

Heather Geiger

Recently Published

Converting MSigDB gene sets into a binary gene set x gene matrix
Pre-processing for Data 620, project 2
MSigDb hallmark gene set network analysis pre-processing
Titanic data exploration
Exploration of the Titanic data, including explain why did certain transformations, for Data 622.
Visualizing vaccine preventable disease (VPD) cases vs. vaccination rates in California
California has very detailed publicly available data on both rates of vaccine-preventable diseases as well as immunization rates, which here will be juxtaposed visually.
Visualizing vaccine preventable disease cases in California
CUNY Data 608 final project proposal
Improving a graphic on atmospheric conditions from the ACT
Took an example science graphic from the ACT and used it as an example of how to make this data clearer.
Thanksgiving foods data-to-ink ratio example
Show how to make "fun" graphics that still have relatively good data-to-ink ratio (or not).
Kaggle - House Prices
My take on the Kaggle House Prices dataset
Data 605 Final Problem 1
Data 605 Final Problem 2, sections (1-3)/4
Based on data from Kaggle. House Prices: Advanced Regression Techniques competition. Available here:
Data 605 - Calculus 12.1, question 14
Discussion #15
Data 605 - Calculus section 8.8, Q26
Discussion #14
Data 605 - Calculus section 7.1, Q5
Discussion #13
Data 605 - Probability section 11.2, question 9
Data 605 discussion 10
Section 9.3, Question 10
Data 605 discussion 9
Data 605 - section 6.1, question 8
Expected value discussion
Data 605 - Probability section 4.1, question 27
Week 6 discussion contribution
Data 605 - Probability section 1.1, question 3
Discussion #5
Data 605 - question LT.C42
Discussion #4
Show that Khan Academy and textbook eigenspace method produce the same result.
Textbook example from Data 605 to show that A - (lambda * identity matrix( vs. (lambda*identity matrix) - A produces same result when calculating eigenvector.
Data 605 - Discussion DM.C24
Coursera Data Science Capstone Milestone Report
The dataset to be used to build predictive text models includes a number of blog posts, news articles (or blurbs), and tweets in English. In the milestone report, the task is to read in the data, perform some exploratory analysis, and summarize a basic outline of a plan for using the data to create a prediction algorithm and Shiny app.
Predicting Wine Wholesale Purchases
Data 621 - HW #5
Predicting insurance crash and claim amount risk
Data 621 - HW #4
Predicting High vs. Low Crime using Binary Logistic Regression
Data 621 - Homework 3
Special election outcomes in 2017-2018 as a function of 2016 election results
Modeling US special election outcomes in 2017-2018 as a function of 2016 election results. Final project for DATA 606.
MTA Ridership and Delays over Time vs. Recent Media Coverage of the MTA
Data sources for this report will include: A CSV file with weekly ridership data broken down by station from 2010 to 2018. A CSV file with monthly rates of delays broken down by subway line, from 2009 to 2017. Google News API results searching for “MTA” from select news sources (Gothamist, NY Daily News, NY Post, etc.) from the past three months (roughly February-April 2018) One major question to be answered is how changes in ridership relate to quality of service? In other words, do New Yorkers exhibit some elasticity of demand in response to delays? Or is demand mostly inelastic? Another question will be to quantify the tone of media reports about the MTA, and compare to performance data to see if the tone matches the data.
Multiple linear regression lab
Data 606 lab 8
Multiple and logistic regression homework
Data 606 Homework 8
Scale_manual Tidyverse recipe
Tidyverse recipe for Data 607. Manually adjust colors corresponding to a variable using scale_colour_manual or scale_fill_manual.
Introduction to linear regression (Lab)
Data 606 Lab 7
Introduction to Linear Regression
Data 606 Homework 7
Data 606 Final Project Proposal - Election Data 2017-2018
Data 606 final project proposal to analyze special and standard election data from January 2017 to March 2018.
Inference for categorical data
Data 606 Lab 6
Inference for categorical data (homework)
Data 606 Chapter 6 Homework
Question 5.19 - Inference on Paired Data
Homework question that uses paired temperature measurements to see if this data set supports a hypothesis of increased temperatures over time.
Querying the New York Times Article Search API in R
Using the New York Times Article Search API to get R data frames with articles about Facebook from around the time of a recent (Cambridge Analytica) and older (psychological experiment) scandal.
Lab - inference for numerical data
Data 606 Lab 5
Inference for numerical data
Data 606 Chapter 5 Homework
Most frequent skills in resumes from New York and San Francisco
Joint effort with myself and Raj Kumar from Data 607. After web scraping to get several hundred resumes each for New York and San Francisco data scientist job seekers, here we use text processing to look for the most frequently listed skills by job seekers in each city and across cities.
Cleaning Resume Data from Draft
Data 607 - project 3 draft
Parsing HTML, XML, and JSON files using R
Data 607 assignment
Foundations for inference
Data 606 - Homework 4
Time Use by Country Survey - Tidying and Analyzing the Data (edited!)
Edited version of previous Rpubs where I tidied and analyzed a data set of time use by country. Will replace the old Rpubs with this one when assignment is done being graded (this is for Data 607, project 2).
Time Use by Country Survey - Tidying and Analyzing the Data
Data 607 - project 2 dataset 2
MTA Subway Station Ridership Info - Tidying the Data
Data 607 - project 2
Comparing airlines using a tidied flight delay data set
Data 607 assignment 5
The normal distribution (Data 606 Lab 3)
Distributions of Random Variables
Distributions of Random Variables - Data 606 Homework 3
Summarizing Chess Tournament Info using R and Regular Expressions
Summarizing Chess Tournament Info using R and Regular Expressions. Data 607, project 1.
Regions of Olive Oil App Pitch
Slides to pitch a Shiny app to tell which region of Italy an olive oil is from given its chemical composition.
Data 607 - Regex and string functions assignment
Part 1 - get names in first then last format, incl. determine if names have a title and/or middle name. Part 2- for a list of example regular expressions, describe what they are trying to match and give an example match.
Data 606 Lab 2 - Probability
Data 606 Lab 2 - Probability
Histogram from a relative frequency distribution
Histogram from a relative frequency distribution of incomes.
Data 606 Lab 1 - Introduction to Data (using CDC data)
Movie Database using R and MySQL
Here I will use the MovieTweetings database from Github user sidooms. I then extract from here ratings from 10 recent popular movies, and create an SQL database from this. I then check this database for correctness using R, and then run some basic summary text and plots on this data to see how people rate these movies.
Data 606 Homework 1 - Introduction to Data
Data 606 Lab 0 - Intro to R and RStudio
Data 606 Lab 0 intro to R/Rstudio using birth data from Arbutnot/United States as examples.
Transforming Edible vs. Poisonous Mushrooms Data Set
Describes how to transform the mushroom records from The Audubon Society Field Guide to North American Mushrooms into a more easily understood format.
Heatmap of gene expression with Plotly
Using a simple toy example, I demonstrate how to use Plotly to add an interactive tooltip showing the gene and sample name to a gene expression heat map created with ggplot2.
Leaflet mapping of lunch spots around New York Genome Center
Used leaflet R library to show lunch spots around New York Genome Center on a map.
Fuel efficiency vs. transmission
Fuel efficiency vs. transmission for Coursera regression course
Weather events most adversely impacting human health and economic activity in the United States
This document uses the NOAA (National Oceanic and Atomospheric Administration) Storm Database to summarize the weather events that have had the greatest impact on both human health (fatalities and injuries) and economic activity (crop and property damage) in the United States.