I discuss a little about Hadoop vs, Spark.
Slides for analysis of the Final Project of DATA 621
An example I made for a DATA 621 discussion.
In this report, I analyze crime statistics for Boston in 1978 using Logistic Regression to classify neighborhoods as high crime or low crime. I find that the clearest indicator if a neighborhood is high crime is high air pollution measured by concentration of nitrogen oxides, which is caused by fossil fuel combustion.
A write up for a discussion question in DATA 621. The question has to do with explaining to others how outliers can influence regression.
In this paper, I analyze salary and unemployment data pertaining to college majors. These data were obtained from fivethirtyeight.com's github page. I find that STEM Majors earn more with less unemployment than Humanities, and that gender inequality in these two categories could play a role in the difference in pay.
This is the final version of the analysis that Chunhui Zhu and I did of world electrical energy production.
This is a draft of Chunhui Zhu's and my Final Project where we examine the production of electricity and the flow of energy resources.
Our final assignment is about multivariate calculus.
Draft of the final project for DATA 607 where we analyze energy production and usage for the top 10 economies.
This week is multivariate calculus.
I use Euler's formula for pi to calculate pi and compare to r's default value.
This week's assignment covers series and sequences.
I migrate a MySQL database to a Neo4J database.
This week's lab is about multiple regression.
Checking another student's work by request.
This Assignment covers Logistic and Multiple Regression models.
This week's discussion is solving a Surface Area integral using trigonometric substitution and substitution methods.
This week's homework is a primer in basic calculus.
This week's assignment covers multiple linear regression models and transformations to make data fit a linear regression.
A sample lesson on the derivation and use of kinematic equations.
Our proposal for the Final Project
I explain a couple methods to make non-linear data usable for linear regression.
This week I perform a multiple regression analysis on Human Resources data from Kaggle.com: https://www.kaggle.com/ludobenistant/hr-analytics/data to see if the factors measures predict job satisfaction.
This assignment looks at the interpretation of linear models and calculating parameters such as slope from R and standard deviations.
We use linear regression to reproduce the analysis done in the move Moneyball. I find that team Batting Average is the best traditional predictor of runs and On-Base%+Slugging is the best modern predictor of runs. All modern statistics out-preformed traditional statistics in predicting runs.
This week we perform linear regression on the breaking distance of a car vs speed, and see that just because you get a low p-value, it doesn't mean the model is valid.
This week we are tasked with qualitatively reverse engineering a recommend system. Our group selected grubhub. Essentially we are testing what recommendations are made based on our input and conjecturing what models grubhub uses.
We begin linear regression by examining the relationship between video duration and views for TED talks.
We were tasked with using the text mining package in r, 'tm' and supervised learning techniques to classify emails as 'spam' or not. I was able to get >96% accuracy.
This draft of my "Data Science in Context" presentation provides a basic formula for creating a word cloud from a data frame.
This assignment explores the Gambler's Ruin problem from 2 different strategies. First using a constant bet, and then using a increasing bet. Markov Chains, Binomial Distribution and Simulations are used.
This week's discussion question involves the Gambler's Ruin problem.
The goal of this assignment is to extract data from the NYT's API in the form a json file and to format it as an R data frame.
This lab covers inference of proportions.
This weeks assignment covers CLT for independent random variables and Moment Generating Functions.
This chapter covers proportion tests, calculating confidence intervals and Chi-sq tests.
This assignment cover Hypothesis testing using Confidence Intervals, t-tests and ANOVA.
In this week's lab we calculate confidence intervals and perform hypothesis testing on data about pregnancies.
This weeks discussion in DATA 605 tests the Central Limit Theorem for proportions.
Slides for DATA 607 presentation.
Minor correction made on the original.
This is my proposal for the final project for DATA 606. I will take an in depth look at incomes and employment statistics for 173 college majors using data obtained from the fivethirtyeight.com github page.
This is a rough draft of the presentation slide for DATA 607 Project 3.
Extended Silverio's work to include Confidence Intervals, t-tests, and KS tests for salaries. I also adjusted salaries for Cost of Living Index.
This weeks assignment covers convolution of discrete and continuous random variables.
In time for the World Series, here I calculation the probability of at bat outcomes given a probability distribution for 4 at-bats.
This weeks assignment covers loading data from web-based formats: hmtl, xml, and json into r in the form of data frames.
This weeks assignment covers important probability densities and distributions, such as the Beta, Geometric, Exponential, Binomial, and Poisson.
This assignment covers calculating confidence intervals, p-values, and hypothesis testing.
This lab examines how to define a confidence intervals. Please note that the data in this file will be different than the data I had in R studio while writing, so the answers may not match the graphs and summary statistics.
This week's discussion covers common continuous probability densities and discrete distributions.
The goal of this Project is to take data from three different sources. In this case two .csv files and one scraped from a web page, and use tidyr and dplyr to clean and reorganize the data for further analysis.
This assignment covers combinatorics and probability.
This week's discussion covers combinatorics and conditional probability.
This lab explores behaviors of sampling and populations needed to introduce the Central Limit Theorem and Confidence Intervals.
This assignment covers defining probability distributions and calculating probabilities from probability distributions.
We were tasked with tidying and analyzing a data set using r's tidyr and dplyr. I opted to use an SQL database for the starting data instead of a .csv file
This homework assignment covers probability distributions, namely the Normal, Geometric and Binomial distributions.
This week in DATA 605 is probability distributions. Here I use a basic simulation to solve a popular urban legend about a professor tricking his students, after they try to trick him/her.
This Lab covers the properties of the Normal Distribution.
This week's homework covers the svd decomposition of a matrix and finding the inverse matrix from it's co-factors.
In this project we were tasked with taking a semi-structured .txt file and creating a R markdown file that would output a .csv that can be used to populate a SQL database.
This week's topic is Linear Transformations. In this discussion question I show that a transformation is linear.
This was an optional question for Week 3 HW
This week's DATA 605 homework covers matrix ranks, eigenvalues, and eigenvectors.
This assignment covers using regular expressions to extract data from files.
This homework set covers Probability and discrete random variables.
This homework covers Matrix operations such as trace, transpose, matrix multiplication, and factorization.
This lab analyzes Kobe Bryant's shooting performance during the 2009 NBA finals to test the "hot hands" hypothesis. We find that Kobe performed no better that a simulation where the simulated shooter's hit percentage was set at Kobe's hit percentage.
We have to present a practice problem from the text. I am presenting 1.23 which evaluates the methodology used in a survey.
Something I wanted to add to my discussion for DATA 605
This is my week 2 discussion question to DATA 605.
This Lab demonstrates techniques for initial data summaries and visualizations and how to subset data.
My solutions to the first homework set for CUNY DATA 606, Statistics and Probability for Data Analytics.
This is my first homework for DATA 605, Fundamentals of Computational Mathematics. This assignment mostly covers vector and matrix operations and solving systems of equations.
This is a lab for CUNY DATA 606 to familiarize the student with the functionality of R and Rstudio.
This is my submission for CUNY's MS in DATA Science's DATA 607 Homework 1. The objectives were to load the data on mushrooms from a website using R into a data frame, then create a subset of that data frame with 3 or 4 columns from the original data. Finally we were tasked with relabeling the data headers and categories into a more readable format. I also added a few visualizations.
This is the test lab for DATA 607 for the CUNY MS in Data Science program.
This is an analysis of arrest records.
This is the homework for week 1 of the MSDA Bridge program in R