Recently Published
DATA624 Brewing Mode
New regulations are requiring us to understand our manufacturing process and the predictive factors with modeling. We also need to choose an optimal predictive model of PH to report to our boss. We have been given historical data regarding our manufacturing process that has already been split into a training set of "StudentData.xlsx" and a test set of "StudentEvaluation.xlsx".
This will be a technical report that showcases our process of tidying the data received and exploring it.
DATA622 Final Project
Our goal is to utilize job salary data (retrieved from Ask A Manager Salary Survey here: https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/edit?resourcekey#gid=1625408792) and demographic data (Federal Reserve Economic Data: https://fred.stlouisfed.org/release/tables?eid=257197&rid=110) to predict if the salaries in the survey end up above or below per capita personal income for state. The data itself is a bit messy being a real-world survey dataset, so this will take a decent amount of data transformation and cleaning. After cleaning and preparation, we want to build classification models consisting of a logit model, an svm model, and a neural network model. The end goal of this analysis is to build a model that is best able to predict if someone should have above or below the median state income with their personal qualifications that can be utilized by workers to be able to determine factors that contribute to being paid above or below the median state income and allow businesses to determine if they should be paying a work above or below median state income based on their qualifications.
DATA621 Final SalPred
Our goal is to utilize job salary data (retrieved from Ask A Manager Salary Survey here: https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/edit?resourcekey#gid=1625408792) and demographic data to predict if the salaries in the survey end up above or below per capita personal income for state.
DATA622 Support Vector Machines
Our objective for this analysis is to model a support vector machine on a dataset to compare how the results change compared to previously built decision tree models. For our dataset we will utilize the sample sales dataset from (https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/) containing 1,000,000 million records.
DATA624 JK Chapter 8
In this document, we will be going through exercises 8.1, 8.2, 8.3, and 8.7 from Applied Predictive Modeling - Kuhn and Johnson.
DATA624 JK Chapter 7
In this document, we will be going through exercises 7.2 and 7.5 from Applied Predictive Modeling - Kuhn and Johnson.
DATA621 HW4 Insurance
We will explore, analyze and model a data set containing approximately 8000 records. Each record represents a customer at an auto insurance company. Each record has various predictor variables regarding the customer’s car, job, and demographics. The response variables within this dataset indicate if the customer was in a car crash with a binary label and the value of damages done if the customer was in a car crash.
DATA624 JK Chapter 6
In this document, we will be going through exercises 6.2 and 6.3 from Applied Predictive Modeling - Kuhn and Johnson.
DATA 622 Decision Trees Algorithms
Our objective for this analysis is to make two decision trees with different variables used and a random forest model from a dataset to compare how the results change depending on which model we are using. For our dataset we will utilize the sample sales dataset from (https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/) containing 1,000,000 million records.
DATA621 HW4 Insurance Predictions
Processing insurance customer data to generate a logistic regression model for determining if they're likely to crash and a linear regression model for how much the crash would cost if they are likely.
Curse of Dimensionality
A basic empirical exploration into the curse of dimensionality
DATA624 Project 1 Time Series Forecasting
The full tidy forecasting workflow applied to untidy time series with multiple keys.
DATA624 FPP Chapter 9
In this document, we will be going through exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 from Forecasting: Principles and Practice (3rd ed).
DATA622 Exploratory Machine Learning Analysis
We will explore, analyze and model two sample sales datasets from (https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/) containing 100 and 1,000,000 million records.
DATA624 FPP Chapter 8
In this document, we will be going through exercises 8.1, 8.5, 8.6, 8.7, 8.8, and 8.9 from Forecasting: Principles and Practice (3rd ed).
DATA624 JK Chapter 3
In this document, we will be going through exercises 3.1 and 3.2 from Applied Predictive Modeling - Kuhn and Johnson.
DATA624 FPP Chapter 5
In this document, we will be going through exercises 5.1, 5.2, 5.3, 5.4 and 5.7 from Forecasting: Principles and Practice (3rd ed).
DATA621 HW1 Moneyball
We will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season
DATA624 FPP Chapter 3
In this document, we will be going through exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from Forecasting: Principles and Practice (3rd ed).
DATA624 FPP Chapter 2
Forecasting: Principles and Practice Exercises from Chapter 2
DATA 607 Final Project
Predictive modeling for Steam game playtime.
DATA 605 Final Project
Fundamentals of computational mathematics final project culminating in a Kaggle submission for a regression model.
DATA 606 Final Project
In this project, we take data regarding games listed on Valve’s Steam platform for video games from the Steam Spy API. Specifically, we’re interested in the median playtime of the games with the top 100 users in the past 2 weeks from this data. With the additional variables of user rating, amount of game owners, and game price we also attempt to answer the question of if: The user ratings, reported by Steam, are related to median playtime. This is answered based on creating regression models from the data after processing, transforming, and removing the outliers from the data.
Data 607 Week 12 Project
When you are dealing with classification problems that have large sets of publicly available data, the best method to resolve such problems is to train a classifier on them. In this project we will be utilizing a public dataset of spam emails and ham (non-spam) emails from https://spamassassin.apache.org/ to attempt to train a model on detecting the difference between spam and ham emails.
DATA 605 Multiple Regression Discussion
Multiple regression model applied to video game based data.
Data 605 - Week 11 Discussion
Building a simple regression model and checking its fit for Steam Spy data.
DATA 607 Week 10 Assignment
With the textbook Text Mining with R by by Julia Silge and David Robinson, we explore utilizing sentiment analysis on text. We begin with mimicking code examples present in the text, and then we extend it to utilize a different corpus and sentiment lexicons.
DATA 607 Week 9 Project
For this assignment, we’ll be practicing our knowledge of Tidyverse functions by creating vignette examples of the packages that make up Tidyverse. In my case, I wanted to attempt going over the forcats package which focuses on manipulating factor elements in a dataframe, as I have no experience with using it at this point.
DATA 607 Week 9 Assignment
For this assignment, we’ll be testing our capabilities of accessing APIs and pulling json data from them into data frames. Specifically, we’ll be looking at data from the New York Times books API.
DATA 607 Week 7 Assignment
The goal of this assignment is to begin working on the ability to process data sourced from web sources that is not in a convenient direct download to csv or some sort of tabular data. The data types of focus on this assignment are html files which would be typical from direct scraping, along with XML and JSON files which are more likely to be retrieved from API utilization.
To get directly familiar with these formats, we will create a representation of information regarding three books of a certain genre in the three different formats. After the data has been created, we will utilize various R packages in order to load the information as dataframes.
DATA 607 Week 6 Project
Tidying data is apparently one of the most common uses of your time as a data scientist. That is why having the methods for tidying data down is important to learn. In this assignment we will be importing untidy data from a .csv file, tidying the data up, and then performing analysis on the data. The data we will be working on are three different untidy datasets provided by our classmates:
DATA 607 Week 5 Assignment
In this assignment we will be importing untidy data from a .csv file, tidying the data up, and then performing analysis on the data. The data set we will be working on in is a small chart describing arrival delays for two airlines across five destinations. Ultimately we want to compare the arrival delays for the two airlines in our analysis.
DATA 605 Assignment 4
Utilizing eigenvalues and eigenvectors to generate new images that account for variance amongst a given set of original images.
DATA 607 Week 4 Project
Project 1 is structured around scraping data from an unfriendly text table about chess statistics to get it into R. Once the data has been wrangled, the next focus is transferring information that exists one multiple other rows to another single row, for every row. Our ultimate goal is to end up with a CSV that includes the data formatted into the columns: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
DATA 607 Week 3 Assignment
A simple assignment revolving around utilizing the pattern matching functionality within R mainly focusing on regex.
DATA 607 Week 2 Assignment
A simple assignment in taking survey data, loading it into SQL, then creating a connection between SQL and R to load the data back into R.
DATA607 Week 1 Assignment
A simple exercise in loading data into R, modifying, and subsetting it. The data used is in regards to college fight songs.