gravatar

Tillmawitz

Matthew Tillmawitz

Recently Published

Data 622 Assignment 3
Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework. Answer questions, such as: Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?
Data 622 Assignment 2
Introduction In Machine Learning, Experimentation refers to the systematic process of designing, executing, and analyzing different configurations to identify the optimal settings that performs best on a given task. Experimentation is learning by doing. It involves systematically changing parameters, evaluating results with metrics, and comparing different approaches to find the best solution; essentially, it's the practice of testing and refining machine learning models through controlled experiments to improve their performance. The key is to modify only one or a few variables at a time to isolate the impact of each change and understand its effect on model performance. In the assignment you will conduct at least 6 experiments. In real life, data scientists run anywhere from a dozen to hundreds of experiments (depending on the dataset and problem domain). Assignment This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment you will define what you are trying to achieve (before each run), conduct the experiment, and at the end you will review how your experiment went. These experiments will allow you to compare algorithms and choose the optimal model. Using the dataset and EDA from the previous assignment, perform the following: Algorithm Selection You will perform experiments using the following algorithms: Decision Trees Random Forest Adaboost Experiment For each of the algorithms (above), perform at least two (2) experiments. In a typical experiment you should: Define the objective of the experiment (hypothesis) Decide what will change, and what will stay the same Select the evaluation metric (what you want to measure) Perform the experiment Document the experiment so you compare results (track progress) Variations There are many things you can vary between experiments, here are some examples: Data sampling (feature selection) Data augmentation e.g., regularization, normalization, scaling Hyperparameter optimization (you decide, random search, grid search, etc.) Decision Tree breadth & depth (this is an example of a hyperparameter) Evaluation metrics e.g., Accuracy, precision, recall, F1-score, AUC-ROC Cross-validation strategy e.g., holdout, k-fold, leave-one-out Number of trees (for ensemble models) Train-test split: Using different data splits to assess model generalization ability
Data 622 EDA
Exploratory data analysis of the Bank Marketing Dataset from https://archive.ics.uci.edu/dataset/222/bank+marketing for the CUNY Data 622 course.
Data 624 Final
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH. Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach. Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
The Effect of Female Education on Birth Rates in the United States
An analysis done for the CUNY SPS DATA 607 class.
Data 624 Homework 9
Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson.
Data 624 Assignment 8
Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.
Recommender System Case Study
Your task is to analyze an existing recommender system that you find interesting. You should: Perform a Scenario Design analysis as described below. Consider whether it makes sense for your selected recommender system to perform scenario design twice, once for the organization (e.g. Amazon.com) and once for the organization's customers. Attempt to reverse engineer what you can about the site, from the site interface and any available information that you can find on the Internet or elsewhere. Include specific recommendations about how to improve the site's recommendation capabilities going forward. Create your report using an R Markdown file, and create a discussion thread with a link to the GitHub repo where your Markdown file notebook resides. You are not expected to need to write code for this discussion assignment.
Data 624 Assignment 7
In Kuhn and Johnson do problems 6.2 and 6.3. There are only two but they consist of many parts. Please submit a link to your Rpubs and submit the .rmd file as well.
Data 624 Project 1
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Apr 11. I will accept late submissions with a penalty until the meetup after that when we review some projects. Part A – ATM Forecast, ATM624Data.xlsx In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file. Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above. Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Data 607 Web API Assignment
The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/apis You’ll need to start by signing up for an API key. Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.
Data 624 Assignment 6
Do the exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman. Exercises can be found at https://otexts.com/fpp3/arima-exercises.html
Data 607 week 7 assignment
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
Data 624 Homework 5
Do exercises 8.1, 8.5, 8.6, 8.7, 8.8, 8.9 in Hyndman.
Data 624 Assignment 4
Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.
Data 607 Assignment 5
Data 607 Project 1
A simple project to parse chess tournament results and write a summary to a csv.
Data 624 Assignment 3
Do exercises 5.1, 5.2, 5.3, 5.4 and 5.7 in the Hyndman book. Please submit your Rpubs link as well as your .pdf file showing your run code.
Data 624 Assignment 2
Do exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8 and 3.9 from the online Hyndman book.
Data 607 Assignment 3
Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission. #1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either "DATA" or "STATISTICS" #2 Write code that transforms the data below: [1] "bell pepper" "bilberry" "blackberry" "blood orange" [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" [9] "elderberry" "lime" "lychee" "mulberry" [13] "olive" "salal berry" Into a format like this: c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry") The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: #3 Describe, in words, what these expressions will match: (.)\1\1 "(.)(.)\\2\\1" (..)\1 "(.).\\1.\\1" "(.)(.)(.).*\\3\\2\\1" #4 Construct regular expressions to match words that: Start and end with the same character. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.) Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
Data_624_assignment_1
Exercises 2.1, 2.2, 2.3, 2.4, 2.5 and 2.8 from the Hyndman online Forecasting book 3rd edition.
Data 607 Assignment 1