Recently Published
DocumentDATA607 - Final
Assignment
Requirements
Scope
Data Preparation
Analysis
Conclusion
MSDA 607 Final Project
Deliverables Schedule
Deliverable Date Points
One Paragraph Proposal Sunday April 26 20
Final Project Sunday May 10 120
Final Project Presentation Before or during Final Meetup on Wednesday May 13 30
Policy on Collaboration
You may work in a team of up to three people. Each project team member is responsible for understanding and being able to explain all of the submitted project code. Remember that you can take work that you find elsewhere as a base to build on, but you need to acknowledge the source, so that I base your grade on what you contributed, not on what you started with!
Approval Meeting
Once you’ve turned in your one paragraph proposal, I want to schedule a 15 minute phone meeting with each person or team (starting when we return from Thanksgiving break), where you’ll describe the reason (benefit) for doing this work and/or question you’re seeking to answer, where you’ll source the data, and the overall flow. For team projects, I also want you to articulate the roles and responsibilities of each team member.
Final Project Checklist
To receive full credit, you’ll need to deliver on all of the items in the checklist below. Please read carefully through this checklist before you make your project proposal. You are (within these checklist constraints) strongly urged to limit scope and make the necessary simplifying assumptions so that you can deliver your work on time!
Proposal describes your motivation for performing this analysis
Proposal describes likely data sources.
Your project has a recognizable “data science workflow,” such as the OSEMN work flow or Hadley Wickham’s Grammar of Data Science. [Example: First the data is acquired, then necessary transformations and clean-up are performed, then the analysis and presentation work is performed]
Project includes data from at least two different types of data sources (e.g., two or more of these: relational or CSV,Neo4J, web page [scraped or API], MongoDB, etc.)
Project includes at least one data transformation operation. [Examples: transforming from wide to long; converting columns to date format]
Project includes at least one statistical analysis and at least one graphics that describes or validates your data.
Project includes at least one graphic that supports your conclusion(s).
Project includes at least one statistical analysis that supports your conclusion(s).
Project includes at least one feature that we did not cover in class! There are many examples: “I used ggmap; I created a decision tree; I ranked the results; I created my presentation slides directly from R; I figured out to use OAuth 2.0…”
Presentation. Was the presentation delivered in the allotted time (3 to 5 minutes)?
Presentation. Did you show (at least) one challenge you encountered in code and/or data, and what you did when you encountered that challenge? If you didn’t encounter any challenges, your assignment was clearly too easy for you!
Presentation. Did the audience come away with a clear understanding of your motivation for undertaking the project?
Presentation. Did the audience come away with a clear understanding of at least one insight you gained or conclusion you reached or hypothesis you “confirmed” (rejected or failed to reject…)?
Code and data. Have you delivered the submitted code and data where it is self-contained—preferably in rpubs.com and github? Am I able to fully reproduce your results with what you’ve delivered? You won’t receive full credit if your code references data on your local machine!
Code and data. Does all of the delivered code run without errors?
Code and data. Have you delivered your code and conclusions using a “reproducible research” tool such as RMarkdown?
Deadline management. Were your draft project proposal, project, and presentation delivered on time? Any part of the project that is turned in late will receive a maximum grade of 80%. Please turn in your work on time! You are of course welcome to deliver ahead of schedule!
Project 4
temp
DATA607 - Assignment 11
Building the Next New York Times Recommendation Engine - The New York Times.pdf
Amazon-Recommendations Item to Item Collaborative Filtering .pdf
Your task is to analyze an existing recommender system that you find interesting. You should:
Perform a Scenario Design analysis as described below. Consider whether it makes sense for your selected recommender system to perform scenario design twice, once for the organization (e.g. Amazon.com) and once for the organization’s customers.
Attempt to reverse engineer what you can about the site, from the site interface and any available information that you can find on the Internet or elsewhere.
Include specific recommendations about how to improve the site’s recommendation capabilities going forward.
Create your report using an R Markdown file, and create a discussion thread with a link to the GitHub repo where your Markdown file notebook resides. You are not expected to need to write code for this discussion assignment.
DATA607 - Project 3
Project – Data Science Skills
This is a project for your entire class section to work on together, since being able to work effectively on a virtual team is a
key “soft skill” for data scientists. Please note especially the requirement about making a presentation during our first
meetup after the project is due.
W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question,
“Which are the most valued data science skills?” Consider your work as an exploration; there is not necessarily a “right
answer.”
Grading rubric:
You will need to determine what tool(s) you’ll use as a group to effectively collaborate, share code and any
project documentation (such as motivation, approach, findings).
You will have to determine what data to collect, where the data can be found, and how to load it.
The data that you decide to collect should reside in a relational database, in a set of normalized tables.
You should perform any needed tidying, transformations, and exploratory data analysis in R.
Your deliverable should include all code, results, and documentation of your motivation, approach, and findings.
As a group, you should appoint (at least) three people to lead parts of the presentation.
While you are strongly encouraged (and will hopefully find it fun) to try out statistics and data models, your
grade will not be affected by the statistical analysis and modeling performed (since this is a semester one
course on Data Acquisition and Management).
Every student must be prepared to explain how the data was collected, loaded, transformed, tidied, and
analyzed for outliers, etc. in our Meetup. This is the only way I’ll have to determine that everyone actively
participated in the process, so you need to hold yourself responsible for understanding what your class-size
team did! If you are unable to attend the meet up, then you need to either present to me one-on-one before
the meetup presentation, or post a 3 to 5 minute video (e.g. on YouTube) explaining the process. Individual
students will not be responsible for explaining any forays into statistical analysis, modeling, data mining,
regression, decision trees, etc.
Data607 - Assignment 10
Week 10 Assignment
Attached Files:
Week 10 Assignment Rubric.pdf (45.194 KB)
In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Work with a different corpus of your choosing, and
Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).
As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.
DATA607 - Assignment 9
Assignment – Web APIs Page 1
Assignment – Web APIs
⌂
The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/apis
You’ll need to start by signing up for an API key.
Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and
transform it into an R DataFrame.
DATA607 - Assignment 7
Assignment – Working with XML and JSON in R
⌂
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more
than one author. For each book, include the title, authors, and two or three other attributes that you find
interesting.
Take the information that you’ve selected about these three books, and separately create three files which
store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”,
“books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you
create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into
separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into
an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files
accessible from the web].
DATA607 - Project 2
IS 607 – Project 2 Page 1 IS 607 – Project 2 The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. (2) Please include in your homework submission, for each of the three chosen datasets: The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.
DATA607 - Assignment 5
The chart above describes arrival delays for two airlines across five destinations. Your task is to:
(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.
You’re encouraged to use a “wide” structure similar to how the information appears above, so
that you can practice tidying and transformations as described below.
(2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy
and transform your data.
(3) Perform analysis to compare the arrival delays for the two airlines.
(4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative
descriptions of your data cleanup work, analysis, and conclusions. Please include in your
homework submission:
The URL to the .Rmd file in your GitHub repository. and
The URL for your rpubs.com web page.
DATA607 - Project 1
Project 1
In this project, you’re given a text file with chess tournament results where the information has some structure. Your
job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database)
with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
For the first player, the information would be:
Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and
dividing by the total number of games played.
If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data
science, like chess, is a game of back and forth…
The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts,
including assessing relative strength of employment candidates by human resource departments.
You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater
complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in
an R markdown file (and published to rpubs.com); with your data accessible for the person running
DATA607 - Assignment 3
Assignment Requirements
=======================
1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either "DATA" or "STATISTICS"
2. Write code that transforms the data below:
[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"
Into a format like this:
c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
**The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:**
3. Describe, in words, what these expressions will match:
4. Construct regular expressions to match words that:
Start and end with the same character.
Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
DATA607 - Assignment 2
Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary
friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5. Take the results
(observations) and store them in a SQL database of your choosing. Load the information from the SQL database
into an R dataframe.
This is by design a very open-ended assignment. In general, there’s no need here to ask “Can I…?” questions
about your proposed approach. A variety of reasonable approaches are acceptable. You could for example
access the SQL data directly from R, or you could create an intermediate .CSV file. I should be able to generate
the SQL table(s) and data from your provided code—if you use a graphical user interface to create and populate
tables, it should have a mechanism to generate corresponding SQL code.
This assignment does not need to be 100% reproducible. You can (and should) blank out your SQL password if
your solution requires it; otherwise, full credit requires that your code is “reproducible,” with the assumption
that I have the same database server and R software.
Handling missing data is a foundational skill when working with SQL or R. To receive full credit, you should
demonstrate a reasonable approach for handling missing data. After all, how likely is it that all five of your
friends have seen all six movies?
You’re encouraged to optionally find other ways to make your solution better. For example, consider
incorporating one or more of the following suggestions into your solution:
• Use survey software to gather the information.
• Are you able to use a password without having to share the password with people who are viewing your
code? There are a lot of interesting approaches that you can uncover with a little bit of research.
• While it’s acceptable to create a single SQL table, can you create a normalized set of tables that
corresponds to the relationship between your movie viewing friends and the movies being rated?
• Is there any benefit in standardizing ratings? How might you approach this?
You should post any code (e.g. SQL and R Markdown) in a GitHub repository, and provide a link in your
assignment submission. For this assignment, you are not required to post your code to rpubs.com.
You may work in a small group on this assignment. If you work in a group, each group member should indicate
who they worked with, and all group members should individually submit their week 2 assignment.
Please start early, and do work that you would want to include in a “presentations portfolio” that you might
share in a job interview with a potential employer! You are encouraged to share thoughts, ask, and answer
clarifying questions in this week’s “R and SQL” forum.
DATA607 - Assignment 1
Assignment Requirements
DATA607 Week 1 assignment: choose one of the provided datasets on fivethirtyeight.com that you find interesting:
Take the data, and create one or more code blocks.
You should finish with a data frame that contains a subset of the columns in your selected dataset. If there is an obvious target (aka predictor or independent)variable, you should include this in your set of columns.
You should include (or add if necessary) meaningful column names and replace (if necessary) any non-intuitive abbreviations used in the data that you selected. For example: if you had instead been tasked with working with the UCI mushroom dataset, you would include the target column for edible or poisonous, and transform “e” values to “edible.”
Your deliverable is the R code to perform these transformation tasks.
Make sure that the original data file is accessible through your code—for example, stored in a GitHub repository or AWS S3 bucket and referenced in your code. If the code references data on your local machine, then your work is not reproducible!
Start your R Markdown document with a two to three sentence “Overview” or “Introduction” description of what the article that you chose is about, and include a link to the article.
Finish with a “Conclusions” or “Findings and Recommendations” text block that includes what you might do to extend, verify, or update the work from the selected article.
Each of your text blocks should minimally include at least one header, and additional non-header text.
You’re of course welcome—but not required–to include additional information, such as exploratory data analysis graphics (which we will cover later in the course).
Place your solution into a single R Markdown (.Rmd) file and publish your solution out to rpubs.com.
Post the .Rmd file in your GitHub repository, and provide the appropriate URLs to your GitHub repository and your rpubs.com file in your assignment link.
Intro Markdown example
Intro Markdown Example