Recently Published
Project 4
It can be useful to be able to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
Assignment 9
The New York Times web site provides a rich set of APIs, as described here: http://developer.nytimes.com/docs
You’ll need to start by signing up for an API key.
Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and
transform it to an R dataframe.
Assignment 7
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more
than one author. For each book, include the title, authors, and two or three other attributes that you find
interesting.
Take the information that you’ve selected about these three books, and separately create three files which
store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”,
“books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you
create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into
separate R data frames. Are the three data frames identical?
Assignment 5Document
(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.
You’re encouraged to use a “wide” structure similar to how the information appears above, so
that you can practice tidying and transformations as described below.
(2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy
and transform your data.
(3) Perform analysis to compare the arrival delays for the two airlines
Homework 2
Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe.
Assignment – Loading Data into a Data Frame
The task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need
to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns
in the dataset. You should include the column that indicates edible or poisonous and three or four other columns.
You should also add meaningful column names and replace the abbreviations used in the data—for example, in the
appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation
tasks.