Recently Published
Week 5 BDT
Read this article about doing statistics with categorical variables. Write at least 500 words discussing how to use these statistics to help understand big data.
Week 3 BDT
Write at least 500 words discussing one or more use cases for Spark.
Use at least three sources. Include at least 3 quotes from your sources enclosed in quotation marks and cited in-line by reference to your reference list. Example: "words you copied" (citation) These quotes should be one full sentence not altered or paraphrased. Cite your sources using APA format. Use the quotes in your paragraphs.
Write in essay format not in bulleted, numbered or other list format.
Week 2 BDT
Write at least 500 words discussing what Spark is and does. Explain what problems it solves.
Use at least three sources. Include at least 3 quotes from your sources enclosed in quotation marks and cited in-line by reference to your reference list. Example: "words you copied" (citation) These quotes should be one full sentence not altered or paraphrased. Cite your sources using APA format. Use the quotes in your paragraphs.
Week 1 BDT
Write at least 500 words on what 'Big' means in Big Data. What exposure have you had to Big Data?
Use at least three sources. Include at least 3 quotes from your sources enclosed in quotation marks and cited in-line by reference to your reference list. Example: "words you copied" (citation) These quotes should be one full sentence not altered or paraphrased. Cite your sources using APA format. Use the quotes in your paragraphs.
Write in essay format not in bulleted, numbered or other list format.
Week 15 ML
Optimization of regression models
Write a summary of the three most useful algorithms learned during the course.
Week 14 ML
Human Activity Recognition Final Case Analysis
Week 13 ML
Optimization of regression models
Describe various aspects of DBSCAN and Mixture clustering methods. Describe the process of anomaly detection using clustering.
Week 12 ML
Clustering techniques Part 1
Describe in detail the difference between SOM and LLE. Which techniques do you think is more effective than K-means clustering.
Week 11 ML
Unsupervised Learning using Dimension Reduction
week 10 ML
Nutrition Case Study
Write a formal report on your findings from the last several weeks for the regression problem of the Nutrition Case Study.
A sample template for the final report is provided that contains minimum requirements for the report including the following sections: Introduction, Analysis and Results, Methodology, Limitations and Conclusion.
The main objective is to write a fully executed R-Markdown program performing regression prediction using the best models found for and comparing the cost functions and R-square values.
Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.
Week 9 ML
Optimization of regression models
Describe various aspects of machine learning training for regression training such as cost function, Gradient Descent, and Bias-Variance Tradeoff.
Week 8 ML
Preventing overfitting
Describe the difference between LASSO and Ridge regression techniques. Do you have a preference for one over the other for the Nutrition case study predicting the response variable?
Week 7 ML
Ensemble Methods
Describe how linear regression models are different from nonlinear models. Also describe the main idea behind polynomial terms in a nonlinear model.
Week 6 ML
Santander Bank Case Study
Write a formal report on your findings from the last several weeks for the classification of the Santander Bank Case Study.
The main objective is to write a fully executed R-Markdown program performing classification using the best models found for logistic regression, SVM, Random Forest and XGBoost algorithms, and comparing the values of their cost functions and accuracy scores.
Make sure to describe the final hyperparameter settings of all algorithms that were used for comparison purposes.
Clean and merge all the files using proper IDs discussed in the second week. Create one master data file to be analyzed for case study 2. Now perform EDA and share your findings in the form of an R Markdown report.
Week 5 ML
Ensemble Methods
Describe the fundamental difference between Random Forest and a Boosted tree. Describe Out of bag sample and the cost functions for both types of models
Week 4 ML
Support Vector Machine
Describe a kernel function in SVM. Also describe the process of feature engineering and boundary creation in SVM.
Week 3 ML
Logistic Regression
Week 2 ML
Challenges of ML
Visit the Kaggle website for the Santander Bank classification challenge and describe the main problem. Also describe various challenges and limitations of this case analysis.
Week 1 ML
You are expected to be able to program in R prior to taking this class. Use Titanic dataset and perform EDA on various columns. Without using any modeling algorithms, and only using basic methods such as frequency distribution, describe the most important predictors of survival of Titanic passengers, e.g. were males or females more likely to survive, were young and rich females more likely to survive than old poor males etc?
Week 14 VA
Refinements in ggplot
Week 13 VA
Find a graph of covid 19 disease or economic data in a newspaper, journal, or website and recreate it.
Find a map of covid 19 disease or economic data in a newspaper, journal, or website and recreate it.
Week 10 VA
load the broom library
use tidy() on the out dataframe to produce a new dataframe of component level information. Store the result in out_comp.
round all the columns to two decimal places using round_df().
Produce a flipped scatter plot of Term v. Estimate
Produce a new tidy output of out including confidence intervals. Store it in a variable called out_conf after rounding the dataframe to two decimals.
Remove the intercept column and the term continent from the label and make a plot of points with whiskers to show the coefficients with a confidence range and order the output from smallest to largest.
use the head function to see the first six rows after applying the augment function to out. Store the result in out_aug.
Add the data back into out_aug with the data = argument.
plot the .fitted data v. the .resid data
What does this graph show?
using the pipe round the output of glance(out)
Week 8 VA
Return to the visualization for Presidential Elections: Popular and Electoral College margins, subset by party, and use that to add color to your points.
Recreate figures 5.28 using functions from the dplyr library.
Using gss_sm data, calculate the mean and median number of children by degree
Using gapminder data, create a boxplot of life expectancy over time
Using gapminder data, create a violin plot of population over time.
Visual Analytics Week 1
Week 1 Visual Analytics Solutions