Recently Published
Final_Project_Mansi
News Popularity in Social Media: Insights through Statistical Analysis
Week 13 | Data Dive — Critiquing Models and Analyses
Think about the context of the lab (chosen Lab8 here) and consider the following:
- Analytical issues, such as model assumptions
- Overcoming biases (existing or potential)
- Possible risks or societal implications
- Crucial issues which might not be measurable
Mansi_Data_Dive_Time_based_Data
Select a column of your data that encodes time (e.g., "date", "timestamp", "year", etc.). Convert this into a Date in R.
Choose a column of data to analyze over time. This should be a "response-like" variable that is of particular interest.
Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.
What stands out immediately?
Use linear regression to detect any upwards or downwards trends.
Do you need to subset the data for multiple trends?
How strong are these trends?
Use smoothing to detect at least one season in your data, and interpret your results.
Can you illustrate the seasonality using ACF or PACF?
Mansi_Data_Dive_GLMs_Part2
Build a linear (or generalized linear) model as you like
- Use whatever response variable and explanatory variables you prefer
Use the tools from previous weeks to diagnose the model
- Highlight any issues with the model
Interpret at least one of the coefficients
Mansi_Data_Dive_GLMs
Select an interesting binary column of data, or one which can be reasonably converted into a binary variable
This should be something worth modeling
Build a logistic regression model for this variable, using between 1-4 explanatory variables
Interpret the coefficients, and explain what they mean in your notebook
(Bonus) Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and interpret its meaning
Consider a transformation for any explanatory variable, and illustrate why you need the transformation (or why you do not)
Scatter Plots ...
Data Dive — Regression
Your RMarkdown notebook for this data dive should contain the following:
Select a continuous (or ordered integer) column of data that seems most "valuable" given the context of your data, and call this your response variable.
For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses.
Select a categorical column of data (explanatory variable) that you expect might influence the response variable.
Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions.
If there are more than 10 categories, consider consolidating them before running the test using the methods we've learned in class.
Explain what this might mean for people who may be interested in your data. E.g., "there is not enough evidence to conclude [----], so it would be safe to assume that we can [------]".
Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.
Build a linear regression model of the response using just this column, and evaluate its fit.
Run appropriate hypothesis tests and summarize their results. Use diagnostic plots to identify any issues with your model.
Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?
Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn't).
Maybe include an interaction term, but explain why you included it.
You can add up to 4 variables if you like.
Mansi Data Dive — Confidence Interval
## Part 1: Build at least three sets of variable combinations
- For each set of variables, include at least one column that you created (i.e., calculated based on others)
- All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g., ['small', 'medium', 'large'] is okay, but ["apples", "oranges", "bananas"] is not)
- For each set, there should be one response variable with the others as explanatory variables
## Part 2: Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot
- Use what we've covered so far in class to scrutinize the plot (e.g., are there any outliers?)
## Part 3 : Calculate the appropriate correlation coefficient for each of these combinations
- Explain why the value makes sense (or doesn't) based on the visualization(s)
## Part 4: Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.
Mansi_Data_Dive_Documentation
1. A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.
E.g., this could be a column name, or just some value inside a cell of your data
Why do you think they chose to encode the data the way they did? What could have happened if you didn't read the documentation?
2. At least one element or your data that is unclear even after reading the documentation
You may need to do some digging, but is there anything about the data that your documentation does not explain?
3. Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear.
You can use color or an annotation, but also make sure to explain your thoughts using Markdown
Do you notice any significant risks? If so, what could you do to reduce negative consequences?
Mansi Assignment: Sampling n Drawing Conclusions
Your RMarkdown notebook for this data dive should contain the following:
A collection of 5-10 random samples of data (with replacement) from at least 6 columns of data
Each subsample should be as long as roughly 50% percent of your data. We are simulating the act of collecting data from a population where the "population" is represented by the data set you already have.
Store each sample set in a separate data frame (e.g., df_i might contain m rows from columns 1-6)
These subsamples should include both categorical and continuous (numeric) data
Scrutinize these subsamples.
How different are they?
What would you have called an anomaly in one sub-sample that you wouldn't in another?
Are there aspects of the data that are consistent among all sub-samples?
Consider how this investigation affects how you might draw conclusions about the data in the future.
Mansi_Assignment2_Week3
Week 3 | Data Dive — Probabilities and Anomalies
Assignment: Data Dive - Summaries
RMarkdown notebook for this data dive on News Popularity in Multiple Social Media Platforms