gravatar

AndersonUyekita

Anderson Hitoshi Uyekita

Recently Published

[Practice Quiz] Regression Models
Based on an analysis involving 173 observations and 19 variables, there is no sufficient evidence to affirm that the major category has a significant association with income.
[Quiz 4] Regression Models
[Quiz 3] Regression Models
[Quiz 2] Regression Models
[Quiz 1] Regression Models
[Course Project] Regression Models
The data analysis process in this document has identified that manual vehicles have better performance than automatics concerning miles per gallon (mpg). Furthermore, based on the linear regression modeled by this study, this difference in absolute value has reached an average of 3.79 miles per gallon, which is a significant number.
[Quiz 4] Statistical Inference
[Course Project] Statistical Inference - Part 2
Part 2 of the Course Project aims to analyze the ToothGrowth database using confidence intervals and tests. This dataset has 60 observations on 3 (three) variables and describes the tooth growth in guinea pigs in respect of a vitamin C supplement by two delivery methods. According to the results, it is possible to identify there is no evidence to affirm Orange Juice (OJ) and Ascorbic Acid (VC) have different performance outcomes. However, there is strong evidence that increasing the vitamin C dosage increases tooth growth.
[Course Project] Statistical Inference - Part 1
Part 1 of the Course Project aims to show an understanding of the central limit theorem (CLT), comparing simulation results with theoretical expectations. The activity is based on a sample of 1,000 means generated by 40 numbers (with an exponential distribution profile with lambda 0.2). Then comparing those values (the sample and theoretical) to prove the CLT. As a result, a graph was plotted showing the normality of the sample’s average, confirming the CLT.
[Quiz 3] Statistical Inference
[Quiz 2] Statistical Inference
[Quiz 1] Statistical Inference
[Course Project 2] Reproducible Research
Course Project 2 explores the database from U.S. National Oceanic and Atmospheric Administration (NOAA) and will be used to create a Reproducible Research following the instructions and reviews criteria stated by the Course instructor. This reproducible research should be hosted on Rpubs and will be evaluated in a peer-graded assignment. In addition, Course Project 2 also aims to evaluate the economic impacts and the consequences caused by storms, hurricanes, tornadoes, floods, and other environmental disasters. According to the analysis performed to answer questions 1 and 2, the event type with the most harmful consequences is the TORNADO due to its high frequency and high average number of injuries and fatalities. On the other hand, HURRICANE and TSUNAMI do not have the highest number of harmful to the person. Also, both have a low probability of happening. From the given dataset, we have recorded 299 hurricanes and 20 tsunamis. On the other hand, there are more than 60 thousand tornadoes observations. Still, the average number of deaths and injuries is greater than the TORNADO, which we could interpret as both event types being more deadly. Lastly, FLOOD is the event type with the most significant economic impact in absolute values, which reach more than USD 180 billion (most of it is related to Properties Damages). On the other hand, weather problems such as EXCESSIVE HEAT and EXCESSIVE COLDc also significantly impact economic affairs. For example, the EXCESSIVE HEAT has generated more than USD 15.8 billion in Damages to Crops. Another critical point is HURRICANE does not happen regularly. Still, when it happens, it causes severe Damage to Properties and Crops. Based on the given dataset, hurricanes average generate USD 300 million of Damage for each event.
[Course Project 1] Reproducible Research
Course Project 1 aims to answer questions and accomplish tasks to create a Reproducible Research. The dataset used for this assignment is the repdata_data_activity.zip, which has data about personal movement (steps): 279 kBytes (calculate using pryr package), 3 variables, and 17.568 observations. The dataset required manipulation to clean the NA observation or define a strategy to deal with it. I have used an approach that considers the different behavior during the weekdays in its interval. The result of this imputing was an increase in the mean and median. In the process of Exploratory Data Analysis, it was possible to identify that the tracked person has distinct behavior on Mondays, Tuesdays, and Wednesdays when they wake up early compared to Thursday and Friday, days they used to wake up at 8 AM. It was also possible to identify that on Weekends, they do not wake up early nor have intense physical exercises. Finally, this tracked person used to make almost 11 thousand steps on average every day.
[Quiz 2] Reproducible Research
[Quiz 1] Reproducible Research
[Course Project 2] Exploratory Data Analysis
Course Project 2 aims to create six plots using base graphic and ggplot2 to answer six given questions. The datasets used for this assignment are from the National Emissions Inventory (NEI), which is recorded every three years. The current Course Project will cover the years from 1999 to 2008. The datasets comprise 2 data frames: (i) The summarySCC_PM25.rds with almost 6.5 million observations and 6 variables, and; (ii) The Source_Classification_Code.rds with over 11,7 thousand observations and 15 variables. Based on the plots created to address the questions, I have figured out that PM2.5 emissions have decreased over the years all over the US. Although, In Baltimore City, it is clear the pollution emission reduction from vehicles, in Los Angeles County, the PM2.5 from vehicles had increased until 2005 when in 2008 the pollution decreased by almost 9%. Finally, the Coal PM2.5 is mostly from Electric Generation, and one observed a sharp reduction in Coal PM2.5 between 2005 and 2008. Feel free to look at the Codebook to go in-depth.
[Quiz 1] Getting and Cleaning Data
quiz-1_getting-and-cleaning-data
[Course Project 1] Exploratory Data Analysis
Course Project 1 aims to reproduce four given graphics from the UC Irvine Machine Learning Repository about “Individual household electric power consumption Data Set”. This dataset has 20 Megabytes compressed (zip file) and uncompressed, reaching around 130 megabytes on Windows OS. It is estimated around 142.4 Megabytes of memory to load this text file (The details will be shown on 1. Memory Requirements) The dataset has more than 2 million rows (observations) and 9 columns according to the instructions. The given instruction made it possible to fill the colClasses from the read.table to be faster. Later, in the tidying process of the dataset, it was necessary to convert columns Date into a Date class object and merge Data and Time to create a POSIXlt. Finally, I have stored the R scripts on Github. The PNG files on the repository’s root are the exported graphics from my R scripts. Those figures were 480 pixels in width and height. Feel free to look at the CodeBook to go in-depth.
[Programming Assignment 1] R Programming
This Programming Assignment is part of the R Programming course and consists of three parts, each one with one function development: Part 1: The pollutantmean() function calculates the mean of a specific pollutant (could be sulfate or nitrate), excluding any NA observation. Part 2: The complete() function analyses the number of non-NA observations of each monitor id. Part 3: The corr() function calculates the correlation between nitrate and sulfate, excluding rows with NA observations.
[Programming Assignment 3] R Programming
This Programming Assignment is part of the R Programming course and consists of four parts. The dataset analyzed is about the death due to a specific outcome, such as heart attack, heart failure, or pneumonia. Part 1: Most hospitals in the US have 14 to 17 deaths occasioned by heart attacks monthly. Part 2: The best() function aims to find the best hospital in any outcomes (heart attack, heart failure or pneumonia) in the last 30 days. Part 3: The rankhospital() returns the hospital name of a given state and a given ranking. Part 4: The rankall () compares several states based on a given ranking.
Atividade 2
Vignette - Mastering Software Development in R
This is the vignette submitted to evaluation.
Build a New Geom
This is the final project of the 4th course of Mastering Software Development in R offered by the Johns Hopkins University.
ND111 - Project 02 - Data Science II - Act Report
This project aims to give to the student a real case of how to gather, assess, clean, and analyze the data, in other words its englobes the Data Wrangling and Exploratory Data Analysis. The database used as an example is about the WeRateDogs™ Twitter user, this account has more than 7,572,000 followers, 9,500 tweets, and 141,000 likes. The Data Gathering process bundled three different tasks, the first one download file from URL and later loading to the Jupyter Notebook, which requires a manual step, the second downloading a file programmatically, and the third gathering data from the Twitter API. This step has also required to save these data in a local machine. Based on the data gathered, I have assessed the most evident issues (17 issues in total) and documented it to create a record of modifications. Later, in Data Cleaning process I have fixed all identified issues, and I have also merged (the two downloaded files from the Data Gathering process) into one and added some missing values (from the archive downloaded from the Twitter API). The final data frame was stored as twitter_archive_master.csv. In the Data Analysis and Visualization, which I have interpreted as Exploratory Analysis, I have posed few questions to guide my analysis, which lead me to found strong evidence of: Seasonality in the number of tweets along the week and along the year; A positive correlation between the number of retweets and the number of favourites, and; No correlation between the algorithms output used to predict the dog breed.
ND111 - Project 02 - Data Science II - Wrangle Report
Along the Data Wrangling process, in the `twitter_archive_enhanced.csv` file, I have found several problems in the dog's name column, probably the regex used to gather/find it (from the Twitter user `@dog_rates` also known as [WeRateDogs™][dog_rates]) was not well calibrated, and in many cases has gathered articles, nouns, etc. or any other ordinary word. I have fixed it assuming these problematic dog's names as `None`. I have also found problems in `rating_numerator` and `rating_denominator` columns, both from `image_predictions.tsv` file, which has required a new process of "scrapping" these values from the `text` column. Finally, I have combined the files `twitter_archive_enhanced.csv` and `image_predictions.tsv` into a new data frame called `twitter_archive_master.csv`, which I have aggregated some new features: * `retweet_count`, and; * `favorite_count`. Both features, are gathered from the WeRateDogs™ tweets using the tweepy package. [dog_rates]: https://twitter.com/dog_rates
Job Application QJ45141 - Dashboard
This dashboard is a part of the QZ45141 Job Application. My objectives with this Dashboard are 1) Provide an example of data visualisation; 2) Analysis and summary data into one warehouse; 3) Possibility to create an analysis recording each step of data manipulation.
Job Application QJ45141 - Data & Information Manager
As a part of the Job Application to Data & Information Manager, this documents aims to endorse my proficient working on data analysis. The outcome of this document is a nutshell of the Aurora Energy and the NZ Electricity Market.
Statistical inference - Course Project: Part 2
This project aims to analyze the Tooth Growth database using confidence intervals and/or tests. This dataset has 60 observations and 3 variables, and a summary was provided with a brief of exploratory analysis. As a results of this project, supplement type has no effect on tooth growth and increasing the dose level leads to increased tooth growth.
Statistical Inference - Course Project: Part 1
This exercise aim to show the power of central limit theorem comparing simulations results with theoretical expectations. The exercise is based on a sample of 1,000 means generated by 40 numbers (with exponential distribution profile with lambda 0.2). Comparisons between sample and theoretical results were made to prove the CLT. As a result, a graph was plotted showing the normality of the sample, and confirming the CLT.