RPubs

by RStudio

tkubota0720

Takafumi Kubota

Recently Published

Data Cleaning 01

In this study, we present a detailed methodology for cleaning and organizing the data from the Spanish football league. The primary objective is to transform raw match data into a structured and analyzable format, ensuring consistency and accuracy. We employ several data processing techniques using R, including reading raw CSV files, converting team names to their abbreviations, and calculating match results. Our approach involves creating a new dataframe that includes both home and away team perspectives, with additional columns for results and win points. The processed data enables more straightforward analysis for various applications, such as performance analysis, trend identification, and predictive modeling. The effectiveness of our method is demonstrated through a step-by-step transformation of the dataset, ensuring that it is ready for advanced statistical analysis and machine learning applications. This paper contributes to the field by providing a reproducible and scalable data cleaning framework, essential for researchers and analysts working with sports data.

9 days ago

eyesdata

This study demonstrates the use of the VGAM package in R to fit a vector generalized linear model (VGLM) with synthetic data. Data for 1000 individuals, including ocular pressure and age, was generated and transformed to include mean ocular pressure and linear predictors. Binary outcomes for eye conditions were derived using the logitlink function. The vglm function was used to fit the model with leye and reye as responses and op as the predictor, specifying a binomial family with odds ratio. The model matrix was generated to show the data structure.

10 days ago

ggplot2

11 days ago

confint

This study simulates the construction of 95% confidence intervals for the true mean using normally distributed random data. By generating 100 samples of size 30, we compute the confidence intervals and identify those that do not contain the true mean, highlighting them in red. The proportion of intervals containing the true mean is calculated to validate the simulation's accuracy. This approach visually demonstrates the reliability of confidence intervals in estimating population parameters.

12 days ago

Spotify

This analysis investigates the relationships between various musical features of songs using a dataset containing attributes such as Danceability, Energy, Key, Loudness, and more. We calculated the mean, standard deviation, and five-number summary for each variable. Correlation matrices were computed, and the most strongly correlated pairs were identified. Boxplots and scatterplot matrices were generated to visualize distributions and correlations. Key findings include a strong negative correlation between Key and Energy, and strong positive correlations between Liveness and Acousticness, and Tempo and Liveness.

21 days ago

pokemon

22 days ago

Time Series

22 days ago

100m

This analysis examines the historical 100m sprint times for male athletes, utilizing linear regression to identify trends over time. A dataset of sprint times from 1964 to 2009 is used to build a regression model, which is then employed to predict sprint times 100 years into the future. The findings include the regression model equation and the predicted sprint time for the year 2109.

26 days ago

Ordinal Logistic Regression

This study compares multinomial and ordinal logistic regression models using custom penalized error metrics. By applying larger penalties for misordered predictions, the performance of each model is assessed through penalized Mean Absolute Error (MAE) and Mean Squared Error (MSE), highlighting the benefits of considering order in predictive accuracy.

26 days ago

Comparative Analysis of GLM and VGLM Using Simulated Data

This document compares GLM and VGLM using simulated data with multiple predictors and Poisson-distributed responses, evaluating residuals and coefficients.

28 days ago

Differences between Linear Regression, Polynomial Regression, and Spline Regression

about 1 month ago

Boston

about 1 month ago

bike_sharing

about 1 month ago

mpg2

about 1 month ago

mpg

about 2 months ago

dplyr01

about 2 months ago

polr

Example of polr

2 months ago

Rintro

2 months ago

Frequency barchart

R code for the bar chart drawn from the frequency distribution table of the number of party supporters in Figure 1.1 for use in Data Science I classes.

3 months ago