gravatar

tkubota0720

Takafumi Kubota

Recently Published

Data Cleaning 01
In this study, we present a detailed methodology for cleaning and organizing the data from the Spanish football league. The primary objective is to transform raw match data into a structured and analyzable format, ensuring consistency and accuracy. We employ several data processing techniques using R, including reading raw CSV files, converting team names to their abbreviations, and calculating match results. Our approach involves creating a new dataframe that includes both home and away team perspectives, with additional columns for results and win points. The processed data enables more straightforward analysis for various applications, such as performance analysis, trend identification, and predictive modeling. The effectiveness of our method is demonstrated through a step-by-step transformation of the dataset, ensuring that it is ready for advanced statistical analysis and machine learning applications. This paper contributes to the field by providing a reproducible and scalable data cleaning framework, essential for researchers and analysts working with sports data.
eyesdata
This study demonstrates the use of the VGAM package in R to fit a vector generalized linear model (VGLM) with synthetic data. Data for 1000 individuals, including ocular pressure and age, was generated and transformed to include mean ocular pressure and linear predictors. Binary outcomes for eye conditions were derived using the logitlink function. The vglm function was used to fit the model with leye and reye as responses and op as the predictor, specifying a binomial family with odds ratio. The model matrix was generated to show the data structure.
ggplot2
confint
This study simulates the construction of 95% confidence intervals for the true mean using normally distributed random data. By generating 100 samples of size 30, we compute the confidence intervals and identify those that do not contain the true mean, highlighting them in red. The proportion of intervals containing the true mean is calculated to validate the simulation's accuracy. This approach visually demonstrates the reliability of confidence intervals in estimating population parameters.
Spotify
This analysis investigates the relationships between various musical features of songs using a dataset containing attributes such as Danceability, Energy, Key, Loudness, and more. We calculated the mean, standard deviation, and five-number summary for each variable. Correlation matrices were computed, and the most strongly correlated pairs were identified. Boxplots and scatterplot matrices were generated to visualize distributions and correlations. Key findings include a strong negative correlation between Key and Energy, and strong positive correlations between Liveness and Acousticness, and Tempo and Liveness.
pokemon
Time Series
100m
This analysis examines the historical 100m sprint times for male athletes, utilizing linear regression to identify trends over time. A dataset of sprint times from 1964 to 2009 is used to build a regression model, which is then employed to predict sprint times 100 years into the future. The findings include the regression model equation and the predicted sprint time for the year 2109.
Ordinal Logistic Regression
This study compares multinomial and ordinal logistic regression models using custom penalized error metrics. By applying larger penalties for misordered predictions, the performance of each model is assessed through penalized Mean Absolute Error (MAE) and Mean Squared Error (MSE), highlighting the benefits of considering order in predictive accuracy.
Comparative Analysis of GLM and VGLM Using Simulated Data
This document compares GLM and VGLM using simulated data with multiple predictors and Poisson-distributed responses, evaluating residuals and coefficients.
Boston
bike_sharing
mpg2
mpg
dplyr01
polr
Example of polr
Rintro
Frequency barchart
R code for the bar chart drawn from the frequency distribution table of the number of party supporters in Figure 1.1 for use in Data Science I classes.
Frequency table
R code to create a frequency table of the number of party supporters in Table 1.1 for use in Data Science I classes.
Document
Document
hoge01