gravatar

clau_aranda

CLAUDIA

Recently Published

Preprocess and Exploratory Analysis Stroke Prediction Dataset
In this RMarkdown document, I perform data preparation and exploratory analysis on a stroke prediction dataset obtained from Kaggle. The steps include: Data Preparation: Setup: Disable warnings and messages in R to ensure cleaner output during execution. Data Import: Load the stroke prediction dataset from a CSV file and inspect its structure by viewing the first few rows in a tabular format. Data Cleaning: Remove the first column (patient ID) as it is not relevant for analysis, and delete rows with missing values (NA). Convert integer columns to numeric types and rename columns for clarity. Recode categorical variables Data Analysis: Data Visualization: Display the cleaned dataset. Factor Conversion: Convert the 'stroke' variable into a factor. Outlier Handling: Exclude the 'Other' category in gender, as it represents only one individual and is not relevant for further study. Missing Value Imputation: Impute missing values in all predictor variables, including BMI, using k-nearest neighbors (KNN). Numerical Variable Distribution: Analyze the distribution of numerical variables and their relationships with the response variable using box plots. Categorical Variables Distribution: Create mosaic plots to show relationships between categorical variables. Contingency Tables and Association Tests: Perform statistical tests to examine associations between categorical variables and the response. Correlation Matrix: Compute and visualize the correlation matrix for numerical variables. Non-linear Relationships: Explore non-linear relationships between numerical variables and the response through scatter plots with non-linear fit lines. Heatmap: Create a heatmap to show the distribution of average age by job type. Data Filtering and Standardization: Filter out patients younger than 50 and those with job statuses of never worked or children. This document aims to clean and explore the dataset thoroughly to prepare it for further predictive modeling and analysis.