Recently Published
Predicting the housing market in Iowa: scanning extreme gradient boosting for superior prediction
Here I used the Ames housing data set to perform a fairly comprehensive analysis and predictive modeling to estimate house prices. I have extensively explored the training set, then performed feature engineering, missing value imputation and feature selection using the training set. Finally, I trained both linear models such as lasso regularization, PCA regression, as well as more complex algorithms including gradient boosting, extreme gradient boosting, random forest, and support vector machines. I have obtained a model that predicts house sale prices fairly well, with a RMSE of 0.12717 obtained from the test data set.
Public health and economic consequences of storm events across the United States
Here I present the detailed analysis of the data provided by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to determine the storm events that cause highest economic and public health consequences in the United States. The results presented here suggests that tornadoes are the most harmful weather event that is associated with over 5000 death and 75000 injuries across the U.S between 1950 and 2011. On the other hand, flood is determined as the leading severe weather event that is linked to the highest property damage, costing approximately 150 billion U.S dollars in the same period. In terms of U.S. agriculture, drought has a bigger impact on the crop damage, resulting in a loss of nearly 15 billion U.S dollars. Finally, I also present the geospatial distribution of these severe weather events to demonstrate their relative impact on public health and economy across different states. These observations might facilitate the decision making of government authorities and allow prioritization of resources to prepare for different types of severe weather events.
A random forest classifier for highly accurate prediction of human activity
Here I present the detailed analysis and predictive modeling of the human activity recognition data set generated by Velloso et al (http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises). By using the training data set, I built several non-linear models by using decision tree, random forest, boosted tree as well as support vector machines to classify 5 types of human activities by using the features collected by human activity recognition devices. I have also demonstrated stacking and blending of the combinations of these models that are trained using the predictions of the stand alone classifiers. Finally all individual and stacked classifiers were evaluated using an independent validation set. Random forest model gave the highest accuracy (99.9%) when tested on the validation set. Finally, the random forest model was used to predict 20 independent test cases, resulting in 100% accuracy (verified/data not shown).