gravatar

vnnair

Vishnu Nair

Recently Published

Newton's Method and Support Vector Machines
A small simulation is done to implement Newton's Method, which is a technique used to find the roots of a function. Afterwards, the Auto dataset is used and support vector machines are implemented to observe if the data is separable using different kernels. A support vector machine is a machine learning technique used for classification. It is versatile as we can adjust the gamma parameter and change the kernel in order to accommodate for various types of data distributions and get greater accuracy.
Tuning With Cross Validation
Cross validation is a method used to assess the predictive performance of a model. There are various types of cross validation that can be utilized depending on the type of data that is being worked with. In this case, we explore the mean-squared error for a regression using different degree polynomials. Using cross validation, we can observe which degree polynomial to fit our model on in order to reduce MSE.
Regressions and Splines
Using built-in datasets in R, I explored simple linear regressions, polynomial regressions, and splines. Flexibility is a very important concept when creating a model. A model cannot be too rigid as it will not predict new data very accurately. If too flexible, it will overfit on the training data and thus not predict well either. It is important to have the right balance between the two to maximize prediction accuracy.
Bootstrap Exploration
This is a small class project that expanded my understanding of bootstrapping. I use medium sized datasets provided in class in order to bootstrap and notice the differences in summaries of the datasets. For those unfamiliar with this method, it is a resampling method in which samples are repeatedly drawn from an original dataset in order make our population size bigger. In machine learning, this can be useful for our model to be trained on more observations. Of course, one of the major limitations is the possible overfitting that can occur with the model since we are saturating our dataset with repeating observations.
A Hearty Analysis
A Hearty Analysis Hello! This is an analysis done on the Cleveland Heart Disease Dataset from the UC Irvine Machine Learning repository! My aim was to implement various classification techniques to distinguish heart-healthy and heart-unhealthy individuals from one another. For Newcomers... In case you are unfamiliar with some of these techniques that are being implemented, here is some information! Classification Tree: A classification tree operates very similarly to a regression tree, except it is used to predict/classify a qualitative response rather than a quantitative one. We are essentially trying to predict that the observations belong to its particular most commonly occurring class. We are interested in this as well as the class proportions among the training observations that fall into that region. We use recursive binary splitting to grow the tree and use the classification error rate to measure misclassifications. Pruning can be done to reduce overfitting of training set data and possibly reduce variance. The amount of nodes are chosen by cross-validation. Bagging: This is a method for reducing variance within our crafted tree. We basically build a large number of decision trees on bootstrapped training samples, build separate predictive models from these, and average the predictions that are attained. A pretty convenient way to measure the test error of a bagged model is by using the out-of-bag observations. These are observations that are left out of the bootstrapped training samples. We can compare our predicted observations to these OOB observations and get the error. Random Forest: This is an extension of the bagging method that allows us to decorrelate the trees. Random forest chooses a subset of the predictors when making a split in a tree. In bagging, all of the predictors are considered when making a split. If there is a strong predictor, then a lot of the generated trees will look similar and thus be highly correlated. By choosing a subset of the predictors, the generated trees will become decorrelated. Boosting: This is another method of improving prediction accuracy in a decision tree. We combine a large number of decision trees and fit to the residuals of the model. This way, the model learns slowly but performs better over time. In other models, the trees are grown independently of one another whereas in boosting, the trees all grow learning from one another. Goal See which classification method works best for the Cleveland Heart Dataset in classifying healthy vs unhealthy individuals.