gravatar

atsels1

Alex Tselsov

Recently Published

Titanic: Machine Learning Challenge (Kaggle.com)
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, I performed analysis of what sorts of people were likely to survive. In particular, I applied machine learning skills and techniques to predict which passengers survived the tragedy.
Microarray analysis
In this exercise we will analyze data taken from a typical experiment in medical science. Our example data is taken from a microarray experiment. In medical science we often ask questions like: ”How does this medicine affect the liver”, ”What is the difference between normal skin and a tumor in the skin”, or ”Is there a good candidate molecule in a tissue with inflammation that we can target with a medicine to reduce the inflammation”. A common denominator is that we want to compare two or more cases, and the best experiment is naturally one that gives as much information as possible on what is going on. More or less everything that is ”going on” in a tissue or a cell in the body is controlled by proteins. Proteins are complex molecules that come in thousands of brands. Some of them are building blocks of the cell, other transmit information or store energy, and another group, the enzymes, are the work-horses that perform the actions and chemical reactions in the cell. The proper function of a cell is totally dependent on that all these proteins are present in the correct concentrations. Almost any change that a cell experiences is reflected in increasing and/or decreasing concentrations of one or more of all these proteins. In other words, if we have a method to compare all the protein concentrations in the healthy and sick cells or tissues, we could pinpoint which are affected, and this could give clues to what has actually happened in the sick sample, especially if we are lucky to know something about the affected proteins. Measuring proteins turns out to be very difficult, but instead of the proteins we can measure mRNAs: An mRNA is another kind of molecule which is used to construct the proteins. Every protein is built using one specific kind of mRNA, and the more of that mRNA we have, the more protein is produced. Here is where the microarray comes in. A microarray is a little plate, it almost looks like a computer chip, with thousands of microscopic chemically prepared spots on it, and each of these spots has the ability to identify one specific mRNA if you pour the properly prepared cell sap from a tissue or cell sample on it. Under a special kind of microscope, each spot will give a light signal which we can measure that is higher the higher the concentration of that mRNA is in the sample. In other words, if we measure the whole microarray we will get a lot of values for all the different mRNA concentrations in the sample, which will directly tell us something about all the protein concentrations. Ultimately, the production of each mRNA is controlled by a gene on the chromosomes. That is why microarray data is often termed ”gene expression” values. Microarrays have been used extensively in medical science during the last 20 years, and there is a special database at the National Institute for Biotechnology Information in USA (NCBI) where data from lots of microarray experiments are publicly available. We are going to download our example data from there. It comes from an investigation of human vein cells treated with the inflammatory stimulus TNF, aiming at elucidating mechanisms of inflammation.
Human Activity Prediction with Recursive Partitioning and Regression Trees and Stochastic Gradient Boosting.
Our outcome variable is classe, a factor variable with 5 levels. For this data set, “participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions: - exactly according to the specification (Class A) - throwing the elbows to the front (Class B) - lifting the dumbbell only halfway (Class C) - lowering the dumbbell only halfway (Class D) - throwing the hips to the front (Class E) Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes." [1] Prediction evaluations will be based on maximizing the accuracy and minimizing the out-of-sample error. All other available variables after cleaning will be used for prediction. Two models will be tested using Recursive Partitioning and Regression Trees and Stochastic Gradient Boosting. The model with the highest accuracy will be chosen as our final model.
Fuel consumption regression analysis based on 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
This report based on the analysis of the 1974 Motor Trend US magazine data regarding comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). In this report, regression models and exploratory data analyses used to explore the relationship between transmission type (am) and fuel consumption (MPG). Based on the t-test result, we conclude that there is a substantial diffrence in fuel consumption between cars with automatic (am=1) and manual (am=0) transmissions. Cars with manual transmissions get 3 to 11 miles per gallon more than cars with automatic transmissions on average. Akaike information criterion (AIC) is used in parsimonious model selection, where adjusted R-squared value is used to reach parsimonious model. Based on the *Selected model output, we conclude that keeping number of cylinders, horse power and weight constant, cars with manual transmission have 1.8 miles per gallon (MPG) more compared to automatic transmission cars.
Экономическое исследование (2015)
Оценка общих показателей экономики экономики Российской Федерации. Исследование влияния федеральных программ поддержки МСБ на денежные обороты предприятий в 2010-2013 гг.
Severe Weather Events Impact on Public Health and Economicy in the U.S.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Activity monitoring device data analysis.
It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data. This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.