Matthew Gregory

Recently Published

Classifier performance quantification
When we build statistical models from data using Machine Learning tools how do we go about assessing the models performance? We introduce some commonly used packages and functions therein which help us address this problem. We look at a decision tree classifier we built in a previous post to predict the outcome of a student's end of year exam result from Portuguese Maths students and assess its accuracy and other characteristics.
Merging Government data
We simulate some student data from the national pupil database and merge it with demographic and social deprivation data matched by Lower Super Output Area (LSOA).
We elucidate base tools for the problem of looking up translations using a lookup table in R. Thus we can recode or translate or mutate variables as required.
Regression trees for wine quality prediction
To develop a red wine rating model, we use data donated to the UCI Machine Learning Data Repository by Cortez et al. We are interested in predicting the quality of a red wine taste based on its quantified chemical characteristics. We fit a regression tree model to training data then evaluate with a test data set.
Forest fire prediction with support vector machines
This post is based on a paper by Cortez & Morais (2007). Forest fires are a major environmental issue, creating economical and ecological damage while endangering human lives. Fast detection is a key element for controlling such phenomenon. To achieve this, one alternative is to use automatic tools based on local sensors, such as microclimate and weather data provided by meteorological stations. We demonstrate the workflow in R for implementing a svm model using available functions.
Clustering analysis with K-means
To assess the patterning of the sex ratios and whether my intuition that there were three distinct groups or clusters in the data following RNAi treatment of female Tribolium castaneum beetles, K-means clustering was used to objectively (it is reproducible given the random seed) partition the data into K distinct, non-overlapping clusters. Essentially we wished to identify sub-groups within the data, with the three groups representing the non-fertile crosses, crosses yielding progeny with a 1:1 sex ration and crosses yielding all males. Code pertaining to k-means in R and how best to display the results are described.
Logistic Regression
Fit a logistic regression model in order to predict student Maths exam pass at the end of the year based on pertinent variables introduced in previous posts. Data derived from Cortez et al., publicly available on the UCI machine learning site (
Observational data tend to be technically and logistically easier to collect than randomised experimental data, hence tends to be more readily available. However, it comes with issues such as the treatment assignment mechanism which is likely to be non-random and unknown. This results in the distribution of the covariates (X) in the treatment groups being dissimilar. Thus we must try to match the treatment and control groups across variables to mitigate bias. We demonstrate that here with the cem package in R.
synthpop package in R with student attainment data
Synthesising Sensitive Student dataset to test machine learning tools for prediction using the synthpop package in R with Cortez (2008) secondary school student mathematics score data.
Neural Networks and student attainment prediction
An introduction to using neural networks to predict a continuous response variable output; the final year exam score in Maths, at a couple of Secondary schools in Portugal.
Decision trees for prediction in education
Using the machine learning decision tree algorithm to predict exam performance of students in secondary education. The decision tree is human interpretable.
Elucidates betting strategy for Baccarat and the long term loss one can expect under standard Casino rules.
Diamonds - linear regression
We are interested in predicting the price of a diamond based on 9 predictors. We want to know if we are getting ripped off for the price we pay for a diamond compared to historical prices.
Dota 2 - the mechanics of effective hp
How to calculate how many effective hit points one has to resist physical damage. Includes elucidation of several core functions that make up the mathematics of the Dota 2 universe.
Dota 2 - item choice to optimise ehp
Bang for your buck! How much effective hit points is an item purchase giving you?