Recently Published
Making causal inference on sequential data
Applying the Causal Impact algorithm on a simulated clinical trial
Causal inference: A simulated study
The Bayesian structural time-series model is a useful method for causal inference on Time series. This method could be used for the clinical trial where the outcome is monitored during a long period or when the outcome consists of a physiological signal being recorded as time series (Holter ECG, blood pressure, peakflow or glycemia). This method could also be considered as alternative solution for Survival analysis as we could also make causal inference about cumulative events (COPD exacerbations or symptoms).
MLM case X16: Ensembling
The present data experiment aims to explore the utility of 3 ensembling techniques: Weighted averaging, Majority voting and Stacking, applied to a multilabels classification problem. The main objective was to classify 10 morphologic patterns in Cardiotocography.
MLM case X15: Model selection P.2 Classification
In this study, we introduced Random Forest as a powerful method for both feature selection and overfitting control. Due to the randomness in bootstrap aggregating, only one model could be not trustful; however by replicating the model training many times, we can reach a high level of certainty.
MLM case X14: Model selection P.1 Regression
In the present tutorial, we will explore 5 different methods for model selection. These include:
Stepwise linear regression (the conventional method)
Lasso regularisation
Elastic net regularisation
Bayesian Model averaging (based on MCMC sampler)
Wrapped learner with random feature selection
MLM case X13: Feature selection
In this case study X13, we will explore 12 simple filter methods of feature selection that are suported by the mlr package. We also introduce the Kuncheva stability index (KSI) for determining the best feature subset(s) for a binary classification problem.
ESMS 5-3: Introduction to Mixed Model
This time we will take a long journey to discover more interesting aspects of Linear regression analysis, from the analysis of variance to mixed modelling.
Tutorial 6: 2D kernel density plot
In this short tutorial, we will perform a 2-D Kernel density plot on those data, using ggplot2 and spatstat packages
MLM case X12: GBM based quantile regression
The present study aims to explore the Regression GBM (gradient boosting machine) algorithm in h2o. More than that, we will perform a quantile regression using 3 GBM models, each one for predicting a percentile of the response variable.
MLM case X11: Generic Bootstrap Aggregating
Bootstrap aggregating (BAGGING) is a machine learning ensembme meta-algorithm that aims to improve the performance and stability of a weak learner by averaging many weak learners on randomly generated training sets.In this case study, we will explore the generic Bagging procedure on a binary classification task with basic Naive Bayes learner.
MLM case X10: H2O vs GAMLSS
The h2o GLM allows performing an orthogonal polynomial regression on Gamma distributions and combining such polynomial function with a Lasso regularization to optimize the model’s capacity and to avoid overfitting. We would like to compare the h2o GLM algorithm with the GAMLSS Polynomial functions.
MLM case X9: Exploring GBM learner in h2o
The present case study aims to explore following Machine learning techniques: Training a GBM learner in h2o
Tuning the cut-off for multiclass classficiation
Interpretive study on GBM model
Combining the strengths of 3 packages: h2o, mlr and caret
MLM case X8: Dealing with Imbalanced data
The main objective of this case study X8 is to evaluate the effect of different fold_assignment and balance classes settings on the performance of Random Forest algorithm applied to a binary classfication problem.
ML Tutorial 5: Multiclass imputation using Random Forest
In this 5th tutorial, we attempt to perform a (greedy) imputation for missing values of multiclass-categorical variables using the Random Forest algorithm. Through this data experiment, you will learn how to:
1) Detect missing values in your dataset
2) Develop your own imputation function that implies h2o Random Forest algorithms in loop.
MLM case X7: Random Forest vs Deep neural net
In the present tutorial, we will set a benchmark study to compare the performance of two most powerful algorithms in h2o in their native forms: Random Forest vs Deep neural net
MLM case X6: Tuning a Deep neural net
The main objective of this tutorial X6 is to introduce the Random grid-based tuning for Deep learning algorithm in another binary classifciation task.
MLM case X5: Deep learning Regression
Through this case X5, we will introduce to you another application of Deep learning method for developing a predictive model for a numerical response variable. We call it a regression task.
MLM case X4: Deep learning (1)
Deep learning, also known as Multilayer neural networks, is considered one among the most powerful machine learning algorithms. Deep learning can solve any classification/regression problem if we know how to tune it adequately and give it enough time and/or data for learning. The drawbacks of deep learning include a slow training process, blackbox model and difficulties with categorical data.
Tutorial 4: A hidden function in MLR
If you think you know everything about the mlr package, you might be wrong. Though we have discovered most of sophisticated features of this package, like model wrapping, imputation, grid based tuning…, there are some hidden functions that we should learn. Among them, plotLearnerPrediction() function might be the most interesting one. This function has never been documented by anyone, even in the mlr’s official website.
Tutorial 3: Using mlr for validating a caret model
Tutorial in Vietnamese for Biomedical Datascience group
ESMS 5-2 Anatomy of a simple linear model
Today (27/03/2017), we will celebrate the 160th birthday of Karl Pearson by demonstrating the link between his famous Cross-product moment coefficient (r) and the linear regression model.
ESMS Case 7 Robust methods for dealing with outliers
This tutorial was inspired by a status of Pr. Nguyen Van Tuan on Facebook (24/03/2017).In this status, Pr. Tuan introduced the Yuen’s test for dealing with outliers in data. This method belongs to a new area of statistical analysis that called “robust methods”. As those method have never been introduced in Vietnam, we will include them in our ESMS project.
ESMS 6: Alternative tools for t test
This tutorial was inspired by a status of Pr. Nguyen Van Tuan on Facebook (22/03/2017). He encouraged the use the Bootstrapping method rather than non-parametric methods for dealing with messy data.
MLM case X3: Multilabel classification problem
Multilabel classification is a special supervised learning method. It could be useful to resolve the greedy problems, such as predicting the comorbidities or complications of a disease as a whole pathological entity.
Sample size for 2 means, 2 sided equality testing
This tutorial provide the codes for estimating the sample sizes needed for the comparison of 2 means of 2 indepdent groups A and B, based on 2-sided equality testing.
ESMS 5-1 Linear model - Part 1
The present tutorial is a setup for our long journey to discover the Linear models. We begin with a study question, then the R syntaxes of lm() function will be introduced. Later, we will discover how the least square algorithm works.
ESMS-4 Toolbox for Agreement study
In this tutorial, the author would like to share every methods she knew about the study of agreement. Those include :
Linear regression and Correlation analysis
The Bland-Altman method (1983)
The robust method by Carstensen (2008) that based on mixed linear model
ESMS -1 Pearson's product moment corelation coefficient
The Pearson’s product-moment correlation coefficient (r) measures strength and direction of the linear association between two continuous variables. This tutorial will guide you through a standard procedure for evaluating the linear relationship. We also introduce a bootstraping method, as well as how to interpret and report the results.
ESMS -2 Independent sample t-test
This tutorial will show you how to carry out an Independent sample t-test in R. We also introduce a bootstraping procedure for t-test, as well as how to interpret and report the results.
ESMS -1 One-way ANOVA
This tutorial will show you how to carry out an One-way ANOVA in R. We also suggest some popular techniques for pairwise comparison, including Tukey post-hoc test, Bonferroni adjustment and planned contrast analysis. Last but not least, we also introduce some bootstraping procedures for ANOVA and post-hoc test, as well as how to interpret and report the results.
MLM case X2: PCA- A simple but forgotten tool for data mining
Principal component analysis (PCA) is a simple but powerful tool for exploring the large dataset with many quantitative variables. The goal of PCA is to represent the data in a space with reduced dimensionality with the best viewpoint, thus enabling an optimal visualization of the hidden information in the multivariate data with only a few synthetic components. The PCA results could be interpreted from either individual or variable viewpoints.
Case study X1: Elastic-net regularisation models
The glmnet package, developed by erome Friedman, Trevor Hastie and Rob Tibshirani in 2013 provides a powerful and flexible hybrid solution for fitting different types of generalized linear model using either LASSO, Ridge or Elastic net regularisation methods.This case study aims to demonstrate the function of glmnet package in 3 different ways: On its own or in mlr and caret frameworks.
Imputation using MLR and CARET
Tutorial for the Data Science and Quantitative Analysis group, in Vietnamese.
Looping mixed linear model
Tutorial for the Data Science and Quantitative Analysis group, in Vietnamese.
Anatomy of superheat package
A new package for creating heatmaps was recently published by Rebecca Barter. As its name would suggest, the superheat package could be the most comprehensive tool to develop and customize heatmaps from your data. The package allows extending the traditional heatmap by multiple visual elements, such as dendrogram, scatter plot, smoothing curve, correlation bars, text labels and more.
Tutorial 2
Heatmap tutorial for Data Science and Quantitative Analysis group in Vietnamese
Tutorial 1
A tutorial for Data Science and Quantitative Analysis group in Vietnamese.
MLM Case study 17: Automated v.s Manual, User-controlled Tuning for Decision Tree
Overfitting is a critical problem for CART modelling.There are two alternative solutions to overcome this problem: (1) Setting the limits of tree size and/or (2) pruning the tree. In this experiment, we will evaluate the performance of 3 different approaches: Automated training by caret using either rpart and rpart2 methods versus user-controlled modelling in mlr. Our hypothesis is that a manual tuning would provide better performance.
MLM Case study 5: Machine learning applied to Survival Analysis
We have succesfully extended the traditional Proportional hazard regression (Cox-PH) analysis with machine learning techniques. Though the mlr package supports 15 different algorithms for Survival analysis, a classical algorithm such as Cox-PH model is still good for interpretive studies. The Boosted Cox-model is an alternative solution and it works as well as the simple Cox-ph model.
MLM Case study 4: Bootstrapping a Logistic model
This experiment has succesfully extended the basic Logistic regression analysis with Bootstrap resampling. The Resampling could be applied at 2 levels of ML study, including features exploration, evaluation of model’s performance, for both interpretive and predictive purposes.
MLM Case study 3: Wrapping up a linear model
In this document, we introduce 3 Machine learning techniques that could eventually improve your GLM based interpretive studies:
(1) Manual feature selection, (2) a brute force algorithm for automated and active feature selection. The model will be able to select the most relevant predictors for itself and return the most accurate model, (3) Using bootstrap for validating and extracting the confidence interval of a linear model.
MLM Case study 52: Go beyond the ROC curves...
The ROC curve as we know is just a two dimensioned line plot that shows the relationship between Sensitivity and 1-Specificity of a given classifier. So ANY classification rule (not a specific test by its own, but any rule that implies stand-alone or combined clinical data, including symtomps, laboratory markers and medical imaging, simple or sophisticated algorithm…) could be visually evaluated by a ROC curve.
MLM Case 38: Optimising Clinical Decision Making
Unlike previous studies on Machine learning, this time our goal is no more predicting what will happen to the patient, but to improve our decision making with the help of computerized algorithms. So we will teach the machine to build an optimized treatment flowchart based on ancient clinical data.
MLM Case 7: Machine learning applied to Interpretive analysis
experiment suggests that even for an interpretive study, Machine learning techniques such as resampling and feature selection can be helpful by extending our regression analysis. These techniques can provide valuable information such as model’s averaged accuracy and confidence interval for feature’s effect. We strongly recommend this approach in routine data analysis.
MLM Project Case Study N°57: SVM vs XGBT
The 57th case study (Chapter 13 - Machine Learning in Medicine-Cookbook Two by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) allows us to set up a small duel match between 2 machine learning algorithms : Extreme Gradient Boosting Tree versus Support Vector Machine.
MLM Case 46: Cluster analysis for Clinical Data Mining
Our objective is to demonstrate the ability of Cluster analysis, an unsupervised machine learning technique for exploring and identifying groups (i.e clusters) within the biomarkers data set.
MLM Case 31: Machine learning and linear mixed models
This parallel-group study aims to compare the long-term effectiveness of a new cholesterol reducing compound with another drug.
The conventional approach for this study design would be MANOVA for repeated measure, or generalized linear mix model (GLMM). However, in this study we will introduce an alternative approach that combines the linear mix model (lme4 package) to Machine learning techniques