This document looks at managing categorical variables in a dataset. In base R, these are seen as factors. Not so in the tidyverse, which creates tibbles instead of dataframes. The forcats library allows us to make use the advantages of factors in the tidyverse.
In this tutorial we take a look at the ubiquitous box-and-whisker plot. It is ideal to visualize the spread in a numerical variable by way of quartile values. It can also display statistical outliers which can be great importance.
In this tutorial we take a look at all the color options available in Plotly for R. You can indeed color your plots and charts to your heart's content.
In this tutorial we add some control over the color of our histograms. It is an extension of a previous tutorial on histograms.
In this tutorial we explore alternative tests for categorical variables.
In this tutorial we take a look at the most common tests to analyze categorical variables.
In this tutorial we take a look at the exact test of goodness of fit for binomial categorical variables. This hypothesis test compares the actual counts found during data capture against an expected count.
This tutorials explains the nature and use of Hotelling's T-squared test to compare the means of several numerical variables between two groups. Along the way we also consider some of the assumptions for the use of this test and look at the packages and functions required for the test.
Accompanying RPubs file for my YouTube tutorial introducing R for biostatistics.
This document contains a first look at an example of a convolutional neural network. It uses one of the built-in Keras image datasets and shows the use of convolutional operation layers, maximum pooling layers, and a flatten layer.
Convolutional neural networks (CNN) are ideal for image classification. This post provides an explanation of the concepts required to construct a CNN. These include the convolution operation, pooling, stride length, padding, and more.
Cross entropy describes a method for determining the difference between an actual and a predicted categorical data point value.
Regression problems have numerical data as target variable and requires specific loss functions and output layer design.
Researchers have added many mathematical changes to the original concept of a layered perceptron model. This chapter discusses some of these improvements and are aimed at providing an overview of this topic. This will be useful when implementing techniques such as RMSprop and batch normalization in code.
This chapter implements regularization and dropout to help overfitting (high variance). The IMDB dataset is used as a prime example of high variance, especially when large networks are used.
Dropout is a regularization technique that can be implemented when a network overfits (i.e. a high variance exists). It randomly removes the values of some nodes in a network leading to a simpler model and hence reduces the hypothesis space.
This document describes the use of regularization in deep neural network. Regularization is introduced as a form of complexity measurement and constrain on the hypothesis space of a network. Regularization constrains the hypothesis space by creating simpler networks that generalize better and improve high variance.
This post discusses the issues around creating a proper training and test set. It also discusses the subjects of bias and variance in a model, showing how these are recognized and what steps to take to correct for them.
This chapter shows the actual code to construct a neural network using Keras in R. It introduces the concept of splitting the data into a training and test set. This allows for both training the neural network as well as testing its accuracy.
A very short introduction to R
This post describes the basics of a neural network
The last puzzle piece required to understand the fundamentals of deep neural networks is introduced in this post. It considers expressing the predicted variable as a probability so that it can be used in classification problems.
This chapter describes a multivariable linear regression model as a neural network with a single hidden layer. This is done so as to create a familiarity with the terms and processes of deep learning.
An example of linear regression serves to illustrate the basic concepts of deep learning through the explanation of terms such as cost functions, backpropagation, global minima, and gradient descent.
This post explains the concepts of a model and an error through the use of linear regression. These concepts will play an important role in creating deep neural networks.
This post introduces Keras in R. A deep learning framework using Google's TensorFlow backend. The example is a multi-class classification problem from the University of California at Irvine database for machine learning. The dataset is converted to a .csv file and is available in my GitHub repository.
This post describes the ROSE package used to correct for class imbalance.
This R-markdown post introduces the WDI library to interact with the World Bank Open Data API.
Describing the expression and visualization of bivariate categorical and numerical variables.
A short description of prediction intervals, with an example.
Scatter plots and bubble charts using Plotly for R.
Describing and visualizing univariate data in R.
A post on the use of Plotly to create histograms in R
Create bar charts using Plotly for R.
Get going with plotting using ggplot2.
This post describes the principals of multivariate logistic regression using R
Biserial and point-biserial correlations allows for the calculation of a correlation coefficient if one of the variables is discrete in nature.
In this post with R code snippets, I discuss some of the assumptions that must be met for the use of parametric tests