Recently Published
Tests
https://bookdown.org/yihui/rmarkdown/basics.html
First Quartro Document
Atelier Effectuer des rapports automatisés avec R-Markdown/Quarto
Amandine BLIN <amandine.blin@mnhn.fr>
Rapport avec R-Markdown
Amandine Blin
UAR 2700 2AD, Service Analyse de Données
Demo Iris
Amandine Blin
UAR 2700 2AD
Service Analyse de Données
Pôle Analyse de Données
Demo 1
Amandine Blin
UAR 2700 2AD
Service Analyse de Données
Pôle Analyse de Données
Methods and Skills for Sustainability
A Quick Demo on How to Plot SDGs on Maps, with R
SCGIS6
Geoprocessing 1 - Geometric Measurements
SCGIS5
Getting data with APIs
SCGIS4
Wrangling 2D Data with dplyr
SCGIS3
import raster GIS data into R
SCGIS2
use tmap to make a map
SCGIS
how to import various types of vector GIS data into R.
Multiple Testing
[mySWIRL notes: Statistical_Inference](https://github.com/DataScienceSpecialization/courses/)
## Multiple Testing
Multiple testing is when you use data to test several hypotheses, for example if we set $\alpha$ =.5 we can only test 20 hypotheses before one of the outcomes is an error and even if a p value is significantly low, it could be false. In these cases error measures come into play, especially for big data analysis.The the questions asked are:
_"Which of the variables matter among the thousands should be measured?"_
_"How do we relate unrelated information?"_
P for POWER
library(knitr) # creating a pdf document ;
library(ggplot2) # making plots
library(reshape2) # data frames
Power is the probability of rejecting the NULL HYPOTHESIS
when it is false.
:: Power is used to determine if your sample size was big enough
to yield a meaningful, rather than random, result
:: Detect if your ALTERNATIVE hypothesis is true, to lower the
risk of a Type II errors.
As beta is the probability of a _Type II error, for accepting_
_a false null hypothesis_, and the complement of this is
(1-$\beta$), the power is:
\[
P =
(1-\beta)
\]
Basic Inferential Data Analysis Exercise
The project consists of two parts (i) [a simulation exercise](https://rpubs.com/lindangulopez/709073) and a (ii) basic inferential data analysis, the former is presented here. The hypothesis tested, was that `the dosage and supplement do not affect tooth length`, the alternative is that it does. The the population was assumed to be near normally distributed. It was also found that at p-value is 0.03032, when comparing orange juice to vitamin C, so we can reject the null hypothesis. Further investigation showed that orange juice is linked to higher tooth growth length at dose = .5mg and dose = 1.0mg. But that there is no significant difference of tooth length at dose = 2.0mg, the p-value was almost 1 at 0.9639.
Exponential Distribution Simulation Exercise
The hypothesis tested, was that `the sampling distribution of exponential distribution has a normal distribution with a mean that matches the population mean and a variance that matches the theoretical result`. It was found that for the exponential distribution generated the true mean that matches the population mean and a variance that matches the theoretical result in addition the distributions have similar means at the quantiles: 5%, 25%, 50% , 75% and 95%.
The Central Limit Theorem, a Swiss Army Knife of Statistics
The project consists of two parts (i) a simulation exercise and (ii) a basic inferential data analysis, the former is presented here.
Asymptotics form the basis for frequency interpretation of probabilities, where the behavior of statistics depends on the sample size or some other relevant quantity of limits to infinity or to zero. These limits are the [the swiss army knives of statistics](https://github.com/bcaffo/courses/raw/master/06_StatisticalInference/07_Asymptopia/index.pdf), Brian Caffo.
Simulations were made to investigate the asymptotic distributions of exponential distributions, a discreet case, and compared to test
statistics which are expected to be Gaussian, a strong form of the [Central Limit Theorem](https://youtu.be/hgtMWR3TFnY), in R4.0. Results
show that like with greater $n$in the CLT, with an increase the unit of time,$\lambda$, the coverage improves and adheres to the CLT.
Power
Worked Examples
Bootstrap in action
Resampling Basics
multiple-comparisons
https://www.coursera.org/learn/statistical-inference/lecture/7c7Ns/12-01-multiple-comparisons
Power Basics
When designing an experiment we usually know mu0 & alpha and we want to know if we have or can get enough data at the power we want. The more power the better the experimental design, as one of the errors is we could make is to reject the null when it is false, this is called the
Power = 1 - beta
Practice Exercises, S_p & CI
we know, S_p = sqrt(((n_x - 1) * S_x^2 + (n_y - 1) * S_y^2)/(n_x + n_y - 2)) and we know, CI = mu_x - mu_y + c(-1, 1) * qnorm(quantile) * S_p * (1 / n_x + 1 / n_y)^.5 so we can plug in ...
How to Generate & Interpret P Values.
The question motivating p-values is: Given that we have some null hypothesis concerning our data, how unusual or extreme is the sample value we get from our data, from for example, its mean?
Is our [test statistic](https://rpubs.com/lindangulopez/702246) consistent with our hypothesis? There are, implicitly, three steps we have to take to answer these types of questions.
- Create a null hypothesis
- Calculate a test statistic from the given data
- Compare the test statistic to the hypothetical distribution
Making decisions about populations using observed data.
An important concept in hypothesis testing is `the NULL hypothesis`, usually denoted as H_0:
This is the hypothesis that `represents the status_quo that is which is assumed to be true`.
It's a baseline against which you're testing alternative hypotheses, usually denoted by H_a.
- Statisticalevidence is required to reject H_0 in favor of the research or alternative
hypothesis.
Statistical methods for dealing with large & small datasets.
Central Limit Theorem (CLT) - Z statistic - Student’s or Gosset’s t distribution - t confidence intervals
P Values
P-values are a convenient way to communicate the results of a hypothesis test. When communicating a P-value, the reader can perform the test at whatever Type I error rate that they would like. Just compare the P-value to the desired Type I error rate and if the P-value is smaller, reject the null hypothesis.
Formally, the P-value is the probability of getting data as or more extreme than the observed data in favor of the alternative. The probability calculation is done assuming that the null is true. In other words if we get a very large T statistic the P-value answers the question "How likely would it be to get a statistic this large or larger if the null was actually true?". If the answer to that question is "very unlikely", in other words the P-value is very small, then it sheds doubt on the null being true, since you actually observed a statistic that extreme.
Hypothesis testing
Statistical hypothesis testing is the formal inferential framework around choosing between hypotheses. The null hypothesis is assumed true, H0, and statistical evidence is required to reject it in favor of a research or alternative hypothesis, Ha.
T Confidence Intervals
# Uncomment & Run manipulate in RStudio
NOAA Reproducible Research
Contents:
- Synopsis, Reproducible Research Checklist
- Summary of Results, Past Significant Weather Events in the USA, 1950-2011
- Raw Data: NOAA StormData.csv.bz2
- Data Transformation, storm_data_corrected2
- Data Processing, Storm Event Type Damage
- Data Analysis, Tables and Plots
Maps with R
Using base maps from R’s maps package and also using the ggmap package, with the ggplot2 package.
Plot in Plot
Demo
echo = false
knitr
FirstKnitr
Demo
R Markdown
Quick Demo
Reproducible reporting with R
R markdown files can be used to generate reproducible reports
Explore the National Emissions Inventory database
## [Assignment](https://www.coursera.org/learn/exploratory-data-analysis/peer/b5Ecl/course-project-2)
The overall goal of this assignment is to explore the National Emissions Inventory database and see what it say about fine particulate matter pollution in the United states over the 10-year period 1999 to 2008.
Changes in PM25 levels
This plot needs a bit of work still, it's not easy to see by eye, it looks but that many states have decreased the average PM levels from 1999 to 2012, and a few states actually increased their levels ..
Clustering Tips
Finding differences in patterns are useful for modeling.
colours in R
quick demo
Grouping Data in R
When exploring data, there are two principal uses of grouping, (i) to point out groups of similar data, here the distance/similarity has to be chosen to match the problem and (ii) to create a set of variables which are uncorrelated but representative of the data, which would explain as much variance as possible, here the first goal is statistical, solved by PCA, and the second goal is data compression which can be solved by SVD.
ggplot2(maacs)
Plotting in ggplot2 allows for the building up in layers, for example (i) Plot the data (ii) Overlay a summary (iii) then add Metadata and annotation.
[Data set: MAACS Cohort] A mouse allergen and asthma cohort study of children (aged 5—17) with persistent asthma, data was collected 5 times, per child, over a year.
The Lattice Plotting System
The lattice plotting system is implemented using the following packages:
lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics
system; includes functions like xyplot, bwplot, levelplot
grid: implements a different graphing system independent of the “base” system; the lattice
package builds on top of grid
The lattice plotting system does not have a "two-phase" aspect with separate plotting and
annotation like in base plotting
All plotting/annotation is done at once with a single function call
Exploratory household energy usage data analysis, with R
Examine how household energy usage varies over a 2-day period in February, 2007. Your task is to reconstruct the following plots below, all of which were constructed using the base plotting system.
Step 1: fork and clone this GitHub repository
Step 2: Download the data to your working directory.
...
Base Plotting on Graphic Devices
screen, pdf, png
Base Plot
Examples for screen device display
Plottingin R
There are three key plotting systems in R, the Base, which is a type of artist’s palette, model, a Lattice System, which allows to specify an entire plot specified by one function and the conditioning ggplot2 with mixed elements of Base and Lattice.
Basics, of Exploratory Data Analysis in R
http://rpubs.com/lindangulopez/656231 <-replaces
HAR-subject-activity-mean Code Book
The course Getting and Cleaning Data, is course 3 of the [Data Science Specialization], it is taught by Jeff Leek, Brian Caffo and Roger D. Peng, all Professors from the [Johns Hopkins School of Public Health]. You will learn how to obtain data from the web, from APIs, from databases and from colleagues in various formats.
The main aim was to understand the basics of data cleaning and how to make data “tidy”. Tidy data dramatically speeds up downstream data analysis tasks as well as the components of a complete data set including raw data, processing instructions, codebooks, and processed data, and is the focus of this blog which presents the
Tidy Data Assignment Results:
- a [README.md] which explains the purpose and content of the repository.
- there is the R script [run_analysis.R] which transforms the given data to a tidy data set.
- the tidy dataset [HAR-subject-activity-mean.txt]which is produced as an output from the R script.
- a csv file [HAR-subject-activity-mean.csv] for easy data analysis with csv tools.
- a metadata file, for statisticians, the [CodeBook.md] which lists the variables of the tidy data set.
Tidy Data
with tidyverse
Case Studies
Getting & Cleaning Data
Working with Dates
Week 4 Course Notes: data-cleaning in R, by [Linda] (@lindangulopez)
Regular Expressions, in R
Week 4 Course Notes: data-cleaning
Regular Expressions are useful when seeking text using wide patterns, using metacharacter which are the 'grammar' in the logic search and literals which are similar to 'words' used in Natural Language, NL.
clean text variables
Week 4 Course Notes: data-cleaning
You run these R code chunks to clean text variables, if you need to download & clean data
Week 3 Quiz: data-cleaning
Extracts of a few Case Studies
Managing Data Frames with tidyverse [dplyr]
Merging Data in R4.0
The magrittr package [dplyr] package facilitates the rendering of Tidy Data, that is when:
each variable forms a column
each observation forms a row
and each table/file stores data about one kind of observation
Managing Data Frames with dplyr
The magrittr package installed as part of the dplyr package facilitate the rendering of Tidy Data, that is when:
Each variable forms a column
Each Observation forms a row
Each table/file stores data about one kind of observation
through these type of data manipulations:
reveal new variables, new observations, and new ways to describe data as well as subset data, do group wise operations etc.
Reshaping Data & Using Pipes
This is an R Markdown document, feel free to [reach out](https://www.linkedin.com/in/lindangulopez/) for finer details.
New Variables
Create & Add the New Variables to the Data Frame
Summarizing Data
from summary () to ftable()
Subsetting & Sorting
which(), sort(), arrange()
data-cleaning/week-2-quiz
TO DO: test Q in Ubuntu & R (not RStudio:°)
Accessing Geospatial Data Using API’s with R
Lesson 7. Programmatically Accessing Geospatial Data Using API’s - Working with and Mapping JSON Data from the Colorado Information Warehouse in R
Web Scraping in R
To avoid getting your IP address blocked respect the terms of service of the website.
Reading & writing to HDF5 with R
Environmental Setup & Code Chunks
Connecting & Sourcing MySQL Data in R
This is an R Markdown document on sourcing data from the MariaDB a mySQL server with R.
Getting & Cleaning Data Lab
Answers to Quiz 1
Data Sourcing and Examination, in R
This is an R3.6 Markdown document on **Data Sourcing and Examination, in R**, on how processing steps can impact results.
Hospital Quality Study
Programming in R Assignment
Document
This is an attempt at the Caching the Inverse of a Matrix Assignment from Programming with R, the exercise is to demonstrate closures.
Closures get their name because they enclose the environment of the parent function and can access all its variables. Closures are useful for making function factories, and are one way to manage mutable state in R.
It allows us to have two levels of parameters: a parent level that controls operation and a child level that does the work. The example, makeCacheMatrix(), uses this idea to store x and inv in the enclosing environment of the set, get, setInverse, getInverse functions. That means the environment within which they were defined, i.e., the environment created by the makeCacheMatrix().
Cars Data Set
My First R Markdown