Recently Published
Boodle_Jonathan_21153695
---
title: "STAT500 Assignment 1"
author: "Jonathan Boodle 21153695"
date: "2023-03-21"
output:
pdf_document: default
html_document: default
editor_options:
markdown:
wrap: 72
---
\textbf{Outline: } This assignment has four questions and it's worth 10%
of your final grade.
\vspace{5mm}
\textbf{Purpose: } To assess your analytical and computing skills on the
material covered.
\vspace{5mm}
\textbf{Total possible marks: 100 marks}.
\bigskip
\textbf{Instructions:}
```{=tex}
\begin{enumerate}
\item \textbf{Use the .Rmd file provided to write your answers}.
Enter your name and student id in the `author` field at
the top of the file.
\item Save the file on disk. The filename must include 1) your lastname, 2) your firstname,
and 3) your student id, e.g., if Jane Doe submits her assignment, her file must be named ``Doe\_Jane\_123456789''.
\item Finally, knit your .Rmd answer-file to .pdf and submit to Canvas.
\item Fill in and sign an assessment cover-sheet which
must be the very first page in the PDF. Use, e.g., Adobe
Acrobat Pro on Uni computers. You can submit this
cover sheet as a separate file.
\end{enumerate}
```
\vspace{5mm}
\noindent \textbf{IMPORTANT:} Any \texttt{R} code employed to complete
this assignment must be self-explanatory and must be embedded in your
answer using R Markdown code chunks. Screenshots and other images will
not be allowed and will be penalized.
\textbf{Make sure you make comments on your
code for a full understanding of your answer.}
\vspace{5mm}
\noindent \textbf{Late submissions}: There is a lateness penalty of 5
marks/day (or part of a day thereof), up to a maximum of 4 days, unless
the student gets an approved SCA by NO later than Monday 10 April 2023.
See note below.
\vspace{5mm}
\noindent \textbf{Note:} If you need an extension with no lateness
penalty because, e.g., your performance has been impacted by some
extenuating, unexpected, circumstances, you can submit and SCA along
with relevant evidence using the submission link from the STAT500 Home
page. \textbf{Bear in mind that SCA processing may
take up to 5 working days}. If you have questions, contact
\textcolor{blue}{\texttt{victor.miranda@aut.ac.nz}} or
\textcolor{blue}{\texttt{nuttanan.wichitaksorn@aut.ac.nz}}.
## Question 1: Variables (20 marks)
(a) [5 marks] Create a variable from your student id using the following
code and name your variable as `sid`. For example, Jane Doe with the
student id 123456789 will have
```{r}
sid <- c(1,2,3,4,5,6,7,8,9)
```
```{r}
sid <- c(2,1,1,5,3,6,9,5)
# Write your answer here
```
(b) [3 marks] Find the length of this variable.
```{r}
length(sid)
# The length of variable 'sid' is 8
```
(c) [2 marks] What type is this variable? Explain.
```{r}
class(sid)
# The type of variable 'sid' is numeric because the variable is made up of characters which are numbers and therefore numeric
```
(d) [5 marks] Using `R` functions, find mean, mode, median, standard
deviation, and variance of this variable.
```{r}
mean(sid)
# The mean of variable 'sid' is 4
library(DescTools)
Mode(sid) # The given values are 1 and 5, with the frequency underneath being 2. So both 1 and 5 are repeated in the variable 'sid' 2 times
median(sid) # This gives a value of 4, being the average between 3 and 5 the middlemost numbers
sd(sid) # This gives a value of 2.777 rounded to 3 decimal places.
var(sid) # This gives a value of 7.714 being rounded to 3 decimal places
```
(e) [5 marks] Interpret the median and the mode. Use 2-3 sentences.
The median is the middlemost number or numbers when in order. The middlemost numbers in the variable 'sid' are 3 and 5 considering it is has 8 values. The average of 3 and 5 are 4 so the median is 4. It is best to use the median when measuring the central tendency because it is more accurate to use the median rather than the mean in case of outliers or values that are not like the rest such 6 and 9 in the variable 'sid'. Those numbers are far off the rest of the values which could damage the reliablily of the results.
The mode is the number or numbers that occur the most frequently. The variable 'sid' has 2 of these, those being 1 and 5. Both values occur twice in the variable while all the others occur once. This is useful in learning how the values are distributed and can show patterns or trends such as 1 and 5 being repeated.
## Question 2: Working with Data (25 marks)
Run the code below, with the `sid` variable defined as above.
```{r}
set.seed(sum(sid))
d1 <- data.frame(percent_cured = runif(100, min = 0, max = 100),
age = rep(c('child', 'adult'), each = 50))
```
(a) [5 marks] Using `R`, Create and add a variable called `dose` to the
data frame `d1`. Use `c(100, 200, 100, 200)` with every value
repeated 25 times so that the length of `dose` matches the other two
variables.
```{r}
d1$dose <- rep(c( 100,200,100,200), each = 25)
# This function repeats the values 100,200,100,200, 25 times each so that the length matches the code.
```
(b) [5 marks] Using `R`, find the summary statistics of `percent_cured`
for child and adult.
```{r}
summary(d1$percent_cured)
# This gives the minimum, maximum, mean, median and the 2 quartiles respectively
```
(c) [5 marks] In 2-3 sentences, describe the mean, min and max of
`percent_cured` for child and adult. You must consider your answer
to (b).
```{r}
# There is a very large distribution in the data. The lowest value being 0.5285 and the highest being 99.9918. The mean is 50.9058. This shows the data is spread very evenly and the values all showing for normal distribution. The minimum value shows that at least one patient was not cured while the highest value shows that at least one patient was cured. The mean shows that the average a person was cured or not cured was 50% or around that.
```
(d) [10 marks] Plot 3 graphs: (1) a histogram of the variable
`percent_cured`, (2) a box plot of `percent_cured` according to
`age`, and (3) a box plot of `percent_cured` according to `dose`.
What is the median of `percent_cured` according to `dose`?
```{r}
hist(d1$percent_cured, main = "Histogram of Percent Cured")
# For 1) A histogram of Percent Cured
boxplot(percent_cured ~ age, data = d1, main = "Boxplot of Percent Cured by Age")
# For 2) A boxplot of percent cured by age
boxplot(percent_cured ~ dose, data = d1, main = "Boxplot of Percent Cured by Dose")
# For 3) A boxplot of percent cured by dose
aggregate(percent_cured ~ dose, d1, median)
# Using the aggregate function, we can find out the median which comes to 42.0649 for a dose of 100, and 55.4639 for a dose of 200
```
## Question 3: Normal Distribution (25 marks)
(a) [5 marks] Generate 100 random numbers from normal distribution with
the mean of the `sid` variable in Question 1 and name it as the
`sid.norm1` variable and find its mean.
```{r}
# Write your answer here
```
(b) [5 marks] How can we assess that the `sid.norm1` is from a normal
distribution?
```{r}
# Write your answer here
```
(c) [5 marks] Increase the random numbers to 1,000 and 10,000 with the
same mean as above, and respectively call the variables, `sid.norm2`
and `sid.norm3`.
```{r}
# Write your answer here
```
(d) [10 marks] Compare the mean of all three variables and discuss
widely. You must write a short paragraph of 4-5 sentences for this.
```{r}
# Write your answer here
```
## Question 4: Importing Data and Data Analysis (30 marks)
First, complete the following tasks:
1. Go to the COVID-19 data portal by Statistics New Zealand website at
<https://www.stats.govt.nz/experimental/covid-19-data-portal> and
click at the orange "DOWNLOAD DATA" button (next to ABOUT) around
the middle of the page.
2. Choose **two** indicators of your own choice. You can select them
one at at time and download its data.
3. Once you successfully download each indicator, delete the "metadata"
sheet. Then, save the file in an appropriate folder.
4. Import the dataset for each indicator into `R`
\textbf{Questions:}
(a) [5 marks] For the completion of the four steps above
```{r}
# Chosen the General Health and Smoking indicator and the General Health and Drinking indicator. Both Datasets are imported into R.
```
(b) [10 marks] **For each indicator**, write down the variable type AND
explain why you chose it (5 marks p/ indicator).
```{r}
#In the Health and Smoking datasets, the type is numeric as the values are shown in percentages. 50% is also seen as 0.5 therefore all values are numeric. I've chosen this indicator as it relates to the world and other people my age. New Zealand has a drinking culture and I feel like the results of this can be helpful to visualise the problems people may have had during lockdown.
#In the Health and Drinking datasets, the type is also numeric. I chose this variable for the same reason, during lockdown many people needed a vice to escape the household, and I feel like many people may have taken up smoking. The results of this could also help to visualise the problems associated with smoking and help to mitigate the damage.
```
(c) [10 marks] Use an appropriate plot to visualize each
indicator/variable \*\*Explain the selected plot(s) in 2-3
sentences. (5 marks p/ indicator including explanation).
```{r}
# Write your answer here
```
(d) [5 marks] Choose one of your indicators.Make an appropriate plot to
see its distribution. Then discuss this distribution in 3-4
sentences.
```{r}
# Write your answer here
```