gravatar

jboodle

Jonathan

Recently Published

Boodle_Jonathan_21153695
--- title: "STAT500 Assignment 1" author: "Jonathan Boodle 21153695" date: "2023-03-21" output: pdf_document: default html_document: default editor_options: markdown: wrap: 72 --- \textbf{Outline: } This assignment has four questions and it's worth 10% of your final grade. \vspace{5mm} \textbf{Purpose: } To assess your analytical and computing skills on the material covered. \vspace{5mm} \textbf{Total possible marks: 100 marks}. \bigskip \textbf{Instructions:} ```{=tex} \begin{enumerate} \item \textbf{Use the .Rmd file provided to write your answers}. Enter your name and student id in the `author` field at the top of the file. \item Save the file on disk. The filename must include 1) your lastname, 2) your firstname, and 3) your student id, e.g., if Jane Doe submits her assignment, her file must be named ``Doe\_Jane\_123456789''. \item Finally, knit your .Rmd answer-file to .pdf and submit to Canvas. \item Fill in and sign an assessment cover-sheet which must be the very first page in the PDF. Use, e.g., Adobe Acrobat Pro on Uni computers. You can submit this cover sheet as a separate file. \end{enumerate} ``` \vspace{5mm} \noindent \textbf{IMPORTANT:} Any \texttt{R} code employed to complete this assignment must be self-explanatory and must be embedded in your answer using R Markdown code chunks. Screenshots and other images will not be allowed and will be penalized. \textbf{Make sure you make comments on your code for a full understanding of your answer.} \vspace{5mm} \noindent \textbf{Late submissions}: There is a lateness penalty of 5 marks/day (or part of a day thereof), up to a maximum of 4 days, unless the student gets an approved SCA by NO later than Monday 10 April 2023. See note below. \vspace{5mm} \noindent \textbf{Note:} If you need an extension with no lateness penalty because, e.g., your performance has been impacted by some extenuating, unexpected, circumstances, you can submit and SCA along with relevant evidence using the submission link from the STAT500 Home page. \textbf{Bear in mind that SCA processing may take up to 5 working days}. If you have questions, contact \textcolor{blue}{\texttt{victor.miranda@aut.ac.nz}} or \textcolor{blue}{\texttt{nuttanan.wichitaksorn@aut.ac.nz}}. ## Question 1: Variables (20 marks) (a) [5 marks] Create a variable from your student id using the following code and name your variable as `sid`. For example, Jane Doe with the student id 123456789 will have ```{r} sid <- c(1,2,3,4,5,6,7,8,9) ``` ```{r} sid <- c(2,1,1,5,3,6,9,5) # Write your answer here ``` (b) [3 marks] Find the length of this variable. ```{r} length(sid) # The length of variable 'sid' is 8 ``` (c) [2 marks] What type is this variable? Explain. ```{r} class(sid) # The type of variable 'sid' is numeric because the variable is made up of characters which are numbers and therefore numeric ``` (d) [5 marks] Using `R` functions, find mean, mode, median, standard deviation, and variance of this variable. ```{r} mean(sid) # The mean of variable 'sid' is 4 library(DescTools) Mode(sid) # The given values are 1 and 5, with the frequency underneath being 2. So both 1 and 5 are repeated in the variable 'sid' 2 times median(sid) # This gives a value of 4, being the average between 3 and 5 the middlemost numbers sd(sid) # This gives a value of 2.777 rounded to 3 decimal places. var(sid) # This gives a value of 7.714 being rounded to 3 decimal places ``` (e) [5 marks] Interpret the median and the mode. Use 2-3 sentences. The median is the middlemost number or numbers when in order. The middlemost numbers in the variable 'sid' are 3 and 5 considering it is has 8 values. The average of 3 and 5 are 4 so the median is 4. It is best to use the median when measuring the central tendency because it is more accurate to use the median rather than the mean in case of outliers or values that are not like the rest such 6 and 9 in the variable 'sid'. Those numbers are far off the rest of the values which could damage the reliablily of the results. The mode is the number or numbers that occur the most frequently. The variable 'sid' has 2 of these, those being 1 and 5. Both values occur twice in the variable while all the others occur once. This is useful in learning how the values are distributed and can show patterns or trends such as 1 and 5 being repeated. ## Question 2: Working with Data (25 marks) Run the code below, with the `sid` variable defined as above. ```{r} set.seed(sum(sid)) d1 <- data.frame(percent_cured = runif(100, min = 0, max = 100), age = rep(c('child', 'adult'), each = 50)) ``` (a) [5 marks] Using `R`, Create and add a variable called `dose` to the data frame `d1`. Use `c(100, 200, 100, 200)` with every value repeated 25 times so that the length of `dose` matches the other two variables. ```{r} d1$dose <- rep(c( 100,200,100,200), each = 25) # This function repeats the values 100,200,100,200, 25 times each so that the length matches the code. ``` (b) [5 marks] Using `R`, find the summary statistics of `percent_cured` for child and adult. ```{r} summary(d1$percent_cured) # This gives the minimum, maximum, mean, median and the 2 quartiles respectively ``` (c) [5 marks] In 2-3 sentences, describe the mean, min and max of `percent_cured` for child and adult. You must consider your answer to (b). ```{r} # There is a very large distribution in the data. The lowest value being 0.5285 and the highest being 99.9918. The mean is 50.9058. This shows the data is spread very evenly and the values all showing for normal distribution. The minimum value shows that at least one patient was not cured while the highest value shows that at least one patient was cured. The mean shows that the average a person was cured or not cured was 50% or around that. ``` (d) [10 marks] Plot 3 graphs: (1) a histogram of the variable `percent_cured`, (2) a box plot of `percent_cured` according to `age`, and (3) a box plot of `percent_cured` according to `dose`. What is the median of `percent_cured` according to `dose`? ```{r} hist(d1$percent_cured, main = "Histogram of Percent Cured") # For 1) A histogram of Percent Cured boxplot(percent_cured ~ age, data = d1, main = "Boxplot of Percent Cured by Age") # For 2) A boxplot of percent cured by age boxplot(percent_cured ~ dose, data = d1, main = "Boxplot of Percent Cured by Dose") # For 3) A boxplot of percent cured by dose aggregate(percent_cured ~ dose, d1, median) # Using the aggregate function, we can find out the median which comes to 42.0649 for a dose of 100, and 55.4639 for a dose of 200 ``` ## Question 3: Normal Distribution (25 marks) (a) [5 marks] Generate 100 random numbers from normal distribution with the mean of the `sid` variable in Question 1 and name it as the `sid.norm1` variable and find its mean. ```{r} # Write your answer here ``` (b) [5 marks] How can we assess that the `sid.norm1` is from a normal distribution? ```{r} # Write your answer here ``` (c) [5 marks] Increase the random numbers to 1,000 and 10,000 with the same mean as above, and respectively call the variables, `sid.norm2` and `sid.norm3`. ```{r} # Write your answer here ``` (d) [10 marks] Compare the mean of all three variables and discuss widely. You must write a short paragraph of 4-5 sentences for this. ```{r} # Write your answer here ``` ## Question 4: Importing Data and Data Analysis (30 marks) First, complete the following tasks: 1. Go to the COVID-19 data portal by Statistics New Zealand website at <https://www.stats.govt.nz/experimental/covid-19-data-portal> and click at the orange "DOWNLOAD DATA" button (next to ABOUT) around the middle of the page. 2. Choose **two** indicators of your own choice. You can select them one at at time and download its data. 3. Once you successfully download each indicator, delete the "metadata" sheet. Then, save the file in an appropriate folder. 4. Import the dataset for each indicator into `R` \textbf{Questions:} (a) [5 marks] For the completion of the four steps above ```{r} # Chosen the General Health and Smoking indicator and the General Health and Drinking indicator. Both Datasets are imported into R. ``` (b) [10 marks] **For each indicator**, write down the variable type AND explain why you chose it (5 marks p/ indicator). ```{r} #In the Health and Smoking datasets, the type is numeric as the values are shown in percentages. 50% is also seen as 0.5 therefore all values are numeric. I've chosen this indicator as it relates to the world and other people my age. New Zealand has a drinking culture and I feel like the results of this can be helpful to visualise the problems people may have had during lockdown. #In the Health and Drinking datasets, the type is also numeric. I chose this variable for the same reason, during lockdown many people needed a vice to escape the household, and I feel like many people may have taken up smoking. The results of this could also help to visualise the problems associated with smoking and help to mitigate the damage. ``` (c) [10 marks] Use an appropriate plot to visualize each indicator/variable \*\*Explain the selected plot(s) in 2-3 sentences. (5 marks p/ indicator including explanation). ```{r} # Write your answer here ``` (d) [5 marks] Choose one of your indicators.Make an appropriate plot to see its distribution. Then discuss this distribution in 3-4 sentences. ```{r} # Write your answer here ```