BIOL 5404 Assignment 3

Author

Roz Dakin

Due on

February 16, 2025

Instructions

Create a new .R script to complete this assignment in your local assignments folder for this course.
You’ll submit the .R script on Brightspace.
Put your FIRST and LAST NAME in the file name of the script.
Put your name at the top of the script as well.
Please do not include your student ID, just your name is enough.
Show your work!
Make sure your script is organized and legible.
Use code sections (####) & question numbers as outlined below.
Provide all written answers as brief # comments within your script.
Be sure to load all packages needed at the top of your script, like this (adding any other packages needed):

library(tidyverse)

In this assignment, we will use the mpg dataset (which comes with tidyverse) AND also the data contained in the zip file data_by_continent_country.zip, which is available on Brightspace.

Part 1

1a. Copy the custom function null_corr() below into your script. For 1a, look at the function and describe in your own words what this function is intended to do. What does null_corr take as input, what does it attempt to do, and what does it return as output?

null_corr <- function (ss = 100) { 
  x <- rnorm(ss, 0, 1)
  y <- rnorm(ss, 0, 1)
  ctest <- cor.test(x ~ y)
  return(setNames(c(ctest$estimate, ctest$p.value), c('estimate', 'p')))
}

1b. The version of null_corr() above has an error! Find the error and fix it. Then, demonstrate that the function works with a few examples, like this:

null_corr(ss = 50) # etc

1c. Briefly, why does null_corr() gives a different output each time it is run for a given value of ss?

1d. Now that you’ve got the function working, let’s use it to do a simulation.

In this simulation, we want to run null_corr() 10,000 times, and store the output each time.
We will store each output in a single row of a data.frame that we’ll create, called results.
After creating results, write a for loop to run null_corr() 10,000 times, and store each iteration’s output in the i^th row of results. Use the default ss of 100.

results <- data.frame(r = rep(NA, 10000), p = NA)
head(results)

   r  p
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA

1e. Now let’s examine the results. What percent of the time did you obtain a p-value < 0.05? Does your answer match what you would expect? Why do different students obtain slightly different values for the percent here?

1f. Now, repeat the exercise above, but with a LARGER value of ss, let’s say an ss of 1,000.

Make a new data.frame to store these new results for the larger sample size.
Run 10,000 iterations again, using the larger ss of 1,000.
What proportion of time do you expect to get a p-value < 0.05 this time?
Check to see if your expectation is correct.

1g. Use a histogram to examine the distribution of correlation coefficients obtained in your first run (when ss = 100). What, if anything, do you expect to change about this distribution when ss gets larger? (i.e., how do you expect this distribution to look for ss = 1000 vs. ss = 100?)

OPTIONAL: Use ggplot2 to investigate this by generating a plot with both distributions of the correlation coefficient r overlaid on top of each other, so they can be visually compared. (i.e., a plot that shows the distribution of r when ss = 100, and when ss = 1,000, on the same panel).

Part 2

There are two loops below, under headings Loop 1 and Loop 2. Each of these loops has some errors.

2a. Find and fix the two errors in Loop 1, and comment to identify and explain the changes you made. After making the fix, run Loop 1, and inspect the output to show that it’s now working.

# Loop 1 (Hint: this loop has 2 errors)
# The goal of this first loop is to fill in simresults_1
# with means sampled from a normal distribution

simresults_1 <- rep(NA, 1000) 

for (i in 1/1000) { 
  temp <- rnorm(100, 100, 10)
  simresults_1[i] < - mean(temp)
}

simresults_1

2b. Find and fix the three errors in Loop 2, and comment to identify and explain the changes you made. After making the fix, run Loop 2, and inspect the output to show that it’s now working.

# Loop 2 (Hint: this loop has 3 errors)
# The goal of this second loop is to fill in simresults_2
# with means sampled from a normal distribution

simresults_2 <- rep(NA, 1000) 

for (i from 1:1000) {
  temp <- rnorm(100, 99, 10)
  simresults2 <- mean(temp)
}

simresults_2

Part 3

In this section, we’ll work with the data provided in data_by_continent_country.zip on Brightspace. This zip archive contains a set of folders and files that have demographic data from countries around the world. Each data file includes info on a country’s population, average life expectancy, and GDP per capita across 5-year intervals from 1952 to 2007.

Aside: these data come from the gapminder package… they are readily available as a single object if you install the package. I’ve intentionally distributed them across folders here to give us a chance to practice reading in data & iterating a process like this when we have many files to read in nested directories.

Download data_by_continent_country.zip from Brightspace and uncompress/unzip it. The data are stored across folders, organized by continent, then country. Take a look at the folder structure using the explorer/finder on your computer. You should be able to see that there is a folder for each continent, with subfolders for each country, each of which has a csv file with data for a given country.

Follow this organization: PUT the unzipped folder data_by_continent_country in a dedicated directory for A3 (i.e., an A3 folder) that contains only:

your .R script for A3
the folder data_by_continent_country (and all its contents)

Make sure you match your workflow to this organization above, and when you grade your peer, grade them with this setup as well, with the A3 folder as your working directory.

3a. First, determine how many csv files there are in total.

You can use list.files() to do this, by getting the paths to all of the .csv files in data_by_continent_country.
Use the pattern argument in list.files() to identify csv files, and use the recursive argument to search all sub-directories by setting recursive = T.
Read the help for list.files() to understand what these arguments are doing.
Give the output of your call to list.files() a name (e.g., myvector), so that it creates a vector of paths to all the different csv files in data_by_continent_country.
How many csv files are there in total?

3b. Next, you are going to iterate over the vector you created above, reading in each of the csv files, and then storing the data from each of these files in the i^th element of a list that you create for this purpose.

Start by using the line below to create an empty list with the desired length:

mylist <- vector("list", length = length(myvector))

Now, write a for loop to read in each csv file and put into mylist. You can use read.csv() from base R or read_csv() in tidyverse.

Inspect the first element of your list. Try looking at a few more, did it work?

3c. A problem with what we have so far is that the csv files did not actually contain values for country and continent names. We need to incorporate that info so that the data will be useful & tidy!

Revise and repeat your loop above so that after reading in each csv file, your loop will also ADD a column with the correct continent name for that country, and ADD a column for the correct country name for that country. As above, store it in the i^th element of the list. Inspect it to make sure it worked.

Hint: you can use strsplit() from base R, or str_split() from tidyverse, to break up the elements of your paths based on a separator/pattern, and extract the continent and country info from the output of one of these string splitting functions.

Aside: when you have a list of data.frames like this, and all of the data.frames in the list have the exact same column names and types, you can use the rbindlist() function from the data.table package to bind them together into a single data.frame. It’s very useful!

OPTIONAL: Check for duplicates… there is ONE country that had its data accidentally duplicated. Using R, can you identify which country this is, and remove the duplicate data from the list?

OPTIONAL: Inspect the combined data for extreme values. Do the variables have the distributions and ranges you would expect? Identify any extreme values worthy of further investigation (i.e., to verify that outliers are indeed real).

Part 4

Now we are going to work with the mpg dataset that has information on vehicles (cars etc), their fuel efficiency, and other properties. Familiarize yourself with the variables in this dataset.

head(mpg)
help(mpg)

Let’s consider the structure of this tidy dataset. The column model represents the name of the car model:

table(mpg$model)


           4runner 4wd                     a4             a4 quattro 
                     6                      7                      8 
            a6 quattro                 altima     c1500 suburban 2wd 
                     3                      6                      5 
                 camry           camry solara            caravan 2wd 
                     7                      7                     11 
                 civic                corolla               corvette 
                     9                      5                      5 
     dakota pickup 4wd            durango 4wd         expedition 2wd 
                     9                      7                      3 
          explorer 4wd        f150 pickup 4wd           forester awd 
                     6                      7                      6 
    grand cherokee 4wd             grand prix                    gti 
                     8                      5                      5 
           impreza awd                  jetta        k1500 tahoe 4wd 
                     8                      9                      4 
land cruiser wagon 4wd                 malibu                 maxima 
                     2                      5                      3 
       mountaineer 4wd                mustang          navigator 2wd 
                     4                      9                      3 
            new beetle                 passat         pathfinder 4wd 
                     6                      7                      4 
   ram 1500 pickup 4wd            range rover                 sonata 
                    10                      4                      7 
               tiburon      toyota tacoma 4wd 
                     7                      7

filter(mpg, model == 'mustang')

# A tibble: 9 × 11
  manufacturer model   displ  year   cyl trans     drv     cty   hwy fl    class
  <chr>        <chr>   <dbl> <int> <int> <chr>     <chr> <int> <int> <chr> <chr>
1 ford         mustang   3.8  1999     6 manual(m… r        18    26 r     subc…
2 ford         mustang   3.8  1999     6 auto(l4)  r        18    25 r     subc…
3 ford         mustang   4    2008     6 manual(m… r        17    26 r     subc…
4 ford         mustang   4    2008     6 auto(l5)  r        16    24 r     subc…
5 ford         mustang   4.6  1999     8 auto(l4)  r        15    21 r     subc…
6 ford         mustang   4.6  1999     8 manual(m… r        15    22 r     subc…
7 ford         mustang   4.6  2008     8 manual(m… r        15    23 r     subc…
8 ford         mustang   4.6  2008     8 auto(l5)  r        15    22 r     subc…
9 ford         mustang   5.4  2008     8 manual(m… r        14    20 p     subc…

4a. Briefly, how would you describe the structure of the data? How many rows do we have for the 1999 Ford Mustang, for example? How many rows for the 2008 Ford Mustang? Why do we have multiple rows of Ford Mustangs for a given year?

4b. Do all of the car models have the same number of 1999 rows as we do for 1999 Ford Mustangs?

Hint: using group_by() and the tally() function gives a nice way to do this, or group_by() and count(), which is equivalent.

# For example: 

mpg %>% 
  group_by(class) %>% 
  count()

# A tibble: 7 × 2
# Groups:   class [7]
  class          n
  <chr>      <int>
1 2seater        5
2 compact       47
3 midsize       41
4 minivan       11
5 pickup        33
6 subcompact    35
7 suv           62

4c. Use dplyr to calculate the average, and range, for the number of ‘entries’ for a given model-year. By model-year, I mean things like ‘1999 Ford Mustang’, ‘2008 Audi A4’, etc. So, use dplyr functions to determine how many entries (rows) in the data there are for each model-year, and then compute the average, and range (min and max) for those counts.

OPTIONAL: Examine the drv column, which gives info on the vehicle’s drive train (categorized as either 4-wheel drive, front-wheel drive, or rear-wheel drive). Now print the model column, and note that some of the models also have similar information about the vehicle’s drive train in the model name as well. Using vectorized commands, extract any information about drive train from the model column (if it exists). Then, verify that that info you extracted from the model column matches the value in the drv column (quality control).

Hint: str_extract_all() from the stringr package will do this conveniently. Are there any issues/errors?

Part 5

Fuel efficiency values are found in these columns:

cty for efficiency during city driving in miles/gallon
hwy for efficiency during highway driving in miles/gallon

5a. Write a small custom function that converts fuel efficiency values in miles/gallon to fuel consumpsion rates in units of litres/100 km.

Check/demonstrate that your function works in your script.

To help with this: note that 20 miles/gallon is approximately 11.76 l/100 km, and 40 miles/gallon is approximately 5.88 l/100 km.
Note that this is trickier than you might initially think, because the conversion from mpg to l/100 km is non-linear (i.e., it’s NOT a simple linear transformation). This is because mpg as a measure of fuel efficiency has distance in the numerator. By contrast, l/100 km is a measure of fuel consumption, and has distance in the denominator.

5b. Using dplyr and your new function, add (mutate) two new columns into mpg that provide city and highway fuel consumption estimates in litres/100 km.

5c. Take a look at the trans column, which has multiple pieces of information about a vehicle’s transmission type (auto or manual) and the number of speeds (e.g,. m6, m5 etc). Create a new column that JUST identifies a car’s transmission type as either auto OR manual (stripping off the extra characters that provide information about speeds).

5d. How many cars in the dataset have automatic transmission, and how many have manual transmission?

Hint: there are many ways you could do this, but a convenient option from the tidyverse is str_extract(). For example:

# Example of how to use str_extract to look for either of two patterns:

mystring <- c('shebang', 'shalalalala', 'he and me', 'hahaha')

str_extract(string = mystring, pattern = 'he|ha')

[1] "he" "ha" "he" "ha"

5e. Take a look at the class column, which has info on vehicle size classes (there are 7 unique classes). Using R, which is the most common size class? Which is the least common?

5f. Create a new column in mpg that categorizes vehicles as either “small” (2seater, compact, and subcompact), “medium” (midsize), or “large” (minivan, pickup, and suv). You can use case_when(), or use nested calls to ifelse() to do this.

Part 6

6a. Create a graph that shows CITY fuel consumption (in litres/100 km) on the y-axis vs. the transmission type (either auto or manual) on x-axis, using either boxplots or violin plots. Make your plot so that it shows manual on the LEFT side of the plot, and auto on the RIGHT side of the plot, along the x-axis. Make any other aesthetic changes you wish.

HINT: you can do this by making the columns factors using factor() and specifying hte level order, or you can use factor() and do this within the call to ggplot.

6b. Similarly, create a graph that shows CITY fuel consumption (in litres/100 km) on the y-axis vs. the 3-level size class you created above, with the order along the x-axis showing small, medium, then large, from left to right. Again, make any other aesthetic changes you wish.

Extra optional challenges

This is for extra (excellent) practice, and is not graded.

Are there any vehicles (rows) in mpg wherein that vehicle’s city fuel efficiency in the cty column is BETTER (more efficient) than its highway fuel efficiency in the hwy column? Note that in miles/gallon, a LARGER value = better fuel efficiency.
Are there any car MODELS where the best option for CITY fuel efficiency was actually BETTER in 1999 than the best option for that model in 2008? If so, how many? Remember, larger values of miles/gallon indicate better fuel efficiency.

Note, to answer this you will have to do some data-wrangling and reshaping. It can be helpful to think through the steps and write them down in plain language first. I used group_by(), summarize(), and then pivot_wider() as my first steps.

Examine the different of car manufacturers in the dataset here:

table(mpg$manufacturer)


      audi  chevrolet      dodge       ford      honda    hyundai       jeep 
        18         19         37         25          9         14          8 
land rover    lincoln    mercury     nissan    pontiac     subaru     toyota 
         4          3          4         13          5         14         34 
volkswagen 
        27

Using dplyr functions, which manufacturer has has the best average HIGHWAY fuel efficiency in 1999? Which manufacturer has the best average HIGHWAY fuel efficiency in 2008?

Which manufacturer has has the best average CITY fuel efficiency in 1999? Which manufacturer has the best average CITY fuel efficiency in 2008?
Which of the manufacturers in this dataset showed the best IMPROVEMENT in average highway fuel efficiency over that decade?
Which of the manufacturers in this dataset showed the best IMPROVEMENT in average city fuel efficiency over that decade?