Preface

This Exploratory Data Analysis (EDA) will explore the paleobiogeography of dinosaur families in North America at the end of the Cretaceous Period.

The Late Cretaceous of North America

The last few thousand years of the Cretaceous Period in North America are represented by a series of geological formations bearing dinosaur fossils (Scollard, Frenchman, upper Hell Creek, Lance, and Denver, among others). These formations present a unique opportunity to study Cretaceous ecosystems in a temporally-constrained window, across a wide geographical range, representing variable environments.

The southern formations including the upper Hell Creek, Lance, and Denver formations represent more coastal settings, due to their proximity to a receding interior seaway. The Frenchman and Scollard formations in Canada represent more northern habitats further from the coast.

Datasets

The “cret_dino_abun.csv” file represents a working dataset of dinosaur fossil abundances collected from major museums and institutions across North America. Information includes:

  • Geological formation of origin (Scollard, Frenchman, Hell Creek, Lance, or Denver)

  • Abbreviation of the institution where specimens are housed

  • Dinosaur family/clade

  • County or general area of locality

Note: To maintain a large sample size, one instance of a fossil can range from a single tooth to an entire skeleton.

The “locality_dat.csv” file represents a complimentary dataset on the county or area used in the “cret_dino_abun.csv” dataset. Information includes:

  • County or general area of locality

  • Average latitude and longitude of the county/area (center of the county)

  • Adjusted latitude and longitude of county/area (area of fossil localities)

  • State or province name

  • State or province abbreviation

Note: The average latitude and longitude values were obtained from location data available on Wikipedia. Adjusted latitude and longitude values were estimated based on locality data of fossils from the county.

Objectives

The objective of the overall project is to investigate if there are any detectable differences in the relative abundance of dinosaur families in certain areas, and if these differences represent significant variations in the composition of dinosaur communities across North America.

In this EDA, the objective is to visualize the distribution of dinosaur fossils across North America and identify potential geographical trends in the relative abundance of dinosaur families/clades that should be pursued further as the project continues.

Preparations for Analysis

This R markdown HTML document was built with R version 4.3.2.

If you wish to see the R code used throughout this report, click on the ‘Show’ buttons.

Load Required Packages

Ensure that the following packages are properly installed.

  • devtools

  • tidyverse

  • rnaturalearth

  • rnaturalearthdata

  • sf

  • ggrepel

  • gridExtra

  • maps

  • RColorBrewer

  • knitr

library(devtools)
library(tidyverse)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(ggrepel)
library(gridExtra)
library(maps)
library(RColorBrewer)
library(knitr)

The package “rnaturalearthhires” requires manual installing using devtools

devtools::install_github("ropensci/rnaturalearthhires")

Note: Ensure that the working directory is set to the current folder and all required csv files are in the working directory.

Import Data

Import the required files into R. In each case, check that the data loaded in correctly by looking at the top of the data set.

dino <- read_csv("cret_dino_abun.csv") 

# Show first 6 lines
kable(head(dino), format = "html", table.attr = "class='table table-striped table-hover table-bordered', margin:auto;'")
Geological Formation Institution Geographical Area Dinosaur Family Abundance
Scollard AMNH Dry Island Ankylosauridae 1
Scollard CMN Dry Island Ceratopsidae 2
Scollard CMN Dry Island Thescelosauridae 1
Scollard CMN Dry Island Tyrannosauridae 3
Scollard CMN Dry Island Leptoceratopsidae 3
Scollard CMN Dry Island Ankylosauridae 1

“cret_dino_abun.csv” will be referred to as “dino”

local <- read_csv("locality_dat.csv")

# Show first 6 lines
kable(head(local), format = "html", table.attr = "class='table table-striped table-hover table-bordered', margin:auto;'")
Geographical Area Latitude Longitude Adjusted Latitude Adjusted Longitude State/Province abbrev
Dry Island 51.94 -112.96 51.94 -112.96 Alberta AB
GNP 49.04 -106.57 49.04 -106.57 Saskatchewan SK
Eastend 49.38 -108.51 49.38 -108.51 Saskatchewan SK
Denver 39.74 -104.98 39.74 -104.98 Colorado CO
Garfield 47.28 -106.99 47.69 -106.92 Montana MT
Rosebud 46.23 -106.72 46.26 -106.59 Montana MT

“locality_dat.csv” will be referred to as “local”

Tidying and Data Hygiene

The following section performs a series of operations to tidy and clean the individual data frames, merge them, and clean the combined data frame. Details of the changes are annotated in the code.

Clean “dino” data frame

  • Adjusted column names

  • Checked that values in each column are reasonable

  • Tidied the abundance column by separating the aggregate abundances

# Adjust column names to shorten and include no special characters

names(dino)[1] <- 'fm'
names(dino)[2] <- 'inst'
names(dino)[3] <- 'area'
names(dino)[4] <- 'fam'
names(dino)[5] <- 'abun'

# Formation column (fm) should consist only of Scollard, Frenchman, Hell Creek, Lance, and Denver
unique(dino$fm) # All values are correct

# Check institution column (inst) for typos 
unique(dino$inst) # All values correct

# Check area column (area) for typos
unique(dino$area) # All values are correct

# Check family (fam) column for typos/duplicates
unique(dino$fam) # All values are correct

# Abundance column should have reasonable values greater than 0
range(dino$abun) # Range from 0 to 474. The zeros are unnecessary so it will be removed

dino <- dino[dino$abun != 0, ] 
# Overwrite the dino df with data that has values greater than 0 in the abundance column

# Double check that all zero values have been removed
range(dino$abun) # Range from 1 to 474

# Since the abundance column is an aggregate of observations, it is not tidy. To tidy the dataset:
dino <- dino %>%
  uncount(weights = abun, .remove = TRUE)

# Now each column is a variable and row an observation

Clean “local” data frame

  • Adjusted column names

  • Checked that values in each column are reasonable

# Adjust column names to shorten and include no special characters

names(local)[1] <- 'area' 
# These values compliment those in the dino df and are given the same column name

names(local)[2] <- 'lat'
names(local)[3] <- 'long'
names(local)[4] <- 'adj_lat'
names(local)[5] <- 'adj_long'
names(local)[6] <- 'st_pr'
names(local)[7] <- 'st_pr_abb'

# Each area in the dino df needs complimentary data in the local df
length(unique(dino$area)) == length(unique(local$area)) # Yields true. Same number of areas in both data frames

# Latitude values should range from 0 to 90 (northern hemisphere)
range(local$lat) 
range(local$adj_lat) 
# Both acceptable values between 39.12 (Colorado) and 51.94 (Alberta)

# Longitude values should range from -180 to 0 (western hemisphere)
range(local$long) 
range(local$adj_long)
# Both acceptable values (all negative and between -100 and -120)

# Check all state and province names are spelled correctly
unique(local$st_pr) # All Correct values

# Check all abbreviations are correct
unique(local$st_pr_abb) # All correct values

# Check that each abbreviation corresponds to the correct state/province: 
local %>%
  select(st_pr, st_pr_abb) %>% 
  distinct()
# Each state/province has the correct abbreviation

Merge data frames

The two data frames are combined into “dino_loc”, which ties together the fossil information with locality information. Check that the data frames merged correctly by looking at the top of the new data frame. This data frame will be used for the visualizations.

# Merge the two data frames into a new data frame
dino_loc <- left_join(dino, local, by = 'area')

# Show first 6 lines
kable(head(dino_loc), format = "html", table.attr = "class='table table-striped table-hover table-bordered', margin:auto;'")
fm inst area fam lat long adj_lat adj_long st_pr st_pr_abb
Scollard AMNH Dry Island Ankylosauridae 51.94 -112.96 51.94 -112.96 Alberta AB
Scollard CMN Dry Island Ceratopsidae 51.94 -112.96 51.94 -112.96 Alberta AB
Scollard CMN Dry Island Ceratopsidae 51.94 -112.96 51.94 -112.96 Alberta AB
Scollard CMN Dry Island Thescelosauridae 51.94 -112.96 51.94 -112.96 Alberta AB
Scollard CMN Dry Island Tyrannosauridae 51.94 -112.96 51.94 -112.96 Alberta AB
Scollard CMN Dry Island Tyrannosauridae 51.94 -112.96 51.94 -112.96 Alberta AB

Clean “dino_loc” data frame

  • Checked and adjusted formation names to match a geographical region

  • Checked for any NA values

# Check that each county matches a corresponding formation
dino_loc %>%
  select(area, fm, st_pr_abb) %>% # Selects only area, formation, and the state/province abbreviation
  distinct() %>% # Select only the distinct combinations
  group_by(area) %>% # Group them by the area column
  filter(n() > 1) %>% # Filters to groups where the number of occurrences of area is duplicated 
  arrange(area) # Arranges the answers by area so it is easy to see the duplicated areas

### Note: Differences may have resulted due to the age of the collections. Older records may refer to formations as Lance, regardless of location. 

# The name of each formation corresponds to a province/state(s). The Hell Creek Formation is the only one spread over multiple states (MT, SD, and ND) and the rest of the Formations are restricted to a state/province. 

# Adjust so that the formation labels correspond to the correct state/province:
dino_loc$fm[dino_loc$st_pr_abb == 'MT' | dino_loc$st_pr_abb == 'SD' | dino_loc$st_pr_abb == 'ND'] <- 'Hell Creek'
dino_loc$fm[dino_loc$st_pr_abb == 'WY'] <- 'Lance'
dino_loc$fm[dino_loc$st_pr_abb == 'CO'] <- 'Denver'
dino_loc$fm[dino_loc$st_pr_abb == 'SK'] <- 'Frenchman'
dino_loc$fm[dino_loc$st_pr_abb == 'AB'] <- 'Scollard'

# Check that there are no NA values in the data frame
unique(is.na(dino_loc)) # All returns FALSE. No NA values

Data Wrangling

The following section performs modifications to the data frame to fit the needs of this EDA. Details of the changes are annotated on the code.

Wrangling “dino_loc” data frame

  • Updated and consolidated dinosaur family/clade names

  • Ordered formations from North to South

# Some classifications of the fossils are outdated or incorrect, and should be consolidated. 

# The groups that need to be consolidated are as follows:

# Caenagnathidae <- Avimimidae, Oviraptoridae
# Tyrannosauridae <- Megalosauridae
# Hadrosauridae <- Iguanodontidae
# Dromaeosauridae <- Small Theropod
# Thescelosauridae <- Hypsilophodontidae

# To adjust these names:

dino_loc$fam[dino_loc$fam == 'Avimimidae' | dino_loc$fam == 'Oviraptoridae'] <- 'Caenagnathidae'
# Avimimidae is a dubious basal lineage
# Oviraptoridae only known from Asia
# Both consolidated into the closely related Caenagnathidae

dino_loc$fam[dino_loc$fam == 'Megalosauridae'] <- 'Tyrannosauridae'
# Megalosauridae not known from the Cretaceous of North America
# Consolidated into Tyrannosauridae, the only large theropod in North America at the time

dino_loc$fam[dino_loc$fam == 'Iguanodontidae'] <- 'Hadrosauridae'
# Iguanodontidae not known from the Cretaceous of North America
# Consolidated into Hadrosauridae, a related group that is abundant in the Late Cretaceous

dino_loc$fam[dino_loc$fam == 'Small Theropod'] <- 'Dromaeosauridae'
# Small theropod was a descriptor used by the RSM for  unidentified small theropod teeth and made up a sizable portion of the Frenchman Formation material
# Tentatively consolidated into Dromaeosauridae, since the teeth of other groups of small theropods are fairly easily diagnosable 

dino_loc$fam[dino_loc$fam == 'Hypsilophodontidae'] <- 'Thescelosauridae'
# Thescelosaurus was previously placed into Hypsilophodontidae
# Consolidated into Thescelosauridae, the new family of Thescelosaurus

# To check that names were adjusted accordingly
unique(dino_loc$fam)

# Order the Formations from North to South
dino_loc <- dino_loc %>%
mutate(fm = factor(fm, levels = c("Scollard", "Frenchman", "Hell Creek", "Lance", "Denver")))

Summary Statistics

The current sample size of this data set is 9328 fossils representing 15 dinosaur families or clades.

The data set contains records from 23 institutions, collected from across 7 states and provinces in North America.

Data Visualizations

The following section will visualize the data in various formats, highlighting different aspects of the data.

I. Basic Data Visualizations

The following graphs represent basic visualizations of the number of dinosaur fossils when sorted by geological formation, institution, geographical area, and dinosaur family/clade.

1. Instances of Dinosaur Fossils per Formation

This graph shows the number of dinosaur fossils recorded per geological formation. The graph illustrates a clear sampling bias in the Hell Creek and Lance formations.

abun_fm <- dino_loc %>%
  count(fm) %>%
  mutate(fm = factor(fm, levels = fm[order(-n)])) %>% # Order from largest to smallest values
  ggplot(aes(x = fm, y = n)) +
  geom_col() +
  labs(x = "Formations", y = "Number of Specimens") +
  theme_bw()

print(abun_fm)

2. Instances of Dinosaur Fossils per Institution

This graph shows the number of dinosaur fossils recorded at each institution. The graph illustrates that most of the Cretaceous dinosaur fossils are housed in a small number of museums.

abun_inst <- dino_loc %>%
  count(inst) %>%
  mutate(inst = factor(inst, levels = inst[order(-n)])) %>%
  ggplot(aes(x = inst, y = n)) +
  geom_col() +
  labs(x = "Institutions", y = "Number of Specimens") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(abun_inst)

3. Instances of Dinosaur Fossils per Geographical Area

This graph shows the number of dinosaur fossils recorded from each geographical area. The areas are sorted based by its corresponding geological formation. The graph illustrates that most fossils for a given formation comes from a few productive areas. This is important to note since it reveals that not all areas will be useful for this analysis.

abun_area <- dino_loc %>%
  group_by(fm, area) %>%
  count(area) %>%
  arrange(fm, desc(n)) %>% 
  mutate(area = factor(area, levels = unique(area)[order(-n)])) %>%
  ggplot(aes(x = area, y = n, fill = fm)) +
  geom_col() +
  labs(x = "Geographical Areas", y = "Number of Specimens", fill = "Formation") +  # Edit the legend title
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(abun_area)

4. Instances of Dinosaur Fossils per Dinosaur Family/Clade

This graph shows the number of dinosaur fossils recorded for each dinosaur family/clade. The graph illustrates that the most common group of dinosaurs during the Cretaceous (based on fossils) are the large herbivore groups, Ceratopsidae (horned dinosaurs) and Hadrosauridae (duck-billed dinosaurs).

abun_fam <- dino_loc %>%
  count(fam) %>%
  mutate(fam = factor(fam, levels = fam[order(-n)])) %>%
  ggplot(aes(x = fam, y = n)) +
  geom_col() + 
  labs(x = "Families/Clades", y = "Number of Specimens") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(abun_fam)

II. Dinosaur Abundance Distribution

The following graphs represent visualizations focusing on the distribution of these abundance values across geological formations.

5. Abundance of Dinosaur Clades per Formation

This graph shows the number of dinosaur fossils, per clade, present in each formation. This illustrates how common the fossils of a particular group are in a formation.

# Dinosaur abundance by formation
abun_fam_fm <- dino_loc %>%
  count(fam, fm) %>%
  ggplot(aes(x = fam, y = n, fill = fm)) +
  geom_col(position = "dodge") +
  labs(x = "Families/Clades", y = "Number of Specimens", fill = "Formations") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(abun_fam_fm)

Note: This graph shows abundances but the different sample sizes of each formation (see graph 1) makes lesser sampled formations harder to compare. A better alternative to make formations comparable is to use relative abundances.

Making a Relative Abundance Data Frame

A new data frame with a relative abundance value (r_abun) is required. The relative abundance represents the percentage of the total abundance that a specific group represents. This new data frame will be called “dino_abun”.

# Create a new data frame with relative abundance of dinosaur clades by formation
dino_abun <- dino_loc %>%
  group_by(fm, fam) %>% 
  summarise(total_abun = sum(n())) %>%
  mutate(r_abun = (total_abun / sum(total_abun))*100) %>%
  ungroup() %>%
  complete(fm, fam, fill = list(total_abun = 0, r_abun = 0)) # Filling out certain missing clades in formations with 0

6. Relative Abundance of Dinosaur Clades per Formation

This graph shows the relative abundance of dinosaur clades present in each formation. This better illustrates how common the fossils of a particular group are in each formation and makes comparisons between formations more practical. We can see clear trends where certain groups are more abundant in specific formations.

# Relative abundance of dinosaur clades by formation
r_abun_fam_fm <- ggplot(dino_abun, aes(x = fam, y = r_abun, fill = fm)) + 
  geom_col(position = "dodge") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Family", y = "Relative Abundance (%)", fill = "Formation")

print(r_abun_fam_fm)

Note: Although this graph better illustrates abundance trends between formations, some formations can cover vast geographical areas. To get more detailed data, it might be necessary to further subdivide the formations. To do this, we need to look at the distribution of the areas and localities producing these fossils.

III. Fossil Distribution Visualizations

The following graphs represent the distribution of the above data across North America. The following map visualizations require map shape files to be loaded.

world <- map_data("world") 
states <- ne_states(country = "united states of america", returnclass = "sf") # Load state boundaries
provinces <- ne_states(country = "canada", returnclass = "sf") # Load province boundaries
us_counties <- map_data("county") # Load US county boundaries
states_provinces <- rbind(states, provinces) # Combine state and province boundaries

7. Distribution of Counties/Areas with Dinosaur Fossils

This graph plots all counties and areas with records of dinosaur fossils. The graph illustrates areas where Late Cretaceous layers are exposed. Note that the Hell Creek Formation covers a vast geographical area spanning many states. In the case of Canadian areas without counties, the actual locality is plotted.

dino_dist_area <- ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + # Add US county boundaries
    geom_sf(data = states_provinces, color = "black", fill = NA) + # Add state and province boundaries
  geom_point(data = dino_loc, aes(x = long, y = lat, color = fm), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +  # Set limits for the map
  labs(x = 'Longitude', y = "Latitude", color = "Formations") + 
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

print(dino_dist_area)

Note: Although this graph illustrates the areas with dinosaur fossils, each point only indicates the presence of fossils in the specific county, not the actual location fossils are coming from. A better illustration would be to look at the distribution of localities (adjusted latitude and longitude values).

8. Distribution of Localities with Dinosaur Fossils

This graph plots the estimated location of the major fossil-bearing localities within each county. The graph better illustrates areas where Late Cretaceous layers are exposed. In some cases, the locality data was unable to be obtained, and has been left as the original point.

dino_dist_loc <- ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + # Add US county boundaries
  geom_sf(data = states_provinces, color = "black", fill = NA) + # Add state and province boundaries
  geom_point(data = dino_loc, aes(x = adj_long, y = adj_lat, color = fm), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +  # Set limits for the map
  labs(x = 'Longitude', y = "Latitude", color = "Formations") + 
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

print(dino_dist_loc)

Note: Some localities are represented by only a few isolated specimens. These can be filtered out to better illustrate the concentrations of fossils.

filtered_dino_dist_loc <- dino_loc %>% 
  group_by(area, adj_long, adj_lat, fm) %>%
  summarise(n = n()) %>%
  filter(n > 100) ### This filter can be adjusted to filter out sites with n number of specimens 

9. Distribution of Localities with Dinosaur Fossils (n > 100)

This graph restricts the plots to localities with over 100 recorded specimens. This graph better illustrates clusters or “hot-spots” for dinosaur fossils.

fil_dino_dist_loc <- ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + # Add US county boundaries
  geom_sf(data = states_provinces, color = "black", fill = NA) + # Add state and province boundaries
  geom_text_repel(data = filtered_dino_dist_loc, aes(x = adj_long, y = adj_lat, label = area), 
            size = 2.5, hjust = 0.5, vjust = 0) +
  geom_point(data = filtered_dino_dist_loc, aes(x = adj_long, y = adj_lat, color = fm), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +  # Set limits for the map
  labs(x = 'Longitude', y = "Latitude", color = "Formations") + 
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

print(fil_dino_dist_loc)

By filtering to areas with over 100 specimens, we can see fairly obvious clusters of sites

  • Scollard - Dry Island

  • Frenchman - Eastend, GNP

  • Hell Creek NW - Garfield, McCone

  • Hell Creek SE - Carter, Slope, Fallon, Harding

  • Lance - Weston, Niobrara

  • Denver - Denver

The previous plot can be adjusted to plot localities with less than 100 specimens. Some localities with less than 100 specimens can be added to the previously defined clusters based on distance. Black points on the map indicate localities with less than 100 specimens.

ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + # Add US county boundaries
  geom_sf(data = states_provinces, color = "black", fill = NA) + # Add state and province boundaries
  geom_text_repel(data = unique(dino_loc[, c("adj_long", "adj_lat", "area")]), 
                  aes(x = adj_long, y = adj_lat, label = area), 
                  size = 2, hjust = 0, vjust = 0, 
                  max.overlaps = Inf) +
  geom_point(data = dino_loc, aes(x = adj_long, y = adj_lat), size = 1) +
  geom_point(data = filtered_dino_dist_loc, aes(x = adj_long, y = adj_lat, color = fm), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +  # Set limits for the map 
  labs(x = 'Longitude', y = "Latitude", color = "Formations") + 
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

Some areas can be added to major clusters, and although other clusters exist, the sample sizes are too small to be meaningful.

Major groupings

  • Scollard - Dry Island

  • Frenchman - Eastend, GNP

  • Hell Creek NW - Garfield, McCone

  • Hell Creek SE - Carter, Slope, Fallon, Harding + Powder River, Bowman

  • Lance - Weston, Niobrara + Converse

  • Denver - Denver + Jefferson

Other groupings

  • Hell Creek E - Sioux, Corson, Ziebach

  • Hell Creek S - Butte, Perkins, Meade

  • Lance NW - Park, Big Horn

  • Lance SW - Sweetwater, Carbon

Undefined

  • Hell Creek - Dawson, Rosebud, Petroleum, Billings, Morton

  • Lance - Natrona, Goshen, Hot Springs

Add a new column to dino_loc which separates the formations into geographical subdivisions, outlined above.

# Assign subdivisions based on the values in the 'area' column
dino_loc$subdivision[dino_loc$area %in% c("Dry Island")] <- "Scollard"
dino_loc$subdivision[dino_loc$area %in% c("Eastend", "GNP")] <- "Frenchman"
dino_loc$subdivision[dino_loc$area %in% c("Garfield", "McCone")] <- "Hell Creek NW"
dino_loc$subdivision[dino_loc$area %in% c("Carter", "Slope", "Fallon", "Harding", "Powder River", "Bowman")] <- "Hell Creek SE"
dino_loc$subdivision[dino_loc$area %in% c("Weston", "Niobrara", "Converse")] <- "Lance E"
dino_loc$subdivision[dino_loc$area %in% c("Denver", "Jefferson")] <- "Denver"
dino_loc$subdivision[dino_loc$area %in% c("Sioux", "Corson", "Ziebach")] <- "Hell Creek E"
dino_loc$subdivision[dino_loc$area %in% c("Butte", "Perkins", "Meade")] <- "Hell Creek S"
dino_loc$subdivision[dino_loc$area %in% c("Park", "Big Horn")] <- "Lance NW"
dino_loc$subdivision[dino_loc$area %in% c("Sweetwater", "Carbon")] <- "Lance SW"

This can be plotted on the previous map to better visualize how the clusters are geographically organized. Most clusters show fairly obvious separation from each other. However, some clusters (Hell Creek SE, E, and S) are harder to distinguish, as they appear to be parts of a larger exposure.

ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + 
  geom_sf(data = states_provinces, color = "black", fill = NA) + 
  geom_point(data = dino_loc, aes(x = adj_long, y = adj_lat, color = subdivision), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +
  labs(x = 'Longitude', y = "Latitude", color = "Fm Subdivisions") +
  scale_color_manual(values = brewer.pal(11, "Paired")) +
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

Making a Relative Abundance Data Frame based on Subdivision

A new data frame, “dino_abun_sd”, is created with the relative abundance of dinosaur clades per subdivision. Subdivisions with less than 100 total specimens will be disregarded.

# Create a new data frame with relative abundance of dinosaur clades by subdivision

dino_abun_sd <- dino_loc %>%
  group_by(subdivision, fm, fam) %>% 
  summarise(total_abun = sum(n())) %>%
  mutate(r_abun = (total_abun / sum(total_abun))*100) %>%
  group_by(subdivision) %>%
  filter(sum(total_abun) >= 100) %>%
  mutate(subdivision = factor(subdivision, levels = c("Scollard", "Frenchman", "Hell Creek NW", "Hell Creek SE", "Lance E", "Denver"))) %>%
  filter(!is.na(subdivision))

10. Relative Abundance of Dinosaur Families/Clades per Subdivision (n > 100)

This graph shows the relative abundance of dinosaur families per subdivision (with over 100 specimens). There are potentially N-S trends in relative abundance of certain groups.

# Relative abundance of dinosaur clades by subdivision
r_abun_fam_subdivision <- ggplot(dino_abun_sd, aes(x = fam, y = r_abun, fill = subdivision)) + 
  geom_col(width = 0.8, position = "dodge") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Family", y = "Relative Abundance (%)", fill = "Subdivision")

print(r_abun_fam_subdivision)

Relative Abundance Values of Dinosaur Families/Clades per per Subdivision (n > 100)

This table summarizes the relative abundances (in percent) of each family/clade, categorized by each subdivision.

dino_abun_sd_table <- dino_abun_sd %>%
  group_by(fam, subdivision) %>%
  summarise(r_abun = sum(r_abun)) %>%
  pivot_wider(names_from = subdivision, values_from = r_abun)
  
names(dino_abun_sd_table)[1] <- 'Family/Clade' 
  
  dino_abun_sd_table <- dino_abun_sd_table[,c('Family/Clade', "Scollard", "Frenchman", "Hell Creek NW", "Hell Creek SE", "Lance E", "Denver")]


kable(dino_abun_sd_table, format = "html", digits = 2, table.attr = "class='table table-striped table-hover table-bordered', margin:auto;'")
Family/Clade Scollard Frenchman Hell Creek NW Hell Creek SE Lance E Denver
Alvarezsauridae NA 0.09 0.11 0.07 0.10 NA
Ankylosauridae 6.77 0.84 0.97 0.66 1.19 NA
Caenagnathidae 0.75 1.31 0.07 1.25 0.13 NA
Ceratopsidae 15.79 26.92 27.13 29.74 32.36 32.63
Dromaeosauridae 9.96 29.27 14.57 5.86 7.46 4.56
Hadrosauridae 5.83 15.67 15.04 29.96 27.32 31.93
Leptoceratopsidae 6.02 0.09 NA 0.59 0.27 NA
Nodosauridae 0.38 0.09 0.25 0.22 0.36 NA
Ornithomimidae 8.46 6.10 3.17 4.25 1.46 2.46
Pachycephalosauridae 2.44 0.47 1.33 3.37 1.33 NA
Paronychodon sp. 2.82 0.38 5.27 0.88 10.54 2.81
Richardoestesia sp. 5.64 1.31 18.40 2.93 2.79 21.05
Thescelosauridae 7.71 5.91 2.42 7.55 4.51 0.70
Troodontidae 1.50 0.84 1.15 1.17 4.38 0.70
Tyrannosauridae 25.94 10.69 10.10 11.50 5.80 3.16

Conclusions

This EDA revealed a number of potential trends to be pursued further as this project progresses. It will be necessary to statistically test the trends observed in this EDA but the presence of any potential trends at this stage is promising. If these trends are statistically significant and continue to persist with the addition of new specimens, it may provide valuable new data on the habitat preference and ecological interactions between various dinosaur groups immediately prior to the end-Cretaceous mass extinction.