BIOL 5404 EDA - Late Cretaceous Dinosaurs of North America

Preface

This Exploratory Data Analysis (EDA) will explore the paleobiogeography of dinosaur families in North America at the end of the Cretaceous Period.

The Late Cretaceous of North America

The last few thousand years of the Cretaceous Period in North America are represented by a series of geological formations bearing dinosaur fossils (Scollard, Frenchman, upper Hell Creek, Lance, and Denver, among others). These formations present a unique opportunity to study Cretaceous ecosystems in a temporally-constrained window, across a wide geographical range, representing variable environments.

The southern formations including the upper Hell Creek, Lance, and Denver formations represent more coastal settings, due to their proximity to a receding interior seaway. The Frenchman and Scollard formations in Canada represent more northern habitats further from the coast.

Datasets

The “cret_dino_abun.csv” file represents a working dataset of dinosaur fossil abundances collected from major museums and institutions across North America. Information includes:

Geological formation of origin (Scollard, Frenchman, Hell Creek, Lance, or Denver)
Abbreviation of the institution where specimens are housed
Dinosaur family/clade
County or general area of locality

Note: To maintain a large sample size, one instance of a fossil can range from a single tooth to an entire skeleton.

The “locality_dat.csv” file represents a complimentary dataset on the county or area used in the “cret_dino_abun.csv” dataset. Information includes:

County or general area of locality
Average latitude and longitude of the county/area (center of the county)
Adjusted latitude and longitude of county/area (area of fossil localities)
State or province name
State or province abbreviation

Note: The average latitude and longitude values were obtained from location data available on Wikipedia. Adjusted latitude and longitude values were estimated based on locality data of fossils from the county.

Objectives

The objective of the overall project is to investigate if there are any detectable differences in the relative abundance of dinosaur families in certain areas, and if these differences represent significant variations in the composition of dinosaur communities across North America.

In this EDA, the objective is to visualize the distribution of dinosaur fossils across North America and identify potential geographical trends in the relative abundance of dinosaur families/clades that should be pursued further as the project continues.

Preparations for Analysis

This R markdown HTML document was built with R version 4.3.2.

If you wish to see the R code used throughout this report, click on the ‘Show’ buttons.

Load Required Packages

Ensure that the following packages are properly installed.

devtools
tidyverse
rnaturalearth
rnaturalearthdata
sf
ggrepel
gridExtra
maps
RColorBrewer
knitr

library(devtools)
library(tidyverse)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(ggrepel)
library(gridExtra)
library(maps)
library(RColorBrewer)
library(knitr)

The package “rnaturalearthhires” requires manual installing using devtools

devtools::install_github("ropensci/rnaturalearthhires")

Note: Ensure that the working directory is set to the current folder and all required csv files are in the working directory.

Import Data

Import the required files into R. In each case, check that the data loaded in correctly by looking at the top of the data set.

dino <- read_csv("cret_dino_abun.csv") 

# Show first 6 lines
kable(head(dino), format = "html", table.attr = "class='table table-striped table-hover table-bordered', margin:auto;'")

Geological Formation	Institution	Geographical Area	Dinosaur Family	Abundance
Scollard	AMNH	Dry Island	Ankylosauridae	1
Scollard	CMN	Dry Island	Ceratopsidae	2
Scollard	CMN	Dry Island	Thescelosauridae	1
Scollard	CMN	Dry Island	Tyrannosauridae	3
Scollard	CMN	Dry Island	Leptoceratopsidae	3
Scollard	CMN	Dry Island	Ankylosauridae	1

“cret_dino_abun.csv” will be referred to as “dino”

local <- read_csv("locality_dat.csv")

# Show first 6 lines
kable(head(local), format = "html", table.attr = "class='table table-striped table-hover table-bordered', margin:auto;'")

Geographical Area	Latitude	Longitude	Adjusted Latitude	Adjusted Longitude	State/Province	abbrev
Dry Island	51.94	-112.96	51.94	-112.96	Alberta	AB
GNP	49.04	-106.57	49.04	-106.57	Saskatchewan	SK
Eastend	49.38	-108.51	49.38	-108.51	Saskatchewan	SK
Denver	39.74	-104.98	39.74	-104.98	Colorado	CO
Garfield	47.28	-106.99	47.69	-106.92	Montana	MT
Rosebud	46.23	-106.72	46.26	-106.59	Montana	MT

“locality_dat.csv” will be referred to as “local”

Tidying and Data Hygiene

The following section performs a series of operations to tidy and clean the individual data frames, merge them, and clean the combined data frame. Details of the changes are annotated in the code.

Clean “dino” data frame

Adjusted column names
Checked that values in each column are reasonable
Tidied the abundance column by separating the aggregate abundances

# Adjust column names to shorten and include no special characters

names(dino)[1] <- 'fm'
names(dino)[2] <- 'inst'
names(dino)[3] <- 'area'
names(dino)[4] <- 'fam'
names(dino)[5] <- 'abun'

# Formation column (fm) should consist only of Scollard, Frenchman, Hell Creek, Lance, and Denver
unique(dino$fm) # All values are correct

# Check institution column (inst) for typos 
unique(dino$inst) # All values correct

# Check area column (area) for typos
unique(dino$area) # All values are correct

# Check family (fam) column for typos/duplicates
unique(dino$fam) # All values are correct

# Abundance column should have reasonable values greater than 0
range(dino$abun) # Range from 0 to 474. The zeros are unnecessary so it will be removed

dino <- dino[dino$abun != 0, ] 
# Overwrite the dino df with data that has values greater than 0 in the abundance column

# Double check that all zero values have been removed
range(dino$abun) # Range from 1 to 474

# Since the abundance column is an aggregate of observations, it is not tidy. To tidy the dataset:
dino <- dino %>%
  uncount(weights = abun, .remove = TRUE)

# Now each column is a variable and row an observation

Clean “local” data frame

Adjusted column names
Checked that values in each column are reasonable

# Adjust column names to shorten and include no special characters

names(local)[1] <- 'area' 
# These values compliment those in the dino df and are given the same column name

names(local)[2] <- 'lat'
names(local)[3] <- 'long'
names(local)[4] <- 'adj_lat'
names(local)[5] <- 'adj_long'
names(local)[6] <- 'st_pr'
names(local)[7] <- 'st_pr_abb'

# Each area in the dino df needs complimentary data in the local df
length(unique(dino$area)) == length(unique(local$area)) # Yields true. Same number of areas in both data frames

# Latitude values should range from 0 to 90 (northern hemisphere)
range(local$lat) 
range(local$adj_lat) 
# Both acceptable values between 39.12 (Colorado) and 51.94 (Alberta)

# Longitude values should range from -180 to 0 (western hemisphere)
range(local$long) 
range(local$adj_long)
# Both acceptable values (all negative and between -100 and -120)

# Check all state and province names are spelled correctly
unique(local$st_pr) # All Correct values

# Check all abbreviations are correct
unique(local$st_pr_abb) # All correct values

# Check that each abbreviation corresponds to the correct state/province: 
local %>%
  select(st_pr, st_pr_abb) %>% 
  distinct()
# Each state/province has the correct abbreviation

Merge data frames

The two data frames are combined into “dino_loc”, which ties together the fossil information with locality information. Check that the data frames merged correctly by looking at the top of the new data frame. This data frame will be used for the visualizations.

# Merge the two data frames into a new data frame
dino_loc <- left_join(dino, local, by = 'area')

# Show first 6 lines
kable(head(dino_loc), format = "html", table.attr = "class='table table-striped table-hover table-bordered', margin:auto;'")

fm	inst	area	fam	lat	long	adj_lat	adj_long	st_pr	st_pr_abb
Scollard	AMNH	Dry Island	Ankylosauridae	51.94	-112.96	51.94	-112.96	Alberta	AB
Scollard	CMN	Dry Island	Ceratopsidae	51.94	-112.96	51.94	-112.96	Alberta	AB
Scollard	CMN	Dry Island	Ceratopsidae	51.94	-112.96	51.94	-112.96	Alberta	AB
Scollard	CMN	Dry Island	Thescelosauridae	51.94	-112.96	51.94	-112.96	Alberta	AB
Scollard	CMN	Dry Island	Tyrannosauridae	51.94	-112.96	51.94	-112.96	Alberta	AB
Scollard	CMN	Dry Island	Tyrannosauridae	51.94	-112.96	51.94	-112.96	Alberta	AB

Clean “dino_loc” data frame

Checked and adjusted formation names to match a geographical region
Checked for any NA values

# Check that each county matches a corresponding formation
dino_loc %>%
  select(area, fm, st_pr_abb) %>% # Selects only area, formation, and the state/province abbreviation
  distinct() %>% # Select only the distinct combinations
  group_by(area) %>% # Group them by the area column
  filter(n() > 1) %>% # Filters to groups where the number of occurrences of area is duplicated 
  arrange(area) # Arranges the answers by area so it is easy to see the duplicated areas

### Note: Differences may have resulted due to the age of the collections. Older records may refer to formations as Lance, regardless of location. 

# The name of each formation corresponds to a province/state(s). The Hell Creek Formation is the only one spread over multiple states (MT, SD, and ND) and the rest of the Formations are restricted to a state/province. 

# Adjust so that the formation labels correspond to the correct state/province:
dino_loc$fm[dino_loc$st_pr_abb == 'MT' | dino_loc$st_pr_abb == 'SD' | dino_loc$st_pr_abb == 'ND'] <- 'Hell Creek'
dino_loc$fm[dino_loc$st_pr_abb == 'WY'] <- 'Lance'
dino_loc$fm[dino_loc$st_pr_abb == 'CO'] <- 'Denver'
dino_loc$fm[dino_loc$st_pr_abb == 'SK'] <- 'Frenchman'
dino_loc$fm[dino_loc$st_pr_abb == 'AB'] <- 'Scollard'

# Check that there are no NA values in the data frame
unique(is.na(dino_loc)) # All returns FALSE. No NA values

Data Wrangling

The following section performs modifications to the data frame to fit the needs of this EDA. Details of the changes are annotated on the code.

Wrangling “dino_loc” data frame

Updated and consolidated dinosaur family/clade names
Ordered formations from North to South

# Some classifications of the fossils are outdated or incorrect, and should be consolidated. 

# The groups that need to be consolidated are as follows:

# Caenagnathidae <- Avimimidae, Oviraptoridae
# Tyrannosauridae <- Megalosauridae
# Hadrosauridae <- Iguanodontidae
# Dromaeosauridae <- Small Theropod
# Thescelosauridae <- Hypsilophodontidae

# To adjust these names:

dino_loc$fam[dino_loc$fam == 'Avimimidae' | dino_loc$fam == 'Oviraptoridae'] <- 'Caenagnathidae'
# Avimimidae is a dubious basal lineage
# Oviraptoridae only known from Asia
# Both consolidated into the closely related Caenagnathidae

dino_loc$fam[dino_loc$fam == 'Megalosauridae'] <- 'Tyrannosauridae'
# Megalosauridae not known from the Cretaceous of North America
# Consolidated into Tyrannosauridae, the only large theropod in North America at the time

dino_loc$fam[dino_loc$fam == 'Iguanodontidae'] <- 'Hadrosauridae'
# Iguanodontidae not known from the Cretaceous of North America
# Consolidated into Hadrosauridae, a related group that is abundant in the Late Cretaceous

dino_loc$fam[dino_loc$fam == 'Small Theropod'] <- 'Dromaeosauridae'
# Small theropod was a descriptor used by the RSM for  unidentified small theropod teeth and made up a sizable portion of the Frenchman Formation material
# Tentatively consolidated into Dromaeosauridae, since the teeth of other groups of small theropods are fairly easily diagnosable 

dino_loc$fam[dino_loc$fam == 'Hypsilophodontidae'] <- 'Thescelosauridae'
# Thescelosaurus was previously placed into Hypsilophodontidae
# Consolidated into Thescelosauridae, the new family of Thescelosaurus

# To check that names were adjusted accordingly
unique(dino_loc$fam)

# Order the Formations from North to South
dino_loc <- dino_loc %>%
mutate(fm = factor(fm, levels = c("Scollard", "Frenchman", "Hell Creek", "Lance", "Denver")))

Summary Statistics

The current sample size of this data set is 9328 fossils representing 15 dinosaur families or clades.

The data set contains records from 23 institutions, collected from across 7 states and provinces in North America.

Data Visualizations

The following section will visualize the data in various formats, highlighting different aspects of the data.

I. Basic Data Visualizations

The following graphs represent basic visualizations of the number of dinosaur fossils when sorted by geological formation, institution, geographical area, and dinosaur family/clade.

1. Instances of Dinosaur Fossils per Formation

This graph shows the number of dinosaur fossils recorded per geological formation. The graph illustrates a clear sampling bias in the Hell Creek and Lance formations.

abun_fm <- dino_loc %>%
  count(fm) %>%
  mutate(fm = factor(fm, levels = fm[order(-n)])) %>% # Order from largest to smallest values
  ggplot(aes(x = fm, y = n)) +
  geom_col() +
  labs(x = "Formations", y = "Number of Specimens") +
  theme_bw()

print(abun_fm)

2. Instances of Dinosaur Fossils per Institution

This graph shows the number of dinosaur fossils recorded at each institution. The graph illustrates that most of the Cretaceous dinosaur fossils are housed in a small number of museums.

abun_inst <- dino_loc %>%
  count(inst) %>%
  mutate(inst = factor(inst, levels = inst[order(-n)])) %>%
  ggplot(aes(x = inst, y = n)) +
  geom_col() +
  labs(x = "Institutions", y = "Number of Specimens") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(abun_inst)

3. Instances of Dinosaur Fossils per Geographical Area

This graph shows the number of dinosaur fossils recorded from each geographical area. The areas are sorted based by its corresponding geological formation. The graph illustrates that most fossils for a given formation comes from a few productive areas. This is important to note since it reveals that not all areas will be useful for this analysis.

abun_area <- dino_loc %>%
  group_by(fm, area) %>%
  count(area) %>%
  arrange(fm, desc(n)) %>% 
  mutate(area = factor(area, levels = unique(area)[order(-n)])) %>%
  ggplot(aes(x = area, y = n, fill = fm)) +
  geom_col() +
  labs(x = "Geographical Areas", y = "Number of Specimens", fill = "Formation") +  # Edit the legend title
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(abun_area)

4. Instances of Dinosaur Fossils per Dinosaur Family/Clade

This graph shows the number of dinosaur fossils recorded for each dinosaur family/clade. The graph illustrates that the most common group of dinosaurs during the Cretaceous (based on fossils) are the large herbivore groups, Ceratopsidae (horned dinosaurs) and Hadrosauridae (duck-billed dinosaurs).

abun_fam <- dino_loc %>%
  count(fam) %>%
  mutate(fam = factor(fam, levels = fam[order(-n)])) %>%
  ggplot(aes(x = fam, y = n)) +
  geom_col() + 
  labs(x = "Families/Clades", y = "Number of Specimens") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(abun_fam)

II. Dinosaur Abundance Distribution

The following graphs represent visualizations focusing on the distribution of these abundance values across geological formations.

5. Abundance of Dinosaur Clades per Formation

This graph shows the number of dinosaur fossils, per clade, present in each formation. This illustrates how common the fossils of a particular group are in a formation.

# Dinosaur abundance by formation
abun_fam_fm <- dino_loc %>%
  count(fam, fm) %>%
  ggplot(aes(x = fam, y = n, fill = fm)) +
  geom_col(position = "dodge") +
  labs(x = "Families/Clades", y = "Number of Specimens", fill = "Formations") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(abun_fam_fm)

Note: This graph shows abundances but the different sample sizes of each formation (see graph 1) makes lesser sampled formations harder to compare. A better alternative to make formations comparable is to use relative abundances.

Making a Relative Abundance Data Frame

A new data frame with a relative abundance value (r_abun) is required. The relative abundance represents the percentage of the total abundance that a specific group represents. This new data frame will be called “dino_abun”.

# Create a new data frame with relative abundance of dinosaur clades by formation
dino_abun <- dino_loc %>%
  group_by(fm, fam) %>% 
  summarise(total_abun = sum(n())) %>%
  mutate(r_abun = (total_abun / sum(total_abun))*100) %>%
  ungroup() %>%
  complete(fm, fam, fill = list(total_abun = 0, r_abun = 0)) # Filling out certain missing clades in formations with 0

6. Relative Abundance of Dinosaur Clades per Formation

This graph shows the relative abundance of dinosaur clades present in each formation. This better illustrates how common the fossils of a particular group are in each formation and makes comparisons between formations more practical. We can see clear trends where certain groups are more abundant in specific formations.

# Relative abundance of dinosaur clades by formation
r_abun_fam_fm <- ggplot(dino_abun, aes(x = fam, y = r_abun, fill = fm)) + 
  geom_col(position = "dodge") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Family", y = "Relative Abundance (%)", fill = "Formation")

print(r_abun_fam_fm)

Note: Although this graph better illustrates abundance trends between formations, some formations can cover vast geographical areas. To get more detailed data, it might be necessary to further subdivide the formations. To do this, we need to look at the distribution of the areas and localities producing these fossils.

III. Fossil Distribution Visualizations

The following graphs represent the distribution of the above data across North America. The following map visualizations require map shape files to be loaded.

world <- map_data("world") 
states <- ne_states(country = "united states of america", returnclass = "sf") # Load state boundaries
provinces <- ne_states(country = "canada", returnclass = "sf") # Load province boundaries
us_counties <- map_data("county") # Load US county boundaries
states_provinces <- rbind(states, provinces) # Combine state and province boundaries

7. Distribution of Counties/Areas with Dinosaur Fossils

This graph plots all counties and areas with records of dinosaur fossils. The graph illustrates areas where Late Cretaceous layers are exposed. Note that the Hell Creek Formation covers a vast geographical area spanning many states. In the case of Canadian areas without counties, the actual locality is plotted.

dino_dist_area <- ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + # Add US county boundaries
    geom_sf(data = states_provinces, color = "black", fill = NA) + # Add state and province boundaries
  geom_point(data = dino_loc, aes(x = long, y = lat, color = fm), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +  # Set limits for the map
  labs(x = 'Longitude', y = "Latitude", color = "Formations") + 
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

print(dino_dist_area)

Note: Although this graph illustrates the areas with dinosaur fossils, each point only indicates the presence of fossils in the specific county, not the actual location fossils are coming from. A better illustration would be to look at the distribution of localities (adjusted latitude and longitude values).

8. Distribution of Localities with Dinosaur Fossils

This graph plots the estimated location of the major fossil-bearing localities within each county. The graph better illustrates areas where Late Cretaceous layers are exposed. In some cases, the locality data was unable to be obtained, and has been left as the original point.

dino_dist_loc <- ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + # Add US county boundaries
  geom_sf(data = states_provinces, color = "black", fill = NA) + # Add state and province boundaries
  geom_point(data = dino_loc, aes(x = adj_long, y = adj_lat, color = fm), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +  # Set limits for the map
  labs(x = 'Longitude', y = "Latitude", color = "Formations") + 
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

print(dino_dist_loc)

Note: Some localities are represented by only a few isolated specimens. These can be filtered out to better illustrate the concentrations of fossils.

filtered_dino_dist_loc <- dino_loc %>% 
  group_by(area, adj_long, adj_lat, fm) %>%
  summarise(n = n()) %>%
  filter(n > 100) ### This filter can be adjusted to filter out sites with n number of specimens

9. Distribution of Localities with Dinosaur Fossils (n > 100)

This graph restricts the plots to localities with over 100 recorded specimens. This graph better illustrates clusters or “hot-spots” for dinosaur fossils.

fil_dino_dist_loc <- ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + # Add US county boundaries
  geom_sf(data = states_provinces, color = "black", fill = NA) + # Add state and province boundaries
  geom_text_repel(data = filtered_dino_dist_loc, aes(x = adj_long, y = adj_lat, label = area), 
            size = 2.5, hjust = 0.5, vjust = 0) +
  geom_point(data = filtered_dino_dist_loc, aes(x = adj_long, y = adj_lat, color = fm), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +  # Set limits for the map
  labs(x = 'Longitude', y = "Latitude", color = "Formations") + 
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

print(fil_dino_dist_loc)

By filtering to areas with over 100 specimens, we can see fairly obvious clusters of sites

Scollard - Dry Island
Frenchman - Eastend, GNP
Hell Creek NW - Garfield, McCone
Hell Creek SE - Carter, Slope, Fallon, Harding
Lance - Weston, Niobrara
Denver - Denver

The previous plot can be adjusted to plot localities with less than 100 specimens. Some localities with less than 100 specimens can be added to the previously defined clusters based on distance. Black points on the map indicate localities with less than 100 specimens.

ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + # Add US county boundaries
  geom_sf(data = states_provinces, color = "black", fill = NA) + # Add state and province boundaries
  geom_text_repel(data = unique(dino_loc[, c("adj_long", "adj_lat", "area")]), 
                  aes(x = adj_long, y = adj_lat, label = area), 
                  size = 2, hjust = 0, vjust = 0, 
                  max.overlaps = Inf) +
  geom_point(data = dino_loc, aes(x = adj_long, y = adj_lat), size = 1) +
  geom_point(data = filtered_dino_dist_loc, aes(x = adj_long, y = adj_lat, color = fm), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +  # Set limits for the map 
  labs(x = 'Longitude', y = "Latitude", color = "Formations") + 
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

Some areas can be added to major clusters, and although other clusters exist, the sample sizes are too small to be meaningful.

Major groupings

Scollard - Dry Island
Frenchman - Eastend, GNP
Hell Creek NW - Garfield, McCone
Hell Creek SE - Carter, Slope, Fallon, Harding + Powder River, Bowman
Lance - Weston, Niobrara + Converse
Denver - Denver + Jefferson

Other groupings

Hell Creek E - Sioux, Corson, Ziebach
Hell Creek S - Butte, Perkins, Meade
Lance NW - Park, Big Horn
Lance SW - Sweetwater, Carbon

Undefined

Hell Creek - Dawson, Rosebud, Petroleum, Billings, Morton
Lance - Natrona, Goshen, Hot Springs

Add a new column to dino_loc which separates the formations into geographical subdivisions, outlined above.

# Assign subdivisions based on the values in the 'area' column
dino_loc$subdivision[dino_loc$area %in% c("Dry Island")] <- "Scollard"
dino_loc$subdivision[dino_loc$area %in% c("Eastend", "GNP")] <- "Frenchman"
dino_loc$subdivision[dino_loc$area %in% c("Garfield", "McCone")] <- "Hell Creek NW"
dino_loc$subdivision[dino_loc$area %in% c("Carter", "Slope", "Fallon", "Harding", "Powder River", "Bowman")] <- "Hell Creek SE"
dino_loc$subdivision[dino_loc$area %in% c("Weston", "Niobrara", "Converse")] <- "Lance E"
dino_loc$subdivision[dino_loc$area %in% c("Denver", "Jefferson")] <- "Denver"
dino_loc$subdivision[dino_loc$area %in% c("Sioux", "Corson", "Ziebach")] <- "Hell Creek E"
dino_loc$subdivision[dino_loc$area %in% c("Butte", "Perkins", "Meade")] <- "Hell Creek S"
dino_loc$subdivision[dino_loc$area %in% c("Park", "Big Horn")] <- "Lance NW"
dino_loc$subdivision[dino_loc$area %in% c("Sweetwater", "Carbon")] <- "Lance SW"

This can be plotted on the previous map to better visualize how the clusters are geographically organized. Most clusters show fairly obvious separation from each other. However, some clusters (Hell Creek SE, E, and S) are harder to distinguish, as they appear to be parts of a larger exposure.

ggplot(data = world) + 
  geom_map(map = world, aes(map_id = region), fill = "white", color = "black") +
  geom_map(data = us_counties, map = us_counties, aes(map_id = region), fill = NA, color = "grey") + 
  geom_sf(data = states_provinces, color = "black", fill = NA) + 
  geom_point(data = dino_loc, aes(x = adj_long, y = adj_lat, color = subdivision), size = 2) +
  coord_sf(xlim = c(-115, -100), ylim = c(38, 53), expand = FALSE) +
  labs(x = 'Longitude', y = "Latitude", color = "Fm Subdivisions") +
  scale_color_manual(values = brewer.pal(11, "Paired")) +
  theme(axis.text.x = element_text(angle = -45, hjust = -0.1))

Making a Relative Abundance Data Frame based on Subdivision

A new data frame, “dino_abun_sd”, is created with the relative abundance of dinosaur clades per subdivision. Subdivisions with less than 100 total specimens will be disregarded.

# Create a new data frame with relative abundance of dinosaur clades by subdivision

dino_abun_sd <- dino_loc %>%
  group_by(subdivision, fm, fam) %>% 
  summarise(total_abun = sum(n())) %>%
  mutate(r_abun = (total_abun / sum(total_abun))*100) %>%
  group_by(subdivision) %>%
  filter(sum(total_abun) >= 100) %>%
  mutate(subdivision = factor(subdivision, levels = c("Scollard", "Frenchman", "Hell Creek NW", "Hell Creek SE", "Lance E", "Denver"))) %>%
  filter(!is.na(subdivision))

10. Relative Abundance of Dinosaur Families/Clades per Subdivision (n > 100)

This graph shows the relative abundance of dinosaur families per subdivision (with over 100 specimens). There are potentially N-S trends in relative abundance of certain groups.

# Relative abundance of dinosaur clades by subdivision
r_abun_fam_subdivision <- ggplot(dino_abun_sd, aes(x = fam, y = r_abun, fill = subdivision)) + 
  geom_col(width = 0.8, position = "dodge") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Family", y = "Relative Abundance (%)", fill = "Subdivision")

print(r_abun_fam_subdivision)

Relative Abundance Values of Dinosaur Families/Clades per per Subdivision (n > 100)

This table summarizes the relative abundances (in percent) of each family/clade, categorized by each subdivision.

dino_abun_sd_table <- dino_abun_sd %>%
  group_by(fam, subdivision) %>%
  summarise(r_abun = sum(r_abun)) %>%
  pivot_wider(names_from = subdivision, values_from = r_abun)
  
names(dino_abun_sd_table)[1] <- 'Family/Clade' 
  
  dino_abun_sd_table <- dino_abun_sd_table[,c('Family/Clade', "Scollard", "Frenchman", "Hell Creek NW", "Hell Creek SE", "Lance E", "Denver")]


kable(dino_abun_sd_table, format = "html", digits = 2, table.attr = "class='table table-striped table-hover table-bordered', margin:auto;'")

Family/Clade	Scollard	Frenchman	Hell Creek NW	Hell Creek SE	Lance E	Denver
Alvarezsauridae	NA	0.09	0.11	0.07	0.10	NA
Ankylosauridae	6.77	0.84	0.97	0.66	1.19	NA
Caenagnathidae	0.75	1.31	0.07	1.25	0.13	NA
Ceratopsidae	15.79	26.92	27.13	29.74	32.36	32.63
Dromaeosauridae	9.96	29.27	14.57	5.86	7.46	4.56
Hadrosauridae	5.83	15.67	15.04	29.96	27.32	31.93
Leptoceratopsidae	6.02	0.09	NA	0.59	0.27	NA
Nodosauridae	0.38	0.09	0.25	0.22	0.36	NA
Ornithomimidae	8.46	6.10	3.17	4.25	1.46	2.46
Pachycephalosauridae	2.44	0.47	1.33	3.37	1.33	NA
Paronychodon sp.	2.82	0.38	5.27	0.88	10.54	2.81
Richardoestesia sp.	5.64	1.31	18.40	2.93	2.79	21.05
Thescelosauridae	7.71	5.91	2.42	7.55	4.51	0.70
Troodontidae	1.50	0.84	1.15	1.17	4.38	0.70
Tyrannosauridae	25.94	10.69	10.10	11.50	5.80	3.16

Potential Trends

Northern-most Scollard Formation is the only formation to show a relative abundance of over 5% for Ankylosauridae and Leptoceratopsidae
Ornithomimidae and Thescelosauridae potentially show a gradual decrease in abundance from N to S
Alvarezsauridae and Nodosauridae are consistently rare in all formations
Ceratopsidae and Hadrosauridae are consistently abundant. Hadrosauridae show a larger difference in relative abundance between northern and southern areas (generalist vs specialist?)
Dominant small theropod groups (Dromaeosauridae, Paronychodon sp., Richardoestesia sp., Troodontidae) appear to vary by area
- Troodontidae appear fairly consistently across all formations, although they are the most common in the Lance Formation
- Dromaeosauridae are the dominant small theropod in the Frenchman Formation (which is possibly an artifact of lumping “small theropod” fossils into Dromaeosauridae). This will be further investigated in the near future
- Paronychodon sp. are the dominant small theropod in the Lance Formation
- Richardoestesia sp. are the dominant small theropod in both the Hell Creek and Denver Formations
Potentially irregular abundance of Tyrannosauridae specimens in the Scollard Formation (this may affect the relative abundances of other major groups in the Scollard)

Conclusions

This EDA revealed a number of potential trends to be pursued further as this project progresses. It will be necessary to statistically test the trends observed in this EDA but the presence of any potential trends at this stage is promising. If these trends are statistically significant and continue to persist with the addition of new specimens, it may provide valuable new data on the habitat preference and ecological interactions between various dinosaur groups immediately prior to the end-Cretaceous mass extinction.