Exploratory Data Analysis of Bat Capture Data

Author

Mika Tosoni-Eves

Rationale

All species of bats in Ontario are listed as endangered and while collecting data from them is important, capture methods can be invasive and stressful for the bats (Edwards et al., 2022). For this reason, it’s important to optimize our nights of capture to ensure that we are collecting the most data we can while disturbing the least amount of bat roosts.

I’m using previously collected bat capture data collected from the Davy Lab from the last few years to discover any noticeable trends. Trends such as number of species, number of individuals, or change over time in physical measurements taken at the different sites of capture. By analyzing capture trends in previously collected data, we can find out where to focus our resources and limit unnecessary stress to already endangered species.

Figure 1. Long-legged myotis (Myotis volans) caught in a mist net, a method of trapping that captures bats during flight.

Study Questions

  1. In which year were the most bats captured? The least?

  2. Where were the most bats captured?

  3. Which species were most frequently caught at each site?

  4. Has mass or forearm length changed over the time?

Libraries

library(tidyverse)
library(knitr)

Data Import

First I import the Davy Lab’s collective capture data, selecting only the columns necessary to perform my data. Each row of the data corresponds to a captured individual and includes the year of capture, location of capture, species, sex, age, forearm length (mm), and mass (g) of each bat.

dat <- read_csv("BatCaptureData.csv") %>%
  select('Year', 'General Location', 'Species', 'Sex', 'Age', 'Forearm_mm', 'Mass_g')

Here we have a dataframe containing 5063 observations with 7 columns of values that need to be tidied before analyzing.

Data Cleaning

Here I spend some time manipulating the data to make it easier to use, easier to read, and to ensure all of the values are consistent and in the correct format.

Renaming Columns

dat <- dat %>%
  rename(
    year = Year,
    location = `General Location`,
    species = Species,
    sex = Sex,
    age = Age,
    forearm = Forearm_mm,
    mass = Mass_g
  )

Cleaning: year

Expected: Numerical years between 2017 and 2021

sort(unique(dat$year))
[1] 2017 2018 2019 2020 2021
class(dat$year)
[1] "numeric"

Already properly formatted!

Cleaning: location

Expected: No inadvertent duplicates and consistent naming scheme

sort(unique(dat$location))
 [1] "Amherst Island"   "Belgrave"         "Brighton"         "Burlington"      
 [5] "Cardiff"          "Cordova Mines"    "Eganville"        "Hillsburgh"      
 [9] "Kinmount"         "Kirkland Lake"    "Marston"          "Minden"          
[13] "Mount St Patrick" "Penatanguishene"  "Peterborough"     "Port Elgin"      
[17] "Red Bay"          "Roslin"           "Tory Hill"        "Wainfleet"       
[21] "Wilberforce"      "Wolfe Island"    

Already properly formatted. I’ll be keeping NAs in the dataset for analyses where they won’t interfere.

Cleaning: species

Expected: No inadvertent duplicates, valid species, consistent naming scheme

unique(dat$species)
 [1] "EPFU"  "LABO"  "MYLU"  "PESU"  "MYSE"  "PESU?" "MYLE"  NA      "MYLE?"
[10] "?"    
dat <- dat %>% 
  mutate(species = case_when(
    species == 'EPFU' ~ 'Big brown bat',
    species == 'LABO' ~ 'Eastern red bat',
    species == 'MYLE' ~ 'Eastern small-footed bat',
    species == 'MYSE' ~ 'Northern long-eared bat',
    species == 'PESU' ~ 'Tricoloured bat',
    species == 'MYLU' ~ 'Little brown bat',
    TRUE ~ NA_character_
  ))
unique(dat$species)
[1] "Big brown bat"            "Eastern red bat"         
[3] "Little brown bat"         "Tricoloured bat"         
[5] "Northern long-eared bat"  NA                        
[7] "Eastern small-footed bat"

Issues: Naming scheme was very jargon-heavy and there were values with some uncertainty when the observers weren’t certain in their species ID.

Fix: Set uncertain answers as NAs and replaced jargony abbreviations with the common names of the bat species.

  • EPFU: Big brown bat (Eptesicus fuscus)
  • LABO: Eastern red bat (Lasiurus borealis)
  • MYLE: Eastern small-footed bat (Myotis leibii)
  • MYSE: Northern long-eared bat (Myotis septentrionalis)
  • PESU: Tricoloured bat (Perimyotis sublavus)
  • MYLU: Little brown bat (Myotis lucifugus)

While we’re here, lets check how many of each species we have in our data set!

Bat Species Count
species Count
Little brown bat 4243
Big brown bat 458
Eastern small-footed bat 170
Tricoloured bat 144
NA 32
Northern long-eared bat 14
Eastern red bat 2

Cleaning: sex

Expected: Only values should correspond with either male or female, consistent naming scheme

unique(dat$sex)
[1] "M"       "F"       NA        "?"       "unknown" "A"      
dat <- dat %>% 
  mutate(sex = case_when(
    sex == 'M' ~ 'M',
    sex == 'F' ~ 'F',
    TRUE ~ NA
  ))
unique(dat$sex)
[1] "M" "F" NA 

Issues: Some uncertain values, ‘A’ as a value makes no sense

Fix: Set uncertain and incorrect values as NAs and keep M’s and F’s unchanged.

Cleaning: age

Expected: Only values should correspond with either adult or juvenile, consistent naming scheme

unique(dat$age)
 [1] "J"                       "A"                      
 [3] "J?"                      "A?"                     
 [5] NA                        "?"                      
 [7] "unknown"                 "a"                      
 [9] "J (was called A before)" "na"                     
[11] "F"                      
dat <- dat %>% 
  mutate(age = case_when(
    age == 'a' | age == 'A' ~ 'A',
    age == 'j' | age == 'J' | age == 'J (was called A before)' ~ 'J',
    TRUE ~ NA
  ))
unique(dat$age)
[1] "J" "A" NA 

Issues: Lots of uncertainty and variation in the naming scheme, unnecessary text in some values, one value of ‘F’ that makes no sense in the column.

Fix: Set uncertain values to NA and change naming scheme of certain values to be consistent as either ‘J’ or ‘A’.

Cleaning: forearm

Expected: Numerical values that follow a consistent scheme

dat <- dat %>% 
  mutate(forearm = case_when(
    forearm == 'na' | forearm == '?' | forearm == 'unknown' ~ NA,
    forearm == '38..5' ~ '38.5',
    TRUE ~ forearm
  ))
dat$forearm <- as.numeric(dat$forearm)

Issues: Few uncertain values and one that contained 2 decimal points

Fix: Set uncertain values as NAs and fix the format of the one mistake

Now I want to see if the values are biologically valid

First I plot the distributions of forearm measurements by species and look for any values that are completely unreasonable using field identification guides provided by the Davy Lab:

na.omit(dat) %>% 
  ggplot(aes(x = forearm)) + geom_histogram() +
  facet_wrap(~species, scales = 'free')

Bats with values outside reasonable range:

Little brown bat: Should be <=40mm

Eastern small-footed bat:: Should be <=33mm

dat <- dat %>% 
  filter(!(species == "Eastern small-footed bat" & forearm >= 34),
         !(species == "Little brown bat" & forearm > 40))

na.omit(dat) %>% 
  ggplot(aes(x = forearm)) + geom_histogram() +
  facet_wrap(~species, scales = 'free')

All values look reasonable now!

Cleaning: mass

Expected: Numerical values with a consistent scheme

dat <- dat %>% 
  mutate(mass = case_when(
    mass == 'unknown' | 
      mass == '~19.01*' | 
      mass == '13.23 with transmitter' ~ NA,
    TRUE ~ mass
  ))
dat$mass <- as.numeric(dat$mass)

Issues: Some uncertain values

Fix: Changed uncertain values to NAs and set the column as numeric

Again, I want to check for unreasonable values

Mass can be variable so I’ll be more lenient with expected values, but I will be checking for any outrageous numbers.

na.omit(dat) %>% 
  ggplot(aes(x = mass)) + geom_histogram() +
  facet_wrap(~species, scales = 'free')

dat <- dat %>% 
  filter(!(species == 'Eastern small-footed bat' & mass >= 7),
         !(species == 'Big brown bat' & mass >= 28),
         !(mass > 100))

Removed values: No eastern small-footed bat would be measured above 7g in weight, no big brown bat would be larger than 28g and finally there were 3 values over 500g which are an obvious error.

na.omit(dat) %>% 
  ggplot(aes(x = mass)) + geom_histogram() +
  facet_wrap(~species, scales = 'free')

Values are all reasonable now!

Final Cleaning

Finally, I want to get rid of any captures where all of the columns have NA values because they will be completely useless in my analyses!

dat <- dat %>% 
  filter(!(is.na(forearm) & is.na(mass) & is.na(sex) & is.na(age)))

All in all I have reorganized the dataframe from 5063 observations to a clean and consistent 4550 observations ready to be plotted and analyzed.

Analysis

Question 1: In which year were the most bats captured? The Least?

plot_1 <- dat %>%
  count(year) %>%
  mutate(year = fct_reorder(factor(year), n)) %>%
  ggplot(aes(x = year, y = n, fill = year)) +
  geom_col() +
  geom_text(aes(label = paste0("n = ", n)), vjust = -0.5) +
  theme_minimal() +
  theme(panel.grid = element_blank(), legend.position = 'none') +
  labs(x = 'Sample Year', y = 'Number of Bats Captured')

plot_1

Finding: 2018 had the highest number of bats captured (n =1677)

Finding: 2020 had the lowest number of bats captured (n = 128)

This steep dropoff in capture observations in 2020 was likely do the difficulties with COVID!

Question 2: Where were the most bats captured?

plot_2 <- dat %>%
  filter(!is.na(location)) %>%
  count(location) %>%
  mutate(location = fct_reorder(location, n, .desc = TRUE)) %>%
  ggplot(aes(y = location, x = n, fill = location)) + 
  geom_col() +
  theme_minimal() +
  theme(panel.grid = element_blank(), legend.position = 'none') +
  labs(x = 'Number of Bats Captured', y = 'Sample Location')

plot_2

Finding: Mount St Patrick easily has the highest number of bat captures, completely dwarfing the counts of all other locations.

This may be due to more rigorous capture efforts at this location, but it may also be a factor of extremely high populations in this area! I.E., this is a great place to focus research efforts.

Question 3: Which species were most frequently caught at each site?

common_sp <- dat %>%
  group_by(location, species) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(location, desc(count)) %>%
  group_by(location) %>%
  slice_max(order_by = count, n = 1)

plot_3 <- common_sp %>%
  filter(!is.na(location)) %>% 
  ggplot(aes(x = species, y = fct_reorder(location, species), color = species)) +
  geom_point(size = 5) +
  labs(title = "Most Common Species per Location",
       x = "Species",
       y = "Location") +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

plot_3

Finding: Most sites found little brown bats to be the most common (n = 15), some sites found big brown bats to be most common (n = 6), Burlington sampling found eastern small-footed bats as common as little brown bats.

In species-specific studies where you are aiming to capture a relatively difficult to eastern small-footed bats, focusing your trapping efforts at the Burlington site would likely be a great use of resources!

Figure 2. Pair of eastern small-footed bat (Myotis leibii) resting in a rock crevice

4a. Has forearm length changed over time?

Note: I excluded eastern red bats from the analysis as we only have 2 observations in the dataset

plot_5 <- dat %>%
  filter(!is.na(forearm), !is.na(species), !is.na(year), !(species == 'Eastern red bat')) %>%
  ggplot(aes(x = year, y = forearm)) +
  geom_point(alpha = 0.4, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  facet_wrap(~ species, scales = "free_y") +
  labs(title = "Forearm Size Across Years",
       x = "Year",
       y = "Forearm Length (mm)") +
  theme_minimal() +
  theme(panel.grid = element_blank())

plot_5

Finding: Looks as though tricoloured bats may be having a reduction in their forearm length in recent years.

Finding: Northern long-eared bats look to be having an increase in forearm length, however small sample size is likely exaggerating the trend.

Finding: The other three species look to be quite consistent in their measurements across the sampling years.

4b. Has mass changed over time?

Note: I excluded eastern red bats from the analysis as we only have 2 observations in the dataset

plot_6 <- dat %>%
  filter(!is.na(mass), !is.na(species), !is.na(year), !(species == 'Eastern red bat')) %>%
  ggplot(aes(x = year, y = mass)) +
  geom_point(alpha = 0.4, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  facet_wrap(~ species, scales = "free_y") +
  labs(title = "Mass Across Years",
       x = "Year",
       y = "Mass (g)") +
  theme_minimal() +
  theme(panel.grid = element_blank())

plot_6

Finding: All species apart from the eastern small-footed bats show a very consistent spread in their mass measurements each year.

Finding: Eastern small-footed bats look to be having an increase in their mean mass!

This could potentially have been caused by sampling late in the season as they come to the end of their foraging season and fatten up for the winter. But this could be used as rationale to perform future studies to see whether this is really the case!

I decided to explore whether there was any statistical basis in this visual finding.

First I made a subset of the data only containing eastern small-footed bats

dat_smallfooted <- dat %>% 
  filter(species == "Eastern small-footed bat", !is.na(mass), !is.na(year))

dat_smallfooted$year <- as.factor(dat_smallfooted$year)

Then I performed a one-way anova to determine whether mass differed significantly across years. This ANOVA took a subset from the total data of n = 154 eastern small-footed bats.

mass_anova <- aov(mass ~ year, data = dat_smallfooted)
summary(mass_anova)
             Df Sum Sq Mean Sq F value   Pr(>F)    
year          4  6.825  1.7063    9.06 1.41e-06 ***
Residuals   149 28.060  0.1883                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova returned a p-value associated with ‘year’ of 1.41e-06, much smaller than the alpha value of 0.05, indicating significance.

I then performed a TukeyHSD to determine which years were significantly difference from each other.

TukeyHSD(mass_anova)

There were 4 pairs of years that showed significantly different mass distributions:

2021-2019: p = ~0.01

2018-2017: p = ~0.002

2020-2018: p = ~0.005

2021-2018: p = ~0.0004

The most interesting of these results to me was the longest term pair from 2021-2018 being highly significant, indicating that there was quite the jump mass between these years.

In the following graph you can see a visualization of this jump in mass.

ggplot(dat_smallfooted, aes(x = year, y = mass)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.4) +
  labs(title = "Eastern Small-footed Bat Mass Over Time",
       x = "Year",
       y = "Mass (g)") +
  theme_minimal() +
  theme(panel.grid = element_blank())

Conclusions:

Using compiled bat capture data we are able to find which trapping sites are best to optimize our sampling efforts. We can also analyze our data to find trends within our data which can inform future exploratory studies. For example, if we wish to perform studies on the eastern small-footed bat and it’s apparent increase in mass, Burlington is likely the best site within our currently collected data.

Citations:

Phoebe D Edwards, Rudy Boonstra, Curtis O Bosson, N Jane Harms, Piia M Kukka, Craig K R Willis, Thomas S Jung, Effects of capture on stress-axis measures in endangered little brown bats (Myotis lucifugus), Journal of Mammalogy, Volume 103, Issue 1, February 2022, Pages 91–99, https://doi.org/10.1093/jmammal/gyab135

https://home.nps.gov/subjects/bats/studying-bats.htm (Figure 1)

https://www.ontario.ca/page/eastern-small-footed-myotis-recovery-strategy (Figure 2)