1. Introduction

Welcome to my EDA project! In this script, we will analyze 16S metabarcoding information to explore bacterial community compositions in potato-wart infested soil samples. Sequencing of the data was done using the Emu pipeline, you may find more information about it on their github if you’re interested.

In this project, I hope to discover genera of bacteria that are extensively studied as Biocontrol Agents (BCAs) in hopes of discovering unique species that could serve as potential defense organisms against Synchytrium endobioticum - a fungal pathogen responsible for inducing potato wart disease. In doing so, it can pave the way to additional research to investigate gene expression pathways associated with those given species to determine if they can confer resistances against the plant pathogen. I will also be assessing the difference in bacterial community composition between original 16S genomic DNA, and whole genome amplified (WGA) 16S genomic DNA to assess if laboratory preparation can affect the sequencing steps.

For each dataset in this EDA, I will provide a breakdown on:

1.) What the dataset entails

2.) How I will tidy each dataset

3.) What I plan to explore with each dataset

Below is an overview of this.

1.1 Overview of datasets:

batch2.csv:
1.) This dataset contains information on 16S bacterial relative abundance information already calculated by the Emu pipeline.

2.) I will tidy this dataframe by organizing the column names to something that can be easily understood, and removing columns that are unnecessary for subsequent analysis, such as columns pertaining to positive and negative controls as I’m interested in comparing between the two treatment groups. I will also be converting this dataframe into long/wide formats, and I will be ensuring there are no NA values.

3.) I will first store this csv file into a metadata dataframe called mm conduct various visualizations, such as creating barplots to see which Genera and Phyla are most present in the original and wga treatments and how they differ.

mock-community-data.csv
1.) This dataset contains the theoretical relative abundance compositions for bacterial members that are present in the positive control

2.) I will not need to tidy this dataframe other than filtering for columns that I will be interested in.

3.) I will use this dataframe as a reference database to compare and see whether my sampling was successful by assessing if the expected bacteria members were present in my positive control after sequencing, and whether their relative composition is similar to the theoretical compositions.

emu-combined-tax_id-counts.csv
1.) This dataframe will contain metadata regarding ASVs (amplicon sequence variants). It stores information on the absolute counts of each species opposed to their relative abundance like in batch2.csv .

2.) I will need to tidy this dataframe to remove the columns containing information on positive and negative controls, and I need to reorganize the columns so that they are easy to understand. I will also need to convert the dataframe between wide format and long format to do multiple types of analysis, as well as ensuring there are no NA values.

3.) I will conduct NMDS analysis and carry out PERMANOVA tests to determine if there is a significant difference between the two treatment groups using largely the vegan R package.

1.2 Packages to be installed:

If you have not done so already, make sure to install the following five R packages using the install.packages() function:

tidyverse, vegan, ggpubbr , tidytext, forcats

Then load them so that we may begin the data analysis:

library(tidyverse)
library(vegan)
library(ggpubr)
library(tidytext)
library(forcats)

1.3 Data Import

Now read the files as indicated in section 1.1:

# Read files

# Full meta data on relative abundance:
mm <- read_csv('data/batch2.csv') 

# Mock community data on relative abundance:
mock <- read_csv('data/mock-community-data.csv') 

# Full meta data on relative abundance, will be tidied:
b2 <- read_csv('data/batch2.csv') 

# Meta data on absolute counts:
b2_abs <- read_csv('data/emu-combined-tax_id-counts.csv') 

1.4 Define functions

We will need to create a couple of functions to help with out EDA. Both will sort our datasets to make them more tidy. The num_order() function will reorganize our columns based on numerical order from lowest to highest. The col_sorter() function will reorganize our taxonomy columns to be in the proper order of Kingdom, People, Class, Order, Family, Genus, Species. Here is the code for both of them:

num_sorter():

num_sorter <- function(df){
  cols <- colnames(df)   # Gets the column names
  fixed_cols <- cols[1:8] # Assigns the fixed columns I want to keep the same way
  sort_cols <- cols[-(1:8)] # Assigns the columns I want to manipulate (after col 8)
  sort_cols_number <- as.numeric(gsub("\\D", "", sort_cols)) # Extracts the numbers from the column names
  sort_cols <- sort_cols[order(sort_cols_number)] # Orders columns based on extracted numbers
  cols <- c(fixed_cols, sort_cols)   # Combines the fixed and sorted columns
  df <- df[, cols]   # Reorder the dataframe
  return(df) # Returns the dataframe
}

col_sorter():

# Create Sorter function to organize taxonomic ranks in order of highest to lowest after tax_id:
col_order <- function(df) {
  df |>
    select(tax_id, superkingdom, phylum, class, order, family, genus, species, everything())
}

Now that our set up is done, we are ready for the next step which involves data tidying.

2. Data Tidying

2.1 Tidying of relative abundance and asv metadata

Let’s quickly inspect the metadata containing our asv and relative abundance information:

b2 # 3135 rows x 37 columns
## # A tibble: 3,135 × 37
##    tax_id  species                  genus family order class phylum superkingdom
##    <chr>   <chr>                    <chr> <chr>  <chr> <chr> <chr>  <chr>       
##  1 100134  Anaerocolumna xylanovor… Anae… Lachn… Clos… Clos… Firmi… Bacteria    
##  2 1002    Hugenholtzia roseola     Huge… Berna… Cyto… Cyto… Bacte… Bacteria    
##  3 1002526 Pseudomonas formosensis  Pseu… Pseud… Pseu… Gamm… Prote… Bacteria    
##  4 1002689 Occallatibacter riparius Occa… Acido… Acid… Acid… Acido… Bacteria    
##  5 1002691 Occallatibacter savannae Occa… Acido… Acid… Acid… Acido… Bacteria    
##  6 1002870 Gaiella occulta          Gaie… Gaiel… Gaie… Rubr… Actin… Bacteria    
##  7 1003110 Verrucosispora maris     Verr… Micro… Micr… Acti… Actin… Bacteria    
##  8 1004261 Oceanobacillus indicire… Ocea… Bacil… Baci… Baci… Firmi… Bacteria    
##  9 1004279 Mesorhizobium muleiense  Meso… Phyll… Rhiz… Alph… Prote… Bacteria    
## 10 1004304 Hydrotalea sandarakina   Hydr… Chiti… Chit… Chit… Bacte… Bacteria    
## # ℹ 3,125 more rows
## # ℹ 29 more variables: `barcode05.fastq-threshold-0.0001` <dbl>,
## #   barcode07.fastq <dbl>, `barcode13.fastq-threshold-0.0001` <dbl>,
## #   `barcode12.fastq-threshold-0.0001` <dbl>, barcode05.fastq <dbl>,
## #   barcode03.fastq <dbl>, barcode06.fastq <dbl>, barcode08.fastq <dbl>,
## #   `barcode06.fastq-threshold-0.0001` <dbl>,
## #   `barcode08.fastq-threshold-0.0001` <dbl>, …
b2_abs # 3135 rows x 37 columns
## # A tibble: 3,135 × 37
##    tax_id  species                  genus family order class phylum superkingdom
##    <chr>   <chr>                    <chr> <chr>  <chr> <chr> <chr>  <chr>       
##  1 100134  Anaerocolumna xylanovor… Anae… Lachn… Clos… Clos… Firmi… Bacteria    
##  2 1002    Hugenholtzia roseola     Huge… Berna… Cyto… Cyto… Bacte… Bacteria    
##  3 1002526 Pseudomonas formosensis  Pseu… Pseud… Pseu… Gamm… Prote… Bacteria    
##  4 1002689 Occallatibacter riparius Occa… Acido… Acid… Acid… Acido… Bacteria    
##  5 1002691 Occallatibacter savannae Occa… Acido… Acid… Acid… Acido… Bacteria    
##  6 1002870 Gaiella occulta          Gaie… Gaiel… Gaie… Rubr… Actin… Bacteria    
##  7 1003110 Verrucosispora maris     Verr… Micro… Micr… Acti… Actin… Bacteria    
##  8 1004261 Oceanobacillus indicire… Ocea… Bacil… Baci… Baci… Firmi… Bacteria    
##  9 1004279 Mesorhizobium muleiense  Meso… Phyll… Rhiz… Alph… Prote… Bacteria    
## 10 1004304 Hydrotalea sandarakina   Hydr… Chiti… Chit… Chit… Bacte… Bacteria    
## # ℹ 3,125 more rows
## # ℹ 29 more variables: `barcode05.fastq-threshold-0.0001` <dbl>,
## #   barcode07.fastq <dbl>, `barcode13.fastq-threshold-0.0001` <dbl>,
## #   `barcode12.fastq-threshold-0.0001` <dbl>, barcode05.fastq <dbl>,
## #   barcode03.fastq <dbl>, barcode06.fastq <dbl>, barcode08.fastq <dbl>,
## #   `barcode06.fastq-threshold-0.0001` <dbl>,
## #   `barcode08.fastq-threshold-0.0001` <dbl>, …

We can see that there are 3135 rows, and 37 columns for both dataframes. The taxonomy information is split by their respective groups, but out of order (we will correct this later on). Currently both dataframes are in wide format where the rows represent a unique species as indicated by the tax_id column. The columns containing barcode information refer to different samples that were pooled together during library prepration. I will discuss a bit more on this in a bit, but first let’s quickly check to see if each row is truly unique using an if statement:

if(length(unique(b2$tax_id)) == nrow(b2) &
   length(unique(b2_abs$tax_id)) == nrow(b2_abs)){
  print("All tax_ID values are unique for both dataframes.")
} else {
  print("All tax_ID values are not unique for both dataframes.")
}
## [1] "All tax_ID values are unique for both dataframes."

For the next step, let’s run our sorter functions on the dataframes:

# Run the functions on the dataframes:
b2 <- col_order(b2)
b2 <- num_sorter(b2)
b2_abs <- col_order(b2_abs)
b2_abs <- num_sorter(b2_abs)

The columns have now been organized so that the taxonomy columns are in order, and the barcode columns are in order as well. Barcodes in this case represent a unique sample site as DNA from multiple sample sites were pooled together during library preparation and sequencing. Because these dataframes contain information on both original and wga treatments on DNA, barcodes01-07 represent original samples whereas barcodes08-14 represent WGA samples. WGA was done to increase the available concentration of DNA to work with, but that may cause some biases in sequencing data as some community members may be more enhanced than others. This is why it’s of interest to investigate the difference between the two treatment groups. The other two samples, barcode15 and barcode16 represent positive and negative control samples.

Note that there are additional columns containing 0.0001 threshold information for each of the 14 barcode samples. These columns include only the species that are at or above this given threshold. Because we are interested in assessing the differences in treatments for all samples, we can remove them from our data. along with the positive and negative control sample information. The positive control sample information will be assessed using the mm dataframe which we will tidy later on.

To tidy, we’ll keep the first 8 columns containing taxonomy information, and get rid of the columns we’re not interested in:

# Note that the first 8 columns contain taxonomy information, so we'll select those columns and columns ending with fastq:
b2 <- b2 |>
  select(1:8, ends_with("fastq"), -barcode15.fastq, -barcode16.fastq)

b2_abs <- b2_abs |>
  select(1:8, ends_with("fastq"), -barcode15.fastq, -barcode16.fastq)

Narrowed it down to 22 columns now.

Now let’s rename these columns to something that’s easier to understand. We will rename barcodes01-07 to og_rel_1-7 and barcodes08-14 to wga_rel_1-7 for the b2 dataframe containing relative abundance information. For the b2_abs dataframe, we will convert it to og_abs_1-7 and wga_abs_1-7 instead.

# Rename the columns to something easier to understand:
b2 <- b2 |>
  rename(
    og_rel_1 = barcode01.fastq, og_rel_2 = barcode02.fastq, og_rel_3 = barcode03.fastq,
    og_rel_4 = barcode04.fastq, og_rel_5 = barcode05.fastq, og_rel_6 = barcode06.fastq,
    og_rel_7 = barcode07.fastq, wga_rel_1 = barcode08.fastq, wga_rel_2 = barcode09.fastq,
    wga_rel_3 = barcode10.fastq, wga_rel_4 = barcode11.fastq, wga_rel_5 = barcode12.fastq,
    wga_rel_6 = barcode13.fastq, wga_rel_7 = barcode14.fastq
  )

b2_abs <- b2_abs |>
  rename(
    og_abs_1 = barcode01.fastq, og_abs_2 = barcode02.fastq, og_abs_3 = barcode03.fastq, 
    og_abs_4 = barcode04.fastq, og_abs_5 = barcode05.fastq, og_abs_6 = barcode06.fastq,
    og_abs_7 = barcode07.fastq, wga_abs_1 = barcode08.fastq, wga_abs_2 = barcode09.fastq,
    wga_abs_3 = barcode10.fastq, wga_abs_4 = barcode11.fastq, wga_abs_5 = barcode12.fastq,
    wga_abs_6 = barcode13.fastq, wga_abs_7 = barcode14.fastq
  )

# See if it worked:
colnames(b2)
##  [1] "tax_id"       "superkingdom" "phylum"       "class"        "order"       
##  [6] "family"       "genus"        "species"      "og_rel_1"     "og_rel_2"    
## [11] "og_rel_3"     "og_rel_4"     "og_rel_5"     "og_rel_6"     "og_rel_7"    
## [16] "wga_rel_1"    "wga_rel_2"    "wga_rel_3"    "wga_rel_4"    "wga_rel_5"   
## [21] "wga_rel_6"    "wga_rel_7"
colnames(b2_abs)
##  [1] "tax_id"       "superkingdom" "phylum"       "class"        "order"       
##  [6] "family"       "genus"        "species"      "og_abs_1"     "og_abs_2"    
## [11] "og_abs_3"     "og_abs_4"     "og_abs_5"     "og_abs_6"     "og_abs_7"    
## [16] "wga_abs_1"    "wga_abs_2"    "wga_abs_3"    "wga_abs_4"    "wga_abs_5"   
## [21] "wga_abs_6"    "wga_abs_7"

Now it will be better if we could organize the dataframes such that one dataframe will contain all the metadata information on original samples, and the other will contain all the metadata information on wga samples. Let’s go ahead and do this:

# Select the taxonomy and 'og' columns from both b2 and b2_abs dataframes:
b2_og <- select(b2, tax_id:species, starts_with("og_rel"))
og_b2_abs <- select(b2_abs, starts_with("og_abs"))

# Combine the selected columns into a new dataframe
og <- bind_cols(b2_og, og_b2_abs)

# Select the taxonomy and 'wga' columns from both b2 and b2_abs dataframes
b2_wga <- select(b2, tax_id:species, starts_with("wga_rel"))
wga_b2_abs <- select(b2_abs, starts_with("wga_abs"))

# Combine the selected columns into a new dataframe
wga <- bind_cols(b2_wga, wga_b2_abs)

Now let’s create a new column called taxonomy for the b2 column. This column will be a concatenated format of the other taxonomy columns in a way such that it will be easy to extract taxonomy information if needed. We will do this by combining the taxonomy columns into one string where each hierarchical classification is separated the following way: k__x; p__x; c__x; o__x; f__x; g__x; s__x . Note that x represents the hierarchical classification for each letter (k = superkingdom, p = phylum, c = class, o = order, f = family, s = species):

b2 <- b2 |>
  mutate(taxonomy = paste0("k__", superkingdom, "; ",
                           "p__", phylum, "; ",
                           "c__", class, "; ",
                           "o__", order, "; ",
                           "f__", family, "; ",
                           "g__", genus, "; ",
                           "s__", species))

b2_abs <- b2_abs |>
  mutate(taxonomy = paste0("k__", superkingdom, "; ",
                           "p__", phylum, "; ",
                           "c__", class, "; ",
                           "o__", order, "; ",
                           "f__", family, "; ",
                           "g__", genus, "; ",
                           "s__", species))

After that, we will separate the asv dataframe columns for further analysis. We will create a taxonomy dataframe only containing taxonomy information for each tax_id, and an asv dataframe only containing the absolute counts for each tax_id. This will be for the b2_abs dataframe as that is what we will largely be doing our asv analyses with to compare between treatment groups

First let’s create the taxonomy dataframe and title it b2_tax:

# Start with taxonomy dataframes:
b2_tax <- b2 |>
  select(tax_id, 2:8, taxonomy) |>
  relocate(taxonomy, .after = tax_id)
print(b2_tax$taxonomy[1])
## [1] "k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Anaerocolumna; s__Anaerocolumna xylanovorans"

Seems to work, now the dataframe includes the concatenated taxonomy information that we created earlier on - nice!

Now make the ASV table. We will only filter for the tax_id column, and the sample columns. We will also convert any NA values to 0 and values to integers. This is because for subsequent steps such as when rarefying our data, the function will only accept integer values - so let’s go ahead and do that:

# Do the same for the absolute count data. Also note that for the analysis steps, functions such as rarefy will only accept integer count values, so while we're at it, let's convert the values to whole numbers as well.
b2_abs_asv <- b2_abs |>
  select(tax_id, starts_with(c("og", "wga"))) |>
  mutate(across(starts_with(c("og", "wga")), ~replace_na(., 0))) |>
  mutate(across(starts_with(c("og", "wga")), ~as.integer(round(.))))


# Check to see if it worked:
head(b2_abs_asv)
## # A tibble: 6 × 15
##   tax_id  og_abs_1 og_abs_2 og_abs_3 og_abs_4 og_abs_5 og_abs_6 og_abs_7
##   <chr>      <int>    <int>    <int>    <int>    <int>    <int>    <int>
## 1 100134         0        0        0       21        0        0        0
## 2 1002           0        0        0        0        0        0        0
## 3 1002526        0        0        0        0        0        0        0
## 4 1002689     1267      392     1033     1010     2130      145      245
## 5 1002691     1967      583     1297     2224     2952      211      472
## 6 1002870     4234     1318     2061     1370     1814     1277     2330
## # ℹ 7 more variables: wga_abs_1 <int>, wga_abs_2 <int>, wga_abs_3 <int>,
## #   wga_abs_4 <int>, wga_abs_5 <int>, wga_abs_6 <int>, wga_abs_7 <int>

2.2 Tidying of mock community analysis data

Note that for the mm dataframe, it is essentially the same as the b2 dataframe since they both contain metadata on relative abundance of bacterial species. Since I only require information pertaining to the barcode15.fastq column which contains information on the positive control bacteria composition, I do not need to do much data tidying. As such, I’ll just be filtering my dataframe to select the top 10 abundant bacterial species for the positive control in descending order.

I will then see which species are common between the expected mock community list and our positive controls, and creating a list of species that are common. The mock community data is provided by ZymoBIOMICS™ Microbial Community DNA Standard Catalog No. D6305:

# Filter metadata for top 10 abundant bacterial species for the postive control:
mm <- mm |>
  select(tax_id, species, barcode15.fastq) |>
  filter(barcode15.fastq > 0) |>
  mutate(barcode15.fastq = barcode15.fastq * 100) |>
  arrange(desc(barcode15.fastq)) |>
  slice(1:10)

# Print out which species are the most abundant between my sequencing data and theoretical data:
common_species <- inner_join(mm, mock, by = "species") |>
  select(species) |>
  distinct()

common_species
## # A tibble: 7 × 1
##   species               
##   <chr>                 
## 1 Salmonella enterica   
## 2 Bacillus subtilis     
## 3 Escherichia coli      
## 4 Staphylococcus aureus 
## 5 Listeria monocytogenes
## 6 Enterococcus faecalis 
## 7 Pseudomonas aeruginosa

We can see that 7 out of 8 of the theoretical most abundant species are present in our sequencing data when comparing to the mock community data. Before we proceed to visualization, I need to tidy the data. I will first combine the common species list with the values that we received, and then convert the dataframe to long format for visualization:

# Overrides the mm dataframe to combine the common species list with the values that we received:
mm <- mm |>
  filter(species %in% common_species$species) |>
  select(species, barcode15.fastq) |>
  arrange(species)

# Filter mock data down to the same species as in the data df:
mock <- mock |>
  filter(species %in% common_species$species) |>
  select(species, "16S_only") |>
  arrange(species)

# Now bind the dataframes together:
data_mock <- inner_join(mm, mock, by = "species")

# Reshape the data to a long format for visualization:
data_mock <- data_mock |>
  pivot_longer(cols = c(barcode15.fastq, "16S_only"), names_to = "sample", values_to = "percentage")

# Changing the column values for visualization:
data_mock$sample[data_mock$sample == "barcode15.fastq"] <- "actual abundance"
data_mock$sample[data_mock$sample == "16S_only"] <- "theoretical abundance"

After doing this, we’ll proceed to the visualization of our data.

3. Visualization

3.1 Determining the quality of our sequencing

We will now visualize our results to compare the mock community theoretical results with our practical results. Let’s run a ggplot to show the distribution in relative abundance percentages between the two groups:

ggplot(data_mock, aes(x = species, y = percentage, fill = sample)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +  
  labs(title = "Figure 1. Comparing the Relative Abundance Composition of Positive Controls",
       x = "Species", 
       y = "Percentage (%)", 
       fill = "Sample") +
  scale_fill_discrete(labels = c("Theoretical values", "Sequencing values")) +
  scale_fill_manual(values = c("steelblue", "maroon")) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.title = element_text(size=15),
        axis.text.x.bottom = element_text(face="italic"))

From our results, we can see that they are quite similar. The ZymoBIOMICs website mentions that if the sequences are within 15% of each other, then it sequencing was successful. Knowing this now, we will create rarefaction curves to see if we had enough sequencing depth to count all reads in our samples.

We will be rarefying our data using the vegan R package. To first do so, we need to convert our dataframe containing absolute counts into wide format, and make our columns tax_id while the rows will be our samples. We will also remove the column containing sample information and do some minor tidying. Let’s go ahead and do this:

# First temporarily convert the dataframe to long format
b2_abs_wide <- pivot_longer(b2_abs_asv, cols = -tax_id, names_to = "sample", 
                              values_to = "value") 

# Revert the dataframe back to wide format
b2_abs_wide <- pivot_wider(b2_abs_wide, 
                       names_from = tax_id, 
                       values_from = value) |> as.data.frame()

# Assigning the row names as the sample:
rownames(b2_abs_wide) <- b2_abs_wide$sample


# Removing the column containing sample information because we need this in matrix format to be able to work with it. But before we do that, let's pull the names of the sample columns as it will help us with subsequent analysis. We'll also create a new column titled group which indicates whether it is og or wga for the treatment type for a given sample ID:

b2_names <- b2_abs_wide |>
  select(sample) |>
  mutate(group = case_when(
    startsWith(sample, "og") ~ "og",
    startsWith(sample, "wga") ~ "wga"
  ))


# Now let's go ahead and remove the sample column
b2_abs_wide <- b2_abs_wide[,-1]
rownames(b2_abs_wide)[0] <- "samples"

Now before moving onto rarefy our data, we need to determine what the lowest number of read counts for our samples will be. Using the code below, we can determine that it is 95,460. Now let’s visualize the data. Note that this process may take some time.

# Before we rarefy our data, we need to see what the lowest number of counts is for each sample. Let's go ahead and see that:
raremax <- min(rowSums(b2_abs_wide)) # Shows the lowest number of counts is 95,460

# Now let's visualize our rarefy data. The initial output by the vegan package will look messy, so we'll extract the output from the rarecurve() function and visualize it in a neater way using ggplot:

# First run the rarecurve() function and store the output into a variable called rarecurve_data
rarecurve_data <- rarecurve(b2_abs_wide, step = 100, sample = raremax)

# This is taking each element of the list and binds them together to make a dataframe
map_dfr(rarecurve_data, bind_rows) |>
  bind_cols(sample = rownames(b2_abs_wide),) |>
  pivot_longer(-sample) |>
  drop_na() |>
  mutate(counts = as.numeric(str_replace(name, "N",""))) |>
  
# Now plotting the graph:
 ggplot(aes(x=counts, y=value, group=sample, color=sample)) +
  geom_vline(xintercept = raremax, color ="gray") +
  geom_line(size=0.8) +
  labs(x="Sequencing read counts", y="Number of observed species", 
       title="Figure 2. Rarefaction visualization of sequencing samples.") +
  scale_color_discrete(name = "Sample ID") +
  theme_classic2() +
  scale_x_continuous(labels = scales::comma) 

Notice that the output provided by vegan and base R is messy, but with ggplot we can use the data provided by the rarecurve() function, and store it into a separate variable to plot it with ggplot and make it more visually appealing. From our results, we can see that all sequencing samples plateau as the sequencing read counts approach a constant number of species for each sample, thereby indicating there was enough sequencing to cover all species within the samples. The rarefraction curves allow us to quickly visualize the number of observed species recorded for the number of sequences. This indicates our sequencing quality was good, and we can now proceed to visualize the differences in community composition between treatments.

3.2 Assessing community composition differences between treatment groups

First, we will visualize the most abundant phyla and genera within our samples between the original and WGA treatment groups. To do these analyses, we will need to first convert our dataframes to long format, and then do some tidying to filter by the most common phyla and genera. For instance, we need to convert NA values to Unclassified for taxa that have not been identified at that level. Let’s go ahead and visualize the most abundant phyla first:

#For data analysis, we'll go ahead and use the og_b2 and wga_b2 dataframes.

# Reshape the dataframes into long format
og_long <- b2_og |>
  pivot_longer(cols = starts_with("og_rel"), names_to = "sample", values_to = "value") %>%
  drop_na(value) |>
  group_by(sample) |>
  mutate(percentage = value / sum(value) * 100) |>
  as.data.frame()

wga_long <- b2_wga |>
  pivot_longer(cols = starts_with("wga_rel"), names_to = "sample", values_to = "value") %>%
  drop_na(value) |>
  group_by(sample) |>
  mutate(percentage = value / sum(value) * 100) |>
  as.data.frame()


# Now let's visualize the most abundant phyla within our groups. To do so, we'll first create new dataframes titled og_phy and wga_phy to store our information so that we'll be able to compare them side by side using ggarrange().We are summing the total relative abundance for each phylum across each sample, and then multiplying it by 100 to get a percentage as the relative abundance information is stored as a fraction. We are then replacing any NA values as Unclassified, and plotting our results using ggplot():

# Tidying data to select phyla and replacing NA values with Unclassified:
og_phy <- og_long |>
  group_by(sample, tax_id, phylum, value) |>
  summarize(value = sum(value), .groups="drop") |>
  group_by(sample, phylum) |>
  summarize(value = value*100, .groups="drop") |>
  mutate(phylum = replace_na(phylum, "Unclassified")) 

wga_phy <- wga_long |>
  group_by(sample, tax_id, phylum, value) |>
  summarize(value = sum(value), .groups="drop") |>
  group_by(sample, phylum) |>
  summarize(value = value*100, .groups="drop") |>
  mutate(phylum = replace_na(phylum, "Unclassified"))

# Reordering levels of phylum before to descending order before creating the plots
og_phy$phylum <- fct_reorder(og_phy$phylum, og_phy$value, .desc = TRUE)
wga_phy$phylum <- fct_reorder(wga_phy$phylum, wga_phy$value, .desc = TRUE)


# Code for plots:
og_phy_bar <- ggplot(data=og_phy, aes(x=sample, y=value, fill=phylum)) +
  geom_col() +
  labs(y="Percentage (%)", x="Sample ID", fill="Phyla", title = "Figure 3A. Visualizing most abundant phyla for original DNA") +
  theme_classic2() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.text = element_text(size = 8, face="italic"))

wga_phy_bar <- ggplot(data=wga_phy, aes(x=sample, y=value, fill=phylum)) +
  geom_col() +
  labs(y="Percentage (%)", x="Sample ID", fill="Phyla", 
       title = "Figure 3B. Visualizing most abundant phyla for WGA treatment") +
  theme_classic2() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.text = element_text(size = 8, face="italic"))

# Visualize:
og_phy_bar

wga_phy_bar

From this, we can see that the composition in phyla clearly differ between the original and WGA treatments. The most abundant treatment for the original samples appears to be Chloroflexi, and for the WGA samples it appears to be Firmicutes. Regardless, both are common BCAs that have been determined in soil samples based on past studies (Lee et al. 2017). This is a good sign based on this preliminary data.

Now let’s visualize the most abundant genera. Note that to do this, there are a lot of genera, as such, it will not be possible to visualize all of them. Due to this, we need to filter down the data to the top 10 most abundant genera. To do this, we need to sum the relative abundance for all genera and see the top 10 for each treatment group. Based on that, we need to set the cut off point to the minimum threshold we are looking for, and then visualize it using ggplot. This is summarized in the code and plots below:

# Creating temporary variables to tmp1 and tmp2 to store filtered og and wga data respectively containing information on the relative abundance percentages of each genus per sample:

tmp1 <- og_long |>
  group_by(sample, tax_id, genus, value) |>
  summarise(value = sum(value), .groups="drop") |>
  group_by(sample, genus) |>
  summarise(value = value*100, .groups="drop") |>
  mutate(genus = replace_na(genus, "Unclassified"))

tmp2 <- wga_long |>
  group_by(sample, tax_id, genus, value) |>
  summarise(value = sum(value), .groups="drop") |>
  group_by(sample, genus) |>
  summarise(value = value*100, .groups="drop") |>
  mutate(genus = replace_na(genus, "Unclassified"))


# Before we pool for rare taxa, let's first quickly assess the top 10 genera percentages for each of the treatment groups, and use that as the respective cut-off points:


# Shows the lowest percentage would be 2.29%, we'll use that as our cut-off point for the og samples:
tmp1_min <- tmp1 |>
  group_by(genus) |>
  summarize(max = max(value)) |>
  arrange(desc(max)) |>
  slice(10) |>
  pull(max)

# Shows the lowest percentage would be 6.46%, we'll use that as our cut-off point for the wga samples:
tmp2_min <- tmp2 |>
  group_by(genus) |>
  summarize(max = max(value)) |>
  arrange(desc(max)) |>
  slice(10) |>
  pull(max)

# Pooling rare taxa for percentages <= 2% and storing them into respective variables for og and wga:
og_pool <- tmp1|>
  group_by(genus) |>
  summarize(pool = max(value) <= tmp1_min, .groups="drop") 

wga_pool <- tmp2 |>
  group_by(genus) |>
  summarize(pool = max(value) <= tmp2_min, .groups="drop") 


# Now visualizing the dataframes to filter by genus <= 2% for both the original and wga treatments:

# Combining the pooled data information with the temporary dataframes, and then graphing the results using ggplot():
og_gen <- inner_join(tmp1, og_pool, by="genus") |>
  mutate(genus = if_else(pool, "Other", genus)) |>
  group_by(sample, genus) |>
  summarize(value = sum(value))


og_gen_bar <- og_gen |>
  mutate(genus = fct_reorder(genus, value, .desc = TRUE)) |>
  ggplot(aes(x=sample, y=value, fill=genus)) +
  geom_col() +
  labs(y="Percentage (%)", x="Sample ID", fill="Genera",
       title = "Figure 4A. Visualizing most abundant genera for original DNA") +
  theme_classic2() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.text = element_text(size = 8, face="italic"))

wga_gen <- inner_join(tmp2, wga_pool, by="genus") |>
  mutate(genus = if_else(pool, "Other", genus)) |>
  group_by(sample, genus) |>
  summarize(value = sum(value)) 

wga_gen_bar <- wga_gen |>
  mutate(genus = fct_reorder(genus, value, .desc=TRUE)) |>
  ggplot(aes(x=sample, y=value, fill=genus)) +
  geom_col() +
  labs(y="Percentage (%)", x="Sample ID", fill="Genera", 
       title = "Figure 4B. Visualizing most abundant genera for WGA treatment") +
  theme_classic2() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.text = element_text(size = 8, face="italic"))


# Visualize:
og_gen_bar

wga_gen_bar

As we can see from these graphs, the most abundant genera still belong to the ‘Other’ category for both treatment groups based on the thresholds we have defined. This is somewhat expected as there are a lot of genera given the large dataset I am working with, and it may not be possible to visualize them all using this method. Despite this however, I do observe genera such as Bacilius and Clostridium in the WGA treatment group which are genera that have been known for having BCA properties against fungal pathogens in other soil microbiomes (Lee et al. 2017).

We can clearly see from these visualizations that there is a difference between the original and WGA treatment groups. Now let’s do some further statistical tests and visualizations to determine if the difference is significant or not. To do this, we will use various functions within the vegan R package. We will attempt to visualize our data using methods such as MetaMDS and betadisper() plots, while assessing statistical differences using PERMANOVA.

First let’s create an NMDS plot to visualize the difference between the original and WGA treatments. To do so, we’ll need to first normalize our data using decostand() as part of the vegan R package. I will be using the hellinger transformation by selecting the method hell. This performs a square root normalization on our data:

# Normalize the data using decostand(). There are multiple methods to choose from, but I have decided to go with hellinger transformation. 

b2_norm <- decostand(b2_abs_wide, method = "hell")


# Run the NMDS model to visualize composition data
set.seed(100)
nmds_results <- metaMDS(b2_norm, distance="robust.aitchison")
## Run 0 stress 0.0943651 
## Run 1 stress 0.09712313 
## Run 2 stress 0.1027415 
## Run 3 stress 0.107571 
## Run 4 stress 0.09712318 
## Run 5 stress 0.09234865 
## ... New best solution
## ... Procrustes: rmse 0.05151316  max resid 0.09846235 
## Run 6 stress 0.1015796 
## Run 7 stress 0.1075709 
## Run 8 stress 0.3578482 
## Run 9 stress 0.09118546 
## ... New best solution
## ... Procrustes: rmse 0.03079115  max resid 0.05613709 
## Run 10 stress 0.1075709 
## Run 11 stress 0.09319933 
## Run 12 stress 0.1075708 
## Run 13 stress 0.0971231 
## Run 14 stress 0.1887572 
## Run 15 stress 0.1872123 
## Run 16 stress 0.1253094 
## Run 17 stress 0.09118563 
## ... Procrustes: rmse 0.0005617368  max resid 0.00157867 
## ... Similar to previous best
## Run 18 stress 0.1253098 
## Run 19 stress 0.0982696 
## Run 20 stress 0.273934 
## *** Best solution repeated 1 times
# Extract the NMDS scores
nmds_scores <- as.data.frame(scores(nmds_results)$sites)

# Determine the centroids for the NMDS graph:
group_centroids <- data.frame(
  Treatment = c("og", "wga"),
  Centroid_X = c(mean(nmds_scores$NMDS1[b2_names$group == "og"]),
                 mean(nmds_scores$NMDS1[b2_names$group == "wga"])),
  Centroid_Y = c(mean(nmds_scores$NMDS2[b2_names$group == "og"]),
                 mean(nmds_scores$NMDS2[b2_names$group == "wga"]))
)

# Now create a dataframe containing the nmds data that we want to visualize with ggplot:
nmds_data <- data.frame(
  Treatment = b2_names$group,
  NMDS1 = nmds_scores$NMDS1,  
  NMDS2 = nmds_scores$NMDS2,  
  xend = c(rep(group_centroids[1,2], 1), rep(group_centroids[2,2], 1)),
  yend = c(rep(group_centroids[1,3], 1), rep(group_centroids[2,3], 1))
)

# Now plot the data:
ggplot(nmds_data, aes(NMDS1, NMDS2)) +
  geom_point(aes(colour = Treatment), size=2) + 
  stat_ellipse(geom = "polygon", alpha = 0.04, aes(group = Treatment),
               colour = "black", fill = "steelblue") +
  geom_point(data = group_centroids, aes(x = Centroid_X, y = Centroid_Y),
             colour = "black", size = 2, shape = 7) +
  geom_segment(data = nmds_data, aes(x = NMDS1, y = NMDS2, xend = xend, 
                                     yend = yend, colour = Treatment), alpha = 0.5) +
  scale_colour_manual(name = "Treatment type", labels = unique(nmds_data$Treatment),
                      values = c("cyan", "magenta")) +
  labs(title = "Figure 5. NMDS plot comparing the two soil sample treatment groups") +
  theme_classic()

As we can see from this, there is some overlap between the two groups, but they are largely distinct from one another. The obtained stress value for this test was 0.09, which means the fit is fair, but not good as it isn’t below 0.05.

Now let’s also do a dispersion plot as it will allow us to visualize our results in different ways. To do so, we’ll first conduct dissimilarity indices using vegdist() and then visualize the dispersion between our data using betadisper() for multivariate dispersion analysis:

# Calculating dissimilarity indices:
perm_dist <- vegdist(b2_norm, method = "robust.aitchison")

# Let's now check to see if the two treatment groups have similar statistically similar multivariate dispersion. 

dispersion <- betadisper(perm_dist, group = b2_names$group, type = "centroid")
plot(dispersion, main = "Figure 6. Multivariate dispersion plot") 

# Visually, we can see the groups are distinct from each other.

Similar to the NMDS plot, we can clearly see the two treatments differ. Now let’s assess how significant the difference is Permutational multivariate analysis of variance (PERMANOVA):

perm_results <- adonis2(perm_dist~as.factor(nmds_data$Treatment), data=perm_dist,
                        permutations=9999)
perm_results
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 9999
## 
## adonis2(formula = perm_dist ~ as.factor(nmds_data$Treatment), data = perm_dist, permutations = 9999)
##                                Df SumOfSqs      R2      F Pr(>F)    
## as.factor(nmds_data$Treatment)  1   736.47 0.23838 3.7559  4e-04 ***
## Residual                       12  2353.01 0.76162                  
## Total                          13  3089.48 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the PERMANOVA test, we can see that the difference is significant as the p-value is 4x10-4 and is far less than 0.05. This indicates that the probability that the two groups are different are highly not due to chance, suggesting there is a significant difference between the two treatments.

4. Discussion

In conclusion, this EDA has shown me that there is a significant difference between the original and WGA treatments when it comes to bacterial community composition. This suggests to me that if possible, I should avoid using WGA DNA for my library preparation as it could lead to biases and amplification of different species as highlighted in Figures 4 and 5. However, I did find bacterial taxa within both treatments that have been previously found to have BCA properties. As such, these preliminary results seem promising in suggesting that upon pathogen infection to S. endobioticum, soil microbiomes may undergo a shift in their community composition to favour bacterial taxa having BCA properties to protect them from environmental stressors. However, more studies are required to validate this.

Bibliography

Lee, Chol Gyu, Toshiya Iida, Yohei Uwagaki, Yoko Otani, Kazuhiro Nakaho, and Moriya Ohkuma. 2017. “Comparison of Prokaryotic and Eukaryotic Communities in Soil Samples with and Without Tomato Bacterial Wilt Collected from Different Fields.” Microbes and Environments 32 (4): 376–85. https://doi.org/10.1264/jsme2.ME17131.