Welcome to my EDA project! In this script, we will analyze 16S metabarcoding information to explore bacterial community compositions in potato-wart infested soil samples. Sequencing of the data was done using the Emu pipeline, you may find more information about it on their github if you’re interested.
In this project, I hope to discover genera of bacteria that are extensively studied as Biocontrol Agents (BCAs) in hopes of discovering unique species that could serve as potential defense organisms against Synchytrium endobioticum - a fungal pathogen responsible for inducing potato wart disease. In doing so, it can pave the way to additional research to investigate gene expression pathways associated with those given species to determine if they can confer resistances against the plant pathogen. I will also be assessing the difference in bacterial community composition between original 16S genomic DNA, and whole genome amplified (WGA) 16S genomic DNA to assess if laboratory preparation can affect the sequencing steps.
For each dataset in this EDA, I will provide a breakdown on:
1.) What the dataset entails
2.) How I will tidy each dataset
3.) What I plan to explore with each dataset
Below is an overview of this.
batch2.csv:
1.) This dataset contains information on 16S bacterial
relative abundance information already calculated by the Emu
pipeline.
2.) I will tidy this dataframe by organizing the column names to something that can be easily understood, and removing columns that are unnecessary for subsequent analysis, such as columns pertaining to positive and negative controls as I’m interested in comparing between the two treatment groups. I will also be converting this dataframe into long/wide formats, and I will be ensuring there are no NA values.
3.) I will first store this csv file into a metadata dataframe called mm conduct various visualizations, such as creating barplots to see which Genera and Phyla are most present in the original and wga treatments and how they differ.
mock-community-data.csv
1.) This dataset contains the theoretical relative
abundance compositions for bacterial members that are present in the
positive control
2.) I will not need to tidy this dataframe other than filtering for columns that I will be interested in.
3.) I will use this dataframe as a reference database to compare and see whether my sampling was successful by assessing if the expected bacteria members were present in my positive control after sequencing, and whether their relative composition is similar to the theoretical compositions.
emu-combined-tax_id-counts.csv
1.) This dataframe will contain metadata regarding ASVs
(amplicon sequence variants). It stores information on the absolute
counts of each species opposed to their relative abundance like in
batch2.csv .
2.) I will need to tidy this dataframe to remove the columns containing information on positive and negative controls, and I need to reorganize the columns so that they are easy to understand. I will also need to convert the dataframe between wide format and long format to do multiple types of analysis, as well as ensuring there are no NA values.
3.) I will conduct NMDS analysis and carry out PERMANOVA tests to determine if there is a significant difference between the two treatment groups using largely the vegan R package.
If you have not done so already, make sure to install the following
five R packages using the install.packages() function:
tidyverse, vegan, ggpubbr ,
tidytext, forcats
Then load them so that we may begin the data analysis:
library(tidyverse)
library(vegan)
library(ggpubr)
library(tidytext)
library(forcats)
Now read the files as indicated in section 1.1:
# Read files
# Full meta data on relative abundance:
mm <- read_csv('data/batch2.csv')
# Mock community data on relative abundance:
mock <- read_csv('data/mock-community-data.csv')
# Full meta data on relative abundance, will be tidied:
b2 <- read_csv('data/batch2.csv')
# Meta data on absolute counts:
b2_abs <- read_csv('data/emu-combined-tax_id-counts.csv')
We will need to create a couple of functions to help with out EDA.
Both will sort our datasets to make them more tidy. The
num_order() function will reorganize our columns based on
numerical order from lowest to highest. The col_sorter()
function will reorganize our taxonomy columns to be in the proper order
of Kingdom, People, Class, Order, Family, Genus, Species. Here is the
code for both of them:
num_sorter():
num_sorter <- function(df){
cols <- colnames(df) # Gets the column names
fixed_cols <- cols[1:8] # Assigns the fixed columns I want to keep the same way
sort_cols <- cols[-(1:8)] # Assigns the columns I want to manipulate (after col 8)
sort_cols_number <- as.numeric(gsub("\\D", "", sort_cols)) # Extracts the numbers from the column names
sort_cols <- sort_cols[order(sort_cols_number)] # Orders columns based on extracted numbers
cols <- c(fixed_cols, sort_cols) # Combines the fixed and sorted columns
df <- df[, cols] # Reorder the dataframe
return(df) # Returns the dataframe
}
col_sorter():
# Create Sorter function to organize taxonomic ranks in order of highest to lowest after tax_id:
col_order <- function(df) {
df |>
select(tax_id, superkingdom, phylum, class, order, family, genus, species, everything())
}
Now that our set up is done, we are ready for the next step which involves data tidying.
Let’s quickly inspect the metadata containing our asv and relative abundance information:
b2 # 3135 rows x 37 columns
## # A tibble: 3,135 × 37
## tax_id species genus family order class phylum superkingdom
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 100134 Anaerocolumna xylanovor… Anae… Lachn… Clos… Clos… Firmi… Bacteria
## 2 1002 Hugenholtzia roseola Huge… Berna… Cyto… Cyto… Bacte… Bacteria
## 3 1002526 Pseudomonas formosensis Pseu… Pseud… Pseu… Gamm… Prote… Bacteria
## 4 1002689 Occallatibacter riparius Occa… Acido… Acid… Acid… Acido… Bacteria
## 5 1002691 Occallatibacter savannae Occa… Acido… Acid… Acid… Acido… Bacteria
## 6 1002870 Gaiella occulta Gaie… Gaiel… Gaie… Rubr… Actin… Bacteria
## 7 1003110 Verrucosispora maris Verr… Micro… Micr… Acti… Actin… Bacteria
## 8 1004261 Oceanobacillus indicire… Ocea… Bacil… Baci… Baci… Firmi… Bacteria
## 9 1004279 Mesorhizobium muleiense Meso… Phyll… Rhiz… Alph… Prote… Bacteria
## 10 1004304 Hydrotalea sandarakina Hydr… Chiti… Chit… Chit… Bacte… Bacteria
## # ℹ 3,125 more rows
## # ℹ 29 more variables: `barcode05.fastq-threshold-0.0001` <dbl>,
## # barcode07.fastq <dbl>, `barcode13.fastq-threshold-0.0001` <dbl>,
## # `barcode12.fastq-threshold-0.0001` <dbl>, barcode05.fastq <dbl>,
## # barcode03.fastq <dbl>, barcode06.fastq <dbl>, barcode08.fastq <dbl>,
## # `barcode06.fastq-threshold-0.0001` <dbl>,
## # `barcode08.fastq-threshold-0.0001` <dbl>, …
b2_abs # 3135 rows x 37 columns
## # A tibble: 3,135 × 37
## tax_id species genus family order class phylum superkingdom
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 100134 Anaerocolumna xylanovor… Anae… Lachn… Clos… Clos… Firmi… Bacteria
## 2 1002 Hugenholtzia roseola Huge… Berna… Cyto… Cyto… Bacte… Bacteria
## 3 1002526 Pseudomonas formosensis Pseu… Pseud… Pseu… Gamm… Prote… Bacteria
## 4 1002689 Occallatibacter riparius Occa… Acido… Acid… Acid… Acido… Bacteria
## 5 1002691 Occallatibacter savannae Occa… Acido… Acid… Acid… Acido… Bacteria
## 6 1002870 Gaiella occulta Gaie… Gaiel… Gaie… Rubr… Actin… Bacteria
## 7 1003110 Verrucosispora maris Verr… Micro… Micr… Acti… Actin… Bacteria
## 8 1004261 Oceanobacillus indicire… Ocea… Bacil… Baci… Baci… Firmi… Bacteria
## 9 1004279 Mesorhizobium muleiense Meso… Phyll… Rhiz… Alph… Prote… Bacteria
## 10 1004304 Hydrotalea sandarakina Hydr… Chiti… Chit… Chit… Bacte… Bacteria
## # ℹ 3,125 more rows
## # ℹ 29 more variables: `barcode05.fastq-threshold-0.0001` <dbl>,
## # barcode07.fastq <dbl>, `barcode13.fastq-threshold-0.0001` <dbl>,
## # `barcode12.fastq-threshold-0.0001` <dbl>, barcode05.fastq <dbl>,
## # barcode03.fastq <dbl>, barcode06.fastq <dbl>, barcode08.fastq <dbl>,
## # `barcode06.fastq-threshold-0.0001` <dbl>,
## # `barcode08.fastq-threshold-0.0001` <dbl>, …
We can see that there are 3135 rows, and 37 columns for both
dataframes. The taxonomy information is split by their respective
groups, but out of order (we will correct this later on). Currently both
dataframes are in wide format where the rows represent a unique species
as indicated by the tax_id column. The columns containing barcode
information refer to different samples that were pooled together during
library prepration. I will discuss a bit more on this in a bit, but
first let’s quickly check to see if each row is truly unique using an
if statement:
if(length(unique(b2$tax_id)) == nrow(b2) &
length(unique(b2_abs$tax_id)) == nrow(b2_abs)){
print("All tax_ID values are unique for both dataframes.")
} else {
print("All tax_ID values are not unique for both dataframes.")
}
## [1] "All tax_ID values are unique for both dataframes."
For the next step, let’s run our sorter functions on the dataframes:
# Run the functions on the dataframes:
b2 <- col_order(b2)
b2 <- num_sorter(b2)
b2_abs <- col_order(b2_abs)
b2_abs <- num_sorter(b2_abs)
The columns have now been organized so that the taxonomy columns are
in order, and the barcode columns are in order as well. Barcodes in this
case represent a unique sample site as DNA from multiple sample sites
were pooled together during library preparation and sequencing. Because
these dataframes contain information on both original and wga treatments
on DNA, barcodes01-07 represent original
samples whereas barcodes08-14 represent
WGA samples. WGA was done to increase the available
concentration of DNA to work with, but that may cause some biases in
sequencing data as some community members may be more enhanced than
others. This is why it’s of interest to investigate the difference
between the two treatment groups. The other two samples,
barcode15 and barcode16 represent
positive and negative control
samples.
Note that there are additional columns containing 0.0001
threshold information for each of the 14 barcode samples. These
columns include only the species that are at or above this given
threshold. Because we are interested in assessing the differences in
treatments for all samples, we can remove them from our data. along with
the positive and negative control sample information. The positive
control sample information will be assessed using the mm
dataframe which we will tidy later on.
To tidy, we’ll keep the first 8 columns containing taxonomy information, and get rid of the columns we’re not interested in:
# Note that the first 8 columns contain taxonomy information, so we'll select those columns and columns ending with fastq:
b2 <- b2 |>
select(1:8, ends_with("fastq"), -barcode15.fastq, -barcode16.fastq)
b2_abs <- b2_abs |>
select(1:8, ends_with("fastq"), -barcode15.fastq, -barcode16.fastq)
Narrowed it down to 22 columns now.
Now let’s rename these columns to something that’s easier to
understand. We will rename barcodes01-07 to
og_rel_1-7 and barcodes08-14 to
wga_rel_1-7 for the b2 dataframe
containing relative abundance information. For the b2_abs
dataframe, we will convert it to og_abs_1-7 and
wga_abs_1-7 instead.
# Rename the columns to something easier to understand:
b2 <- b2 |>
rename(
og_rel_1 = barcode01.fastq, og_rel_2 = barcode02.fastq, og_rel_3 = barcode03.fastq,
og_rel_4 = barcode04.fastq, og_rel_5 = barcode05.fastq, og_rel_6 = barcode06.fastq,
og_rel_7 = barcode07.fastq, wga_rel_1 = barcode08.fastq, wga_rel_2 = barcode09.fastq,
wga_rel_3 = barcode10.fastq, wga_rel_4 = barcode11.fastq, wga_rel_5 = barcode12.fastq,
wga_rel_6 = barcode13.fastq, wga_rel_7 = barcode14.fastq
)
b2_abs <- b2_abs |>
rename(
og_abs_1 = barcode01.fastq, og_abs_2 = barcode02.fastq, og_abs_3 = barcode03.fastq,
og_abs_4 = barcode04.fastq, og_abs_5 = barcode05.fastq, og_abs_6 = barcode06.fastq,
og_abs_7 = barcode07.fastq, wga_abs_1 = barcode08.fastq, wga_abs_2 = barcode09.fastq,
wga_abs_3 = barcode10.fastq, wga_abs_4 = barcode11.fastq, wga_abs_5 = barcode12.fastq,
wga_abs_6 = barcode13.fastq, wga_abs_7 = barcode14.fastq
)
# See if it worked:
colnames(b2)
## [1] "tax_id" "superkingdom" "phylum" "class" "order"
## [6] "family" "genus" "species" "og_rel_1" "og_rel_2"
## [11] "og_rel_3" "og_rel_4" "og_rel_5" "og_rel_6" "og_rel_7"
## [16] "wga_rel_1" "wga_rel_2" "wga_rel_3" "wga_rel_4" "wga_rel_5"
## [21] "wga_rel_6" "wga_rel_7"
colnames(b2_abs)
## [1] "tax_id" "superkingdom" "phylum" "class" "order"
## [6] "family" "genus" "species" "og_abs_1" "og_abs_2"
## [11] "og_abs_3" "og_abs_4" "og_abs_5" "og_abs_6" "og_abs_7"
## [16] "wga_abs_1" "wga_abs_2" "wga_abs_3" "wga_abs_4" "wga_abs_5"
## [21] "wga_abs_6" "wga_abs_7"
Now it will be better if we could organize the dataframes such that one dataframe will contain all the metadata information on original samples, and the other will contain all the metadata information on wga samples. Let’s go ahead and do this:
# Select the taxonomy and 'og' columns from both b2 and b2_abs dataframes:
b2_og <- select(b2, tax_id:species, starts_with("og_rel"))
og_b2_abs <- select(b2_abs, starts_with("og_abs"))
# Combine the selected columns into a new dataframe
og <- bind_cols(b2_og, og_b2_abs)
# Select the taxonomy and 'wga' columns from both b2 and b2_abs dataframes
b2_wga <- select(b2, tax_id:species, starts_with("wga_rel"))
wga_b2_abs <- select(b2_abs, starts_with("wga_abs"))
# Combine the selected columns into a new dataframe
wga <- bind_cols(b2_wga, wga_b2_abs)
Now let’s create a new column called taxonomy for the b2 column. This column will be a concatenated format of the other taxonomy columns in a way such that it will be easy to extract taxonomy information if needed. We will do this by combining the taxonomy columns into one string where each hierarchical classification is separated the following way: k__x; p__x; c__x; o__x; f__x; g__x; s__x . Note that x represents the hierarchical classification for each letter (k = superkingdom, p = phylum, c = class, o = order, f = family, s = species):
b2 <- b2 |>
mutate(taxonomy = paste0("k__", superkingdom, "; ",
"p__", phylum, "; ",
"c__", class, "; ",
"o__", order, "; ",
"f__", family, "; ",
"g__", genus, "; ",
"s__", species))
b2_abs <- b2_abs |>
mutate(taxonomy = paste0("k__", superkingdom, "; ",
"p__", phylum, "; ",
"c__", class, "; ",
"o__", order, "; ",
"f__", family, "; ",
"g__", genus, "; ",
"s__", species))
After that, we will separate the asv dataframe columns for further analysis. We will create a taxonomy dataframe only containing taxonomy information for each tax_id, and an asv dataframe only containing the absolute counts for each tax_id. This will be for the b2_abs dataframe as that is what we will largely be doing our asv analyses with to compare between treatment groups
First let’s create the taxonomy dataframe and title it
b2_tax:
# Start with taxonomy dataframes:
b2_tax <- b2 |>
select(tax_id, 2:8, taxonomy) |>
relocate(taxonomy, .after = tax_id)
print(b2_tax$taxonomy[1])
## [1] "k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Anaerocolumna; s__Anaerocolumna xylanovorans"
Seems to work, now the dataframe includes the concatenated taxonomy information that we created earlier on - nice!
Now make the ASV table. We will only filter for the tax_id column, and the sample columns. We will also convert any NA values to 0 and values to integers. This is because for subsequent steps such as when rarefying our data, the function will only accept integer values - so let’s go ahead and do that:
# Do the same for the absolute count data. Also note that for the analysis steps, functions such as rarefy will only accept integer count values, so while we're at it, let's convert the values to whole numbers as well.
b2_abs_asv <- b2_abs |>
select(tax_id, starts_with(c("og", "wga"))) |>
mutate(across(starts_with(c("og", "wga")), ~replace_na(., 0))) |>
mutate(across(starts_with(c("og", "wga")), ~as.integer(round(.))))
# Check to see if it worked:
head(b2_abs_asv)
## # A tibble: 6 × 15
## tax_id og_abs_1 og_abs_2 og_abs_3 og_abs_4 og_abs_5 og_abs_6 og_abs_7
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 100134 0 0 0 21 0 0 0
## 2 1002 0 0 0 0 0 0 0
## 3 1002526 0 0 0 0 0 0 0
## 4 1002689 1267 392 1033 1010 2130 145 245
## 5 1002691 1967 583 1297 2224 2952 211 472
## 6 1002870 4234 1318 2061 1370 1814 1277 2330
## # ℹ 7 more variables: wga_abs_1 <int>, wga_abs_2 <int>, wga_abs_3 <int>,
## # wga_abs_4 <int>, wga_abs_5 <int>, wga_abs_6 <int>, wga_abs_7 <int>
Note that for the mm dataframe, it is essentially the
same as the b2 dataframe since they both contain metadata
on relative abundance of bacterial species. Since I only require
information pertaining to the barcode15.fastq column
which contains information on the positive control
bacteria composition, I do not need to do much data tidying. As such,
I’ll just be filtering my dataframe to select the top 10 abundant
bacterial species for the positive control in descending order.
I will then see which species are common between the expected mock community list and our positive controls, and creating a list of species that are common. The mock community data is provided by ZymoBIOMICS™ Microbial Community DNA Standard Catalog No. D6305:
# Filter metadata for top 10 abundant bacterial species for the postive control:
mm <- mm |>
select(tax_id, species, barcode15.fastq) |>
filter(barcode15.fastq > 0) |>
mutate(barcode15.fastq = barcode15.fastq * 100) |>
arrange(desc(barcode15.fastq)) |>
slice(1:10)
# Print out which species are the most abundant between my sequencing data and theoretical data:
common_species <- inner_join(mm, mock, by = "species") |>
select(species) |>
distinct()
common_species
## # A tibble: 7 × 1
## species
## <chr>
## 1 Salmonella enterica
## 2 Bacillus subtilis
## 3 Escherichia coli
## 4 Staphylococcus aureus
## 5 Listeria monocytogenes
## 6 Enterococcus faecalis
## 7 Pseudomonas aeruginosa
We can see that 7 out of 8 of the theoretical most abundant species are present in our sequencing data when comparing to the mock community data. Before we proceed to visualization, I need to tidy the data. I will first combine the common species list with the values that we received, and then convert the dataframe to long format for visualization:
# Overrides the mm dataframe to combine the common species list with the values that we received:
mm <- mm |>
filter(species %in% common_species$species) |>
select(species, barcode15.fastq) |>
arrange(species)
# Filter mock data down to the same species as in the data df:
mock <- mock |>
filter(species %in% common_species$species) |>
select(species, "16S_only") |>
arrange(species)
# Now bind the dataframes together:
data_mock <- inner_join(mm, mock, by = "species")
# Reshape the data to a long format for visualization:
data_mock <- data_mock |>
pivot_longer(cols = c(barcode15.fastq, "16S_only"), names_to = "sample", values_to = "percentage")
# Changing the column values for visualization:
data_mock$sample[data_mock$sample == "barcode15.fastq"] <- "actual abundance"
data_mock$sample[data_mock$sample == "16S_only"] <- "theoretical abundance"
After doing this, we’ll proceed to the visualization of our data.
We will now visualize our results to compare the mock community
theoretical results with our practical results. Let’s run a
ggplot to show the distribution in relative abundance
percentages between the two groups:
ggplot(data_mock, aes(x = species, y = percentage, fill = sample)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
labs(title = "Figure 1. Comparing the Relative Abundance Composition of Positive Controls",
x = "Species",
y = "Percentage (%)",
fill = "Sample") +
scale_fill_discrete(labels = c("Theoretical values", "Sequencing values")) +
scale_fill_manual(values = c("steelblue", "maroon")) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.title = element_text(size=15),
axis.text.x.bottom = element_text(face="italic"))
From our results, we can see that they are quite similar. The ZymoBIOMICs website mentions that if the sequences are within 15% of each other, then it sequencing was successful. Knowing this now, we will create rarefaction curves to see if we had enough sequencing depth to count all reads in our samples.
We will be rarefying our data using the vegan R package.
To first do so, we need to convert our dataframe containing absolute
counts into wide format, and make our columns tax_id while the rows will
be our samples. We will also remove the column containing sample
information and do some minor tidying. Let’s go ahead and do this:
# First temporarily convert the dataframe to long format
b2_abs_wide <- pivot_longer(b2_abs_asv, cols = -tax_id, names_to = "sample",
values_to = "value")
# Revert the dataframe back to wide format
b2_abs_wide <- pivot_wider(b2_abs_wide,
names_from = tax_id,
values_from = value) |> as.data.frame()
# Assigning the row names as the sample:
rownames(b2_abs_wide) <- b2_abs_wide$sample
# Removing the column containing sample information because we need this in matrix format to be able to work with it. But before we do that, let's pull the names of the sample columns as it will help us with subsequent analysis. We'll also create a new column titled group which indicates whether it is og or wga for the treatment type for a given sample ID:
b2_names <- b2_abs_wide |>
select(sample) |>
mutate(group = case_when(
startsWith(sample, "og") ~ "og",
startsWith(sample, "wga") ~ "wga"
))
# Now let's go ahead and remove the sample column
b2_abs_wide <- b2_abs_wide[,-1]
rownames(b2_abs_wide)[0] <- "samples"
Now before moving onto rarefy our data, we need to determine what the lowest number of read counts for our samples will be. Using the code below, we can determine that it is 95,460. Now let’s visualize the data. Note that this process may take some time.
# Before we rarefy our data, we need to see what the lowest number of counts is for each sample. Let's go ahead and see that:
raremax <- min(rowSums(b2_abs_wide)) # Shows the lowest number of counts is 95,460
# Now let's visualize our rarefy data. The initial output by the vegan package will look messy, so we'll extract the output from the rarecurve() function and visualize it in a neater way using ggplot:
# First run the rarecurve() function and store the output into a variable called rarecurve_data
rarecurve_data <- rarecurve(b2_abs_wide, step = 100, sample = raremax)
# This is taking each element of the list and binds them together to make a dataframe
map_dfr(rarecurve_data, bind_rows) |>
bind_cols(sample = rownames(b2_abs_wide),) |>
pivot_longer(-sample) |>
drop_na() |>
mutate(counts = as.numeric(str_replace(name, "N",""))) |>
# Now plotting the graph:
ggplot(aes(x=counts, y=value, group=sample, color=sample)) +
geom_vline(xintercept = raremax, color ="gray") +
geom_line(size=0.8) +
labs(x="Sequencing read counts", y="Number of observed species",
title="Figure 2. Rarefaction visualization of sequencing samples.") +
scale_color_discrete(name = "Sample ID") +
theme_classic2() +
scale_x_continuous(labels = scales::comma)
Notice that the output provided by vegan and base R is
messy, but with ggplot we can use the data provided by the
rarecurve() function, and store it into a separate variable
to plot it with ggplot and make it more visually appealing.
From our results, we can see that all sequencing samples plateau as the
sequencing read counts approach a constant number of species for each
sample, thereby indicating there was enough sequencing to cover all
species within the samples. The rarefraction curves allow us to quickly
visualize the number of observed species recorded for the number of
sequences. This indicates our sequencing quality was good, and we can
now proceed to visualize the differences in community composition
between treatments.
First, we will visualize the most abundant phyla and genera within our samples between the original and WGA treatment groups. To do these analyses, we will need to first convert our dataframes to long format, and then do some tidying to filter by the most common phyla and genera. For instance, we need to convert NA values to Unclassified for taxa that have not been identified at that level. Let’s go ahead and visualize the most abundant phyla first:
#For data analysis, we'll go ahead and use the og_b2 and wga_b2 dataframes.
# Reshape the dataframes into long format
og_long <- b2_og |>
pivot_longer(cols = starts_with("og_rel"), names_to = "sample", values_to = "value") %>%
drop_na(value) |>
group_by(sample) |>
mutate(percentage = value / sum(value) * 100) |>
as.data.frame()
wga_long <- b2_wga |>
pivot_longer(cols = starts_with("wga_rel"), names_to = "sample", values_to = "value") %>%
drop_na(value) |>
group_by(sample) |>
mutate(percentage = value / sum(value) * 100) |>
as.data.frame()
# Now let's visualize the most abundant phyla within our groups. To do so, we'll first create new dataframes titled og_phy and wga_phy to store our information so that we'll be able to compare them side by side using ggarrange().We are summing the total relative abundance for each phylum across each sample, and then multiplying it by 100 to get a percentage as the relative abundance information is stored as a fraction. We are then replacing any NA values as Unclassified, and plotting our results using ggplot():
# Tidying data to select phyla and replacing NA values with Unclassified:
og_phy <- og_long |>
group_by(sample, tax_id, phylum, value) |>
summarize(value = sum(value), .groups="drop") |>
group_by(sample, phylum) |>
summarize(value = value*100, .groups="drop") |>
mutate(phylum = replace_na(phylum, "Unclassified"))
wga_phy <- wga_long |>
group_by(sample, tax_id, phylum, value) |>
summarize(value = sum(value), .groups="drop") |>
group_by(sample, phylum) |>
summarize(value = value*100, .groups="drop") |>
mutate(phylum = replace_na(phylum, "Unclassified"))
# Reordering levels of phylum before to descending order before creating the plots
og_phy$phylum <- fct_reorder(og_phy$phylum, og_phy$value, .desc = TRUE)
wga_phy$phylum <- fct_reorder(wga_phy$phylum, wga_phy$value, .desc = TRUE)
# Code for plots:
og_phy_bar <- ggplot(data=og_phy, aes(x=sample, y=value, fill=phylum)) +
geom_col() +
labs(y="Percentage (%)", x="Sample ID", fill="Phyla", title = "Figure 3A. Visualizing most abundant phyla for original DNA") +
theme_classic2() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.text = element_text(size = 8, face="italic"))
wga_phy_bar <- ggplot(data=wga_phy, aes(x=sample, y=value, fill=phylum)) +
geom_col() +
labs(y="Percentage (%)", x="Sample ID", fill="Phyla",
title = "Figure 3B. Visualizing most abundant phyla for WGA treatment") +
theme_classic2() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.text = element_text(size = 8, face="italic"))
# Visualize:
og_phy_bar
wga_phy_bar
From this, we can see that the composition in phyla clearly differ between the original and WGA treatments. The most abundant treatment for the original samples appears to be Chloroflexi, and for the WGA samples it appears to be Firmicutes. Regardless, both are common BCAs that have been determined in soil samples based on past studies (Lee et al. 2017). This is a good sign based on this preliminary data.
Now let’s visualize the most abundant genera. Note that to do this,
there are a lot of genera, as such, it will not be possible to visualize
all of them. Due to this, we need to filter down the data to the top 10
most abundant genera. To do this, we need to sum the relative abundance
for all genera and see the top 10 for each treatment group. Based on
that, we need to set the cut off point to the minimum threshold we are
looking for, and then visualize it using ggplot. This is
summarized in the code and plots below:
# Creating temporary variables to tmp1 and tmp2 to store filtered og and wga data respectively containing information on the relative abundance percentages of each genus per sample:
tmp1 <- og_long |>
group_by(sample, tax_id, genus, value) |>
summarise(value = sum(value), .groups="drop") |>
group_by(sample, genus) |>
summarise(value = value*100, .groups="drop") |>
mutate(genus = replace_na(genus, "Unclassified"))
tmp2 <- wga_long |>
group_by(sample, tax_id, genus, value) |>
summarise(value = sum(value), .groups="drop") |>
group_by(sample, genus) |>
summarise(value = value*100, .groups="drop") |>
mutate(genus = replace_na(genus, "Unclassified"))
# Before we pool for rare taxa, let's first quickly assess the top 10 genera percentages for each of the treatment groups, and use that as the respective cut-off points:
# Shows the lowest percentage would be 2.29%, we'll use that as our cut-off point for the og samples:
tmp1_min <- tmp1 |>
group_by(genus) |>
summarize(max = max(value)) |>
arrange(desc(max)) |>
slice(10) |>
pull(max)
# Shows the lowest percentage would be 6.46%, we'll use that as our cut-off point for the wga samples:
tmp2_min <- tmp2 |>
group_by(genus) |>
summarize(max = max(value)) |>
arrange(desc(max)) |>
slice(10) |>
pull(max)
# Pooling rare taxa for percentages <= 2% and storing them into respective variables for og and wga:
og_pool <- tmp1|>
group_by(genus) |>
summarize(pool = max(value) <= tmp1_min, .groups="drop")
wga_pool <- tmp2 |>
group_by(genus) |>
summarize(pool = max(value) <= tmp2_min, .groups="drop")
# Now visualizing the dataframes to filter by genus <= 2% for both the original and wga treatments:
# Combining the pooled data information with the temporary dataframes, and then graphing the results using ggplot():
og_gen <- inner_join(tmp1, og_pool, by="genus") |>
mutate(genus = if_else(pool, "Other", genus)) |>
group_by(sample, genus) |>
summarize(value = sum(value))
og_gen_bar <- og_gen |>
mutate(genus = fct_reorder(genus, value, .desc = TRUE)) |>
ggplot(aes(x=sample, y=value, fill=genus)) +
geom_col() +
labs(y="Percentage (%)", x="Sample ID", fill="Genera",
title = "Figure 4A. Visualizing most abundant genera for original DNA") +
theme_classic2() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.text = element_text(size = 8, face="italic"))
wga_gen <- inner_join(tmp2, wga_pool, by="genus") |>
mutate(genus = if_else(pool, "Other", genus)) |>
group_by(sample, genus) |>
summarize(value = sum(value))
wga_gen_bar <- wga_gen |>
mutate(genus = fct_reorder(genus, value, .desc=TRUE)) |>
ggplot(aes(x=sample, y=value, fill=genus)) +
geom_col() +
labs(y="Percentage (%)", x="Sample ID", fill="Genera",
title = "Figure 4B. Visualizing most abundant genera for WGA treatment") +
theme_classic2() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.text = element_text(size = 8, face="italic"))
# Visualize:
og_gen_bar
wga_gen_bar
As we can see from these graphs, the most abundant genera still belong to the ‘Other’ category for both treatment groups based on the thresholds we have defined. This is somewhat expected as there are a lot of genera given the large dataset I am working with, and it may not be possible to visualize them all using this method. Despite this however, I do observe genera such as Bacilius and Clostridium in the WGA treatment group which are genera that have been known for having BCA properties against fungal pathogens in other soil microbiomes (Lee et al. 2017).
We can clearly see from these visualizations that there is a
difference between the original and WGA treatment groups. Now let’s do
some further statistical tests and visualizations to determine if the
difference is significant or not. To do this, we will use various
functions within the vegan R package. We will attempt to
visualize our data using methods such as MetaMDS and
betadisper() plots, while assessing statistical differences
using PERMANOVA.
First let’s create an NMDS plot to visualize the difference between
the original and WGA treatments. To do so, we’ll need to first normalize
our data using decostand() as part of the
vegan R package. I will be using the hellinger
transformation by selecting the method hell. This performs
a square root normalization on our data:
# Normalize the data using decostand(). There are multiple methods to choose from, but I have decided to go with hellinger transformation.
b2_norm <- decostand(b2_abs_wide, method = "hell")
# Run the NMDS model to visualize composition data
set.seed(100)
nmds_results <- metaMDS(b2_norm, distance="robust.aitchison")
## Run 0 stress 0.0943651
## Run 1 stress 0.09712313
## Run 2 stress 0.1027415
## Run 3 stress 0.107571
## Run 4 stress 0.09712318
## Run 5 stress 0.09234865
## ... New best solution
## ... Procrustes: rmse 0.05151316 max resid 0.09846235
## Run 6 stress 0.1015796
## Run 7 stress 0.1075709
## Run 8 stress 0.3578482
## Run 9 stress 0.09118546
## ... New best solution
## ... Procrustes: rmse 0.03079115 max resid 0.05613709
## Run 10 stress 0.1075709
## Run 11 stress 0.09319933
## Run 12 stress 0.1075708
## Run 13 stress 0.0971231
## Run 14 stress 0.1887572
## Run 15 stress 0.1872123
## Run 16 stress 0.1253094
## Run 17 stress 0.09118563
## ... Procrustes: rmse 0.0005617368 max resid 0.00157867
## ... Similar to previous best
## Run 18 stress 0.1253098
## Run 19 stress 0.0982696
## Run 20 stress 0.273934
## *** Best solution repeated 1 times
# Extract the NMDS scores
nmds_scores <- as.data.frame(scores(nmds_results)$sites)
# Determine the centroids for the NMDS graph:
group_centroids <- data.frame(
Treatment = c("og", "wga"),
Centroid_X = c(mean(nmds_scores$NMDS1[b2_names$group == "og"]),
mean(nmds_scores$NMDS1[b2_names$group == "wga"])),
Centroid_Y = c(mean(nmds_scores$NMDS2[b2_names$group == "og"]),
mean(nmds_scores$NMDS2[b2_names$group == "wga"]))
)
# Now create a dataframe containing the nmds data that we want to visualize with ggplot:
nmds_data <- data.frame(
Treatment = b2_names$group,
NMDS1 = nmds_scores$NMDS1,
NMDS2 = nmds_scores$NMDS2,
xend = c(rep(group_centroids[1,2], 1), rep(group_centroids[2,2], 1)),
yend = c(rep(group_centroids[1,3], 1), rep(group_centroids[2,3], 1))
)
# Now plot the data:
ggplot(nmds_data, aes(NMDS1, NMDS2)) +
geom_point(aes(colour = Treatment), size=2) +
stat_ellipse(geom = "polygon", alpha = 0.04, aes(group = Treatment),
colour = "black", fill = "steelblue") +
geom_point(data = group_centroids, aes(x = Centroid_X, y = Centroid_Y),
colour = "black", size = 2, shape = 7) +
geom_segment(data = nmds_data, aes(x = NMDS1, y = NMDS2, xend = xend,
yend = yend, colour = Treatment), alpha = 0.5) +
scale_colour_manual(name = "Treatment type", labels = unique(nmds_data$Treatment),
values = c("cyan", "magenta")) +
labs(title = "Figure 5. NMDS plot comparing the two soil sample treatment groups") +
theme_classic()
As we can see from this, there is some overlap between the two groups, but they are largely distinct from one another. The obtained stress value for this test was 0.09, which means the fit is fair, but not good as it isn’t below 0.05.
Now let’s also do a dispersion plot as it will allow us to visualize
our results in different ways. To do so, we’ll first conduct
dissimilarity indices using vegdist() and then visualize
the dispersion between our data using betadisper() for
multivariate dispersion analysis:
# Calculating dissimilarity indices:
perm_dist <- vegdist(b2_norm, method = "robust.aitchison")
# Let's now check to see if the two treatment groups have similar statistically similar multivariate dispersion.
dispersion <- betadisper(perm_dist, group = b2_names$group, type = "centroid")
plot(dispersion, main = "Figure 6. Multivariate dispersion plot")
# Visually, we can see the groups are distinct from each other.
Similar to the NMDS plot, we can clearly see the two treatments differ. Now let’s assess how significant the difference is Permutational multivariate analysis of variance (PERMANOVA):
perm_results <- adonis2(perm_dist~as.factor(nmds_data$Treatment), data=perm_dist,
permutations=9999)
perm_results
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 9999
##
## adonis2(formula = perm_dist ~ as.factor(nmds_data$Treatment), data = perm_dist, permutations = 9999)
## Df SumOfSqs R2 F Pr(>F)
## as.factor(nmds_data$Treatment) 1 736.47 0.23838 3.7559 4e-04 ***
## Residual 12 2353.01 0.76162
## Total 13 3089.48 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the PERMANOVA test, we can see that the difference is significant as the p-value is 4x10-4 and is far less than 0.05. This indicates that the probability that the two groups are different are highly not due to chance, suggesting there is a significant difference between the two treatments.
In conclusion, this EDA has shown me that there is a significant difference between the original and WGA treatments when it comes to bacterial community composition. This suggests to me that if possible, I should avoid using WGA DNA for my library preparation as it could lead to biases and amplification of different species as highlighted in Figures 4 and 5. However, I did find bacterial taxa within both treatments that have been previously found to have BCA properties. As such, these preliminary results seem promising in suggesting that upon pathogen infection to S. endobioticum, soil microbiomes may undergo a shift in their community composition to favour bacterial taxa having BCA properties to protect them from environmental stressors. However, more studies are required to validate this.