Differential Abundance
When assessing a microbial community, you might be interested to determine which taxa are differentially abundant between conditions. Given that we have a counts matrix we can use DESeq2!
Phylum Present
Before we assess which phylum are differentially abundant, a bar plot can be a quick first pass at determining this:
# transform the sample counts to proportions
# separate out our proportions
# separate our our tax info
ps.prop <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
otu = data.frame(t(data.frame(ps.prop@otu_table)))
tax = data.frame(ps.prop@tax_table)
# merge the otu table and phylum column
# reshape our data to be accepted by ggplot
# merge taxa data with sample meta data
merged <- merge(otu,
tax %>% select(Phylum),
by="row.names") %>%
select(-Row.names) %>%
reshape2::melt() %>%
merge(.,
data.frame(ps.prop@sam_data) %>%
select(Run,Host),
by.x="variable",
by.y="Run")
# plot our taxa
taxa_plot <- ggplot(merged,aes(x=variable,y=value,fill=Phylum)) +
geom_bar(stat='identity') +
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
labs(
x="",
y="Abundance",
title = "Barplot of Phylum Abundance"
)+
facet_wrap(Host ~ ., scales = "free_x")
taxa_plot
Here we note that the wild type seem to have an abundance of Campylobacteria and the C57BL/6NTac have an abundance of Bacteriodota. Let's see if our DESeq2 results confirm this.
Differential Abundance
Differential Abundance measures which taxa are differentially abundant between conditions. So how does it work:
DESeq2 Normalization:
- Geometric mean per ASV
- Divide rows by geometric mean
- Take the median of each sample
- Divide all ASV counts by that median
DESeq2 Model
- The normalized abundances of an ASV are plotted against two conditions
- The regression line that connects these data is used to determine the p-value for differential abundance
DESeq2 P-Value
- The Slope or 𝛽1 is used to calculate a Wald Test Statistic 𝑍
- This statistic is compared to a normal distribution to determine the probability of getting that statistic
Now how do we do this in R?
# Differential Abundance
## convert phyloseq object to DESeq object this dataset was downsampled and
## as such contains zeros for each ASV, we will need to
## add a pseudocount of 1 to continue and ensure the data are still integers
## run DESeq2 against Host status, and ensure wild type is control,
## filter for significant changes and add in phylogenetic info
dds = phyloseq_to_deseq2(ps, ~ Host)
dds@assays@data@listData$counts = apply((dds@assays@data@listData$counts +1),2,as.integer)
dds = DESeq(dds, test="Wald", fitType="parametric")
res = data.frame(
results(dds,
cooksCutoff = FALSE,
contrast = c("Host","C57BL/6NTac","Mus musculus domesticus")))
sigtab = res %>%
cbind(tax_table(ps)[rownames(res), ]) %>%
dplyr::filter(padj < 0.05)
## order sigtab in direction of fold change
sigtab <- sigtab %>%
mutate(Phylum = factor(as.character(Phylum),
levels=names(sort(tapply(
sigtab$log2FoldChange,
sigtab$Phylum,
function(x) max(x)))))
)
# as a reminder let's plot our abundance data again
taxa_plot
## plot differential abundance
ggplot(sigtab , aes(x=Phylum, y=log2FoldChange, color=padj)) +
geom_point(size=6) +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
ggtitle("Mus musculus domesticus v. C57BL/6NTac")
Explanation of Results
- Wild type seem to have an abundance of Campylobacteria and the C57BL/6NTac have an abundance of Bacteriodota
- Proteobacteria are severely downregulated in our C57BL/6NTac mice. However, they only show up in one sample!
- Be sure that your data are not influenced by outliers!
- Additionally, we collapsed our ASV's to the Phylum level since all ASV's had an identified phylum
Optional: How do I turn this R markdown into an R script?
- run the following code (being sure to change the path to where your script is):
knitr::purl("dada2pipeline.Rmd")
- You should now find an R script called
dada2pipeline.R
!
References
- Galaxy Project - Metagenomics
- Microbiome 101
- Current understanding of the human microbiome
- Amplicon and metagenomics overview
- Variable regions of the 16S ribosomal RNA
- A primer on microbial bioinformatics for nonbioinformaticians
- usearch
- Sample Multiplexing Overview
- DADA2: High resolution sample inference from Illumina amplicon data
- Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons
- DADA2 Pipeline Tutorial (1.16)
- Statistics How To
- Hierarchical Clustering in Data Mining
- Abundance-based dissimilarity metrics
- Differential expression analysis with DESeq2
- Introduction to RNA-Seq with Galaxy
- Evaluation of 16S rRNA Databases for Taxonomic Assignments Using a Mock Community
- Wild Mouse Gut Microbiota Promotes Host Fitness and Improves Disease Resistance
- Normalization and microbial differential abundance strategies depend upon data characteristics
- Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible
- A Primer on Metagenomics
- Clustal W and Clustal X version 2.0
- The neighbor-joining method: a new method for reconstructing phylogenetic trees.
- Large-scale contamination of microbial isolate genomes by Illumina PhiX control
- Dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology