Merging, Chimeras & Taxonomy
Merging Reads
- For paired-end data there is a good deal of overlap between the forward and reverse read
- To resolve this redundancy, these reads are collapsed into contigs
Let's do this with code now!
# Merge Read Pairs
# so far we have "denoised", so to speak,
# these sequence variants. We now need to merge the
# forward and reverse strands
mergers <- mergePairs(
dadaForward,
filtForward,
dadaReverse,
filtReverse,
verbose=TRUE)
619 paired-reads (in 18 unique pairings) successfully merged out of 807 (in 82 pairings) input.
570 paired-reads (in 29 unique pairings) successfully merged out of 815 (in 136 pairings) input.
619 paired-reads (in 28 unique pairings) successfully merged out of 868 (in 128 pairings) input.
713 paired-reads (in 18 unique pairings) successfully merged out of 860 (in 76 pairings) input.
609 paired-reads (in 29 unique pairings) successfully merged out of 851 (in 133 pairings) input.
620 paired-reads (in 30 unique pairings) successfully merged out of 810 (in 115 pairings) input.
679 paired-reads (in 28 unique pairings) successfully merged out of 845 (in 104 pairings) input.
616 paired-reads (in 28 unique pairings) successfully merged out of 830 (in 106 pairings) input.
Info
Here we see that for each sample we get the number of reads that were able to be successfully merged out of the total number of reads that could be merged.
ASVs vs. OTUs
- Now that we have finally merged our sequence variants we are left with an Amplicon Sequence Variant.
- Traditional 16S metagenomic approaches use OTUs or operational taxonomic units instead of ASVs.
- Operational taxonomic units (OTUs): clusters of reads that differ by less than a fixed sequence dissimilarity threshold, most commonly 3% Instead of clustering sequences, methods that resolve amplicon sequence variants (ASVs) distinguish sequence variants differing by as little as one nucleotide. ASVs represent a biological reality and provide nucleotide-level resolution. Callahan, MucMurdie, & Holmes (2017) have demonstrated that “ASVs capture all biological variation present in the data, and ASVs inferred from a given data set can be reproduced in future data sets and validly compared between data sets.”
ASV Table
Now that our sequences are merged we can create an ASV counts table, basically telling us how which samples contain which ASV's:
# Making a Sequence Table
# now that we have merged sequences we can construct
# an Amplicon Sequence Variant (ASV) table
seqtab <- makeSequenceTable(mergers)
Chimera Removal
- During Sequencing microbial DNA is subjected to PCR to amplify DNA
- During PCR it is possible for two unrelated templates to form a non-biological hybrid sequence
- DADA2 finds these chimeras by:
- aligning each sequence to more abundant sequences
- now check low abundant sequences and determine:
- can this sequence be created if we mix the left and right sides of the abundant sequences
Now in code:
# Removing Chimeras
# Chimeric sequences occur as errors during PCR
# when two unrelated templates for a hybrid sequence
# we will need to remove them before going forward
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", verbose=TRUE)
Now let's check if any chimeric sequences are removed:
## check to see if the dimensions are different
## between the chimera filtered and unfiltered
## ASV tables
dim(seqtab)
dim(seqtab.nochim)
[1] 8 119
[1] 8 117
Info
We can see here that 2 chimeric sequences were removed because our before and after sequence count matrices differ by two columns.
Pipeline Quality Control
We will also take a moment to do some final QC:
# Final QC
## we have performed quite a few steps
## and it would be nice to get a final qc check
## before assigning taxonomy
getN <- function(x) sum(getUniques(x))
finalQC <- cbind(
out,
sapply(dadaForward, getN),
sapply(dadaReverse, getN),
sapply(mergers, getN),
rowSums(seqtab.nochim))
colnames(finalQC) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(finalQC) <- sampleNames
finalQC
input filtered denoisedF denoisedR merged nonchim
SRR5690809 1000 905 840 855 619 611
SRR5690810 1000 937 853 885 570 549
SRR5690811 1000 937 880 910 619 594
SRR5690812 1000 924 886 888 713 700
SRR5690819 1000 938 872 906 609 609
SRR5690820 1000 916 870 844 620 620
SRR5690821 1000 921 883 879 679 679
SRR5690822 1000 940 865 891 616 616
Info
Here we see that we start with 1000 sequences per sample, end up with around 900 after filtering, around 800 after denoising to find unique sequences, and around 600-700 sequences after merging sequences and removing chimeric sequences.
Assigning Taxonomy
- To determine which taxon each ASV belongs to DADA2 uses a naïve bayes classifier
- This classifier uses a set of reference sequences with known taxonomy, here we use the SILVA database, as the training set and and outputs taxonomic assignments with bootstrapped confidence
# Assigning Taxonomy
# dada2 uses a naive Bayes classifier when
# assigning taxonomy. This means we need a training
# set of sequences with known taxonomy information.
# here we use the silva database
taxa <- assignTaxonomy(seqtab.nochim, "../data/silva_nr99_v138.1_train_set.fa.gz")
Databases
While we use the SILVA database here, there are other options databases:
Time for a break!