The following items are not hands on, but will be shown as part of use case of a recent Metagenomics Workflow done in collaboration with a Tufts Researcher
If this sounds like a lot of ground to cover, you are correct!
In some portions of this tutorial, we don’t actually run the code, but jump ahead to the next step.
For these “skipped” portions, the code to run the process is provided.
The setup code adds the libraries in the order they will be searced for packages.
Setting .libpaths()
to Tufts shared folders can reduce the amount of time spent downloading packages.
Please reach out to us about any difficult or tricky installs, sometimes it is a library package error that has to be fixed by an TTS HPC Research Technology Specialist.
Please make sure that the output to this code chunk has the shared library, “/cluster/tufts/hpc/tools/R/4.0.0” in position [1]
Example output (it depends what else you have been doing in RStudio before today).
[1] “/cluster/tufts/hpc/tools/R/4.0.0” [2] “/cluster/home/arhode05/R/x86_64-pc-linux-gnu-library/4.0” [3] “/opt/shared/R/4.0.0/lib64/R/library”
Commands to specify the shared directory as the first place to look for R libraries and check the output.
For example, we are using knitr
in this notebook to generate an .html file, so that program is already in our shared libraries.
.libPaths(c('/cluster/tufts/hpc/tools/R/4.0.0',.libPaths()))
.libPaths()
This tutorial uses code chunks. To run an individual code chunk while the notebook is open in OnDemand RStudio, please press the green triangle on the upper right corner of the code chunk.
You may be wondering why we went through the trouble of having bash commands inside an R notebook for this tutorial. The simplest answer is that RStudio has the handy ability to open visualization files with a simple click on the file name in the lower right screen. This saves a lot of time and confusing navigation when doing this type of project, that produces a lot of intermediate files.
RStudio can run code chunks in other languages by resetting the code chunks to read that language.
{bash}
indicates that the code chunk is in command line syntax, so these commands can be copied and pasted directly into a terminal
{r}
indicates that the code chunk uses R-code
To save a bunch of copying/pasting, we are just going to run both types of these chunks inside our notebook using RStudio OnDemand.
The order of chunks is important, so if you skip one or get lost, it is possible to use the dropdown menu next to the word “Run” at the top of this window to run multiple chunks to get you back to where you started.
Any questions?
In this workshop, we will sometimes be loading modules from the HPC to use in a code chunk.
Here is an example code chunk that loads a module for samtools and runs the command with the help flag to generate some output to test that the module loaded correctly.
If the bash environment is not designated in the setup chunk, you may need to add this bash command at the beginning of the chunk to enable module loading.
source ~/.bashrc
module load samtools/1.9
samtools --help
The behavior of modules in the code chunks differs from our direct command line.
The Rmarkdown notebook will not remember it from chunk to chunk. So load the modules in the chunks where the commands are needed.
An example with conda environments will be shown in Part II.
module list
Conda environments are great for installing specific versions of programs, but often bring in clashing dependencies, and only one is active at a time.
Using HPC modules can provide more integration because you can load several at one time.
Of course, conda environments could be used if you load all the required programs into the same environment. We will see an example of this later with Metawrap
Ticks are capable of spreading pathogens (viruses, bacteria and other parasites) among a host population. Not much is known about the spread of tick-borne viruses in Giant Pandas. The density of giant pandas in breeding captivity in Sichuan Province, China may lead to more transmission of these diseases. The researchers who loaded this data into the SRA were interested in characterizing the virome of these ticks and to compare these with the virome of the pandas.
We are not going to restrict our analysis to the virome, and just take a general look at what these ticks are carryng around in general using practical metagenomics workflows.
Check out the final paper here
We are going to bring in some of the data from this bioproject to our Metagenomics Tutorial space.
SRA Metadata - every project is different!
Example from BioProject Microbes From Mum
The Metadata from our BioProject
Something you may have noticed is that the reads are suspicially even in number, at length 502. The experimental design states that the reads were paired-end 2x250.
The data we are going to copy over can be obtained directly from the SRA by using our command line module for the SRAToolkit. This can take a little while, depending on the internet connection and which source the files are pulled from (on-prem, AWS or GCP).
We have a toolkit on the HPC cluster that allows users to pull data directly from SRA. The first step is configuring the SRA toolkit with a unique user ID. The second step is running either fastq-dump
or fasterq-dump
. We will use fastq-dump
to demonstrate the options available.
module load /cluster/tufts/bio/tools/module_files/sra/3.0.0
#The next few lines set up your configuration for SRA and is necessary if you have not used the SRA toolkit previously.
#The vdb-config is an interactive tool, to generate a unique identifier. Since it is difficult to use it interactively from ROnDemand, this command opens the interactive screen for three seconds then closes it, which generates the user id.
vdb-config -i & read -t 3 ; kill $!
module load /cluster/tufts/bio/tools/module_files/sra/3.0.0
mkdir -p test_SRA
fastq-dump --outdir test_SRA -X 500 --skip-technical --read-filter pass --readids --minReadLen 200 --split-e SRR18086364
Command Line Parameters
-X 500
--skip-technical
--read-filter pass
.1
to forward read header .2
to reverse read header –readidsThere should be three new files in your Metagenomics2022/test_SRA
directory.
The output is small enough that we can look into the fastq file by clicking on it.
.1
and .2
were appended to your forward read headers and reverse read headers.1
, .2
or both?
wc -l ./test_SRA/*.fastq
How many lines are in our fastq output files and why?
(Hint 1: we asked for 500 “spots”, not reads) (Hint 2: how did our parameter choices affect our outputs?)
In some cases, the newer version of this tool fasterq-dump
will automatically run the split-e version of the command without adding this parameter. It is worth looking at the options before using it, because some of the parameter meanings have changed.
The following codeblock is for reference, to use the chunk, remove the #
at the beginning of the line.
#/cluster/tufts/hpc/tools/spack/linux-rhel7-ivybridge/gcc-9.3.0/fastqc-0.11.9-eeij5jwnwau3ttx4m53bxafmxbhm4fjz/bin/fastqc ./MGData/*overrep.fastq
#source ~/.bashrc
module load fastqc
mkdir -p fastqc
fastqc ./data/*pass*.fastq -o fastqc
#source ~/.bashrc
module load multiqc/1.7.0
mkdir -p multiqc
multiqc ./fastqc -o ./multiqc
Check the quality of your results by navigating to the file folder for multiqc and clicking on the html files that were generated.
Okay, so the data is not the most perfect, but we just need a set that can make it through the analysis.
Metagenomics has benefited from k-mer based approaches.
BCBio tools are useful for cleaining metagenomics files, because many simple functions can be called and sped up using kmers.
For example, let’s remove known adapter and other contamination using the NCBI database called “UniVec”
cd ~/Metagenomics2022
#pwd
#grep "TruSeq" ./contaminants/UniVec.fasta
grep "PCR Primer Index 9" ./contaminants/UniVec.fasta
Are these repeats biological (part of the genome) or technical (primer amplification, underabundance of sample or lack of PhiX spike-in)? What does the raw sequencing data show us?
Here is an example of how bbduk.sh can be used to simultaneously remove adapters, phiX spike-in and overrepresented sequences from our custom file.
#source ~/.bashrc
module load java/1.8.0_60
module load /cluster/tufts/bio/tools/module_files/bcbio/1.1.5
mkdir -p decon
bbduk.sh -Xmx10g -tbo -tpe \
in1=data/SRR18085691_pass_1.fastq \
in2=data/SRR18085691_pass_2.fastq \
out1=decon/SRR18085691_decon_1.fastq \
out2=decon/SRR18085691_decon_2.fastq \
outs=decon/SRR18085691_decon_sing.fastq \
ref=contaminants/UniVec.fasta \
ref=contaminants/overrep.fasta \
ref=adapters \
ref=phix \
ktrim=r k=21 rcomp=t mink=11 hdist=2 minlen=200 \
BBDUK = Decontamination Using Kmers
-tbo
trims overlapping bases in paired end reads-tpe
trims both paired end reads to the same length-in1
,-in2
,out1
,out2
,outs
means input forward and reverse reads and output forward and reverse plus orphaned readsref
can be any file in fasta format or baked-in options such as phiX
and adapters
ref
can be replaced with literal=ACTGGT,TTTGGTG
with any list of stringsktrim=r
means to trim from 3-prime
endk
refers to the kmer
size that is chosenmink
allows a shorter kmer
size at the end of the sequencehdist
refers to hamming distance - number of mismatches allowed in a kmerminlen
indicates the minimum length of the sequence to keep after trimming
#source ~/.bashrc
module load java/1.8.0_60
module load /cluster/tufts/bio/tools/module_files/bcbio/1.1.5
mkdir -p decon
ls data/*_pass_1.fastq | while read line ; do \
bbduk.sh -Xmx10g -tbo -tpe \
in1=$line \
in2=data/$(basename $line _pass_1.fastq)_pass_2.fastq \
out1=decon/$(basename $line _pass_1.fastq)_decon_1.fastq \
out2=decon/$(basename $line _pass_1.fastq)_decon_2.fastq\ \
outs=decon/$(basename $line _pass_1.fastq)_decon_sing.fastq \
ref=contaminants/UniVec.fasta \
ref=contaminants/overrep.fasta \
ref=phix \
ktrim=r k=21 rcomp=t mink=11 hdist=2 minlen=200; \
done
#source ~/.bashrc
module load fastqc
mkdir -p fastqc_decon
fastqc ./decon/*decon*.fastq -o fastqc_decon
#source ~/.bashrc
module load multiqc/1.7.0
mkdir -p multiqc_decon
multiqc ./fastqc_decon -o ./multiqc_decon
To be removed before assembly:
mkdir -p gooddata
echo SRR18085691 > gooddata.txt
echo SRR18085708 >> gooddata.txt
echo SRR18086342 >> gooddata.txt
echo SRR18086575 >> gooddata.txt
echo SRR18091316 >> gooddata.txt
cat gooddata.txt | while read $line; do cp decon/$line*1.fastq gooddata; cp decon/$line*2.fastq gooddata; done
This concludes Part I, please come back after the break for Part II - using a wrapper script to bin your data.
#/cluster/tufts/bio/tools/megahit/1.2.9/bin/megahit -m 16000000 -t 32 --read ./gooddata/* \
#--k-list 21,41,61,81,99 \
#--no-mercy \
#--min-count 2 \
#-o megahit_out/
Megahit: de Bruijn graph Assembly
It is an assembly algorithm that splits your sequence into k-mers and tries to build the longest contigs along the shortest parsimonious path.
To save time in the workshop, the assembly has been done using the commented out command above and placed into your workshop folder called “reports”.
Quast is a commonly used tool to assess assemblies. It will provide a report that is either in a text file or an html.
Let’s run the following command and then look for the quast report to see if we have a “good” assembly.
/cluster/tufts/bio/tools/quast/5.2.0/quast.py -o quast_assembly ./reports/final.contigs.fa
At this point, we may be interested in whether our assembly makes sense for our set of contigs.
Kraken2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches. This k-mer based approach has very fast classification speeds with high accuracy.
This approach differs from homology based approaches that try to match sequences to each other and score them based on the number of mismatches, deletions and inserts. Kraken uses the entire kmer composition of the contig to match it to a database where reference sequences have been broken down into kmer “hashes” or a database of what the kmers look like for a particular organism.
/cluster/tufts/bio/tools/conda_envs/kraken/2.1.2/bin/kraken2 --use-names --threads 4 --db /cluster/tufts/bio/tools/training/metagenomics/kraken_virus --report kraken.report.txt --quick reports/final.contigs.fa --classified-out classified_sequences.tsv > sequences.kraken
head classified_sequences.tsv
We can open up the sequence report inside RStudio OnDemand and scroll through the findings.
This can be a very long list of species names, so it may be more helpful to use an interactive visualization tool to look at our data.
Krona is a very useful tool that can take in a variety of formats. In this case, we are providing our kraken.report to the visualization. Metabarcoding data assigned to a taxonomy can also be used in this case.
/cluster/tufts/bio/tools/conda_envs/kraken/2.1.2/bin/ktImportTaxonomy -t 5 -m 3 -o krona.html kraken.report.txt
The output, when opened in a web browser, allows the user to interact with the data and explore different levels of the data.
So we are probably running out of time to continue working hands on.
Let’s take some time to discuss what the next steps might look like.
This discussion will center around a recent project from Dr. Pearson’s lab, conducted by Neveda Naz, Ph.D.
Neveda builds little robots and sends them to outerspace. (Neveda if you are here, feel free to comment!)
Please open up this pdf here to find the discussion.
Also, for future reading, I have placed two useful .pdf’s in our public github pages that summarize the steps we are discussing in further detail.
Essentials in Metagenomics I Essentials in Metagenomics II
This markdown and associated files will be placed in our Github and mailed out to the class shortly.
We will also send out a video of this workshop.
Jason and I hope to continue developing this practical material to provide more hands-on experience in the future, so any feedback on the tools you would like to explore is appreciated.