Obtaining the Reference data and NGS Sequencing data from public repositories

The US National Center for Biotechnology Information hosts repositories for many types of biomedical and genomics data. Today we’ll retrieve reference data from the Genomes Database FTP server as well as the Sequence Read Archive

Step 1: Galaxy Setup

Create a new history

Click the + at the top of the history panel, on the right hand side of the screen
To rename the history, click on the box Unnamed History, type ngs data workshop, and press enter

Step 2: Obtaining our Data

Our dataset is a SARS-CoV-2 Next Generation Sequencing sample. In this section we’ll obtain our reference data and our NGS reads in preparation for alignment.

Import the SARS-CoV2 genome and gene annotation from NCBI

On the left hand side tool panel, click the Upload icon

Click Paste/Fetch data
Copy the following URLs and paste in the grey box:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gff.gz

This is what you should see:

Click Start followed by Close
Two jobs will appear in the History, Grey (pending) -> Orange (running) -> Green (complete).
Rename the genome file
- On the green box for the file ending in fna.gz, click on the
- Under Name, replace the URL in the name with “genome”
- Click Save
Rename the gene annotation file and decompress
- On the green box for the file ending in gff.gz, click on the
- Under Name, replace the URL in the name with “genes”
- Click Save
- Click on the Convert tab, and under Name select Convert compressed file to uncompressed
- Click Convert Datatype to generate a new dataset with the uncompressed file
Here is what you should see in your History

Fasta Format

The virus genome is in fasta format. Fasta format has two parts, a sequence identifier preceeded by a “>” symbol, followed by the sequence on subsequent lines. You can see a preview of it by clicking on the genome dataset in the History panel.

GFF Format

The gene annotation file is in Generic Feature Format (GFF). This formet tells us where genes are located in the reference genome. To preview the GFF file, click on the on the genes uncompressed dataset. Note that we must always be sure that our gene information and genome come from the same source.

Step 3: Import NGS sequencing data from Sequence Read Archive

We are interested in obtaining reads from the sample Viral genomic RNA sequencing of a B.1.617.2/Delta isolate; Severe acute respiratory syndrome coronavirus 2; RNA-Seq

Download Reads

We’ll download the data from Sequence Read Archive using a Galaxy tool called SRA Toolkit.

In the Tool panel search box, search for a tool called “fasterq” and click on the tool under SRA toolkit called Faster Download and Extract Reads in FASTQ
Under Accession paste the accession number SRR15607266
Click Execute

The result will be four items in History
- Log: information about the total reads downloaded
- Other data: Empty, but can contain experiment metadata
- Single-end data: Empty, since this experiment has no single-end files
- Paired-end data: Two files, containing the forward and reverse reads for this sample

Fastq format

Fastq format is a way to store both sequence data and information about the quality of each sequenced position.

Each block of 4 lines contains one sequencing reads, for example:

@SRR15607266.1 1 length=76
NTTATCTACTTTTATTTCAGCAGCTCGGCAAGGGTTTGTTGATTCAGATGTAGAAACTAAAGATGTTGTTGAATGT
+SRR15607266.1 1 length=76
#8ACCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Sequence identifier
Sequence
+ (optionally lists the sequence identifier again)
Quality string

Paired end sequencing data will typically be stored as two fastq files, one for the forward and one for the reverse. Each file should contain the same number of reads, with the same labels, in the same order. If this convention is not followed, it could cause errors with downstream tools. Fortunately there are tools such as BBTools Repair that can help restore pairing information.

Base Quality Scores

The symbols we see in the read quality string are an encoding of the quality score:

A quality score is a prediction of the probability of an error in base calling:

Going back to our read, we can see that for most of our read the quality score is “G” –> “Q” = 38 -> Probability < 1/1000 of an error.

Preview Fastq data

Click on the list Pair-end data (fasterq-dump) and the sublist SRR15607266 to expand the sample, you’ll see 2 sequencing files forward and reverse
Click on the on the first sequence file forward and look at the fastq reads

(Optional Next:) Process Raw Reads

Next: Read Alignment

Previous: Introduction to Galaxy

Obtaining the Reference data and NGS Sequencing data from public repositories

Tools for Life Science

The Basics

NGS Analysis

Metagenomics Analysis

Protein Structure Analysis

Galaxy Tutorials

Obtaining the Reference data and NGS Sequencing data from public repositories

Step 1: Galaxy Setup

Create a new history

Step 2: Obtaining our Data

Import the SARS-CoV2 genome and gene annotation from NCBI

Fasta Format

GFF Format

Step 3: Import NGS sequencing data from Sequence Read Archive

Download Reads

Fastq format

Base Quality Scores

Preview Fastq data