Skip to the content.

Import Raw Reads from Shared Library

The introductory Slides give an overview of RNAseqencing technologies and our workflow.

Dataset

Our dataset is from the publication:

Chang et al. Next-Generation Sequencing Reveals HIV-1-Mediated Suppression of T Cell Activation and RNA Processing and Regulation of Noncoding RNA Expression in a CD4+T Cell Line. mBio 2011doi: 10.1128/mBio.00134-11

The experiment aims to compare the mRNA produced by Mock and HIV infected CD4+ T cells, both 12 hr and 24 hr after infection.

The following steps will walk you through how to run tools needed for our workflow. In each step certain parameters are set. If a parameter option appears on the screen but this tutorial doesn’t mention how to set it, leave it at the default. There are questions throughout, which serve to guide you through the results and check your understanding.

Create a new history

Import the raw data from a shared data library on our server

We’ll import The raw reads from a shared library on our server. They have been downsampled to 1 million reads per file in order to speed up computation. The full dataset is available from NCBI under accession SRP013224.

You’ll see the collection (or list) chang_2011 in your history.

View Fastq files

The first 4 lines constitute the first sequencing read:

@SRR497699.30343179.1 HWI-EAS39X_10175_FC61MK0_4_117_4812_10346 length=75
CAGATGGCCGCAGAGGAAGCCATGAAGGCCCTGCATGGGGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGAC
+
IIIIGIIHFIIIIBIIDII>IIDHIIHDIIIGIFIIEIGIBDDEFIG<EIEGEEG;<DB@A8CC7<><C@BBDDB
  1. Sequence identifier
  2. Sequence
    • (optionally lists the sequence identifier again)
  3. Quality string

Perform Quality Control on Raw Reads

FastQC provides several modules (as discussed in intro Slides)

Run FastQC

Question 1: How many sequences are in the sample **HIV_12hr_rep1**? What is their average length?

Aggregate QC data with MultiQC

The tool MultiQC allows us to view our QC results from all samples side by sides, in order to check for consistency across replicates. It can use the Raw Data output from FastQC and generate plots for all modules.

Steps to run:

The first panel gives summary statistics:

The second figure is a bar graph showing “Sequence Counts” of unique and duplicate reads for each sample. The remaining figures show each FastQC metric, displaying all samples on a single graph. There is a rectangle at the top that summarizes the pass/fail status of samples.

Question 2: Which metrics show one or more failed samples?

Trim adapters and low quality read ends with Trim Galore!

Rerun FastQC and MultiQC

Question 3: Were any reads completely removed from the samples? Note:The MultiQC "General Statistics" tables shows a rounded value, so use the "Sequence Counts" graph.
Question 4: Is the adapter problem solved? What about the GC content? Note: HIV replication is ramping up rapidly in these cells in the first 24 hours.

Next: Read Alignment

Previous: Introduction to Galaxy

Main Page