How to run bioinformatics analysis on Tufts HPC

Author: Shirley Li, xue.li37@tufts.edu

Date: 2024-10-07

Objectives

The primary goal of this tutorial is to introduce participants to bioinformatics on a high-performance computing (HPC) cluster.
By the end of this tutorial, participants will: Understand basic command-line operations.
Be familiar with common bioinformatics file formats (e.g., FASTA, FASTQ, SAM, BAM). - Gain hands-on experience with bioinformatics tools and workflows.

Getting Started

Prerequisites

Basic understanding of biology and bioinformatics
Familiarity with the command line
Access to an HPC cluster (e.g., login credentials, necessary software installations)

Setup

Connecting to the Cluster through terminal or Open OnDemand.
Start an interactive session, go from log in mode to compute mode.

srun -p interactive -n 1 --time=4:00:00 --mem=32g --cpus-per-task=8 --pty bash

Copying Sample Data

You can use your lab storage or your home directory (only for this workshop).

cd /cluster/tufts/XXlab/utln/

Copy example data to your directory.

cp -r /cluster/tufts/workshop/demo/bio_2024 ./
cd bio_2024

Let's take a look at the files using ls

genomics_data  other  raw_fastq  README.txt  reference_data

For this workshop, we will store the data in our home directory since it’s small in size. However, in the future, avoid using your home directory for data storage. Always utilize your lab's designated storage space instead. Otherwise, your home directory will fill up quickly, limiting your ability to perform various tasks.

Overview of the input files

FASTA file

FASTA is a text-based format for representing nucleotide sequences or protein sequences. It is widely used in bioinformatics for sequence data storage and analysis. Each sequence in a FASTA file is represented by a header line starting with a '>', followed by lines of sequence data.

A typical FASTA file has the following structure:
Header Line: This line starts with a '>' character, followed by a description or identifier of the sequence.
Sequence Data: The nucleotide or amino acid sequence, which can span multiple lines.
Example:

cat other/rcsb_pdb_5XUS.fasta

What does this file contain?

>5XUS_1|Chain A|LbCpf1|Lachnospiraceae bacterium ND2006 (1410628)
GSHMSKLEKFTNCYSLSKTLRFKAIPVGKTQENIDNKRLLVEDEKRAEDYKGVKKLLDRYYLSFINDVLHSIKLKNLNNYISLFRKKTRTEKENKELENLEINLRKEIAKAFKGNEGYKSLFKKDIIETILPEFLDDKDEIALVNSFNGFTTAFTGFFDNRENMFSEEAKSTSIAFRCINENLTRYISNMDIFEKVDAIFDKHEVQEIKEKILNSDYDVEDFFEGEFFNFVLTQEGIDVYNAIIGGFVTESGEKIKGLNEYINLYNQKTKQKLPKFKPLYKQVLSDRESLSFYGEGYTSDEEVLEVFRNTLNKNSEIFSSIKKLEKLFKNFDEYSSAGIFVKNGPAISTISKDIFGEWNVIRDKWNAEYDDIHLKKKAVVTEKYEDDRRKSFKKIGSFSLEQLQEYADADLSVVEKLKEIIIQKVDEIYKVYGSSEKLFDADFVLEKSLKKNDAVVAIMKDLLDSVKSFENYIKAFFGEGKETNRDESFYGDFVLAYDILLKVDHIYDAIRNYVTQKPYSKDKFKLYFQNPQFMGGWDKDKETDYRATILRYGSKYYLAIMDKKYAKCLQKIDKDDVNGNYEKINYKLLPGPNKMLPKVFFSKKWMAYYNPSEDIQKIYKNGTFKKGDMFNLNDCHKLIDFFKDSISRYPKWSNAYDFNFSETEKYKDIAGFYREVEEQGYKVSFESASKKEVDKLVEEGKLYMFQIYNKDFSDKSHGTPNLHTMYFKLLFDENNHGQIRLSGGAELFMRRASLKKEELVVHPANSPIANKNPDNPKKTTTLSYDVYKDKRFSEDQYELHIPIAINKCPKNIFKINTEVRVLLKHDDNPYVIGIDRGERNLLYIVVVDGKGNIVEQYSLNEIINNFNGIRIKTDYHSLLDKKEKERFEARQNWTSIENIKELKAGYISQVVHKICELVEKYDAVIALEDLNSGFKNSRVKVEKQVYQKFEKMLIDKLNYMVDKKSNPCATGGALKGYQITNKFESFKSMSTQNGFIFYIPAWLTSKIDPSTGFVNLLKTKYTSIADSKKFISSFDRIMYVPEEDLFEFALDYKNFSRTDADYIKKWKLYSYGNRIRIFRNPKKNNVFDWEEVCLTSAYKELFNKYGINYQQGDIRALLCEQSDKAFYSSFMALMSLMLQMRNSITGRTDVDFLISPVKNSDGIFYDSRNYEAQENAILPKNADANGAYNIARKVLWAIGQFKKAEDEKLDKVKIAISNKEWLEYAQTSVKH

>5XUS_2|Chain B|crRNA|synthetic construct (32630)
AAUUUCUACUAAGUGUAGAUGGAAAUUAGGUGCGCUUGGC

>5XUS_3|Chain C|DNA (29-MER)|synthetic construct (32630)
GCCAAGCGCACCTAATTTCCTAAAGGACG

>5XUS_4|Chain D|DNA (5'-D(*CP*GP*TP*CP*CP*TP*TP*TP*A)-3')|synthetic construct (32630)
CGTCCTTTA

Counting the number of sequencies in a FASTA file:

grep -c "^>" other/rcsb_pdb_5XUS.fasta

Common Use: The FASTA format is typically used to store reference genome sequences, which can be for a whole genome, specific chromosomes, or individual genes. This format is essential for tasks like sequence alignment, variant calling, and genome assembly.

FASTQ file

FASTQ is a text-based format used to store both nucleotide sequences and their corresponding quality scores. It is widely used in bioinformatics, particularly for storing data from high-throughput sequencing technologies.
- Structure of a FASTQ File A FASTQ file consists of a series of entries, each representing a single read. Each entry has four lines:
- Header Line: Starts with '@' followed by a sequence identifier and an optional description.
- Sequence Line: The raw sequence of nucleotides (A, T, C, G).
- Separator Line: Starts with a '+' character and can optionally be followed by the same sequence identifier and description as in the header line.
- Quality Line: Encodes the quality scores for the sequence in the sequence line, using ASCII characters.

Example:

head raw_fastq/Irrel_kd_1.subset.fq

Warning: Do not use cat to view fq files, as they are often very large.

@HWI-ST330:304:H045HADXX:1:1214:19169:32660
CTTTTTTGCTGGAACTTTAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGTACCCTTAGTACCAGGTGCTTTTTTGGGAGAAGCT
+
CCCFFFFFGGHGHJJJJJJJIEHIIIIIGIIIJJHIJIIJJJJJGGGIJJJJJJJJJJGHGHHFBCEFFEEEEE;@@CDDD5>ACDDDDDDBB?BDDDCA
@HWI-ST330:304:H045HADXX:2:1212:14280:80867
CCCTAATGATGATATATGGATCAAAAGTCTTCTTTGTAGTACAAACAGTCATGCTGCCTTCGATCAGGTCCAGGGTTGCATTAACATGATGTTCATTTAA
+
:?@DDDDFDHHGHJIGBFGIGIIIGDE:CFIHIIJE1?>CEFGCEGFG?FGI?F@FHCCGIHIIJGHC@FHHGEB@9AEEHFFFEDFCCACEEEEDDDDD
@HWI-ST330:304:H045HADXX:2:2111:14933:89958
GGCATCCATGTTCTTGCCCAAAACCTTGGTTACAGCAATCTGATACTTCTTTTGTGTGGGCTGGCATAGGTCAATGAGGCAGATCGGAAGAGCACACGTC

You can also use less to view the file and press Q key to exit.

FASTQC

A quality control tool used in bioinformatics to assess the quality of high-throughput sequencing data, typically for next-generation sequencing (NGS) reads, like those in FASTQ files. It provides a series of quality checks on raw sequence data, helping researchers identify issues before proceeding with downstream analysis like alignment, variant calling, or transcript quantification.

Example usage:

module load fastqc/0.12.1
fastqc raw_fastq/Irrel_kd_1.subset.fq  # It will generate a html report

Fastqc output

Check this article for how to interpret fastqc result.

GTF file

GTF (Gene Transfer Format) is a file format used to hold information about gene structure. It is widely used in genomics to store annotations of genomic features such as genes, exons, and regulatory elements. GTF is similar to GFF (General Feature Format) but includes additional standardized attributes.

Structure of a GTF File:

A GTF file is a tab-delimited text file with one line per feature. Each line consists of nine fields:

seqname: Name of the sequence (e.g., chromosome).
source: Name of the program that generated the feature.
feature: Type of feature (e.g., gene, exon, CDS).
start: Start position of the feature.
end: End position of the feature.
score: Score value (can be a dot if not used).
strand: Strand of the feature (+ or -).
frame: Frame for coding sequences (0, 1, or 2).
attribute: A semicolon-separated list of key-value pairs describing the feature.

head reference_data/chr1-hg19_genes.gtf

chr1    unknown exon    14362   14829   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    14970   15038   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    15796   15947   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    16607   16765   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    16858   17055   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    17233   17368   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    17606   17742   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    17915   18061   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    18268   18366   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    24738   24891   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    29321   29370   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    34611   35174   .       -       .       gene_id "FAM138F"; gene_name "FAM138F"; transcript_id "NR_026820"; tss_id "TSS8099";
chr1    unknown exon    34611   35174   .       -       .       gene_id "FAM138A"; gene_name "FAM138A"; transcript_id "NR_026818"; tss_id "TSS8099";
chr1    unknown exon    35277   35481   .       -       .       gene_id "FAM138F"; gene_name "FAM138F"; transcript_id "NR_026820"; tss_id "TSS8099";
chr1    unknown exon    35277   35481   .       -       .       gene_id "FAM138A"; gene_name "FAM138A"; transcript_id "NR_026818"; tss_id "TSS8099";
chr1    unknown exon    35721   36081   .       -       .       gene_id "FAM138F"; gene_name "FAM138F"; transcript_id "NR_026820"; tss_id "TSS8099";
chr1    unknown exon    35721   36081   .       -       .       gene_id "FAM138A"; gene_name "FAM138A"; transcript_id "NR_026818"; tss_id "TSS8099";
chr1    unknown CDS     69091   70005   .       +       0       gene_id "OR4F5"; gene_name "OR4F5"; p_id "P9488"; transcript_id "NM_001005484"; tss_id "TSS13903";
chr1    unknown exon    69091   70008   .       +       .       gene_id "OR4F5"; gene_name "OR4F5"; p_id "P9488"; transcript_id "NM_001005484"; tss_id "TSS13903";
chr1    unknown start_codon     69091   69093   .       +       .       gene_id "OR4F5"; gene_name "OR4F5"; p_id "P9488"; transcript_id "NM_001005484"; tss_id "TSS13903";
chr1    unknown stop_codon      70006   70008   .       +       .       gene_id "OR4F5"; gene_name "OR4F5"; p_id "P9488"; transcript_id "NM_001005484"; tss_id "TSS13903";
ch

GTF files are essential for common bioinformatics analyses such as RNA-Seq analysis, genome annotation, and differential expression analysis. They provide the necessary annotation information to map reads to genomic features and to understand the functional elements of the genome.

You can download human gtf file from UCSC Genome Browser https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/ or Ensembl Genome Browser https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/

How to donwload your own reference genome file

Navigate to ensembl website: https://useast.ensembl.org/index.html
Select the organisms you are working on. Ex: Human

ensembl

Choose the file you would like to download: FASTA, GTF or GFF3, ...
Always keep an eye on the genome build. GRCh38 is the latest for human genome.

Toy analysis with interactive jobs

Alignment with STAR

Make sure you are in compute node

utln@login-prod-02    # log in node
utln@p1cmp017         # compute node

If not, do the following

srun -p interactive -n 1 --time=4:00:00 --mem=32g --cpus-per-task=8 --pty bash

Load necessary modules

module load star/2.7.11b

Before aligning your FASTQ files, you need to generate an index using the reference genome (.fa file) and the annotation file (.gtf file).

STAR --runMode genomeGenerate \
     --genomeDir ./reference_data/ \
     --genomeFastaFiles ./reference_data/chr1.fa \
     --sjdbGTFfile ./reference_data/chr1-hg19_genes.gtf \
     --runThreadN 8

You will see output log like this

Sep 20 11:25:15 ..... started STAR run
Sep 20 11:25:15 ... starting to generate Genome files
Sep 20 11:25:20 ..... processing annotations GTF
!!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=249250621, which may cause seg-fault at the mapping step. Re-run geno
me generation with recommended --genomeSAindexNbases 12
Sep 20 11:25:22 ... starting to sort Suffix Array. This may take a long time...
Sep 20 11:25:23 ... sorting Suffix Array chunks and saving them to disk...
Sep 20 11:26:40 ... loading chunks from disk, packing SA...
Sep 20 11:26:47 ... finished generating suffix array
Sep 20 11:26:47 ... generating Suffix Array index
Sep 20 11:27:52 ... completed Suffix Array index
Sep 20 11:27:52 ..... inserting junctions into the genome indices
Sep 20 11:28:17 ... writing Genome to disk ...
Sep 20 11:28:17 ... writing Suffix Array to disk ...
Sep 20 11:28:19 ... writing SAindex to disk
Sep 20 11:28:22 ..... finished successfully

If you do ls -lhtr reference_data , you will see the genome index you just generated.

Align FASTQ Reads to the Reference Genome

First, let's create an folder to store output from STAR:

mkdir star_output

For single-end reads:

STAR --genomeDir ./reference_data/ \
     --readFilesIn ./raw_fastq/Irrel_kd_1.subset.fq \
     --outFileNamePrefix ./star_output/ \
     --runThreadN 8

In our example, we use single-end reads.

For paired-end reads:

STAR --genomeDir ./reference_data/ \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --outFileNamePrefix ./star_output/ \
     --runThreadN 8

Output

Check the outputs:

cd star_output
ls

You will see the following:

Aligned.out.sam  Log.final.out  Log.out  Log.progress.out  SJ.out.tab

Aligned.out.sam contains the aligned information.

SAM file

SAM (Sequence Alignment/Map) is a text-based format for storing biological sequences aligned to a reference sequence. It is widely used in bioinformatics for storing large-scale sequencing data, such as that from next-generation sequencing technologies.

A SAM file consists of a header section and an alignment section.

Header Section: Optional. Contains metadata about the alignments and reference sequences. Header lines start with the '@' character.
Alignment Section: Contains the aligned sequences and their corresponding information. Each line represents a single alignment and consists of 11 mandatory fields and optional fields.

@HD     VN:1.4
@SQ     SN:chr1 LN:249250621
@PG     ID:STAR PN:STAR VN:2.7.11b      CL:/usr/local/bin/STAR-avx2   --runThreadN 8   --genomeDir ./reference_data/   --readFilesIn ./raw_fastq/Irrel_kd_1.subset.fq      --outFileNamePrefix ./star_output/
@CO     user command line: /usr/local/bin/STAR-avx2 --genomeDir ./reference_data/ --readFilesIn ./raw_fastq/Irrel_kd_1.subset.fq --outFileNamePrefix ./star_output/ --runThreadN 8
HWI-ST330:304:H045HADXX:2:1102:9621:48866       0       chr1    115266503       255     100M    *       0       0       CCCTCCTCCACAATCTCAATCATTCCTTGGTACTCAGTCTGTGTTGGATCAACACTCCTCAGGGGGCGAATTACTTTGCCAGAGTAAATGGTGGGATCAG    CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJHIIIJJJGIJJIJJJJJJJJJGJJJJJJJJIIIJJJFDDDDDDDDDEDDDDDBB>CDDEEDCDDDDDDDD    NH:i:1  HI:i:1  AS:i:98 nM:i:0
HWI-ST330:304:H045HADXX:2:1104:16192:61086      0       chr1    115266503       255     100M    *       0       0       CCCTCCTCCACAATCTCAATCATTCCTTGGTACTCAGTCTGTGTTGGATCAACACTCCTCAGGGGGCGAATTACTTTGCCAGAGTAAATGGTGGCATCAG    @BCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJIJJJIIJJJIIJJJJJJIJJHJJJJJJJJJJJJFDDDDDDDDDEDDDDDDD@CCDEED:ABDDDDDD    NH:i:1  HI:i:1  AS:i:96 nM:i:1
HWI-ST330:304:H045HADXX:2:1107:5712:34356       0       chr1    115266503       255     100M    *       0       0       CCCTCCTCCACAATCTCAATCATTCCTTGGTACTCAGTCTGTGTTGGATCAACACTCCTCAGGGGGCGAATTACTTTGCCAGAGTAAATGGTGGGATCAG    @B@FFDFFHFFHHIGIIIGIHJJIJJJJJJCEHIJJJFGHIIJIJJJJJIIJJJJJJJIJJJJIIJFDBBBDDDDDEDDDDDDD>CCDEED>ADDDDDDD    NH:i:1  HI:i:1  AS:i:98 nM:i:0
HWI-ST330:304:H045HADXX:2:1113:8407:74596       0       chr1    115266503       255     100M    *       0       0

Check this article about how to interpret sam files.

BAM file

The BAM (Binary Alignment/Map) file format is a binary version of the SAM (Sequence Alignment/Map) format. It is used to store aligned sequence data in a more efficient and compressed form.

Convert sam file to bam file:

module load samtools/1.17
samtools view -S -b Aligned.out.sam > Aligned.out.bam

Bam file cannot be read directly, we have to use samtools to view it:

module load samtools/1.17
samtools view Aligned.out.bam | head -n 20

Toy analysis with job scripts

Let's write a slurm script call run_star.sh

#!/bin/bash
#SBATCH -J STAR_JOB             # Job name
#SBATCH --time=12:00:00         # Job will run for a maximum of 12 hours (D-HH:MM:SS format)
#SBATCH -p batch                # Partition (queue) to submit the job to
#SBATCH -n 1                    # Number of tasks (here we only need 1 task)
#SBATCH --mem=32g               # Allocate 32 GB of memory for this job
#SBATCH --cpus-per-task=8       # Number of CPU cores allocated for this task
#SBATCH --output=MyJob.%j.%N.out  # Output will be saved to a file with job ID (%j) and node name (%N)
#SBATCH --error=MyJob.%j.%N.err   # Error will be saved to a file with job ID (%j) and node name (%N)
#SBATCH --mail-type=ALL         # Receive email notifications for job status (ALL = start, end, fail)
#SBATCH --mail-user=utln@tufts.edu  # Your email to receive notifications

# Load the STAR module
module load star/2.7.11b         # Load the specific version of STAR needed for the alignment

mkdir star_output

# Run STAR alignment
STAR --genomeDir ./reference_data/ \       # Specify the directory containing the genome index files
     --readFilesIn ./raw_fastq/Irrel_kd_1.subset.fq \  # Input FASTQ file(s) for alignment
     --outFileNamePrefix ./star_output/ \  # Output files will be saved in this directory 
     --runThreadN 8                        # Use 8 threads for faster processing

Here is the command to submit job

chmod +x run_star.sh    # Makes the script executable
sbatch run_star.sh      # Submits the script to the SLURM queue

Use squeue -u yourusername to check job status.

Run job with GPU node

Interactive session

srun -p preempt -n 1 --time=04:00:00 --mem=20G --gres=gpu:1 --pty /bin/bash

You can also specify which gpu node you would like to run jobs on

srun -p preempt -n 1 --time=04:00:00 --mem=20G --gres=gpu:a100:1 --pty /bin/bash

Submit jobs to queue

Example script: align.sh using parabricks to do the alignment.

#!/bin/bash
#SBATCH -J fq2bam_alignment          # Job name
#SBATCH -p preempt                   # Submit to the 'preempt' partition (modify based on your cluster setup)
#SBATCH --gres=gpu:1                 # Request 1 GPU for accelerated processing
#SBATCH -n 2                         # Number of tasks (2 in this case)
#SBATCH --mem=60g                    # Memory allocation (60GB)
#SBATCH --time=02:00:00              # Maximum job run time (2 hours)
#SBATCH --cpus-per-task=20           # Number of CPU cores allocated per task
#SBATCH --output=alignment.%j.out    # Standard output file (with job ID %j)
#SBATCH --error=alignment.%j.err     # Standard error file (with job ID %j)
#SBATCH --mail-type=ALL              # Email notifications for all job states (begin, end, fail)
#SBATCH --mail-user=utln@tufts.edu   # Email address for notifications

# Load necessary modules
nvidia-smi                              # Show GPU information (optional for logging)
module load parabricks/4.0.0-1          # Load Parabricks module for GPU-accelerated alignment

# Define variables
genome_reference="/path/to/reference_genome"      # Path to the reference genome (.fasta)
input_fastq1="/path/to/input_read1.fastq"         # Path to the first paired-end FASTQ file
input_fastq2="/path/to/input_read2.fastq"         # Path to the second paired-end FASTQ file
sample_name="sample_identifier"                  # Sample identifier
known_sites_vcf="/path/to/known_sites.vcf"        # Known sites VCF file for BQSR (optional, if available)
output_directory="/path/to/output_directory"      # Directory for the output BAM file and reports
output_bam="${output_directory}/${sample_name}.bam"            # Output BAM file path
output_bqsr_report="${output_directory}/${sample_name}.BQSR-report.txt"  # Output BQSR report path

# Run the Parabricks fq2bam alignment pipeline
pbrun fq2bam \
    --ref ${genome_reference} \                # Reference genome (.fasta)
    --in-fq ${input_fastq1} ${input_fastq2} \  # Input paired-end FASTQ files
    --read-group-sm ${sample_name} \           # Sample name for read group
    --knownSites ${known_sites_vcf} \          # Known sites for BQSR 
    --out-bam ${output_bam} \                  # Output BAM file
    --out-recal-file ${output_bqsr_report}     # Output Base Quality Score Recalibration (BQSR) report

Here is the command to submit job

chmod +x align.sh    # Makes the script executable
sbatch align.sh      # Submits the script to the SLURM queue

Use squeue -u yourusername to check job status.

Useful links:

https://hbctraining.github.io/Intro-to-shell-flipped/lessons/03_working_with_files.html

https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html