Approximate time: 20 minutes
Goals
- Connect to the HPC cluster via On Demand Interface
- Download data
Log into the HPC cluster’s On Demand interface
- Open a Chrome browser and enter the URL https://ondemand.cluster.tufts.edu
- Log in with your Tufts Credentials
- On the top menu bar choose
Clusters->Tufts HPC Shell Access
- Type your password at the prompt (the password will be hidden for security purposes):
tutln01@login.cluster.tufts.edu's password:
- You’ll see a welcome message and a bash prompt, for example for user
tutln01
:
[tutln01@login001 ~]$
This indicates you are logged in to the login node of the cluster.
- Type
clear
to clear the screen
Set up for the analysis
Find 500M storage space
- Check how much available storage you have in your home directory by typing
showquota
.
Result:
Home Directory Quota
Disk quotas for user tutln01 (uid 31394):
Filesystem blocks quota limit grace files quota limit grace
hpcstore03:/hpc_home/home
1222M 5120M 5120M 2161 4295m 4295m
Listing quotas for all groups you are a member of
Group: facstaff Usage: 16819478240KB Quota: 214748364800KB Percent Used: 7.00%
Under blocks
you will see the amount of storage you are using, and under quota you see your quota.
Here, the user has used 1222M of the available 5120M and has enough space for our analysis.
- If you do not have 500M available, you may have space in a project directory for your lab.
These are located in
/cluster/tufts
with names like/cluster/tufts/labname/username/
. If you don’t know whether you have project space, please email tts-research@tufts.edu.
Download the data
- Get an interaction session on a compute node (3 hours, 16 Gb memory, 4 cpu on 1 node) on the default partition (
batch
) by typing:
srun --pty -t 3:00:00 --mem 16G -N 1 --cpus 4 bash
Notes:
If wait times are very long, you can try a different partitions by adding, e.g. -p preempt
or -p interactive
before bash
.
If you go through this workshop in multiple steps, you will have to rerun this step each time you log in.
- Change to your home directory
cd
Or, if you are using a project directory:
cd /cluster/tufts/labname/username/
- Copy the course directory and all files in the directory (-R is for recursive):
cp -R /cluster/tufts/bio/tools/training/intro-to-ngs/ .
(Also available via: git clone https://gitlab.tufts.edu/rbator01/intro-to-ngs.git
)
- Take a look at the contents using the
tree
command:
tree intro-to-ngs
You’ll see a list of all files
intro-to-ngs
├── all_commands.sh <-- Bash script with all commands
├── raw_data <-- Folder with paired end fastq files
│ ├── na12878_1.fq
│ └── na12878_2.fq
├── README.md <-- Contents description
└── ref_data <-- Folder with reference sequence
└── chr10.fa
2 directories, 5 files
Data for the class
Genome In a Bottle (GIAB) was initiated in 2011 by the National Institute of Standards and Technology “to develop the technical infrastructure (reference standards, reference methods, and reference data) to enable translation of whole human genome sequencing to clinical practice” (Zook et al 2012). We’ll be using a DNA Whole Exome Sequencing (WES) dataset released by GIAB for the purposes of benchmarking bioinformatics tools.
The source DNA, known as NA12878, was taken from a single person: the daughter in a father-mother-child ‘trio’. She is also mother to 11 children of her own, for whom sequence data is also available. (HBC Training). Father-mother-child ‘trios’ are often sequenced to study genetic links between family members.
As mentioned in the introduction, WES is a method to concentrate the sequenced DNA fragments in coding regions (exons) of the genome.
For this class, we’ve created a small dataset of reads that align to a single gene that will allow our commands to finish quickly.
Sample: NA12878
Gene: Cyp2c19 on chromosome 10
Sequencing: Illumina, Paired End, Exome