Skip to content

Slurm Job Arrays

Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily, saving both time and computational resources.

Use cases

  • I have 1000 samples and they all need to run the same workflow.
  • I need to run a simulation 1000 times with a different set of parameters.

Why not use serial jobs?

A common approach is to use bash loops to submit jobs one by one, but this is not efficient for large numbers of tasks. For example:

for fq in *.fastq.gz; do 
  fastqc -t 4 $fq
done

serial_job

Using bash loops works but often results in jobs taking much longer. Instead, using SLURM job arrays can streamline this process.

Slurm arrays

Basic Syntax

Job arrays are only supported for batch jobs, and the array index values are specified using the --array or -a option of the sbatch command or #SBATCH inisde job script.

--array=<indices>
  • You can specify the array indices in different ways:

  • --array=0-100: Runs jobs with indices from 0 to 100.

  • --array=2,4,6,8,10: Runs jobs with specific indices (2, 4, 6, 8, and 10).

  • --array=2-1000:2: Runs jobs with a step size, in this case, every 2nd job from 2 to 1000.

  • You can limit the number of array jobs which are allowed to run at once by using the % character when specifying indices.

  • 1-16%2 Create 16 jobs, but only allow two to run at a time

Job ID and Environment Variables

SLURM_ARRAY_JOB_ID

  • This environment variable represents the job ID of the entire job array.

  • It is the same for all tasks within that job array.

  • If you submit a job array with 10 tasks, each of those tasks will have the same SLURM_ARRAY_JOB_ID.

Example:

If you submit a job array with sbatch --array=1-10 script.sh, and the job array is assigned the job ID 12345, then

  • SLURM_ARRAY_JOB_ID for all tasks will be 12345.

SLURM_ARRAY_TASK_ID

  • This environment variable represents the unique identifier of each task within the job array.

  • It differentiates each task in the array and usually corresponds to the index you specified when submitting the job array.

  • This is the variable you use to handle task-specific operations within the script.

Example:

If you submit a job array with sbatch --array=1-10 script.sh, and the job array is assigned the job ID 12345, then:

  • Task 1 will have SLURM_ARRAY_TASK_ID=1.

  • Task 2 will have SLURM_ARRAY_TASK_ID=2.

  • And so on, up to SLURM_ARRAY_TASK_ID=10 for the last task.

In a simple case, you can directly use the $SLURM_ARRAY_TASK_ID variable in your script to set up your job array.

For instance, if you have a fasta file for each sample like: sample1.fa, sample2.fa, sample3.fa ... sample10.fa, and you want each of the 10 Slurm array tasks to handle a separate sample file, you can replace the line specifying the sample filename with sample${SLURM_ARRAY_TASK_ID}.fa.

This means that for array task 1, the script will run sample1.fa, for array task 2 it will run sample2.fa, and so on.

Monitor and cancel jobs

You can cancel a particular array task using the respective JOBID in the first column, e.g. scancel 7456478_2, or you can cancel all array tasks in the array job by just specifying the main job ID, e.g. scancel 7456478.

[yzhang85@login-prod-01 array]$ sbatch fastqc_array.sh 
Submitted batch job 7456347
[yzhang85@login-prod-01 array]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         7456478_1   preempt   fastqc yzhang85  R       1:30      1 s1cmp004
         7456478_2   preempt   fastqc yzhang85  R       1:30      1 s1cmp004
         7456478_3   preempt   fastqc yzhang85  R       1:30      1 s1cmp004
         7456478_4   preempt   fastqc yzhang85  R       1:30      1 s1cmp004
         7456478_5   preempt   fastqc yzhang85  R       1:30      1 s1cmp004
         7456478_6   preempt   fastqc yzhang85  R       1:30      1 s1cmp004

Example job scripts

In the following example, I have many fastq.gz files in the folder fastq. I need to run fastqc to check the quality of each of these fastq.gz files.

$ ls -1 fastq/*.gz
fastq/SRX1693951_1.fastq.gz
fastq/SRX1693951_2.fastq.gz
fastq/SRX1693952_1.fastq.gz
fastq/SRX1693952_2.fastq.gz
fastq/SRX1693953_1.fastq.gz
fastq/SRX1693953_2.fastq.gz
fastq/SRX1693954_1.fastq.gz
fastq/SRX1693954_2.fastq.gz
fastq/SRX1693955_1.fastq.gz
fastq/SRX1693955_2.fastq.gz
fastq/SRX1693956_1.fastq.gz
fastq/SRX1693956_2.fastq.gz

For each fastq.gz file, I want to submit a separate slurm job to our cluster. This can be achieved with slurm job array.

We can use fastq/SRX169395${SLURM_ARRAY_TASK_ID}_1.fastq.gz and fastq/SRX169395${SLURM_ARRAY_TASK_ID}_2.fastq.gz to represent pairs of fastq.gz files.

#!/bin/bash
#SBATCH -p preempt  # batch, gpu, preempt, mpi or your group's own partition
#SBATCH -t 1:00:00  # Runtime limit (D-HH:MM:SS)
#SBATCH -N 1   # Number of nodes
#SBATCH -n 1   # Number of tasks per node
#SBATCH -c 4   # Number of CPU cores per task
#SBATCH --mem=8G       # Memory required per node
#SBATCH --array=1-6     # An array of 6 jobs
#SBATCH --job-name=fastqc      # Job name
#SBATCH --mail-type=FAIL,BEGIN,END     # Send an email when job fails, begins, and finishes
#SBATCH --mail-user=yzhang85@tufts.edu       # Email address for notifications
#SBATCH --error=%x-%A_%a.err   # Standard error file: <job_name>-<job_id>-<taskid>.err
#SBATCH --output=%x-%A_%a.out  # Standard output file: <job_name>-<job_id>-<taskid>.out

echo "SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

module load fastqc/0.12.1
fastqc -t 4 fastq/SRX169395${SLURM_ARRAY_TASK_ID}_1.fastq.gz fastq/SRX169395${SLURM_ARRAY_TASK_ID}_2.fastq.gz -o fastqcOut

Output logs

[yzhang85@login-prod-01 array]$ ls -hl
total 13K
drwxrws--- 2 yzhang85 workshop 4.0K Aug 30 11:51 fastq/
drwxrws--- 2 yzhang85 workshop 4.0K Aug 30 11:39 fastqcOut/
-rw-rw---- 1 yzhang85 workshop 1.2K Aug 30 11:54 fastqc-7456478_1.err
-rw-rw---- 1 yzhang85 workshop  110 Aug 30 11:52 fastqc-7456478_1.out
-rw-rw---- 1 yzhang85 workshop 1.1K Aug 30 11:54 fastqc-7456478_2.err
-rw-rw---- 1 yzhang85 workshop  110 Aug 30 11:52 fastqc-7456478_2.out
-rw-rw---- 1 yzhang85 workshop 1.1K Aug 30 11:54 fastqc-7456478_3.err
-rw-rw---- 1 yzhang85 workshop  110 Aug 30 11:52 fastqc-7456478_3.out
-rw-rw---- 1 yzhang85 workshop 1.1K Aug 30 11:54 fastqc-7456478_4.err
-rw-rw---- 1 yzhang85 workshop  110 Aug 30 11:52 fastqc-7456478_4.out
-rw-rw---- 1 yzhang85 workshop 1.1K Aug 30 11:54 fastqc-7456478_5.err
-rw-rw---- 1 yzhang85 workshop  110 Aug 30 11:52 fastqc-7456478_5.out
-rw-rw---- 1 yzhang85 workshop 1.1K Aug 30 11:54 fastqc-7456478_6.err
-rw-rw---- 1 yzhang85 workshop  110 Aug 30 11:52 fastqc-7456478_6.out
-rw-rw---- 1 yzhang85 workshop  918 Aug 30 11:48 fastqc_array.sh

Limits

Submitting too many jobs

MaxArraySize

To query MaxArraySize , you can use

```scontrol show conf | grep MaxArraySize $ scontrol show config | grep -i array MaxArraySize = 2000

Public Partitions (batch+mpi+largemem+gpu)
CPU: 1000 cores
RAM: 4000 GB
GPU: 10
Preempt Partition (preempt)
CPU: 2000 cores
RAM: 8000 GB
GPU: 20



# Array jobs with R script

## Required files

1. **Parameter File:** A file containing the parameters that your array job will iterate through. This file could include different variables or data that each array task will process individually.
2. **Script (R, Shell, Python, etc.):** The main script that will perform the analysis or visualization tasks. While the example here is in R, the same structure applies to other languages like shell, Python, or Perl. Adapt the script according to the specific tool or language you are using for the job.
3. **Wrapper Shell Script:** A simple shell script that sends your jobs to the SLURM scheduler. This script makes it easy to run multiple tasks automatically, with each task using different parameters from the parameter file.



## R Script Example

Here is n example of an R script that generates scatter plots of gene expression based on raw RNA-seq count data:

```r
# Load libraries
library(tidyverse)
library(ggrepel)

# Read in parameters
args <- commandArgs(trailingOnly = TRUE)
gene <- as.character(args[1])
padj <- as.numeric(args[2])

# Subset the gene of interest
dt <- read.table("salmon.merged.gene_counts.tsv", header=T)
d <- dt[match(gene, dt$gene_name),]
d <- gather(d, key = "condition", value = "expression", GFPkd_1:PRMT5kd_3)

# Reformat for ggplot
d_long <- separate(d, col = "condition", into = c("treatment", "replicate"), sep = "_")

# Ggplot to visualize
p <- ggplot(d_long, aes(treatment, expression)) +
        geom_point(size=5, color="steelblue", alpha=0.5) +
        geom_label_repel(aes(label=replicate)) +
        theme_classic() +
        xlab("Treatment") +
        ylab("Gene expression") +
        ggtitle(paste0(gene,": padj ", padj))

# Save plot to a pdf file
ggsave(plot=p, file=paste0(gene, ".pdf"), width=4, height=4)

Script Purpose

This R script creates scatter plots for gene expression levels between control and treated groups from an RNA-seq analysis. It reads in two parameters from the command line: the gene name (genename) and the adjusted p-value (padj). The input data file is salmon.merged.gene_counts.tsv.

Example Parameter File

Here’s an example of the parameter file (table.tsv) used in the job array. Each row contains gene expression information, and the R script will extract specific columns for each job.

gene_id baseMean        log2FoldChange  lfcSE   pvalue  padj    genename
ENSG00000078018 1126.709        -2.161184       0.05810824      1.578201e-304   2.054292e-301   MAP2
ENSG00000004799 1224.003        -2.199776       0.06003955      1.17799e-295    1.4154e-292     PDK4
ENSG00000272398 2064.696        1.615232        0.04513554      2.024618e-282   2.258895e-279   CD24
ENSG00000135046 12905.46        -0.8779134      0.02467955      1.349814e-278   1.405606e-275   ANXA1

The R script reads the genename from column 7 and padj from column 6 for each gene.

Shell Wrapper Script

The following shell script submits the jobs to the SLURM scheduler as an array of tasks. Each task processes a different gene from the parameter file.

#!/bin/bash
#SBATCH -p preempt  # batch, gpu, preempt, mpi or your group's partition
#SBATCH -t 1:00:00  # Runtime limit (D-HH:MM:SS)
#SBATCH -N 1        # Number of nodes
#SBATCH -n 1        # Number of tasks per node
#SBATCH -c 4        # Number of CPU cores per task
#SBATCH --mem=2G    # Memory required per node
#SBATCH --array=2-11 # An array of 10 jobs
#SBATCH --job-name=Rplot
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --mail-user=xue.li37@tufts.edu
#SBATCH --error=%x-%A_%a.err   # Standard error file: <job_name>-<job_id>-<taskid>.err
#SBATCH --output=%x-%A_%a.out  # Standard output file: <job_name>-<job_id>-<taskid>.out

echo "SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

module load R/4.4.0
GENE=$(awk "NR==${SLURM_ARRAY_TASK_ID} {print \$7}" table.tsv) 
Padj=$(awk "NR==${SLURM_ARRAY_TASK_ID} {print \$6}" table.tsv)

echo $GENE $Padj
Rscript R_scatter_vis.r $GENE $Padj  

Script Details

  • SBATCH --array=2-11 tells SLURM to run jobs for rows 2 to 11 of the parameter file.
  • The awk commands extract the GENE and Padj values from the specified row and columns (7th and 6th).
  • The script submits 10 jobs, each running the R script with different GENE and Padj values.

Customizing the Array

You can adjust the --array option to change the range of jobs. For example, to run jobs for every other line from 2 to 1000, you can specify:

#SBATCH --array=2-1000:2

This would submit jobs for rows 2, 4, 6, ..., up to 1000.