Project Organization and Data Lifecycle

When setting up a project on the Tufts HPC it is worth considering what kind of resources you need and the data’s lifecyle:

Questions To Ask

When considering compute resources make sure you have an idea of the following:

Question	Why it’s Useful
What is the approximate start date and end date for usage of the data?	This can be helpful when it comes to how TTS plans future storage needs
What is the estimated computation needed?	Knowing how long your jobs might take is incredibly valuable for planning when you’ll run them. If you know the job takes 2 days to run, you’ll be able to plan accordingly.
Are you working with multiple files simultaneously?	Each directory comes with some limit. You can ask for more storage but often storage is stretched thin as it is. If you can stagger when large files are loaded you can avoid hitting your storage maximum.
Can you remove redundant files?	Life sciences data is rife with redundant files/logs. Consider if there is anything that can be deleted or even downloaded again if it is a publically available file.
What software/tools will you use to process, analyze, or view the data?	The Tufts HPC clustser has an incredible array of tools to help you do research. However, it doesn’t have all of them and it might help to check before you start your project.

Project Organization

It might seem insignificant but consider creating an organized folder structure. As a project grows it can be difficult to find relavent data even when the folder structure is neat. Consider the following two projects and how easy it might be to find the files you’re looking for:

Project 1

[tutln01@c1cmp047 project1]$ ls
15338084.err  15338085.err  15338086.err  15338087.err  data         sraAccList.txt        SRR19145573.fastq.gz  STAR1       starLoop.sh
15338084.out  15338085.out  15338086.out  15338087.out  download.sh  SRR19145569.fastq.gz  STAR                  starAll.sh  starOne.sh

Project 2

[tutln01@c1cmp047 project1]$ ls
input_data trimmed_data bam_files feature_counts DEG_results

Main Page

Project Organization and Data Lifecycle

Tools for Life Science

The Basics

NGS Analysis

Metagenomics Analysis

Protein Structure Analysis

Galaxy Tutorials

Project Organization and Data Lifecycle

Questions To Ask

Project Organization

Project 1

Project 2