, a · 2020. 2. 11. · compress the sequence alignment map file (sam to bam) sorting the bam file...

Nag

araj

an K

athi

resa

n, P

h.D.

, Co

mpu

tatio

nal S

cien

tist,

KAU

ST S

uper

com

putin

g La

b,

naga

raja

n.ka

thire

san@

kaus

t.edu

.sa

Agendaü UNIX tools for Bioinformatics ü A simple job script in a command line!ü What is workflow? How can I build it? ü How to address the job dependencies??ü When to use Job arrays? ü Optimization in workflow design.

Note:

Some of the Bioinformatics tools like bwa - Burrows-Wheeler Alignment, Samtools, and Picard/GATK are used for explinations.)

UNIX tools for BioinformaticsData transfer and Search for pattern

Move data between two systemsThe rsync utility is a very useful utility for synchronizing files and directories between two different servers.

q Copying from the local machine to a remote machine:rsync local_directory

remote_server_name:remote_directory

q Copying from a remote machine to the local machine:rsync remote_server_name:remote_directory

local_directory

-a archive mode

-r recursive over subdirectories

-v verbose

-x don't cross filesystem boundaries

-H preserve hard links

-P show progress

-n no-op, or dry-run

$ rsync -arvxHPmy_data

[email protected]:/ibex/scratch/kathirn/work/my_data/

Search for pattern • grep, egrep, fgrep• wc• | (Pipe character) • cut• awk• sort• uniq…….

Working with genome filesFasta

Indexed Fasta

Compressed Fastq

Compressed VCF

BAM

SAM

Sorted BAM

GTF

Working with fasta file$ more Aegilops_tauschii.Aet_v4.0.ncrna.fa

Extract the headers from the FASTA file grep, egrep, fgrep à print lines matching a pattern-i, --ignore-case à ignore case-v, --invert-match à “invert”, get the lines not matching the patent -w, --word-regexp à Get the lines when matches whole patent -o, --only-matching à Get only the matching part

egrep = grep –E (--extended-regexp)

fgrep = grep –F (--fixed-strings)

Word countwcà Count the number of lines, words and characters in a given file

$ wc Aegilops_tauschii.Aet_v4.0.ncrna.fa13525 48871 1247270

Aegilops_tauschii.Aet_v4.0.ncrna.fa

$ wc -l Aegilops_tauschii.Aet_v4.0.ncrna.fa13525 Aegilops_tauschii.Aet_v4.0.ncrna.fa

$ wc -c Aegilops_tauschii.Aet_v4.0.ncrna.fa1247270 Aegilops_tauschii.Aet_v4.0.ncrna.fa

$ wc -w Aegilops_tauschii.Aet_v4.0.ncrna.fa48871 Aegilops_tauschii.Aet_v4.0.ncrna.fa

Question: How do I count the number of sequences in the above fasta file?

Answer:$ grep -c ">" Aegilops_tauschii.Aet_v4.0.ncrna.fa3732

Why?• Counting the header (“>”) is an appropriate way!• Many sequence lines is possible within a single sequence identification.

>ENSRNA050031380-T1 ncrnachromosome:Aet_v4.0:2D:126982204:126982306:-1 gene:ENSRNA050031380 gene_biotype:snRNA transcript_biotype:snRNAgene_symbol:U6 description:U6 spliceosomal RNAACTATATAAAAAACTTCCAATTTTAGTGGAACTATACAGAGAAGATTAGCATGGCCCCGACGCAAGGATGACACACACGAATTGAGAAATGATCCAAATTTTT

Sequence identification

Sequence

Combining the commands| à Pipe character

Example

Grep option:-io: ignore-case and only-matching

Useful data processing tools!cut àThis command allows extracting the column from the fileUse: cut –f file name

Useful tableview!https://github.com/informationsea/tableview/releases/download/v0.4.6/tableview_linux_amd64

$ cat sample.vcf | grep -v "#" | tableview_linux_amd64

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

$ cat sample.vcf | grep -v "#" Grep Option: -v: invert-match

https://github.com/informationsea/tableview/releases/download/v0.4.6/tableview_linux_amd64

Cont. … cut command!

Uniq and sort

AWK à scans each line and performance some actions. awk ‘ {action1} …’

AWK

Combine commands: awk + pipe + uniq + sort …

A simple job scriptMinimum 3 parameters: 1. sbatch: Submit a batch script to Slurm2. time: Set a limit on the total run time of the job

allocation--time=days-hours:minutes:seconds

-t days-hours:minutes:seconds

3. wrap: specified command string or simple "sh" shell script & submit to the slurm controller

Example: $ sbatch --time=00:10 --wrap="hostname”

Output:

slurm-9024853.out$ cat slurm-9024853.out

cn603-28-l

Caution note (by default SLURM allocation)

• memory = 2GB• CPU = 1 core• Node = 1 node

$ cat my_job.sh#!/bin/bash#SBATCH --time=00:10hostname

$ sbatch ./my_job.shSubmitted batch job 7438

$ cat slurm-7438.outcn512-05-r

Job script (batch jobs)

Workflow - exampleGenome mapping/alignment

Compress the Sequence Alignment Map file (SAM to BAM)

Sorting the BAM file

Index for BAM files

Chromosome interval for research interest

Mark or remove the duplicate

BWA

Samtools

Samtools

Samtools

Samtools

GATK/Picard

Step #1

Step #2

Step #3

Step #4

Step #5

Step #6

Step #1: Genome alignmentBurrows-Wheeler Aligner: • bwa index ref.fa• bwa mem ref.fa reads.fq > aln-se.sam• bwa mem ref.fa read1.fq read2.fq > aln-pe.sam• bwa aln ref.fa short_read.fq > aln_sa.sai• bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam• bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam• bwa bwasw ref.fa long_read.fq > aln.sam

http://bio-bwa.sourceforge.net/bwa.shtml

Pre-request 1. Genome Reference file (GRCh37, HG19 …)2. Genome Reference - Index files3. Sample data (Single or Pair-end)

Caution note: • By default, the BWA tool will run as a sequential (1 core)• It’s supported with multi-threads for parallelization using the option -t• The option -T (alignment score) is different.

Reference available:/ibex/reference/KSL/

http://bio-bwa.sourceforge.net/bwa.shtml

Example: BWA MEMbwa mem ref.fa read1.fq read2.fq > aln-pe.samStep #1: Check the availability on the software $ module av bwa------------- /sw/csi/modulefiles/applications -----------------------

bwa/0.7.17/gnu-6.4.0 bwakit/0.7.15/binary-0.7.15

Step #2: Use the module software $ module load bwa/0.7.17/gnu-6.4.0

Step #3: Prepare a job submission script Command: $ bwa mem /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam

SLURM Script: $ sbatch --time=00:10 --wrap="bwa mem /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam”

Caution note in resource allocation: • 2 GB memory • 1 Core

Cont. . . (Optimized script)SLURM Script: $ sbatch \

--time=2:00:00 \

--mem=100GB \

--cpus-per-task=16 \

--wrap="bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam”

Submitted batch job 9028055

$ cat slurm-9028055.out

[M::bwa_idx_load_from_disk] read 0 ALT contigs

[M::process] read 1600000 sequences (160000000 bp)...

[M::process] read 1600000 sequences (160000000 bp)...

Batch job script$ cat BWA_MEM_batch.sh

#!/bin/bash

#SBATCH --time=2:00:00

#SBATCH --mem=100GB

#SBATCH --cpus-per-task=16

## Software

module load bwa/0.7.17/gnu-6.4.0

## Command

bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam

Job submitted using sbatch$ sbatch ./BWA_MEM_batch.shSubmitted batch job 7439

Standard output/error will be in the name of slurm-.out $ cat slurm-7439.outLoading module for BWABWA 0.7.17 is now loaded[M::bwa_idx_load_from_disk] read 0 ALT contigs[M::process] read 1600000 sequences (160000000 bp)...[M::process] read 1600000 sequences (160000000 bp)...

How can I run 100+ genome samples? $ ls -lrta *_001.fastq.gz

-rw-r--r-- 1 kathirn g-kathirn 2125471805 Dec 17 2013 NIST7086_CGTACTAG_L002_R2_001.fastq.gz




-rw-r--r-- 1 kathirn g-kathirn 2001172486 Dec 17 2013 NIST7035_TAAGGCGA_L002_R2_001.fastq.gz




------ DATA PROCESSING ------

Steps for processing more samplesS1 S2 S3 Sn

$ sbatch \--time=2:00:00 \--mem=100GB \--cpus-per-task=16 \--wrap=" bwa mem -t 16

/ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SAMPLE_NAME_1.fastq SAMPLE_NAME_2.fastq > SAMPLE_NAME.sam”

One by one SAMPLE_NAME

Until all SAMPLES

YesJob

done

In UNIX script1. Get the unique list of samples $ ls *_R1_001.fastq.gz

NIST7035_TAAGGCGA_L001_R1_001.fastq.gz

NIST7086_CGTACTAG_L001_R1_001.fastq.gz

NIST7035_TAAGGCGA_L002_R1_001.fastq.gz

NIST7086_CGTACTAG_L002_R1_001.fastq.gz

2. Parse sample by sample$ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;

do

echo $SAMPLE_NAME;

done

Cont.3. Get the UNIQUE sample name $ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;

do

echo `basename $SAMPLE_NAME _R1_001.fastq.gz`;

done

Output:

NIST7035_TAAGGCGA_L001

NIST7035_TAAGGCGA_L002

NIST7086_CGTACTAG_L001

NIST7086_CGTACTAG_L002

Cont.4. Multiple Job submission using FOR LOOP$ module load bwa/0.7.17/gnu-6.4.0

$ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;

do

PREFIX=`basename $SAMPLE_NAME _R1_001.fastq.gz`;

sbatch \

--time=2:00:00 \

--mem=100GB \

--cpus-per-task=16 \

--wrap=" bwa mem -t 16

/ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta

${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz > ${PREFIX}.sam”

done

In a batch script (as a Job arrays)#!/bin/bash#SBATCH --job-name=BWA_MEM#SBATCH --output=BWA_MEM.%A_%a.out#SBATCH --error=BWA_MEM.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-4

## Software module load bwa/0.7.17/gnu-6.4.0

## My variablesSAMPLE=`ls *_R1_001.fastq.gz | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE _R1_001.fastq.gz` ;

## Job commandbwa mem -t 16

/ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz > ${PREFIX}.sam

Pre-request:Array size = number of samples

$ sbatch ./bwa_mem_array.shSubmitted batch job 7440

$ squeue -u $USERJOBID PARTITION NAME USER ST TIME NODES

NODELIST(REASON)7440_1 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_2 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_3 batch BWA_MEM kathirn R 0:53 1 cn512-05-r7440_4 batch BWA_MEM kathirn R 0:53 1 cn512-05-r

Cont. …$ sbatch ./bwa_mem_array.shSubmitted batch job 7440

$ squeue -u $USERJOBID PARTITION NAME USER ST TIME NODES

NODELIST(REASON)7440_1 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_2 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_3 batch BWA_MEM kathirn R 0:53 1 cn512-05-r7440_4 batch BWA_MEM kathirn R 0:53 1 cn512-05-r

$ ls -lrta *.sam-rw-r--r-- 1 kathirn g-kathirn 2790260736 Feb 9 16:41 NIST7086_CGTACTAG_L002.sam-rw-r--r-- 1 kathirn g-kathirn 3783000064 Feb 9 16:41 NIST7086_CGTACTAG_L001.sam-rw-r--r-- 1 kathirn g-kathirn 3024093184 Feb 9 16:41 NIST7035_TAAGGCGA_L002.sam-rw-r--r-- 1 kathirn g-kathirn 3978428416 Feb 9 16:41 NIST7035_TAAGGCGA_L001.sam

Step 2: SAM to BAM filesSamtools to convert SAM files to BAM#!/bin/bash

module load samtools/1.8

for SAMPLE_NAME in `ls *.sam`;

do

PREFIX=`basename $SAMPLE_NAME .sam`;

sbatch --time=2:00:00 --mem=100GB --cpus-per-task=16 --wrap="samtoolsview --threads 16 -b -S -h -q 30 ${SAMPLE_NAME} > ${PREFIX}.bam"

done

• Sam files are very large• BAM file is compressed version of SAM• Good to use BAM files and safe to delete

SAM once the BAM files are available.

• 1.8G NIST7035_TAAGGCGA_L001_R1_001.fastq.gz• 1.9G NIST7035_TAAGGCGA_L001_R2_001.fastq.gz• 13G NIST7035_TAAGGCGA_L001.sam• 3.4G NIST7035_TAAGGCGA_L001.bam

Step 3: convert bam to sorted bamSort the BAM files using samtools

#!/bin/bash


for SAMPLE_NAME in `ls *.bam`;

do

PREFIX=`basename $SAMPLE_NAME .bam`;

sbatch --time=2:00:00 --mem=100GB --cpus-per-task=16 --wrap="samtoolssort --threads 16 -T ${PREFIX} ${SAMPLE_NAME} -o ${PREFIX}.sorted.bam"

done

End-of-Step 3!What are the files generated?

ü sam files (Generated from Genome alignment)ü unsorted bam files (Generated from the samtools, part

of data compression)

ü sorted bam files (Generated from samtools)

Do we need all these intermediate files generated? IF NOT ?!

*.Fastq.gz

*.sam

*.bam

*.sorted. bam

$ bwa mem -t 16 $REF $PREFIX_R1_001.fastq.gz $PREFIX_R1_001.fastq.gz | samtools view --threads 16 -b -S -h -q 30 - | samtools sort --threads 16 -T $PREFIX -> $PREFIX.sorted.bam

3-in-1 !?*.Fastq.gz

*.sam

*.bam

*.sorted. bam

#!/bin/bash#SBATCH --job-name=BWA_MEM#SBATCH --output=BWA_MEM.%A_%a.out#SBATCH --error=BWA_MEM.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-4

# Software module load bwa/0.7.17/gnu-6.4.0module load samtools/1.8

# My variablesSAMPLE=`ls *_R1_001.fastq.gz | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE _R1_001.fastq.gz` ;

# Job commandbwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz | samtools view --threads 16 -b -S -h -q 30 - | samtools sort --threads 16 - > $PREFIX.sorted.bam

Step 4: index the bam filesIndex the BAM files using samtools

#!/bin/bash


for SAMPLE_NAME in `ls *.sorted.bam`;

do

PREFIX=`basename $SAMPLE_NAME .sorted.bam`;

sbatch --time=30:00 --mem=100GB --cpus-per-task=1 --wrap="samtoolsindex ${SAMPLE_NAME}"

done

Summary: list of files generated Sorted BAM files: • -rw-r--r-- 1 kathirn g-kathirn 2.6G Feb 4 12:14 NIST7035_TAAGGCGA_L002.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.5G Feb 4 12:14 NIST7035_TAAGGCGA_L001.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.7G Feb 4 12:14 NIST7086_CGTACTAG_L001.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.7G Feb 4 12:14 NIST7086_CGTACTAG_L002.sorted.bam

Index of Sorted BAM files:

• -rw-r--r-- 1 kathirn g-kathirn 3.4M Feb 4 12:25 NIST7086_CGTACTAG_L001.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.5M Feb 4 12:26 NIST7035_TAAGGCGA_L001.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.5M Feb 4 12:26 NIST7035_TAAGGCGA_L002.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.4M Feb 4 12:26 NIST7086_CGTACTAG_L002.sorted.bam.bai

List of chr. In each BAM files@HD VN:1.5 SO:coordinate

@SQ SN:1 LN:249250621

@SQ SN:2 LN:243199373

@SQ SN:3 LN:198022430

@SQ SN:4 LN:191154276

@SQ SN:5 LN:180915260

@SQ SN:6 LN:171115067

@SQ SN:7 LN:159138663

@SQ SN:8 LN:146364022

@SQ SN:9 LN:141213431

@SQ SN:10 LN:135534747

@SQ SN:11 LN:135006516@SQ SN:12 LN:133851895@SQ SN:13 LN:115169878@SQ SN:14 LN:107349540@SQ SN:15 LN:102531392@SQ SN:16 LN:90354753@SQ SN:17 LN:81195210@SQ SN:18 LN:78077248@SQ SN:19 LN:59128983@SQ SN:20 LN:63025520@SQ SN:21 LN:48129895@SQ SN:22 LN:51304566@SQ SN:X LN:155270560@SQ SN:Y LN:59373566@SQ SN:MT LN:16569

@SQ SN:GL000207.1 LN:4262@SQ SN:GL000226.1 LN:15008@SQ SN:GL000229.1 LN:19913@SQ SN:GL000231.1 LN:27386@SQ SN:GL000210.1 LN:27682@SQ SN:GL000239.1 LN:33824….….

@SQ SN:NC_007605 LN:171823@SQ SN:hs37d5 LN:35477943

@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta NIST7035_TAAGGCGA_L002_R1_001.fastq.gz NIST7035_TAAGGCGA_L002_R2_001.fastq.gz

Step 5: Chromosome interval for research interestObjective:

Generate a chunk of BAM file that has the interval between10,000-15,000 from Chromosome-1 and Chromosome-2, etc. Solution:

$ samtools view NIST7035_TAAGGCGA_L002.sorted.bam 1:10000-15000 | more

HWI-D00119:50:H7AP8ADXX:2:1214:6356:27283 163 1 10354 60 89M12S = 10354 96 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTA

ACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAACCCTAAGCCCCGGCA 8??DBDBAFF>?FGAFFIIFF9ED8;CCDFDED3?9?@?0?B@?DFF(DHECCC@@HGHHGIIIEECC==BCDFFECECECCCCCCDCDCECC N

M:i:0 MD:Z:101 MC:Z:101M AS:i:101 XS:i:71

….

…..

Chr1

Chr2

Cont. Any Optimal or better Solution!? #!/bin/bash#SBATCH --job-name=Region_of_Interest#SBATCH --output=Region_of_Interest.%A.out#SBATCH --error=Region_of_Interest.%A.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-2

## My variablesSAMPLE=NIST7035_TAAGGCGA_L002.sorted.bamPREFIX=NIST7035_TAAGGCGA_L002REGION="10000-15000"

## Software module load samtools/1.8

## Job command to get Region of Interest from Chromosome 1 & 2samtools view ${SAMPLE} ${SLURM_ARRAY_TASK_ID}:$REGION --threads 16 -b -o ${PREFIX}.${SLURM_ARRAY_TASK_ID}.$REGION.sorted.bam

Caution note!• Job array will be numeric letters (no

fractions, no characters, no special symbols, no alpha-numeric …. etc. )

• When the Chromosome is “Chr1”, data distribution is required as follows:

if [${SLURM_ARRAY_TASK_ID} -eq 1 ]// ….do something …. //

else// ….do something …. //

fi

Batch processing is required to get the value of ${SLURM_ARRAY_TASK_ID}

To view the chromosome …. (e.G. IGV can be used)

Step 6: mark duplicate(s)Any Optimal or better Solution!? #!/bin/bash#SBATCH --job-name=MarkDupe#SBATCH --output=MarkDupe.%A_%a.out#SBATCH --error=MarkDupe.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --array=1-4

## My variablesSAMPLE=`ls *.sorted.bam | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE .sorted.bam` ;

## Software module load gatk/4.0.1.1

## Job commandgatk MarkDuplicates --INPUT $SAMPLE --OUTPUT $PREFIX.duped.sorted.bam --METRICS_FILE $PREFIX.txt --REMOVE_DUPLICATES true

Pre-request:Array size = number of samples

End-of-Step 6

- Many job script - Multiple files- Manual steps- etc.

Single job script

Job dependency

Job dependencysbatch --dependency= ...

after:jobid[:jobid...] job can begin after the specified jobs have started

afterany:jobid[:jobid...] job can begin after the specified jobs have terminated

afternotok:jobid[:jobid...] job can begin after the specified jobs have failed

afterok:jobid[:jobid...] job can begin after the specified jobs have run to completion with an exit code of zero.

singleton

jobs can begin execution after all previously launched jobs with the same name and user have ended. This is useful to collate results of a swarm or to send a notification at the end of a swarm.

Source: https://hpc.nih.gov/docs/job_dependencies.html

https://hpc.nih.gov/docs/job_dependencies.html

Job dependency - Example$ cat dependent.sh#!/bin/bash## Any bugs/issues, please e-mail: [email protected] "Submitting 5 jobs with 4 job dependency condition";

## Submit First jobFirst_CMD="sleep 40";First_Job="sbatch --partition=batch --job-name=First_Step --time=30:00 --output=First-%J.out --error=First-%J.err--nodes=1";First_ID=$(${First_Job} --parsable --wrap="${First_CMD}");echo "First Job submitted (\" ${First_CMD} is executing \") and this job id is " ${First_ID};

## Execute the Second job only when First job is successfulSecond_CMD="hostname";Second_Job="sbatch --partition=batch --job-name=Second_Step --time=30:00 --output=Second-%J.out --error=Second-%J.err --nodes=1";Second_ID=$(${Second_Job} --parsable --dependency=afterok:${First_ID} --wrap="${Second_CMD}");echo " Second Job (\" ${Second_CMD} \") was submitted (Job_ID=${Second_ID}) and it will execute when the First Job_ID=${First_ID} is successful"

echo " The status of running jobs are ..."echo "-----------------------------------------------------------------------------------------------------------"squeue -u $USER -lecho "-----------------------------------------------------------------------------------------------------------"

Workflow

Source: Computational and Bioinformatics Frameworks for Next-Generation Whole Exome and Genome Sequencing

Source: Best Practices for Variant Discovery in DNAseq

Is this simple and/or Optimized?

workflow optimization

TRIMMOMATIC_JAR

bwa mem

GATK 4.x MarkDuplicates

gatkAddOrReplaceReadGroups

samtools index

GATK 3.x HaplotypeCaller

bgzip

tabix

Workflow for multiple samples

Different software/tools for every job step

Cores = 4

Cores = 16

Cores = 1

Cores = 1

Cores = 1

Cores = 16

Cores = 1

Cores = 1

Optimal Heterogenous resource allocation

• Automated the workflow 8 scripts = single script

• Heterogenous resource allocation used

64 cores = optimal # cores• Turnaround time minimized

Unpredicted = Predicted• Optimized resource allocations

Max. cores = Optimal• Job monitoring and job control

Complex = Simplifiedü Job restart ü Job statistics/report

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Outcome

sample 1

Read trimming sample 1sample 2……..sample N

Read Mapping sample 1sample 2……..sample N

Mark Duplicate sample 1sample 2……..sample N

Add/Replace Read Groups

sample 1sample 2……..sample N

Indexing sample 1sample 2……..sample N

Haplotype callersample 1sample 2……..sample N

Compress the gVCFsample 1sample 2……..sample N

xgVCF Index

sample 2……..sample N

48

Acknowledgements: Elodie Rey (Prof. Mark Tester ) Michael D. Abrouk (Prof. Simon Krattinger)

THANKS!

Time for Questions and your feedback!

, a · 2020. 2. 11. · compress the sequence alignment map file (sam to bam) sorting the bam file...

Documents