, a · 2020. 2. 11. · compress the sequence alignment map file (sam to bam) sorting the bam file...

50
Nagarajan Kathiresan, Ph.D., Computational Scientist, KAUST Supercomputing Lab, [email protected]

Upload: others

Post on 24-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Nag

    araj

    an K

    athi

    resa

    n, P

    h.D.

    , Co

    mpu

    tatio

    nal S

    cien

    tist,

    KAU

    ST S

    uper

    com

    putin

    g La

    b,

    naga

    raja

    n.ka

    thire

    san@

    kaus

    t.edu

    .sa

  • Agendaü UNIX tools for Bioinformatics ü A simple job script in a command line!ü What is workflow? How can I build it? ü How to address the job dependencies??ü When to use Job arrays? ü Optimization in workflow design.

    Note:

    Some of the Bioinformatics tools like bwa - Burrows-Wheeler Alignment, Samtools, and Picard/GATK are used for explinations.)

  • UNIX tools for BioinformaticsData transfer and Search for pattern

  • Move data between two systemsThe rsync utility is a very useful utility for synchronizing files and directories between two different servers.

    q Copying from the local machine to a remote machine:rsync local_directory

    remote_server_name:remote_directory

    q Copying from a remote machine to the local machine:rsync remote_server_name:remote_directory

    local_directory

    -a archive mode

    -r recursive over subdirectories

    -v verbose

    -x don't cross filesystem boundaries

    -H preserve hard links

    -P show progress

    -n no-op, or dry-run

    $ rsync -arvxHPmy_data

    [email protected]:/ibex/scratch/kathirn/work/my_data/

  • Search for pattern • grep, egrep, fgrep• wc• | (Pipe character) • cut• awk• sort• uniq…….

  • Working with genome filesFasta

    Indexed Fasta

    Compressed Fastq

    Compressed VCF

    BAM

    SAM

    Sorted BAM

    GTF

  • Working with fasta file$ more Aegilops_tauschii.Aet_v4.0.ncrna.fa

  • Extract the headers from the FASTA file grep, egrep, fgrep à print lines matching a pattern-i, --ignore-case à ignore case-v, --invert-match à “invert”, get the lines not matching the patent -w, --word-regexp à Get the lines when matches whole patent -o, --only-matching à Get only the matching part

    egrep = grep –E (--extended-regexp)

    fgrep = grep –F (--fixed-strings)

  • Word countwcà Count the number of lines, words and characters in a given file

    $ wc Aegilops_tauschii.Aet_v4.0.ncrna.fa13525 48871 1247270

    Aegilops_tauschii.Aet_v4.0.ncrna.fa

    $ wc -l Aegilops_tauschii.Aet_v4.0.ncrna.fa13525 Aegilops_tauschii.Aet_v4.0.ncrna.fa

    $ wc -c Aegilops_tauschii.Aet_v4.0.ncrna.fa1247270 Aegilops_tauschii.Aet_v4.0.ncrna.fa

    $ wc -w Aegilops_tauschii.Aet_v4.0.ncrna.fa48871 Aegilops_tauschii.Aet_v4.0.ncrna.fa

    Question: How do I count the number of sequences in the above fasta file?

  • Answer:$ grep -c ">" Aegilops_tauschii.Aet_v4.0.ncrna.fa3732

    Why?• Counting the header (“>”) is an appropriate way!• Many sequence lines is possible within a single sequence identification.

    >ENSRNA050031380-T1 ncrnachromosome:Aet_v4.0:2D:126982204:126982306:-1 gene:ENSRNA050031380 gene_biotype:snRNA transcript_biotype:snRNAgene_symbol:U6 description:U6 spliceosomal RNAACTATATAAAAAACTTCCAATTTTAGTGGAACTATACAGAGAAGATTAGCATGGCCCCGACGCAAGGATGACACACACGAATTGAGAAATGATCCAAATTTTT

    Sequence identification

    Sequence

  • Combining the commands| à Pipe character

  • Example

    Grep option:-io: ignore-case and only-matching

  • Useful data processing tools!cut àThis command allows extracting the column from the fileUse: cut –f file name

  • Useful tableview!https://github.com/informationsea/tableview/releases/download/v0.4.6/tableview_linux_amd64

    $ cat sample.vcf | grep -v "#" | tableview_linux_amd64

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

    $ cat sample.vcf | grep -v "#" Grep Option: -v: invert-match

    https://github.com/informationsea/tableview/releases/download/v0.4.6/tableview_linux_amd64

  • Cont. … cut command!

  • Uniq and sort

  • AWK à scans each line and performance some actions. awk ‘ {action1} …’

    AWK

  • Combine commands: awk + pipe + uniq + sort …

  • A simple job scriptMinimum 3 parameters: 1. sbatch: Submit a batch script to Slurm2. time: Set a limit on the total run time of the job

    allocation--time=days-hours:minutes:seconds

    -t days-hours:minutes:seconds

    3. wrap: specified command string or simple "sh" shell script & submit to the slurm controller

    Example: $ sbatch --time=00:10 --wrap="hostname”

    Output:

    slurm-9024853.out$ cat slurm-9024853.out

    cn603-28-l

    Caution note (by default SLURM allocation)

    • memory = 2GB• CPU = 1 core• Node = 1 node

    $ cat my_job.sh#!/bin/bash#SBATCH --time=00:10hostname

    $ sbatch ./my_job.shSubmitted batch job 7438

    $ cat slurm-7438.outcn512-05-r

    Job script (batch jobs)

  • Workflow - exampleGenome mapping/alignment

    Compress the Sequence Alignment Map file (SAM to BAM)

    Sorting the BAM file

    Index for BAM files

    Chromosome interval for research interest

    Mark or remove the duplicate

    BWA

    Samtools

    Samtools

    Samtools

    Samtools

    GATK/Picard

    Step #1

    Step #2

    Step #3

    Step #4

    Step #5

    Step #6

  • Step #1: Genome alignmentBurrows-Wheeler Aligner: • bwa index ref.fa• bwa mem ref.fa reads.fq > aln-se.sam• bwa mem ref.fa read1.fq read2.fq > aln-pe.sam• bwa aln ref.fa short_read.fq > aln_sa.sai• bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam• bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam• bwa bwasw ref.fa long_read.fq > aln.sam

    http://bio-bwa.sourceforge.net/bwa.shtml

    Pre-request 1. Genome Reference file (GRCh37, HG19 …)2. Genome Reference - Index files3. Sample data (Single or Pair-end)

    Caution note: • By default, the BWA tool will run as a sequential (1 core)• It’s supported with multi-threads for parallelization using the option -t• The option -T (alignment score) is different.

    Reference available:/ibex/reference/KSL/

    http://bio-bwa.sourceforge.net/bwa.shtml

  • Example: BWA MEMbwa mem ref.fa read1.fq read2.fq > aln-pe.samStep #1: Check the availability on the software $ module av bwa------------- /sw/csi/modulefiles/applications -----------------------

    bwa/0.7.17/gnu-6.4.0 bwakit/0.7.15/binary-0.7.15

    Step #2: Use the module software $ module load bwa/0.7.17/gnu-6.4.0

    Step #3: Prepare a job submission script Command: $ bwa mem /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam

    SLURM Script: $ sbatch --time=00:10 --wrap="bwa mem /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam”

    Caution note in resource allocation: • 2 GB memory • 1 Core

  • Cont. . . (Optimized script)SLURM Script: $ sbatch \

    --time=2:00:00 \

    --mem=100GB \

    --cpus-per-task=16 \

    --wrap="bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam”

    Submitted batch job 9028055

    $ cat slurm-9028055.out

    [M::bwa_idx_load_from_disk] read 0 ALT contigs

    [M::process] read 1600000 sequences (160000000 bp)...

    [M::process] read 1600000 sequences (160000000 bp)...

  • Batch job script$ cat BWA_MEM_batch.sh

    #!/bin/bash

    #SBATCH --time=2:00:00

    #SBATCH --mem=100GB

    #SBATCH --cpus-per-task=16

    ## Software

    module load bwa/0.7.17/gnu-6.4.0

    ## Command

    bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam

    Job submitted using sbatch$ sbatch ./BWA_MEM_batch.shSubmitted batch job 7439

    Standard output/error will be in the name of slurm-.out $ cat slurm-7439.outLoading module for BWABWA 0.7.17 is now loaded[M::bwa_idx_load_from_disk] read 0 ALT contigs[M::process] read 1600000 sequences (160000000 bp)...[M::process] read 1600000 sequences (160000000 bp)...

  • How can I run 100+ genome samples? $ ls -lrta *_001.fastq.gz

    -rw-r--r-- 1 kathirn g-kathirn 2125471805 Dec 17 2013 NIST7086_CGTACTAG_L002_R2_001.fastq.gz

    -rw-r--r-- 1 kathirn g-kathirn 2083510543 Dec 17 2013 NIST7086_CGTACTAG_L002_R1_001.fastq.gz

    -rw-r--r-- 1 kathirn g-kathirn 2081364133 Dec 17 2013 NIST7086_CGTACTAG_L001_R2_001.fastq.gz

    -rw-r--r-- 1 kathirn g-kathirn 2037956271 Dec 17 2013 NIST7086_CGTACTAG_L001_R1_001.fastq.gz

    -rw-r--r-- 1 kathirn g-kathirn 2001172486 Dec 17 2013 NIST7035_TAAGGCGA_L002_R2_001.fastq.gz

    -rw-r--r-- 1 kathirn g-kathirn 1962477139 Dec 17 2013 NIST7035_TAAGGCGA_L002_R1_001.fastq.gz

    -rw-r--r-- 1 kathirn g-kathirn 1954935121 Dec 17 2013 NIST7035_TAAGGCGA_L001_R2_001.fastq.gz

    -rw-r--r-- 1 kathirn g-kathirn 1914722761 Dec 17 2013 NIST7035_TAAGGCGA_L001_R1_001.fastq.gz

    ------ DATA PROCESSING ------

  • Steps for processing more samplesS1 S2 S3 Sn

    $ sbatch \--time=2:00:00 \--mem=100GB \--cpus-per-task=16 \--wrap=" bwa mem -t 16

    /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SAMPLE_NAME_1.fastq SAMPLE_NAME_2.fastq > SAMPLE_NAME.sam”

    One by one SAMPLE_NAME

    Until all SAMPLES

    YesJob

    done

  • In UNIX script1. Get the unique list of samples $ ls *_R1_001.fastq.gz

    NIST7035_TAAGGCGA_L001_R1_001.fastq.gz

    NIST7086_CGTACTAG_L001_R1_001.fastq.gz

    NIST7035_TAAGGCGA_L002_R1_001.fastq.gz

    NIST7086_CGTACTAG_L002_R1_001.fastq.gz

    2. Parse sample by sample$ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;

    do

    echo $SAMPLE_NAME;

    done

  • Cont.3. Get the UNIQUE sample name $ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;

    do

    echo `basename $SAMPLE_NAME _R1_001.fastq.gz`;

    done

    Output:

    NIST7035_TAAGGCGA_L001

    NIST7035_TAAGGCGA_L002

    NIST7086_CGTACTAG_L001

    NIST7086_CGTACTAG_L002

  • Cont.4. Multiple Job submission using FOR LOOP$ module load bwa/0.7.17/gnu-6.4.0

    $ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;

    do

    PREFIX=`basename $SAMPLE_NAME _R1_001.fastq.gz`;

    sbatch \

    --time=2:00:00 \

    --mem=100GB \

    --cpus-per-task=16 \

    --wrap=" bwa mem -t 16

    /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta

    ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz > ${PREFIX}.sam”

    done

  • In a batch script (as a Job arrays)#!/bin/bash#SBATCH --job-name=BWA_MEM#SBATCH --output=BWA_MEM.%A_%a.out#SBATCH --error=BWA_MEM.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-4

    ## Software module load bwa/0.7.17/gnu-6.4.0

    ## My variablesSAMPLE=`ls *_R1_001.fastq.gz | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE _R1_001.fastq.gz` ;

    ## Job commandbwa mem -t 16

    /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz > ${PREFIX}.sam

    Pre-request:Array size = number of samples

    $ sbatch ./bwa_mem_array.shSubmitted batch job 7440

    $ squeue -u $USERJOBID PARTITION NAME USER ST TIME NODES

    NODELIST(REASON)7440_1 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_2 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_3 batch BWA_MEM kathirn R 0:53 1 cn512-05-r7440_4 batch BWA_MEM kathirn R 0:53 1 cn512-05-r

  • Cont. …$ sbatch ./bwa_mem_array.shSubmitted batch job 7440

    $ squeue -u $USERJOBID PARTITION NAME USER ST TIME NODES

    NODELIST(REASON)7440_1 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_2 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_3 batch BWA_MEM kathirn R 0:53 1 cn512-05-r7440_4 batch BWA_MEM kathirn R 0:53 1 cn512-05-r

    $ ls -lrta *.sam-rw-r--r-- 1 kathirn g-kathirn 2790260736 Feb 9 16:41 NIST7086_CGTACTAG_L002.sam-rw-r--r-- 1 kathirn g-kathirn 3783000064 Feb 9 16:41 NIST7086_CGTACTAG_L001.sam-rw-r--r-- 1 kathirn g-kathirn 3024093184 Feb 9 16:41 NIST7035_TAAGGCGA_L002.sam-rw-r--r-- 1 kathirn g-kathirn 3978428416 Feb 9 16:41 NIST7035_TAAGGCGA_L001.sam

  • Step 2: SAM to BAM filesSamtools to convert SAM files to BAM#!/bin/bash

    module load samtools/1.8

    for SAMPLE_NAME in `ls *.sam`;

    do

    PREFIX=`basename $SAMPLE_NAME .sam`;

    sbatch --time=2:00:00 --mem=100GB --cpus-per-task=16 --wrap="samtoolsview --threads 16 -b -S -h -q 30 ${SAMPLE_NAME} > ${PREFIX}.bam"

    done

    • Sam files are very large• BAM file is compressed version of SAM• Good to use BAM files and safe to delete

    SAM once the BAM files are available.

    • 1.8G NIST7035_TAAGGCGA_L001_R1_001.fastq.gz• 1.9G NIST7035_TAAGGCGA_L001_R2_001.fastq.gz• 13G NIST7035_TAAGGCGA_L001.sam• 3.4G NIST7035_TAAGGCGA_L001.bam

  • Step 3: convert bam to sorted bamSort the BAM files using samtools

    #!/bin/bash

    module load samtools/1.8

    for SAMPLE_NAME in `ls *.bam`;

    do

    PREFIX=`basename $SAMPLE_NAME .bam`;

    sbatch --time=2:00:00 --mem=100GB --cpus-per-task=16 --wrap="samtoolssort --threads 16 -T ${PREFIX} ${SAMPLE_NAME} -o ${PREFIX}.sorted.bam"

    done

  • End-of-Step 3!What are the files generated?

    ü sam files (Generated from Genome alignment)ü unsorted bam files (Generated from the samtools, part

    of data compression)

    ü sorted bam files (Generated from samtools)

    Do we need all these intermediate files generated? IF NOT ?!

    *.Fastq.gz

    *.sam

    *.bam

    *.sorted. bam

    $ bwa mem -t 16 $REF $PREFIX_R1_001.fastq.gz $PREFIX_R1_001.fastq.gz | samtools view --threads 16 -b -S -h -q 30 - | samtools sort --threads 16 -T $PREFIX -> $PREFIX.sorted.bam

  • 3-in-1 !?*.Fastq.gz

    *.sam

    *.bam

    *.sorted. bam

    #!/bin/bash#SBATCH --job-name=BWA_MEM#SBATCH --output=BWA_MEM.%A_%a.out#SBATCH --error=BWA_MEM.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-4

    # Software module load bwa/0.7.17/gnu-6.4.0module load samtools/1.8

    # My variablesSAMPLE=`ls *_R1_001.fastq.gz | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE _R1_001.fastq.gz` ;

    # Job commandbwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz | samtools view --threads 16 -b -S -h -q 30 - | samtools sort --threads 16 - > $PREFIX.sorted.bam

  • Step 4: index the bam filesIndex the BAM files using samtools

    #!/bin/bash

    module load samtools/1.8

    for SAMPLE_NAME in `ls *.sorted.bam`;

    do

    PREFIX=`basename $SAMPLE_NAME .sorted.bam`;

    sbatch --time=30:00 --mem=100GB --cpus-per-task=1 --wrap="samtoolsindex ${SAMPLE_NAME}"

    done

  • Summary: list of files generated Sorted BAM files: • -rw-r--r-- 1 kathirn g-kathirn 2.6G Feb 4 12:14 NIST7035_TAAGGCGA_L002.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.5G Feb 4 12:14 NIST7035_TAAGGCGA_L001.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.7G Feb 4 12:14 NIST7086_CGTACTAG_L001.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.7G Feb 4 12:14 NIST7086_CGTACTAG_L002.sorted.bam

    Index of Sorted BAM files:

    • -rw-r--r-- 1 kathirn g-kathirn 3.4M Feb 4 12:25 NIST7086_CGTACTAG_L001.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.5M Feb 4 12:26 NIST7035_TAAGGCGA_L001.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.5M Feb 4 12:26 NIST7035_TAAGGCGA_L002.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.4M Feb 4 12:26 NIST7086_CGTACTAG_L002.sorted.bam.bai

  • List of chr. In each BAM files@HD VN:1.5 SO:coordinate

    @SQ SN:1 LN:249250621

    @SQ SN:2 LN:243199373

    @SQ SN:3 LN:198022430

    @SQ SN:4 LN:191154276

    @SQ SN:5 LN:180915260

    @SQ SN:6 LN:171115067

    @SQ SN:7 LN:159138663

    @SQ SN:8 LN:146364022

    @SQ SN:9 LN:141213431

    @SQ SN:10 LN:135534747

    @SQ SN:11 LN:135006516@SQ SN:12 LN:133851895@SQ SN:13 LN:115169878@SQ SN:14 LN:107349540@SQ SN:15 LN:102531392@SQ SN:16 LN:90354753@SQ SN:17 LN:81195210@SQ SN:18 LN:78077248@SQ SN:19 LN:59128983@SQ SN:20 LN:63025520@SQ SN:21 LN:48129895@SQ SN:22 LN:51304566@SQ SN:X LN:155270560@SQ SN:Y LN:59373566@SQ SN:MT LN:16569

    @SQ SN:GL000207.1 LN:4262@SQ SN:GL000226.1 LN:15008@SQ SN:GL000229.1 LN:19913@SQ SN:GL000231.1 LN:27386@SQ SN:GL000210.1 LN:27682@SQ SN:GL000239.1 LN:33824….….

    @SQ SN:NC_007605 LN:171823@SQ SN:hs37d5 LN:35477943

    @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta NIST7035_TAAGGCGA_L002_R1_001.fastq.gz NIST7035_TAAGGCGA_L002_R2_001.fastq.gz

  • Step 5: Chromosome interval for research interestObjective:

    Generate a chunk of BAM file that has the interval between10,000-15,000 from Chromosome-1 and Chromosome-2, etc. Solution:

    $ samtools view NIST7035_TAAGGCGA_L002.sorted.bam 1:10000-15000 | more

    HWI-D00119:50:H7AP8ADXX:2:1214:6356:27283 163 1 10354 60 89M12S = 10354 96 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTA

    ACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAACCCTAAGCCCCGGCA 8??DBDBAFF>?FGAFFIIFF9ED8;CCDFDED3?9?@?0?B@?DFF(DHECCC@@HGHHGIIIEECC==BCDFFECECECCCCCCDCDCECC N

    M:i:0 MD:Z:101 MC:Z:101M AS:i:101 XS:i:71

    ….

    …..

    Chr1

    Chr2

  • Cont. Any Optimal or better Solution!? #!/bin/bash#SBATCH --job-name=Region_of_Interest#SBATCH --output=Region_of_Interest.%A.out#SBATCH --error=Region_of_Interest.%A.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-2

    ## My variablesSAMPLE=NIST7035_TAAGGCGA_L002.sorted.bamPREFIX=NIST7035_TAAGGCGA_L002REGION="10000-15000"

    ## Software module load samtools/1.8

    ## Job command to get Region of Interest from Chromosome 1 & 2samtools view ${SAMPLE} ${SLURM_ARRAY_TASK_ID}:$REGION --threads 16 -b -o ${PREFIX}.${SLURM_ARRAY_TASK_ID}.$REGION.sorted.bam

    Caution note!• Job array will be numeric letters (no

    fractions, no characters, no special symbols, no alpha-numeric …. etc. )

    • When the Chromosome is “Chr1”, data distribution is required as follows:

    if [${SLURM_ARRAY_TASK_ID} -eq 1 ]// ….do something …. //

    else// ….do something …. //

    fi

    Batch processing is required to get the value of ${SLURM_ARRAY_TASK_ID}

  • To view the chromosome …. (e.G. IGV can be used)

  • Step 6: mark duplicate(s)Any Optimal or better Solution!? #!/bin/bash#SBATCH --job-name=MarkDupe#SBATCH --output=MarkDupe.%A_%a.out#SBATCH --error=MarkDupe.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --array=1-4

    ## My variablesSAMPLE=`ls *.sorted.bam | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE .sorted.bam` ;

    ## Software module load gatk/4.0.1.1

    ## Job commandgatk MarkDuplicates --INPUT $SAMPLE --OUTPUT $PREFIX.duped.sorted.bam --METRICS_FILE $PREFIX.txt --REMOVE_DUPLICATES true

    Pre-request:Array size = number of samples

  • End-of-Step 6

    - Many job script - Multiple files- Manual steps- etc.

    Single job script

    Job dependency

  • Job dependencysbatch --dependency= ...

    after:jobid[:jobid...] job can begin after the specified jobs have started

    afterany:jobid[:jobid...] job can begin after the specified jobs have terminated

    afternotok:jobid[:jobid...] job can begin after the specified jobs have failed

    afterok:jobid[:jobid...] job can begin after the specified jobs have run to completion with an exit code of zero.

    singleton

    jobs can begin execution after all previously launched jobs with the same name and user have ended. This is useful to collate results of a swarm or to send a notification at the end of a swarm.

    Source: https://hpc.nih.gov/docs/job_dependencies.html

    https://hpc.nih.gov/docs/job_dependencies.html

  • Job dependency - Example$ cat dependent.sh#!/bin/bash## Any bugs/issues, please e-mail: [email protected] "Submitting 5 jobs with 4 job dependency condition";

    ## Submit First jobFirst_CMD="sleep 40";First_Job="sbatch --partition=batch --job-name=First_Step --time=30:00 --output=First-%J.out --error=First-%J.err--nodes=1";First_ID=$(${First_Job} --parsable --wrap="${First_CMD}");echo "First Job submitted (\" ${First_CMD} is executing \") and this job id is " ${First_ID};

    ## Execute the Second job only when First job is successfulSecond_CMD="hostname";Second_Job="sbatch --partition=batch --job-name=Second_Step --time=30:00 --output=Second-%J.out --error=Second-%J.err --nodes=1";Second_ID=$(${Second_Job} --parsable --dependency=afterok:${First_ID} --wrap="${Second_CMD}");echo " Second Job (\" ${Second_CMD} \") was submitted (Job_ID=${Second_ID}) and it will execute when the First Job_ID=${First_ID} is successful"

    echo " The status of running jobs are ..."echo "-----------------------------------------------------------------------------------------------------------"squeue -u $USER -lecho "-----------------------------------------------------------------------------------------------------------"

  • Workflow

    Source: Computational and Bioinformatics Frameworks for Next-Generation Whole Exome and Genome Sequencing

    Source: Best Practices for Variant Discovery in DNAseq

  • Is this simple and/or Optimized?

  • workflow optimization

    TRIMMOMATIC_JAR

    bwa mem

    GATK 4.x MarkDuplicates

    gatkAddOrReplaceReadGroups

    samtools index

    GATK 3.x HaplotypeCaller

    bgzip

    tabix

    Workflow for multiple samples

    Different software/tools for every job step

    Cores = 4

    Cores = 16

    Cores = 1

    Cores = 1

    Cores = 1

    Cores = 16

    Cores = 1

    Cores = 1

    Optimal Heterogenous resource allocation

    • Automated the workflow 8 scripts = single script

    • Heterogenous resource allocation used

    64 cores = optimal # cores• Turnaround time minimized

    Unpredicted = Predicted• Optimized resource allocations

    Max. cores = Optimal• Job monitoring and job control

    Complex = Simplifiedü Job restart ü Job statistics/report

    Step 1

    Step 2

    Step 3

    Step 4

    Step 5

    Step 6

    Step 7

    Step 8

    Outcome

    sample 1

    Read trimming sample 1sample 2……..sample N

    Read Mapping sample 1sample 2……..sample N

    Mark Duplicate sample 1sample 2……..sample N

    Add/Replace Read Groups

    sample 1sample 2……..sample N

    Indexing sample 1sample 2……..sample N

    Haplotype callersample 1sample 2……..sample N

    Compress the gVCFsample 1sample 2……..sample N

    xgVCF Index

    sample 2……..sample N

    48

    Acknowledgements: Elodie Rey (Prof. Mark Tester ) Michael D. Abrouk (Prof. Simon Krattinger)

  • THANKS!

  • Time for Questions and your feedback!