variant calling workshop

28
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop

Upload: sandro

Post on 22-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Variant Calling Workshop. Overview. There will be two parts to the workshop: Variant calling analysis (on the cluster) Visualization (on the desktop) using IGV Command prompts (what you will type) will be in boxes preceded by ‘$’ . Output will be in red:. $ mkdir foo $ cd foo - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Variant Calling Workshop

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

Variant Calling Workshop

Page 2: Variant Calling Workshop

Overview

• There will be two parts to the workshop:• Variant calling analysis (on the cluster)• Visualization (on the desktop) using IGV

• Command prompts (what you will type) will be in boxes preceded by ‘$’. Output will be in red:$ mkdir foo$ cd foo$ ls -latotal 96drwxrwxr-x 2 cjfields cjfields 32768 Jun 23 22:51 .drwxr-x--- 39 cjfields cjfields 32768 Jun 23 22:51 ..

Page 3: Variant Calling Workshop

Prelude : Variant Calling Setup

1. Log into the cluster using your classroom account.

2. Create a work folder (I call mine ‘mayo_test’):

$ mkdir mayo_test$ cd mayo_test$ lltotal 0

Page 4: Variant Calling Workshop

Part Ia : Variant Calling Setup

3. Link in all scripts from the main work folder to this directory:

$ ln -s /home/mirrors/gatk_bundle/mayo_workshop/*.sh .$ lsannotate_snpeff.sh call_variants_ug.sh hard_filtering.sh post_annotate.sh

Page 5: Variant Calling Workshop

• Data for this workshop is from the 1000 Genomes project and is WGS, 60x coverage

• The initial part of the GATK pipeline (alignment, local realignment, base quality score recalibration) has been done, and the BAM file has been reduced for a portion of human chromosome 20• Otherwise, we would not even finish the alignment within the

next few days, let alone the other steps

Part Ia : Variant Calling Setup

Page 6: Variant Calling Workshop

Part Ia : Variant Calling

• Start the variant calling job. Check the status of the job using ‘qstat’:

$ qsub call_variants_ug.sh371875.biocluster.igb.illinois.edu

$ qstat -u <YOUR USER NAME>

biocluster.igb.illinois.edu: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----371875.biocluste cjfields default call_variants_ug 24285 1 4 10gb -- R 00:01

Page 7: Variant Calling Workshop

Part Ia : Variant Calling

• Discussion: what did we just do?

• We ran the GATK UnifiedGenotyper to call variants

• Show the script…

Page 8: Variant Calling Workshop

Part Ia : Variant Calling

• Job done yet? Should only be a few minutes…

• What do the data look like? (anyone here use UNIX?)

$ qstat -u <YOUR USER NAME>

$ ll *vcf*-rw-rw-r-- 1 cjfields cjfields 237060 Jun 23 23:10 raw_indels.vcf-rw-rw-r-- 1 cjfields cjfields 2829 Jun 23 23:10 raw_indels.vcf.idx-rw-rw-r-- 1 cjfields cjfields 3203447 Jun 23 23:08 raw_snps.vcf-rw-rw-r-- 1 cjfields cjfields 107241 Jun 23 23:08 raw_snps.vcf.idx

$ tail -n 2 raw_indels.vcf20 26306897 rs200138621 CAGA C 1305.73 .AC=1;AF=0.500;AN=2;BaseQRankSum=3.130;DB;DP=75;FS=0.936;MLEAC=1;MLEAF=0.500;MQ=57.75;MQ0=0;MQRankSum=0.407;QD=5.80;ReadPosRankSum=0.371 GT:AD:DP:GQ:PL 0/1:44,26:75:99:1343,0,252620 26314306 rs199619140 GT G 1502.73 .AC=1;AF=0.500;AN=2;BaseQRankSum=3.814;DB;DP=83;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=57.12;MQ0=0;MQRankSum=-1.411;QD=18.11;ReadPosRankSum=1.387 GT:AD:DP:GQ:PL 0/1:33,36:76:99:1540,0,1253

Page 9: Variant Calling Workshop

Part Ia : Variant Calling

• How many SNPs and Indels were called?

• Any found in dbSNP?

$ grep -c -v '^#' raw_snps.vcf13621

$ grep -c -v '^#' raw_indels.vcf1070

$ grep -c 'rs[0-9]*' raw_snps.vcf12245

$ grep -c 'rs[0-9]*' raw_indels.vcf1019

Page 10: Variant Calling Workshop

Part Ib : Hard filtering

• We need to filter the variant calls

• Generally, for human data we would use variant quality score recalibration, but we have a very small set of variants, so here we use hard filtering

Page 11: Variant Calling Workshop

Part Ib : Hard filtering

• Start the hard filtering step. This will be fast:

• You will have two new VCF files in a minute: • hard_filtered_snps.vcf• hard_filtered_indels.vcf

$ qsub hard_filtering.sh371886.biocluster.igb.illinois.edu

$ qstat -u <YOUR USER NAME>

biocluster.igb.illinois.edu: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----371886.biocluste cjfields default hard_filtering.s 24455 1 4 10gb -- R --

Page 12: Variant Calling Workshop

Part Ib : Hard filtering

• What are we doing?

• <Show the code!>

• Questions: • Did we lose any variants?• How many PASS’ed the filter?• What is the difference in the filtered and raw output?

Page 13: Variant Calling Workshop

Part Ib : Hard filtering

• What are we doing?

• <Show the code!>

• Questions: • Did we lose any variants?• How many PASS’ed the filter?• What is the difference in the filtered and raw output?

$ grep -c 'PASS' hard_filtered_snps.vcf8270

$ grep -c 'PASS' hard_filtered_indels.vcf1041

Page 14: Variant Calling Workshop

Part Ic : Annotate the variants (SnpEff)

• Run the next job, which uses SnpEff to add annotation to the VCF:

• This takes a couple of minutes…

• Two new VCF: • hard_filtered_snps_annotated.vcf• hard_filtered_indels_annotated.vcf

$ qsub annotate_snpeff.sh371894.biocluster.igb.illinois.edu

$ qstat -u <YOUR USER NAME>

biocluster.igb.illinois.edu: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----371894.biocluste cjfields default annotate_snpeff. 18957 1 1 10gb -- R --

Page 15: Variant Calling Workshop

Part Ic : Annotate the variants (SnpEff)

• SnpEff adds information about where the variants are in relation to specific genes

• The IDs for the human assembly version we use are from Ensembl (ENSGXXXXXXXXXXX)

• The Ensembl ID for FOXA2 is ENSG00000125798

Page 16: Variant Calling Workshop

Part Ic : Annotate the variants (SnpEff)

• The Ensembl ID for FOXA2 is ENSG00000125798

• Are there any variants called for FOXA2?

Page 17: Variant Calling Workshop

Part Ic : Annotate the variants (SnpEff)

• The Ensembl ID for FOXA2 is ENSG00000125798

• Are there any variants called for FOXA2?

• SnpEff also creates some additional output files; we’ll see those in a bit

$ grep -c 'ENSG00000125798' hard_filtered_snps_annotated.vcf3

$ grep -c 'ENSG00000125798' hard_filtered_indels_annotated.vcf0

Page 18: Variant Calling Workshop

Part Id : GATK VariantAnnotator

• SnpEff adds a lot of information to the VCF.

• GATK VariantAnnotator helps remove a lot of the extraneous information

Page 19: Variant Calling Workshop

Part Id : GATK VariantAnnotator

• The last step:

• This may take about 5-10 minutes

$ qsub post_annotate.sh371905.biocluster.igb.illinois.edu

$ qstat -u <YOUR USER NAME>

biocluster.igb.illinois.edu: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----371905.biocluste cjfields default post_annotate.sh 24650 1 4 10gb -- R 00:01

Page 20: Variant Calling Workshop

While this is going on…

• Let’s start a little tutorial on the Integrated Genome Viewer (also from Broad)

Page 21: Variant Calling Workshop

Prelude to Part II

• We need to download the results from your user folders to the local desktop

• We’ll use FileZilla for this

Page 22: Variant Calling Workshop

FileZilla

Page 23: Variant Calling Workshop

FileZilla

Page 24: Variant Calling Workshop

FileZilla

Page 25: Variant Calling Workshop

FileZilla

Page 26: Variant Calling Workshop

FileZilla

Page 27: Variant Calling Workshop

FileZilla• Transfer folder to the desktop

Page 28: Variant Calling Workshop

Part II : Viewing Results in IGV

• Open IGV• Switch the genome to ‘Human (b37)’