eccb10 talk - nextgen sequencing and snps

47
Next-generation sequencing and SNPs Jan Aerts Wellcome Trust Sanger Institute [email protected]

Upload: jan-aerts

Post on 17-Jul-2015

637 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECCB10 talk - Nextgen sequencing and SNPs

Next-generation sequencingand SNPs

Jan AertsWellcome Trust Sanger Institute

[email protected]

Page 2: ECCB10 talk - Nextgen sequencing and SNPs

Aim

To identify the SNP that causes disease,phenotype– Find them all, so you don’t miss it (false

negatives)– Not find too many, so it’s useful (false

positives)

Page 3: ECCB10 talk - Nextgen sequencing and SNPs

General principle

Map reads to reference sequenceConvert from read-based to base-based

(i.e. pileup)Look at differences

Page 4: ECCB10 talk - Nextgen sequencing and SNPs

This presentation

Factors in finding real SNPs– Sequencing technology– Mapping algorithms and initial calling– Post-mapping tweaking– Calling– Filtering

Based on experiences in exome resequencing;“experiment 5” on last slide Thomas

Page 5: ECCB10 talk - Nextgen sequencing and SNPs

1. Sequencing

• Provides raw data• Different technologies

Different accuracy (critical!)Different types of errors

Page 6: ECCB10 talk - Nextgen sequencing and SNPs

Accuracy

Base quality dropsalong readSanger> SOLiD> Illumina> 454> Helicos

Page 7: ECCB10 talk - Nextgen sequencing and SNPs

Base calling errors

Main source of error for Illumina, less inSOLiD & 454

Page 8: ECCB10 talk - Nextgen sequencing and SNPs

Homopolymer runs

• Especially 45439% of errors are homopolymers

• A5 motifs: 3.3% error rate• A8 motifs: 50% error rate!Reason: use signal intensity as a measure for

homopolymer length

Page 9: ECCB10 talk - Nextgen sequencing and SNPs
Page 10: ECCB10 talk - Nextgen sequencing and SNPs

Is it 4? Is it 5? Is it 4?

Page 11: ECCB10 talk - Nextgen sequencing and SNPs

Consensus accuracy

Increase accuracy for SNP calling byincreasing coverage– Illumina: 20X– SOLiD: 12X– 454: 7.4X– Sanger: 3X

Factors: raw accuracy + read length

Page 12: ECCB10 talk - Nextgen sequencing and SNPs

2. Mapping: fastq => bam

• Maq and bwa: only 1 mappingIf multiple: mapQ = 0<=> mosaik & mrFAST: alternatives

• Maq and bwa: use paired-endinformation => might prefer correctdistance over correct alignment

Page 13: ECCB10 talk - Nextgen sequencing and SNPs

3. Post-mapping tweaking

Improve quality of mapped data:– duplicate removal– baseQ recalibration– read clipping– local realignment around indels

Genome Analysis Toolkit (GATK)http://bit.ly/9zIn4b

Page 14: ECCB10 talk - Nextgen sequencing and SNPs

Duplicate removal

PCR amplification biasmultiple reads with same start/stop =>keep only one (with highest mapping Q)

Page 15: ECCB10 talk - Nextgen sequencing and SNPs

java -Xmx2048m \ -jar /path_to_picardtools/MarkDuplicates.jar \ INPUT=input.bam \ OUTPUT=output.bam \ METRICS_FILE=output.metrics \ VALIDATION_STRINGENCY=LENIENT

samtools rmdup input.bam output.bam

Picard

samtools

Page 16: ECCB10 talk - Nextgen sequencing and SNPs

baseQ recalibration

• Why?– correct for variation in quality with machine

cycle, sequence context, lane, baseQ…• Steps:

– Identify what to correct for (create plots)– Calculate covariates– Apply covariates– Check (create plots)

Page 17: ECCB10 talk - Nextgen sequencing and SNPs
Page 18: ECCB10 talk - Nextgen sequencing and SNPs

java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ --DBSNP resources/dbsnp_129_hg18.rod \ -I my_reads.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov DinucCovariate \ -recalFile my_reads.recal_data.csv

Page 19: ECCB10 talk - Nextgen sequencing and SNPs

java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ -I my_reads.bam \ -T TableRecalibration \ -outputBam my_reads.recal.bam \ -recalFile my_reads.recal_data.csv

Page 20: ECCB10 talk - Nextgen sequencing and SNPs

Read clipping

Remove:– low quality strings of bases– sections of reads– reads containing user-provided sequences

Page 21: ECCB10 talk - Nextgen sequencing and SNPs

Local realignment near indels

Page 22: ECCB10 talk - Nextgen sequencing and SNPs

Local realignment near indels

Page 23: ECCB10 talk - Nextgen sequencing and SNPs

java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -o /path/to/output.intervals

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \ -jar /path/to/GenomeAnalysisTK.jar \ -I input.bam \ -R ref.fasta \ -T IndelRealigner \ -targetIntervals /path/to/output.intervals \ -o realignedBam.bam

Page 24: ECCB10 talk - Nextgen sequencing and SNPs

4. SNP calling

• Different callers:– samtools– GATK UnifiedGenotyper– SOAPsnp– …

• Read-based => base-based

Page 25: ECCB10 talk - Nextgen sequencing and SNPs

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

pileup

Page 26: ECCB10 talk - Nextgen sequencing and SNPs

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

pileup

Page 27: ECCB10 talk - Nextgen sequencing and SNPs

java \ -Xmx6g \ -jar /path_to/GenomeAnalysisTK.jar \ -l INFO \ -R human_b36_plus.fasta \ -I input.bam \ -T UnifiedGenotyper \ --heterozygosity 0.001 \ -pl Solexa \ -varout output.vcf \ -vf VCF \ -mbq 20 \ -mmq 10 \ -stand_call_conf 30.0 \ --DBSNP dbsnp_129_b36_plus.rod

GATK

Page 28: ECCB10 talk - Nextgen sequencing and SNPs

samtools pileup \ -vcs \ -r 0.001 \ -l CCDS.txt \ -f human_b36_plus.fasta \ input.bam \ output.pileup

samtools

Page 29: ECCB10 talk - Nextgen sequencing and SNPs

VCF file##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00. . .

Page 30: ECCB10 talk - Nextgen sequencing and SNPs

VCF file##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00. . .

header

datacolumn header

Page 31: ECCB10 talk - Nextgen sequencing and SNPs

VCF file

INFODB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSEDB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE

FORMAT a_a:bwa057_b:picard.bamGT:DP:GQ 1/1:3:36.00GT:DP:GQ 1/1:6:45.00

Page 32: ECCB10 talk - Nextgen sequencing and SNPs

Pileup => VCF

Custom scripts, then annotatejava -Xmx10g \ -jar GenomeAnalysisTK.jar \ -T VariantAnnotator \ --assume_single_sample_reads sample \ -R human_b36_plus.fasta \ -D dbsnp_129_b36_plus.rod \ -I input.bam \ -B variant,VCF,unannotated.vcf \ -o annotated.vcf \ -A AlleleBalance \ -A MappingQualityZero \ -A LowMQ \ -A RMSMappingQuality \ -A HaplotypeScore \ -A QualByDepth \ -A DepthOfCoverage \ -A HomopolymerRun

Page 33: ECCB10 talk - Nextgen sequencing and SNPs

5. Filtering

• Aim: to reduce number of false positives• Options:

– Depth of coverage– Mapping quality– SNP clusters– Allelic balance– Number of reads with mq0

Page 34: ECCB10 talk - Nextgen sequencing and SNPs

java \ -Xmx4g \ -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R human_b36_plus.fasta \ -o output.vcf \ -B variant,VCF,input.vcf \ --clusterWindowSize 10 \ --filterExpression 'DP < 3 || DP > 1200' \ --filterName 'DP' \ --filterExpression 'QUAL < #{qual_cutoff}' \ --filterName 'QUAL' \ --filterExpression 'AB > 0.75 && DP > 40' \ --filterName 'AB'

Page 35: ECCB10 talk - Nextgen sequencing and SNPs

Filtering - QC metrics (1)

Transition/transversion ratioRandom: Ti/Tv = 0.5

Whole genome: 2.0-2.1Exome: 3-3.5

Page 36: ECCB10 talk - Nextgen sequencing and SNPs

Filtering - QC metrics (2)

Number of novel SNPsExome:total 20k - 25k;novel 1-3k

Page 37: ECCB10 talk - Nextgen sequencing and SNPs
Page 38: ECCB10 talk - Nextgen sequencing and SNPs

Combining discovery pipelines

• Mapper: MAQ/bwa/stampy/…• BaseQ recalibration? Local

realignment?• SNP caller: GATK/samtools/SOAPsnp• Priors for SNP calling: heterozygosity

(whole genome, exome, dbSNP)• Filtering

Page 39: ECCB10 talk - Nextgen sequencing and SNPs

Combining discovery pipelines

ROC

false positives

true

posi

tives

Page 40: ECCB10 talk - Nextgen sequencing and SNPs

Combining discovery pipelines

better

single

combinations

Page 41: ECCB10 talk - Nextgen sequencing and SNPs

Indels

Still more tricky than SNPs– samtools/dindel/GATK– Sample of 10 individuals: on average per

individual:• 2 novel functional high-quality SNPs• 18 novel functional high-quality indels

“I trust manual interpretation of the reads morethan the basic quality parameters we use”

Page 42: ECCB10 talk - Nextgen sequencing and SNPs

4 snp_1 STOP_GAINED1 snp_2 STOP_LOST1 snp_3 STOP_GAINED1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC2 snp_6 STOP_GAINED2 snp_7 STOP_GAINED1 snp_8 STOP_GAINED1 snp_9 STOP_GAINED1 snp_10 STOP_GAINED1 snp_11 STOP_GAINED1 snp_12 STOP_GAINED1 snp_13 STOP_LOST1 snp_14 STOP_GAINED

Page 43: ECCB10 talk - Nextgen sequencing and SNPs

4 snp_1 STOP_GAINED1 snp_2 STOP_LOST1 snp_3 STOP_GAINED1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC2 snp_6 STOP_GAINED2 snp_7 STOP_GAINED1 snp_8 STOP_GAINED1 snp_9 STOP_GAINED1 snp_10 STOP_GAINED1 snp_11 STOP_GAINED1 snp_12 STOP_GAINED1 snp_13 STOP_LOST1 snp_14 STOP_GAINED

178 indels FRAMESHIFT_CODING

Page 44: ECCB10 talk - Nextgen sequencing and SNPs

Conclusions

Different tools exist and are createdBest to combine (intersect) the results from

different pipelinesGenome Analysis ToolKit (GATK) provides

useful bam-file processing tools:– Realignment around indels– Base quality recalibration

Page 45: ECCB10 talk - Nextgen sequencing and SNPs

Use in resequencing

• Identify SNPs/indels• Consequences (loss-of-function?)• Prevalence in cases/controls• Model:

– Dominant: any het– Recessive: homnonref or compound het

Page 46: ECCB10 talk - Nextgen sequencing and SNPs

References

• Chan E. In: Single Nucleotide Polymorphisms,Methods in Molecular Biology 578 (2009)

• McKenna et al. Genome Res 20:1297-1303 (2010)• Li H & Durbin R. Bioinformatics 25:1754-1760 (2009)• Li H et al. Bioinformatics 25:2078-2079 (2009)• Li H et al. Genome Res 18:1851-1858 (2008)

Page 47: ECCB10 talk - Nextgen sequencing and SNPs

Questions?