indel-based realignment - ucla · 2017. 2. 24. · cca tg ca context g ref del ins • mappers...
TRANSCRIPT
![Page 1: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/1.jpg)
talks
Indel-basedRealignment
Improvingtheoriginalalignmentsofthereadsbasedonmul8plesequence
(re-)alignment
![Page 2: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/2.jpg)
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
YouarehereintheGATKBestPrac8cesworkflowforgermlinevariantdiscovery
![Page 3: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/3.jpg)
InDels=inser8on/dele8on
AGCTAGGGTC AGCTAGGGTC
AGCTAGGGTC
TTC
AGCGGTC
Refseq
Sampleseq
Inser&on Dele&on
![Page 4: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/4.jpg)
Theproblemwewanttofix
Severalconsecu3ve“SNPs”onlyfoundonreadsendingonthe
rightofthehomopolymer
Severalconsecu3ve“SNPs”onlyfoundonreadsendingonthe
le;ofthehomopolymer 7bp“T”
homopolymerrun
Addinga1-bpinser3onbringssanityto
theen3realignment
AlignmentbyBWA
A;errealignment
![Page 5: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/5.jpg)
Whydoesthishappen?
þ Localrealignmentaroundindels->mostparsimoniousalignment
þ Improvesaccuracyofseveraldownstreamprocessingsteps
Ref T A C C C A T T T T T T T C T A A A A G C T BWA C C A T T T T T T C T A A A A A C T IR C C A – T T T T T T C T A A A A A C T
CATGCA CCA TGCA G
ref
del
ins
• Mapperscannot“see”indelsnearendsofreads• Becausemismatchesare“cheaper”thanagapinthis
context
Missmatch=-1Opengap=-3
![Page 6: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/6.jpg)
Howdoweiden8fywhererealignmentisneeded?
• Knownsites(e.g.dbSNP,1000Genomes)
• Indelsseeninoriginalalignments(inCIGARs)
• Siteswhereevidencesuggestsahiddenindel
-Entropycalcula8oniden8fies“messyareas”
![Page 7: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/7.jpg)
1.Findthebestalternateconsensussequencethat,togetherwiththereference,bestfitsthereadsinapile(maximumof1indel)
3.Ifbestalternateconsensusissufficientlybe`erthantheoriginalalignments(usingLODscorethreshold)->acceptproposedrealignment
2.Scoreforalternateconsensus=totalsumofqualityscoresofmismatchingbases
Howdoestherealignmentalgorithmwork?
AAGAGTAGRef:
AAG---AGTAG
AAGAGTAG
Readpileconsistentwitha3bpinser8on
ReadpileconsistentwiththereferencesequenceRealigning
determineswhichisbe`er
ThreeadjacentSNPs
![Page 8: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/8.jpg)
IndelRealignmentsteps/tools
• Iden8fywhatregionsneedtoberealigned➔ RealignerTargetCreator
• Performtheactualrealignment
➔ IndelRealigner
![Page 9: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/9.jpg)
RealignerTargetCreator
• Pre-processingsteptofindintervalsthatmayneedrealignment
• InputBAMfilenotnecessaryifprocessingonlyatknownindels
• Usingalistofknownindelswillbothspeedupprocessingandimproveaccuracy,butisnotrequired
Input BAM Target Intervals
Realigned BAM
RealignerTargetCreator
IndelRealigner
Known Sites
java –jar GenomeAnalysisTK.jar \ –T RealignerTargetCreator \ –R human.fasta \ –I original.bam \ –known indels.vcf \ –o realigner.intervals
![Page 10: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/10.jpg)
IndelRealigner
• A`emptsrealignmentatRealignerTargetCreatortargetintervals
• Mustusesameinputfile(s)usedinRealignerTargetCreatorstep
• Processingop8ons- Onlyatknownindels:muchfaster,
accuratefor~90-95%ofindels- AtindelsseenintheoriginalBAM
alignments:therecommendedmode
- UsingfullSmith-Watermanrealignment:mostaccurate,butheavycomputa8onalcostandnotreallynecessarywiththenewtechs
Input BAM Target Intervals
Realigned BAM
IndelRealigner
Known Sites
java –jar GenomeAnalysisTK.jar \ –T IndelRealigner \ –R human.fasta \ –I original.bam \ –known indels.vcf \ –targetIntervals realigner.intervals \ –o realigned.bam
![Page 11: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/11.jpg)
DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Gen.
ThisiswhatarealignedBAMlookslike
Before AierOlddata
(lowerquality)
Newdata(higherquality)
![Page 12: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/12.jpg)
CanIseetheeffectsofrealignment?
• IndelRealignerchangestheCIGARstringofrealignedreadsbutmaintainstheoriginalCIGAR(withOCtag)
->Cangrepforrealignedregionsandviewingenomebrowser(IGV)
20GAVAAXX100126:1:67:10041:180738 99 20 10011431 70 87M1D14M= 10011720 390
TTAAATGTGTTTATCTATTGTTCTACTATTCAGTTACCTGATTATAAAATCAAAGATTATTTCATGAAACTCAGTACCCCTTCAGGGAAAAAAAAAAAAAT
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGG X0:i:1 X1:i:0 MC:Z:101M OC:Z:101M PG:Z:MarkDuplicates RG:Z:20GAV.1XG:i:0 AM:i:37
NM:i:1SM:i:37 XM:i:1 XO:i:0
BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@cccddc``a`^\[Y MQ:i:60 XT:A:
![Page 13: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/13.jpg)
Isrealignments8llnecessarywithlatestsoiware?
• Variantcallerswithreassemblystep(HaplotypeCaller,MuTect2,Platypus)donotrequireindelrealignment
• BUTpoten8alimprovementforBaseQualityScoreRecalibra8onwhenrunonrealignedBAMfiles(ar8factualSNPsarereplacedwithrealindels).
• Alsos8llusefulforlegacytools– UnifiedGenotyper– MuTect1
![Page 14: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/14.jpg)
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
YouarehereintheGATKBestPrac8cesworkflowforgermlinevariantdiscovery
![Page 15: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”](https://reader033.vdocuments.us/reader033/viewer/2022060807/608bfe40b80083734114e23e/html5/thumbnails/15.jpg)
talks
Furtherreading
h`p://www.broadins8tute.org/gatk/guide/best-prac8ces
h`p://www.broadins8tute.org/gatk/guide/ar8cle?id=38
h`ps://www.broadins8tute.org/gatk/gatkdocs/org_broadins8tute_gatk_tools_walkers_indels_IndelRealigner.php
h`ps://www.broadins8tute.org/gatk/gatkdocs/
org_broadins8tute_gatk_tools_walkers_indels_RealignerTargetCreator.php