marking duplicates - university of california, los angeles · 2017-02-24 · on-gatk mark...
TRANSCRIPT
![Page 1: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/1.jpg)
Markingduplicates
Removingnon-independentobserva7ons
talks
![Page 2: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/2.jpg)
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
YouarehereintheGATKBestPrac7cesworkflowforgermlinevariantdiscovery
![Page 3: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/3.jpg)
Whymarkduplicates?
Reference
Mappedreads
=sequencingerrorpropagatedinduplicates
• Duplicatesaresetsofreadspairsthathavethesameunclippedalignmentstartandunclippedalignmentend
• They’resuspectedtobenon-independentmeasurementsofasequence• SampledfromtheexactsametemplateofDNA• Violatesassump7onsofvariantcalling
• What’smore,errorsinsample/libraryprepwillgetpropagatedtoalltheduplicates• Justpickthe“best”copy–mi7gatestheeffectsoferrors
Markduplicates
![Page 4: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/4.jpg)
Howdoduplica7oneventsarise?
PCRduplicates
Op:calduplicatesReadnameshavethefollowingform:@identifier:lane:tile:x:y
hWp://www.slideshare.net/jandot/next-genera7on-sequencing-course-part-2-sequence-mappinghWp://www.slideshare.net/cosen7a/illumina-gaiix-for-high-throughput-sequencing
![Page 5: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/5.jpg)
Op7calandPCRduplica7oneventsariseatdifferentratesasasequencingexperimentproceeds
PCRduplicates
Op:calduplicates
![Page 6: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/6.jpg)
Howdoweiden7fyduplicatereads?
• DupesmightcomefromthesameinputDNAtemplate,sowewillassumethatreadswillhavesamestartposi7ononreference
– “Wherewasthefirstbasethatwassequenced?”
– Forpaired-end(PE)reads,samestartforbothends
• Iden7fyduplicatesets,thenchooserepresenta7vereadbasedonbasequalityscoresandothercriteria
![Page 7: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/7.jpg)
Butthere’sacatch(ortwo)…
• BWAsome7mes“clips”basesfromtheendsofthealignment(whenthealignmentthereispoor)
• NeedtouseSAMflags+CIGARstringtodeterminethe
unclipped5’end
• Fragmentsmappedtothereversestrandarespecifiedbytheir3’posi7on,insteadof5’
![Page 8: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/8.jpg)
Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on
Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A
BluemapstoforwardstrandRedmapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingWhataretheduplicatesets?
![Page 9: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/9.jpg)
Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A
BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞ r1,r3,r5,r6(startatposi7on1)
Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on
![Page 10: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/10.jpg)
Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A
BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞ r1,r3,r5,r6(startatposi7on1)☞ r2,r4(startatposi7on7)
Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on
![Page 11: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/11.jpg)
Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A
BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞ r1,r3,r5,r6(startatposi7on1)☞ r2,r4(startatposi7on7)☞ r7(startsatposi7on3)
Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on
![Page 12: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/12.jpg)
Sonowwehavemapped,sorted,anddedupedreads
Showingduplicatereads Hidingduplicatereads
![Page 13: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/13.jpg)
Whatthismeansfordownstreamanalysis
• DuplicatestatusisindicatedinSAMflag
• Duplicatesarenotremoved,justtagged(unlessyourequestremoval)• Downstreamtoolscanreadthetagandchoosetoignorethosereads
• MostGATKtoolsignoreduplicatesbydefault
![Page 14: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/14.jpg)
UsecaseswhereyoumayNOTwanttomarkduplicates
• Ampliconsequencing->allreadsstartatsameposi7onbydesign
• RNAseqallele-specificexpressionanalysis(ASEReadCountercandisableDuplicateFilter)
![Page 15: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/15.jpg)
Add-on:Predic7ngthecomplexityofasequencingexperiment
Complexityanalysisdependson: • Es:matedlibrarysize• ReturnonInvestment(ROI)
calcula:ons
![Page 16: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/16.jpg)
Es7ma7onoflibrarysizeandduplica7oninPicard
Es7matedfrac7onofduplicates
Assump7ons● allreadsaredrawnfromthesamePoissondistribu7onPo(λ)
● theoccurrenceofduplica7oneventsdependsonunderlyingconcentra7onofinsertsinthelibrary
![Page 17: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/17.jpg)
Ac7veresearchtoimprovelibrarysizees7ma7on
• Rateofduplica7onvarieswithinsertsizelength• Duplica7onsratesalsolikelyvarywithGCcontent
![Page 18: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/18.jpg)
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
YouarehereintheGATKBestPrac7cesworkflowforgermlinevariantdiscovery
![Page 19: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant](https://reader033.vdocuments.us/reader033/viewer/2022041909/5e663f2d468d4c10a72a64fc/html5/thumbnails/19.jpg)
Furtherreading
hWp://www.broadins7tute.org/gatk/guide/best-prac7ces
hWp://broadins7tute.github.io/picard/
talks