ann marie-patch - structural variants and mutation detection using whole genome sequencing
Post on 10-May-2015
686 Views
Preview:
DESCRIPTION
TRANSCRIPT
Structural variants and mutation detection using whole genome sequencing
2014 Winter School in Mathematical and Computational Biology
Ann‐Marie Patch
Mutation Detection success depends on previous steps
Generalised sequencing workflow
Sample Library Sequence Initial data Data analysispreparation preparation generation processing Data analysis
F t ti •Base calling M t ti•DNA•RNA•miRNABAC
•Fragmentation•Size selection•Target enrichment
•Platform•Sequence length
g•Quality assessment
•De novo bl
•Mutation detection
•Annotation•Biological •BAC •Indexing assembly
•Alignment
ginterpretation
Mutation cataloguing aids understanding of cancer genetics
International Cancer Genome Consortium projects
sequenced >1500 samples from >600 patients
DNA mutations are normally sensed and repaired or the cell dies
Cell Suicide or ApoptosisNormal cell divisionDNA damage is sensed by the cellMechanisms for
DNA damage
Mechanisms forDNA repairCell cycle check pointsSignalling for cell death
g g
Signalling for growth
Cancer cells accumulate mutations
Cell Suicide or ApoptosisTumour cell divisionDNA damage is NOT sensed by the cellDisrupted mechanisms for
DNA damage
Disrupted mechanisms forDNA repairCell cycle check pointsSignalling for cell death
g g
Signalling for growth
Fourth orlater mutation
Third mutation
Second mutation
First mutation Uncontrolled growth
Cancer sequencing basics
Typically our projects involve the parallel analysis of at least two samples for each patient
Inherited genome sample
Germline variants
Germline Data AnalysisSeen in both samples
Tumour genome sample
•Mutation detection
•Annotation•Biologicalg p •Biological interpretation
Somatic mutationsSpecific to tumour sample
Tumour
This sample contains a mixed population of normal and tumour cellsand tumour cellsAlso subpopulations of different tumour cells
Mutation detection – finding differences
DNA Mutation detection
•SNV/SNP/Substitutions
•Small insertions and deletionsdeletions
•Large structural variations
•Copy number aberrations
Cloonan et al 2011
Large structural variants are only detectable from whole genome sequencing
Whole genome paired‐end sequencing process recap ‐ library preparationlibrary preparation
Genomic DNA
Fragmented DNA
Whole genome paired‐end sequencing process recap ‐ library preparation
Genomic DNAlibrary preparation
Fragmented DNA
Clean‐up DNA fragmentsClean‐up DNA fragments
Consistent fragment size distribution
Whole genome paired‐end sequencing process recap ‐ library preparation
Clean‐up DNA fragments
library preparation
p g
Adaptors added
Sequence reads produced from both ends of each fragment
The distance from the ends of the reads should follow the DNA size distribution
~300 b~300 bp
Paired‐end sequence alignment to the reference genome
I II I I IIReference genome
I II I I II
Paired‐end sequences mapped to genome
Coverage depthmapped to genome
Examining how the mapping position and content of the pairs of reads vary across the reference genome allows us to determine mutationsvary across the reference genome allows us to determine mutations and structural rearrangements
Detection software pinpoints differences in your sample from the reference
II IIII II II IIII
****
Normal/Germline DNA:Germline
SNV
*
**
Tumour DNA:**
**Somatic
SNV
*
SomaticSomatic
translocationdeletion
Somaticamplificationamplification
We convert mutation data into positional information and counts using detection software
Somatic mutations that only occur in the tumour are determined
Choosing what software to use to identify mutations
Software listSoftware listhttp://seqanswers.com/wiki/Software/list
Choice can be guided by
Type of data
QCMG DNA mutation detection
•Substitutions
The biological question
A ailable omp tin reso r es
•qSNP – in house tool•GATK – Broad
•Small insertions and deletions•Pindel ‐ SangerAvailable computing resources
Past experience
Pindel ‐ Sanger•GATK ‐ Broad
•Large structural variations•qSV – in house tool
Related literature
Visualising a germline single nucleotide variant examplePaired‐end HiSeq data for Ovarian Cancer patientPaired end HiSeq data for Ovarian Cancer patient Chromosome 11
Grey blocks 100bp readsmatching
Small coloured blocks indicate a change
Tumour data
matching reference
indicate a change from the reference
The reference base is Tumour data The reference base isa G
There is an A present pin some of the reads
Normal data
Robinson et al 2011Reference sequence
Pileup analysis produces counts of alleles
Coverage 56xCoverage 56x
Count of non duplicate reads that cross any given position
Tumour datathat cross any given position
Allele frequencyCount of bases at any position
Tumour G=36
Allele frequency
Tumour G 36 A=20(Total coverage 56x)
Normal data Normal G=26 A=33 T=1(Total coverage 60x)
Considering error and bias
Allele proportions
Sample Coverage Reference Alternate Other Bi‐allelic Hi hl k dallele % allele % allele % ratio
Tumour 56 G=64% A=36% ‐ 1:0.56
Normal 60 G=43% A=55% T=2% 1:1.3
Highly skewed representation in tumour samples
Sequencing error
Diploid organismexpected bi‐allelic proportion 50% (ratio 1:1)
Tumour data Changes in expected proportions can be due to:Sample contamination/integrityStochastic sampling/low coverage depth
Normal data
Capture or enrichment biasAlignment/mapping strategySequencing error
How should we determine a good call from error?
How many SNVs would we expect to find?
Human genome (length ~ 3,000,000,000 bases)
Germline changes = ~ 3,000,000 (~1000 mutations per Mb (0.1%))
Ovarian Cancer genome
Somatic mutation = ~6,000 (~2 mutations per Mb)
Thi b l d di h f b i dThis number can vary greatly depending upon the type of cancer being sequenced
Filtering of results from mutation detection tools is necessary
Example for sample purity = 64%
R Filt d R Filt dRaw somatic
Filteredsomatic
Raw Germline
Filtered Germline
qSNP 298,388 6,632 4,180,630 3,698,034GATK 224,839 9,722 4,945,990 4,069,314
K b t 2 4% K b t 84 88%
R b th t d b f ti t ti ~6 000
Keep between 2‐4% Keep between 84‐88%
Remember the expected number of somatic mutations ~6,000And Germline variants ~3,000,000
qSNP in‐house, rules‐based heuristic tool sensitive (Kassahn et al 2013)GATK (unified genotyper) a Bayesian tool (McKenna et al 2010)GATK (unified genotyper) a Bayesian tool (McKenna et al 2010)
The intersect of these tools produces a high confidence SNV call
QCMG Strategy for identifying somatic substitution mutations
Control of quality of variant calls through input filteringmapping quality for reads >10maximum number of mismatches in read <=3maximum number of mismatches in read < 3minimum consecutive matched bases in a read >=34duplicate reads removed
Tumour dataSomatic variant calls are made when theminimum number of reads with the variant minimum coverage in tumour and normal samplemaximum variant count for a given coverage in the matched normalmaximum variant count for a given coverage in the matched normalthreshold proportion of variant call qualities at that position
Potential weakness in calls annotatedNormal data
Potential weakness in calls annotatedVariant seen in unfiltered bam of matched normalPosition of variant within 5 bp of ends of readsVariant not seen in sequencing reads of both directions
l f h
Somatic variant
Variant seen in germline of another patientNumber of novel starts for reads supporting variant is low
Somatic variant Tumour T=63% C=37%Normal T=100%
Detection – examination – verification ‐modify
We have used a cyclical feedback approach to inform the filtering strategy and improve our mutation calling
Detectmutations ExamineManual IGV review
Independent VerificationIdentify patterns and •PCR and capillary sequencingmodify filtering strategies
p y q g•PCR and deep MiSeq sequencing•SOLiD sequencing•mRNA sequencing
This approach has been key for the detection of small insertions and deletionsThis approach has been key for the detection of small insertions and deletions as sequencing errors and alignment biases are often exaggerated for indels
Large genomic structural variants need different detection strategies
O i h hi h i t bilit d hi hl dOvarian cancer genomes have high instability and are highly rearrangedStructural variants underlie copy number changes
Spectral Karyotype from HGSOvCa Cell line Ouellet et al 2008 BMC Cancer
Deletion Duplication/Insertion Translocation
Low resolution
Reference
Sample
There are 4 main methods for SV detection in WG sequencing
Alkan, Coe and Eichler 2011
Most well known tools only use one detection methodMost well known tools only use one detection method
a few multi‐method tools are now available
Visualising structural variants
Sub microscopic homozygous deletion in a tumour sample
Tumour
Normal
Robinson et al 2011Chromosome 13: 1.3Kb somatic deletion including exon 17 of RB1 gene
Insert size estimation is key for detection with discordantly mapped read pairs
DNA fragment size distribution
pp p
Production of sequence reads fromsequence reads from the end of the fragments
300bp median
~300 bp
Alignment of read pairs
Typical read‐pair insert size
300bp median g pallows calculation of insert size
Typical read pair insert size distribution visualised by qProfiler
g coun
tLog
Base pairsNormally mapped reads
Discordantly mapped read pairs mark rearrangements
reference
>1.3kb insert size
reference
Read pairs too far apart
Tumour Read pairs too close together
Normal
Read pairs in wrong orientation
Detection tools identify clusters of read pairs with similar characteristics
orientation
pairs with similar characteristics
large clusters indicate more evidence
Changes in coverage support rearrangements
Clear drop in coverage over the region in the tumour sample
Tumour
Normal
Coverage changes are often associated with SVs
Changes in coverage can be interpreted as copy number and can mark rearrangement breakpointsg p
Deletion Duplication
mbe
r
DeletionFewer reads mapped
Copy num
More reads mapped
Genomic positionCNVnator (Abyzov et al 2011)
Tools are available that identify copy number variants from read depth i i i d GC ipartitioning and GC content correction
Clusters of soft clipping indicate rearrangement break points
Alignment software that performs soft clipping can reveal exact positions of the break points
Further realignment of the clipped sequences produces split reads
Reads with soft clipping and unmapped reads can be assembled into contigs thatassembled into contigs that span break points
qSV : Detecting Somatic Structural Variants
qSV detects 3 types of supporting evidence
Resolves all lines of evidence to identify breakpoints to base pair resolutionResolves all lines of evidence to identify breakpoints to base pair resolution
Felicity Newell
Automation of SV verification processSt t l i t i ifi tiStructural variants require verification
PCR amplification over breakpoints followed by sequencing
Automation of key stages can increase throughput of verification
PCR of tumour and normal DNAVerified events are circled
Quek et al in press
Characterising tumour genomes by the distribution of SVs
A huge range in the distribution of SVs in ovarian cancer patients
Unstable >300 events Complex localised eventsChromosomes
Copy number
B allele frequency
SVs
Circos, Krzywinski et al 2009
Chromothripsis events can be identified
SV break d itdensity
SV types and positions
Copy number segmentation
Log R RatioLog R Ratio
B allele frequency
Chromosome 15
Stephens et al 2011
Breakage‐fusion‐bridge amplification can be identified
SV break density
SV types and positions
Copy number segmentation
Log R Ratio
B allele frequency
Chromosome 12Loss of telomere region
Control sample
Tumor sample
Kinsella and Bafna 2012
Other complex regions with high density of breakpoints
SV break density
Translocations
SV types and positions
Copy number isegmentation
Log R Ratio
Chromosome 19
B allele frequency
Chromosome 19
Associating structural variants with proximal genes
Structural variants break points are annotated with genes features
Gene model annotation of break points can predict fusion genes
G f i ifGene fusions can occur if:•both breakpoints are within the footprints of genes•the transcription direction of the two genes align•translation phase of adjoining exons match•translation phase of adjoining exons match•splicing signals are not disrupted
Barsha Poudel
Patient summary of mutations identified
chromosomes
Coding small mutationsCoding small mutations with amino acid change
SNP array track that shows copy number gain in red and loss in green and regions of loss ofgreen and regions of loss of heterozygosity
Structural variants in centreStructural variants in centre
Circos, Krzywinski et al 2009 ICGC catalogue of mutations
Mutation detection summary
Output of mutation detection software requires careful filtering
Development of filtering strategy typically requires a feedback process
Verification is a key part of this process
Detectmutations ExamineManual IGV reviewManual IGV review
Independent VerificationIdentify patterns andIdentify patterns and modify filtering strategies
Detection – examination – verification ‐modify
Acknowledgements:
Bioinformatics:John PearsonFelicity Newell
Genome Biology:Sean GrimmondNicola Waddell
Peter MacCallum Cancer CentreDavid BowtellDariush EtemadmoghadamElizabeth ChristieDale GarsedFelicity Newell
Lynn FinkConrad LeonardOliver HolmesQinying XuMatthew Anderson
Katia NonesPeter BaileyMichael QuinnKelly Quek
Joshy George Sian FeredayLaura GallettaKathryn AlsopNadia TraficanteMatthew Anderson
Stephen KazakoffNick WaddellScott Wood
Sequencing:David MillerAngelika ChristTim BruxnerC i N
Nadia TraficanteJoy HendleyChris MitchellPrue Cowin
Craig NourseEhsan NourbakhshSuzanne ManningIvon HarliwongSenel Idrisoglu
Previous team membersKarin KassahnBarsha PoudelSarah Song
Westmead Institute for Cancer ResearchA d F i
gShivangi Wani
Sarah SongNicole CloonanDarrin TaylorDeborah GywnnePeter WilsonAnita Steptoe
Anna deFazioCatherine KennedyYoke-Eng ChiewJillian Hung
National Health and Medical Research Council
Australian GovernmentClinicians and patients
top related