graham taylor - university of melbourne - maximising the advantages and addressing the limitation of...
DESCRIPTION
Professor Graham Taylor, Director, Australian node of the Human Variome Project, University of Melbourne presented "Maximising the advantages and addressing the limitation of clonal sequencing in diagnostics" at the National Pathology Forum 2013. This annual conference provides a platform for the public and private sectors to come together and discuss all the latest issues affecting the pathology sector in Australia. For more information, please visit the conference website: http://www.informa.com.au/pathologyforumTRANSCRIPT
Acknowledgments
Translational Pathology, University of Melbourne: Arthur Lian Chi Hsu, Sebastian Lunke, Clare Love, Kym Pham, Olga Kondrashova, Matt Wakefield, Renate Marquis-Nicholson, Tiffany Cowie and Paul Waring + many others
Translational Genomics , University of Leeds & Leeds Teaching Hospitals: Joanne Morgan, Christopher Watson, Sally Harrison, Laura Crinnion, Lucy Stead, Stefano Berri, Henry Wood, Helen Lindsay, Nick Camm, Antigone Tzika, Josie Hayes, Vicky Crowe, Ruth Charlton, Colin Johnson + many others
HVPA/VPAC: Tim Smith, Alan Lo, Melvyn Leong, David Perkins, Dick Cotton + many others
Victorian Clinical Genetics Service: Damien Bruno & Howard Slater
Sequencing Has Gone Large
Illumina
Solexa
Roche
454
ABI
SOLiD/Ion
torrent
Research, translation and service
• Original
• Surprising
• >80% accurate
• Bespoke
• Proven
• Predictable
• >99.99% accurate
• Standardised
Research
Service
Attractions of NGS
• Unprecedented rate of change • Reduced cost per base • Increased capacity • Increased automation • Sequence or sequence data on
demand • One size fits all • Whole genome at or before
birth? – Non invasive PND – Circulating tumour DNA
Dangers of NGS • Unwanted incidental findings • Data handling problems • Higher sequencing error rate • The Hype Cycle • Costs beginning to stabilize? • Still cost and performance limitations
Clinical drivers for NGS use
1. Increasing demand
2. Increasing expectations
3. Direct to consumer options
4. One stop comprehensive testing – Cost savings
– Improved time to result
5. Low cost, high throughput targeted testing (e.g. tumours)
4 and 5 are different!
Overview
• Genomic technology into diagnostics – Incremental or radical? – Cost saving – Demand driven
• Quality markers
– Assay performance markers
– Empirical analysis
– Data sharing
• The clinical question • The best use of sequencing real estate
Service Developments 2008-2013
1. Replacing Sanger sequencing by long PCR and NGS
2. Enrichment for targeted gene lists and exomes
3. Grouped read testing for mutations in tumours
4. Copy number by sequencing
1) Long PCR and Indexed NGS Replaces Sanger Sequencing
Less work, lower cost, more reliable, digital data output
LR PCR LR PCRLR PCR
Library Preparation
Clonal
Sequencing
Data Analysis and Interpretation
Standard PCR (where required)
Service 1 Service 2 Service 3
Reports
issued
Reports
issued
Reports
issued
•Long range PCR (LR PCR) to
assist efficient target
amplification
•Semi-automated library
preparation and clonal
sequencing using Illumina GAII
•Data analysis (NextGENE)
•Standard PCR and Sanger
sequencing where required
(e.g. LR PCR failure,
confirmation of variants
detected by NGS)
•Customised spreadsheets for
management of samples within
panels
•Pre-PCR: sample details
•Post-PCR: amplicon
pooling and library
preparation
•Post-sequencing: data
coverage and identification
of variants
•Classification of variants
•Patient reports prepared
Sanger
Sequencing
Standard PCR (where required)
Long PCR and NGS Helen Lindsay, Nick Camm & Joanne Morgan
Aneuploidy detection Antigoni Tzika, Josie Hayes, Henry Wood & Kelly Cohen
CNV-seq
BAC array
SCC Lung Tumour Cell lines
Ion torrent PGM 316v2 chip Damien Bruno, Howard Slater
VCGS
Ion torrent:250K reads per sample
0 50 100 150
0.0
0.5
1.0
1.5
2.0
2.5
3.0
test chromosome 4
Chromosomal position (Mb)
Ha
plo
id e
quiv
ale
nt
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
3.0
test chromosome 14
Chromosomal position (Mb)
Ha
plo
id e
quiv
ale
nt
28Mb 4pter del 26Mb 14qter dup
0 50 100 150
0.0
0.5
1.0
1.5
2.0
2.5
3.0
test chromosome 6
Chromosomal position (Mb)
Ha
plo
id e
quiv
ale
nt
0 10 20 30 40 50
0.0
0.5
1.0
1.5
2.0
2.5
3.0
test chromosome 21
Chromosomal position (Mb)
Ha
plo
id e
quiv
ale
nt
6Mb 6pter del 6 Mb 21qter dup
Sequencing RE sites MspI & RsaI, size-selected at 200-220 bp
Reduced complexity
Size selection
SNP scoring
Dosage
Re-arrangements
Towards clinical exome analysis
Since Q1 2010 Two genes 22 long PCRs BRCA1, BRCA2 50% cost reduction
By Q1 2013 31 Genes ,1 PCR BRCA1, BRCA2, TP53, CHEK2, BRIP1, PALB etc etc… Same cost or less
Recent successful mapping studies
Exome capture vs. Targeted Capture
Cancer Panel
95.00
95.50
96.00
96.50
97.00
97.50
98.00
98.50
99.00
99.50
100.00
20
08.3
452
20
08.4
058
20
09.6
330
20
10.1
016
20
10.1
157
20
10.1
296
20
10.3
063
20
10.4
069
20
10.6
678
20
11.2
795
CC
1
CC
2
CC
3
CC
4
CC
5
CC
6
CC
7
CC
8
CC
9
CC
10
CC
11
CC
12
CC
13
CC
15
CC
17
CC
18
CC
19
CC
20
CC
23
CC
24
CC
16
CC
25
CC
26
CC
27
CC
29
CC
30
CC
32
CC
33
CC
34
CC
35
CC
36
CC
38
CC
39
CC
40
CC
41
CC
42
CC
43
EMQ
N 1
2.0
307
5
EMQ
N 1
2.0
307
6
EMQ
N 1
2.0
307
7
Ave
rage
Validate coverage of target (Percentage of target sequenced to a depth >50X)
An average of 98.8% of bases covered to > 50X this equates to ~1,976 bases <50X coverage
Reproducibility of capture: possiblity of dosage analysis
MYH7
Daughter
Father
Mother
EPCAM exon 9 to MSH2 exon 8 and EPCAM exon 9 to MSH2 exon 1 deletions
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
chr2:47596595-47596770(EPCAM
)
chr2:47600552-47600759(EPCAM
)
chr2:47600897-47601237(EPCAM
)
chr2:47602323-47602488(EPCAM
)
chr2:47604103-47604266(EPCAM
)
chr2:47606042-47606243(EPCAM
)
chr2:47606858-47607158(EPCAM
)
chr2:47612255-47612399(EPCAM
)
chr2:47613661-47613802(EPCAM
)
chr2:47630281-47630591(MSH2)
chr2:47635490-47635744(MSH2)
chr2:47637183-47637561(MSH2)
chr2:47639503-47639749(MSH2)
chr2:47641358-47641607(MSH2)
chr2:47643385-47643618(MSH2)
chr2:47656831-47657130(MSH2)
chr2:47672637-47672846(MSH2)
chr2:47690120-47690343(MSH2)
chr2:47693747-47693997(MSH2)
chr2:47698054-47698251(MSH2)
chr2:47702114-47702459(MSH2)
chr2:47703456-47703760(MSH2)
chr2:47705361-47705708(MSH2)
chr2:47707785-47708060(MSH2)
chr2:47709868-47710138(MSH2)
chr2:48010323-48010682(MSH6)
chr2:48018016-48018312(MSH6)
chr2:48022983-48023252(MSH6)
chr2:48025700-48028344(MSH6)
chr2:48030509-48030874(MSH6)
chr2:48031999-48032216(MSH6)
chr2:48032707-48032896(MSH6)
chr2:48033293-48033547(MSH6)
chr2:48033541-48033840(MSH6)
chr2:48033868-48034049(MSH6)
The problem of indels BRCA1 c.1175_1214del40
Soft clipped reads assembled de novo and the consensus read is BLAT aligned
From 596 targeted exons there are 12 exons that “shouldn’t” be screened
• APC exon 1 • CDKN2A (3 exons) • CHEK2 exon 3 • EPCAM exon 1 • FANCD2 exon 20 • FLCN exons 4, 8 and 11 • RAD51 exon 9 • STK11 exon 3
The Blacklist
Low concordance of multiple variant-calling pipelines O’Rawe et al. Genome Medicine 2013, 5:28
SNV concordance: 57.4% Indel concordance 26.8%
From 596 targeted exons there are 12 exons that “shouldn’t” be screened
• APC exon 1 • CDKN2A (3 exons) • CHEK2 exon 3 • EPCAM exon 1 • FANCD2 exon 20 • FLCN exons 4, 8 and 11 • RAD51 exon . • STK11 exon 3
Clinical Whole-Exome Sequencing for the Diagnosis of Mendelian Disorders
Yang et al, (Baylor) NEJM 2013
• Whole-exome sequencing identified the underlying genetic defect in 25% of con- secutive patients referred for evaluation of a possible genetic condition.
• As testing with whole-exome sequencing evolves to characterise more patients with atypical presentations of known genetic diseases, the spectrum of phenotypes associated with genetic disorders will expand.
Coverage of target
0
2500
5000
7500
0 25 50 75 100
coverage
featu
res
Data performance indicators
insert size of mapped pairs
MS0404_L001 MS0404_L002 MS0405_L001 MS0405_L002 MS0406_L001 MS0406_L002 MS0407_L001 MS0407_L002
MS0408_L001 MS0408_L002 MS0409_L001 MS0409_L002 MS0410_L001 MS0410_L002 MS0411_L001 MS0411_L002
MS0412_L001 MS0412_L002 MS0413_L001 MS0413_L002 MS0414_L001 MS0414_L002 MS0415_L001 MS0415_L002
MS0444_L001 MS0444_L002 MS0445_L001 MS0445_L002 MS0446_L001 MS0446_L002 MS0447_L001 MS0447_L002
MS0448_L001 MS0448_L002 MS0449_L001 MS0449_L002 MS0450_L001 MS0450_L002 MS0451_L001 MS0451_L002
MS0452_L001 MS0452_L002 MS0453_L001 MS0453_L002 MS0454_L001 MS0454_L002 MS0455_L001 MS0455_L002
0
10000
20000
30000
0
10000
20000
30000
0
10000
20000
0
10000
20000
05000
10000150002000025000
05000
10000150002000025000
0
10000
20000
0
10000
20000
05000
100001500020000
05000
100001500020000
05000
10000150002000025000
05000
100001500020000
0
10000
20000
30000
0
10000
20000
30000
05000
10000150002000025000
05000
10000150002000025000
0
2000
4000
6000
0
2000
4000
6000
0
5000
10000
15000
0
5000
10000
15000
05000
100001500020000
05000
100001500020000
0
10000
20000
0
10000
20000
010000200003000040000
010000200003000040000
0
10000
20000
30000
40000
0
10000
20000
30000
40000
0
10000
20000
30000
40000
0
10000
20000
30000
40000
0
10000
20000
30000
40000
0
10000
20000
30000
40000
0
10000
20000
30000
0
10000
20000
30000
0
10000
20000
30000
40000
0
10000
20000
30000
0
10000
20000
30000
40000
0
10000
20000
30000
01000020000300004000050000
01000020000300004000050000
0
10000
20000
30000
0
10000
20000
30000
0
2500
5000
7500
0
2500
5000
7500
0
2000
4000
6000
0
2000
4000
6000
0
2000
4000
6000
8000
0
2000
4000
6000
8000
10
0
20
0
10
0
20
0
10
0
20
0
10
0
20
0
10
0
20
0
10
0
20
0
10
0
20
0
10
0
20
0
lengthfr
equ
ency
size of merged pairs
Tumour: CNS: Min.Cov. – 40 Min.Reads – 20 Min.Var.Freq. – 0.05 Indels: Min.Cov. – 20 Min.Reads – 5 Min.Var.Freq. – 0.05
Germline: CNS: Min.Cov. – 40 Min.Reads – 20 Min.Var.Freq. – 0.20 Indels: Min.Cov. – 10 Min.Reads – 5 Min.Var.Freq. – 0.20
• SIFT • PolyPhen • HGVS • HGNC • RefSec • Cosmic_v65 • dbsnp_137Common • dbsnp_137Flagged • dbsnp_137Multi • LOVD? • BIC?
Merging forward and reverse reads
Variant & SIFT classes
Variant class SIFT class
child father child father
Neuro
I-exome
N-exome
Enrichment and Variant Detection
NF1 exon 1 HTT exon 1
Amplicon targeted resequencing
6
Step 1: PCR to amplify regions of interest
Genomic
DNA
F
R
Locus-specific sequence
Overhang adapter sequence used in Step 2
7
Step 2: 2nd round of PCR to add ILMN indices and
sequencing adapters
Index adapter oligos from ILMN: contain P5/P7 adapters to make template
compatible with flow cell, also contains a unique sample index
P5 Index
1
Insert to be sequenced Index
2
P7
Assay design Primers are Y-shaped with 5’ phosphate for ligation and 3’ phosporothioate to prevent digestion of the 3’T overhang
Grouped read testing
• Targeted • Sensitive • Quantitative • Low computing overhead • Genotypes • Estimates error rate • BLAST/BLAT mutation
scanning option
Genotyping FFPE samples gene 011707PM 012307PM 022307PM 032307PMBRAFV600wt 100.00 100.00 100.00 100.00
BRAFV600Ec.1799T>A 0.00 18.18 63.51 0.00
KRAS12&13wt 100.00 100.00 100.00 100.00
KRAS12c.34G>A 0.00 0.00 0.00 0.00
KRAS12c.34G>C 0.00 0.00 0.00 0.00
KRAS12c.34G>T 0.00 0.00 0.00 0.00
KRAS12c.35G>A 0.00 0.00 0.00 0.00
KRAS12c.35G>C 0.00 0.00 0.00 0.00KRAS12c.35G>T 0.00 0.00 0.00 0.00
KRAS13c.38G>A 0.00 0.00 0.00 0.00KRAS61wt 100.00 100.00 100.00 100.00
KRAS61c.181C>A 0.00 0.00 0.00 0.00
KRAS61c.181C>G 0.00 0.00 0.00 0.00KRAS61c.182A>C 0.00 0.00 0.00 0.00
KRAS61c.182A>G 0.12 0.00 0.00 0.11
KRAS61c.182A>T 0.00 0.00 0.00 0.00
KRAS61c.183A>C 0.00 0.00 0.00 0.00
KRAS61c.183A>G 0.00 0.00 0.00 0.00
KRAS61c.183A>T 0.00 0.00 0.00 0.00
NRAS12&13wt 100.00 100.00 100.00 100.00
NRAS12c.34G>A 0.00 0.00 0.00 0.00NRAS12c.34G>C 0.00 0.00 0.00 0.00
NRAS12c.34G>T 0.00 0.00 0.00 0.00
NRAS12c.35G>A 0.00 0.00 0.00 0.00
NRAS12c.35G>C 30.50 0.00 0.00 19.89NRAS12c.35G>T 0.00 0.00 0.00 0.00
NRAS13c.37G>A 0.00 0.00 0.00 0.00
NRAS13c.37G>C 0.00 0.00 0.00 0.00
NRAS13c.37G>T 0.00 0.00 0.00 0.00
NRAS13c.38G>A 0.00 0.00 0.00 0.00
NRAS13c.38G>C 0.00 0.00 0.00 0.00NRAS13c.38G>T 0.00 0.00 0.00 0.00
NRAS61wt 100.00 100.00 100.00 100.00
NRAS61c.181C>A 0.00 0.00 0.00 0.00
NRAS61c.181C>G 0.00 0.00 0.00 0.00
NRAS61c.182A>C 0.00 0.00 0.00 0.00
NRAS61c.182A>G 0.00 0.00 0.00 0.00
NRAS61c.182A>T 0.00 0.00 0.00 0.00NRAS61c.183A>C 0.00 0.00 0.00 0.00NRAS61c.183A>G 0.00 0.00 0.00 0.00
NRAS61c.183A>T 0.00 0.00 0.00 0.00PIK3CA1633wt 100.00 100.00 100.00 100.00
PIK3CA1633c.1633G>A 25.55 14.29 0.00 35.79
PIK3CA3140wt 100.00 100.00 100.00 100.00
PIK3CA3140c.3140A>G 0.00 0.00 0.00 0.00
coveragestatistics
each amplicon read sorted by primers
grouped amplicon variants
grab amplicons
sort by locus
group amplicons
Edit Disitance
Read Counts, Read Distribution and Analysis
Using the amplimer sequence to grab each amplicon is an alternative to querying the entire sequence output with the advantage that each set of reads should be more homogenous and amenable to grouping. Groups above a certain abundance (corresponding to the detection limit selected) and then be compared in detail with the canonical sequence using string comparison tools such as the Levenshtein (edit) distance, or by Smith-Waterman alignment. Using this approach we have confirmed that variants can be identified de novo, but with more interference from sequence errors than by grouped read typing. We have also shown that the current TruSeq Cancer Panel kit co-amplifies a region of chromosome 22 containing a perfect match to the pathogenic KIT Exon 11 c.1669T>A mutation. Artifactual data from the duplicated region risks the reporting of specious variants as false positive results.
Average'Read'Count'per'Amplicon'+/6'SEM'
Smith-Waterman
BLAST
363#reads;#common#mutation:#p.G12A;#chr12:25,398,290C>G###c.35G>C
TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA
220#reads;#wildtype
TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA
64#reads;#nonEcoding#chr12:295,398,329C>T
TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA
39#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C
TGTATTGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA
27#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C
TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGATTATATTAGAACATGTCACACATAAGGTTA
26#reads#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C
TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGTAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA
17#reads;##two#nonEcoding#errors,#non#adjacent
TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCCGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA
17#reads;#corresponding#to#c.35G>A
TGTATCGTCAAGGTACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA
>1141_>EGFR9_78 chr7 55242418-55242511 1141
GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT
>646_>EGFR9_78 chr7 55242418-55242511 646
GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT
>60_>EGFR9_78 chr7 55242418-55242511 60
GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTTCATGGCT
>57_>EGFR9_78 chr7 55242418-55242511 57
GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGTTTTGCTGTGTGGGGGTCCATGGCT
>54_>EGFR9_78 chr7 55242418-55242511 54
GACTTTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT
>51_>EGFR9_78 chr7 55242418-55242511 51
GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCTTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT
GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT
GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAG---------------ACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT
KRAS point mutation
EGFR 15 base deletion
EGFR 15 base deletion with Smith-Waterman alignment
Amplicon Sequencing: treating reads as groups
Most reads are error free FFPE contaminates the evidence
Clustal alignment & phylogeny of errors
BRAF V600E
Scale:chr7--->
YourSeq
CCDS5863.1
BRAF
Segmental Dups
RepeatMasker
50 baseshg19140,453,130140,453,140140,453,150140,453,160140,453,170140,453,180140,453,190140,453,200140,453,210140,453,220140,453,230
GTGAGGTAGCTCTAAAGTGACATCGATCTGGTTTTAGTGGATAAAAATGACACTCCAGAAGTACTTCTTTATATAGACTCCACATCATTCATTTCCTTTTGTCATCTAGAGTAAAAGGATAGYour Sequence from Blat Search
Consensus CDS
RefSeq Genes
Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of SamplesDuplications of >1000 Bases of Non-RepeatMasked Sequence
Repeating Elements by RepeatMasker
A
GSWRSKVTALGFDGIKVTLDEHLFIN
>24028BRAF1
>2594BRAF1
>198BRAF1
>174BRAF1
>170BRAF1
>159BRAF1
>156BRAF1
>156BRAF1
>151BRAF1
>148BRAF1
>144BRAF1
>138BRAF1
>137BRAF1
>135BRAF1
>133BRAF1
>128BRAF1
>123BRAF1
>121BRAF1
>118BRAF1
>116BRAF1
>115BRAF1
>112BRAF1
>110BRAF1
>105BRAF1
>105BRAF1
>104BRAF1
>102BRAF1
>99BRAF1
>92BRAF1
PTEN Variants
0
200
400
600
800
1000
1200
1400
1600
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTTTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGTATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGTA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATTAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
TAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATCTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGGAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAGAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
PTEN Variants Canonical
PTENP1
PTEN delTTAC
Families of Variants
Canonical
Pseudo
delTTAC
rest(>2reads)
EGFR + del15 grouped reads of 2 or more = 2,784 grouped reads sets = 184
wildtype
mutant
rest
PTEN + del TTAC grouped reads of 2 or more = 4,314 grouped reads sets = 321
DNA Extracted from FFPE is Damaged and Degraded
Nucleotide Damage Degraded DNA Fragments
Strand-specific sequencing to Differentiate True Variation From DNA Damage
Extending grouped read testing
Scalechr17:
--->
RefSeq Genes
RepeatMasker
50 bases hg1941,223,030 41,223,040 41,223,050 41,223,060 41,223,070 41,223,080 41,223,090 41,223,100 41,223,110 41,223,120 41,223,130 41,223,140 41,223,150 41,223,160 41,223,170 41,223,180 41,223,190
CTGGCT TCTCCCTGCTCACACTTTC T TCCATTGCA TTATACCCAGCAGTATCAGTAGTATGAGCAGCAGCTGGACTCTGGGCAGAT TCTGCAACT T TCAATTGGGGAACTTTCAATGCAGAGGTTGAAGATGGTATGT TGCCAACACGAGCTGACTCTGGGGCTCTGTCTTCAGAAGGATCAL A S P C S H F L P L H Y T Q Q Y Q * Y E Q Q L D S G Q I L Q L S I G E L S M Q R L K M V C C Q H E L T L G L C L Q K D Q
F W L L P A H T F F H C I I P S S I S S M S S S W T L G R F C N F Q L G N F Q C R G * R W Y V A N T S * L W G S V F R R I RS G F S L L T L S S I A L Y P A V S V V * A A A G L W A D S A T F N W G T F N A E V E D G M L P T R A D S G A L S S E G S
flanks
TARGET
RefSeq Genes
Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of Samples
Duplications of >1000 Bases of Non-RepeatMasked SequenceRepeating Elements by RepeatMasker
CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTTTATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT
AGATTCTGCAACTTT(.*)CAATGCAGAGGTTGA
rs1799966
gene total forward reverse
chr174122306041223115 5,589 2,044 3,545
chr174122307641223130 3,254 1,347 1,907chr174122310141223146 6,963 3,390 3,573
mappedreads 15,806totalreads 38,228,739
percentmapped 0.04
Non-amplicon systems? Probably not with read pairs: random shearing and variable distance between reads But with single reads or individual reads from read pairs, why not pretend that they are comprised of sets amplicons? This pretense will become more convincing as reads get longer and more accurate. 2008: read length 32 bases, 2013: read length 400 bases
CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTT
TATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT
AGATTCTGCAACTTT(.*)CAATGCAGAGGTTGA
Merged forward and reverse reads PandaSeq and SeqPrep can merge overlapping read pairs to make them even longer and more accurate. In an unselected 100 base read pair run enriched for hereditary cancer genes over 20% of the reads could be merged. With longer reads and suitable experimental design this fraction could be increased.
PairsProcessed: 18,456,760PairsMerged: 4,029,383PairsWithAdapters: 32,899PairsDiscarded: 646percentmerged 21.83
locus 1total 2total
chr174122306041223115 2631 2777
chr174122307641223130 2452 2674
chr174122310141223146 2223 2501
0
500
1000
1500
2000
2500
3000
chr174122306041223115 chr174122307641223130 chr174122310141223146
1total
2total
Probability of a given length read as a subset of a longer read in a normal distribution of longer reads: the “minimum substring problem”
This approach might make sense with longer reads
Average read length from 101 to 150 bases
Orthogonal validation without Sanger Scale:chr17--->
RefSeq Genes
RepeatMasker
10 baseshg1941,223,07041,223,07541,223,08041,223,08541,223,09041,223,09541,223,10041,223,105
AGTCATCATACTCGTCGTCGACCTGAGACCCGTCTAAGAflanks
Your Sequence from Blat Search
RefSeq Genes
Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of Samples
Duplications of >1000 Bases of Non-RepeatMasked SequenceRepeating Elements by RepeatMasker
CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTTTATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT
AGATTCTGCAACTTT(.*)CAATGCAGAGGTTGA
G YourSeq
rs1799966
0
50
100
150
200
250
300
350
400
450
500
GCCCAGAGTCCAGCTGCTGCTCATACGCCCAG*G*GTCCAGCTGCTGCTCATAC
Forward reads inc rs1799966
Cheap NextGen Adaptations
Unlikely to be high priority for NGS manufacturers
Require custom informatics solutions
• MLPAbrary
• MSIbrary
0
1
2
3
4
5
6
7
8
9
MS0606 MS0607 MS0608 MS0609
MLPAbrary
CNCE1exon6
0
50
100
150
200
250
300
350
400
450
BAT34c487
BAT34c488
BAT34c489
BAT34c490
BAT34c491
D18S5589
D18S5591
D18S5593
D18S5599
D18S55101
D18S55103
D5S34663
D5S34665
D5S34667
Length(x)vsReadCount(y)
What did we learn?
• Incremental rather than radical approaches are easier to integrate into service
• Quality standards can be progressively bootstrapped from the previous “best practice”
• Some genes will be refractory to hybridization enrichment and short read sequencing
• Formalin fixation increases the noise in tissue samples, reducing sensitivity to low levels of variants in mixed samples
• Data handling is challenging for larger target sizes
What would an NGS-based diagnostic workflow look like?
• Clinically driven request
– Gene list driven by phenotype
– Actionable results
– Quality assured
– Cost-effective
– *Technical report
– Clinical report
Building pipelines to collect and manage the data
Managing and sharing the data FASTQoutput
Technicalreport-Variantdata-Assayperformance-Logofprocesses(so wareandversioning)
Variantcalling(GATK)
Coverageanalysis(BEDTools)
Alignment(Novoalign)
Annota on
Annovar
Sea leSeq
Alamut-ht
Custominfo.-LocalMAF- Labpathsc.- HGMD
-Posi on- ‘Meta’data
VEP
Quality Control of NGS Output and Interpretation
• Confidence in sequence read output – Source of read (pre-analytical formatting)
– Quality of the read (base calling accuracy, phase, consensus)
• Confidence in variant detection – Quality of mapping and alignment
– Quality of mismatch detection
• Confidence in variant exclusion – Coverage of the region of interest
– Identify and flag “black list” regions
Reported data fields: VCF annotation/metadata
Coverage
- Reads aligned
- Coverage of target
- Location of bases
- Mappability
- Known concordance with Gold Standards
Variation - HGVS compliant variants - Chr, pos, c. location - Function (exonic /intronic/UTR) - Mutation type (synon./missense) - Distance from s. site - Splice site data - SIFT - AlignGVGD - dbSNP identifier - 1K genomes frq data - dbSNP clinically associated flag - 89 fields in total
Robust standards for genomic medicine
• Databases and data content – Access to identified and de-identified data (consent
and confidentiality)
– Database accreditation process in prep with RCPA
– Defining the performance of various aligners, variant callers and annotation programs
– Clinical grade Variant Call Format (VCF)
– Metafile covering data trail: what was tested, what was not tested
– EQA
Standards
Human Variome Project
Human Variome Project Australian Node
What We’ve Done • NeAT Funding (2010-2011)
– Pilot Phase – 4 labs, 3 diseases
• Breast Cancer • Colon Cancer • Huntington’s
– Portal Launched April 2011 – Molecular Data Only – Collaboration with Mawson
• NeCTAR Funding (2012-2013) – 12 more labs + all genes they test
for – Configuration Tool – Clinical Data/Phenotype Linkage – Transfer data internationally
What We Built
• Collection Tool
• Portal
• Data Model
• Ethics Processes
• Access & Usage Policy
• Data Sharing Agreements
HVPA Search Tools
Improvements to database Area Feature Median
GeneDisplay Statisticsofnumberofvariantsforthatgeneastableorbargraph 1
VariantInstances Raiseaconcernaboutaninstance'sinterpretation 1
General MatchingNATAfieldsasstandardcollectedfields 1
Variantsearch Searchbyrange 2
Variantsearch Searchbygenomicposition 2
Variantsearch Filterbypathogenicity 2
Variantsearch Sortby...(pathogenicity,otherfields) 2
GeneDisplay Displaylinkstorelateddatabaseforgenebyreferencinggenenames.org 2
Variantsearch Wildcardsearchofvariants 2
New Searchbydiseasewhichshowsmultiplegenesandvariantresults 2
New VCFdataimportsintoHVPAustralia 2
GeneDisplay VarVis-visualisationofgeneandvariantsreported 2
New VCFdataexportfromHVPAustraliaofasetofresults 3
Variantsearch Atinstancelevel-seeothervariantsfromthistest/patient 3
VariantInstances Capture&displaySIFTscore 3
Notifications Notifyifconcensusofpathogencityscorechanged/updated 3
General IntegrationwithEBI/NCBItoolsforqueriesanddisplays 3
Variantsearch Displaylastdateuploadedforthisvariant(orlast10dates) 3
Variantsearch Searchbyvarianttype(operator,e.g.del,sub,inversion...) 4
VariantDisplay AsatrackinUCSC,EBIandNCBIgenoebrowser 4
New PatientMetaData(BioGrit?)integration(TimeofDeath,Co-morboidities,...) 4
HVP Variant Exporter
DMuDB
Growth of DMuDB is supported by the active collection of data from its users.
Referral data are submitted to the database either directly through the web interface, or to the database administrator in spreadsheet format for bulk upload.
First line quality assurance is performed by the users themselves, with an ‘approval’ status display showing whether a variant entry has been checked and approved by the submitting laboratory. This process aims to reflect the quality assurance procedures followed by diagnostic laboratories.
New genes, diseases and reference sequences are added to the database by the administrator, on request, whenever they are needed.
www.ngrl.org.uk/Manchester
NeCTAR Project Deliverables
Benefit&Outcome Objective&Measure
1.ImprovedresearchcapabilityforInSiGHT
AIM1. De-identifiedlinkageofHVPAustralianNodedatawithclinicaldataset
AIM2. DemonstrationofquerycapabilitytoInSiGHT
2.IncreaseaccesstomoleculardataAIM3. IncreasesubmittinglaboratoriesonHVP
AustralianNodefrom3to15.
3.ReducebarriersfornewlaboratoriestoparticipateAIM4. ConfigurationTooltestedat3participating
laboratories
Prototype NGS database
LOVD3
Adding variants
Adding variants
Adding variants
Storing and Sharing NGS Output
Is my variant in other LOVDs?
The Next Steps
• Start sharing data with HVPA
• Develop NGS best practice for targeted and genome scale sequencing
• An internationl consensus standard for the variant call file format that is fit for diagnostic use
• Work with RCPA on diagnostic quality variant databases
HVPA