graham taylor - university of melbourne - maximising the advantages and addressing the limitation of...

Acknowledgments

Translational Pathology, University of Melbourne: Arthur Lian Chi Hsu, Sebastian Lunke, Clare Love, Kym Pham, Olga Kondrashova, Matt Wakefield, Renate Marquis-Nicholson, Tiffany Cowie and Paul Waring + many others

Translational Genomics , University of Leeds & Leeds Teaching Hospitals: Joanne Morgan, Christopher Watson, Sally Harrison, Laura Crinnion, Lucy Stead, Stefano Berri, Henry Wood, Helen Lindsay, Nick Camm, Antigone Tzika, Josie Hayes, Vicky Crowe, Ruth Charlton, Colin Johnson + many others

HVPA/VPAC: Tim Smith, Alan Lo, Melvyn Leong, David Perkins, Dick Cotton + many others

Victorian Clinical Genetics Service: Damien Bruno & Howard Slater

Sequencing Has Gone Large

Illumina

Solexa

Roche

454

ABI

SOLiD/Ion

torrent

Research, translation and service

• Original

• Surprising

• >80% accurate

• Bespoke

• Proven

• Predictable

• >99.99% accurate

• Standardised

Research

Service

Attractions of NGS

• Unprecedented rate of change • Reduced cost per base • Increased capacity • Increased automation • Sequence or sequence data on

demand • One size fits all • Whole genome at or before

birth? – Non invasive PND – Circulating tumour DNA

Dangers of NGS • Unwanted incidental findings • Data handling problems • Higher sequencing error rate • The Hype Cycle • Costs beginning to stabilize? • Still cost and performance limitations

Clinical drivers for NGS use

1. Increasing demand

2. Increasing expectations

3. Direct to consumer options

4. One stop comprehensive testing – Cost savings

– Improved time to result

5. Low cost, high throughput targeted testing (e.g. tumours)

4 and 5 are different!

Overview

• Genomic technology into diagnostics – Incremental or radical? – Cost saving – Demand driven

• Quality markers

– Assay performance markers

– Empirical analysis

– Data sharing

• The clinical question • The best use of sequencing real estate

Service Developments 2008-2013

1. Replacing Sanger sequencing by long PCR and NGS

2. Enrichment for targeted gene lists and exomes

3. Grouped read testing for mutations in tumours

4. Copy number by sequencing

1) Long PCR and Indexed NGS Replaces Sanger Sequencing

Less work, lower cost, more reliable, digital data output

LR PCR LR PCRLR PCR

Library Preparation

Clonal

Sequencing

Data Analysis and Interpretation

Standard PCR (where required)

Service 1 Service 2 Service 3

Reports

issued

Reports

issued

Reports

issued

•Long range PCR (LR PCR) to

assist efficient target

amplification

•Semi-automated library

preparation and clonal

sequencing using Illumina GAII

•Data analysis (NextGENE)

•Standard PCR and Sanger

sequencing where required

(e.g. LR PCR failure,

confirmation of variants

detected by NGS)

•Customised spreadsheets for

management of samples within

panels

•Pre-PCR: sample details

•Post-PCR: amplicon

pooling and library

preparation

•Post-sequencing: data

coverage and identification

of variants

•Classification of variants

•Patient reports prepared

Sanger

Sequencing

Standard PCR (where required)

Long PCR and NGS Helen Lindsay, Nick Camm & Joanne Morgan

Aneuploidy detection Antigoni Tzika, Josie Hayes, Henry Wood & Kelly Cohen

CNV-seq

BAC array

SCC Lung Tumour Cell lines

Ion torrent PGM 316v2 chip Damien Bruno, Howard Slater

VCGS

Ion torrent:250K reads per sample

0 50 100 150

0.0

0.5

1.0

1.5

2.0

2.5

3.0

test chromosome 4

Chromosomal position (Mb)

Ha

plo

id e

quiv

ale

nt

0 20 40 60 80 100

0.0

0.5

1.0

1.5

2.0

2.5

3.0

test chromosome 14


Ha

plo

id e

quiv

ale

nt

28Mb 4pter del 26Mb 14qter dup

0 50 100 150

0.0

0.5

1.0

1.5

2.0

2.5

3.0

test chromosome 6


Ha

plo

id e

quiv

ale

nt

0 10 20 30 40 50

0.0

0.5

1.0

1.5

2.0

2.5

3.0

test chromosome 21


Ha

plo

id e

quiv

ale

nt

6Mb 6pter del 6 Mb 21qter dup

Sequencing RE sites MspI & RsaI, size-selected at 200-220 bp

Reduced complexity

Size selection

SNP scoring

Dosage

Re-arrangements

Towards clinical exome analysis

Since Q1 2010 Two genes 22 long PCRs BRCA1, BRCA2 50% cost reduction

By Q1 2013 31 Genes ,1 PCR BRCA1, BRCA2, TP53, CHEK2, BRIP1, PALB etc etc… Same cost or less

Recent successful mapping studies

Exome capture vs. Targeted Capture

Cancer Panel

95.00

95.50

96.00

96.50

97.00

97.50

98.00

98.50

99.00

99.50

100.00

20

08.3

452

20

08.4

058

20

09.6

330

20

10.1

016

20

10.1

157

20

10.1

296

20

10.3

063

20

10.4

069

20

10.6

678

20

11.2

795

CC

1

CC

2

CC

3

CC

4

CC

5

CC

6

CC

7

CC

8

CC

9

CC

10

CC

11

CC

12

CC

13

CC

15

CC

17

CC

18

CC

19

CC

20

CC

23

CC

24

CC

16

CC

25

CC

26

CC

27

CC

29

CC

30

CC

32

CC

33

CC

34

CC

35

CC

36

CC

38

CC

39

CC

40

CC

41

CC

42

CC

43

EMQ

N 1

2.0

307

5

EMQ

N 1

2.0

307

6

EMQ

N 1

2.0

307

7

Ave

rage

Validate coverage of target (Percentage of target sequenced to a depth >50X)

An average of 98.8% of bases covered to > 50X this equates to ~1,976 bases <50X coverage

Reproducibility of capture: possiblity of dosage analysis

MYH7

Daughter

Father

Mother

EPCAM exon 9 to MSH2 exon 8 and EPCAM exon 9 to MSH2 exon 1 deletions

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

chr2:47596595-47596770(EPCAM

)

chr2:47600552-47600759(EPCAM

)

chr2:47600897-47601237(EPCAM

)

chr2:47602323-47602488(EPCAM

)

chr2:47604103-47604266(EPCAM

)

chr2:47606042-47606243(EPCAM

)

chr2:47606858-47607158(EPCAM

)

chr2:47612255-47612399(EPCAM

)

chr2:47613661-47613802(EPCAM

)

chr2:47630281-47630591(MSH2)

chr2:47635490-47635744(MSH2)

chr2:47637183-47637561(MSH2)

chr2:47639503-47639749(MSH2)

chr2:47641358-47641607(MSH2)

chr2:47643385-47643618(MSH2)

chr2:47656831-47657130(MSH2)

chr2:47672637-47672846(MSH2)

chr2:47690120-47690343(MSH2)

chr2:47693747-47693997(MSH2)

chr2:47698054-47698251(MSH2)

chr2:47702114-47702459(MSH2)

chr2:47703456-47703760(MSH2)

chr2:47705361-47705708(MSH2)

chr2:47707785-47708060(MSH2)

chr2:47709868-47710138(MSH2)

chr2:48010323-48010682(MSH6)

chr2:48018016-48018312(MSH6)

chr2:48022983-48023252(MSH6)

chr2:48025700-48028344(MSH6)

chr2:48030509-48030874(MSH6)

chr2:48031999-48032216(MSH6)

chr2:48032707-48032896(MSH6)

chr2:48033293-48033547(MSH6)

chr2:48033541-48033840(MSH6)

chr2:48033868-48034049(MSH6)

The problem of indels BRCA1 c.1175_1214del40

Soft clipped reads assembled de novo and the consensus read is BLAT aligned

From 596 targeted exons there are 12 exons that “shouldn’t” be screened

• APC exon 1 • CDKN2A (3 exons) • CHEK2 exon 3 • EPCAM exon 1 • FANCD2 exon 20 • FLCN exons 4, 8 and 11 • RAD51 exon 9 • STK11 exon 3

The Blacklist

Low concordance of multiple variant-calling pipelines O’Rawe et al. Genome Medicine 2013, 5:28

SNV concordance: 57.4% Indel concordance 26.8%

From 596 targeted exons there are 12 exons that “shouldn’t” be screened

• APC exon 1 • CDKN2A (3 exons) • CHEK2 exon 3 • EPCAM exon 1 • FANCD2 exon 20 • FLCN exons 4, 8 and 11 • RAD51 exon . • STK11 exon 3

Clinical Whole-Exome Sequencing for the Diagnosis of Mendelian Disorders

Yang et al, (Baylor) NEJM 2013

• Whole-exome sequencing identified the underlying genetic defect in 25% of con- secutive patients referred for evaluation of a possible genetic condition.

• As testing with whole-exome sequencing evolves to characterise more patients with atypical presentations of known genetic diseases, the spectrum of phenotypes associated with genetic disorders will expand.

Coverage of target

0

2500

5000

7500

0 25 50 75 100

coverage

featu

res

Data performance indicators

insert size of mapped pairs

MS0404_L001 MS0404_L002 MS0405_L001 MS0405_L002 MS0406_L001 MS0406_L002 MS0407_L001 MS0407_L002






0

10000

20000

30000

0

10000

20000

30000

0

10000

20000

0

10000

20000

05000

10000150002000025000

05000

10000150002000025000

0

10000

20000

0

10000

20000

05000

100001500020000

05000

100001500020000

05000

10000150002000025000

05000

100001500020000

0

10000

20000

30000

0

10000

20000

30000

05000

10000150002000025000

05000

10000150002000025000

0

2000

4000

6000

0

2000

4000

6000

0

5000

10000

15000

0

5000

10000

15000

05000

100001500020000

05000

100001500020000

0

10000

20000

0

10000

20000

010000200003000040000

010000200003000040000

0

10000

20000

30000

40000

0

10000

20000

30000

40000

0

10000

20000

30000

40000

0

10000

20000

30000

40000

0

10000

20000

30000

40000

0

10000

20000

30000

40000

0

10000

20000

30000

0

10000

20000

30000

0

10000

20000

30000

40000

0

10000

20000

30000

0

10000

20000

30000

40000

0

10000

20000

30000

01000020000300004000050000

01000020000300004000050000

0

10000

20000

30000

0

10000

20000

30000

0

2500

5000

7500

0

2500

5000

7500

0

2000

4000

6000

0

2000

4000

6000

0

2000

4000

6000

8000

0

2000

4000

6000

8000

10

0

20

0

10

0

20

0

10

0

20

0

10

0

20

0

10

0

20

0

10

0

20

0

10

0

20

0

10

0

20

0

lengthfr

equ

ency

size of merged pairs

Tumour: CNS: Min.Cov. – 40 Min.Reads – 20 Min.Var.Freq. – 0.05 Indels: Min.Cov. – 20 Min.Reads – 5 Min.Var.Freq. – 0.05

Germline: CNS: Min.Cov. – 40 Min.Reads – 20 Min.Var.Freq. – 0.20 Indels: Min.Cov. – 10 Min.Reads – 5 Min.Var.Freq. – 0.20

• SIFT • PolyPhen • HGVS • HGNC • RefSec • Cosmic_v65 • dbsnp_137Common • dbsnp_137Flagged • dbsnp_137Multi • LOVD? • BIC?

Merging forward and reverse reads

Variant & SIFT classes

Variant class SIFT class

child father child father

Neuro

I-exome

N-exome

Enrichment and Variant Detection

NF1 exon 1 HTT exon 1

Amplicon targeted resequencing

6

Step 1: PCR to amplify regions of interest

Genomic

DNA

F

R

Locus-specific sequence

Overhang adapter sequence used in Step 2

7

Step 2: 2nd round of PCR to add ILMN indices and

sequencing adapters

Index adapter oligos from ILMN: contain P5/P7 adapters to make template

compatible with flow cell, also contains a unique sample index

P5 Index

1

Insert to be sequenced Index

2

P7

Assay design Primers are Y-shaped with 5’ phosphate for ligation and 3’ phosporothioate to prevent digestion of the 3’T overhang

Grouped read testing

• Targeted • Sensitive • Quantitative • Low computing overhead • Genotypes • Estimates error rate • BLAST/BLAT mutation

scanning option

Genotyping FFPE samples gene 011707PM 012307PM 022307PM 032307PMBRAFV600wt 100.00 100.00 100.00 100.00

BRAFV600Ec.1799T>A 0.00 18.18 63.51 0.00

KRAS12&13wt 100.00 100.00 100.00 100.00

KRAS12c.34G>A 0.00 0.00 0.00 0.00

KRAS12c.34G>C 0.00 0.00 0.00 0.00

KRAS12c.34G>T 0.00 0.00 0.00 0.00

KRAS12c.35G>A 0.00 0.00 0.00 0.00

KRAS12c.35G>C 0.00 0.00 0.00 0.00KRAS12c.35G>T 0.00 0.00 0.00 0.00

KRAS13c.38G>A 0.00 0.00 0.00 0.00KRAS61wt 100.00 100.00 100.00 100.00

KRAS61c.181C>A 0.00 0.00 0.00 0.00

KRAS61c.181C>G 0.00 0.00 0.00 0.00KRAS61c.182A>C 0.00 0.00 0.00 0.00

KRAS61c.182A>G 0.12 0.00 0.00 0.11

KRAS61c.182A>T 0.00 0.00 0.00 0.00

KRAS61c.183A>C 0.00 0.00 0.00 0.00

KRAS61c.183A>G 0.00 0.00 0.00 0.00

KRAS61c.183A>T 0.00 0.00 0.00 0.00

NRAS12&13wt 100.00 100.00 100.00 100.00

NRAS12c.34G>A 0.00 0.00 0.00 0.00NRAS12c.34G>C 0.00 0.00 0.00 0.00

NRAS12c.34G>T 0.00 0.00 0.00 0.00

NRAS12c.35G>A 0.00 0.00 0.00 0.00

NRAS12c.35G>C 30.50 0.00 0.00 19.89NRAS12c.35G>T 0.00 0.00 0.00 0.00

NRAS13c.37G>A 0.00 0.00 0.00 0.00

NRAS13c.37G>C 0.00 0.00 0.00 0.00

NRAS13c.37G>T 0.00 0.00 0.00 0.00

NRAS13c.38G>A 0.00 0.00 0.00 0.00

NRAS13c.38G>C 0.00 0.00 0.00 0.00NRAS13c.38G>T 0.00 0.00 0.00 0.00

NRAS61wt 100.00 100.00 100.00 100.00

NRAS61c.181C>A 0.00 0.00 0.00 0.00

NRAS61c.181C>G 0.00 0.00 0.00 0.00

NRAS61c.182A>C 0.00 0.00 0.00 0.00

NRAS61c.182A>G 0.00 0.00 0.00 0.00

NRAS61c.182A>T 0.00 0.00 0.00 0.00NRAS61c.183A>C 0.00 0.00 0.00 0.00NRAS61c.183A>G 0.00 0.00 0.00 0.00

NRAS61c.183A>T 0.00 0.00 0.00 0.00PIK3CA1633wt 100.00 100.00 100.00 100.00

PIK3CA1633c.1633G>A 25.55 14.29 0.00 35.79

PIK3CA3140wt 100.00 100.00 100.00 100.00

PIK3CA3140c.3140A>G 0.00 0.00 0.00 0.00

coveragestatistics

each amplicon read sorted by primers

grouped amplicon variants

grab amplicons

sort by locus

group amplicons

Edit Disitance

Read Counts, Read Distribution and Analysis

Using the amplimer sequence to grab each amplicon is an alternative to querying the entire sequence output with the advantage that each set of reads should be more homogenous and amenable to grouping. Groups above a certain abundance (corresponding to the detection limit selected) and then be compared in detail with the canonical sequence using string comparison tools such as the Levenshtein (edit) distance, or by Smith-Waterman alignment. Using this approach we have confirmed that variants can be identified de novo, but with more interference from sequence errors than by grouped read typing. We have also shown that the current TruSeq Cancer Panel kit co-amplifies a region of chromosome 22 containing a perfect match to the pathogenic KIT Exon 11 c.1669T>A mutation. Artifactual data from the duplicated region risks the reporting of specious variants as false positive results.

Average'Read'Count'per'Amplicon'+/6'SEM'

Smith-Waterman

BLAST

363#reads;#common#mutation:#p.G12A;#chr12:25,398,290C>G###c.35G>C

TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

220#reads;#wildtype

TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

64#reads;#nonEcoding#chr12:295,398,329C>T

TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

39#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C

TGTATTGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

27#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C

TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGATTATATTAGAACATGTCACACATAAGGTTA

26#reads#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C

TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGTAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

17#reads;##two#nonEcoding#errors,#non#adjacent

TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCCGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

17#reads;#corresponding#to#c.35G>A

TGTATCGTCAAGGTACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

>1141_>EGFR9_78 chr7 55242418-55242511 1141

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

>646_>EGFR9_78 chr7 55242418-55242511 646

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

>60_>EGFR9_78 chr7 55242418-55242511 60

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTTCATGGCT

>57_>EGFR9_78 chr7 55242418-55242511 57

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGTTTTGCTGTGTGGGGGTCCATGGCT

>54_>EGFR9_78 chr7 55242418-55242511 54

GACTTTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

>51_>EGFR9_78 chr7 55242418-55242511 51

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCTTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAG---------------ACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

KRAS point mutation

EGFR 15 base deletion

EGFR 15 base deletion with Smith-Waterman alignment

Amplicon Sequencing: treating reads as groups

Most reads are error free FFPE contaminates the evidence

Clustal alignment & phylogeny of errors

BRAF V600E

Scale:chr7--->

YourSeq

CCDS5863.1

BRAF

Segmental Dups

RepeatMasker

50 baseshg19140,453,130140,453,140140,453,150140,453,160140,453,170140,453,180140,453,190140,453,200140,453,210140,453,220140,453,230

GTGAGGTAGCTCTAAAGTGACATCGATCTGGTTTTAGTGGATAAAAATGACACTCCAGAAGTACTTCTTTATATAGACTCCACATCATTCATTTCCTTTTGTCATCTAGAGTAAAAGGATAGYour Sequence from Blat Search

Consensus CDS

RefSeq Genes

Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of SamplesDuplications of >1000 Bases of Non-RepeatMasked Sequence

Repeating Elements by RepeatMasker

A

GSWRSKVTALGFDGIKVTLDEHLFIN

>24028BRAF1

>2594BRAF1

>198BRAF1

>174BRAF1

>170BRAF1

>159BRAF1

>156BRAF1

>156BRAF1

>151BRAF1

>148BRAF1

>144BRAF1

>138BRAF1

>137BRAF1

>135BRAF1

>133BRAF1

>128BRAF1

>123BRAF1

>121BRAF1

>118BRAF1

>116BRAF1

>115BRAF1

>112BRAF1

>110BRAF1

>105BRAF1

>105BRAF1

>104BRAF1

>102BRAF1

>99BRAF1

>92BRAF1

PTEN Variants

0

200

400

600

800

1000

1200

1400

1600

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA



CAGAAAAAGTAGAAAATGGAAGTTTATGTGATCAAGAAATCGATAGCATTTGCA


CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGTATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGTA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATTAAGAAATCGATAGCATTTGCA


TAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA













CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATCTGCA






CAGAAAAAGTAGGAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAGAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

PTEN Variants Canonical

PTENP1

PTEN delTTAC

Families of Variants

Canonical

Pseudo

delTTAC

rest(>2reads)

EGFR + del15 grouped reads of 2 or more = 2,784 grouped reads sets = 184

wildtype

mutant

rest

PTEN + del TTAC grouped reads of 2 or more = 4,314 grouped reads sets = 321

DNA Extracted from FFPE is Damaged and Degraded

Nucleotide Damage Degraded DNA Fragments

Strand-specific sequencing to Differentiate True Variation From DNA Damage

Extending grouped read testing

Scalechr17:

--->

RefSeq Genes

RepeatMasker

50 bases hg1941,223,030 41,223,040 41,223,050 41,223,060 41,223,070 41,223,080 41,223,090 41,223,100 41,223,110 41,223,120 41,223,130 41,223,140 41,223,150 41,223,160 41,223,170 41,223,180 41,223,190

CTGGCT TCTCCCTGCTCACACTTTC T TCCATTGCA TTATACCCAGCAGTATCAGTAGTATGAGCAGCAGCTGGACTCTGGGCAGAT TCTGCAACT T TCAATTGGGGAACTTTCAATGCAGAGGTTGAAGATGGTATGT TGCCAACACGAGCTGACTCTGGGGCTCTGTCTTCAGAAGGATCAL A S P C S H F L P L H Y T Q Q Y Q * Y E Q Q L D S G Q I L Q L S I G E L S M Q R L K M V C C Q H E L T L G L C L Q K D Q

F W L L P A H T F F H C I I P S S I S S M S S S W T L G R F C N F Q L G N F Q C R G * R W Y V A N T S * L W G S V F R R I RS G F S L L T L S S I A L Y P A V S V V * A A A G L W A D S A T F N W G T F N A E V E D G M L P T R A D S G A L S S E G S

flanks

TARGET

RefSeq Genes

Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of Samples

Duplications of >1000 Bases of Non-RepeatMasked SequenceRepeating Elements by RepeatMasker

CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTTTATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT

AGATTCTGCAACTTT(.*)CAATGCAGAGGTTGA

rs1799966

gene total forward reverse

chr174122306041223115 5,589 2,044 3,545

chr174122307641223130 3,254 1,347 1,907chr174122310141223146 6,963 3,390 3,573

mappedreads 15,806totalreads 38,228,739

percentmapped 0.04

Non-amplicon systems? Probably not with read pairs: random shearing and variable distance between reads But with single reads or individual reads from read pairs, why not pretend that they are comprised of sets amplicons? This pretense will become more convincing as reads get longer and more accurate. 2008: read length 32 bases, 2013: read length 400 bases

CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTT

TATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT


Merged forward and reverse reads PandaSeq and SeqPrep can merge overlapping read pairs to make them even longer and more accurate. In an unselected 100 base read pair run enriched for hereditary cancer genes over 20% of the reads could be merged. With longer reads and suitable experimental design this fraction could be increased.

PairsProcessed: 18,456,760PairsMerged: 4,029,383PairsWithAdapters: 32,899PairsDiscarded: 646percentmerged 21.83

locus 1total 2total

chr174122306041223115 2631 2777

chr174122307641223130 2452 2674

chr174122310141223146 2223 2501

0

500

1000

1500

2000

2500

3000

chr174122306041223115 chr174122307641223130 chr174122310141223146

1total

2total

Probability of a given length read as a subset of a longer read in a normal distribution of longer reads: the “minimum substring problem”

This approach might make sense with longer reads

Average read length from 101 to 150 bases

Orthogonal validation without Sanger Scale:chr17--->

RefSeq Genes

RepeatMasker

10 baseshg1941,223,07041,223,07541,223,08041,223,08541,223,09041,223,09541,223,10041,223,105

AGTCATCATACTCGTCGTCGACCTGAGACCCGTCTAAGAflanks

Your Sequence from Blat Search

RefSeq Genes

Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of Samples

Duplications of >1000 Bases of Non-RepeatMasked SequenceRepeating Elements by RepeatMasker

CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTTTATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT


G YourSeq

rs1799966

0

50

100

150

200

250

300

350

400

450

500

GCCCAGAGTCCAGCTGCTGCTCATACGCCCAG*G*GTCCAGCTGCTGCTCATAC

Forward reads inc rs1799966

Cheap NextGen Adaptations

Unlikely to be high priority for NGS manufacturers

Require custom informatics solutions

• MLPAbrary

• MSIbrary

0

1

2

3

4

5

6

7

8

9

MS0606 MS0607 MS0608 MS0609

MLPAbrary

CNCE1exon6

0

50

100

150

200

250

300

350

400

450

BAT34c487

BAT34c488

BAT34c489

BAT34c490

BAT34c491

D18S5589

D18S5591

D18S5593

D18S5599

D18S55101

D18S55103

D5S34663

D5S34665

D5S34667

Length(x)vsReadCount(y)

What did we learn?

• Incremental rather than radical approaches are easier to integrate into service

• Quality standards can be progressively bootstrapped from the previous “best practice”

• Some genes will be refractory to hybridization enrichment and short read sequencing

• Formalin fixation increases the noise in tissue samples, reducing sensitivity to low levels of variants in mixed samples

• Data handling is challenging for larger target sizes

What would an NGS-based diagnostic workflow look like?

• Clinically driven request

– Gene list driven by phenotype

– Actionable results

– Quality assured

– Cost-effective

– *Technical report

– Clinical report

Building pipelines to collect and manage the data

Managing and sharing the data FASTQoutput

Technicalreport-Variantdata-Assayperformance-Logofprocesses(so wareandversioning)

Variantcalling(GATK)

Coverageanalysis(BEDTools)

Alignment(Novoalign)

Annota on

Annovar

Sea leSeq

Alamut-ht

Custominfo.-LocalMAF- Labpathsc.- HGMD

-Posi on- ‘Meta’data

VEP

Quality Control of NGS Output and Interpretation

• Confidence in sequence read output – Source of read (pre-analytical formatting)

– Quality of the read (base calling accuracy, phase, consensus)

• Confidence in variant detection – Quality of mapping and alignment

– Quality of mismatch detection

• Confidence in variant exclusion – Coverage of the region of interest

– Identify and flag “black list” regions

Reported data fields: VCF annotation/metadata

Coverage

- Reads aligned

- Coverage of target

- Location of bases

- Mappability

- Known concordance with Gold Standards

Variation - HGVS compliant variants - Chr, pos, c. location - Function (exonic /intronic/UTR) - Mutation type (synon./missense) - Distance from s. site - Splice site data - SIFT - AlignGVGD - dbSNP identifier - 1K genomes frq data - dbSNP clinically associated flag - 89 fields in total

Robust standards for genomic medicine

• Databases and data content – Access to identified and de-identified data (consent

and confidentiality)

– Database accreditation process in prep with RCPA

– Defining the performance of various aligners, variant callers and annotation programs

– Clinical grade Variant Call Format (VCF)

– Metafile covering data trail: what was tested, what was not tested

– EQA

Standards

Human Variome Project

Human Variome Project Australian Node

What We’ve Done • NeAT Funding (2010-2011)

– Pilot Phase – 4 labs, 3 diseases

• Breast Cancer • Colon Cancer • Huntington’s

– Portal Launched April 2011 – Molecular Data Only – Collaboration with Mawson

• NeCTAR Funding (2012-2013) – 12 more labs + all genes they test

for – Configuration Tool – Clinical Data/Phenotype Linkage – Transfer data internationally

What We Built

• Collection Tool

• Portal

• Data Model

• Ethics Processes

• Access & Usage Policy

• Data Sharing Agreements

HVPA Search Tools

Improvements to database Area Feature Median

GeneDisplay Statisticsofnumberofvariantsforthatgeneastableorbargraph 1

VariantInstances Raiseaconcernaboutaninstance'sinterpretation 1

General MatchingNATAfieldsasstandardcollectedfields 1

Variantsearch Searchbyrange 2

Variantsearch Searchbygenomicposition 2

Variantsearch Filterbypathogenicity 2

Variantsearch Sortby...(pathogenicity,otherfields) 2

GeneDisplay Displaylinkstorelateddatabaseforgenebyreferencinggenenames.org 2

Variantsearch Wildcardsearchofvariants 2

New Searchbydiseasewhichshowsmultiplegenesandvariantresults 2

New VCFdataimportsintoHVPAustralia 2

GeneDisplay VarVis-visualisationofgeneandvariantsreported 2

New VCFdataexportfromHVPAustraliaofasetofresults 3

Variantsearch Atinstancelevel-seeothervariantsfromthistest/patient 3

VariantInstances Capture&displaySIFTscore 3

Notifications Notifyifconcensusofpathogencityscorechanged/updated 3

General IntegrationwithEBI/NCBItoolsforqueriesanddisplays 3

Variantsearch Displaylastdateuploadedforthisvariant(orlast10dates) 3

Variantsearch Searchbyvarianttype(operator,e.g.del,sub,inversion...) 4

VariantDisplay AsatrackinUCSC,EBIandNCBIgenoebrowser 4

New PatientMetaData(BioGrit?)integration(TimeofDeath,Co-morboidities,...) 4

HVP Variant Exporter

DMuDB

Growth of DMuDB is supported by the active collection of data from its users.

Referral data are submitted to the database either directly through the web interface, or to the database administrator in spreadsheet format for bulk upload.

First line quality assurance is performed by the users themselves, with an ‘approval’ status display showing whether a variant entry has been checked and approved by the submitting laboratory. This process aims to reflect the quality assurance procedures followed by diagnostic laboratories.

New genes, diseases and reference sequences are added to the database by the administrator, on request, whenever they are needed.

www.ngrl.org.uk/Manchester

NeCTAR Project Deliverables

Benefit&Outcome Objective&Measure

1.ImprovedresearchcapabilityforInSiGHT

AIM1. De-identifiedlinkageofHVPAustralianNodedatawithclinicaldataset

AIM2. DemonstrationofquerycapabilitytoInSiGHT

2.IncreaseaccesstomoleculardataAIM3. IncreasesubmittinglaboratoriesonHVP

AustralianNodefrom3to15.

3.ReducebarriersfornewlaboratoriestoparticipateAIM4. ConfigurationTooltestedat3participating

laboratories

Prototype NGS database

Adding variants

Storing and Sharing NGS Output

Is my variant in other LOVDs?

The Next Steps

• Start sharing data with HVPA

• Develop NGS best practice for targeted and genome scale sequencing

• An internationl consensus standard for the variant call file format that is fit for diagnostic use

• Work with RCPA on diagnostic quality variant databases

graham taylor - university of melbourne - maximising the advantages and addressing the limitation of...

Health & Medicine

ga nras

gt nras

gc nras

wt nras

ac nras

ag nras

ga kras

cg nras