elliott margulies - striving for perfection: the platinum genomes project

21
© 2011 Illumina, Inc. All rights reserved. Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Striving for Perfection: The Platinum Genomes Project Elliott H. Margulies, Ph.D. Director, Scientific Research

Upload: genomeinabottle

Post on 24-Jun-2015

1.592 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE © 2011 Illumina, Inc. All rights reserved. Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

Striving for Perfection: The Platinum Genomes Project

Elliott H. Margulies, Ph.D. Director, Scientific Research

Page 2: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

2

From Sample to Answer Sequence Analyse Annotate Interpret Answer Sample

Improved Accuracy and Utility of detected variants

Enabling clinical use of WGS

Fast sequencing from low-input and FFPE samples

Integrated “push button” analyses – from sequence to annotated variants

Focus on genome exploration

Page 3: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

3

The truth is hard to find…

Dad Mom

Child

A/A T/T

T/T

First Time Second Time

Variants

?

Sequencing the same genome twice does not give you the identical answer

We identify many more Mendelian conflicts than actually exist

Page 4: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

4

Sensitivity  Mendelian Conflicts   Accuracy   Filter  

96.62   13,032   99.9995% unfiltered  

96.10   8,383   99.9997% + gVCF filters  

95.25   5,309   99.9998% + score:coverage

Summary of increased accuracy

Sensitivity Conflicts Accuracy Method 95.90 4,928 99.9998% BWA+MPG*

* Accurate and comprehensive sequencing of personal genomes S.S. Ajay, S.C.J. Parker, H. Ozel Abaan, Karin V. Fuentes Fajardo, and E.H. Margulies Genome Res. 2011 21: 1498-1505

NB: Accuracy is expressed here as % total filtered calls that are Mendelian concordant

1.43% loss in sensitivity

59.26% loss in conflicts

Eland+CASAVA

Page 5: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

5

A critical assessment of whole-genome sequencing… ! Where are we doing well?

! What parts of the genome are still inaccessible or less accurately called – and most importantly, why?

GOALS:

! Maximum utility for use in research and medical applications

! Determine key areas for improvement and assess progress

! Assess performance in real-life situations

Page 6: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

6

Platinum genomes: the proposal ! Select a small set of well-known and accessible genomes

! Generate initial WGS datasets using best current practices

! Make it freely available in a database by "open source" principles

! Perform analyses to define high and low quality regions and variant calls

! Examine low quality regions and calls and validate with additional evidence (methods)

! Maintain a database with revised data and evidence to provide a long term benchmark

! Develop improved methods (analysis, chemistry, sample prep)

Page 7: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

7

CEPH/Utah Pedigree 1463

! Three generation family, extensively sequenced by the genomics community

! Focus on the trio shaded in gray (12877 12878 and 12882) ! Sourced ~200µg for the initial trio (shaded) and ~50µg for all

others

12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893

12877 12878

12882

Page 8: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

8

Initial dataset Sample   Depth   Q30  

Genotype coverage  

Genotype concordance  

NA12877   219.63   91.3   99.79   99.25  NA12878   211.88   93.6   99.8   99.25  NA12882   217.95   93.2   99.8   99.24  NA12881   46.67   91.7   99.84   99.28  NA12880   48.37   91.4   99.74   99.28  NA12879   48.01   92   99.75   99.29  NA12883   54.73   94.2   99.6   99.27  NA12884   43.76   93.2   99.7   99.27  NA12885   54.56   94   99.8   99.28  NA12886   64.98   91   99.8   99.28  NA12887   48.33   92.4   99.81   99.29  NA12888   47.61   92.2   99.81   99.28  NA12889   49.99   91   99.49   99.28  NA12890   59.34   88   99.8   99.29  NA12891   45.49   93   99.75   99.28  NA12892   50.32   93.4   99.67   99.29  NA12893   47.69   92.7   99.79   99.28  

Technical Replicate

Page 9: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

9

200x (18 lanes)

100x (8 lanes)

100x (8 lanes)

50x 50x 50x 50x

200x (18 lanes)

100x (8 lanes)

100x (8 lanes)

50x 50x 50x 50x

Technical Replicate A

Technical Replicate B

NA12882

! Callability and reproducibility among pairs of replicates –  50x vs 100x vs 200x –  Between technical replicates

Page 10: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

10

Pair-wise comparisons of genome builds

Coverage Library SNPs Indels Combined 50x different 99.34%   90.94%   98.52%  

50x same 99.36%   90.83%   98.52%  

100x different 99.47%   90.60%   98.57%  

100x same 99.47%   90.54%   98.56%  

200x different 99.53%   90.23%   98.55%  

Concordance at variant positions where both genomes PASSed basic quality filters

Page 11: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

11

200x (18 lanes)

100x (8 lanes)

100x (8 lanes)

50x 50x 50x 50x

200x (18 lanes)

100x (8 lanes)

100x (8 lanes)

50x 50x 50x 50x

Technical Replicate A

Technical Replicate B

NA12882

! Consistency across all the replicates –  How many replicates were able to be called at a given position? –  How many different genotypes were present at that position?

Page 12: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

12

Consistency among technical replicates     0   1   2   3   4   5   6   7   8   9   10   11   12   13   14  

0   1.96  1   0.23  2   0.21   0.0005  3   0.18   0.0006   3.5E-­‐05  4   0.16   0.0007   4.2E-­‐05   8.7E-­‐06  5   0.15   0.0007   4.5E-­‐05   1.3E-­‐05   3.5E-­‐06  6   0.15   0.0008   4.6E-­‐05   1.6E-­‐05   6.1E-­‐06   1.4E-­‐06  7   0.16   0.0008   4.9E-­‐05   1.8E-­‐05   8.8E-­‐06   3.0E-­‐06   8.2E-­‐07  8   0.16   0.0007   5.5E-­‐05   1.9E-­‐05   9.0E-­‐06   4.3E-­‐06   1.9E-­‐06   4.1E-­‐07  9   0.17   0.0007   5.6E-­‐05   2.0E-­‐05   1.1E-­‐05   5.2E-­‐06   2.5E-­‐06   1.4E-­‐06   3.7E-­‐07  10   0.20   0.0006   6.1E-­‐05   2.1E-­‐05   1.1E-­‐05   7.4E-­‐06   3.8E-­‐06   1.9E-­‐06   7.1E-­‐07   1.9E-­‐07  11   0.24   0.0006   6.9E-­‐05   2.6E-­‐05   1.4E-­‐05   9.4E-­‐06   6.4E-­‐06   3.7E-­‐06   1.5E-­‐06   3.7E-­‐07   7.4E-­‐08  12   0.32   0.0007   8.5E-­‐05   3.2E-­‐05   1.9E-­‐05   1.2E-­‐05   8.6E-­‐06   5.5E-­‐06   2.8E-­‐06   1.3E-­‐06   4.8E-­‐07   7.4E-­‐08  13   0.61   0.0010   1.2E-­‐04   4.3E-­‐05   2.8E-­‐05   1.9E-­‐05   1.5E-­‐05   1.1E-­‐05   7.4E-­‐06   4.6E-­‐06   2.0E-­‐06   6.7E-­‐07   2.2E-­‐07  14   95.07   0.0025   2.3E-­‐04   8.6E-­‐05   5.3E-­‐05   4.0E-­‐05   3.6E-­‐05   3.3E-­‐05   3.0E-­‐05   2.3E-­‐05   1.4E-­‐05   7.6E-­‐06   2.1E-­‐06   6.0E-­‐07  

Num

ber o

f rep

licat

es

PAS

Sin

g ge

noty

pe q

ualit

y fil

ter

Number of different genotypes

“Metal”   Genome   SNVs  from  a  50x  build  Gold   95.1%   94.80%   3,030,777  

Silver   2.95%   4.15%   132,579  

Copper   0.01%   1.05%   33,679  

Lead   1.96%  

Page 13: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

13

Genomic features overlapping with “metal” regions

Genome   SNVs   CDS   medCDS  gold   95.07%   94.80%   96.91%   97.87%  

silver   2.95%   4.15%   1.35%   1.11%  

copper   0.01%   1.05%   0.003%   0.002%  

lead   1.96%   0.00%   1.74%   1.02%  

Page 14: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

14

A closer examination of “Copper” regions: those that had more than one genotype

Type  of  inconsistency   Percentage  

REF  /  het  SNV   37.40  

REF  /  het  DEL   21.89  

REF  /  het  INS   15.11  

het  SNV  /  hom  SNV   5.38  

het  DEL  /  hom  DEL   0.42  

het  INS  /  hom  INS   1.43  

Remaining   18.38  

86% of copper regions had just two different genotypes

Page 15: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

15

Concordance in “metal” regions

50x   100x   200x  ALL   99.34%   99.47%   99.53%  

Gold   99.80%   99.94%   99.94%  

Silver   85.00%   89.81%   93.80%  

Copper   53.85%   67.85%   82.12%  

Lead*   519   6,589   22,164  

Non-gold regions of the genome point to areas that are not comprehensively/accurately assessed

SNP concordance from two builds generated from different libraries

*  Absolute  values  more  revealing  

Page 16: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

16

Concordance in “metal” regions

SNPs   Indels   Both  Overall   99.47%   90.54%   98.56%  

Gold   99.92%   96.77%   99.65%  

Silver   90.65%   68.18%   86.32%  

Copper   77.13%   57.11%   61.00%  

Lead   73.44%   74.73%   73.88%  

Indels need more attention

Concordance of variants between two 100x builds from the same library

Page 17: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

17

Practical/Clinical/Medical Relevance

Metal   ALL   Same   Different  Percent

the Same  Percent in Metal  

Combined   1,187   1,182   5   99.58%  

Gold   1,151   1,151   0   100.00%   96.97%  

Silver   29   26   3   89.66%   2.44%  

Copper   2   2   0   100.00%   0.17%  

Lead   5   3   2   60.00%   0.42%  

200x build comparison in medically-relevant CDS regions

Page 18: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

18

Future Plans ! Classify inconsistent parts of the genome into:

–  Alignment or read length issues §  Paralogous/repetitive/CNV regions §  Missed or wrong indel calls

–  Depth of coverage –  Platform-specific artifacts

! Disseminate data/analyses to the research community

! Platform for developing better indel detection

! Error correction via haplotyping efforts

! Independent validation efforts

! Develop a database of variants and associated evidence

Page 19: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

19

Acknowledgements

! David Bentley

! Sean Humphray

! Mark Ross

! Nick Kerry

! Nondas Fritzilas

! Phil Tedder

! Mike Eberle

! Lisa Murray

! Klaus Maisinger

! Russell Grocock

! Peter Saffrey

! Brad Sickler

! Pedro Cruz

! Shankar Ajay

! Marc Laurant

! Semyon Kruglyak

Page 20: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

20

END

Page 21: Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

21

Research

Accurate and comprehensive sequencingof personal genomesSubramanian S. Ajay,1 Stephen C.J. Parker,1 Hatice Ozel Abaan,1

Karin V. Fuentes Fajardo,2 and Elliott H. Margulies1,3,4

1Genome Informatics Section, Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health,Bethesda, Maryland 20892, USA; 2Undiagnosed Diseases Program, Office of the Clinical Director, National Human Genome Research

Institute, National Institutes of Health, Bethesda, Maryland 20892, USA

As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinicaland diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determininggenotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ~30@ coverage isnot adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our resultsare based on analyses of a clinical sample sequenced on two related Illumina platforms, GAIIx and HiSeq 2000, to a veryhigh depth (126@). We used these data to establish genotype-calling filters that dramatically increase accuracy. We alsoempirically determined how the callable portion of the genome varies as a function of the amount of sequence data used.These results help provide a ‘‘sequencing guide’’ for future whole-genome sequencing decisions and metrics by whichcoverage statistics should be reported.

[Supplemental material is available for this article.]

Whole-genome sequencing and analysis is becoming part of atranslational research toolkit (Lupski et al. 2010; Sobreira et al.2010) to investigate small-scale changes such as single-nucleotidevariants (SNVs) and indels (Bentley et al. 2008; Wang et al. 2008;Kim et al. 2009; McKernan et al. 2009; Fujimoto et al. 2010; Leeet al. 2010; Pleasance et al. 2010) in addition to large-scale eventssuch as chromosomal rearrangements (Campbell et al. 2008;Chen et al. 2008) and copy-number variation (Chiang et al. 2009;Park et al. 2010). For both basic genome biology and clinicaldiagnostics, the trade-offs of data quality and quantity will de-termine what constitutes a ‘‘comprehensive and accurate’’ whole-genome analysis, especially for detecting SNVs. As whole-genomesequencing becomes commoditized, it will be important to deter-mine quantitative metrics to assess and describe the comprehen-siveness of an individual’s genome sequence. No such standardscurrently exist.

For several reasons (sample handling, platform biases, run-to-run variation, etc.), random generation of sequencing readsdoes not always represent every region in the genome uniformly.It is therefore necessary to understand what proportion of thewhole genome can be accurately ascertained, given a certain amountand type of input data and a specified reference sequence. The1000 Genomes Project (which aims to accurately assess geneticvariation within the human population) refers to this concept asthe ‘‘accessible’’ portion of the reference genome (1000 GenomesProject Consortium 2010). While population-scale sequencingfocuses on low-coverage pooled data sets, here we focus on require-ments for highly accurate SNV calls from an individual’s genome,

a question that is extremely important as whole-genome se-quencing and analysis of individual genomes transitions fromprimarily research-based projects to being used for clinical anddiagnostic applications. Additionally, we seek to understand therelationship between the amount of sequence data generated andthe resulting proportion of the genome where confident geno-types can be derived—we refer to this as the ‘‘callable’’ portion,a term that is roughly equivalent to the 1000 Genomes Project’s‘‘accessible’’ portion. Using these sequencing metrics and geno-type-calling filters will help obviate the need for costly and time-consuming validation efforts. Currently, no empirically deriveddata sets exist for determining howmuch sequence data is neededto enable accurate detection of SNVs.

To address this issue, we sequenced a blood sample from amale individual with an undiagnosed clinical condition on tworelated platforms—Illumina’s GAIIx and HiSeq 2000—to a total of359 Gb (equivalent to;1263 average sequenced depth). Here wefocus on the technical aspects of analyzing these data generatedas part of the expanded whole-genome sequencing efforts of theNational Institutes of Health (NIH) Undiagnosed Diseases Pro-gram (UDP).We leveraged the ultra-deep coverage of this genometo identify sources of incorrect genotype calls and developed ap-proaches to mitigate these inaccuracies. We generated incremen-tal data sets of the deep-sequenced genome to answer the fol-lowing important questions: Given a specific amount of sequencedata, what fraction of the genome is callable? and how manySNVs are detected? Ultimately, we seek to understand how muchsequence data is needed for adequate representation of the wholegenome for genotype calling and to develop standards by whichall whole-genome data sets can be evaluated with respect tocomprehensiveness.

Answers to these questions will help us make more informeddecisions for designing whole-genome sequencing experiments tostudy genome biology and for clinical analyses, specifically in lightof accurately detecting variants that directly modify phenotypesand cause disease.

3Present address: IlluminaCambridge Ltd., ChesterfordResearchPark,Little Chesterford, Saffron Walden, Essex CB10 1XL, UK.4Corresponding author.E-mail [email protected] published online before print. Article, supplemental material, and pub-lication date are at http://www.genome.org/cgi/doi/10.1101/gr.123638.111.Freely available online through the Genome Research Open Access option.

21:000–000 ISSN 1088-9051/11; www.genome.org Genome Research 1www.genome.org

Cold Spring Harbor Laboratory Press on July 20, 2011 - Published by genome.cshlp.orgDownloaded from 10.1101/gr.123638.111Access the most recent version at doi:

published online July 19, 2011Genome Res. Subramanian S. Ajay, Stephen C.J. Parker, Hatice Ozel Abaan, et al. Accurate and comprehensive sequencing of personal genomes

MaterialSupplemental http://genome.cshlp.org/content/suppl/2011/06/15/gr.123638.111.DC1.html

P<P Published online July 19, 2011 in advance of the print journal.

Open Access Freely available online through the Genome Research Open Access option.

serviceEmail alerting

click heretop right corner of the article orReceive free email alerts when new articles cite this article - sign up in the box at the

object identifier (DOIs) and date of initial publication. by PubMed from initial publication. Citations to Advance online articles must include the digital publication). Advance online articles are citable and establish publication priority; they are indexedappeared in the paper journal (edited, typeset versions may be posted when available prior to final Advance online articles have been peer reviewed and accepted for publication but have not yet

http://genome.cshlp.org/subscriptions go to: Genome ResearchTo subscribe to

Copyright © 2011 Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on July 20, 2011 - Published by genome.cshlp.orgDownloaded from

50x 50x

Genotype calls

Filter  hg19  callable  

In  both   Discordant  No  extra  filters   98.33%   46,580  

With  alignment  and  genotype  Filters   93.13%   1,673  

No  q20  Evidence  (MapQ1)   267  

NHGRI