The structure of the talk includes a brief overview of some of the critical milestones that have occurred over the past decade that really have contributed to where we are today as a community of human genetic researchers. Next we’ll review some of the successes of GWAS, GWAS meaning genome wide association study, in the latter half of the decade, but we will also begin a conversation about what is missing, what is needed next, and where are the new hypotheses that are going to lead the community moving forward. This will lead into a discussion of what Illumina is developing in terms of tools that will enable researchers to explore more fully these new hypotheses throughout 2010. And lastly, we’ll have a short discussion of sequencing and arrays. Two complementary technology that each have their strengths and each have their own place in a researcher’s toolbox, or arsenal, for going after the variants that contribute to their disease or trait of interest.
3
First-gen GWAS vs. Next-gen GWAS
4
The GWAS approach is successful in human genetics
Year # of publications
2005 2
2006 8
2007 89
2008 151
2009 222
First publications in 2005
Almost 600 total publications since
Over 3500 associations published
Wide-range of phenotypes and diseases
5
Published Genome-Wide Associations through 9/2009, 536 published GWA at p < 5 x 10-8
NHGRI GWA Catalogwww.genome.gov/GWAStudies
Presenter
Presentation Notes
And GWAS has been successful for many different phenotypes and identified variants across the entirety of the genome. This slide here was downloaded from the NHGRI GWAS Catalog available at the website listed here and clearly shows the diversity of diseases and traits that have been studied using a GWAS approach and the many significant findings that have been discovered as a result of these experiments. This is truly remarkable as many of these disorders have been documented for centuries, even millennia, but for many of them, it has only been through the GWAS efforts of the past few years that an understanding of the genetics variants that contribute have begun to emerge.
6
0%
20%
40%
60%
80%
100%Rare Common Disease/Traits
For most common diseases, the sum of individual effects found so faris much less than the total estimated heritability
The Case of the Missing Heritability
Missing
Explained
Heritability
Adapted from Manolio et al 2009
Presenter
Presentation Notes
Many genetic studies have successfully detected both common and rare genetic variants for both single gene and complex traits. However, there is still much more needed to understand the genetics behind many of these disorders and this is highlighted by the observation of missing heritability for many common diseases and traits. In other words, for many diseases, the sum of individual effects of the indentified associated variants is often much less than the total estimated heritability of those diseases. To illustrate this point, on this graph a number of different disorders are listed along the x-axis while percentage of explained heritability is along the y-axis. For very rare, Mendelian disorders such as Huntington’s or Cystic Fibrosis, located on the left hand side of the graph, genes and variants that contribute to these disorders have been identified for decades through the work of traditional linkage analysis and cumulatively explain almost all of the observed heritability. However, even though many variants have been identified that contribute to the more common disorders such as AMD, Crohn’s, Lupus, T2D, cumulatively these variants explain a much smaller fraction of the observed heritability. The question remains how much more could be found if scientists had better tools. So, what are we, as a community, missing? Where is it? And what tools are needed to find it?
7
Tackling the Full Spectrum of Variants in DiseaseEf
Well, one way to think about this is in the context of two variables – risk allele frequency along the x axis and effect size along the y-axis. As you see in this plot here. The sweet spot, if you will, for disease studies is marked by the blue band that travels almost diagonally from rare variants of large effect down and across to common variants of small effect. Outside of that blue band, in the upper right, common variants with high effect size, well there aren’t many of those our researchers would be finding them regularly with the tools that have been available for common variant GWAS. In the lower left corner, rare variants of small effect are also of limited interest as they are almost impossibly difficult to identify and even if it were possible have marginal importance to understanding disease on the population level. Now, across this swath, the extremes have been well covered by available technologies. For example, those rare variants that confer large effect sizes have been identified using linkage mapping techniques for decades now with over 2000 hits for Mendelian disorders. Likewise, common variants that confer small to modest effect sizes have also been well explored using available GWAS tools for the past five years. The great unexplored area of this curve, therefore, is the centre section, the variants of rare to intermediate allele frequency that confer intermediate effect sizes. And it is through a second generation of GWAS tools, enabling rich GWAS, that researchers will be able to explore this class of variation for association to disease. Obviously, as more and more is learned about the true spectrum of variation through projects such as the 1000 genomes project, this will improve the tools for exploring common variation as well, so in essence this segment of the curve will also benefit from new “rich” GWAS tools. And lastly, as next generation sequencing matures and becomes more available, it will lend itself nicely to the exploration of the left hand extreme of the curve, the rare variants of large effect, taking over where traditional linkage mapping had left off.
Next-gen Sequencing and the 1kGP Revolutiona new era beyond the HapMap Project
10
Next-Gen Sequencing
High Density Custom Arrays
Targeted resequencing
Next-gen GWAS Arrays
ARRAYS
Presenter
Presentation Notes
But, what’s next? Well, the future of human genetics is intimately linked to sequencing-based discovery efforts. As next-gen sequencing matures, the catalogue of variation that is available for creating microarrays will grow at an unprecedented rate. These new variants can be deployed on custom arrays or high density standard arrays that in turn will identify regions of the genome for further discovery efforts, funneling back into targeted resequencing efforts, for example. Suffice it to say that sequencing and arrays are evolving hand in hand to enable the next wave of discoveries.
11
The 1,000 Genomes ProjectSequence 2,500 genomes to complete the picture of genetic variation
Project Goals
1. Accelerate fine-mapping efforts in gene regions indentified through genome-wide association studies or
candidate gene studies
2. Improve the power of future genetic association studies by enabling design of next-generation genotyping
microarrays that more fully represent human genetic variation
3. Enhance the analysis of ongoing and already completed association studies by improving our ability to “impute”
or “predict” untyped genetic variants
Achieve a nearly complete catalog of common human genetic variants with frequency 1% or higher.
Presenter
Presentation Notes
Indeed, massive next-generation resequencing projects such as the 1000 Genomes Project, are delivering a wealth of new information about the true spectrum of variants present in populations. And all of this new content is available for design of the next generation of microarrays for GWAS. In fact, it is spelled out in the 1000 Genomes mission statement – to improve the power of future genetic association studies by enabling design of next-generation microarrays…
12
New Content for Next-gen GWAS ArraysRich content to explore new hypotheses and enable new discoveries
Project Year
Approx. Cumulative SNPs
found
Tag SNPs needed for
max coverage
Lower limit of allele frequency
targeted% variation tagged
(r2>0.8)
HapMap 2003-2007 3M ~0.6M 5% >90%
1kG Pilot Project 2008-2009 13M ~2.5M 2.5% ~80%
1kG Full Project 2010 35M* ~5.0M 1% >90%
Sequence to discover SNPs >1% MAF (1000-Genomes project)
Leverage the power of LD to select tagSNPs and remove redundancy
Include progressively more SNPs at lower allele frequencies (5%, 2.5%, 1%)
* Estimated
Presenter
Presentation Notes
So, what does this mean in practice. Well, step one is to sequence to discover SNPs - an effort that is being completed, for example, by the 1000 Genomes Project as we’ve already discussed. Next, the same concept of haplotype blocks and tagSNPs that was so effective for the first generation of GWAS arrays continues to apply for these new variants as well. The only difference is that now a more complete picture of the variants and their LD blocks is available for further improved tagSNP selection. Furthermore, as the data from the 1000 Genomes Project is to be released in stages, this process of selecting tags can be applied iteratively down to lower and lower frequencies. This table begins with the HapMap and ~3 million SNP which could be tagged by slightly over a ½ a million well chosen markers. Arrays designed off of the HapMap had a lower limint of allele frequency of ~5% and currently available arrays are able to tag about 90% of all variation in this category. Next, the first phase of the 1000 Genome Project has identified upwards of 17 million variants so far, though again, applying an intelligent tagging approach, we estimate about 2.5 million will be needed to capture approximately 80% of all variants down to 2.5% MAF. And lastly, the final phase of the 1000 Genomes Project is anticipated to deliver ~35 million new variants. We estimate that ~90% of these down to 1% MAF can be captured using ~5 million variants.
13
1000 Genomes
0 10 20 30 40 50 600.0
0.2
0.4
0.6
0.8
1.0Co
unt (
x 106 )
Minor Alleles
HapMap Represents a Small Part of All Variation
SNPs by observations in 60 CEU Samples
HapMap
Presenter
Presentation Notes
The need for more comprehensive chips can be seen just by looking at the amount of content available before the 1000 genomes versus what is now coming out of the 1000 Genomes. This plot shows the number of CEU SNPs within the HapMap database as a function of the number of times that the minor allele was observed. Maximizing coverage of the common SNPs that predominantly populate the HapMap data base has been the logical approach for years because, until recently, it represented the most comprehensive database of SNP information across many samples. When we compare just the CEU data from the 1000 Genomes that was released through the end of 2009, we now have ascertain over three times as many putative SNPs compared to what we knew about from the HapMap data. Additionally, we now have a much richer understanding of the full frequency spectrum of SNPs.
The ultimate GWAS tool providing near complete coverage
of common and rare variation
MAF > 5% MAF >2.5% MAF > 1%
Presenter
Presentation Notes
So, on a single slide, here is the Omni Family of Microarrays mapped out. The family begins with the Omni1 and OmniExpress arrays on the left hand slide of this slide and proceeds through the Omni1S and Omni2.5, leading then into the Omni2.5S and the Omni5. As these products are being designed on successive releases of the 1000 Genomes Project data, we see a step-wise progression from 5% MAF target down to 1% MAF target with the Omni5 and Omni2.5S products. The Omni1 and OmniExpress are available now and were designed primarily from information available from the HapMap project, with the incorporation of a small fraction of new variants identified by the 1kGP. As the Roadmap proceeds, the fraction of new, 1000 Genomes Project data that will be incorporated into the chip design will increase sequentially. The whole idea here is to provide a clear path for researchers to begin accessing these new, rarer variants from 1kGP as quickly as possible throughout 2010 so that new discoveries can be made faster, than if the community waited for the 5M as a stand alone product. Furthermore, the Roadmap gives researchers the flexibility to jump into next-gen GWAS at whatever stage is most appropriate for their diease, trait, budget, and long and short term research goals.
17
Illumina’s GWAS Roadmap
Content optimized from next-gen re-sequencing efforts such as 1000 Genomes.
Pushing the boundary of GWAS content into unexplored territory
Cost effective path for researchers that want to ride the cutting edge today
18
Path Step 1 Step 2 Step 3 Total Markers
1
OmniExpress Omni1S Omni2.5S
~4.4 Million
2
Omni1 Omni1S Omni2.5S
~5 Million
3
Omni2.5 Omni2.5S
~5 Million
4
Omni5
~5 Million
Roadmap Paths
19
2010 GWAS RoadmapMultiple chips made Easier with the Multi-use Workflow
Omni1-QuadMulti-use
OmniExpressMulti-use
Omni2.5Multi-use
Omni1S Multi-use
Omni1S Multi-use
Omni2.5S Multi-use
Omni2.5S Multi-use
Omni2.5S Multi-use
Roadmap Entry Point Second Array Third Array
20
Omni2.5 Details
21
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Competitor“New Array” *
Competitor “Old 900K”
660W Omni1/OmniExpress
Omni2.5
% C
aptu
red
at r
2 >0.
8
CEU Coverage Estimates: HapMap vs. 1kGP Reference Data
HapMap 5% 1kGP 5% 1kGP 2.5%
This is not just an array with “new” content!The Omni2.5 array is a complete game-changer!
*Base content only
22
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Competitor "New Array" *
Competitor "Old 900K"
660W Omni1/ OmniExpress Omni2.5
% C
aptu
red
at r
2 >0
.8
YRI Coverage Estimates: HapMap vs. 1kGP Pilot Data
HapMap 5%
1kGP 5%
1kGP 2.5%
Genomic Coverage Stats for African Populations
*Base content only
23
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Competitor "New Array" *
Competitor "Old 900K"
660W Omni1/ OmniExpress Omni2.5
% C
aptu
red
at r
2 >0
.8
CHB/JPT Coverage Estimates: HapMap vs. 1kGP Pilot Data
HapMap 5%
1kGP 5%
1kGP 2.5%
Genomic Coverage Stats for Asian Populations
*Base content only
25
Illumina GWAS Portfolio at a Glance
Omni2.5 Omni1 OEx CytoSNP12
Number of Markersper Sample 2,450,000 1,140,419 733,202 301, 232
Number of Samplesper BeadChip 4 4 12 12
Scan Times per Sample(minutes) 15 13 5 3
Spacing (Mean / Median / 90% percentile largest gap)
First-generation GWAS has provided a foundation for beginning to understand the genetic architecture of many diseases and traits.
However, first-generation GWAS was limited by the extent of knowledge about the spectrum of variation in humans in the HapMap era.
NGS re-sequencing efforts, such as 1kGP, are providing a much more comprehensive catalog of common variation (>1% MAF) in diverse populations
Next-gen GWAS tools are leveraging this expanded catalog of variation to drive a new wave of genetic discovery by enabling exploration of the rare-variant hypothesis and higher resolution CNV research in a cost-effective tools.