snp resources and applications

Download SNP Resources and Applications

If you can't read please download the document

Upload: ankti

Post on 25-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

SNP Resources and Applications. SeattleSNPs PGA Debbie Nickerson Department of Genome Sciences [email protected]. http://pga.gs.washington.edu. Strategies for Genetic Analysis. Populations Association Studies. Families Linkage Studies. C. /. C. C. /. T. C. /. C. C. /. - PowerPoint PPT Presentation

TRANSCRIPT

  • SNP Resources and Applications

    SeattleSNPs PGA Debbie NickersonDepartment of Genome [email protected]

    http://pga.gs.washington.edu

  • CasesControls40% T, 60% C15% T, 85% CC/CC/TC/CC/TC/CC/CC/TC/CC/CC/TC/CC/TC/TC/CMultiple Genes Common VariantsPolymorphic Markers > 500,000 -1,000,000Single Nucleotide Polymorphisms (SNPs)Single GeneRare Variants ~1,000 Short Tandem Repeat Markersand now 3,000 SNPsStrategies for Genetic AnalysisFamiliesLinkage StudiesPopulations Association StudiesSimple InheritanceComplex Inheritance

  • Complex inheritance/diseaseVariant GeneDiseaseDiabetesHeart DiseaseSchizophreniaObesityMultiple SclerosisCeliac DiseaseCancerAsthma AutismMany OtherGenesEnvironmentTwo hypotheses:1- common disease/common variants 2- common disease/many rare variants

  • Genetic Strategy - New Insightsallele frequencyHIGHLOWeffectsizeWEAKSTRONG LINKAGE ASSOCIATION Genome-wide SequencingArdlie, Kruglyak & Seielstad (2002) Nat. Genet. Rev. 3: 299-309Zondervan & Cardon (2004) Nat. Genet. Rev. 5: 89-100

  • Finding SNPs -

    Strategies

  • Total sequence variation in humans

    Population size:6x109 (diploid)Mutation rate:2x108 per bp per generationExpected hits:240 for each bp - Every variant compatible with life exists in the populationBUT most are vanishingly rare in the population!Compare 2 haploid genomes: 1 SNP per 1331 bp**The International SNP Map Working Group, Nature 409:928 - 933 (2001)

  • SNP Discovery: HapMap and othersTACGCCTATATCAAGGAGATGenerate more SNPs:Sources of SNPs:Perlegen SNP data Sequence chromatograms from Celera projectHapMap Random ShotgunGenomic DNA(multiple individuals)Sequence and align (reference sequence)Random Shotgun SequencingdbSNP 127 - 11.8 Million SNPs and 5.7 Million SNPs Validated

  • Finding SNPs: Sequence-based SNP MiningSequence Overlap - SNP DiscoveryGTTACGCCAATACAGGATCCAGGAGATTACCGTTACGCCAATACAGCATCCAGGAGATTACCDNASEQUENCING

  • SNP discovery is dependent on your sample population sizeGTTACGCCAATACAGGATCCAGGAGATTACCGTTACGCCAATACAGCATCCAGGAGATTACC{2 chromosomes

  • Candidate Gene Resource

  • SNP Discovery in SeattleSNPs53Complete analysis: cSNPs, Linkage Disequilbrium and Haplotype DataArg-CysVal-ValPCR ampliconsGenerate SNP data from complete genomic resequencing (i.e., 5 regulatory, exon, intron, 3 regulatory sequence)

  • Increasing Sample Size Improves SNP DiscoveryGTTACGCCAATACAGGATCCAGGAGATTACCGTTACGCCAATACAGCATCCAGGAGATTACC{2 chromosomes

  • SNPs in the Average Gene

    Average Gene Size - 25 kb ~ Compare 2 haploid - 1 in 1,000 bp~150 SNPs (200 bp) - 15,000,000 SNPs~ 50 SNPs > 0.05 MAF (600 bp) - 6,000,000 SNPs (33-40%)~ 5 coding SNPs (half change the amino acid sequence)Crawford et al Ann Rev Genomics Hum Genet 2005;6:287-312

  • SeattleSNPs panelHapMap Integration (~4 million SNPs)= SeattleSNPs discovery (1/188 bp)= HapMap SNPs (~1/1000 bp)High Density Genic Coverage (SeattleSNPs)Low Density Genome Coverage (HapMap)

  • Sequence Variation and the HapMap

  • Summary: The Current State of SNP Resources Random SNP discovery generates many SNPs (HapMap)

    Random approaches to SNP discovery have reached limits of discovery and validation (~ 50% of the common SNPs)

    Resequencing approaches continue to catalog important variants (rare and common not captured by the HapMap)

    SeattleSNPs has generated SNP data across >300 key candidate genes

  • NHLBI - Candidate Genes and Medical Resequencinghttp://rsng.nhlbi.nih.gov/scripts/index.cfm

  • Typing SNPs: Approaches

  • HapMap Project: Genotype validated SNPs in the dbSNPGenotype SNPs in Four populations: Initially 1 Million -> Now 4 Million

    CEPH (CEU) (Europe - n = 90, trios) Yoruban (YRI) (Africa - n = 90, trios) Japanese (JPT) (Asian - n = 45) Chinese (HCB) (Asian - n =45)

    To produce a genome-wide map of common variation

  • Genotyping Adds Value to SNPsHapMap Genotyping Confirms a SNP as real and informative

    Determines Minor Allele Frequency (MAF) - - common or rare

    Determines MAF in different populations

    Detection of SNP correlations - (Linkage Disequilibrium and Haplotypes)

  • Genotype correlations among SNPs decreases the number of SNPs that need to be genotyped

  • IL1A in Europeans 18.5 kb 50 SNPsAn Example of SNP Correlation in the Human IL1A GeneCarlson et al. (2004) Am J Hum Genet. 74: 106-120.

  • Threshold LD: r2 Bin 1: 22 sitesBin 2: 18 sitesBin 3: 5 sitesGenotype 1 SNP from each bin TagSNP, chosen for biological intuition or ease of assay design 46 Common SNPs reduces to 3 SNPs - Select one SNP per bin using LDSelect

  • Common Variants - LD (Association) Patterns - Not the same in all genes for all populations All SNPsSNPs > 10% MAFAfrican-AmericanEuropean-American

  • How do I pick TagSNPs?

  • TagSNPs for any gene - Use GVS

  • TagSNPs in any Gene

  • TagSNPs for a gene for typing multiple populations

  • TagSNPs for a gene for typing multiple populations

  • TagSNPs in a pathway of genes

  • HumanAssociationStudies

  • C-Reactive Protein (CRP)Pentamer belonging to pentraxin family

    Acute-phase protein produced by the liver in response to cytokine production (IL-6, IL-1, tumor necrosis factor)

    Non-specific response to inflammation, infection, tissue damageWell designed candidate gene studies have provided significant insights and thesehave been replicated in genome-wide association studies

  • CRP AnalysisCRP is an independent risk factor for CVD

    CRP levels are heritable (~40% in FHS)

    Several reported SNPs alter CRP levels

  • tagSNP selection for CRPSynonymous SNP(2667)Promoter SNPs(790, 1440)Intron SNP(1919)Downstream SNPs(3872, 5237)3 UTR SNP(3006)6 cosmopolitan tagSNPs1 rare synonymous SNP

  • Association between CRP SNPs and Serum CRP LevelsCARDIA - Carlson et al Am J Hum Genet 77: 64-77, 2005NHANES- Crawford et al Circulation 114: 2458-65, 2006CHS - Lange et al JAMA 296: 2703-11, 2006Framingham - Larson et al Circulation 113: 1415-23, 2006Other - Szalai et al J. Mol Med 83: 440-7, 2005

  • High CRP Associated with SNPs in USF1 Binding SiteUSF1 (Upstream Stimulating Factor)Polymorphism at 1421 alters another USF1 binding site 1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5 gcagctacCACGTGcacccagatggcCACTTGtt H6 gcagctacCACATGcacccagatggcCACTTGtt SNP Alters Expression In VitroAltered Gel Shift in VitroGenome-wide studies lead to regional and candidate genes studies

  • Genome-WideAssociationStudies

  • Genome-Wide Platforms100,000 or 500,000 Quasi-Random SNPs

    100,000, 317,000, 550,000, 650,000Y SNPsAffymetrixRandom SNPs Illumina TagSNPs1 Million Products are here!

  • Genome-wide Tour de force Nature 447: 661-678 Read all the supplementalmaterials too!

  • Applying HapMap - Will it work? YES!!Hits:

    Macular Degeneration, Obesity, Cardiac Repolarization,Inflammatory Bowel Disease, Diabetes T1 and T2, Coronary Artery Disease, Rheumatoid Arthritis, Breast Cancer, Colon Cancer ..

    There are misses as well unclear why - Phenotype, Coverage,Environmental Contexts?Example of a miss - Hypertension

    -There are lots more hits in these data sets - sample size, low proxy coverage with other SNPs ..

    Analysis of associations between phenotype(s) and even individual sites is daunting and this will just be the first stage,and this does even consider multi-site interactions

  • How robust are the new genome-wide platforms?

    How well do they capture common SNPs?

  • LD-based coverage of Sequence VariationMAF > 0.05 Bhangale et al, unpublished

  • How can I get more information about a reference SNP (rs) identified from an association study?

  • Searching for Genomic Information with an RS number

  • Structural Variation

  • Structural Variants Identified in the HapMap

    Conrad, et al. (Nature Genetics 38:75-81, 2006) Hinds, et al. (Nature Genetics 38:82-85, 2006) McCarroll, et al. (Nature Genetics 38:86-92, 2006)Structural Variation - Large Insertion-Deletion Events~ 1,500 indels Lots more of them - this was only a start

  • New Variation to Consider - Structural Variation Types of Structural Variants

    Insertions/DeletionsInversions DuplicationsTranslocations

    Size:Large-scale (>100 kb) intermediate-scale (500 bp100 kb)Fine-scale (1500 bp)Nature 447: 161-165, 2007

  • CEPHA Human Genome Structural Variation ProjectGoal: Complete characterization of normal pattern of structural variation in 62 human genomes

    Genomes have dense SNP maps (HapMap)

    Select most genetically diverse individuals

    62 additional human genome projects underwayNature 447:161-165, 2007

  • Sequence-Based Resolution of Structural VariationDataset: 1,122,408 fosmid pairs preprocessed (15.5X genome coverage) 639,204 fosmid pairs BEST pairs (8.8 X genome coverage)Human Genomic DNA Genomic Library (1 million clones)Sequence ends of genomic inserts &Map to human genome

  • Kidd, Cooper, and Eichler - unpublished

  • Detection of Indels in Genotype Data X-linked SNPUnknown indelCarlson et al, Hum. Mol. Genet. 15: 1931-1937, 2006

  • Searching for Genomic Information with an RS number

  • DNA Sequencingthe ultimategenotypingplatform?

  • Rare Variant Versus Common VariantBoth could play a role

    Rare Variant - Sequence Individuals

    Common Variants - Genotype a Smaller Set of Variants to Explore Correlations

  • High Density Lipoprotein (HDL)Sequencing Known Candidate Genes for Functional VariationFrom Individuals at the Tails of the Trait Distribution Low HDL

    High HDLIndividuals

    Chart3

    0.00013383020.0001338302

    0.00019865550.0001986555

    0.00029194690.0002919469

    0.00042478030.0004247803

    0.00061190190.0006119019

    0.00087268270.0008726827

    0.00123221920.0012322192

    0.00172256890.0017225689

    0.00238408820.0023840882

    0.00326681910.0032668191

    0.00443184840.0044318484

    0.00595253240.0059525324

    0.00791545160.0079154516

    0.01042093480.0104209348

    0.01358296920.0135829692

    0.01752830050.0175283005

    0.02239453030.0223945303

    0.02832703770

    0.03547459280

    0.0439835960

    0.05399096650

    0.06561581480

    0.07895015830

    0.09404907740

    0.11092083470

    0.12951759570

    0.14972746560

    0.1713685920

    0.1941860550

    0.2178521770

    0.24197072450

    0.26608524990

    0.28969155280

    0.31225393340

    0.33322460290

    0.35206532680

    0.36827014030

    0.38138781550

    0.3910426940

    0.39695254750

    0.39894228040

    0.39695254750

    0.3910426940

    0.38138781550

    0.36827014030

    0.35206532680

    0.33322460290

    0.31225393340

    0.28969155280

    0.26608524990

    0.24197072450

    0.2178521770

    0.1941860550

    0.1713685920

    0.14972746560

    0.12951759570

    0.11092083470

    0.09404907740

    0.07895015830

    0.06561581480

    0.05399096650

    0.0439835960

    0.03547459280

    0.02832703770

    0.02239453030.0223945303

    0.01752830050.0175283005

    0.01358296920.0135829692

    0.01042093480.0104209348

    0.00791545160.0079154516

    0.00595253240.0059525324

    0.00443184840.0044318484

    0.00326681910.0032668191

    0.00238408820.0023840882

    0.00172256890.0017225689

    0.00123221920.0012322192

    0.00087268270.0008726827

    0.00061190190.0006119019

    0.00042478030.0004247803

    0.00029194690.0002919469

    0.00019865550.0001986555

    0.00013383020.0001338302

    Sheet1

    -40.000031686-40.00013383020.0001338302-40.39894227681.70.0001338302

    -3.90.0000481155-3.90.00019865550.0001986555-3.90.39894227251.80.0001986555

    -3.80.0000723724-3.80.00029194690.0002919469-3.80.39894226341.90.0002919469

    -3.70.0001078301-3.70.00042478030.0004247803-3.70.398942244420.0004247803

    -3.60.0001591457-3.60.00061190190.0006119019-3.60.39894220572.10.0006119019

    -3.50.0002326734-3.50.00087268270.0008726827-3.50.39894212852.20.0008726827

    -3.40.0003369808-3.40.00123221920.0012322192-3.40.39894197752.30.0012322192

    -3.30.0004834825-3.30.00172256890.0017225689-3.30.39894168852.40.0017225689

    -3.20.0006872021-3.20.00238408820.0023840882-3.20.39894114662.50.0023840882

    -3.10.0009676712-3.10.00326681910.0032668191-3.10.39894015162.60.0032668191

    -30.0013499672-30.00443184840.0044318484-30.39893836262.70.0044318484

    -2.90.0018658801-2.90.00595253240.0059525324-2.90.39893521272.80.0059525324

    -2.80.0025551906-2.80.00791545160.0079154516-2.80.39892978292.90.0079154516

    -2.70.0034670231-2.70.01042093480.0104209348-2.70.398920619230.0104209348

    -2.60.0046612218-2.60.01358296920.0135829692-2.60.39890548033.10.0135829692

    -2.50.0062096799-2.50.01752830050.0175283005-2.50.39888099933.20.0175283005

    -2.40.0081975289-2.40.02239453030.0223945303-2.40.39884225523.30.0223945303

    -2.30.0107240811-2.30.02832703770-2.30.39878225273.40.0283270377

    -2.20.0139033989-2.20.03547459280-2.20.39869133563.50.0354745928

    -2.10.0178643574-2.10.0439835960-2.10.39855657873.60.043983596

    -20.022750062-20.05399096650-20.39836124073.70.0539909665

    -1.90.0287164929-1.90.06561581480-1.90.39808439413.80.0656158148

    -1.80.0359302655-1.80.07895015830-1.80.39770088683.90.0789501583

    -1.70.0445654318-1.70.09404907740-1.70.397181808340.0940490774

    -1.60.0547992895-1.60.11092083470

    -1.50.0668072288-1.50.12951759570

    -1.40.0807567113-1.40.14972746560

    -1.30.0968005495-1.30.1713685920

    -1.20.1150697317-1.20.1941860550

    -1.10.1356661015-1.10.2178521770

    -10.1586552598-10.24197072450

    -0.90.1840600917-0.90.26608524990

    -0.80.2118553339-0.80.28969155280

    -0.70.2419635785-0.70.31225393340

    -0.60.2742530649-0.60.33322460290

    -0.50.3085375326-0.50.35206532680

    -0.40.3445783034-0.40.36827014030

    -0.30.3820886425-0.30.38138781550

    -0.20.4207403128-0.20.3910426940

    -0.10.4601721045-0.10.39695254750

    00.499999999800.39894228040

    0.10.53982789550.10.39695254750

    0.20.57925968720.20.3910426940

    0.30.61791135750.30.38138781550

    0.40.65542169660.40.36827014030

    0.50.69146246740.50.35206532680

    0.60.72574693510.60.33322460290

    0.70.75803642150.70.31225393340

    0.80.78814466610.80.28969155280

    0.90.81593990830.90.26608524990

    10.841344740210.24197072450

    1.10.86433389851.10.2178521770

    1.20.88493026831.20.1941860550

    1.30.90319945051.30.1713685920

    1.40.91924328871.40.14972746560

    1.50.93319277121.50.12951759570

    1.60.94520071051.60.11092083470

    1.70.95543456821.70.09404907740

    1.80.96406973451.80.07895015830

    1.90.97128350711.90.06561581480

    20.97724993820.05399096650

    2.10.98213564262.10.0439835960

    2.20.98609660112.20.03547459280

    2.30.98927591892.30.02832703770

    2.40.99180247112.40.02239453030.0223945303

    2.50.99379032012.50.01752830050.0175283005

    2.60.99533877822.60.01358296920.0135829692

    2.70.99653297692.70.01042093480.0104209348

    2.80.99744480942.80.00791545160.0079154516

    2.90.99813411992.90.00595253240.0059525324

    30.998650032830.00443184840.0044318484

    3.10.99903232883.10.00326681910.0032668191

    3.20.99931279793.20.00238408820.0023840882

    3.30.99951651753.30.00172256890.0017225689

    3.40.99966301923.40.00123221920.0012322192

    3.50.99976732663.50.00087268270.0008726827

    3.60.99984085433.60.00061190190.0006119019

    3.70.99989216993.70.00042478030.0004247803

    3.80.99992762763.80.00029194690.0002919469

    3.90.99995188453.90.00019865550.0001986555

    40.99996831440.00013383020.0001338302

    Sheet1

  • ABCA1 and HDL-CObserved excess of rare, nonsynonymous variants in low HDL-C samples at ABCA1Demonstrated functional relevance in cell culture

    Cohen et al, Science 305, 869-872, 2004

    Many examples emerging

    Common Disease Rare Variants

  • Personalized Human Genome SequencingSolexa - an example

  • New Technologies1 Gigabyte of Sequence

    Problem is to Target - Genes or RegionsShort reads - 30-35bp - quality?Variation discovery needs ~ 20-fold coverageNeeds to be fairly uniformProvide 30-50 Mb of baseline

  • Human Genome Variation - Summary SeattleSNPs and HapMap - Common variation sources - SeattleSNPs offers insights into coverage New Genotyping Platforms - Very successful but more coverage will be coming Many genome associations are being identified regions Other variants of interest emerging - structural variation Paradigm Shift in Sequencing Technology

  • AcknowledgementsUW Mark RiederAlex ReinerGreg CooperPeggy RobertsonTushar Bhangale

    FHCRCChris Carlson

    VanderbiltDana Crawford

    StanfordShelley Force-AldredRick Myers

    CARDIADavid SiscovickDale WilliamsBeth LewisKiang LiuCarlos IrribarenMyriam FornageCashell JaquishEric BoerwinkleNHLBI - SeattleSNPs

    I have the great pleasure of reviewing in two hours all of the SNP database resources. A recent history of SNP discovery and resources. Background to some of the projects and databases we will be referencing throughtout our workshop. This talk will be an overview of SNPs SNP discovery and the current state of the dbSNP database. I believe that most people here today have some ideas about the state of SNPs but possibly dont have a clear idea of the quality or manner in which they were collected. Hopefully this background can fill in that void and give you a better understanding of how to approach these databases. PUT the WORK IN CONTEXT of present day data we have.They were basically started from the point of the TSC and HGP work with 2.8 million sNPs but 1.4 validated SNPs - so more had been submitted to dbSNP but validation became a bigger issue. So the approach they took to generating SNPs was a bit easier because at this time a draft of the human genome project was available to work with. Again they used a random Shotgun approach, but did it in a genome wide manner. Simply fragment clone and sequence and align. Animation. Other sources Perlegen data and the private genome project by CeleraThis an outline of how they went about this. Starting with input genetic material: mRNA or Genomic DNA and making libraries - essentially taking small pieces of DNA and propagataing it to be sequenced, getting that sequence, finding sequence matches and the then looking for differences. Really one of the first methods for doing this was using mRNA and cDNA libaries, but some of these early SNPs developed by this method had some quality issue dealing with the reverse transcriptase step, which can introduce errors into the sequence and lower quality sequence from these sequences. This lower quality sequence can then generate false difference between sequence fragments, this can be a problem for all of these methods, but DNA sequencing quality had rapidly improved in quality and that has become less of an issue. First I would like to quickly give an overview of both methods derived from Genomic dna - BAC and RRS. POINT: DNA SEQUENCING --> SEQUENCE OVERLAP --> SNPSWhen we draw from two chromosomes, we can develop theoretical preductions of about what the fraction of SNPs you can discover using this method. When using 2 chromosomes we actually can never discovery more than 50% of ao the most common SNPs - those occuring at 50% allele frequency. But in truthfullness they were discovered in a few more chromosome than that but probably not much more than 8 chromosomes. At that level you are able to ascertain nearly all of the very frequent polymorphisms (50% allele frequency), but a much lower percentage of the rare SNPs - in this case about 50% of the SNPs occurring at 10% frequency in the popualtion and this basically makes sense. In order to overcome the probability of seeing a rare SNps you would need to sample from more people. BOTTOM LINE: is that by using these large scale random approaches it isnt possible to fully disover the entire spectrum of common and rare SNPs.When we draw from two chromosomes, we can develop theoretical preductions of about what the fraction of SNPs you can discover using this method. When using 2 chromosomes we actually can never discovery more than 50% of ao the most common SNPs - those occuring at 50% allele frequency. But in truthfullness they were discovered in a few more chromosome than that but probably not much more than 8 chromosomes. At that level you are able to ascertain nearly all of the very frequent polymorphisms (50% allele frequency), but a much lower percentage of the rare SNPs - in this case about 50% of the SNPs occurring at 10% frequency in the popualtion and this basically makes sense. In order to overcome the probability of seeing a rare SNps you would need to sample from more people. The bottom line is that by using these large scale random approaches it isnt possible to fully disover the entire spectrum of common and rare SNPs.The second objective of the hapMap project was to characterize SNP by genotyping in major representatative population from around the world. The goal for this was to develop 600,000 SNPs genome-wide where they would have genotype each individual from four populations - Say something about trios - hapolotypes - youll be hearing more about that in the following talk from Dr. Crawford.So one question might be why do they want to do this genotyping - other than the fact that it adds to the validation status of a SNP. There some obvious advantages and this are outlined ehre and again you will hear more about several of these as we move through the talks today and tomorrow. Population specific - emphasize that - about 10% of SNPs can be private or monomorphic or low frequency in one popuation and then very polymorphic in another population - if you are designing a study for your population of interest you would like to know that during the SNPs selection phase. LD - how are genotypes in the same individuals correlated between SNPs - this has effects on how dense you would expect your SNPs map to be and practically Speaking could alter the number of SNPs you would have to gneotype. Finally, the how these SNPs are coinherited as haplotypes - and if you need to consider this coinheritance when designing your association studies. In examining the highest activity haplotype, the associated tagSNP is ~1000 bp upstream of the transcript. CLICK However, several other SNPs in this region are on the same haplotype, including one in the putative promoter. CLICK Remarkably, this SNP is within 20 bp of the triallelic SNP and CLICK also alters a predicted USF1 binding site.14 other SNP showing clustering pattern