handling and analyzing data from a genome-wide association study

27
Handling and analyzing data from a genome-wide association study Laura Scott Biostatistics Department Center for Statistical Genetics University of Michigan

Upload: kylynn-salinas

Post on 03-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Handling and analyzing data from a genome-wide association study. Laura Scott Biostatistics Department Center for Statistical Genetics University of Michigan. Outline. Storing large amounts of genotype data Quality control Generating initial association analysis Viewing results - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Handling and analyzing data from a genome-wide association study

Handling and analyzing data from a genome-wide association study

Laura Scott

Biostatistics DepartmentCenter for Statistical Genetics

University of Michigan

Page 2: Handling and analyzing data from a genome-wide association study

Outline

• Storing large amounts of genotype data

• Quality control

• Generating initial association analysis

• Viewing results

• Imputation of missing SNP genotypes

• Storing results and planning specialized analysis

Page 3: Handling and analyzing data from a genome-wide association study

Genotype data is huge

• 500,000 SNPs * 2000 cases + controls = 1,000,000,000 genotypes!

• Need compact ways to store data– If store each genotype as 00, 01, or 11 will

have file that looks like Person 100011001010001001001100….

000000100001000010001000….

Page 4: Handling and analyzing data from a genome-wide association study

Genotype data is huge

• 500,000 SNPs * 2000 cases + controls = 1,000,000,000 genotypes!

• Need compact ways to store data– If store each genotype as 00, 01, or 11 will

have file that looks like SNP 100011001010001001001100….

000000100001000010001000….

Page 5: Handling and analyzing data from a genome-wide association study

Genotype data is huge

• 500,000 SNPs * 2000 cases + controls = 1,000,000,000 genotypes!

• Need compact ways to store data– If store each genotype as 00, 01, or 11 will

have file that looks like 100011001010001001001100….

000000100001000010001000….

– Total file space for 300K SNPs: 4 Gigabytes– Largest chromosome file: .4 Gigabytes

Page 6: Handling and analyzing data from a genome-wide association study

Need to do extensive planning for genotype data before it arrives

• Chromosome datasets are too large for SAS and other commonly used analytic packages

• Need programs to select and write out genotype data in multiple formats

• Tests of procedures with large-scale trial datasets

Page 7: Handling and analyzing data from a genome-wide association study

Gather other data needed for analysis

• SNP information– Chromosome– Position

• SNP annotation– Gene– Function

• Translation of called allele to a standard allele– Example forward strand of given genome build

Page 8: Handling and analyzing data from a genome-wide association study

How good is the data?

• Identify and remove bad samples and SNPs

• Compute summary statistics– Percent successfully genotyped samples– Average genotyping success rate– Duplicate sample error rate– Non-Mendelian inheritance error rates (errors

not consistent with normal transmission of chromosomes in family members)

Page 9: Handling and analyzing data from a genome-wide association study

Identify bad samples and remove

• Poor quality samples– Sample genotype success rate < 95 to 97.5%– Greater proportion of heterozygous genotypes

than expected

• Related individuals (if independent samples) – Based on pair-wise comparisons of similarity of

genotypes

• Sample switches– Wrong sex

• Regions of homozygosity in cell line

Page 10: Handling and analyzing data from a genome-wide association study

Identify poor quality SNPs and remove

• Expected proportions of genotypes are not consistent with observed allele frequency (Hardy Weinberg Equilibrium (HWE))

• HWE p-value < 10-4 to 10-6

• Look for deviation from expected distribution of p-values under the null

• Genotyping success rate < 95%• Duplicate sample or Non-Mendelian error rate

is elevated• Differential missingness in cases and controls

Page 11: Handling and analyzing data from a genome-wide association study

Programs are available for large scale quality control analysis

• Plink– Duplicate error rates, sample relatedness, HWE,.. – Develop by Shaun Purcell– http://pngu.mgh.harvard.edu/~purcell/plink/

• GAINQC Software used for the quality control analysis of the GAIN project – Duplicate error rates, sample relatedness, HWE,..– Developed by Shyam Gopalakrishnan and Goncalo

Abecasis– Available from [email protected]

Page 12: Handling and analyzing data from a genome-wide association study

Initial analysis is straightforward once have everything in place

• Case/control association– Use test that is not affected by deviations

from HWE• Cochran-Armitage test for trend• Equivalent to score test in logistic regression

• TDT or other family-based test

• Quantitative trait association

Page 13: Handling and analyzing data from a genome-wide association study

Programs are available for large scale case-control or family-based analysis• Plink

– Case/control, tdt, quantitative traits– Develop by Shaun Purcell– http://pngu.mgh.harvard.edu/~purcell/plink/

• Merlin– Quantitative traits in independent samples or families,

ability to impute genotypes for untyped individuals based on genotyped family members

– Developed by Goncalo Abecasis– http://www.sph.umich.edu/csg/abecasis/Merlin/

Page 14: Handling and analyzing data from a genome-wide association study

Are the results believable?

• Are stronger associations correlated with poorer quality control measures?

• Is there a strong deviation from expected distribution of p-values?

• Is there confounding from differences in the genetic origins of case and control samples (population stratification)?– Genomic control– Eigenstrat analysis

Page 15: Handling and analyzing data from a genome-wide association study

Seeing from many different angles is believing (sometimes)

• Plink graphical output

• User added custom tracks in the UCSC browser– http://genome.ucsc.edu/– http://genome.ucsc.edu/goldenPath/help/

hgTracksHelp.html#CustomTracks

• Homemade graphes

Page 16: Handling and analyzing data from a genome-wide association study

FUSION T2D association

Page 17: Handling and analyzing data from a genome-wide association study

Many different ways to display similar data

Zeggini et al. (2007) Science 316: 13361341; Diabetes Genetic Initiative (2007) Science 316:1331-1336; Scott et al., (2007) Science 316:1341-1345

Page 18: Handling and analyzing data from a genome-wide association study

Getting more for your genotyping dollars: Imputation of SNP genotypes

• Impute/predict genotypes for:– Missing data within genotyped markers– Untyped markers

• Uses haplotype structure of existing sample such as HapMap samples to infer data for samples with sparser marker set

Page 19: Handling and analyzing data from a genome-wide association study

Observed genotypes

Observed Genotypes

. . . . A . . . . . . . A . . . . A . . .

. . . . G . . . . . . . C . . . . A . . .

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C

Study Sample

HapMap

Gonçalo Abecasis

Page 20: Handling and analyzing data from a genome-wide association study

Identify match among reference

Observed Genotypes

. . . . A . . . . . . . A . . . . A . . .

. . . . G . . . . . . . C . . . . A . . .

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C

Gonçalo Abecasis

Page 21: Handling and analyzing data from a genome-wide association study

Phase chromosomes, impute missing genotypes

Observed Genotypes

c g a g A t c t c c c g A c c t c A t g gc g a a G c t c t t t t C t t t c A t g g

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C

Gonçalo Abecasis

Page 22: Handling and analyzing data from a genome-wide association study

Imputing genotype data allows much more thorough analysis

• Allows testing of untyped variation

• Allows easy combination of data across genotyping platforms

• Provides complete data for analysis with multiple SNPs

Page 23: Handling and analyzing data from a genome-wide association study

Imputed data takes care to generate, analyze and understand

• Requires large scale computing resources• Need to assess quality of imputation

– Compare imputed gentoypes to actual genotypes

• Error rates are higher than for genotyped SNPs• Works less well for rarer alleles• Best to take account of uncertainty imputed

SNPs in analysis– Need ways to take into account fractional genotype

counts

Page 24: Handling and analyzing data from a genome-wide association study

Imputation programs are available

• IMPUTE– Developed by Jonathan Marchini– Nature Genetics, Advance online publication– http://www.stats.ox.ac.uk/~marchini/#software

• Mach 1.0, Markov Chain Haplotyping– Developed by Goncalo Abecasis– http://www.sph.umich.edu/csg/abecasis/MACH/

Page 25: Handling and analyzing data from a genome-wide association study

Need to store results and prepare for large scale specialized analysis

• System to store, view, merge results– SQL database– Plink

• Testing speed of specialized analyses in different statistical packages

• Potential development of software to run large scale specialized analysis

Page 26: Handling and analyzing data from a genome-wide association study

Summary: Ideally what needs to happen before getting the data

• Ability to store, select and write out genotype data in multiple formats for quality control and association analysis

• Identification of primary quality control and analysis programs

• Systems to store, view, merge results• Adequate computing resources to do

intensive computing • Testing of standard and specialized

processes with large-scale trial datasets

Page 27: Handling and analyzing data from a genome-wide association study

FUSION study

U Michigan

Michael BoehnkeKaren ConneelyCharles Ding William DurenTerry GliedtLarry HuAnne JacksonXiao-Yi LiAndrew SkolHeather StringhamPeggy WhiteCristen WillerFang XiangRui Xiao

National Public Health Institute Helsinki

Jaakko TuomilehtoTimo Valle

USCRichard BergmanThomas BuchananRichard Watanabe

Randall PruimCalvin College

Kimberly DohenyElizabeth Pugh

CIDR

Francis CollinsLori BonnycastlePeter ChinesMichael ErdosNarisu NarisuL. Prokunina-OlssonNancy RiebowAndrew SprauAmy SwiftMaurine Tong

NHGRI / NIH

Karen MohlkeKyle GaultonJason Luo Li Qin

UNC-Chapel Hill

U Michigan

Gonçalo AbecasisYun LiJun DingPaul Scheet