![Page 1: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/1.jpg)
Genotype Phasing and Imputation in 1x Sequencing Data
Warren W. Kretzschmar
DPhil Genomic Medicine and StatisticsWellcome Trust Centre for Human Genetics, Oxford, UK
Supervisor: Jonathan Marchini
![Page 2: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/2.jpg)
• Commonest psychiatric disorder and the second ranking cause of morbidity world-wide.
• Affects 1 in 10 people in their lifetime.
• Estimates of heritability range between 30-40%.
Major Depression
![Page 3: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/3.jpg)
Major de-pressive dis-
orders
Violence
Ischaemic heart disease
Alcohol use disordersRoad traffic ac-
cidents
Diabetes mellitus
Cerebrovascular disease
Other unin-tentional in-
juries
Lower respiratory infections
Chronic obstructive pulmonary disease
DALY : Disability adjusted life year : number of years lost due to ill-health, disability or early death
Top Ten causes of DALYs
![Page 4: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/4.jpg)
Genetics of Major DepressionMajor Depressive Disorder Working Group of the Psychiatric GWAS Consortium (2012). A mega-analysis of genome-wide association studies for major depressive disorder. Molecular Psychiatry 18.4:497-511.
Study Design• Unrelated Europeans• 9240 cases• 9519 controls• 1.2 million SNPs
Hypotheses• Depression has
heterogeneous environmental and genetic causes
• Depression is a complex trait with genetic components of small effect size
![Page 5: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/5.jpg)
CONVERGE (China, Oxford and VCU Experimental Research on Genetic Epidemiology)
Genetically Homogeneous : All subjects are female and their grandparents are Han Chinese
6,000 cases : typically severe affected: 85% qualify for a diagnosis of melancholia by DSM-IV. >25% reported a family history of MD in one or more first-degree relatives
6,000 controls : patients undergoing minor surgical procedures.
Extensive Phenotyping : primary disorder of major depression, common comorbid disorders (e.g. generalized anxiety disorder, panic disorder), within disorder symptoms (e.g. suicidal ideation), disorder subtypes (e.g. melancholia, dysthymia), possible endophenotypes (e.g. neuroticism) and a range of risk factors (e.g. child abuse, stressful life events, social and marital relationships, parenting, post-natal depression, demographics).
Sequencing : mean depth 1.7X using lllumina HiSeq at Beijing Genomics Institute
Current status Sequencing finished. We have data on 12,000 samples. For now we have only considered ~13M sites polymorphic 1000 Genomes Asian samples. Analysis ongoing…
59 hospitals, 45 cities, 21 provinces.
![Page 6: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/6.jpg)
Phase 1: genotype likelihood estimationOne sample at a time
Phase 2: phasing and imputationAll samples together
Raw reads
Genotype likelihoods
Mapping Stampy
Duplicate Picard
marking
Base quality GATK recalibration Genotype
probabilitiesGenotype
likelihoodSNPToolsestimation
Phasing and imputation
Genotype likelihoods My focus!
Sequence analysis pipeline
48 TB
650 GB4.6 CPU
years
350 GB2.7 CPU
years
5 CPU years
![Page 7: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/7.jpg)
GENOTYPE PHASING AND IMPUTATION
![Page 8: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/8.jpg)
Genotype Phasing
Unphased: G/G A/T A/A T/T G/T A/T T/T A/A G/G G/C
Example SNP chip data
Hap 1: G A A T T T T A G C
Hap 2: G T A T G A T A G G
After Phasing
Phase-informative Sites
![Page 9: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/9.jpg)
Genotype Imputation from Haplotypes
J Marchini and B Howie. Nature Rev. Genet. 2010
![Page 10: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/10.jpg)
GENOTYPE LIKELIHOODS
![Page 11: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/11.jpg)
What is a Genotype Likelihood?
Genotype Likelihood = Pr( R | G )
R = Reads; also known as the “observed data”G = Genotype; usually one of ref/ref, ref/alt, alt/alt
Genotype likelihoods (aka GL) are defined on a site by site basis.
GLs are conditional probabilities.
![Page 12: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/12.jpg)
How are Genotype Likelihoods Useful?
Genotype Probability = Pr ( G | R ) proportional to Pr( R | G ) * Pr( G )
Genotype likelihoods allow us to quantify how much the reads support each possible genotype independent of other information.
To determine the most likely genotype call, we need a genotype probability.
Pr( G ) = prior probability of G.May be determined through haplotype phasing and imputation approaches.
![Page 13: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/13.jpg)
Genotype Likelihood Creation with SNPTools
Y Wang, J Lu, J Yu, RA Gibbs, FL Yu. Genome Research. 2013
observed reads
Three distributions
Pr(R|G = alt/alt) = 10e-6
Pr(R|G = ref/alt) = 10e-3
Pr(R|G = ref/ref) = 0.06
![Page 14: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/14.jpg)
Genotype Phasing using Genotype Likelihoods
Example GL dataPr(ref/ref): G/G A/A A/A T/T G/G A/A T/T A/A G/G G/G Pr(ref/alt): G/A A/T A/G T/A G/T A/T T/C A/G G/C G/C
Pr(alt/alt): A/A T/T G/G A/A T/T T/T C/C G/G C/C C/C
Hap 5: G A A T T A T A G C
Hap 6: G T A T T A T A G G
Plausible Haplotypes after Phasing
Hap 1: G A A T T A C A G G
Hap 2: G T A T T A T A G G
Hap 3: G T A T G A C A G G
Hap 4: G T A T G A T A G C
Reference Haplotypes
![Page 15: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/15.jpg)
General MCMC Scheme for Phasing from GLs
When using GLs, haplotype estimation is currently done in an iterative Markov Chain Monte Carlo (MCMC) scheme
1. Initalize haplotypes for each sample randomly2. for a predetermined number of iterations
1. for each sample1. Find a plausible haplotype pair using its GLs and all
other haplotypes as a reference panel2. Update that sample’s haplotypes with the plausible
haplotype pair3. Return each sample’s current pair of haplotypes
![Page 16: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/16.jpg)
The Tools/Languages I use
Coding Emacs
Scripting Perl with DistributedMake for pipelines
Statistical Methods C++
Figure Generation R
Statistical Analysis & Report Writing
LaTeX with SWeave
Presentations PowerPoint or LaTeX
![Page 17: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/17.jpg)
A Bioinformatician’s Best Practices
- Understand your goals and choose appropriate methods- Be suspicious and trust nobody
- Set traps for your own scripts and other people’s- Be a detective- You're a scientist, not a programmer- Use version control software- Pipelineitis is a nasty disease- An Obama frame of mind- Someone has already done this. Find them!
according to Nick Loman & Mick Watson. Nature Biotechnology. 2013see also: W. S. Noble. PLoS Computational Biology. 2009
![Page 18: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/18.jpg)
Good Directory Structureaccording to W. S. Noble. PLoS Computational Biology. 2009
![Page 19: Genotype Phasing and Imputation in 1x Sequencing Data](https://reader035.vdocuments.us/reader035/viewer/2022070504/568168f2550346895ddff907/html5/thumbnails/19.jpg)
Thank you. Questions?