construction of maize haplotype map from single genome...

36
Construction of Maize Haplotype Map From Single Genome Reference to Pan-Genome reference Qi Sun Bioinformatics Facility Cornell University

Upload: others

Post on 04-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Construction of Maize Haplotype Map

From Single Genome Reference to Pan-Genome reference

Qi SunBioinformatics Facility

Cornell University

Page 2: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

$1,000 GenomeIllumina X Ten

Cost of Sequencing a human genome

Page 3: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

How does the challenge compare:Human

Diploid

Low Diversity

Old transposons

Small families

Rare inbreeding

Crops

Haploid to Polyploid

High diversity

Active transposons

Large Families

Extensive inbreeding

4X

20X

50X

50X

20X

Harder

Harder

Harder

Easier

Easier

Ed Buckler(2014)

Page 4: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Reference genome: De novo assembly (PacBio, 10X Gemcode)

Haplotype map: Whole genome sequencing (High Seq X Ten)

Path of haplotypes: Targeted sequencing (GBS, Amplicon, et al)

$5 genome

Page 5: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Cost of sequencing a maize genome (2.1 GB)

• De novo assembly: $50,000 per genome

• Genome re-sequencing: $50 - $500 per genome

• Genotyping-by-sequencing: $20 per genome

• Other targeted sequencing: <$5 per genome

Genome Platform Lab(s)

B73(v4) PacBio Doreen Ware

W22 NRGene Tom Brutnell et al.

Page 6: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

10X Genomics Gemcode Technology

<$5,000 reference genome

Page 7: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Cost of sequencing a maize genome (2.1 GB)

• De novo assembly: $50,000 per genome

• Genome re-sequencing: $50 - $500 per genome

• Genotyping-by-sequencing: $20 per genome

• Other sparse sequencing: <$5 per genomeIllumina X Ten System

Page 8: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Cost of sequencing a maize genome (2.1 GB)

• De novo assembly: $50,000 per genome

• Genome re-sequencing: $50 - $500 per genome

• Genotyping-by-sequencing: $20 per genome

• Other targeted sequencing: <$5 per genomePro: Genotype the hypo-methylated regionsCon: High cost of DNA preparation; Slow turn around time.

Page 9: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Cost of sequencing a maize genome (2.1 GB)

• De novo assembly: $50,000 per genome

• Genome re-sequencing: $50 - $500 per genome

• Genotyping-by-sequencing: $20 per genome

• Other targeted sequencing: <$5 per genome

Lost cost DNA prep

Page 10: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Markers and Genotyping

B73A C G

CML247

W22

PH20722

A

A

T

C

G

T

C

G

T

Page 11: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Alignment of the genomes

CML247

W22

PH20722

B73

3 2 0 32 0 1

Page 12: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Proportion of aligned genomes

Mo17

W22

PH207

CML247

TIL11

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 1 2 3 4 5

pro

po

rtio

n o

f th

e g

en

om

e

coverage

Cheng Zou

Page 13: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Dataset # TaxaCoverage

min max average

HapMap2 103 1 18.5 4.1

Hapmap2 extra 44 4.2 42 11.5

CAU 725 0.06 36.8 1.75

CIMMYT/BGI 89 1.1 19 11

282 Cornell 271 0 9 2.2

282 Novogene 270 0.6 34.5 4.4

German, Ref. [2] 31 8.3 59 17.4

Hapmap 3 was constructed with whole genome sequencing of 1280 maize lines

Page 14: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

HapMap3 markers are defined as sites of which the physical

positions (B73 alleles) matching consensus genetic positions.

CML247

W22

PH20722

B73

3 2 0 32 0 1

Page 15: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Filters used in genotype calling

Reads based

Robert Bukowski

• Mapping quality score (MQ>30)

• Read depth (Segregation test)

• Rare allele read depth;

Genetics based

• IBD filter

• LD filter

• Imbreeding coefficient

Page 16: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

< 450 bp

Restriction site

( ) sequence tag

Loss of cut site

Sample1

GBS markers are used as anchor markers for IBD and LD filters

Sample2

ApeKI is methylation sensitive, and can only cut unmethylatedDNA

Page 17: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

IBD filter

• Genotypes from GBS markers were used to determine IBD regions.

• Checked the match over mismatch ratio for each site in IBD regions.

• ~50% of the sites that survive IBD but with no minor alleles present in the IBD contrast are labeled as IBD1.

196 M Variants

Genome size: 2,067 MB

96.8 M survive IBD

Page 18: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Local LD filter

• Genotypes from GBS markers were used as anchor map.

• For each site, measure LD against GBS genotypes.

Genome size: 2,067 MB

32M LLD sites

Page 19: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Parameter 3.1.1,

3.2.1unimp,

3.2.1imp

Description

DP +++ Total read depth at the site

NZ +++ Number of taxa with non-zero depth

AD +++ Allelic depths (reference, alternative in order listed in ALT field)

AN +++ Numbers of alleles (reference, alternative in order listed in ALT field)

AQ +++ Average allele base qualities (reference, alternative in order listed in

ALT field) computed in HapMap 3.1.1 from 916 taxa

GN +++ Numbers of genotypes (AA,AB,BB or AA,AB,AC,BB,BC,CC if 2

alt alleles present)

HT +++ Number of heterozygotes

EF +++ EF=het_frequency/(presence_frequency*minor_allele_frequency);

computed in HapMap 3.1.1 from 916 taxa

PV +++ p-value from segregation test, computed in HapMap 3.1.1. from 916

taxa

MAF +++ Minor allele frequency (summed up over all alternative allles)

MAF0 --+ Minor allele frequency in unimputed HapMap 3.2.1.

FH +-- Fraction of heterozygous taxa among the 506 taxa with more than

50% non-missing genotypes on chr 10

FH2 +-- Site with FH greater than 2%

IBD1 +++ only one allele present in IBD contrasts - based on 916 taxa of

HapMap 3.1.1

LLD +++ Site in local LD with GBS map - based on 916 taxa of HapMap 3.1.1

NI5 +++ Indel or site within 5bp of a putative indel - from 916 taxa of HapMap

3.1.1

INHMP311 -++ Site present in HapMap 3.1.1

ImpHomoAccuracy --+ Fraction of homozygotes imputed back into homozygotes

ImpMinorAccuracy --+ Fraction of minor allele homozygotes imputed back into minor allele

homozygotes

DUP --+ Site with heterozygotes frequency > 3% - basded on unimputed

HapMap 3.2.1 genotypes

Flags used in HMP 3.2.1 VCF file

Page 20: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Parameter 3.1.1,

3.2.1unimp,

3.2.1imp

Description

DP +++ Total read depth at the site

NZ +++ Number of taxa with non-zero depth

AD +++ Allelic depths (reference, alternative in order listed in ALT field)

AN +++ Numbers of alleles (reference, alternative in order listed in ALT field)

AQ +++ Average allele base qualities (reference, alternative in order listed in

ALT field) computed in HapMap 3.1.1 from 916 taxa

GN +++ Numbers of genotypes (AA,AB,BB or AA,AB,AC,BB,BC,CC if 2

alt alleles present)

HT +++ Number of heterozygotes

EF +++ EF=het_frequency/(presence_frequency*minor_allele_frequency);

computed in HapMap 3.1.1 from 916 taxa

PV +++ p-value from segregation test, computed in HapMap 3.1.1. from 916

taxa

MAF +++ Minor allele frequency (summed up over all alternative allles)

MAF0 --+ Minor allele frequency in unimputed HapMap 3.2.1.

FH +-- Fraction of heterozygous taxa among the 506 taxa with more than

50% non-missing genotypes on chr 10

FH2 +-- Site with FH greater than 2%

IBD1 +++ only one allele present in IBD contrasts - based on 916 taxa of

HapMap 3.1.1

LLD +++ Site in local LD with GBS map - based on 916 taxa of HapMap 3.1.1

NI5 +++ Indel or site within 5bp of a putative indel - from 916 taxa of HapMap

3.1.1

INHMP311 -++ Site present in HapMap 3.1.1

ImpHomoAccuracy --+ Fraction of homozygotes imputed back into homozygotes

ImpMinorAccuracy --+ Fraction of minor allele homozygotes imputed back into minor allele

homozygotes

DUP --+ Site with heterozygotes frequency > 3% - basded on unimputed

HapMap 3.2.1 genotypes

Flags used in HMP 3.2.1 VCF file

• LLD (50%): Site in local LD with GBS map

• NI5 (15%): Indel or within 5 bp to an indel

• DUP (10%): Site with heterozygotes frequency > 3%

Page 21: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Overlap between various classes of

HapMap 3.1.1 polymorphic sites.

37 million high confidence

markers (LLD)

Page 22: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Distribution of inbreeding coefficient for variant sets obtained with two read mapping quality thresholds (q=30 and q=1).

Page 23: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Genotype class Accuracy within

class [%]

% unimputed

Major allele homozygote 99.8 1.2

Heterozygote 11.1 47.0

Minor allele homozygote 94.4 14.2

Indel 92.2 17.3

Accuracy of various genotype classes based on statistics from imputation in HapMap 3.2.1

Page 24: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

HapMap 3 Discovery Pipeline

Page 25: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

HapMap 3 Production Pipeline

Alignment with BWA

Sequencing

Genotyping on 82M sites

Page 26: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

B73

Mo17

Genes not in synteny(6.7%)

B73

Mo17

Duplicated genes, one copy not in synteny (10.6%)

Loci not included in Hapmap 4.0

XX

Page 27: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

B73

Mo17

Genes not in synteny(6.7%)

B73

Mo17

Duplicated genes, one copy not in synteny (10.6%)

XX

These loci are removed in Hapmap 4.0. As in a linear reference genome, these loci would cause genotyping

and imputation errors.

Loci not included in Hapmap 4.0

Page 28: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

HapMap 4.0: Focus on genomes that can be aligned to B73

CML247

W22

PH20722

B73

3 2 0 32 0 1

Page 29: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

B73

W22

HSP = high-scoring segment pairs

Construction of genome alignment and evaluation

Histogram of chain length (W22) Histogram of chain length (Mo17)

Chain: a sequence of HSPs

Page 30: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Evaluation of genome alignment

sample Total HSPs N50 of chain Percent of gene that has been covered 90% in the alignment

Mo17 921 Mbp 50,445 bp 75%

W22 981 Mbp 112 Mbp 74%

PH207 907 Mbp 101 Mbp 66%

CML247 936 Mbp 1.2 Mbp 73%

TIL11 351 Mbp 10,084 bp 63%

FV2 499 Mbp 10,765 bp 49%

B47 406 Mbp 6267 bp 49%

TIL01 321 Mbp 8333 bp 39%

B97 320 Mbp 7428 bp 21%

Page 31: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Size Technologies Lab

B73 2.1 GB Sanger, PacBio(v4) Ware

W22 2.4 GB NRGene Brutnell, Buckler, et al

Mo17 2.2 GB Illumina Ware

PH207 2.4 GB DISCOVAR Buckler

CML247 2.3 GB NRGene Buckler

TIL01 1.4 GB MaSuRCA Hufford

TIL11 1.0 GB MaSuRCA Hufford

diploperennis 1.8 G Dovetail & DISCOVAR Ross-Ibarra

De novo assemblies

Page 32: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Alignment of 7 maize/teosinte genomes to B73

Assembly Total length (bp) Total aligned L50 of aligned block Fraction of gene space covered by alignments

W22 2,105,225,086 981 Mbp 112 Mbp 0.75

Mo17 2,110,724,827 921 Mbp 50,455 bp 0.75

PH207 2,137,607,690 907 Mbp 101 Mbp 0.66

CML247 2,011,717,085 855 Mbp 197,800 bp 0.65

FV2 1,025,206,968 499 Mbp 10765 bp 0.49

TIL11 1,025,206,968 157 Mbp 4319 bp 0.26

TIL01 1,398,228,815 284 Mbp 2694 bp 0.23

Thanks to HapMap 4 Consortium for providing assemblies!

Page 33: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Mismatch fraction on mapped read

Reduced mismatch rate with pan-genome as reference

Page 34: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Hmp32HMP4, PV0.01

2669312 27918623103528

Less heterozygous calling with pan-genome as reference

Comparison of HapMap3 and HapMap4 (Chr10)

Inbreeding coefficient (per taxon)

Page 35: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Research collaboration

BioHPC LabA Cloud Service for Bioinformatics

Training:Workshops & Office hours

Cornell Bioinformatics Facility

Page 36: Construction of Maize Haplotype Map From Single Genome ...ksiconnect.icrisat.org/wp-content/uploads/2016/07/24-07-2016.pdf · FH +-- Fraction of heterozygous taxa among the 506 taxa

Acknowledgement

Maize Rare Allele ProjectEd Buckler (USDA)Jeff Ross-Ibarra (UC Davis)Doreen Ware (USDA)Qi Sun (Cornell)John DoebleySherry Flint GarciaJim HollandSharon Mitchell (Cornell)Theresa FultonCinta Romay (Cornell)

HapMap3Jinsheng Lai (CAU)Yunbi Xu (CIMMYT/CAAS)

HapMap4Thomas Brutnell (Danforth)Johann Joets (INRA)

Robert BukowskiCheng Zou (CAAS)Qi Sun

Hapmap team