building a platinum human genome assembly from single ...€¦ · prim2 region resolved in grch38 ....

28
Building a platinum human genome assembly from single haplotype human genomes Karyn Meltz Steinberg PacBio UGM December, 2015 @KMS_Meltzy

Upload: others

Post on 20-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Building a platinum human genome

assembly from single haplotype

human genomes

Karyn Meltz Steinberg

PacBio UGM December, 2015

@KMS_Meltzy

Page 2: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Single haplotype from hydatidiform mole

Paternal DNA doubles

Tumor like growth

ONLY paternal DNA present

Enucleated egg (no maternal DNA)

Page 3: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Last year…

Steinberg et al, 2014

Page 4: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

This year…

0

5000000

10000000

15000000

20000000

25000000

30000000

CHM13Draft

CHM1PB_2

CHM1PB_1

CHM1_1.1 HuRef ALLPATHS YH_2.0

Contig Number

Contig N50

Page 5: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

This year…

Log

scale

1

10

100

1000

10000

100000

1000000

10000000

100000000

CHM13Draft

CHM1PB_2

CHM1PB_1

CHM1_1.1 HuRef ALLPATHS YH_2.0

Contig Number

Contig N50

Page 6: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

We combine PacBio with other technologies to construct

the assembly

Page 7: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

How do we define platinum and gold standards?

GRCh38 Platinum

(CHM1)

Gold

(NA19240)

% Reference genome

covered 100 98.40 90.80

% Assigned chromosomes 99.60 98.40 90.80

% gene models covered

(>95% id, >90% length) 99.96 98.78 94.26

Contig N50 67.8 Mb 26.9 Mb 6.0 Mb

Number of gaps 875 3,640 3,568

Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb

% haplotype blocks

(>1kb) resolved NA >95 >80

http://genome.wustl.edu/projects/detail/reference-genomes-improvement/

Page 8: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

CHM13 Draft Assembly (GCA_000983455.1)

• 60X PacBio (P5 and P6 chemistry)

• Average read length ~11kb

• Daligner/Falcon v 0.2

Total sequence length 2,851,367,788

Number of contigs 2,873

Contig N50 12,981,785

Contig L50 68

Page 9: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Short read sequence analysis

• 100X Illumina sequence

• Align with BWA-MEM to ordered and

oriented assembly

• Variant calling via SpeedSeq (Chiang et al,

2015)

• SNVs, indels: FreeBayes

• SVs: LUMPY, SVTyper

• CNV: CNVnator

Page 10: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

CHM13 Illumina data aligned to CHM13 assembly

202,016 SNVs/indels on unplaced scaffolds

SV_TYPES >10kb 5-10kb 1-5kb <1kb

DELETIONS 174 131 430 2582 INVERSIONS 5 0 2 7

DUPLICATIONS 151 112 309 113 TOTAL 330 243 741 2702

Page 11: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

BioNano can be used to size gaps and identify

structural variants

Collapse

Expansi

on

in A

ssem

bly

Gap in Sequence PacBio Assembly

BioNano Map

SV_TYPES

DELETIONS 41

INVERSIONS 10

INSERTIONS 15

TOTAL 66

BioNano alignment to CHM13

Page 12: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

BioNano reveals collapse in PacBio assembly

PacBio Assembly

BioNano Map

Page 13: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Illumina data aligned to PacBio assembly also shows

collapse

Page 14: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

BioNano reveals collapse in PacBio assembly due to

highly homologous segmental duplications

SD = 96%

CHR1 46746040 46857004 40 W LBHZ01000938.1 110965

CHR1 46857005 47034202 41 N 177198 gap

CHR1 47034203 52157695 42 W LBHZ01000245.1 5123493

PacBio Assembly

BioNano Map

Page 15: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

This region is rich in medically relevant genes

This locus has an assigned GRC issue due to unresolved variation and may be

a candidate locus for alternative representation in the reference

Page 16: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

CHM13 Hybrid Scaffold

Hybrid Scaffold

PacBio Contigs

BioNano Contigs

Page 17: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

CHM13 Hybrid Scaffolds Improve Contiguity

BioNano Map PacBio Assmbly Hybrid Scaffold

# of Contigs 3593 1590 * 254

Min Contig Length 0.08 Mb 0 0.27 Mb

Median Contig

Length 0.61 Mb 0.06 Mb 4.35 Mb

Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb

Contig N50 1.02 Mb 12.98 Mb 20.79 Mb

Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb

Total Contig Length 2812 Mb 2824 Mb 2457.75 Mb

*Number of contigs used in hybrid scaffolding

Page 18: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Reference based Analyses

• 100X Illumina sequence from CHM13

• Align to GRCh37 and GRCh38 with BWA-MEM

• Variant calling via SpeedSeq (Chiang et al,

2015)

• SNVs, indels: FreeBayes

• SVs: LUMPY, SVTyper

• CNV: CNVnator

Page 19: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Similar number of variants per chromosome

GRCh37.p15

GRCh38.p2

Page 20: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Similar annotation of variants

GRCh37.p15

GRCh38.p2

Page 21: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based
Page 22: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

GRCh37.p15

GRCh38.p2

Page 23: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

SRGAP2 region resolved in GRCh38

Patch alignment to chromosome 1

1q32 1q21 1p21

Page 24: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

GRCh37.p15

GRCh38.p2

Page 25: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

PRIM2 region resolved in GRCh38

Page 26: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

tl;dpa*

• The reference genome assembly is constantly being

improved

• New PacBio-based assemblies are orders of magnitude

more contiguous than previous WGS assemblies

• Integration of other data (e.g. BioNano, Dovetail) can

improve contiguity even further and be used to identify

structurally variant haplotypes that can be added to

reference as alternative loci

• Platinum genome sequences integrated into GRCh38

have greatly improved read mapping and variant calling

*too long; didn’t pay attention

Page 27: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based

Acknowledgements

The McDonnell Genome Institute at

Washington University in St. Louis

Rick Wilson

Bob Fulton

Wes Warren

Tina Graves-Lindsay

Vince Magrini

Sean McGrath

Derek Albracht

Milinn Kremitzki

Susan Rock

Debbie Scheer

Aye Wollam

The Finishing and Bioinformatics

Teams at The Genome Institute

University of Washington

Evan Eichler

John Huddleston

Archana Raja

NCBI

Valerie Schneider

University of Pittsburgh

School of Medicine (CHM cell lines)

Urvashi Surti

Personalis

Deanna Church

BioNano Genomics

Palak Sheth

Pacific Biosciences

Jason Chin

Nick Sisneros

Page 28: Building a platinum human genome assembly from single ...€¦ · PRIM2 region resolved in GRCh38 . tl;dpa* •The reference genome assembly is constantly being improved •New PacBio-based