building a platinum human genome assembly from single ...€¦ · prim2 region resolved in grch38 ....
TRANSCRIPT
Building a platinum human genome
assembly from single haplotype
human genomes
Karyn Meltz Steinberg
PacBio UGM December, 2015
@KMS_Meltzy
Single haplotype from hydatidiform mole
Paternal DNA doubles
Tumor like growth
ONLY paternal DNA present
Enucleated egg (no maternal DNA)
Last year…
Steinberg et al, 2014
This year…
0
5000000
10000000
15000000
20000000
25000000
30000000
CHM13Draft
CHM1PB_2
CHM1PB_1
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
This year…
Log
scale
1
10
100
1000
10000
100000
1000000
10000000
100000000
CHM13Draft
CHM1PB_2
CHM1PB_1
CHM1_1.1 HuRef ALLPATHS YH_2.0
Contig Number
Contig N50
We combine PacBio with other technologies to construct
the assembly
How do we define platinum and gold standards?
GRCh38 Platinum
(CHM1)
Gold
(NA19240)
% Reference genome
covered 100 98.40 90.80
% Assigned chromosomes 99.60 98.40 90.80
% gene models covered
(>95% id, >90% length) 99.96 98.78 94.26
Contig N50 67.8 Mb 26.9 Mb 6.0 Mb
Number of gaps 875 3,640 3,568
Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb
% haplotype blocks
(>1kb) resolved NA >95 >80
http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
CHM13 Draft Assembly (GCA_000983455.1)
• 60X PacBio (P5 and P6 chemistry)
• Average read length ~11kb
• Daligner/Falcon v 0.2
Total sequence length 2,851,367,788
Number of contigs 2,873
Contig N50 12,981,785
Contig L50 68
Short read sequence analysis
• 100X Illumina sequence
• Align with BWA-MEM to ordered and
oriented assembly
• Variant calling via SpeedSeq (Chiang et al,
2015)
• SNVs, indels: FreeBayes
• SVs: LUMPY, SVTyper
• CNV: CNVnator
CHM13 Illumina data aligned to CHM13 assembly
202,016 SNVs/indels on unplaced scaffolds
SV_TYPES >10kb 5-10kb 1-5kb <1kb
DELETIONS 174 131 430 2582 INVERSIONS 5 0 2 7
DUPLICATIONS 151 112 309 113 TOTAL 330 243 741 2702
BioNano can be used to size gaps and identify
structural variants
Collapse
Expansi
on
in A
ssem
bly
Gap in Sequence PacBio Assembly
BioNano Map
SV_TYPES
DELETIONS 41
INVERSIONS 10
INSERTIONS 15
TOTAL 66
BioNano alignment to CHM13
BioNano reveals collapse in PacBio assembly
PacBio Assembly
BioNano Map
Illumina data aligned to PacBio assembly also shows
collapse
BioNano reveals collapse in PacBio assembly due to
highly homologous segmental duplications
SD = 96%
CHR1 46746040 46857004 40 W LBHZ01000938.1 110965
CHR1 46857005 47034202 41 N 177198 gap
CHR1 47034203 52157695 42 W LBHZ01000245.1 5123493
PacBio Assembly
BioNano Map
This region is rich in medically relevant genes
This locus has an assigned GRC issue due to unresolved variation and may be
a candidate locus for alternative representation in the reference
CHM13 Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs
CHM13 Hybrid Scaffolds Improve Contiguity
BioNano Map PacBio Assmbly Hybrid Scaffold
# of Contigs 3593 1590 * 254
Min Contig Length 0.08 Mb 0 0.27 Mb
Median Contig
Length 0.61 Mb 0.06 Mb 4.35 Mb
Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb
Contig N50 1.02 Mb 12.98 Mb 20.79 Mb
Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb
Total Contig Length 2812 Mb 2824 Mb 2457.75 Mb
*Number of contigs used in hybrid scaffolding
Reference based Analyses
• 100X Illumina sequence from CHM13
• Align to GRCh37 and GRCh38 with BWA-MEM
• Variant calling via SpeedSeq (Chiang et al,
2015)
• SNVs, indels: FreeBayes
• SVs: LUMPY, SVTyper
• CNV: CNVnator
Similar number of variants per chromosome
GRCh37.p15
GRCh38.p2
Similar annotation of variants
GRCh37.p15
GRCh38.p2
GRCh37.p15
GRCh38.p2
SRGAP2 region resolved in GRCh38
Patch alignment to chromosome 1
1q32 1q21 1p21
GRCh37.p15
GRCh38.p2
PRIM2 region resolved in GRCh38
tl;dpa*
• The reference genome assembly is constantly being
improved
• New PacBio-based assemblies are orders of magnitude
more contiguous than previous WGS assemblies
• Integration of other data (e.g. BioNano, Dovetail) can
improve contiguity even further and be used to identify
structurally variant haplotypes that can be added to
reference as alternative loci
• Platinum genome sequences integrated into GRCh38
have greatly improved read mapping and variant calling
*too long; didn’t pay attention
Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Tina Graves-Lindsay
Vince Magrini
Sean McGrath
Derek Albracht
Milinn Kremitzki
Susan Rock
Debbie Scheer
Aye Wollam
The Finishing and Bioinformatics
Teams at The Genome Institute
University of Washington
Evan Eichler
John Huddleston
Archana Raja
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine (CHM cell lines)
Urvashi Surti
Personalis
Deanna Church
BioNano Genomics
Palak Sheth
Pacific Biosciences
Jason Chin
Nick Sisneros