structural variant detection in crops using low-fold ......mh63 fold coverage recall of deletions...

1
For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. Structural variant detection in crops using low-fold coverage long-read sequencing Michelle Vierra and Aaron Wenger PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025 Insertions, deletions, duplications, translocations, inversions, and tandem repeat expansions in the structural variant (SV) size range (≥50 bp) contribute to the evolution of traits and often have significant associations with agronomically important phenotypes. However, most SVs are too small to detect with array comparative genomic hybridization and too large to reliably discover with short-read DNA sequencing. While de novo assembly is the most comprehensive way to identify variants in a genome, recent studies in human genomes show that PacBio SMRT Sequencing sensitively detects structural variants at low coverage. Introduction Methods for SV Detection SV Calling in MH63 with pbsv The Genome in a Bottle Consortium 2 has developed a benchmark set of insertion and deletion SVs in a human male, HG002/NA24385. Comparing technologies against this benchmark, PacBio has the highest precision and recall across the structural variant size range, and particularly for insertions. Figure 6. SVs detected by assembly and pbsv. (A) Insertion and deletion detected at 10- and 20-fold coverage. (B) Deletion detected at 20- but not 10-fold. 1. Chaisson MJ, et al. (2017). Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. doi:10.1038/s41467-018-08148-z. 2. Zook JM, et al. (2016). Extensive sequencing of seven human genomes to characterize benchmark reference materials . Sci Data. 3:160025. 3. Zhang J, et al. (2016). Extensive sequence divergence between the reference gnomes of two elite indica rice varieties Zhenshan 97 and Minghui 63. PNAS. 113(35):E5163-71. 4. Kawahara Y, et al. (2013). Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical mapping . Rice. 6(1):4. References Conclusions - PacBio sequencing has high precision and recall for SVs in plant genomes. - SV calling is effective at lower coverage than is de novo assembly. - The workflow to detect SVs is simple and efficient. SNVs SVs indels short-read sequencing PacBio SMRT sequencing 3 Mb 5 Mb 10 Mb 3 Mb 5 Mb 10 Mb 3 Mb Figure 1. Variation in a typical germline human genome 1 . (A) Most of the base pairs that differ between two human genomes are in indels 1-49 base pairs and in structural variants (SVs), differences ≥50 base pairs. Short-read sequencing has limited sensitivity for indels and SVs, while PacBio long-read sequencing comprehensively detects variants of all sizes. (B) Precision (blue) and recall (orange) for SVs in a human genome (HG002) against fold coverage. Recall remains high for ≥10-fold coverage. Technology Precision Recall PacBio 96.13% 95.99% Oxford Nanopore 83.23% 87.46% Illumina 85.35% 55.88% 10X Genomics 83.79% 39.83% Figure 2. Variant calling performance against the GIAB HG002 v0.6 benchmark. Histograms indicate the number of variants and lines show the precision (blue) and recall (orange) at each variant size for call sets from different technologies. Structural variant length (bp) Illumina (Manta) 10X Genomics (LongRanger) Structural variant length (bp) PacBio (pbsv) Oxford Nanopore (pbsv) deletions insertions deletions insertions Human SV Benchmark 0 20 40 60 80 100 0 5 10 15 20 25 30 Value (%) Fold coverage A B Precision Recall minimap2 pbsv map reads call variants Sequel System sequence Cluster 280 bp Deletion Variant Call Figure 3. Workflow to detect structural variants from PacBio long reads. To call structural variants, pbsv identifies signatures of structural variation in alignments, clusters nearby signatures with similar length and sequence, summarizes into a consensus call, and assigns a genotype based on read support. Find SV signatures Cluster SV signatures Filter by support Summarize consensus Genotype SV Calling in MH63 with Assembly Comparison of a de novo assembly of the Oryza sativa indica cultivar MH63 (ref. 3) to the International Rice Genome Sequencing Project reference assembly of the Oryza sativa japonica cultivar Nipponbare 4 provides a baseline of structural variants against which to evaluate variant calling with pbsv. The MH63 de novo assembly used 110-fold coverage of PacBio reads. MH63 assembly against Nipponbare reference SVs 50 bp to 1 kb SVs 1 kb to 10 kb Structural variant length (bp) deletions insertions Structural variant length (bp) deletions insertions Figure 4. Structural variants in MH63 from assembly- to-assembly comparison. Structural variants were detected by aligning the MH63 assembly to the Nipponbare reference with minimap2 and calling variants with paftools. The assembly-based callset includes 3,739 deletions and 4,328 insertions 50 bp to 10 kb. Cultivar Accession Length N50 Oryza sativa japonica Nipponbare GCA_001433935.1 (release 7) 374 Mb 7.7 Mb Oryza sativa indica MH63 GCA_001623345.2 (MH63) 387 Mb 25.6 Mb Thank you for Dave Scherer, Pamela Bentley Mills, and Kristin Robertshaw for help with poster generation. 0 20 40 60 80 100 0 5 10 15 20 25 30 Recall of SVs in assembly (%) MH63 fold coverage Recall of deletions Recall of insertions MH63 was re-sequenced to 30-fold PacBio read coverage and downsampled to generate lower coverage. Structural variants were called with pbsv and compared to the calls generated from de novo assembly. Figure 5. Recall of pbsv at various coverage levels. Recall remains high for coverage ≥10-fold, with the primary limit of sensitivity being large insertions. A B

Upload: others

Post on 30-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structural variant detection in crops using low-fold ......MH63 fold coverage Recall of deletions Recall of insertions MH63 was re-sequenced to 30-fold PacBio read coverage and downsampledto

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific

Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners.

Structural variant detection in crops using low-fold coverage long-read sequencingMichelle Vierra and Aaron WengerPacBio, 1305 O’Brien Drive, Menlo Park, CA 94025

Insertions, deletions, duplications,

translocations, inversions, and tandem

repeat expansions in the structural variant

(SV) size range (≥50 bp) contribute to the

evolution of traits and often have

significant associations with

agronomically important phenotypes.

However, most SVs are too small to

detect with array comparative genomic

hybridization and too large to reliably

discover with short-read DNA sequencing.

While de novo assembly is the most

comprehensive way to identify variants in

a genome, recent studies in human

genomes show that PacBio SMRT

Sequencing sensitively detects structural

variants at low coverage.

Introduction Methods for SV Detection SV Calling in MH63 with pbsv

The Genome in a Bottle Consortium2 has

developed a benchmark set of insertion

and deletion SVs in a human male,

HG002/NA24385. Comparing

technologies against this benchmark,

PacBio has the highest precision and

recall across the structural variant size

range, and particularly for insertions.

Figure 6. SVs detected by assembly and pbsv. (A)

Insertion and deletion detected at 10- and 20-fold

coverage. (B) Deletion detected at 20- but not 10-fold.

1. Chaisson MJ, et al. (2017). Multi-platform discovery of

haplotype-resolved structural variation in human genomes.

Nat Commun. doi:10.1038/s41467-018-08148-z.

2. Zook JM, et al. (2016). Extensive sequencing of seven

human genomes to characterize benchmark reference

materials. Sci Data. 3:160025.

3. Zhang J, et al. (2016). Extensive sequence divergence

between the reference gnomes of two elite indica rice

varieties Zhenshan 97 and Minghui 63. PNAS.

113(35):E5163-71.

4. Kawahara Y, et al. (2013). Improvement of the Oryza sativa

Nipponbare reference genome using next generation

sequence and optical mapping. Rice. 6(1):4.

References

Conclusions

- PacBio sequencing has high precision

and recall for SVs in plant genomes.

- SV calling is effective at lower

coverage than is de novo assembly.

- The workflow to detect SVs is simple

and efficient.

SNVs SVsindels

short-read

sequencing

PacBio SMRT

sequencing

3

Mb5 Mb 10 Mb3 Mb

5 Mb 10 Mb3 Mb

Figure 1. Variation in a typical germline human

genome1. (A) Most of the base pairs that differ

between two human genomes are in indels 1-49 base

pairs and in structural variants (SVs), differences ≥50

base pairs. Short-read sequencing has limited

sensitivity for indels and SVs, while PacBio long-read

sequencing comprehensively detects variants of all

sizes. (B) Precision (blue) and recall (orange) for SVs

in a human genome (HG002) against fold coverage.

Recall remains high for ≥10-fold coverage.

Technology Precision Recall

PacBio 96.13% 95.99%

Oxford Nanopore 83.23% 87.46%

Illumina 85.35% 55.88%

10X Genomics 83.79% 39.83%

Figure 2. Variant calling performance against the

GIAB HG002 v0.6 benchmark. Histograms indicate the

number of variants and lines show the precision (blue)

and recall (orange) at each variant size for call sets from

different technologies.

Structural variant length (bp)

Illumina (Manta)

10X Genomics (LongRanger)

Structural variant length (bp)

PacBio (pbsv)

Oxford Nanopore (pbsv)

deletions insertions deletions insertions

Human SV Benchmark

0

20

40

60

80

100

0 5 10 15 20 25 30

Va

lue

(%

)

Fold coverage

A B Precision

Recall

minimap2 pbsv

map reads call variants

Sequel

System

sequence

Cluster

280 bp Deletion Variant Call

Figure 3. Workflow to detect structural variants from

PacBio long reads. To call structural variants, pbsv

identifies signatures of structural variation in alignments,

clusters nearby signatures with similar length and

sequence, summarizes into a consensus call, and assigns a

genotype based on read support.

Find SV

signatures

Cluster SV

signatures

Filter by

support

Summarize

consensusGenotype

SV Calling in MH63 with Assembly

Comparison of a de novo assembly of the

Oryza sativa indica cultivar MH63 (ref. 3) to

the International Rice Genome Sequencing

Project reference assembly of the Oryza

sativa japonica cultivar Nipponbare4 provides

a baseline of structural variants against

which to evaluate variant calling with pbsv.

The MH63 de novo assembly used 110-fold

coverage of PacBio reads.

MH63 assembly

against

Nipponbare reference

SVs 50 bp to 1 kb SVs 1 kb to 10 kb

Structural variant length (bp)

deletions insertions

Structural variant length (bp)

deletions insertions

Figure 4. Structural variants in MH63 from assembly-

to-assembly comparison. Structural variants were

detected by aligning the MH63 assembly to the

Nipponbare reference with minimap2 and calling variants

with paftools. The assembly-based callset includes 3,739

deletions and 4,328 insertions 50 bp to 10 kb.

Cultivar Accession Length N50

Oryza sativa japonica

Nipponbare

GCA_001433935.1

(release 7) 374 Mb 7.7 Mb

Oryza sativa indica

MH63

GCA_001623345.2

(MH63) 387 Mb 25.6 Mb

Thank you for Dave Scherer, Pamela Bentley Mills, and Kristin Robertshaw for

help with poster generation.

0

20

40

60

80

100

0 5 10 15 20 25 30

Re

ca

ll o

f S

Vs in

asse

mb

ly (

%)

MH63 fold coverage

Recall of deletions

Recall of insertions

MH63 was re-sequenced to 30-fold PacBio

read coverage and downsampled to

generate lower coverage. Structural variants

were called with pbsv and compared to the

calls generated from de novo assembly.

Figure 5. Recall of pbsv at various coverage levels.

Recall remains high for coverage ≥10-fold, with the primary

limit of sensitivity being large insertions.

A

B