discovery of structural variation with next-generation sequencing alexandre gillet-markowska...

57
Discovery of Structural Variation with Next- Generation Sequencing Alexandre Gillet-Markowska [email protected] Gilles Fischer Team – Biology of Genomes UMR7238 Laboratory of Computational and Quantitative Biology Université Pierre et Marie-Curie, Paris

Upload: maggie-lykes

Post on 14-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Discovery of Structural Variation with Next-Generation Sequencing

Alexandre [email protected]

Gilles Fischer Team – Biology of Genomes UMR7238Laboratory of Computational and Quantitative Biology

Université Pierre et Marie-Curie, Paris

(i) Structural variations (SV)(ii) SV detection technologies (iii) Read pairs: 2 types of Illumina genomic DNA libraries(iv) SV detection using Read pairs(v) Polymorphic SV Structural Variations (SV)

outline

1Yes, the minimal size is arbitrary…

1

Structural Variations (SV)

Structural Variations (SV)

Structural Variations (SV)

Structural Variations (SV)

Structural Variations (SV)

INVERSION (INV) RECIPROCAL TRANSLOCATION (RT)

INSERTION (INS)DELETION (DEL)

ref

SV

ref

SV

Bala

nced

SV

Unb

alan

ced

SV (C

NV)

Intrachromosomal SV Interchromosomal SV

ref

SV

ref

SV

TANDEM DUPLICATION (DUP)

Balanced SV versus Unbalanced SV

Pictures adapted from Feuk et al., 2006 Nature ReviewsCalvin Blackman Bridges, Science

Why Discover SV ?

involved in > 30 diseases (Psoriasis, Crohn disease, ASD…)

chromosomal instability detected in the vast majority of cancers

powerful mechanism of adaptation and evolution

SV detection technologies

Calvin Blackman Bridges, Science

Timeline of technologies used to discover SVSV, Structural Variations since 1936

1936

Lejeune, Study of somatic chromosomes from 9 mongoloid children, Hebd Seances Acad Sci 1959

Smith et al, Interstitial deletion of (17)(p11.2p11.2) in nine patients. Am J Med Genet 1986

Comparative cytogenetics

Calvin Blackman Bridges, Science

200 et 221 CNV

360 Mb CNVR (12% du génome humain)

1936

Lejeune, Study of somatic chromosomes from 9 mongoloid children, Hebd Seances Acad Sci 1959

Smith et al, Interstitial deletion of (17)(p11.2p11.2) in nine patients. Am J Med Genet 1986

Iafrate, Detection of large-scale variation in the human genome, Nature

Sebat, Large-scale copy number polymorphism in the human genome, Science

2004

Redon, Global variation in copy number in the human genome, Nature 2006

Comparative cytogenetics

Microarrays

Timeline of technologies used to discover SVSV, Structural Variations since 1936

Calvin Blackman Bridges, Science

200 et 221 CNV

360 Mb CNVR (12% du génome humain)

Microarrays

Korbel et al, Paired-end mapping reveals extensive structural variation in the human genome, Science

NGS

1936

Lejeune, Study of somatic chromosomes from 9 mongoloid children, Hebd Seances Acad Sci 1959

Smith et al, Interstitial deletion of (17)(p11.2p11.2) in nine patients. Am J Med Genet 1986

Iafrate, Detection of large-scale variation in the human genome, Nature

Sebat, Large-scale copy number polymorphism in the human genome, Science

2004

Redon, Global variation in copy number in the human genome, Nature 2006

2007

1000 HGP, A map of human genome variation from population-scale sequencing, Nature 201020 000 SV

1 000 SV

Comparative cytogenetics

Timeline of technologies used to discover SVSV, Structural Variations since 1936

‘Range of usability’ of technologies

Size limit

SV type limit

SV detection with NGS data

Breakpoints res.

SV size range

CNV

Balanced SV

FDR

Missing rate

>100 bp

> Insert Size

Yes

Yes

Variable

Variable

Quinlan & Hall 2011 Trends in Genetics

LI 2011 Nature

1 bp

1 bp–50 kbp

Yes

Yes

>10%

>25%

1-10 bp

>10 bp

Yes

No

High?

High?

1 bp

>1 bp

Yes

Yes

low

High?

How to detect SV with NGS data ?

Read pairs: 2 types of Illumina genomic DNA libraries

1) Illumina Paired-End

2) Illumina Mate-Pair

1) Illumina Paired-End

2) Illumina Mate-Pair

Illumina Paired end vs Mate-Pair

(MP allows a better genome assembly than PE)MP allows to detect SV that involve repeated elements

Illumina Paired end vs Mate-Pair

Insert-size distribution of 100,000 read-pairs

Insert-size (bp)5,000

(or much less…)

Illumina Paired end vs Mate-Pair

SV detection with Read pairs

1) trim the data2) align data to reference genome3) remove PCR duplicates4) SV calling

Trim the dataFirst criteria: Chargaff rule

Trim the dataFirst criteria : %A = %T and %G = %C on both DNA strands

Trim the dataSecond criteria: nucleotide quality

Bcbio-nextgenBtrimCANGSChipster

Clean readsConDeTriEa-utilsFastx

FlexbarPRINSEQReaperSeqTrim

SkewerSolexaQATagCleanerTrimmomatic

Trimming tools

Align the data to reference genome

Remove PCR duplicates

samtools rmdup (only intra-molecular duplicates)markduplicates.jar (picard tools)FastUniq…

PCR duplicates annotation tools

SV signatures

SV have nearly identical signatures with MP and PE

SV signatures

Gillet-Markowska, 2014, Bioinformatics

SV signatures

SV signatures

Inter-tool variability is immense

Inter-tool variability is immense

Inter-tool variability is immense

Adapted from ICGC-TCGA challenge

Inter-tool variability is immense

SV examples

Korbel et al, Science 2007

SV in the Human genome

Not-so-identical monozygotic twins

Bruder, C. E. G. et al. Phenotypically concordant and discordant monozygotic twins display different DNA copy-number-variation profiles. Am. J. Hum. Genet. 82, 763–771 (2008)

Butterfly mimicry

Butterfly mimicry

Livestock phenotypes caused by CNV

Polymorphic SV Structural Variations (SV)

Individual (germ line)SV in 100% of cells of each individual

Tissue (somatic)SV in one tissue / in a few cells

Polymorphic SV Structural Variations (SV)

#generation

Bottleneck 1

60 90 120 150 2400

Bottleneck 2 Bottleneck 3 Bottleneck 4 Bottleneck 5 Bottleneck 80

0 30

#cells 124 109

Sequencing a single culture

Can we detect de novo SV occurring in a single cell culture by high throughput sequencing ?

DNA extractionSequencing

(n=80)DNA extractionSequencing

The physical coverage (theoretically) sets the detection threshold

S. cerevisiae

30 # generations 0 11

109# cells 112

24

122.103103

138.103

141.6.104

6,000X700X

Pair-End sequencing: insert size ~ 400 bp

Sequencing with high physical coverage

Reference

Cell 1

Cell 2

Cell 3

Cell 4

Cell 5

Cell 6

Cell 7

Cell 8

Cell 9

Cell 10

Pair-End sequencing: insert size ~ 400 bp

Sequencing with high physical coverage

Reference

Cell 1

Cell 2

Cell 3

Cell 4

Cell 5

Cell 6

Cell 7

Cell 8

Cell 9

Cell 10

Pair-End sequencing: insert size ~ 400 bp

Sequencing with high physical coverage

2

10

Coverage (sequence)

covseq = 0.5X

Reference

Cell 1

Cell 2

Cell 3

Cell 4

Cell 5

Cell 6

Cell 7

Cell 8

Cell 9

Cell 10

Pair-End sequencing: insert size ~ 400 bp

Sequencing with high physical coverage

2

10

2

10

Coverage (sequence)

covseq = 0.5X

covphys = 0.85XCoverage (physical)

Reference

Cell 1

Cell 2

Cell 3

Cell 4

Cell 5

Cell 6

Cell 7

Cell 8

Cell 9

Cell 10

Pair-End sequencing: insert size ~ 400 bp

Sequencing with high physical coverage

2

10

2

10

Coverage (sequence)

covseq = 0.5XcovSV = 0

covSV = 0

Reference

Cell 1

Cell 2

Cell 3

Cell 4

Cell 5

Cell 6

Cell 7

Cell 8

Cell 9

Cell 10

covphys = 0.85XCoverage (physical)

Mate Pair sequencing: insert size ~ 1 to 20 kb

Sequencing with high physical coverage

Reference

Cell 1

Cell 2

Cell 3

Cell 4

Cell 5

Cell 6

Cell 7

Cell 8

Cell 9

Cell 10

Discordant Paired Sequence

Mate Pair sequencing: insert size ~ 1 to 20 kb

Sequencing with high physical coverage

Reference

Cell 1

Cell 2

Cell 3

Cell 4

Cell 5

Cell 6

Cell 7

Cell 8

Cell 9

Cell 10

2

10

20

46810

covseq = 0.5X

covphys = 5X

Coverage (sequence)

Coverage (physical)

covSV = 1

Discordant Paired Sequence

Mate Pair sequencing increases the sensitivity of SV detection

Illumina Paired-End

Illumina Paired-End