next-generation sequencing: challenges and opportunities ion mandoiu computer science and...

60
Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Post on 21-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Next-Generation Sequencing: Challenges and Opportunities

Ion MandoiuComputer Science and Engineering Department

University of Connecticut

Page 2: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Outline

• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression

levels• Viral quasispecies reconstruction • Future work

Page 3: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

http://www.economist.com/node/16349358

Advances in High-Throughput Sequencing (HTS)

Roche/454 FLX Titanium400-600 million reads/run

400bp avg. length

Illumina HiSeq 2000Up to 6 billion PE

reads/run35-100bp read length

SOLiD 41.4-2.4 billion PE reads/run

35-50bp read length

Page 4: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Illumina Workflow – Library Preparation

Genomic DNA mRNA

Page 5: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Illumina Workflow – Cluster Generation

Page 6: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Illumina Workflow – Sequencing by Synthesis

Page 7: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Cost of Whole Genome Sequencing

$100

$1,000

$10,000

$100,000

$1,000,000

$10,000,000

$100,000,000

days weeks months years

Sequencing Time

Co

st

[email protected]

J. [email protected]

Illumina@36xSOLiD@12x

Page 8: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

• HTS is a transformative technology • Numerous applications besides de novo genome sequencing:

– RNA-Seq– Non-coding RNAs– ChIP-Seq– Epigenetics – Structural variation– Metagenomics– Paleogenomics– …

HTS applications

Page 9: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Outline

• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression

levels• Viral quasispecies reconstruction • Future work

Page 10: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Genomics-Guided Cancer Immunotherapy

CTCAATTGATGAAATTGTTCTGAAACTGCAGAGATAGCTAAAGGATACCGGGTTCCGGTATCCTTTAGCTATCTCTGCCTCCTGACACCATCTGTGTGGGCTACCATG

AGGCAAGCTCATGGCCAAATCATGAGA

Tumor mRNASequencing

SYFPEITHIISETDLSLLCALRRNESL

Tumor Specific Epitopes

PeptideSynthesis

Immune SystemStimulation

Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html

TumorRemission

Page 11: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Bioinformatics Pipeline

Tumor mRNA reads

CCDSMapping

Genome Mapping

Read Merging

CCDS mapped reads

Genome mapped reads

SNVs Detection

Mapped reads

Epitope Prediction

Tumor specific

epitopes

HaplotypingTumor-specific

SNVs

Close SNV Haplotypes

Primers Design

Primers for Sanger

Sequencing

Page 12: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Bioinformatics Pipeline

Tumor mRNA reads

CCDSMapping

Genome Mapping

Read Merging

CCDS mapped reads

Genome mapped reads

SNVs Detection

Mapped reads

Epitope Prediction

Tumor specific

epitopes

HaplotypingTumor-specific

SNVs

Close SNV Haplotypes

Primers Design

Primers for Sanger

Sequencing

Page 13: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Mapping mRNA Reads

http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

Page 14: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read MergingGenome CCDS Agree? Hard Merge Soft Merge

Unique Unique Yes Keep Keep

Unique Unique No Throw Throw

Unique Multiple No Throw Keep

Unique Not Mapped No Keep Keep

Multiple Unique No Throw Keep

Multiple Multiple No Throw Throw

Multiple Not Mapped No Throw Throw

Not mapped Unique No Keep Keep

Not mapped Multiple No Throw Throw

Not mapped Not Mapped Yes Throw Throw

Page 15: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

SNV Detection and Genotyping

AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGCAACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC

Reference

Locus i

Ri

r(i) : Base call of read r at locus iεr(i) : Probability of error reading base call r(i)Gi : Genotype at locus i

Page 16: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

SNV Detection and Genotyping

• Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

Page 17: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

SNV Detection and Genotyping• Calculate conditional probabilities by multiplying contributions of

individual reads

Page 18: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Data Filtering

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 330%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Transcripts

Genome

Hard Merge

SoftMerge

Read Position

% o

f mism

atch

es

Page 19: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Accuracy per RPKM binsSO

APsn

p

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

RPKM < 1 1 < RPKM < 5 5 < RPKM < 10 10 < RPKM < 50 50 < RPKM < 100

RPKM > 100

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TPHomoVar TPHetero FP FNHomoVar FNHetero

Page 20: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Bioinformatics Pipeline

Tumor mRNA reads

CCDSMapping

Genome Mapping

Read Merging

CCDS mapped reads

Genome mapped reads

SNVs Detection

Mapped reads

Epitope Prediction

Tumor specific

epitopes

HaplotypingTumor-specific

SNVs

Close SNV Haplotypes

Primers Design

Primers for Sanger

Sequencing

Page 21: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Haplotyping

• Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent.

ACGTTACATTGCCACTCAATC--TGGAACGTCACATTG-CACTCGATCGCTGGA

Heterozygous variants

Page 22: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Haplotyping

Locus

Event Alleles

1 SNV C,T

2 Deletion C,-

3 SNV A,G

4 Insertion

-,GC

Locus

Event Alleles Hap 1 Alleles Hap 2

1 SNV T C

2 Deletion C -

3 SNV A G

4 Insertion

- GC

Page 23: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

RefHap Algorithm• Reduce the problem to Max-Cut.• Solve Max-Cut• Build haplotypes according with the cut

Locus 1 2 3 4 5

f1 - 0 1 1 0

f2 1 1 0 - 1

f3 1 - - 0 -

f4 - 0 0 - 1

31

1

1 -1

-14

2

3

h1 00110h2 11001

Page 24: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Bioinformatics Pipeline

Tumor mRNA reads

CCDSMapping

Genome Mapping

Read Merging

CCDS mapped reads

Genome mapped reads

SNVs Detection

Mapped reads

Epitope Prediction

Tumor specific

epitopes

HaplotypingTumor-specific

SNVs

Close SNV Haplotypes

Primers Design

Primers for Sanger

Sequencing

Page 25: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Immunology Background

J.W. Yedell, E Reits and J Neefjes. Making sense of mass destruction: quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003

Page 26: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Epitope Prediction

C. Lundegaard et al. MHC Class I Epitope Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004

Page 27: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Results on Tumor DataMouse strain BALB/C B10.D2 TRAMP

Tumor Meth-A CMS5 prostate1 prostate2 prostate3 prostate4

#lanes 1 3 4 3 3 3

HQ Het SNPs 465 77 86 17 292 193

DdWeak 119 17 14 12 63 70

Strong 20 2 2 0 7 12

KdWeak 111 21 10 0 19 54

Strong 3 1 1 0 1 3

LdWeak 99 12 25 4 47 75

Strong 8 0 0 0 2 9

TotalWeak 329 50 49 16 129 199

Strong 31 3 3 0 10 24

Page 28: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Experimental Validation• Mutations reported by [Noguchi et al 94] found by the pipeline

• Confirmed with Sanger sequencing 18 out of 20 mutations for MethA and 26 out of 28 mutations for CMS5

• Immunogenic potential under experimental validation in the Srivastava lab at UCHC

Page 29: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Outline

• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression

levels• Viral quasispecies reconstruction • Future work

Page 30: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

RNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression (GE)

A B C

A C

D E

Isoform Discovery (ID) Isoform Expression (IE)

Page 31: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Alternative Splicing

[Griffith and Marra 07]

Page 32: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Challenges to Accurate Estimation of Gene Expression Levels

• Read ambiguity (multireads)

• What is the gene length?

A B C D E

Page 33: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Previous approaches to GE

• Ignore multireads• [Mortazavi et al. 08]

– Fractionally allocate multireads based on unique read estimates

• [Pasaniuc et al. 10]– EM algorithm for solving ambiguities

• Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or

more isoforms [Trapnell et al. 10]

Page 34: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read Ambiguity in IE

A B C D E

A C

Page 35: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Previous approaches to IE

• [Jiang&Wong 09]– Poisson model + importance sampling, single reads

• [Richard et al. 10]• EM Algorithm based on Poisson model, single reads in exons

• [Li et al. 10]– EM Algorithm, single reads

• [Feng et al. 10]– Convex quadratic program, pairs used only for ID

• [Trapnell et al. 10]– Extends Jiang’s model to paired reads– Fragment length distribution

Page 36: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Our contribution

• Unified probabilistic model and Expectation-Maximization Algorithm for IE considering– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores

Page 37: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read-Isoform Compatibilityirw ,

a

aaair FQOw ,

Page 38: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Fragment length distribution

• Paired reads

A B C

A C

A B C

A CA C

A B Ci

j

Series1

Fa(i)

Series1

Fa (j)

Page 39: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Fragment length distribution

• Single reads

A B C

A C

A B C

A C

A B C

A C

i

j

Series1

Fa(i)

Series1

Fa (j)

Page 40: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

IsoEM algorithm

E-step

M-step

Page 41: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Error Fraction Curves - Isoforms• 30M single reads of length 25 (simulated)

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq

Rescue

UniqLN

Cufflinks

RSEM

IsoEM

Relative error threshold

% o

f iso

form

s ov

er th

resh

old

Page 42: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Error Fraction Curves - Genes• 30M single reads of length 25 (simulated)

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq

Rescue

GeneEM

Cufflinks

RSEM

IsoEM

Relative error threshold

% o

f gen

es o

ver t

hres

hold

Page 43: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Validation on MAQC Samples

0.6

0.650000000000001

0.7

0.75

0.800000000000001

0.85 UHRR Lib 1, IsoEM

UHRR Lib 2, IsoEM

UHRR Lib 3, IsoEM

UHRR Lib 4, IsoEM

UHRR Lib 5, IsoEM

UHRR Lib 6, IsoEM

HBRR Lib 1, IsoEM

HBRR Lib 2, IsoEM

UHRR Lib 1, Cufflinks

UHRR Lib 2, Cufflinks

UHRR Lib 3, Cufflinks

UHRR Lib 4, Cufflinks

UHRR Lib 5, Cufflinks

UHRR Lib 6, Cufflinks

HBRR Lib1, Cufflinks

HBRR Lib 2, Cufflinks

Million Mapped Bases

R2

Page 44: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Outline

• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression

levels• Viral quasispecies reconstruction • Future work

Page 45: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Viral Quasispecies

RNA viruses (HIV, HCV)Many replication mistakesQuasispecies (qsps)

= co-existing closely related variants

Variants differ in virulenceability to escape the immune system resistance to antiviral therapiestissue tropism

How do qsps contribute to viral persistence and evolution?

Page 46: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

454 Pyrosequencing

Pyrosequencing =Sequencing by Synthesis.

GS FLX Titanium : Fragments (reads): 300-800 bp Sequence of the reads System software assembles reads

into a single genome

We need a software that assembles reads into multiple genomes!

Page 47: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Quasispecies Spectrum Reconstruction (QSR)

Problem

Given pyrosequencing reads from a quasispecies population of unknown size and distribution

Reconstruct the quasispecies spectrum

sequencesfrequencies

Page 48: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

ViSpA Viral Spectrum Assembler

Page 49: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

454 Sequencing Errors

Error rate ~0.1%.

Fixed number of incorporated bases vs. light intensity value.

Incorrect resolution of homopolymers =>

over-calls (insertions)65-75% of errors

under-calls (deletions)20-30% of errors

Page 50: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Preprocessing of Aligned Reads

1. Deletions in reads: DReplace deletion, confirmed by a single read, with either allele value that is present in all other reads or N.

2. Insertions into reference: IRemove insertions, confirmed by a single read.

3. Imputation of missing values N

Page 51: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read Graph: Vertices

Subread = completely contained in some read with ≤ n mismatches. Superread = not a subread => the vertex in the read graph.

ACTGGTCCCTCCTGAGTGT

GGTCCCTCCT

TGGTCACTCGTGAG

ACCTCATCGAAGCGGCGTCCT

Page 52: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read Graph: Edges

Edge b/w two vertices exists if there is an overlap between superreads they agree on their overlap with ≤ m mismatches.

Auxiliary vertices: source and sink

Page 53: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read Graph: Edge Cost

The most probable source-sink path through each vertex

Cost: uncertainty that two superreads are from the same qsps.

Overhang Δ is the shift in start positions of two overlapping superreads.

Δ

Page 54: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Contig Assembling

Max Bandwidth Path through vertexpath minimizing maximum edge cost for the path and each subpath

Consensus of path’s superreadsEach position: >70%-majority or N

Weighted consensus obtained on all reads

Remove duplicatesDuplicated sequences = statistical evidence

kkl

L

t

L

t

k

lrsp

1),( read r of length l qsps s of length L k is #mismatches, t/L is a mutation rate

Page 55: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Expectation Maximization

Bipartite graph: Qq is a candidate with frequency fq

Rr is a read with observed frequency or

Weight hq,r = probability that read r is produced by qsps q with j mismatches

E step:

jjlrq j

lh

1,

''

''

:,

,,

qrqrqq

rqqrq hf

hfp

rr

qrrqr

q o

op

fM step:

Page 56: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

HCV Qsps (P. Balfe)

30927 reads from 5.2Kb-long region of HCV-1a genomes

intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%]

27764 reads average length=292bpIndels: ~77% of reads

Insertions length: 1 (86%) , 3 (9.8%)Deletions length: 1 (98%)

N: ~7% of reads

Page 57: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

HCV Data Statistics

Page 58: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

NJ Tree for 12 Most Frequent Qsps (No Insertions)

The top sequence: 26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads.

In sum:35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads.

Reconstructed sequence with highest frequency 99% identical to one of the ORFs obtained by cloning the quasispecies.

Page 59: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Conclusions & Future Work• Freely available implementations of these methods

available at http://dna.engr.uconn.edu/software/

• Ongoing work– Monitoring immune responses by TCR sequencing– Isoform discovery– Computational deconvolution of heterogeneous samples– Reconstruction & frequency estimation of virus quasispecies

from Ion Torrent reads

Page 60: Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Acknowledgments Immunogenomics

Jorge Duitama (KU Leuven) Pramod K. Srivastava, Adam Adler, Brent Graveley, Duan Fei (UCHC) Matt Alessandri and Kelly Gonzalez (Ambry Genetics)

IsoEM Marius Nicolae (Uconn) Alex Zelikovsky, Serghei Mangul (GSU)

ViSpA Alex Zelikovsky, Irina Astrovskaya, Bassam Tork, Serghei Mangul,

(GSU), and Kelly Westbrooks (Life Technologies) Peter Balfe (Birmingham University, UK)

Funding NSF awards IIS-0546457, IIS-0916948, and DBI-0543365 UCONN Research Foundation UCIG grant