advanced methods for sequence analysis · sim4 [florea et al., 1998], spidey [wheelan et al., 2001]...

93
Advanced Methods for Sequence Analysis Gunnar Rätsch Friedrich Miescher Laboratory, Tübingen Vorlesung WS 2007/2008 Eberhard-Karls-Universität Tübingen 6 February, 2008

Upload: others

Post on 17-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Advanced Methodsfor Sequence Analysis

Gunnar Rätsch

Friedrich Miescher Laboratory, Tübingen

Vorlesung WS 2007/2008Eberhard-Karls-Universität Tübingen

6 February, 2008

http://www.fml.mpg.de/raetsch/lectures/amsa07

Page 2: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Today

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 2

Three applicationsLearning parameters for spliced alignmentsDiscriminative Gene findingAnalysis of Resequencing Data

Summary of previous lectures

Page 3: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Max-Margin Structured Output Learning

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 3

Learn function f (y|x) scoring segmentations y for x

Maximize f (y|x) w.r.t. y for prediction:

argmaxy!Y"

f (y|x)

Given N sequence pairs (x1,y1), . . . , (xN,yN) for trainingDetermine f such that there is a large margin betweentrue and wrong segmentations

minf

CN!

n=1

!n + P[f ]

w.r.t. f (yn|xn) # f (y|xn) $ 1 # !n

for all yn %= y ! Y", n = 1, . . . , N

Exponentially many constraints!

Page 4: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Joint Feature Map

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 4

Recall the kernel trickFor each kernel, there exists a corresponding fea-ture mapping !(x) on the inputs such that k(x,x&) ='!(x), !(x&)(.

Joint kernel on X and YWe define a joint feature map on X ) Y, denoted by!(x, y). Then the corresponding kernel function is

k((x, y), (x&, y&)) := '!(x, y), !(x&, y&)(.

For multiclassFor normal multiclass classification, the joint feature mapdecomposes and the kernels on Y is the identity, that is

k((x, y), (x&, y&)) := [[y = y&]]k(x,x&).

Page 5: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Algorithm

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 5

1. Y1n = *, for n = 1, . . . , N

2. Solve(wt, !t) = argmin

w!F,!C

N!

n=1

!n + +w+2

w.r.t. 'w, !(x,yn) # !(x,y)( $ 1 # !n

for all yn %= y ! Ytn, n = 1, . . . , N

3. Find violated constraints (n = 1, . . . , N )

ytn = argmax

yn %=y!Y"'wt, !(x,y)(

If 'wt, !(x,yn) # !(x,ytn)( < 1 # !t

n, set Yt+1n = Yt

n , {ytn}

4. If violated constraint exists then go to 25. Otherwise terminate - Optimal solution

Page 6: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 6

PALMA: Perfect Alignments using Large Margins

Gunnar Rätsch,1 Bettina Hepp,2 Uta Schulze,1,3 and ChengSoon Ong1,4

1 Friedrich Miescher Laboratory, Tübingen2 Fraunhofer FIRST, Berlin,

3 University of Leipzig4 Max Planck Institute for Biological Cybernetics, Tübingen

http://www.fml.mpg.de/raetsch/projects/palma

Page 7: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Motivation & Background

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 7

Abundant experimental data:Expressed Sequence Tags (EST)Full length mRNAs

Alignment to genomic sequences helpsDiscovery of new genes,Delineation exon/intron boundaries,Identification alternative splice forms,Finding SNPs, . . .

ProblemsRepetitive elements, paralogs, pseudo-genesSequencing errors, polymorphismsNon-canonical splice sitesMicroexons

Page 8: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Motivation & Background

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 8

Abundant experimental data:Expressed Sequence Tags (EST)Full length mRNAs

Alignment to genomic sequences helpsDiscovery of new genes,Delineation exon/intron boundaries,Identification alternative splice forms,Finding SNPs, . . .

ProblemsRepetitive elements, paralogs, pseudo-genesSequencing errors, polymorphismsNon-canonical splice sitesMicroexons

Page 9: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Motivation & Background

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 9

Abundant experimental data:Expressed Sequence Tags (EST)Full length mRNAs

Alignment to genomic sequences helpsDiscovery of new genes,Delineation exon/intron boundaries,Identification alternative splice forms,Finding SNPs, . . .

ProblemsRepetitive elements, paralogs, pseudo-genesSequencing errors, polymorphismsNon-canonical splice sitesMicroexons

Page 10: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Previous Work

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 10

More than 10 years of research on spliced alignmentsGreedy algorithms (extend seed words or Blast based)

Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001]

Blat (prefers AG/GT) [Kent, 2002]

EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Exalin (DP based, AG/GS only) [Zhang and Gish, 2006]

Fixed substitution and gap costsSplice site model (PWMs)

"Maximum likelihoodcombination

Why another tool?More accurate splice site models (SVM based)Intron length modelMore principled combination (based on large margins)

Page 11: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Previous Work

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 11

More than 10 years of research on spliced alignmentsGreedy algorithms (extend seed words or Blast based)

Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001]

Blat (prefers AG/GT) [Kent, 2002]

EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Exalin (DP based, AG/GS only) [Zhang and Gish, 2006]

Fixed substitution and gap costsSplice site model (PWMs)

"Maximum likelihoodcombination

Why another tool?More accurate splice site models (SVM based)Intron length modelMore principled combination (based on large margins)

Page 12: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

2-Class Splice Site Detection

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 12

True sites: fixed window around a true splice siteDecoys sites: generated by shifting the window

- Very unbalanced problem (1:200)- Millions of points from EST databases- Large scale methods necessary

(here: Support Vector Machines)

Page 13: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

2-Class Splice Site Detection

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 13

Results on C. elegans 1-ROC Score[Rätsch and Sonnenburg, 2004] Donor AcceptorHigher order PWM 1.77% 1.12%SVM w/ TOP Kernel 1.68% 1.30%SVM w/ Polynomial 1.59% 1.05%SVM w/ Locality Improved 1.52% 0.92%SVM w/ Weighted Degree ’04 1.53% 0.95%SVM w/ Weighted Degree ’05 0.38% 0.26%

Weighted Degree kernel is simple and accurate:

Computes similaritybetween sequences

Allows training on 10 million examples [Sonnenburg et al., 2006]

Page 14: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Alignment Algorithms

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 14

InputTwo sequences over the alphabet {A, C, G, T, N}

EST sequence SE of length mDNA sequence SD of length n

Substitution matrix M : " ) " . R,where " := {A, C, G, T, N,#}

OutputSequence alignment A

Sequence of pairs, i.e. A = (ar, br)r=1,...,R, ar, br ! "R / m + n depends on the alignment

Alignment that maximizes the alignment score

s(A) =R!

r=1

M(ar, br)

Page 15: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Maximizing the Alignment Score

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 15

Needleman-Wunsch algorithmMaximizes alignment score by dynamic programmingFills m · n alignment matrix V :

V (i, 0) := 0 and V (0, j) := 0 for all i, jRecursion

V (i, j) = max

#$

%

V (i # 1, j # 1) + M(SE(i), SD(j))V (i # 1, j) + M(SE(i),& #&)V (i, j # 1) + M(&#&, SD(j))

Runtime time and space complexity: O(m · n)

Problems:Does not distinguish between gaps and intronHow to choose M? No splice site model!Too expensive for alignments against whole genomes

Page 16: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Needleman-Wunsch algorithm

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 16

Page 17: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Needleman-Wunsch with Introns

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 17

Page 18: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Recursion with Intron Model

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 18

Extended recursion formula (0i = 1 . . . m, j = 1 . . . n)

V (i, j) = max

#&&$

&&%

V (i # 1, j # 1) + M(SE(i), SD(j))V (i # 1, j) + M(SE(i),& #&)V (i, j # 1) + M(&#&, SD(j))

max1/k/j#1(V (i, k) + fI(k, j))

For intron score fI(k, j) considerSplice sites scores sDon

k and sAcck (SVM predictions)

- Contribute fDon(sDonk ) + fAcc(sAcc

k )

Length of intron- Contributes fLen(j # k)

Unspecified functions fDon, fAcc, fLen as well as M !

Idea: Learn functions on training set with known alignments

Page 19: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Parameterization

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 19

Substitution matrix M : " ) " . RFunctions fLen, fAcc and fDon

Piecewise linear functions (support points x1, . . . , xs):

f (x) =

#&$

&%

"1 x / x1"i(xi+1#x)+"i+1(x#xi)

xi+1#xixi / x / xi+1

"s x $ xs

.

" := ("1, . . . , "s) parametrizes functionLet " := ("Acc, "Don, "Len, "M)

Given ", alignment score s"(A) is fully defined

Page 20: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Parameters: Optimization

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 20

IdeaFind " such that for a known alignment A+

s"(A+) 1 s"(A#)

where A# %= A+ is any wrong alignmentGiven N known alignments A+

i , i = 1, . . . , N

Solve quadratic optimization problem (QP)

min!$0,"

1

N

N!

i=1

!i + P(")

s.t. s"(A+i ) # s"(A#

i ) $ 1 # !i 0A#i %= A+

i , i = 1, . . . , N

!i: Slack-variables to implement a soft-marginP("): Regularizer leading to smooth functions

Problem: Exponentially many constraints

Page 21: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Iterative Algorithm

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 21

Set " := ("Acc, "Don, "Len, "M) randomly, A#i = *

For t = 1, . . . , T

For i = 1, . . . , N

Compute (wrong) alignments A# based on "If A# %= A+

i , then A#i := A#

i , {A#}Obtain new parameters " by solving the restricted QP

min!$0,"

1

N

N!

i=1

!i + P(")

s.t. s"(A+i ) # s"(A#) $ 1 # !i 0A# ! A#

i , i = 1, . . . , N

Only need to solve small optimization problems!Guaranteed convergence!

Page 22: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Microexon Simulation Study

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 22

Artificial DataConsider EST confirmed exon triples (C. elegans)Shorten middle exon in central region (EST & DNA)

(found by blat andstrict filtering)

Microexon generated- Splice sites still intactGenerate insertions/deletions/mutations in artificial EST

(# = 0%, 1%, 2%, 10%, 20%, 50%)Train PALMA on 4608 exon triples (2 1h)Test blat, sim4, exalin and PALMA on 4358 tripplesCorrect only if all boundaries are correctly predicted

Page 23: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Microexon Simulation Study

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 23

Artificial DataConsider EST confirmed exon triples (C. elegans)Shorten middle exon in central region (EST & DNA)

(found by blat andstrict filtering)

Microexon generated- Splice sites still intactGenerate insertions/deletions/mutations in artificial EST

(# = 0%, 1%, 2%, 10%, 20%, 50%)Train PALMA on 4608 exon triples (2 1h)Test blat, sim4, exalin and PALMA on 4358 tripplesCorrect only if all boundaries are correctly predicted

Page 24: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Microexon Simulation Study

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 24

Artificial DataConsider EST confirmed exon triples (C. elegans)Shorten middle exon in central region (EST & DNA)

(found by blat withstrict post-processing)

- Microexon generated- Splice sites still intactGenerate insertions/deletions/mutations in artificial EST

(# = 0%, 1%, 2%, 10%, 20%, 50%)Train PALMA on 4608 exon triples (2 1h)Test blat, sim4, exalin and PALMA on 4358 tripplesCorrect only if all boundaries are correctly predicted

Page 25: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Microexon Simulation Study

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 25

Artificial DataConsider EST confirmed exon triples (C. elegans)Shorten middle exon in central region (EST & DNA)

(found by blat withstrict post-processing)

- Microexon generated- Splice sites still intactGenerate insertions/deletions/mutations in artificial EST

(# = 0%, 1%, 2%, 10%, 20%, 50%)Train PALMA on 4608 exon triples (2 1h)Test blat, sim4, exalin and PALMA on 4358 tripplesCorrect only if all boundaries are correctly predicted

Page 26: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Results (C. elegans)

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 26

Error rate increasesdrastically for blatand sim4

Exalin performsquite well, butalways > 3% wrongPALMA is alwayscorrect for # / 10%

PALMA w/o SS onlyslightly worse

Page 27: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Conclusion

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 27

New alignment algorithm for perfect alignments ofmRNA and DNAExploits very accurate SVM-based splice site predictionsNew idea of combining different sources of information:

Similarity, splice site scores and intron lengthsLarge margin based iterative algorithmGuaranteed convergence

Significantly reduced error rates on test data (short ex-ons/much noise)Better detection of microexons & altern. spliced exonsCurrent work: Reduce computational complexitySource code (Python/C++, GPL) and data available at

Page 28: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Substitution Matrix

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 28

M : " ) " . Rmatches score high, gaps score lowmismatch scores all close to zero

Page 29: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Donor/Acceptor Score

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 29

fAcc and fDon

Score the acceptor and donor SVM outputs

Page 30: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Intron Length Score

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 30

fLen

Maximum near 50nt (most frequent intron length)

Page 31: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Discriminative Gene Finding

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 31

mGene Talk (30 min)

Page 32: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Computational Approaches to the Analysis of

Whole Genome Resequencing Data

Gunnar Ratsch

Friedrich Miescher Laboratory, Tubingen

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 1 / 20

Page 33: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Introduction

What is the genetic basis of variation?

Modified from Koornneef et al. 2004. Ann. Rev. Plant Biology, 55, 141-172.

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 2 / 20

Page 34: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Introduction

Questions:

What sequence changes occur in short time frames?

Which polymorphisms and genes underlie adaption?

What are the consequences for gene function?

Arabidopsis thaliana:

119 Mb finished euchromatic sequence (Col-0)

Resources comparable to Drosophila and C. elegans

Collections of >1000 wild strains from 3 continents

Strains are largely homozygous

Resequencing of 20 wild strains

genome-wide identification of sequence polymorphisms

High-density oligo-nucleotide arrays for high-throughputresequencing

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 3 / 20

Page 35: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Introduction

Questions:

What sequence changes occur in short time frames?

Which polymorphisms and genes underlie adaption?

What are the consequences for gene function?

Arabidopsis thaliana:

119 Mb finished euchromatic sequence (Col-0)

Resources comparable to Drosophila and C. elegans

Collections of >1000 wild strains from 3 continents

Strains are largely homozygous

Resequencing of 20 wild strains

genome-wide identification of sequence polymorphisms

High-density oligo-nucleotide arrays for high-throughputresequencing

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 3 / 20

Page 36: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Introduction

Questions:

What sequence changes occur in short time frames?

Which polymorphisms and genes underlie adaption?

What are the consequences for gene function?

Arabidopsis thaliana:

119 Mb finished euchromatic sequence (Col-0)

Resources comparable to Drosophila and C. elegans

Collections of >1000 wild strains from 3 continents

Strains are largely homozygous

Resequencing of 20 wild strains

genome-wide identification of sequence polymorphisms

High-density oligo-nucleotide arrays for high-throughputresequencing

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 3 / 20

Page 37: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Resequencing Array Basics I

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 4 / 20

Page 38: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Resequencing Array Basics II

>99.99% of bases represented

Each base queried with forward and reverse strand probe quartets

Nearly 1 billion oligos per ecotype

19+1 ecotypes surveyed representing worldwide distribution

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 5 / 20

Page 39: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Resequencing Array Basics II

>99.99% of bases represented

Each base queried with forward and reverse strand probe quartets

Nearly 1 billion oligos per ecotype

19+1 ecotypes surveyed representing worldwide distribution

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 5 / 20

Page 40: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Resequencing Data

Data analysis challenge

Hybridizationintensities depend on

OligomerEcotypeRepeats

Measurement noise

Identify SNPs

Problematic cases

Highly polymorphicregions

Deletions/insertions

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 6 / 20

Page 41: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Resequencing Data

Data analysis challenge

Hybridizationintensities depend on

OligomerEcotypeRepeats

Measurement noise

Identify SNPs

Problematic cases

Highly polymorphicregions

Deletions/insertions

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 6 / 20

Page 42: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Resequencing Data

Data analysis challenge

Hybridizationintensities depend on

OligomerEcotypeRepeats

Measurement noise

Identify SNPs

Problematic cases

Highly polymorphicregions

Deletions/insertions

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 6 / 20

Page 43: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning I

Large Margin Learning

Extract features

Find linearseparation

With large margin

Classify new points

Works with

many featuresnonlinearitiesusing kernels

Elegant theory

“Support VectorMachines”

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 7 / 20

Page 44: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning I

Large Margin Learning

Extract features

Find linearseparation

With large margin

Classify new points

Works with

many featuresnonlinearitiesusing kernels

Elegant theory

“Support VectorMachines”

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 7 / 20

Page 45: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning I

Large Margin Learning

Extract features

Find linearseparation

With large margin

Classify new points

Works with

many featuresnonlinearitiesusing kernels

Elegant theory

“Support VectorMachines”

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 7 / 20

Page 46: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning I

Large Margin Learning

Extract features

Find linearseparation

With large margin

Classify new points

Works with

many featuresnonlinearitiesusing kernels

Elegant theory

“Support VectorMachines”

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 7 / 20

Page 47: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning I

Large Margin Learning

Extract features

Find linearseparation

With large margin

Classify new points

Works with

many featuresnonlinearitiesusing kernels

Elegant theory

“Support VectorMachines”

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 7 / 20

Page 48: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning I

Large Margin Learning

Extract features

Find linearseparation

With large margin

Classify new points

Works with

many featuresnonlinearitiesusing kernels

Elegant theory

“Support VectorMachines”

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 7 / 20

Page 49: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning I

Large Margin Learning

Extract features

Find linearseparation

With large margin

Classify new points

Works with

many featuresnonlinearitiesusing kernels

Elegant theory

“Support VectorMachines”

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 7 / 20

Page 50: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning I

Large Margin Learning

Extract features

Find linearseparation

With large margin

Classify new points

Works with

many featuresnonlinearitiesusing kernels

Elegant theory

“Support VectorMachines”

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 7 / 20

Page 51: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Application to SNP discovery

Classification using Support Vector Machines with 302 features

≈2,400 known SNPs/ecotype (Nordborg et al., PLoS Biol., 2005)

Out-of-sample evaluation and prediction on whole genome

Comparison with Perlegens model based method (Hinds et al., Science, 2005)

Clark et al., in prep., 2006.Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 8 / 20

Page 52: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Identification of Highly Polymorphic Regions

Results

Performance drops,when other SNPs arein vicinity (1-20nt)

Least predicted SNPsin highly polymorphicregions!

ML more sensitive

New Approach

Polymorphic RegionPrediction (PRP)

Novel ML techniquesfor predicting complexproperties

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 9 / 20

Page 53: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Identification of Highly Polymorphic Regions

Results

Performance drops,when other SNPs arein vicinity (1-20nt)

Least predicted SNPsin highly polymorphicregions!

ML more sensitive

New Approach

Polymorphic RegionPrediction (PRP)

Novel ML techniquesfor predicting complexproperties

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 9 / 20

Page 54: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Modeling polymorphic regions

Example sequence from Bor-4

Create segmentation into

conserved and

polymorphic regions (distance <25nt)

Predict segmentation using state-modelbased approach

conserved

polymorphic

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 10 / 20

Page 55: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Modeling polymorphic regions

Example sequence from Bor-4

Create segmentation into

conserved and

polymorphic regions (distance <25nt)

Predict segmentation using state-modelbased approach

conserved

polymorphic

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 10 / 20

Page 56: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Modeling polymorphic regions

Example sequence from Bor-4

Create segmentation into

conserved and

polymorphic regions (distance <25nt)

Predict segmentation using state-modelbased approach

conserved

polymorphic

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 10 / 20

Page 57: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Example

Bor-4, chromosome 5, 14303502...14304007

Known labels

SNPs detected with SNP calling methods

Targetsegmentation

SNP calls

Count prediction as correct if it 75% overlaps

55% Sensitivity, 90% Specificity (excluding isolated SNPs)

Previously: Heuristic method with much lower sensitivity

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 11 / 20

Page 58: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Example

Bor-4, chromosome 5, 14303502...14304007

50 100 150 200 250 300 350 400 450 500 550

0123

Input feature: hybridization intensities

Known labels

Targetsegmentation

SNP calls

Count prediction as correct if it 75% overlaps

55% Sensitivity, 90% Specificity (excluding isolated SNPs)

Previously: Heuristic method with much lower sensitivity

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 11 / 20

Page 59: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Example

Bor-4, chromosome 5, 14303502...14304007

50 100 150 200 250 300 350 400 450 500 550

0123

Known labels

Prediction of a polymorphic region (PRP)

Targetsegmentation

SNP calls

Count prediction as correct if it 75% overlaps

55% Sensitivity, 90% Specificity (excluding isolated SNPs)

Previously: Heuristic method with much lower sensitivity

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 11 / 20

Page 60: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Example

Bor-4, chromosome 5, 14303502...14304007

50 100 150 200 250 300 350 400 450 500 550

0123

Known labels

Prediction of a polymorphic region (PRP)

Targetsegmentation

SNP calls

Count prediction as correct if it 75% overlaps

55% Sensitivity, 90% Specificity (excluding isolated SNPs)

Previously: Heuristic method with much lower sensitivity

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 11 / 20

Page 61: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Complementing SNP Calls

1 [2,3] [4,5] [6,10] [11,20] [21,50] [51,100] >100 0

0.2

0.4

0.6

0.8

1

SNPs covered by PRPs, TP rate SNP calls, TP rate

Distance from SNP to nearest polymorphism

SN

P d

etec

tion

true

pos

itive

rat

e

Fraction of called/covered polymorphisms (test set):

SNP calling (MB+ML) Region predictorSNPs ∼32% ∼41%SNPs both methods ∼65%Deletions (per base) ∼71%Insertions ∼39%

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 12 / 20

Page 62: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Polymorphism Distribution

28.0

15.9

4.9

8.5

14.2

3.6

6.0

35.4

3.1

4.7

43.2

5.3

5.1

14.8

35.5

3.3

55.6

13.3

Coding 230,186 3,828,824

Intron 96,069 4,090,306

Untranslated RNA 32,771 1,029,948

Intergenic 229,335 16,046,419

Pseudogene 21,552 1,413,369

Transposon 38,657 2,462,213

SNP no. bp in PSPsSequence type

Inner circle: genomic basesInner ring: SNP predictionsOuter ring: bases in PRPs

Coding regions are underrepresented for PRPs while transposons areoverrepresented.

Clark et al., Zeller et al., in preparation, 2006Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 13 / 20

Page 63: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Predicted Effects on Gene Products

SNPs may interfere with signal recognition

Transcription start & stop

Splice sites

Translation start & stop

Regulatory elements

Consequences

Modification of consensus sequence (e.g. splice site dinucl.)

Permanent change⇒ Easy to check

Weaker or stronger recognition of signals

Different expression levels(Alternative) splicing pattern may change

⇒ Needs ab initio gene finding ⇒ build on our previous work

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 14 / 20

Page 64: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Predicted Effects on Gene Products

SNPs may interfere with signal recognition

Transcription start & stop

Splice sites

Translation start & stop

Regulatory elements

Consequences

Modification of consensus sequence (e.g. splice site dinucl.)

Permanent change⇒ Easy to check

Weaker or stronger recognition of signals

Different expression levels(Alternative) splicing pattern may change

⇒ Needs ab initio gene finding ⇒ build on our previous work

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 14 / 20

Page 65: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Effects on Genes

SNPs (MB+ML) leading to consensus changes

109,976 amino acid changes

1,227 premature stops

156 initiation methionine changes

435 splice site changes

Major changes in 573 genes validated by dideoxy sequencing∗

Polymorphic Regions Predictions

Overlap coding regions of 16,692 genes in at least one ecotype

50% overlap in 1,910 genes, 95% overlap in 743 genes

122 coding sequence deletions validated by dideoxy sequencing∗

∗ Verified in collaboration with laboratory of Joseph Ecker

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 15 / 20

Page 66: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Effects on Genes

SNPs (MB+ML) leading to consensus changes

109,976 amino acid changes

1,227 premature stops

156 initiation methionine changes

435 splice site changes

Major changes in 573 genes validated by dideoxy sequencing∗

Polymorphic Regions Predictions

Overlap coding regions of 16,692 genes in at least one ecotype

50% overlap in 1,910 genes, 95% overlap in 743 genes

122 coding sequence deletions validated by dideoxy sequencing∗

∗ Verified in collaboration with laboratory of Joseph Ecker

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 15 / 20

Page 67: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Major Effects Distribution

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 16 / 20

Page 68: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Predicted Effects by Gene Finding (Preliminary)

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 17 / 20

Page 69: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Example of Predicted Splice Form Change

bur 0 +145 106 200 64 351383 90 397 85

bor 4 +145 79 106 200 64 351196 108 90 397 85

br 0 +145 106 200 64 351383 90 397 85

c24 +145 106 200 64 351383 90 397 85

cvi 0 +145 106 200 64 351383 90 397 85

got 7 +145 106 200 64 351383 90 397 85

ler 1 +145 79 106 200 64 351

196 108 90 397 85

lov 5 +145 106 200 64 351

383 90 397 85

nfa 8 +145 106 200 64 351

383 90 397 85

tsu 1 +145 79 106 200 64 351

196 108 90 397 85

bay 0 +145 79 106 200 64 351

196 108 90 397 85

est 1 +145 106 200 64 351

383 90 397 85

fei 0 +145 106 200 64 351

383 90 397 85

rrs 10 +145 106 200 64 351

383 90 397 85

rrs 7 +145 106 200 64 351

383 90 397 85

sha +145 106 200 64 351

383 90 397 85

tamm 2 +145 106 200 64 351

383 90 397 85

ts 1 +145 106 200 64 351

383 90 397 85

van 0 +145 106 200 64 351

383 90 397 85

Col 0 +145 106 200 64 351

383 90 397 85

annotation +145 106 200 64 351

383 90 397 85

1 2 3 4 5 6

768 1152 1536 1920 2304

AT4G02980Chromosome 4 +

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 18 / 20

Page 70: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Conclusions

Created inventory for polymorphisms in Arabidopsis thaliana

Highly polymorphic: ≈0.5% in SNPs, ≈25% in PRPs

New method for SNP calling; more accurate than Perlegen’s

Accurate polymorphic region predictions

Important for further analysis (e.g. dideoxy sequencing)

Large number of major effect changes

Overrepresented in R genes, F-box genes, Receptor-like kinases

More predicted changes by ab initio gene finding

Application to other genomes (rice, human?)

Study variations using mRNA tiling arrays

Expression levels, splicing, . . .

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 19 / 20

Page 71: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Conclusions

Created inventory for polymorphisms in Arabidopsis thaliana

Highly polymorphic: ≈0.5% in SNPs, ≈25% in PRPs

New method for SNP calling; more accurate than Perlegen’s

Accurate polymorphic region predictions

Important for further analysis (e.g. dideoxy sequencing)

Large number of major effect changes

Overrepresented in R genes, F-box genes, Receptor-like kinases

More predicted changes by ab initio gene finding

Application to other genomes (rice, human?)

Study variations using mRNA tiling arrays

Expression levels, splicing, . . .

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 19 / 20

Page 72: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Conclusions

Created inventory for polymorphisms in Arabidopsis thaliana

Highly polymorphic: ≈0.5% in SNPs, ≈25% in PRPs

New method for SNP calling; more accurate than Perlegen’s

Accurate polymorphic region predictions

Important for further analysis (e.g. dideoxy sequencing)

Large number of major effect changes

Overrepresented in R genes, F-box genes, Receptor-like kinases

More predicted changes by ab initio gene finding

Application to other genomes (rice, human?)

Study variations using mRNA tiling arrays

Expression levels, splicing, . . .

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 19 / 20

Page 73: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Conclusions

Created inventory for polymorphisms in Arabidopsis thaliana

Highly polymorphic: ≈0.5% in SNPs, ≈25% in PRPs

New method for SNP calling; more accurate than Perlegen’s

Accurate polymorphic region predictions

Important for further analysis (e.g. dideoxy sequencing)

Large number of major effect changes

Overrepresented in R genes, F-box genes, Receptor-like kinases

More predicted changes by ab initio gene finding

Application to other genomes (rice, human?)

Study variations using mRNA tiling arrays

Expression levels, splicing, . . .

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 19 / 20

Page 74: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Acknowledgments

Friedrich Miescher Laboratory

Gabriele Schweikert

Georg Zeller

The Salk Institute, CA

Paul Shinn

Joseph Ecker

MPI for Biological Cybernetics

Bernhard Scholkopf

MPI for Developmental Biology

Richard Clark

Stephan Ossowski

Norman Warthmann

Detlef Weigel

ZBIT, University of Tubingen

Daniel Huson

Perlegen Sciences, CA

Kelly Frazer

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 20 / 20

Page 75: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Summary Of Lectures

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 33

Statistical Learning TheoryProbabilistic ApproachesSupport Vector MachinesConvex OptimizationBoostingIntroduction to (String) KernelsFast SVMs using String KernelsGraph kernelsMultiple Kernel LearningStructured Output LearningApplications

Page 76: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Probabilistic Learning Model

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 17

AssumptionAll data is generated by the same hidden probabilisticsource!

Formally! is an unknown joint probability distribution over X!Y

Training data ((x1, y1), . . . , (xn, yn)) is iid " !

Aim: find best f # $ F that minimizes risk

R(f ) =

!"(f (x), y)d!.

ERM: find best fn $ F that minimizes empirical risk

Remp(f ) =1

n

n"

i=1

"(f (xi), yi).

Page 77: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Generative vs Discriminative

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 34

Generative approachModels p(x, y) = p(x|y)p(y). Uses Bayes’ rule to infer

p(y|x) =p(x|y)p(y)

p(x).

Discriminative approachModels p(y|x) directly and takes max.

ExamplesGenerative: Mixtures of Gaussians, Hidden MarkovModels, Bayesian Networks, Graphical Models, · · ·Discriminative SVM, (Regularized) Least SquaresRegression, · · ·

Page 78: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

SVMs: Geometric View

Cheng Soon Ong and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 2

Minimize

1

2!w!2 + C

N"

i=1

!i

Subject to

yi("w,xi# + b) ! 1 $ !i

!i ! 0for all i = 1, . . . , N.

The examples on the margin are called support vectors[Vapnik, 1995]Called the soft margin SVM or the C-SVM [Cortes andVapnik, 1995]

Page 79: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Convex Optimization

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 19

Constrained Optimization (generally hard)minx f0(x)

subject to fi(x) " 0 for all igj(x) = 0 for all j

Convex Optimization (generally easy)minx f0(x)

subject to fi(x) " 0 for all ia%j x = bj for all j

f0, f1, . . . , fm are convex, and the equality constraints areaf!ne Boyd and Vandenberghe [2004].

Page 80: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Nonlinear Algorithms in Feature Space

Cheng Soon Ong and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 9

Linear separation might be not sufficient!& Map into a higher dimensional feature space

Example: all second order monomials

! : R2 ' R3

(x1, x2) (' (z1, z2, z3) := (x21,)

2 x1x2, x22)

!

!

!

!

!

!

!

!

"

"

"

"

"

"

"

"

"

"

""

"

"

"

"

"

"

"

"

x1

x2

!!

!!

!

!

!

!

"

"

"

"

"

"

"

"

"

"

"

"

"

z1

z3

"

z2

Page 81: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

AdaBoost (Freund & Schapire 1996)

Machine Learning Summer School, July 2006 in Taipei. Gunnar Rätsch: An Introduction to Boosting. Part I (The Idea of Boosting), Page 15

Idea:Simple hypotheses are not perfect!Hypotheses combination ! increased accuracy

Problems:How to generate different hypotheses?How to combine them?

Method:Compute distribution d1, . . . , dN on examplesFind hypothesis on the weighted training sample(x1, y1, d1), . . . , (xN, yN, dN)

Combine hypotheses h1, h2, . . . linearly:

f =T!

t=1

!tht

Page 82: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Example: Trees & Tries

Cheng Soon Ong and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 20

Tree (trie) data struc-ture stores sparseweightings on se-quences (and theirsubsequences).

Illustration: Threesequences AAA, AGA,GAA were added to atrie (!’s are the weightsof the sequences).

Building tree: O(Q · L · D)

Compute all f (xi): O(N ·L ·D)

Memory: O(Q · L · D · |!|)Works for any D

Page 83: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

44

Definition of Diffusion Kernel

• A: Adjacency matrix, • D: Diagonal matrix of Degrees• L = D-A: Graph Laplacian Matrix• Diffusion kernel matrix

– Diffusion paramater

• Matrix exponential, not elementwise exponential

Page 84: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Summary

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 21

MKL Algorithm

Automatically computes best convex combinationof kernels

k(x,x!) =M!

p=1

"pkp(x,x!),M!

p=1

"p = 1, "p " 0

SILP formulation makes large scale training andevaluation possible.

Possible Applications

Heterogeneous data.Improving interpretability.

Page 85: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Max-Margin Structured Output Learning

Cheng Soon Ong, Petra Philips and Gunnar Rätsch: Kernel Methods for Predictive Sequence Analysis: Introduction, Page 17

Learn function f (y|x) scoring segmentations y for x

Maximize f (y|x) w.r.t. y for prediction:

argmaxy#Y$

f (y|x)

Given N sequence pairs (x1,y1), . . . , (xN,yN) for trainingDetermine f such that there is a large margin betweentrue and wrong segmentations

minf

CN!

n=1

#n + P[f ]

w.r.t. f (yn|xn) % f (y|xn) " 1 % #n

for all yn &= y # Y$, n = 1, . . . , N

Exponentially many constraints!

Page 86: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Modeling polymorphic regions

Example sequence from Bor-4

Create segmentation into

conserved and

polymorphic regions (distance <25nt)

Predict segmentation using state-modelbased approach

conserved

polymorphic

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 11 / 22

Page 87: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

SVM S3VM Training a S3VM SSL Assumptions + Methods Summary

Called Label Propagation, as the same solution is achieved byiteratively propagating labels along edges until convergence

[images from “Learning with Local and Global Consistency”, Zhou, Bousquet, Lal, Weston, Scholkopf; NIPS 2004]

Note: herecolor

!= classes

Page 88: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

Ulrik

evo

nLuxb

urg

:PCA

and

Ker

nel

PCA

6.

Dec

ember

2006

27

Summary PCAKeep in mind the following picture:

Page 89: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning II

Problems with Mature Solutions

Machine Learning works well for relatively sim-ple objects with simple properties:

Classification

Regression

(Novelty detection)

Current Research

Large scale problems (>10 million examples)

Classification of structured objects, e.g. sequences, networks etc.

Improving interpretability of learning results

Prediction of complex properties

Sonnenburg et al., Journal of Machine Learning Research, 2006:http://www.fml.mpg.de/raetsch/projects/shogun.

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 8 / 22

Page 90: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning II

Problems with Mature Solutions

Machine Learning works well for relatively sim-ple objects with simple properties:

Classification

Regression

(Novelty detection)

Current Research

Large scale problems (>10 million examples)

Classification of structured objects, e.g. sequences, networks etc.

Improving interpretability of learning results

Prediction of complex properties

Sonnenburg et al., Journal of Machine Learning Research, 2006:http://www.fml.mpg.de/raetsch/projects/shogun.

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 8 / 22

Page 91: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning II

Problems with Mature Solutions

Machine Learning works well for relatively sim-ple objects with simple properties:

Classification

Regression

(Novelty detection)

Current Research

Large scale problems (>10 million examples)

Classification of structured objects, e.g. sequences, networks etc.

Improving interpretability of learning results

Prediction of complex properties

Sonnenburg et al., Journal of Machine Learning Research, 2006:http://www.fml.mpg.de/raetsch/projects/shogun.

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 8 / 22

Page 92: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

A Brief Excursion to Machine Learning II

Problems with Mature Solutions

Machine Learning works well for relatively sim-ple objects with simple properties:

Classification

Regression

(Novelty detection)

Current Research

Large scale problems (>10 million examples)

Classification of structured objects, e.g. sequences, networks etc.

Improving interpretability of learning results

Prediction of complex properties

Sonnenburg et al., Journal of Machine Learning Research, 2006:http://www.fml.mpg.de/raetsch/projects/shogun.

Gunnar Ratsch (FML, Tubingen) Computational Approaches for Genome Resequencing 8 / 22

Page 93: Advanced Methods for Sequence Analysis · Sim4 [Florea et al., 1998], Spidey [Wheelan et al., 2001] Blat (prefers AG/GT) [Kent, 2002] EST Genome (DP based, prefers AG/GS) [Mott, 1997]

References

L. Florea, G. Hartzell, Z. Zhang, G.M. Rubin, and W. Miller. A computer program for aligning a cdna sequence with a genomic dnasequence. Genome Research, 8:967–974, 1998.

W. J. Kent. BLAT–the BLAST-like alignment tool. Genome Res, 12(4):656–664, April 2002.

R. Mott. EST GENOME: a program to align spliced dna sequences to unspliced genomic dna. Comput. Appl. Biosci., 13:477478, 1997.

G. Ratsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P. Vert, editors,Kernel Methods in Computational Biology. MIT Press, 2004.

Soren Sonnenburg, Gunnar Ratsch, Christin Schafer, and Bernhard Scholkopf. Large Scale Multiple Kernel Learning. Journal of MachineLearning Research, 7:1531–1565, July 2006.

S.J. Wheelan, D.M. Church, and J.M. Ostell. Spidey: a tool for mrna-to-genomic alignments. Genome Research, 11(11):1952–7, 2001.

M. Zhang and W. Gish. Improved spliced alignment from an information theoretic approach. Bioinformatics, 22(1):13–20, January 2006.