bioinformatics phd. course 1. biological introduction exact extended approximate 6. projects: promo,...

109
Bioinformatics PhD. Course logical introduction Extended Approximate rojects: PROMO, MREPATT, … Sequence assembly rison of short sequences ( up to 10.000bps) ix Pairwise align. Multiple align son of large sequences ( more that 10.000bps structures Suffix trees MUMs String matching

Upload: eric-mckinney

Post on 17-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

Dynamic programming What about genomes? Quadratic cost of space and time. accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Short sequences (up to bps) can be aligned using dynamic programming Quadratic cost of space and time. acc agt | | | |xx acc a--

TRANSCRIPT

Page 1: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Bioinformatics PhD. Course

1. Biological introduction

Exact Extended Approximate

6. Projects: PROMO, MREPATT, …

5. Sequence assembly

2. Comparison of short sequences ( up to 10.000bps) Dot Matrix Pairwise align. Multiple align. Hash alg.

3. Comparison of large sequences ( more that 10.000bps) Data structures Suffix trees MUMs

4. String matching

Page 2: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Comparison of large sequences

First part:

Alignment of large sequences

Page 3: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Dynamic programming

What about genomes?

• Quadratic cost of space and time.

accaccacaccacaacgagcata … acctgagcgatat

acc..t

• Short sequences (up to 10.000 bps) can be aligned using dynamic programming

• Quadratic cost of space and time.

acc.................................agt | | |.................................|xxacc.................................a--

Page 4: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Genomic sequences

In which case Dynamic Programming can be applied?

•The length of sequences is 1000 times longer.

• Genomic sequences have millions of base pairs.

•The running time is 1.000.000 times higher !

(1 second becomes 11 days)(1 minute becomes 2 years)

Page 5: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

First assumption

……………………………………………………………….

………………………….………………...…………...….

Genome B

Genome A

……………………………………Genome B

……

……

……

……

……

….

Gen

ome

A

Page 6: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Realistic assumption?

Unrealistic assumption!

More realistic

assumption

……………………………………………………………….

………………………….………………...…………...….

Genome B

Genome A

………………………………………………………………….

………………………………………………...…………...….Genome A

Genome B

………………………

……

……G

enom

e A

Genome B

Page 7: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Realistic assumptions?

But, now is it a

real case?

Unrealistic assumption!

More realistic

assumption

……………………………………………………………….

………………………….………………...…………...….

Genome B

Genome A

…………………………………………………………………

………………………………………………...…………...….Genome A

Genome B

………………………

……

……G

enom

e A

Genome B

Page 8: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Preview in a real caseChlamidia muridarum: 1.084.689bps Chlamidia Thrachomatis:1057413bps

Page 9: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Preview in a real case

Pyrococcus abyssis: 1.790.334 bpsPyrococcus horikoshu: 1.763.341 bps

Page 10: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Methodology of an alignment

1st:

2nd:

3th: (Linear cost)

Identify the portions that can be aligned.

Make a preview: ……………………..….

…………………...….

Make the alignment:

…..…

……

………………….

(Linear cost)

Page 11: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Methodology of an alignment

(Linear cost)

Make a preview: ……………………..….

…………………...….1st:

2nd:

3th:

Identify the portions that can be aligned.

Make the alignment:

…..…

……

………………….

?

Page 12: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Preview-Revisited

… a a t g….c t g...

… c g t g….c c c ...

MatchingUniqueMaximal

MUMConnect to MALGEN

Page 13: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Methodology of an alignment

1st:

2nd:

3th:

Identify the portions that can be aligned.

Make a preview: ……………………..….

…………………...….

Make the alignment:

…..…

……

………………….

How can MUMs be found?

With CLUSTALW, TCOFFEE,…

How can these portions be determined?

Linear costwith

Suffix trees

Page 14: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Comparison of large sequences

M-GCAT

Todd Treangen

Page 15: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Homework

1. Javier 14. Alexis2. Dmitry 15. Ramon3. Ana Iris4. David5. Patricia6. Rogeli7. Atif8. Aina9. Isaac10. Maria Merce11. Romina12. Guillem13. Raul

Page 16: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Bioinformatics PhD. Course

Second part:

Introducing Suffix trees

Page 17: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix trees

Given string ababaas:

1: ababaas

2: babaas

3: abaas

4: baas

5: aas

6: as7: s

as,3

s,6

as,5

s,7

as,4ba

baas,2

a

babaas,1

a

babaas,1

ba

baas,2

as,3

as,4

s,6

as,5

s,7

Suffixes:

What kind of queries?

Page 18: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Applications of Suffix trees

a

babaas,1as,3

ba

baas,2

as,4

s,6

as,5

s,7

1. Exact string matching

…………………………

• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

Page 19: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string …………………………......

P1: the leaves of suffixes from have been inserted

and the suffix-tree

…...

Invariant Properties:

Page 20: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

ababaabbs,1

Page 21: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

Page 22: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1ababaabbs,1

Page 23: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

Page 24: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

ba

baabbs,2

Page 25: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

ababaabbs,1

abbs,3

ba

baabbs,2

abbs,4

Page 26: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

ababaabbs,1

abbs,3

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

Page 27: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

abbs,5

Page 28: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

abbs,5

Page 29: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4

ba

ba

baabbs,2

abbs,4

a abbs,5

ba abbs,3

baabbs,1

Page 30: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

a abbs,5

ba abbs,3

baabbs,1

bs,6

Page 31: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

a abbs,5

ba abbs,3

baabbs,1

bs,6

Page 32: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

a abbs,5

ba abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

Page 33: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

a abbs,5

ba abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

s,8

Page 34: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string ababaabbs

a abbs,5

ba abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

s,7

s,9

Page 35: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

The suffix tree of many strings …

and it is the suffix tree of the concatenation of strings.

the generalized suffix tree of ababaabb and aabaat …

is the suffix tree of ababaabαaabaatβ, :

is called the generalized suffix tree …

For instance,

Page 36: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given the suffix tree of ababaabα :

Construction of the suffix tree of ababaabbαaabaaβ :

Page 37: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Construction of the suffix tree of ababaabbαaabaaβ :

Page 38: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

Construction of the suffix tree of ababaabbαaabaaβ :

a bα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

abaaβ,1

Page 39: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

Construction of the suffix tree of ababaabbαaabaaβ :

a bα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

abaaβ,1

Page 40: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

abaaβ,1

aβ,2

Construction of the suffix tree of ababaabbαaabaaβ :

Page 41: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

abaaβ,1

aβ,2

Construction of the suffix tree of ababaabbαaabaaβ :

Page 42: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

abaaβ,1

aβ,2

a β,3

Page 43: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

abaaβ,1

aβ,2

a β,3

Page 44: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4

Construction of the suffix tree of ababaabbαaabaaβ :

Page 45: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4

Construction of the suffix tree of ababaabbαaabaaβ :

Page 46: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

Construction of the suffix tree of ababaabbαaabaaβ :

Page 47: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

Construction of the suffix tree of ababaabbαaabaaβ :

Page 48: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Construction of the suffix tree of ababaabbαaabaaβ :

Page 49: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Generalizad suffix tree

a bα,5b

a bbα,3baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Generalized suffix tree of ababaabbαaabaaβ :

Page 50: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Applications of Suffix trees

a

babaas,1as,3

ba

baas,2

as,4

s,6

as,5

s,7

1. Exact string matching

…………………………

• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

Page 51: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Applications of Suffix trees

2. The substring problem for a database of strings DB• Does the DB contain any ocurrence of patterns abab, aab, and ab?

a bα,5b

a bbα,3baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Page 52: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Applications of Suffix trees

3. The longest common substring of two strings

a bα,5b

a bbα,3baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Page 53: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Applications of Suffix trees

5. Finding MUMs.

a bα,5b

a bbα,3baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Page 54: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Bioinformatics PhD. Course

Third part:

Suffix links

Page 55: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

Page 56: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Page 57: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Page 58: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Page 59: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Page 60: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Page 61: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Page 62: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Page 63: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Page 64: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

Page 65: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

Page 66: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a

Page 67: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a

Page 68: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a aa in S2 [1] Unique matchings

Page 69: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a aa in S2 [1] Unique matchings

aab in S2 [1] =

S1[5..6-7] in S2 [1]

Page 70: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a Unique matchings S1[5..6-7] in S2 [1]

Page 71: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a Unique matchings S1[5..6-7] in S2 [1]

Page 72: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]

Page 73: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]

Page 74: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]

Page 75: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]

Page 76: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-8] in S2 [2]

S1[4..6-8] in S2 [3]

Page 77: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Traversal using Suffix links

a abbα,5

ba abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..8] in S2 [4] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6]

Page 78: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

From UMs to MUMs

Given S2 = a a b a a b b a Unique matchings S1[5..8] in S2 [4] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6]

Array of UMs123 6-84 6-85 86 87 889

and S1 = a b a b a a b b α

MUM: S1[3..6-8] in S2[2]

Page 79: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Bioinformatics PhD. Course

Third part:

Linear insertion algorithm

Page 80: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Quadratic insertion algorithm

Given the string …………………………......

P1: the leaves of suffixes from have been inserted

and the suffix-tree

…...

Invariant Properties:

Page 81: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm

Given the string …………………………......

P2: the string is the longest string that can be spelt through the tree.

P1: the leaves of suffixes from have been inserted

and the suffix-tree

…...

Invariant Properties:

Page 82: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

Given the string ababaababb...

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

a

Page 83: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

Given the string ababaababb...

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

6 7 8

Page 84: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

6 7 8Given the string ababaababb...

Page 85: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

6 7 89Given the string ababaababb...

Page 86: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

baababb...,1ba

baababb...,2

ababb...,4

Given the string ababaababb...

6 7 89

baababb...,1b

b...,6

aababb...,1

Page 87: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 89

b

b...,6

aababb...,1

Page 88: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 89

b

b...,6

aababb...,1

Page 89: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 89

b

b...,6

aababb...,1

Page 90: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 89

b

b...,6

aababb...,1

baababb...,2b aababb...,2

Page 91: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 8…

b

b...,6

aababb...,1

baababb...,2b

b...,7

aababb...,2

Page 92: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

89

b

b...,6

aababb...,1

b

b...,7

aababb...,2

Page 93: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

89

b

b...,6

aababb...,1

b

b...,7

aababb...,2

Page 94: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

89

b

b...,6

aababb...,1

b

b...,7

aababb...,2

Page 95: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

89

b

b...,6

aababb...,1

b

b...,7

aababb...,2

Page 96: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

ba ababb...,4

Given the string ababaababb...

89

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

Page 97: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

ba ababb...,4

Given the string ababaababb...

89

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8

Page 98: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

ba ababb...,4

Given the string ababaababb...

9

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8

Page 99: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

ba ababb...,4

Given the string ababaababb...

9

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8

Page 100: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb... 9

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8a

Page 101: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb... 9

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8a

b...,9

Page 102: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb... 9

ababb...,3

b

b...,6

ababb...,1

b

b...,7

aababb...,2

a

b...,8a

b...,9

Page 103: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb... 9

ababb...,3

b

b...,6

ababb...,1

b

b...,7

aababb...,2

a

b...,8a

b...,9

Page 104: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb...

9

ababb...,3

b

b...,6

ababb...,1

b

b...,7

aababb...,2

a

b...,8a

b...,9

Page 105: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Index

Suffix arrays Suffix-arrays: a new method for on-line

string searches, G. Myers, U. Manber

Page 106: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Suffix arrays

Given string ababaa#:

1: ababaa#2: babaa#

3: abaa#

4: baa#

5: aa#

6: a#

7: #

Suffixes: … but lexicographically sorted

1: ababaa#

2: babaa#

3: abaa#

4: baa#

5: aa#6: a#1: #1

234567

Which is the cost? O(n log(n))

Page 107: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Applications of suffix arrays

1. Exact string matching• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

1: ababaa#

2: babaa#

3: abaa#

4: baa#

5: aa#6: a#1: #1

234567

Binary search

O(log(n) |P|)

… which is the cost?

O(log(n)+|P|) ?

Can it be improved to …

Page 108: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Fast search with cost O(log(n)+|P|) Query:

Invariant Properties:

P1: α < query ≤ β α

β

12… …

n

Suffix array

P2: matches pref( query)

Page 109: Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short

Fast search with cost O(log(n)+|P|) Query:

Invariant Properties:

P1: α < query ≤ β α

β

γ Algorithm:

12… …

n

Suffix array

P2: matches pref( query)

If suff(γ)<suff(query) then α = γ else β = γ