incremental inference of relational motifs with a degenerate alphabet nadia pisanti, lipn paris 13...

65
Incremental Inference of Relational Motifs with a Degenerate Alphabet Nadia Pisanti, LIPN Paris 13 & ABI Paris 6 joint work with: H.Soldano, LIPN Paris 13 & ABI Paris 6 M.Carpentier, ABI Paris 6 CPM 2005

Post on 19-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Incremental Inference of Relational Motifs with a Degenerate Alphabet

Nadia Pisanti, LIPN Paris 13 & ABI Paris 6

joint work with:

H.Soldano, LIPN Paris 13 & ABI Paris 6M.Carpentier, ABI Paris 6

CPM 2005

Summary

Relational Motifs: The model. A few motivations.

Previous work: KMR: the idea of the paradigm. KMRC: using a degenerate alphabet &

maximality. KMRELAT=KMRC + “relations”. Problems.

The algorithm: The idea. Properties that guarantee correctness and

efficiency. Preliminary tests on 3D Proteins.

Relational Motifs: the idea

Motif TCAGTCTCA

occurrences

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

Alphabet: {A,C,G,T}Length: 9Quorum: 5Sequence

IN

OUT

“normal” motifs:

Relational Motifs: the idea

Motif TCAGTCTCA

occurrences

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

Alphabet: {A,C,G,T}Length: 9Quorum: 5Sequence

IN

OUT

“normal” motifs:

Relational Motifs: the idea

Motif TCAGTCTCA

occurrences

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

TCAGTCTCA

Alphabet: {A,C,G,T}Length: 9Quorum: 5Sequence

IN

OUT

Alphabet: natural numbersLength: 3Quorum: 4Relations alphabet: {<,>,=}.Sequence and pairwise relations

IN

OUT

relational motifs (trivial case: only relations):

“normal” motifs:

252

161 161

151

151Relations: (p1,p2) <, (p2,p3) >, and (p1,p3) =

Music: detecting scales and tunes in different keys.

3D structures of proteins: amino acids in 3D space and pairwise distance as relation.

Motif inference in structured texts: data structures, source codes.

Numbers and arithmetical relations. Events and temporal relations.

Some possible applications

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k/2:

occurs at b1,b2,b3,...

occurs at r1,r2,r3,...

occurs at y1,y2,y3,...

y1

y2

b1

b8 r7

r2b2 b3r1 r3

y7 r8

distance k/2 = | | = | |

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k/2:

occurs at b1,b2,b3,...

occurs at r1,r2,r3,...

occurs at y1,y2,y3,...

y1

y2

b1

b8 r7

r2b2 b3r1 r3

y7 r8

distance k/2 = | | = | |, hence = occurs at y7and | | = k

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k:

occurs at y2,y7,...

y1

y2

b1

b8 r7

r2b2 b3r1 r3

y7

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k:

occurs at y2,y7,...

occurs at y1,...

y1

y2

b1

b8 r7

r2b2r1 r3

y7

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k:

occurs at y2,y7,...

occurs at y1,...

y1

y2

b1

b8

r2b2 r3

y7

occurs at b1,b8.

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k:

occurs at y2,y7,...

occurs at y1,...

y1

y2

b1

b8 y7

occurs at b1,b8.

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k:

occurs at y2,y7,...

occurs at y1,...

y1

y2

b1

b8 y7

occurs at b1,b8.no quorum

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k:

occurs at y2,y7,...

occurs at y1,...

y1

y2

y7

OUTPUT:

With O(log k) steps all k-motifs are generated

KMR (1972): concatenation of short motifs into long ones

Ex: Length k with quorum 3

Length k:

occurs at y2,y7,y5

occurs at y1,...

y1

y2

y7

OUTPUT:

extent of y5

KMRC: Degenerate AlphabetFinding exact motifs is not enough for certain applications...

(nor so challenging!)

Motif {C,T}{C,G}A{C,T}A

CCATA

TCACA

TGATA

TCACACGACA

TCATATGATA

Length: 5Quorum: 5Input Sequence (hence implicitely {A,C,G,T})

&Motif’s alphabet (cover of ): {A},{C,G},{C,T}

IN

OUT

KMRC: Degenerate AlphabetFinding exact motifs is not enough for certain applications...

(nor so challenging!)

Motif {C,T}{C,G}A{C,T}A

CCATA

TCACA

TGATA

TCACACGACA

TCATATGATA

Length: 5Quorum: 5Input Sequence (hence implicitely {A,C,G,T}).

&Motif’s alphabet (cover of ): {A},{C,G},{C,T}

IN

OUT

Alphabet with degeneracy 2

Sequence alphabet: amino acids of primary structure.

Motifs degenerate alphabeth: grouping amino acids with similar chemical properties.

Relations: distance of -carbons in 3D tertiary structure of the protein.

Relations degenerate alphabet: e.g. discretizing the distance.

Repeated structures in 3D proteins

Relational motifs with degenerate alphabet(one for symbols and one for relations)

1

2

3

4

8

76

955

3

4

3

3

44

3

2

44 6

r(1,4) = 5r(1,3) = 4 r(2,4) = 3r(1,2) = 3 r(2,3) = 4 r(3,4) =3

r(6,9) = 6r(6,8) = 3 r(7,9) = 4r(6,7) = 4 r(7,8) = 4 r(8,9) =2

Motifs & Extents

In AAAAAAAAAAAAAAAAAAAAA ... AAAAAAAAAAAAAAAAAAAAAAAAAAAA = An

• Every k-long word on {C1,C2,C3} is a different k-motif!• Each one of them has extents {1,2,3,…,n-k+1} (indeed, it’s always the same…)

Hence {1,2,3,…,n-k+1} represents gk motifs!

E.g. Motif’s alphabet C1={A,C},C2={A,G},C3={A,T} with degeneracy g=3

A k-motif is a k-long word on {C1,C2,C3} (occurring q times)

Different motifs can occur at the same position...

Even worse: two different motifs may have the very same extents

Maximal motifs

Motif C2 C2 C1 C2 C1 occurs at p3,p4,p5,p6,p7 non maximal

CCACA

CCACA

CGATA

CCACACGACA

CCACACGATA

e.g. motif’s alphabet: C1={A},C2={C,G},C3={C,T}

p1 p2p3

p4p5

p6 p7

Motif C2 C2 C1 C3 C1 occurs at p3,p4,p5,p6,p7 and in p1,p2 maximal

Maximal motifs: good and bad news

Good news: Each maximal k-motif can be built from two

maximal (k/2)-motifs. Bad news:

Two maximal (k/2)-motifs can generate a non-maximal motif.

Non-maximal motifs have to be detected and discarded at each step.

Very bad news: There can be an exponential number of

maximal motifs (theoretically).

KMRelat (2003): introducing relations

Ex: Length k with quorum 3

Length k/2:

occurs at r1,r2,r3,...

occurs at y1,y2,y3,...

KMRelat (2003): introducing relations

Ex: Length k with quorum 3

Length k/2:

occurs at r1,r2,r3,... and relations are conserved

occurs at y1,y2,y3,... and relations are conserved

KMRelat (2003): introducing relations

Ex: Length k with quorum 3

Length k:

occurs at y1,y2,y3,... and SOME relations are conserved

KMRelat (2003): introducing relations

Ex: Length k with quorum 3

Length k:

occurs at y1,y2,y3,... and SOME relations are conserved

KMRelat (2003): introducing relations

Ex: Length k with quorum 3

Length k:

occurs at y1,y2,y3,... and SOME relations are conserved

KMRelat (2003): introducing relations

Ex: Length k with quorum 3

Length k:

occurs at y1,y2,y3,... and SOME relations are conserved

There are still O(k2) relations to be checked.. per each occurrence... and at each step...

Overlap steps

Length k-d

d

d

Overlap steps

Length k-d

d

d

Length k

Overlap steps

Length k-d

d

d

Length k

Only O(d2) relations to check where d is a constant

Why KMR+overlap

It takes O(k) steps, each one taking O(n), hence O(kn) [regardless the degenerate alphabet].

Possible alternatives: KMR would take O(log k) steps with step i

concatenating two 2i-motifs and checking (2i)2 relations, that is

i=1 n * 22i = ... = O(k2n). With an in depth approach (not KMR-like) it would

take O(n) steps where at step i an i-motif is extended of one position and i relations are checked, that is

i=1 n * i = O(k2n).

(log2 k)-1

k-1

KMRoverlap and relations

Inferring relational k-motifs with degenerate alphabets performing overlap steps:

Maximal motifs still suffice. No need to explicitely store relations: the

extents suffices still. Relations refine the query and thus reduce

the search space and the output size. More sensitive motifs inference.

KMRoverlap: sketch of the algorithm

l:=1;REPEATOverlap two relational l-motifs;Check relations and generate as many relational

(l+d)-motifs as conserved ones;Check quorum;Eliminate non maximal;l:=l+d;UNTIL l=k.

KMRoverlap: sketch of the algorithm

l:=1;REPEATOverlap two relational l-motifs;Check relations and generate as many relational

(l+d)-motifs as conserved ones;Check quorum;Eliminate non maximal;l:=l+d;UNTIL l=k.

... but there is a problem...

Pseudomotifs

Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.

1 2 3 4 5 6 7 8 9 0

An example:

xbxcxaxbxcx

b c

aC1

C2

C3

Pseudomotifs

Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.

1 2 3 4 5 6 7 8 9 0

An example:

xbxcxaxbxcx

b c

aC1

C2

C3

The extent {1,7} corresponds to xbx occurring twice (so far so good...)and corresponding to the motif C3 C1C2 C3 (strange..)

Pseudomotifs

Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.

1 2 3 4 5 6 7 8 9 0

An example:

xbxcxaxbxcx

b c

aC1

C2

C3

The extent {1,7} corresponds to xbx occurring twice (so far so good...)and corresponding to the motif C3 C1C2 C3 (strange..)

• C3 C1 C3 has extent {1,5,7}{1,7}

• C3 C2 C3 has extent {1,3,7}{1,7}

Pseudomotifs

Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.

1 2 3 4 5 6 7 8 9 0

An example:

xbxcxaxbxcx

b c

aC1

C2

C3

The extent {1,7} corresponds to xbx occurring twice (so far so good...)and corresponding to the motif C3 C1C2 C3 (strange...)• C3 C1 C3 has extent {1,5,7}{1,7}

• C3 C2 C3 has extent {1,3,7}{1,7}

C3 C1C2 C3 is a pseudomotif... and {1,7} a pseudoextent

Pseudomotifs

Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.

1 2 3 4 5 6 7 8 9 0

An example:

xbxcxaxbxcx

b c

aC1

C2

C3

The extent {1,7} corresponds to xbx occurring twice (so far so good...)and corresponding to the motif C3 C1C2 C3 (strange...)• C3 C1 C3 has extent {1,5,7}{1,7}

• C3 C2 C3 has extent {1,3,7}{1,7}

C3 C1C2 C3 is a pseudomotif... and {1,7} a pseudoextent

The extent {2,8} corresponds to C1C2 C3 C2, but it is also the extent of the 3-motif C2 C3 C2 {2,8} is not a pseudoextent.

Pseudomotifs

Why are pseudomotifs dangerous? The overlap of two maximal (k-d)-motifs can

generate a pseudomotif. There can be O(2|G|k) distinct pseudomotifs of length

k. Pseudomotifs can never be maximal. Thus:

They will never have to be output. They will never be useful to generate longer motifs.

We need to find a way to avoid generating them...

Storing prefixes and suffixes

Length k-d

d

d

Length k-d

d

d

Length k

Storing prefixes and suffixes

prefix

suffix

Length k-dLength k

Storing prefixes and suffixes

prefix

suffix

also of inherited motifs

prefix

suffix

Length k-dLength k

Storing prefixes and suffixes

prefix

suffix

also of inherited motifs

prefix

suffix

The extent of is included in that of

Length k-dLength k

Storing prefixes and suffixes

prefix

suffix

also of inherited motifs

prefix

suffix

The extent of is included in that of

Hence is eliminated and inheritates it.

Length k-dLength k

Storing prefixes and suffixes

prefix

suffix

also of inherited motifs

prefix

suffix

The extent of is included in that of

Hence is eliminated and inheritates it.

Length k-dLength k

Storing prefixes and suffixes

prefixes

suffixes

also of inherited motifs

The extent of is included in that of

Hence is eliminated and inheritates it.

Avoiding pseudomotifs

Length k-d

d

d

with prefixes in the set P

with prefixes in the set P

with suffixes in the set S

with suffixes in the set S

Avoiding pseudomotifs

Length k-d

d

d

with prefixes in the set P

with prefixes in the set P

with suffixes in the set S

with suffixes in the set S

Length kis generated iff S P

Some interesting properties

The prefix-suffix condition avoids generating exactly pseudomotifs!

It is enough that ONE maximal motif inheritates the prefix and suffix of a discarded one:

Only (or ) inheritates .

The KMRoverlap algorithm

endwhile;Output all left motifs.

l := 1;while l < k do

for each l-motif occurring at x and l-motif occurring at x+d do:

If S P then generate a per each different set of conserved relations;

Eliminate extents that are < q;Eliminate nonmaximal extents;

l := l + d;

KMRoverlapEx: Length k with quorum 2

Length k-d:

d

KMRoverlapEx: Length k with quorum 2

Length k-d:

d

Overlap and

KMRoverlapEx: Length k with quorum 2

Length k-d:

d...

...

...

check relations and generate asmany as relations sets.

Overlap and :

KMRoverlapEx: Length k with quorum 2

Length k-d:

d...

...

...

Length k:

occurs twice

occurs once

check relations and generate asmany as relations sets.

Overlap and :

KMRoverlapEx: Length k with quorum 2

Length k-d: Length k:

occurs twice

occurs once

Overlap and

...

...

occurs twice

KMRoverlapEx: Length k with quorum 2

Length k-d: Length k:

occurs twice

occurs once

Overlap and

occurs twice

occurs twice

occurs twice

......

...

...

KMRoverlapEx: Length k with quorum 2

Length k:

occurs twice

occurs once Check quorumoccurs twice

occurs twice

occurs twice

KMRoverlapEx: Length k with quorum 2

Length k:

occurs twice

They are all maximal(just a coincidence to simplify!)

occurs twice

occurs twice

occurs twice

KMRoverlapEx: Length k with quorum 2

Output:

occurs twiceoccurs twice

occurs twice

occurs twice

Complexity

O(k) steps. At each step i there are O(ngl) motifs of length

l. Generating new motifs takes O(n gl). Detecting possible inclusions takes O(n g2l).

Overall complexity in O(k n g2k), [linear w.r.t. input size but still looks bad]

but it is a very rough approximation...

To be precise...

There are two degeneracies: g and g’ At each step i there are O(ngl) motifs of length l.

Generating new motifs takes O(n gl + (g’)2l). Detecting possible inclusions takes O(n (g+(g’)2l)2).

Overall complexity in O(k n (g+(g’)2k)2),[linear wrt input size but exponential in k] but it even a more rough approximation...

KMRoverlap: correctness and completeness

The algorithm is correct (it generates ONLY maximal k-motifs) because: Non maximal are discarded. It stops when k-motifs are generated.

The algorithm is complete (it generates ALL maximal k-motifs) because: Overlapping two maximal (k-d)-motifs is enough to generate all maximal k-motifs. The prefix-suffix condition only discards pseudomotifs.

Preliminary tests

8973 0 124

81757 76953 1110

550911 1881241 936

165727 1673186 167

12502 15668 27

4 5

3 4

6 7

5 6

7 8

generatedmotifs

avoidedpseudo-motifs

maximalmotifsoverlap step

k = 8d = 1q = 5n ~103

• As expected there are many pseudo-motifs.

• Alhought they have the same theoretical upper bound, the number of maximal k-motifs is sensibly smaller than that of the k-motifs.

THANK YOU