motif extraction and grammar induction in language and biology

Motif Extraction and Grammar Induction in

Language and Biology

David HornTel Aviv University

http://horn.tau.ac.il

Analogy between languages

Human languages are realized in streams of speech or in lines of text

All computational problems can be realized by a Turing machine

Hereditary biology makes use of chains of nucleotides in DNA and RNA and chains of amino-acids in proteins

All are one-dimensional representations of reality

Reality, however, is not one dimensional

hence one needs a set of syntactic rules, to make sense of the text, and a semantic mapping into actions in the real world.

The distinction between syntactic and semantic levels (Chomsky 1957) is intuitively clear in human languages.

In biology syntax refers to the structure of the sequences whereas semantics relates the different sequence elements to the complicated process involving transcription, the birth of mRNA, to translation, the birth of the protein.

Examples of syntax (semantics): segmentation of the chromosome to genes (originators of proteins), promoters (transcription regulation) and 3’ UTRs (translation regulation); segmentation of genes to exons (coding) and introns (noncoding); finding motifs on promoters (TFBS relevant to enabling or forbidding transcription); finding motifs on proteins (relevant for protein interaction and protein functionality) etc.

Can we induce the rules (and, sometimes, the words) from the texts?

ADIOS is an algorithm that induces syntax, or grammar, from the text. MEX is an algorithm that extracts motifs, or patterns, from the text.

Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman.Unsupervised learning of natural languages.Proc. Natl. Acad. Sci. USA, 102 (2005) 11629-11634.

MEX: motif extraction algorithm Create a graph whose vertices are words (for

text) or letters (for biological sequences) Load all strings of text onto the graph as

paths over the vertices Given the loaded graph consider trial-paths

that may coincide with original strings of text Use context sensitive statistics to define left-

and right-moving probabilities that help to define motifs

Given a set of stringsa l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f h a v i n g n o t h i n g t o d o o n c e o r t w i c e s h e h a d p e e p e d i n t o t h e b o o k h e r s i s t e r w a s r e a d i n g b u t i t h a d n o p i c t u r e s o r c o n v e r s a t i o n s i n i t a n d w h a t i s t h e u s e o f a b o o k t h o u g h t a l i c e w i t h o u t p i c t u r e s o r c o n v e r s a t i o n

alicewas beginning toget verytiredof sitting by hersister onthebank and of having nothing todo onceortwice shehad peep ed intothe book hersister was reading butit hadno pictures or conversation s init and what is theuseof abook thoughtalice without pictures or conversation

Find patterns in strings of letters

Motifs EXtraction (MEX)

• ∑ = {a-z}

Creating the graph (directed)…

e

h

d

v

ab

c

g

fu

t

s

r

qp

on

ij

km l

w

xy

zbeginend

(1,1)

)1,2(

)1,3(

)1,4(

)1,5()1,6(

• alice was

(2,1)(2,2)

(2,3)(2,4)

structured graph

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

{1003;3}

{1002;1}

{1002;2}

{1003;4}

{1003;5}

{1003;6}

{1003;8}

{1003;7}

{1003;9}

{1003;10}

{1003;11}

{1003;12}

{1003;13}

{1003;14}

Creating the graph… )Cont’(

Creating the graph… (Cont’)

random graph

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f c o n v e r s a t i o n

)1(

endbegin a l ne i c

w h e n a l i c e h a d b e e n a l l t h e w a y d o w n o n e s i d e a n d u p t h e o t h e r t r y i n g e v e r y d o o r s h e w a l k e d s a d l y

)2(

dw h

Creating the graph - cont’d

a endbegin l nec i

structured graph

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

{1003;3}

{1002;1}

{1002;2}

{1003;4}

{1003;5}

{1003;6}

{1003;8}

{1003;7}

{1003;9}

{1003;10}

{1003;11}

{1003;12}

{1003;13}

{1003;14}

Searching for patterns

1

2

36

4 5 9

7

2

38

7

6 8

5 4

1

path

search path

Searching for patterns (Cont’)

L matrix (numbers of paths)

alicewas

a8770

l10464704

i4689127486

c3974016372382

e39739748870313545

w484851665792671

a212121231926248770

s171717191423779646492

b22225101463

e222246924

g222245510 Calculate conditional probabilities

l(ei;ej) = number of occurrences of sub-path (ei,ej) Where (ei,ej) is: ei→ ei+1 → ei+2 → … → ej-1 → ej

From L to P

Calculating conditional probabilities

La8770l1046i486

c397e397w48a21s17b2e2

P_Ra0.08l0.12i0.45

c0.85e1w0.12a0.44s0.81b0.12e1

P(a) = 0.08

P(l|a) = 1046/8770

P(i|al) = 486/1046

P(w|alice) = 48/397

P(e|alic) = 397/397

P(c|ali) = 397/486

P_Ralicewas

a0.08

l0.120.043

i0.450.190.069

c0.850.440.0850.022

e10.990.770.30.12

w0.120.120.10.0940.0430.024

a0.440.440.410.350.330.230.08

s0.810.810.810.830.740.60.110.06

b0.120.120.120.110.0350.0270.0150.01

e11110.80.60.640.38

g111110.830.560.42

The probability Matrix

Probability MatrixP_R = Right moving probability )end of pattern(

P_L = Left moving probability )beginning of pattern(

),(

),()...|(),(

1121

ji

jijiiiijiR eel

eeleeeeepeeP

),(

),()...|(),(

1121

ij

ijjjiiiijL eel

eeleeeeepeeP

)(SM ij

),( jiR eeP

)( ieP

),( ijL eeP

if ji ji ji

if

if

P_R

P_L

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

e1 e2 e3 e4 e5

significantpattern

PR(e2|e1)=3/4

PR(e4|e1e2e3)=1

PR(e5|e1e2e3e4)=1/3

PL(e4)=6/41

PL(e3|e4)=5/6

PL(e2|e3e4)=1

PL(e1|e2e3e4)=3/5

PL

SLSR

PR

PR(e1)=4/41

PR(e3|e1e2)=1

begin end

),(

),(),(

1jiR

jiRjiR eeP

eePeeD

),(

),(),(

1 ijL

ijLijL eeP

eePeeD

Detecting a significant pattern

Significance test • Pn = (e1,en) : a potential pattern edge

• m = # of paths from e1 to en r = # of paths from e1 to en+1

• ≡ )Pn is a pattern edge(

• Assume P*n+1 = the “true” Pn+1, given )e1,en(

• H0: P*n+1 ≥ Pn·η )does not diverge(

H1: P*n+1 < Pn·η

• odds to receive results at least as “extreme” as r and m are:

If the outcome is less than a predetermined α

The pattern is significant

rmreel

r

rmreel

r

eel

rpni pp

rmr

mpp

r

mmrPeeB

ninini

)1(

)!(!

!)1()|(),(

),(

0

),(

0

),(

0

n

n

P

P 1 nPm

r

Rewiring the graph • Once the algorithm has reached the stop criteria )e.g. ceases to

locate new patterns(, for each significant patterns, the sub-paths it subsumed are merged into a new vertex

• The graph is rewired in a length–significance descending order

ALICE motifs

Weight Occurrences Lengthconversation 0.98 11 11whiterabbit 1.00 22 10caterpillar 1.00 28 10interrupted 0.94 7 10procession 0.93 6 9mockturtle 0.91 56 9beautiful 1.00 16 8important 0.99 11 8continued 0.98 9 8different 0.98 9 8atanyrate 0.94 7 8difficult 0.94 7 8surprise 0.99 10 7appeared 0.97 10 7mushroom 0.97 8 7thistime 0.95 19 7suddenly 0.94 13 7business 0.94 7 7nonsense 0.94 7 7morethan 0.94 6 7remember 0.92 20 7consider 0.91 10 7curious 1.00 19 6hadbeen 1.00 17 6however 1.00 20 6perhaps 1.00 16 6hastily 1.00 16 6herself 1.00 78 6footman 1.00 14 6suppose 1.00 12 6silence 0.99 14 6witness 0.99 10 6gryphon 0.97 54 6serpent 0.97 11 6angrily 0.97 8 6croquet 0.97 8 6venture 0.95 12 6forsome 0.95 12 6timidly 0.95 9 6whisper 0.95 9 6rabbit 1.00 27 5course 1.00 25 5eplied 1.00 22 5seemed 1.00 26 5remark 1.00 28 5

Motifs selected in order of

-length

-weight (significance of drop)

Shown here are results of one run over a trial-path and the beginning of the list of motifs extracted from it

Application to Biology

Vertices of the graph: 4 or 20 lettersPaths: gene-sequences, promoter-

sequences, protein-sequences.Conditional probabilities on the graph

are proportional to the number of paths

Trial-path: testing transition probabilities to extract motifs

Extracting Motifs from Enzymes

>P54233 | 1.7.1.1LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES

Each enzyme sequence corresponds to a

single path

Applying MEX to oxidoreductases 6602 enzyme sequences MEX motifs are specific subsequences

V. Kunik, Z. Solan, S. Edelman, E.Ruppin and D. Horn. CSB2005

3165 motifs were obtained Distribution of MEX motifs

Enzyme MotifsN

um

ber

of

moti

fs

Length of motif

>P54233 | 1.7.1.1LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES

Each enzyme is represented as a ‘bag of motifs’

>P54233 | 1.7.1.1RDEGTAD, TGKHPFN, LMHHGFITP, YVRNHGPVP, WTVEVTG, PDHGFP YHYKDN, KVTRVE, YGKYWCW, MGMMNNCWF

Enzymes Representation

These 1222 MEX motifs cover 3739 enzymes

The functionality of an enzyme is determined

according to its EC number

Classification Hierarchy [ Webb, 1992 ]

n1.n2: sub-class / 2nd level

n1: class

n1.n2.n3: sub-subclass / 3rd level

n1.n2.n3.n4: precise enzymatic activity

Enzyme Function

EC number: n1.n2.n3.n4 (a unique identifier)

An example:• EC 1 .12 . 1 . n4

hydrogen as electron donors

NAD+ / NADP+ as electron acceptors

NAD+ oxidoreductase

oxidoreductases

EC 1.12.1. 2

H2 + NAD+ = H+ + NADH

Current knowledge regarding enzyme classification High sequence similarity is required to

guarantee functional similarity of proteins. A recent analysis of enzymes by Tian and

Skolnick 2003 suggests that 40% pairwise sequence identity can be used as a threshold for safe transferability of the first three digits of the Enzyme Commission (EC) number.

The EC number, which is of the form:n1:n2:n3:n4 specifies the location of the enzyme on a tree of functionalities.

Current knowledge regarding enzyme classification

Using pairwise sequence similarity, and combining it with the Support Vector Machine (SVM) classification approach, Liao and Noble 2003 have argued that they obtain a significantly improved remote homology detection relative to existing state-of-the-art algorithms.

Cai at al (2003,2004) have applied SVM to a protein description based on physico-chemical features of their amino-acids such as hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility.

Ben-Hur and Brutlag 2004 used the eMotif approach and analyzed the oxidoreductases as `bags of motis’.With appropriate feature selection methods theyobtain success rates over 90% for a variety of classifiers.

O17433 1148 262 463 610 7987 1627 260P19992 124 7290 27 111 3706 18128 3432Q01284 6652 198 1489 710 425 64 55Q12723 693 145 7290 3712 65 543 522P14060 455 2664 848 55 128 256 74Q60555 7290 3712 65 543 522 6748 7159

The MEX method

• SVM classifier input:

Classification Tasks:

• 16 2nd level subclasses

• 32 3rd level sub-subclasses

A linear SVM is applied to evaluate the predictive

power of MEX motifs Enzyme sequences are randomly partitioned

into a training-set and a test-set (75%-25%) 16 2nd level classification tasks 32 3rd level classification tasks Performance measurement: Jaccard Score

The train-test procedure was repeated 40 times

to gather sufficient statistics

tptp + fp + fn

J =

Methods

Results

Average Jaccard scores: 2nd level: 0.88 ± 0.06 3rd level: 0.84 ± 0.09

EC subclass

# o

f seq

uen

ces EC subclass

Jaccard

score

2nd level results

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

11

.1

1.1

4

1.6

1.9

1.2

1.1

1

1.3

1.8

1.1

5

1.4

1.1

8

1.7

1.1

7

1.1

3

1.1

0

1.5

MEXSmith-Waterman SVMProt

* * **

* ***

*

*

Jaccard

Score

EC Subclassα < 0.01

2nd Level Classification

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.1

.1

1.9

.3

1.2

.1

1.6

.5

1.1

1.1

1.1

4.1

4

1.1

5.1

1.3

.3

1.8

.4

1.1

8.6

1.1

4.1

3

1.8

.1

1.1

7.4

1.4

.1

1.6

.99

1.1

3.1

1

1.7

.1

1.4

.3

1.1

4.9

9

1.3

.1

1.2

.4

1.1

4.1

5

1.3

.99

1.1

0.3

1.1

4.1

9

1.5

.1

1.1

4.1

1

1.1

4.1

6

1.6

.2

1.1

8.1

1.4

.99

1.1

.99

MEXSmith-Waterman

**

* * *

*

* *

*

*

*

*

*

* * **

* *

*

* *

3rd Level Classification

α < 0.01

Jaccard

Score

EC Sub-subclass

Summary so far…

MEX is an unsupervised motif extraction method that

finds biologically meaningful motifs

The meaning of the motifs is established by the

classification tasks

Classifies enzymes better than Smith-Waterman

and outperforms SVMProt

There exists correspondence between MEX motifs

and PROSITE biologically significant patterns

Specific motifs of length 6 and longer specify

the functionality of enzymes

ADIOS (AutomaticDIstillation Of Structure)

Representation of a corpus (of sentences) as paths over a graph whose vertices are lexical elements (words)

Motif Extraction (MEX) procedure for establishing new vertices thus progressively redefining the graph in an unsupervised fashion

Recursive Generalization

ADIOS: recursive generalization

1. For each path 1. Slide a context window of size L2. For each location i {0< i <=L}

1. Look for all paths with identical prefix at ip {0< ip<i}

2. and identical suffix at is {i< is <L}

3. If a ‘generalized’ pattern is found )and significant…(1. Add E – the equivalence class to the graph2. Rewire

2. Continue with MEX and ADIOS until no significant pattern is found

Loading sentences onto a graph whose vertices are words

• Is that a cat?

• Is that a dog?

• And is that a horse?

• Where is the dog?

Results will be finding the pattern

P=‘is that a E ’?

with an equivalence class

E={cat, horse, dog}

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon ...

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon... and a path induced by a string

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon... and a graph induced by a corpus of strings

8

5

54

6

7

e2 e3 e4

4

7

6

54

new vertex

11

A

B

E

took chair

bed

table

the

5

to

E

E(i+2)E

blue

green

redthe

E

table

bed

green

blue table

bedbed

chair

D

F

is

table

chair

bed

L

took the to

G

e4

8

8

storedequivalence class

C

E

==

ischairthe red

newequivalence class

p

e1 e2 e3 e4 e5

significantpattern

PL(e4)=6/Ne

PL

PR(e1)=4/Ne

begin end0

0.2

0.4

0.6

0.8

1

PR(e2)=3/4

PR(e4)=1

PR(e5)=1/3

PL(e3)=5/6

PL(e2)=1

PL(e1)=3/5

PR

PR(e3)=1

41

7

6

55

4

4

1 1

6

6

7

7

2

2

8

3

e1

e2 e3

33

7

2

2

1

5

search path

3

8

i+2 i+3i+1i

C

newequivalence class

i+2 i+3i+1i

mergingpoint

dispersionpoint

P

P

P

P

E

Generalization

1205

567 321120132234 621987

321234987 1203

567 321120132234 621987 2000

321234987 1203 3211203

234987 1204

987 2001 1204

The Model: The training process

Beth

Cin

dy

Georg

e

Jim

Joe

Pam

70

is

cele

bra

ting

livin

g

pla

yin

g

sta

yin

g

work

ing

65

66

extr

em

ely

very

98

far

aw

ay

67

1012

3

4

56

7

9

10

11

1213

1201 8

144begin end

begin end

Beth

Cin

dy

Georg

e

Jim

Joe

Pam

70

is

cele

bra

ting

livin

g

pla

yin

g

sta

yin

g

work

ing

65

661

2

45

6

120

3

extr

em

ely

very

98

far

aw

ay

67

1

2

3

4 5

101

begin endGeorge is working very

far

aw

ay

1 2

67A

B

Croot-pattern(RP)

P144s begin P144 end

P120 P101P120 E70 P66E70 Beth | Cindy...P66 is E65E65 celebrating | living...

P101 E98 P67E98 extremely | veryP67 far away

D

First pattern formation

Higher hierarchies: patterns (P) constructed of other Ps, equivalence classes (E) and terminals (T)

Final stage: root pattern

CFG: context free grammar

Trees to be read from top to bottom and from left to right

student learns from teacher Teacher generates a corpus of

sentences Student generates syntax out of

significant patterns and equivalence classes

Unseen teacher-generated patterns are checked by student (recall)

Student-generated patterns are checked by teacher (significance)

ATIS experiments(Airline Transport Information System)

The ATIS-CFG is a hand-made CFG of 4592 rules, constructed to provide good recall (45%) of ATIS-NL, a corpus of natural language (13,000 sentences).

We train multiple ADIOS learners using ATIS-CFG as the teacher. Recursion is limited to depth 10. In testing performance, precision is defined by taking the mean across individual learners, while for recall acceptance by one learner suffices.

ADIOS learning from ATIS-CFG (4592 rules)using different numbers of learners, and different window length L

0

0.6

0 0.2 0.4 0.6 0.8 1

Over-generalization

L=5

L=4L=3

10,000Sentences

120,000Sentences

40,000Sentences

120,000Sentences

L=3

L=5

L=4

L=5

L=4L=3L=5

L=6

150 learners

30 learners

L=7

L=6L=6

Pre

cisi

on

1

0.4

0.2

0.8

B

Recall

ATIS experiments

The ATIS natural language corpus contains 13,000 sentences.Training ADIOS on it leads to recall of 40% (ATIS-CFG reaches recall of 45%). Nonetheless human judged precision is remarkable: 8 subjects judged the grammatical acceptability to be roughly the same as that of ATIS-NL!ALL this while ATIS-CFG produces 99% of ungrammatical sentences!

0.00

0.20

0.40

0.60

0.80

1.00

ADIOS ATIS-N

Pre

cis

ion

A B

Chinese

Spanish

French

English

Swedish

Danish

C

D E

Meta-analysis of ADIOS results.

We define a pattern spectrum as the histogram of pattern types, whose bins are labeled by sequences such as (T,P) or (E,E,T), E standing for equivalence class, T for tree-terminal (original unit) and P for significant pattern.

We apply this analysis to the Parallel Bible, a text containing 31,000 verses in six different languages.

TT

TE

TP

ET

EE

EP

PT

PE

PP

TTT

TTE

TTP

TE

T

TE

E

TE

P

TP

T

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

English

Spanish

Swedish

Chinese

Danish

French

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 200 400 600 8000.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 200 400 600 800

A B

Chinese

Spanish

French

English

Swedish

Danish

C

D E

TT

TE

TP

ET

EE

EP

PT

PE

PP

TTT

TTE

TTP

TE

T

TE

E

TE

P

TP

T

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

A B

Chinese

Spanish

French

English

Swedish

Danish

C

D E

The typological relations among six different natural languages, as judged according to pattern spectra. These are accepted relations according to linguists.

Phylogeny of languages

Summary MEX is a motif-extraction method applicable to

linguistic texts. We have shown its applicability to proteins.

MEX motifs serve as ‘words’, allowing for enzyme classification: from sequence to function!

ADIOS is a grammar induction algorithm, employing MEX in a space of words, patterns and equivalence classes, and constructing a CFG representing syntax of the data.

Thus the syntactic analysis of linguistic corpora and biological texts is put on common grounds.

Unsupervised learning of natural languages PNAS 102 (2005) 11629

More information at http://adios.tau.ac.il

motif extraction and grammar induction in language and biology

Documents

t i o n s i n i t

t i s t h e u s e o

t i o nalicewas

h e r s i s t e r w

t h o u g h t

d i n g b u t i t h

n d o f h

n d w h