motif extraction and grammar induction in language and biology

59
Motif Extraction and Grammar Induction in Language and Biology David Horn Tel Aviv University http://horn.tau.ac.il

Upload: carol-swanson

Post on 03-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Motif Extraction and Grammar Induction in Language and Biology. David Horn Tel Aviv University http://horn.tau.ac.il. Analogy between languages. Human languages are realized in streams of speech or in lines of text All computational problems can be realized by a Turing machine - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Motif Extraction  and Grammar Induction in Language and Biology

Motif Extraction and Grammar Induction in

Language and Biology

David HornTel Aviv University

http://horn.tau.ac.il

Page 2: Motif Extraction  and Grammar Induction in Language and Biology

Analogy between languages

Human languages are realized in streams of speech or in lines of text

All computational problems can be realized by a Turing machine

Hereditary biology makes use of chains of nucleotides in DNA and RNA and chains of amino-acids in proteins

All are one-dimensional representations of reality

Page 3: Motif Extraction  and Grammar Induction in Language and Biology

Reality, however, is not one dimensional

hence one needs a set of syntactic rules, to make sense of the text, and a semantic mapping into actions in the real world.

The distinction between syntactic and semantic levels (Chomsky 1957) is intuitively clear in human languages.

In biology syntax refers to the structure of the sequences whereas semantics relates the different sequence elements to the complicated process involving transcription, the birth of mRNA, to translation, the birth of the protein.

Examples of syntax (semantics): segmentation of the chromosome to genes (originators of proteins), promoters (transcription regulation) and 3’ UTRs (translation regulation); segmentation of genes to exons (coding) and introns (noncoding); finding motifs on promoters (TFBS relevant to enabling or forbidding transcription); finding motifs on proteins (relevant for protein interaction and protein functionality) etc.

Page 4: Motif Extraction  and Grammar Induction in Language and Biology

Can we induce the rules (and, sometimes, the words) from the texts?

ADIOS is an algorithm that induces syntax, or grammar, from the text. MEX is an algorithm that extracts motifs, or patterns, from the text.

Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman.Unsupervised learning of natural languages.Proc. Natl. Acad. Sci. USA, 102 (2005) 11629-11634.

Page 5: Motif Extraction  and Grammar Induction in Language and Biology

MEX: motif extraction algorithm Create a graph whose vertices are words (for

text) or letters (for biological sequences) Load all strings of text onto the graph as

paths over the vertices Given the loaded graph consider trial-paths

that may coincide with original strings of text Use context sensitive statistics to define left-

and right-moving probabilities that help to define motifs

Page 6: Motif Extraction  and Grammar Induction in Language and Biology

Given a set of stringsa l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f h a v i n g n o t h i n g t o d o o n c e o r t w i c e s h e h a d p e e p e d i n t o t h e b o o k h e r s i s t e r w a s r e a d i n g b u t i t h a d n o p i c t u r e s o r c o n v e r s a t i o n s i n i t a n d w h a t i s t h e u s e o f a b o o k t h o u g h t a l i c e w i t h o u t p i c t u r e s o r c o n v e r s a t i o n

alicewas beginning toget verytiredof sitting by hersister onthebank and of having nothing todo onceortwice shehad peep ed intothe book hersister was reading butit hadno pictures or conversation s init and what is theuseof abook thoughtalice without pictures or conversation

Find patterns in strings of letters

Motifs EXtraction (MEX)

Page 7: Motif Extraction  and Grammar Induction in Language and Biology

• ∑ = {a-z}

Creating the graph (directed)…

e

h

d

v

ab

c

g

fu

t

s

r

qp

on

ij

km l

w

xy

zbeginend

(1,1)

)1,2(

)1,3(

)1,4(

)1,5()1,6(

• alice was

(2,1)(2,2)

(2,3)(2,4)

Page 8: Motif Extraction  and Grammar Induction in Language and Biology

structured graph

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

{1003;3}

{1002;1}

{1002;2}

{1003;4}

{1003;5}

{1003;6}

{1003;8}

{1003;7}

{1003;9}

{1003;10}

{1003;11}

{1003;12}

{1003;13}

{1003;14}

Creating the graph… )Cont’(

Page 9: Motif Extraction  and Grammar Induction in Language and Biology

Creating the graph… (Cont’)

random graph

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

Page 10: Motif Extraction  and Grammar Induction in Language and Biology

a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f c o n v e r s a t i o n

)1(

endbegin a l ne i c

w h e n a l i c e h a d b e e n a l l t h e w a y d o w n o n e s i d e a n d u p t h e o t h e r t r y i n g e v e r y d o o r s h e w a l k e d s a d l y

)2(

dw h

Creating the graph - cont’d

a endbegin l nec i

Page 11: Motif Extraction  and Grammar Induction in Language and Biology

structured graph

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

{1003;3}

{1002;1}

{1002;2}

{1003;4}

{1003;5}

{1003;6}

{1003;8}

{1003;7}

{1003;9}

{1003;10}

{1003;11}

{1003;12}

{1003;13}

{1003;14}

Searching for patterns

Page 12: Motif Extraction  and Grammar Induction in Language and Biology

1

2

36

4 5 9

7

2

38

7

6 8

5 4

1

path

search path

Searching for patterns (Cont’)

Page 13: Motif Extraction  and Grammar Induction in Language and Biology

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

e1 e2 e3 e4 e5

significantpattern

PR(e2|e1)=3/4

PR(e4|e1e2e3)=1

PR(e5|e1e2e3e4)=1/3

PL(e4)=6/41

PL(e3|e4)=5/6

PL(e2|e3e4)=1

PL(e1|e2e3e4)=3/5

PL

SLSR

PR

PR(e1)=4/41

PR(e3|e1e2)=1

begin end

Searching for patterns (Cont’)

Page 14: Motif Extraction  and Grammar Induction in Language and Biology

L matrix (numbers of paths)

alicewas

a8770       

l10464704      

i4689127486     

c3974016372382    

e39739748870313545   

w484851665792671  

a212121231926248770 

s171717191423779646492

b22225101463

e222246924

g222245510 Calculate conditional probabilities

l(ei;ej) = number of occurrences of sub-path (ei,ej) Where (ei,ej) is: ei→ ei+1 → ei+2 → … → ej-1 → ej

Page 15: Motif Extraction  and Grammar Induction in Language and Biology

From L to P

Calculating conditional probabilities

La8770l1046i486

c397e397w48a21s17b2e2

P_Ra0.08l0.12i0.45

c0.85e1w0.12a0.44s0.81b0.12e1

P(a) = 0.08

P(l|a) = 1046/8770

P(i|al) = 486/1046

P(w|alice) = 48/397

P(e|alic) = 397/397

P(c|ali) = 397/486

Page 16: Motif Extraction  and Grammar Induction in Language and Biology

P_Ralicewas

a0.08       

l0.120.043      

i0.450.190.069     

c0.850.440.0850.022    

e10.990.770.30.12   

w0.120.120.10.0940.0430.024  

a0.440.440.410.350.330.230.08 

s0.810.810.810.830.740.60.110.06

b0.120.120.120.110.0350.0270.0150.01

e11110.80.60.640.38

g111110.830.560.42

The probability Matrix

Page 17: Motif Extraction  and Grammar Induction in Language and Biology

Probability MatrixP_R = Right moving probability )end of pattern(

P_L = Left moving probability )beginning of pattern(

),(

),()...|(),(

1121

ji

jijiiiijiR eel

eeleeeeepeeP

),(

),()...|(),(

1121

ij

ijjjiiiijL eel

eeleeeeepeeP

)(SM ij

),( jiR eeP

)( ieP

),( ijL eeP

if ji ji ji

if

if

P_R

P_L

Page 18: Motif Extraction  and Grammar Induction in Language and Biology
Page 19: Motif Extraction  and Grammar Induction in Language and Biology

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

e1 e2 e3 e4 e5

significantpattern

PR(e2|e1)=3/4

PR(e4|e1e2e3)=1

PR(e5|e1e2e3e4)=1/3

PL(e4)=6/41

PL(e3|e4)=5/6

PL(e2|e3e4)=1

PL(e1|e2e3e4)=3/5

PL

SLSR

PR

PR(e1)=4/41

PR(e3|e1e2)=1

begin end

),(

),(),(

1jiR

jiRjiR eeP

eePeeD

),(

),(),(

1 ijL

ijLijL eeP

eePeeD

Detecting a significant pattern

Page 20: Motif Extraction  and Grammar Induction in Language and Biology

Significance test • Pn = (e1,en) : a potential pattern edge

• m = # of paths from e1 to en r = # of paths from e1 to en+1

• ≡ )Pn is a pattern edge(

• Assume P*n+1 = the “true” Pn+1, given )e1,en(

• H0: P*n+1 ≥ Pn·η )does not diverge(

H1: P*n+1 < Pn·η

• odds to receive results at least as “extreme” as r and m are:

If the outcome is less than a predetermined α

The pattern is significant

rmreel

r

rmreel

r

eel

rpni pp

rmr

mpp

r

mmrPeeB

ninini

)1(

)!(!

!)1()|(),(

),(

0

),(

0

),(

0

n

n

P

P 1 nPm

r

Page 21: Motif Extraction  and Grammar Induction in Language and Biology

Rewiring the graph • Once the algorithm has reached the stop criteria )e.g. ceases to

locate new patterns(, for each significant patterns, the sub-paths it subsumed are merged into a new vertex

• The graph is rewired in a length–significance descending order

Page 22: Motif Extraction  and Grammar Induction in Language and Biology

ALICE motifs

Weight Occurrences Lengthconversation 0.98 11 11whiterabbit 1.00 22 10caterpillar 1.00 28 10interrupted 0.94 7 10procession 0.93 6 9mockturtle 0.91 56 9beautiful 1.00 16 8important 0.99 11 8continued 0.98 9 8different 0.98 9 8atanyrate 0.94 7 8difficult 0.94 7 8surprise 0.99 10 7appeared 0.97 10 7mushroom 0.97 8 7thistime 0.95 19 7suddenly 0.94 13 7business 0.94 7 7nonsense 0.94 7 7morethan 0.94 6 7remember 0.92 20 7consider 0.91 10 7curious 1.00 19 6hadbeen 1.00 17 6however 1.00 20 6perhaps 1.00 16 6hastily 1.00 16 6herself 1.00 78 6footman 1.00 14 6suppose 1.00 12 6silence 0.99 14 6witness 0.99 10 6gryphon 0.97 54 6serpent 0.97 11 6angrily 0.97 8 6croquet 0.97 8 6venture 0.95 12 6forsome 0.95 12 6timidly 0.95 9 6whisper 0.95 9 6rabbit 1.00 27 5course 1.00 25 5eplied 1.00 22 5seemed 1.00 26 5remark 1.00 28 5

Motifs selected in order of

-length

-weight (significance of drop)

Shown here are results of one run over a trial-path and the beginning of the list of motifs extracted from it

Page 23: Motif Extraction  and Grammar Induction in Language and Biology

Application to Biology

Vertices of the graph: 4 or 20 lettersPaths: gene-sequences, promoter-

sequences, protein-sequences.Conditional probabilities on the graph

are proportional to the number of paths

Trial-path: testing transition probabilities to extract motifs

Page 24: Motif Extraction  and Grammar Induction in Language and Biology

Extracting Motifs from Enzymes

>P54233 | 1.7.1.1LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES

Each enzyme sequence corresponds to a

single path

Applying MEX to oxidoreductases 6602 enzyme sequences MEX motifs are specific subsequences

V. Kunik, Z. Solan, S. Edelman, E.Ruppin and D. Horn. CSB2005

Page 25: Motif Extraction  and Grammar Induction in Language and Biology

3165 motifs were obtained Distribution of MEX motifs

Enzyme MotifsN

um

ber

of

moti

fs

Length of motif

Page 26: Motif Extraction  and Grammar Induction in Language and Biology

>P54233 | 1.7.1.1LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES

Each enzyme is represented as a ‘bag of motifs’

>P54233 | 1.7.1.1RDEGTAD, TGKHPFN, LMHHGFITP, YVRNHGPVP, WTVEVTG, PDHGFP YHYKDN, KVTRVE, YGKYWCW, MGMMNNCWF

Enzymes Representation

These 1222 MEX motifs cover 3739 enzymes

Page 27: Motif Extraction  and Grammar Induction in Language and Biology

The functionality of an enzyme is determined

according to its EC number

Classification Hierarchy [ Webb, 1992 ]

n1.n2: sub-class / 2nd level

n1: class

n1.n2.n3: sub-subclass / 3rd level

n1.n2.n3.n4: precise enzymatic activity

Enzyme Function

EC number: n1.n2.n3.n4 (a unique identifier)

Page 28: Motif Extraction  and Grammar Induction in Language and Biology

An example:• EC 1 .12 . 1 . n4

hydrogen as electron donors

NAD+ / NADP+ as electron acceptors

NAD+ oxidoreductase

oxidoreductases

EC 1.12.1. 2

H2 + NAD+ = H+ + NADH

Page 29: Motif Extraction  and Grammar Induction in Language and Biology

Current knowledge regarding enzyme classification High sequence similarity is required to

guarantee functional similarity of proteins. A recent analysis of enzymes by Tian and

Skolnick 2003 suggests that 40% pairwise sequence identity can be used as a threshold for safe transferability of the first three digits of the Enzyme Commission (EC) number.

The EC number, which is of the form:n1:n2:n3:n4 specifies the location of the enzyme on a tree of functionalities.

Page 30: Motif Extraction  and Grammar Induction in Language and Biology

Current knowledge regarding enzyme classification

Using pairwise sequence similarity, and combining it with the Support Vector Machine (SVM) classification approach, Liao and Noble 2003 have argued that they obtain a significantly improved remote homology detection relative to existing state-of-the-art algorithms.

Cai at al (2003,2004) have applied SVM to a protein description based on physico-chemical features of their amino-acids such as hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility.

Ben-Hur and Brutlag 2004 used the eMotif approach and analyzed the oxidoreductases as `bags of motis’.With appropriate feature selection methods theyobtain success rates over 90% for a variety of classifiers.

Page 31: Motif Extraction  and Grammar Induction in Language and Biology

O17433 1148 262 463 610 7987 1627 260P19992 124 7290 27 111 3706 18128 3432Q01284 6652 198 1489 710 425 64 55Q12723 693 145 7290 3712 65 543 522P14060 455 2664 848 55 128 256 74Q60555 7290 3712 65 543 522 6748 7159

The MEX method

• SVM classifier input:

Classification Tasks:

• 16 2nd level subclasses

• 32 3rd level sub-subclasses

Page 32: Motif Extraction  and Grammar Induction in Language and Biology

A linear SVM is applied to evaluate the predictive

power of MEX motifs Enzyme sequences are randomly partitioned

into a training-set and a test-set (75%-25%) 16 2nd level classification tasks 32 3rd level classification tasks Performance measurement: Jaccard Score

The train-test procedure was repeated 40 times

to gather sufficient statistics

tptp + fp + fn

J =

Methods

Page 33: Motif Extraction  and Grammar Induction in Language and Biology

Results

Average Jaccard scores: 2nd level: 0.88 ± 0.06 3rd level: 0.84 ± 0.09

EC subclass

# o

f seq

uen

ces EC subclass

Jaccard

score

2nd level results

Page 34: Motif Extraction  and Grammar Induction in Language and Biology

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

11

.1

1.1

4

1.6

1.9

1.2

1.1

1

1.3

1.8

1.1

5

1.4

1.1

8

1.7

1.1

7

1.1

3

1.1

0

1.5

MEXSmith-Waterman SVMProt

* * **

* ***

*

*

Jaccard

Score

EC Subclassα < 0.01

2nd Level Classification

Page 35: Motif Extraction  and Grammar Induction in Language and Biology

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.1

.1

1.9

.3

1.2

.1

1.6

.5

1.1

1.1

1.1

4.1

4

1.1

5.1

1.3

.3

1.8

.4

1.1

8.6

1.1

4.1

3

1.8

.1

1.1

7.4

1.4

.1

1.6

.99

1.1

3.1

1

1.7

.1

1.4

.3

1.1

4.9

9

1.3

.1

1.2

.4

1.1

4.1

5

1.3

.99

1.1

0.3

1.1

4.1

9

1.5

.1

1.1

4.1

1

1.1

4.1

6

1.6

.2

1.1

8.1

1.4

.99

1.1

.99

MEXSmith-Waterman

**

* * *

*

* *

*

*

*

*

*

* * **

* *

*

* *

3rd Level Classification

α < 0.01

Jaccard

Score

EC Sub-subclass

Page 36: Motif Extraction  and Grammar Induction in Language and Biology

Summary so far…

MEX is an unsupervised motif extraction method that

finds biologically meaningful motifs

The meaning of the motifs is established by the

classification tasks

Classifies enzymes better than Smith-Waterman

and outperforms SVMProt

There exists correspondence between MEX motifs

and PROSITE biologically significant patterns

Specific motifs of length 6 and longer specify

the functionality of enzymes

Page 37: Motif Extraction  and Grammar Induction in Language and Biology

ADIOS (AutomaticDIstillation Of Structure)

Representation of a corpus (of sentences) as paths over a graph whose vertices are lexical elements (words)

Motif Extraction (MEX) procedure for establishing new vertices thus progressively redefining the graph in an unsupervised fashion

Recursive Generalization

Page 38: Motif Extraction  and Grammar Induction in Language and Biology

ADIOS: recursive generalization

1. For each path 1. Slide a context window of size L2. For each location i {0< i <=L}

1. Look for all paths with identical prefix at ip {0< ip<i}

2. and identical suffix at is {i< is <L}

3. If a ‘generalized’ pattern is found )and significant…(1. Add E – the equivalence class to the graph2. Rewire

2. Continue with MEX and ADIOS until no significant pattern is found

Page 39: Motif Extraction  and Grammar Induction in Language and Biology

Loading sentences onto a graph whose vertices are words

• Is that a cat?

• Is that a dog?

• And is that a horse?

• Where is the dog?

Results will be finding the pattern

P=‘is that a E ’?

with an equivalence class

E={cat, horse, dog}

Page 40: Motif Extraction  and Grammar Induction in Language and Biology

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon ...

Page 41: Motif Extraction  and Grammar Induction in Language and Biology

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon... and a path induced by a string

Page 42: Motif Extraction  and Grammar Induction in Language and Biology

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon... and a path induced by a string

Page 43: Motif Extraction  and Grammar Induction in Language and Biology

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon... and a path induced by a string

Page 44: Motif Extraction  and Grammar Induction in Language and Biology

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon... and a path induced by a string

Page 45: Motif Extraction  and Grammar Induction in Language and Biology

the

dog

where

cat

horse

and

a

?

begin end

is

that

a lexicon... and a graph induced by a corpus of strings

Page 46: Motif Extraction  and Grammar Induction in Language and Biology

8

5

54

6

7

e2 e3 e4

4

7

6

54

new vertex

11

A

B

E

took chair

bed

table

the

5

to

E

E(i+2)E

blue

green

redthe

E

table

bed

green

blue table

bedbed

chair

D

F

is

table

chair

bed

L

took the to

G

e4

8

8

storedequivalence class

C

E

==

ischairthe red

newequivalence class

p

e1 e2 e3 e4 e5

significantpattern

PL(e4)=6/Ne

PL

PR(e1)=4/Ne

begin end0

0.2

0.4

0.6

0.8

1

PR(e2)=3/4

PR(e4)=1

PR(e5)=1/3

PL(e3)=5/6

PL(e2)=1

PL(e1)=3/5

PR

PR(e3)=1

41

7

6

55

4

4

1 1

6

6

7

7

2

2

8

3

e1

e2 e3

33

7

2

2

1

5

search path

3

8

i+2 i+3i+1i

C

newequivalence class

i+2 i+3i+1i

mergingpoint

dispersionpoint

P

P

P

P

E

Generalization

Page 47: Motif Extraction  and Grammar Induction in Language and Biology

1205

567 321120132234 621987

321234987 1203

567 321120132234 621987 2000

321234987 1203 3211203

234987 1204

987 2001 1204

The Model: The training process

Page 48: Motif Extraction  and Grammar Induction in Language and Biology

1205

567 321120132234 621987

321234987 1203

567 321120132234 621987 2000

321234987 1203 3211203

234987 1204

987 2001 1204

The Model: The training process

Page 49: Motif Extraction  and Grammar Induction in Language and Biology

Beth

Cin

dy

Georg

e

Jim

Joe

Pam

70

is

cele

bra

ting

livin

g

pla

yin

g

sta

yin

g

work

ing

65

66

extr

em

ely

very

98

far

aw

ay

67

1012

3

4

56

7

9

10

11

1213

1201 8

144begin end

begin end

Beth

Cin

dy

Georg

e

Jim

Joe

Pam

70

is

cele

bra

ting

livin

g

pla

yin

g

sta

yin

g

work

ing

65

661

2

45

6

120

3

extr

em

ely

very

98

far

aw

ay

67

1

2

3

4 5

101

begin endGeorge is working very

far

aw

ay

1 2

67A

B

Croot-pattern(RP)

P144s begin P144 end

P120 P101P120 E70 P66E70 Beth | Cindy...P66 is E65E65 celebrating | living...

P101 E98 P67E98 extremely | veryP67 far away

D

First pattern formation

Higher hierarchies: patterns (P) constructed of other Ps, equivalence classes (E) and terminals (T)

Final stage: root pattern

CFG: context free grammar

Trees to be read from top to bottom and from left to right

Page 50: Motif Extraction  and Grammar Induction in Language and Biology

student learns from teacher Teacher generates a corpus of

sentences Student generates syntax out of

significant patterns and equivalence classes

Unseen teacher-generated patterns are checked by student (recall)

Student-generated patterns are checked by teacher (significance)

Page 51: Motif Extraction  and Grammar Induction in Language and Biology

student-teacher process

Teacher

P84 that P58 P63E63 E64 P48E64 Beth | Cindy | George | Jim | Joe | Pam | P49 | P51P48 , doesn't itP51 the E50P49 a E50E50 bird | cat | cow | dog | horse | rabbitP61 who E62E62 adores | loves | scolds | worshipsE53 Beth | Cindy | George | Jim | Joe | PamE85 annoys | bothers | disturbs | worriesP58 E60 E64E60 flies | jumps | laughs

Grammar(CFG)

Training:n sentences

Testing:m sentences

acquisition

Student

Acquired Grammar

tha

tB

eth

Cin

dy

Ge

org

eJi

mJo

eP

ama

bir

dca

tco

wd

og

ho

rse

rab

bit

the

bir

dca

tco

wd

og

ho

rse

rab

bit

flie

sju

mp

sla

ug

hs

an

no

ysb

oth

ers

dis

turb

sw

orr

ies

Be

thC

ind

yG

eo

rge

Jim

Joe

Pam

wh

oa

do

res

love

ssc

old

sw

ors

hip

sB

eth

Cin

dy

Ge

org

eJi

mJo

eP

ama

bir

dca

tco

wd

og

ho

rse

rab

bit

the

bir

dca

tco

wd

og

ho

rse

rab

bit,

do

esn

't it

50

49

50

51

64

60

58

85 53 62

61

50

49

50

51

64

48

63

84 0.0001

Testing:m sentences

RecallPrecision

Page 52: Motif Extraction  and Grammar Induction in Language and Biology

ATIS experiments(Airline Transport Information System)

The ATIS-CFG is a hand-made CFG of 4592 rules, constructed to provide good recall (45%) of ATIS-NL, a corpus of natural language (13,000 sentences).

We train multiple ADIOS learners using ATIS-CFG as the teacher. Recursion is limited to depth 10. In testing performance, precision is defined by taking the mean across individual learners, while for recall acceptance by one learner suffices.

Page 53: Motif Extraction  and Grammar Induction in Language and Biology

ADIOS learning from ATIS-CFG (4592 rules)using different numbers of learners, and different window length L

0

0.6

0 0.2 0.4 0.6 0.8 1

Over-generalization

L=5

L=4L=3

10,000Sentences

120,000Sentences

40,000Sentences

120,000Sentences

L=3

L=5

L=4

L=5

L=4L=3L=5

L=6

150 learners

30 learners

L=7

L=6L=6

Pre

cisi

on

1

0.4

0.2

0.8

B

Recall

Page 54: Motif Extraction  and Grammar Induction in Language and Biology

ATIS experiments

The ATIS natural language corpus contains 13,000 sentences.Training ADIOS on it leads to recall of 40% (ATIS-CFG reaches recall of 45%). Nonetheless human judged precision is remarkable: 8 subjects judged the grammatical acceptability to be roughly the same as that of ATIS-NL!ALL this while ATIS-CFG produces 99% of ungrammatical sentences!

Page 55: Motif Extraction  and Grammar Induction in Language and Biology

0.00

0.20

0.40

0.60

0.80

1.00

ADIOS ATIS-N

Pre

cis

ion

A B

Chinese

Spanish

French

English

Swedish

Danish

C

D E

Page 56: Motif Extraction  and Grammar Induction in Language and Biology

Meta-analysis of ADIOS results.

We define a pattern spectrum as the histogram of pattern types, whose bins are labeled by sequences such as (T,P) or (E,E,T), E standing for equivalence class, T for tree-terminal (original unit) and P for significant pattern.

We apply this analysis to the Parallel Bible, a text containing 31,000 verses in six different languages.

Page 57: Motif Extraction  and Grammar Induction in Language and Biology

TT

TE

TP

ET

EE

EP

PT

PE

PP

TTT

TTE

TTP

TE

T

TE

E

TE

P

TP

T

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

English

Spanish

Swedish

Chinese

Danish

French

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 200 400 600 8000.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 200 400 600 800

A B

Chinese

Spanish

French

English

Swedish

Danish

C

D E

Page 58: Motif Extraction  and Grammar Induction in Language and Biology

TT

TE

TP

ET

EE

EP

PT

PE

PP

TTT

TTE

TTP

TE

T

TE

E

TE

P

TP

T

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

A B

Chinese

Spanish

French

English

Swedish

Danish

C

D E

The typological relations among six different natural languages, as judged according to pattern spectra. These are accepted relations according to linguists.

Phylogeny of languages

Page 59: Motif Extraction  and Grammar Induction in Language and Biology

Summary MEX is a motif-extraction method applicable to

linguistic texts. We have shown its applicability to proteins.

MEX motifs serve as ‘words’, allowing for enzyme classification: from sequence to function!

ADIOS is a grammar induction algorithm, employing MEX in a space of words, patterns and equivalence classes, and constructing a CFG representing syntax of the data.

Thus the syntactic analysis of linguistic corpora and biological texts is put on common grounds.

Unsupervised learning of natural languages PNAS 102 (2005) 11629

More information at http://adios.tau.ac.il