motif extraction and grammar induction in language and biology
DESCRIPTION
Motif Extraction and Grammar Induction in Language and Biology. David Horn Tel Aviv University http://horn.tau.ac.il. Analogy between languages. Human languages are realized in streams of speech or in lines of text All computational problems can be realized by a Turing machine - PowerPoint PPT PresentationTRANSCRIPT
Motif Extraction and Grammar Induction in
Language and Biology
David HornTel Aviv University
http://horn.tau.ac.il
Analogy between languages
Human languages are realized in streams of speech or in lines of text
All computational problems can be realized by a Turing machine
Hereditary biology makes use of chains of nucleotides in DNA and RNA and chains of amino-acids in proteins
All are one-dimensional representations of reality
Reality, however, is not one dimensional
hence one needs a set of syntactic rules, to make sense of the text, and a semantic mapping into actions in the real world.
The distinction between syntactic and semantic levels (Chomsky 1957) is intuitively clear in human languages.
In biology syntax refers to the structure of the sequences whereas semantics relates the different sequence elements to the complicated process involving transcription, the birth of mRNA, to translation, the birth of the protein.
Examples of syntax (semantics): segmentation of the chromosome to genes (originators of proteins), promoters (transcription regulation) and 3’ UTRs (translation regulation); segmentation of genes to exons (coding) and introns (noncoding); finding motifs on promoters (TFBS relevant to enabling or forbidding transcription); finding motifs on proteins (relevant for protein interaction and protein functionality) etc.
Can we induce the rules (and, sometimes, the words) from the texts?
ADIOS is an algorithm that induces syntax, or grammar, from the text. MEX is an algorithm that extracts motifs, or patterns, from the text.
Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman.Unsupervised learning of natural languages.Proc. Natl. Acad. Sci. USA, 102 (2005) 11629-11634.
MEX: motif extraction algorithm Create a graph whose vertices are words (for
text) or letters (for biological sequences) Load all strings of text onto the graph as
paths over the vertices Given the loaded graph consider trial-paths
that may coincide with original strings of text Use context sensitive statistics to define left-
and right-moving probabilities that help to define motifs
Given a set of stringsa l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f h a v i n g n o t h i n g t o d o o n c e o r t w i c e s h e h a d p e e p e d i n t o t h e b o o k h e r s i s t e r w a s r e a d i n g b u t i t h a d n o p i c t u r e s o r c o n v e r s a t i o n s i n i t a n d w h a t i s t h e u s e o f a b o o k t h o u g h t a l i c e w i t h o u t p i c t u r e s o r c o n v e r s a t i o n
alicewas beginning toget verytiredof sitting by hersister onthebank and of having nothing todo onceortwice shehad peep ed intothe book hersister was reading butit hadno pictures or conversation s init and what is theuseof abook thoughtalice without pictures or conversation
Find patterns in strings of letters
Motifs EXtraction (MEX)
• ∑ = {a-z}
Creating the graph (directed)…
e
h
d
v
ab
c
g
fu
t
s
r
qp
on
ij
km l
w
xy
zbeginend
(1,1)
)1,2(
)1,3(
)1,4(
)1,5()1,6(
• alice was
(2,1)(2,2)
(2,3)(2,4)
structured graph
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
{1003;3}
{1002;1}
{1002;2}
{1003;4}
{1003;5}
{1003;6}
{1003;8}
{1003;7}
{1003;9}
{1003;10}
{1003;11}
{1003;12}
{1003;13}
{1003;14}
Creating the graph… )Cont’(
Creating the graph… (Cont’)
random graph
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f c o n v e r s a t i o n
)1(
endbegin a l ne i c
w h e n a l i c e h a d b e e n a l l t h e w a y d o w n o n e s i d e a n d u p t h e o t h e r t r y i n g e v e r y d o o r s h e w a l k e d s a d l y
)2(
dw h
Creating the graph - cont’d
a endbegin l nec i
structured graph
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{1003;3}
{1002;1}
{1002;2}
{1003;4}
{1003;5}
{1003;6}
{1003;8}
{1003;7}
{1003;9}
{1003;10}
{1003;11}
{1003;12}
{1003;13}
{1003;14}
Searching for patterns
1
2
36
4 5 9
7
2
38
7
6 8
5 4
1
path
search path
Searching for patterns (Cont’)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
e1 e2 e3 e4 e5
significantpattern
PR(e2|e1)=3/4
PR(e4|e1e2e3)=1
PR(e5|e1e2e3e4)=1/3
PL(e4)=6/41
PL(e3|e4)=5/6
PL(e2|e3e4)=1
PL(e1|e2e3e4)=3/5
PL
SLSR
PR
PR(e1)=4/41
PR(e3|e1e2)=1
begin end
Searching for patterns (Cont’)
L matrix (numbers of paths)
alicewas
a8770
l10464704
i4689127486
c3974016372382
e39739748870313545
w484851665792671
a212121231926248770
s171717191423779646492
b22225101463
e222246924
g222245510 Calculate conditional probabilities
l(ei;ej) = number of occurrences of sub-path (ei,ej) Where (ei,ej) is: ei→ ei+1 → ei+2 → … → ej-1 → ej
From L to P
Calculating conditional probabilities
La8770l1046i486
c397e397w48a21s17b2e2
P_Ra0.08l0.12i0.45
c0.85e1w0.12a0.44s0.81b0.12e1
P(a) = 0.08
P(l|a) = 1046/8770
P(i|al) = 486/1046
P(w|alice) = 48/397
P(e|alic) = 397/397
P(c|ali) = 397/486
P_Ralicewas
a0.08
l0.120.043
i0.450.190.069
c0.850.440.0850.022
e10.990.770.30.12
w0.120.120.10.0940.0430.024
a0.440.440.410.350.330.230.08
s0.810.810.810.830.740.60.110.06
b0.120.120.120.110.0350.0270.0150.01
e11110.80.60.640.38
g111110.830.560.42
The probability Matrix
Probability MatrixP_R = Right moving probability )end of pattern(
P_L = Left moving probability )beginning of pattern(
),(
),()...|(),(
1121
ji
jijiiiijiR eel
eeleeeeepeeP
),(
),()...|(),(
1121
ij
ijjjiiiijL eel
eeleeeeepeeP
)(SM ij
),( jiR eeP
)( ieP
),( ijL eeP
if ji ji ji
if
if
P_R
P_L
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
e1 e2 e3 e4 e5
significantpattern
PR(e2|e1)=3/4
PR(e4|e1e2e3)=1
PR(e5|e1e2e3e4)=1/3
PL(e4)=6/41
PL(e3|e4)=5/6
PL(e2|e3e4)=1
PL(e1|e2e3e4)=3/5
PL
SLSR
PR
PR(e1)=4/41
PR(e3|e1e2)=1
begin end
),(
),(),(
1jiR
jiRjiR eeP
eePeeD
),(
),(),(
1 ijL
ijLijL eeP
eePeeD
Detecting a significant pattern
Significance test • Pn = (e1,en) : a potential pattern edge
• m = # of paths from e1 to en r = # of paths from e1 to en+1
• ≡ )Pn is a pattern edge(
• Assume P*n+1 = the “true” Pn+1, given )e1,en(
• H0: P*n+1 ≥ Pn·η )does not diverge(
H1: P*n+1 < Pn·η
• odds to receive results at least as “extreme” as r and m are:
If the outcome is less than a predetermined α
The pattern is significant
rmreel
r
rmreel
r
eel
rpni pp
rmr
mpp
r
mmrPeeB
ninini
)1(
)!(!
!)1()|(),(
),(
0
),(
0
),(
0
n
n
P
P 1 nPm
r
Rewiring the graph • Once the algorithm has reached the stop criteria )e.g. ceases to
locate new patterns(, for each significant patterns, the sub-paths it subsumed are merged into a new vertex
• The graph is rewired in a length–significance descending order
ALICE motifs
Weight Occurrences Lengthconversation 0.98 11 11whiterabbit 1.00 22 10caterpillar 1.00 28 10interrupted 0.94 7 10procession 0.93 6 9mockturtle 0.91 56 9beautiful 1.00 16 8important 0.99 11 8continued 0.98 9 8different 0.98 9 8atanyrate 0.94 7 8difficult 0.94 7 8surprise 0.99 10 7appeared 0.97 10 7mushroom 0.97 8 7thistime 0.95 19 7suddenly 0.94 13 7business 0.94 7 7nonsense 0.94 7 7morethan 0.94 6 7remember 0.92 20 7consider 0.91 10 7curious 1.00 19 6hadbeen 1.00 17 6however 1.00 20 6perhaps 1.00 16 6hastily 1.00 16 6herself 1.00 78 6footman 1.00 14 6suppose 1.00 12 6silence 0.99 14 6witness 0.99 10 6gryphon 0.97 54 6serpent 0.97 11 6angrily 0.97 8 6croquet 0.97 8 6venture 0.95 12 6forsome 0.95 12 6timidly 0.95 9 6whisper 0.95 9 6rabbit 1.00 27 5course 1.00 25 5eplied 1.00 22 5seemed 1.00 26 5remark 1.00 28 5
Motifs selected in order of
-length
-weight (significance of drop)
Shown here are results of one run over a trial-path and the beginning of the list of motifs extracted from it
Application to Biology
Vertices of the graph: 4 or 20 lettersPaths: gene-sequences, promoter-
sequences, protein-sequences.Conditional probabilities on the graph
are proportional to the number of paths
Trial-path: testing transition probabilities to extract motifs
Extracting Motifs from Enzymes
>P54233 | 1.7.1.1LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES
Each enzyme sequence corresponds to a
single path
Applying MEX to oxidoreductases 6602 enzyme sequences MEX motifs are specific subsequences
V. Kunik, Z. Solan, S. Edelman, E.Ruppin and D. Horn. CSB2005
3165 motifs were obtained Distribution of MEX motifs
Enzyme MotifsN
um
ber
of
moti
fs
Length of motif
>P54233 | 1.7.1.1LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES
Each enzyme is represented as a ‘bag of motifs’
>P54233 | 1.7.1.1RDEGTAD, TGKHPFN, LMHHGFITP, YVRNHGPVP, WTVEVTG, PDHGFP YHYKDN, KVTRVE, YGKYWCW, MGMMNNCWF
Enzymes Representation
These 1222 MEX motifs cover 3739 enzymes
The functionality of an enzyme is determined
according to its EC number
Classification Hierarchy [ Webb, 1992 ]
n1.n2: sub-class / 2nd level
n1: class
n1.n2.n3: sub-subclass / 3rd level
n1.n2.n3.n4: precise enzymatic activity
Enzyme Function
EC number: n1.n2.n3.n4 (a unique identifier)
An example:• EC 1 .12 . 1 . n4
hydrogen as electron donors
NAD+ / NADP+ as electron acceptors
NAD+ oxidoreductase
oxidoreductases
EC 1.12.1. 2
H2 + NAD+ = H+ + NADH
Current knowledge regarding enzyme classification High sequence similarity is required to
guarantee functional similarity of proteins. A recent analysis of enzymes by Tian and
Skolnick 2003 suggests that 40% pairwise sequence identity can be used as a threshold for safe transferability of the first three digits of the Enzyme Commission (EC) number.
The EC number, which is of the form:n1:n2:n3:n4 specifies the location of the enzyme on a tree of functionalities.
Current knowledge regarding enzyme classification
Using pairwise sequence similarity, and combining it with the Support Vector Machine (SVM) classification approach, Liao and Noble 2003 have argued that they obtain a significantly improved remote homology detection relative to existing state-of-the-art algorithms.
Cai at al (2003,2004) have applied SVM to a protein description based on physico-chemical features of their amino-acids such as hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility.
Ben-Hur and Brutlag 2004 used the eMotif approach and analyzed the oxidoreductases as `bags of motis’.With appropriate feature selection methods theyobtain success rates over 90% for a variety of classifiers.
O17433 1148 262 463 610 7987 1627 260P19992 124 7290 27 111 3706 18128 3432Q01284 6652 198 1489 710 425 64 55Q12723 693 145 7290 3712 65 543 522P14060 455 2664 848 55 128 256 74Q60555 7290 3712 65 543 522 6748 7159
The MEX method
• SVM classifier input:
Classification Tasks:
• 16 2nd level subclasses
• 32 3rd level sub-subclasses
A linear SVM is applied to evaluate the predictive
power of MEX motifs Enzyme sequences are randomly partitioned
into a training-set and a test-set (75%-25%) 16 2nd level classification tasks 32 3rd level classification tasks Performance measurement: Jaccard Score
The train-test procedure was repeated 40 times
to gather sufficient statistics
tptp + fp + fn
J =
Methods
Results
Average Jaccard scores: 2nd level: 0.88 ± 0.06 3rd level: 0.84 ± 0.09
EC subclass
# o
f seq
uen
ces EC subclass
Jaccard
score
2nd level results
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
11
.1
1.1
4
1.6
1.9
1.2
1.1
1
1.3
1.8
1.1
5
1.4
1.1
8
1.7
1.1
7
1.1
3
1.1
0
1.5
MEXSmith-Waterman SVMProt
* * **
* ***
*
*
Jaccard
Score
EC Subclassα < 0.01
2nd Level Classification
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.1
.1
1.9
.3
1.2
.1
1.6
.5
1.1
1.1
1.1
4.1
4
1.1
5.1
1.3
.3
1.8
.4
1.1
8.6
1.1
4.1
3
1.8
.1
1.1
7.4
1.4
.1
1.6
.99
1.1
3.1
1
1.7
.1
1.4
.3
1.1
4.9
9
1.3
.1
1.2
.4
1.1
4.1
5
1.3
.99
1.1
0.3
1.1
4.1
9
1.5
.1
1.1
4.1
1
1.1
4.1
6
1.6
.2
1.1
8.1
1.4
.99
1.1
.99
MEXSmith-Waterman
**
* * *
*
* *
*
*
*
*
*
* * **
* *
*
* *
3rd Level Classification
α < 0.01
Jaccard
Score
EC Sub-subclass
Summary so far…
MEX is an unsupervised motif extraction method that
finds biologically meaningful motifs
The meaning of the motifs is established by the
classification tasks
Classifies enzymes better than Smith-Waterman
and outperforms SVMProt
There exists correspondence between MEX motifs
and PROSITE biologically significant patterns
Specific motifs of length 6 and longer specify
the functionality of enzymes
ADIOS (AutomaticDIstillation Of Structure)
Representation of a corpus (of sentences) as paths over a graph whose vertices are lexical elements (words)
Motif Extraction (MEX) procedure for establishing new vertices thus progressively redefining the graph in an unsupervised fashion
Recursive Generalization
ADIOS: recursive generalization
1. For each path 1. Slide a context window of size L2. For each location i {0< i <=L}
1. Look for all paths with identical prefix at ip {0< ip<i}
2. and identical suffix at is {i< is <L}
3. If a ‘generalized’ pattern is found )and significant…(1. Add E – the equivalence class to the graph2. Rewire
2. Continue with MEX and ADIOS until no significant pattern is found
Loading sentences onto a graph whose vertices are words
• Is that a cat?
• Is that a dog?
• And is that a horse?
• Where is the dog?
Results will be finding the pattern
P=‘is that a E ’?
with an equivalence class
E={cat, horse, dog}
the
dog
where
cat
horse
and
a
?
begin end
is
that
a lexicon ...
the
dog
where
cat
horse
and
a
?
begin end
is
that
a lexicon... and a path induced by a string
the
dog
where
cat
horse
and
a
?
begin end
is
that
a lexicon... and a path induced by a string
the
dog
where
cat
horse
and
a
?
begin end
is
that
a lexicon... and a path induced by a string
the
dog
where
cat
horse
and
a
?
begin end
is
that
a lexicon... and a path induced by a string
the
dog
where
cat
horse
and
a
?
begin end
is
that
a lexicon... and a graph induced by a corpus of strings
8
5
54
6
7
e2 e3 e4
4
7
6
54
new vertex
11
A
B
E
took chair
bed
table
the
5
to
E
E(i+2)E
blue
green
redthe
E
table
bed
green
blue table
bedbed
chair
D
F
is
table
chair
bed
L
took the to
G
e4
8
8
storedequivalence class
C
E
==
ischairthe red
newequivalence class
p
e1 e2 e3 e4 e5
significantpattern
PL(e4)=6/Ne
PL
PR(e1)=4/Ne
begin end0
0.2
0.4
0.6
0.8
1
PR(e2)=3/4
PR(e4)=1
PR(e5)=1/3
PL(e3)=5/6
PL(e2)=1
PL(e1)=3/5
PR
PR(e3)=1
41
7
6
55
4
4
1 1
6
6
7
7
2
2
8
3
e1
e2 e3
33
7
2
2
1
5
search path
3
8
i+2 i+3i+1i
C
newequivalence class
i+2 i+3i+1i
mergingpoint
dispersionpoint
P
P
P
P
E
Generalization
1205
567 321120132234 621987
321234987 1203
567 321120132234 621987 2000
321234987 1203 3211203
234987 1204
987 2001 1204
The Model: The training process
1205
567 321120132234 621987
321234987 1203
567 321120132234 621987 2000
321234987 1203 3211203
234987 1204
987 2001 1204
The Model: The training process
Beth
Cin
dy
Georg
e
Jim
Joe
Pam
70
is
cele
bra
ting
livin
g
pla
yin
g
sta
yin
g
work
ing
65
66
extr
em
ely
very
98
far
aw
ay
67
1012
3
4
56
7
9
10
11
1213
1201 8
144begin end
begin end
Beth
Cin
dy
Georg
e
Jim
Joe
Pam
70
is
cele
bra
ting
livin
g
pla
yin
g
sta
yin
g
work
ing
65
661
2
45
6
120
3
extr
em
ely
very
98
far
aw
ay
67
1
2
3
4 5
101
begin endGeorge is working very
far
aw
ay
1 2
67A
B
Croot-pattern(RP)
P144s begin P144 end
P120 P101P120 E70 P66E70 Beth | Cindy...P66 is E65E65 celebrating | living...
P101 E98 P67E98 extremely | veryP67 far away
D
First pattern formation
Higher hierarchies: patterns (P) constructed of other Ps, equivalence classes (E) and terminals (T)
Final stage: root pattern
CFG: context free grammar
Trees to be read from top to bottom and from left to right
student learns from teacher Teacher generates a corpus of
sentences Student generates syntax out of
significant patterns and equivalence classes
Unseen teacher-generated patterns are checked by student (recall)
Student-generated patterns are checked by teacher (significance)
student-teacher process
Teacher
P84 that P58 P63E63 E64 P48E64 Beth | Cindy | George | Jim | Joe | Pam | P49 | P51P48 , doesn't itP51 the E50P49 a E50E50 bird | cat | cow | dog | horse | rabbitP61 who E62E62 adores | loves | scolds | worshipsE53 Beth | Cindy | George | Jim | Joe | PamE85 annoys | bothers | disturbs | worriesP58 E60 E64E60 flies | jumps | laughs
Grammar(CFG)
Training:n sentences
Testing:m sentences
acquisition
Student
Acquired Grammar
tha
tB
eth
Cin
dy
Ge
org
eJi
mJo
eP
ama
bir
dca
tco
wd
og
ho
rse
rab
bit
the
bir
dca
tco
wd
og
ho
rse
rab
bit
flie
sju
mp
sla
ug
hs
an
no
ysb
oth
ers
dis
turb
sw
orr
ies
Be
thC
ind
yG
eo
rge
Jim
Joe
Pam
wh
oa
do
res
love
ssc
old
sw
ors
hip
sB
eth
Cin
dy
Ge
org
eJi
mJo
eP
ama
bir
dca
tco
wd
og
ho
rse
rab
bit
the
bir
dca
tco
wd
og
ho
rse
rab
bit,
do
esn
't it
50
49
50
51
64
60
58
85 53 62
61
50
49
50
51
64
48
63
84 0.0001
Testing:m sentences
RecallPrecision
ATIS experiments(Airline Transport Information System)
The ATIS-CFG is a hand-made CFG of 4592 rules, constructed to provide good recall (45%) of ATIS-NL, a corpus of natural language (13,000 sentences).
We train multiple ADIOS learners using ATIS-CFG as the teacher. Recursion is limited to depth 10. In testing performance, precision is defined by taking the mean across individual learners, while for recall acceptance by one learner suffices.
ADIOS learning from ATIS-CFG (4592 rules)using different numbers of learners, and different window length L
0
0.6
0 0.2 0.4 0.6 0.8 1
Over-generalization
L=5
L=4L=3
10,000Sentences
120,000Sentences
40,000Sentences
120,000Sentences
L=3
L=5
L=4
L=5
L=4L=3L=5
L=6
150 learners
30 learners
L=7
L=6L=6
Pre
cisi
on
1
0.4
0.2
0.8
B
Recall
ATIS experiments
The ATIS natural language corpus contains 13,000 sentences.Training ADIOS on it leads to recall of 40% (ATIS-CFG reaches recall of 45%). Nonetheless human judged precision is remarkable: 8 subjects judged the grammatical acceptability to be roughly the same as that of ATIS-NL!ALL this while ATIS-CFG produces 99% of ungrammatical sentences!
0.00
0.20
0.40
0.60
0.80
1.00
ADIOS ATIS-N
Pre
cis
ion
A B
Chinese
Spanish
French
English
Swedish
Danish
C
D E
Meta-analysis of ADIOS results.
We define a pattern spectrum as the histogram of pattern types, whose bins are labeled by sequences such as (T,P) or (E,E,T), E standing for equivalence class, T for tree-terminal (original unit) and P for significant pattern.
We apply this analysis to the Parallel Bible, a text containing 31,000 verses in six different languages.
TT
TE
TP
ET
EE
EP
PT
PE
PP
TTT
TTE
TTP
TE
T
TE
E
TE
P
TP
T
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
English
Spanish
Swedish
Chinese
Danish
French
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0 200 400 600 8000.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0 200 400 600 800
A B
Chinese
Spanish
French
English
Swedish
Danish
C
D E
TT
TE
TP
ET
EE
EP
PT
PE
PP
TTT
TTE
TTP
TE
T
TE
E
TE
P
TP
T
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
A B
Chinese
Spanish
French
English
Swedish
Danish
C
D E
The typological relations among six different natural languages, as judged according to pattern spectra. These are accepted relations according to linguists.
Phylogeny of languages
Summary MEX is a motif-extraction method applicable to
linguistic texts. We have shown its applicability to proteins.
MEX motifs serve as ‘words’, allowing for enzyme classification: from sequence to function!
ADIOS is a grammar induction algorithm, employing MEX in a space of words, patterns and equivalence classes, and constructing a CFG representing syntax of the data.
Thus the syntactic analysis of linguistic corpora and biological texts is put on common grounds.
Unsupervised learning of natural languages PNAS 102 (2005) 11629
More information at http://adios.tau.ac.il