a similar fragments merging approach to learn automata on proteins

42
A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Upload: maxine

Post on 31-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

A Similar Fragments Merging Approach to Learn Automata on Proteins. Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes. Outline of the talk. Protein families signatures Similar Fragment Merging Approach (Protomata-L) Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Similar Fragments Merging Approach to Learn Automata on Proteins

A Similar Fragments Merging Approach to Learn Automata on

Proteins

Goulven KERBELLEC & François COSTEIRISA / INRIA Rennes

Page 2: A Similar Fragments Merging Approach to Learn Automata on Proteins

Outline of the talk Protein families signatures Similar Fragment Merging Approach (Protomata-L)

Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs

Generalization Merging of SFP in an automaton Gap generalization Identification of Physico-chemical properties

Experiments

Page 3: A Similar Fragments Merging Approach to Learn Automata on Proteins

Protein families Amino acid alphabet :

Protein sequence :

Protein data set :

>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…

>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…>AQP2_RATMWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWASSPPSVLQIAVAFGLGIGILVQALGH…>AQP3_MOUSEMGRQKELMNRCGEMLHIRYRLLRQALAECLGTLILVMFGCGSVAQVVLSRGTHGGFLT…

{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Common function & Common topology (3D structure)

Page 4: A Similar Fragments Merging Approach to Learn Automata on Proteins

Characterization of a protein family

x x x x

x x x x x x x x

C H x \ / x

x Zn x x / \ x

C H x x x x

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

ZBT11 ...Csi..CgrtLpklyslriHmlk..H...

ZBT10 ...Cdi..CgklFtrrehvkrHslv..H...

ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern

Page 5: A Similar Fragments Merging Approach to Learn Automata on Proteins

Expressivity classes of patternsClass Example

A T-C-T-T-G-A

B D-R-C-C-x(2)-H-D-x-C

C G-G-G-T-F-[ILV]-[ST]-[ILV]

D V-x-P-x(2)-[RQ]-x(4)-G-x(2)-L-[LM]

E G-C-x(1,3)-C-P-x(8,10)-C-C

F C-x(2,4)-C-x(3)-[ILVFYC]-x(8)-H-x(3,5)-H

G D-T-A-G-Q-E-*-L-V-G-N-K

H D-T-A-G-[NQ]-*-L-V-G-N-[KEH]

I D-T-A-x(2,5)-G-[NQ]-*-L-V-G-N-[KEH]

J Regular Expression / Automaton

PROSITE PRATTTEIRESIAS

PROTOMATA-L

Page 6: A Similar Fragments Merging Approach to Learn Automata on Proteins

Characterization

Page 7: A Similar Fragments Merging Approach to Learn Automata on Proteins

Similar Fragment Pairs Significantly similar fragment pairs (SFPs) Natural selection Important area characterization

Data set D:

Page 8: A Similar Fragments Merging Approach to Learn Automata on Proteins

Ordering the SFPs Problem :

Solution : ordering the SFPs by scoring each SFP S(f1,f2)= ? 3 different scoring functions :

dialign Sd support Ss implication Si

Page 9: A Similar Fragments Merging Approach to Learn Automata on Proteins

Dialign Score

Sd ( f1 , f2 ) = - log P ( L , Sim )

L = |f1| = |f2| Sim = Sum of the individual similarity values P = Probability that a random SFP of the same L

has the same S

Blossum62similarity

Page 10: A Similar Fragments Merging Approach to Learn Automata on Proteins

Support Score

Taking into account the representativeness of SFP

Ss (f1,f2,D) = Number of sequences supporting <f1,f2>

f1f2

f

<f1,f2> is supported by f with respectthe triangular inequality :Sd(f,f1) + Sd(f,f2) Sd(f1,f2)

Page 11: A Similar Fragments Merging Approach to Learn Automata on Proteins

Implication Score Taking into account a counter-example set N Discriminative fragments Lerman index:

Si(f1,f2,D,N) =

avec P(X) =

-P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N)

P( Ss(f1 ,f2 ,D) ) x |N|

|X|

|D| + |N|

Page 12: A Similar Fragments Merging Approach to Learn Automata on Proteins

Generalization

Page 13: A Similar Fragments Merging Approach to Learn Automata on Proteins

From protein data sets to automata

MASEIKLFW

M A S E I K L F W

Page 14: A Similar Fragments Merging Approach to Learn Automata on Proteins

From protein data sets to automata

MASEIKLFW

MGYEVKYRV

M G Y E V K Y R V

M A S E I K L F W

Page 15: A Similar Fragments Merging Approach to Learn Automata on Proteins

Merging SFPs

MASEIKLFW

MGYEVKYRV

M G Y E V K Y R V

M A S E I K L F W

Page 16: A Similar Fragments Merging Approach to Learn Automata on Proteins

Merging SFPs

MASEIKLFW

MGYEVKYRV

M G YE K

Y R V

M AS L

F W[I,V]

Page 17: A Similar Fragments Merging Approach to Learn Automata on Proteins

Merging SFPsMASEIKLFW

MGYEVKYRV

M G YE [I,V] K

Y R V

M AS L

F W

MASEVKLFM MGYEIKYRV

MASEIKYRV MGYEVKLFW

MASEVKYRV MGYEIKLFW

Page 18: A Similar Fragments Merging Approach to Learn Automata on Proteins

Protein Sequence Data SetList of SFPs

MCA

Automaton / Regular Expression

Ordered List of SFPs

MERGING

Page 19: A Similar Fragments Merging Approach to Learn Automata on Proteins

Gap Generalization Merging on themself non-representative transitions Treat them as "gaps"

Page 20: A Similar Fragments Merging Approach to Learn Automata on Proteins

Identification of Physico-chemical properties

Similar Fragments ~ potential function area Amino acids share out the same position Physicochemical property at play => Generalization from a group (of amino acids) to a Taylor group

I,V I,Q,W,P

aliphatic

xI,L,V

no information

C

C

[I,V] [I,L,V] C C[I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Page 21: A Similar Fragments Merging Approach to Learn Automata on Proteins

Likelihood ratio test To decide if the multi-set A has been generated

according to a physico-chemical group G or not by a likelihood ratio test:

Given a threshold , we test the expansion of A to G and reject it when LRG/A <

Page 22: A Similar Fragments Merging Approach to Learn Automata on Proteins

Experiments

Page 23: A Similar Fragments Merging Approach to Learn Automata on Proteins

MIP : the Major Intrinsic Protein Family

FamilyMIP

SubfamiliesAQP, Glpf, Gla

Page 24: A Similar Fragments Merging Approach to Learn Automata on Proteins

Data sets

UNIPROTMIP in SWISS-PROT

Set « T » (159 seq)

Set « M» (44 seq)identity<90%

Set « W+» (24 seq)

Set « W-» (16 seq)

Set « C» (49 seq)Blast(1<e<100) not MIP

Set « E» (79 seq)

Set « U » (911 seq)

Water-specific

Page 25: A Similar Fragments Merging Approach to Learn Automata on Proteins

Experiments First Common Fragment on a Family

MIP family Positive set Comparison with pattern discovery tools

Teiresias Pratt Protomata-L (short pattern)

Water-specific Characterization MIP sub-families Positive and negative sets Leave-one-out cross-validation

Protomata-L (short to long pattern)

Page 26: A Similar Fragments Merging Approach to Learn Automata on Proteins

First Common Fragment Automaton

Results of 4 patterns scannedon Swiss-Prot protein Database

Set « M» (44 seq)

Learning Set

Learning set

Set « T » (159 seq)Target set

Page 27: A Similar Fragments Merging Approach to Learn Automata on Proteins

From short automata to long automata

Previous experiment only the first SFPs of the ordered list of SFPs short automaton first common fragment automaton

Next experiment larger cut-offs in the list of SFPs Protomat-L is able to create longer automata with more

common subparts Long patterns are closed of the topoly (3D-structure) of

the family

Page 28: A Similar Fragments Merging Approach to Learn Automata on Proteins

Water-specific characterization Leave-one-out cross-validation

Learning set W+ \ Si : Positive learning set W- \ Sj : Negative learning set

Test set { Si U Sj }

Control set Set T

Implication score

Set « W+» (24 seq)

Set « W-» (16 seq)

Set « C» (49 seq)

Page 29: A Similar Fragments Merging Approach to Learn Automata on Proteins

Leave-one-out cross-validation

Page 30: A Similar Fragments Merging Approach to Learn Automata on Proteins

Error Correcting Cost The error correcting cost of a sequence S represents the

distance (blossum similarity) between S and the closest sequence given by the automaton A.

Distibution of sequences with long automata (size Approx. 100)

Page 31: A Similar Fragments Merging Approach to Learn Automata on Proteins

Leave-one-out cross-validationWith Error Correcting Cost

Page 32: A Similar Fragments Merging Approach to Learn Automata on Proteins

Leave-one-out cross-validation

Page 33: A Similar Fragments Merging Approach to Learn Automata on Proteins

Conclusion & Perspective Good characterization of protein family using automata

(-> hmm structure) No need of a multiple alignment greedy data-driven algorithm

Important subparts localization Physico-chemical identification and generalization

Counter example sets Bringing of knowledge is possible in automata

(-> 2D structure)

Page 34: A Similar Fragments Merging Approach to Learn Automata on Proteins

Questions ?

?

?

?

?

?

??

??

?

?

?

?? ?

Page 35: A Similar Fragments Merging Approach to Learn Automata on Proteins

Demo

Page 36: A Similar Fragments Merging Approach to Learn Automata on Proteins

Protomata-L ’s Approach

First Common Fragment

Page 37: A Similar Fragments Merging Approach to Learn Automata on Proteins

Protomata-L ’s Approach

To get a more precise automaton

Page 38: A Similar Fragments Merging Approach to Learn Automata on Proteins

IDENTIFICATION OF PHYSICOCHEMICAL

GROUPS

Data set (Protein sequences)

Pairs of fragments

SORT

EXTRACTION

Initial Automaton(MCA)

MERGING

IDENTIFICATION OF « GAPS »

Page 39: A Similar Fragments Merging Approach to Learn Automata on Proteins

Structural discrimination

Page 40: A Similar Fragments Merging Approach to Learn Automata on Proteins
Page 41: A Similar Fragments Merging Approach to Learn Automata on Proteins

Aromatique

Hydrophobe

Non Informatif

Generalization of an Aquaporins automaton

Page 42: A Similar Fragments Merging Approach to Learn Automata on Proteins

Physico-chemical properties identification

Ratio likelihood test

AliphaticSmallx