fixed parameter algorithms for protein similarity search under mrna structure constrains a joint...
TRANSCRIPT
Fixed parameter algorithms for
protein similarity search under mRNA structure constrains
A joint work by:
G. Blin, G. Fertin, D. Hermelin, and S. Vialette.
2
Outline Biological motivation.
mRNA molecules. The mRNA to protein process. Selenocysteine Insertion.
The MRSO problem. Implied structure graph. Known results.
Two natural parameters. The parameters. Nice edge bipartition. A general algorithm for both parameters.
3
Outline The cutwidth parameter.
An efficient algorithm for small cutwidth. Implications of this algorithm.
Binary similarity functions. Closing remarks.
4
mRNA molecules:
Can be considered as strings over {A,C,G,U}. Complementary bases (A-U, G-C) may pair to form a folding
structure (secondary structure) of the mRNAs. Encode genetic information that is later translated into
proteins.
Biological Motivation
5
Biological Motivation The mRNA protein process:
6
The mRNA protein process - standard assumption: Each codon encodes into a single amino acid.
Recently, biologists found that this not necessarily true: According to different folding structures of the mRNA, a single
codon might encode into different amino acids. Example application - Selenocysteine insertion.
Biological Motivation
7
Selenocysteine insertion:
Selenocysteine is a rare amino acid only recently discovered.
Generated by the UGA codon which usually encodes a stop signal.
The presence of the SECIS element forces the generation of Selenocysteine rather than stopping the encoding.
Biological Motivation
8
Selenocysteine insertion:
Modifying existing proteins by inserting the SECIS element results in certain cases in enhanced proteins.
Is this application only the tip of the iceberg?
Biological Motivation
9
The MRSO problem
The MRSO problem: Given a specified secondary structure S and an mRNA sequence
R, construct an mRNA sequence R’ with complementary
nucleotides according to S which is as similar as possible to R.
CGG CGA CUA AAU
+
R
S
10
GCGU
The MRSO problem
The MRSO problem: Given a specified secondary structure S and an mRNA sequence
R, construct an mRNA sequence R’ with complementary
nucleotides according to S which is as similar as possible to R.
CG CGA CUA
R’
AG A U
11
The score of a solution is given by n similarity functions:
Given f1,…,fn, one needs no additional information on the source mRNA sequence R.
CGU CGA CUA GCG
R’
s(R’) = f1(CGU) + f2(CGA) + f3(CUA) + f4(GCG)
The MRSO problem
12
The implied structure graphimplied structure graph:
A linear graph with maximum degree 3. Complementary constrains within nucleotides are labeled on
the edges of G.
S
1 2 3 4G
The MRSO problem
13
The MRSO problem A more formal definition [Backofen et al.’02]:
Given an implied structure graph G with n vertices, and f1,…,fn similarity functions, find an assignment of codons c1,…,cn to the vertices of G that:
1. Maximizes f(ci).
2. Is compatible with respect to G.
Definition allows adapting to different applications. Allows also a certain degree of combinatorial leverage as we
shall soon see…
14
The MRSO problem – known results [Backofen et al.’02 and Bongartz’04]:
NP-complete (APX-hard) for general implied structure graphs.
Constant factor approximation algorithms. Cannot handle well -.
In P when the implied structure graph G is outer-planar. In other words, if one can permutate the nodes of G such that all
of the edges of G are non-crossing.
[Backofen et al.’02] give an O(n) algorithm for outer-planar
implied structure graphs.
We call this algorithm Aop in this talk.
15
1 2 3 4
Two natural parameters
Let = # degree 3 vertices in G. Let = # edge crossings in G.
5 6 7 8
16
Two natural parameters
Modifying the similarity functions: We can modify the similarity functions so that some vertices are
assigned specific codons in any feasible solution.
For example: Ensuring the first vertex is assigned AAA:
f*1(AAA) = f1 (AAA).
f*1(C) = - , for all C AAA.
17
6
Nice edge bipartitionNice edge bipartition of G:
Upper part induces an outer-planar graph.
Two natural parameters
1 2 3 4 5 7 8
Upper part
Bottom part
18
A general algorithm:
Enumerate all assignments which are compatible with respect to the bottom part.
Invoke Aop with each such assignment. Time complexity = O(2O(b)n), where b = # bottom edges.
Two natural parameters
61 2 3 4 5 7 8
19
The general algorithm can be applied for our two natural parameters: Parameter = # edge crossings in G.
Time = O(2O()n), hence polynomial for = O(lgn).
5
Two natural parameters
1 2 3 4 6 7 8
20
The general algorithm can be applied for our two natural parameters: Parameter = # degree 3 vertices in G. Every graph with maximum degree 2 is outer-planar.
Time = O(2O()n), hence polynomial for = O(lgn).
Two natural parameters
71 352 46 81 2 3 4 5 6 7 8
21
4 5 631 2
The cutwidth cutwidth of G:
For p {1,…,n-1}, let Ep denote the edges connecting
vertices from {1,…,p} to {p+1,…,n}, and let Vp denote the
vertices of G which are incident to Ep.
Let denote the cutwidth of G. Then = maxp|Ep|.
7
The cutwidth parameter
8
p = 2
Ep
Vp
22
Algorithm outline: Pick any p {1,…,n-1}. For each assignment for Vp that is compatible with Ep:
Recursively find the optimal solution for the subgraphs of G induced by {1,…,p} and {p+1,…,n} under this assignment.
Return the highest scoring solution found in the previous step.
The cutwidth parameter
1 2 73 4 5 6 8
CGA UAA CGG AUA GUU CGC
23
Time = O(2O()n), hence polynomial for = O(lgn).
Theorem [Korach&Solel’93 via Chung&Seymour’89]:Any graph G with n vertices and constant treewidth has a vertex ordering such that G under this ordering has cutwidth of O(lgn).
Theorem [Bodlaender’95]:If G is either a chordal graph or a circular-arc graph with constant maximum clique size then G has constant treewidth. If G is k-outerplanar for any constant k then G has constant treewidth.
Combining all the above we get:MRSO is polynomial time solvable if G is either a chordal graph, a circular-arc graph, or k-outerplanar.
The cutwidth parameter
24
Binary similarity functions Suppose we are only interested in the number of
“correct” codons in a solution.
In this case we can restrict ourselves to binary similarity functions. That is, for all i : fi : 3 {0,1}.
Unfortunately, MRSO is NP-hard even when restricted only to instances with binary similarity functions.
CGG CGA CUA AAUSource CUA GGA CGG UGA
Target CGG GA CUA AAU C GA CGG UGAU A C CCUA AAU C GA CGG UGAGACGG
25
Binary similarity functions MRSO with restrictive similarity functions is in FPT for
parameter = score of the optimal solution. More precisely, its solvable in O(29.25n) time.
Proof sketch: We can assume w.l.o.g. that for all i there exists a C such that
fi(C) = 1.
Any maximal independent set in G is of size at least n/4, since G
is at most cubic.
We prove for n/4 and > n/4 separately.
26
Binary similarity functions Suppose n/4:
Find an independent set of size in O() time. Since for all i there exists a C such that fi (C) = 1, there exists an
assignment to this independent set which guarantees a score of at least .
Since fi 0 for all i, this assignment can be extended to all vertices of G to obtain an assignment with score at least .
Suppose > n/4: Try all -subsets of the vertices of G. There are at most
23.25 such subsets. Enumerating all possible codon assignments
for each subset requires O(26) time.
4 ( )
27
Closing remarks Extending our results:
Finding a practical algorithm for the cutwidth problem restricted to
cubic graphs with fixed cutwidth.
More interesting parameters? Hardness results?
Applying our techniques to a similar variation of the problem
which has been studied in the literature [Backofen’04].
Thank You!