on realizing shapes in the theory of rna neutral networks speaker: leszek gąsieniec, u of...
TRANSCRIPT
On realizing shapes in the theory of RNA neutral networks
Speaker:
Leszek Gąsieniec, U of Liverpool, UK
Joint work with:
Peter Clote, Boston College, USA
Roman Kolpakov, U of Moscow, Russia
Evangelos Kranakis, Carleton U, Canada
Danny Krizanc, Wesleyan U, USA
Realizing shapes
• RNA sequences • RNA secondary structures - shapes• Problems:
– Finding all shapes related to a single RNA sequence
– Realizing a number of shapes based on a single RNA sequence
• Solutions– NP-hardness result– Exact/approximate solutions
Secondary structure: “shape”
• For an integer n>0, a length n RNA nucleotide sequence is considered as a word in space CCn={A,C,G,U}n
• For a=a1a2…an Cn a secondary structure S Sn is a collection of pairs (i,j) s.t.:– aiaj {AU,UA,CG,GC}
– if (i,j) and (k,l) S then a combination i<k<j<l is not permitted, i.e., pseudo-knots are disallowed
– for each pair (i,j) S the values of i,j are unique
Shapes
ACGUCGGUACCAGUUGAGGUCCGAGGACG
ACGUCGGUACCAGUUGAGGUCCGAGGACG
NO
ACGUCGGUACCAGUUGAGGUCCGAGGACG
NO
Shapes• Secondary structures can be identified with a
balanced parenthesis expressions padded with ‘dots’, where– a dot (°) corresponds to an unpaired nucleotide
position, and– a matching parenthesis which opens at nucleotide
position i and closes at nucleotide position j corresponds to a base pair (i,j)
AC°UCGGUA°CAGUU°A°°UC°GAG°°C°
Realizing Shapes
• Give a shape S Sn and a word a Cn we say that a is realizing S if padding with dots is feasible
AC°UCGGUA°CAGUU°A°°UC°GAG°°C°
ACGUCGGUACCAGUUGAGGUCCGAGGACG
S
a
Decision Problem
• “Given a finite set of secondary structures (shapes) {S1,S2,…,Sk}. Under what conditions does there exist a single DNA sequence which can realize which of the given structures?”
• What can be done if such a realization is not feasible?
Optimization Problem M*RP
• We add a “don’t care” symbol * which matches any symbol {A,C,G,U}.
• Given a set of secondary structures (shapes) {S1,S2,…,Sk} to be realized by a sequence Cn. Find the minimum number of positions N(S1,...,Sk) for which after removal (replacement) of all base pairs incident to these positions there exists a sequence a Cn
which realizes each of the structures Si. • We call this the Min * Realizability Problem and we
refer to it by M*RP
Results
• O(nk) algorithm for the decision problem, i.e., when N(S1,…,Sk)=0
• Proof that M*RP problem is NP-hard for k > 3 (case k=3 is unclear)
• We also study a bounded version of M*RP with limited number of *s. E.g., we show that the case limited to the presence of a single * is also solvable in time O(nk).
M*RP Simplification
• We observe that a string a realizing the shapes S1,…,Sk over the four letter alphabet {A,C,G,U} exists if and only if there is a binary string b realizing (here we mean that the endpoints of each edge/pair must have a different bit 0/1) the same set of shapes.
M*RP Simplification
AC°UCGGUA°CAGUU°A°°UC°GAG°°C°
10°010101°0110U°A°°UC°GAG°°C°
10°010101°01101°1°°00°100°°1°
Graph of shapes
• G(S1,…,Sk) = (V,E) is a graph with:
– the set of vertices V containing consecutive positions 1,…, n of base pairs (binary symbols in the simplified version) of the sequence Cn
– the set of edges E is the union of the set of edges appearing in the shapes S1,…,Sk
Graph of shapes
ACGUCGGUACCAGUUGAGGUCCGAGGACG
ACGUCGGUACCAGUUGAGGUCCGAGGACG
1 2 n
An observation
• Lemma: – Any set of shapes S1,S2,…,Sk of size n can
be realized by a single binary string b if and only if the graph G(S1,S2,…,Sk) has no odd cycles (it is 2-colorable).
– Moreover, one can check the existence of b and, if b exists, construct it in O(nk) time
M*RP[m] Problem
• M*RP[m] problem - for any set of shapes S1,…,Sk compute a string over alphabet {0,1,*} which realize all shapes and contain no more than m occurrences of the don’t care symbol *
• Lemma: M*RP[m] problem can be solved in time O(( )||G(S1,…,Sk)||)
nm
Solving M*RP[1] problem
• Using the formula from previous slide we know that M*RP[1] problem can be solved in time O(n||G(S1,…,Sk)||)
• In what follows we give some details of the algorithm solving M*RP[1] in time O(||G(S1,…,Sk)||)
Critical vertices
• A vertex of a graph G is called critical if it is contained in all odd cycles in G.
• Lemma: All critical vertices of an arbitrary graph G can be found in time O(||G||).
• Theorem: M*RP[1] can be solved in O(||G(S1,…,Sk)||) time.
Sketch of the algorithm
• Find any odd cycle without chords– this can be done via finding any odd cycle C,
e.g., with a help of BFS search and the parity test
– having an odd cycle we “chop-off” (one after another) its even sub-cycles based on chords
– all done in time O(||G||)
External connected components K1,K2…,Ke
Odd cycle CK1
K2
Ki
Odd neighbor pairs
Connected component Ki
territory
Odd cycle C territory
0
1
11
1
0
02
3
4
Length L
x
y
L + l(x) + l(y) = 5
Some properties of external connected components
• The external components must not contain an odd cycle, i.e., each component is 2-colorable
• For any Ki – a number of odd neighbor pairs of Ki must be odd,– and it cannot be larger than 2
• Which means that each Ki must have exactly one odd neighbor pair, which defines a segment Li on the odd cycle C
Critical vertices
• Let R be the intersection of all Lis
• One can prove that:– all critical vertices are contained in R– and every vertex in R is critical, i.e., any
cycle in G which does not contain vertices from r must be even
• The content of the set R can be computed in time linear in ||C||.
Conclusion
• Theorem: M*RP[1] can be solved in O(||G(S1,…,Sk)||) time
– what is the complexity of M*RP[i]?
• Theorem: M*RP is NP-hard for k>3– the case with k=2 is always realizable, and– the complexity of the case with k=3 is not
yet established
Thank you