on sorting unsigned permutations by double-cut-and-joins

13
J Comb Optim (2013) 25:339–351 DOI 10.1007/s10878-010-9369-8 On sorting unsigned permutations by double-cut-and-joins Xin Chen Published online: 4 December 2010 © Springer Science+Business Media, LLC 2010 Abstract The problem of sorting unsigned permutations by double-cut-and-joins (SBD) arises when we perform the double-cut-and-join (DCJ) operations on pairs of unichromosomal genomes without the gene strandedness information. In this pa- per we show it is a NP-hard problem by reduction to an equivalent previously-known problem, called breakpoint graph decomposition (BGD), which calls for a largest collection of edge-disjoint alternating cycles in a breakpoint graph. To obtain a bet- ter approximation algorithm for the SBD problem, we made a suitable modifica- tion to Lin and Jiang’s algorithm which was initially proposed to approximate the BGD problem, and then carried out a rigorous performance analysis via fractional linear programming. The approximation ratio thus achieved for the SBD problem is 17 12 + 1.4167 + , for any positive . Keywords Genome rearrangement · Double-cut-and-joins · Breakpoint graph decomposition · Fractional linear programming 1 Introduction A basic problem in the study of genome rearrangements is to compute the genomic distance between two genomes based on their gene orders. The genomic distance is generally defined as the minimum number of operations required to transform one genome to another, and the complexity of computing it may largely depend on the choice of operations and on the representation of genomes as well. Double-cut-and-join (DCJ) is an operation that cuts a chromosome in two places and joins the four ends of the cuts in a new way. It was first introduced in Yancopou- X. Chen ( ) School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore e-mail: [email protected]

Upload: xin

Post on 23-Dec-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: On sorting unsigned permutations by double-cut-and-joins

J Comb Optim (2013) 25:339–351DOI 10.1007/s10878-010-9369-8

On sorting unsigned permutationsby double-cut-and-joins

Xin Chen

Published online: 4 December 2010© Springer Science+Business Media, LLC 2010

Abstract The problem of sorting unsigned permutations by double-cut-and-joins(SBD) arises when we perform the double-cut-and-join (DCJ) operations on pairsof unichromosomal genomes without the gene strandedness information. In this pa-per we show it is a NP-hard problem by reduction to an equivalent previously-knownproblem, called breakpoint graph decomposition (BGD), which calls for a largestcollection of edge-disjoint alternating cycles in a breakpoint graph. To obtain a bet-ter approximation algorithm for the SBD problem, we made a suitable modifica-tion to Lin and Jiang’s algorithm which was initially proposed to approximate theBGD problem, and then carried out a rigorous performance analysis via fractionallinear programming. The approximation ratio thus achieved for the SBD problem is1712 + ε ≈ 1.4167 + ε, for any positive ε.

Keywords Genome rearrangement · Double-cut-and-joins · Breakpoint graphdecomposition · Fractional linear programming

1 Introduction

A basic problem in the study of genome rearrangements is to compute the genomicdistance between two genomes based on their gene orders. The genomic distance isgenerally defined as the minimum number of operations required to transform onegenome to another, and the complexity of computing it may largely depend on thechoice of operations and on the representation of genomes as well.

Double-cut-and-join (DCJ) is an operation that cuts a chromosome in two placesand joins the four ends of the cuts in a new way. It was first introduced in Yancopou-

X. Chen (�)School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore,Singaporee-mail: [email protected]

Page 2: On sorting unsigned permutations by double-cut-and-joins

340 J Comb Optim (2013) 25:339–351

los et al. (2005) and later refined in Bergeron et al. (2006), to unify all the classi-cal genome rearrangement operations including inversions, transpositions, translo-cations, block-interchanges, fissions and fusions. A simple formula exists for the ge-nomic distance by the DCJ operations, which can be computed in linear time for pairsof genomes when the strandedness of genes have become available, i.e., knowing thedirections of genes.

However, genetic maps produced from many experimental studies such as recom-bination analysis and physical imaging generally do not specify the strandedness ofgenes or markers. For instance, the Gramene database (http://www.gramene.org) con-tains a variety of such maps for the rice, maize, oats and other cereal genomes. Asa result, a genome can only be represented as an unsigned permutation. When weperform the DCJ operations on pairs of such genomes, a new combinatorial problemnaturally arises, which we called sorting unsigned permutations by double-cut-and-joins or SBD for short. It is specifically defined as the problem of finding the mini-mum number of DCJs required to transform an unsigned permutation into the identityunsigned permutation.

In this paper we study the SBD problem, and for the sake of simplicity, we focusour present study on the problem whose input are restricted to two unichromosomalgenomes. We first show that the minimum number of DCJs required to sort an un-signed permutation π , denoted by d(π), is equal to the number of breakpoints b(π)

subtracted by the maximum number c(π) of edge-disjoint alternating cycles in thebreakpoint graph G(π); that is, d(π) = b(π) − c(π). It turns out that the SBD prob-lem is indeed equivalent to a previously known NP-hard problem called breakpointgraph decomposition (BGD; Kececioglu and Sankoff 1995), implying that the SBDproblem is NP-hard as well.

Bafna and Pevzner (1996) presented the first approximation algorithm for theBGD problem with performance ratio 7

4 . It was subsequently improved to 32 and

3323 + ε ≈ 1.4348 + ε, for any positive ε, due to Christie (1998) and Caprara and Rizzi(2002), respectively. At present, the best known approximation ratio achievable for

BGD (and hence for SBD) is 5073−15√

12013208 + ε ≈ 1.4193 + ε, for any positive ε,

due to Lin and Jiang (2004). To further improve the approximation, we present inSect. 4.2 a suitable modification on Lin and Jiang’s approximation algorithm, andthen carry out a rigorous performance analysis via fractional linear programming toobtain a better approximation ratio of 17

12 + ε ≈ 1.4167 + ε, for any positive ε.

2 Preliminaries

2.1 Breakpoint graph decomposition

Let π = π1π2 · · ·πn be an unsigned permutation on {1,2, . . . , n}. We usually extendπ by adding π0 = 0 and πn+1 = n + 1. A pair of consecutive elements πi and πi+1

of π is called an adjacency if |πi − πi+1| = 1 and otherwise, a breakpoint. Let b(π)

denote the number of breakpoints in π . Define the inverse permutation π−1 of π tobe π−1

πi:= i, for all i = 0, . . . , n + 1.

Page 3: On sorting unsigned permutations by double-cut-and-joins

J Comb Optim (2013) 25:339–351 341

The notion of the breakpoint graph was first introduced in Bafna and Pevzner(1993) to study the problem of sorting by reversals. Given an unsigned permutationπ , the breakpoint graph is an edge-colored graph G(π) with n + 2 vertices π0, π1,π2, . . . , πn, πn+1. Two vertices πi and πj are joined by a black edge if they form abreakpoint in π , or by a gray edge if they form a breakpoint in π−1. A cycle in G(π)

is called alternating if the colors of every two consecutive edges are distinct. Hence-forth, all cycles referred to in the breakpoint graph will be alternating cycles. Thelength of a cycle is the number of black edges that it contains, and a cycle of lengthl is called an l-cycle. The breakpoint graph G(π) can always be decomposed into amaximum number of edge-disjoint alternating cycles (Bafna and Pevzner 1993), andthis maximum number is denoted by c(π).

A slightly different definition of breakpoint graph can also be seen in many studies(e.g. Hannenhalli and Pevzner 1996). It allows edges to join not only breakpoints butalso adjacencies so that 1-cycles may occur in a breakpoint graph. Let G′(π) denotethe breakpoint graph to be constructed with this definition and c′(π) the number ofedge-disjoint alternating cycles in a maximum decomposition of G′(π). It is not hard(though not trivial) to see that there always exists a maximum cycle decompositionthat retains all the 1-cycles in G′(π). Therefore, we have

b(π) − c(π) = n + 1 − c′(π). (1)

Unless otherwise stated, we will use the first definition of breakpoint graph in the restof the paper.

The problem of breakpoint graph decomposition is aimed at finding b(π) − c(π).It was initially introduced to help solve the problem of sorting by reversals (Bafnaand Pevzner 1993; Caprara and Rizzi 2002; Lin and Jiang 2004). Since b(π) is givenwith a permutation π , it becomes equivalent to decomposing G(π) into a maximumnumber of edge-disjoint alternating cycles.

The concept of breakpoint graph extends naturally to signed permutations.A signed permutation �π , in which each element �πi is a signed integer, can be trans-formed into an unsigned permutation π by replacing a positive integer +i by twounsigned integers 2i − 1 2i and a negative integer −i by two unsigned integers2i 2i − 1, respectively. Then, define b(�π) := b(π) and c(�π) := c(π). Observe thata breakpoint graph of signed permutations, every vertex has degree at most two,thereby making the cycle decomposition unique and trivial. As stated in Theorem 1below, Yancopoulos et al. (2005) has showed that b(�π) − c(�π) gives the minimumnumber of DCJ operations required to transform a signed permutation �π into theidentity signed permutation.

3 Sorting by double-cut-and-joins

In order to unify all the classical genome rearrangement events, Yancopoulos et al.(2005) introduced a new edit operation called double-cut-and-join (DCJ). It gener-ally cuts a chromosome in two places and joins the four ends of the cuts in a newway. For example, when a chromosome is given as an unsigned permutation π ,a DCJ operation ρ1 that acts on two consecutive pairs πiπi+1 and πjπj+1 (i < j)

Page 4: On sorting unsigned permutations by double-cut-and-joins

342 J Comb Optim (2013) 25:339–351

of π will cut both πiπi+1 and πjπj+1 and joins either πiπj and πi+1πj+1, orπiπj+1 and πi+1πj to create two new consecutive pairs. If one chooses to joinπiπj and πi+1πj+1, it simulates an inversion operation, resulting in a new permu-tation π · ρ1 = π1 · · ·πiπj · · ·πi+1πj+1 · · ·πn. On the other hand, to join πiπj+1 andπi+1πj , a circular intermediate πi+1 · · ·πj is then generated (see Fig. 1b). In thiscase, we may absorb the circular intermediate back into the permutation by anotherDCJ operation ρ2 that cuts two consecutive pairs πkπk+1 (assume k < i) and πlπl+1

(assume i < l < j ) and joins πkπl+1 and πlπk+1, resulting in a new permutationπ ·ρ1 ·ρ2 = π1 · · ·πkπl+1 · · ·πjπi+1 · · ·πlπk+1 · · ·πiπj+1 · · ·πn. We can see that thecomposition of two DCJ operations ρ1 and ρ2 indeed simulates a block-interchangeoperation that acts on two segments πk+1 · · ·πi and πl+1 · · ·πj .

Given an unsigned permutation π , the problem of sorting by double-cut-and-joints(SBD) is defined to find a sequence of DCJ operations such that π · ρ1 · ρ2 · · ·ρt

produces the identity unsigned permutation and t is the minimum. Let d(π) := t . Incases of signed permutations �π , we already know the following results.

Theorem 1 (Yancopoulos et al. 2005) Let �π be a signed permutation on {1,2, · · · , n}.Then, d(�π) = b(�π) − c(�π). Moreover, an optimal sequence of DCJ operations thattransforms �π into the identity signed permutation can be found in linear time.

Fig. 1 (a) The breakpoint graphof the unsigned permutationπ = 4 5 3 1 2, for whichb(π) = 4 and c(π) = 2. (b) Twodouble-cut-and-join operationsthat sort the permutation π .Note that a circular intermediate(3 4 5) is created after the firstDCJ operation

Page 5: On sorting unsigned permutations by double-cut-and-joins

J Comb Optim (2013) 25:339–351 343

However, to the best of our knowledge, the problem of sorting unsigned permu-tations by double-cut-and-joints has not been discussed in the literature so that verylittle is known about it. In the following, we first show that d(π) can still be computedusing the same formula as given in the previous theorem.

Theorem 2 Let π be an unsigned permutation on {1,2, . . . , n}. Then, d(π) = b(π)−c(π).

Proof A spin of π refers to a signed permutation �π such that �πi = +πi or −πi (Bafnaand Pevzner 1993; Hannenhalli and Pevzner 1996). Consider now a maximum cycledecomposition of the breakpoint graph G(π). Let �πi = +πi if any of the follow-ing conditions holds: (i) there exists a cycle in the given cycle decomposition whichcontains a path πi−1 · πi · (πi − 1), (ii) there exists a cycle in the given cycle decom-position which contains a path πi+1 · πi · (πi + 1), or (iii) πi−1 = πi − 1. Otherwise,let �πi = −πi . The signed permutation �π is so constructed that its breakpoint graphwill preserve all the cycles from the given cycle decomposition of G(π). Hence,b(π) = b(�π) and c(π) = c(�π). Further observe that the effect of a DCJ operationon �π can be mimicked by a DCJ operation on π , thus implying that d(π) ≤ d(�π).Finally, it follows from Theorem 1 that d(π) ≤ d(�π) = b(�π) − c(�π) = b(π) − c(π).

On the other hand, let us consider an optimal sorting of π by the sequence of DCJoperations ρ1 · ρ2 · · ·ρd(π). Define a signed permutation π̃ by π̃i = +πi . If we applythe same sequence of DCJ operations ρ1 · ρ2 · · ·ρd(π) to π̃ , the resulting permutationπ̃ · ρ1 · ρ2 · · ·ρd(π) would be an identity permutation if the positive/negative signof every integer element is omitted. Next we define another signed permutation �π asfollows. Let �πi = +πi if π̃i remains positive in π̃ ·ρ1 ·ρ2 · · ·ρd(π). Otherwise, let �πi =−πi . If we also apply the same sequence of DCJ operations ρ1 ·ρ2 · · ·ρd(π) to �π , thenthe resulting permutation �π · ρ1 · ρ2 · · ·ρd(π) must be an identity permutation with allpositive elements. It thus implies that d(π) ≥ d(�π) = b(�π)− c(�π). It is worth notingthat the cycle decomposition of G(�π) might not lead to a cycle decomposition ofG(π).1 Therefore, we turn to the breakpoint graph G′(π) constructed with the seconddefinition that we discussed earlier. From �π , we can obtain a cycle decomposition ofG′(π) as follows. If �πi is positive, then let the cycle decomposition contain both pathsπi−1 ·πi · (πi −1) and πi+1 ·πi · (πi +1) (where the first edge is a black edge and thesecond is a gray edge for each path). If �πi is negative, then let the cycle decompositioncontain both paths πi−1 · πi · (πi + 1) and πi+1 · πi · (πi − 1) (where, once again, thefirst edge is a black edge and the second is a gray edge for each path). Observe thatevery cycle of length at least 2 in the resulting cycle decomposition corresponds toa unique cycle of the breakpoint graph G(�π), thereby implying that b(�π) − c(�π) ≥n + 1 − c′(π). Furthermore, by (1), we have b(π) − c(π) = n + 1 − c′(π). Thesefinally give d(π) ≥ b(π) − c(π). �

From the above theorem we can easily see that the SBD problem is indeed equiv-alent to the problem of breakpoint graph decomposition (BGD), which is already

1One such example is π = 3 4 1 2 and �π = 3 4 − 1 − 2.

Page 6: On sorting unsigned permutations by double-cut-and-joins

344 J Comb Optim (2013) 25:339–351

shown to be NP-hard in (Kececioglu and Sankoff 1995). The following theorem thusfollows.

Theorem 3 The SBD problem is NP-hard.

In the next section we will present an approximation algorithm for the SBD prob-lem, which is also an improved approximation algorithm for the BGD problem.

4 Approximation of SBD

Theorem 2 indicates that a larger-sized collection of edge-disjoint alternating cyclesin the breakpoint graph G(π) would yield a shorter sequence of DCJ operations forsorting, since the number of breakpoints is fixed for a given unsigned permutationπ . Therefore, to better approximate b(π) − c(π), we focus mainly on improving thealternating cycle decomposition.

4.1 Definitions and graph-theoretic background

Let ci denote the number of i-cycles in a (fixed) maximum cycle decomposition ofG(π), for i ≥ 2. Therefore, c(π) = ∑

i≥2 ci . Moreover, let c∗2 denote the maximum

number of edge-disjoint 2-cycles in G(π), and c∗3 the maximum number of edge-

disjoint cycles of length no more than 3. Obviously, c2 ≤ c∗2 and c2 + c3 ≤ c∗

3 .

4.1.1 Finding edge-disjoint 2-cycles

To find a collection of edge-disjoint 2-cycles, one may construct a graph G2(π)

whose vertices represent all the 2-cycles in G(π) and whose edges join two 2-cyclesthat share some edge in G(π). An independent set in a graph is a subset of verticesin which no two are adjacent, and a maximum independent set is an independent setof maximum cardinality. Observe that a maximum independent set of G2(π) gives alargest collection of edge-disjoint 2-cycles with cardinality c∗

2 .Unfortunately, the problem of maximum independent set is NP-hard (Karp 1972).

In order to obtain a good approximation, Caprara and Rizzi (2002) found a way to re-duce the graph G2(π) to another graph G′

2(π) with the maximum degree (at most) 4,such that a maximum independent set of G′

2(π) is also a maximum independent set ofG2(π). By the best-known approximation algorithm (denoted as APPROX-MIS) forthe maximum independent set problem restricted to graphs with the bounded maxi-mum degree (Berman and Fürer 1994), one can obtain the following lemma.

Lemma 1 (Caprara and Rizzi 2002) The problem of finding a largest collection ofedge-disjoint 2-cycles in a breakpoint graph G(π) can be approximated with ratio57 − ε in polynomial time, for any positive ε.

Page 7: On sorting unsigned permutations by double-cut-and-joins

J Comb Optim (2013) 25:339–351 345

4.1.2 Finding edge-disjoint 2-cycles and 3-cycles

To find a collection of edge-disjoint cycles of length no more than 3, one may con-struct a collection C of subsets of the base set S, where S is comprised of all theedges in G(π) and each subset of C is comprised of edges from a 2-cycle or 3-cyclein G(π). A set packing is a sub-collection of pairwise disjoint subsets in C, anda maximum set packing is a set packing with maximum cardinality. Observe that amaximum set packing of C gives a largest sub-collection of 2-cycles and 3-cycleswith cardinality c∗

3 .The problem of maximum set packing is NP-hard too (Karp 1972). Since every

subset in C has size at most 6, the problem reduces to the k-set packing problem,where k = 6. By the best-known approximation algorithm (denoted as APPROX-MSP) for the k-set packing problem (Halldórsson 1995), one can obtain the followinglemma.

Lemma 2 (Lin and Jiang 2004) The problem of finding a largest collection of edge-disjoint cycles of length at most 3 in a breakpoint graph G(π) can be approximatedwith ratio 3

8 in polynomial time.

4.1.3 Local improvement search

The local improvement search is a technique often used to solve many hard com-binatorial optimization problems (Berman and Fürer 1994; Halldórsson 1995). Tofacilitate our subsequent algorithmic performance analysis, it is briefly introducedbelow in the context of the k-set packing problem.

Let I = (S,C) be an instance of the k-set packing problem, where S is a base setand C is a collection of subsets of S. We call C′ a feasible solution of I if it is asub-collection of disjoint subsets in C. C′ is further said to be maximal if there is nosubset Ci ∈ C \ C′ such that C′ ∪ {Ci} is still feasible. Even when a set packing C′ ismaximal already, it might happen that there exist two subsets Ci1,Ci2 ∈ C \ C′ anda subset Cj ∈ C′ such that C′ ∪ {Ci1,Ci2} \ Cj is feasible. In this case, we may saysubsets Ci1 and Ci2 improve C′ as its size is growing. If, on the other hand, there isno such pair of subsets improving C′, then C′ is said to be 2-maximal.

A local improvement search procedure that finds a 2-maximal set packing for aninstance I = (S,C) of the k-set packing problem may start with an arbitrary (possiblyempty) set packing C′, and then repeatedly update C′ by replacing one subset in C′with two subsets in C \C′ or adding one subset of C \C′ into C′ until no more updateis possible while the feasibility of C′ is not violated.

4.1.4 Lin and Jiang’s algorithm

The currently best ratio known to approximate b(π) − c(π) is due to Lin and Jiang(2004). Their algorithm is summarized below.

Lin and Jiang’s algorithm

1. Compute a collection P of 2-cycles by algorithm APPROX-MIS.

Page 8: On sorting unsigned permutations by double-cut-and-joins

346 J Comb Optim (2013) 25:339–351

2. Compute a 2-maximal collection Q of cycles of length at most 3 by improving P

(using the local improvement search procedure as described in Sect. 4.1.3).3. Compute a collection R of cycles of length at most 3 by algorithm APPROX-MSP.4. Output the larger collection of Q and R.

Basically, Lin and Jiang’s algorithm first runs APPROX-MIS and APPROX-MSP toobtain two lower bounds of ( 5

7 − ε)c2 and 38 (c2 + c3) on the number of edge-disjoint

cycles, respectively, and then a 2-maximal improvement procedure on the collectionP of edge-disjoint 2-cycles. The latter leads to a better size guarantee by incorporat-ing a balancing argument, as stated in the following lemma.

Lemma 3 (Lin and Jiang 2004) The resulting 2-maximal collection of edge-disjoint

cycles either improves the lower bound from ( 57 − ε)c2 to (

√1201+89

168 − ε)c2 or im-

proves the lower bound from 38 (c2 + c3) to 37−√

12016 (c2 + c3), but not both.

Further note that at least four black edges are needed to form an l-cycle, for any l ≥ 4.With this fact and the above lemma, Lin and Jiang (2004) successfully showed thefollowing performance ratio for approximating b(π)− c(π), thereby giving the sameratio for approximating the SBD problem.

Theorem 4 (Lin and Jiang 2004) The problem of minimizing b(π) − c(π) can be

approximated within ratio 5073−15√

12013208 + ε ≈ 1.4193 + ε in polynomial time, for

any positive ε.

4.2 A better performance guarantee

In the previous section, we introduced the best-to-date approximation algorithm forminimizing b(π) − c(π). In this section, we make a suitable modification on thisalgorithm and then perform a rigorous performance analysis via fractional linear pro-gramming to achieve a better approximation ratio.

4.2.1 The modified algorithm

We modified Lin and Jiang’s algorithm mainly by removing a computation-intensivestep that employs the APPROX-MSP algorithm to compute a collection R of cyclesof length at most 3 (i.e., step 3 as seen above). The modified algorithm is summarizedblow.

The modified algorithm

1. Compute a collection P of 2-cycles by algorithm Approx-MIS.2. Compute a 2-maximal collection Q of cycles of length at most 3 by improving P

(using the local improvement search procedure as described in Sect. 4.1.3).3. Output the collection Q.

Page 9: On sorting unsigned permutations by double-cut-and-joins

J Comb Optim (2013) 25:339–351 347

Let Q2 and Q3 be the sub-collections of 2-cycles and 3-cycles in Q, respectively.Obviously, Q = Q2 ∪ Q3 and Q2 ∩ Q3 = ∅. Furthermore, let C2 and C3 be the sub-collections of 2-cycles and 3-cycles in a (fixed) maximum cycle decomposition ofG(π), respectively. In the following, we denote the sizes of the above mentionedcollections by their respective lower case letters (e.g., the size of Q2 is denoted byq2). Moreover, we simplify the notations b(π) and c(π) to b and c, respectively.

Lemma 4 2(c2 + c3) ≤ 5q2 + 7q3.

Proof Notice that C2 ∪C3 is a (not necessarily maximal) set packing of 2-cycles and3-cycles. Let S1 be the collection of cycles in C2 ∪ C3 that each intersects exactlyone cycle in Q, and let S2 = C2 ∪ C3 \ S1. Obviously, s1 + s2 = c2 + c3. Since Q is a2-maximal set packing of 2-cycles and 3-cycles, every cycle in C2 ∪C3 shall intersectat least one cycle in Q, thereby implying that every cycle in S2 shall intersect at leasttwo cycles in Q. Moreover, no two cycles in S1 can intersect a same cycle in Q

because otherwise they would further improve Q. Therefore,

s1 ≤ q = q2 + q3. (2)

Further notice that every 2-cycle in Q2 can intersect at most 4 cycles in C2 ∪ C3, andthat every 3-cycle in Q3 can intersect at most 6 cycles in C2 ∪ C3. Therefore,

s1 + 2s2 ≤ 4q2 + 6q3. (3)

Combining Inequations 2 and 3 yields

2s1 + 2s2 ≤ 5q2 + 7q3, (4)

which hence establishes the lemma. �

Lemma 5 2p ≤ 2q2 + q3.

Proof Notice that Q is initialized as P in our modified algorithm, which implies thatq2 = p and q3 = 0. Therefore, the inequality 2p ≤ 2q2 + q3 is true at the beginningof the local search for a 2-maximal collection. During the local search process, anyimprovement by replacing one set with two sets or adding one set into Q will neverlower the value of 2q2 + q3. Therefore, 2p ≤ 2q2 + q3 remains true at the end of thelocal search. �

4.2.2 Performance analysis via fractional linear programming

Let C′ be a collection of edge-disjoint alternating cycles in which Q2 and Q3 arerespectively the sub-collections of 2-cycles and 3-cycles output from the above al-gorithm. It is obvious that q2 + q3 ≤ c′ and c2 + c3 ≤ c. Moreover, c ≤ c2 + c3 +b−2c2−3c3

4 , which follows from the fact that every i-cycle contains at least 4 blackedges for all i ≥ 4. Then, finding the worst-case ratio between the sizes of the ap-proximate solution C′ and some (fixed) optimal solution C is reduced to solving the

Page 10: On sorting unsigned permutations by double-cut-and-joins

348 J Comb Optim (2013) 25:339–351

following optimization problem.

maxb − c′

b − c

subject to c2 + c3 ≤ c,

c ≤ c2 + c3 + b − 2c2 − 3c3

4,

(5

7− ε

)

c2 ≤ p,

2(c2 + c3) ≤ 5q2 + 7q3,

2p ≤ 2q2 + q3,

q2 + q3 ≤ c′,

b ≥ 1,

c2, c3, p, q2, q3 ≥ 0.

This is a fractional linear programming problem—the generalization of a linear pro-gramming problem in which the objective function is the ratio of two linear functions.Notice that the value of c′ is bounded from the bottom and the value of c is boundedfrom the above. Therefore, the maximum is attained when we substitute c′ by q2 +q3

and c by c2 + c3 + b−2c2−3c34 . That is,

maxb − q2 − q3

34b − 1

2c2 − 14c3

subject to 2c2 + 3c3 ≤ b,(

5

7− ε

)

c2 ≤ p,

2(c2 + c3) ≤ 5q2 + 7q3,

2p ≤ 2q2 + q3,

b ≥ 1,

c2, c3, p, q2, q3 ≥ 0.

Consider two cases when c2 + 2c3 < 1 and when c2 + 2c3 ≥ 1, respectively. Whenc2 + 2c3 < 1, we have c2 = c3 = 0 because both c2 and c3 are nonnegative integers.The maximum objective value 4

3 is then attained when q2 = q3 = 0. When c2 +2c3 ≥ 1, we further have 2c2 + 3c3 ≥ 1. Then, the maximum is attained whenb = 2c2 + 3c3. Therefore, the worst-case approximation ratio shall be equal tomax{ 4

3 , R(ε)}, where R(ε) denotes the maximum objective value of the followingfractional linear programming problem

Page 11: On sorting unsigned permutations by double-cut-and-joins

J Comb Optim (2013) 25:339–351 349

max2c2 + 3c3 − q2 − q3

c2 + 2c3

subject to

(5

7− ε

)

c2 ≤ p,

2(c2 + c3) ≤ 5q2 + 7q3,

2p ≤ 2q2 + q3,

c2 + 2c3 ≥ 1,

c2, c3, p, q2, q3 ≥ 0.

Because its objective function is bounded both from below and from above, we maytransform it to an equivalent linear programming problem

max 2c2 + 3c3 − q2 − q3

subject to

(5

7− ε

)

c2 ≤ p,

2(c2 + c3) ≤ 5q2 + 7q3,

2p ≤ 2q2 + q3,

c2 + 2c3 = 1,

c2, c3, p, q2, q3 ≥ 0.

Because p appears only in the first and third constraints, we may remove the variablep by combining these two constraints. Moreover, the equality constraint allows us toremove the variable c3 by the substitution c3 = 1

2 (1 − c2)

max3

2+ 1

2c2 − q2 − q3

subject to

(5

7− ε

)

c2 ≤ q2 + 1

2q3,

c2 + 1 ≤ 5q2 + 7q3,

c2 ≤ 1,

c2, q2, q3 ≥ 0.

It is well known that, for a linear programming problem, the maximum is attainedin an extreme point of the feasible region. It turns out that, for the above particularlinear programming problem, the maximum value R(ε) is attained when

c2 = 7

18 − 35ε, q2 = 5 − 7ε

18 − 35ε, and q3 = 0,

such that

R(ε) = 3

2− 3 − 14ε

36 − 70ε.

Page 12: On sorting unsigned permutations by double-cut-and-joins

350 J Comb Optim (2013) 25:339–351

Since R(ε) is differentiable when ε ≤ 12 , we know from calculus that

R(ε) ≤ 17

12+ ε, 0 ≤ ε <

1

2.

Finally, notice that 1712 +ε ≥ 4

3 , where 43 is the approximation ratio that we can achieve

for the instances with c2 = c3 = 0. Putting them together, we have proved the fol-lowing theorem, which states an improved approximation ratio for approximatingb(π) − c(π).

Theorem 5 The problem of minimizing b(π) − c(π) can be approximated withinratio 17

12 + ε ≈ 1.4167 + ε in polynomial time, for any positive ε.

Along with Theorem 2, the above theorem yields one of our main results below.

Theorem 6 The SBD problem can be approximated within ratio 1712 +ε ≈ 1.4167+ε

in polynomial time, for any positive ε.

5 Conclusions

Since the double-cut-and-join (DCJ) operation was first introduced in 2005, a vari-ety of genome rearrangement analyses have been carried out under the DCJ context,such as estimating true evolutionary distances (Lin and Moret 2008), genome halving(Warren and Sankoff 2008), genome aliquoting (Warren and Sankoff 2009), and find-ing genome median (Zhang et al. 2009). All these analyses are based on a commonassumption that the strandedness of genes of interest shall already become available.

In this paper, we studied the problem of sorting unsigned permutations by double-cut-and-joints (SBD), which naturally arises when the DCJ operations are performedon pairs of unichromosomal genomes without the gene strandedness information. Wefirst showed that the SBD problem can be equivalently reduced to a previously-knownNP-hard problem called breakpoint graph decomposition, thereby implying that theSBD problem is also NP-hard. This result contrasts with our intuition that computingthe rearrangement distance by the DCJ operations is always very easy, as exemplifiedin sorting signed permutations (Yancopoulos et al. 2005). To achieve a better approxi-mation to the SBD problem, we made a suitable modification to Lin and Jiang’s algo-rithm, which was initially proposed to approximate the BGD problem, and carried outa rigorous performance analysis based on fractional linear programming. The finalapproximation ratio achieved for the SBD problem is 17

12 + ε ≈ 1.4167 + ε, improv-

ing over the previously known approximation ratio 5073−15√

12013208 + ε ≈ 1.4193 + ε,

for any positive ε.

Acknowledgements This research was supported in part by the Singapore NRF grant NRF2007IDM-IDM002-010 and MOE AcRF Tier 1 grant RG78/08. We would like to thank the referees for their valuablecomments.

Page 13: On sorting unsigned permutations by double-cut-and-joins

J Comb Optim (2013) 25:339–351 351

References

Bafna V, Pevzner PA (1993) Genome rearrangements and sorting by reversals. In: SFCS ’93: Proceedingsof the 1993 IEEE 34th annual foundations of computer science, pp 148–157

Bafna V, Pevzner PA (1996) Genome rearrangements and sorting by reversals. SIAM J Comput 25(2):272–289

Bergeron A, Mixtacki J, Stoye J (2006) A unifying view of genome rearrangements. In: WABI, pp 163–173

Berman P, Fürer M (1994) Approximating maximum independent set in bounded degree graphs. In: Pro-ceedings of the 5th annual ACM-SIAM symposium on discrete algorithms, pp 365–371

Caprara A, Rizzi R (2002) Improved approximation for breakpoint graph decomposition and sorting byreversals. J Comb Optim 6(2):157–182

Christie DA (1998) A 3/2-approximation algorithm for sorting by reversals. In: SODA ’98: Proceedingsof the 9th annual ACM-SIAM symposium on discrete algorithms, pp 244–252

Halldórsson MM (1995) Approximating discrete collections via local improvements. In: SODA ’95: Pro-ceedings of the 6th annual ACM-SIAM symposium on discrete algorithms, pp 160–169

Hannenhalli S, Pevzner P (1996) To cut . . . or not to cut (applications of comparative physical mapsin molecular evolution). In: SODA ’96: Proceedings of the 7th annual ACM-SIAM symposium ondiscrete algorithms, pp 304–313

Karp RM (1972) Reducibility among combinatorial problems. In: Miller RE, Thatcher JW (eds) Complex-ity of computer computations. Plenum, New York, pp 85–103

Kececioglu JD, Sankoff D (1995) Exact and approximation algorithms for sorting by reversals, with ap-plication to genome rearrangement. Algorithmica 13(1/2):180–210

Lin G, Jiang T (2004) A further improved approximation algorithm for breakpoint graph decomposition.J Comb Optim 8(2):183–194

Lin Y, Moret BM (2008) Estimating true evolutionary distances under the DCJ model. Bioinformatics24(13):i114–i122

Warren R, Sankoff D (2008) Genome halving with double cut and join. In: Proceedings of the 6th Asia-Pacific bioinformatics conference, vol. 6, pp 231–240

Warren R, Sankoff D (2009) Genome aliquoting with double cut and join. BMC Bioinform 10(Suppl 1):S2.doi 10.1186/1471-2105-10-S1-S2

Yancopoulos S, Attie O, Friedberg R (2005) Efficient sorting of genomic permutations by translocation,inversion and block interchange. Bioinformatics 21(16):3340–3346. doi 10.1093/bioinformatics/bti535

Zhang M, Arndt W, Tang J (2009) An exact solver for the DCJ median problem. In: Pacific symposium onbiocomputing, pp 138–149