shortest common superstring1

21
PROBLEM DEFINITION : Find the shortest string S which contains each S i as a substring of S. INPUT OUTPUT PROBLEM DESCRIPTION : The shortest superstring problem takes as input, several strings of different lengths and finds the shortest common string that contains all the input strings as substrings. This is helpful in the genome project since it will allow researchers to determine entire coding regions from a collection of fragmented sections. Shortest common superstring arises in a variety of applications, including sparse matrix compression. Suppose we have an (n x m) matrix with most of the elements being zero. We can partition each row into (m / k) runs of k elements each and construct the shortest common superstring S' of these runs. We now have reduced the problem to storing the superstring, plus an (n x m / k) array of pointers into the superstring denoting where each of the runs starts. Accessing a particular element M[i,j] still takes constant time, but there is a space savings when |S| << mn. INPUT DESCRIPTION : Given a set of n strings, S = {S 1 ,...,S n }, we want to find the shortest string s that contains S i as a substring. OUTPUT DESCRIPTION :

Upload: mukta-debnath

Post on 02-Apr-2015

187 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: shortest common superstring1

PROBLEM DEFINITION :

Find the shortest string S which contains each Si as a substring of S.

INPUT OUTPUT

PROBLEM DESCRIPTION :

The shortest superstring problem takes as input, several strings of different lengths and finds the shortest common string that contains all the input strings as substrings. This is helpful in the genome project since it will allow researchers to determine entire coding regions from a collection of fragmented sections. Shortest common superstring arises in a variety of applications, including sparse matrix compression. Suppose we have an (n x m) matrix with most of the elements being zero. We can partition each row into (m / k) runs of k elements each and construct the shortest common superstring S' of these runs. We now have reduced the problem to storing the superstring, plus an (n x m / k) array of pointers into the superstring denoting where each of the runs starts. Accessing a particular element M[i,j] still takes constant time, but there is a space savings when |S| << mn.

INPUT DESCRIPTION :

Given a set of n strings, S = {S1,...,Sn}, we want to find the shortest string s that contains

Si as a substring.

OUTPUT DESCRIPTION :

The output of this problem is the shortest common superstring from the given set of substrings and printing the substrings in a shifted fashion whenever a match is encountered.

ASSUMPTIONS :

We assume that no Si belongs to S is a substring of Sj belongs to S. This problem is NP-hard. Such a problem scales up exponentially and consequently large instances cannot be solved in real life time by electronic computers.

Page 2: shortest common superstring1

TECNIQUES THAT CAN BE APPLIED:

1. GREEDY HEURISTIC METHOD :

The Greedy Heuristic method provides the standard approach to approximating Shortest Common Superstring.

ALGORITHM:

Step 1 : Input set of strings. S= { S1, S2 ..., Sn }.

Step 2 : Identification of which pair of string have maximum overlap for every pair by using Brute – Force Algorithm or Knuth Morris Pratt Algorithm.

Step 3 : Replace the pair of strings with maximum overlap by a merge string until only one string remains.

Step 4 : Output the string with the superstring in one line and approximately shifting the substring to the right after a mismatch.

2. USING TRAVELLING SALESMAN PROBLEM APPROACH :

This is one of the most well known difficult problems of time. A salesperson must visit n cities, passing through each city only once, beginning from one of the city that is considered as a base or starting city and returns to it. The cost of the transportation among the cities is given. The problem is to find the order of minimum cost route that is, the order of visiting the cities in such a way that the cost is the minimum.

To solve the above problem using TSP we have to do the following operations:

1. Create an overlap graph G where vertex Vi represents string Si.

2. Assign edge (vi,vj) weight equal to the length of Si minus the overlap of Sj with Si. Thus weight W(vi,vj) = 1 for Sj=abc and Sj =bcd.

3. The minimum weight path visiting all the vertices defines the SCS. These edge weights are not symmetric.

4. For the above problem W(vi,vj)=3 for the 1st two strings S1=ABRAC and S2=ACADA.

5. Now the TSP is applied.

ALGORITHM TSP:

Step 1: First, find out all (n -1)! Possible solutions, where n is the number of string inputs.

Step 2 ; Next, determine the minimum cost by finding out the cost of everyone of these (n -1)! solutions.

Step 3 : Finally, keep the one with the minimum cost.

Page 3: shortest common superstring1

3. THE SET COVER ALGORITHM APPROACH :

Using the set cover method, we obtain a 2Hn factor approximation algorithm.

Given input, S = {S1,...,Sn}, we construct a string rijk for all possible combinations Si and Sj belongs to S (where k is the maximum overlap between the two). Now, let’s call the set of all such r, R. Now let v belong to given set, such that sub(v) = {s belongs to S| s is a substring of v}. All possible subsets of S are sub(v) for all v belongs to S U R.

ALGORITHM (SET COVER):

Step 1 : Use the greedy set cover algorithm to find a cover for the instance C.

Step 2 : Backwards construct v1, ...vk from the sets selected by the algorithm so that

sub(v1)U...U

sub(vk) is the cover for C.

Step 3 : Uniting the strings v1, ...vk gives the shortest superstring via set cover.

4. KRUSKAL’S MAXIMUM SPANNING TREE ALGORITHM :

We can solve the problem also by finding the Maximum Spanning Tree using Kruskal Algorithm by creating a graph G of the given set of strings. T represents the Tree.

ALGORITHM :

One method for computing the maximum weight spanning tree of a network G – due to Kruskal can be summarized as follows.

Step 1 : Sort the edges of G into decreasing order by weight. Let T be the set of edges comprising the maximum weight spanning tree. Set T = NULL.

Step 2 : Add the first edge to T.

Step 3 : Add the next edge to T if and only if it does not form a cycle in T. If there are no remaining edges exit and report G to be disconnected.

Step 4 : If T has n−1 edges (where n is the number of vertices in G) stop and output T. Otherwise go to step 3.

Page 4: shortest common superstring1

OUR LOGICAL APPROACH :

We begin our approach by taking ‘n’ substrings from the user and storing them in a 2D array. The user may enter a maximum of 10 substrings, which is the boundary condition of the program. We have implemented the program, using various structures. Firstly, after the inputs are encountered we use the structure ‘matrix’ which keeps the record of common character between each pair of substring. We also use a structure ‘edgelist’ to represent each substring as a vertex and the number of common characters between pairs of substrings as edges. In this way, the whole structure is represented in a form of a tree. Later we use Khuskal’s algorithm to form the maximal spanning tree, with help of the structure ‘sequence’ which stores the edges in a non-increasing order. Finally, a function is invoked which rearranges the vertices in an efficient way so that the shortest common superstring can be formed.

DATA STRUCTURES USED :

We try to solve this problem simply using array data structure. We use a 2D array ‘IS’ to store the input substrings and an 1D array ‘OS’ to store the shortest common superstring which the required output. The reason that we have chosen array as the primary data structure is that strings are most suitably represented using character-array representation. It is also worth mentioning that the manipulation of strings become easier as traversing an array with respect to array-indices reduces excess overhead.

PROGRAM IMPLEMENTATION USING C-CODE :

/*Inclusion of Header Files*/#include<stdio.h>#include<conio.h>#include<string.h>#include<alloc.h>

/*declaration of global variables*/char IS[10][10];/*input_string*/char OS[50];/*output string*/int total;/*total no of substrings*/int length;/*length of a substring*/int edge_count=0;/*no. of matches found*/int sequence_count=0;/*no. of matches actually considered in formation of the output string*/

/*declaration of global structures*/struct string_matrix/*structure which keeps the record of common character between each pair of substring*/

{int value[10];}matrix[10];

struct edgelist/*structure which stores the non-zero entries of the matrix*/{int u,v,weight;}edgelist[10];

Page 5: shortest common superstring1

struct sequence/*structure which holds the maximal spanning tree*/{int u,v,weight;}sequence[10];

struct dummy_sequence{int u,v,weight;}dseq[10];

/*declaration of global function*/void display_sub_strings(void);void create_matrix(void);void display_matrix(void);void create_edgelist(void);void display_edgelist(void);void arrange_edgelist(void);void create_sequence(void);int check_cycle(int);void arrange_sequence(void);void display_sequence(void);void create_super_string(void);void display_super_string(void);

void main(){int i,j;

printf("ENTER THE TOTAL NO. OF SUBSTRINGS : ");scanf("%d",&total);printf("ENTER %d SUBSTRINGS (each terminated by an enter)

EACH SUBSTRING MUST BE OF SAME LENGTH : ",total);for(i=0;i<=total;i++)

gets(IS[i]);length=strlen(IS[1]);

/*initialization of string_matrix*/for(i=1;i<=10;i++)

for(j=1;j<=10;j++)matrix[i].value[j]=0;

display_sub_strings();create_matrix();display_matrix();create_edgelist();arrange_edgelist();display_edgelist();create_sequence();arrange_sequence();display_sequence();

Page 6: shortest common superstring1

create_super_string();display_super_string();}/*end of main*/

/*definition of global functions*//*function to display each substring entered*/void display_sub_strings(void)

{int i;printf("ENTERED SUBSTRINGS : ");printf("-------------------------------------");for(i=1;i<=total;i++)

{printf("IS[%d] = ",i);puts(IS[i]);}

}/*end of function*/

/*function to create the string_matrix*/void create_matrix(void)

{int i,j,k,l,flag;length=strlen(IS[1]);for(i=1;i<=total;i++)

{flag=0;for(j=1;j<=total;j++)

{for(k=0;k<length;k++)

{if(IS[i][k]==IS[j][0])

{l=1;k++;flag=1;while(k<length && (IS[i][k]==IS[j][l]))

{l++;k++;}

}if((IS[i][k]!=IS[j][l]) && (k!=length))

flag=0;}

if(flag && i!=j)/*match found for last 'l' characters of i-th string*/matrix[i].value[j]=l;

}}

}/*end of function*/

/*function to display the string_matrix*/

Page 7: shortest common superstring1

void display_matrix(void){int i,j;printf("MATRIX : ");printf("--------------");

printf("Here MATRIX[i][j]) = max. matching characters between two strings and MATRIX[i][j]) = 0 if i=j");

for(i=1;i<=total;i++) printf("IS[%d]",i);

for(i=1;i<=total;i++){printf("IS[%d]",i);for(j=1;j<=total;j++)

printf("%d",matrix[i].value[j]);printf("\n");}

}/*end of function*/

/*function to create the edge_list*/void create_edgelist(void)

{int i,j;for(i=1;i<=total;i++)

{for(j=1;j<=total;j++)

{if(matrix[i].value[j])

{edge_count++;edgelist[edge_count].u=i;edgelist[edge_count].v=j;edgelist[edge_count].weight=matrix[i].value[j];}

}}

}/*end of function*/

/*function to arrange the edge_list in non-increasing order, implemented bubble sort*/void arrange_edgelist(void)

{int i,flag=1,j,temp;for(i=1;i<=edge_count && flag;i++)

{j=edge_count;flag=0;while(j>i)

{if(edgelist[j].weight > edgelist[j-1].weight)

{temp=edgelist[j].weight;edgelist[j].weight=edgelist[j-1].weight;

Page 8: shortest common superstring1

edgelist[j-1].weight=temp;temp=edgelist[j].u;edgelist[j].u=edgelist[j-1].u;edgelist[j-1].u=temp;temp=edgelist[j].v;edgelist[j].v=edgelist[j-1].v;edgelist[j-1].v=temp;flag=1;}

j--;}

}}/*end of function*/

/*function to display the edge_list*/void display_edgelist(void)

{int i;printf("EDGELIST : ");printf("-----------------");

printf("Here The Non-zero Entries of The above Matrix is Represented in Form of a Edgelist");

printf("VERTEX 1 VERTEX 2 EDGE");for(i=1;i<=edge_count;i++)

printf("IS[%d] IS[%d] %d",edgelist[i].u,edgelist[i].v,edgelist[i].weight);}/*end of function*/

/*function to create the maximal spanning tree using kruskal algorithim*/void create_sequence(void)

{int i=1,flag;while((sequence_count < total-1) && (i<=edge_count))

{flag=check_cycle(edgelist[i].u);if(!flag)

{sequence_count++;sequence[sequence_count].u=edgelist[i].u;sequence[sequence_count].v=edgelist[i].v;sequence[sequence_count].weight=edgelist[i].weight;}

i++;}

}/*end of function*/

/*function to check whether inclusion of a edge form a cycle in the tree*/int check_cycle(int u)

{int i;for(i=1;i<=sequence_count;i++)

if(sequence[i].u==u)

Page 9: shortest common superstring1

return(1);return(0);}/*end of function*/

/*function to form the final sequence of substrings*/void arrange_sequence(void)

{int flag,i,j,k,store_i;for(i=1;i<=sequence_count;i++)

{k=0;flag=0;for(j=1;j<=sequence_count && (flag!=sequence_count-1);j++)

{if(sequence[i].v==sequence[j].u)

{k++;dseq[k].u= sequence[i].u;dseq[k].v= sequence[i].v;dseq[k].weight= sequence[i].weight;store_i=i;i=j;flag++;j=0;}

}if(flag==sequence_count-1)

{k++;dseq[k].u= sequence[i].u;dseq[k].v= sequence[i].v;dseq[k].weight= sequence[i].weight;/*copy into sequence*/for(i=1;i<=sequence_count;i++)

{sequence[i].u=dseq[i].u;sequence[i].v=dseq[i].v;sequence[i].weight=dseq[i].weight;}

return;}

if(flag)i=store_i;

}}/*end of function*/

/*function to display the final sequence of substrings*/void display_sequence(void)

{int i;printf("SEQUENCE : ");

Page 10: shortest common superstring1

printf("--------");printf("Here we Represent The Maximal Spanning Tree in Form of a List : ");printf("VERTEX 1 VERTEX 2 EDGE");for(i=1;i<=sequence_count;i++)

printf("IS[%d] IS[%d] %d",sequence[i].u,sequence[i].v,sequence[i].weight);

}/*end of function*/

/*function to form the shortest common string*/void create_super_string(void)

{int i,j,k;for(i=0,j=0;i<length;i++,j++)

OS[j]=IS[sequence[1].u][i];for(i=1;i<=sequence_count;i++)

{for(k=sequence[i].weight;k<length;k++,j++)

{OS[j]=IS[sequence[i].v][k];}

}}/*end of function*/

/*function to display the shortest common string*/void display_super_string(void)

{int i,j,k,count_blank=0;printf("SHORTEST COMMON SUPERSTRING : ");puts(OS);printf("------------");/*printing a formatted output*/puts(IS[sequence[1].u]);printf("\n");for(i=1;i<=sequence_count;i++)

{for(j=1;j<=count_blank;j++)

printf(" ");for(k=sequence[i].weight;k<length;k++)

{printf(" ");count_blank++;}

puts(IS[sequence[i].v]);printf("\n");}

}/*end of function*/

/*definition of global structures finished*/

Page 11: shortest common superstring1

OUTPUT INSTANCE 1 :

ENTER THE TOTAL NO. OF SUBSTRINGS : 5

ENTER 5 SUBSTRINGS (each terminated by an enter)EACH SUBSTRING MUST BE OF SAME LENGTH :ABRACACADAADABRDABRARACAD

ENTERED SUBSTRINGS :------------------------------------IS[1] = ABRACIS[2] = ACADAIS[3] = ADABRIS[4] = DABRAIS[5] = RACAD

MATRIX :-------------Here MATRIX[i][j]) = max. matching characters between two strings and MATRIX[i][j]) = 0 if i=j

IS[1] IS[2] IS[3] IS[4] IS[5]IS[1] 0 2 0 0 3IS[2] 1 0 3 2 0IS[3] 3 0 0 4 1IS[4] 4 1 1 0 2IS[5] 0 4 2 1 0

EDGELIST :-----------------Here The Non-zero Entries of The above Matrix is Represented in Form of an Edgelist

VERTEX 1 VERTEX 2 EDGEIS[3] IS[4] 4IS[4] IS[1] 4IS[5] IS[2] 4IS[1] IS[5] 3IS[2] IS[3] 3IS[3] IS[1] 3IS[1] IS[2] 2IS[2] IS[4] 2IS[4] IS[5] 2IS[5] IS[3] 2IS[2] IS[1] 1IS[3] IS[5] 1IS[4] IS[2] 1IS[4] IS[3] 1

Page 12: shortest common superstring1

IS[5] IS[4] 1

SEQUENCE :-------------------Here we Represent The Maximal Spanning Tree in Form of a List :

VERTEX 1 VERTEX 2 EDGEIS[3] IS[4] 4IS[4] IS[1] 4IS[1] IS[5] 3IS[5] IS[2] 4

SHORTEST COMMON SUPERSTRING :

ADABRACADA---------------------ADABR DABRA ABRAC RACAD ACADA

OUTPUT INSTANCE 2 :

ENTER THE TOTAL NO. OF SUBSTRINGS : 4

ENTER 4 SUBSTRINGS (each terminated by an enter)EACH SUBSTRING MUST BE OF SAME LENGTH :ABCDEBCDEFDEFGHCDEFG

ENTERED SUBSTRINGS :------------------------------------IS[1] = abcdeIS[2] = bcdefIS[3] = defghIS[4] = cdefg

MATRIX :-------------Here MATRIX[i][j]) = max. matching characters between two strings and MATRIX[i][j]) = 0 if i=j

IS[1] IS[2] IS[3] IS[4] IS[1] 0 4 2 3 IS[2] 1 0 3 4 IS[3] 0 0 0 0 IS[4] 0 0 4 0

Page 13: shortest common superstring1

EDGELIST :----------------Here The Non-zero Entries of The above Matrix is Represented in Form of a Edgelist

VERTEX 1 VERTEX 2 EDGEIS[1] IS[2] 4IS[2] IS[4] 4IS[4] IS[3] 4IS[1] IS[4] 3IS[2] IS[3] 3IS[1] IS[3] 2

SEQUENCE :------------------Here we Represent The Maximal Spanning Tree in Form of a List :

VERTEX 1 VERTEX 2 EDGEIS[1] IS[2] 4IS[2] IS[4] 4IS[4] IS[3] 4

SHORTEST COMMON SUPERSTRING :

ABCDEFGH-----------------ABCDE BCDEF CDEFG DEFGH

DISCUSSION :

1. The code is implemented considering certain basic assumptions, such as:i. Each substring entered must be of equal length.ii. No such substring should be entered that have no common

characters when compared with all other substrings.

iii. No Si is a substring of Sj, where both Si and Sj are substrings of S.

2. Certain boundary conditions have also to be maintained, such as:i. The substrings entered must be within of 10 characters.ii. A maximum of 10 substring may be entered.iii. The output string is 1D array capable of storing a maximum of 30

characters.

3. The output is displayed in a formatted way to that it is easier for the user to understand the formation of the shortest common superstring.

Page 14: shortest common superstring1

4. The Kruskal’s algorithm is generally used to compute the minimal spanning tree but here it is used to find the maximal spanning tree. This is possible because the structure ‘sequence’ used here stores the edges in an non-increasing order. The Kruskal’s algorithm starts by sorting all edges of a graph. The time complexity of this sorting operation is O(ElogE) if there is ‘E’ number of edges in the graph. The ‘for’ loop in the algorithm makes ‘E’ number of iterations in the worst case. In each iteration, the major task is to find whether the current edge introduces a cycle. The complexity of detecting a cycle is O(log n) in the worst case if the graph contains ‘n’ vertices. Thus the overall time complexity of the algorithm is O(ElogE) + O(Elogn).

5. This program can even be further modified by using suffix trees. It can be done by building a tree containing all suffixes of all strings of S. String Si overlaps with Sj iff a suffix of Si matches the prefix of Sj- traversing these vertices in order of distance from the root defines the approximate merging order.

APPLICATION OF THIS PROBLEM:

The shortest common superstring problem (SCS) has been extensively studied for its applications in string compression and DNA sequence assembly. Although the problem is known to be Max-SNP hard, the simple greedy algorithm performs extremely well in practice. To explain the good performance, previous researchers proved that the greedy algorithm is asymptotically optimal on random instances. Unfortunately, the practical instances in DNA sequence assembly are very different from the random instances. The shortest common superstring problem (SCS) has been extensively studied for its applications in string compression and DNA sequence assembly. Although the problem is known to be Max-SNP hard, the simple greedy algorithm performs extremely well in practice. To explain the good performance, previous researchers proved that the greedy algorithm is asymptotically optimal on random instances. Unfortunately, the practical instances in DNA sequence assembly are very different from the random instances.

BIBLIOGRAPHY :

1. Lecture notes on Shortest Superstring Problem from Massachusetts Institute of Technology.Seminar in Theoretical Computer Science.

2. Research work from Kenneth S. Alexander 1 Department of Mathematics,

University of Southern California. Los Angeles. 3. From Scholarly Articles available from net. 4. From the book : The Algorithm Design Manual BY Steven S.Skiena

Stony Brook University , Dept. of Computer Science.

5. Self experience.