on the k -closest substring and k -consensus pattern problems
DESCRIPTION
On the k -Closest Substring and k -Consensus Pattern Problems. Yishan Jiao, Jingyi Xu Institute of Computing Technology Chinese Academy of Sciences Ming Li University of Waterloo July 5, 2004. Outline. Motivation & background Our contributions A PTAS for k -Closest Substring Problem - PowerPoint PPT PresentationTRANSCRIPT
23/4/19 1
On the k-Closest Substring and k-Consensus Pattern
ProblemsYishan Jiao, Jingyi Xu
Institute of Computing TechnologyChinese Academy of Sciences
Ming Li
University of WaterlooJuly 5, 2004
2
Outline Motivation & background Our contributions
A PTAS for k -Closest Substring Problem The NP-hardness of (2-)-approximation
of the HRC problem A PTAS for k -Consensus Pattern Problem
Conclusion
3
Motivation Given n protein sequences, find a
“conserved” region separately:
N sequences
L
L
•Red/blue regions are different conserved regions, or motifs.
•They don’t have to be exactly the same.
•They match with higher scores than other regions.
4
Focused problem k -Closest Substring Problem(k -CSS)
.clustering theof radiuscluster maximum thenumber thecall
and of clustering a },,...,,{solution thecall We
.),(minwith of substring)(closest substring
-length a is there, stringevery for such that imizingmin
length of ,...,, stringscenter find ,integer an and
,length of stringseach with },...,,{set string aGiven
21
1
21
21
kd
Skdccc
dtcdst
LSsd
Lccc k L
msssS
k
jikijj
j
k
n
A special case when k =2
5
2-KCSS
.),(min where, imizing
},...,,{,},,{ :Output
||},,...,,{ :Input
2,1
2121
21
dtcddMin
tttdcc
mssssS
ii
n
in
L2c
bstrings)regions(su blue:BT
BS
…
jt
1c
AS
bstrings)regions(su red:AT
L
…
it
),( tcd i
L
…
S
6
counterpart
geometric
Related work
Hamming Radius O(1)-clustering problem (O(1)-HRC):A RPTAS for Hamming Radius O(1)-clustering problem ; Doctoral dessertation,J.Jansson,2003.
Closest Substring problem
k-Closest Substring problem
Closest String problem
Hamming Radius k-clustering problem (HRC)
Geometric k-center problem
K=1 L=m
L=m
Closest Substring problem:A PTAS; M.Li et al. ,JACM 49(2):157-171,2002
7
Outline Motivation & background Our contributions
A PTAS for k -Closest Substring Problem The NP-hardness of (2- )-approximation
of the HRC problem A PTAS for k -Consensus Pattern Problem
Conclusion
8
The PTAS for k-CSS Difficulties:
How to choose n closest substrings? How to partition strings into k sets accordingly?
Method: Extend random sampling strategy in [M.Li et al. , JACM 49(2):157-171,2002] Construct h to approximate the Hamming
distance. Result:
A PTAS for O(1)- CSS.
9
P
QLP \},...,2,1{
P-Q decomposition
…… …
1it
2it
rit
R)in positions random ))(log(( PmnOPR
L positions
'1cQ
agree ,...,, wherepositions ofset the:21 riii tttQ
PP cc ||'
10
P-Q decomposition
riiiriiiriiiriii
r
r
PPQiQ
optl
iii
n
optiii
cc'tc'
dr
)td(cnl
c'Tttt
rnr rL
tttT
rdP
,...,2,1,...,2,1,...,2,11,...,2,1
21
21
|| ,|| where
,)12
11(,',1any for such that
stringcenter a and in repeats) (allowing,...,,
strings are there,2,constant any For .length
each strings of},...,,{ ofset aGiven
:2 Lemma
|| :1 Lemma
21
,...,,
Lemma2. satisfied ),...,,(,...,, get thecan we
S,in substrings L-length possible all By trying
2121
Thus
rr jjjiii tttttt
11
Random sampling strategy :
The random sampling strategy R1(R2):randomly pick O(log(mn)) positions from P1(P
2)
)'('Let :2 Lemma of definition By the 21 ccc'
agree. ,...,, wherestring the| .121 , ... , 2 ,1 rriii iiiQ tttc'
riiiPc'
, ... ,2 ,1| .2
!! optimal about the nothing know We c
????
12
Random sampling Strategy
|||),,',()',(|
.|||),,',()',(|,each for
),least (at y probabilithigh With
. of substring L-length a is |Let
:Lemma3
22222
11111
31
PRQcuhcud
PRQcuhcudUu
4(mn)1
s}u{uU
jj
jjj
jSsj
h approximate Hamming distance well.
),(||
||),(,,, cod
R
PcodR)Qch(o RQ
13
Scheme of PTAS
otherwise., '
.),,',(min),,',(min
if,
,each Partition 3.
2into
222}s of t substring L-lengthany {111}s of t substring L-lengthany {
1into
C
RQcthRQcth
'C
Ss j
.|' and |'Get
Lemma2 satisfied ),...,,(,...,, get thecan we
,in substrings L-length possible all By trying 1.
21
2121
2Thus
QQ1
jjjiii
cc
tttttt
S
rr
;|' and |'Get
,in positions random ))(log( iespossibilit all Trying .2
22 2Thus
RR1 cc
PmnO
y.accordingl'' and,each for ' substringsclosest Get 4. 21 ,TT Sst jj
14
Scheme of PTAS
ly.respective ' and 'set to([LMW02])
problem StringClosest for the method theapplyingby
rerror within problemon optimizati thesolvecan We
21 TT
dopt
5. Get final approximating center strings
.||||;''),',(|)|',(
.||||;''),',(|)|',(
;min
2222
1111
2
2
2
1
1
1
PxTtttddtxd
PxTtttddtxd
d
jjiQ
Pj
jjiQ
Pj
.)212
11( d
such that solution aGet
opt
P2P121
drr
) |,c|) = (c,x(x21
15
Sum up:
1] & 3 [Lemma
with ,on based Partition 1.
εrd|h-d|
',h)',c(cS
opt
21
[Lemma1] optimal theeapproximat .2 2121 ,cc'',cc
Strategy.
Sampling Random with theion)decomposit (
Argument ialCombinator ofn Combinatio 3.
P-Q
16
Outline Motivation & background Our contributions
A PTAS for k -Closest Substring Problem The NP-hardness of (2- )-approximation
of the HRC problem A PTAS for k -Consensus Pattern Problem
Conclusion
17
The NP-hardness of (2-)-approximation of the HRC problem
Main Ideas: Given any instance G=(V,E) of the Vertex Cover
Problem, |V|=n, |E|= m' . Construct an instance <S ,k > of the Hamming
radius k-clustering problem, which has a k-clustering with the maximum cluster radius not exceeding 2 .
if and only if G has a vertex cover with k-m' vertices.
problem HRC theofion approximat--(2problemcover vertex
18
, } v|,,{ 531 84 or H(x,y)y,xSx,yEvsssS jiijijij
Thus finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution.
19
We can proof: Given k 2m', k-m' vertices in V can cover E ,
if and only if there is a k-clustering of S with the maximum cluster radius equal to 2.
if there is a polynomial algorithm for the Hamming radius k -clustering problem within an approximation factor less than 2
the exact vertex cover number of any instance G can be solved in polynomial time.
This is a contradiction.
20
Outline Motivation & background Our contributions
A PTAS for k -Closest Substring Problem the NP-hardness of (2- )-approximation
of the HRC problem A PTAS for k -Consensus Pattern Problem
Conclusion
21
Conclusion A nice combination of
Combinatorial argument (P-Q decomposition) with the random sampling strategy in solving k -CSS problem.
An alternative and direct proof of the NP-hardness of (2- )-approximation of the HRC problem.
22
Contact Us Authors
Yishan Jiao, Jingyi Xu : {jys,xjy}@ict.ac.cn Bioinformatics lab, Institute of Computing
Technology, Chinese Academy of Sciences Ming Li: [email protected]
University of Waterloo
n
23
Thank You!
24
Outline Motivation & background Our contributions
The PTAS for k-Closest Substring Problem the NP-hardness of (2-)-approximation of
the HRC problem The PTAS for k-Consensus Pattern
Problem Conclusion
25
Deterministic PTAS for O(1)-Consensus Pattern problem 1
k-Consensus Pattern problem
Most related works: The Hamming O(1) -median clustering problem
O(1)-Consensus Pattern problem when L= m. A RPTAS ; R. Ostrovsky et al. ,JACM 49(2):139-156,2002
The Consensus Pattern problem k-Consensus Pattern problem when k= 1. A PTAS; M.Li et al., STOC’99.
给出 O(1)-Consensus Pattern Problem 的一个确定性PTAS ,并证明。
26
DPTAS for O(1)-CP 1 Outline:
1.Suppose in the optimal solution:({c1,c2}, {t1,t2,…,tn}, {C1,C2})
C1,C2: instances of Consensus Pattern problem2.Trying all possibilities, get and satisfying Lemma 3 in M.Li et al., STOC’99.
27
DPTAS for O(1)-CP 2 Outline:
3. Get c1’,c2’ c1’: the column-wise majority string of c2’: the column-wise majority string of
4.Partition each into C1’,C2’ as follows: otherwise
5.Get closest substrings (tl’) in T1’,T2’ satisfying
28
DPTAS for O(1)-CP 3 Outline:
6.Get a good approximation solutionwhere
c1”,c2” are the column-wise majority string of all string in T1’,T2’ respectively.
7.Conclusion: Output a solution in polynomial time with total
cost at most
29
PTAS for 2-Consensus Pattern
problem
30
Definition of PTAS A family of approximation algorithms fo
r problem P,{Ak}k, is called a polynomial (time) approximation scheme or PTAS, if algorithm Ak is a (1+k)-approximation algorithm and its running time is polynomial in the size of the input for a fixed k.
31
Vertex-cover problem Vertex cover: given an undirected
graph G=(V,E), then a subset V'V such that if (u,v)E, then uV' or v V' (or both).
Size of a vertex cover: the number of vertices in it.
Vertex-cover problem: find a vertex-cover of minimal size.
32
Vertex-cover problem Vertex-cover problem is NP-complete. (See s
ection 34.5.2). Vertex-cover belongs to NP. Vertex-cover is NP-hard (CLIQUEPvertex-cover.)
Reduce <G,k> where G=<V,E> of a CLIQUE instance to <G',|V|-k> where G'=<V,E'> where E'={(u,v): u,vV, uv and <u,v>E} of a vertex-cover instance.
So find an approximate algorithm.
33
34
Conclusion for the approximation solution
Outline Get a good approximation solution
where
10.Conclusion: Outputs (c1”, c2”) in polynomial time Satisfying with high probability:
Can be derandomized by standard method [MR95]. Extend to k=O(1) case: trivial
35
PTAS for 2-CSS
36
Notation
problem stringClosest of instances :
({
:CSS-2for solution optimal in the Suppose 5.
.| and|between distance hamming ),|,|(),( .4
].[]...[][ string :| .3
)...1 (},,...,,{set multi a :set position .2
||||,||||strings, :,,, .1
21
2121
21
opt2121n2121
PpPpP
kp
kk
,TT
)},d,S}, {S,T}, {T,t,,t}, {t,cc
tstsdtsd
jsjsjss
mjjjjjj
L.comtscots
37
P-Q decomposition
……
L positions
…
'1c
1it
2it
rit
Q P
R
).,( min),(
).,,,(min),,,(
),,(||
||),(,,,
:measures Distance
)in positions random ))(log((
\},...,2,1{
agree. ,...,, wherepositions ofset the:
}s of t substring L-lengthany {
}s of t substring L-lengthany {
21
ctdcsg
RQcthRQcsf
codR
PcodR)Qch(o
PmnOPR
QLP
tttQ
RQ
iii r