complete on average boolean satisfiability

13
JOURNAL OF COMPLEXITY 18, 1024–1036 (2002) doi:10.1006/jcom.2002.0649 Complete on Average Boolean Satis¢ability Jie Wang 1 Department of Computer Science, University of Massachusetts, One University Avenue, Lowell, Massachusetts 01854 E-mail : wang@cs:uml:edu; URL : http : ==www:cs:uml:edu=ewang Received September 5, 2001; revised April 18, 2002; accepted April 24, 2002; published online July 2, 2002 We present in this paper a dynamic binary coding scheme a on CNF formulas c; and show that under a uniform distribution m a on binary string aðcÞ; SAT is complete on average, where m a ðcÞ is proportional to jaðcÞj 2 2 jaðcÞj : We then show that there is k 0 > 2 such that for all k5k 0 ; kSAT under m a is complete on average. # 2002 Elsevier Science (USA) Key Words: average NP-completeness; distributional tiling; distributional Boolean satisfiability; randomized reductions. 1. INTRODUCTION Finding a reasonable distribution on CNF formulas for which SAT is complete on average is a major open problem [CM97]. This problem was motivated by the desire to understand how difficult it is to determine whether a given random CNF formula is satisfiable. Most of the early approaches were algorithmic in nature, designing specific algorithms to evaluate CNF formulas and analyzing their average performance under various distributions of CNF formulas. In particular, many of the previous papers were centered around the Davis–Putnam procedure (DPP) [DLL62, DP60] under two models of probability distributions: one on random clause lengths (RCL) and the other on fixed clause lengths (FCL). DPP is a resolution procedure that employs certain heuristics to evaluate a CNF formula recursively by selecting literals one at a time with its two possible truth assignments. Let c be a CNF formula, and cðuÞ the formula obtained from c by setting u true. Then DPP works as follows: If c is empty return yes; else if c contains an empty clause return no; else if c contains a pure literal u or a unit clause fug return DPPðcðuÞÞ; else select a literal v in c; return yes if DPPðcðvÞÞ ¼ yes, and return DPPðcÞð:vÞÞ otherwise. 1 Supported in part by NSF under Grants CCR-9820611 and CCR-0296037. 1024 0885-064X/02 $35.00 # 2002 Elsevier Science (USA) All rights reserved.

Upload: jie-wang

Post on 15-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

JOURNAL OF COMPLEXITY 18, 1024–1036 (2002)

doi:10.1006/jcom.2002.0649

1S

088#

All

Complete on Average Boolean Satis¢ability

Jie Wang1

Department of Computer Science, University of Massachusetts, One University Avenue,

Lowell, Massachusetts 01854

E-mail : wang@cs:uml:edu; URL : http : ==www:cs:uml:edu=ewang

Received September 5, 2001; revised April 18, 2002; accepted April 24, 2002;

published online July 2, 2002

We present in this paper a dynamic binary coding scheme a on CNF formulas c;and show that under a uniform distribution ma on binary string aðcÞ; SAT is complete

on average, where maðcÞ is proportional to jaðcÞj�22�jaðcÞj: We then show that there is

k0 > 2 such that for all k5k0; kSAT under ma is complete on average. # 2002 Elsevier

Science (USA)

Key Words: average NP-completeness; distributional tiling; distributional

Boolean satisfiability; randomized reductions.

1. INTRODUCTION

Finding a reasonable distribution on CNF formulas for which SAT iscomplete on average is a major open problem [CM97]. This problem wasmotivated by the desire to understand how difficult it is to determinewhether a given random CNF formula is satisfiable. Most of the earlyapproaches were algorithmic in nature, designing specific algorithms toevaluate CNF formulas and analyzing their average performance undervarious distributions of CNF formulas. In particular, many of the previouspapers were centered around the Davis–Putnam procedure (DPP) [DLL62,DP60] under two models of probability distributions: one on random clauselengths (RCL) and the other on fixed clause lengths (FCL). DPP is aresolution procedure that employs certain heuristics to evaluate a CNFformula recursively by selecting literals one at a time with its two possibletruth assignments. Let c be a CNF formula, and cðuÞ the formula obtainedfrom c by setting u true. Then DPP works as follows: If c is empty returnyes; else if c contains an empty clause return no; else if c contains a pureliteral u or a unit clause fug return DPPðcðuÞÞ; else select a literal v in c;return yes if DPPðcðvÞÞ ¼ yes, and return DPPðcÞð:vÞÞ otherwise.

upported in part by NSF under Grants CCR-9820611 and CCR-0296037.

10245-064X/02 $35.00

2002 Elsevier Science (USA)

rights reserved.

COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY 1025

Both of the RCL and FCL models contain two parameters: the number nof variables and the number m of clauses from which a random CNFformula c is drawn.

In the RCL model (also called ‘‘the fixed density model’’), c isconstructed by including each of the 2n literals in each of the m clauseswith probability p: Assume that clause lengths are binomially distributed,and so the expected clause length is 2pn: Thus the probability distribution ofc; denoted by mRCLðpÞðcÞ; is proportional to

n�2m�22�2pnm:

For any constant value of p; DPP solves random RCL instances inpolynomial time Oðn2mÞ on average [Gol79]. This result is due to a favorablechoice of distribution, for the probability that a random RCL instance issatisfiable tends to 1 as n grows, and a witness can be found by a constantnumber of guesses of random truth assignments [FP83].

In the FCL model (also called ‘‘the random kSAT model’’), c is generatedby selecting clauses uniformly at random from the set of all possible nontrivial

clauses of a fixed length k: A clause is trivial if it contains both a variable andits negation. Since there are 2k n

k

� �nontrivial clauses, the probability

distribution mFCLðcÞ of a random FCL instance is proportional to

n�2m�22�km n

k

!�m

:

DPP on random FCL instances runs in exponential time on average to findall solutions [CS88, FP83].

We observe that random formulas can also be generated as follows: First,choose a binary encoding scheme for encoding formulas, then randomlygenerate binary encodings. We shall allow probability distributions m onbinary strings to sum up to less than 1 by assuming that mðxÞ ¼ 0 if x is not alegal encoding. Let jcj denote the number of literals occurring in c: If c is akSAT instance over n variables and with m clauses, then jcj ¼ km: Each suchformula can be encoded using 2n codewords of equal length, each of whichrepresents a literal. Let cðcÞ denote such a binary encoding of c; and ‘c thelength of cðcÞ: Then ‘c ¼ kmðdlog ne þ 1Þ; and the uniform distribution ofcðcÞ; denoted by mUNIðcðcÞÞ; is proportional to

n2m�2‘�2c 2�‘c :

Let mFCLðcÞ denote the probability distribution of c; which is proportional to

n�2m�22�km n

k

!�m

:

JIE WANG1026

Since nk

� �� nk when k ¼ oð

ffiffiffin

pÞ [Bol85], we have

2�km n

k

!�m

� 2�kmðlog nþ1Þ:

This implies that mUNIðcðcÞÞ and mFCLðcÞ are equivalent distributions fromaverage complexity point of view, for they are dominated by each another. Adistribution m; is dominated by a distribution n; written as m%n; if for allx; mðxÞ4qðjxjÞnðxÞ for some fixed polynomial q: Thus, under mUNIðcðcÞÞ; DPPon random kSAT instances also runs in exponential time on average.

Despite intensive effort, it remains open whether kSAT is complete onaverage under mFCLðcÞ or mUNIðcðcÞÞ: It is not even known any concreteencoding scheme2 on CNF formulas such that, under the uniformdistribution of this encoding, SAT is complete on average. Such anencoding scheme c0ðcÞ must be polynomially equivalent to cðcÞ; namely,cðcÞ can be obtained from c0ðcÞ in polynomial time in the size of input, andvice versa. We construct such a concrete encoding scheme in this paper andsettle the open question affirmatively.

We fix a finite alphabet A for describing formulas, with each symbolencoded in binary. In addition to all the symbols available on a standardkeyboard, A also includes symbols 8;

P;Q; and :: We label variables as

v1; v2; . . . ; and label literals as li; where i ¼ �1;�2; . . . ; such that li ði > 0Þrepresents variable vi and l�i represents its negation :vi: For each clause C;we list the literals in increasing order on indices. This sequence of indices isunique for C; and we call it an identifier of C: The length of C is defined to bethe length of its identifier. We list the clauses in c in lexicographical order:Clauses with shorter identifiers are listed prior to those with longeridentifiers; within the same length, clauses are listed in the dictionary order.It is easy to see that any CNF formula c is uniquely represented in this way.For convenience, we still use c to denote this ordered expression.

We use ‘‘polytime’’ to denote ‘‘polynomial-time’’ and ‘‘polynomial time.’’We are only interested in polynomially equivalent encodings Ec of c:

In the worst-case complexity, choices of encoding schemes do not affectNP-completeness as long as they are polynomially equivalent. In theaverage-case complexity, however, the choice of encoding scheme is crucial,for different encoding schemes may result in distributions that do notdominate each other. We consider a concrete encoding scheme a for c in thispaper. Since unit clauses must be set true in order to satisfy a CNF formula

2 If we do not care about concrete encodings, then one can easily construct a universal

distribution, based on an enumeration of all polynomially samplable distributions, to make

SAT (and all standard NP-complete problems) complete on average for the class of NP

problems with polynomially samplable distributions [BCGL92]. A distribution m is said to be

samplable if there exists a randomized algorithm that outputs x with probability mðxÞ in

polynomial time of jxj:

COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY 1027

c; we first check whether c contains unit clauses. If yes, we compress unitclauses based on arithmetic progressions in literal indexing (if there is any).We then use a dynamic coding scheme to encode the rest of the clauses in c(details of this encoding are given in Section 2). This encoding is one-to-one,polytime computable, and polytime invertible. For convenience, we call thisdistribution an a-distribution.

Distributional Satisfiability (DistSAT)Instance: A CNF formula c:Question: Is c satisfiable?Distribution: maðcÞ is proportional to jaðcÞj�22�jaðcÞj:

When c is restricted to be a kSAT instance in DistSAT, we denote theproblem by Dist-kSAT. We show that DistSAT is complete on average forDistNP. Since ma is flat, namely, maðcÞ42�jcje for some fixed e > 0;randomized reductions are necessary for establishing the completenessresult (unless EXP ¼ NEXP) [Gur91]. The rest of the paper is organized asfollows. In Section 2, we describe the a-distribution in detail. In Section 3,we first present some basics of the theory of average NP-completeness forreaders who are not familiar with it. We then define a flat distributionaltiling problem (FDT) and show that it is complete on average for DistNPunder randomized reductions. We show that FDT is reducible to DistSAT,and hence DistSAT is complete on average for DistNP. We then show thatthere exists a k0 > 2 such that for all k > k0; Dist-kSAT is complete onaverage under randomized reductions.

2. THE a-DISTRIBUTION

Denote by jxj the length of a binary string x; and by jX j the cardinality of aset X : A probability distribution (or simply distribution) m is a real-valuedfunction from f0; 1gn ! ½0; 1� such that

Px mðxÞ41:

In previous literature when one says that m is polytime computable, itmeans that either the distribution function mnðxÞ ¼

Py4x mðyÞ is polytime

computable, where 4 is the standard lexicographical order on f0; 1gn; orm%n for some n with polytime computable nn: This definition implies thatmðxÞ%2�jzðxÞj for some function z with the property that zðxÞ is one-to-one,polytime computable and polytime invertible in jxj; which is the onlyproperty required of m for establishing average NP-completeness. Withoutloss of generality, we define m to be polytime computable if there exists a one-to-one and polytime computable function zm : f0; 1gn ! f0; 1gn; with z�1

m ðyÞ¼ x being computable in time polynomial in jxj; such that for all x;mðxÞ%2�jzmðxÞj:

JIE WANG1028

Given a binary string x; we can embed x in a longer string and find itefficiently as follows. Let s be jxj written in binary. Set eðxÞ ¼ 0jsj1sx: Thengiven a string with eðxÞ as a prefix, we can count the number of 0’s before theinitial 1, use this number to find the number s; and then use s to find x:Notice that jxj þ log jxj4jeðxÞj4jxj þ 2log jxj: For convenience, we call eðxÞ alogarithmic embedding of x (x-embedding, in short).

Let x be a binary string starting with 1. Then x is a unique concatenationof the base strings: 1; 10; 000; 100 [Wan99]. We encode each symbol in Afrom a string in the regular set R ¼ 0100ð00 þ 11Þn11 as follows. Let ‘ be theleast even integer such that 2ð‘�6Þ=25jxj þ jAj: Let S be the set of the first jAjstrings (in lexicographical order) in R of length ‘ such that no string in S is asubstring of x: Such a set S exists because the string x has at most jxjsubstrings of length ‘: If follows that none of the base strings is a prefix ofany coded symbol, and that if a nonempty suffix z of a coded symbol u is aprefix of a coded symbol v; then z ¼ u ¼ v: We assign a distinct element of Sto each symbol in A in a fixed order. The length of each coded symbol‘ ¼ Oðlog jxjÞ: This dynamic encoding scheme was first used in [Gur91] toshow that distributional Post correspondence under a uniform distributionis complete on average. For convenience, we call this coding scheme alogarithmic encoding scheme of x (x-encoding, in short).

Let c be a CNF formula. We construct aðcÞ as follows.

* Unit clause compression: A unit clause is a clause flg of length 1, inwhich the literal ‘ is also called a unit witness. Let u15 � � �5u‘ be the indicesof the literals in all unit clauses in c: Let huaþii

ki¼0 be a subsequence of

huji‘j¼1: If there exist integers c > 1 and d (d could be negative) such that forall 04i4k;

uaþi ¼ bi þ d þ ic;

where bi 2 f0; 1g; then we call huaþiiki¼0 a 0–1 arithmetic progression withbase ðc; d; b0b1 � � � bkÞ: Let hut; utþ1; . . . ; utþZ�1i be the first (i.e., with t beingthe smallest of all) longest 0–1 arithmetic progression. Assume that its baseis ðb; g; x0x1 � � � xZ�1Þ: Let x ¼ 1x0x1 � � � xZ�1: We replace flutg � � � flutþZ�1

g by

x@ðut;b; gÞ;

and place it in front of the other clauses.For example, suppose c contains the following unit clauses: f:v3g;

f:v1g; fv2g; and fv5g: Then u1 ¼ �3; u2 ¼ �1; u3 ¼ 2; and u4 ¼ 5: Letc ¼ 2; d ¼ �3; and b0b1b2 ¼ 001; then u1þi ¼ bi þ d þ ic for i ¼ 0; 1; 2; butu4 ¼ u1þ3 ¼ 5 > 1 þ d þ 3c > 0 þ d þ 3c: Thus, hu1; u2; u3i is the first long-est 0–1 arithmetic progression. Its base is ð2;�3; 001Þ: Hence, we replacef:v3g; f:v1g; and fv2g in c by 001@ð0; 2;�3Þ:

COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY 1029

* Dynamic encoding: Let c0 be the expression of formula c obtainedafter the unit clause compression. Let x be the binary string obtained fromthe unit clause compression (note that x begins with 1). We fix an x-encodingscheme for A; and encode every symbol in A by the x-encoding scheme.Replace x by eðxÞ: Encode every integer z in c0 by #bðzÞ#; where bðzÞ is zwritten in binary, and then replace 1 with 10 and 0 with 01 in bðzÞ: Since x-encoding codes are strings of the form 0100ð00 þ 11Þn11; a binary number socoded is easily distinguished from any coded symbol in A: Finally, encodeevery other symbol in c0 by its x-encoding code. If the length of the resultingbinary string is less than jcj1=3; pad the string to make it at least this long.The final string is aðcÞ; with prefix eðxÞ; which is a sequence of codedsymbols in A under x-encoding.

Theorem 1. The encoding a is one-to-one, polytime computable, and

polytime invertible.

Proof. It is straightforward to see that a is one-to-one and polytimecomputable. To compute a�1 on input y; we first look for the prefix eðxÞfrom y to extract x; from which we know how A is encoded by thex-encoding scheme. If y represents a set of clauses, let c be the CNF formulauniquely determined by y; and output c: If y does not have eðxÞ as prefix ory does not represent a set of clauses, output ‘‘nil.’’ ]

Corollary 2. The distribution ma is polytime computable.

Clearly, ma is a flat distribution.

3. COMPLETE ON AVERAGE SAT

3.1. Basics of Average-Case NP-Completeness

Denote by N the set of nonnegative integers. Let f : Sþ ! N be afunction with an input distribution m: If there exists an e > 0 such thatX

jxj=0

f eðxÞjxj�1mðxÞ51;

then we say that f is polynomial on m-average [Lev86]. Denote by AP theclass of distributional problems ðD; mÞ; where D is solvable in timepolynomial on m-average. Let ðA;mÞ and ðB; nÞ be two distributionalproblems. If A is reducible to B via a one-to-one, polytime computablereduction f ; and m%n 8 f ; then ðA; mÞ is polytime reducible to ðB; nÞ: AP isclosed under polytime reductions, and polytime reductions are transitive.

JIE WANG1030

Let DistNP ¼ fðD;mÞ : D 2 NP and m is polytime computableg: Since ma isflat, we need to use randomized reductions to show that DistSAT iscomplete on average for DistNP.

We assume that a randomized algorithm U flips a coin only when itscomputation requires a random bit, and the coin is unbiased. Randomizedalgorithms (to solve a problem) are allowed to make errors and produceincorrect outputs on some sequences of random bits. They can also runforever on some random (infinite) sequences. If U on input x halts with acorrect output using random bits r; we call ðx; rÞ a good input for U: We notethat deterministic algorithms are a special case with good inputs ðx; lÞ; wherel represents the empty string. Let G be a set of good inputs for U: Let

GðxÞ ¼ fr : ðx; rÞ 2 Gg:

Let m be an input distribution. If GðxÞ=| for all x with mðxÞ > 0; we call G agood-input domain of U (with respect to m). It is easy to see that no string inGðxÞ is a prefix of a different string in GðxÞ (otherwise, the longer stringcannot be in GðxÞ; for the algorithm halts before the string is generated). Let

UGðxÞ ¼1P

r2GðxÞ 2�jrj;

which is called the rarity function of G: We say that G is nonrare (withrespect to m) if UG is polynomial on m-average. U is almost total if UGðxÞ ¼ 1for all x with mðxÞ > 0: For all ðx; rÞ 2 G; define

mGðx; rÞ ¼ mðxÞ2�jrjUGðxÞ:

Let tðx; rÞ be the running time of U on input ðx; rÞ 2 G: If G is nonrare andthere exists an e > 0 such thatX

ðx;rÞ2G

teðx; rÞjxj�1mGðx; rÞ51;

then we say that U runs in polytime on m-average. If tðx; rÞ is bounded by apolynomial in jxj for all ðx; rÞ 2 G; then we say that U runs in polytime.

One way to justify the correctness of the output is to show that its inputbelongs to the good domain. For this purpose, we consider certifiable

domains. Domain G is certifiable if G is decidable in polytime onmðxÞjrj�22�jrj-average. It can be shown that U runs in polytime on m-averageif and only if U can be iterated in a certain manner to run in polytime on m-average with an almost total good-input domain [BG93].

Denote by RAP the class of all distributional problems ðD;mÞ; where D issolvable by a randomized algorithm in polytime on m-average with acertifiable, nonrare good-input domain. We say that ðA; mÞ is polytime

COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY 1031

randomly reducible to ðB; nÞ if there is a one-to-one reduction f ; computableby a randomized algorithm in polytime with a certifiable, nonrare good-input domain G; such that, for all ðx; rÞ 2 G; x 2 A if and only if f ðx; rÞ 2 B;and mG%n 8 f : RAP is closed under polytime randomized reductions, andpolytime randomized reductions are transitive [BG93].

It is easy to see that AP is a subset of RAP, and the polytime reducibilityis a special case of the polytime random reducibility. For more informationabout randomized reductions, the reader is referred to [BG93, Gur91,VL88, Wan97].

We say that a distributional problem ðD;mÞ 2 DistNP is complete onaverage for DistNP (under randomized reductions) if all distributionalproblems in DistNP are polytime (randomly) reducible to ðD;mÞ:

3.2. Main Theorems

We first consider a variant of Levin’s distributional tiling problem[Lev86]. A tile is a square with a symbol on each side that may not berotated or turned over. Denote by ða; b; c; dÞ a tile whose symbols are a; b; c;and d clockwise starting from the top side. We assume that there is asufficient supply of copies of each tile. A tiling of an n � n square is anarrangement of n2 tiles covering the square in which the symbols on thecommon sides of adjacent tiles are the same. Let T be a finite set of tiles. LetHT � T � T be the collection of pairs of tiles that can be placed horizontally,and VT � T � T be the collection of pairs of tiles that can be placedvertically. A tiling system of T is a pair ðS; sÞ; where S � T with jSj ¼ 2 andS � S � HT ; and s is a sequence of tiles s1s2 � � � sk such that s1 2 T ; si 2 S fori ¼ 2; . . . ; k; and ðs1; s2Þ 2 HT :

Flat Distributional Tiling (FDT)Instance: A tiling system ðS; sÞ of T ; where s ¼ s1s2 � � � sk :Question: Can s be extended to tile a k � k square?Distribution: mFTðS; sÞ is proportional to k�22�k :

To show that FDT is complete on average for DistNP, we reduce thefollowing flat distributional halting problem to FDT. Let hx; yi denote thebinary string eðxÞy:

Gurevich [Gur91] constructed a nondeterministic Turing machine (NTM)MG such that, on binary instances x01n with jxj5n; it is complete on averagefor DistNP to decide whether MG accepts x within n steps under distributionmHðx01nÞ; which is proportional to n�32�jxj: Let K be the set of positiveinstances x01n:

Flat Distributional Halting (FDH)Instance: hx; yi; where x and y are binary strings.

JIE WANG1032

Question: Does MG accept x within jyj steps?Distribution: mFHðhx; yiÞ is proportional to jxj�2jyj�22�jxj2�jyj:

Let KF denote the set of positive instances of FDH. We can thenrandomly reduce ðK;mHÞ to ðKF ;mFHÞ as follows. On input x01n; thereduction f generates a random string r with jrj ¼ n and outputs hx; ri: LetG ¼ fðx01n; rÞ : jrj ¼ ng; then UGðx01nÞ ¼ 1; and G is polytime computableand so is certifiable. We note that f is one-to-one and polytime computableon G: Clearly, for all ðx01n; rÞ 2 G : x01n 2 K if and only if hx; ri 2 KF ; andmGðx01n; rÞ ¼ mHðx01nÞ2�jrj ¼ Oðn�32�jxj2�jrjÞ5Oðjxj2ÞmFHðhx; riÞ: This im-plies that FDH is complete on average for DistNP under randomizedreductions.

Lemma 3. There is a set TK of tiles and a tiling system ðS; sÞ of TK such

that FDT is complete on average for DistNP under randomized reductions.

Proof. We randomly reduce FDH to FDT. Let MF be a one-tape NTMthat accepts KF in polytime. We construct a one-tape NTM M such that oninput z; if z ¼ eðwÞz0 for some w; M extracts w; otherwise, M rejects. This canbe carried out deterministically in polytime in jwj: M then determineswhether w ¼ hx; yi for some x and y in deterministic polytime in jwj: If so,M simulates MF on w; otherwise, M rejects. Thus, there is a polynomial psuch that MF accepts hx; yi if and only if M accepts eðwÞz0 for any z0 andevery computation path of M is strictly less than pðjwjÞ: Note that M eitheraccepts all inputs beginning with eðwÞ or rejects all inputs beginning witheðwÞ; depending on whether or not MF accepts hx; yi: Let Q be the set ofstates of M ; with starting state qs; accepting state qa; and rejecting state qr:Let D be the transition function of M ; and B the blank symbol. Let TK be theset of the following tiles, where a; b; c 2 f0; 1;Bg; q 2 Q – fqa; qrg; p 2 Q;and *, #, $ are symbols not in Q [ f0; 1;Bg:

* ða; * ; a; * Þ; ððqa; bÞ; * ; ðqa; bÞ; *Þ; ððqs; bÞ;#; $; $Þ; ða;#; $;#Þ;* ðb;p; ðq; aÞ; *Þ and ððp; cÞ; * ; c;pÞ; if Dðq; aÞ ¼ ðp; b;RÞ;* ðb; * ; ðq; aÞ;pÞ and ððp; cÞ;p; c; *Þ; if Dðq; aÞ ¼ ðp; b; LÞ:

The sets HTK and VTK can be readily obtained. Let

S ¼ fð0;#; $;#Þ; ð1;#; $;#Þg:

Let jeðwÞj ¼ k and write eðwÞ ¼ w1w2 � � �wk ; where wi 2 f0; 1g: Letr ¼ r1r2 � � � r‘ be a random string, where ‘ ¼ pðjwjÞ � k and ri 2 f0; 1g: Lets ¼ s1s2 � � � skskþ1 � � � skþ‘; where s1 ¼ ððqs;w1Þ;#; $; $Þ; si ¼ ðwi;#; $;#Þ fori ¼ 2; . . . ; k; and sj ¼ ðrj;#; $;#Þ for j ¼ k þ 1; . . . ; k þ ‘: Since M will reachthe accepting or the rejecting state in strictly less than pðjwjÞ time, and the

COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY 1033

tiling can only duplicate the accepting state, s can extend to a tiling of apðjwjÞ � pðjwjÞ square if and only if s occupies the bottom row of the squareand M accepts s: Let G ¼ fðhx; yi; rÞ : jrj ¼ pðjwjÞ � jeðwÞjg: Then UGðhx; yiÞ¼ 1 and G is polytime computable and so is certifiable. Thus, f ðhx; yi; rÞ ¼ðS; sÞ is the desired randomized reduction from FDH to FDT, which is one-to-one and polytime computable on G: To verify domination of distribu-tions, we note that

mGðhx; yi; rÞ ¼mFHðhx; yiÞ2�jrj ¼ Yðjxj�2jyj�22�jxj2�jyj2�jrjÞ

4Oðjyj�22�jhx;yij2�jrjÞ4Oðjwj2jyj�22�jeðwÞj2�jrjÞ

5Oððjxj þ jyjÞ4jsj�22�jsjÞ:

Thus mFHðhx; yiÞ%mFTðf ðhx; yiÞÞ: This completes the proof. ]

Theorem 4. DistSAT is complete on average for DistNP under

randomized reductions.

Proof. We reduce FDT to DistSAT. Let TK be the set of tiles and ðS; sÞthe tiling system of TK obtained from Lemma 3. Label the two tiles in S as t0and t1; and the tiles in Tk � S as t2; . . . ; ts�1; where s ¼ jTK j is a constant. LetN ¼ jsj: Create n ¼ N 2s variables v0; v1; . . . ; vn�1:

For each variable vr; where r ¼ ðiN þ jÞsþ k; 04i5N ; 04j5N ; and04k5s; we will later want vr to be 1 if the ði; jÞth cell of the square iscovered by the kth tile. We construct a formula c as follows.

(1) Let s ¼ s0s1 � � � sjsj�1: Assume that s0 ¼ tp; and si ¼ txi for i ¼ 1; . . . ;jsj � 1; where xi 2 f0; 1g: c includes the following unit clauses:

vpYjsj�1

i¼1

visþxi : ð1Þ

(2) For all k; k0 with ðtk ; tk0 Þ =2 HTK ; where 04k; k05s; c includes thefollowing 2-clauses:

YN2�1

I¼0

ð:vIsþk þ :vðIþ1Þsþk0 Þ: ð2Þ

(3) For all k; k0 with ðtk ; tk0 Þ =2 VTK ; where 04k; k05s; c includes thefollowing 2-clauses:

YN2�1

I¼0

ð:vIsþk þ :vðIþN Þsþk0 Þ: ð3Þ

JIE WANG1034

(4) Finally, c includes the following s-clauses:

YN2�1

I¼0

Xs�1

k¼0

vIsþk : ð4Þ

We note that jcj ¼ N þ 4N2 þ N2s ¼ YðsN2Þ: Let f ðS; sÞ ¼ c; then f isone-to-one. If ðS; sÞ is a positive instance of FDT, then for all i and j with04i5N and 04j5N ; there must be a k5s such that the square at locationði; jÞ is tiled by tk : Set vðiNþjÞsþk ¼ 1: Since for all I with 04I5N 2 there mustbe a pair of integers i and j with i; j 2 ½0;N Þ such that I ¼ iN þ j; weconclude that every clause in (4) is satisfied. Clearly, every unit clause in (1)is satisfied. Now for each pair ðk; k0Þ with ðtk ; tk0 Þ =2 HTK ; and for all I with04I5N 2; at least one of vIsþk and vðIþ1Þsþk0 has not been set to 1. Set thatvariable to 0. Thus, every clause in (2) is satisfied. Similarly, every clause in(3) is also satisfied. Conversely, we can show that if c is satisfiable then ðS; sÞis a positive instance of FDT.

Let x ¼ x1 � � � xjsj�1: Notice that s is a constant and jsj ¼ N : For everysymbol a 2 A; we use

%a to denote the x-encoding of a: Then

a vpYjsj�1

i¼1

visþxi

!¼ x@ð#bð1Þ#;#bðsÞ#;#bð0Þ#Þv�p; ð5Þ

and so jaðvpQjsj�1

i¼1 visþxiÞj ¼ jxj þYðlog jxjÞ: Under x-encoding we can seethat jaðY Þj ¼ Yðlog jxj þ log N Þ for Y being a formula in either (2), (3), or (4).Thus,

jaðcÞj ¼ jxj þYðlog jxj þ log N Þ ð6Þ

¼ jsj þYðlog jsjÞ; ð7Þ

and so jaðcÞj ¼Yðjcj1=2Þ > jcj1=3: Equality (7) implies that mFTðS; sÞ%maðf ðS; sÞÞ:Hence f reduces FDT to DistSAT. ]

We can extend a-encoding to compress certain special clauses in a kSATformula c: Let C be a nontrivial clause of length k in c with identifierhj1; . . . ; jki; where k > 1: We say that C is a special clause of c if c contains2k � 1 clauses of length k (including C) f‘q1

; ‘q2; . . . ; ‘qkg; where qi ¼ �ji;

and qi cannot be all negative. Thus, among these special clauses there is onethat contains all variables. Let hh1; . . . ; hki be the identifier of this clause.Then the product of all these 2k � 1 special clauses equals 1 if and only if‘h1

¼ ‘h2¼ � � � ¼ ‘hk ¼ 1: We call hh1; . . . ; hki the basis of these special

clauses.

COMPLETE ON AVERAGE BOOLEAN SATISFIABILITY 1035

We look for the first special clause starting from the first clause in c;group together all the special clauses with the same basis hh1; � � � ; hki; andreplace the product of these clauses by

½h1; . . . ; hk�: ð8Þ

We then encode (8) with ½#bðh1Þ#; . . . ;#bðhkÞ#� using the x-encoding in a:

Theorem 5. For every k5s; where s ¼ jTK j; Dist-kSAT is complete on

average for DistNP under randomized reductions.

Proof (Sketch). Let c be the formula constructed in the proof ofTheorem 4 with n variables. Then each clause in c has length 1, 2, or s:Create s new variables vn; . . . ; vnþs�1; and 2s � 1 special clauses with basisfn; . . . ; ðn þ s� 1Þg: Replace each unit clause flg in c by fl;:vn; . . . ;:vnþs�2g; and replace each 2-clause fl1; l2g in c by fl1; l2;:vn; . . . ;:vnþs�3g:This produces a formula cs in sSAT, and c is satisfiable if and only if csalso is. Moreover, jaðcsÞj ¼ jsj þYðlog jsjÞ: So Dist-sSAT is complete onaverage.

Reducing Dist-kSAT to Dist-ðk þ 1ÞSAT is straightforward. ]

Finally, we would like to point out that it remains open whether for34k5s; Dist-kSAT is complete on average for DistNP.

ACKNOWLEDGMENTS

I am grateful to Jay Belanger and Drue Coles for carefully reading early drafts of this paper,

and to Steve Cook, Yuri Gurevich, and Leonid Levin for their comments. I thank Drue Coles

for pointing out an improved statement of FDT.

REFERENCES

[BCGL92] S. Ben-David, B. Chor, O. Goldreich, and M. Luby, On the theory of average case

complexity, J. Comput. System Sci. 44 (1992), 193–219. (Preliminary version first

appeared in STOC’89.)

[BG93] A. Blass and Y. Gurevich, Randomizing reductions of search problems, SIAM J.

Comput. 22 (1993), 949–975.

[Bol85] B. Bollob!aas, ‘‘Random Graphs,’’ Academic Press, New York, 1985.

[CS88] V. Chv!aatal and E. Szemer!eedi, Many hard examples for Resolution, J. ACM 35

(1988), 759–768.

[CM97] S. A. Cook and D. G. Mtchell, Finding hard instances of the satisfiability problem:

A survey, in ‘‘Satisfiability Problem: Theory and Applications,’’ (D.-Z. Du, J. Gu,

and P. Pardolas Eds.), pp. 1–17, AMS Press, Providence, RI, 1997.

JIE WANG1036

[DLL62] M. Davis, G. Logemann, and D. Loveland, A machine program for theorem-

proving, Comm. ACM 5 (1962), 394–397.

[DP60] M. Davis and H. Putnam, A computing procedure for quantification theory,

J. ACM 7 (1960), 201–215.

[FP83] J. Franko and M. Paull, Probabillistic analysis of the Davis–Putnam procedure for

solving the satisfiability problem, Discrete Appl. Math. 22 (1988), 35–51.

[Gol79] A. Goldberg, ‘‘On the Complexity of the Satisfiability Problem,’’ Courant

Computer Science Report No. 16, New York University, 1979.

[Gur91] Y. Gurevich, Average case completeness, J. Comput. System Sci. 42 (1991),

346–398. (Preliminary version first appeared in FOCS’87.)

[LeV86] L. Levin, Average case complete problem, SIAM J. Comput. 15 (1986), 285–286.

(Preliminary version first appeared in STOC’84.)

[VL88] R. Venkatesan and L. Levin, Random instances of a graph coloring problem are

hard, in ‘‘Proceedings of the 20th Annual Symposium on Theory of Computing,’’

pp. 217–222, ACM Press, Providence, RI, 1998.

[Wan97] J. Wang, Average-case computational complexity theory, in ‘‘Complexity Theory

Retrospective II,’’ (L. Hemaspaandra and A. Selman, Eds.), pp. 295–328, Springer-

Verlag, Berlin, 1997.

[Wan99] J. Wang, Distributional word problems for groups, SIAM J. Comput. 28 (1999),

1264–1283. (Preliminary version first appeared in STOC’95.)