adaptive annealing: a near-optimal connection between sampling and counting

Adaptive annealing: a near-optimal connection between

sampling and counting

Daniel Štefankovič(University of Rochester)

Santosh VempalaEric Vigoda(Georgia Tech)

Adaptive annealing: a near-optimal connection between

sampling and counting

Daniel Štefankovič(University of Rochester)

Santosh VempalaEric Vigoda(Georgia Tech)

If you want to count using MCMC then statistical physicsis useful.

1. Counting problems

2. Basic tools: Chernoff, Chebyshev

3. Dealing with large quantities (the product method)

4. Statistical physics

5. Cooling schedules (our work)

6. More…

Outline

independent sets

spanning trees

matchings

perfect matchings

k-colorings

Counting

spanning trees

Compute the number of

spanning trees


det(D – A)vv0 1 0 11 0 1 00 1 0 11 0 1 0

2 0 0 00 2 0 00 0 2 00 0 0 2

2 -1 0-1 2 -1 0 -1 2

Kirchhoff’s Matrix Tree Theorem:

-D A

det

spanning trees


polynomial-timealgorithmG

number of spanning trees of G

independent sets

spanning trees

matchings

perfect matchings

k-colorings

Counting ?

independent sets


(hard-core gas model)

independent set subset S of vertices, of a graph no two in S are neighbors=

# independent sets = 7

independent set = subset S of vertices no two in S are neighbors

...

# independent sets = G1

G2

G3

Gn

...Gn-2

...Gn-1

...

# independent sets = G1

G2

G3

Gn

...Gn-2

...Gn-1

2

3

5

Fn-1

Fn

Fn+1

# independent sets = 5598861

independent set = subset S of vertices no two in S are neighbors

independent sets



number of independent sets of G

?

independent sets



number of independent sets of G

!(unlikely)

#P-complete

#P-complete even for 3-regular graphs

graph G # independent sets in G

(Dyer, Greenhill, 1997)

FP

#P

P

NP


approximation

randomization

?


approximation

randomization

?which is more important?


approximation

randomization

?which is more important?

My world-view: (true) randomness is important conceptually but NOT computationally(i.e., I believe P=BPP).

approximation makes problems easier (i.e., I believe #P=BPP)

We would like to know Q

Goal: random variable Y such that

P( (1-)Q Y (1+)Q ) 1-

“Y gives (1-estimate”

We would like to know Q

Goal: random variable Y such that

P( (1-)Q Y (1+)Q ) 1-

polynomial-timealgorithmG,,

FPRAS: Y

(fully polynomial randomized approximation scheme):






6. More...

Outline

We would like to know Q 1. Get an unbiased estimator X, i. e.,

E[X] = Q

Y=X1 + X2 + ... + Xnn

2. “Boost the quality” of X:

P( Y gives (1)-estimate )

1 -

The Bienaymé-Chebyshev inequality

V[Y]E[Y]2

1

P( Y gives (1)-estimate )

1 -

Y=X1 + X2 + ... + Xn

n


V[Y]E[Y]2 = 1 V[X]

E[X]2n

squared coefficient of variation SCV

V[Y]E[Y]2

1

P( Y gives (1)-estimate of Q )

Let X1,...,Xn,X be independent, identically distributed random variables, Q=E[X]. Let


1 - V[X]n E[X]2

1

ThenY=

X1 + X2 + ... + Xn

n

P( Y gives (1)-estimate of Q ) - 2 . n . E[X] / 3 1 –

Let X1,...,Xn,X be independent, identically distributed random variables, 0 X 1, Q=E[X]. Let

Chernoff’s bound

Y=X1 + X2 + ... + Xn

nThen

e

n V[X]E[X]2

1

1

n 1E[X]

3

ln (1/)

0X1

Number of samples to achieve precision with confidence

n V[X]E[X]2

1

1

n 1E[X]

3

ln (1/)

0X1

Number of samples to achieve precision with confidence

BADGOOD

BAD

Median “boosting trick”

P( ) 3/4

n 1E[X]

4

(1-)Q (1+)Q

Y=X1 + X2 + ... + Xn

n

Y=

BY BIENAYME-CHEBYSHEV:

Median trick – repeat 2T times(1-)Q (1+)Q

P( ) 3/4

P( ) 1 - e-T/4> T out of 2T

median is in

P( ) 1 - e-T/4

BY BIENAYME-CHEBYSHEV:

BY CHERNOFF:

n V[X]E[X]2

32

n 1E[X]

3

ln (1/)

0X1

+ median trick

ln (1/)

BAD

n V[X]E[X]2

1

ln (1/)

Creating “approximator” from X = precision = confidence






6. More...

Outline

(approx) counting sampling Valleau,Card’72 (physical chemistry), Babai’79 (for matchings and colorings), Jerrum,Valiant,V.Vazirani’86

random variables: X1 X2 ... Xt

E[X1 X2 ... Xt]

= O(1)V[Xi]E[Xi]2

the Xi are easy to estimate = “WANTED”

the outcome of the JVV reduction:

such that1)2)

squared coefficient of variation (SCV)

E[X1 X2 ... Xt]

= O(1)V[Xi]E[Xi]2

the Xi are easy to estimate = “WANTED” 1)

2)

O(t2/2) samples (O(t/2) from each Xi) give 1 estimator of “WANTED” with prob3/4

Theorem (Dyer-Frieze’91)

(approx) counting sampling

JVV for independent sets

P( ) 1# independent sets =

GOAL: given a graph G, estimate the number of independent sets of G

JVV for independent sets

P( )P( ) =

?

??

??P( ) ?P( )P( )

X1 X2 X3 X4

Xi [0,1] and E[Xi] ½ = O(1)V[Xi]E[Xi]2

P(AB)=P(A)P(B|A)

Self-reducibility for independent sets

?

??P( ) 5

7=

?

??P( ) 5

7=

57=


?

??P( ) 5

7=

57= 5

7=


??P( ) 3

5=

35=


??P( ) 3

5=

35= 3

5=


35

57= 5

7=

35

57= 2

3 = 7


SAMPLERORACLEgraph G

random independentset of G

JVV: If we have a sampler oracle:

then FPRAS using O(n2) samples.

SAMPLERORACLEgraph G

random independentset of G

JVV: If we have a sampler oracle:

then FPRAS using O(n2) samples.

SAMPLERORACLE

graph G set fromgas-model Gibbs at

ŠVV: If we have a sampler oracle:

then FPRAS using O*(n) samples.

O*( |V| ) samples suffice for counting

Application – independent sets

Cost per sample (Vigoda’01,Dyer-Greenhill’01) time = O*( |V| ) for graphs of degree 4.

Total running time: O* ( |V|2 ).

Other applicationsmatchings O*(n2m) (using Jerrum, Sinclair’89)

spin systems: Ising model O*(n2) for <C

(using Marinelli, Olivieri’95) k-colorings O*(n2) for k>2 (using Jerrum’95)

total running time






6. More…

Outline

easy = hot

hard = cold

Hamiltonian

12

4

0

Hamiltonian H : {0,...,n}

Big set =

Goal: estimate |H-1(0)|

|H-1(0)| = E[X1] ... E[Xt ]

Distributions between hot and cold

(x) exp(-H(x))

= inverse temperature

= 0 hot uniform on = cold uniform on H-1(0)

(Gibbs distributions)

(x) Normalizing factor = partition function

exp(-H(x))

Z()= exp(-H(x)) x

Z()

Distributions between hot and cold

(x) exp(-H(x))

Partition function

Z()= exp(-H(x)) x

have: Z(0) = ||want: Z() = |H-1(0)|

Partition function - example

Z()= exp(-H(x)) x have: Z(0) = ||

want: Z() = |H-1(0)|

12

4

0

Z() = 1 e-4.

+ 4 e-2.

+ 4 e-1.

+ 7 e-0.

Z(0) = 16Z()=7

(x) exp(-H(x))Z()

Assumption: we have a sampler oracle for

SAMPLERORACLE

graph G

subset of Vfrom

(x) exp(-H(x))Z()


W

(x) exp(-H(x))Z()


W X = exp(H(W)( - ))

(x) exp(-H(x))Z()


W X = exp(H(W)( - ))

E[X] = (s) X(s) s

= Z()Z()

can obtain the following ratio:

Partition function

Z() = exp(-H(x)) x

Our goal restated

Goal: estimate Z()=|H-1(0)| Z() =Z(1) Z(2) Z(t)

Z(0) Z(1) Z(t-1)Z(0)

0 = 0 < 1 < 2 < ... < t =

...

Our goal restated

Z() =Z(1) Z(2) Z(t)Z(0) Z(1) Z(t-1)

Z(0)...

How to choose the cooling schedule?

Cooling schedule:

E[Xi] =Z(i)Z(i-1)

V[Xi]E[Xi]2

O(1)

minimize length, while satisfying

0 = 0 < 1 < 2 < ... < t =






6. More...

Outline

Parameters: A and n

Z() =AH: {0,...,n}

Z() = exp(-H(x)) x

Z() = ak e- kk=0

n

ak = |H-1(k)|

Parameters

Z() =A H: {0,...,n}

independent sets

matchings

perfect matchings

k-colorings

2V

V!kV

A

EVVEn

V!

Parameters

Z() =A H: {0,...,n}

independent sets

matchings

perfect matchings

k-colorings

2V

V!kV

A

EVVEn

V!matchings = # ways of marrying them so that no unhappy couple

Parameters

Z() =A H: {0,...,n}

independent sets

matchings

perfect matchings

k-colorings

2V

V!kV

A

EVVEn

V!marry ignoring “compatibility” hamiltonian = number of unhappy couples

Parameters

Z() =A H: {0,...,n}

independent sets

matchings

perfect matchings

k-colorings

2V

V!kV

A

EVVEn

V!

Previous cooling schedules

Z() =A H: {0,...,n}

+ 1/n (1 + 1/ln A) ln A

“Safe steps”

O( n ln A)Cooling schedules of length

O( (ln n) (ln A) )

0 = 0 < 1 < 2 < ... < t =

(Bezáková,Štefankovič, Vigoda,V.Vazirani’06)


+ 1/n (1 + 1/ln A) ln A

“Safe steps”


Z() = ak e- kk=0

nW X = exp(H(W)( - ))

1/e X 1V[X]E[X]2 e1

E[X]

+ 1/n (1 + 1/ln A) ln A

“Safe steps”


Z() = ak e- kk=0

nW X = exp(H(W)( - ))

Z() = a0 1 Z(ln A) a0 + 1 E[X] 1/2

+ 1/n (1 + 1/ln A) ln A

“Safe steps”


Z() = ak e- kk=0

nW X = exp(H(W)( - ))

E[X] 1/2e

Previous cooling schedules

+ 1/n (1 + 1/ln A) ln A

“Safe steps”

O( n ln A)Cooling schedules of length

O( (ln n) (ln A) )



1/n, 2/n, 3/n, .... , (ln A)/n, .... , ln A

No better fixed schedule possible

Z() =A H: {0,...,n}

Za() = (1 + a e ) A1+a

- n A schedule that works for all

(with a[0,A-1])

has LENGTH ( (ln n)(ln A) )

THEOREM:

Parameters

Z() =A H: {0,...,n}Our main result:

non-adaptive schedules of length *( ln A )

Previously:

can get adaptive scheduleof length O* ( (ln A)1/2 )

Related work


Lovász-Vempala Volume of convex bodies in O*(n4) schedule of length O(n1/2)(non-adaptive cooling schedule, using specific properties of the “volume” partition functions)

Existential part

for every partition function there exists a cooling schedule of length O*((ln A)1/2)

Lemma:


there exists

Cooling schedule (definition refresh)

Z() =Z(1) Z(2) Z(t)Z(0) Z(1) Z(t-1)

Z(0)...

How to choose the cooling schedule?

Cooling schedule:

E[Xi] =Z(i)Z(i-1)

V[Xi]E[Xi]2

O(1)

minimize length, while satisfying

0 = 0 < 1 < 2 < ... < t =

W X = exp(H(W)( - ))E[X2]E[X]2

Z(2-) Z()Z()2= C

E[X] Z() Z()=

Express SCV using partition function(going from to )

V[X]E[X]2 +1

=

f()=ln Z()

Proof:

E[X2]E[X]2

Z(2-) Z()Z()2= C

C’=(ln C)/2

2-

(f(2-) + f())/2 (ln C)/2 + f()

graph of f

f()=ln Z() f is decreasing f is convex f’(0) –n f(0) ln A

Properties of partition functions

f()=ln Z() f is decreasing f is convex f’(0) –n f(0) ln A

f() = ln ak e- kk=0

n

f’() =ak k e- k

k=0-

n

ak e- kk=0

n

Properties of partition functions

(ln f)’ = f’f

f()=ln Z()

f is decreasing f is convex f’(0) –n f(0) ln A

Proof:either f or f’changes a lot

Let K:=f

(ln |f’|) 1K

1Then

for every partition function there exists a cooling schedule of length O*((ln A)1/2)

GOAL: proving Lemma:

Proof:Let K:=f

(ln |f’|) 1K1

Then

c := (a+b)/2, := b-ahave f(c) = (f(a)+f(b))/2 – 1

(f(a) – f(c)) / f’(a) (f(c) – f(b)) / f’(b)

a bc

f is convex

Let K:=f

(ln |f’|) 1K

Then

c := (a+b)/2, := b-ahave f(c) = (f(a)+f(b))/2 – 1

(f(a) – f(c)) / f’(a) (f(c) – f(b)) / f’(b)f is convex

f’(b) f’(a)1-1/f e-f

f:[a,b] R, convex, decreasingcan be “approximated” using

f’(a)f’(b)(f(a)-f(b))

segments

Proof:

2-

Technicality: getting to 2-

Proof:

2-

i

i+1


Proof:

2-

i

i+1 i+2


Proof:

2-

i

i+1 i+2


i+3

ln ln A

extrasteps

Existential Algorithmic


there exists


Algorithmic construction

(x) exp(-H(x))Z()

using a sampler oracle for

we can construct a cooling schedule of length 38 (ln A)1/2(ln ln A)(ln n)

Our main result:

Total number of oracle calls 107 (ln A) (ln ln A+ln n)7 ln (1/)

current inverse temperature

ideally move to such that

E[X] =Z()Z()

E[X2]E[X]2

B2B1




E[X] =Z()Z()

E[X2]E[X]2

B2B1


X is “easy to estimate”



E[X] =Z()Z()

E[X2]E[X]2

B2B1


we make progress (where B11)



E[X] =Z()Z()

E[X2]E[X]2

B2B1


need to construct a “feeler” for this

Algorithmic constructioncurrent inverse temperature


E[X] =Z()Z()

E[X2]E[X]2

B2B1


= Z()Z()

Z(2)Z()

Algorithmic constructioncurrent inverse temperature


E[X] =Z()Z()

E[X2]E[X]2

B2B1


= Z()Z()

Z(2)Z()

bad “feeler”

estimator for Z()Z()

Z() = ak e- kk=0

n

For W we have P(H(W)=k) = ak e- k

Z()

Z() = ak e- kk=0

n


Z()

For U we have P(H(U)=k) = ak e- k

Z()

If H(X)=k likely at both , estimator

Z()Z()

estimator for


Z()


Z()

P(H(U)=k)P(H(W)=k)ek(-)=

Z()Z()

Z()Z()

estimator for


Z()


Z()

P(H(U)=k)P(H(W)=k)ek(-)=

Z()Z()

Z()Z()

PROBLEM: P(H(W)=k) can be too small

estimator for

Rough estimator for

Z() = ak e- kk=0

n

For W we have P(H(W)[c,d]) =

ak e- k

Z()

k=c

d

Z()Z()

For U we have P(H(W)[c,d]) =

ak e- k

Z()

k=c

d

interval insteadof single value

P(H(U)[c,d])P(H(W)[c,d]) e ec(-)

e 1

If |-| |d-c| 1 thenRough estimator for

We also need P(H(U) [c,d]) P(H(W) [c,d]) to be large.

Z()Z()

Z()Z()

Z()Z()

ak e- kk=c

d

ak e- kk=c

d ec(-) =ak e- (k-c)

k=c

d

ak e- (k-c)d

k=c

Split {0,1,...,n} into h 4(ln n) ln Aintervals [0],[1],[2],...,[c,c(1+1/ ln A)],...

for any inverse temperature there exists a interval with P(H(W) I) 1/8h

We say that I is HEAVY for

We will:

Algorithm find an interval I which is heavy for the current inverse temperature

see how far I is heavy (until some *)

use the interval I for the feeler

repeat

Z()Z()

Z(2)Z()

either * make progress, or * eliminate the interval I * or make a “long move”

ANALYSIS:

distribution of h(X) where X

...

I = a heavy interval at

I isheavy

distribution of h(X) where X

...


no longer heavy at !

I is NOTheavy

I isheavy

distribution of h(X) where X’

...


’

heavy at ’

I isheavy

I isheavy

I is NOTheavy

I isheavy

I isheavy

I is NOTheavy

I isheavy I is NOT

heavy

use binary search to find *

**+1/(2n)

= min{1/(b-a), ln A}

I=[a,b]

’

I isheavy

I isheavy

I is NOTheavy

I isheavy I is NOT

heavy

use binary search to find *

**+1/(2n)

= min{1/(b-a), ln A}

I=[a,b]

How do you know that you can use binary search?

’

I isheavy

I isheavy


I is NOTheavy

I is NOTheavy

Lemma: the set of temperatures for which I is h-heavy is an interval.

ak e- kk=0

nak e- k

kI 1

8h

P(h(X) I) 1/8h for XI is h-heavy at


ak e- kk=0

nak e- k

kI 1

8h

c0 x0 + c1 x1 + c2 x2 + .... + cn xn

Descarte’s rule of signs: x=e-

+ - + +

sign change number of positive roots

number of sign changes


ak e- kk=0

nak e- k

kI 1

h

c0 x0 + c1 x1 + c2 x2 + .... + cn xn


+ + +



-1+x+x2+x3+...+xn

1+x+x20

-


ak e- kk=0

nak e- k

kI 1

8h

c0 x0 + c1 x1 + c2 x2 + .... + cn xn


+ + +



-

I isheavy

I isheavy

I is NOTheavy

* *+1/(2n)

can roughly compute ratio of Z()/Z(’) for ’ [,*] if |-|.|b-a| 1

I=[a,b]

I isheavy

I isheavy

I is NOTheavy

* *+1/(2n)

can roughly compute ratio of Z()/Z(’) for ’ [,*] if |-|.|b-a| 1

I=[a,b]

find largest such thatZ()Z()

Z(2)Z()

C

1. success

2. eliminate interval

3. long move

if we have sampler oracles for

then we can get adaptive scheduleof length t=O* ( (ln A)1/2 )

independent sets O*(n2) (using Vigoda’01, Dyer-Greenhill’01)

matchings O*(n2m) (using Jerrum, Sinclair’89)








6. More...

Outline

6. More… a) proof of Dyer-Frieze

b) independent sets revisited

c) warm starts

Outline

O(t2/2) samples (O(t/2) from each Xi) give 1 estimator of “WANTED” with prob3/4

Theorem (Dyer-Frieze’91)

Appendix – proof of:E[X1 X2 ... Xt]

= O(1)V[Xi]E[Xi]2

the Xi are easy to estimate = “WANTED” 1)

2)

How precise do the Xi have to be?First attempt – term by term

(1 )(1 )(1 )... (1 ) 1t

t

t

t

Main idea:

each term (t2) samples (t3) total

n V[X]E[X]2

1

ln (1/)

How precise do the Xi have to be?Analyzing SCV is better

(Dyer-Frieze’1991)

P( X gives (1)-estimate )

1 - V[X]E[X]2

1

squared coefficient of variation (SCV)GOAL: SCV(X) 2/4

X=X1 X2 ... Xt

How precise do the Xi have to be?


SCV(X) = (1+SCV(X1)) ... (1+SCV(Xt)) - 1

Main idea:

SCV(Xi) t SCV(X) <

SCV(X)= V[X]E[X]2

E[X2]E[X]2= -1

Analyzing SCV is better

proof:



SCV(X) = (1+SCV(X1)) ... (1+SCV(Xt)) - 1

Main idea:

SCV(Xi) t SCV(X) <

SCV(X)= V[X]E[X]2

E[X2]E[X]2= -1


proof:

X1, X2 independent E[X1 X2] = E[X1]E[X2]

X1, X2 independent X12,X2

2 independent

X1,X2 independent SCV(X1X2)=(1+SCV(X1))(1+SCV(X2))-1


(Dyer-Frieze’1991)X1 X2 ... Xt

X =Main idea:

SCV(Xi) t SCV(X) <

each term (t /2) samples (t2/2) total




c) warm starts

Outline

12

4Hamiltonian

0

Hamiltonian – many possibilities

0

1

2

(hardcore lattice gas model)

What would be a natural hamiltonian for planar graphs?

What would be a natural hamiltonian for planar graphs?

H(G) = number of edges

natural MC

(1+)

1(1+) try G - {u,v}

try G + {u,v}

pick u,v uniformly at random

natural MC

(1+)

1(1+) try G - {u,v}

try G + {u,v}

pick u,v uniformly at random

uv

uv

(1+)n(n-1)/2

1(1+)n(n-1)/2G G’

uv

uv

(1+)n(n-1)/2

1(1+)n(n-1)/2

G) number of edges

satisfies the detailed balance condition

(G) P(G,G’) = (G’) P(G’,G)

G G’

( = exp(-))



c) warm starts

Outline

Mixing time: mix = smallest t such that | t - |TV 1/e

Relaxation time: rel = 1/(1-2)

rel mix rel ln (1/min)

n ln n)

n)

(n=3)

(discrepancy may be substantially bigger for, e.g., matchings)



Estimating (S)

1 if X S0 otherwiseY={

X

E[Y]=(S) ...

X1X2X3

Xs

METHOD 1



Estimating (S)

1 if X S0 otherwiseY={

X

E[Y]=(S) ...

X1X2X3

Xs

METHOD 1

X1 X2 X3... Xs

METHOD 2 (Gillman’98, Kahale’96, ...)



Further speed-up

X1 X2 X3... Xs

|t - |TV exp(-t/rel) Var(0/)

( (x)(0(x)/(x)-1)2)1/2 small called warm start




Further speed-up

X1 X2 X3... Xs


|t - |TV exp(-t/rel) Var(0/)

( (x)(0(x)/(x)-1)2)1/2 small called warm start

sample at can be used as a warm start for ’

cooling schedule can step from ’ to

sample at can be used as a warm start for ’

cooling schedule can step from ’ to

0 1 2 3 m

....

= “well mixed” states

m=O( (ln n)(ln A) )

0 1 2 3 m

....

= “well mixed” states

Xs

X1 X2 X3... Xs

METHOD 2

run the our cooling-schedulealgorithm with METHOD 2using “well mixed” statesas starting points

0 1 k

Output of our algorithm: k=O*( (ln A)1/2 )

small augmentation (so that we can use sample from current as a warm start at next)

still O*( (ln A)1/2 )0 1 2 3 m

....

Use analogue of Frieze-Dyer for independent samplesfrom vector variables with slightly dependent coordinates.

if we have sampler oracles for

then we can get adaptive scheduleof length t=O* ( (ln A)1/2 )

independent sets O*(n2) (using Vigoda’01, Dyer-Greenhill’01)

matchings O*(n2m) (using Jerrum, Sinclair’89)



adaptive annealing: a near-optimal connection between sampling and counting

Documents

independent sets of

q goal

g approximation randomization

g dyer

ppnpgraph g

random variable y

regular graphsgraph

vxn ex212thenp y