lower bounds for exact model counting and applications in probabilistic databases paul beame jerry...

Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases

Paul Beame Jerry Li Sudeepa Roy Dan Suciu

University of Washington

Model Counting• Model Counting Problem:

Given a Boolean formula F, compute #F = #Models (satisfying assignments) of F

e.g. F = (x y) (x u w) (x u w z) #Assignments on x, y, u, z, w which make F = true

• Probability Computation Problem:Given F, and independent Pr(x), Pr(y), Pr(z), …,

compute Pr(F)

Model Counting• #P-hard

▫ Even for formulas where satisfiability is easy to check

• Applications in probabilistic inference ▫ e.g. Bayesian net learning

• There are many practical model counters that can compute both #F and Pr(F)

•CDP•Relsat•Cachet•SharpSAT•c2d•Dsharp•…

Exact Model Counters

Search-based/DPLL-based(explore the assignment-space and count the satisfying ones)

Knowledge Compilation-based(compile F into a “computation-friendly” form)

[Survey by Gomes et. al. ’09]

Both techniques explicitly or implicitly • use DPLL-based algorithms • produce FBDD or Decision-DNNF compiled forms (output or trace)

[Huang-Darwiche’05, ’07]

[Birnbaum et. al.’99]

[Bayardo Jr. et. al. ’97, ’00]

[Sang et. al. ’05]

[Thurley ’06]

[Darwiche ’04]

[Muise et. al. ’12]

Model Counters Use Extensions to DPLL

• Caching Subformulas▫ Cachet, SharpSAT, c2d, Dsharp

• Component Analysis▫ Relsat, c2d, Cachet , SharpSAT, Dsharp

• Conflict Directed Clause Learning▫ Cachet, SharpSAT, c2d, Dsharp

• DPLL + caching + (clause learning) FBDD• DPLL + caching + component + (clause learning) Decision-DNNF

How much more does component analysis add?i.e. how much more powerful are decision-DNNFs than FBDDs?

Theorem:

• Decision-DNNF of size N FBDD of size Nlog N + 1

• If the formula is k-DNF, then FBDD of size Nk

• Algorithm runs in linear time in the size of its output

Main Result

Consequence: Running Time Lower Bounds

Model counting algorithm running time ≥ compiled form size

Lower bound on compiled form size Lower bound on running time

▫Note: Running time may be much larger than the size▫e.g. an unsatisfiable CNF formula has a trivial compiled form

Our quasipolynomial conversion+ Known exponential lower bounds on FBDDs

[Bollig-Wegener ’00, Wegener’02]

Exponential lower bounds on decision-DNNF size

Exponential lower bounds on running time of exact model counters

Consequence: Running Time Lower Bounds

Outline

•Review of DPLL-based algorithms▫Extensions (Caching & Component Analysis)▫Knowledge Compilation (FBDD & Decision-DNNF)

•Our Contributions▫Decision-DNNF to FBDD conversion▫ Implications of the conversion▫Applications to Probabilistic Databases

•Conclusions

DPLL Algorithms

Davis, Putnam, Logemann, Loveland [Davis et. al. ’60, ’62]

1 0 1 0

F: (xy) (xuw) (xuwz)

y(uw)3/87/8

Assume uniform distribution for simplicity

// basic DPLL:Function Pr(F):

if F = false then return 0if F = true then return 1select a variable x, return

½ Pr(FX=0) + ½ Pr(FX=1)

DPLL Algorithms

1 0 1 0

y(uw)3/87/8

The trace is a Decision-Tree for F

Extensions to DPLL

• Caching Subformulas

• Component Analysis

• Conflict Directed Clause Learning▫ Affects the efficiency of the algorithm, but not the final “form” of the trace

Traces of• DPLL + caching + (clause learning) FBDD• DPLL + caching + component + (clause learning) Decision-DNNF

Caching

½ Pr(FX=0) + ½ Pr(FX=1)

// DPLL with caching:Cache F and Pr(F);look it up before computing

Caching & FBDDs

y(uw)The trace is a decision-DAG for F

FBDD (Free Binary Decision Diagram)or

ROBP (Read Once Branching Program)

• Every variable is tested at most once on any path

• All internal nodes are decision-nodes

Decision-Node

Component Analysis

y (uw)

½ Pr(FX=0) + ½ Pr(FX=1)

// DPLL with component analysis (and caching):

if F = G Hwhere G and H have disjoint set of variablesPr(F) = Pr(G) × Pr(H)

Components & Decision-DNNF

y (uw)

The trace is a Decision-DNNF [Huang-Darwiche ’05, ’07]

FBDD + “Decomposable” AND-nodes

(Two sub-DAGs do not share variables)

Decision Node

01AND Node

How much power do they add?

Main Technical Result

Decision-DNNF FBDDEfficient construction

Size N Size Nlog N+1

(quasipolynomial)

Size Nk

(polynomial)k-DNFe.g. 3-DNF: (x y z) (w y z)

Outline

•Review of DPLL algorithms▫Extensions (Caching & Component Analysis)▫Knowledge Compilation (FBDDs & Decision-DNNF)

•Conclusions

Need to convertall AND-nodes to Decision-nodeswhile evaluating the same formula F

Decision-DNNF FBDD

A Simple Idea

0 1 0 1

0 1Decision-DNNF FBDD

G and H do not share variables, so every variable is still tested at most once on any path

But, what if sub-DAGs are shared?

0 10 1

Decision-DNNF

Conflict!

Obvious Solution: Replicate Nodes

No conflictApply the simple idea

But, may need recursive replicationCan have exponential blowup!

Main Idea: Replicate Smaller Sub-DAG

Edges coming from other nodes in the decision-DNNF

Smaller sub-DAG

Larger sub-DAG

Each AND-node creates a private copy of its smaller sub-DAG

Light and Heavy Edges

Smaller sub-DAG

Larger sub-DAG

Light Edge Heavy Edge

Each AND-node creates a private copy of its smaller sub-DAG

Þ Recursively each node u is replicated #times in a smaller sub-DAG

Þ #Copies of u = #sequences of light edges leading to u

Quasipolynomial Conversion

L = Max #light edges on any path

L ≤ log N

N = Nsmall + Nbig ≥ 2 Nsmall ≥ ... ≥ 2L

#Copies of each node ≤ NL ≤ Nlog N

We also show that our analysis is tight

#Nodes in FBDD ≤ N. Nlog N

Polynomial Conversion for k-DNFs

•L = #Max light edges on any path ≤ k – 1

•#Nodes in FBDD ≤ N. NL = Nk

Outline

•Conclusions

Separation Results

AND-FBDDDecision-DNNF

FBDDd-DNNF

• FBDD: Decision-DAG, each variable is tested once along any path

• Decision-DNNF: FBDD + decomposable AND-nodes (disjoint sub-DAGs)

Exponential Separation

Poly-size AND-FBDD or d-DNNF exists

Exponential lower bound on decision-DNNF size

• AND-FBDD: FBDD + AND-nodes (not necessarily decomposable) [Wegener’00]

• d-DNNF: Decomposable AND nodes + OR-nodes with sub-DAGs not simultaneously satisfiable [Darwiche ’01, Darwiche-Marquis ’02]

Outline

•Conclusions

Probabilistic Databases AsthmaPatien

Friend

Ann Joe

Ann Tom

Bob Tom

Smoker

Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

• Tuples are probabilistic (and independent)▫ “Ann” is present with probability 0.3

• What is the probability that Q is true on D?▫ Assign unique variables to tuples

• Boolean formula FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)▫ Q is true on D FQ,D is true

0.30.1

0.51.0

0.90.5

Pr(x1) = 0.3

Probabilistic Databases

• FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

• Probability Computation Problem: Compute Pr(FQ,D) given Pr(x1), Pr(x2), ….

• FQ,D can be written as a k-DNF ▫ for fixed, monotone queries Q

For an important class of queries Q, we get exponential lower bounds on decision-DNNFs and model counting algorithms

Outline

•Conclusions

Summary

• Quasi-polynomial conversion of any decision-DNNF into an FBDD (polynomial for k-DNF)

• Exponential lower bounds on model counting algorithms • d-DNNFs and AND-FBDDs are exponentially more

powerful than decision-DNNFs

• Applications in probabilistic databases

Open Problems

• A polynomial conversion of decision-DNNFs to FBDDs?

• A more powerful syntactic subclass of d-DNNFs than decision-DNNFs?▫ d-DNNF is a semantic concept▫ No efficient algorithm to test if two sub-DAGs of an OR-node are

simultaneously satisfiable

• Approximate model counting?

Thank You

Questions?

lower bounds for exact model counting and applications in probabilistic databases paul beame jerry...

Documents

thesis alin suciu

a nswering c onjunctive q ueries w ith i nequalities paris...

suciu borgia mss in naples

#noagile - dan suciu

embracing uncertainty in large-scale computational...

dan suciu univ. of washington querying xml streams1 from...

s kew in p arallel q uery p rocessing paraschos koutris paul...

suciu-the borgian coptic manuscripts in naples

niss workshop: causal inference and machine … · niss...

suciu - history of joseph the carpenter duke

communication cost in parallel query processing dan suciu...

beame, lindsa tyo speak at today's...

twenty global problems by suciu and miller

cse544 introduction monday, march 29, 2004. staff...

lecture(2( datacube(basics( - cs.duke.edu ·...

model counting of query expressions: limitations of...

lifted probabilistic inference in relational models -...

04 - ularu, puican, suciu, vulpe, todoran

a course on probabilistic databases dan suciu university of...

from semistructured data to xml dan suciu at&t labs...