queries with difference on probabilistic databases sanjeev khanna sudeepa roy val tannen university...

1

Queries with Difference on Probabilistic Databases

Sanjeev KhannaSudeepa RoyVal Tannen

University of Pennsylvania

2

Probabilistic Databases

• To model and query uncertain data (sensor networks, information extraction…)

• Possible worlds model– Each possible world W is a standard database

instance, has a probability P[W]– Compact representation D assuming independence

D

a1

a2

a3

a3

b1

b1

b2

b3

0.1

0.5

0.2

0.1

a1

a2

a3

0.3

0.4

0.6

b1

b2

b3

0.7

0.8

0.4

RS

T

3

Query Semantics

• Query Semantics on probabilistic databases:– Apply the query q on each possible world W– Add up the probabilities of the worlds that give

the same query answer A P[q(D) = A] = ∑W : q(W) = A P[W]

• Goal: Efficiently evaluate P[q(D) = A]– Data complexity; want time polynomial in n = |D|

• Can we always efficiently compute P[q(D)]?– NO, in general it is #P-hard

4

b1

b2

b3

u1

u2

u3

0.7

0.8

0.4

b1

b2

b3

0.7

0.8

0.4

a1

a2

a3

a3

b1

b1

b2

b3

v1

v2

v3

v4

0.1

0.5

0.2

0.1

a1

a2

a3

a3

b1

b1

b2

b3

0.1

0.5

0.2

0.1

a1

a2

a3

w1

w2

w3

0.3

0.4

0.6

a1

a2

a3

0.3

0.4

0.6

Introduce event variables for tuples (P[w1] = 0.3, …)

Step 1: Boolean provenance for q(D) [FR ’97, Z ’97] f = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3

Step 2: Compute P[q(D)] = P[f]given P[w1] = 0.3, P[v1] = 0.4, …

Probability

Event variables

Boolean query q():-R(x),S(x, y),T(y)

easy

hard

Query Answering in Two Steps

DR

ST

5

Probability Computation for Positive Queries

• Dichotomy Result [DS ’04, ’07; DSS ’10] Given q as input, we can efficiently decide if q is

– Safe: Safe plans run in poly-time on all instances, or,

– Unsafe: #P-hard, e.g. q() :- R(x) S(x, y) T(y)

• Instance-by-instance approach [SDG ’10, RPT ’11]– Both q and D are given as input – Poly-time algorithm to compute P[q(D)] for special

cases even if q is unsafe

What about queries with difference?

Boolean Provenances for Difference

c1

c1

c2

c3

a1

a2

a3

a2

v1

v2

v3

v4

a1

a2

a3

w1

w2

w3

R T

6

q1(x):- R(x, y), S(y, z)

b1

b2

b1

c1

c2

c3

u1

u2

u3

q2(x):- R(x, y), S(y, z), T(z)b1

b2

u1(v1 + v2) + u3v4

u2v3

b1

b2

u1v1w1 + u1v2w2 + u3v4w2

u2v3w3

b1

b2

(u1(v1 + v2) + u3v4) . (u1v1w1 + u1v2w2 + u3v4w2)

(u2v3) . (u2v3w3)

q = q1 – q2

S

7

Previous Work on Difference

FOR ’11– Framework for exact and approximate

probability computation – But, no guarantee of polynomial running time

In fact, we show in this paper that with difference,

in some cases no approximation exists (unless NP = RP)

How far can we go with difference in poly-time?

8

A Quick Comparison

With difference

• DNF of boolean provenance may be exponential in n

• P[q(D)] may not be approximable

Without difference

• DNF of boolean provenance is poly-size (n|q|)

• P[q(D)] is always approximable (FPRAS)

FPRAS: Fully Polynomial Randomized Approx. Scheme Compute with prob. ≥ ¾ in time polynomial in n, 1/ε

p [(1-ε) P[q(D)], (1+ε) P[q(D)]

9

Our Results

• We study queries of the form q1 – q2 and their generalization

– FPRAS: If q1 is any UCQ, q2 is any safe CQ-

– #P-hardness: Even if both q1 and q2 are safe CQ-

– Inapproximability: Even if q1 is the trivial TRUE query and q2 is a UCQ

• Our FPRAS result extends to a larger class of queries of which q1 – q2 is a special case

[CQ- : Conjunctive queries without self-joins]

10

Difference Rank• Define difference rank (q) of query q recursively

– (R) = 0

– (q1 - q2) = (q1) + (q2) + 1• R – S : rank 1

– (q1 ⋈ q2) = (q1) + (q2)

• (R – S1) ⋈ (R - S2) : rank 2

• (R - T1) ⋈ T2 : rank 1

– (q1 q2) = max ((q1), (q2))

• (R – S1) ⋈ (R - S2) (R - T1) ⋈ T2 : rank 2

– Select, project: rank remains the same

11

FPRAS for queries q with (q) = 1given some conditions hold

(inapproximable for (q) = 1 in general)

12

Steps in FPRAS

• Step 1: Compute boolean provenance of q[D] for any query q with (q) = 1

• Step 2: Write the boolean provenance in a “Probability Friendly Form” (if possible)

• Step 3: FPRAS inspired by Karp-Luby framework

13

Boolean Provenance for Queries q s.t. (q) = 1

Lemma:For any q with (q) = 1, on any D, the provenance

f of q(D) has form

f is poly-size in n = |D|, poly-time computable

...))(()).(( 4321 DNFDNFDNFDNFf

14

Probability Friendly Form (PFF)

If f is in PFF, we can approximate P[f] using Karp-Luby Framework

...))(()).(( 4321 DNFDNFDNFDNFf

...))(()).(( 42 dDNNFbcddDNNFabcf

f is in PFF, if the negated DNF-s can be written in poly-size d-DNNFs (next slide)

15

d-DNNFDarwiche ’01, ’02, DM ’02deterministic - Decomposable Negation Normal

Form

No internal node can have negation

At most one child of a +-node is satisfiable

Children of a .-node do not share variables

+

+1v

1u

1v

2v 2v

In general, can be a DAG

Probability can be computed in linear time


16

Karp-Luby Framework

[KL ’83] Given boolean expression DAGs F1, …, Fm

f = F1 + F2 + ... + Fm

P[f] can be computed in poly-time (in m, n)

if in poly-time, i(1) P[Fi] can be computed

(2) it can be checked if a given assignment satisfies Fi

(3) a random satisfying assignment of Fi can be sampled

Well-studied special case: DNF counting, where F1, …, Fm are DNF minterms: f = xyz + xyw + wuv

17

Conditions (1) and (2) hold for PFF

Product of minterm and d-DNNF is another d-DNNF

w2=1, z1=1

+

+1v

1v

2v 2v

121 zwu+

+1v

1u

1v

2v 2v


18

Condition (3) also holdsLemma: Generating a random satisfying assignment on a d-DNNF can be done in poly-time

+

+1v

1v

2v 2v

1. Process in reverse topological order

2. Generate a random satisfying assignment bottom up

v2 = 1 v2 = 0

v1 = 0

v1 = 1v2 = 0

v1 = 0, v2 = 0

v1 = 1, v2 = 0

At random

19

Expressibility in PFF

So, if f is in PFF, we can approximate P[q(D)]

But, can we decide in poly-time if some sub-expressions of a boolean expression have poly-size d-DNNFs?

• Not known • But, there are natural sufficient conditions that can be

verified in poly-time

– If certain sub-queries are safe and hence generate read-once expressions [OH ’08]

– If sub-queries generate poly-size OBDDs [JS ’11]

– Extends to instance-by-instance approach (both q, D given)

20

#P-hardness for q1 - q2

both q1, q2 are safe CQ-

21

#P-hardness: Steps in the proof

“Hard” query q = q1 – q2

– q1() := R1(x, y1) R2(x, y2) R3(x, y3) R4(x, y4)

– q2() := R1(x1, y) R2(x2, y) R3(x3, y) R4(x4, y)

Counting independent sets in 3-regular bipartite graphs (XZ ’06)

Counting edge covers in bipartite graphs of degree ≤ 4, where the edge set can be partitioned into 4 disjoint matchings

22

Other Related Work

– Semantics of probabilistic query answering• Fuhr-Rollecke ’97, Zimanyi ‘97

– Dichotomy of CQ- ,CQ and UCQ queries• Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10

– Knowledge compilation techniques• Olteanu-Huang ’08, Jha-Olteanu-Suciu ’10, Jha-Suciu ’11, Fink-Olteanu ’11

– Instance-by-instance approach• Sen-Deshpande-Getoor ’10, Roy-Perduca-Tannen ’11

23

Conclusions and Future work

A step towards understanding complexity of exact and approximate computation for queries withdifference operations

Future work– Dichotomy results that classify syntactically difference

queries (similar to positive UCQ)?

– Extending FPRAS to queries with difference rank > 1?

– Experimental evaluation of our algorithms

24

Thank you

Questions?

queries with difference on probabilistic databases sanjeev khanna sudeepa roy val tannen university...

Documents