scrubbing query results from probabilistic databases

Scrubbing Query Results from Probabilistic Databases

Jianwen Chen, Ling Feng, Wenwei Xue

A skeleton of scrubbing probabilistic database query results

Three probabilistic relation examples

Query 1: look for the year(s) where at least one movie was liked by people from northern regionsThe user gets the following answer from the probabilistic database:

User: Where is the probability derived?System: It is based on the two assumptions: Pr(x4) = 0.9 and Pr(x5) = 0.2User: I think the movie of MovieID = 4 is not actually liked by people from northern regions. Pr(x4) should be 0.1 but not 0.9! System: The new probability is 0.28!

How to identify the top-kuncertain assumptions for user clarification?

How to recompute the probability?

Pr(ee)=Pr(x4 x5)∨=Pr(x4) + Pr(x5) – Pr(x4) * Pr(x5)=0.9 + 0.2 – 0.9 * 0.2 = 0.92

1.09.01)Pr(1)Pr(

8.02.01)Pr(1)Pr(

Top-k assumptions

Pr(ee)=Pr(x4 x∨ 5)=Pr(x4) + Pr(x5) – Pr(x4) * Pr(x5)=0.1 + 0.2 – 0.1 * 0.2 = 0.28

0.1EventID Prob. Rate

x4 0.9 0.8

x5 0.2 0.1

Basic algorithm to compute top-k assumptions

For an event expression ee, to compute its probability Pr(ee), one can first convert it into an equivalent disjunctive normalform, and then apply the inclusion-exclusion formula.

disjunctive norm form:ee = C1 ∨ C2∨ …∨ Cm

where C1= e11∧ e12∧ …∧ e1 s1,C2= e21∧ e22∧ …∧ e2 s2,...,Cm= em1∧ em2∧ …∧ em sm,m ≥1,s1,s2,…,sm≥1

inclusion-exclusion formula:

)Pr()1(

)Pr()Pr(

kjikji

i jijii

Basic algorithm to compute top-k assumptions

To compute ,)Pr(

one can rewrite Pr(ee) as

Pr(ee)=α*Pr(ei)+β

where α and β are two sub-expressions irrelevant to Pr(ei)and

The time complexity is O(2m), where m is the number of conjuncts in the disjunctive normal form of ee.

Optimization

Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB Journal 16(4) (2007) 523–544

We restrict the event expression ee to the situation where basic events e1,e2, …, en are independent and moreover they do not occur repeatedly in ee, which can be obtained for most of the queries (80% of the TPC/H queries ) by using the well-researched optimization technique adopted in

Three probabilistic relation examples

Query 2: look for the year(s) where at least one movie was liked by people from northern regions but not by people from southern regions

The user gets the following answer from the uncertain database:

ee=(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)Pr(e1)=0.2Pr(e2)=0.7Pr(e3)=0.1Pr(e4)=0.9Pr(e5)=0.7Pr(e6)=0.2

Pr(ee)?

Pr(~ee) = 1 –Pr(ee)

Pr(ee1 ee∧ 2) = Pr(ee1) * Pr(ee2)

Pr(ee1 ee∨ 2) = Pr(ee1) + Pr(ee2) – Pr(ee1) * Pr(ee2)

Pr(ee)=f(Pr(e1),Pr(e2),…,Pr(e6))

)Pr(,...,)Pr(

)Pr(,)Pr(

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Pr(e1)=0.2

Pr(e2)=0.7

Pr(e3)=0.1

Pr(e4)=0.9

Pr(e5)=0.7

Pr(e6)=0.2

Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.7=0.3

Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06

Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Pr(e1)=0.2

Pr(e2)=0.7

Pr(e3)=0.1

Pr(e4)=0.9

Pr(e5)=0.7

Pr(e6)=0.2

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

Second Optimization

(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)

top-2 assumptions

Scrub the query result

Recompute Pr((e1∧～ e2) (e∨ 3∧～ e4) (e∨ 5∧～ e6)) with modified Pr(e2) and pr(e5)

Performance Study

Conclusion

scrubbing query results from probabilistic databases

Documents

scrubbing pdf

query auditing for protecting max/min values of sensitive...

wet air scrubbing wet air scrubbing

research article continuous probabilistic skyline queries...

sensitivity analysis & explanations for robust query...

efficient query evaluation on probabilistic databases papers...

probabilistic threshold range aggregate query processing...

probabilistic structured query methods

bayonet: probabilistic inference for networks · 2018. 9....

query answering in probabilistic datalog+/{ ontologies under...

top-k query evaluation on probabilistic data

a toolbox of query evaluation techniques for probabilistic...

finding probabilistic nearest neighbors for query objects...

design and implementation of probabilistic programming...

probabilistic ranking of database query results

probabilistic information retrieval approach for ranking of...

probabilistic similarity query on dimension incomplete data

query answering in probabilistic datalog+/– ontologies...

eﬃcient query evaluation on probabilistic...

1 probabilistic/uncertain data management slides based on...