scrubbing query results from probabilistic databases
DESCRIPTION
Scrubbing Query Results from Probabilistic Databases. Jianwen Chen, Ling Feng, Wenwei Xue. A skeleton of scrubbing probabilistic database query results. Three probabilistic relation examples. Query 1: look for the year(s) where at least one movie was liked by people from northern regions. - PowerPoint PPT PresentationTRANSCRIPT
Scrubbing Query Results from Probabilistic Databases
Jianwen Chen, Ling Feng, Wenwei Xue
A skeleton of scrubbing probabilistic database query results
Three probabilistic relation examples
Query 1: look for the year(s) where at least one movie was liked by people from northern regionsThe user gets the following answer from the probabilistic database:
User: Where is the probability derived?System: It is based on the two assumptions: Pr(x4) = 0.9 and Pr(x5) = 0.2User: I think the movie of MovieID = 4 is not actually liked by people from northern regions. Pr(x4) should be 0.1 but not 0.9! System: The new probability is 0.28!
How to identify the top-kuncertain assumptions for user clarification?
How to recompute the probability?
Pr(ee)=Pr(x4 x5)∨=Pr(x4) + Pr(x5) – Pr(x4) * Pr(x5)=0.9 + 0.2 – 0.9 * 0.2 = 0.92
1.09.01)Pr(1)Pr(
)Pr(
8.02.01)Pr(1)Pr(
)Pr(
45
54
xx
ee
xx
ee
Top-k assumptions
Pr(ee)=Pr(x4 x∨ 5)=Pr(x4) + Pr(x5) – Pr(x4) * Pr(x5)=0.1 + 0.2 – 0.1 * 0.2 = 0.28
0.1EventID Prob. Rate
x4 0.9 0.8
x5 0.2 0.1
Basic algorithm to compute top-k assumptions
For an event expression ee, to compute its probability Pr(ee), one can first convert it into an equivalent disjunctive normalform, and then apply the inclusion-exclusion formula.
disjunctive norm form:ee = C1 ∨ C2∨ …∨ Cm
where C1= e11∧ e12∧ …∧ e1 s1,C2= e21∧ e22∧ …∧ e2 s2,...,Cm= em1∧ em2∧ …∧ em sm,m ≥1,s1,s2,…,sm≥1
inclusion-exclusion formula:
)Pr()1(
)Pr(
)Pr()Pr(
)Pr(
)Pr(
21
1
21
mm
kjikji
m
i jijii
m
CCC
CCC
CCC
CCC
ee
Basic algorithm to compute top-k assumptions
To compute ,)Pr(
)Pr(
ie
ee
one can rewrite Pr(ee) as
Pr(ee)=α*Pr(ei)+β
where α and β are two sub-expressions irrelevant to Pr(ei)and
)Pr(
)Pr(
ie
ee
The time complexity is O(2m), where m is the number of conjuncts in the disjunctive normal form of ee.
Optimization
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB Journal 16(4) (2007) 523–544
We restrict the event expression ee to the situation where basic events e1,e2, …, en are independent and moreover they do not occur repeatedly in ee, which can be obtained for most of the queries (80% of the TPC/H queries ) by using the well-researched optimization technique adopted in
Three probabilistic relation examples
Query 2: look for the year(s) where at least one movie was liked by people from northern regions but not by people from southern regions
The user gets the following answer from the uncertain database:
ee=(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)Pr(e1)=0.2Pr(e2)=0.7Pr(e3)=0.1Pr(e4)=0.9Pr(e5)=0.7Pr(e6)=0.2
Pr(ee)?
Pr(~ee) = 1 –Pr(ee)
Pr(ee1 ee∧ 2) = Pr(ee1) * Pr(ee2)
Pr(ee1 ee∨ 2) = Pr(ee1) + Pr(ee2) – Pr(ee1) * Pr(ee2)
Pr(ee)=f(Pr(e1),Pr(e2),…,Pr(e6))
)Pr(
)Pr(,...,)Pr(
)Pr(,)Pr(
)Pr(
621 e
ee
e
ee
e
ee
(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)
Pr(e1)=0.2
Pr(e2)=0.7
Pr(e3)=0.1
Pr(e4)=0.9
Pr(e5)=0.7
Pr(e6)=0.2
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.7=0.3
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.1*0.1=0.01
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.2=0.8
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.7*0.8=0.56
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.9=0.1
(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)
Pr(e1)=0.2
Pr(e2)=0.7
Pr(e3)=0.1
Pr(e4)=0.9
Pr(e5)=0.7
Pr(e6)=0.2
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.7=0.3
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.1*0.1=0.01
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.2=0.8
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.7*0.8=0.56
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.9=0.1
(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.7=0.3
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.1*0.1=0.01
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.2=0.8
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.7*0.8=0.56
Pr(ee(N))=1-Pr(ee(leftChild(N)))=1-0.9=0.1
(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.2*0.3=0.06
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.1*0.1=0.01
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591
Pr(ee(N))=Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.7*0.8=0.56
(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.06+0.01-0.06*0.01=0.0694
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591
(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)
Pr(ee(N))=Pr(ee(leftChild(N)))+Pr(ee(rightChild(N)))-Pr(ee(leftChild(N)))*Pr(ee(rightChild(N)))=0.0694+0.56-0.0696*0.56=0.591
(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)
Second Optimization
(e1 ~e∧ 2) (e∨ 3 ~e∧ 4) (e∨ 5 ~e∧ 6)
top-2 assumptions
Scrub the query result
Recompute Pr((e1∧~ e2) (e∨ 3∧~ e4) (e∨ 5∧~ e6)) with modified Pr(e2) and pr(e5)
Performance Study
Performance Study
Conclusion