a*-tree: a structure for storage and modeling of uncertain multidimensional arrays presented by:...
TRANSCRIPT
A*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays
Presented by: ZHANG XiaofeiMarch 2, 2011
Outline
• Motivation• Modeling correlated uncertainty• Construction of A*-tree• Analysis of A*-tree• Query processing• Experiments
Outline
• Motivation• Modeling correlated uncertainty• Construction of A*-tree• Analysis of A*-tree• Query processing• Experiments
Motivation
• Multidimensional arrays– Suit for scientific and engineering applications– Logically equivalent to relational tables
D1
D2 <A1,A2,…,An>
D1 D2 A1 A2 … An
A cell of the multidimensional arrays: (A1,A2,…,Ak, D1,D2,…Dd)
Motivation (Cont’d)
• Uncertain data– Inevitable– Two categories
Motivation (Cont’d)
• Correlated uncertain data– Examples: Geographically distributed sensors
ID X Y Z T. H. P.
S1 11 9 15 20* 20* 5*
S2 11 7 13 19* 19* 5.5*
More applications examples can be found in router’s network traffic analysis, quantization of image or sound, etc.
Outline
• Motivation• Modeling correlated uncertainty• Construction of A*-tree• Analysis of A*-tree• Query processing• Experiments
Modeling Correlated Uncertainty
• PGM: Probabilistic Graphical Model– Bayesian network
)|(),,|(
)|()|(),|(
ANCPAEBNCP
BAPEAPBEAP
Limitations:1) Prior knowledge and initial probabilities2) Significant computational cost(NP hard)
Modeling Correlated Uncertainty (Cont’d)
• PGM: Probabilistic Graphical Model– Markov Random Fields
A graphical model in which a set of random variables have a Markov property described by an undirected graph
Pros: cyclic dependenciesCons: no induced dependencies
NP hard to compute
Modeling Correlated Uncertainty (Cont’d)
• Considering the locality of correlation– E.g. a 2-dimensional arrays
Outline
• Motivation• Modeling correlated uncertainty• Construction of A*-tree• Analysis of A*-tree• Query processing• Experiments
Construction of A*-tree
• Basic A*-structure1) k-ary tree: k=2^d, where d is the number of
correlated dimensions2) Each leaf contains the joint distribution of four
neighboring cells it maps to3) The joint distribution at each internal node is
recursively defined
Construction of A*-tree (Cont’d)
• Joint distribution at a node
X1 X2
X3 X4
Y=(X1+X2+X3+X4)/4Xi=Y(1+Fi)
Fi range k, r entries in distribution table, l bits to present probability
8
)log3( lkr
Construction of A*-tree (Cont’d)
• Extension of A*-tree– Uneven dimensional size• 2k+1 partitioned as k and k+1• Shorter dimension stops partition first, with partition of
longer dimension goes on
Construction of A*-tree (Cont’d)
• Extension of A*-tree– Basic uncertainty blocks of arbitrary shapes• Each cell is intuitively the basic uncertain block,
however, maybe this granularity is too fine• Initial identification of uncertainty blocks is user and
application specified
Outline
• Motivation• Modeling correlated uncertainty• Construction of A*-tree• Analysis of A*-tree• Query processing• Experiments
Analysis of A*-tree
• Natural mapping from A*-tree to Bayesian Network
Analysis of A*-tree (Cont’d)
• How A*-tree model express the neighboring correlation– From the perspective of any random query, the
average level where cell correlation is encoded is low. (efficient inference & accurate modeling)
Analysis of A*-tree (Cont’d)
• Neighboring cells and clustering distance– Definition
Analysis of A*-tree (Cont’d)
• Neighboring cells and clustering distance
Analysis of A*-tree (Cont’d)
• CD (Clustering Distance)– For any query that may return q pairs of
neighboring cells
Expected average CD
e.g. for 1024*1024 array, h=10, then E(argCD )~ 1.01
h
h
ii
hiavgCDE
2
11
2
1)(
1
11
Analysis of A*-tree (Cont’d)
• Accuracy vs. Efficiency– Double “flip”– Polynomial time scan O(d*n)– Consider basic uncertainty block
Outline
• Motivation• Modeling correlated uncertainty• Construction of A*-tree• Analysis of A*-tree• Query processing• Experiments
Query Processing
• Monte Carlo based query processing– SamplingQ: select avg(brightness)From space_imageWhere Dis(x,y,z,322,108,251)<50
Query Processing (Cont’d)
• Compared with MRF– MRF require sequenced round sampling– Each sample node is computed from all the nodes
Query Processing (Cont’d)
• Other queries– COUNT, AVG and SUM
•Minimum Set Cover•Build-in cell-count function•Effectively query answering
t
ii
t
iii
t
iii
t
ii
cac
ac
c
11
1
1
/)(
Outline
• Motivation• Modeling correlated uncertainty• Construction of A*-tree• Analysis of A*-tree• Query processing• Experiments
Experiments
• Data set description
• Evaluations– Accuracy of modeling the underlying joint
distribution– Execution time– Aggregate query– Space cost
ID X Y T Temperature
Experiments (Cont’d)
• Accuracy
Experiments (Cont’d)
• Accuracy
Experiments (Cont’d)
• Execution time
Experiments (Cont’d)
• Aggregate query and space cost
Thank you!
Q&A