special topics in data engineering panagiotis karras cs6234 lecture, march 4 th, 2009
TRANSCRIPT
Special Topics in Data EngineeringSpecial Topics in Data Engineering
Panagiotis KarrasPanagiotis KarrasCS6234 Lecture, March 4th, 2009
OutlineOutline• Summarizing Data Streams.• Efficient Array Partitioning. 1D Case. 2D Case.• Hierarchical Synopses with Optimal Error
Guarantees.
Summarizing Data StreamsSummarizing Data Streams• Approximate a sequence [d1, d2, …, dn] with B buckets, si = [bi, ei, vi]
so that an error metric is minimized.• Data arrive as a stream: Seen only once. Cannot be stored.
• Objective functions:
Max. abs. error:
Euclidean error:
iii xfXFL max,
2
12
2 ,
n
xfXFL i
ii
Histograms Histograms [KSM 2007][KSM 2007]
• Solve the error-bounded problem.
Maximum Absolute Error bound ε = 2
4 5 6 2 15 17 3 6 9 12 …
[ 4 ]
[ 16 ] [ 4.5 ] […
• Generalized to any weighted maximum-error metric.
Each value di defines a tolerance interval
Bucket closed when running intersection of interval becomes null
Complexity:
ii
ii w
dw
d
,
nO
HistogramsHistograms
• Apply to the space-bounded problem.
Perform binary search in the domain of the error bound ε
Complexity:
*lognO
For error values requiring space , with actual error , run an optimality test:
BB
Error-bounded algorithm running under constraint instead of
error error
If requires space, then optimal solution has been reached.
BB ~error
Independent of buckets B
What about streaming case?
Streamstrapping Streamstrapping [Guha 2009][Guha 2009]
• Run multiple algorithms.
• Metric error satisfies property:
HXHYHXHYXHXHYHX ,,,,,
1. Read first B items, keep reading until first error (>1/M)2. Start versions for
3. When a version for some fails, a) Terminate all versions for b) Start new versions for using summary of as first
input. 4. Repeat until end of input.
000 1,,1, J 11 J
11 logOJ
J1
Streamstrapping Streamstrapping [Guha 2009][Guha 2009]
• Proof:
• Theorem:For any StreamStrap algorithm achieves an
approximation,running copies and initializations.
10
1 2311
Consider lowest value of for which an algorithm runs.Suppose error estimate was raised j times before reaching Xi : prefix of input just before error estimate was raised for ith
time.Yj : suffix between (j-1)th and jth raising of error estimate.
Hi : summary built for Xi. Then:
Furthermore:
Error estimate is raised by at every time.
Α
jjjjjjjjj HXHYHXHYXHXHYHX ,,,,,
1111 ,,, jjjjjjjj HXHYHXHX
11 J
11 logO MO *1 log
target error added error
recursion
Streamstrapping Streamstrapping [Guha 2009][Guha 2009]
• Proof (cont’d):Putting it all together, telescoping:
Total error is:
Moreover,
However, (algorithm failed for it)
Thus,
In conclusion, total error is# Initializations follows.
1
2
111 ,, j
jiiiiijj HYHXHX
31
1
1
1
1
*1
* ,, HYXHYHX jjj
1
1*,HYHX jj
*31
1
11
1* 31,
HYX j
*
31
12
added error
optimal error
10
1
10
1
Streamstrapping Streamstrapping [Guha 2009][Guha 2009]
• Proof:
• Theorem:Algorithm runs in space and
time.
Space bound follows from copies.Batch input values in groups of Define binary tree of t values, compute min & max over tree
nodes:Using tree, max & min of any interval computed inEvery copy has to check violation of its bound over t items.Non-violation decided in O(1). Total Violation located in . For all buckets,Over all algorithms it becomes:
1logBO M
BBnO *2 loglog
1logBOt
tO log
BnOOtnO 11 log
tO 2log tBO 2log
MBB
M OtBO *2*12 loglogloglog
nO
1D Array Partitioning 1D Array Partitioning [KMS 1997][KMS 1997]
• Problem:
Partition an array of n items into p intervals so that the maximum weight of the intervals is minimized.
Arises in load balancing in pipelined, parallel environments.
j
ik
kAjiAF ,
1D Array Partitioning 1D Array Partitioning [KMS 1997][KMS 1997]
• Idea:
Perform binary search on all possible O(n2) intervals responsible for maximum weight result (bottlenecks).
• Obstacle: Approximate median has to be calculated
in O(n) time.
1D Array Partitioning 1D Array Partitioning [KMS 1997][KMS 1997]
• Solution: Exploit internal structure of O(n2) intervals. n columns, column c consisting of ciciAF 1,,
nn
cc
nc
nc
nc
nc
,
,
,3,3
,2,22,2
,1,12,11,1
,2,1
Monotonically non-increasing
1D Array Partitioning 1D Array Partitioning [KMS 1997][KMS 1997]
• Calls to F(...) need O(1). (why?)• Median of any subcolumn determined with
one call to F oracle. (how?)
Splitter-finding Algorithm:• Find median weight in each active
subcolumn.• Find median of medians m in O(n) (standard).
• Cl (Cr): set of columns with median < (>) m.
1D Array Partitioning 1D Array Partitioning [KMS 1997][KMS 1997]
• The median of medians m is not always a splitter.
8
,min rlrl
CCCC
1D Array Partitioning 1D Array Partitioning [KMS 1997][KMS 1997]
• If median of medians m is not a splitter, recur to set of active subcolumns (Cl or Cr) with more elements (ignored elements still considered in future set size calculations).
• Otherwise, return m as a good splitter (approximate median).
End of Splitter-finding Algorithm.
8
,min rlrl
CCCC
1D Array Partitioning 1D Array Partitioning [KMS 1997][KMS 1997]
Overall Algorithm:1.Arrange intervals in subcolumns.2.Find a splitter weight m of active subcolumns.3.Check whether array is partitionable in p
intervals of maximum weight m (how?)4.If true, then m is upper bound of optimal
maximum weight, eliminate half of elements of each subcolumn in Cl - otherwise in Cr.
5.Recur until convergence to optimal m. Complexity: O(n log n)
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
• Problem: Partition a 2D array of n x n items into a p x
p partition (inducing p2 blocks) so that the maximum weight of the blocks is minimized.
Arises in particle-in-cell computations, sparse matric computations, etc.
• NP-hard [GM 1996]• APX-hard [CCM 1996]
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
• Definition: Two axis-parallel rectangles are
independent if their projections are disjoint along both the x-axis and the y-axis.
• Observation 1: If an array has a partition, then it
may contain at most independent rectangles of weight strictly greater than W. (why?)
,W2
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
• At least one line needed to stab each of the independent rectangles.
• Best case: independent rectangles2
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
The Algorithm: Assume we know optimal W.Step 1: (define P )Given W, obtain partition such that each
row/column within any block has weight at most 2W. (how?)
Independent horizontal/vertical scans, keeping track of running sum of weights of each row/column in block. (why exists ?)
WjiAji
,max,
P
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Step 2: (from P to S )Construct set of all minimal rectangles of
weight more than W, entirely contained in blocks of . (how?)
Start from each location within block, consider all possible rectangles in order of increasing sides, until W exceeded, keep minimal ones.
Property of S : block weight at most 3W. (why?)Hint : rows/columns in blocks of P at most 2W.
S
P
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Step 3: (from S to M )Determine local 3-optimal set of
independent rectangles.
3-optimality : There does not exist set of independent rectangles in that, added to after removing rectangles from it, do not violate independence condition.
Polynomial-time construction(how? with swaps: local optimality easy)
SM
MS 3,2,1i
M 1i
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Step 4: (from M to new partition)For each rectangle in M, set two straddling
horizontal and two straddling vertical lines that induce it. At most partition derived
New partition: P from step 1 together with this.
horizontal lines vertical lines
MM 22
Mh 2
Mv 2
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Step 5: (final)Retain every th horizontal line, every th vertical line.Maximum weight increased at most by
h
v
vh
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Analysis:We have to show that:a. Given W (large enough) such that there
exists partition, the maximum block weight in constructed partition is
b. Minimum W for which analysis holds (found by binary search) is upper bound to optimum W.
,W WO
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Lemma 1: (at Step 1)Let block b contained in partition P.If b exceeds 27W, then b can be partitioned in 3
independent rectangles of weight >W.Proof:Vertical scan in b, cut as soon as seen slab
weight exceeds 7W. (hence slab weight < 9W ) (why?)
Horizontal scan, cut as soon as one seen slab weight exceeds W.
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Proof (cont’d):Slab weight exceeding W does not exceed 3W.
(why?)Eventually, 3 rectangles weighting >W each.
b W7
W4
W4
W
W3
W3
b W
W3
W3
W
W
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Lemma 2: (at Step 4)Weight of any block of Step-4-partition isProof:Case 1:Weight of b is O(W). (recall block in S <3W )Case 2:Weight of b is <27W.If >27W, then b partitionable in 3 independentrectangles, which can substitute the at most 2 blocksin M non-independent of b: violates 3-optimality of M.
Mb
Mb
WO
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Lemma 3: (at Step 3)If , thenProof:Weight of rectangles in M is >W.By Observation 1, at most independent
rectanglescan be contained in M.
2M,W
2
2D Array Partitioning 2D Array Partitioning [KMS 1997][KMS 1997]
Lemma 4: (at Step 5)If , weight of any block in final solution isProof:At Step 5, maximum weight increased at most by
By Lemma 2, maximum weight isHence, final weight is (a)
Least W for which Step 1 and Step 3 succeed exceeds
optimum W. Found by binary search. (b)
WO
2522
MMvh
WO
WO
,W
Compact Hierarchical Compact Hierarchical HistogramsHistograms
• Assign arbitrary values to CHH coefficients, so that a maximum-error metric is minimized.
• Heuristic solutions: Reiss et al. VLDB 2006
BnnBO loglog2
c0
c1 c2
c3 c4c5 c6
d3d2d1d0
nnBO 2log
time
space
The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node.
[Reiss et al. VLDB 2006]
Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. Next-to-bottom level case
dcbavdcba
dcbavdcbadcbavdcba
dcbavdcba
viS
,,,,
,,,,,,,,
,,,,
,2
,1
,0
,
1,,, ** ii ssviSv
cic2i c2i+1
bav ,
z00
ba, dc,
dcba ,,
cic2i0 0
z
dcbav ,,
dc, ba,
dcba ,,
dcz , dcbaz ,,
Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. General, recursive case
0000
00000000
0000
**
**
**
,2
,1
,
,
RLRL
RLRLRLRL
RLRL
v
vv
v
ss
ss
ss
viS
RL
RL
RL
ii
ii
ii
*0
*0 ,,
RL iRiL sviSvsviSv RL
Complexity:(space-
efficient)
nnOn
On 2log
0 1log
22
time
space
• Apply to the space-bounded problem.
Complexity:
Polynomially Tractable
nOOn
log
02
nnnO logloglog *
ReferencesReferences1. P. Karras, D. Sacharidis, N. Mamoulis: Exploiting
duality in summarization with deterministic guarantees. KDD 2007.
2. S. Guha: Tight results for clustering and summarizing data streams . ICDT 2009.
3. S. Khanna, S. Muthukrishnan, S. Skiena: Efficient Array Partitioning. ICALP 1997.
4. F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact histograms for hierarchical identifiers. VLDB 2006.
5. P. Karras, N. Mamoulis: Hierarchical synopses with optimal error guarantees. ACM TODS 33(3): 2008.