special topics in data engineering panagiotis karras cs6234 lecture, march 4 th, 2009

Special Topics in Data EngineeringSpecial Topics in Data Engineering

Panagiotis KarrasPanagiotis KarrasCS6234 Lecture, March 4th, 2009

OutlineOutline• Summarizing Data Streams.• Efficient Array Partitioning. 1D Case. 2D Case.• Hierarchical Synopses with Optimal Error

Guarantees.

Summarizing Data StreamsSummarizing Data Streams• Approximate a sequence [d1, d2, …, dn] with B buckets, si = [bi, ei, vi]

so that an error metric is minimized.• Data arrive as a stream: Seen only once. Cannot be stored.

• Objective functions:

Max. abs. error:

Euclidean error:

iii xfXFL max,

2

12

2 ,

n

xfXFL i

ii

Histograms Histograms [KSM 2007][KSM 2007]

• Solve the error-bounded problem.

Maximum Absolute Error bound ε = 2

4 5 6 2 15 17 3 6 9 12 …

[ 4 ]

[ 16 ] [ 4.5 ] […

• Generalized to any weighted maximum-error metric.

Each value di defines a tolerance interval

Bucket closed when running intersection of interval becomes null

Complexity:

ii

ii w

dw

d

,

nO

HistogramsHistograms

• Apply to the space-bounded problem.

Perform binary search in the domain of the error bound ε

Complexity:

*lognO

For error values requiring space , with actual error , run an optimality test:

BB

Error-bounded algorithm running under constraint instead of

error error

If requires space, then optimal solution has been reached.

BB ~error

Independent of buckets B

What about streaming case?

Streamstrapping Streamstrapping [Guha 2009][Guha 2009]

• Run multiple algorithms.

• Metric error satisfies property:

HXHYHXHYXHXHYHX ,,,,,

1. Read first B items, keep reading until first error (>1/M)2. Start versions for

3. When a version for some fails, a) Terminate all versions for b) Start new versions for using summary of as first

input. 4. Repeat until end of input.

000 1,,1, J 11 J

11 logOJ

J1


• Proof:

• Theorem:For any StreamStrap algorithm achieves an

approximation,running copies and initializations.

10

1 2311

Consider lowest value of for which an algorithm runs.Suppose error estimate was raised j times before reaching Xi : prefix of input just before error estimate was raised for ith

time.Yj : suffix between (j-1)th and jth raising of error estimate.

Hi : summary built for Xi. Then:

Furthermore:

Error estimate is raised by at every time.

Α

jjjjjjjjj HXHYHXHYXHXHYHX ,,,,,

1111 ,,, jjjjjjjj HXHYHXHX

11 J

11 logO MO *1 log

target error added error

recursion


• Proof (cont’d):Putting it all together, telescoping:

Total error is:

Moreover,

However, (algorithm failed for it)

Thus,

In conclusion, total error is# Initializations follows.

1

2

111 ,, j

jiiiiijj HYHXHX

31

1

1

1

1

*1

* ,, HYXHYHX jjj

1

1*,HYHX jj

*31

1

11

1* 31,

HYX j

*

31

12

added error

optimal error

10

1

10

1


• Proof:

• Theorem:Algorithm runs in space and

time.

Space bound follows from copies.Batch input values in groups of Define binary tree of t values, compute min & max over tree

nodes:Using tree, max & min of any interval computed inEvery copy has to check violation of its bound over t items.Non-violation decided in O(1). Total Violation located in . For all buckets,Over all algorithms it becomes:

1logBO M

BBnO *2 loglog

1logBOt

tO log

BnOOtnO 11 log

tO 2log tBO 2log

MBB

M OtBO *2*12 loglogloglog

nO

1D Array Partitioning 1D Array Partitioning [KMS 1997][KMS 1997]

• Problem:

Partition an array of n items into p intervals so that the maximum weight of the intervals is minimized.

Arises in load balancing in pipelined, parallel environments.

j

ik

kAjiAF ,


• Idea:

Perform binary search on all possible O(n2) intervals responsible for maximum weight result (bottlenecks).

• Obstacle: Approximate median has to be calculated

in O(n) time.


• Solution: Exploit internal structure of O(n2) intervals. n columns, column c consisting of ciciAF 1,,

nn

cc

nc

nc

nc

nc

,

,

,3,3

,2,22,2

,1,12,11,1

,2,1

Monotonically non-increasing


• Calls to F(...) need O(1). (why?)• Median of any subcolumn determined with

one call to F oracle. (how?)

Splitter-finding Algorithm:• Find median weight in each active

subcolumn.• Find median of medians m in O(n) (standard).

• Cl (Cr): set of columns with median < (>) m.


• The median of medians m is not always a splitter.

8

,min rlrl

CCCC


• If median of medians m is not a splitter, recur to set of active subcolumns (Cl or Cr) with more elements (ignored elements still considered in future set size calculations).

• Otherwise, return m as a good splitter (approximate median).

End of Splitter-finding Algorithm.

8

,min rlrl

CCCC


Overall Algorithm:1.Arrange intervals in subcolumns.2.Find a splitter weight m of active subcolumns.3.Check whether array is partitionable in p

intervals of maximum weight m (how?)4.If true, then m is upper bound of optimal

maximum weight, eliminate half of elements of each subcolumn in Cl - otherwise in Cr.

5.Recur until convergence to optimal m. Complexity: O(n log n)


• Problem: Partition a 2D array of n x n items into a p x

p partition (inducing p2 blocks) so that the maximum weight of the blocks is minimized.

Arises in particle-in-cell computations, sparse matric computations, etc.

• NP-hard [GM 1996]• APX-hard [CCM 1996]


• Definition: Two axis-parallel rectangles are

independent if their projections are disjoint along both the x-axis and the y-axis.

• Observation 1: If an array has a partition, then it

may contain at most independent rectangles of weight strictly greater than W. (why?)

,W2


• At least one line needed to stab each of the independent rectangles.

• Best case: independent rectangles2


The Algorithm: Assume we know optimal W.Step 1: (define P )Given W, obtain partition such that each

row/column within any block has weight at most 2W. (how?)

Independent horizontal/vertical scans, keeping track of running sum of weights of each row/column in block. (why exists ?)

WjiAji

,max,

P


Step 2: (from P to S )Construct set of all minimal rectangles of

weight more than W, entirely contained in blocks of . (how?)

Start from each location within block, consider all possible rectangles in order of increasing sides, until W exceeded, keep minimal ones.

Property of S : block weight at most 3W. (why?)Hint : rows/columns in blocks of P at most 2W.

S

P


Step 3: (from S to M )Determine local 3-optimal set of

independent rectangles.

3-optimality : There does not exist set of independent rectangles in that, added to after removing rectangles from it, do not violate independence condition.

Polynomial-time construction(how? with swaps: local optimality easy)

SM

MS 3,2,1i

M 1i


Step 4: (from M to new partition)For each rectangle in M, set two straddling

horizontal and two straddling vertical lines that induce it. At most partition derived

New partition: P from step 1 together with this.

horizontal lines vertical lines

MM 22

Mh 2

Mv 2


Step 5: (final)Retain every th horizontal line, every th vertical line.Maximum weight increased at most by

h

v

vh


Analysis:We have to show that:a. Given W (large enough) such that there

exists partition, the maximum block weight in constructed partition is

b. Minimum W for which analysis holds (found by binary search) is upper bound to optimum W.

,W WO


Lemma 1: (at Step 1)Let block b contained in partition P.If b exceeds 27W, then b can be partitioned in 3

independent rectangles of weight >W.Proof:Vertical scan in b, cut as soon as seen slab

weight exceeds 7W. (hence slab weight < 9W ) (why?)

Horizontal scan, cut as soon as one seen slab weight exceeds W.


Proof (cont’d):Slab weight exceeding W does not exceed 3W.

(why?)Eventually, 3 rectangles weighting >W each.

b W7

W4

W4

W

W3

W3

b W

W3

W3

W

W


Lemma 2: (at Step 4)Weight of any block of Step-4-partition isProof:Case 1:Weight of b is O(W). (recall block in S <3W )Case 2:Weight of b is <27W.If >27W, then b partitionable in 3 independentrectangles, which can substitute the at most 2 blocksin M non-independent of b: violates 3-optimality of M.

Mb

Mb

WO


Lemma 3: (at Step 3)If , thenProof:Weight of rectangles in M is >W.By Observation 1, at most independent

rectanglescan be contained in M.

2M,W

2


Lemma 4: (at Step 5)If , weight of any block in final solution isProof:At Step 5, maximum weight increased at most by

By Lemma 2, maximum weight isHence, final weight is (a)

Least W for which Step 1 and Step 3 succeed exceeds

optimum W. Found by binary search. (b)

WO

2522

MMvh

WO

WO

,W

Compact Hierarchical Compact Hierarchical HistogramsHistograms

• Assign arbitrary values to CHH coefficients, so that a maximum-error metric is minimized.

• Heuristic solutions: Reiss et al. VLDB 2006

BnnBO loglog2

c0

c1 c2

c3 c4c5 c6

d3d2d1d0

nnBO 2log

time

space

The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node.

[Reiss et al. VLDB 2006]

Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. Next-to-bottom level case

dcbavdcba

dcbavdcbadcbavdcba

dcbavdcba

viS

,,,,

,,,,,,,,

,,,,

,2

,1

,0

,

1,,, ** ii ssviSv

cic2i c2i+1

bav ,

z00

ba, dc,

dcba ,,

cic2i0 0

z

dcbav ,,

dc, ba,

dcba ,,

dcz , dcbaz ,,

Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. General, recursive case

0000

00000000

0000

**

**

**

,2

,1

,

,

RLRL

RLRLRLRL

RLRL

v

vv

v

ss

ss

ss

viS

RL

RL

RL

ii

ii

ii

*0

*0 ,,

RL iRiL sviSvsviSv RL

Complexity:(space-

efficient)

nnOn

On 2log

0 1log

22

time

space

• Apply to the space-bounded problem.

Complexity:

Polynomially Tractable

nOOn

log

02

nnnO logloglog *

ReferencesReferences1. P. Karras, D. Sacharidis, N. Mamoulis: Exploiting

duality in summarization with deterministic guarantees. KDD 2007.

2. S. Guha: Tight results for clustering and summarizing data streams . ICDT 2009.

3. S. Khanna, S. Muthukrishnan, S. Skiena: Efficient Array Partitioning. ICALP 1997.

4. F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact histograms for hierarchical identifiers. VLDB 2006.

5. P. Karras, N. Mamoulis: Hierarchical synopses with optimal error guarantees. ACM TODS 33(3): 2008.

Thank you! Questions?Thank you! Questions?

special topics in data engineering panagiotis karras cs6234 lecture, march 4 th, 2009

Documents