multiple aggregations over data stream

2006/3/21 1

Multiple Aggregations over Data Stream

Rui Zhang, Nick Koudas, Beng Chin OoiDivesh Srivastava

SIGMOD 2005

2006/3/21 2

Outline

• Introduction to Giga-Scope DSMS

• Multiple Aggregations Problem

• The proposed approach

- choice of phantoms

- space allocation problem

• Conclusion

2006/3/21 3

Giga-Scope

• A DSMS appears to monitor high speed IP traffic data.

LFTA

HFTA

Main MemoryProcessing low speed data stream seed by LFTA.

Network Interface CardSimple low level query over high speed data stream, which serve to reduce data volumes

DSMS

2006/3/21 4

2,1

24,1

3,1

17,1

2,22,3

4,1

Single Aggregation in Giga-Scope

224

223174

1

2

3

4

5

6

7

8

9

0

LFTA HFTA

(group, count)

R

Select A, count(*)From RGroup by A;

2006/3/21 5

Cost of Processing a Single Aggregation

• probe (c1) : The cost of looking up the hash table in LFTAs and possible update in case of a collision

• eviction (c2) : The cost of transferring an entry from LFTAs to HFTAs

2006/3/21 6

Processing Multiple Aggregation Naively

Select A, count(*)From RGroup by A;

Select B, count(*)From RGroup by B;

Select C, count(*)From RGroup by C;

(2, 3, 4 )(24, 4, 3)(2, 3, 4)(2, 3, 4)(4, 2, 3)

R(A, B, C) LFTA HFTA

Hash Table A

Hash Table B

Hash Table C

•(2,1)

•(3,1)

•(4,1)

•(24,1)

•(4,1)

•(3,1)

•(2,3)

•(3,3)

•(4,3)

•(2,1)

•(4,1)

•(3,2)

15c1 +1c2+7c2

The end of Epoch !!

2006/3/21 7

Processing Multiple Aggregation by maintaining phantoms

R(A, B, C)

(2, 3, 4 )(24, 4, 3)(2, 3, 4)(2, 3, 4)(4, 2, 3)

The end of Epoch !!

LFTA

Hash Table A

Hash Table B

Hash Table C

Hash Table ABC

(2, 3 )

(3, 3 )

(4, 3 )

(24, 1 )

(4, 1 )

(2, 1 )

(4, 1 )

(3, 1 )

14c1 +8c2

HFTA

1

2

3

4

5

6

7

8

9

0

(2, 3, 4, 1 )

(24, 4, 3, 1)

(2, 3, 4, 2 )

(4, 2, 3, 1 )

(2, 3, 4, 3 )

(3, 1 )(3, 2 )

2006/3/21 8

The problem • Consider a set of aggregation queries over a data stream that differ

only in their group attribute. Determine an optimal sharing setting for the queries with limit memory.

AB BC BD CD

ABC ABD BCD

ABCD

Q1 Q2 Q3 Q4

Given queries

-choice of phantoms

-space allocation

2006/3/21 9

Idea by maintaining phantoms

• : the collision rate without phantoms• : collision rate with phantoms• : the collision rate of phantom ABC• The total cost:

– Without phantom :– With the phantom :

E1= 3nc1+3x1nc2

E2= nc1+3x2nc1+3x1’x2nc2

x1

x1’x2

2006/3/21 10

Example

A

B

C

ABC

C2

C1

C1

In the case, the phantom benefits the cost

To be fair ,the total space used for the hash tables should be the same with or without the phantoms

E1= 3c1+3x1c2

E2= c1+3x2c1+3x1’x2c2

A

B

C

M/3

M/3

M/3

x1

x1’

M/4

M/4

M/4

M/4

E1-E2=(2-3x2)c1+3(x1-x1’x2)c2When x20, the phantom

benefits the cost.

x2

C1

x1

x1

E1-E2=F(x1, x2 , x1’)

2006/3/21 11

g=3000b=1000

The probability of k groups out of g hashed to a buckets

Bk is the number of buckets having k groupsnrg :The expected number of record for each group(1-1/k): the collision rate in the bucket :collision happen in the bucket

g: number of groups of a relation

b: number of buckets in the hash table

Key point

The collision rate estimation

2006/3/21 12

Algorithmic strategies for choosing the phantoms

• Benefit=the difference between the maintenance costs without or with the phantom.

Greedy by Increasing Collision Rate• The configuration I only includes all the queries

• We calculate the maintenance cost if a phantom R is added to I

• By comparing with the maintenance cost when R is not in I , we can get the benefit

• After we add this phantom to I ,we iterate with the other phantoms

• As more phantoms are added into I, the overall collision rate goes up and benefit decreases

• Stop when the benefit becomes negative.

2006/3/21 13


Greedy by Increasing Collision Rate

AB BC BD CD

ABC ABD BCD

ABCD

Q1 Q2 Q3 Q4

g=2837

g=2117

g=1846

g=2387 g=2249

g=1946 g=1899 g=1999

Available memory=12000

Allocate AB=(1846/7690)*120000Allocate BC…Allocate BD…Allocate CD…

Try ABCD (Linear proportional Allocation)Allocate ABCD=(2837/10527)*12000Allocate AB=(1846/10527)*12000Allocate BC…Allocate BD…Allocate CD…

The process ends when

benefit become negative

E1-E2=F(x1, x2, x1’)

bABCDxABCDBenefit

2006/3/21 14

Space Allocation

A B

AB

tb

g

b

gc

bbM

gc

b

g

b

gc

b

gc

cb

g

b

gc

b

g

b

gc

cxxcxcxce

))(2()(

))(2(

)(2

)(2

2

2

1

12

21

01

2

2

1

12

0

01

22

2

1

11

0

0

0

01

22110101

By partial derivatives of e to 0.

22

22

1

1

b

g

b

gWhen , e has minimum cost.

Thereby, the space allocated is proportional to square root of number of group.

Optimal solution for the two level graph

x0

x1 x2

2006/3/21 15


One way of allocating hash table space to a relation is proportional to the number of groups in the table

We can allocate space for a relation with g is a constant and we set it large

g

2006/3/21 16


Greedy by Increasing Space• We calculate the benefit of each phantom according to the

cost model

• We calculate the benefit per unit space for each phantom R, benefit/

• We choose the phantom with the largest benefit per unit space as the first phantom to instantiate

• The process ends when the benefit per unit space becomes negative

gR( n )

2006/3/21 17

Algorithmic strategies for choosing the Algorithmic strategies for choosing the phantomsphantoms

• Greedy by Increasing Space

AB BC BD CD

ABC ABD BCD

ABCD

Q1 Q2 Q3 Q4

g=2837

g=2117

g=1846

g=2387 g=2249

g=1946 g=1899 g=1999

E1-E2=(2-3x2)c1+3(x1-x1’x2)c2

Benefit/Space as a metric

Benefit=2

Benefit=1Benefit=-1

Try ABCDAvailable memory=1200012000-7690=43104310-2837=1473

The process ends when1. Benefit become negative2. The space is exhausted

2006/3/21 18

Drawback

• needs to be tuned to find the best performance

2006/3/21 19

Space Allocation

• According to Abel’s impossibility theorem, equations of order higher than 4 cannot be solved algebraically, we say unsolvable

• More general multi-level configurations generate equations of even higher order which are unsolvable

• We would use heuristics to decide space allocation for the these unsolvable cases based on the analysis available

2006/3/21 20

Space Allocation

• Super-node with Linear Combination

• Super-node with Square Root Combination

• Linear Proportional Allocation

• Square Root Proportional Allocation

2006/3/21 21

ConclusionConclusion

• We address the problem of efficiently computing multiple aggregations over high speed data streams

• In real DSMS, the value of “g” is unknown.

multiple aggregations over data stream

Documents