xy-sketch: on sketching data streams at web scale

12
XY-Sketch: on Sketching Data Streams at Web Scale Yongqiang Liu University of Science and Technology of China Hefei, China [email protected] Xike Xie University of Science and Technology of China Hefei, China [email protected] ABSTRACT Conventional sketching methods on counting stream item frequen- cies use hash functions for mapping data items to a concise struc- ture, e.g., a two-dimensional array, at the expense of overcounting due to hashing collisions. Despite the popularity, however, the ac- cumulated errors originated in hashing collisions deteriorate the sketching accuracies at the rapid pace of data increasing, which poses a great challenge to sketch big data streams at web scale. In this paper, we propose a novel structure, called XY-sketch, which estimates the frequency of a data item by estimating the proba- bility of this item appearing in the data stream. The framework associated with XY-sketch consists of two phases, namely decom- position and recomposition phases. A data item is split into a set of compactly stored basic elements, which can be stringed up in a probabilistic manner for query evaluation during the recompo- sition phase. Throughout, we conduct optimization under space constraints and detailed theoretical analysis. Experiments on both real and synthetic datasets are done to show the superior scalability on sketching large-scale streams. Remarkably, XY-sketch is orders of magnitudes more accurate than existing solutions, when the space budget is small. KEYWORDS Data streams; Sketch; Data structures 1 INTRODUCTION In many applications, big data streams are continuously and auto- matically generated, such as web clicks [25], emails [17], financial data trackers [31], sensor networks [16], network traffics [13] and social network interactions [14][22]. For example, in U.S., the mar- ket of online advertising is reported as 100 billion dollars in 2018 [23], and is likely to expand to 230 billion dollars in near future [20]. The industry relies on tracking the web-click streams of billions of users, and counting many combinations of events, leading to a blow-up in the number of counting items [11][20]. Hence, it is desired to have a compact yet scalable structure for supporting the emerging applications, such as web-scale stream analytics. Sketches are compact data structures which take small space to support high quality approximate queries over data streams [7] [5][9][19][6][30]. Different from conventional database process- ing that requires multiple passes, data stream processing is often processed sequentially in one pass, empowered by sketches. Of sketches, the task is to estimate item frequencies for data streams. State-of-the-art solutions include CM-sketch [7], C-sketch [5], CU- sketch [9], A-sketch [19], Cold Filter [30], and so on. They adopt the similar underlying structure which is essentially a d × w array of counters for storing item frequencies. Each of the d row of the array is associated with a hash function for mapping items to w counters. Nevertheless, these collisions pose a great challenge to the sketch scalability, especially in face of big data streams. The sketch scalability is of paramount importance in big data streaming scenarios, such as online advertising and social network tracking, where data items are with unprecedent expanding do- mains (# of distinct items) and growing volumes. Let f i be the true frequency of item x i and ˆ f i be the estimated frequency of x i . We investigate how the estimation accuracy scales. Equivalently, we investigate how the frequency estimation error, | ˆ f i f i | , scales with respect to the growth of the total number of items N , and the number of distinct items n, namely N -scalability and n-scalability. Table 1: Error Bounds for Sketches (with probability at least 1 δ ) XY-sketch CM-sketch C-sketch Error Bounds 2 δn ˝ n i =1 f i e w ˝ n i =1 f i 8 w q ˝ n i =1 f 2 i We can analyze the scalability based on the estimation error bounds of sketches, as shown in Table 1, which connects the esti- mation accuracy with two scalability factors, n and N . Of a stream, N refers to the total number of items, and n refers to the num- ber of distinct items. From Table 1, the error bound of C -sketch is 8 w q ˝ n i =1 f 2 i [5]. By transforming the error bound from L 2 norm to L 1 norm, the error bound can be represented by 8 wn N 1 , meaning that the error bound is proportional to N , and inversely propor- tional to wn. For example, if n is increased to n 1 , the error bound of C-sketch is to shrink to q n n 1 of its original bound, resulting in a lower error bound. In contrary to that, the error bound of XY-sketch is to shrink to n n 1 of the original one, which shows better scalability in terms of n. In this work, we propose XY-sketch, a novel sketching tech- nique, that tackles the scalability challenges by adopting novel decomposition-and-recomposition framework. For the first time, we estimate the probability of a data item appearing in the data stream for counting the stream item frequency. Then, to count an item of a stream, one can simply multiplying the probability with the total number of items N . The basic idea is to decompose an item into a sequence of elements which need much smaller storage space. During the query phase, the decomposed elements can be re- composed for frequency estimation in a probabilistic manner. Both decomposition and recomposition phases are enabled by bijective functions. Using bijective functions, it ensures that a decomposed item can be uniquely recomposed to its original form item. It cannot be done by one way functions, i.e., hashing functions. 1 According to inequality 1 n ˝ n i =1 f i q ˝ n i =1 f 2 i .

Upload: others

Post on 18-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XY-Sketch: on Sketching Data Streams at Web Scale

XY-Sketch: on Sketching Data Streams at Web ScaleYongqiang Liu

University of Science and Technology of China

Hefei, China

[email protected]

Xike Xie

University of Science and Technology of China

Hefei, China

[email protected]

ABSTRACTConventional sketching methods on counting stream item frequen-

cies use hash functions for mapping data items to a concise struc-

ture, e.g., a two-dimensional array, at the expense of overcounting

due to hashing collisions. Despite the popularity, however, the ac-

cumulated errors originated in hashing collisions deteriorate the

sketching accuracies at the rapid pace of data increasing, which

poses a great challenge to sketch big data streams at web scale. In

this paper, we propose a novel structure, called XY-sketch, which

estimates the frequency of a data item by estimating the proba-

bility of this item appearing in the data stream. The framework

associated with XY-sketch consists of two phases, namely decom-

position and recomposition phases. A data item is split into a set

of compactly stored basic elements, which can be stringed up in

a probabilistic manner for query evaluation during the recompo-

sition phase. Throughout, we conduct optimization under space

constraints and detailed theoretical analysis. Experiments on both

real and synthetic datasets are done to show the superior scalability

on sketching large-scale streams. Remarkably, XY-sketch is orders

of magnitudes more accurate than existing solutions, when the

space budget is small.

KEYWORDSData streams; Sketch; Data structures

1 INTRODUCTIONIn many applications, big data streams are continuously and auto-

matically generated, such as web clicks [25], emails [17], financial

data trackers [31], sensor networks [16], network traffics [13] and

social network interactions [14] [22]. For example, in U.S., the mar-

ket of online advertising is reported as 100 billion dollars in 2018

[23], and is likely to expand to 230 billion dollars in near future [20].

The industry relies on tracking the web-click streams of billions

of users, and counting many combinations of events, leading to

a blow-up in the number of counting items [11][20]. Hence, it is

desired to have a compact yet scalable structure for supporting the

emerging applications, such as web-scale stream analytics.

Sketches are compact data structures which take small space

to support high quality approximate queries over data streams [7]

[5] [9] [19] [6] [30]. Different from conventional database process-

ing that requires multiple passes, data stream processing is often

processed sequentially in one pass, empowered by sketches. Of

sketches, the task is to estimate item frequencies for data streams.

State-of-the-art solutions include CM-sketch [7], C-sketch [5], CU-

sketch [9], A-sketch [19], Cold Filter [30], and so on. They adopt

the similar underlying structure which is essentially a d ×w array

of counters for storing item frequencies. Each of the d row of the

array is associated with a hash function for mapping items to w

counters. Nevertheless, these collisions pose a great challenge to

the sketch scalability, especially in face of big data streams.

The sketch scalability is of paramount importance in big data

streaming scenarios, such as online advertising and social network

tracking, where data items are with unprecedent expanding do-

mains (# of distinct items) and growing volumes. Let fi be the true

frequency of item xi and ˆfi be the estimated frequency of xi . We

investigate how the estimation accuracy scales. Equivalently, we

investigate how the frequency estimation error, | ˆfi − fi |, scaleswith respect to the growth of the total number of items N , and the

number of distinct items n, namely N -scalability and n-scalability.

Table 1: Error Bounds for Sketches (with probability at least1 − δ )

XY-sketch CM-sketch C-sketch

Error Bounds2

δn∑ni=1 fi

ew∑ni=1 fi

8√w

√∑ni=1 f

2

i

We can analyze the scalability based on the estimation error

bounds of sketches, as shown in Table 1, which connects the esti-

mation accuracy with two scalability factors, n and N . Of a stream,

N refers to the total number of items, and n refers to the num-

ber of distinct items. From Table 1, the error bound of C-sketch is

8√w

√∑ni=1 f

2

i [5]. By transforming the error bound from L2 norm to

L1 norm, the error bound can be represented by8√wn

N 1, meaning

that the error bound is proportional to N , and inversely propor-

tional to

√wn. For example, if n is increased to n1, the error bound

of C-sketch is to shrink to

√nn1

of its original bound, resulting in a

lower error bound. In contrary to that, the error bound of XY-sketch

is to shrink tonn1

of the original one, which shows better scalability

in terms of n.In this work, we propose XY-sketch, a novel sketching tech-

nique, that tackles the scalability challenges by adopting novel

decomposition-and-recomposition framework. For the first time,

we estimate the probability of a data item appearing in the data

stream for counting the stream item frequency. Then, to count an

item of a stream, one can simply multiplying the probability with

the total number of items N . The basic idea is to decompose an

item into a sequence of elements which need much smaller storage

space. During the query phase, the decomposed elements can be re-

composed for frequency estimation in a probabilistic manner. Both

decomposition and recomposition phases are enabled by bijective

functions. Using bijective functions, it ensures that a decomposed

item can be uniquely recomposed to its original form item. It cannot

be done by one way functions, i.e., hashing functions.

1According to inequality

1√n

∑ni=1 fi ≤

√∑ni=1 f

2

i .

Page 2: XY-Sketch: on Sketching Data Streams at Web Scale

For hashing function-based solutions, an item frequency retrieval

can falsely be retrieved as the sum of multiple data items, according

to hashing collisions. The hashing collisions may aggravate the

situation, especially when the space budget is small and the value

of n is big, in applications of web data streaming. For XY-sketch,

there can be errors caused by the approximation of conditional

probabilities in the decomposition-and-recomposition framework.

We therefore conduct detailed analysis for gaining theoretical con-

fidence in bounding such errors. Extensive experiments on real and

synthetic datasets show that our proposals are effective, especially

when the space budget is small.

Our contributions can be summarized as follows.

• We propose a novel sketching technique, called XY-sketch,

which utilizes the decomposition-and-recomposition frame-

work.

• We conduct detailed theoretical analysis for the estimation

error bounds to gain insights on the performance of scalabil-

ity.

• We propose both basic structure and extended structure for

XY-sketch. We also investigate corresponding optimization

techniques for further enhancing the frequency estimation

accuracies.

• We conduct extensive experiments on both real and synthetic

datasets to evaluate the scalability of XY-sketch.

The rest of this paper is organized as follows. Section 2 discusses

the related works. Section 3 investigates the decomposition-and-

recomposition framework, which formulates the basic structure of

XY-sketch. Section 4 makes detailed theoretical analysis on the esti-

mation error bounds. Section 5 proposes the extended structure for

XY-sketch in association with a series of optimization techniques.

Section 6 reports the experimental results. Section 7 concludes the

paper.

2 RELATEDWORKCM-sketch [7] corresponds to d rows, each of which consists of a

hashing function and a set of w counters. When a new item arrives,

for each of the d rows, CM-sketch applies the corresponding hash-

ing function to get a counter and increment it by one. In total, dcounters are incremented. Then, for retrieving an item’s frequency,

CM-sketch applies d hashing functions to find d counters and re-

ports the minimal value of the d counters. CU-sketch [9] is similar

to the CM-sketch except that it adopts conservative updating, which

only increments the counter(s) with the minimum value among the

d mapped counters.

C-sketch [5] has the same structure as CM-sketch, except that

it maintains an extra hashing function for each row, which maps

the arrived item to {−1, 1}. Thus, the extra hashing function helps

in determining whether the corresponding counter should be up-

dated positively or negatively. For retrieving an item’s frequency,

C-sketch reports the median one among the d mapped counters.

CM-sketch and C-sketch are considered as two basic sketching

techniques, of which most existing sketching techniques, such as

Bias-sketch [6], A-sketch [19], Cold Filter [30], are variants.

The Bias-sketch [6] improves over C-sketch and CM-sketch by

taking extra storage for samples of streaming items in order to

avoid the bias estimation. Bias-sketch focuses on recording and

recovering the entire data stream, whose target is different from

item frequency estimation considered in our work. A-sketch [19]

and Cold Filter [30] both use filters as auxiliary structures associ-

ated with basic sketches, e.g., CM-sketch. In particular, A-sketch

uses filters for high-frequency items, whereas Cold Filter uses fil-

ters for low-frequency items. Hence, A-sketch achieves high ac-

curacies in estimating high-frequent items. Cold Filter adopts a

well-devised two-layered structure to achieve good accuracies for

the low-frequency item estimation. Meanwhile, it incurs parameter

tuning issues for the automatic configuration in practice.

Recently, MV-skecth [21] studies heavy hitter and heavy change

queries, which is different from the item frequency estimation

considered in this paper. SketchLearn [13] uses adaptive statisti-

cal inference to relieve users approximate measurement burdens.

Although it is versatile in addressing many types of queries, the per-

formance on basic queries, e.g., point queries, is limited, compared

with CM-sketch, C-sketch, and their variants. There also exist other

types of sketching techniques for various purposes. For example,

OM-sketch [29] and Pyramid sketch [27] avoid counter overflows.

Ada-sketch [20] achieves better frequency estimation for recent

items than old ones, by using techniques of digital Dolby noise re-

duction. Odd sketch [18] is a compact binary sketch for estimating

the similarity of two sets, which is relevant in applications such as

web duplicate detection and collaborative filtering.

3 BASIC STRUCTURE OF XY-SKETCH3.1 PreliminariesWe consider a standard model, called cash register model. Suppose

a data stream SN with N items and n distinct items. The stream

SN can be represented by a sequence ⟨e1, ..., eN ⟩, where each item

ei takes a value from the item set X, (ei ∈ X). Notice that items in

X = {x1, ...,xn } are distinct, i.e., xi , x j . The frequency fi equalsto the number of times item xi appearing in the stream SN . Next,

we formally define bijective functions and basic elements, whichmake the foundation for the decomposition-and-recomposition

framework.

Table 2: List of notations

Notation Meaning

SN data stream of N items

N total number of items

n total number of distinct items

X domain of items, |X| = wd

Y domain of elements, |Y| = w

xi ∈ X an item in X

y(j)i ∈ Y j-th element of xiYd×w matrix of d rows andw columns

Y=⟨Y (1), ...,Y (d )⟩ random variables for d elements

bit(xi ,k) k-th bit (from the right) of xi(ˆfi ) fi (estimated) frequency of xi

ϖ and ϖ−1 bijective function and its inverse

d an item has d elements

b an element has b bits

Page 3: XY-Sketch: on Sketching Data Streams at Web Scale

Definition 1 (Bijective Function ϖ). ϖ is a bijective func-tion, which maps a data item to a sequence of d elements, formallyϖ : X → Yd . For example, given item xi ∈ X, we have ϖ(xi ) =⟨y(1)

i ,y(2)

i , ...,y(d )i ⟩ andϖ

−1(⟨y(1)

i ,y(2)

i , ...,y(d )i ⟩) = xi , where {y

(j)i ∈

Y}1≤j≤d are elements of xi .

Of the stream, an item or an element can be viewed as a sequence

of bits, or a binary string, equivalently.We assume that each element

has b bits and each item has d elements and thus d × b bits, so that

an item length is an integral number, i.e., d , of an element length2.

Essentially, an element of an item is an ordered arrangement of

a subset of b bits of the item. Of an item, all the d elements are

equal-sized and mutually exclusive.

The bijective function ϖ represents the one-to-one correspon-

dence between the two domains X and Yd . Once the bijective

function is given, an item can be uniquely identified by its corre-

sponding sequence of elements, vise versa. We show an example

in Figure 1. Given item xi = 101101011, it can be decomposed

into three elements y(1)

i , y(2)

i , and y(3)

i with ϖ . Element y(1)

i = 101

is derived by taking the 1st, 5

th, and 6

thbits from xi . Following

the corresponding relation as shown in Figure 1, y(2)

i = 011 and

y(3)

i = 101 can be obtained, similarly. Inversely, we can recompose

xi by {y(j)i }1≤j≤3 with ϖ−1, which is the inverse function of ϖ .

There could be (d · b)! possible ways of mapping between items

and elements, if an item is of d elements and an element is of bbits. Next, we introduce the concept of random permutation. Based

on that, we define random bijective function, which is general and

therefore makes the basis for the theoretical analysis of XY-sketch.

Definition 2 (Random Permutation). Given a sequence of in-tegers ⟨1, 2, ...,d · b⟩, a random permutation ⟨q1, ...,qs ⟩ (s = d · b)can be obtained by applying a randomly selecting permutation of(d · b) to the sequence, or by choosing a random element from the setof distinct permutations of the sequence, equivalently.

Given an item xi , let bit(xi ,k) be the k-th bit of the binary form

of xi , so that xi =∑sk=1 bit(xi ,k)2

k−1, where s = d · b. Then,

we formally define the random bijective function with random

permutation and function bit(, ) in Definition 3. Since it is easy to

prove that the mapping ϖ∗ defined by Definition 3 is bijection, we

omit it due to page limits.

Definition 3 (Random Bijective Function ϖ∗). ϖ∗ is a ran-dom mapping with a random permutation Q = ⟨q1, ...,qs ⟩. For anyxi ∈ X, ϖ∗(xi ) = ⟨y

(1)

i ,y(2)

i , ...,y(d )i ⟩, where y

(j)i =

∑bk=1 bit(xi ,

q(j−1)·b+k )2k−1, when j = 1, ...,d .

Unless explicitly noted, the random bijective function is used for

item decomposition and recomposition by default.

3.2 Decomposition and RecompositionData Structure. XY-sketch maintains a d × w matrix Yd×w of

counters. Suppose an item consisting of d elements. Each of the

d rows corresponds to the position that an element is ordered at

2If items are of different lengths, we can round them up to the same length. For example,

items of double-precision floating-point format can be rounded up to the length of 64,

meaning that they are of 64 bits.

空白演示单击输入您的封面副标题

Figure 1: Schematic diagram of the decomposition-and-recomposition framework (b = 3, d = 3)

.

the item. Each of thew columns corresponds to a possible value of

an element. Initially, all counters of the matrix are set to zero. The

mapping from stream items to matrix counters is implemented in

accordance to ϖ∗, as detailed below.

Figure 2: Basic Structure

Decomposition. Upon receiving an item from the data stream,

XY-sketch decomposes it into a sequence of elements with ϖ∗. Foreach element, the corresponding counter of the matrix is retrieved

and incremented by one. The process repeats until all thed elements

of the item are elaborated. Therefore, the time complexity is O(d).An example is shown in Figure 2, where a 2× 4matrix is utilized

by XY-sketch for handling data stream ⟨0, 1, 2, ..., 15, 6, 7⟩. Suppose

currently item 6 is to be handled. It is then decomposed into two

elements 01 and 10 by function ϖ∗. The first element updates the

counter of (row 1, column 01) by one. The second elements updates

the counter of (row 2, column 10) by one. After that, the decom-

position processing of item 6 is done. Similarly, the next item 7 is

decomposed into 01 and 11, and the counters of (row 1, column 01)

and (row 2, column 11) are incremented by one, respectively.

We formalize the decomposition phase in Algorithm 1. When

data item xi arrives, XY-sketch first splits the xi into d elements

Page 4: XY-Sketch: on Sketching Data Streams at Web Scale

Algorithm 1 Decomposition Phase

1: Yd×w ← 02: while data item xi in the SN arrives do3: ⟨y

(1)

i ,y(2)

i ...,y(d )i ⟩ ← ϖ∗(xi )

4: for j = 1 to d do5: Yd×w [j,y

(j)i ] ← Yd×w [j,y

(j)i ] + 1

6: end for7: end while

(line 3). Then, the counters of the d elements in Yd×w are found,

and incremented by 1 (line 4-6).

Recomposition. The recomposition phase works for the query

evaluation. In this work, we consider point queries, which is the

most basic type of sketch-based queries.

Algorithm 2 Recomposition Phase

Input: A two-dimensional array Yd×w and item xiOutput: The estimated frequency

ˆfi of item xi

1: ⟨y(1)

i ,y(2)

i ...,y(d )i ⟩ ← ϖ∗(xi )

2: sum ←∑wk=1 Yd×w [1,k] and

ˆfi ← sum3: for j = 1 to d do4:

ˆfi ← ˆfi × Yd×w [j,y(j)i ]/sum

5: end for6: return ˆfi

The process is depicted in Algorithm 2. Upon receiving a point

query, on retrieving the frequency of a item xi , XY-sketch decom-

poses xi into a sequence of d elements with ϖ∗. For each of the delements, we first find its corresponding counter of the d rows in

the matrix. Let y(j)i be the j-the element of xi . For each row, the

probability that y(j)i takes the value of y

(j)i by Yd×w [j,y

(j)i ]/N

3.

The value of N can be derived by summarizing the counter of any

row of the matrix. Finally, we get the product of all the d probabili-

ties and N , which equals to the estimated frequencyˆfi , according

to Equation 3. The result is hence returned as the frequency of item

xi to answer the query (lines 3-6). Since there are d probabilities to

be found, the time complexity is O(d). Next, we show an example

about the calculation in the recomposition phase.

In Figure 2, there are in total 18 items in the stream, N = 18,

which can be calculated by either 4 + 6 + 4 + 4 or 4 + 4 + 5 + 5.

The corresponding counters of item 7’s two elements, 01 and 11,

are valued 6 and 5, respectively. The frequency of item 7 can thus

be estimated by 18 × 6

18× 5

18= 1.67, while the exact frequency of

item 7 is 2. The accuracy for estimating f7 is thus 1.67/2 ≈ 84%. We

will show the estimation is very effective towards the scalability

issues. More details on the error bound analysis are to be shown in

Section 4. Next, we explain the reasonability behind the calculation

of the recomposition phase.

3 Yd×w [j, y(j )i ] records the number of which the j th basic element of the items being

decomposed is y (j )i among N data items. Therefore, the probability of which jth

element takes the value of y (j )i is Yd×w [j, y(j )i ]/N .

3.3 AnalysisThe decomposition-and-recomposition framework is conceived

and formulated with detailed analysis on balancing the tradeoff

between space and accuracy. Let X be the random variable for

distinct item values of the stream. The probability Pr(X = xi ) indi-cates the possibility that an item takes the value of xi , satisfying∑1≤i≤n Pr (X = xi ) = 1. So, the frequency fi of item xi can be eval-

uated by the product of the total number of items N and probability

Pr(X = xi ).

fi = N · Pr(X = xi )

= N · Pr(Y (1) = y(1)i ,Y(2) = y

(2)

i , ...,Y(d ) = y

(d )i )

(1)

Here, Y = ⟨Y (1), ...,Y (d )⟩ represents a sequence of random vari-

ables, for distinct element values obtained through the decompo-

sition phase. Then, Pr(Y(j) = y(j)i ) is the probability that the j-th

element of item xi takes the value of y(j)i . By using the chain rule

of probability theory, Equation 1 can be expanded as follows.

fi = N · Pr(Y (1) = y(1)i ) · Pr(Y(2) = y

(2)

i |Y(1) = y

(1)

i ) · ...

· Pr(Y (d ) = y(d )i |Y(1) = y

(1)

i , ...,Y(d−1) = y

(d−1)i ) (2)

Ideally, to evaluate the exact conditional probability in Equation

2 is costly. For an element of length b, there can be 2bdifferent

possible values for the element. Then, the calculation of Equation 2

requires storing O(wd ) items, which is unaffordable and violates

the compactness requirement of sketching techniques4. To this

end, we study how to accurately approximate the calculation of

the conditional probability, so as to accurately approximate and

accelerate the calculation of the item frequency. Theoretically, if

the random variable Y (j) is independent or weakly dependent on

variables {Y (k )}k,j , the conditional probability can be well approx-

imated and thus simplified by its unconditional counterpart. More

details are covered in the analysis part (Section 4). The estimation

of fi can thus be written as follows.

ˆfi = N∏

1≤j≤d

Pr(Y (j) = y(j)i ) , where

Pr(Y (j) = y(j)i ) =Yd×w [j,y

(j)i ]∑w

k=1Yd×w [j,k]=

Yd×w [j,y(j)i ]

N

(3)

This way, the space complexity is reduced from O(wd ) to O(wd),

since only {Pr(Y (j) = y(j)i )} is needed for the recomposition phase

to evaluateˆfi .

3.4 DiscussionIn this section, we introduce the basic structure of XY-sketch which

is easy for implementation and deployment. The basic data structure

is a two-dimensional matrix Yd×w , wherew = 2b. The mechanism

4Intuitively, to calculate Pr (Y (d ) = y (d )i |Y

(1) = y (1)i , ..., Y (d−1) = y (d−1)i )

part in Equation 1, one needs to maintain a one-dimensional array with wd

elements, Ywd , each of which represents the exact frequency of one data

item, so that Pr (Y (d ) = y (d )i |Y(1) = y (1)i , ..., Y (d−1) = y (d−1)i ) =

Ywd [y

(1)i ·w

d−1+y(2)i ·wd−2+. . .+y(d )i ]

∑y(1)i ·wd−1+y(2)i ·wd−2+. . .+y(d−1)i ·w+w−1

k=y(1)i ·wd−1+y(2)i ·w

d−2+. . .+y(d−1)i ·wYwd [k ]

.

Page 5: XY-Sketch: on Sketching Data Streams at Web Scale

of the basic structure follows a decomposition-and-recomposition

framework. During the query phase, the desired frequency can

be estimated by collecting and evaluating relevant elements in a

probabilistic manner.

In the extreme case, if d equals 1, the decomposition mapping

X → Yd degenerates into X → Y, so that an element degener-

ate into an item. This way, |X| equals n, meaning that XY-sketch

degenerates into a frequency histogram, which takes O(n) space,although the estimation accuracy can be 100%. Again, the setting

of d = 1 makes the solution not scalable and thus violates the com-

pactness requirement of sketching. So, in practice, the value of dis often great than one. Actually, a higher valued d corresponds to

the better space efficiency of the structure. The detailed analysis

enabling the tuning for the trade-off between space efficiency and

estimation accuracy is shown in the subsequent sections.

So far, several questions remain to be answered. 1) How good is

the estimationˆfi , comparing with state-of-the-art solutions, e.g.,

CM-sketch? 2) Parameter w is sets to be a power of 2. In such

settings, it may not make full use of the given space. We tackle

these challenges in the following sections. In particular, the first

question is covered by Section 4, and the second question is covered

in Section 5.

4 ANALYSISIn this section, we show the error bound analysis of XY-sketch. We

first derive the general error bound, based on which we analyze

the N - and n-scalability. Then, we study error bounds under uni-

form and Zipfian distributions, gaining insights in the sketching

properties and performance analysis.

4.1 General Error Bound and ScalabilityAnalysis

We hereby investigate a general error bound for the item frequency

estimation, as shown in Theorem 1.

Theorem 1. If we randomly select one of the n items, with prob-ability at least 1 − δ , we have | ˆfi − fi | ≤

2

nδ N for XY-sketch itemfrequency estimation.

Proof. Let ζi = | ˆfi − fi | be the random variable of the es-

timation error. Since ζi , ˆfi and fi are all positive, it holds that

ζi ≤ max{ ˆfi , fi }. Since∑i fi = N and

∑iˆfi ≤ N (Lemma 1), we

have

∑ni=1

ˆfi +∑ni=1 fi ≤ 2N . Therefore,

n∑i=1

ζi ≤n∑i=1

max{ ˆfi , fi } ≤n∑i=1

ˆfi +n∑i=1

fi ≤ 2N (4)

Then, we use the method of reductio ad absurdum to prove this

theorem. In the case of δn < 1, if there exist i ′ ∈ {1, 2, ...,n}, suchthat ζi′ >

2

nδ N , then we have

∑ni=1 ζi > 2N which contradicts with

Equation 4. In the other case, let δ < k < 1, and suppose that the

random variable of the error ζi is greater than2

nδ N with probability

k . Thus, there exist i1, i2, ..., ikn ∈ {1, 2, ...,n} and they are not

equal to each other, such that ζi j >2

nδ N , where j = 1, 2, ...,kn. If

so, we can get

∑knj=1 ζi j > 2N , which contradicts with Equation 4.

Therefore, the theorem is proved. □

Lemma 1.

∑ni=1

ˆfi ≤ N

Proof. Let p̂i be∏

1≤j≤d Pr(Y (j) = y(j)i ). Sinceˆfi = Np̂i (Equa-

tion 3), we have

∑ni=1

ˆfi = N∑ni=1 p̂i . Since ∀j ∈ [1,d],

∑wk=1

Pr (Y (j) = k) = 1, we have

∑ni=1 p̂i ≤ 1. Therefore,

∑ni=1

ˆfi ≤N . □

Based on Theorem 1, we can figure out why the error bound of

XY-sketch is tighter than that of CM-sketch, if the space budget

of both sketches is set tow · d . Recall that the error bound of CM-

sketch iseNw , with probability at least 1− δ [7]. The error bound of

XY-sketch is2Nnδ , with the probability at least 1 − δ . It means that,

with the same probability at least 1 − δ , letting 2

nδ <ew , we can

find that n > 2weδ orw < enδ

2. Therefore, if the number of distinct

items n is large (higher than2weδ ), or the space budget of sketches

is small (lower thanenδ2

), the error bound of XY-sketch is tighter

than that of CM-sketch. It means that XY-sketch outperforms CM-

sketch, when the space budget is small, or when the number of

distinct items in stream is large. The conclusion is consistent with

experimental results in Section 6.

Theorem 1 can also be used for analyzing the scalability of XY-

sketch. For example, when N is fixed a constant, the error bound

shrinks, as the number of different items n increases. It means that

XY-sketch has good n-scalability.n-scalability. Based on Table 1, after transforming C-sketch’s

error bound from L2 norm to L1 norm, we can see that both C-

sketch and CM-sketch’s n-scalability are not as good as XY-sketch.

For example, the error bound of C-sketch will shrink to

√nn1

of

the original when n increases to n1. While the the error bound

of XY-sketch will shrink tonn1

of the original, leading to a lower

error bound. Because most existing sketching techniques, such as

A-sketch, Cold Filter, are based on CM-sketch or C-sketch, we can

see that the n-scalability of them is not as good as that of XY-sketch.

Remarkably, the error bound of XY-sketch is2

nδ N , which decreases

with an increased value of n, if N and space budget is fixed.

N -scalability.XY-sketch achieves betterN -Scalability compared

with others under certain conditions. Take CM-sketch as an exam-

ple. In the case of n > 2weδ orw < enδ

2, the factor

2

δn is smaller than

the factorew . Therefore, The growth of XY-sketch’s error bound

will be smaller than the growth of CM-sketch’s error bound, when

N increases, meanings that XY-sketch has better N -scalability in

this case.

4.2 Error bound with detailed distributionIn this part, we analyze the estimation error bound when items

follow both uniform and Zipfian distributions.

First of all, for ease of presentation, we define several symbols for

representing probabilities. Let ϕ(Y (1) = y(1)

i ,Y(2) = y

(2)

i , ...,Y(j) =

y(j)i ) be the number of data items in the data stream SN , satisfying

that 1-st element equals y(1)

i , ..., j-th element equals y(j)i . We denote

ϕ(Y (1) = y(1)

i , ...,Y(j) = y

(j)i ) as ϕ(y

(1)

i , ...,y(j)i ) and useCPr (Y (j) =

y(j)i ) to represent the conditional probability Pr (Y (j) = y(j)i |Y

(1) =

Page 6: XY-Sketch: on Sketching Data Streams at Web Scale

y(1)

i , ...,Y(j−1) = y

(j−1)i ). Then, we can get the following equations.

CPr (Y (j) = y(j)i ) =ϕ(y(1)

i ,y(2)

i , ...,y(j)i )

ϕ(y(1)

i ,y(2)

i ...,y(j−1)i )

(5)

Pr(Y (j) = y(j)i ) =ϕ(y(j)i )∑w−1

k=0 ϕ(Y (j) = k)=

ϕ(y(j)i )

N(6)

Next, we show how to use the relationship between Equations 5

and 6, which can be used for bounding the estimated probability

under various item distributions.

Theorem 2. It holds that ∀j ∈ [1,d], ∀xi ∈ SN and ∀y(j)i ∈

[0,w − 1], min

xi′ ∈SN{CPr (Y (j) = y(j)i′ )} ≤ Pr (Y (j) = y(j)i ) ≤ max

xi′ ∈SN{

CPr (Y (j) = y(j)i′ )}.

Proof. We first show the case when j = 2, then generalize it

to case j > 2. We use the method of reductio ad absurdum to

prove this theorem. Suppose estimated probability Pr (Y (2) = y(2)i )

is always greater than conditional probability CPr (Y (2) = y(2)

i ).

With Equations 5 and 6, we get the following equation.

ϕ(y(2)

i )

N>

ϕ(Y (1) = k,Y (2) = y(2)

i )

ϕ(Y (1) = k)k ∈ [0,w − 1]

By simple transformation, we can have the inequality ϕ(y(2)

i ) ·

ϕ(Y (1) = k) > ϕ(Y (1) = k,Y (2) = y(2)

i ) · N . The inequality holds for

different values of k , ranging from 0 tow − 1. Then, we can getwinequalities, by substituting k with values from 0 tow − 1. Basedon that, the following inequality can be obtained, by summarizing

the left and right parts of thew inequalities, respectively.

ϕ(y(2)

i ) ·

w−1∑k=0

ϕ(Y (1) = k) > N ·w−1∑k=0

ϕ(Y (1) = k,Y (2) = y(2)

i )

Since all values ({0, 1, ...,w −1}) of basic elements are taken into ac-

count, we get

∑w−1k=0 ϕ(Y (1) = k) = N and

∑w−1k=0 ϕ(Y (1) = k,Y (2) =

y(2)

i ) = ϕ(y2i ). Finally, we have ϕ(y(2)

i ) ·N > ϕ(y(2)

i ) ·N , which is con-

tradictory. Therefore, the value of estimated probability Pr (Y (2) =

y(2)

i ) is bounded by the minimum and maximum values of condi-

tional probabilitiesCPr (Y (2) = y(2)i′ ), where xi′ ∈ SN . Similarly, we

can extend it to the case when j > 2. □

Theorem 2 gives the upper and lower bounds for the estimated

probability Pr (Y (j) = y(j)i ). The bounds are represented in the form

of conditional probabilities. However, the conditional probabilistic

bounds are difficult to derive in practice. To this end, we consider

two representation distributions, for which the closed form of error

bounds can be evaluated.

Error Bounds vs. Uniform Distributions. If data items of

the data stream follow uniform distribution, the counter value

Yd×w [j,y(j)i ] on each row would be close, where j = 1, 2, ...,d and

∀xi ∈ SN . Therefore, with a large probability, the conditional

probability of the most frequent item achieves its maximum value.

So that, we can get the follows.

max

xi ∈SN{ fi } = N ·

d∏j=1

max

xi′ ∈SN{CPr (Y (j) = y(j)i′ )}

Similarly, the following holds.

min

xi ∈SN{ fi } = N ·

d∏j=1

min

xi′ ∈SN{CPr (Y (j) = y(j)i′ )}

Let R = max

xi ∈SN{ fi } − min

xi ∈SN{ fi } be the range of the frequency

of the data item. Then, the estimation error bound for the case of

uniform distribution is as shown in Theorem 3.

Theorem 3. It holds that ∀xi ∈ SN , | fi − ˆfi | ≤ R under uniformdistribution, where R = max

xi ∈SN{ fi } − min

xi ∈SN{ fi }.

Proof. For ∀xi ∈ SN , ∀j ∈ [1,d] and ∀y(j)i ∈ [0,w − 1] we canget |CPr (Y (j) = y

(j)i ) − Pr (Y (j) = y

(j)i )| ≤ | max

xi′ ∈SN{CPr (Y (j) =

y(j)i′ )} − min

xi′ ∈SN{CPr (Y (j) = y(j)i′ )}|, according to Theorem 2. Thus,

combining Equations 2 and 3, the exact frequency fi and the esti-

mated frequencyˆfi are both lower than the product of maximum

conditional probabilities and higher than the product of minimum

conditional probabilities. Thus, we have

| fi − ˆfi | ≤d∏j=1

max

xi′ ∈SN{CPr (Y (j) = y(j)i′ )} · N

d∏j=1

min

xi′ ∈SN{CPr (Y (j) = y(j)i′ )} · N ,

which is equivalent to | fi − ˆfi | ≤ R. □

From Theorem 3, if data items follow uniform distribution, the

value of range R will be very small, meaning that the error bound

of XY-sketch is small. This conclusion can also be drawn from

Theorem 2. Due to the uniform distribution, min

xi′ ∈SN{CPr (Y (j) =

y(j)i′ )} is close to max

xi′ ∈SN{CPr (Y (j) = y(j)i′ )}. That is, for each row in

XY-sketch, the value of estimated probability is very close to the

true probability. Therefore, the final error estimated by XY-sketch

will also be small.

Error bounds vs. Skewed distributions. We show the error

bound of item frequency estimation when the data items in the

stream follow skewed distribution. We use the Zipfian distribution

to model skewed distributions, following the setting of existing

works [8][4][28]. Zipfian distribution has a parameter z (setting z >1) and a constant Cz . Correspondingly, fi represents the frequencyof i-th most frequent item only when talk for skew distribution.

Recall that items are in the range [1, 2, ...,n], where n is the number

of distinct items. Then, the constantCz can be determined by z and

n, since∑ni=1

Cziz = 1. Therefore, the following holds.∫ k+1

k

Cziz

di ≤Czkz· 1 ≤

∫ k

k−1

Cziz

di

Page 7: XY-Sketch: on Sketching Data Streams at Web Scale

After expanding the upper limit of the integration in the above

formula, we can get the follows.∫ +∞k

Cziz

di ≤n∑i=k

Cziz≤

∫ +∞k−1

Cziz

di

With some transformations, we get the following inequality.

Czk1−z

z − 1≤

n∑i=k

Cziz≤

Cz (k − 1)1−z

z − 1

With the derivation from the Zipfian distribution, we can derive

the error bound of cold items. Here, we define cold items as the

set of items excluding k highest frequent items. Let x1 be the most

frequent item and x2 be the second most frequent items. The cold

items can be represented CI = {xk+1,xk+2, ...,xn }. Here, we focuson CI , which is the vast majority of data items in the stream, since

the frequencies of the top-k data items {x1, ...,xk } can almost be

exactly estimated with little extra space, such as a filter for A-sketch

[19] and CM-sketch [7].

According to Theorem 2, it can reasonably assumed that Nk ·1

wd ≤∑ki=1

ˆfi ≤∑ki=1 fi , since we have

1

w ≤ Pr (Y (j) = y(j)i ) ≤

CPr (Y (j) = y(j)i ) for any j ∈ [1, 2, ...,d] and for any i ∈ [1, 2, ...,k].Based on that, we can draw the error bound for cold items as follows.

Theorem 4. For a Zipfian distribution with parameter z and scal-ing constant Cz , with probability at least 1 − δ , the error bound ofcold items is N

(n−k )δ · (Czk1−z

z−1 + (1 − kwd )).

Proof. Since the data items follow Zipfian distribution, we can

get

∑ni=k+1

Cziz ≤

∫ +∞k

Cziz di . That is,

∑ni=k+1 fi ≤ N Czk1−z

z−1 . Ac-

cordingly, we can get

∑ki=1

ˆfi ≥Nkwd . Therefore,

∑ni=k+1

ˆfi ≤

N −∑ki=1

ˆfi ≤ N · (1 − kwd ).

Let ζi = | ˆfi − fi | be the random variable for the estimation error,

where i ∈ [k + 1,k + 2, ...,n]. Notice that ˆfi and fi are both positive,

it holds that ζi ≤ max{ ˆfi , fi }. Therefore,

n∑i=k+1

ζi ≤n∑

i=k+1

ˆfi +n∑

i=k+1

fi ≤ N (Czk

1−z

z − 1+ (1 −

k

wd))

Then, we using themethod of reductio ad absurdum to prove the the-

orem. Suppose u is a probability value, satisfying δ < u < 1. Then,

suppose the random variable ζi is greater thanN

(n−k )δ (Czk1−z

z−1 +(1−

kwd ))with probabilityu. It means that there exist ζi1 , ζi2 , ..., ζiu(n−k ) ,

such that ζi j >N

(n−k )δ (Czk1−z

z−1 +(1−kwd )), where j = 1, 2, ...,u(n−k).

Thus, we will get

∑u(n−k )j=1 ζi j > N (Czk

1−z

z−1 + (1 −kwd )), which con-

tradicts with the assumption. Therefore, the theorem is proved. □

5 EXTENSIONS5.1 Extended Structure of XY-sketchXY-sketch’s basic structure is in the form of a matrix with d rows

and w columns. Let wi be the number of elements of row i . Forthe basic structure,wi is limited to be a power of 2 and allwi s are

equal satisfying

∏i≤d wi = |X|. Due to the setting ofwi , given an

arbitrary amount of space β , it may not be fully padded by the basic

structure, making the allocated space not fully used for sketching

stream items. Hereby, we study how to extend the basic structure

in a way such that the space can be fully utilized and hence the

estimation accuracy can be improved.

In particular, we extend the basic XY-sketch structure by setting

wi s to tunable values. This way, the structure of XY-sketch is no

longer a set of rows of equal length in a matrix, but a set of rows,

long and short, so that each row may contain different numbers of

elements. We represent the length (number of bits) of an element

on row i as bi , satisfying∑i bi = b · d , or equivalently

∏i≤d wi =

|X|, meaning that the domain of items is kept, and therefore the

recomposition of elements from all different rows is capable of

recovering original data stream items.

Heuristics. However, to find the optimal way of setting the

length of rows {wi } is computationally challenging. In total, there

are at most

∑b ·d−1i=0

(b ·d−1i

)= 2

b ·d−1possible ways of settings. Thus,

we design a greedy algorithm for approaching the optimal way of

space allocation, as shown in Algorithm 3. Initially, the item is of

b · d bits and the space budget is β . In the first iteration, we set

up row 1, of which each element has b1 bits. Then, for the next

round, it is equivalent to solve the same problemwith space (β−w1)

for pseudo item of (b · d − b1) bits. The subroutine is thus revokedrecursively.

Algorithm 3 SpaceAlloc (Space Budget β , Pseudo Item Length

b · d) //w is a global sequence

1: r ← ⌊log β⌋2: while b · d > 0 do3: w ′ ← 2

r

4: if β −w ′ < 2 × (b · d − r ) then5: w ′ ← 2

r−1

6: end if7: Appendw ′ tow8: SpaceAlloc(β −w ′,b · d − r )9: end while

At iteration i , the configuration of wi should satisfy two con-

ditions. First, the length of pseudo item r should be maximized

(line 1). Recalling Formula 2 and Formula 3, we take the probability

Pr (Y (j) = y(j)i ) to estimate the exact probability CPr (Y (j) = y

(j)i ).

When j is equal to 1, the probability Pr (Y (j) = y(j)i ) is also equal to

probability CPr (Y (j) = y(j)i ). It means that the probability logged

in the first row of XY-sketch is always the exact probability. So,

we prefer to take large enough space to log the information of this

exact probability. Thus, we letb1 as large as possible. Then, we seemthe original items as two parts, part one with this b1 bits while parttwo with the remainder (b · d − b1) bits. We regard this remainder

(b · d − b1) bits as a new items in a new field. For this new items,

we also use the probability Pr (Y (j) = y(j)i ) to estimate probability

CPr (Y (j) = y(j)i ). Thus, the similar situation occurs in this case.

Therefore, we set b2 large enough for the remainder space.

Second, the remaining space (β − w ′) should be adequate for

storing current pseudo items with (b · d − r ) bits (line 4). And the

correctness is guaranteed by Theorem 6, in which c represents theseremaining bits (b · d − r ). If the remaining space is not enough to

Page 8: XY-Sketch: on Sketching Data Streams at Web Scale

process the remaining bits, it means that the given r is too large.

Then, continue to the next iteration after adjusting the r (line 5).Therefore, according to Algorithm 3, the leftover space is lim-

ited. The leftover space is equally divided into u parts, where

u = ⌊Lspace/wd ⌋. The remainder must be less than wd and is

ignored. Each of the u part is of sizewd . The u parts are then com-

bined with the d-th row to construct an equal-width histogram of

u + 1 buckets. This way, the first d − 1 rows represent the item

sub-domain

∏i≤d−1wi , and the last row (i.e., d-th row) represent

the sub-domainwd , satisfying that the product of the two equals∏i≤d wi = |X|.

Theorem 5. The space cost is decreased by the decompositionoperation.

Proof. It is equivalent to prove that ifm = w1w2,w1,w2 ∈ N+

w1 ≥ 2 andw2 ≥ 2, it holds thatm ≥ w1 +w2.

Since m = w1w2, the problem is to prove w1w2 ≥ w1 + w2,

or (w1 − 1) ≥w1

w2

. Let y1(x) =w1

x . Since y1(x) is a decreasing

function of x when x > 0, we havew1

2> w1

x when x > 2. Suppose

y2(x) = x − 1 and y3(x) =x2, we can get y′

2(x) = 1 and y′

3(x) = 1

2.

We can find that y2(x) is greater than y3(x), when x > 2. So, we

havew1w2 > w1 +w2, ifw1,w2 > 2. □

Theorem 6. Given a series of pseudo items of c bits, the min-imum space required for storing them in the decomp-osition-and-recomposition framework is 2 · c .

Proof. Givenm =∏d

i=1wi and β =∑di=1wi . Then, we show

that β will get the minimum value, if w1 = w2 = ... = wd . Let

wd =m∏d−1i=1 wi

, we have β =∑d−1i=1 wi +

m∏d−1i=1 wi

and∂β∂w j= 1 −

m∏d−1i,j· 1

w2

j= 1−

wdw j

. By solving∂β∂w j= 0, we can get thatw j = wd

is the valley point. Similarly, it holds that β =∑di=1wi is minimum

whenw1 = w2 = ... = wd . Then, according to Theorem 5, the value

ofwi must be minimum. Thus, we getmin(wi ) = 2, sincewi = 2b

and b ∈ N+.Thus, the minimum space is achieved, if the decomposition pro-

cess applies for pseudo items of c bits, until {wi = 2}∀i . The corre-sponding space cost is 2 · c . □

Error Bounds. We show the error bound analysis for basic

structure also works for the extended structure, by Theorems 7 to

9.

Theorem 7. For the extended XY-sketch structure, if we randomlyselect one of the n items, with the probability 1−δ , we have | ˆfi − fi | ≤2

nδ N .

Proof. Similar to Theorem 1, we let ζi = | ˆfi − f i | be the random

variable of the error. Since

∑ni=1 p̂i < 1, we have

∑ni=1

ˆfi ≤ N . Then,

we can use the same reductio ad absurdum in Theorem 1 to prove

this theorem. □

Essentially, the key difference between basic structure and ex-

tended structure is that the number of counters in each row of the

extended structure is different. But this difference does not affect the

key property of XY-sketch min

xi′ ∈SN{CPr (Y (j) = y(j)i′ )} ≤ Pr (Y (j) =

y(j)i ) ≤ max

xi′ ∈SN{CPr (Y (j) = y

(j)i′ )} for ∀j ∈ [1,d], ∀xi ∈ SN and

∀y(j)i ∈ [0,w j − 1] (Theorem 2), meaning that the estimated proba-

bility is bounded by the minimum and the maximum conditional

probabilities. Due to this, the error bounds of XY-sketch under

uniform and Zipfian distributions hold for the extended structure.

For uniform distribution, Theorem 3 still works for the extended

structure, because that the estimated probability is bounded by the

minimum and the maximum conditional probabilities. It reveals

that the gap between the estimated and the true probabilities is

small. Therefore, we can similarly find out that the error bound

under uniform distribution is still limited by the range R as shown

in Theorem 8. As for Zipfian distribution, the key formula Nk ·1∏d

i=1wi≤∑ki=1

ˆfi ≤∑ki=1 fi is still work based on the truth that

the estimated probability is bounded. Thus, the following Theorem

9 can reveal that it is only slightly different from Theorem 4 in terms

of expression. And the proof method is similar to the previous one,

so we will not repeat it again.

Theorem 8. For the extended XY-sketch structure, we have | fi −ˆfi | ≤ R under uniform distribution, whereR = max

xi ∈SN{ fi }− min

xi ∈SN{ fi }.

Theorem 9. For the extended XY-sketch structure, given a Zipfiandistribution with parameter z and scaling constantCz , with probabil-ity at least 1 − δ , the error bound of cold items is N

(n−k )δ · (Czk1−z

z−1 +

(1 − k∏dj=1w j

)).

5.2 Statistics-based OptimizationThere exist (b · d)! possible mappings between b · d bits and d ele-

ments. Eachmappingmethod uniquely determines a decomposition-

and-recomposition procedure. Although all mapping methods com-

ply with the error bounds in Section 4, the performance varies.

Hereby, we design a statistics-based method, which takes a small

amount of items to get statistics in order to supervise the mapping

method selection.

The idea is to get distributions on {0, 1} for every bit of data

items. The distribution is collected with the first N1 items of the

data stream. Let c j (0) and c j (1) be the counts of values 0 and 1 for

the j-th bit, respectively. We can get the probabilities for the j-thbit to take values 0 and 1, as follows.

pj (0) =c j (0)

c j (0) + c j (1)and pj (1) =

c j (1)

c j (0) + c j (1)

For every bit of item, we calculate its entropy byH (j) = −∑1

i=0 pj (i)loд(pj (i)). Based on the entropy, we sort the bits in the descending

order to formulate a sequence. Then, the sequence is divided into

a set of segments with length of b1, b2, and so on. For the basic

structure (Section 3.2), all bi s are equal. For the extended structure

(Section 5.1), bi s are determined by Algorithm 3.

The reasonability of statistics-based method can be explained

by the concept of mutual information. Let mutual information

between i- and j-th bits be defined as I (i, j) = H (i) − H (i |j). Themutual information I (i, j) represents the amount of information

about i-th bit by observing from j-th bit.

Theorem 10. I (i, j) ≤ min{H (i),H (j)}.

Page 9: XY-Sketch: on Sketching Data Streams at Web Scale

Proof. Notice that I (i, j) = H (i) − H (i |j) and I (j, i) = H (j) −H (j |i). Since I (i, j) = I (j, i) andH (i |j),H (j |i) ≥ 0, we can get I (i, j) ≤min{H (i),H (j)}. □

Theorem 11. When the sequence formulated by the descendingorder bits based on the entropy, the mutual information I (i, j) hasa lower bound, where i- and j-th bits from the different rows of XY-sketch.

Proof. According to Theorem 10, the bound of the mutual in-

formation I (i, j) depends on the bit with the lower entropy value.

Let i-th bit from the first row of XY-sketch while j-th bit from the

second row of XY-sketch and denote the minimum entropy bit

in the first row is v-th bit. When the sequence formulated by the

descending order bits based on the entropy, we have the mutual

information I (i, j) bound by H (j), which is always smaller than

H (v). But when this descending condition does not hold, the mu-

tual information I (i, j)may larger thanH (v), which lead to a higher

bound. The case when bit i and bit j from other row has the same

situation and the theorem is proved. □

We only focus on the situation that i- and j-th bits from the

different rows of XY-sketch since the estimated error from XY-

sketch associated with the junction among the different rows. And

Theorem 11 shows the lowermutual information I (i, j) boundwhichwe prefer to.

The optimization procedure takesO(b×d) extra space for gettingthe statistics. The cost can be negligible, since b ×d is small. Notice

that the frequency information of the N1 cannot be preserved in the

optimized XY-sketch, according to the one-pass processing criteria

for data streams. In case that there exists temporarily allocated

space enough for storing N1 items, a better optimization can be

implemented. The space can be released when the optimization

finishes.

It is worth noting that the entropy of each bit obtained through

statistics is related to the order of data stream item arrival. So we

employ the random order model [30] [12] [24], which is a general

model regardless of the distributions of item sets. The idea of ran-

dom order model is that the incoming item in the stream is picked

independently and uniformly at random from X, so that it is adap-

tive to arbitrary frequency distributions over distinct item sets at

any time point. Therefore, the entropies of each bits calculated from

N1 data items can reflect those of the entire data stream.

6 RESULTSWe cover experimental setup in Section 6.1, and report the results

in Section 6.2.

6.1 SetupDatasets. We use two real datasets for experiments, Kosarak [1]

and WebDocs [1]. Kosarak contains 30.5 MB (anonymized) click-

stream data of a Hungarian online news portal. The total number

of data items is 8, 019, 015, while the total number of distinct items

is 41, 270. This dataset has a skew distribution similar to a Zipfian

distribution of 1.0. WebDocs is a huge real-life transactional dataset,

built from a crawled collection of web documents. The dataset has

299, 887, 139 items and 5, 267, 656 distinct items. The size of the

dataset is 1.37 GB. More detailed information on this dataset can be

found in [3]. Let ρ be the ratio of the number of distinct items over

the number of total items, ρ = nN . The ρ-values of Kosarak and

WebDocs are 0.52% and 1.8%, respectively. We also generate a series

of 6 synthetic datasets for the testing of n-scalability. In order to

facilitate the construction of dataset with different ρ-values, these6 synthetic datasets follow Normal distribution with µ = 5 × 107.

We simply vary the variance σ 2from 5 × 105 to 2 × 106 to make

ρ-value from 1.0% to 5.3%. Each of the datasets contains 2 × 108

data items and takes about 1.67 GB space.

Metrics. We adopt two commonly accepted metrics for evaluat-

ing the accuracies of sketching methods, Average Absolute Error

(AAE in short) and Average Relative Error (ARE in short). Formally,

AAE =1

n

n∑i=1| ˆfi − fi | ARE =

1

n

n∑i=1

| ˆfi − fi |

fi(7)

Baselines. We consider five competitors, CM-sketch [7], C-

sketch [5], CU-sketch [9], CU with A-Sketch [19] and CU with

Cold Filter [30], represented by CM, C, CU, A and CF, respectively.

XY-sketch with statistics-based optimization is represented by XY.

We also compare the results of XY to XY-sketch without statistics-

based optimization. For both A and CF, we use 32-bit Bob Hashing

[2], and use CU-sketch for its sketching part, following the setting

of [30]. In particular, for A, we set its filter size, i.e., the number

of items in the filter, to 32 for Kosarak, and to 1, 280 for Webdocs,

with which the best performance is achieved. For CF, we follow

the default setting in [30]. We use CF40, CF70, and CF90 for repre-senting the CF with filter percentages equal to 40%, 70%, and 90%,

respectively. For all baselines, the number of hash functions is 3

according to the setting in [30][10][15].

Parameters. We select first N1 items in data stream to deter-

mine the mapping method. By default, N1 is set to 50, 000. We also

examine the effect of N1 in Section 6.2.

6.2 Experiment ResultsN-scalability.We first evaluate the N-scalability of XY-sketch and

baselines on Webdocs dataset. We vary the total number of data

items from 100M to 300M forWebDocs dataset, and vary the number

of items from 40M to 200M for synthetic datasets. In all testing, the

space budget is fixed to 1.5MB, so that the effect on the scalability

can be solely observed.

The results of AAE and ARE on WebDocs are reported in Fig-

ures 3 (a) and (b), respectively. We can see that all the AAE and

ARE values for baselines increase significantly as the increase of the

number of data items. For example, the AAE value of C increases

more than double when the data volume increase from 100M to

300M, in Figure 3 (a). It is because that more hashing conflicts are

incurred as the growth of the number of data items. Compared to

baselines, XY performs very stable w.r.t. the increase of data stream

volumes. It can be observed that XY mostly dominates other com-

petitors, on both metrics. Compared to the fast increase of AAE and

ARE values for baselines, only slight increase can be observed for

XY. In particular, when the number of items increases from 100M

to 300M, AAE of XY only increases from 7.95 to 18.4. In contrast,

AAEs of CM, C, CU, A, CF (CF90) have increased by 106.95, 117.32,

68.98, 68.89 and 83.85, respectively, which are at least 8 times larger

than that of XY. Similar trends can be observed from the results on

Page 10: XY-Sketch: on Sketching Data Streams at Web Scale

0

100

200

300

400

1 1.5 2 2.5 3

AA

E

Number of data items (108)

(a) AAE w.r.t N (WebDocs)

0

50

100

150

200

250

1 1.5 2 2.5 3

AR

E

Number of data items (108)

(b) ARE w.r.t N (WebDocs)

0

50

100

150

200

250

300

350

400

0.4 0.8 1.2 1.6 2

AA

E

Number of data items (108)

(c) AAE w.r.t N (synthetic)

0

20

40

60

80

0.4 0.8 1.2 1.6 2

AR

E

Number of data items (108)

CMCU

CA

CF40CF70CF90

XY

(d) ARE w.r.t N (synthetic)

Figure 3: N-scalability

synthetic datasets, as shown in Figures 3(c) and (d). In summary,

results in Figure 3 show that XY achieves much better performance

in terms of N -scalability.

n-Scalability.We evaluate the n-scalability of XY and its com-

petitors in Figure 4. We use 6 synthetic datasets so that the AAE

and ARE w.r.t the varying ρ values can be observed. Each dataset

contains 2G data items. In all testing, the space budget is fixed to

1.5MB. From Figure 4 (a), all methods increase w.r.t. ρ, except Cand XY. Which is consistent with the scalability analysis in Section

4. Among all methods, XY achieves the best n-scalability. The AAEvalue of XY is an order of magnitude lower than that of C, and

two orders of magnitudes lower than others. Similar results are ob-

served for the ARE metric, as shown in Figure 4 (b). The ARE value

of XY is orders of magnitude lower than its competitors. We can

conclude that the XY achieves better performance on n-scalabilitythan its competitors.

4

16

64

256

1024

4096

0.01 0.02 0.03 0.04 0.05

AA

E

ρ=n/N

(a) AAE w.r.t ρ

1

10

100

1000

0.01 0.02 0.03 0.04 0.05

AR

E

ρ=n/N

CMCU

CA

CF40CF70CF90

XY

(b) ARE w.r.t ρ

Figure 4: n-scalability

Compactness.We test the space efficiency of XY, which is im-

portant for the compactness requirement of sketching techniques.

The results are collected from experiments on the two real datasets,

and are reported in Figure 5.

First, we compare the AAE and ARE metrics for all six sketches

on the Kosarak data, by varying the space budgets from 80KB to

240KB. In Figure 5(a), all methods achieve smaller AAE, by in-

creasing the space budget, whereas XY always performs better. In

particular, when the space budget equals 80KB, AAEs of CM, C, CU,

A, CF (CF70) are 13.8 times, 20.9 times, 8.4 times, 8.6 times and 14.9

times of that of XY, respectively. It means that XY performs better

frequency estimation than competitors with the same amount of

space cost. Similar trends on ARE can be observed in Figure 5(b),

where XY always achieves the smallest ARE. Especially, when the

space budget is small, the improvement of XY in terms of estimation

accuracy is significant. For example, when space budget equals 80,

the ARE of XY is orders of magnitude lower than others.

The results on WebDocs is reported in Figures 5(c) and (d), re-

spectively. The performance is tested by varying the space from

1.5MB to 4MB. In Figure 5(c), the value of AAE of each method

decreases as the space budget increases. In particular, when space

budget equals 1.5MB, AAEs of CM, C, CU, A, CF (CF90) are 7.6

times, 8.5 times, 4.8 times, 4.8 times and 4.3 times of that of XY,

respectively. It is worth noticing that when the space budget is very

small, all baselines degrade in the frequency estimation for data

items, making them not qualified for data stream sketching. In con-

trast, XY achieves much better estimation accuracy, compared with

baselines, especially when the space budget is small. For example,

when the space budget equals 1.5MB, the AAE of XY is orders of

magnitude lower than that of CF70 and CF90. Similarly, in Figure

5(d), when space budget is set to 1.5MB, the ARE of XY is orders of

magnitude lower than methods. We argue that sketches are often

used in a small and fast memory, e.g, L1 or L2 cache [19] [26]. All

the experiments in [19] [30] [27] [29] [21] are done with space

budget no larger than 2MB. Therefore, we can conclude that XY

dominates its competitors in the common setting of space budget

range.

Efficiency. We test the updating and querying efficiency for all

sketches on the two real-world datasets, in Figure 6. In Figure 6 (a),

all sketches achieve high throughput in handling item updating. In

particular, XY has the highest throughput among all methods on

Kosarak. The reason may be that XY-sketch only uses bit opera-

tions in the decomposition phase. At the same time, the distinct

items n of Kosarak is relatively small. Therefore, the number of

bits to be decomposed is correspondingly small, leading to higher

throughput. Figure 6 (b) reports the query efficiency on Kosarak.

We can see that CM, CU, A,CF90 and XY have similar performance

in query processing, where CU is the fastest. CF’s performance on

query efficiency has significant correlations with parameter set-

tings. In Figure 6 (c) and (d), we report the results on querying and

updating efficiency for WebDocs, respectively. The performance of

all methods, except A, are almost on the same level. XY is slightly

slower than CM, CU, CF70, CF90, and is faster than C and A. There

exist some small fluctuations for XY, which might be caused by the

Page 11: XY-Sketch: on Sketching Data Streams at Web Scale

2

4

8

16

32

64

128

256

80 120 160 200 240

AA

E

Space(KB)

(a) AAE w.r.t Space (Kosarak)

0.16 0.32 0.64 1.28 2.56 5.12

10.24 20.48 40.96

80 120 160 200 240

AR

E

Space(KB)

(b) ARE w.r.t Space (Kosarak)

4

8

16

32

64

128

256

1536 2048 2560 3072 3584 4096

AA

E

Space(KB)

(c) AAE w.r.t Space (WebDocs)

2

4

8

16

32

64

128

256

1536 2048 2560 3072 3584 4096

AR

E

Space(KB)

CMCU

CA

CF40CF70CF90

XY

(d) ARE w.r.t Space (WebDocs)

Figure 5: Compactness

0

1

2

3

4

5

6

CM CU C A CF40 CF70 CF90 XYThro

ughput

(ite

ms/u

s)

Sketches

(a) Updating (Kosarak)

0

100

200

300

400

500

600

700

800

CM CU C A CF40 CF70 CF90 XYThro

ughput

(ite

ms/u

s)

Sketches

(b) Querying (Kosarak)

0

1

2

3

4

5

6

CM CU C A CF40 CF70 CF90 XYThro

ughput

(ite

ms/u

s)

Sketches

(c) Updating (WebDocs)

0

50

100

150

200

250

CM CU C A CF40 CF70 CF90 XYThro

ughput

(ite

ms/u

s)

Sketches

(d) Querying (WebDocs)

Figure 6: Efficiency (Updating and Querying)

recomposition phase incurring computational costs on probability

calculations. We argue that it is still worthy of the efforts, given the

significant improvement in estimation accuracy achieved by XY.

4

8

16

32

64

128

1536 2048 2560 3072 3584 4096

AA

E

Space(KB)

(a) AAE w.r.t Space

2

4

8

16

32

64

128

1536 2048 2560 3072 3584 4096

AR

E

Space(KB)

CF70CF90

CUXY

XYminXYmaxXYavg

(b) ARE w.r.t Space

Figure 7: Effect of Statistics-based Optimization

Effect of Statistics-basedOptimization.Wenow examine the

effectiveness of statistics-based optimization proposed in Section 5.

We randomly select 100 mapping functions from the random bijec-

tive function family {ϖ∗}. We run experiments with all of the 100

mapping functions for WebDocs dataset, as shown in Figure 7 (a).

For each value on the x-axis, we record the result with the highest

estimation error as XYmax , the result with the lowest estimation

error as XYmin , and the average of the 100 results as XYavд . Here,we use XY to denote the method with statistics-based optimization

techniques. In Figures 7 (a) and (b), values of AAE and ARE of the

methods including XYmax , XYavд and XY decrease as the space

allocated increases. Also, AAE and ARE values ofXY are almost the

0

5

10

15

20

25

30

44 46 48 50 52 54 56

AA

E

N1 (103)

80KB120KB160KB200KB

(a) AAE w.r.t N1(Kosarak)

8

12

16

20

24

32 36 40 44 48 52 56

AA

E

N1 (103)

1.5MB3MB4MB

(b) AAE w.r.t N1(WebDocs)

Figure 8: Effect of N1

lowest among 100 tests. It implies that the proposed optimization

techniques make XY mostly outperform the randomly selected set

of mapping functions.

Effect of N1.We hereby test the effect of parameter N1 to the

performance of statistics-based optimization. Parameter N1 is used

to estimate the entropy of every bit of input items in order to

optimize the setting of bijective functions. The results on Kosarak

and WebDocs are reported in Figures 8 (a) and (b), respectively. It

shows that a larger value of N1 corresponds to a better accuracy.

Both AAE and ARE converges after N1 reaches some value. The

convergence point forN1 is 50K for Kosarak, and 4.4K forWebDocs,

which is quite small compared to the total volume of data streams.

Also, we test the effect ofN1 by allocating different amounts of space

budgets, represented by different curves in Figure 8. We can observe

that larger allocated space corresponds to smaller estimation errors.

On the other hand, the convergence point of N1 is independent of

Page 12: XY-Sketch: on Sketching Data Streams at Web Scale

the allocated space, meaning that the setting of N1 is general to

space budgets for our sketch.

7 CONCLUSIONIn this paper, we study the problem of item frequency estimation

for web-scale data streams, by proposing a new sketch, called XY-

sketch. XY-sketch follows novel decomposition-and-recomposition

framework on the basis of bijective functions, which converts the

problem of item frequency estimation into the problem of proba-

bility estimating. XY-sketch can achieve high estimation accuracy

with very small space. We conduct detailed error bound analysis to

gain theoretical insights on the scalability of the structure. Several

optimization techniques are studied for further enhancing the per-

formance. Experiment results on real and synthetic datasets show

that XY-sketch outperforms state-of-the-art solutions, when the

space budget is small.

REFERENCES[1] Frequent Itemset Mining Dataset Repository. http://fimi.uantwerpen.be/data/.

[2] Hash website. http://burtleburtle.net/bob/hash/evahash.html.

[3] WebDocs: a real-life huge transactional dataset.

http://fimi.uantwerpen.be/data/webdocs.pdf.

[4] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. 1999. Web

Caching and Zipf-like Distributions: Evidence and Implications. In INFOCOM.

126–134.

[5] Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. 2002. Finding Fre-

quent Items in Data Streams. In ICALP. 693–703.[6] Jiecao Chen and Qin Zhang. 2017. Bias-Aware Sketches. In PVLDB. 961–972.[7] Graham Cormode and S. Muthukrishnan. 2005. An improved data stream sum-

mary: the count-min sketch and its applications. J. Algorithms 55, 1 (2005),

58–75.

[8] Graham Cormode and S. Muthukrishnan. 2005. Summarizing and Mining Skewed

Data Streams. In SDM. 44–55.

[9] Cristian Estan and George Varghese. 2002. New directions in traffic measurement

and accounting. In SIGCOMM. 323–336.

[10] Amit Goyal, Hal Daumé III, and Graham Cormode. 2012. Sketch Algorithms for

Estimating Point Queries in NLP. In EMNLP-CoNLL. 1093–1103.[11] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich.

2010. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search

Advertising in Microsoft’s Bing Search Engine. In ICML. Omnipress, 13–20.

[12] Sudipto Guha and Andrew McGregor. 2009. Stream Order and Order Statistics:

Quantile Estimation in Random-Order Streams. SIAM J. Comput. 38, 5 (2009),

2044–2059.

[13] Qun Huang, Patrick P. C. Lee, and Yungang Bao. 2018. Sketchlearn: relieving

user burdens in approximate measurement with automated statistical inference.

In SIGCOMM. 576–590.

[14] Mohammad Tanvir Irfan and Tucker Gordon. 2019. The Power of Context in

Networks: Ideal Point Models with Social Interactions. In IJCAI. 6176–6180.[15] Yi Lu, Andrea Montanari, Balaji Prabhakar, Sarang Dharmapurikar, and Ab-

dul Kabbani. 2008. Counter braids: a novel counter architecture for per-flow

measurement. In SIGMETRICS. 121–132.[16] Samuel Madden and Michael J. Franklin. 2002. Fjording the Stream: An Architec-

ture for Queries Over Streaming Sensor Data. In ICDE. 555–566.[17] D. Madigan. 2003. DIMACS working group on monitoring message streams.

http://stat.rutgers.edu/madigan/mms/.

[18] Michael Mitzenmacher, Rasmus Pagh, and Ninh Pham. 2014. Efficient estimation

for high similarities using odd sketches. In WWW. 109–118.

[19] Pratanu Roy, Arijit Khan, and Gustavo Alonso. 2016. Augmented Sketch: Faster

and More Accurate Stream Processing. In SIGMOD. 1449–1463.[20] Anshumali Shrivastava, Arnd Christian König, and Mikhail Bilenko. 2016. Time

Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams. In SIGMOD.1417–1432.

[21] Lu Tang, Qun Huang, and Patrick P. C. Lee. 2019. MV-Sketch: A Fast and

Compact Invertible Sketch for Heavy Flow Detection in Network Data Streams.

In INFOCOM. 2026–2034.

[22] Ramine Tinati, Xin Wang, Ian C. Brown, Thanassis Tiropanis, and Wendy Hall.

2015. A Streaming Real-Time Web Observatory Architecture for Monitoring the

Health of Social Machines. In WWW. 1149–1154.

[23] Luca Vassio, Michele Garetto, Carla-Fabiana Chiasserini, and Emilio Leonardi.

2020. User Interaction with Online Advertisements: Temporal Modeling and

Optimization of Ads Placement. TOMPECS 5, 2 (2020), 8:1–8:26.[24] Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. 2015. Persistent

Data Sketching. In SIGMOD. 795–810.[25] Tobias Weller. 2018. Compromised Account Detection Based on Clickstream

Data. In WWW. 819–823.

[26] Tong Yang, Haowei Zhang, Hao Wang, Muhammad Shahzad, Xue Liu, Qin Xin,

and Xiaoming Li. 2019. FID-sketch: an accurate sketch to store frequencies in

data streams. World Wide Web 22, 6 (2019), 2675–2696.[27] Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. 2017. Pyramid

Sketch: a Sketch Framework for Frequency Estimation of Data Streams. In PVLDB.1442–1453.

[28] Yue Yang and Jianwen Zhu. 2016. Write Skew and Zipf Distribution: Evidence

and Implications. TOS 12, 4 (2016), 21:1–21:19.[29] Yang Zhou, Peng Liu, Hao Jin, Tong Yang, Shoujiang Dang, and Xiaoming Li. 2017.

One Memory Access Sketch: A More Accurate and Faster Sketch for Per-Flow

Measurement. In GLOBECOM. 1–6.

[30] Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve

Uhlig. 2018. Cold Filter: A Meta-Framework for Faster and More Accurate Stream

Processing. In SIGMOD. 741–756.[31] Yunyue Zhu and Dennis E. Shasha. 2002. StatStream: Statistical Monitoring of

Thousands of Data Streams in Real Time. In VLDB. 358–369.