ieee transactions on knowledge and data engineering, … · punctuations that collectively match...

Using Punctuation Schemes to CharacterizeStrategies for Querying over Data Streams

Peter A. Tucker, Member, IEEE, David Maier, Senior Member, IEEE,

Tim Sheard, Member, IEEE, and Paul Stephens

Abstract—Many systems and strategies have been proposed for processing nonterminating data streams. Each approach has

advantages and disadvantages, including the kinds of queries that can be executed. We present a framework for characterizing the

kinds of queries that can be executed over streams based on a notion of compact sets from topology. We first apply our framework to

queries over punctuated data streams. Previous work on punctuations focused primarily on the behavior of individual query operators.

We use our framework to determine if an entire query can benefit from punctuations available from stream sources. We then consider

other common strategies proposed in the literature for executing queries over streams, and we discuss how our framework can

characterize the kinds of queries each strategy can answer.

Index Terms—Data streams, query execution, punctuation.

Ç

1 INTRODUCTION

TWO kinds of traditional query operators pose problemsfor nonterminating data streams: blocking operators,

which produce output only after the input has ended, andstateful operators, which maintain an unbounded amount ofstate. A number of strategies have been proposed to allowthe use of blocking and stateful operators over nontermi-nating streams, including querying over ordered data,applying windows over the input, and embedding punc-tuations. Each strategy has advantages, but little researchhas been done to help formally decide which strategy canbe used to execute a particular query. In this work, weintroduce the concept of punctuation schemes and usepunctuation schemes to determine if a given query can beexecuted over nonterminating inputs. Further, we usepunctuation schemes in discussing the kinds of queriesother strategies can address.

1.1 Motivating Example

A number of commercial and research systems process and

monitor online auctions, including eBay [1], Yahoo!

Auctions [2], the Fishmarket [3], and the Michigan Internet

AuctionBot [4]. Software agents may represent humans in

the auction to bid on or sell items. A user registers through

an agent, then participates in auctions as a buyer or seller.

We model an online auction with three kinds of stream

sources that supply data to an auction monitoring system,

as shown in Fig. 1. Bids for an item currently for sale arrive

on one of several Bid streams, new items for sale arrive onthe Auction stream, and newly registered users arrive onthe Person stream.

For simplicity, assume two bid streams (Bid1 and Bid2)with schema: BidN(auctionid, bidderid, value, hour,minute). An auction administrator will want a query thatcounts the bids placed each hour where the bid price isconsidered high (say, greater than $250). The SQL for such aquery is

Query 1. Count high price bids

SELECT hour, COUNT(*)FROM (SELECT * FROM Bid1

UNION

SELECT * FROM Bid2)

WHERE value > 250

GROUP BY hour;

However, as group-by is a blocking operator, this querycannot be executed over nonterminating inputs withouthelp.

Auction items arrive on a separate stream from bids.For concreteness, we will use the following schema:Auction(id, sellerid, itemname, expires). A sec-ond important query reports the maximum bid submittedfor each item. Such a query reports the sale price for eachitem when the auction for that item closes. The SQL forsuch a query is

Query 2. Closing price for items

SELECT A.id, A.itemname, B.close

FROM Auction A,

(SELECT auctionid,

MAX(value) AS close

FROM (SELECT * FROM Bid1

UNION

SELECT * FROM Bid2)

GROUP BY auctionid) B

WHERE A.id = B.auctionid;

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 9, SEPTEMBER 2007 1

. P.A. Tucker and P. Stephens are with the Department of Math andComputer Science, Whitworth University, 300 W. Hawthorne Road,Spokane, WA 99218. E-mail: {ptucker, pstephens07}@whitworth.edu.

. D. Maier and T. Sheard are with the Department of Computer Science,Portland State University, PO Box 751, Portland, OR 97207.E-mail: {maier, sheard}@cs.pdx.edu.

Manuscript received 4 Dec. 2006; revised 29 Mar. 2007; accepted 6 Apr. 2007;published online 24 Apr. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0545-1206.Digital Object Identifier no. 10.1109/TKDE.2007.1052.

1041-4347/07/$25.00 � 2007 IEEE Published by the IEEE Computer Society

As in Query 1, this query has a group-by operator, whichis a blocking operator. Additionally, this query has a joinoperator, which requires an unbounded amount of state.Both kinds of operators are problematic when dealing withnonterminating input, so this query also cannot be executedover nonterminating inputs without help.

1.2 Comparing Approaches to Querying overData Streams

Several strategies have been proposed for executing queriesover nonterminating data streams. These strategies includerelying on data arriving in some order based on specificattribute values (typically in the form of a timestamp),defining windows over the input streams, using heartbeats,and embedding punctuations in the stream. We discussthese approaches further in Section 8.

The performance and expressiveness of data streamsystems can be compared using benchmarks such as LinearRoad [5] and NEXMark [6]. However, there has not beenmuch investigation in formally comparing different datastream strategies. In this work, we will develop a frame-work inspired by compact sets, then use our framework as amethod for comparing strategies based on the kinds ofqueries that each can handle.

1.3 Organization

The rest of this paper is organized as follows: In Section 2,we give a brief overview of punctuation semantics. Wepresent definitions of groupings and punctuation schemesin Section 3. In Section 4, we discuss how to determine ifpunctuation schemes unblock query operators and, inSection 5, we discuss how to determine if punctuationschemes reduce operator state. In Section 6, we show howto determine if all of the queries benefit from punctuationschemes. We introduce a new punctuation-specific queryoperator in Section 7. In Section 8, we use punctuationschemes to compare various data stream approaches. Wereview related work in Section 9 and conclude in Section 10.

2 BRIEF OVERVIEW OF PUNCTUATED

DATA STREAMS

A punctuation is an item embedded into a stream thatdescribes a subset of the domain of that stream. We say thata data item d matches a punctuation p (denoted matchðd; pÞ)if d belongs to the subset described by p. A punctuation pembedded in a data stream states that no data items willarrive subsequently that match p. A stream that adheres tothis property is called grammatical. In this work, we consideronly streams that are grammatical.

By embedding punctuations into a stream, queryoperators know about the ends of particular subsets ofdata. A blocking operator might use this information tooutput results for a completed subset. A stateful operatormight purge state for such a subset. A query operator that isenhanced to exploit punctuations implements three punc-tuation behaviors: First, pass behavior defines what data itemscan be output when punctuations arrive. Second, keepbehavior defines the state an operator must retain tocontinue outputting correct results. Finally, propagationbehavior defines what punctuations may be emitted fromthe operator to operators further up the query tree. For eachbehavior, we have defined corresponding punctuationinvariants for many common query operators to formallydefine how punctuations should be handled and haveshown that operators which adhere to the invariants behavecorrectly [7].

Punctuation invariants are cumulative. They consider aprefix of the input stream (or streams). Pass invariants returnthe data items that can be output beyond those that wouldhave been output during normal execution. Operators thatdo not block on their input(s) can use the trivial passinvariant, which outputs the empty set. Propagation invar-iants return the punctuations that can be emitted from theoperator such that the operator’s output remains gramma-tical. Keep invariants return the data items that must beretained in operator state beyond what is kept duringnormal operator execution. Examples of the three kinds ofinvariants for specific operators are given in Table 1.

Fig. 2 shows a possible query plan for Query 1. Supposethe Bid sources emit a punctuation after the last value foreach hour. Consider punctuation p that signals that moredata items will arrive with hour value 6 from Bid1. Whenp arrives at the union operator, it must wait for a

2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 9, SEPTEMBER 2007

Fig. 1. Simple architecture for a system to monitor an online auction.

TABLE 1Example Punctuation Invariants for Various Operators

Here, ds represents data items, and ps represents punctuations thathave arrived. The function grpP outputs punctuations in the outputschema that match an entire group when enough input punctuationshave arrived to guarantee that all data items for that group have arrived.The wildcard pattern (“�”) matches all values for that attribute.

punctuation with the same hour from Bid2. After p is seenon both streams, the punctuation can be output (propaga-tion behavior). If union is eliminating duplicates, then alldata items that match p can be removed from the state (keepbehavior).

When the group-by operator receives p, it knows that nomore data items will arrive that contribute to the group forhour value 6. It can output results for that group (passbehavior), then reduce the state required for that group(keep behavior). Finally, it can also output punctuation forhour value 6 (propagation behavior).

To this point, our work on punctuations has focused ondefining correct behavior for individual query operatorsunder punctuations. Indeed, punctuations can fix theproblems with Queries 1 and 2. For Query 1, punctuationsembedded into the bid streams that mark the end of eachhour could be used to unblock the group-by operator sinceit is grouping on the hour attribute. For Query 2,punctuations that mark the end of bidding for each auctioncan be used both to reduce the amount of state required bythe join operator (since the join attributes are id andauctionid) and unblock group-by (since the group-byattribute is auctionid).

We used punctuations in previous work to produceresults and reduce state [8] but have avoided an importantquestion, namely, “how do we determine what kinds ofpunctuation will help a given query?” Punctuations are notalways beneficial. Indeed, there are queries that cannot beimproved with any kind of punctuation. Query 1 groupsdata items on the hour attribute and, so, has a naturalgrouping of interest based on hour. (Grouping of interest issomewhat akin to the idea of an “interesting order” [9].) Agrouping is a collection of groups in the data domain basedon values of the grouping attribute(s). Many operators havenatural groupings of interest and these help decide if aparticular set of punctuations benefits a query. Section 3provides details on groupings of interest.

Punctuations that collectively match all possible dataitems in a grouping of interest may benefit a query, but onlyif the number of punctuations needed to cover each specificgroup is finite and will eventually arrive on the stream.1

(The total number of punctuations in a stream need not befinite.) To illustrate, suppose that each bid stream containspunctuations marking the end of bids from a specific userfor a particular hour. Fig. 3 depicts the data items thatmatch each punctuation. Using this set of punctuations, we

can match the group of all bids for a given hour using afinite number of punctuations. Therefore, such punctua-tions benefit queries that group solely on hour. However,we cannot realistically match all bids from a given bidderwith a finite number of punctuations. Therefore, queriesthat group on bidderid will not benefit from this set ofpunctuations. The need to cover a group with a finitenumber of punctuations suggests a notion similar to compactsets [10], [11] from topology. We will use analogs tocompactness to formally determine if a specific set ofpunctuations will benefit a particular query.

3 GROUPINGS AND PUNCTUATION SCHEMES

We want to determine the utility of a given set ofpunctuations for a particular query. To this end, weintroduce several concepts. A dataspace represents thedomain of all possible data items that may appear on astream. For example, the dataspace for the Bid streams is

DB ¼f< a; b; v; h;m > ja 2 ZZ; b 2 ZZ; v 2 IR;

h 2 ZZ;m 2 ½0; 59�g:

The term subspace means any subset of a dataspace.

3.1 Groupings for Dataspaces and Groupings ofInterest

A grouping for a dataspace or subspace is a collection ofgroups where each group contains items that have equalvalues for specified attributes and the union of thegroups equals the dataspace. For example, the groupingof DB on h is

�f< a; b; v; h;m > ja 2 ZZ; b 2 ZZ; v 2 IR;m 2 ½0; 59�gjh 2 ZZ

�:

For the subspace

S ¼f< 1; 1; 50; 1; 15 >;< 1; 2; 70; 1; 45 >;

< 1; 3; 75; 2; 10 >g;

the grouping on h is (with h value in bold)

�f< 1; 1; 50;1; 15 >;< 1; 2; 70;1; 45 >g;f< 1; 3; 75;2; 10 >g

�:

A grouping of interest for a query operator is a groupingthat arises naturally from the definition or implementationof that operator. Many operators have groupings of interest.

TUCKER ET AL.: USING PUNCTUATION SCHEMES TO CHARACTERIZE STRATEGIES FOR QUERYING OVER DATA STREAMS 3

1. Note that, in practice, we do not need such a strong statement. It issufficient to assume that every punctuation will eventually appear or that apunctuation will arrive that subsumes it.

Fig. 2. Query plan for Query 1.

Fig. 3. Data items that match punctuations for a specific bidder and hour.Each square in the grid represents all bids from a bidder that havearrived during a specific hour. The darker area, containing all bids for aparticular hour, can be covered by a finite number of punctuations. Thelighter area, containing all bids from a particular bidder, cannot.

For example, a group-by’s grouping of interest is the onedefined by the group by attributes. Join has two groupingsof interest, one for each input, based on the join attributes.We discuss groupings of interest for specific operators inSection 5.

3.2 Punctuation Format and Punctuation Schemes

For a given dataspace, a punctuation is a tuple of patterns,one pattern per attribute of the dataspace. There are manyoptions for patterns to construct punctuations. The patternswe present have the property of closure under intersection(and, therefore, punctuations are closed under conjunction).This property is important for some punctuation behaviorssuch as the propagation behavior for union.

The set of valid patterns over a domain A includes twospecial patterns, � and �, where f�; �g \A ¼ ;. The set ofpatterns then is defined as

�A ¼ f�g [ f�g [ fA0jA0 � A ^A0 is finiteg:

The matchPat function takes an element a from A and apattern p from �A and returns true if a matches p:

matchPat :: A��A ! Boolean;

matchPatða; �Þ ¼ true;matchPatða; �Þ ¼ false;matchPatða; a1Þ ¼ a ¼¼ a1;

matchPatða;A0Þ ¼ true if 9a0 2 A0jmatchPatða; a0Þfalse otherwise:

If A is totally ordered by � , we can supplement thesepatterns to include range matching. We add? and>, wheref?;>g \A ¼ ; ^ 8a 2 A, ?< a < >. We extend �A with

f½a1; a2�ja1; a2 2 A ^ a1 � a2g[f½?; a2�ja2 2 Ag[f½a1;>�ja1 2 Ag

and extend matchPat

matchPatða; ½a1; a2�Þ ¼ a1 � a � a2;

matchPatða; ½?; a2�Þ ¼ a � a2;

matchPatða; ½a1;>�Þ ¼ a1 � a:

Two points are worth mentioning. First, note that we do notneed the case ½?;>� as that would be the same as the* pattern. Second, we only specify closed intervals above. It isa simple extension to specify open and mixed intervals aswell.

For dataspace D (over A1 �A2 � . . .�An), the punctua-

tion space IPD ¼ �A1��A2

� . . .��Anis the set of possible

punctuations over D. The match function determines whenan item from D matches punctuation from IPD bycomparing values and patterns from corresponding attri-butes. For D,

match :: D� IPD ! boolean;

matchðd; pÞ ¼ 8i 2 ½1; n�;matchPatðdðiÞ; pðiÞÞ;

where dðiÞ 2 Ai is the value of the ith attribute of d andpðiÞ 2 �Ai

is the pattern value of the ith attribute of

p. Consider a punctuation p ¼ < �; �; ½5:00; 8:00�; 10; � >for DB. Given data items d1 ¼< 1; 2; 6:00; 10; 20 > andd2 ¼ < 1; 2; 7:00; 11; 20 > , it is clear that matchðd1; pÞ ¼true and matchðd2; pÞ ¼ false.

A punctuation scheme PD for D is the set of punctuationsthat will be emitted from a stream source.2 Clearly,PD � IPD. We specify individual punctuation schemesusing set notation. Suppose each bid stream source outputspunctuations at the end of each hour. Then, the punctuationscheme is PhB ¼ f< �; �; �; h; � > jh 2 ZZg. Alternatively, if abid source outputs a punctuation marking the end of each10-minute interval for each hour, the punctuation scheme is

Ph;mB ¼f< �; �; �; h; ½m;mþ 9� > jh 2 ZZ

^m 2 f0; 10; 20; 30; 40; 50gg:

A punctuation scheme PD is complete if, for every d 2 D,there exists p 2 PD such that matchðd; pÞ. Generally, if apunctuation scheme is not complete, then blocking opera-tors cannot be completely unblocked. Query 1 groups onvalues of the hour attribute. The punctuation scheme P2h

B ¼f< �; �; �; 2h; � > jh 2 ZZg emits punctuations that matchdata items for every even hour and, so, is not complete.Query 1 will not be completely unblocked. The punctuationscheme PhB ¼ f< �; �; �; h; � > jh 2 ZZg given earlier is com-plete and does completely unblock Query 1.

We want to know when all items of a group in a groupinghave arrived. Punctuations provide this information. Theinterpretation of a punctuation p is the subspace of data itemsthat match p, denoted IðpÞ. Formally, given dataspaceD andpunctuation p 2 IPD, IðpÞ ¼ fdjd 2 D ^matchðd; pÞg. Forpunctuation scheme PD, SPD ¼ fIðpÞjp 2 PDg is the collec-tion of interpretations for punctuations in PD. When D isunderstood, we will write P for PD and SP for SPD .

3.3 Benefitting Queries with Punctuation Schemes

A punctuation scheme benefits a query if the followingconditions hold:

. Enables. All result data items for a query willeventually be output.

. Cleanses. Every data item that resides in the state forany operator in the query will eventually beremoved.

Note that, even if a punctuation scheme cleanses a query,there may still be data in its state at every point duringexecution.

Our goal is to determine if a punctuation scheme benefitsan entire query. Given a query tree Q for a query and thepunctuation schemes for each stream source, we can usepropagation invariants to determine the output punctuationscheme for each operator in Q. If the root operator in thetree emits a complete punctuation scheme, we know thatthe query is enabled.

From topology, a collection S forms a cover for a set T ifSS � T . Further, T is compact if every cover for T in S

contains a finite subcollection that is also a cover for T [10],[11]. Applying these concepts to punctuations, given a


2. Note that every punctuation in PD will either appear eventually or besubsumed by some other punctuation in PD. Punctuations that will notappear on the data stream are not in PD.

group G in a grouping and a punctuation scheme P, P is acover for G if each data item in G matches some punctuationin P. G is compact relative to P if some finite subset of P alsoforms a cover for G. If all groups of a grouping G arecompact relative to P, we say that G is compact relative to P.

For example, consider groupings based on that in Fig. 3.The grouping

Gh ¼�f< a; b; v; h;m > ja 2 ZZ; b 2 ZZ;

v 2 IR;m 2 ½0; 59�gjh 2 ZZ�

is compact relative to the punctuation scheme

PhB ¼ f< �; �; �; h; � > jh 2 ZZg

since a finite number of punctuations (one in this case)from PhB will match each group in Gh. However, thegrouping Gb ¼

�f< a; b; v; h;m > ja 2 ZZ; v 2 IR; h 2 ZZ;m 2

½0; 59�gjb 2 ZZ�

is not compact relative to PhB. Our basicstrategy is to show that, if a grouping of interest for anoperator is compact relative to its input punctuationscheme, then that punctuation scheme benefits theoperator. Further, if the groupings of interest for alloperators in a query are compact relative to their inputpunctuation schemes, then the entire query benefits.

4 ENABLING QUERY OPERATORS

We first consider the kinds of punctuation schemes thatunblock unary operators. For a unary operator O, PI ½O�denotes the input punctuation scheme and PR½O� the output(result) punctuation scheme. An input punctuation schemePI ½O� enables an output punctuation p 2 PR½O� if p can bepropagated after some finite subset of PI ½O� has arrived.Further, an input punctuation scheme PI ½O� enables PR½O� if,for every p 2 PR½O�, PI ½O� enables p. If PI ½O� enables PR½O�,then we know that all result data items contained ininterpretations of PR½O� will be output. If PR½O� is complete,we know that O is enabled.

4.1 The Definition of Preimage

We need a map from a punctuation p in PR½O� to asubspace T of the input dataspace DI such that, if T iscompact relative to PI ½O�, then p will eventually be emitted.We call this map preimage, defined for specific operators in

Table 2. (We use dupelim to refer to the operator thatremoves duplicates.) Intuitively, for any punctuationp 2 PR½O�, preimage½O�ðpÞ tells us what subspace must becovered by input punctuations before O can safely emit p.Put another way, preimage½O�ðpÞ returns that subspace ofthe input that contributes to output data items that match p.

These are the definitions we will consider in this work.We can choose an alternate definition of preimage for agiven operator, such as preimage½��ðpÞ ¼ �ðIðpÞÞ, as long asTheorem 1 (given in the next section) holds.

4.2 Enabling Punctuations from Unary Operators

We want to determine if an input punctuation scheme PI ½O�enables an output punctuation scheme PR½O� for a unaryoperator O. The following theorem specifies when a givenpunctuation in PR½O� can be emitted by O. We will later usethat result to show when PR½O� is enabled by PI ½O�.Theorem 1. For unary operator O, grammatical input stream S,

input punctuation scheme PI ½O�, and output punctuationscheme PR½O�, pr 2 PR½O� can be emitted if preimage½O�ðprÞis compact relative to PI ½O�.

A detailed proof can be found elsewhere [7]. The proofstrategy is to consider P � PI ½O� that is compact overpreimage½O�ðprÞ, where all punctuations in P have arrived.As discussed in Section 2, the pass invariant for O formallydefines what data items can be output due to punctuations.We use the pass invariant for O to show that all data itemsin preimage½O�ðprÞ have been output and, therefore, pr canbe emitted.

For example, Query 1 uses the group-by operator

Gcountð�Þh . A possible output punctuation scheme for Gcountð�Þh

could be PhO ¼ f< h; � > jh 2 ZZg. An input punctuation

scheme that is compact relative to the grouping on h,

such as PhB ¼ f< �; �; �; h; � > jh 2 ZZg, would enable PhOfor Gcountð�Þh .

We use this result to show the required properties ofPI ½O� that will enable PR½O� in the following theorem.

Theorem 2. For a unary operator O, let

Sg ¼ fpreimage½O�ðprÞjpr 2 PR½O�g:

If Sg is compact relative to PI ½O�, then PI ½O� enables PR½O�.Proof. Suppose Sg is compact relative to PI ½O�. Then,

8G 2 Sg; G is compact relative to PI ½O�

) 8G 2 fpreimage½O�ðprÞjpr 2 PR½O�g; G is

compact relative to PI ½O�

) 8pr 2 PR½O�; preimage½O�ðprÞ is compact

relative to PI ½O�

) ½by Theorem 1� 8pr 2 PR½O�; pr will be emitted

) 8pr 2 PR½O�;PI ½O� enables pr

) PI ½O� enables PR½O�:ut

That is, if the collection of preimages for each punctua-tion in PR½O� is compact relative to PI ½O�, then PR½O� is


TABLE 2Definitions for preimage for Various Operators

for Punctuations p 2 PR½O�

We use the notation “ : � ” to mean � values for each attribute of theschema not already listed.

enabled by PI ½O�. In this case, we know that all punctua-tions in PR½O� will be emitted and, therefore, all result dataitems will eventually be output.

The approach for binary operators is similar to that ofunary operators and, hence, omitted (and can also be foundelsewhere [7]).

5 CLEANSING QUERY OPERATORS

To reason about the state maintained by query operatorsduring query execution, we use groupings to model statefor various operators. We show that if the input punctua-tion schemes are complete and if all groupings for the statefor an operator O are compact relative to those schemes,then O will be cleansed.

Our discussion of enabling a query operator O focusedon the logical definition of O and did not need to considerthe implementation of O. However, because we must modelthe state maintained during execution to determine whetherO is cleansed, we must now consider its implementation.We limit our discussion to well-known implementations ofquery operators and, when possible, to those that may besuitable for processing nonterminating data streams. Forexample, the implementation of join that we consider doesnot block on either input and does not require indexes.

5.1 Modeling State Required for Query Operators

Many query operator implementations use a hash table,keyed on a subset of the attributes of a data item. Clearly,hash table structures conform neatly to groupings, wherethe hash-key attributes are the grouping attributes. Inpractice, different hash-key values may hash to the samebucket. Although our model does not exactly conform to thehash buckets, it is sufficient for modeling how state ismaintained.

Our models for the state maintained by various operatorimplementations are shown in Table 3. Operators thatmaintain no state such as select duplicate-preserving projectand union are trivially cleansed and are not listed.

. DupElim. Duplicate elimination can be implemen-ted using a hash table, using all attributes as thehash key. Since the hash key includes allattributes, the grouping for this implementation

is a collection of singleton sets. Using ðA;B;C;DÞas an example input schema, the grouping isG½�� ¼

�f< a; b; c; d >gja 2 A; b 2 B; c 2 C; d 2 D

�.

. Group-by. We implement group-by using a hash

table, keyed on the group-by attributes. Therefore,

we will model its state using a grouping on the

group-by attributes. Again using ðA;B;C;DÞ, if the

group-by attribute is A, then the grouping is

G½GfA� ¼�f< a; b; c; d > jb 2 B; c 2 C; d 2 Dgja 2 A

�.

In practice, full data items are not held in state.

Instead, only the information required to generate

the resulting data items is kept. However, there is a

correspondence between that information and the

grouping we define.. Join. We use the symmetric hash join [12] imple-

mentation for the join operator. Symmetric hash joinmaintains one hash table for each input, keyed onthe join attributes. We model the state required foreach input with a grouping, where the groupingattributes are the join attributes for that input.Consider the following two example input streams:S1 with attributes ðA;B;C;DÞ and S2 with attributesðD;E; F Þ, where the join condition is S1:D ¼ S2:D.The grouping is the pair ðG1½ffl�;G2½ffl�Þ, where

G1½ffl� ¼�f< a; b; c; d > ja 2 A; b 2 B; c 2 Cgjd 2 D

�;

G2½ffl� ¼�f< d; e; f > je 2 E; f 2 Fgjd 2 D

�:

Note that we model state as a pair of groupings for

binary operators. This model contrasts to the

preimage function, which returned a collection of

pairs.. Intersect. Intersect can be implemented as a special

case of join on all attributes. Thus, we can modelstate as for join. The hash keys for the join will allattributes. Thus, the grouping is based on allattributes. Using ðA;B;C;DÞ for both inputs, thegrouping pair is ðG½

T�;G½

T�Þ, where

G \½ � ¼�f< a; b; c; d >gja 2 A; b 2 B; c 2 C;

d 2 D�:

. Difference. We implement difference using a hashtable for each input, keyed on all attributes, as inintersect. When some data item arrives from thepositive side, we first probe the hash table for thenegative side. Using input streams S1 and S2 withattributes ðA;B;C;DÞ, the two groupings are de-fined as ðG½�;G½�Þ, where

G½� ¼�f< a; b; c; d >gja 2 A; b 2 B;c 2 C; d 2 D

�:

5.2 Cleansing Operators with Punctuation Schemes

Using the state models, we can describe the punctuation

schemes that cleanse those operators.

Theorem 3. Given a grammatical stream S, a unary operator O

that discards state per the keep invariant for O at the earliest


TABLE 3State Models for Various Implementations of Query Operators

R represents the input schema. For input domain DI , the domain ofattribute a is DIðaÞ. Superscripts are used to denote specific inputs forbinary operators.

possible instant, a state model for O represented as a groupingG½O�, and an input punctuation scheme PI ½O�, if G 2 G½O� iscompact relative to PI ½O�, then all data items held in state forO that also exist in G will eventually be removed.

Details for the proof for Theorem 3 are found elsewhere[7]. The general proof strategy is to show that any data itemd that resides in state belongs to some group in thegrouping model for that operator. Since that group iscompact relative to the input punctuation scheme, some setof punctuations will eventually arrive that covers the groupfor d. As mentioned in Section 2, the keep invariant for Oformally defines what data items must be held in state forcorrect output (and, therefore, what can be removed). Weuse the keep invariant to show that d will then be removedfrom state.

We extend the results on groups to groupings as a whole,and, therefore, the operator is cleansed by the inputpunctuation scheme.

Theorem 4. Given a grammatical stream S, a unary operator O,and a complete punctuation scheme PI ½O�, if the state modelfor O is a grouping G½O� that is compact relative to PI ½O�,then PI ½O� cleanses O.

Proof. We need to show that, for a unary operator, everydata item d 2 DI that at some point resides in state willeventually be removed. Since PI ½O� is complete, there is apunctuation p 2 PI ½O� such that matchðd; pÞ. Since S isgrammatical, any such punctuation must arrive after d inS. Since d is in state, d 2 Gd for some group Gd 2 G½O�.Since G½O� is compact relative to PI ½O�, Gd is alsocompact relative to PI ½O�. By Theorem 3, all data items instate that are also in Gd can be removed. Therefore, PI ½O�cleanses O. tu

The case for binary operators is similar and can be foundelsewhere [7].

With these results, we can now determine if a givencollection of punctuation schemes will benefit (enable andcleanse) specific queries, as we will show in Section 6. Note,however, we do not address cleansing an operator ofpunctuation. For example, though the union operator doesnot maintain data items in state, it must maintain punctua-tions in state in order to propagate punctuations correctly.Cleansing punctuations is an area for future work.

6 DETERMINING IF PUNCTUATION SCHEMES

BENEFIT SPECIFIC QUERIES

We now have the tools to demonstrate whether a giveninput punctuation scheme benefits a particular queryoperator. Now, we consider an entire query plan, composedof an arbitrary combination of query operators. Ourapproach is to associate input and output punctuationschemes with each operator in the plan, where the outputpunctuation scheme for an operator is an input punctuationscheme for its parent. If we can prove that each operator inthe query plan benefits from its input punctuation(s), thenwe have demonstrated that the whole query benefits fromthe input punctuation schemes.

6.1 Determining if the Example Queries Can Benefit

Consider Query 1, expressed in relational algebra:

Q1 ¼ Gcountð�Þh ð�price>250ðBid1 [Bid2ÞÞ. A query plan was

given earlier in Fig. 2. Suppose the bid streams (Bid1

and Bid2) emit the following punctuation scheme:

PhB ¼ f< �; �; �; h; � > jh 2 ZZg. PhB benefits Query 1 if all

of its operators benefit from their respective input

punctuation schemes. Let PO½Gcountð�Þh � ¼ f< h; � > jh 2ZZg be the output punctuation scheme for the query.

Fig. 4 shows punctuation scheme assignments for each

operator.We start at the union operator, with input punctuation

schemes PhB on both inputs. We can use the definition of

preimage½[� to set the output punctuation scheme to PhB.

Similarly, using the definition of preimage½��, we can set the

output punctuation scheme for select also to PhB.To show that PhB enables the group-by, we must show

that some output punctuation p will be emitted if

preimage½Gcountð�Þh �ðpÞ is compact relative to PhB. By

definition, preimage½Gcountð�Þh �ðpÞ ¼ Ið< �; �; �; pðhÞ; � >Þ As

pðhÞ 2 �ZZ, some pb 2 PhB exists such that

Ið< �; �; �; pðhÞ; � >Þ

is covered by pb. Therefore, Ið< �; �; �; pðhÞ; � >Þ is compact

relative to PhB and, so, PhB enables group-by.To show that group-by is cleansed by PhB, recall that the

model for state is a grouping based on the group-by

attribute, in this case, h:

G½Gcountð�Þh � ¼nf< a; b; v; h;m > ja 2 ZZ; b 2 ZZ;

v 2 IR;m 2 ½0; 59�gjh 2 ZZo:

It can be shown that PhB is compact relative to G½Gcountð�Þh �.Therefore, given some group G 2 G½Gcountð�Þh �, there must

exist some p 2 PhB that covers G. Thus, by Theorem 4,

G½Gcountð�Þh � is cleansed by PhB; hence, PhB benefits group-by.Therefore, as each operator in Query 1 benefits from its

input punctuation (PhB in all cases) and since PO½Gcountð�Þh � is

complete, we have that Query 1 benefits from its input

punctuation scheme PhB.In a similar fashion, appropriate punctuations can be

shown to benefit Query 2. Suppose the bid stream emits a

punctuation scheme PaB ¼ f< a; �; �; �; � > ja 2 ZZg and the

auction stream emits a punctuation scheme

PiA ¼ f< i; �; �; � > ji 2 ZZg:


Fig. 4. Punctuation scheme assignments for Query 1.

That is, both streams emit punctuations that are compactrelative to groupings defined on auction ID values. Notethat such punctuations are realistic. Auctions will end in afixed amount of time. Let us set the punctuation schemeassignments for the other operators as follows:

PO½[� ¼ PaB;PO½GmaxðvÞa � ¼ f< a; � > ja 2 ZZg;PO½ffla¼i� ¼ f< �; �; i; �; �; � > ji 2 ZZg;PO½�i;n;v� ¼ f< i; �; � > ji 2 ZZg:

Punctuation scheme assignments for Query 2 are illustratedin Fig. 5. The group-by operator can be shown to benefitfrom PO½[�, and the join operator can be shown to becleansed by PO½GmaxðvÞa � and PiA (note that join is triviallyenabled). Finally, the project operator for Query 2 can onlyemit punctuation based on what kinds of punctuation itreceives. In this case, it will emit PO½�i;n;v�. Since PO½�i;n;v� iscomplete, Query 2 benefits from its input punctuations.

6.2 Algorithm for Determining if a Query CanBenefit

Using the examples above, we can formulate a generalalgorithm to determine if a given query can benefit fromavailable punctuations. In essence, the algorithm is a depth-first traversal algorithm, where each operator requests allavailable punctuation schemes from its source(s) anddetermines which of those punctuation schemes will benefitthat operator and, of those that do, what the availableoutput streams are. If so, and if all operators benefit, thenthe query benefits.

Specifically, each query operator will support the iteratoravailablePSchemes. This iterator will first call availa-blePSchemes on its child node(s), then check if thereturned punctuation schemes will benefit that operator. Ifnot, then the operator will continue to call that method onits child node(s) until either punctuation schemes arereturned that will benefit the operator or a NULL isreturned, indicating that no more punctuation schemes

are available. In that case, the operator also returns NULL

and the query will not benefit from any available punctua-tion schemes. If punctuation schemes are returned from thechild node(s) that benefit the operator, then an outputpunctuation scheme is generated based on the inputpunctuation schemes and the propagation invariant forthat operator and that punctuation scheme is returned. Ifthe root operator for the query plan returns a non-NULLcomplete punctuation scheme, then the query will benefitfrom the input punctuation scheme. The algorithm forunary operators is given in Fig. 6. It can be adapted to workfor binary operators as well. The method benefit returnstrue if the given punctuation scheme benefits thatoperator. The method determinePScheme returns theoutput punctuation scheme based on the available inputpunctuation schemes and that operator’s punctuationinvariant. The definitions of these methods will vary foreach operator, based on the propagation invariant for thatoperator.

7 THE DESCRIBE OPERATOR

We say that a punctuation describes a set of attributes if thereexists a specific set of values for those attributes such thatthe punctuation covers all possible data items with thosevalues. For example, given a grouping G over someattribute A, a punctuation that covers a group G 2 G issaid to describe the grouping attributes of G. In ourpunctuation format, given a set of attributes A in aschema R, a punctuation describes A if the pattern forevery attribute in RA is the wildcard. Consider thepunctuation schemes for Bid:

PhB ¼f< �; �; �; h; � > jh 2 ZZg;PaB ¼f< a; �; �; �; � > ja 2 ZZg:

Punctuations in PhB describe the hour attribute (as well asthe hour and minute attributes and any other combinationof attributes containing hour), and punctuations in PaBdescribe the auctionid attribute (and any combinationscontaining auctionid).

We have seen that operators often have a naturalgrouping of interest and that only punctuations of a certainform can benefit those operators. Rather than having everyoperator include a routine to determine if a given punctua-tion will be beneficial, we factor this functionality into anew operator, called the describe operator. The describeoperator outputs data items as they arrive and filters out


Fig. 5. Punctuation scheme assignments for Query 2.

Fig. 6. Algorithm for determining if a specific unary query operator will

benefit from available punctuation schemes, where o is the child query

operator.

incoming punctuations that will not help query operatorsfurther along in the query tree. In factoring this function-ality into a separate operator, we avoid implementingseparate code in operators such as join and group by toverify each incoming punctuation. Further, by takingadvantage of equivalences involving describe and otheroperators, we are able to “push” the describe operatordown in the query plan in order to filter out punctuationsearlier during query execution. Some equivalences of thisform have been defined elsewhere [7]. Note that thedescribe operator can be used to filter out all punctuationswhen the query does not require punctuations for properexecution (for example, selection queries).

Another function that can optionally be performed bythe describe operator is to “build up” new punctuationsfrom incoming punctuations when possible. For example,data items in our Bid stream have an hour attribute and aminute attribute. Suppose a punctuation arrives thatmatches all possible data items for hour 1 between minutes 0and 15 ð< �; �; �; 1; ½0; 15� >Þ, then another arrives thatmatches all possible data items for hour 1 betweenminutes 15 and 45 ð< �; �; �; 1; ½15; 45� >Þ, and then a thirdarrives that matches all data items for hour 1 between 45and 59 ð< �; �; �; 1; ½45; 59� >Þ. From these punctuations, wecan infer that all data items for hour 1 have arrived, eventhough we have not explicitly received that punctuation. Inthis example, the describe operator can emit a newpunctuation that matches all data items for hour 1ð< �; �; �; 1; � >Þ.

7.1 Defining Describe and Its PunctuationInvariants

The formal definition for describe is straightforward. Wedenote the describe operator as DAðSÞ, where A is the list ofattributes that output punctuations should describe, and Sis the input stream. Any data item that arrives is output.

Now, we define the punctuation invariants for describe.The pass and keep invariants for describe are simple.Describe is not a blocking operator. Therefore, the passinvariant for describe does not specify any additional dataitems to be output due to punctuations. Similarly, becausedescribe does not store data items in state, the keep invariantis trivial (though punctuations may be stored in state).

The main purpose for the describe operator is topropagate punctuation, so the propagation invariant ismore complex. Describe manipulates punctuations in one oftwo ways: First, only those punctuations that will help thequery are emitted. Second, new punctuations are built upfrom incoming punctuations when possible. The firstversion of describe is relatively easy to define. If apunctuation arrives that describes the desired attributes,then we can pass it on. If not, then ignore it. We use thefunctions data and punct to separate out the data and thepunctuations, respectively, from a stream. Given an inputschema R for stream S and ds ¼ dataðSÞ and ps ¼ punctsðSÞare the data items and punctuations that have arrived fromS, respectively,

cpropDAðds; psÞ ¼ fxjx 2 ps ^ð8a 2 ðRAÞ; xðaÞ ¼ �Þg;

ð1Þ

that is, only output punctuations that have a wildcard value(*) for attributes that are not in the set of describedattributes A.

Punctuations that alone do not describe the desiredattributes may be combined into a new punctuation thatdoes describe those attributes. In the following alternatedefinition of cprop for describe, we use the functionsetCoalesce that takes a set of punctuations and buildsnew valid punctuations from input punctuations:

cpropDAðds; psÞ ¼ fxjx 2 setCoalesceðpsÞ ^ð8a 2 ðRAÞ; xðaÞ ¼ �Þg:

ð2Þ

This definition outputs all punctuations that describe thedesired attributes, as well as the punctuations that can bederived from the punctuations received. Both definitions ofcprop are valid. Definition 2 comes at an implementationcost. We must keep punctuations in state as they arrive to beable to derive new punctuations when later punctuationsarrive, thus making describe a stateful operator. Removingpunctuations from state is an area of future work.Definition 1 does not have that added cost.

7.2 Implementation of Describe

In our implementation, the describe operator takes threeparameters: The attributes-to-describe parameter isrequired. The watch-attributes and watch-patterns

parameters are optional. The attribute-to-describe

parameter specifies what punctuations are meaningful. InQuery 1, the beneficial punctuations are those that match alldata items for a given hour. Therefore, in that case,attributes-to-describe is set to hour. The watch-

attributes parameter lists the attributes that can be usedto build up punctuations that describe the attributes listedin the attributes-to-describe parameter. Each attri-bute in the watch-attributes parameter has a pattern inthe watch-patterns parameter value defining a rangethat must be covered by input punctuations in order togenerate the new punctuation. Again using Query 1 as anexample, we can watch for punctuations that describe theminute attribute for a specific hour and, if they cover therange [0, 59] for a particular hour, new punctuation can begenerated that describes that hour. Therefore, we setwatch-attributes to minute and watch-patterns

to the range [0, 59].The implementation for (1) is easy—simply walk

through the attributes of a punctuation and, for allnondescribe attributes, check that the value is the wildcard(�). Our current implementation for (2) is somewhatsimplistic. We limit the watch-attributes parameterto a single attribute. In order to generate new punctuations,we only handle punctuations with range-type values for thewatch-attributes and wildcard values for all othernondescribe attributes. As punctuations arrive, our goal isto combine ranges for the watch-attributes valuesfrom multiple punctuations that produce a cover for thewatch-patterns value. When a cover for the watch-

patterns has arrived, a new punctuation can be outputwith wildcard values for attributes not listed in theattributes-to-describe parameter value. An exam-ple of using the describe operator in Query 1 is shown inFig. 7.


We store incoming punctuations in a hash table. The

hash key is built by concatenating values for each

nondescribe attribute. When a punctuation arrives, we

build the hash key and probe the hash table. If the key is not

present, we use the range value for the watch-attribute

as the hash value. If the key is present, we calculate the

union of the existing range value with the range value from

the punctuation. If that range covers the watch-pattern,

then we build a punctuation with the wildcard for watch-

attribute. Otherwise, we rehash the range back into the

hash table and continue. The pseudocode for this algorithm

is given in Fig. 8.

7.3 Benefit of Describe

We have implemented the describe operator into the

NiagaraST query engine, developed based on Niagara

[13]. We executed some performance tests to evaluate the

performance overhead of punctuations, as well as to

determine the effect of adding the describe operator to

query plans. A more thorough discussion can be found

elsewhere [7]. In short, we did notice that the describe

operator can improve overall data throughput for queries

that process punctuations. Particularly when the describe

operator is able to filter out unwanted punctuations early or

build up punctuations for later operators, we see a notice-

able improvement in the rate at which data items can be

processed. Such results are promising, but more evaluation

is an area for future work.

8 DETERMINING PUNCTUATION SCHEMES FOR

OTHER STREAMING APPROACHES

Our grouping framework helps us determine if a given set

of punctuation schemes benefits a particular query. We

have also considered other strategies for querying over data

streams: ordered data, windows, and heartbeats. For both

ordered data and windows, we posit that punctuations are

implicitly embedded in the stream which captures the

behavior of those strategies and, therefore, we can define

punctuation schemes for those punctuations. In this way,

we can compare the kinds of queries each approach can

support.

8.1 Positional and Ordered Data

Sequence database systems [14], [15] rely on metainforma-tion about input sequences to optimize the execution ofsequence queries. Many of the techniques used in sequencedatabase systems can be applied to nonterminating datastreams. Blocking and stateful operators can be implemen-ted to take advantage of input that arrives in an interestingorder. For example, Query 1 uses a group-by operator togroup data items by hour. If the input arrives sorted onhour, when a new hour value arrives to the group-byoperator, results for the previous hour can be output andthat state cleared.

The Gigascope system [16] processes streams of networkpacket data. Operators in Gigascope rely on data that ismonotonically nondecreasing on timestamp values. Speci-fically, the join operator must use a predicate that containsan attribute from each input that is monotonically non-decreasing. The join implementation uses this informationto determine when a data item will no longer join with dataitems from the other join input and can therefore beremoved from state (merge is similar). Further, group-byrequires that some grouping attribute be monotonicallynondecreasing. When a data item arrives whose orderedattribute is greater than any current group, the results forthat group can then be output.

Ordered data fits nicely into our punctuation schemeframework. When data arrive in some known order, we


Fig. 7. Use of the describe operator in a query plan to output

punctuations on the hour attribute ðhÞ, building up from punctuations

on the minute attribute ðmÞ. Ph;mB ¼ f< �; �; �; h; ½m;mþ 9� > jh 2ZZ ^m 2 f0; 10; 20; 30; 40; 50gg. PhB ¼ f< �; �; �; h; � > jh 2 ZZg.

Fig. 8. Algorithm for handling punctuations in the describe operator.

consider each data item d as an implicit punctuation: Weknow that no data item will arrive after d that precedes d inthe sort order. That is, each d is the end of a group, where dis the maximal data item (according to the given order) inthe group. Thus, each group in the grouping represents aprefix of the input domain.

Considering each data item as an implicit punctuation,we can define a punctuation scheme as follows, where A isthe set of input domain attributes, S is set of orderingattributes, primaryðSÞ is the primary attribute of theordering attributes S, and domainðaiÞ is the domain of theattribute ai:

PS ¼f< p1; p2; . . . ; pn > 8ai 2 A;ai 6¼ primaryðSÞ ) pi ¼ �;ai ¼ primaryðSÞ ) pi ¼ ½?; viÞ

where vi 2 domainðaiÞg:

That is, each punctuation has wildcard values for theattributes that are not the primary sorting attribute and arange pattern from ? up to a value for the primary sortingattribute.

For Gigascope, the primary sorting attribute is thetimestamp attribute. A data item d serves as a punctuationwith wildcard patterns for nontimestamp attributes and arange pattern that matches all values up to the timestampvalue for d. (This is a common style of punctuations that werefer to as linear punctuations.) Considering d as a punctua-tion, a blocking operator can output results for groups up tothe prefix including d, and a stateful operator can reducestate for those same groups. Since the punctuations areimplicit in the data items, in order for punctuations toremain valid, operators must maintain ordered output.Thus, queries containing operators with groupings on thetimestamp attribute will benefit from ordered data streams.

Given PS , we can tell that only queries with groupings ofinterest based on the sorted attribute will benefit fromordered inputs. Other queries will not benefit and are notappropriate for nonterminating input streams. In ourexample, Query 1 has a grouping of interest on hour. Ifdata items arrive sorted by hour, then these systems will beable to execute that query. However, Query 2 has agrouping of interest on auctionid. As it is unlikely thatthe inputs will arrive sorted on auctionid, these systemswill not be able to execute that query.

8.2 Windows

Many stream processing systems break input data intocontiguous subsets, called windows [17], [18], [19], [20], [21].The two most common kinds of windows are

. tumbling windows, which break the stream intosuccessive, nonoverlapping subsets of data, and

. sliding windows, which break the stream into succes-sive, overlapping subsets of data.

By breaking the stream input into bounded windows, aquery operator processes bounded data sets. As each newwindow arrives, the query is restarted for that new window.Since the window’s content is bounded, blocking operatorsoutput results before reaching the end of the stream. For

example, we could redefine Query 1 to use fixed windows as“Output the number of high-valued bids each hour.”

Each window has a first data item and a last data item.Thus, we must have some notion of when the last item for awindow has arrived in order to process the window. Forexample, a window defined on time ends when the timeruns out. That point when the window ends can beconsidered an implicit punctuation.

Li et al. [22] make window participation explicit in a dataitem using a window identifier ðwidÞ. We will use thisapproach in our discussion for concreteness. Let wid havedomain W . Therefore, the punctuation scheme for windowscould be

PW ¼f< p1; p2; . . . ; pn; wid > j8ai 2 A; pi ¼ �; wid 2Wg:

A punctuation exists in the punctuation scheme foreach wid. When a window completes, a punctuation isembedded into the stream stating that the window hasclosed. A blocking operator can output its results for thatwindow, a stateful operator can remove from state theinformation for that window, and all operators can emit apunctuation stating that the window is complete. Thus, aquery that defines windows over its input streams willbenefit from PW .

We can see from PW that queries that have groupings ofinterest on the window ID will benefit from windows. Thatis, if the query defines some kind of window, then thesekinds of systems can be used over nonterminating streamseffectively. Again, since Query 1 has a grouping of intereston the hour attribute, a window can be defined over thatattribute and window queries will execute successfully.

Query 2 has a grouping of interest on auctionid forboth the Auction and Bid streams as those streamsparticipate in the join. A window can be defined for theauctionid attribute for the Auction stream, but it isimpractical to try to define a window based on theauctionid over the Bid stream. Bids for open auctionswill be interspersed with other bids, so each window willcontain bids for multiple auctions. Auctions will likely closeat different times, so it would not be appropriate tocalculate the “current” closing price for all auctions in thewindow, only those that have actually closed. Clearly,Query 2 cannot be rewritten to use windows effectively.

8.3 Heartbeats

Similar to punctuations are heartbeats, presented by Babuand Widom [23]. The Stanford system requires that all dataitems be moved from the input manager to the queryprocessor and that they be in order, which they termprogress. Heartbeats are a mechanism by which progress canbe ensured in all cases. They define a heartbeat � to be anapplication-defined timestamp over a discrete ordereddomain, which we will call T . A heartbeat � over a streamor set of streams tells the input manager that everysubsequent timestamp on those streams will be greaterthan � ; thus, the input manager can move all buffered dataitems into the query processor. A heartbeat may be suppliedby the stream source or, if such is not the case, it may bedetermined and supplied by the data stream management


system (DSMS). So long as at least one stream is supplyingdata, there is progress in the system. However, should allstreams pause, there will be no further heartbeats and theinput manager will continue to hold the buffered tuples. Toavoid this situation, a timeout value ttimeout is supplied bythe user which will define the maximum amount of timethat will be allowed before the DSMS declares a heartbeatand releases data from the input manager.

The entire point of defining progress on the data streamsis to make sure that there are never any data items that arebuffered indefinitely in the input manager. When a heart-beat is introduced, the input manager knows that thebuffered data items may be sent. The input manager also isresponsible for ordering any out-of-order data items asorder on timestamp values is a requirement for the queryprocessor.

Each heartbeat can be considered a punctuation, statingthat no more data items will arrive with a timestamp valuegreater than the timestamp value of the heartbeat. Thus, wecan define a punctuation scheme for heartbeat-stylepunctuations as follows:

PT ¼f< p1; p2; . . . ; pn; ½?; � � > j8ai 2 A; pi ¼ �; � 2 T g:

When a punctuation from PT arrives, blocking operatorsoutput data items and stateful operators reduce state.Heartbeats only benefit queries over timestamps. Therefore,this scheme would function well for a time-based querysuch as in Query 1, but would not be able to process Query2, where the grouping of interest is on auctionid.

8.4 Summary

We have shown how all three common approaches toprocessing data streams (windows, ordered data, andheartbeats) can be described using punctuations. Wedescribed each approach in terms of a punctuation schemeimplied by it. The next step is to put this theory intopractice. We have begun designing and developing a datastream processing system that will rely on punctuations toimplement the various approaches. Such a system willallow us to evaluate the efficiency of our approach ascompared to systems that are specifically designed to useother approaches.

9 RELATED WORK

Data stream processing has received a great deal ofattention recently in the data management community.Babcock et al. [24] give a good overview.

We have seen that some systems rely on input arrivalorder to handle queries over nonterminating streams. TheGigascope system [16] is a good example. Sequence datasystems [14], [15] and temporal database systems [25] relyon input order. In our example, Query 1 groups on hour. Ifdata arrive in an interesting order relative to a query, thenthese kinds of systems will be able to produce results.

Another approach is to define windows over the inputdata according to data arrival. Sliding-window queries wereintroduced in temporal database systems [26], [27] and laterapplied to queries over data streams in the Tangram system[28]. This kind of query has also been called a moving-

window query. Tumbling-window queries (or fixed-windowqueries) are a special case of sliding-window queries. For atumbling-window query, we alter the request slightly toonly output results at the end of the sliding-window period.Landmark windows [19] consider all data items in a streamfrom some landmark forward to calculate a result. Damped-window queries [21] are an extension of sliding-windowqueries. A damped-window query evaluates each windowalong with previous windows together, where more recentwindows make a greater contribution to the results for awindow then older windows.

Li et al. [29] present a guarantee of safety for the joinoperator using punctuations. That is, their goal is toguarantee that the join operator in a given continuous joinquery can execute without requiring unbounded statebased on knowledge of incoming punctuations. Theyintroduce a simplified version of the punctuated schemesthat we present, where each attribute is allowed a wildcardor nonwildcard pattern. Our punctuation schemes are morecomplete and we show how the punctuation schemes canbe applied to many more query operators.

Other systems have used variants of punctuations withpositive results. Shkapenyuk et al. [30], for example, showhow punctuations can be applied in Gigascope to decreasequery memory utilization. Further, Ding et al. [31] present ajoin algorithm that is optimized to process punctuations,and show that their algorithm outperforms pure windowjoin algorithms in both memory usage and throughput.

10 CONCLUSIONS AND FUTURE WORK

Understanding how punctuations improve the behavior ofindividual query operators is interesting. However, it is onlya building block for a more important problem, namely,whether a given query can benefit from punctuations and, ifso, what kinds of punctuations can benefit that query.

To this end, we use punctuation schemes to specifypunctuations that may appear from a source. Punctuationschemes are defined using set notation and describe anatural grouping. Further, we have seen that various queryoperators also have natural groupings of interest. When thegroupings defined by punctuation schemes cover thegroupings of interest for a query, then the punctuationschemes may benefit the query. We showed how punctua-tion schemes can be used to define other commonapproaches to processing data streams.

Further, we have introduced the describe operatorspecifically to handle punctuations. Instead of each operatoranalyzing punctuations as they arrive to determine if theycan benefit that operator, the describe operator can filter outunwanted punctuations. Further, the describe operator canbe “pushed down” the query tree, filtering out unwantedpunctuations as early as possible and building up punctua-tions when desired.

Finally, we presented a top-down algorithm for deter-mining if a given query will benefit from any of the inputpunctuation schemes available from its inputs. Such analgorithm would be useful for optimizers when determin-ing an appropriate query plan to execute over streams. Ifone query plan can benefit from incoming punctuations andanother plan cannot, then perhaps choosing the first planwould be most appropriate.


Our work here is a starting point. Clearly, somesignificant issues remain. One is developing a more formalapproach for determining necessary input punctuationschemes for a given query. To be effective, such anapproach should be able to list multiple possible inputpunctuation schemes as well as suggest an “optimal” set ofpunctuation schemes. Optimality could mean the set ofpunctuation schemes that output data items the fastest, theset of punctuation schemes that minimize state, or someother criterion.

Currently, the placement and values of the parametersfor the describe operator is done by hand. Determiningthose values are very much specific to the application dataand the available punctuations. One could imagine adeclarative language that defines relationships betweenattributes in a stream, which could then be used todetermine the parameters for the describe operator. Anarea for future work is how to implement a query optimizerthat handles the describe operator.

Our notion of cleansing an operator might be strength-ened. An operator is cleansed if any data item that residesin state will eventually be removed from state. A strongernotion is to give a bound for state. This notion might becaptured in one of at least three related ways. First, a givendata item will be removed within n data items that follow it(similar to one kind of k-constraints [32]). Second, there is abound on how many data items will be held in state.Finally, there is a bound on how long an item will be held instate. All three are useful, but require more knowledgeabout the properties of the input streams. For example,knowing order arrival of punctuations and the “skew” (ordistance in tuples) between punctuations on differentinputs would be useful in determining the maximumamount of time a data item will remain in state. Strengthen-ing our notion of cleansing in these ways could improvehow an optimizer chooses a query plan and is an area forfuture work.

It might make sense to quantify how much a querybenefits from a punctuation scheme, rather than making asimple “yes/no” judgment. For example, all operatorsinvolved in a given query plan might benefit frompunctuations, with the exception of a project operator.Project is neither a blocking nor a stateful operator, but, dueto the projected attributes, it is not able to emit punctua-tions. We would want the query optimizer to ensure thatthe project operator is executed as near to the top of thequery plan as possible, allowing more operators in thequery plan to benefit. Further, if some query plan cannotbenefit by any punctuation, we would like to determine ifan alternative, equivalent query plan exists that can benefit.Finally, we would like to include available punctuationschemes in query optimization. For example, relevantpunctuation schemes could be a logical property ofsubexpressions during query optimization and be used tolimit construction of plans to those that can benefit fromavailable input punctuations.

ACKNOWLEDGMENTS

The authors would like to thank Jennifer Widom for hermany useful suggestions, Sava Krstic, John Matthews, KentJones, Lyle Cochran, and Donna Pierce for the discussions

they had related to compactness and topologies, LeonidasFegaras for discussions he had relating compactness topunctuation schemes and helping to find examples ofequivalent queries with different behavior for a punctuationscheme, Vassilis Papadimos, Kristin Tufte, and Jin Li forvarious discussions on stream queries and feedback on thispaper, and comments from the anonymous reviewers.Funding was provided by the US Defense AdvancedResearch Projects Agency (DARPA) through NAVY/SPA-WAR Contract N66001-99-1-8908 and by US NationalScience Foundation Awards IIS0086002 and IIS0612311.Funding for Paul Stephens was provided by the Office ofCareer Services and by the Weyerhauser Younger ScholarsProgram at Whitworth University.

REFERENCES

[1] eBay homepage, http://www.eBay.com/, 2007.[2] Yahoo! auctions homepage, http://auctions.yahoo.com/, 2007.[3] J.A. Rodrıguez, P. Noriega, C. Sierra, and J. Padget, “FM96.5 A

Java-Based Electronic Auction House,” Proc. Int’l Conf. andExhibition on the Practical Application of Intelligent Agents andMulti-Agent Technology, pp. 207-224, Apr. 1997.

[4] P.R. Wurman, M.P. Wellman, and W.E. Walsh, “The MichiganInternet AuctionBot: A Configurable Auction Server for Humanand Software Agents,” Proc. Second Int’l Conf. Autonomous Agents(Agents ’98), pp. 301-308, May 1998.

[5] A. Arasu, M. Cherniak, E. Galvez, D. Maier, A. Maskey, E.Ryvkina, M. Stonebraker, and R. Tibbets, “Linear Road: A StreamData Management Benchmark,” Proc. Int’l Conf. Very Large DataBases, pp. 480-491, Aug. 2004.

[6] J. Li, D. Maier, V. Papadimos, P. Tucker, and K. Tufte,“NEXMark—A Benchmark for Queries over Data Streams,”http://datalab.cs.pdx.edu/niagara/NEXMark/, 2003.

[7] P.A. Tucker, “Punctuated Data Streams,” PhD dissertation, OGISchool of Science and Eng. at Oregon Health and ScienceUniversity, Aug. 2005.

[8] P.A. Tucker, D. Maier, T. Sheard, and L. Fegaras, “ExploitingPunctuation Semantics in Continuous Data Streams,” IEEE Trans.Knowledge and Data Eng., vol. 15, no. 3, pp. 555-568, May/June2003.

[9] P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, andT.G. Price, “Access Path Selection in a Relational DatabaseManagement System,” Proc. ACM SIGMOD Int’l Conf. Managementof Data, pp. 23-34, May 1979.

[10] R.H. Kasriel, Undergraduate Topology. W.B. Saunders, 1971.[11] W. Rudin, Principles of Mathematical Analysis. McGraw Hill, 1964.[12] A.N. Wilschut and P.M.G. Apers, “Dataflow Query Execution in a

Parallel Main-Memory Environment,” Proc. IASTED Int’l Conf.Parallel and Distributed Information Systems, pp. 68-77, Dec. 1991.

[13] J. Naughton, D. DeWitt, D. Maier, J. Chen, L. Galanis, K. Tufte, J.Kang, Q. Luo, N. Prakash, and F. Tian, “The Niagara QuerySystem,” The IEEE Data Eng. Bull., vol. 24, no. 2, pp. 27-33, June2000.

[14] P. Seshadri, M. Livny, and R. Ramakrishnan, “Sequence QueryProcessing,” Proc. ACM SIGMOD Int’l Conf. Management of Data,pp. 430-441, May 1994.

[15] P. Seshadri, M. Livny, and R. Ramakrishnan, “SEQ: A Model forSequence Databases,” Proc. IEEE Int’l Conf. Data Eng., pp. 232-239,Mar. 1995.

[16] T. Johnson, C. Cranor, O. Spatscheck, and V. Shkapenyuk,“Gigascope: A Stream Database for Network Applications,” Proc.ACM SIGMOD Int’l Conf. Management of Data, pp. 647-651, June2003.

[17] A. Arasu, S. Babu, and J. Widom, “The CQL Continuous QueryLanguage: Semantic Foundations and Query Execution,” Int’l J.Very Large Data Bases, vol. 15, no. 2, pp. 121-142, June 2006.

[18] S. Chandrasekaran and M.J. Franklin, “Streaming Queries overStreaming Data,” Proc. Int’l Conf. Very Large Data Bases, pp. 203-214, Aug. 2002.

[19] J. Gehrke, F. Korn, and D. Srivastava, “On Computing CorrelatedAggregates over Continuous Data Streams,” Proc. ACM SIGMODInt’l Conf. Management of Data, pp. 13-24, May 2001.


[20] M. Sullivan and A. Heybey, “Tribeca: A System for ManagingLarge Databases of Network Traffic,” Proc. USENIX Ann. TechnicalConf., pp. 13-24, June 1998.

[21] Y. Zhu and D. Shasha, “StatStream: Statistical Monitoring ofThousands of Data Streams in Real Time,” Proc. Int’l Conf. VeryLarge Data Bases, pp. 358-369, Aug. 2002.

[22] J. Li, D. Maier, K. Tufte, V. Papadimos, and P.A. Tucker,“Semantics and Evaluation Techniques for Window Aggregatesin Data Streams,” Proc. ACM SIGMOD Int’l Conf. Management ofData, pp. 311-322, June 2005.

[23] S. Babu and J. Widom, “Continuous Queries over Data Streams,”SIGMOD Record, vol. 30, no. 3, pp. 109-120, Sept. 2001.

[24] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom,“Models and Issues in Data Stream Systems,” Proc. ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 1-16,June 2002.

[25] M.D. Soo, “Bibliography on Temporal Databases,” SIGMODRecord, vol. 20, no. 1, pp. 14-23, 1991.

[26] G. Ozsoyo�glu and R.T. Snodgrass, “Temporal and Real-TimeDatabases: A Survey,” IEEE Trans. Knowledge and Data Eng., vol. 7,no. 4, pp. 513-532, Aug. 1995.

[27] A. Segev and A. Shoshani, “Logical Modeling of Temporal Data,”Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 454-466,May 1987.

[28] D.S. Parker, R.R. Muntz, and L. Chau, “The Tangram StreamQuery Processing System,” Proc. Fifth IEEE Int’l Conf. Data Eng.,pp. 556-563, Feb. 1989.

[29] H.-G. Li, S. Chen, J. Tatemura, D. Agrawal, K.S. Candan, andW.-P. Hsiung, “Safety Guarantee of Continuous Join Queries overPunctuated Data Streams,” Proc. Int’l Conf. Very Large Data Bases,pp. 19-30, Sept. 2006.

[30] V. Shkapenyuk, T. Johnson, O. Spatscheck, and S. Muthukrishnan,“A Heartbeat Mechanism and Its Application in Gigascope,” Proc.Int’l Conf. Very Large Data Bases, pp. 1079-1088, Aug. 2005.

[31] L. Ding, N. Mehta, E.A. Rundensteiner, and G.T. Heineman,“Joining Punctuated Streams,” Proc. Ninth Int’l Conf. ExtendingDatabase Technology, pp. 587-604, Mar. 2004.

[32] S. Babu, U. Srivastava, and J. Widom, “Exploiting k-Constraints toReduce Memory Overhead in Continuous Queries over DataStreams,” ACM Trans. Database Systems, vol. 29, no. 3, pp. 545-580,Sept. 2004.

Peter A. Tucker received the BS degree inmathematics and computer science in 1991 fromWhitworth College and the PhD degree fromOregon Health and Science University in 2005.He worked for eight years at Microsoft Corp. insoftware development, holding such roles assoftware design engineer in testing and softwaredesign engineer, before attending the OGISchool of Science and Engineering at OregonHealth and Science University. He is currently

an assistant professor at Whitworth University in the Department ofMath and Computer Science. His current interests include data streamprocessing, computer ethics, and functional programming. He is amember of the ACM, IEEE, and IEEE Computer Society.

David Maier received the BA (Honors College)degree from the University of Oregon, with adouble major in mathematics and computerscience and the PhD degree from PrincetonUniversity in electrical engineering and computerscience. He is currently the Maseeh Professor ofEmerging Technologies in the Department ofComputer Science at Portland State University.He was previously on the faculties of the OGISchool of Science and Engineering (formerly the

Oregon Graduate Institute) at the Oregon Health and Science Universityand of the State University of New York at Stony Brook. He has heldvisiting positions at l’Institut National de Recherche en Informatique eten Automatique (INRIA, Rocquencourt) and the University of Wisconsin-Madison. His research interests include query processing, object-oriented systems, databases and data-product management forscientific computing, superimposed information systems, and streamdata processing. He is a fellow of the ACM and holds the ACM SIGMODInnovations Award. He is a senior member of the IEEE and a member ofthe IEEE Computer Society.

Tim Sheard received the PhD degree incomputer and information science from theUniversity of Massachusetts, Amherst, in 1985and is currently a professor of computer scienceat Portland State University. His research inter-ests include program generation, metaprogram-ming systems, theorem proving, logicalframeworks, type systems, domain specificlanguages, and patterns for functional program-ming. He was the general chair of Generative

Programming and Component Engineering (GPCE ’04) and wasorganizer of the 2001 ICFP Programming Contest that attracted morethan 250 entries from around the world. He is a pioneer in the area ofmetaprogramming and is the creator of three research artifacts(MetaML, Template Haskell, and the Omega Programming Language)which have a broad influence on the programming language community.He is a member of the IEEE.

Paul Stephens is currently working toward theBS degree in computer science, the BA degreein mathematics, and the BA degree in appliedphysics at Whitworth University. He has experi-ence in business IT working at UPF Incorporatedin Spokane, Washington, where he was active indatabase management and Web services. Fol-lowing graduation, he is interested in pursuinggraduate studies in natural language processing.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


ieee transactions on knowledge and data engineering, … · punctuations that collectively match...

Documents