proceedings of dmkd'03 8th acm sigmod workshop on …

Proceedings of DMKD'03

8th ACM SIGMOD Workshop onResearch Issues in Data Mining and

Knowledge Discovery

13th June 2003, San Diego, California

in conjunction with ACM SIGMOD International Conference on

Management of Data

Sponsored by ACM SIGMOD

Editors

Mohammed J. ZakiComputer Science DepartmentRensselaer Polytechnic Institute

Troy, New York, USA

Charu C. AggarwalIBM T.J. Watson Research Center

Yorktown Heights, New York, USA

8th ACM SIGMOD Workshop on Research Issues in DataMining and Knowledge Discovery (DMKD’03)

Mohammed J. ZakiComputer Science DepartmentRensselaer Polytechnic Institute

Troy, NY 12180

[email protected]

Charu C. AggarwalIBM T.J. Watson Research Center

Yorktown Heights, NY 10598

[email protected]

ForewordThis is the 8th workshop on Research Issues in Data Min-ing and Knowledge Discovery (DMKD’03), held annually inconjunction with ACM SIGMOD conference. The workshopaims to bring together data-mining researchers and experi-enced practitioners with the goal of discussing the next gen-eration of data-mining tools. Rather than following a ”mini-conference” format focusing on the presentation of polishedresearch results, the DMKD workshop aims to foster an in-formal atmosphere, where researchers and practitioners canfreely interact through short presentations and open dis-cussions on their ongoing work as well as forward-lookingresearch visions/experiences for future data-mining applica-tions.In addition to research on novel data-mining algorithms andexperiences with innovative mining systems and applica-tions, of particular interest in this year’s DMKD workshop isthe broad theme of ”Future Trends in Data Mining.” Topicsof interest include:

• Privacy & Security issues in Mining: Privacy preserv-ing data mining techniques are invaluable in cases whereone may not look at the detailed data, but one is al-lowed to infer high level information. This also hasrelevance for the use of mining for national securityapplications.

• Mining Data Streams: In many emerging applicationsdata arrives and needs to be processed on a continuousbasis, i.e., there is need for mining without the benefitof several passes over a static, persistent snapshot.

• Data Mining in Bioinformatics and Biological DatabaseSystems: High-performance data mining tools will playa crucial role in the analysis of the ever-growing databasesof bio-sequences/structures.

• Semi/Un-Structured Mining for the World Wide Web:The vast amounts of information with little or no struc-ture on the web raise a host challenging mining prob-lems such as web resource discovery and topic distil-lation; web structure/linkage mining; intelligent websearching and crawling; personalization of web con-tent.

• Future Trends/Past Reflections: What are the emerg-ing topics/applications for next generation data min-ing? What lessons have been learned from over adecade of data mining? What areas need more at-tention?

Out of the 28 paper submissions we were able to select only8 full papers (an acceptance rate of 28.5%). In addition5 short papers were accepted. Each paper was reviewedby three members of the program committee. Along withan invited talk, we were able to assemble a very excitingprogram.

We would like to thank all the authors, invited speakers,and attendees for contributing to the success of the work-shop. Special thanks to the program committee for helpin reviewing the submissions, and ensuring a high qualityprogram.

Sincerely,Mohammed Zaki, Charu Aggarwal (editors)

Workshop Co-ChairsMohammed J. Zaki, Rensselaer Polytechnic InstituteCharu Aggarwal, IBM T.J. Watson Research Center

Program CommitteeRoberto Bayardo, IBM Almaden Research Center, USAAlok Choudhary, Northwestern University, USAGautam Das, Microsoft Research, USAVenkatesh Ganti, Microsoft Research, USAMinos N. Garofalakis, Bell Labs, USADimitrios Gunopulos, University of California, Riverside,USAJiawei Han, University of Illinois at Urbana-Champaign,USAEamonn Keogh, University of California, Riverside, USANick Koudas, AT&T Research, USAVipin Kumar, University of Minnesota, USABing Liu, University of Illinois at Chicago, USARosa Meo, University of Torino, ItalyRaymond Ng, University of British Columbia, CanadaSrini Parthasarathy, Ohio State University, USARajeev Rastogi, Bell Labs, USAKyuseok Shim, Seoul National University, South KoreaHannu Toivonen, University of Helsinki, FinlandPhilip S. Yu, IBM T.J. Watson Research Center, USA

DMKD03: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003 page i

Workshop Program & ContentsThe workshop will foster an informal atmosphere, where par-ticipants can freely interact through short presentations andopen discussions on their ongoing work as well as discus-sion on current/future trends in data mining. Questions canbe asked freely during any time, and discussion slots havebeen set aside in the morning and afternoon sessions. Theseslots are meant to focus on critical evaluation of current ap-proaches and open problems. All speakers and participantsare encouraged to think about these issues prior to the work-shop and to take active part in the ensuing discussions.

8:00-8:30 Continental Breakfast

8:30-8:35 Opening Remarks

8:35-9:10, Invited Talk.Analyzing Massive Data Streams: Past, Present, and Fu-ture: Minos Garofalakis (Bells Labs) page 1

9:10-9:50 Data Streams I

• 9:10-9:35, Full Paper: A Symbolic Representation ofTime Series, with Implications for Streaming Algo-rithms, Jessica Lin, Eamonn Keogh, Stefano Lonardi,Bill Chiu (University of California, Riverside) page 2

• 9:35-9:50, Short Paper: Clustering Binary Data Streamswith K-means, Carlos Ordonez (Teradata, NCR) page 12

9:50-10:15 Coffee Break

10:15-11:20 DB Integration

• 10:15-10:40, Full Paper: Processing Frequent ItemsetDiscovery Queries by Division and Set ContainmentJoin Operators, Ralf Rantzau (University of Stuttgart).page 20

• 10:40-10:55, Short Paper: Efficient OLAP Operationsfor Spatial Data Using Peano Trees, Baoying Wang,Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding, WilliamPerrizo (North Dakota State University) page 28

• 10:55-11:20, Full Paper: Clustering Gene ExpressionData in SQL Using Locally Adaptive Metrics, Dim-itris Papadopoulos (University of California, River-side), Carlotta Domeniconi (George Mason University),Dimitrios Gunopulos (University of California, River-side), Sheng Ma (IBM T.J. Watson Research Center)page 35

11:20-11:30 Short Discussion Session (Open Floor)

11:30-12:30 FCRC Plenary Talk by James Kurose onNetworking

12:30-1:45 Lunch

1:45-2:25 WWW Mining

• 1:45-2:10, Full Paper: Graph-Based Ranking Algo-rithms for E-mail Expertise Analysis, Byron Dom, IrisEiron, Alex Cozzi (IBM Almaden Research Center),Yi Zhang (CMU) page 42

• 2:10-2:25, Short Paper: Deriving Link-context fromHTML Tag Tree, Gautam Pant (The University ofIowa) page 49

2:25-3:05 Data Streams II

• 2:25-2:50, Full Paper: Clustering of Streaming TimeSeries is Meaningless , Jessica Lin, Eamonn Keogh,Wagner Truppel (University of California, Riverside)page 56

• 2:50-3:05, Short Paper: A Learning-Based Approachto Estimate Statistics of Operators in Continuous Queries:a Case Study, Like Gao (GMU), X. Sean Wang (GMU),Min Wang (IBM T.J. Watson Research Center), Sri-ram Padmanabhan (IBM T.J. Watson Research Cen-ter) page 66

3:05-3:45 Bioinformatics

• 3:05-3:30, Full Paper: Using transposition for patterndiscovery from microarray data, Francois Rioult (GR-EYC CNRS 6072 - University of Caen), Jean-FrancoisBoulicaut (INSA Lyon LIRIS), Bruno Cremileux (GR-EYC CNRS 6072 - University of Caen), Jeremy Besson(INSA Lyon LIRIS) page 73

• 3:30-3:45, Short Paper: Weave Amino Acid Sequencesfor Protein Secondary Structure Prediction, Bin Wang(Northeastern University, China), Xiaochun Yang (Brig-ham Young University) page 80

3:45-4:00 Coffee Break

4:00-4:50 Privacy & Security

• 4:00-4:25, Full Paper: Assuring privacy when big brotheris watching, Murat Kantarcioglu, Chris Clifton (Pur-due University) page 88

• 4:25-4:50, Full Paper: Dynamic Inference Control, Jes-sica Staddon (Palo Alto Research Center) page 94

4:50-5:25 Discussion Session (Open Floor)

5:25-5:30 Closing Remarks.

DMKD03: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003 page ii

Invited TalkAnalyzing Massive Data Streams:

Past, Present, and Future

Minos GarofalakisBells Labs

600 Mountain AvenueMurray Hill, NJ 07974, USA

[email protected]

ABSTRACTContinuous data streams arise naturally, for example, in theinstallations of large telecom and Internet service providerswhere detailed usage information (Call-Detail-Records, SNMP-/RMON packet-flow data, etc.) from different parts of theunderlying network needs to be continuously collected andanalyzed for interesting trends. Such environments raise acritical need for effective stream-processing algorithms thatcan provide (typically, approximate) answers to data-analysisqueries while utilizing only small space (to maintain concisestream synopses) and small processing time per stream item.In this talk, I will discuss the basic pseudo-random sketch-ing mechanism for building stream synopses and our ongoingwork that exploits sketch synopses to build an approximateSQL (multi) query processor. I will also describe our re-cent results on extending sketching to handle more complexforms of queries and streaming data (e.g., similarity joinsover streams of XML trees), and try to identify some chal-lenging open problems in the data-streaming area.

DMKD03: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003 page 1


A Symbolic Representation of Time Series, with Implications for Streaming Algorithms

Jessica Lin Eamonn Keogh Stefano Lonardi Bill Chiu

University of California - Riverside

Computer Science & Engineering Department Riverside, CA 92521, USA

{jessica, eamonn, stelo, bill}@cs.ucr.edu

ABSTRACT The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued. Many researchers have also considered transforming real valued time series into symbolic representations, noting that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly “batch-only” problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms. In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.

We will demonstrate the utility of our representation on the classic data mining tasks of clustering, classification, query by content and anomaly detection.

Keywords Time Series, Data Mining, Data Streams, Symbolic, Discretize

1. INTRODUCTION The parallel explosions of interest in streaming data [4, 8, 10, 18], and data mining of time series [6, 7, 9, 20, 21, 24, 26, 34] have had surprisingly little intersection. This is in spite of the fact that

time series data are typically streaming data, for example, stock value, medical and meteorological data [30]. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued [23].

Many high level representations of time series have been proposed for data mining. Figure 1 illustrates a hierarchy of all the various time series representations in the literature [2, 7, 14, 16, 20, 22, 25, 30, 31, 35]. One representation that the data mining community has not considered in detail is the discretization of the original data into symbolic strings. At first glance this seems a surprising oversight. In addition to allowing the framing of time series problems as streaming problems, there is an enormous wealth of existing algorithms and data structures that allow the efficient manipulations of symbolic representations. Such algorithms have received decades of attention in the text retrieval community, and more recent attention from the bioinformatics community [3, 13, 17, 29, 32, 33]. Some simple examples of “ tools” that are not defined for real-valued sequences but are defined for symbolic approaches include hashing, Markov models, suffix trees, decision trees etc. As a more concrete example, consider the Jaccard coefficient [13], a distance measure beloved by streaming researchers. The Jaccard coefficient is only well defined for discrete data (such as web clicks or individual keystrokes) as thus cannot be used with real-valued time series.

There is a simple explanation for the data mining community’s lack of interest in symbolic manipulation as a supporting technique for mining time series. If the data are transformed into virtually any of the other representations depicted in Figure 1, then it is possible to measure the similarity of two time series in that representation space, such that the distance is guaranteed to lower bound the true distance between the time series in the original space. This simple fact is at the core of almost all algorithms in time series data mining and indexing [14]. However, in spite of the fact that there are dozens of techniques for producing different variants of the symbolic representation [2, 11, 20], there is no known method for calculating the distance in the symbolic space, while providing the lower bounding guarantee.

In addition to allowing the creation of lower bounding distance measures, there is one other highly desirable property of any time series representation, including a symbolic one. Almost all time series datasets are very high dimensional. This is a challenging fact because all non-trivial data mining and indexing algorithms degrade exponentially with dimensionality. For example, above 16-20 dimensions, index structures degrade to sequential scanning [19].


Figure 1: A hierarchy of all the various time series representations in the literature. The leaf nodes refer to the actual representation, and the internal nodes refer to the classification of the approach. The contribution of this paper is to introduce a new representation, the lower bounding symbolic approach

None of the symbolic representations that we are aware of allow dimensionality reduction [2, 11, 20]. There is some reduction in the storage space required, since fewer bits are required for each value; however, the intrinsic dimensionality of the symbolic representation is the same as the original data.

In [4], Babcock et. al ask if “ there is a need for database researchers to develop fundamental and general-purpose models… for data streams.” The opinion of the authors is affirmative. In this work we take a step towards this goal by introducing a representation of time series that is suitable for streaming algorithms. It is dimensionality reducing, lower bounding and can be obtained in a streaming fashion.

As we shall demonstrate, the lower bounding feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on the classic data mining tasks of clustering [21], classification [16], indexing [1, 14, 22, 35], and anomaly detection [9, 24, 31].

The rest of this paper is organized as follows. Section 2 briefly discusses background material on time series data mining and related work. Section 3 introduces our novel symbolic approach, and discusses its dimensionality reduction, numerosity reduction and lower bounding abilities. Section 4 contains an experimental evaluation of the symbolic approach on a variety of data mining tasks. Finally, Section 5 offers some conclusions and suggestions for future work.

2. BACKGROUND AND RELATED WORK Time series data mining has attracted enormous attention in the last decade. The review below is necessarily brief; we refer interested readers to [30, 23] for a more in depth review.

2.1 Time Series Data Mining Tasks While making no pretence to be exhaustive, the following list summarizes the areas that have seen the majority of research interest in time series data mining. � Indexing: Given a query time series Q, and some

similarity/dissimilarity measure D(Q,C), find the most similar time series in database DB [1, 7, 14,22, 35].

� Clustering: Find natural groupings of the time series in database DB under some similarity/dissimilarity measure D(Q,C) [21,25].

� Classification: Given an unlabeled time series Q, assign it to one of two or more predefined classes [16].

� Summarization: Given a time series Q containing n datapoints where n is an extremely large number, create a (possibly graphic) approximation of Q which retains its essential features but fits on a single page, computer screen, executive summary, etc [26].

� Anomaly Detection: Given a time series Q, and some model of “normal” behavior, find all sections of Q which contain anomalies, or “surprising/interesting/unexpected/novel” behavior [9, 24, 31].

Since the datasets encountered by data miners typically don’ t fit in main memory, and disk I/O tends to be the bottleneck for any data mining task, a simple generic framework for time series data mining has emerged [14]. The basic approach is outlined in Table 1.

1. Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest.

2. Approximately solve the task at hand in main memory.

3.

Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data.

Table 1: A generic time series data mining approach

It should be clear that the utility of this framework depends heavily on the quality of the approximation created in Step 1. If the approximation is very faithful to the original data, then the solution obtained in main memory is likely to be the same as, or very close to, the solution we would have obtained on the original data. The handful of disk accesses made in Step 3 to confirm or slightly modify the solution will be inconsequential compared to the number of disk accesses required had we worked on the original data. With this in mind, there has been great interest in approximate representations of time series, which we consider below.

2.2 Time Series Representations As with most problems in computer science, the suitable choice of representation greatly affects the ease and efficiency of time series data mining. With this in mind, a great number of time series representations have been introduced, including the Discrete Fourier Transform (DFT) [14], the Discrete Wavelet Transform (DWT) [7], Piecewise Linear, and Piecewise Constant models (PAA) [22], (APCA) [16, 22], and Singular Value Decomposition (SVD) [22]. Figure 2 illustrates the most commonly used representations.

Time Series Representations Data Adaptive Non Data Adaptive

Spectral Wavelets Piecewise Aggregate Approximation

Piecewise Polynomial

Symbolic Singular

Value Decomposition

Random Mappings

Piecewise Linear

Approximation Adaptive

Piecewise Constant

Approxima tion Discrete

Fourier Transform

Discrete Cosine

Transform Haar Daubechies

dbn n > 1 Coiflets

Symlets

Sorted Coefficients

Orthonormal Bi - Orthonormal Interpolation

Regression

Trees

Natural Language

Strings

Lower Bounding

Non - Lower Bounding


Figure 2: The most common representations for time series data mining. Each can be visualized as an attempt to approximate the signal with a linear combination of basis functions

Recent work suggests that there is little to choose between the above in terms of indexing power [23]; however, the representations have other features that may act as strengths or weaknesses. As a simple example, wavelets have the useful multiresolution property, but are only defined for time series that are an integer power of two in length [7].

One important feature of all the above representations is that they are real valued. This limits the algorithms, data structures and definitions available for them. For example, in anomaly detection we cannot meaningfully define the probability of observing any particular set of wavelet coefficients, since the probability of observing any real number is zero [27]. Such limitations have lead researchers to consider using a symbolic representation of time series.

While there are literally hundreds of papers on discretizing (symbolizing, tokenizing, quantizing) time series [2, 20] (see [11] for an extensive survey), none of the techniques allows a distance measure that lower bounds a distance measure defined on the original time series. For this reason, the generic time series data mining approach illustrated in Table 1 is of little utility, since the approximate solution to problem created in main memory may be arbitrarily dissimilar to the true solution that would have been obtained on the original data. If, however, one had a symbolic approach that allowed lower bounding of the true distance, one could take advantage of the generic time series data mining model, and of a host of other algorithms, definitions and data structures which are only defined for discrete data, including hashing, Markov models, and suffix trees. This is exactly the contribution of this paper. We call our symbolic representation of time series SAX (Symbolic Aggregate approXimation), and define it in the next section.

3. SAX: OUR SYMBOLIC APPROACH SAX allows a time series of arbitrary length n to be reduced to a string of arbitrary length w, (w < n, typically w << n). The alphabet size is also an arbitrary integer a, where a > 2. Table 2 summarizes the major notation used in this and subsequent sections.

C A time series C = c1,…,cn

C A Piecewise Aggregate Approximation of a time series

wccC ,...,1�

C A symbol representation of a time series wccC ˆ,...,ˆˆ1

�

w The number of PAA segments representing time series C

a Alphabet size (e.g., for the alphabet = { a,b,c} , a = 3)

Table 2: A summarization of the notation used in this paper

Our discretization procedure is unique in that it uses an intermediate representation between the raw time series and the symbolic strings. We first transform the data into the Piecewise Aggregate Approximation (PAA) representation and then symbolize the PAA representation into a discrete string. There are two important advantages to doing this: � Dimensionality Reduction: We can use the well-defined

and well-documented dimensionality reduction power of PAA [22, 35], and the reduction is automatically carried over to the symbolic representation.

� Lower Bounding: Proving that a distance measure between two symbolic strings lower bounds the true distance between the original time series is non-trivial. The key observation that allows us to prove lower bounds is to concentrate on proving that the symbolic distance measure lower bounds the PAA distance measure. Then we can prove the desired result by transitivity by simply pointing to the existing proofs for the PAA representation itself [35].

We will briefly review the PAA technique before considering the symbolic extension.

3.1 Dimensionality Reduction Via PAA A time series C of length n can be represented in a w-dimensional

space by a vector wccC ,,1�� . The i th element of C is

calculated by the following equation: � �� i

ijjn

wi

wn

wn

cc1)1(

(1)

Simply stated, to reduce the time series from n dimensions to w dimensions, the data is divided into w equal sized “ frames.” The mean value of the data falling within a frame is calculated and a vector of these values becomes the data-reduced representation. The representation can be visualized as an attempt to approximate the original time series with a linear combination of box basis functions as shown in Figure 3.

Figure 3: The PAA representation can be visualized as an attempt to model a time series with a linear combination of box basis functions. In this case, a sequence of length 128 is reduced to 8 dimensions

The PAA dimensionality reduction is intuitive and simple, yet has been shown to rival more sophisticated dimensionality reduction techniques like Fourier transforms and wavelets [22, 23, 35].

We normalize each time series to have a mean of zero and a standard deviation of one before converting it to the PAA

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0

-1 .5 -1 -0 .5

0 0 .5 1 1 .5

c 1

c 2

c 3

c 4

c 5

c 6

c 7

c 8

C

C

0 50 100 0 50 100 0 50 100 0 50 100

Discrete Fourier Transform

Piecewise Linear Approximation

Haar Wavelet Adaptive Piecewise Constant Approximation


representation, since it is well understood that it is meaningless to compare time series with different offsets and amplitudes [23].

3.2 Discretization Having transformed a time series database into PAA, we can apply a further transformation to obtain a discrete representation. It is desirable to have a discretization technique that will produce symbols with equiprobability [3, 28]. This is easily achieved since normalized time series have a Gaussian distribution [27]. To illustrate this, we extracted subsequences of length 128 from 8 different time series and plotted a normal probability plot of the data as shown in Figure 4.

Figure 4: A normal probability plot of the cumulative distribution of values from subsequences of length 128 from 8 different datasets. The highly linear nature of the plot strongly suggests that the data came from a Gaussian distribution

Given that the normalized time series have highly Gaussian distribution, we can simply determine the “breakpoints” that will produce a equal-sized areas under Gaussian curve [27].

Definition 1. Breakpoints: breakpoints are a sorted list of numbers � = � 1,…,� a-1 such that the area under a N(0,1) Gaussian curve from � i to � i+1 = 1/a (� 0 and � a are defined as -� and � , respectively).

These breakpoints may be determined by looking them up in a statistical table. For example, Table 3 gives the breakpoints for values of a from 3 to 10.

a �i

3 4 5 6 7 8 9 10 �1 -0.43 -0.67 -0.84 -0.97 -1.07 -1.15 -1.22 -1.28 �2 0.43 0 -0.25 -0.43 -0.57 -0.67 -0.76 -0.84 �3 0.67 0.25 0 -0.18 -0.32 -0.43 -0.52 �4 0.84 0.43 0.18 0 -0.14 -0.25 �5 0.97 0.57 0.32 0.14 0 �6 1.07 0.67 0.43 0.25 �7 1.15 0.76 0.52 �8 1.22 0.84 �9 1.28

Table 3: A lookup table that contains the breakpoints that divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equiprobable regions

Once the breakpoints have been obtained we can discretize a time series in the following manner. We first obtain a PAA of the time series. All PAA coefficients that are below the smallest breakpoint are mapped to the symbol “a,” all coefficients greater than or equal to the smallest breakpoint and less than the second smallest breakpoint are mapped to the symbol “b,” etc. Figure 5 illustrates the idea.

Figure 5: A time series is discretized by first obtaining a PAA approximation and then using predetermined breakpoints to map the PAA coefficients into SAX symbols. In the example above, with n = 128, w = 8 and a = 3, the time series is mapped to the word baabccbc

Note that in this example the 3 symbols, “a,” “b,” and “c” are approximately equiprobable as we desired. We call the concatenation of symbols that represent a subsequence a word.

Definition 2. Word: A subsequence C of length n can be

represented as a word wccC ˆ,,ˆˆ

1 �� as follows. Let alphai

denote the ith element of the alphabet, i.e., alpha1 = a and alpha2

= b. Then the mapping from a PAA approximation C to a

word C is obtained as follows:

jijji ciifalphac �� 1,ˆ (2)

We have now defined SAX, our symbolic representation (the PAA representation is merely an intermediate step required to obtain the symbolic representation).

3.3 Distance Measures Having introduced the new representation of time series, we can now define a distance measure on it. By far the most common distance measure for time series is the Euclidean distance [23, 29]. Given two time series Q and C of the same length n, Eq. 3 defines their Euclidean distance, and Figure 6.A illustrates a visual intuition of the measure. � � � � �� n

iii cqCQD

1

2, (3)

If we transform the original subsequences into PAA representations, Q and C , using Eq. 1, we can then obtain a

lower bounding approximation of the Euclidean distance between the original subsequences by: � �� w

i iiwn cqCQDR

1

2),( (4)

This measure is illustrated in Figure 6.B. A proof that DR(Q ,C ) lower bounds the true Euclidean distance appears in [22] (an alterative proof appears in [35] ). If we further transform the data into the symbolic representation, we can define a MINDIST function that returns the minimum distance between the original time series of two words: � �� w

i iiwn cqdistCQMINDIST

1

2)ˆ,ˆ()ˆ,ˆ( (5)

The function resembles Eq. 4 except for the fact that the distance between the two PAA coefficients has been replaced with the sub-function dist(). The dist() function can be implemented using a table lookup as illustrated in Table 4.

0 2 0 4 0 6 0 80 1 00 1 20

- 1 .5 - 1 - 0 .5 0 0.5 1 1.5

b

a a

b

c c

b

c

-10 0 10

0.001 0.003

0.01 0.02

0.05 0.10

0.25

0.50

0.75

0.90 0.95

0.98 0.99

0.997 0.999

Pro

babi

lity


a b c d a 0 0 0.67 1.34

b 0 0 0 0.67

c 0.67 0 0 0

d 1.34 0.67 0 0

Table 4: A lookup table used by the MINDIST function. This table is for an alphabet of cardinality of 4, i.e. a=4. The distance between two symbols can be read off by examining the corresponding row and column. For example, dist(a,b) = 0 and dist(a,c) = 0.67.

The value in cell (r,c) for any lookup table can be calculated by the following expression.

��

otherwise

crifcell

crcr

cr,

1,0

),min(1),max(, �� (6)

For a given value of the alphabet size a, the table needs only be calculated once, then stored for fast lookup. The MINDIST function can be visualized in Figure 6.C.

Figure 6: A visual intuition of the three representations discussed in this work, and the distance measures defined on them. A) The Euclidean distance between two time series can be visualized as the square root of the sum of the squared differences of each pair of corresponding points. B) The distance measure defined for the PAA approximation can be seen as the square root of the sum of the squared differences between each pair of corresponding PAA coefficients, multiplied by the square root of the compression rate. C) The distance between two SAX representations of a time series requires looking up the distances between each pair of symbols, squaring them, summing them, taking the square root and finally multiplying by the square root of the compression rate

There is one issue we must address if we are to use a symbolic representation of time series. If we wish to approximate a massive dataset in main memory, the parameters w and a have to be chosen in such a way that the approximation makes the best use of the primary memory available. There is a clear tradeoff between the parameter w controlling the number of approximating

elements, and the value a controlling the granularity of each approximating element.

It is infeasible to determine the best tradeoff analytically, since it is highly data dependent. We can, however, empirically determine the best values with a simple experiment. Since we wish to achieve the tightest possible lower bounds, we can simply estimate the lower bounds over all possible feasible parameters, and choose the best settings.

),()ˆ,ˆ(

CQD

CQMINDISTBoundLowerofTightness � (7)

We performed such a test with a concatenation of 50 time series databases taken from the UCR time series data mining archive. For every combination of parameters we averaged the result of 100,000 experiments on subsequences of length 256. Figure 7 shows the results.

Figure 7: The empirically estimated tightness of lower bounds over the cross product of a = [3…11] and w = [2…8]. The darker histogram bars illustrate combinations of parameters that require approximately equal space to store every possible word (approximately 4 megabytes)

The results suggest that using a low value for a results in weak bounds, but that there are diminishing returns for large values of a. The results also suggest that the parameters are not too critical; an alphabet size in the range of 5 to 8 seems to be a good choice.

3.4 Numerosity Reduction We have seen that, given a single time series, our approach can significantly reduce its dimensionality. In addition, our approach can reduce the numerosity of the data for some applications.

Most applications assume that we have one very long time series T, and that manageable subsequences of length n are extracted by use of a sliding window, then stored in a matrix for further manipulation [7, 14, 22, 35]. Figure 8 illustrates the idea.

Figure 8: An illustration of the notation introduced in this section: A time series T of length 128, the subsequence C67 of length n = 16, and the first 8 subsequences extracted by a sliding window. Note that the sliding windows are overlapping

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0

-1 .5 -1 -0 .5 0 0 .5 1 1 .5

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0

-1 .5 -1 -0 .5 0 0 .5 1 1 .5

C

= b a a b c c b c

= b a b c a c c a

Q

C

Q

C

Q

(A )

(B )

(C )

2 3 4 5 6 7 8

3 4 5 6 7 8 9 10 11

0

0.2

0.4

0.6

0.8

Word Size w Alphabet size a

Tig

htne

ss o

f low

er b

ound

0 20 4 0 6 0 80 1 00 12 0

T

C 67 C p p = 1… 8


When performing sliding windows subsequence extraction, with any of the real-valued representations, we must store all |T| - n + 1 extracted subsequences (in dimensionality reduced form). However, imagine for a moment that we are using our proposed approach. If the first word extracted is aabbcc, and the window is shifted to discover that the second word is also aabbcc, we can reasonably decide not to include the second occurrence of the word in sliding windows matrix. If we ever need to retrieve all occurrences of aabbcc, we can go to the location pointed to by the first occurrences, and remember to slide to the right, testing to see if the next window is also mapped to the same word. We can stop testing as soon as the word changes. This simple idea is very similar to the run-length-encoding data compression algorithm.

The utility of this optimization depends on the parameters used and the data itself, but it typically yields a numerosity reduction factor of two or three. However, many datasets are characterized by long periods of little or no movement, followed by bursts of activity (seismological data is an obvious example). On these datasets the numerosity reduction factor can be huge. Consider the example shown in Figure 9.

Figure 9: Sliding window extraction on Space Shuttle Telemetry data, with n = 32. At time point 61, the extracted word is aabbcc, and the next 401 subsequences also map to this word. Only a pointer to the first occurrence must be recorded, thus producing a large reduction in numerosity

There is only one special case we must consider. As we noted in Section 3.1, we normalize each time series (including subsequences) to have a mean of zero and a standard deviation of one. However, if the subsequence contains only one value, the standard deviation is not defined. More troublesome is the case where the subsequence is almost constant, perhaps 31 zeros and a single 0.0001. If we normalize this subsequence, the single differing element will have its value exploded to 5.48. This situation occurs quite frequently. For example, the last 200 time units of the data in Figure 9 appear to be constant, but actually contain tiny amounts of noise. If we were to normalize subsequences extracted from this area, the normalization would magnify the noise to large meaningless patterns.

We can easily deal with this problem, if the standard deviation of the sequence before normalization is below an epsilon � , we simply assign the entire word to the middle-ranged alphabet (e.g. cccccc if a = 5).

We end this section with a visual comparison between SAX and the four most used representations in the literature (Figure 10).

Figure 10: A visual comparison of SAX and the four most common time series data mining representations. A raw time series of length 128 is transformed into the word ffffffeeeddcbaabceedcbaaaaacddee. This is a fair comparison since the number of bits in each representation is the same

4. EXPERIMENTAL VALIDATION OF SAX We performed various data mining tasks using SAX. For clustering and classification, we compared the results with the classic Euclidean distance, and with other previously proposed symbolic approaches. Note that none of these other approaches use dimensionality reduction. In the next paragraph we summarize the strawmen representations that we compare SAX to. We choose these two approaches since they are typical representatives of symbolic approaches in the literature.

André-Jönsson, and Badal [2] proposed the SDA algorithm that computes the changes between values from one instance to the next, and divide the range into user-predefined regions. The disadvantages of this approach are obvious: prior knowledge of the data distribution of the time series is required in order to set the breakpoints; and the discretized time series does not conserve the general shape or distribution of the data values.

Huang and Yu proposed the IMPACTS algorithm, which uses change ratio between one time point to the next time point to discretize the time series [20]. The range of change ratios are then divided into equal-sized sections and mapped into symbols. The time series is converted to a discretized collection of change ratios. As with SAX, the user needs to define the cardinality of symbols.

4.1 Clustering Clustering is one of the most common data mining tasks, being useful in its own right as an exploratory tool, and also as a sub-routine in more complex algorithms [12,15, 21].

4.1.1 Hierarchical Clustering Comparing hierarchical clusterings is a very good way to compare and contrast similarity measures, since a dendrogram of size N summarizes O(N2) distance calculations [23]. The evaluation is typically subjective; we simply adjudge which distance measure appears to create the most natural groupings of the data. However, if we know the data labels in advance we can also make objective statements of the quality of the clustering. In Figure 11 we clustered nine time series from the Control Chart dataset, three each from the decreasing trend, upward shift and normal classes.

0 200 400 600 800 1000

aabbcc

Space Shuttle STS-57 Telemetry

aabccb

-3

-2 -1 0 1 2 3

DFT

PLA

Haar

APCA

a b c d e f


Figure 11: A comparison of the four distance measures’ ability to cluster members of the Control Chart dataset. Complete linkage was used as the agglomeration technique

In this case we can objectively state that SAX is superior, since it correctly assigns each class to its own subtree. This is simply a side effect due to the smoothing effect of dimensionality reduction. More generally, we observed that SAX closely mimics Euclidean distance on various datasets.

4.1.2 Partitional Clustering Although hierarchical clustering is a good sanity check for any proposed distance measure, it has limited utility for data mining because of its poor scalability. The most commonly used data mining clustering algorithm is k-means [15], so for completeness we will consider it here. We performed k-means on both the original raw data, and our symbolic representation. Figure 12 shows a typical run of k-means on a space telemetry dataset. Both algorithms converge after 11 iterations on average.

The results here are quite unintuitive and surprising; working with an approximation of the data gives better results than working with the original data. Fortunately, a recent paper offers a suggestion as to why this might be so. It has been shown that initializing the clusters centers on a low dimension approximation of the data can improve the quality [12], this is what clustering with SAX implicitly does.

Figure 12: A comparison of the k-means clustering algorithm using SAX and the raw data. The dataset was Space Shuttle telemetry, 1,000 subsequences of length 512. Surprisingly, working with the symbolic approximation produces better results than working with the original data

4.2 Classification Classification of time series has attracted much interest from the data mining community. The high dimensionality, high feature correlation, and typically high levels of noise found in time series provide an interesting research problem [23]. Although special- purpose algorithms have been proposed [25], we will consider only the two most common classification algorithms for brevity, clarity of presentations and to facilitate independent confirmation of our findings.

4.2.1 Nearest Neighbor Classification To compare different distance measures on 1-nearest-neighbor classification, we used leaving-one-out cross validation. We compare SAX with Euclidean distance, IMPACTS, SDA, and LP� . Two classic synthetic datasets are used: the Cylinder-Bell-Funnel (CBF) dataset has 50 instances of time series for each of the three clusters, and the Control Chart (CC) has 100 instances for each of the six clusters [23].

Since SAX allows dimensionality and alphabet size as user input, and the IMPACTS allows variable alphabet size, we ran the experiments on different combinations of dimensionality reduction and alphabet size. For the other approaches we applied the simple dimensionality reduction technique of skipping data points at a fixed interval. In Figure 13, we show the results with a dimensionality reduction of 4 to 1.

Similar results were observed for other levels of dimensionality reduction. Once again, SAX’s ability to beat Euclidean distance is probably due to the smoothing effect of dimensionality reduction; nevertheless, this experiment does show the superiority of SAX over the other approaches proposed in the literature.

4.2.2 Decision Tree Classification Due to its poor scalability, Nearest Neighbor is unsuitable for most data mining applications; instead, decision trees are the most common choice of classifier. While decision trees are defined for real data, attempting to classify time series using the raw data would clearly be a mistake, since the high dimensionality and noise levels would result in a deep, bushy tree with poor accuracy.

Euclidean

IMPACTS (alphabet=8) SDA

SAX

Number of Iterations

220000 225000 230000 235000 240000 245000 250000 255000 260000 265000

1 2 3 4 5 6 7 8 9 10 11

Raw data Our Symbolic Approach

Obj

ecti

ve F

unct

ion

Raw data

SAX


Figure 13: A comparison of five distance measures utility for nearest neighbor classification. We tested different alphabet sizes for SAX and IMPACTS. SDA’s alphabet size is fixed at 5.

In an attempt to overcome this problem, Geurts [16] suggests representing the time series as a Regression Tree (RT) (this representation is essentially the same as APCA [22], see Figure 2), and training the decision tree directly on this representation. The technique shows great promise.

We compared SAX to the Regression Tree (RT) on two datasets; the results are in Table 5.

Dataset SAX Regression Tree CC 3.04

� 1.64 2.78

� 2.11

CBF 0.97 � 1.41 1.14

� 1.02

Table 5: A comparison of SAX with the specialized Regression Tree approach for decision tree classification. Our approach used an alphabet size of 6; both approaches used a dimensionality of 8

Note that while our results are competitive with the RT approach, The RT representation is undoubtedly superior in terms of interpretability [16]. Once again, our point is simply that our “black box” approach can be competitive with specialized solutions.

4.3 Query by Content (Indexing) The majority of work on time series data mining appearing in the literature has addressed the problem of indexing time series for fast retrieval [30]. Indeed, it is in this context that most of the representations enumerated in Figure 1 were introduced [7, 14, 22, 35]. Dozens of papers have introduced techniques to do indexing with a symbolic approach [2, 20], but without exception, the answer set retrieved by these techniques can be very different to the answer set that would be retrieved by the true Euclidean distance. It is only by using a lower bounding technique that one can guarantee retrieving the full answer set, with no false dismissals [14].

To perform query by content, we built an index using SAX, and compared it to an index built using the Haar wavelet approach [7]. Since the datasets we use are large and disk-resident, and the reduced dimensionality could still be potentially high (or at least high enough such that the performance degenerates to sequential scan if R-tree were used [19]), we use Vector Approximation (VA) file as our indexing algorithm. We note, however, that SAX could also be indexed by classic string indexing techniques such as suffix trees.

To compare performance, we measure the percentage of disk I/Os required in order to retrieve the one-nearest neighbor to a randomly extracted query, relative to the number of disk I/Os required for sequential scan. Since it has been forcibly shown that the choice of dataset can make a significant difference in the relative indexing ability of a representation, we tested on more than 50 datasets from the UCR Time Series Data Mining Archive. In Figure 14 we show 4 representative examples.

Figure 14: A comparison of indexing ability of wavelets versus SAX. The Y-axis is the percentage of the data that must be retrieved from disk to answer a 1-NN query of length 256, when the dimensionality reduction ratio is 32 to 1 for both approaches

Once again we find our representation competitive with existing approaches.

4.4 Taking Advantage of the Discrete Nature of our Representation In the previous sections we showed examples of how our proposed representation can compete with real-valued representations and the original data. In this section we illustrate examples of data mining algorithms that take explicit advantage of the discrete nature of our representation.

4.4.1 Detecting Novel/Surprising/Anomalous Behavior A simple idea for detecting anomalous behavior in time series is to examine previously observed normal data and build a model of it. Data obtained in the future can be compared to this model and any lack of conformity can signal an anomaly [9]. In order to achieve this, in [24] we combined a statistically sound scheme with an efficient combinatorial approach. The statistical scheme is based on Markov chains and normalization. Markov chains are used to model the “normal” behavior, which is inferred from the previously observed data. The time- and space-efficiency of the algorithm comes from the use of suffix tree as the main data structure. Each node of the suffix tree represents a pattern. The tree is annotated with a score obtained comparing the support of a pattern observed in the new data with the support recorded in the Markov model. This apparently simple strategy turns out to be very effective in discovering surprising patterns. In the original work we use a simple symbolic approach, similar to IMPACTS [20]; here we revisit the work using SAX.

For completeness, we will compare SAX to two highly referenced anomaly detection algorithms that are defined on real valued representations, the TSA-tree Wavelet based approach of Shahabi et al. [31] and the Immunology (IMM) inspired work of Dasgupta and Forrest [9]. We also include the Markov technique using IMPACTS and SDA in order to discover how much of the difference can be attributed directly to the representation. Figure 15 contains an experiment comparing all 5 techniques.

5 6 7 8 9 10

Impacts SDA Euclidean LP max SAX

0 0.1

0.2 0.3

0.4 0.5

0.6

5 6 7 8 9 10

Control Chart Cylinder - Bell - Funnel

Err

or R

ate

Alphabet Size Alphabet Size

0 0.1 0.2 0.3 0.4 0.5 0.6

Ballbeam Chaotic Memory Winding Dataset

DWT Haar SAX


Figure 15: A comparison of five anomaly detection algorithms on the same task. I) The training data, a slightly noisy sine wave of length 1,000. II) The time series to be examined for anomalies is a noisy sine wave that was created with the same parameters as the training sequence, then an assortment of anomalies were introduced at time periods 250, 500 and 750. III) and IIII) The Markov Model technique using the IMPACTS and SDA representation did not clearly discover the anomalies, and reported some false alarms. V) The IMM anomaly detection algorithm appears to have discovered the first anomaly, but it also reported many false alarms. VI) The TSA-Tree approach is unable to detect the anomalies. VII) The Markov model-based technique using SAX clearly finds the anomalies, with no false alarms

The results on this simple experiment are impressive. Since suffix trees and Markov models can be used only on discrete data, this offers a motivation for our symbolic approach.

4.4.2 Motif discovery It is well understood in bioinformatics that overrepresented DNA sequences often have biological significance [3, 13, 29]. A substantial body of literature has been devoted to techniques to discover such patterns [17, 32, 33]. In a previous work, we defined the related concept of “ time series motif” [26]. Time series motifs are close analogues of their discrete cousins, although the definitions must be augmented to prevent certain degenerate solutions. The naïve algorithm to discover the motifs is quadratic in the length of the time series. In [26], we demonstrated a simple technique to mitigate the quadratic complexity by a large constant factor; nevertheless, this time complexity is clearly untenable for most real datasets.

The symbolic nature of SAX offers a unique opportunity to avail of the wealth of bioinformatics research in this area. In particular, recent work by Tompa and Buhler holds great promise [33]. The authors show that many previously unsolvable motif discovery problems can be solved by hashing subsequences into buckets using a random subset of their features as a key, then doing some post-processing search on the hash buckets1. They call their algorithm PROJECTION.

We carefully reimplemented the random projection algorithm of Tompa and Buhler, making minor changes in the post-processing step to allow for the fact that although we are hashing random projections of our symbolic representation, we actually wish to discover motifs defined on the original raw data. Figure 16

1 Of course, this description greatly understates the contributions of this work. We urge the reader to consult the original paper.

shows an example of a motif discovered in an industrial dataset [5] using this technique.

Figure 16: Above, a motif discovered in a complex dataset by the modified PROJECTION algorithm. Below, the motif is best visualized by aligning the two subsequences and “zooming in” . The similarity of the two subsequences is striking, and hints at unexpected regularity

Apart from the attractive scalability of the algorithm, there is another important advantage over other approaches. The PROJECTION algorithm is able to discover motifs even in the presence of noise. Our extension of the algorithm inherits this robustness to noise.

5. CONCLUSIONS AND FUTURE DIRECTIONS In this work we introduced the first dimensionality reduction, lower bounding, streaming symbolic approach in the literature. We have shown that our representation is competitive with, or superior to, other representations on a wide variety of classic data mining problems, and that its discrete nature allows us to tackle emerging tasks such as anomaly detection and motif discovery.

A host of future directions suggest themselves. In addition to use with streaming algorithms, there is an enormous wealth of useful definitions, algorithms and data structures in the bioinformatics literature that can be exploited by our representation [3, 13, 17, 28, 29, 32, 33]. It may be possible to create a lower bounding approximation of Dynamic Time Warping [6], by slightly modifying the classic string edit distance. Finally, there may be utility in extending our work to multidimensional time series [34].

6. REFERENCES [1] Agrawal, R., Psaila, G., Wimmers, E. L. & Zait, M. (1995).

Querying Shapes of Histories. In proceedings of the 21st Int'l Conference on Very Large Databases. Zurich, Switzerland, Sept 11-15. pp 502-514.

[2] André-Jönsson, H. & Badal. D. (1997). Using Signature Files for Querying Time-Series Data. In proceedings of Principles of Data Mining and Knowledge Discovery, 1st European Symposium. Trondheim, Norway, Jun 24-27. pp 211-220.

[3] Apostolico, A., Bock, M. E. & Lonardi, S. (2002). Monotony of Surprise and Large-Scale Quest for Unusual Words. In proceedings of the 6th Int’ l Conference on Research in Computational Molecular Biology. Washington, DC, April 18-21. pp 22-31.

[4] Babcock, B, Babu, S., Datar, M., Motwani, R. & Widom, J. (2002). Models and Issues in Data Stream Systems. Invited Paper in proceedings of the 2002 ACM Symp. On Principles of Database Systems. June 3-5, Madison, WI.

I)

-5 0 5

0 100 200 300 400 500 600 700 800 900 1000 -5 0 5

0 100 200 300 400 500 600 700 800 900 1000

II)

III)

IIII)

V)

VI)

VII)

0 500 1000 1500 2000 2500

0 20 40 60 80 100 120

B A Winding Dataset (Angular speed of reel 1)

A

B


[5] Bastogne, T., Noura, H., Richard A. & Hittinger, J. .M. (2002). Application of Subspace Methods to the Identification of a Winding Process. In proceedings of the 4th European Control Conference, Vol. 5, Brussels.

[6] Berndt, D. & Clifford, J. (1994) Using Dynamic Time Warping to Find Patterns in Time Series. In proceedings of the Workshop on Knowledge Discovery in Databases, at the 12th Int’ l Conference on Artificial Intelligence. July 31-Aug 4, Seattle, WA. pp 229-248.

[7] Chan, K. & Fu, A. W. (1999). Efficient Time Series Matching by Wavelets. In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, Mar 23-26. pp 126-133.

[8] Cortes, C., Fisher, K., Pregibon, D., Rogers, A. & Smith, F. (2000). Hancock: a Language for Extracting Signatures from Data Streams. In proceedings of the 6th ACM SIGKDD Int’ l Conference on Knowledge Discovery and Data Mining. Aug 20-23, Boston, MA. pp 9-17.

[9] Dasgupta, D. & Forrest, S. (1996) Novelty Detection in Time Series Data using Ideas from Immunology. In proceedings of The International Conference on Intelligent Systems. June 19-21.

[10] Datar, M. & Muthukrishnan, S. (2002). Estimating Rarity and Similarity over Data Stream Windows. In proceedings of the 10th European Symposium on Algorithms. Sep 17-21, Rome, Italy.

[11] Daw, C. S., Finney, C. E. A. & Tracy, E. R. (2001). Symbolic Analysis of Experimental Data. Review of Scientific Instruments. (2002-07-22).

[12] Ding, C., He, X., Zha, & Simon., H. (2002). Adaptive Dimension Reduction for Clustering High Dimensional Data. In proceedings of the 2nd IEEE International Conference on Data Mining. Dec 9-12. Maebashi, Japan. pp 147-154.

[13] Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

[14] Faloutsos, C., Ranganathan, M., & Manolopoulos, Y. (1994). Fast Subsequence Matching in Time-Series Databases. In proceedings of the ACM SIGMOD Int’ l Conference on Management of Data. May 24-27, Minneapolis, MN. pp 419-429.

[15] Fayyad, U., Reina, C. &. Bradley, P. (1998). Initialization of Iterative Refinement Clustering Algorithms. In proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27-31. pp 194-198.

[16] Geurts, P. (2001). Pattern Extraction for Time Series Classification. In proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery. Sep 3-7, Freiburg, Germany. pp. 115-127.

[17] Gionis, A. & Mannila, H. (2003). Finding Recurrent Sources in Sequences. In proceedings of the 7th International Conference on Research in Computational Molecular Biology. Apr 10-13, Berlin, Germany. To Appear.

[18] Guha, S., Mishra, N., Motwani, R. & O'Callaghan, L. (2000). Clustering Data Streams. In proceedings of the 41st Symposium on Foundations of Computer Science. Nov 12-14, Redondo Beach, CA. pp 359-366.

[19] Hellerstein, J. M., Papadimitriou, C. H. & Koutsoupias, E. (1997). Towards an Analysis of Indexing Schemes. In proceedings of the 16th ACM Symposium on Principles of Database Systems. May 12-14, Tucson, AZ. pp 249-256.

[20] Huang, Y. & Yu, P. S. (1999). Adaptive Query Processing for Time-Series Data. In proceedings of the 5th Int'l Conference on Knowledge Discovery and Data Mining. San Diego, CA, Aug 15-18. pp 282-286.

[21] Kalpakis, K., Gada, D. & Puttagunta, V. (2001). Distance Measures for Effective Clustering of ARIMA Time-Series. In proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, Nov 29-Dec 2. pp 273-280.

[22] Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. (2001). Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. In proceedings of ACM SIGMOD Conference on Management of Data. Santa Barbara, CA, May 21-24. pp 151-162.

[23] Keogh, E. & Kasetty, S. (2002). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23 - 26, 2002. Edmonton, Alberta, Canada. pp 102-111.

[24] Keogh, E., Lonardi, S. & Chiu, W. (2002). Finding Surprising Patterns in a Time Series Database in Linear Time and Space. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23 - 26, 2002. Edmonton, Alberta, Canada. pp 550-556.

[25] Keogh, E. & Pazzani, M. (1998). An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27-31. pp 239-241.

[26] Lin, J., Keogh, E., Lonardi, S. & Patel, P. (2002). Finding Motifs in Time Series. In proceedings of the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD Int’ l Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada, July 23-26. pp. 53-68.

[27] Larsen, R. J. & Marx, M. L. (1986). An Introduction to Mathematical Statistics and Its Applications. Prentice Hall, Englewood, Cliffs, N.J. 2nd Edition.

[28] Lonardi, S. (2001). Global Detectors of Unusual Words: Design, Implementation, and Applications to Pattern Discovery in Biosequences. PhD thesis, Department of Computer Sciences, Purdue University, August, 2001.

[29] Reinert, G., Schbath, S. & Waterman, M. S. (2000). Probabilistic and Statistical Properties of Words: An Overview. Journal of Computational. Biology. Vol. 7, pp 1-46.

[30] Roddick, J. F., Hornsby, K. & Spiliopoulou, M. (2001). An Updated Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research. In Post-Workshop Proceedings of the International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining. Berlin, Springer. Lecture Notes in Artificial Intelligence. Roddick, J. F. and Hornsby, K., Eds. 147-163.

[31] Shahabi, C., Tian, X. & Zhao, W. (2000). TSA-tree: A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Surprise and Trend Queries In proceedings of the 12th Int’ l Conference on Scientific and Statistical Database Management. pp 55-68.

[32] Staden, R. (1989). Methods for Discovering Novel Motifs in Nucleic Acid Sequences. Computer Applications in Biosciences. Vol. 5(5). pp 293-298.

[33] Tompa, M. & Buhler, J. (2001). Finding Motifs Using Random Projections. In proceedings of the 5th Int’ l Conference on Computational Molecular Biology. Montreal, Canada, Apr 22-25. pp 67-74.

[34] Vlachos, M., Kollios, G. & Gunopulos, G. (2002). Discovering Similar Multidimensional Trajectories. In proceedings of the 18th International Conference on Data Engineering. Feb 26-Mar 1, San Jose, CA.

[35] Yi, B, K., & Faloutsos, C. (2000). Fast Time Sequence Indexing for Arbitrary Lp Norms. In proceedings of the 26st Int’ l Conference on Very Large Databases. Sep 10-14, Cairo, Egypt. pp 385-394.

Clustering Binary Data Streams with K-means

Carlos [email protected]

Teradata, a division of NCRSan Diego, CA 92127, USA

ABSTRACTClustering data streams is an interesting Data Mining prob-lem. This article presents three variants of the K-meansalgorithm to cluster binary data streams. The variants in-clude On-line K-means, Scalable K-means, and IncrementalK-means, a proposed variant introduced that finds higherquality solutions in less time. Higher quality of solutionsare obtained with a mean-based initialization and incremen-tal learning. The speedup is achieved through a simplifiedset of sufficient statistics and operations with sparse matri-ces. A summary table of clusters is maintained on-line. TheK-means variants are compared with respect to quality ofresults and speed. The proposed algorithms can be used tomonitor transactions.

1. INTRODUCTIONClustering algorithms partition a data set into several dis-joint groups such that points in the same group are similarto each other according to some similarity metric [9]. Mostclustering algorithms work with numeric data [3; 6; 14; 26],but there has been work on clustering categorical data aswell [12; 15; 18; 23]. The problem definition of clusteringcategorical data is not as clear as the problem of cluster-ing numeric data [9]. There has been extensive databaseresearch on clustering large and high dimensional data sets;some important approaches include [6; 3; 17; 26; 2; 7; 20].Clustering data streams has recently become a popular re-search direction [13]. The problem is difficult. High dimen-sionality [17; 1; 2; 23], data sparsity [2; 14] and noise [3; 6;7; 17] make clustering a more challenging problem.

This work focuses on clustering binary data sets. Binarydata sets are interesting and useful for a variety reasons.They are the simplest form of data available in a computerand they can be used to represent categorical data. From aclustering point of view they offer several advantages. Thereis no concept of noise like that of quantitative data, they canbe used to represent categorical data and they can be effi-ciently stored, indexed and retrieved. Since all dimensionshave the same scale there is no need to transform the dataset. There is extensive research on efficient algorithms thatcan manage large data sets [26; 14] and high dimensionality[2; 17]. Despite such efforts K-means remains one of themost popular clustering algorithms used in practice. Themain reasons are that it is simple to implement, it is fairlyefficient, results are easy to interpret and it can work un-

der a variety of conditions. Nevertheless, it is not the finalanswer to clustering [6; 9]. Some of its disadvantages in-clude dependence on initialization, sensitivity to outliers andskewed distributions and converging to poor locally optimalsolutions. This article introduces several improvements toK-means to cluster binary data streams. The K-means vari-ants studied in this article include the standard version ofK-means [19; 9], On-line K-means [25], Scalable K-means[6], and Incremental K-means, a variant we propose.

1.1 Contributions and article outlineThis article presents several K-means improvements to clus-ter binary data streams. Improvements include simple suf-ficient statistics for binary data, efficient distance computa-tion for sparse binary vectors, sparse matrix operations anda summary table of clustering results showing frequent bi-nary values and outliers. We study how to incorporate thesechanges into several variants of the K-means algorithms.Particular attention is paid to the Incremental K-means al-gorithm proposed in this article. An extensive experimentalsection compares all variants.

The article is organized as follows. Section 2 introducesdefinitions, the K-means algorithm and examples. Section3 introduces the proposed improvements to cluster binarydata streams and explains how to incorporate them into allK-means variants. Section 4 presents extensive experimen-tal evaluation with real and synthetic data sets. Section 5briefly discusses related work. Section 6 contains the con-clusions and directions for future work.

2. PRELIMINARIES

2.1 DefinitionsThe input for the K-means clustering algorithm is a data setD having n d-dimensional points and k, the desired num-ber of clusters. The output are three matrices, C, R, W ,containing the means, the squared distances and weights re-spectively for each cluster, a partition of D into k subsetsand a measure of cluster quality. Matrices C and R are d×kand W is a k×1 matrix. To manipulate matrices we use thefollowing convention for subscripts. For transactions we usei; i ∈ {1, 2, . . . , n} (i alone is a subscript, whereas ij refersto item j). For cluster number we use j; j ∈ {1, 2, . . . , k}and to refer to one dimension we use l: l ∈ {1, 2, . . . , d}. LetD1, D2, . . . , Dk be the k subsets of D induced by clusterss.t. Dj ∩Dj′ = ∅ for j 6= j′. To refer to a column of C or Rwe use the j subscript (i.e. Cj , Rj). So Cj and Rj refer tothe jth cluster centroid and jth variance matrix respectively


and Wj is the jth cluster weight. The diag[] notation willbe used as a generic operator to obtain a diagonal matrixfrom a vector or consider only the diagonal of a matrix orto convert the diagonal of a matrix into a vector. This workassumes that a symmetric similarity measure [9], like Eu-clidean distance, on binary points is acceptable. That is, a0/0 match is as important as a 1/1 match on some dimen-sion. This is in contrast to asymmetric measures, like theJaccard coefficient, that give no importance to 0/0 matches[15; 18].

Let S = [0, 1]d be a d-dimensional Hamming cube. LetD = {t1, t2, . . . , tn} be a database of n points in S. Thatis, ti is a binary vector (treated as a transaction). MatrixD is a d× n sparse binary matrix. Let Ti = {l|Dli = 1, l ∈{1, 2, . . . , d}, i ∈ {1, 2, . . . , n}}. That is, Ti is the set of non-zero coordinates of ti; Ti can be understood as a transactionor an itemset to be defined below. Then the input data set Dbecomes a stream of integers indicating dimensions equal to1 separated by an end of transaction marker. Since transac-tions are sparse vectors |Ti| << d. This fact will be used todevelop an efficient distance computation. We will use T todenote average transaction size (T =

∑n

i=1 |Ti|/n). Matri-ces C and R are d×k and W is a k×1 matrix. To manipulatematrices we use the following convention for subscripts. Fortransactions we use i; i ∈ {1, 2, . . . , n} (i alone is a sub-script, whereas ij refers to item j). For cluster number weuse j; j ∈ {1, 2, . . . , k} and to refer to one dimension weuse l: l ∈ {1, 2, . . . , d}. Let D1, D2, . . . , Dk be the k sub-sets of D induced by clusters s.t. Dj ∩ Dj′ = ∅ for j 6= j′.To refer to a column of C or R we use the j subscript (i.e.Cj , Rj). So Cj and Rj refer to the jth cluster centroid andjth variance matrix respectively; Rj represents a diagonalmatrix but it can be manipulated as a vector. Wj is the jthcluster weight. Since D contains binary points Clj can beunderstood as the fraction of points in cluster j that havedimension l equal to 1. Wj is the fraction of the n pointsthat belong to cluster j. The diag[] notation will be usedas a generic operator to obtain a diagonal matrix from avector or consider only the diagonal of a matrix or to con-vert the diagonal of a matrix into a vector. Points that donot adjust well to the clustering model are called outliers.K-means uses Euclidean distance; the distance from ti to Cj

is

δ(ti, Cj) = (ti − Cj)t(ti − Cj). (1)

2.2 The K-means clustering algorithmK-means [19] is one of the most popular clustering algo-rithms [6; 10; 24]. It is simple and fairly fast [9; 6]. K-meansis initialized from some random or approximate solution.Each iteration assigns each point to its nearest cluster andthen points belonging to the same cluster are averaged to getnew cluster centroids. Each iteration successively improvescluster centroids until they become stable.

Formally, the problem of clustering is defined as finding apartition of D into k subsets such that

n∑

i=1

δ(ti, Cj) (2)

is minimized, where Cj is the nearest cluster centroid of ti.The quality of a clustering model is measured by the the sum

of squared distances from each point to the cluster where itwas assigned [26; 21; 6]. This quantity is proportional tothe average quantization error, also known as distortion [19;24]. The quality of a solution is measured as:

q(C) =1

n

n∑

i=1

δ(ti, Cj), (3)

which can be computed from R as

q(R, W ) =

k∑

j=1

Wj

d∑

l=1

Rlj . (4)

2.3 ExampleWe now present an example with d = 16, k = 4 to motivatethe need for a summary table. Assume items come as trans-actions containing integers in {1, . . . , 16}. The clusteringresults for a store in terms of matrices are shown in Figure1. If we consider a real database environment in which thereare hundreds of product categories or thousands of productsit is difficult to understand the output matrices C, W . Wepropose a summary of clusters as shown in Table 1. Thissummary is easier to understand than the matrix with float-ing point numbers. Another advantage is that we can seeoutliers. This table suggests associations [4; 16] among itemsin the same row.

3. CLUSTERING BINARY DATA STREAMS

3.1 Sparse matrix operations and simple suf-ficient statistics

The first concern when using a clustering algorithm withlarge data sets is speed. To accelerate K-means we use sparsedistance computation and simpler sufficient statistics. Weexplain distance computation first. Sparse distance compu-tation is made by precomputing the distances between thenull transaction (zeroes on all dimensions) and all centroidsCj . Then only differences for non-null dimensions of eachtransaction are computed to determine cluster membershipas transactions are being read. When D is a sparse matrixand d is high the distance formula is expensive to compute.In typical transaction databases a few dimensions may havenon-zero values. So we precompute a distance from everyCj to the null vector 0. To that purpose, we define the k-dimensional vector ∆: ∆j = δ(0, Cj). Based on ∆ distancescan be computed as:

δ(ti, Cj) = ∆j +

d∑

l=1,(ti)l 6=0

((ti)l − Clj)2 − C2

lj . (5)

This computation improves performance significantly, but itdoes not affect result accuracy. A similar idea is applied toupdate clusters. This is discussed in more detail later.

K-means can use sufficient statistics [6; 26], which are sum-maries of D1, D2, . . . , Dk represented by the three matricesM, Q, N that contain sum of points, sum of squared pointsand number of points per cluster respectively. K-means isfast and the idea of using sufficient statistics is well knownto make it faster [6; 10], but we can take a further step. Suf-ficient statistics can be simplified for clustering binary data.The following result states that the sufficient statistics forthe problem of clustering binary vectors are simpler thanthe sufficient statistics required for clustering numeric data.


W =

0.400.050.200.35

C =

1 2 3 41 : beer 0.12 0.80 0.09 0.212 : bread 0.95 0.21 0.45 0.783 : coffee 0.69 0.14 0.02 0.164 : crackers 0.31 0.87 0.50 0.075 : ham 0.21 0.89 0.21 0.146 : jelly 0.83 0.12 0.25 0.117 : magazine 0.11 0.09 0.03 0.038 : meat 0.02 0.11 0.13 0.999 : milk 0.91 0.45 0.77 0.2310 : notebook 0.02 0.56 0.45 0.0711 : oil 0.21 0.11 0.56 0.7012 : produce 0.43 0.05 0.82 0.2113 : salsa 0.25 0.01 0.21 0.1114 : soda 0.10 0.61 0.70 0.0715 : water 0.16 0.27 0.88 0.0916 : wine 0.01 0.08 0.12 0.13

Figure 1: Basket clusters

j Wj Cj (frequent dimensions) Outliers (odd transactions)1 40% 90-100%: bread milk 80-90%: jelly {jelly wine notebook }2 5% 80-90%: beer crackers ham {crackers sauce }3 20% 80-90%: produce water 70-80%: milk soda {water coffee }4 35% 90-100%: meat 80-90%: bread 70-80%: oil {bread magazine }

Table 1: Transaction clusters summary table

Lemma 1 Let D be a set of n transactions of binary data.and D1, D2, . . . , Dk be a partition of D. Then the sufficientstatistics required for computing C, R, W are only N and M .

Proof:

To compute W , k counters are needed for the k subsets of ofD that are stored in the k×1 matrix N and then Wj = Nj/n.To compute C we use Mj =

∑n

i=1 ti, ∀ti ∈ Dj and then Cj =Mj/Nj . To compute Q the following formula must be com-puted: Qj =

∑n

i=1 diag[titti] =

∑

i=1n ti = Mj , ∀ti ∈ Dj .Note that diag[tit

ti] = ti because x = x2 if x is binary and

ti is a vector of binary numbers. Elements off the diagonalare ignored for diagonal matrices. Then Q = M . Therefore,only N and M are needed.2

A consequence of the previous lemma is that R can be com-puted from C without scanning D or storing Q. Even fur-ther, Lemma 1 makes it possible to reduce storage to onehalf. However, it is not possible to reduce storage furtherbecause when K-means determines cluster membership itneeds to keep a copy of Cj to compute distances and a sepa-rate matrix with Mj to add the point to cluster j. Thereforeboth C and M are needed for an Incremental or Scalable ver-sion, but not for an On-line version. The On-line K-meansalgorithm to be introduced later could keep a single matrixfor centroids and sum of points. For the remainder of thisarticle M is a d × k matrix and Mj =

∑

∀ti∈Djti and N

is k × 1 matrix and Nj = |Dj |. The update formulas forC, R, W are

Cj =1

Nj

Mj , (6)

Rj = diag[Cj ]− CjCtj , (7)

and

Wj =Nj

∑k

j′=1 Nj′

. (8)

A summary table G, as the one shown in Section 2.3, is nec-essary to understand very high dimensional binary data. In-stead of trying to interpret a d×k matrix containing floatingpoint numbers the user can focus in those matrix entries thatare more interesting. This table should be cheap to main-tain so that it introduces a negligible overhead. The userspecifies a list of cutoff points and a number of top outliersper cluster. These cutoff points uniformly partition clustersmeans Cj . Dimension subscripts (items) are inserted intoa rank every time C is updated. The table is constructedusing a list of cutoff points c1, c2, . . . , s.t. 1 ≥ cij > 0, and anumber of desired outliers per cluster Oj ≥ 0 for cluster j.For each cutoff ci there is a (possibly empty) list of dimen-sions L s.t. l ∈ L and ci > Clj ≥ ci+1. Cutoff points aretaken at equal interval lengths in decreasing order startingin 1 and are used for all k clusters for efficiency purposes.Having different cutoff points for each cluster or having themseparated at different intervals would introduce a significantoverhead to maintain G. Top outliers are inserted into a listin descending order according by distance to their nearestcentroid; the outlier with smaller distance is deleted fromthe list. Outliers are inserted when cluster membership isdetermined. The cost to maintain this table is low. Theseare important considerations to update the summary tableon-line. (1) It is expected that only a few dimensions willappear in some rank since the data sets are sparse. (2)Only high percentage ranks, closer to 1, are considered in-teresting. (3) All the cutoff points are separated at equalintervals. In this manner the rank for some dimension ofCj is determined in time O(1) using Clj as an index value.


Irregular intervals would make a linear search mandatory.(4) The dimensions are sorted in lexicographical order bydimension index l inside each rank. The insertion is done intime almost O(1) because only a few dimensions appear inthe list and the data set D is assumed to be a sparse matrix.When G becomes populated as transactions are scanned itis likely some dimension of Cj is already inserted and thiscan be determined in time O(log(|L|) doing a binary search.This would not be the case if all the cutoffs points spannedthe [0, 1] interval because the d dimensions would have to beranked. (5) Outlier insertion is done in time almost O(1).This is the case because most input points are not outliersand it is assumed the desired number of outliers in the listis small.

3.2 K-means variants for binary data streamsThis section presents the variants of K-means for binarydata streams based on the improvements introduced above.Empty clusters are re-seeded with the the furthest neighborsof non-empty clusters as proposed in [6]. The re-seedingpoints are extracted from the outlier list stored in the sum-mary table. We believe incorporating all improvements inall algorithms is more valuable than improving only one andthen claiming that one is the best. This allows a fair com-parison.

The input is D = {T1, T2, . . . , Tn}, the binary data pointsgiven as transactions, as defined in Section 2.1, and k, thedesired number of clusters. It is important to observe thatpoints are assumed to come as lists of integers as defined inSection 2. The order of dimensions inside each transactionis not important. The order of transactions is not impor-tant as long as transactions do not come sorted by cluster.The output is the clustering model given by the matricesC, R, W , a partition of D into D1, D2, . . . , Dk, a summarytable G and a measure of cluster quality q(R, W ).Let the nearest neighbor function be defined as

NN(ti) = J,

such that

δ(ti, CJ ) ≤ δ(ti, Cj).

Let ⊕ be sparse addition of vectors. Mathematically this ad-dition is a normal vector addition, but for implementationpurposes only non-zero entries are added. Mj ⊕ ti has com-plexity O(T ). In a similar manner we define a sparse divisionof matrices �, and a sparse matrix subtraction that onlyupdate matrix entries that change after reading a new pointand assigning it to its nearest cluster. Empty clusters arere-seeded as follows. If WJ = 0 then CJ ← to (outlier trans-action), where to is a transaction s.t. δ(to, CJ ) ≥ δ(ti, Cj)for j 6= J and ti ∈ Dj . This outlier to is taken from the topoutlier list in G.

The On-line K-means variant we use in this work is basedon the On-line EM algorithm [22]. The basic difference isthat distances are used to compute cluster memberships in-stead of weighted probabilities with Gaussian distributions.Initialization is based on a sample of k different points toseed C. The weights Wj are initialized to 1/k to avoid earlyre-seeding. A similar streamed version for continuous datais outlined in [21].We implemented our improvements on the simplified Scal-able K-means version proposed in [10]. In this work authors

give convincing evidence that a simpler version of the origi-nal Scalable K-means [6] produces higher quality results inless time. The only critical parameter is the buffer size. Pri-mary and secondary compression parameters as proposed in[6] are not needed. This version performs primary compres-sion discarding all buffer points each time. Initializationis based on a sample of k different points to seed C. Theweights Wj are initialized to 1/k to avoid early re-seeding.

Incremental K-means

This is our proposed variant of K-means. This version is acompromise between On-line K-means and the Standard K-means doing one iteration. A fundamental difference withboth Standard K-means and Scalable K-means is that Incre-mental K-means does not iterate until convergence. Anotherimportant difference is that initialization, explained below,is done using global statistics of D instead of using a sampleof k transactions. Doing only one iteration could be a limi-tation with continuous (numeric) data, but it is reasonablewith binary data. A difference with On-line K-means is thatit does not update C and Wj every transaction, but everyn/L transactions (L times), and each time it touches theentirety of C and W . The setting for L is important to geta good solution. If L = 1 then Incremental K-means reducesto Standard K-means stopped early after one iteration. Onthe other hand, if L = n then Incremental K-means reducesto On-line K-means. The setting we propose is L =

√n.

This is a good setting for variety of reasons: (1) It is inde-pendent from d and k. (2) A larger data set size n acceler-ates convergence since as n→∞, L→∞. (3) The numberof points used to recompute centroids is the same as thetotal number of times they are updated. Cluster centroidsare initialized with small changes to the global mean of thedata set. This mean-based initialization has the advantageof not requiring a pass over the data set to get different seedsfor different runs because the global mean can be incremen-tally maintained. In the pseudo-code r represents a randomnumber in [0, 1], µ is the global mean and σ = diag[

√

Rj ]represents a vector of global standard deviations. As d→∞cluster centroids Cj → µ. This is based on the fact that thevalue of

∑n

i=1 δ(ti, x) is minimized when x = µ.

3.3 Suitability for binary data streamsClustering data streams has become a popular research di-rection [13]. All the variants introduced above read eachdata point from disk just once. However, Scalable K-meansiterates in memory which can slow it down for an incomingflow of transactions. Moreover, since it iterates until conver-gence it is necessary to have a threshold on the number ofiterations to provide time guarantees. On-line K-means andIncremental K-means can keep up with the incoming flow oftransactions since they do not iterate.

The proposed K-means variants have O(Tkn) complexity;where T is the average transaction size; recall that T << d.Re-seeding using outliers is done in time O(k). For sparsebinary data updating G takes time almost O(1). In the caseof Scalable K-means there is an additional complexity fac-tor given by the the number of iterations until convergence.Matrices M, C require O(dk) space and N, W require O(k).Since R is derived from C that space is saved. For all ap-proaches there is an additional requirement O(b) to hold btransactions in a buffer. In the case of Scalable K-meansthis size is explicitly used as a parameter to the clustering


Input: {T1, T2, . . . , Tn} and kOutput: C, R, W and q(R, W )

FOR j = 1 TO k DOCj ← µ± σr/dNj ← 0Mj ← 0Wj ← 1/k

ENDL =

√n

FOR i = 1 TO n DOj = NN(ti)Mj ⊕ ti

Nj ← Nj + 1IF (i mod (n/L)) = 0 THEN

Cj ←Mj/Nj

Rj ← Cj − CtjCj

Wj = Nj/iFOR j = 1 TO k DO

IF Wj = 0 THENCj ← to

ENDEND

ENDEND

Figure 2: The Incremental K-means algorithm

process. In any case the buffer space requirements are neg-ligible compared to the clustering model space. In short,space requirements are O(dk) and time complexity is O(kn)for sparse binary vectors..

4. EXPERIMENTAL EVALUATIONThis section contains extensive experimental evaluation withreal and synthetic data sets. All algorithms were bench-marked on quality of results and performance. The maincriterion for quality was q(R, W ). Most running times weremeasured in seconds. All experiments were done on a PCrunning at 600MHz with 64MB of main memory and a 20GB hard disk. All algorithms were implemented in the C++language. The Scalable K-means variant we used was modi-fied from the code available from the authors of [10]. For thesake of completeness we also incorporated our performanceimprovements into the Standard K-means to compare qual-ity of solutions and performance.In general the main parameter is only k. Standard K-meansand Scalable K-means had a tolerance threshold for q(R, W )set to ε = 1.0e − 5. Scalable K-means buffer size was setat 1% as recommended by the authors [6]. On-line andIncremental K-means had no parameters to set other thank. All algorithms updated the clustering summary table Gon-line showing the cost to maintain it is low.

We do not show comparisons with the same implementationsof Scalable K-means proposed in [6] and [10] because theirtimes on sparse binary data sets are an order of magnitudehigher; we believe this would not be interesting and timecomparisons would be meaningless.

4.1 Experiments with Real Data SetsTable 2 shows the best results out of 10 runs for severalreal data sets. The table shows the quantization error (av-erage sum of squared distances) and elapsed time. Lowernumbers are better. When we refer to d it is the numberof binary dimensions, not to be confused with the original

dimensionality of some data sets that may be smaller. Weran all algorithms a few times to find a good value for k.Such k produced clusters that had C values close to 1 in afew dimensions and mostly values close to 0 for the rest.The patient data set had n = 655 and d = 19. This dataset contained information for patient being treated for heartdisease. We extracted categorical columns and each valuewas mapped to a binary dimension. Patients were alreadybinned by their age in 10-year increments by medical doc-tors. These were clusters with some dimension Clj ≥ 90%One cluster had 18% of patients who were male, white andaged between 50-59. Another cluster had 15% of patientswho were male, white and aged between 70-79; the interest-ing aspect is that the cluster with males aged 60-69 only had6% of patients. Another cluster had about 10% of black malepatients who were between 30-79 but 36% (most) of themwere between 40-49. Another cluster had 14% of patientswho were white females aged 70-79. Two outliers were twomale patients younger than 40. Since n is very small timeswere a fraction of 1 second.

The store data set had a sample of n = 234, 000 transactionsfrom a chain of supermarkets with d = 12 binary attributes.Each transaction had categorical attributes from its store.The quantitative attributes were ignored. In this case bi-nary dimensions were the values for each categorical variabledescribing a predefined classification of stores according totheir profit performance (low, medium and high), the storelayout (convenience, market, supermarket) and store size(small, medium, large, and very large). K-means was runwith k = 10. In this case small convenience stores had theworst performance found in 5% of the data set. High per-formance was concentrated mostly in market town storesof small, medium and large size in 15% of transactions.One cluster revealed that most (66%) medium conveniencestores but one third had high performance. A heavy clusterwith 22% transaction with medium performance happenedmostly in convenience and market town layouts, but of vary-ing sizes. There were no interesting outliers in this case; theyonly included transactions with missing information.The bktdept data set came from another chain of grocerystores. This data set sizes were n = 1M and d = 161. Thedimensions were the store departments indicating whetherthe customer bought in that department or not. This dataset was challenging given its high dimensionality with mosttransactions having 5 departments or less and also becauseclusters had significant overlap. A small k did not producegood results. We had to go up to k = 40 to find accept-able solutions. The algorithms were able to identify somesignificant clusters. One cluster with 1% of baskets involvedbread and pet products; a strange combination. Another1% of baskets involved mostly biscuits and candy. 8% ofbaskets had cream as the main product with no other sig-nificant product bought together with it. Another interest-ing finding is that 5% mainly bought newspaper, but alwayswith a variety of other products with milk being the mostfrequent one. Another 10% bought mostly cigarettes, butone fifth of them also bought milk. A couple of interest-ing outliers with odd combinations included a basket withmore than 50 departments visited (too many compared tomost baskets) including toys, cooking aids, pet products andfood departments, and another basket with cigarettes, nutsand medicine among others. It was surprising that ScalableK-means was slower than Standard K-means.


Data set k std online scal incrKM KM KM KM

patient 8 0.0217 0.0247 0.0199 0.0181n = 655, d = 19 0 0 0 0store 10 0.0355 0.0448 0.0361 0.0389n = 234k, d = 12 32 13 19 12bktdept 40 0.0149 0.0168 0.0151 0.0151n = 1M, d = 161 823 251 842 284harddisk 5 0.0119 0.0279 0.0016 0.0016n = 55k, d = 13 62 36 54 31phone 13 0.0116 0.0174 0.0090 0.0095n = 743k, d = 22 472 57 91 64

Table 2: Quality of results with real data sets

The harddisk data set contained data from a hard disk man-ufacturer. Each record corresponded to one disk passing orfailing a series of tests coded in a categorical variable. Thesizes for this data set were n = 556, 390 and d = 13. Most ofthe tests involved measurements where a zero would implyfail and a value greater than zero passing (with a degree ofconfidence). We mapped each measurement to one binarydimension setting a positive number to 1. This data setwas hard to cluster because almost 100% of disks passed thetests. All algorithms found clusters in which failing diskswere spread across all clusters. However, the best solutionindicated that most failing disks (about 1.3%) failed a spe-cific test. In this case outliers were passing disks that passeddifficult two difficult tests and failed the common ones. An-other outlier was a failing disk that failed the same tests asthe outlier passing disk.

The phone data set contained demographic data about cus-tomers from a telephone company. We selected a few cate-gorical dimensions that were already binary. This data setwas very easy to cluster because customers concentrated onthree well defined groups with the rest in small clusters. Noclusters seemed to indicate anything particularly interesting.We omit further discussion.

4.2 Experiments with Transaction Data SetsIn this section we present experiments with transaction datasets created with the IBM data generator [4; 5]. The de-faults we used were as follows. The number of transactionswas n = 100k. The average transaction size T was 8,10 and20. Pattern length (I) was one half of transaction length.Dimensionality d was 100, 1000 and 10,000. The rest ofparameters were kept at their defaults (average rule con-fidence=0.25, correlation=0.75). These data sets representvery sparse matrices and very high dimensional data. We didnot expect to find any significant clusters, but we wanted totry to the algorithms on them to see how they behaved. Formost clusters solutions summarized in Table 3 centroids werebelow 0.5. According to our criterion they were bad. Clus-tering transaction files was difficult. Increasing k improvedresults by a small margin as can be seen. This fact indi-cated that clusters had significant overlap with the defaultdata generation parameter settings. We further investigatedhow to make the generator produce clusters and it turnedout that correlation and confidence were the most relevantparameters. In that case the quantization error showed asignificant decrease for all algorithms. For these syntheticdata sets all algorithms found solutions of similar quality.In general Scalable K-means found slightly better solutions.

Then Standard and Incremental K-means came in secondplace. On-line K-means always found the worst solution.

Figure 3 shows performance graphs varying n, d and T withdefaults n = 100k, d = 1000, T = 10. The program wasrun with k = 10. The left graph compares the four K-meansvariants. The center graph shows performance for Incre-mental K-means at two dimensionalities. The right graphshows Incremental K-means performance varying the aver-age transactions size T . In this case the story is quite dif-ferent compared to the easier synthetic binary data sets.Performance for Standard K-means and Scalable K-meansdegrades more rapidly than their counterparts with increas-ing n. All K-means variants are minimally affected by di-mensionality when the average transaction size is kept fixedshowing the advantage of performing sparse distance com-putation. All variants exhibit linear behavior.

4.3 DiscussionIn general Incremental K-means and Scalable K-means foundthe best solutions. In a few cases Scalable K-means foundslightly higher quality solutions than Incremental K-means,but in most cases Incremental K-means was as good or bet-ter than Scalable K-means. For two real data sets ScalableK-means found slightly better solutions than IncrementalK-means. Performance-wise On-line K-means and Incre-mental K-means were the fastest. It was surprising thatin a few cases Incremental K-means was faster than On-lineK-means. It turned out that continuously computing aver-ages for every transaction does cause overhead. StandardK-means was always the slowest. On its favor we can saythat it found the best solution in a couple of cases with thereal data sets. This suggests some clustering problems arehard enough to require several scans over the data set to finda good solution. In general there was not sensitivity to ini-tialization, like when clustering numeric data, but more so-phisticated initialization techniques may be employed. Suchimprovements can be added to this proposal.

5. RELATED WORKResearch on scaling K-means to cluster large data sets in-cludes [6] and [10], where Scalable K-means is proposed andimproved. An alternative way to re-seed clusters for K-means is proposed in [11]. We would like to compare thatapproach to the one used in this article. The problem ofclustering data streams is proposed in [13]. There has beensome work in that direction. Randomized clustering algo-rithms to find high-quality solutions on streaming data areproposed in [21].


Data set d k std online scal incrKM KM KM KM

T8I4D100k 1000 10 0.00758 0.00760 0.00746 0.00746T8I4D100k 1000 20 0.00719 0.00727 0.00710 0.00720T10I5D100k 100 10 0.07404 0.07545 0.07444 0.07456T10I5D100k 100 20 0.07042 0.07220 0.07100 0.07151T20I10D100k 10000 10 0.00194 0.00195 0.00192 0.00192T20I10D100k 10000 20 0.00186 0.00191 0.00184 0.00184

Table 3: Quality of results with transaction data sets

0

100

200

300

400

500

600

700

0 100 200 300 400 500

Tim

e in

seco

nds

n x 1000

incr KMscal KM

online KMstd KM

0

10

20

30

40

50

0 50 100 150 200 250 300 350 400 450 500

Tim

e in

min

ute

s

k

incr KM d= 1000incr KM d=10000

0

2

4

6

8

0 10 20 30 40 50 60 70 80 90 100

Tim

e in

min

ute

s

t: avg transaction size

incr KM d=1000

Figure 3: Performance varying n, k and T with high d

There is work on clustering categorical data sets from a DataMining perspective. The K-modes algorithm is proposed in[18]; this algorithm is a variant of K-means, but using onlyfrequency counting on 1/1 matches. Another algorithm isROCK that groups points according to their common neigh-bors (links) in a hierarchical manner [15]. CACTUS is agraph-based algorithm that clusters categorical values us-ing point summaries. These approaches are different fromours since they are not distance-based and have differentoptimization objectives. Also, ROCK is a hierarchical algo-rithm. One interesting aspect discussed in [15] is the errorpropagation when using a distance-based algorithm to clus-ter binary data in a hierarchical manner. But fortunatelyK-means is not hierarchical. An advantage of the K-meansvariants introduced above over purely categorical clusteringalgorithms, like K-modes, CACTUS and ROCK, is that theycan be extended to cluster points with both numeric (con-tinuous) and categorical dimensions (with categorical valuesmapped to binary). This problem is known as clusteringmixed data types [9; 8]. There is some criticism on usingdistance similarity metrics for binary data [9], but our pointis that a careful summarization and interpretation of resultscan make the approach useful. A fundamental differencewith associations [4] is that associations describe frequentpatterns found in the data matched only on 1/1 occurrences.Another difference is that associations do not summarize thetransactions that support the discovered pattern.

6. CONCLUSIONSThis article proposed several improvements for K-means tocluster binary data streams. Sufficient statistics are sim-pler for binary data. Distance computation is optimizedfor sparse binary vectors. A summary table with best clus-ter dimensions and outliers is maintained on-line. The pro-posed improvements are fairly easy to incorporate. All vari-ants were compared with real and synthetic data sets. The

proposed Incremental K-means variant is faster than thealready quite fast Scalable K-means and finds solution ofcomparable quality. In general these two algorithms findhigher quality than On-line K-means.

Future work includes the following. Mining associationsfrom clusters is an interesting problem. We plan to ap-proximate association support from clusters avoiding scan-ning transactions. We want to introduce further processingusing set-oriented metrics like the Jaccard coefficient or as-sociation support. Some of our improvements apply to con-tinuous data but that is a harder problem for a variety ofreasons. We believe a further acceleration of Incremental K-means is not possible unless approximation, randomizationor sampling are used.

AcknowledgementsSpecial thanks to Chris Jermaine, from the University ofFlorida, whose comments improved the presentation of thisarticle.

7. REFERENCES[1] C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and Jong

Park. Fast algorithms for projected clustering. In ACMSIGMOD Conference, 1999.

[2] C. Aggarwal and P. Yu. Finding generalized projectedclusters in dimensional spaces. In ACM SIGMOD Con-ference, 2000.

[3] R. Agrawal, J. Gehrke, D. Gunopolos, and PrabhakarRaghavan. Automatic subspace clustering of high di-mensional data for data mining applications. In ACMSIGMOD Conference, 1998.

[4] R. Agrawal, T. Imielinski, and Arun Swami. Mining as-sociation rules between sets of items in large databases.In ACM SIGMOD Conference, pages 207–216, 1993.

[5] R. Agrawal and R. Srikant. Fast algorithms for miningassociation rules in large databases. In VLDB Confer-ence, 1994.


[6] P. Bradley, U. Fayyad, and C. Reina. Scaling clusteringalgorithms to large databases. In ACM KDD Confer-ence, 1998.

[7] M. Breunig, H.P. Kriegel, P. Kroger, and J. Sander.Data bubbles: Quality preserving performance boost-ing for hierarchical clustering. In ACM SIGMOD Con-ference, 2001.

[8] R. Dubes and A.K. Jain. Clustering Methodologiesin Exploratory Data Analysis, pages 10–35. AcademicPress, New York, 1980.

[9] R. Duda and P. Hart. Pattern Classification and SceneAnalysis, pages 10–45. J. Wiley and Sons, 1973.

[10] F. Fanstrom, J. Lewis, and C. Elkan. Scalability forclustering algorithms revisited. SIGKDD Explorations,2(1):51–57, June 2000.

[11] B. Fritzke. The LBG-U method for vector quantization– an improvement over LBG inspired from neural net-works. Neural Processing Letters, 5(1):35–45, 1997.

[12] V. Ganti, J. Gehrke, and R. Ramakrishnan. Cactus-clustering categorical data using summaries. In ACMKDD Conference, 1999.

[13] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan.Clustering data streams. In FOCS, 2000.

[14] S. Guha, R. Rastogi, and K. Shim. Cure: An efficientclustering algorithm for large databases. In SIGMODConference, 1998.

[15] S. Guha, R. Rastogi, and K. Shim. Rock: A robustclustering algorithm for categorical attributes. In ICDEConference, 1999.

[16] J. Han, J. Pei, and Yiwei Yun. Mining frequent pat-terns without candidate generation. In ACM SIGMODConference, 2000.

[17] A. Hinneburg and D. Keim. Optimal grid-clustering:Towards breaking the curse of dimensionality. In VLDBConference, 1999.

[18] Z. Huang. Extensions to the k-means algorithm for clus-tering large data sets with categorical values. Data Min-ing and Knowledge Discovery, 2(3), 1998.

[19] J.B. MacQueen. Some method for classification andanalysis of multivariate observations. In Proc. of the 5thBerkeley Symposium on Mathematical Statistics andProbability, 1967.

[20] A. Nanopoulos, Y. Theodoridis, and Y. Manolopoulos.C2p: Clustering based on closest pairs. In VLDB Con-ference, 2001.

[21] L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, andR. Motwani. Streaming-data algorithms for high qualityclustering. In IEEE ICDE, 2002.

[22] C. Ordonez and E. Omiecinski. FREM: Fast and ro-bust EM clustering for large data sets. In ACM CIKMConference, 2002.

[23] C. Ordonez, E. Omiecinski, and Norberto Ezquerra. Afast algorithm to cluster high dimensional basket data.In IEEE ICDM Conference, 2001.

[24] S. Roweis and Z. Ghahramani. A unifying review ofLinear Gaussian Models. Neural Computation, 1999.

[25] M. Sato and S. Ishii. On-line EM algorithm for the nor-malized Gaussian network. Neural Computation, 12(2),2000.

[26] T. Zhang, R. Ramakrishnan, and M. Livny. Birch:An efficient data clustering method for very largedatabases. In ACM SIGMOD Conference, 1996.


Processing Frequent Itemset Discovery Queries byDivision and Set Containment Join Operators

Ralf RantzauDepartment of Computer Science, Electrical Engineering and Information Technology

University of StuttgartUniversitatsstraße 38, 70569 Stuttgart, Germany

[email protected]

ABSTRACTSQL-based data mining algorithms are rarely used in prac-tice today. Most performance experiments have shown thatSQL-based approaches are inferior to main-memory algo-rithms. Nevertheless, database vendors try to integrate anal-ysis functionalities to some extent into their query executionand optimization components in order to narrow the gap be-tween data and processing. Such a database support is par-ticularly important when data mining applications need toanalyze very large datasets or when they need access currentdata, not a possibly outdated copy of it.

We investigate approaches based on SQL for the problem offinding frequent itemsets in a transaction table, including analgorithm that we recently proposed, called Quiver, whichemploys universal and existential quantifications. This ap-proach employs a table schema for itemsets that is similarto the commonly used vertical layout for transactions: eachitem of an itemset is stored in a separate row. We arguethat expressing the frequent itemset discovery problem us-ing quantifications offers interesting opportunities to processsuch queries using set containment join or set containmentdivision operators, which are not yet available in commer-cial database systems. Initial performance experiments re-veal that Quiver cannot be processed efficiently by commer-cial DBMS. However, our experiments with query executionplans that use operators realizing set containment tests sug-gest that an efficient processing of Quiver is possible.

Keywordsassociation rule discovery, relational division, set contain-ment join

1. INTRODUCTIONThe frequent itemset discovery algorithms used in today’sdata mining applications typically employ sophisticated in-memory data structures, where the data is stored into andretrieved from flat files. This situation is driven by theneed for high-performance ad-hoc data mining in the ar-eas of health-care or e-commerce, see for example [12] foran overview. However, ad-hoc mining often comes at a highcost because huge investments into powerful hardware andsophisticated software are required.

1.1 Why Data Mining with SQL?From a performance perspective, data mining algorithmsthat are implemented with the help of SQL are usually con-sidered inferior to algorithms that process data outside thedatabase system [21]. One of the most important reasons isthat offline algorithms employ sophisticated in-memory datastructures and try to scan the data as few times as possi-ble while SQL-based algorithms either require several scansover the data or require many and complex joins betweenthe input tables. However, DBMS already provide tech-niques to deal with huge datasets. These techniques need tobe re-implemented in part if offline algorithms are used toanalyze data that is so large that the in-memory data struc-tures grow beyond the size of the computer’s main-memory.

One can classify data mining model generation algorithmsalong the following list of features:

• On-line vs. off-line mining: While the algorithm isrunning, the user may change parameters and retrieveintermediate or approximate results as early as possi-ble.

• In-place vs. out-of-place mining: The algorithm re-trieves original data from the database itself or it op-erates on a copy of the original data, perhaps with adata layout that is optimal for the algorithm.

• Database-coupled vs. database-decoupled mining: Thealgorithm uses the DBMS merely as a file system, i.e.,it stores and loads data using trivial queries or it ex-ploits the query processor capabilities of the DBMS fornontrivial mining queries.

Obviously, some algorithms lie somewhere in-between thesecategories. For example, there are implementations of thefrequent itemset discovery method that allow a user interac-tion only to some extent: a user can increase the minimumsupport threshold but maybe not decrease it because thiswould require revisiting some previously seen transactions.Hence, such an implementation realizes on-line mining onlyup to a certain degree.

We believe that there are datasets of high potential valuethat are so large that the only way to analyze them in theirentirety (without sampling) is to employ algorithms that ex-ploit the scalable query processing capability of a databasesystem, i.e., these datasets require in-place, database-coupledmining, however, at the potential cost of allowing only off-line mining. On-line mining can be achieved if the query


execution plans are based on non-blocking operators alongthe “critical paths” of the data streams from the input ta-bles to the root operator of the plan. A framework for suchoperators supporting queries within the knowledge discoveryprocess has been investigated for example in [5].

1.2 The Problem of Frequent Itemset Discov-ery

We briefly introduce the widely established terminology forfrequent itemset discovery. An item is an object of analyticinterest, like a product of a shop or a URL of a documenton a web site. An itemset is a set of items and a k-itemset

contains k items. A transaction is an itemset representinga fact, like a basket of distinct products purchased togetheror a collection of documents requested by a user during aweb site visit.

Given a set of transactions, the frequent itemset discovery

problem is to find itemsets within the transactions that ap-pear at least as frequently as a given threshold, called mini-

mum support. For example, a user can define that an itemsetis frequent if it appears in at least 2% of all transactions.

Almost all frequent itemset discovery algorithms consist ofa sequence of steps that proceed in a bottom-up manner.The result of the kth step is the set of frequent k-itemsets,denoted as Fk. The first step computes the set of frequentitems or 1-itemsets F1. Each of the following step consistsof two phases:

1. The candidate generation phase computes a set of po-tential frequent k-itemsets from Fk−1. The new setis called Ck, the set of candidate k-itemsets. It is asuperset of Fk.

2. The support counting phase filters out those itemsetsfrom Ck that appear more frequently in the given setof transactions than the minimum support and storesthem in Fk.

All known SQL-based algorithms follow this “classical” two-phase approach. There are other, non-SQL-based approachessuch as frequent-pattern growth [7], which do not require acandidate generation phase. However, the frequent-patterngrowth algorithm employs a (relatively complex) main-memorydata structure called frequent-pattern tree, which disquali-fies it for a straightforward comparison with SQL-based al-gorithms.

1.3 Set Containment TestsThe set containment join (SCJ) 1⊆ is a join between tworelations R and S on set-valued attributes R.a and S.b, thejoin condition being that a ⊆ b. It is considered an impor-tant operator for queries involving set-valued attributes [8;9; 11; 15; 14; 17; 18; 25]. For example, set containmenttest operations have been used for optimizing a workload ofcontinuous queries, in particular for checking if one queryis a subquery of another [4]. Another application area iscontent-based retrieval in document databases, when onetries to find a collection of documents containing a set ofkeywords.

Frequent itemset discovery, one of the most important datamining operations, is an excellent example of a problem thatcan be expressed using set containment joins: Given a set oftransactions T and a single (candidate) itemset i, how manytransactions t ∈ T fulfill the condition i ⊆ t? If this number

transaction items

1001 {A, D}1002 {A, B, C, D}1003 {A, C, D}

(a) Transaction

itemset items

101 {A, B, D}102 {A, C}

(b) Itemset

itemset items transaction items

101 {A, B, D} 1002 {A, B, C, D}102 {A, C} 1002 {A, B, C, D}102 {A, C} 1003 {A, C, D}

(c) Contains

Figure 1: An example set containment join:Itemset 1items⊆items Transaction = Contains

transaction item

1001 A

1001 D

1002 A

1002 B

1002 C

1002 D

1003 A

1003 C

1003 D

(a) Transaction (sin-gle dividend)

itemset item

101 A

101 B

101 D

102 A

102 C

(b) Itemset (sev-eral divisors)

transaction itemset

1002 101

1002 102

1003 102

(c) Contains (severalquotients)

Figure 2: An example set containment division:Transaction÷{item}⊇{item} Itemset = Contains

is beyond the minimum support threshold the itemset isconsidered frequent. In general, we would like to test a wholeset of itemsets I for containment in T : Find the number of

tuples in I 1⊆ T for each distinct transaction in T .

The set containment join takes unnormalized relations asinput, i.e., the relations are not in the first normal form(1NF), which would require that the attributes have onlyatomic values. We have recently introduced the set contain-

ment division (SCD) operator ÷⊇, which is equivalent toSCJ but the relations are in 1NF [20]. It is a generalizationof the well-known relational division operator. Given tworelations R(A∪B) and S(C ∪D), where A = {a1, . . . , am},B = {b1, . . . , bn}, C = {c1, . . . , cn}, and D = {d1, . . . , dp}are sets of attributes and the attribute domains of bi arecompatible with ci for all 1 ≤ i ≤ n. The sets A, B, andC have to be nonempty, while D may be empty. The setcontainment division is defined by the algebra expression

R÷B⊇C S =⋃

x∈πD(S)

((R÷ πC (σD=x (S)))× (x))

= T (A ∪D) .

Note that the superset symbol (⊇) is used to contrast SCDfrom the traditional division operator and to show its simi-larity to the SCJ operator, which has a subset relation (⊆).It does not mean that SCD compares set-valued attributes,unlike SCJ.


If D = ∅, i.e., if S consists of a single group or “set” of ele-ments, then the set containment division becomes the nor-mal division operator:

R÷B⊇C S = R÷ S = T (A) ,

where R is called dividend, S divisor, and T quotient.

To compare the two operators SCJ and SCD, Figures 1 and2 sketch a simple example that uses equivalent input datafor the operators. SCJ is based on unnormalized data whileSCD requires input relations in 1NF. Each tuple of the resultrelation Contains delivers the information on which itemsetis contained in which transaction.

In this paper, we argue that if SQL would allow expressingSCJ and SCD problems in an intuitive manner and if severalalgorithms implementing these operators were available in aDBMS, this would greatly facilitate the processing of queriesfor frequent itemset discovery.

1.4 Paper OverviewThe remainder of this paper is organized as follows. In Sec-tion 2, we discuss frequent itemset algorithms that employSQL queries. Query execution plans that realize the queriesare discussed in Section 3. Section 4 explains algorithms forthe SCJ and SCD operators and their implementation basedon an open source Java class library for building query pro-cessors. Several experiments that assess the performance ofthe algorithms based on both synthetic and real-life datasetsare presented in Section 5. We conclude the paper in Sec-tion 6 and give a brief outlook on future work.

2. FREQUENT ITEMSET DISCOVERYWITH SQL

Algorithms for deriving association rules with SQL havebeen studied in great detail in the past. Frequent itemsetdiscovery, a preprocessing phase of association rule discov-ery, is more time-consuming than the subsequent rule gen-eration phase [2]. Therefore, all association rule discoveryalgorithms described in the literature strive to optimize thefrequent itemset generation phase.

2.1 Related WorkThe SETM algorithm is the first SQL-based approach [10]described in the literature. Subsequent work suggested im-provements of SETM. For example, in [24] views are used in-stead of some of the tables employed in SETM. The authorsalso suggest a reformulation of SETM using subqueries. Theperformance of SETM on a parallel DBMS has been stud-ied [16]. The results have shown that SETM does not per-form well on large datasets and new approaches have beendevised, like for example K-Way-Join, Three-Way-Join, Sub-

query, and Two-Group-Bys [22]. These new algorithms dif-fer only in the statements used for support counting. Theyuse the same statement for generating Ck, as shown in Fig-ure 3 for the example value k = 4. The statement creates anew candidate k-itemset by exploiting the fact that all of itsk subsets of size k−1 have to be frequent. This condition iscalled Apriori property because it was originally introducedin the Apriori algorithm [2; 13]. Two frequent subsets arepicked to construct a new candidate. These itemsets musthave the same items from position 1 to k − 1. The newcandidate is further constructed by adding the kth items ofboth itemsets in a lexicographically ascending order. In ad-

INSERTINTO C4 (itemset, item1, item2, item3, item4)SELECT newid(), item1, item2, item3, item4FROM (

SELECT a1.item1, a1.item2, a1.item3, a2.item3FROM F3 AS a1, F3 AS a2, F3 AS a3, F3 AS a4WHERE a1.item1 = a2.item1 AND

a1.item2 = a2.item2 ANDa1.item3 < a2.item3 AND-- Apriori property.-- Skip item1.a3.item1 = a1.item2 ANDa3.item2 = a1.item3 ANDa3.item3 = a2.item3 AND-- Skip item2.a4.item1 = a1.item1 ANDa4.item2 = a1.item3 ANDa4.item3 = a2.item3) AS temporary;

Figure 3: Candidate generation phase in SQL-92 for k = 4.Such a statement is used by all known algorithms that havea horizontal table layout.

dition, the statement checks if the k − 2 remaining subsetsof the new candidates are frequent as well, expressed by thetwo “skip item” predicates in Figure 3.

Another well-known approach presented in [22] called K-

Way-Join uses k instances of the transaction table and joinsit k times with itself and with a single instance of Ck. Thesupport counting phase of K-Way-Join is illustrated in Fig-ure 4(a), where we contrast it to an equivalent approachusing a vertical table layout in Figure 4(b) that is similar toour Quiver approach, discussed below.

The algorithms presented in [22] perform differently for dif-ferent data characteristics. The authors report that Sub-query is the best algorithm overall compared to the otherapproaches based on SQL-92. The reason is that it exploitscommon prefixes between candidate k-itemsets when count-ing the support.

More recently, an approach called Set-oriented Apriori hasbeen proposed [23]. The authors argue that too much re-dundant computation is involved in each support countingphase. They claim that it is beneficial to save the infor-mation about which item combinations are contained inwhich transaction, i.e., Set-oriented Apriori generates anadditional table Tk(transaction, item1, . . . , itemk) in the kthstep of the algorithm. The algorithm derives the frequentitemsets by grouping on the k items of Tk and it generatesTk+1 using Tk. Their performance results have shown thatSet-oriented Apriori performs better than Subquery, espe-cially for high values of k.

2.2 The Quiver AlgorithmWe have recently introduced a new SQL-based algorithm forderiving frequent itemsets called Quiver (QUantified Item-

set discoVERy) [19]. The basic idea of the algorithm is touse a vertical layout for representing itemsets in a relationin the same way as transactions are represented, i.e., therelations storing candidate and frequent itemsets have thesame schema (itemset, pos, item), similar to the schema fortransactions T(transaction, item). Table 1 (taken from [19])illustrates the difference between the horizontal and the ver-tical layout for transactions and itemsets.

Quiver’s SQL statements employ universal quantification inboth the candidate generation and the support countingphase. Quiver constructs a candidate (k + 1)-itemset i fromtwo frequent k-itemsets f1.itemset and f2.itemset by check-ing if for all item values of f1 at position 1 ≤ f1.pos =


INSERTINTO S4 (itemset, support)SELECT c1.itemset, COUNT(*)FROM C4 AS c1,

T AS t1, T AS t2, T AS t3, T AS t4WHERE c1.item1 = t1.item AND

c1.item2 = t2.item ANDc1.item3 = t3.item ANDc1.item4 = t4.item ANDt1.transaction = t2.transaction ANDt1.transaction = t3.transaction ANDt1.transaction = t4.transaction

GROUP BY c.itemsetHAVING COUNT(*) >= @minimum_support;

INSERTINTO F4 (itemset, item1, item2, item3, item4)SELECT c1.itemset, c1.item1, c1.item2, c1.item3, c1.item4FROM C4 AS c1, S4 AS s1WHERE c1.itemset = s1.itemset;

(a) Original, horizontal version of K-Way-Join

INSERTINTO S4 (itemset, support)SELECT c1.itemset, COUNT(*)FROM C4 AS c1, C4 AS c2, C4 AS c3, C4 AS c4,

T AS t1, T AS t2, T AS t3, T AS t4WHERE c1.itemset = c2.itemset AND

c1.itemset = c3.itemset ANDc1.itemset = c4.itemset ANDt1.transaction = t2.transaction ANDt1.transaction = t3.transaction ANDt1.transaction = t4.transaction ANDc1.item = t1.item ANDc2.item = t2.item ANDc3.item = t3.item ANDc4.item = t4.item ANDc1.pos = 1 ANDc2.pos = 2 ANDc3.pos = 3 ANDc3.pos = 4

GROUP BY c1.itemsetHAVING COUNT(*) >= @minimum_support;

INSERTINTO F4 (itemset, pos, item)SELECT c1.itemset, c1.pos, c1.itemFROM C4 AS c1, S4 AS s1WHERE c1.itemset = s1.itemset;

(b) Vertical version of K-Way-Join

Figure 4: Support counting phase of K-Way-Join for k = 4

f2.pos = i.pos ≤ k − 1 the condition f1.item = f2.item =i.item holds. This is the prefix construction used in theApriori algorithm mentioned in Section 2.1. Similar predi-cates are added to the WHERE clause of the SQL query torealize the Apriori property, i.e., the “skip item” predicatesmentioned in the SQL query used for retrieving horizontalcandidates in Figure 3. The SQL statement used for thisphase is quite lengthy, therefore we do not present it in thispaper. It is shown in [19] together with an equivalent queryin tuple relational calculus to emphasize the use of universalquantification (the universal quantifier “∀”).The frequent itemset counting phase of Quiver uses uni-versal quantification as well. It is used to express the setcontainment test: For each candidate itemset i, find thenumber of transactions where for each transaction t there isa value t.item = i.item for all values of i.pos. Note that thiscondition holds for itemsets of arbitrary size k. No furtherrestriction involving the parameter k is necessary. Hence,the same SQL statement, depicted in Figure 5(a), can beused for any iteration of the frequent itemset discovery al-gorithm. The nested NOT EXISTS predicate realizes theuniversal quantifier.

Unfortunately, there is no universal quantifier defined in theSQL standard, not even in the upcoming standard SQL:2003.It is difficult for an optimizer to recognize that the query

Layout Transactions Itemsets

horizontal(single-row/multi-

column)

transaction item1 item2 item3

1001 A C NULL

1002 A B C

itemset item1 item2

101 A B

102 A C

vertical(multi-row/

single-column)

transaction item

1001 A

1001 C

1002 A

1002 B

1002 C

itemset pos item

101 1 A

101 2 B

102 1 A

102 2 C

Table 1: Table layout alternatives for storing the items oftransactions and itemsets

for support counting in Figure 5(a) is actually a divisionproblem. Therefore, we suggest a set containment divisionoperator in SQL, whose syntax is illustrated in Figure 5(b).Its semantics is equivalent to that of Figure 5(a).

Note that we cannot simply write

T AS t1 SET_CONTAINMENT_DIVIDE BY C AS c1

because the layout of the divisor table C(itemset, pos, item)has the attribute pos, the lexicographical position of an itemwithin an itemset, that is needed when candidates are cre-ated and that would incorrectly lead to grouping the divisortable on the attributes (itemset, pos) instead of (itemset).Hence, we omit this attribute in the SELECT clause of thesubquery.

3. QUERY EXECUTION STRATEGIESIn this section, we show query execution plans (QEPs) pro-duced by a commercial DBMS for some example SQL state-ments of the SQL-based frequent itemset discovery algo-rithms K-Way-Join and Quiver. In addition, we show howto realize the Quiver query for frequent itemset countingusing a SCDs operator.

We compare Quiver to K-Way-Join because of their struc-tural similarity. Remember that Quiver uses a vertical tablelayout for both itemsets and transactions while K-Way-Joinuses a horizontal table layout for itemsets and a vertical lay-out for transactions. We are aware of the fact that K-Way-Join is not the best algorithm based on SQL-92 overall, evenif our preliminary tests with a synthetic dataset has shownthe opposite (see Section 5). However, we decided to derivefirst results by comparing these two approaches. Furtherperformance experiments with other approaches will follow.

In Figures 6(a) and (b), we sketch the query execution plansthat we derived during performance experiments with Mi-crosoft SQL Server 2000 for several SQL algorithms dis-cussed in Section 2.1. The plans (a) and (b) illustrate theexecution strategies chosen by the optimizer for a giventransaction table T and a candidate table C4 for the queriesshown in Figures 4(a) and 5(a). The operators in use aresort, row count, retrieval of the first row only (Top1), re-name (ρ), group (γ), join (1), left anti-semi-join (n), indexscan (IScan), and index lookup (ISeek).

The QEP (a) for the K-Way-Join query uses four hash-joinsto realize the Apriori trick. It joins the candidate table C4

four times with the transaction table.

The QEP (b) for the Quiver query using quantifications em-ploys anti-semi-joins. Left anti-semi-join returns all rows ofthe left input that have no matching row in the right input.The top left anti-semi-join operator test for each combina-tion of values (c1.itemset, c1.transaction) on the left if at


INSERTINTO S (itemset, support)SELECT itemset, COUNT(DISTINCT transaction) AS supportFROM (

SELECT c1.itemset, t1.transactionFROM C AS c1, T AS t1WHERE NOT EXISTS (

SELECT *FROM C AS c2WHERE NOT EXISTS (

SELECT *FROM T AS t2WHERE NOT (c1.itemset = c2.itemset) OR

(t2.transaction = t1.transaction ANDt2.item = c2.item)))

) AS ContainsGROUP BY itemsetHAVING support >= @minimum_support;

INSERTINTO F (itemset, pos, item)SELECT c1.itemset, c1.pos, c1.itemFROM C AS c1, S AS s1WHERE c1.itemset = s1.itemset;

(a) Quiver using quantifications

INSERTINTO S (itemset, support)SELECT c1.itemset, COUNT(*)FROM T AS t1 SET_CONTAINMENT_DIVIDE BY (

SELECT itemset, itemFROM C

) AS c1ON (t1.item = c1.item)

GROUP BY c1.itemsetHAVING COUNT(*) >= @minimum_support;

INSERTINTO F (itemset, pos, item)SELECT c1.itemset, c1.pos, c1.itemFROM C AS c1, S AS s1WHERE c1.itemset = s1.itemset;

(b) Quiver using a set containment division operation, speci-fied by hypothetical SQL keywords

Figure 5: Support counting phase of Quiver for any valueof k. We write C (F ) instead of Ck (Fk) for the candidate(frequent) itemset table because of the independence of theparameter k.

least one row can be retrieved from the right input. If no,then the combination qualifies for the subsequent groupingand counting, otherwise the left row is skipped. A similarprocessing is done for the left anti-semi-join with the outerreference c2.item. An interesting point to note is that theindex scan (not the seek) of t2 on item, transaction is un-correlated. Every access to this table is used to check if thetransaction table is empty or not. For our problem of fre-quent itemset discovery this table is non-empty by default.

In addition to the QEPs that have been derived for realSQL queries using a commercial DBMS, we illustrate a hy-pothetical QEP for the version of the Quiver algorithm thatemploys a set containment division operator in Figure 6(c).The corresponding query using hypothetical SQL keywordsis specified in Figure 5(b).

4. ALGORITHMS FOR SCD AND SCJThe term “hash” specified for the set containment divisionoperator in the QEP in Figure 6(c) was used to illustratethat there may be several implementations (physical oper-ators) in a DBMS realizing SCD, in this example based onthe hash-division algorithm [6]. Since SCD is based on thedivision operator, as shown by the definition in Section 1.3,many implementations of division come into considerationfor realizing SCD depending on the current data characteris-

TT

hash�t4.transaction = t1.transaction ANDc4.item1 = t1.item

c4.itemset COUNT(*)

hash�

hash�t3.transaction = t4.transaction ANDt3.item = c4.item3

hash�c4.item4 = t4.item ANDc4.item2 = t2.item

hash�t2.transaction = t4.transaction

�t1

�t3

C4

�c4

T

�t2

T

�t4

IScanitem, transaction

IScantransaction

IScanitem, transactionIScanitem, transaction

IScanclustered

itemset,item1, item2,item3, item4

(a) Original, horizontal version of K-Way-Join with hash-joins

T

c1.itemset COUNT(*)

stream�

NL�t1.item = c4.item

�t1

C4

�c1

Sort c1.itemset, t1.transaction

NLOuter References:c1.itemset, t1.transaction

RowCount

NLOuter References:c2.item

NL

c1.itemset ≠ c2.itemset

C4

�c2

RowCount

Top 1

T

�t2

RowCount

Top 1

T

�t2

IScanitem, transaction IScanitemset

IScanitem, transaction

ISeek t2.item = c2.item ANDt2.transaction = t1.transaction

ordered, forwarditem, transaction

c1.itemset, t1.transactionstream�

IScanitemset

(b) Quiver with nested-loops left anti-semi-joins

T

c1.itemset COUNT(*)

hash�

�t1

C4

�c1

t1.item⊇c1.item

hash÷

c1.itemset,c1.item

�

(c) Quiver withhash-based SCD

Figure 6: Example query execution plans computing F4,generated for the SQL queries in Figures 4(a), 5(a), and5(b)

tics, e.g., the data may be grouped or even sorted on certainattributes, as discussed in [20].

The SCD operator that we used for the performance testsis based on the hash-division algorithm [6]. Hash-divisionuses two hash tables. The divisor hash table stores all rowsof the divisor table (i.e., all rows of a single divisor groupfor SCD) and assigns to each row an integer number that isequal to k−1 for the kth row. The quotient hash table storesall distinct values of the dividend’s quotient attributes (i.e.,


attribute set A in Section 1.3). These values are a super-set of the final quotients, i.e., they are quotient candidates.For each quotient candidate value, a bitmap is kept, whoselength is the number of divisor rows. For each divisor, thedividend table is scanned once. For each dividend row, thealgorithm looks up the value i in the divisor hash table as anindex for the the quotient bitmap. Then, the quotient can-didate is looked up in the quotient hash table. If it is found,the bit at position i is set to true, otherwise a new quotientcandidate and bitmap is inserted with all bits set to false.After the scan, the quotient candidates in the quotient hashtable are returned whose bits are all true.

We have realized the QEPs shown in Figure 6 using theopen source Java class library XXL (eXtensible and fleXi-

ble Library for data processing) for building query proces-sors [3]. Some example classes are BTree, Buffer, Hash-

Grouper, Join, and Predicate. Interestingly, they provide aclass called SortBasedDivision that implements the merge-division algorithm. However, several necessary operators forour purposes like nested-loops anti-semi-join and hash-joinhad to be built from scratch.

Several algorithms for SCJs have been proposed, as men-tioned in Section 1.3. It may seem curious that we onlyshow a QEP and experiments using Quiver realized withour home-made SCD instead of the better-known SCJ oper-ator. In fact, we have realized an implementation of SCJ us-ing the Adaptive Pick-and-Sweep Join algorithm [15], whichis claimed to be the most efficient SCJ algorithm to dateand made initial performance tests, presented in Section 5.2.However, we cannot yet report on a thorough comparison ofdifferent implementations for SCD and SCJ as it is part ofour ongoing work. Note that an interesting recent publica-tion [11] on SCJs did not investigate this algorithm.

5. EXPERIMENTS

5.1 Commercial DBMSIn Figure 7, we highlight some results of experiments ona commercial database system to assess the query execu-tion plans and performance of the SQL-based algorithms K-Way-Join, Subquery, Set-oriented Apriori, and Quiver usingquantifications. We used Microsoft SQL Server 2000 Stan-dard Edition on a 4-CPU Intel Pentium-III Xeon PC with900 MHz, 4 GB main memory, and Microsoft Windows 2000Server. Due to lack of space, we cannot give the details onthe indexes provided on the tables and what primary keyswere chosen. However, it was surprising that K-Way-Joinperformed best for this (admittedly small) dataset, unlikereports in related work mentioned in Section 2.1. We pro-vide more information on this comparison in [19].

5.2 XXL Query Execution PlansIn addition to the experiments on a commercial DBMS, wehave executed the manual implementation of the QEPs withJava and XXL using one synthetic and one real-life datasetas the transactions table. Table 2 summarizes the character-istics of the datasets. The synthetic data has been producedusing the SDSU Java Library, based on the well-known IBMdata generation tool mentioned in [2]. The real-life transac-tion dataset called BMS-WebView-2 by Blue Martini Soft-ware [26] contain several months of click-stream data of ane-commerce web site.

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10

#ite

mse

ts

k

candidatefrequent

(a) Dataset characteristics

0.001

0.01

0.1

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9 10

time

(s)

k

Quiver with quantificationsSet-oriented Apriori

SubqueryK-Way-Join

(b) Execution times. Note the logarithmic scaleof the y-axis.

Figure 7: Experiments with SQL-based algorithms on acommercial DBMS for the synthetic dataset T5.I5.D10kwith minimum support of 1% (100 transactions)

Type of Distinct Trans- Trans. sizedata

Datasetitems

Rowsactions avg. max.

synthetic T5.I5.D5k 86 30,268 5,000 6.05 18

real-life BMS-WebView-2 3,340 358,278 77,512 4.62 161

Table 2: Overview of transaction datasets

5.2.1 Set Containment DivisionThe plans were executed on a dual-CPU Intel Pentium-III PC with 600 MHz, 256 MB main memory, MicrosoftWindows 2000 Personal Edition, and JRE 1.4.1. The dataresided on a single local hard disk.

We used subsets of the transaction datasets and subsets ofthe candidate 4-itemsets to derive frequent 4-itemsets forcertain minimum support values. The synthetic (real-life)data candidates were generated for a minimum support of1% (0.04%). Figure 8 shows some results of our experiments.One can observe that the execution plan using SCD per-formed best for small numbers of candidate sets. Of course,this result is disappointing but it is simply the result of ourstraightforward implementation of set containment division.Our implementation follows strictly the logical operator’sdefinition, which is a union of several divisions: each di-vision takes the entire transaction table as dividend and asingle itemset group as divisor. We plan to develop moreefficient implementations of the operator in the future.

The plan chosen by the commercial DBMS based on anti-


0

5

10

15

20

25

30

0 20 40 60 80 100 120 140

time

(s)

#candidates

syntheticreal-life

Figure 9: Experiments with set containment division for twotypes of data sets

semi-join was always a bad solution for the given data sets.K-Way-Join outperforms the other approaches for largernumbers of candidates.

5.2.2 Set Containment JoinIn addition to the experiments comparing different queryexecution strategies, we have made initial performance testsusing a set containment join operator instead of a set con-tainment division, illustrated in Figure 9. We used the adap-tive pick-and-sweep join algorithm. The query executionplan that we implemented using XXL is the same as in Fig-ure 6(c), but a SCJ operator instead of the hash-based SCD.In the nested layout, the sets of items were represented bya Java vector of XXL tuples having a single item attribute.The test platform was a single-CPU Intel Pentium-4 PCwith 1.9 GHz, 512 MB main memory, Microsoft WindowsXP Professional SP 1, and JRE 1.4.1. The data resided ona single local hard disk. The experiments are based on thesynthetic data set T5.I5.D10k and a subset of size 10,000transactions of the real-life data set BMS-WebView-2. Thecandidate itemsets are similar to those used for the exper-iments before. For these small data sets, the operator pro-cessed the data with a linear performance for a growing num-ber of candidate 4-itemsets.

6. CONCLUSIONS AND FUTURE WORKWe have shown that frequent itemset discovery is a promi-nent example for queries involving set containment tests.These tests can be realized by efficient algorithms for setcontainment join when the input data have set-valued at-tributes or by set containment division when the data arein 1NF. A DBMS has more options to solve the frequentitemset discovery problem optimally if it could choose toemploy such operators inside the execution plans for somegiven data characteristics. No commercial DBMS offers animplementation of these operators to date. We believe thatsuch a support would make data mining with SQL moreattractive.

We currently realize query execution plans using algorithmsfor set containment joins and we compare them to plans in-volving set containment division. In future work, we willinvestigate also algorithms based on index structures sup-porting efficient containment tests, in particular the algo-rithms discussed in [8] and [11]. Furthermore, we will con-tinue to study query processing techniques for switching be-tween a vertical and a horizontal layout transparently (with-out changing the SQL statements) for data mining problems.

Note that the investigation of the trade-offs of a horizontaland vertical data representation is not new. For example,the benefit of employing a vertical layout for querying e-commerce data has been observed in [1]. The techniquesdescribed there are generally applicable and we will studythem in the context of the frequent itemset discovery prob-lem.

7. ACKNOWLEDGMENTSDaniel Bock’s support of the performance experiments wasgreatly appreciated.

8. REFERENCES

[1] R. Agrawal, A. Somani, and Y. Xu. Storage and Query-ing of E-Commerce Data. In Proceedings VLDB, Rome,

Italy, pages 149–158, September 2001.

[2] R. Agrawal and R. Srikant. Fast Algorithms for Min-ing Association Rules. In Proceedings VLDB, Santiago,

Chile, pages 487–499, September 1994.

[3] J. V. d. Bercken, B. Blohsfeld, J.-P. Dittrich, J. Kramer,T. Schafer, M. Schneider, and M. Seeger. XXL – ALibrary Approach to Supporting Efficient Implemen-tations of Advanced Database Queries. In Proceedings

VLDB, Rome, Italy, pages 39–48, September 2001.

[4] J. Chen and D. DeWitt. Dynamic Re-grouping of Con-tinuous Queries. In Proceedings VLDB, Hong Kong,

China, pages 430–441, August 2002.

[5] M. Gimbel, M. Klein, and P. Lockemann. Interactiv-ity, Scalability and Resource Control for Efficient KDDSupport in DBMS. In Proceedings DTDM, Prague,

Czech Republic, pages 37–50, March 2002.

[6] G. Graefe and R. Cole. Fast Algorithms for UniversalQuantification in Large Databases. TODS, 20(2):187–236, 1995.

[7] J. Han and M. Kamber. Data Mining: Concepts and

Techniques. Morgan Kaufmann Publishers, 2001.

[8] S. Helmer. Performance Enhancements for Advanced

Database Management Systems. PhD thesis, Universityof Mannheim, Germany, December 2000.

[9] S. Helmer and G. Moerkotte. Compiling Away Set Con-tainment and Intersection Joins. Technical Report, Uni-versity of Mannheim, Germany.

[10] M. Houtsma and A. Swami. Set-oriented Data Miningin Relational Databases. DKE, 17(3):245–262, Decem-ber 1995.

[11] N. Mamoulis. Efficient Processing of Joins on Set-valued Attributes. In Proceedings SIGMOD, San Diego,

California, USA, June 2003.

[12] W. Maniatty and M. Zaki. A Requirements Analysis forParallel KDD Systems. In Proceedings HIPS, Cancun,

Mexico, pages 358–365, May 2000.

[13] H. Mannila, H. Toivonen, and A. I. Verkamo. EfficientAlgorithms for Discovering Association Rules. In AAAI

Workshop on Knowledge and Discovery in Databases,

Seattle, Washington, USA, pages 181–192, July 1994.


1

10

100

1000

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

time

(s)

#transactions

K-Way-JoinQuiver with left anti-semi-joins

Quiver with set containment division

(a) synthetic, #candidates = 4

1

10

100

1000

10000

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

time

(s)

#transactions



(b) synthetic, #candidates = 8

1

10

100

1000

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

time

(s)

#transactions



(c) synthetic, #candidates = 16

1

10

100

1000

0 1000 2000 3000 4000 5000 6000 7000

time

(s)

#transactions



(d) real-life, #candidates = 4

1

10

100

1000

0 1000 2000 3000 4000 5000 6000 7000

time

(s)

#transactions



(e) real-life, #candidates = 8

1

10

100

1000

0 1000 2000 3000 4000 5000 6000 7000

time

(s)

#transactions



(f) real-life, #candidates = 16

Figure 8: Execution times of plans in Figure 6 for different numbers of candidate 4-itemsets and for two different datasets.Note the logarithmic scale of both axes.

[14] S. Melnik and H. Garcia-Molina. Divide-and-ConquerAlgorithm for Computing Set Containment Joins. InProceedings EDBT, Prague, Czech Republic, pages 427–444, March 2002.

[15] S. Melnik and H. Garcia-Molina. Adaptive Algorithmsfor Set Containment Joins. TODS, 28(1):56–99, March2003.

[16] I. Pramudiono, T. Shintani, T. Tamura, and M. Kit-suregawa. Parallel SQL Based Association Rule Miningon Large Scale PC Cluster: Performance Comparisonwith Directly Coded C Implementation. In Proceedings

PAKDD, Beijing, China, pages 94–98, April 1999.

[17] K. Ramasamy. Efficient Storage and Query Processing

of Set-valued Attributes. PhD thesis, University of Wis-consin, Madison, Wisconsin, USA, 2002. 144 pages.

[18] K. Ramasamy, J. Patel, J. Naughton, and R. Kaushik.Set Containment Joins: The Good, The Bad and TheUgly. In Proceedings VLDB, Cairo, Egypt, pages 351–362, September 2000.

[19] R. Rantzau. Frequent Itemset Discovery with SQL Us-ing Universal Quantification. In P. Lanzi and R. Meo,editors, Database Support for Data Mining Applica-

tions, volume 2682 of LNCS. Springer, 2003. To appear.

[20] R. Rantzau, L. Shapiro, B. Mitschang, and Q. Wang.Algorithms and Applications for Universal Quantifi-cation in Relational Databases. Information Systems

Journal, Elsevier, 28(1):3–32, January 2003.

[21] S. Sarawagi, S. Thomas, and R. Agrawal. Integrat-ing Association Rule Mining with Relational Database

Systems: Alternatives and Implications. In Proceedings

SIGMOD, Seattle, Washington, USA, pages 343–354,June 1998.

[22] S. Sarawagi, S. Thomas, and R. Agrawal. IntegratingAssociation Rule Mining with Relational Database Sys-tems: Alternatives and Implications. Research ReportRJ 10107 (91923), IBM Almaden Research Center, SanJose, California, USA, March 1998.

[23] S. Thomas and S. Chakravarthy. Performance Evalua-tion and Optimization of Join Queries for AssociationRule Mining. In Proceedings DaWaK, Florence, Italy,pages 241–250, August–September 1999.

[24] T. Yoshizawa, I. Pramudiono, and M. Kitsuregawa.SQL Based Association Rule Mining Using Commer-cial RDBMS (IBM DB2 UDB EEE). In Proceedings

DaWaK, London, UK, pages 301–306, September 2000.

[25] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, andG. Lohman. On Supporting Containment Queries inRelational Database Management Systems. In Proceed-

ings SIGMOD, Santa Barbara, California, USA, May2001.

[26] Z. Zheng, R. Kohavi, and L. Mason. Real World Perfor-mance of Association Rule Algorithms. In Proceedings

SIGKDD, San Francisco, California, USA, pages 401–406, August 2001.



Efficient OLAP Operations for Spatial Data Using Peano Trees

Baoying Wang Fei Pan Dongmei Ren Yue Cui Qiang Ding William PerrizoComputer Science Department North Dakota State University

Fargo, ND 58105 Tel: (701) 231-6257

{baoying.wang, fei.pan, Dongmei.ren, yue.cui, qiang.ding, william.perrizo} @ndsu.nodak.edu

ABSTRACT Online Analytical Processing (OLAP) is an important application of data warehouses. With more and more spatial data being collected, such as remotely sensed images, geographical information, digital sky survey data, efficient OLAP for spatial data is in great demand. In this paper, we build up a new data warehouse structure – PD-cube. With PD-cube, OLAP operations and queries can be efficiently implemented. All these are accomplished based on the fast logical operations of Peano Trees (P-Trees∗). One of the P-tree variations, Predicate P-tree, is used to efficiently reduce data accesses by filtering out “bit holes” consisting of consecutive 0’s. Experiments show that OLAP operations can be executed much faster than with traditional OLAP methods.

Keywords Online analytical processing (OLAP), spatial data, P-Trees, PD-cube.

1. INTRODUCTION Data warehouses (DWs) are collections of historical, summarized, non-volatile data, accumulated from transactional databases. They are optimized for Online Analytical Processing (OLAP) and have proven to be valuable for decision-making [9]. With more and more spatial data being collected, such as remote sensed images, geographical information, and digital sky survey data, it is important to study on-line analytical processing of spatial data warehouses. The data in a warehouse are conceptually modeled as data cubes. Gray et al. introduce the data cube operator as a generalization of the SQL group by operator [3]. The chunk-based multi-way array aggregation method for data cube in OLAP was proposed in [4]. Typically, OLAP queries are complex and the size of the data warehouse is huge, which can cause queries to take very long to complete, if executed directly on raw data. This delay is unacceptable in most data warehouse environments. Two major approaches have been proposed to efficiently process queries in a data warehouse: speeding up the queries by using index structures, and speeding up queries by operating on compressed data. Many different index structures have been proposed for high-dimensional data [1, 2]. Some special indexing techniques, such as bitmap index, join index and bitmap join

* Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.

index, have been introduced. The bitmap index uses a bit vector to represent the membership of tuples in a table. A significant advantage of bitmap index is that complex logical selection operations can be implemented very quickly by performing bit-wised and, or, and complement operators. However, bitmap indexes are space inefficient for high cardinality attributes. Bitmap indexes are only suitable for narrow domains, especially for membership functions with 0, or 1 as function values. The problem associated with sparse bitmap indexes is discussed in [5]. In order to overcome this problem, some improved bitmaps, e.g., encoded bitmap [7], have been proposed. The join indexes gained popularity from its use in relational database query processing. Join indexing is especially useful for maintaining the relationship between a foreign key and its matching primary keys, from the joinable relations. Recently, a new data structure, Peano Tree (P-tree) [10, 11], has been introduced for spatial data. P-tree is a lossless, quadrant based compression data structure. It provides efficient logical operations that are well suited for multi-dimensional spatial data. One of the P-tree variations, Predicate P-tree, is used to efficiently reduce data accesses by filtering out “bit holes” which consist of consecutive 0’s. In this paper, we present a new data warehousing structure, PD-cube, to facilitate OLAP operations and queries. Fast logical operations on P-Trees are used to accomplish these OLAP operations. One of the P-tree variations, Predicate P-tree, is used to efficiently reduce data accesses by filtering out “bit holes” consisting of consecutive 0’s. Experiments show that OLAP operations can be executed much faster this way than with traditional data cube methods. This paper is organized as follows. In section 2, we review P-tree structure and its operations. In section 3, we propose a new data warehouse structure, PD-cube, and efficient OLAP operations using P-Trees. Finally, we compare our method with traditional data cube methods experimentally in section 4 and conclude the paper in section 5.

2. REVIEW OF PEANO TREES A quadrant-based tree structure, called the Peano Tree (P-tree), was developed to facilitate compression and very fast logical operations of bit sequential (bSQ) data [10]. P-Trees can be 1-dimensional, 2-dimensional, 3-dimensional, etc. The most useful form of a P-tree is the predicate-P-tree in which a 1-bit appears at those tree nodes corresponding to quadrants for


which the predicate holds. The predicate can be a particular bit-position of a particular attribute or, more generally, a set of values for each of a set of attributes. Pure 1 P-tree (P1-tree) and Non Pure0 P-tree (NP0-tree) are two predicate P-Trees. In Figure 1, a bSQ file with 64 rows is shown, the file is rearranged into 2-D Peano or Z order in the left and the P-Trees (P1-tree and NP0-

tree) are on the right. We identify quadrants using a Quadrant identifier, Qid - the string of successive sub-quadrant numbers (0, 1, 2 or 3 in Z or Peano order, separated by “ .” (as in IP addresses). Thus, the Qid of the bolded and underlined quadrant in Figure 1 is 2.2.

Figure 1. bSQ file, 2-D Peano order bSQ file, P1-tree, and NP0-tree

Logic and and or are the most important and frequently used P-tree logical operations. By using predicate P-tree operations, we filter “big holes” consisting of consecutive 0’s and get the mixed

quadrants. Then we only load the mixed quadrants of Peano sequence into main memory, thus reducing the data accesses. The and and or operations using NP0-tree are illustrated in Figure 2.

a). NP0-tree1 b). NP0-tree2 c). AND Result d). OR Result

Figure 2. P-tree Operations: and and or In Figure 2, a) and b) are two NP0-trees, c) and d) are their and result and or result, respectively. From c), we know that bits at the range of position [0, 32] and [49, 64] are pure 0 since the bits in quadrant with Qid 0, 1 and 3 are 0. Quadrants with Qid 2 are non-pure0 parts. We only need to load quadrants with Qid 2. The logical operation results calculated in such way are the exactly same as anding two bit sequence with reduced data accesses. There are several ways to perform P-tree logical operations. The basic way is to perform operations level by level starting from the root level. The node operation is commutative.

3. OLAP OPERATIONS USING P-TREES

In this section, we first develop a new data warehousing structure, PD-cube, which is the bit-wised representation of data cube. Then we show that OLAP operations, such as

slice/dice and roll-up, based on PD-cube, can be efficiently implemented using P-Trees.

3.1 PD-cube To take advantage of the continuity and sparseness in spatial data, we develop a new data warehouse structure, PD-cube. It is a bit-wised data cube in Peano order which is consistent with P-Trees in terms of data storage. We partition the data cube into bit level in order to facilitate fast logical operations of predicate P-Trees. Predicate P-Trees are tree-based bitmap indexes built on quadrants. By means of fast logical and, or and not operations of predicate P-Trees, PD-cubes achieve efficient data access, which is very important for massive spatial data set.

1 _____/ / \ \_____ / / \ \ / / \ \ 0 1 1 0 / / \ \ / / \ \ 1 0 1 1 1 1 1 1 //|\ //|\ //|\ 1110 0010 1101

1 _____/ / \ \______ / / \ \ / / \ \ 1 0 1 0 / / \ \ 1 1 1 1 //|\ 0100

1111110011111000111111001111111011110000111100001111000001110000

a) bSQ file

1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0

b) 2-D Peano order

0

1 0 0 0

0 0 1 0 1 1 0 1

1 1 1 0 0 0 1 0 1 1 0 1

c) P1-tree

1

1 1 1 0

1 0 1 1 1 1 1 1

1 1 1 0 0 0 1 0 1 1 0 1

d) NP0-tree

1 ______ / / \ \_____ / / \ \ / / \ \ 0 0 1 0 / | \ \ 1 1 1 1 //|\ //|\ 1101 0100

1 ______ / / \ \_____ / / \ \ / / \ \ 1 1 1 0 / / \ \ / | \ \ 1 0 1 1 1 1 1 1 //|\ //|\ 1110 0010


(b) Data Cube of Field Yield

(a) Fact Table of Yield in Peano Order (c) PD-cubes

Figure 3. A Fact Table, its Data Cube and PD-cubes

X Y T Yeild 0 0 0 15 (1111) 1 0 0 15 (1111) 0 1 0 15 (1111) 1 1 0 15 (1111) 0 0 1 15 (1111) 1 0 1 15 (1111) 0 1 1 15 (1111) 1 1 1 15 (1111) 2 0 0 15 (1111) 3 0 0 4 (0100) 2 1 0 1 (0001) 3 1 0 12 (1100) 2 0 1 12 (1100) 3 0 1 2 (0010) 2 1 1 12 (1100) 3 1 1 12 (1100) 0 2 0 15 (1111) 1 2 0 15 (1111) 0 3 0 2 (0010) 1 3 0 0 (0000) 0 2 1 15 (1111) 1 2 1 15 (1111) 0 3 1 2 (0010) 1 3 1 0 (0000) 2 2 0 12 (1100) 3 2 0 12 (1100) 2 3 0 12 (1100) 3 3 0 12 (1100) 2 2 1 12 (1100) 3 2 1 12 (1100) 2 3 1 12 (1100) 3 3 1 12 (1100) 0 0 2 15 (1111) 1 0 2 15 (1111) 0 1 2 15 (1111) 1 1 2 15 (1111) 0 0 3 15 (1111) 1 0 3 15 (1111) 0 1 3 15 (1111) 1 1 3 15 (1111) 2 0 2 15 (1111) 3 0 2 10 (1010) 2 1 2 1 (0001) 3 1 2 14 (1110) 2 0 3 14 (1110) 3 0 3 2 (0010) 2 1 3 12 (1100) 3 1 3 12 (1100) 0 2 2 15 (1111) 1 2 2 15 (1111) 0 3 2 2 (0010) 1 3 2 0 (0000) 0 2 3 15 (1111) 1 2 3 15 (1111) 0 3 3 2 (0010) 1 3 3 0 (0000) 2 2 2 12 (1100) 3 2 2 12 (1100) 2 3 2 12 (1100) 3 3 2 12 (1100) 2 2 3 12 (1100) 3 2 3 12 (1100) 2 3 3 12 (1100) 3 3 3 12 (1100)

1 2

5 6

2

0

15 15

5 6

12

12

12

12

15 15 15 15

1

15 4

10 2

15

15

15 15

15 15

1

12

15

4

12 2 12 12 12 14 12 12 14 12 12

1 2

5 6

0

0

1 1

5 6

1

1

1

1

1 1 1 1

1

1 0

1 0

1

1

1 1

1 1

0

1

1

0

1 0 1 1 1 1 1 1 1 1 1

1 2

5 6

0

0

1 1

5 6

1

1

1

1

1 1 1 1

1

1 1

0 0

1

1

1 1

1 1

0

1

1

1

1 0 1 1 1 1 1 1 1 1 1

1 2

5 6

1

0

1 1

5 6

0

0

0

0

1 1 1 1

1

1 0

1 0

1

1

1 1

1 1

0

0

1

0

1 0 0 0 0 1 0 1 0 0 0

1 2

5 6

0

0

1 1

5 6

0

0

0

0

1 1 1 1

1

1 0

1 0

1

1

1 1

1 1

1

0

1

0

1 0 0 0 0 0 0 0 0 0 0


Figure 3 shows an example of a 3-D data cube representing the crop yield. It has three dimensions: position x (X), position y (Y), and time (T). Since yield is a 4-bit value, we split the data cube of yield into four bit-wised cubes, which are stored in form of predicate P-Trees NP0 and P1.

3.2 OLAP Operations The OLAP operations are the basis for answering spatial data warehouse questions like: “ find all galaxies brighter than magnitude 22” , “ find average yield of field” , or “ find the total traffic flow during a period of time.” Slice/dice and roll-up operations will be described as follows. 3.2.1 Slice/dice

The Slice operation is a selection along one dimension, while the dice operation defines a sub-cube by performing a selection on two or more dimensions. Slice and Dice are very efficiently accomplished using P-Trees. Typical select statements may have a number of predicates in their “where” clause that must be combined in a Boolean manner. The predicates may include “=” , “<” and “>” . These predicates lead to two different query scenarios, where “=” clause results in an equal query, and “<” or “>” in a range query. 3.2.1.1 Equal Select Slice

Suppose we have a 3-D data cube representing crop yield with dimensions X, Y and T, where X = { 0, 1} , Y = { 0, 1} and T = { 0, 1} . Attribute “Yield” is converted from decimal to binary. The

data cube and its corresponding relational table are shown in Figure 4. From the table in Figure 4 we build 6 P-Trees as shown in Figure 5 according to the P-tree construction rule. Pi, j, is a P-tree for the jth bit of the ith attribute. For attribute of m bit, there are m P-trees, one for each bit position

X Y T Yield 0 0 0 111 1 0 0 011 0 1 0 001 1 1 0 100 0 0 1 011 1 0 1 010 0 1 1 100 1 1 1 001

Figure 4. A 3-D Cube and Binary Relational Table

P1 P2 P3 P41 P42 P43

Figure 5. P-Trees for 3-D cube Quadrant Suppose we want to know that yield at the place where Y = 1. The query is “get yield where Y = 1” . The process of selection is: first get P-tree masks, and then trim the P-Trees. The detailed process is discussed as follows. First, create the P-tree mask, PM = P2. Then use PM to generate a new set of P-Trees. The new P-tree set is the result of selection. Generation of new P-Trees is simple: just trim the nodes corresponding to the position of pure 0 in PM, and PM is P2 here. Take P43 for example, the trimming process is shown as Figure 6.

Figure 6. Trim Process of P-tree based on PM

After trimming the whole P-tree set, we can convert the trimmed P-tree set easily to a data cube as shown in Figure 7.

Figure 7. Result Cube of Selection

3.2.1.2 Range Slice

Typical range queries for spatial data warehouse include questions like: “Get crop yield within the range of X > 200 and Y > 150” . We analyze three range slice cases as follows.

1 1 1 0 1 0 0 11 0 0 1 0 0 1 00 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0

1

4

7

3

3 2

1

1 1 1 0 1 0 0 1 1 0 0 1

4

3

2

1


a. Get yield where Y >1001. Given the number of bits for the second attribute Y is fixed as 4, we can easily get the values that are greater than 1001. From Figure 8 , we can see that { “Y > 1001”} consists of two subsets { “Y = 11**” } and { “Y = 101*” } , where * is 1 or 0. Therefore the query clause can be written as “where Y = 11** || Y = 101*” . Figure 8. Range Query of the Second Attribute Y > 1001

The query is retrieved by the corresponding P-tree mask PMgt = PM1 || PM2. PM1 is predicate P-tree for data that start with “11” and PM2 is for data that start with “101” . Since Y is the second attribute of the Peano Cube, we have PM1 = P21 & P22 PM2 = P21 & P’22 & P27. where P21 – the predicate P-tree for data that has value 1 in the first bit position of the second attribute P’22 – the predicate P-tree for data that has value 0 in the second bit position of the second attribute P22, P27 – follows the same rule.

b. Get yield where T > 01011001. The retrieval process of this query is shown in Figure 9. Attribute T is the third dimension of the PD-Cube. Data set { T > 01011001} can be divided a number of subsets: { “1***** **” } , { “011*****” } , { “010111**” } and { “0101101*”} , where * is 1 or 0. The retrieval process can be generalized as follows:

Figure 9. Range Query of the Third Dimension T > 01011001 • The number of subsets is equal to the number of 0 in the bit

stream. Data set { T > 01011001} consists of four subsets. • Start with the highest bit of string until the nth 0, revert it to

1 and append it with wildcard *, we get the first subset. The first subset of { T > 01011001} is { “1**** ** *” } , the second subset is { “011*****” } , and so on.

• Generate mask P-tree for each subset according to bit pattern and position as discussed in previous example. For this example PM1 = P31, PM2 = P31’ & P32 & P33, PM3 = P31’ & P32 & P33’ & P34 & P35 & P36 and PM4 = P31’ & P32 & P33’ & P34 & P35 & P36’ & P37.

• AND all mask P-Trees together to get a general mask P-tree. For this example PMgl = PM1 || PM2 || PM3 || PM4.

c. Combination of Equal Query and Range Query

It is often the case that “where” clause has “>=” . For example, “get yield where T >= 01011001”. We can easily divide data set { “T >=01011001”} into two subsets, { “T >01011001”} and { “T = 01011001”} . Therefore the mask P-tree PMge = PMgt || PMeq. PMgt

is related to range query and PMeq is related to equal query.

For “where” clause like “01111011 > T > 01011001”, we also need to divide set { “01111011 > T > 01011001”} into two subsets, { “01111011 > T”} and { “T > 01011001”} . The mask P-tree PMgl = PMgt || PM lt.

d. Complement of Range Query We have just analyzed “where” clause with “>” or “>=” . But we also want “<” or “<=” . For example, “get yield where T <= 01011001.” Obviously, data set { “T <= 01011001”} is the complement of { “T > 01011001”} . With the result of query “Get yield where T > 01011001, we can easily retrieve query “Get yield where T <= 01011001” by making complement, i.e. PM le = PM’gt. Dice operation is similar to slice operation. Only dice involves more than one dimension while slice is on one dimension. The example of dice can be “get yield where T <= 01011001” & “Y > 1001.” We just need to and P-tree masks of slice on dimension T and Y to get the final P-tree for dice.

11111111 …. 10000000 01111111 …. 01100000 01011111 …. 01011100 01011011 01011010 ------------ 01011001

011*****

1******* P31

P’31 & P32 & P33

010111** P’31&P32&P’33&P34&P35&P36

0101101* P’31&P32&P’33&P34&P35&P’36&P37

1111 1110 1101 1100 1011 1010 ------- 1001

11**

101*

P21 & P22

P21 & P’22 & P23


3.2.2 Rollup

PD-cube is stored in Peano order rather than in raster order. Therefore, the rollup of PD-cube is accomplished differently from the traditional data cube as a result of different storage models. According to the Peano storage of PD-cube, we develop the rollup algorithm in Figure 11. With this algorithm, we can easily calculate aggregation along any dimension.

The rollup process we have discussed so far is for one cube. As described in section 3.1, PD-cubes are bitwise data cubes. The attribute “Yield” in Figure 4 are composed of three PD-cubes (See Figure 10).

Figure 10. PD-cubes for Attribute “Yield”

Figure 11. Rollup Algorithm for PD-cube

Suppose we want to aggregate “Yield” along dimension T (pointing down). First we need to rollup each cube using the algorithm in Figure 11, and get the results, S2 [ ], S1[ ] and S0[ ] as shown in Figure 12. With the rollup results of each cube, S2, S1, and S0, we can get rollup results, S, as follows: S[i] = S2[i] x 22 + S1[i] x 21 + S0[i] x 20 For i=1, S[1] = 1 x 22 + 1 x 21 + 2 x 20 = 8. Similarly we can get S[2] = 7, S[3] = 7 and S[4] = 7.

4. PERFORMANCE ANALYSIS

We have implemented PD-cube and conducted a series of experiments to validate our performance analysis of the algorithm. Our experiments are implemented in C++ on a 1GHz Pentium PC machine with 1GB main memory. The test data includes the aerial TIFF images of the Oakes Irrigation Test Area in North Dakota. The data is prepared in five sizes, 128x128, 256x256, 512x512, 1024x1024, and 2048x2048. The data sets are available at [12]. In this experiment, we compare our algorithm with bitmap indexed data cube method. The response time is calculated based on average CPU time for 20 queries. The goal is to validate our claim that PD-cube and P-trees have high scalability in terms of data size. The experiment results are shown in 0. As it is shown in Figure 13, when cube size > 1900KB, our method outperforms bitmap indexed data cube method. As the cube size increases, we can see the drastic increase in response time for bitmap indexed data cube method. The P-tree method shows its great strength for big cube sizes. In summary, our experiments show that our algorithm with PD-cube data structure and P-tree algorithm demonstrate significant improvement in performance and scalability over bitmap indexed data cube method for spatial data.

0 1

1

0

0

0 0

1 1

0

1 0

1 1

1

001 100

111 011

011 010 0 0 1 1 1 0

001

// d – number of dimension of the data cube // n – size of the data cube; kd – the dimension to roll up along // R[] – the result of roll up; start – the start index of R[] // val – the values in the data cube at the cursor position // Skip - move cursor forward within the Peano cube Algorithm rollup (d, n, kd, R[], R_start) n = 1: R[start] += val n = 2d: For i = start TO (start + n/2-1) DO

Skip (i-start) % 2kd-1 + 2kd *

� �12/)( −− kdstarti

R[i] += val

Skip � −

=

2

0

2kd

j

j

R[i] += val n > 2d: For i = 0 TO 2d-1-1 DO ssc = n / 2d // ssc: size_sub_quadrant ddm = d ssc // ddm: domain_of_dimension Skip ((i + start) % 2kd-1 +

2kd * � �12/)*( −+ kdstarti )*ssc

rollup (d, n/2d, kd, R[], start + i*ssc/ddm );

Skip ssc + ssckd

j

j *22

0

� −

=

rollup (d, n/2d, kd, R[], start + i*ssc/ddm ); End rollup


S2[ ] = { 1, 1, 1, 0} S1 [ ] = { 0, 1, 1, 1} S0 [ ] = { 2, 1, 1, 1} S [ ] = { 8, 7, 7, 7}

Figure 12. Rollup of “Yield” along Dimension T

Figure 13. Response Time Comparison

5. CONCLUSION

In this paper, we present a general data warehousing structure, PD-cube, to facilitate OLAP operations. The fast logical operations of P-Trees are used to accomplish these operations. Predicate P-Trees are used to find the “big holes” of consecutive 0’s by performing logical operations. Only the mixed chunks need to be loaded, thus the amount of data scans is reduced. Experiments indicate OLAP operations using P-Trees is much faster than traditional data cube methods. One future research work is to extend our PD-cube into parallel data warehouse system. It appears particularly promising by partitioning a large cube horizontally or vertically into small cubes that can improve the query performance by parallelism. Besides, our method also has potential applications in other areas, such as DNA micro array and medical image analysis.

6. REFERENCES [1] K. Lin, H. V. Jagadish, C. Faloutsos. The TV-Tree: An

Index Structure for High-Dimensional Data, VLDB Journal, Vol. 7, pp. 517-542, 1995.

[2] S. Berchtold, D. Keim, H. P. Kriegel. The X-Tree: An Index Structure for High-Dimensional Data, Proc. 22nd Int. Conf. on Very Large Databases (VLDB), 1996, pp. 28-79.

[3] J. Gray, A. Bosworth, A. Layman, H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-tab, and Sub-Totals. Data Mining and Knowledge Discovery 1, 29-57, 1997

[4] Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int, Conf. Management of Data (SIGMOD’97), pages 159-170, Tucson, AZ, May 1997.

[5] P. O’Neil, G. Graefe. Multi-Table Joins Through Bitmapped Join Indices, SIGMOD Record 24(7), September 1995

[6] P. Valduriez, Join Indices, ACM TODS, 12(2), June 1987.

[7] M.C. Wu, A. Buchmann. Encoded Bitmap Indexing for Data Warehouses. Proceedings of the 14th International Conference on Data Engineering, Orlando, Florida, USA, Feb. 1998, pp. 220--270.

[8] S. Amer-Yahia, T. Johnson. Optimizing Queries on Compressed Bitmaps. VLDB 2000: 729-778

[9] E. Codd, S. Codd, C. Salley. Providing OLAP (On-Line Analytical Processing) to User-Analysts: An IT Mandate. Technical Report. 1997, available at http://www.arborsoft.com/essbase/wht_ppr/coddps.zip.

[10] W. Perrizo, Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1, 2001.

[11] M. Khan, Q. Ding, W. Perrizo. k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees, PAKDD 2002, Spriger-Verlag, LNAI 2776, 2002, pp. 517-528.

[12] TIFF image data sets. Available at http://midas-10cs.ndsu.nodak.edu/data/images/.

1 1

0 1

2 1

4

4

3

1

1

7

0 1

1

0

0

0 0

1 1

0

1 0

1 1

1

8 7

0 0 1 1 1 0 3 2

1 0 1 1 1 1 7 7

Response Time Comparison

100

1000

10000

100000

100 1000 10000 100000 1000000

Cube Size (KB)

Resp

on

se T

ime (

ms)

Bitmap PD-cube

Clustering Gene Expression Data in SQL Using LocallyAdaptive Metrics

�� !#"$&%&��'�"

'��()� *+$&��%-,�./%10 2�.1$10 "'�2

3 �#��#��1�4��5768��9:��6��; "1<=$?>:"A@CB7%+<:D4��D�� !="$+%&� *FE.1B7$+G�<7*?*B=,H� %&"I0 >=(A2J0 "1'�2

�)��-��K�#�ML��6��#��K�#�/N�O�P�Q� !="$+%&� '�"'R>I,H.%10 2�.$/0 "1'�2

SUT 5V6�WYXZ�[F\�@^]_0 `�0�abBV*&%+<:D��Q"%+"/B7$&.c4�d"/D=*-"$

%+c�"1D�>I()B=,H2�%/0 ��e�(40 .<:(

ABSTRACTf T 5Y97��-�5V��68WZ�U��g��K5V�h97��6�9757�68�i� T 5Y�8��97�=j�57�kl��m T ��W#5Vn6857��MWI��8��A��mQ�8�I�1�C�#9V97��1�U�K6�W��C��975V��1�I�K6o��4��K�I��&kb��57�In��U�5�p 3 �K��57��6�Wq�1�Urs5V��_mt��#�u� T 5)9:�U��5)��md�8��5:6��1��#6J�#��+k�pQv?��468�#�C�45:�#68��68W#mw��O��l��U�#xPmw�#�49:��-�5V��6 T ��W T �U�K�45768��6��8��9757�4�#�y� T 5o�:j�5V�1�#W�5o��576��&k��#m��z�#�K6��4�#6RkR{ T 57�5Y��6|��68�8��8��975q��K��x�5:��k}��ogJ54��:{~pq��i�Y97��68�5=��5768975��8��-�1�I6�975ymw�8689Vn��68�� T �#�M5:�R��#��kb��5y�#��68�8�8�Amw5:�=��8�57�M�C�7kbgz5y��6�5VrJ5:9V��j�5�p�P5b��6��8975}�I6��I��W��#�� T �� T �=�Y�U�K�97�=j�5/��97��-�5V��4��6��gUn��8��9757�A��#6�6�5=�og�ko��rs5/�576R�A9:��yg8��6��I��#6��A�#m��8��5768��K�#6��AjR��R9:��d{O57��W T ��6�W��~�ImHm�5:�#��U�57�:p4f T ��i�#��I9 T �Ij#��K��)� T 54��x�#mH��~�ImH��6�mt�#��C�#��#6�576897��86R�57�5:�Z�K6�W��g8��Q��4576��6��#��?k�5=�U��9/��6l�5:9 T 6��R�85:�7p��M�8�)��57� T ��l�#��R97�K�I�57�)��5:��9 T 97��-n�5/��b{H5:��W T �qj�579V��#�:��{ T ��5Cj#�#��57�y9:��U��8�5C� T 5C�57�K5Vj#�I6�975Y�Immt5:�I��57��{�� T ��6b� T 5i97�#��5:��z�#6J�U�K6�WY97��-�5V�:p_v&6b� T ��J��5V�M{O5��57�5:6��q��6�5V��97�K5V6�� S8�)� ��8��57��5V6��1�#��6��mA�#�8�q�#��W��#�� T �b�� T �I��576J�Ig��57�Q� T 5��U�K�97�=j�5/�k4�Im�9V��8�-�5V�� 6Y��#�1��57��K��68W~��6��1��5�y�5:�K�=��6��#��AX S p

1. INTRODUCTIONf T 5Y97��-�5V��68WZ�U��g��K5V�h97��6�9757�68�i� T 5Y�8��97�=j�57�kl��m T ��W#5Vn6857��MWI��8��A��mQ�8�I�1�C�#9V97��1�U�K6�W��C��975V��1�I�K6o��4��K�I��&kb��57�In��U�5�pZv?� T ��ygz57576��8�8��5:��5V�R�5:6��Kj#57��kP��6|�-�1�I��97�� =�?�Q��In9 T ��685��5:�I�68��6�Wy� �U�8�7�I�?��I6J�i��=�1�#g��#�5�9V��y�868��5:��V�8��=��I�?p3 ��-�5/��68W��8rs5/��AmK��u� T 5)97�8��5~�ImQ�8��5V6��6J�I�K��&ko�U��g��K5V��6 T ��W T �U��576��1��#6J��Q��J�I975:�7pCv&6 T �KW T �U�K��5V6��68��Q��J�#9:5V�7�d�� T ��W T ��k��x�5:��ko� T �#�7�8mw�#�M��6�koW��j�5V6}�J�I��#mQ�z�#�K6��A{A�� T ��6b� T 5�1�I��5�97��8�-�5V�:�U� T 57�5M5V�U�K�-�O�#�_��5:�I��O�imt5V{��8��5V6��6��H��6Y{ T ��9 T� T 5o��6U��4�I�5Ymw�#�4��I��4mK��5:��9 T �#� T 5V�:pPv?�4��q6��#�4��5:��6Un��6�W�mt�8��Y��U�#xomw�I�)97��-�5/��6Z��9 T � T �KW T �8��5768��K�#6J�I�d��J��9V5�#�� T 5q�:j#57�1�#W�5q�U576��+ko�#m �z��6��M��6�kU{ T 5V�5i��6}�K6��8��975i��x�5:��kb��Cgz5y��:{yp��)��97��6��5:�R��576�975#�s�8��1�#68975~mw�8689V��6��M� T �=�57�U��#��ko�8�5i�#��K6��_mt5=�I��57�A�C�7k�gz5~��685Vrs579V��j#5�pf T 5��8��g8��57��#m T ��W T �8��5768��K�#6J�#��&k�9:�#6�gz5��#��U�57��5=��g�k�5=��8��68Wb� T 5C��5V�i��}��z579V��m�kP�o��g��1�8��975b�w�?p 5�p��d��g��57�i�ImA�8��n�4576��68�1 )mt�#�q97��-�5V�4�#6��#��kU��K�7pb¡��:{O57j�5V�:�d� T 5Y��85V6��¢�9:�#��6�#m)��g��J�I9:57��gRk|� T 5}�8�5V��K�C�#6£5V��I�n?�8��685b��R975:��7p¤Xb�#�5��J�#��1�#6R��k��97��5V�K�I��#6��i� T �I�q��576R��m�kP97��8�-�5V��q��6P� T 5��8�#��I�5C�K��x�5V��k�68�#�y��bgz5�xR68�:{A6�gRk�� T 5Y�8�5V�:pov+6��5:5:�z�d{_5Y��57��5��9 T 9V��57��#��68�7�s��68�}�K68�8�8975:�}��g8��J�I9757�:�J��Cgz5y��#��M�#mQ� T 5¢�6J�U�K6�W��A�#m�� T 5)97��5/��68W��U�/�R9V57��57��m-pv&6��#�1�85/�)��Y9:��8�-�8�5q� T 54��R9=�I��97�#��57�K�I��6��)�ImO�8�I�1�8��Y��z5V�

N S �8��z�#��57�¥gUk§¦ SU¨©3 ��ª�« « ª¬��{_�I�1�®��¯I�R°��I8� ¦ SU¨ vv S n��R°=�R°�°R�J�I6J�� T 5~��yp

mw5:�=��8�5��57��579/��K�#6P�8��R9:57��5C� T ��8��z5V�1�=�5C��97�#��kl��6P��6��8��J�I9:5�pQv?6�� T �K�H��57�H{_5M��5)�C±/²³1´Jmw57�I��8�5M�57��579V��6Y�8��In975:��8�5~� T �=�~�I��KW#6��4��K�R9:�#�� O{_57��W T ��mw5:�=��8�57��#9797�#�1�8��6�W�� T 5��R9:��z97�I��57��=��68�H��md��#�1�i��6�Wq57��9 T ��4576��6spQ�)��576�n��6��H�#��68Wi{ T ��9 T ��#��y�I�5��U�#�5:��k497��57��#�57�4�57975:��j�5M�~�1��#��{O5:��W T �:�#� T �I� T �#�� T 5O5Vrs5797�Q�Imz5:��6�W��#��6�W~�U��-�1�#6�9V57� �#��6�WM� T �#��8��5V6��6�p ¨ 5:�#��U�5:�C�#��6�W�{ T ��9 T �8�I��P�=�5b�-��6�W��k|97��5/n��#�5:�C�57975:��j�5)�q��#�W�5M{O57��W T �:�U� T �I� T �I�O� T 5)5/rs579V�A�#m�97��68�-��9/�n��6�WC��-�1�#6�9V57��6�Wq� T �#��576��6�p ¨ ��W��157�)�)W��j�5i�4��8��55V�8�#�4��5�pQf T 5H��5Vmt��#�Q��57�8��9V��&{O�M97��-�5V��d�#mJ�8�I�1�M5V��#6�WR�=�5:��#��68W�� T 5Hµq�#6��~¶~�8��5768��K�#6��7pdf T 5��W T �� T �:{�� T 5��#��597��8��57��7��{ T 5/�54{A�� T ��68n?97��-�5/�y��-�1�#689757�~gz5V�&{O57576l�z��6��i�I�597��4�8��5:��8��6�W�� T 5b�57��J5:9/��j#5}��U9:�#�_{_57��W T ��CW#576�5/�1�I�5:��g�k��q�#��W��#�� T �bpof T 5�{O5V��W T �qjI��857�~�5V·�5:9/�i�K�R9:�#��97�I��57�K�I��6��#m��=�1��I6J��57� T �I�z5M57��9 T 9V�K��57�_�I�O��¸R¹/º�±V¹�±F»8¼J¹/½/¾w¿1À#Ás¿7Á�²#Â8¸#pf T ��5V97��K�#6J�I��U9:�I�U�5V� T �#�8��6�W��Imz��-�1�#68975:��gz5V��5V��57�8�#�1�I�5V�97��8��57��7�R��6��Y�I��:{A��mw�I�H� T 5��9:�Ij#5V�kC�Im��rs5V�576R�H��I��5V�68�H��6�8��rs5V�5V6U�A��g��#975:�_��m�� T 5~��W#�K68��68�8��A��9V5�pf T 5��8�I�1�In?��68��68W)�1�#�xy�#ms9:��-�5V��68W~�K��5:��k49=�I6�gz5A�z5V�mt�#��5:��6��K��5y�4�57��#��#6J�#��~�_X S p��Hko�5V��k��6�WC�#6Z��)�AX S ��C�C�#6�n�#W�5o� T 5}��=�1�� T 5o97��-�5V��68W��8�8��9:�I��6Ã��4m��575=��mK�� T ��1�#�xJp§�M��8�A��8��5:�4576��6�WÃ�P97��-57��68W|��W��#�� T �Ä��6 S�� C�Ix�57�~� T 5C��W#�1�#��6�Wb�1�I�xl57��57�:p ¨ ��68��k��K6�9:5 S��)� ��#6Ã��6J��8��-�kP�-�1�#6J�8�I�1�Ã�#6��|�IjI�I�K��#g8��5o��6Ã�#��H�C�IÅ+��AX SsÆ �7�� T 5��K��57��576R��#��#6}�#md� T 5)9:��-�5V��68WC�#��W��I�� T �Ç��z�I��1��g��K5�pv+6¤� T �K�o�J�I�z5V�b{_5�mw�I��C��È75�� T 5P�U��g��K57�É��mi¢J6��6�W��rs5/�n576R�)9:��8�-�5V��)��6P�U��rz57�576��~��8g8��8��9757�7pq��~��W��I�� T �Ê� � �U9:�#��k��U��Kj#5 3 ��8�-�5V��K6�W��=�I� � � 3 s��97�=j�5V��97��-�5V��6~��g��#975V��J�I6�685:�|g�kÃ�8��rs5V�576R�497��ig��6J�=��K�#6��q��mM�8��5768��K�#6��4jU��U9:�I�{O5:��W T ��68W��O�Im�mt57�#��8�5V�=p�f T ��_�#�8�8��I9 T �:j��H� T 5��1x4�#m��K�#��#mH�K6Umw�I��C�#��6l576�9V��86R�5V�57�l��6lW��gJ�#�Q�8��5768��K�#6J�#��&kZ�5=�U��9/n��6|�579 T 6��857�7p|��5oW��j�5}��97�#6�9V�5V�5 S�� 57�45:6��1�I��6�#m ��)�I�KW#�� T �o�z� T �#�M5V� T �Kg��W��R��}�z5V�m��C�#68975q�#6J�}�97��n�#g8��K��+k�p®f T 5V�5��=�5}� T 5Zx�5/k�97�#6��K��5/�V�=��6��Ymt�#�b��57��K�:k��6�W|�97��8��57��68W}�#��W��I�� T �©��6��5C�o��AX S ��57��z5797�K�#��kZ{ T 576��857��n��6�W�{A�� T j�5/�kb��#�W#5q��#�1�I�5V�� T �#��y��-�A�57��85~��6Z�8��x��J�#9V5��#��m�{O5){O�#6R�_��4�1�Ix�5i��j#�I6R�1�#W�5)�#m�97��4��8�-57�A97�K��5V��7p

2. RELATED WORK� �U9:�I� �8��5768��6��+kZ�57�8��9/��#6P�#�8��9 T 57�Mmw�#�~� T 5��8��z��5�#m�5V��97�K5V6��kq�K68�85/�8��68W T ��W T �U�K��5V68��6��#�J��#9:57� T �:j�5�gz57576��5/n97576R��k��97�8��5:��6�� T 5C�8�I�1��g8��5��5V�1�I��5}��=��H�:°I�FpËv+6W�5:6�5V�/�#�w�� T 5�5/�C9:��9Vk��ImO� T 57�5��5V� T ��8�i�U57�z576J��~��6 T �={®� T 597��8��57��68W~�8��g8��57�Ì��8�U�57��5:��6�� T 5_¢��-��#975A��64� T 5A�I��KWIn��6J�I�Hmw5:�=��8�5Y��9V5�p��Í�z�#�576R��#�O�5/�/��#��4�U��g��K5V��{�� T ��9 T�579 T 6��857�C�� T 5b�K�I9/x|�#m~��#��l��97�#��kÃ�z5/�mt��Ê 3 �Î��65:��9 T 9:��-�5/�M��Y��5V��j�5~� T 5i��6�9:��#��9:��z�#6�576��:p�X}�#�5V�#j#5V�:�mw�I�Y97��-�5V��68WP�8��z��5:�7�H� T 5}6�5/{Ï�8��5768��68�Y�C�7kÃgz5Z��4n

�~XbÐ~�~��8Ñ ¯=� T � 3 X S v-L~XZ��§��I�x�� T ��}�#6bª�57�5:�#�9 T v&��57�A��6}��#�1�CXb��6��6�W��#6J��Ð~6��:{A��5:��W�5~��97�=j�5V�k��J�#��#� �J�IW�5i��

-10-9-8-7-6-5-4-3-2-10123456789

10

-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

y

x

Cluster0 in original input spaceCluster1 in original input space

-10-9-8-7-6-5-4-3-2-10123456789

10

-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

y

x

Cluster0 transformed by local weightsCluster1 transformed by local weights

¨ �KW#�8�5��ÑA� � 5Vmt�/ 3 ��-�5/��6b�I��KW#��6J�#��68�8�8�A��J�#975�pA�wª��W T �/ 3 ��8�-�5V��_��1�#6��-mt�#��5:�og�ko��R9:��z{_5V��W T ��7p

9V��~��}��6��5V��8�5/�:�d�C�Ix��68Wb�� T �=�1�l��b�86��5V��68��97��-57��~��6�5:��=��6o��4� T 5~��W��6��#��#975�pf T 5Z��1�#g��57�¬��mM¢J6��68W��8��rs5V�576R�Y97��-�5/��6��8��rs5V�576R�Y��gUn��8��9757�Q�ImJ� T 5_�I��W��6��#��6��U� ��J�I975 T �I� gz5V576C��8�U�5V�1�57�q��6b��I�?p� T ��K5y� T 5q{O�#�xZ��6��I� ��9:9757��-mw�8��k��6R��U�8�89757�~��5V� T �8��n��W#k}m��)��U��xU��6�W��#�)��rz57�576��1��g��#975:��mt��)�8��rs5V�576R��97��-�5V��:��M�8�R57��6��I��9V��8�8�5i��J�=��K�#6��6�WC�Im�� T 5i��=�1��K6��Y��tÅ��6R�W#��:p�f T 5��57�z�#��5:��85V68�5o�5VW��6��4�K�I�W�57��kÃ�=j�57��#��H��6�9V5mt�#�~�YW��j�5V6��85768�5y�5:W#��6��d��-�)��IÅ579V��#6��6Z�K�:{O5/�~�8��5:6Un��6��#��+ky��g8��J�#9V57�H�=�5A�#��~�85V6��5��6J�i� T 5Vkq��W�57��5:�J��57��p��6�� T 5�� T 5/� T �#6��Umt�#�A�C�#6Rk��#�8��9:�#��68�_��89 T �I�_97��-��45V��57W#��576R�1�I��6l��68�Z��576��l�#6��#��kU��7��Y�J�#��-��6l�#m�� T 54��=�1��57��1�#g��5y��6�975i��U��#jR�K�U57�)�C97��5:�#�M��6R�5V��57�1�Ig��?k}�ImQ� T 5~�5Vn��-�=pª�57975V6U��kC��:�I�F��I6��I� T 5V�d�85768��&kUnFgJ�#�5:�~�U��#Å+5:9V��j�5H9V��8�-�5V��6�WM�#��nW��#�� T ��t�i� 3 T ��Mgz5V576}�8��#�z��5:�zpAf T ��M��8��#9 T �8��5:��#6��#��&kP9/��5/��K�#6|�85V¢86�57�P�K6��5V��i��m��U576��+kl��mA5:�I9 T9V�K��5V� ��6q��Q97�I��5:��z��6��68W~��8g8��975�p�� Xb��6��5 3 �I��)�U��U975Vn��8�5O��d� T 5V6��85Vj�57��z5:�i��~�#�8�8��:�U��C�I�5O{A�� T4T ��W T �U��gJ�Ig��?k�#6b��C��IÅ-5:9V��j�5~9V��8�-�5V�=p� I�M��P��8�U�57��57�� T 5b��g��57�¬�#mMmw5:�=��8�5}�57��5:9V��#6£��¢�6��9V�K��5V�� T ��U5:6)�K6 T ��W T �U�K��5V68��6��#�R�8�I�1��pdf T 5 ��U� T �I��5:�=�9 T� T ��8W T mw5:�=��8�5Q��g8�5V��1�8��975V�7�=5:j#�I�K�8�#��6�W_57��9 T ��8g8�57��g�k�¢8��-�9V�K��5V��6�W4��6Y� T 5)9V�#��5:��z��6��6�W4��8g8��J�I975��8�#6J�C� T 5V6o57j#�#��I�n��6�WM� T 5��157��6�W�97��8��5V��d�#6J�~mt57�#��5H��g��15/��8��6�WA� T 5O9 T ��576mt5:�I��5��5:��579V��6�9V��5V��6�p ��5M��g��5/�j�5M� T �I�A��4576��6��#��?k�5=�U��9/��6}��M�z5V�mt�#��5:�oW��g��#��ko�K6o� T ��M97�#�5�p_f T 5V�5Vmt�#�5�� T 5�5V9 T 6��R�85y�K6|� #�d��M57�U�z579V�5:�}��Ygz5y5Vrz5:9V��j�5i{ T 5:6��Y�8�I�1�Y�5V�9V��6��1��6��A��5)�5V��5:jI�I6U�Am�5:�#��U�5:�M�I6J�b��#��5i��57�K5Vj#�I6R�q�w6��-k� ��685:�7��#9V��M�I�K��97��-�5/��7pf T 5Z��1�#g��57�¬��mM¢J6��68W��8��rs5V�576R�Y97��-�5/��6��8��rs5V�576R�Y��gUn��8��9757�Z��I�K��U�57��5=��6 ��V�Fphf T 5P�8��J��5:� ��W��#�� T ��F Qª��Å-5V9V�5:� 3_��)S �5V��6�WR ��575VxR�y��8g��5V��~�#m_�8��5V6��6��~��9 T� T �I�C� T 5}�J��6R��I�5}9V�K�#�57��kÃ9V�K��5V�5:�Ã�K6Ã� T 5}9V�#��5:��z��6��6�W��8��6�6�5:�Ã��g8��J�#9V57�:p �H�I� T � T 5b6U��qgz5V�Y�Im)97��8�-�5V��Y�#6��Ã� T 5�=j�5/�1��W�5Y6U�8�ygz5V�q�#mM�U�K��5V6��6��q�z5V�q97��8��5V�4�I�5Y�8�5V�n&�85V¢86�5:��I�1��45V�5V��7py Qª�� 3O��~S �-�1�#��-��{�� T 9 T �U��6�W}�C�1�#6J�U�� 5V��#mH��5=�U��8�7��68�Z� T 576l��1�#W#�57��j�57��k��8��=j�5V�� T 54�U�8��&k��Im�45:�8��MgRko�z5V�mw�#��4�K6�W��#6Z��57�1�I��j�5 T ��d97��qg��68WY�U�/�R9V5:�8��5� T �I��9:�=�1�8�� T 5 Æ g8�� Æ �45:�8�#��U�Cm��Ê� T 5}97��576R�Y�5V�:p¤v+6�#�1��57�C��P¢86J�Ã� T 5}�57�Y�Imi�8��5V6��6�� T �I�Y�C�I��5V�C� T 5}�4��-�mt�#�M5:�#9 T 97��8�-�5/�I�� T 5i�#��W#�� T �Ç�57��579V��A� T 5q�U��576��1��#6��M�#��6�W{ T �K9 T � T 5}�z�#�K6�� T �=j#5}� T 5b��C��K5V�-��=j#57�1�#W#5��-�1�#68975bm�� T 5�97��576��i��5:��p�f T 5C�I�8� T �I��i�8�o68��)�8��=j�54{ T 5V� T 57�)��68�#�� T 5M�#��W��#�� T �^97�#6Uj�5V�W�5V��~� T 5��K��#��&k�9V��5V��6C� T 5Vk9 T �U�#�5�pv&6P9:�#6U�+�1��-�i��b� T 5Y Qª�� 3O��)S �I�KW#�#�1�� T �b�d��8�y��5/� T ��P��R5:�68�#��5:�R��5~��Y��z5797��mtk�� T 5y�:j�5V�1�#W�5y6R�8�qgz5/��ImH�U��576��6��M�-�gz5�x�57�U��z5/��97��8��57�:p ¨ �I��5:��9 T 97��8��57�:� ��6|mw�I97�7�)À#Á�Á�m�5:�#��U�5:�

�I�15��1�#x�5:6Ã��6��9V��6��85/�1�I��6��g��U�C��z5V��k�{_57��W T �57��p�f T 5 Qª�� 3_��~S �#��W��#�� T �h�K�i��5C��685Y�-�Z�K�#��q�ImA��6�mw�#��I��6��m~� T 5�6R��ygJ57��my�U�K�45768��68�o��6��I�b�U��J57��k£9 T ��5:6sp ¨ ��5V�8�#�4��5��mO��=�1�o�#m��+{H�b97��-�5/��~��6��+{H�}�U�K�4576��6��i�I�5C�U��-n��g��5:� ��6 ¨ ��W��5Ã�Ã� � 5VmK�/ /�M ª�� 3O��~S �C�7k£¢J68�� T �#�mw5:�=��8�5~µZ��_� T 5~��#��M��z�I��1�#6R�Amt��97��8�-�5V�)�U�J�#6J��mt5:�I��5~¶�� T 5q��-�)��4�z�#��1�#6R�Mmw�#�~9V��8�-�5V��#py�H�8��8��#Å+579V��68W�97��8��57��q�#��68WY� T 5y¶}�8��5V6��6��8�R5V�16 Æ �)�I�K��:{§��Y�8��#�z5V��k}�57��I�1�#�5�z��6��m8� T 5O�?{_��9V��8�-�5V��:p��P5_�:j��y� T ��g8��57�¥g�kyx�575:��68Wgz�#� T �8��5768��K�#6��imt�#�qgz�#� T 97��-�5V�1�7�Q�I6J�P��z5V��k�{H57��W T ��68W�8��-�1��6�975:�4�#��6�Wl5=�I9 T mt5:�I��5�{A�� T ��6Ã5:�I9 T 97��-�5/�:pPv&6Ã��n��6s�_{ T ��K5}� T 5��#�� T �#��6¥��/�)��6��I�o�U��Ij�5}{ T 5/� T 5V�Y� T 57��#��W��#�� T �©9:��6Uj#5V�W�57�~��b� T 5C9 T ��576��#�8��C��+k�9V��5V��6s�d{O59:��6lmw�#��#��k�� T �:{Ì� T �=�y��8�q�I�KW#�#�� T �h97��6Rj�5V�W�5V�i��o��U9:�I��6��y��Ç�Imd� T 5~�#��9V�K�I�5:�o5V��#�_mw�8689V��6spL�5:6�5V�1�#��j�5_�I��9 T 5V� T �:j�5_�#��Mgz57576q�857j#5:��z5:�~mw�#�Q��U9:�I�8��n��5768��K�#6J�I��?k4�5:�U��9V��6��6��9V��8�-�5V��6�W�p f T 5��#�8��R�I9 T �K6}��/��C�Ix�57�)�8�5��#mH�C�I��q�� x�57�� T �U�U�Zmw��9V��I�i��68��k��)�-�b��U��57��U9:�#��9:�#��57�K�I��6��Agz5V�+{H57576bm�5:�#��U�57�:p��=¯I��57�R�576J�� T 5��68W��5l 3 �©��U��57�M��Ã�|��U��5l�#mi��U9:�I��K6�5=�I�d�1��g8nF�4�8��57��d��97��5H68��6��6857�#��-��9/��8�5H��6i� T 5H��I�1�Up��R��8�5 �#m��8��6897��J�#��97��z�#6�576��d�#6J�I��kUÈV57��U��57��d��5V��j�5:��#��P��#��6£��|��C�I��q�8�4nF��x#5:�� T �R��£�8��#g��5:�bp ��6 «�X�#��W��#�� T �Í��_mw�#��y��I�5=�Y��57�-��C�#�5M� T 5~�8�#�1�#��5/�5V��:p

3. BICLUSTERING OF GENE EXPRESSIONDATAXb��9V��#��1�7kY�579 T 68��W�kY��_��685)��m�� T 5)��#�5:�-�AgU�15:�#xR� T ��W T �A��6

5V��z5V��576R�1�I�H��579V��#�yg��W#k�plL�57685�5/�U�8�57��6|�8�I�1�Z�I�5W�5:6�5V�/�I�57��g�kl��¦��^9 T ��8�y�I6J��I� T 57�i�4��9/��R�I��1�7k��579 T 6��857�7��#6��}� T 57k}�#�5q�Imt�576��U�57�576��5=�Z��I��9757��#m�57�U��57��6��57jRn57��C��m)W�576�57�C�86J�U57��8��rs5V�576R�Y97�#6J��6��w5�p W�p��_576UjR��6��4576��:��6J��jU��8�8��7��57�/ /pq«��9 T ��={Ì9V�#��57��z��6��)��o�oW#5:685#�d�I6J�5:��9 T 97��6Y�57��57�576R��A�q97�#6J��6��68�85/�_{ T ��9 T � T 5�W�5:6�5)��85Vj�57��z5:�sp�H��W#�K�-��#�5��6R�5V�57��5=�)��6~¢�6��8��6�WA�5V��#m8W�576857�d� T �:{A��68W��-��xRn��6�W#��kC��#�O�8�8nw�57W��8�K�I��6��#6��Y�8�:{�6�nF�57W#��#��6��86��85V�O�y�5V��#m�97��6��68�7p f��y� T ��O5V�U�5:6��:�R�579:576��k�� T 5�97�#6�975V�8�_�Im��V¾F¿7Á ÂU±��´+¹/½ T �#�Agz57576o��6U��U��89:57�� I�?p � g��K9V��8�-�5V�A��A�4��g��5V�A�#m�W#5:685V��#6��P�b��8g��5V�i�#m_97��6��68�i{A�� T � T ��W T ��4��K�I��?k��97�I�5�p��J�=��9:�8�K�I�A�97�I�5)� T �#��I��8��57�_��5/�U�8�57��6Z��#��q�K�O� T 5Y¹À�º±��7Â�ÀI½1¹¸Z½¹/±1¾w¸#Â�¹Y±/¿²�½¹�� #�Fp � 5V�� 68��£gz5C��g��57��i��m_W#5:685V��#6��¤5V�U�z5V��576R��b�57��z5V9V��j�57��k�pÇf T 5��#�� d Y��z5797��¢�57�}��g8�C�I��{A�� T �}��5:��6P�/��I�5:�P�5V��U��5Y�97��5��U5V¢J685:��#�_mw�#�K��:{��7Ñ

�� ! "�# �%$ & # � ��' " &)( ' " �*( ' � "�+ ' �� -,�� +�I

{ T 5/�5.' " � � /0 � 021 & # � ' " & �3' �4& � /0 � 051 "�# � ' " & �A�I6J��' �� ~XbÐ~�~��8Ñ ¯=� T � 3 X S v-L~XZ��§��I�x�� T ��}�#6bª�57�5:�#�9 T v&��57�A��6}��#�1�CXb��6��6�W��#6J��Ð~6��:{A��5:��W�5~��97�=j�5V�k��J�#��#� �J�IW�5i�6�

/0 � 0 0 � 061 "�# �%$ & # � ' " & p�f T 5/k��57��57�576R�� T 5��I{P��68�)97��46)��5:��68�7��#6��Ã� T 5}�45:��6��m�� T 5b��8g8�C�I��z�O�57��z5V9V��j�57��k�p�f T 5}��={O57�-��97�I�5 �P�� d ��Ì�o��6J�U��9:�I�57�)� T �#�)� T 5�W#576�5q57�U�8�57��6l�K5Vj�57��·��9V��8�#�5~��6}��6��6�pOf T 5y�#��Î�� T 576b��¢86J�bg8��97�K��5V��A{A�� T��:{¤��5:�I6Y��U�8�#�5:��57��K�U��5��9:�#�5y�wgz57��I{ �y975/��1�#��6Y� T �57� T ��J /p�P5Z��g8�5V�j�5}� T �=�Y� T 5Z��57��6��1��J�=�5=��57��8��5Z�97�#�5Z��Y��6��n�4�KÈ757��{ T 5V6��8g8�5V��Y�#m~W�57685:�Y�#6��57�U�z5V��5:6��w�#��8��5:6Un��68�1 M�=�5q9 T �#�15V6Z�1�C� T �#�� T 5yW�57685qj#5797��q�w�Fp 5�p��z��:{A�)�#mQ� T 5�5:��68W}g��97��-�5V�V )�#�5C97��5��Z5:�I9 T �I� T 5V�i{A�� T �57��J5:9/�y�-�� T 5O« �89:��85:�I6��U�K�-�1�I6�975#p��M�Q�A�57��:�I� T 5 � � 3 �#��W��I�� T �b��6��#� T 5V�H��8g8��J�#9:5�97��-�5/��K6�W4�I��W��#�� T �4�7��#�5�{H57��J��5:��i�z5V�nmt�#��h��y�8��1��685:�#��i97��-�5/��68W}��m_gz�I� T W�5V685:�y�#6��97��6��68��6}�q��9V��R�I��1�:ko�8�I�1�4�C�I��sp��:=��6��8975:�d�#6y��W��#�� T �u�w� 3 ��8�-�5V�/ �mt�#�d9:��-�5V��68WM��#��I��5V�68�7�8� T �#� T ��_gz57576b�#�8��5:��q�)¦)��9V��#��1�7ko��#�1�q�Im��+k��J5C��mOk�5:��-�:pof T 5C� 3 ��-�5V�y��U��57��#�8��È757�y�b9V��5V��6� T �I��8��rs5V�576R�Mm��Ç� T 5y��5:�I6��U�8�#�5:�b�5:��5i�97�#�5��#��R��xR�Amw�#�M97� T 5V�1576��M�J�=��5V�6��6��8g��5V�M�#m��U�K��5V6��6��q�w5�p W8p��6C�I6��U576R��¢J5:�q��g��#975#�R��gUÅ579V��Q�57j#5:��=�W�5V� j��I��57�Qmt��Q� T 5�579V��6J�o�8��5768��K�#6o� T �#6bm��#�A� T 5�¢8��-�/ /p

4. PROBLEM STATEMENT�P5��5V¢J6�54{ T �=�){O549:�#�� ¹/¾��=¼U´+¹¸Z¿7Á Â�±1´&¹/½Vp 3 ��6��85V�i��5V�~�Im�z��6��6i��#��5H��#975 ��m8�8��5:6��#6J�I�K��+k��Pp�� ¹V¾��I¼U´&¹¸)¿:Á ÂU±/´&¹V½� �K��_��g��15/��m8�8�I�1�_�z��6��7�=��W�5V� T 57��{M�� T �Aj�579/��#m�{H57��W T �� É� / �� %�� /�A��89 T � T �=�� T 5Z�z��6��Y�K6 � �#�5Z97��57��k9V�K��5V�5:��#9797�#�1��68Wy��y� T 5�� , 6��I��Í��#6�975�{O57��W T �5:�C��6�W� p f T 5M9V��z�#6�576R�� & ��5:�#��U�5:� � T 5��857W��5V5��#m�9V�#��5:��I��6C�Im�z��6��H��6 � �#��6�Wym�5:�#��U�5��p f T 5��8��#g��57� gz5797��457�H68�={ T �:{��q57�-��C�#�5�� T 5M{H57��W T �_j�5V9V�� mt�#�_5:�#9 T 97��-�5V�O�K6Y� T 5)�8�#��5V�:pv&6�� T ��y�5/��68W8�� T 5�9V��689:5V��i��m)¿7Á ÂU±1´+¹/½M�K�~68�#�ig8��5:�l��6��kl��6�z��6��:�Ag8�8�o��|�K6Rj#��j�57�o�P{O57��W T �5=��8��-�1�#68975��5V��9��A�?p 5�p��9V�K��5V��=�5b�U�K�9V�Ij�5V�5:��6|��J�#9V57�q��1��6��-mw�I��5=��g�k � pP«��I9 T9V�K��5V�y�K�y��U97��#�5:�l{A�� T ��i�={A6 � �� T �I�y�5/·J579V��i� T 5�9V�#�n�5:��=��6£�Im)�z��6R��C��6�� T 5}97��-�5/�Y��5:��m&p¤f T 5b5Vrs579V�C�#m � ��o��1�#6��-mt�#��h��1�I6�9757�i��b� T �#�~� T 5Y�#��R97�K�I�5:�l97��8��5V�i��~�5Vn� T ��J5=��6��}�b�U5768�5 T k��z5V�nF�� T 5V�5��#mH�J��6R��)�5:�8�#�1�I�5:�Zm��#� T 5V�M�8�I�1��pv&6y��1��6J�I�U97�K��5V��6�W��#� T 5O��I��6y�#mJ��5V�Q�ImJ�z��6��K6Un��9757�qgRkq��5V� �#md½¹+»J½¹V±V¹/º�´&À#´F¾��#¹dj�5V9V��#��7��)9:��5=�b¿1¹Vº�´w½²I¾w¸#±�#�b¿1¹/ºJ´&¹/½/±p}f T 5C�J�I��6��K68�8�8975:�Pg�kP��97�=j�5V��68Wb{_57��W T �5:�9V�K��5V��A��_m��C�#��ko�85V¢86�5:�b�#�_mt��={A�7p�� !��"#� ÑdL��j�5V6i�O�15/�%$Y�#m'& �z��6��)(4��6�� T 5*�bn?��4576��6��#�« �89V�K��857�#6��975��Q�o�15/�q�Im�+P97576��5/��-,!. / ��/��%��.�021U�%. &4365 � �� §�2�� %�7+��#9:��8�8��5:�i{A�� T ��5V�Q�#mJ97��57��z�#6J�8��68W�{O57��W T � j#579Vn��#��8, � / ��%� � 0 1U� � & 395 � �'� � �5��%��+��z��I��#69$£��6U�-�+��57��:,;$ / ��/�%�7$ 0 1RÑ$ & �<,=( � � � "�> /

�& " �wµ " (@? & " -,: /BA ,:C �� "�> /

�D " �wµ " (@? D " -,7 /EA , �EF#GIH�J�K1 ��?�#

{ T 57�5L & " ��68� ? & " �57�8�5V�576R�~� T 5LMw� T 97��z�#6�576R��~�Im_j�579/�� & ��68�N. & �57��J5797��j�57��kl�t��57��I�5~g8��#x�576��1�#6J��k8 /pf T 5A�57��Ims9:576��5V��68�q{_57��W T �� ²/»�´?¾�CÀ�Á={A�� T �5:��z579V�Q��)� T 5« �89V�K��857�#6}6��b�8��md� T 57k��68��4��È:5M� T 5~5V��I�M�45:�#��U�5�Ñ

O / � � �QP � 0 & > /

� "�> /

�& "SR2TVU%WYX �F��

��g�Å-579V�)�� T 5�9V��6��-��1�#��6R�� 1 �"Z> / ,& " �Í�[F\�8p � �I6J�9P �I�5

�S�^]_+8 ��C�#��9757�~{ T ��5C97��8�46Pj�579/��#��y�#�5-. & �#6J� � & �5/n��z579V��j�5V��k��I�wp 5�p � �¥� . / �/��. 0 ��68�`P �¥� � / �� 0 �Fp�a & " �57��n�57�576�� T 5��:j�5V�1�IW�5��U��-�1�#6�9V5qmt��#� � T 5q97576R�-��_. &C�#m��z��6��6}97��-�5/��C�I�K�#6�WC�U�K�4576��6bM��68�o��57¢86�5:�o��_mt�#�K��:{��7Ñ

a & " � �� $ & �� c #2d W

� ? & " ( µ " , �f T 5�5V��z��68576��K�#��mw��6�9/��6C��6��F�� T �� T 5M5/rs5797�H�#m��#xU��6�W~� T 5{O5:��W T ��e�& " ��I�5y�5V6��Kj#5)��9 T ��6�W�57�A��6Na�& " �J�#6��o� T 5/�57mt�#�5��9 T ��6�W�57�d��6i��97�#��mw5:�=��8�5Q�57��57jI��689V5�p�f T ��I��K�:{��K�I�W#57�d5V��8��=j�5V��576R��Q�#��{O5O��8�#�� T 5OjI��57��m8{_57��W T ��Q��68�q9V576R�5V��7��#6�� T 5/�57mt��5l�Pmw�#�-�5V�b9V��8�1�I��6£��£�I9 T ��57j�5�� T 57��9:�#��k� T �I�z5:� 97��8��5V��l�w�5:�8�#�1�I�5:��m��5:��9 T �#� T 5V�/ Y��6 � T 5l��#975��1�#68�-mw�#��45:�bg�k��U��K�C�I��{H57��W T ��~�w�575 ¨ ��W��8�5��: /pv+6)� T 5 mt��K�:{A��68WA{O5 �8�5V�5:6��#6y�I�KW#�#�� T �� T �=��¢J68�8��#�K�U��6�w�5V�M��mQ9:576��5V��#6��b{O57��W T ��/ O� T �#�M��C��U9:�#��6��q�8�Î�#mQ� T 55V��_mt�8689V��6P�F�� /p

5. LOCALLY ADAPTIVE CLUSTERING AL-GORITHMf T ��_�579V��6o�85V�97��gz5V�_��8�_mt5:�I��5��57��57j#�#6�9V5)57�-��C�#��6o�8��In

975:��8�5b� T �#�Y�#��KW#6��C��U9:�#�A{O57��W T ��lmw5:�=��8�57�Y��9V97�#�1��68W�� T 5l��R9:��~97�I��57��=��6��#m4��I�1��#��68W£57�#9 T ��5:68��6�pÏ��8��579 T 6��85��8��#W��57��j�5:��kl��8��=j�5V�i� T 5Y��J�I�K��&k��Im_��6��Q97576�n��K�U�M�I6J�Y{O57��W T ��:��gRkY�K6Rj�5V�-��W��I��6�Wy� T 5)��#9:5)685:�I�_� T 5)97576�n�5V��i��Z57��C�I�5C� T 5��U�K�45768��68�y� T �I�q�C�I��5/�q� T 5Y��-�:��Fp 5�p� T 5O��576��#6��Q��#6�WM{ T �K9 T ��U9:�I��#��I�5O��#��ki9V��57��#�5=�zpS �z5V97��¢J9:�I�K��k��{O5~��R9:575:�b�#�Omw��={��7p��5~�-�1�=��A{M�� T � ¹7ÁKÁ �?±/¿1À#´F´&¹/½1¹¸)�z��6��6N&Ï�#�_� T 58+o97576R��#�K��7Ñ{O5A9 T �U�#�5A� T 5O¢��-� 975V6U��#��q�#��/�#6��o�R�#6��4�57��5:9V�Q� T 5_�� T 5/�� T �I�4� T 5VkÃ�I�5�mF�=��mt��#�Ê��685}�I6��#� T 5V�:�H��6��|m�� T 5o¢��9 T ��5:6P975:6��5V�:pZ��5o��6��#��k��5V�y� T 5YjI�#��5:�q�Im�{H57��W T ��q�-�Ã�#pL��j�576}� T 5y��68��97576R��K�U��. & �Jmw�I�@� �®�2�� 7+s��{_5i97��8�8�5� T 5�9:�#��57��z��6��68W��5V��*$ & ��Q��5/¢J6�57�y��6b�?�# /�#{ T 5V�5� & " �§��F\��#6��fFVMpq��54� T 5:6�97�#��U�54� T 5C�:j�5V�1�IW�5��-�1�#68975��#��6�W�5=�#9 T�8��5V6��6bmt��#�Ç� T 5i�J��6R��M��6_$ &q��g. &Rp � 5V�ha & " �U5:68��-5i� T ��:j�5V�1�IW�5~�8��-�1��6�975i�I��6�W��576��6bMp f T 5~��C�I�K��5V�ea & " ��7�� T 5��#�W�5/�O�� T 5M97�#��57��#��6Y�#m��z��6��O�#��#6�Wy�8��5V6��6-Mpd��5��8�5� T 5ijI��58a�& " ��6��#6Z57�U�z�#6�576R��#��{H57��W T ��6�WC�9 T 57��5i��C97�5:�U��{O5:��W T ��_��qmw5:�I��57�i�F�#6��97��-57��1 /Ñ

�& " � R µjid� (ek ]la & " �m�� D > /

� R µ2i�� (�k ]}�n]�a�& D /BA , �t�U

{ T 5/�5 k ��b�|�J�I�1�#��5V�5V�o� T �I�b9:�#6�gz5l9 T ��576¤��C�=�8��È75�w��6��È75= �� T 5O��68·��85768975_�#mVa & " ��6[ & " pd� T 576 k �¤��{O5 T �:j�5�& " �§�=m ��I� T 5V�5VgUki��W#6��#��6�W)��6�ky��rs5/�576�975HgJ57�&{O57576i� T 5Ia & " p��6b� T 5i�#� T 57� T ��6��s�8{ T 576 k ��A��#�W�5y�49 T �I6�W�5i��6Na & " {��gJ55V��z��68576��K�#��k~�5/·J579V�57�q��6n�& " p��5O57�4��9:��k4��5V�5V��685H� T 5jI��5)�Im k � T ��W T 9V�1�#��-n&jI��=��K�#6b��6b�#�8�A5V��z5V��5V6U��O{�� T��q��=�5:�i�8�I�1��p��P5��5/�� T 5 j#�#��5 ��m k ��A��6~� T 5H5V�U�z5/��K��5V6��{�� T �15:�I��8�#��8p f T 5)57�U�z��68576R��#�s{O57��W T ��68W4��_�4��5)�5:6��1��j�5��9 T �I6�W�5V��6��U9:�I��mw57�I��8�5}�5V��5Vj��I6�975|� �I��68��W��j�5V�Y��5b��gz5V��5V��J57�mt�#��C�#68975q��=j�57��5:6��:pMv+6}mt��9V�:�z��M�K�M��#�5y�-�1�#g8��5gz579:�#�8�5Z��o��57j�5V6U��o�U�K�-�1�I6�9757�Ymt��#�É5/�R�576J��68W|��68¢86��57��k��6�#6RkÃ��5:9V��6��Fp 5#p��HÈ75V��{O5:��W T �:p�f T ��:� T �:{O5Vj�5V�:��9:��6Ã�U9797�8�{ T 5V6}57�� T 5V�A��6�5:�=��#�M�R��R�1�#��9){O5V��W T ��K6�W��_��5=�zpf T 5){O57��W T ��e�& " 5V6J�#g8��5)��5:��6�W��#�5i�U�K�-�1�I6�9757�M�#��6�WC��57��qn�z�#��1�#6��)��45:68��68�7�z�?p 5�pM��4576��68��#6�WC{ T ��9 T �z��6��)�I�5��U��57��kl97�#��5:��I�57��I6J�z��=�y� T 5��1�#�454��K�45��b97��68�-��K9V�y�U��-n�1�#689757�C�I�K�#6�WZ� T 5��-�4�K6U·J�8576��K�#�_��6857�7��Fp 5�pPmw5:�I��57��#��68W

�~XbÐ~�~��8Ñ ¯=� T � 3 X S v-L~XZ��§��I�x�� T ��}�#6bª�57�5:�#�9 T v&��57�A��6}��#�1�CXb��6��6�W��#6J��Ð~6��:{A��5:��W�5~��97�=j�5V�k��J�#��#� �J�IW�5i��°

{ T �K9 T �z��6��C�I�5o�-��68W��k|97�#��57��=�5:��p|¦M�#�5�� T �I�q� T 5��5:9 T n68��U�85b��975V6U��#��n?g��#�5:��gz5V9:�#�8�5o{O5:��W T ��68W��C��57�z576J�|��6|� T 59V5:6��K�spf T 5q9V��8�5:�b{H5:��W T ��#�5i�8�5=�b��Y��s��#�5~� T 5i�5V��$ & �z��6�� T 5V�5Vmw�I�5y� T 5y975:6��K�� Æ 9V�R��1��68�#�57�7p�f T 5y��1�R975:��8�5y��5V�n�I�5:�o�86��9:��6Rj�5V�W�5:6�975~��A�57�#9 T 5:�s��wp 5�p 6��9 T ��68W�5~��6o975:6��5V�� Æ9V��I�1�8��6J�=�57�A�K�A�#g��5V�j�57�spf T 5��57��6�Wb��W�� T �b�� T �#�~{H5�9:�� 3 � � �R97��k��M�8��Un��j�5 3 ��8�-�5V��K6�WU /�8��8��C�=��KÈ757�o��6b� T 5)mw��={A��6�W�p

� �� Ñ S 5V��&u�#m��z��6��I( 3�� #6�� T 5�6U��igz5V�O�#m�9V�K��-�57��+zp�#p S �1�=��A{A�� T +o��68��K�#��97576R��K�U�h. / �7. , �� 7.�0U��Rp S 5/��& " �§��Imw�I�Q5:��9 T 97576��K�[. &U�� ¥�5��%��+4�#6��q5:�I9 Tmw5:�=��8�5�M �§�2�� 7�P�

�Up ¨ �I�i5=�#9 T 97576R��#�K�9. & � �.�Ï�2� ��%�7+z��I6J��mt��i5:�#9 T �8�#��z��6��(�Ñ

� S 5V�:$ & �<,/( � � � '��H�4��6 D;P<&LM ��V�S.2D ��(� 71R�{ T 5/�5`P &LM��7�S.2D � (d �¥� 1 �" > / �D " � ? D " ( µ " , / A , �

��p��`"�� \��# ��p ¨ �I�i5=�#9 T 97576R��#�K�9.�&R� � ��2�� %�7+z�J�#6��om��#��5:��9 T m�5=�I��U�15�M1Ñ

� S 5V��a & " � 1 c #2d W � ? & " ( µ " , m � $ & � �#{ T 57�5 � $ & � ��d� T 59:�=�1�8��6J�I��+k��ImO�5V�[$ &Z�a & " �57��57�576��)� T 5C�:j�5V�1�#W�5�U��1�I6�975~�Im��z��6��A��6N$ & m�� . & ��6�W4mw5:�=��8�5 M& /�

� S 5V�� & " � R µ;i�� (�k ] a & " �m8� 1 �D > / R µ2i�� (�k ]��N]a & D /EA , �

�Rp ¨ �I�i5=�#9 T 97576R��#�K�9. & � �.�Ï�2� ��%�7+z��I6J��mt��i5:�#9 T �8�#��z��6��(�Ñ

� ªA5797��4�8��5`$ &�� ,!( � � � '��O�4��6 D=P &LM��V�S.2D ��(� 71R��Up��`"�� J� � � .;�\�# ��j"V� �!��p S 5V� . & �#"%$ c /'& W)( c+*

" $ /�& W ( c)* �sm��5:��9 T � �§�2�� Q+s��{ T 5V�5~� d � � ��K�Q� T 5A��6J�U�K97�I�� mt�86897��6��m��5/�h$_�

°Rp_v&�5/�1�I�5O�U� �� 86��R9V��6Rj�5V�W#576�975M�w�wp 5�p��I6��9 T �#68W�5��6i975:6Un��K�U� Æ 97�R��1�U��6��#�5V�1

f T 5��5:��8576R��K�I�s�-��897��5��m�� T 5 � � 3 �#��W#�� T � �K�O�#6J�I�K�#W��8��Z� T 5��#� T 5:��I��97�4�#m�� T 5-,/.¬��W��#�� T �� °:�?pPf T 5 T ��8576j#�I��g��57�_�#�5�� T 5)�#��KW#6��576��_�Im�� T 5��z��6��O��y� T 5�9V576R��K��:pS �57�Z��97��68��5V�� T 50,£�-�57�}��m�� T 51,/. �#��W#�� T �bÑ��M¢86J�� T 5Mj#�#��857�O�#m�� T 5 T ��8576YjI�#��#g8��5:�@$ &~W#�Kj#576C� T 5��8�57jR��8�Hj#�#��n�857�y�#m_� T 5Y�8�#�1�#�45V�5V��`�& " �#6�� ? & " pZf T 5Cm��:{��6�WZ�-�57��.��5:�� ~97��6��-��q��6�¢�6��6�WZ6�5/{Í�C�=��K975V�q��m_{O57��W T ��4�#6��975:6Un��K��M� T �#�M��K6��È75y� T 5y57��#��m��689V��6Z{A�� T �5V��z579V��C� T 59V�8��576R�A5:�-��C�I��6o��m T ��85V6bj#�I��g��57�7p�P5~9:�#6b��8��=j�5)� T 5)mt��={A��6�W}� ¯#�FÑ2 � �\"3�!�4�Ãp�f T 5 � � 3 ��W��#�� T � 97��6Uj�5/�W�57�)��}��R9=�#�Q��6��n�y��u��md� T 5~5V��_mw�8689V��6��F�� /p

6. IMPLEMENTING LAC IN SQLv&6P� T 5Y��8��5V��576R�1�I��6��#m � � 3 �K6 S8�� {O5Y�8�5:�P� T 5C�1�Ig��5:�� T �I{A6��K6Pfd��g��K5o��p�¦M5V�U�i{O5Y�U5:�9V��gz5C��6|�85/�1�#��57��9 T ��5:�P�Im��M�#��W��#�� T �Ç�8�5V�576R�5:�o��6 S 5:9V��6��R��5/�8�U�57��5:��K6 S�� pv&6 ¨ ��W��U�5A�O{O5OW#��j�5�� T 5O��5/�k~� T �I�d�z��K�I�57�d��#g��5@$O�M65�� ? M65� /�{ T 57�5@M65y�� T 5_�z�#��6U� ��#6�� ? M65i��Q� T 5_97��-�5V� �K�sp�f T ��85V�k

798;:�<�=;>?798;>;@�:A+7B>�C�>)DFEHG9I�JLK;G9I�J%M;I)NPO;:QE:�<�R�<�S�>UT�O;>�O�V G9I�JS�V�K;G9I�J:�WH=�>4E�AXV�Y)D�ZE�S�V�Y)D�[HT�O�>�O�V�Y)D�N�Z�Z�\�]�V�V ]�AXV Y�^_ZE S�V�Y�^`[�T�O;>�O�V�Y�^+N�Z�Z�\`Na�=`@�bcT�O�>;O�J0S3J%AA�C;<�=�<�S�VdK`G9IfeUAXV�K;G9INJ>�\�O�:gE:�<�R�<�S�>�>)DhV�G9I�Jib7984E�M�I)NjO�:lk_M�Ia�=`@�b�>)Dm =`@�n�ofp�qr>)DsV G9IN:�<�R�<�S�>�>�\XV�G9I�Jt>+DsVdK_G9Ia�=`@�b�>)DJ0>�\A�C;<�=�<f>�\�Vuk_M�Ife�>+DsVvM;I�O�8�Tf>_\�V�G9Ifef>)DsV�GBIxw

¨ ��W��8�5C�UÑ S��)� Ñ��M��W�6l�z��6��C�w�M� f��) ��o97�K��57��C� S /��97��n97�8�K�I��K6�W4� T 5){O57��W T �5=�o��-�1�#689757�A��6o� T 5)·�k

798;:�<�=;>?798;>;@�S�O�=�T`y�::�<�R�<�S�>fS`7BT�J0S�@�n�8�>4E Z`Na�=`@�br:m =`@�n�ofp�qFS`7'Tzw

¨ ��W��U�5i�8Ñ S8�� Ñ 3 �#��97�8�K�I�5)� T 5~9:�=�V�U��6J��5:�_��m��5V��h$ &Up

��5:��5V6��O��57�8�A�y�#6��b�~�#m�� T 5 � � 3 ��W#�� T �bpQf T 5)��5/�kÅ�#��68�� T 5��=�1�#�57�M�1�#g��5n& �|{ �y�z{�� T � T 5y9:576��8� Æ ��g��5 ��#6��£{O5V�KW T �� Æ �1�#g8��5_PÌ�A��6 �#�1��5V�Y��9:��97�8��#�5}� T 5}{O57��W T �5:��8��-�1��6�975:��#m 5:�I9 T �z�#��6R��C5:�#9 T 97��-�5/��9:576��K�spAf T 5~�57��#mA� T 5yÅ+��6|W��U57�q��}� T 5��-5:��J��1�=�kP�1��g��K5}{i��p�f T 576��Q��6�� T 5�5:97�#6J�b�1�#g8��5~{��R�Jmw�#��5:�#9 T �z��6��:�J{_5i9:�#��9V��I�5i� T 5y��6��y��8��-�1��6�975o�#mA� T 5��z��6R�C�=� T �I6J�Pmt��#�Ê�#��H97�K��57��7pPf T 5o¢86J�I��57��Y�t�Fp 5�p��1�Ig��5�$Q ��~W��j�576�� T ��8W T �yÅ-��6P�Im�{~�Y�I6J��{)�Rpfd��g��5U{~� T ��Y��È75Z��m8+K&}�_{ T ��5U{M� T ��È75��#m`& �F�I6J��U�U57�4�1�Ig��5f$Q /� { T 5V�5N+|��6U��qgz5V��#m)9V��8�-�5V��C��68� &Ê�� T 5C6U�8�ygz5/�4�ImA�z��6��=pb¦M�#�5�� T �I�y{O5Y9:�I6P�5=�U��975C� T 5��#975��57�}gUko� T 5i�5:��1�I�ko�1�#g��5:��gRk}��z5V6��6�WY�C97��#�)�#6}�1�#g8��5& ��{��~� �#6��mt��y5:�#9 T ��5��97�#��97��I�5C� T 5�+��-�1�#689757�:�Q�I6J��-��54� T 5��z�#�K6��:��68Wo{�� T � T 5�97��-�5V�i��l97�#��57��J��6J�U�K6�Wo�� T 5Q��68��y�� #6�975��I��579V��k)��6��_�1�#g��5I$Hp�f T �8�:�:� T 5��#975�5:�R��57�4576R�A�K�O� T 5i�#9V��#��KÈ75~�Imd�1�#g8��5`$O�8�wp 5#p*&Zp¨ ��W��8�5��y� T �I{A�� T 5)�R��5V�kC� T �=�_97�#��57�H� T 5M9:�I�1��6J�#��57�_�#m�57��[$ & �Y� �Ç��/�7+zpYf T 5��5/��i�#�5��-��5:��6l�1��g��5g$O�Md5�� ? Md5U /pf T 5M97��4��8�1�=��6Y��H�-��1��W T �mt��{O�I�1�sÑQ��H�K�H�y9:�#��6��_�Im��:{��O��6$H��W��8�z5:��g�kY� T 5)97��-�5/�A�K�sp f T 5��5V��A��_�-��#�5:��6��y�1�#g8��5� � � & $H�8�#6J�� T 5�j#�#��57�A�=�5��8�5=�Y��6��-�57��O�4�#6�� i��m � � 3 pv+6 ¨ ��W��5b��{_5bW#��j�5o� T 5Z��5V�kP� T �I��97��8�57�q� T 5}�:j#57�1�#W�5�8��-�1��6�975��m��z��6��H��6�$ & m�� ? & p�f T ��O�R�85/�kC��57��5V6U�� T 5¢��-��J�I��#m)�-�5V��8��{ T 57�5o� T 5bj#�#��57�na�& " � 1 c #2d W � ? & " (µ " , m � $ & � �I�5~9V��8�8�5=�zp�f T 5��57��8��A�15/�A�#md� T ��M��85V�k��x#57�8��6R��Y��g8��58a|� ? M65��'5 / � ��/�%�'5� �B��3� R µ2iz /p 3 ��8��6��H�� R µ;i��9:��97�8��I�5=�y�#�Q��mt�8689V��6q�Im�� T 5��C�8�579V5=��68WM97��468�:��#6J�i��57�y��6i�5:97��6��~�J�I��d�#m8��-5:�y��#{ T 5V�5�� T 5�{O57��W T ��I�5��s��#�5=�zp��5H9 T ��5H��~�I9V��J�I�K��ki��#�5�� T ��8�b��64�#�1��5V��~�:j��K� T �=jR��68W��6�W|5V�U�8�57��68�Y��6£� T 5Z97��4��1�I��6£�#m)� T 5}{O57��W T ��:p f T 5�R�85/�kMÅ�#��6��1��g��K57� � �\& ��{ � �#6��-$Z��6q� T 5_97�K��5V� ��#6��q� T 5�z��6��i��#6��l��57�5:6��)� T 5q�5:��~gRk�W��8�8��6�Wo� T 5q�57��68W��:{��#6o� T 5~97��-�5V��pf T 5y�5797��6��}��I��Im �-�57�Z�Y�#m � � 3 ��M��57�576��57�}�K6 ¨ ��W��U�5��Rpf T 5��85V�k��57�C� T 5}�57��8��Y�-��5:��6£�1��g��K5laPp�¦��I�5}� T �#�

�~XbÐ~�~��8Ñ ¯=� T � 3 X S v-L~XZ��§��I�x�� T ��}�#6bª�57�5:�#�9 T v&��57�A��6}��#�1�CXb��6��6�W��#6J��Ð~6��:{A��5:��W�5~��97�=j�5V�k��J�#��#� �J�IW�5i�#¯

]sB:e�G�" �� d<:G 2R(�DR% ��B:$?'��DRB7G�� *wE ��<:DI*+"DI*+%�� 'RB7*-B��U<:��D=*-%� �� "��>:c#*&%� �� .1"1D=*-$?<:��'�% !"�#�� %+"*-% &$ �� % &�'��'�#&�(�*)�+-, .'�&/ � B1!�>M'R� %&*-B7D�."/%� �102� �� 3546 � ./B:$?'�0�<&7 &

f��#g��5��#ÑHf��#g��57� Æ ��5V�19/��6��

7B8�:�<�=;>?7B8�>;@98A79>�C�>)D`E9K;G9I�J;:;D4J�V�V�J*:B^+NjO�:E:�<�R�<�S�>f:�VdK;G9I J:�n�bhE�E S�V�Y+D�[�T�O�>�O�V6Y)D�N�Z�Z�\`NsJ�V�V�V�J6:Hn�bhE�E�SXV�Y�^`[�T�O�>�O�V�Y�^+N�Z�Z�\`Na�=_@�brS3J T�O�>�O�J~:A�C�<�=�<�S�V�K;G9Ifef:XVdK;G9IO�8�T�:�V�G9IfeUT;O�>�O V�G9Im =_@�n�o=<�>�:�VdK_G9IN:�<�R�<�S�>�>+DsVdK;GBIxJ:_D'?�S�O�=�T_y�:�VdK6@&A�I�JV�V�V:�^5?�S�O�=�T_yB:hVdK6@&A�I�JDCa�=_@�b�>)DJ0S�O�=�T_y�:A�C�<�=�<f>)D4VdK;G9IferS�O�=�T`y�:�V�K;G9I wn;o�T�O�>�<=8:�<�>E:'F�k+y&G�YIH�e�<B8�o4E [6JrZ98�V�I+D�Nj] V�V;V%]f<&8�ohE [6JrZ98�V�I�^+N w

¨ ��W#�8�5~��Ñ S��)� Ñ 3 ��8��5)� T 5i�8��-�1��6�975:�@a�& "7B8�:�<�=;>?7B8�>;@-A:�<�R�<�S�>=8XVdK;G9I�J<�8�o4E [6JrZ98�V�I+D�N9?U:�WH=�>4E*8�V�:�FHk+yIG�YIH+NJV�V�V<�8�o4E}['JFZK8�V I�^+N9?U:;W�=�>hE#8�V�:'F�k+y&G�YIH+Na�=_@�bL8�w

¨ ��W��5y�UÑ S��)� Ñ 3 ��8��5)� T 5){H57��W T ��i�t�

��I�1��45V�5V� k ��i�I9V��J�I�K��k��g��5:��gRkl�o�9:�I��=�~j#�#��5��8�U�n��6�W�� T 5��kU6��#��9Z��6Rj��R9:�#��6��m~� T 5l��85V��5:��gRk£��o9V��6��W��1�I�bpv&6 ¨ ��W��8�5 �}� T 5o��5/�kP� T �=�49V��8�57�y� T 5�97576R�-��q��q��5Vn�576��57��p�f T ��C�-�57� ��Im�� T 5 � � 3 �#��W��#�� T �bp£f T 5}��85V�k¢8��-��Å+��6��_�1��g��K5V�:& ��{ �y�#$��#6�� J��6}�#�1��5V��C97��8�8�5~� T 5��Y��m)� T 5��I��g��U�57��#m~�z�#�K6��C� T �=��gz57��68W��P� T 5Z��459V�K��5V�:p�f T ��i��6U�5/��5:�8��#�-5��57��8��iW��R57�~��b�57�4�z�#�1�I�k��1��g��5$�MON<$O�8�I6J� T �I�_��ÈV5 +sp f T 576��R�1�Ig��5:�e$�M�N $��I6J� � � � & $�I�5�Å��685:��6q�#�1��5V��)97��4�8��5H� T 5H¢�6J�I��57��8��:��wp 5�pd� T 5H97�U�#�n��6J�I�5V�H�#ms� T 5A6�5V{�97576R�-��8�7pd¦��#�5_� T �I� {_5A9:�#686��#�H�:j��4� T 5�579V��6J��Å-��6 �t{ T ��9 T ��i�z5V�mw�#��45:�P��6��#�� 1��g��57�7�Q�#6RkU{_�7k� /�gz5V9:��15 ��I� T 5�mt�575Q97��8��6��s� T �=�d�I��z5:�=�d��6i� S « � « 3 f�97��5�#6��q�I�5O6��6Rj��j�57�y��6��I64�#W�W��57W��#��6imw��6�9V��6�� S � Xu�K6y��97��5= T �=j�5_��i�#�8�J5=�I��64� T 5�L)ªA� � ��2P¤9V�K�#�8�5�p�v&6��H97�#�5��{O5M�z5V�mt�#��Í�qL�ªA� � |�2P �#6Y97��6b$�� ? M654��68��kR�U��y{H5�9:��6Un68�#��549V��6 � � � & $�� ? '+�;5b��6Z� T 5 S « � « 3 f§97��5i� T �=�9/�5=�I�57�y�1��g8��5l$�M�N<$Hp�¡�576�9V5�� T 5o68575:�Pmw�I�4� T 5��5797��6��}Å-��6gz5/�+{O5V5:6o�1�#g8��57�h$�MON �#6J� � � � & $Hp

7. EXPERIMENTAL EVALUATION

798;:�<�=;>?798;>;@�SA+7B>�Cr:Hn�b;:hE'K;GBI�JD:_DJ�V�V�J#:B^NPO_:E:�<�R�<�S�>fS�V�K;G9I�J~:Hn�bhEdT;O�>�O�V Y)D�N4J�V�V�V�J6:Hn�bhE�I�@&Q�@�V�Y�^+Na�=`@�bfT�O�>�OXJ~S�J~:A�C;<�=�<�S�VdK`G9Ifef:�V�K;G9IO�8�T�:�V�G9I�eUT;O�>�O�V G9Im =`@�n�ofp�qFS�VdK_G9IN:�<�R�<�S�>c:�n�b;:�VdK;G9I J:;D6?�S�O�=�T_y�:XVdK6@&A�I�O;:-^5GHM_yHY+DJV�V�V�V:B^R?�S�O�=�T_y�:�VdKI@IA�IfO�:U^BGHM_y�Y�^a�=`@�br:Hn�b;:�J S�O�=�T`y�:A�C;<�=�<�S�O�=;T_y�:�VdK;G9Ife�:Hn�b;:�VdK;G�Ixw

¨ ��W��5 �8Ñ S8�)� Ñ 3 ��8�5)685V{ 975:6��K��i� 3

v+6C� T ��H�579V��6C{O5M�8�57�576R�O��6Y5V�U�z5V��576��1��z57jI�I�K�8�#��6��#m�� T 5S��M� ��8��57��576R�1�I��6~�Im � � 3 p:f T 5 57�U�z5/��K��5V6��{O5/�5Q��86~��6i��85V�xU��#�4 3PT �:jR��68W~�M �M�8��R9:57��#�d�1�8686��6�W��I�_��p �RL�¡£��6��Y�:L�#m�ª��)X�p#f T 5 T ��-��68WM��z5/�1�I��6�W��kU��-5:� {_�I� � ��6R�8�qX}�I6J�U�1�#x�5¯�p��C�wx#5V�6�57��j�57�:p��Rp ��pK�:°� _�#6��{H5){_5)��57�b�)�A�qjU¯�p��#p��5M9V�5:�#�57��-k�6�� T 5/��9)�8�#�1�I�5V�� T �:jR��68W4�8��5768��K�#6J�I��?k�� #6��l{H5C57�qgz57��U5:� + �Í��9V�K��5V��7p�f T 5��z�#�K6�� Æ 97�U�#�1��6J�I�5V��J�I6�685:��{A�� T ��6 � � �:�7��#�H��6�5:�#9 T �U��576��1��#6�pZf T 5Y9:�I�1��6��#��n��57�O�Im�� T 5��#�1�I�5V��&^{H5V�5i�:��j+ �1�#��%S �#6J�b��N ��8�8��5V�=p ¨ �KWIn�8�5H°O� T �:{�� T 5�5V�U5:9V��6~��57��z5V�d��5V�1�#��#6�p�f T 5H�#��W#�� T �97��6Rj�5V�W�57�:��6|�:j�5V�1�#W�5��Q�#mK�5V�C�o��l°b��5V�1�=��68�:p}��5Y9:��6��575� T �=�)� T 545V�U579:�U��6l��57�~�9=�I��5��K6�5:�#��kZ{�� T � T 54��È75��#m�� T 5��=�1�#g��#�5�p f T 5��W��I�� T � Æ �Y57��#�l�tm��1�#9V��6 �Imi��97��I��¢�5:��gUÅ579/��=j�5/��I�1�#��6U��igz5V��m��#g�Å5V9V��1 Hmt��A� T 5)� T �575i��#�1�#�5/�� T �:jR��68Wl�:�#�2+s��I��;+��68�Ã�&N ��57�1 A{_�� UpK�%��T��p ��UT �I6J��p �UTY�8�57��J5:9/��j#57��k�pv+6��#�1��5V�i��}57jI�#��I�54� T 5C�J57�m��C�I6�975��#m � � 3 {�� T �57��J5:9V��~� T 5M�U�K��5V6��68��&k��Im�� T 5��I�1��{O5�W�57685V�1�#�5:�C�~�579V��6��5V��#md�8�I��5V�� T �:j��68W4�U�K�45768��6��#��+kg�Ï5:�R��J��Y�:�U��#�y��6��Up��5_¢��5:�4� T 5�9:�=�1�8��6J�I��+kn& �-�Y�7��;+y��K57��68�4{O5A5:�ygJ5:��5:�+ �§�49V��8�-�5V��:p_f T 5y5/�U5:97��6o��457�M�z5/��5V�1�#��#6��mt�#�A� T 57�5��=�1�#�5V��7��A� T �I{ ��6 ¨ ��W��8�5i¯�pOf T 5i�#��W��#�1�� T � Æ �A5V��#��m��A� T 5� T �5V5i�8�#�1�I�5V��_{_��¯Up ��T��pK�%�%TÍ�I6J�o�8p �6��T��57��z579V��j�5V��k�p¨ �U�� T 5V��4��5��6o��1�U57�O��q9 T 5:9/x�� T 5)�z5V�m��C�I6�975�{ T ��5�j#�=�kUn��6�Wi� T 5M6R��ygJ57�O��m�57�ygz5:��U5:�Y97��-�5V��:��8�I�1�#�5V��{A�� T +C5:��J�I��P��:�l��68��}{_5V�5��5=�zpÃf T 5o9:�I�1��6��#��+k &�{A�#�4¢��5:��}�:��;+o��8�8��57�7��{ T �K��5~� T 5q��576��6��#��&kN��Im�� T 5y��#�1�4{_�I��57�)��}�Up ¨ �KW#�8�5�o�857�8��9V��)� T 545V��5797�8�-�K�#6l��5��z5/�i��5/�1�I��6mw�I�y� T 57�5Y�5V��i��mA�8�#��8pbf T 5��#��W��I�� T � Æ �y5V��#�ymt��i� T 5C� T �575��=�1�#�5V��_{_�I� �8p�� T��8�8p �5�UTÍ�#6��b�8p °�°UT��8�5V��z579V��j�57��k�pf T 5Ag��#��9�9 T �=�1��9V�5V��9A�#m��9V��R�=��1�:k4�)¦M� 5V�U�8�57��6C�8�I�1��y� T 5C�K�I�W�5C6U�8�qgJ5/�q�#m_97��68�8��6��7�d�86��85V�i{ T ��9 T � T 5C�8��gz5V��I�15C��685�pYv+6l� T 54mw�I��h�#mA��57��#��6��1�#g8��5�� T 5CW�576�57�i97��5/n��z��68�)��:{��d��68�~� T 5�97��6��6��9:�#��68�H�F�#�-��g��8�57�1 /p�v+6�#�/�U5V�O��y5V��685�� T 5��9:�I��Ig��&kC�#m�� T 5 S�� 57�45:6��1�I��6

�~XbÐ~�~��8Ñ ¯=� T � 3 X S v-L~XZ��§��I�x�� T ��}�#6bª�57�5:�#�9 T v&��57�A��6}��#�1�CXb��6��6�W��#6J��Ð~6��:{A��5:��W�5~��97�=j�5V�k��J�#��#� �J�IW�5i�#

0

100

200

300

400

500

600

0 100 500 1000

Exe

cutio

n tim

e (s

ec)

Num. of tuples (in thousands)

¨ ��W��5�°�Ñ S kU6R� T 5/��K9 ��=�1�#�5V��7Ñs��5Q�z5V��57�1�=��6){A�� T jI�#�kU��6�W68�8p �#m��K57�7�8¦ �~�R��x �~�

0

20

40

60

80

100

120

140

0 5 10 15 20 25 30

Exe

cutio

n tim

e (s

ec)

Number of dimensions

¨ ��W��5 ¯8Ñ S kU6R� T 5/��K9 ��=�1�#�5V��7Ñs��5Q�z5V��57�1�=��6){A�� T jI�#�kU��6�W68�8p �#m��8��5768��K�#6��7�8��y�=�#��xJ��x �)�

0

20

40

60

80

100

120

0 5 10 15 20

Exe

cutio

n tim

e (s

ec)

Number of clusters

¨ ��W��5 8Ñ S kU6R� T 5/��K9 ��=�1�#�5V��7Ñs��5Q�z5V��57�1�=��6){A�� T jI�#�kU��6�W68�8p �#md97�K��5V��7�8��y�=�#��xJ��¦ �)�

0

20

40

60

80

100

120

140

160

0 12 51

Exe

cutio

n tim

e (s

ec)

Number of genes (in thousands)

¨ ��W��8�5)�7�8Ñ��K�#��W��9:�#�8��#��5V��7Ñd��5_�z5V� ��57�1�I��6y{�� T j#�=�kUn��6�W�68�8p �#mdW�5:6�57�:��¦ �~��#m � � 3 ��{O5��8��4� T 5_mw�#�K��:{��6�W8Ñ�{O5��8�5:�C�)�)¦M� ��9V��I��1�:k4�#mW�5:6�5i5V�U�8�5V��1��#6}�8��I¢J��57��6 T 5V�5=�U��1�I�kog��5:�#�-�M97�#6�9V5V� /:��{ T ��9 TT �I�)�R�#�6�YW�576857�)�I6J�l��97��6��68�7�s��Yg��#��7��#6��b{_5y�5V�8��n9:�#�-5=�y��8��68�q��5:�7�#��6q�I�1�85V��MW�57��=�W�5V��5V��7� T �=jR��68W�#�8�8��:��C�I�57��kZ�=�IÐ4�z�R�7ÐÌ�I6J�Z�=��IÐ¥��8�8��5V�=�U�57��z5V97��Kj#57��k�p��P5�1��6 � � 3 {A�� T x ��pZf T 5C�57��q� T �={ � T �=�i{M�� T ��6��+{O�Zg8��n97��8��57��i¢Jj#5o9V��6J�U��#6��o� T �:jU��6�WZ�K� Æ ��°U� ¯U� U��=�� }��6�� T 5Y¢��685��8��6J�o°U� ¯�� 8��7�8� ��y��6Y� T 5��579V��6J�Y��6�5I �57975:��j#5)97��6��K��5V�1�#g8��k��#�W�5/� {H57��W T �� T �I6q� T 5_�#� T 5V��7pQf T 5��57�C��6��68W�g��97��8�-�5V��Q97��6�n�1�#��6�mw5V{H5V��W�57685:�7��I6J�Ã�#��_97�#6J�U��68�q�579V5:��j#5o5=��#�H{O57��W T ��7pf T ��y��6��8��9:�#�5V�y� T �I�q6��97�#��5V��=��6�{_�#�ymt��86��|�#��68W}� T ��597��68�8��68�7p��5y��g��5V�j�5y� T �#�)��rz57�576��9:��yg��6J�=��6��#m�97��6�n�8��6��I�15)�5:��579V�5:�Ymw�#�M�U��rJ57�576��g8��97��-�5V��7��A��5/�8�z5V9V�5:�mt��#� �Zg��K�#��W��9=�#��z5V��z579V��j�5�p ¨ ��W#�8�5Z�=�}�8�5V�5:6��y� T 5�5/�85/n97��#6��57�y�z5V�4��5V�1�=��K�#6Pmw�I�q� T ��q�5V�y��mA5V�U�z5/��K��5V6��=po�P5��g8�57�j�5_� T �#�Q� T 5A��W��#�� T �®�9:�#��5:�Q��6�5:�=��k4{A�� T �5:��z579V�Q��=�1�#g��#�5~��È75��8�?p 5�p 6R��ygz5V�A��mdW�576857�7p

8. CONCLUSIONS��5��8�5V�576R�5:�Ã�#6��#��W��I�� T � � T �I�Y�U��97�=j�5V��49V�K��5V��4��6|��8g�n��J�I9:57��gUk��8��rs5V�5V6U��97�#�qg8��6��I��68��Im��U��576��6��jR�K�O��U9:�#��{H57��W T �-n��6�W#��mzm�5:�#��U�57�:pQf T �� 5V� T �U�C�=j#��8�� T 5_��xy��ms��Ims��68mw�I�n�C�=��K�#64576897��6U�5/�5:�q��6�W#��g��8�U�K��5V68��6��&ki�5:�8��97��6q�5V9 T n6��R��57�:pH��5i��8��5V��576R�5:�b�#�8��#��W��#�� T �Ï��6�W S��)� ��6b��1��57��bmF�I97��1�#�-5Y� T 5Y�8��9:�=j�5/�kP�ImA97��-�5V��i��6P��I�1�#�5V��i��-��5:�P��6�~�~�_X S pUf T 5�5V��z5V��576R�1�#��57j��I��J�I��6C� T �:{M� � T 5��9=�#��g��&k�#mQ�#�8��W#�� T �bp

9. REFERENCES��V� 3 pd��W�WR�=�{_��w� 3 p� Q��R9V��9#��8p � p��P��m+�� dp S p P��I6J�

�8p S p ��=�xzp ¨ �#��W��I�� T ��4m��C ��#Å+579V�5:� 3 ��-�5V��6�W�pv+6��O½²=¿��.� �� .��~�s�=#��p

� �=�~ �p��1�#g8��5}�#6�� pQ¡��gz5V��:p��M6Ã�=j�5V�jR��5V{Í��m�97��qg��K68�#��In��K�#��#�1�q�#6J�I��k8��7p 3 �K��57��6�W4�#6J� 3 ��#��¢J9:�=��K�#6�p��o²#½7Á�¸z¿V¾?¹/º8´?¾ �H¿��_Â ��8�J�IW�57�M�� :�6�8p

� �I� � p�gz�I��Ã��68��qp�� 6��xzp � �R9:�� 5=�I�68��68Wl�#��W��#�� T ��7p�y¹/ÂU½À�Á�� ²6M»JÂU´+ÀI´w¾F²#º��8�J�� /Ñ ¯#¯�¯��U��7��Rp

� �#�)Ð�p 3HT �#xR�1�#g��I��6�� S pzXb5 T ��I��1�8p � �R9:��5768��6��#��&k�5:�8��97��6�ÑP�©6�5V{©�#��U��R�#9 T ��|��6��5V��68W T ��W T �8��576�n��K�#6J�#��J�I975:�7p��º��_½-²I¿�� "!#�%$A�z�I��Up

/ cI*&**�'& (�(V$+"%&"/B7$&.-c�0 D�c�>I$?�t0 DR� c80 >:<V!)(V(��.1$&<=B7$+$&B1E)(+*�,d`=@ -#2U��G "1()"DI*

�~XbÐ~�~��8Ñ ¯=� T � 3 X S v-L~XZ��§��I�x�� T ��}�#6bª�57�5:�#�9 T v&��57�A��6}��#�1�CXb��6��6�W��#6J��Ð~6��:{A��5:��W�5~��97�=j�5V�k��J�#��#� �J�IW�5~��

� �:�i dp 3OT 5:5V�57�C��6��I6J� �8p S ��8�È�p%$�À � ¹V±±¾FÀ#º �QÁ�À�±±¾ �H¿À�´w¾F²#º� À#ÂU´?²I¿7Á�ÀI±1±��¼�¹²�½ � ÀIºz¸�¹V±Â�Á ´w±/�#9 T ��57� �8pI�M��Av��#XZv&f ��5:��7��=#5�8p

� �=��P4p 3HT 5768Wi�#6��YLqpUX�p 3HT �U�9 T pR�O��97��-�5V��6�Wy�Ims5V��8�57��6��=�1��p8v+6 �_½²:¿�� . $A�8��W#57��=��z�I��Up

� °:�i�yp� dps�)5V��8��5V�:��¦ip�X�p � ��1��I6J�Z�4p��psªA��g8��6�p�X}�#�Rn��q�8� � �Kx#57�� T �U�8�¤m��#� v?6�97��4�8��5V�5P�)�I�1�£jR�K�|� T 5�«�X�M��W#�� T �bp� U²#Â8½1ºzÀ#Áb²-³ ´�¼�¹�A² � À�Á �´&À#´w¾�±1´F¾t¿1À�Á J²I¿V¾F¹/´ � ��-�= /Ñ�� U��¯��:��°�p

� ¯=� 3 p��M��5768��97��6��?p !�²:¿1À�Á�Á � �A¸�ÀV»J´w¾ �#¹��z¹¿1¼RºJ¾��VÂ�¹V±s³7²#½��AÀ�´&¹/½Vº�QÁ�À#±±1¾ �H¿1À#´w¾F²Iº�p� T �§� T 57��7� �)3 ª��j�5V��85��M��WY��#�R�Up

� =� ��p�Lqp��kl�#6�� 3 p�«Ap��1�U��5/k�p ¨ 5:�#��U�5 S ��g��57� S 57��5797��6��68��M�1�85V�_v&�85V6U�-��¢89=�I��6Cmt�#� � 6��8�z5/�j��5:� � 57�#�68��6�W�pRv+6�O½²:¿��. !Q�J��#��p

��:�=�yXPp:«Q��5V�:�=¡qp7 �p7Ð)��5VW�57�w�I��68��qp��p=�|��#�1�IgJ�#�5 ��6R�57�mw��9V5mw�I�O97��-�5/��68Wi��6Y��I�W�5M��I��J�8�#��g8��57�:pRv?6 �_½-²:¿�� %�~��=#��p

��/��dp�L T � T �1�#��68�d�#6J��L4pJ«�pJ¡M��6U��#6�pJf T 5y«HX��M��W��#�� T �mw�I�£Xb��U��57�|�#m ¨ �#9V��#�Ã�M6��#��k�È75V��:pqf�579 T 6��K9:�I�YªM57�z�I��3 ªALMn&fOª�n?5�In��i�M57�8�:p��#m 3 ��8�8�57� S 97��5:689V5�� 6��KjJp��Imf��#6R��:�6��p

��=�:�i«_pdÐ~57�#W T �dÐ�p 3HT �IxR�1�#gJ�=��w� S p�Xb5 T ��I��1�8�Q�#6��|X�p� ��IÈVnÈ:��6��Fp � �U9:�I�K��k}��I�8��j�5q�8��5V6��68��&kb�5:��897�-�K�#6}mw�#�M�K6Un�85/�U�K6�W|��#�W�5Z��5Z�5V�1��57��=�1��g��#�57�:pHv&6 �_½-²:¿��)²-³ ��. �� . ��~�J�#�#�8��p

��:�=�~ªipJXb��9 T �I�K�xU��#6��ªip S �57�8��p3.lÀR¿1¼R¾tºs¹ !d¹À#½Vº8¾�ºK�� º��½ �´F¾ ��¿:¾tÀ�Á��º�´+¹VÁtÁ ¾��R¹/ºJ¿V¹��O»�»J½²:À�¿1¼��9 T ��U�5V� Æ � 5=�I�68��6�W�m��g��57�jI�I��6�Ñ 3 ��689:5V��#� 3 ��8��57��6�W Æ pUv�)L~�¤ ��g�� T ��6�W3 �8p��=�¯#�8p

��7�I�~ªipUf�pR¦�Wy��68� �8pR¡��#6�pR«Q�C9:��576R�H��68��«�rs579V��j�5 3 ��-�5V��6�WX}5V� T �8�U�~mw�#� S �J�I��I�H�)�I�1�ZXb��6��6�W�pdv?6 �_½-²=¿��A²+³� "!#�%$A��=##��p

��=�:� 3 psX�ps ��U97��8��9��X�p��6�57�:�s �pzÐ�pz��W��=�{A�#�F�z��68�Zf)psXPpX}��1�#��wp��X}��6R�5 3 �#��W��#�� T ��mt�#�bmw�#�-�}�8��=Å579/��j�597��8��57��68W8p8v+6��O½²=¿��.� �� .��~�J�#�#�R�Up

�� =�~�4p�L4psªip��MW#�1�7{A�I�w�#�8p�L�5 T �x�54��6J�� dp�ªM�#W T �:jI��6sp��#n�C�#�-�K9 S ��g��#9:5 3 ��-�5V��6�W��ImM¡��W T �)��5768��6J�I�O�)�#��¨ �I�M�)�I�1�YXb��6��6�W��M��K97�I��6��7p8v+6��_½-²:¿V¹¹¸�¾�º��±~²-³%��. �� . ��~�8�8��W�57��I� �J�=��U�'��6�5��:�#¯8p

��=°:�i�ypIf T ��C�I��K�I6��qp 3 �#�-�57��K��#6�� 3 p S p � �?p 3 ��8�-�5V��K6�WM��6��68W��K�I�AjI��5��85797��z��6om��#��8��:��C�I�5��6��85V�U��6�W��6 T �KW T �8��5:6��6J�I��8��9757�7p8v+6 �_½²:¿�� .��=�#¯8p

��:¯=�yXPp�«�p�f_��8�8��6�WZ��68� 3 p�X�p��H�� T �#��p�X}��U��U�57�i��m_ Q��6�97��n�J�I� 3 ��4�z��6�5V6U�q��6��k�È75V��7p��y¹VÂU½À�Á�� ²6M»JÂU´+ÀI´?¾t²#º��_��?�� /��=#�8p

��:=�~¡qp��|�I6�W��§p8��#68W��'��p-PH�#68W��68�} dp S p P��sp 3 ��-�5V��6�WgRkb��I��5V�6Z��#��+ko�K6b��=�W�5y��I�1��5V��7p�v+6 �O½²=¿�� . �� . ��~�J�#�#�R�Up

� �#�=�if)p�� T �#6�W��_ªip�ªM�#��x�� T 6J�I6��A�I6J��X�p � ��jU6�k�pH�Hv&ª 3 ¡qÑ�M6�«Q��9:��5V6U�Y�)�I�1� 3 ��-�5V��6�WÃX}5V� T ��Ãmw�#� �Q5V�k � �I�W�5�~�=�1�#gJ�#�57�7p�v&6��O½²:¿��.� �� . � �~��:�6��p

�~XbÐ~�~��8Ñ ¯=� T � 3 X S v-L~XZ��§��I�x�� T ��}�#6bª�57�5:�#�9 T v&��57�A��6}��#�1�CXb��6��6�W��#6J��Ð~6��:{A��5:��W�5~��97�=j�5V�k��J�#��#� �J�IW�5~��

Graph-Based Ranking Algorithms for E-mail ExpertiseAnalysis

Byron Dom�

Yahoo! Inc.701 First Ave.

Sunnyvale, CA, USA

[email protected]

Iris EironIBM Almaden Research

Center650 Harry Rd.

San Jose, CA, USA

[email protected]

Alex CozziIBM Almaden Research

Center650 Harry Rd.

San Jose, CA, USA

[email protected]

Yi Zhang�

Language Technology InstituteSchool of Computer ScienceCarnegie Mellon University

[email protected]

ABSTRACT�� ! ��"$#��!�%�&�� (')��(�+*,�-��%�(��.0/!��(� (�(�� /��%�1/!.��(�%��2��-*3��/�� !�'4��*5��6�7-/!�%��-�% �/8��(�-�$��9��7-7-/!� �):��(�1��/;��-<�=�)�-��-�>/�.��@?( ��@�%��A�%�>/8�B�%�#�C��-7��D/!.��$��-��&E�FG��6��(��17-/�*, 6��@?( ��@�%��A�%�B�!��6<�)�%��17�/8��%��+/�.=�%�-H8�� 6��- �-IA��(��> ��! ��>�>�1.J/$7-��/��,��+��6<�)�%A�K/!.��(�� ! ��9�;�/8�%�1�/��)�-�7�/��%��-�% �/��L��/M7-/!�%��-�% �/8��(�-�$��5NO ��-/8 (6��QP@IR�;�/��%�S�-�(�8��B7�/��%��:�% �/��,��/2��T�@?(��-�(7-�+/�.R��*5��6�7-/!�%��-�% �/8��(�-�(7-��#��-�-�,��(� ��/8 6��B7-/��%��% �/8��)��,��/U��V�/��)�-�;��S7-/8�(��-7@��!��W�;�/��%��-�(�8�X�(<��-7��/8�(�, �/8��$�,.J��/�*Y��S*,�-*4#��4/�.1��W ��<�,�;�/��%��6Z�Q��H[�2��?) ��%��%��!�;#��-�-�W��A*5�Q��&�\��/]#��&EDFL�� %:.J/��*^/8�)�4�!��!6��)�%��4/��_#�/��L��(�$��(��75��`��&��6K�(�!� �X�!��a�>��$�%��/��(�(7-�1�V�(��b�@�%��/��*,�-��%�(��K.J/��K7-/�*, ��Q��A�(�� ('[�&�U6��-E

General Termsc 6��8/��<��(*,�-I�dD?) ��*,�-�$� �!��/8�eI�fW�-��%�(��*,�-�$�

KeywordsdD?) ��%�=g��(��)I�h)/$7�Z�!6�(��i�>/��']��!6<�(�%��-I�j��!� �� 4�/��(�K� �!�'$:��(�(Ilk1� �)��&�\6A��1�(�� !�7-�

1. INTRODUCTIONc )��/8#6��-*m.O�!7-�&�n/8�n/$7-7&�!�%�/8�L#$�X��6�*,/8��T�-H[��%�$/��B��1��Q��/!.6�/$7&�Q��`�%/8*,�\�(�-�%<��&�o��(.J/��*5�Q��A/��p/��U�%/��(��7-�\/�.+'��/&�;6��&�)�8�8Eq��(��6A�!�%��A�;�)�-�%<� ��#(6A�1�;��-�\/8�(�T��%��r �K�-H8�-�S�%�)��1�;��!��/4��%'.J/��X7-��n��%��6�6<�X#��V��!��g��-�n#$�\g��(��5�!�M��?) ��%�T��X��(��/� �72/�.=��$��-��&E9q��(A�; )��/8#6��-*s/!.Rg��(��4��X�@?( ��@�%�;A�K�;��!��>�V�8�(�)��-�%�+��S��; ��! ��&E��V6Z�Q��8��@�� (��%�-�R�%�(7@�B��D7�/8*, ��!��-��V�8/QH[�@��*,�-�$��8��(:7�A��D�K�%/8�(��7��9/!.��(.J/��*5�Q��A/��2��!��7&�!�B#��(��6��t-�&�2��2��%�-�!��7 �� q��K��-�%�&�!��7 �W��1�(/8�(��;��6��2��X��uKfaE� q��+��%�&�!��7 �W�K�!�T�)/8�(�T�;��6��V/8�W��$��(�%�� X��W�iuKfaE

.0/!�>�@?( ��@�%��DA�D��(�;��8�!��-�[�Q��1��*5��6(/!.��>/��[�!��t&�Q��A/��E9d=*5��6��2��%�&�X��/��(�&�`�!�2�5*5�QCi/��V*,�&�!��T/!.�7-/�*,*4�(��7&�!��/��MA�`#�(�%<:��%�-E c �,�%�(7@�v<�57-/8�$� ��(�\ )��-7-�/8�(�,��(.0/!��*5�!��/8�b��#�/8�)�,��7@��AH�<��-�-I(A�$��;��5 (��/!��-�;/!.D��S��(�H�A�(��6�/��K��(�T/!�%:�[�!��t&�Q��A/��Ewq��(�X�-*5�!A6;7-/�6�6A��7��/8�w��6��%/_��(7-6��(��,#�/��v�/��&x$��A�(�pA�).0/!��*5�!��/8�y��b�/!��-�W )��/!H�A�(��(�v<�&Ez��S��S��&�Q:�%/8��#(6A�U��/S�/� ��8I��(��.J/��8IR��!�V�! )��/8 (��A�Q��5��!6<�(�%��V/�.K��/��8��(At-�!��/8�er �1��-�8�!��2�-*5�!�6�7&��\A�(�-�$��<.J�5��)�H)A�)��6��>�;<��S�(��8�`6A��H[�-6D/�.>��?) ��%��%�U��`��(�]��/� �7-�2/�.>��@��-��/S��(�4/!�%:�[�!��t&�Q��A/��p��B�\�;�(/86��,/��B�%/�*,�U�%�({,7�A��6<�`6A�!��U.J� ��7@��A/��o/�.<�&E�i�}| ~!�>��_�! )��/[��7 �a��/S ��%.0/!��*,��W��(A�2�@?( ��@�%��A�%�U��!6<�(�%��B/�.�-*5�!A6�7-/�6�6A��7��/8��9��4�T��)��-*��-*4#�/��(��(�+��!�� ( (��/8��7 �U�!��(��%7��#��&�eEW�i�L��(A�B ��! ��V��57-/��7-�-�$�%� �Q��\/��_��U��(�� a ��!�%�/�.=��(�V.J/86�6�/&�;A�(�S��)��-�@"$�� M (��/$7-�-�%�-E1q��(A�1 (��/$7-��%��T�! 6��&��/U��2�%�(#�%��]NO/�.=��(�27-/8*, (6A�@��27-/86�6��-7��/8��P1/�.D��(�V�-*5��6��T�)��:7-�(�%�%A�(�,�47-�@�%� ��X��/8 (A7�E

� E;�T��$��<� �!��H[��6��4�-��*5�!��K��;��-6A�!��H[�+��?) ��%��%�+#��i�>�-�-��i��/W ��-/8 (6A�5#��%�&�L/��o��6�6��U/��)"[��/� �75�-*5��6��B#��i�>�-�-��-*XE

� EKq��-�%�\ ��!��%�;��%�W��6Z�Q��H[�@"$�@?( ��@�%��A�%�W��%�%�-�%�%*,��]7&��p#��%�-�-�W��/,�*, 6��7-<��6<�5.0/��*��4�>�-��8�$��&�X�)��7��&�W�!� �� oN��!�O��%�@�(� P=��5�;�(�7@�U��(�+�(/��(�-�97-/��%��% �/8��U��/V ��-/� 6��VNO�%�-��:��>�!��4��-7��-�H[��/�.e�-*5��6��A�U��(�;7-/86�6��-7��/8��P@E9q��;��?)��:��-�7-�T/�.��W�&�(��2#��i�>�-�-�W�(/��(�-�;�9��U�U��(�7&�Q��-�;��!��W��5�Q�56A�-��\/8�(�X�-*5��6;#��-�-�v��M�!�%�%/$7-A�!��&� ��-/8 (6A��E�q��;�(<��-7@��A/��5/�.��;�&�)�8�;��.Z��/8*3��; ��%/��,/�.��&�Q��T�@?( ��@�%��A�%�2��/U��(�2 ��%/8�X/!.�6A��%�%��T�@?( ��@�%��A�%�B�!��,�&�)�8�,��-��4��B��(�5x��$��<��H[�,*,�&��%�)��,/�.;��-6A�!��H[��?) ��%��%�8E

~)EKq��9�(�-�%7��#��&�B�(��!� �� V��D�!��6<�)t-�&�2�(�%A�(�+�+�!� �� "$#��%�&�� '��U��6��8/!��(*��;�(/8�%�B/8�)�� (�;��1�4� ��('[�&�W6��T/!.9��6�6��B�/��(��UNO ��-/8 (6��QPT��7�7-/�� )A�(�S��/,��2��#�%/�6��(��4�@?( ��@�%:��A�%�26��-H8�-6OE

q��4�8/[�!6>/!.>��]��/��'`�)�-�%7��#��&�a�(��U��/X7-/�*, ��Q��4H��!��/8�(�� ('�A�(�V��6��8/!��<��*,�-E�kT�)�9*,�@��/��].J/��97-/8*, �!��%/8�,��#��%�-�U/8�

j2fW�2j2�8~��>�Q�� c+� f�h)��2fMkTj�F_/!��')�%�(/8 X/��W�+�-�%�&�Q��7@�X��%�%��-�;��Xj2�Q� �,fW��U��\�2�/&�;6��&�(��Vj��%7-/QH[��%�$I � ��8~ ��!�8�2� �

�! (6��)��B��-* ��/4�� ! �(��;��(�T��/��\�%��(��5� ��('�A�(�4��'��(/Q�;�p�!��a��(�-�o�!�%�%�-�%�%��`�/&� �>�-6�6>��5�!6��8/��<��(*,�4/8�)�� )��!��-��U�;��o��(�\�%��(�\� �!�'��(E_u>/��v�%�*4�(6Z�Q��/8��U��_��&�!6 "�(�!� �V��?) ��*,�-�$��>�Q��1 )��-�%�-�$��&�lERF_�+��,�2�%�*4�6A�Q��A/��S/!.��.J�6�6<�M7-/��7��&� � ��@�%.0�-7��S�(��!� �� n��X�;��7 �M�-H[��%�W�/��)�B ��<��K7-/��(�-7��&�5#$�5�B6��'U�;<��5��17-/��%��-7@�+�(<��-7��/8�W��,�>�-��8�$��4.J/��*,�&�lE_q��(��4�� ! �p��]��(�-�p��(��*5�!��7&��6�6<�p�)�-�� (�&�o#$��*,/!H��`�!��Q/��U��H[��%��n6��'��-E c ��(<��/8��6�6<�o��S��6<�)t-��!� �� (��K#��!�%�&�5/8�5��&��6l�(�!� �)I)��5�;��7@�\�&�)�8�-�;�!7-7-/8�(�$��.0/!�;�-*5��6�@?(7 ��!��8E�q��-�%�1�� ! ��>7-/8�$� ��\/8�(6��,�B�%*5�!6A6(.J� ��7@��A/��W/!.��(� �/��%�%�#6��2�&�(��-�-E��(�%��(��*,/��+��1��%��/!.��?��;7-6A��%�%<g7&�!��/8�X��/�)��*,��=��>�(<��-7��/��]/!.��&�)�8�-�D*5�-�T.J�(�%��(��=�(�� 8�)�9��(�x$��6�<�i�B/�.��(�� ! �U#$�B��%�%��V��7-/!�%��-7��9�(<��-7@��A/��Q/��>�-��8�$��;��/4��-�(�8��-E

2. RELATED WORKfW/!��H8�Q��&�`#$�S��B (��-H��6��-�(7-�]/!.9��(�2FbFbF I��(�V� �!�%'M/�.9��?�: ��@�%��g��(��,��6A�!��6��M��7-�-�H[�&�M��/&�;��W�!�%��-�$��/8�n.J��/�* ��(��%�&�!��7 �_7-/�*,*4��(<�i�_| �R��D�!� E��/!�B�\��-H��z/�.9��/!��'`��`��Q��&�)I�#�/��2��-�%�-�!��7 �]��(��-�R�!��27-/�*,*,��7-A��6[��)��-*,�-I&�>�9��.J��(�B��-�8�(�@�T��/M| �� ET�i�M7-/��$�%� ��T��/5/��(�2�! )��/[��7 ��I��B*5�QCi/��%:<��W/!.R��2��)��-*,�K��(��-�;�%/4.O�Q�1.J/$7-��;/��X7-/8�(��-7@��A�(�U ��-/� 6��/p��?) ��%��\�;�/v �/8�%�%�-�W�%/8*,�n'��/&�;6��&�(��L/8�y�o�8�H[��w��/� �78EkT�)�4��)��-* ��o7-/�*, ��!��%/��oA�]7-/8�(7-��(�&�L�;<��_� ��'��M ��/�: (6��2��7-7�/�� (��(�]��/]��(�-<�K��?) ��%��%��6��-H[��6�� !��@�K��\�%�- �!� �!��(��@?( ��@�%��;.Z��/8*s�(/8�X��?) ��%��-Ec ��)��-*Y��!�5�Q�%��-*, (��B��/L (��/QH�A�(�\� ��'��`/�.��?) ��%��U#$��(�-<�4�@?( ��@�%��A�%�\6��-H[�-69��U�(��%7��#��&�_��G| �Q� ELq��(��4��)��-* ��U��@?( ��@�%��"[g��(��>��)��-* ��!��@?( (6�/8<��>��-7 ��(A7-��6l ��! ��-I� (��-�%��(:� �Q��/8��-I��-�%�(*,�-�-I��/8*,�5 ��8��U��78EL/8�}fW��q��1d;r �U7�/�� /�� Q��$�%� ��(��e��/+��#(6��(��6�/$7&�Q��/8�V/�.)��-6��-H��$��@?( ��@�%��-ERq��(�� !�'$:��(�\�(�%�-�1�5�%�*, (6A�2�%7 ��*,�B#��%�&�X/��X��V��*]#��;/�.�*,�-�$��/8�(�/!.��]��*��Q �(� �!�%�8Ec �/!��1.O��7@��/��B�(��l��A�Q��A�(�S/��(�T��(��* .J��/�* *5�!��W/��(��@?( ��@�%��A�%� "$6�/$7&�!��/��S��)��-*,��(�K��%�;/�.e�-*5�!A6).0/!��A�(�-�$��<.J�)��@?( ��@�%��-EvkT��\/�.+��5.0��m��)��-*,�V��Q�5�!��S#��!�%�&�o/8�v�-*5��6��,| � ~Q� E��1/Q�>�-H[��T��)��-* �)/$�-�2�/!�27-/��%A�(��2��(�]7�/8�$��-�$�/!.��-*5��6�*,��%��8��T#�)�+/8�6<��/&�� !�%��W#��i�>�-�-�X��27�/��%��:�% �/��)�-�7-��E c �/!��1��)��-*s��!�T��%�-�+�-*5�!A6�A�1 (��-�%��-�XA�| � �� E9q��=��(��*3��%�-�� (��@:��@?(��1�� Q��7@�$�]/�.l�%�#�C��7��9�!�%:�-��-I$7 ��Q� ��7��t-�-�4#$�2��/�� V.J��&x$�(�-�7-��-�-I��/TA�(��<.Z�]�@?( ��@�%��DA��% ��7-<g�7+�!��&�!�>#$�4��!6<�(t�A�(��K��/!� �4.J��&x$�(�-�7-��-��/�.��(�;�-*5��6*,��%��8��U�K��%��}#$�_�-��7 �}��)�H)A�)��6OE��1/&��-H[�@�&I>6��'[�\*5��$�/!.9��(�]��)��-*,�1*,�-�$��/8��-�M��#�/QH[�8I��(A�T��(��* �)/$�-��/!�2�)�&��6�;<��W��(�Vx$��-��/8�S/�.R��(�� (')��(�4/�.D��(��A�(�-�$��<g��-�S��?) ��%��-E

3. RANKING MEASURESc �B�>�5�� !��&�o��L��5��$�%��/��(�(7��/8��I=/��(�]�/� ��\��V��/M��%�5#�/��(�B��@?)��T/�.9�-*5��6R*,�-�%��!�8�-�V�!��X��(��%��#�)��A/��a �!�%��(��/B�)��*,��;�(/V��(�;��?) ��%��>�Q��8E�u>�-7-��%�+��;��- (��%�-�$�9��(�/!��[�!��t&�!��/��n�;�(/8�%�V�-*5�!A6e��2��)�S��1�4�� ! �X�;�(/8�%�V�/��)�-�7�/��%��-�% �/��\��/V ��-/� 6��2�!��U�;�/��%��&�(��-��NO6��'�� P�7�/��%��-�% �/��\��/��(�W��?)��-�(7-�X/�.T�-*5��6K7-/!�%��-�% �/8��(�-�(7-�X#��-��v��(�W ��-/� 6��;�(/8�%�\7-/��%��% �/8��)��`�/��(��B��(��L7�/8��(�-7��&ID�>�\��(A�('a/!.1��@.0��+��/4��;�!��6<�)�%��T�!� � 6�A�('\��!6<�(�%��)Eq��(�V/8�(��7�/8*,��/!.=/8�(�T��6<�)�%��1/!.D��2�-*5�!A6e#��-�-�S�i��/, ��/�: (6��MA�U��(�M�(�@��*,��!��/�� /�.T�;��7@�}/�.��(�S��/_'��/&�;�,*,/!��!#�/8�)��V��/8 (A7]��n�5��-6A�Q��AH8�@"$��?) ��%��%�U�%7�/��8Il�;��7 �n7@��!�%:�!7��t-�-�;��T*5��8�(<��(�1/�.R��!�1��%�%��%�%�&�X�(��l��7-�8E=FL��7&��*5�!'[�`��w�!��6�/��w#��-�-�b��(��W7-/�*, ��!��%/�� !��G�o7-/�*, ��:��<��/8�yNO*5�!��7 ��I��8��*,�8I9��78E P #��-��v��/`�% �/��%��4��&�!*,�-Eo�i��(��1�!��!6A/��S��-6A�!��H[� "$��?) ��%��%�V�%7-/��V7-/!�%��-�% �/8��(�K��/4��(�

�(��l��7-�]A�`�%7-/!��4#��-�-�n��B�i�>/\��&�!*,�-E�F_�]7&��n.0�)�%��?��-��B��6�/8�!�4��/2��7�/��%��-�% �/��(��7-�K#��-��4g��(��1��?�: ��%��D�!��2g��(��+�;7@��*, �/��B#$�� !�'��;��&��*,�-I!�-�% ��7-A��6�6<�4��7&�!�%�-�+�;�(��@��2��&H[�-�er �;#��-�W7-/�*, ��<��/8��;#��-��W�-H[�@�%� ��!<�U/�.;��&�!*,�-E`q��4 )��/8#(6A��*^��B#��-�-�o��(��&�_��?��-�(�%AH8�-6<�NO�%�-�8Il.J/��V��?(��*, (6��S| � � �JP@E2FL�B�;�6�6=*5��'[�B��%�B/�.9��2��6�/8�!��L��].0/86�6�/&�;��a�(�-�%7�� )��A/��V/�.K��(�4� ��('�A�(�M�� ( (��/8��7 ��-�2��-H��6��Q��8E

3.1 Affinity (a.k.a Score)�/�6A6�/&�;��n/8�)�V��&��*4"8� ��'��X��!6�/8��$IR ��! �2��,�%�*, 6��-��K�-�`/�.9� ��(')��(�S��&�!*,��B#$�X��(�U��*]#��/!.>�;��2/!�SNO7�6A/��%�-6<��-6A�!��-��PB��(�\��/!� ��6K �/8��$�,�)��l��-�$��A��6K#��-�-�o��(�\��&�!* �!��6�6=/�.9<��/8 ( �/8�(�-�$��-E�F_�B��.0��T��/\��(A�T*,�&��%�)��4�!� � �Q{,�<�i��,��>�K��9��.0��%��-�\��/B�� %7-/��U��`| � E9q��(A�>*,�&��%�)��+�>/8�6A�#��]g��B<.��-H[�@�%�X��&��* 6A�-�$�&�M�-H[��%�X/��@��&��* /��V<.>�>�4��8��-6A�!��H[� "$��?) ��%��%��!�%�%�-�%�%*,�-�$��>#��@�i��-�5�-H[��%�] ��!��>/�.� ��-/8 (6��8I#�)�,�;��-�}��!�5��\�/!�,��M7&�!�%�8I;<�\��(�5�(/��,��/L�>/��'p��-6�6#��-7&�!��%�;<�9� �Q��-�9�;��(�9/QH[��9�!6A6(��&��*,�=�&x$��!6�6��]��-�[�Q� �(6��-�%�9/�.l��-7-/!� �(�;/�.D��(/8�%��&�!*,�;��\�;�/U��(�-<�;/8 ( �/8��K�>��8E

3.2 Successorq��-6A�Q��&�4��/��K*,�&�!�%�(��-�9�(��%7��#��&�4��M| � E�q��'B/�.e�)<:��-7��-�5�-�(�8�� /8��$��2.J��/�*z�;��(��>��/V6�/8�%��>/��.Z��/8* �!��&�!��@��/X6��-�%�%��B��?) ��%��%�8E5� .;��6�6=��-�%�,�-�(�8�U�(<��-7��/8�(�4�Q��,7-/��%��7��&I��6�6��1 ��-/8 (6�� )/Q�;�(��%��&��*��\N0��&�!7@��#6��#$�5�]�(<��-7��-�\ �!��P.J��/�*m�]��AH8�-�W�(/��(�2�Q��2/�.D6��-�%�%��+��?) ��%��%�8E9q��(A�;/��*,�&�!�%�(��/�.=�@?( ��@�%��A�%�TA�+�%�*, 6<�,��27�/8��$�;/!.D�%�7 �X ��-/� 6��,NO�/��)�-� P@E

3.3 PageRankkT��n/!.B��(�n*,/��n �/8 (�6A�!�S��&��*4"8� ��'��_��7@�(�Ax$��-�\��X�(�@:�%7��#��&�yA�z| � E��i�w��M�! (��/8��7 �y��(�L/�#(� ��(�&�w� �!�'��v�� (��(7-� ��68�-��8�-��H[�-7@��/��KNO7-/!�%��-�% �/8��(��;��/K��6A�!��8��D�-��8�-�):H��6��QP�/!.��K��!C��7-�-�(7��]*5�!�%��<?B/�.��(�� ! �U��]�;�(�7@�U�-�(�8��7-/!�%��-�% �/8��`��/\*5�!��7 ��-��#��@�i��-�`��-��*,��!��M��4�(<��-7��/8�a/�.��,�-�(�8�5��U�)��*,��-�a#$�a�;�(/X��/��Enq��(A�]�� ( (��/8��7 �_��(��/8�)�D��/�#��H�<�%��!6A6<�]A�(�-�$��7&��6��/1��(��-6�6)'��/&�;��D�!�8�-�1��('e| �&� ��6��/��<��* .J/��V� ��(')��(�\�>�-#L ��8��-I��]/8�(6��a�)��l�@��-�7-�]#��-��Q�V��D��-�T�!�'`6�/Q��"8��-��]�&�)�8�-�]�!��5��(�-�a#��-�-�_��6�6�/��)�-�2��L#�/!��_�(<��-7@��A/��-E5q��(��-��8�$�B��V��L�8�!Ci�� !#6��, ��Q:� ��*,�@��&ERF_�;��.J��/2��- 5/!.R��(��(�B6�/&��"[�>�-��8�$�;6��'��9��/��6�6��/��(��1�!� � �%*,/$/��(��)I�;�(A7 �W�>�V��6��%/, ��%.0/!��*XE

3.4 Positional Power FunctionkT��B/�.9��V� ��(')��(�S*,�&�!�%�(��-�+��]7@�(/8�%�]��/\�-H��6��Q��4��T� �!'[�-�.J��/�* | � En��B��]7-6�/��%�-6<�L��-6A�!��-�L��/X�;��!�B��@�L7&�!6�69�� R/8�%<:��/8��6��R/Q�>��(�7��/8��LN �!��=P@E"�/!��!� ��B� ��(')�$#&%�'�( �*)��+@�,+.-.-.-&+0/21 ��!��.J�S��T.J/86�6�/Q�;��(�\��)��-*s/!.D�&x$��Q��/8��-E

% ' ) 34�57698

�/ N:% 4<; � P + N � P

�;��@��=>'��(�-�(/��-�K��(��%��+/�.@?�A�B�BDCD?�?DE � �/��(��+/!.D�/��)�2�%Ec ��-��@� ��6��t&�!��/8�p/!.2N � P@I��;��7@�a��,�-H��6��Q��&�_�!�B �!�%�V/!.;/8�(��?) ��*,�-�$� ��6l7 ��Q� ��7��t&�Q��/8�M��-�

% ' ) 34&5,6 8

| F ; N �HG F=PI% 4 � + N � P

F_�+�%/86�H[�-�,��(��>��)��-*z#��B��+��%��6� �/&�>��"$<�� Q��A/��W�%7 ��-*,��Eq��S7-/��%��% �/8��)��n��-�%�6<��4��p/8�)�,��?) ��*,�-�$� ��69�%�-7��/8�}�!��6A��#��-6�6��&� � �/Q�>��(E

3.5 HITS (authority)

j2fW�2j2�8~��>�Q�� c+� f�h)��2fMkTj�F_/!��')�%�(/8 X/��W�+�-�%�&�Q��7@�X��%�%��-�;��Xj2�Q� �,fW��U��\�2�/&�;6��&�(��Vj��%7-/QH[��%�$I � ��8~ ��!�8�2�8~

q��(��T�iq;he| �Q�4��6��8/!��<��* �K�!�a�)�-H��%�&�y�;<��(�L��/��8�$�n/!.� �!�'��V��-#S ��!�8�-�;�!7-7-/�� )��]��/V��K�(�-�!��-��/�. � �!�(��/!��<�i��(E� �;��;�(�-�%��8�(�&�\�;<��\��T��%��(7��(��T/�.��(�T��#W��S*,��eED��\��(��!�%�%/$7-A�!��&�a�(�� ! �X��(�]�)<��-7��&�M�&�)�8�-�17-/��%��-�% �/��X��/,�$�) ��%:6��'��-E+q��(��1��q;hX�!6A��/��<��*� (��/��)�7-��+��/,�%7-/!��-� �Q� �!�'��8�+"��(� � ��# �V�%7-/!��!��U�� !�(��(/��<�i��B�%7-/��8EDFL�1 )��-�%�-�$��/��6<��(�X6A�!�%��5#��7&��(�%�X<�5�[�&H[�X7-/��%��-�$��6<�v#��%��U��-�%�(6<��U��(�S��(#o�%7-/��M�!�U��?) ��-7��-�eE_��_��*,�4/!.1��(�W�)�� (�v�8�!C��!:7��-�7��U*5�!�%��<?��b��1��)��/!��U�%7-/!��>��AH8�-�5#��4��; (��7�A ��6��A��-��H[�-7��/!�\/�.��(�X*5�!�%��<?��VI>�;��@��y�� (��-�%�-�$��U��(�� ? � E9?�C]/!.��BE

4. ACCURACY EVALUATION��`��(��2�%�-7��/8�`��,�)�-�%7��#��]��(�4*,�&�!�%�(��-��>�U��%�B��/W7 ��!� �!7�:��@��At��]��]��7�7-�(� �!7��M/�.�� ('�A�(�S#$�W7@��!� ��7@��t-��5��]�(�-�!��-�/!.;��!��-�-*,�-�$�D�!�(��!��-�-*,�-�$�V#��-�-�`��U7�/8*, �)��&�M� ��'��%]��5��(��7-/!�%��-7��VNO��/��"[�%��)��P�� !�'��E=�i�5��2�!��!6��)�%��/!.;��)��(��75��Q� �LN �(�-�%7@��A#��-�L��ph)�-7@��A/�� PT�>�\�(�%�&�_�W�(/QH[�-6�@�%��/��"��)�� 7��,*,�&��%�)��8I��(�-�%��8�(�&�M��/S��?)�(A#(<�Vx$��6�<� �!��H[�]#��:��&H)�/!�+�(�-�%<� ��#(6��.0/��1�!�W��?) ��%��%�� ('�A�(�4�!7-7-�)� ��7��S*,�&��%�)��8EFL�]�>�]��.J��2��/S��(A�T*,�&��%�)��U�� '8��%��W�!��`�(�-�%7@��A#��4��A��(�].J/86�6�/Q�;��(�M�%�-7��/8�eEUq��(�]��&�!6<"$��Q� �X��6<�)�%A�5N �)�-�%7��#��&�nA�h��-7��/8� �8P@I� (��-�%��D��-�%�(6<��>��%��(�2#�/��U�� '[�@�%��2*,�&��%�)��!��S��7&��6�6��!� �� (��-E

4.1 Rank–Error Distanceq��(��1��+�4*,�&�!�%�(��/!.D��T.J/86�6�/Q�;��(�5.J/��*X�

N�� + %[P ) �3 '�� N�� ' P��N�� ' P N ~8P

�;�(�� '�� *,��N��' + %�' P �� .'�� ( ��' G %�'�( -

q��(�T��-�� !6�.0/!��* /!.>N ~8P��-*]#�/��(��-�9��/V�/!��/8��#�/8�)�>��(�1�)��:�%<� �!#6��2x$��!6A<� �Q��AH8�1#��-��&H)�/!��/�.��%�7 �\�2*,�&��%�)��8E�kT��+/�.e��%�A�)�&��U��B��!�B��5�*, �/��%� �!�7-�5/!.��!�o��%��/��WN �(��l��-�(7-�S��_��(�� !�'�/!.��(�9��;/8#�C��-7@�@P��7��&�!�%�-�R�;<��B��(�9�(��8��R� ��('T��>��:�%/$7�Z�Q��&�4/�#)Ci�-7��=��-7-�-�H[�-�]/8�B�-<��@�=/�.��(�>6��+NO*,��eN��' + %�' P%P@ED��.�(�-<��9� ��('4��I[��(�T�!�%�%/$7-A�!��&�\��%��/��TN �(��l��-�(7-�QP��%�(/8�(6Z�7�/8�$�%��#�(��MH8��%�p6��%��6��M��/L��(�X/QH[�� 6�61�%7-/��8E�q��5�/!��/8�b��7-�� (��)��&�_��_��U.J��7@��A/��

�N! P@ID�;�(�7@�p�)�-7&�-�)�]*,/��/��/��7&�!6A6<�

�;<��X<��+�!��*,�-�$�&I(��)*, )��/��7&��6�6<�W�� )��/[�!7@�(A�(�Ut-��/)Eq��(�9�%��7-/8��VA�)�&�+��e��!�e��*5��<��)�=/�.��=��%��/!�=��%�%/$7-A�Q��&��;<��p�M�)��l�@��-�7-�"�SA�a� ��'a�%�/��6A�L��!��)� �!��\�!�SN ��)*, )��/��<:7-��6�6<�n�� ( (��/[�!7@��P;�%/�*,�VH8�!6A�(�V��#�U��7��&�!�%�-�-E+q��B�/!��/8�M��@��2��Q�V#��$/��n�%/8*,�UH��6��IR�!6A69�(��l��7-�-�2�%�(/8�(6Z�`7&�!�%�%�a�! (: )��/&?(�*5�Q��-6<�,��T��!*,�1�>�-��8�$�&E9q��K#��-��&H��/��;��K�-*]#�/��(��&�5A��(��.0�(�7��/8�$�N��8P@E��,��?) ��*,�-�$��>�1�(�%�&�U��K.0/�6�6A/&�;��]�% ��-7-<g�7+.0�(�7��/8��6.J/��%:*,�1.0/��

��!��%�S�&�!7@�a7@��!� ��7@��t-�&�a#$�M�%��(�86��5�8�QC�� !#6��U �!:

� �!*,��"'&X�!��'(`��-�% ��-7��H[��6��$E

�N�� )&�P ) ��?) �N G &*�XP

+ �, �-� ��?) �N G &*�XP�!��

�(N��!(RP ) �( | �HG ��?) eN G (.�8P��

Fb��T��(�T �!� ��*,�@��KH��6��/&�)G� - �9 ��"( )w� -�� @��%�&��X/��(�+��?) ��*,�-�$��-E

4.2 Recall�i�5/��(� � C�B �1020 "$#��%�&�S��6<�)�%A��>�T7 ��Q� ��7��t-�1��+7-/8*, �!��%/8�#��-�-�}�i��/_� �!�'��S#$�}� � C�B �10�0 7��(��H[�_NO�%�� R��(�� .0/!��?(��*, (6��QP@ED�i�\��+g��8�)��-�;�(�� 7��B�� x$��%�5�%�t-�.�WNO6A�@�3 ��- (��-�%��2��(�� P1��.0��2��/\��]��/� 3 � ��('[�&�n�/��(��,NO ��-/8 (6A�&P��a��(�U7-/8*, (�(��&�X� �!�'�� %S��n��(� � ��7&��6�6��M��V��(�].Z� ��7��/8�/�.>��4��/� 3 �/��(��B/!.��4��/8�(��$"[�%��(��n� ��('�A�(�'�M��!�B�!��7-/�� !��&�W��\��/8 3 *,�-*]#��/!. %)E

5. RESULTS FOR SYNTHETIC DATAc �] ��Q�%�U/�.1/8�)�,��_/!.1��(�S#��-��&H)�/!�U/�.+��\H��!��/8�(�]� ��('$:��V*,�&�!�%�(��-�=��K� ��,��(�-*3/8�5��)�$��7&�!6A6<�4�8�-�(�� !��&�U�@?( ��@�%:��%�]�!� �� (��-E;q��(�]��-�� !6D�� ( (��/8��7 �X�K��+��/,��<��A��6�6<�M�8�-�(�� !�� @�%.0�-7��M�!� �� (��B��`��a� ��(/�*,6��`�)�-�� (�U��(/8�%�,�� ! �(�#$�n��-*,/QH)��(�M��`��-H[�@��%A�(�M�&�(��-�-EWu9� � ��%.0��7��n�� (��2��*,�&�!�a�!� �� (��.0/!�2�;��7@�`��]��V��L�-�(�8�WNO6A��('P+#��i�>�-�-�`�-H$:��%�W ��<�T/�.��/��)�-��!��W��B�)��7��/8�M/�.��%�B�-�(�8��T7-/!�%��-7��6<��-7��(�]��-6A�!��H[�4��?) ��%��%�U/�.9��4�(/��(�-��(��n7-/��7��&EB�i��;��Q�;.0/86�6�/&�;�T��2 )��-�%�-�$�;��-�%�6<��;.0/!�+��)��-�2�%�7 �X7-/�6�6A��7��/8��1/�.��?) ��*,�-�$��-E

� EKq��2g��+/�.=��(�-�%�V7-/��%��-�% �/��)�+��/,/��6<�W��-*,/QH��U�-�(�8��.J��/8*m�] ��%.0�-7@� �$� : �/��)�2�� (��E

� EKq��U�%�-7�/8��`7-/��%��% �/8��)�2��/W/8�(6<�`��H[��%��S�&�(��-�2.J��/8*�4 ��%.0��7�� : �(/��(�2�� ! �eE

~)EKq��;��<� �47-/!�%��-�% �/8��(�>��/2#�/!��U��-*,/QH)��(�V��4��H[��%��&�(��-�D.J��/8* � �� :��(/��(�� @�%.0�-7��!� �� eE=q��Q�=��-I$�T��*]#��/�.��-�(�8��B��g(��2��-*,/QH[�&�n��`��L�\�)�(*4#��T/�.9��/8�%��-*5��(�U�!��2��-H8��%�&�eE

�i�\��T�>/��'\ (��-�%��-�S��@��V��6�6��&�)�8�-�;��QH8�T��-�� E�R��8�)��-� � N �[P��!�� NO#�P%P��!�� 6�/��/�.$��(��@�%��/��D�)�� 7��1N � ��'[��%��) N�� + %[P%P=�(�-�%7@��A#��-�UA�\h��-7��/8�5�VH[��%�(��(�;�)�(*4#��=/!.e�/��(��-<��@�4��-*,/QH[�&�`/��]��-H[��%�&�lE`q��\��(�H�A�(��6> �/��]�!��S�&H$:�� !�8�5H8�!6��-�2.J��/�* �%�-H[�� !69�%��A��6��]��]�(�-�%7��#��&�a#��-6�/Q�2EXq��!��,��)��-�4� ��#(6��-�]��L��(��B�%��7��/8��IR�&��7 �L� ��#(6��U��- (��-�%��(�n��?) ��*,�-�$��7-/�6�6A��7��/8��EWd��7 �a� ��#(6��,7-/8�$� ��(�V��)��-�U7-/�6��*,��.0/!�\�&��7 �}*,�-��%�(��Ebq��(�W.0/�6A6�/&�;��o��5�L�)�-�%7�� (��/�� /�.1��-�%�7-/�6A�(*,��-E4157698.:<;�=�>@?1:BA�C q��>��+�&H[�� !�8� N�� + %[P=#��i�>�-�-�,��K� ��('$:

A�(�, (��/��)�7-�&�X#$�\��2*,�&�!�%�(��B��S�� /8�(��W�%��)�� NO E �8E9��27-/!�%��-7��+� ��(')��(�$P@EKq��(��T�&H[�� !�8�V��1/QH[�@�T�!6�6��%��<:��6��,��v��(�W ��Q�%��7-�6A�!�5�@?( ��@��A*,��&Eph)*5��6�6��,H��6��(�-�,�!��#��%��&�4 ��%.0��7��+�%7-/��V��;t-��/(E

4ED.F.:<GIHJ>@?KA�C q��(��R��R��)�(*4#��/�.(�%��A��6��R.0/!�R�;��7 �B��(��%�%/!:7-A�!��&�B*,�&��%�)��> (��/��)�7-�-�V��(��#��-��R� �!�'��!�D*,�&��%�)��&�#�� N�� + %[P@E�L��Q��8��=��*]#��!��>#��@�%��&$�; ��%.0��7��D�%7-/!��&x$��6)��/��K��/�� !6��*]#��=/�.l�%��A��6��-E9q��K��/!�� /��%�%A#(6��%7-/��V��;t-��/(E

4ED.F.:<MNFPOQ>@?KA�C q��(��=��(*4#��D/!.��%��A��6��.J/��;��7 �4��K�!��:�%/$7-A�!��&�X*,�&�!�%�(�� )��/��(�(7-�&�\��(�T��/!��;� ��'��4��;*,�&�Q:�%�(��&�U#$� N�� + %[P@E�h�*5��6�6��9�)�(*4#��Q��1#��%��@�&��2 ��%.0�-7@��%7-/��B��1t-�@��/(E1q��2��/!�� /��%�%A#(6��]�%7-/!��]��1�&x$��6e��/U��/�� 6e��*]#��K/�.��%��Z�!6��-E

5.1 Removal of Edges from Perfect Graphsq��U��%�6<��] )��-�%�-�$��&�`��5�Q��U.0/��4�W�%��B/!.;��?) ��*,�-�$��2/8�� : �/��(�+�� ! �(�-E=��9A��%�%�*,�-�4��!��1��%�%/$7�Z�Q��&� �$� ��-/8 (6A��&H[�W��?) ��%��%� � �%7-/!��-��p/�.$# ��+@�,+ ~ +&-.-�-.+ � � +@��1 E c �)��7��&�6��'p�;��b�>�-��8�$� � A�\ 6A�!7-�&� #��-��G�-H[��%�v ��!<�W/!.V�/��(��

j2fW�2j2�8~��>�Q�� c+� f�h)��2fMkTj�F_/!��')�%�(/8 X/��W�+�-�%�&�Q��7@�X��%�%��-�;��Xj2�Q� �,fW��U��\�2�/&�;6��&�(��Vj��%7-/QH[��%�$I � ��8~ ��!�8�2��

��/8��.Z��/8*3�(A��D��/16�/Q�>��?) ��%��%�K�%7-/��E=q��=��=��(��<��A��6� ��%.0��7��L�!� �� eI9�;�(�7@�p��!� ( � (�)�� ) �� n�&�)�8�-�-Evq��(��(*4#�� /�.K�&�)�8�-��*,/!H8�&�n.Z��/8* ��(�-�%�U�!� �� (�VA�2H��!��&�.Z��/8* � ��(��/�� /�� I��;��7 �]�� B��(��7&�!�%�8E�q��-�%�6<��R/!.��(�-�%�V��?) ��*,�-�$��+�!��B 6�/��%��&�SA� �R��(�� N �[P;��W� ��#(�6A�!��&��Mq��#6�� E

1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2

1.4

number of links removed

aver

age

rnke

rr d

ista

nce

Results from removing links from a perfect 21−node graph

Successor PageRank HITS (auth.) Affinity (Score)Power

N �8P��1�-*,/QH��B6�A�('��-E

1 2 3 4 5 6 7 8 9 10 110

5

10

15

20

25

number of links reversed

aver

age

rnke

rr d

ista

nce

Results from reversing links from a perfect 21−node graph

Successor PageRank HITS (auth.) Affinity (Score)Power

NO#�P>�+�-H[�@��%A�(�46��'��-E

�R��(�� B�+�-�%�6<��.Z��/8*�� (/8*,6<�M��-*,/QH��7�Q��-H[�@��%A�(�X6��'��M�] ��%.0�-7@� �$� : �/��)�8E

*,�&��%�)�� QH��)E��(��&E �(/(E=#��-�� (/(ED�>/��h��7-7-��%�%/�� E �8~�� ~ �� D�!�8�-�1��(' ��E �8~ � � ~�~ � �

�T�iq;h ��E �E� �� 9�c {,�(<�i� ��E ~E� ~ � ��R/&�� E � � �9 � � � �

q��#6�� =�1�-�%�(6<��+/!.�� !��(/�*,6<�,��-*,/QH)��(�46��'��.J��/�*m�] ��%.J�-7��$� : �/��)�2�� ! ��E

5.2 Reversal of Edges in Perfect Graphs�� 5�%��\/�.V�@?( ��@��A*,��U�>�X��-H[��%�X� �Q��5�� -*,/QH[��-�(�8��T��X ��%.J�-7��1�� (��-E;q��(�V�)�(*4#��/�.��&�(��-�;��-H[��%�&�W��(�U�%7 ��-*,�4.J/��4�(��@��*,A�(��5��!�B��*]#��2�!��`�%�-6��-7��X��(��-�(�8��U��4��?(��7@��6��a��5��*,�S��B��Q�4�(�%�&�oA�_��(�\��*,/!H��69��?�: ��@��A*,��;�)�-�%7��#��&�W��Xh)�-7��/�� E � E>q��1��-�%�6<��+/�.R��(�-�%�2��?�:

��*,�-�$��9�!��T (6�/��%��&�,��RA��(�� NO#�P=��U� ��#�(6A�!��&�,��\qR��#(6A�� E

*,�&��%�)�� &H��(E9�)A��&E �(/(E�#��-�� /)ED��/!��h)�(7-7-�-�%�%/!� ~)E �� ~9��D��-�1��' �)E � ��

�1��q;h �)E �9�8�8~ � �c {,�(�� )E �8�8�� R/Q�>�� )E ~!�7�8� ��

qR�!#6�� n�+�-�%�6<��,/!.�� (/8*,6<�o��-H8��%��L6��'��,��G�` ��%.0�-7@�� : �/��(�2�!� �� (��E

5.3 Combined Removal and Reversal from Pe-rfect Graphsq��U��%�6<��] )��-�%�-�$��&�`��5�Q��U.0/��4�W�%��B/!.;��?) ��*,�-�$��2/8�

�� : �/��(�+�� ! �(�-E=��9A��%�%�*,�-�4��!��1��%�%/$7�Z�Q��&� �& ��-/8 (6A��&H[�W��?) ��%��%� � �%7-/!��-��p/�.$# ��+@�,+ ~ +&-.-�-.+-� � +&�� 1 E c �)��7��&�6��'p�;��b�>�-��8�$� � A�\ 6A�!7-�&� #��-��G�-H[��%�v ��!<�W/!.V�/��(��8/��T.Z��/8*3��=��/T6�/&��@�9�@?( ��@�%��A�%�K�%7�/��8E=q��(��D��9��<��A��6� ��@�%.0�-7��`�� ! �eID�;��7@�o�� ( � ( ) � �� ) � � �-�(�8��-ELq��(�� ! �5��9��(�-�5�(�-�!� �8�)�&�,A�5H��!��/8�(�9��&�)�>�(�-�%7��#��&�,�!�>.0/�6�6A/&�;�-E� �)�(*4#��=/!.e�&�(��-�9��-*,/QH[�-�� 9�!��&�].Z��/8* � ��)��/8�(�8�� /��8�� I�;��7 �X��+�4��W��K7&�!�%�8E� �)�(*4#��T/�.�� !��)/8* �%��V/!.�� N0��-*5�!A�(��$PT�-�(�8��V��@:

*,/!H8�&�5.J/��+�&�!7@�XH��6��(�2/�.��

� N�� KP ) � 6�/8� �! ( � (�� #"%$L��#&%')(+*�,_#��K�+� ��)/8* H��!��A�!#6��>�;�/��%�;H8�!6A�(�-�=�!��K�!� �� ):��K.0/!��*,�&�S#$�S�(�-6��(�4� ��(/8*s�%�@��+/!.-�� &�(��-�;.J��/8*&]E� �)�(*4#��T/�.;�&�)�8�-�2��-H8��%�&�.� /��0 �Bq��VA�VH��!��&�a.Z��/8* ��(��/8�(�8�1�� .J/��+�&�!7@�2& ')(+*�, E� �)�(*4#��>/�.R� �!��)/8* �%��;/!.��0]�&�)�8�-�K��-H8��%�&�\.J/��+�&�!7@�H8�!6A�(��/�.D��(�� !<�]N�& ' (+*�, + � ��0 P@�

� N�� + ��0[P ) � 6A/�� ( � ( G � �� 0 "3$� ��/�� 6e�/)E�� (��+��6<�)t-�&�46587 �:93' (+*�, �� N�� P';(<*�,3' (<*>= �� N�� + ��0[P ) � �8~��E� -

q��%�6<��;/�.D��K )��/$7-�-�%�1�!�� #(�6A�!��-�W��MqR��#(6��V~)E

*,�&��%�)�� &H��(E9�)A��&E �(/(E�#��-�� /)ED��/!��h)�(7-7-�-�%�%/!� � E � ~ �� $� ��D��-�1��' �)E ��~8�8~ �8~1� � � �

�1��q;h � E � � �8� � � � ��8�c {,�(�� )E � �8�� ~ � � ��R/Q�>�� )E � � � �8� �Q� � �

qR�!#6��U~��1�+�-�%�(6��1/�.�� (/8*,6<�S��-*,/QH)��(�5��X��-H8��%��56�A�('��.J��/�*��4 ��%.0��7�� : �(/��(�2�� ! �eE

j2fW�2j2�8~��>�Q�� c+� f�h)��2fMkTj�F_/!��')�%�(/8 X/��W�+�-�%�&�Q��7@�X��%�%��-�;��Xj2�Q� �,fW��U��\�2�/&�;6��&�(��Vj��%7-/QH[��%�$I � ��8~ ��!�8�2�

6. RESULTS FOR REAL DATA�(/��+��V�%�-7-/��X ��Q�%�T/!.=/8�)�T��?) ��*,�-�$��K��V7-/�6�6A��7��&�n�5��!� ��%�@�K/�.��@��6��*5��6�7-/�*,*4��(�7&�!��/8�57-/��$�%��A#(�(��&�U#$�,��(�H�Z��:��6��K��W/��(�K/!��[��(�t&�!��/8�eE<�R<.Z��-�-�W ��/8 6��T.Z��/8*s�B�%��86��T�>/��'$:�!��/8�( S7-/��$�%��A#(�(��&�U��*5��6�*,�-�%��!�8�-�>.Z��/8*z��-<�� @��%/8��!6l�-*5��6#�/&?)�-�-Ie7-/�� !��S�5��/!� ��6D/!. � ~!� � �\*,�-�%��!�8�-�-E��(��/�* ��(A�T�%��>�U�%�-6��-7��-�eI��%��S'[��/!� �`�%�&�Q��7@�eI�*,��%��8��-6A�!��&�n��/\��-��/� �7-�-E �R�H[�4/�.��(�]��-�`��/8 �7-��>��,�8��78I��%�7 �_��L��@q d�� I�8�QH��)I$��R��6OI[�;��6��K��/!��g�H[�9��;*,/��>�% ��-7-<g�7K��/T��(��) ��,/�.;��/!��'L�)/8��5��_��U�>/��'��/�� eEnq��,.0�(6�6>6��4/!.1 ��/�: (6��)H8/86�H[�&�,��S7�/��%��-�% �/��(��7-��/��S�-��7 �\��/� �7T�K�!�K��?��%� ��7��-�eE�1�*5�!�B��H8�!6A��!��/!��R��>��%'[�&�V��/+� �Q��>�&�!7@�] ��%/8�V.0/!�=�� !�(��6��-H8�-6=/�.K��?) ��%��%�U ��@��/8 (A74/8�_�\�%7&��6��U.Z��/8* � ��/ � ��E,q��(�-�%�� Q��8��>��]��(�-�`�&H[�� &�`��7��/��%��B�-H8�!6��!��/!��-E2F_�B��%�&��(��-�%�6<��]� ��'8�&�S6��;.0/��1�&�!7@�X/!.R��1��/8 �7V�!�+��(��/��%��)��S.J/��1/8�(�;�@?( ��@��A*,��-E�+�-7&�!6�6��!�2��a��(�4�!� �� `��- (��%�-�$� �!��/8�a/�.K�!�L�-*5�!�6�7�/�� (��(/��(�-�T�!��4��%�%/$7-A�!��-�`�;<��M��V ��-/8 (6��]��H[/86�H[�-�XA�M��(�B7�/��%��:�% �/��)�-�7-�W��}�`�(<��-7��-�v�&�)�8�\.J��/8*Y�/��(�S��/a�(/��(�U�a��(:�)�7&�!��-�S��!�S�U��W6��-�%�W/�.4��G��?) ��%�W/8�b��`�%�#�C��7��W��v��-E�Fb��n�!�n�-*5��6R7-/�6�6A��7��/8�_�Q��X��(A�+��-6A�!��H[�]��?) ��%��%��).0/��*5�Q��/8�W*5�-�,#��1#��!�%�&�\/��\��(��!��6<�)�%��;/�.��(�1��*5��6l*,�-��:��!�8�-�T��?)7@��&�X#��-�-�X��(�] �!�%��-�-E+�i�X��(�V7&��%�V�;��B�/7�/��%��-�% �/��(��7-�2��?)��;#��i�>�-�-�W�] ��!<�1/!.R ��-/� 6��8I)��T��6Z�Q��H[��@?( ��@�%��A�%�16��-H[�-6�7&�!��(/��#��1�(<��-7@��6��\�)��*,��-�,��U��(�+7�/��%��:�% �/��)A�(�;�&�(��9�;�6�6$�/��?)��R��2�� (��E c �R��-�%�(6<�&I��!� �� (��#��%�&�`/8�`��&��6��-*5��6=7-/!�%��-�% �/8��(�-�(7-�,�!��,*]�7 �n�% ��!��%�@�2��(�,A�(�&�!6>.0�(6A6<�L7-/8�(��-7��-�L�� ! ��-EMqR��#(6A�5�X��-H8�&��6��V��!�V��(��(*4#��K/!.=�&�(��-�+A�W�� (��K��- )��-�%�-�$��4��(��-�W��/8 (�7-�;A�+/8��&H[�@� ��8�4/8�6<� � ��m/�.>��(�]��(*4#��+/�.>�&�(��-��M��]�&x$��H8�!6��-�$�.J�6�6<�W7�/8��(�-7��&�S�� (��-E

��/� �7 �]�(/��(�-� �]�-�(�8�� N�� *�� PC��QH�� ~1� �� Q�6A�!��@? � � ��

*5�Q��6Z�!# �&� �8� � � ��6 �&� �8� � �?)*,6 � � �8� � ~

7-/8��<��H[� � � �� Q��@�$� �&� � ~ � � �/�� & � ~ �Q��;# � � �8� �� K. � ~ �Q� �� &H��(E � � ~ � � �

qR��#(6��2�(� � ��Q� ��7��7-�1/�.R��&�!6R�(�!� �4#��%�2�� ! �(�-E

Fb��6��1��U��;��)�$��7K��?) ��*,�-�$��;N h)��7��/8� PD��K��@��+� �!�(:�)/8*,6<�v��-H[��%��p�&�)�8�-�-I;�;<��y��G�-*5�!�6T7-/!�� S�!�S��!��}�>��!�%�%�*,��(�K�(<��-7��/��,/�.l�!�U�&�(��K��>�)��*,��&�V#$�]�1��?��=7-6A��:�%<g��K�;��7 �W�!��6<�)t-�-��(�17�/8�$��-�$�K/�.��+*,�-�%��-�K��?)7 ��(�8�&�#��@�i��-�n�U ��!��T/�.� ��-/� 6��8E LeA'8�]�!�$�X�!��6<�)�%��/�.D��(��1'��S�>�7-��(/��2�@?( ��7�� 8�� 7-7��(� ��7@�X.J��/�* ��T )��/$7-�-�%�-EVq��(��.J/��M��V��-�%�(6��(�\�!� �� (��%/�*,�B/!.9��(�]�-�(�8��T*5�-�X#��V��-H[��%�-�eEFL�2��?) ��*,�-�$��&�\�;<��W.J/8�)��)��l�@��-�$�T�!��!6��)�%��+*,��/��)�-E<�/!��H[��%�B ��!��=/!.� ��-/� 6��%I �V#��-��]�;��7 �47-/!�%��-�% �/8��(�-�(7-��?)��(�2H8�Q��/8��+�!��6<�)�%��;*,��/��)��;A6�6��7��1��;.0/�6�6A/&�;�-�� ;.H75�� 25P>K>K="!�HJO � .1��(�W��?) ��%��%�X6��-H8�-6K/�.T�B��,�(A��4��

��Q�+/�.e�,��W�&�)�8�,NO� + �$P��+7@��&�!��&�\�;��S��-��8�$� � E#%$'&(&*) �+�25*>K>K="!-H7O � ��&�Q��-�K�i��/] ��Q� ��6�6��-6��-�(�8��KA�S/8 ( �/8�%<��

�(<��-7@��A/��+#��i�>�-�-�S�>�!��U�(I�&��7 �W�;<��W�>�-��8�$� � �8� E

, 5PD�;9F & �+�25*>K>K="!-H7O �R6�� \��v��#(Z�!�%�&�}7�/8��E ��. � � ��6��.-7��&�!��n�&�)�8�V� + �5�;<��M��-�� ETk1��@�%�;A�%��Il7��&�!��X�&�)�8�+� + �=�;<��W��-��

/ 5�0.= &1$'&32 D ?@O F�4 ) F_�D�%� ��-�2�;fX�!?)�*4�*yd=�$�%��/8 $�4N fXd9P| �� =7-6A�!�%�%�g��T.Z��/8*m��V�1��#�/Q�� !7@'��\| � �!� E1FL�2� ��U7-6A�!�%�%�g��B/8�`��U�%��V/!.K*,�-�%��!�8�-�V�%�-�$�2.J��/�*��K��/5�(EF_�W��(�-�}7@��&�!��n�!�b�&�(��_NO� + �$P]�;<�� -�� 5G�!��}��&�(��]NA� + ��P=�;�(��55��>��+ (��/�#��!#�6��U��+7-6A��%�%<g��K�[�&H[��/_��n�-H[�-�$�\��Q�W�U��\��8�(��X�@?( ��@�%��A�%�M��v�(Ez��.7-/8*,*]��7&�Q��/8� �@?(��\��G#�/��G�)��7��/8��5��M��(� ��7-6A��%�%<g��S�%�- �!� �!��6��}.0/!�S�-��7 �G�(<��-7��/��EwF_�W��(�-�b�!��:�%A��v�M�>�-��8�$�,/!.]N�5 ; � G 3 P �� /M��(�\�-�(�8�`NO� + �$P]�!��N 3 ; � G 5lP �� /B��(�;�&�(��4NA� + ��P=�;�(��65\��!��#��.0/!��!��3 ��(�4/8�)�� )��/�.9��]7-6A��%�%<g��V�;�(�-�`��(�a/��a*,��%��8�/8�5��+/8 �/��%<��2�(<��-7��/��E=�i�5/��(�K��?) ��*,�-�$��T�(�%�&��/8�(��/!.�� 2x$�(��-�D��/T�%� ��U��K7�6Z�!�%�%<g��B��-��&�/8�W��(�T��-*5��(A�(�]��/� �78E

�D��8��T�!�' �T<�� R/&�� c {,�<�i��i�(�&�!6 ��E ~ � ��E �8� ��E � � �)E � �

jT�*,*]� ��E ~ ��E ~ ��E ~ �)E ��1��)/8* ��E � ��E � ��E � �)E �

fXd ��E � ��E � ��E 9 �)E �

qR��#(6�� 1��%�6<��;/8�W/��W��&�!6��(�!� �4#��%�&�W�!� �� (��-E

c �%�*,*5�Q�%�_/�.+��5��-�%�(6��4/�.T��(��(A�(�M��5.0/8�)�U� ��('�A�(�`��6<:�8/!��<��*,�VNO��?)7-6��)A�(�,h��7-�-�%�%/!�@P� ��H�A/��%6<�S*,�-�$��A/��&�5/8�W/8�(��-*5�!A6+�(�!� �a��,#)��/8��$�,��}� ��#(6�� Ebq��W��-�%�(6<��5�!��X.0�)�%��(�@� ��6��&�W�� R��8�)��-� � I�~4��\�(E

7. DISCUSSION OF RESULTS�i� ��o��)�$��7_��-�%�(6��`/�.Wh��-7��/8� �>�p��?(��*,��&�y��(��-�*,/��(��/�.;�)�-�� Q��A/��_�� !�%��(�S.J��/�*� ��%.0��7��V�� ! ��-E4�i�n��-*,/QH��6 "$/��6<�a��%�6<��B��(�,#��-��\NO��&�Q��6<�L ��%.J�-7��@P��-�%�6<��V��/8#)� ��-�U��%��D��-�1��'4��5h)�7�7-�-�%�%/��&I$�;�(��&�!� �T�iq;h]�K�!��%��8�(�g7&��$��6<�G��/!��%�L��!�b��a/��@��-Es�i�b��(�`��-H8��6 "$/��6<��-�%�(6��-I�/��p��(�5/��@�U��!��lIR��\#��-��B��-�%�6<��]�>��\/8#(� �!��&��;<�� R/&��@�&I2��/8�(�8�w��`��-�%�6<��M/8#)� ��-�y�;<�� D�!�8�-�1��('�� c {,�<��>��&�Q��6��B�!�D�8/$/��lE=q��h)�7�7-�-�%�%/��D��%�6<��R��*4�(7@��>/��%��V��/��(��D�!��!�T�iq;h��%�6<��@��>�%/8*,�@:�;��Q�2��/!��%�8E,h��*,A6A�Q�2��-�%�6<��T�>��U/8#)� ��-�a��`��(�]��-*,/QH��6 " 6��(�i"[��-H[�@��6T�@?( ��@��A*,��-EGq��!�\��-IK��W#��-��5��-�%�6<��,��/8#)� ��-�S�;<�� R/&��&I(�;�� D��8��T�!�'S�� c {,�(<�i�5g��(A�%�(��\��6Z�Q��H[�-6<�`7-6�/8�%�U�%�-7�/8��a��n��(�� X��-�% ��7��H[�-6<�$EUq��(��1��q;h��6��/��<��* �K�!�>�T*,/!��;�(�� $�=.0/��(�%��,�!��4h)�7�7-�-�%�%/��K�!�9/��7-��8��w�%��8�<g7&��$��6<�}�>/��%�n��!� ��M/!��-E c �57&�� #��M��?�: ��-7��-�eI�*,�&�!�%�(��&�n�!7-7-�(� �!7-��-��/�#(� ��(�&�X.0/��1��V��-*,/QH8�!6 "$/8�(6<��?) ��*,�-�$�D��@��1�%��8�(<g�7&�!��6<�4#��%��]��/8�%�;/�#(� ��(�&�4.0/!��-H[�@��6 "$/��6<�M�!��\��*,/!H��6 "[ 6��i"8��-H[��!6��-�%�(6��-E�i�B��&�!6�(�!� �T��-�%�(6<��/!.�h)�-7@��A/��1�>�K��?(��*,��&�V.0/��(��*,/��(�-�/�.D�(�-�!� �8��Q��/8�X�� Q�%��A�(�4.J��/�*s�% ��Q��%�2�� (��-E9q��1��-�%�6<��;/8#):� ��(�&�]�;<�� A�(�-��6��V7-6A��%�%<g��>�%� ( �/��%�=��K��)�$��7��-*,/QH��6 "/8�(6<�]��-�%�(6<��-E �D��-�T�!�'B ��%.0/!��*,�&�4�%��8�(<g�7&�!��6<�]#��%��=�;��@�%:�&�!�2�T�iq;h1�K��%6�A��$��6��T��/��%��!�V��(�=/��@�R*,��(/��(�-E c *,/8�(��`j��(*,*]�$I��1��(/�*XIT�!��wfXds7-6A��%�%<g��MH[��%�}�%*5��6�6V�(<.Z:.0�@��-�7-��]��L/QH[�@� ��6�69 ��@�%.0/��*5�!�7-�5�Q��,/8#(�%��H[�&�lE5��V�%�/��6A�L#��/!��&�eIe�(/Q�>�-H[��&Ie��Q� �R/&��V ��@�%.0/��*,�-�`�%6��8�$��6<�`#��%��6�6��(�\/!��4*,��/��)�-Eaq��4��47�/8��%��-�$�U�;<��o��,��-�%�(6<��[�Q��-��.0/��`��_��*,/!H��6 "[ 6��i"8��-H[��!6,�� -*,/QH8�!6 "$/8�(6<�

j2fW�2j2�8~��>�Q�� c+� f�h)��2fMkTj�F_/!��')�%�(/8 X/��W�+�-�%�&�Q��7@�X��%�%��-�;��Xj2�Q� �,fW��U��\�2�/&�;6��&�(��Vj��%7-/QH[��%�$I � ��8~ ��!�8�2��

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Rec

all

Query Size

IdealDummy

RandomME

Uniformly at Random

N �[P@�D��-�1��'�E

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Rec

all

Query Size

IdealDummy

RandomME

Uniformly at Random

NO#�P �T�iq;hlE

�R��8�)�� i*, ��7��/�. � 6A��%�%<g�7-�!��/8�eE

��)�$��@��A75��?) ��*,�-�$��-E��1/��5��Q�]��(�S�)��l��-�7��\��o ��%.J/��%:*5�!�7-�S#��i�>�-�-�p#��-��, ��@�%.0/��*,��(�_��6��/��<��* �!��_��/!��5 ��%:.J/��*,��]��6��/��<��* /��5��(�$��(��71�(�!� �2�K�!��*4�(7@�,*,/!��1�%��8�(<g:7-��$�;��\��V�)��l�@��-�7-�T#��i�>�-�-�W��(/8�%�V��6��/��<��*,�+/8#)� ��&�/��,��&��6��(�!� ��E�q��-�%�+��?) ��*,�-�$��>�!6��%/]7-/��(g(��* ��!�9��%��2��(�A�)�&��6�NOOE �8E+7-/!�%��-7��@P+7�6Z�!�%�%<g�7&�Q��A/��`��/5��-�� Q��B��(�2�&�(��4�)<��-7�:��/��4��L��(�5�� (�_�)A��6Z�)�&�L�%��8�(<g�7&�!��6<�L#��%��B��-�%�6<��V��(/8�%�1/8#(� �!��&�,�;��-�\��%��(�B��(�T/��(��7�6Z�!�%�%<g�7&�Q��A/��X*,��(/��(�-Ec �B�S7&�!6A�#)� �!��/8�o �/�A�$�&I��U�!� �� (�V (��-�%�-�$��S��&��69��Q� �S��:�%�(6<��aN �R��8�)��-� � I ~o�� $PU��7�6A��(�X7��(��H[�-�57-/!�%��-�% �/8��(��_��/ (�7@'��V�� ('�A�(�2��(�.J/��*,6<�4�!�>� �!��)/8*XEDq��(�-�%�1�!��;��K��%� ��<:��$��"$6��T7��(��H[�-�>�� ( ��&�!��V6�/&��-��;��U��(�+.J/8�(��%�#(g�8�)��-�-E��T/!��%�)�� (��%��6��$I9�!6A6K�@?�!*,��&�L� �!�'��n��6��/��<��*,�U ��@�%.0/��*^�%��:�(<g�7&�!��6<�\#��%��;��S��(��;� �!��)/8*4"$/!� �(��(�(E�1�(.J/��%��!��-6<�$I)��B�(��%7-/QH[�@��&�W��Q�+��(�BfXd 7-6A��%�%<g��T�)��-6A�(�&�� !�'��M ��%.J/��*5�!�7-�,��Q�BA�B7-/�*, ��!� �!#6��5<.;�/��B��/!��%�,��!�9/�.��(�Kj��*,*B�V��4�T�!��)/8*37-6A��%�%<g��-E9q��>�-�%�%��A�!6A6<�*,�-��9��!�9��K7-6A��%�%<g��K.O�!A6��&�5��/4 )��/!H�A�(�;��+� ��(')��(�4�!6A��/�:��<��(* �;<��o��$�`��%��.J�6��(.0/!��*5�!��/8�`.J��/�* ��U7-/��$��-�$�V/�.��(�*,��%��8��-E4q��(��V�(�-*,/��%� �!��1��]�(<{,7��6<��-�2/�.��]7-6A��%�%<g:7-�!��/8�o� ��%'�Enq>�>/n*5��_.O�!7��/��]7-/8�$�%��#�)��5��/X��(�5�(<{,7-�(6<�i�/!.��,� ��%'��pN � PUu>/��}*,�-�%��-�W�!�(��(/��&�v#��p�@?( ��@�%��5��(/8�%�a��)��/��-�b#��}6��-�%�%��W��?) ��%��-I1�(��%7-��%�\��M��!*,�n��/� �78I�!��o��@.0/��S��%�S�%A*,�6A�!�5H8/$7&��#(�6A�!�%�$EGN � P4d=*5��6K*,�-�%��-�H��!�%�UA�56��-�(��\��,��4��/B#��1�%�(/��%��9��5/!��(/$7-�(*,�-�$��"U��9�%�/!�%�K��V�%��(�86��1�>/�� MNO�8E �(E � �$�-��B/�� /9�$PD"]�;<��5*4�7 �/!.K��,/!��A��6>7-/8�$��?��V*,��%�%��(EWq��@��.0/!��8I��/��B�-H8��%�`*,�-��:��!�8�]7-/8�$� ��(�T�%�({U7-��-�$�T��(.J/��*5�Q��A/��n#��S<��%�-6<.9��/,A�)�-�$��.Z�W��(��6Z�Q��H[�V��?) ��%��%�26A��H[�-6e/�.R��(�� Q�%��A��1��H[/86�H[�-�eE

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Rec

all

Query Size

IdealDummy

RandomME

Uniformly at Random

N �[P<�R/&��@�&E

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10R

ecal

l

Query Size

IdealDummy

RandomME

Uniformly at Random

NO#�P c {U�<�i�$E

�R��(��V~)��*, ��7@�K/�. � 6A��%�%<g7&�!��/8�X"57�/8�$� �eE

7.1 Conclusionq��ah��7-7-��%�%/��W��6��8/!��(*XIK�;��6��`�(�&�!6A��(�_��-6�6T�;<��b*,��%�%��&�)�8�-�-Ie��T7-6��&�Q��6<�n�/��1��/8#(��+��/S��%��/!��T��n�&�(��U�)��7��/8��E �/!��&��%/��\<��K�!��/8*,*,<��&�].Z��/8* ��K��&�!6��Q� �B�!��!6��)�%��-E�q��T�iq;h]�!6A��/��<��*XI8�;��6��+��H)��(��-�%�6<��=/�.��&�!�%/8��#6��+��7�7-�(� �!7��$I�[�&H[�4�/��7-�-��#6<�`��/!��%�U��-�%�6<��2��!�a��4/��@��,NO��?)7-�- (�]h��7�:7-�-�%�%/!�@P@Eyq��M��(.J��/��\��-�%�6<��,/�#(� ��(�&�}�;<�� 1��q;hv*5�-�p#��!�%�%��#(�(� ��#(6��5��/X��(�U.O�!7��4��!�B<�]��47�/8�7��-�H[�&�L.0/!�]��(�5 �(�%: �/8�%�1/�.�� '��2�>�-#S ��-�K��,A�>#��%�-�\/��\��(�1�(/��/8�\/�.e��?)��-�7��U/�. � A��?U�� A � � E � � � � CD?4�;��7@�a*4�)��6�6<�X��-��).0/��7��&�!7@�_/!��B�)�(��S��4� �!�'��W )��/$7-�-�%�-E\FG��6��,/8�(�U7&�!�_�$�): �/��(�-�%�t-�1��(�1�@?(��-�(7-�+/�.�7-��%� ��\�-*5��6�7�/��%��-�% �/��(��9��!� 6A�-�\�2��/86��2��6�/8��/8�(�;��/B��Q��/�.��#�-I�<�K�%��-*,��(6�A'8�-6<�,��!�*5��$�V��?)��&E c �8/$/��U��#V��/��6A�4#��%/�*,�-/8�(�>�;�/T7-/��%��-�% �/��)��;<��n�46A�!��]��*]#��K/!.=��?) ��%��;/8�n�47-�@�%� ��X��/8 (A7�E��+��+/8�(�.0��-6�A�(�4��Q�;*,/8��K�(�%��K/�.��-*5��6��;�6�6e��-��5��/]��&H[�1/8��]NO/��+�Q�*,/8��]�S.J��1PB ��@��%/8�L��(��`��%�\��]��_�@?( ��@�%�]7�/8��%�(6<� ��$�B/8�o��8�H[��W��/8 (A7�Eq��=��/8�)��!*,�-�$��"$6��'[�9*,/��)�-68�*, (6A�7-<�D��R/&��D�� D�!�8�-�1��('�%�-�-*,�>*,/��1�� )��/8 )��Z�Q��T��/V��T )��/8#(6A��* �!��eE �D�!�8�-�1��(' ��%.J/��*,�+�/��7-�-��#6<�M#��@�%��1��M�!6�6R/��(��!6��8/��<��(*,�T��X��7&�!�%�M�&�(��-�\�!��M*,��%�%��_#(�(�U��(�X/8�(�-�5 (��-�%�-�$�5��}��(�X�� ! � �/8��$�,��p��S��,�(<��-7��/8�eEpFG��-�v�/��%�W��,��%��/��)�7-�-�v��K�-�(�8�;�)<��-7��/8�(� �R/&��@�>#��-7�/8*,�-�=*,/��;��(H��$� ��-/8�(�-E=q��g��6�7-/8�(7-6��%�/8�M��-�8�!� �(��(�5�;�(�7@�M/!.D��2�i�>/,A�+#��-��T�(�- ��(�/8�5��1��7�7-�(� �!7��,/�.��1�&H8�!A6A�!#6��T7-6A�!�%�%�g��&E c {,�(<�i�5�!6A�%/B�[�&H[��&�!�%/8��!#6��5��-�%�6<��V��/8�(�8�a�%/8*,��;��!�]A�).0��/!�]��/ �R/&��@�U�!��D��-�1��'�EdD?�!*,��n��X ��@�%.0/��*5�!�7-�W/�.T��(�XH��!��/8�(�5��6��8/!��(*,�,/8 ):�� Q��A�(�M/8�p�W�!� �� (�o7-/8�(��%��7��&�`�;��_��(�,j��*,*B�M7-6A��%�%<g��

j2fW�2j2�8~��>�Q�� c+� f�h)��2fMkTj�F_/!��')�%�(/8 X/��W�+�-�%�&�Q��7@�X��%�%��-�;��Xj2�Q� �,fW��U��\�2�/&�;6��&�(��Vj��%7-/QH[��%�$I � ��8~ ��!�8�2�E�

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Rec

all

Query Size

PageRankHits

PowerAffinity

Uniformly at Random

N �8P>��(�-��6 � 6Z�!�%�%<g��&E

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Rec

all

Query Size

PageRankHits

PowerAffinity

Uniformly at Random

NO#�P>jT�*,*]� � 6A��%�%<g��&E

�RA��(��2�)�>u9�-��;� ��'��U��6��8/!��(*XE

N �R��(��V�NO#�P%PK��-H8�&��6��;��!�T��6�6e��2��?(��*,��(�&�X��6��/��<��*,�+ ��%:.J/��*��%��8�<g7&��$��6<�W#��@�%��1��\�� !��)/8*s� ��(')��(��,7-�)��H[�8Eq��(��R��-�%�6<�=��D�-�7�/8�(� �!�8��T#��-7&��(�%��<�D�*, 6��-�R��Q�D#$�2/8�(6<�V��:��(�\��(�] �!�%��-�M/!.9��(�]�-*5�!�6R��?)7@��(�8�8Ie��)�- ��-��)�-�$��6<�W/!.9��(�6��'`��%��7@��(��8ID�%/�*,�U�%��8�(�g7&��$�]��(.0/!��*5�!��/8�_��B��?��%� ��7��-�eEFb��X�Ux$��%�\�%�t-�V/�.=/��6<�S.J/8�)�+�>�2��-7-�-�H[� �� -7&�!6A6OE>��.R��(��!7-7-�)� ��7��4/�.��K7�6Z�!�%�%<g��>A�=��/2#��K�*, (��/QH[�&�V��;7&�!�,��?) ��-7�� /�� -7-��6�6��;<��X�4x��(��%�\�%�t-�V/�.D/��6<�5�i��/)E

8. FUTURE WORK��a��(�4.0�)��(��U��,*5�-�n��H[�-��[�Q��,/��(��V� ��'��W��6��8/!��<��*,��!��oH��!��A�!��/8�(�U/�.;��(/8�%�W��(��&�_�(��8Ea�i�p�8�(�(<��/8�o��S 6A��*,/!��4��?��-�(�%�H[�4��?) ��*,�-�$� �!6R��)H8�-��[�!��/��-E]�i�M��]��(�$��(��:��7@"$��Q� �}��H[�-��[�Q��A/��W�>�a�;�6�62��%�`�� ! ��S��!�X�!��a*4�7 ��% �!��%��`�!��G��@.0/��L7-6�/8�%��n��/p��/8�%�a�-�7�/8��$��-�y��w��&��6�(�!� ��E9FL�V��6��%/, 6A��S��/,A�(7-6��)��-��8�$��&�S�&�(��-�]NOOE �8E9�/!�1/��6<�� P2��_/��(�B�!��6<�)�%��-E5�i�n��4��-��6 "��(�!� �X��?) ��*,�-�$��T�>�4�;�6A6/�#(� �!A�W*,/��T��- )��-�%�-�$� �!��H[�2��*5��6�7�/86�6��-7��/8�(�-E

AcknowledgmentsFL�+�>/8�(6Z�,6��'[�;��/V��(' � �(��/� ��@� � ��*, (#��-6�6.0/!�K7-/�6A6��-7@��A�(��%�(#)Ci�-7��H[�� !�'��(�!� ��I$�� T�8��&HBd=��/��4.0/!��=7-/8�$��/8�(��%�( �/!�%�&E

9. REFERENCES| � �Vu>�#6��/8�!� �� (�� /�� 7-/86�6��-�� .0/$/!��#��!6A63� �!�'�� )��-*,�-E

�� !�"� #��%$��#��&��%&� �� !'��%() E

| � �]kTH[��H�� /!. � ��'�� 6�<�� Q��(�� .J��/8* ��#��-7 '�E�� *��+��,��-.�/� � ��!�#��"0��"��#��" -��"�1��2�� 34 5�6 (��!�#��!�7�� #�7��8�!�$�8�#��" -�3��%"�1�8� ��#��#��8�$*�/��%( E

| ~Q� c E � /�t-t-OI;u1E=j�/�*XI��@E=d=<��/8��I.94E�:��!��)I � E � �!*, #��-6�6OI�� RE(fX��6�A/)E�i�(�-�$��<.J�)��V��?) ��%��9#$�,��6<�)t-��B��: *5��6OE�%�#*,<�%��&�lI � ��8~)E

| �!�NL�E �]E �/8�(��&E�d��$��%� �!��a��-�$��-� c �%/$7-�/�6A/��8�7&�!6�7&��%��)�$E��i�<; � E&BDC�C%�� $� ? E�= � � C?>9� � ? ��@ � C � �� E ��0'A E =&C � �C B C E CB A � E E�D�E9A? B9� C � ?FE B�� C � ?�G H�IKJQI1 ��!�8�-� �&�8� "�Q� ��I � ��E�$E

| ��DE �(E : �(E��T��-I��]E!H��U�)��L��eI$��]j4E�qR�!6�*5��E8fW�&�Q:�%�(��;��9 �/Q�>��D/!.��/��)�-�R��]�(�� ! ��-E�q��7@�(��7&��6��+�- �/��%�� :��9��7� � Ilq��#��-�X�i�(��<��(��8I�kT7��/8#��@� � �8� � E

| �Q� �]E��V�!�(��t8IUu1EBh��-6�*5��IU�� faEBh)��ET�1�@.0��%� �!6UF_�-#e�� /8*4#(��S�%/$7-A��6��(��i�>/��'��B��`7-/86�6A��#�/!� �!��H[�5g�6<��)EA E�DLD$A � B � � � E ? E= � � C BMA�N I(�8�(N ~8P@� ��~Q",� I � ��1��E

| �&� �(E�faE8�26A��A��#��)E c �(��/!��<� �!��H[�K�%/8�)��7-�-�9A�U��) ��6��'8�&��-��H)<��/��*,�-�$�&E�OE A � ��0 E= � � C BMA�N I&�7�)N P@� �8�!�!"��~ � I � ��9�)E

| �Q�Bu1E��(6<�;A7 �X�!�� E�u>�)��'[��$E(q�� /8�$� ��7�� R��(��1��-�$�&�c ��>��_#(�6�6�� #�/8�!� �bx$��A/��U�;<�� .J��%� ��6��-E>�i�BPBQB @ �/H�R!I � �9��E

| �Q�Vj4EfX�!�%��/&?lI�faE�fX�-�(#(�(�%�$I)�!��Sj]E�fW/!��$Ed��$�� )��A�%�T��?�: ��%��!��U'��/&�;6��&�)�8�1�(��%7-/QH[��%�$E8�i�S; � E&B C�C��!� �� ? E= � � CUT � �@ � C � �� E ��10LA E =&C � C B C E WV A�D � � A E�D � A � C � @ � C � �� B � � E E VXA @Q@ � C � �� E ��0 G H�HYJ&I� ��8��1~��8~&"�~8�1��I � ��E

| � �Q� c E��UEKfW7 � ��6�6��*XEKu>/&�2� c ��/$/86�'�<�W.J/��X�� Q��A7-��626A��):�8��!�8�,*,/��)�-6�A�(�(I��?��2��%��-H8�!6OID7-6A��%�%<g�7-�!��/8�p�!��L7-6��(��:��(E(�$�%�� &�K�K�2E 7-�-E 7-*4�eE �-�(� �4*,7�7&��6�6��*��Q#�/Q�2I � �9��E

| �8� �V�UE �T��[�!*XI��(E L�� l��%��$Il�� c ElfW7 � ��6�6��*XE �T�%��(�\*5�Q?)�:*4�* �-�$�%��/8 $�5.0/!�+��@?)�;7�6Z�!�%�%<g�7&�Q��A/��I � ��9�)E

| �Q� �NL�E"�D��8I�hlE�u9��eIR�VERfW/!�i��OI=�!��`qTE�FbA�(/8�!� �8�eE�q�� ('X7-<� �!��/��`� �!�'��(�1u9��(�8��\/�� )��/,��V��-#eEq��-7 ��7&�!6$��- �/��%�&I$h�� ).0/!� �]j��8<� �!6*LeA#)� �!�%�]q��-7 ��(/86�/8��-��=��/QC��-7@�&I � �9�8��E

| � ~Q�]faE��>E[h)7 �$�K�Q�%��t��]j4E � E8faE!F_/$/��eE!j��%7-/QH[�@��A�(�T�%��!��&�A�$��2��%��S�� (�L�!��!6��)�%��-E A E�DLD$A � B � � � E ?�E= � � CBMA�N I�~9�(N �8P@� �!�Q"$��)I � ��~)E

| � �!�2F E>h)��}�� >E!�T�-�@��-��E � ��@�%�%g��)��4:B��?) ��%�4g��)��;��(��% ��-7-<g�&�y�%�(#)Ci�-7��`�Q��&��M��(��/��!��6<�)�%��n/�.U�@:*5��627-/8*,*]��(A7-�!��/8�eEK�i�[ZX\^]`_ N Zba @ Bdc�e�e�fWgWh �ji � �B * A �10kh B�� C � � l�B A E =&C � C B C�E Wm C �<n C�B �7 E 0 E ��o�p'q CsrN C%�� p�A E�DLD$A � B � � � E ? � �tn C 0 CsD � � � B�?`n � C�E �uo�p^N C � � �E&�9? p n E&E 0 ? � � B>�[��0 � B � � � E ?�I � �8� � E

| �� Vj4E'9��*5�!* �!�� c ER�2/8#��E=dD?( ��@�%�4g��(��M��)��-*,�B.0/!�/��[�!��t&�Q��A/��-� �D��/8#(6A��* ��}�)/8*5��}��6<�)�%A�5�!��o��(�-*,/��T�� ( (��/8��7 ��EwvYOE A � ��0 E=S_ �O�[� �yx � � � E ��0wA E�D � A � �� 5� �zZ 0 C�B � � E �:B A E�DLD�C � B C�v@I � ~(N � P@� � " � �(I � �8�8~�E

j2fW�2j2�8~��>�Q�� c+� f�h)��2fMkTj�F_/!��')�%�(/8 X/��W�+�-�%�&�Q��7@�X��%�%��-�;��Xj2�Q� �,fW��U��\�2�/&�;6��&�(��Vj��%7-/QH[��%�$I � ��8~ ��!�8�2�8�

Deriving Link-context from HTML Tag Tree

Gautam PantDepartment of Management Sciences

The University of IowaIowa City, IA 52242

[email protected]

ABSTRACTHTML anchors are often surrounded by text that seemsto describe the destination page appropriately. The textsurrounding a link or the link-context is used for a varietyof tasks associated with Web information retrieval. Thesetasks can benefit by identifying regularities in the mannerin which “good” contexts appear around links. In this pa-per, we describe a framework for conducting such a study.The framework serves as an evaluation platform for compar-ing various link-context derivation methods. We apply theframework to a sample of Web pages obtained from morethan 10,000 different categories of the ODP. Our focus is onunderstanding the potential merits of using a Web page’s tagtree structure, for deriving link-contexts. We find that goodlink-context can be associated with tag tree hierarchy. Ourresults show that climbing up the tag tree when the link-context provided by greater depths is too short can providebetter performance than some of the traditional techniques.

General TermsAlgorithms, Experimentation, Measurement

KeywordsTag tree, DOM, Link-context

1. INTRODUCTIONLink-context is used for many tasks related with Web infor-mation retrieval. In most cases, anchor text or text withinan arbitrary size window around a link is used to derive thecontext of a link. While the importance of link-context hasbeen noted in areas such as automatic classification [2] andsearch engines [3], focused or topical crawlers (e.g., [9; 6; 12;1]) have a special reliance on it. Topical crawlers follow thehyperlinked structure of the Web while using the scent of

information to direct themselves towards topically relevantpages. For deriving (sniffing) the appropriate scent theymine the content of pages that are already fetched to prior-itize the fetching of unvisited pages. Unlike search enginesthat use contextual information to complement the contentbased retrieval, topical crawlers depend primarily on contex-tual information (since they score unvisited pages). More-over, a search engine may be able to use contextual informa-tion about a URL (and the corresponding page) from a large

number of pages that contain that URL. Such global infor-mation is hard to come by for a topical crawler that onlyfetches several thousand pages. Hence, the crawler mustmake the best use of the little information that it has aboutthe unvisited pages. While, the study of combined contextsderived from many pages is important, in this paper we willconcentrate on the quality of link-contexts derived from asingle page.

A link-context derivation method takes in a hyperlink URLand the HTML page in which it appears as inputs. Theoutput of the method is a context, if any, for the URL. Tra-ditional context derivation methods view an HTML page asa flat file containing text interspersed with links. This viewis a simple extension of techniques used with ordinary doc-uments for obtaining contexts of citations included withinthe documents. However, recent methods have tried to usean HTML document’s tag tree or Document Object Model(DOM) structure to derive contexts [5; 2]. These methodsview an HTML page as a tree with <html> as its root anddifferent tags and text forming the tree nodes. Figure 1shows an example of an HTML page and its tag tree repre-sentation. As noted earlier, the use of text under the anchortag (anchor text) as link-context has existed for some time.Do other tags and their hierarchy provide any additional in-formation about the context of a link? The question formsone of the underlying themes of this paper.

When we use an anchor text as the context of a link, we areusing all the text in the sub-tree rooted at the anchor tag asthe link-context. We may extend the idea and consider allthe text in the sub-tree rooted at any of the ancestors of ananchor tag as the context of the anchor URL. We call thenode where the sub-tree is rooted as the aggregation node.Note that both the context and the corresponding link arewithin the same sub-tree. Hence, we may view an aggre-gation node as one that aggregates a link with potentiallyhelpful context. There are many potential aggregation nodesfor a single anchor.

2. RELATED WORKSince the early days of Web many different applications havetried to derive contexts of links appearing on Web pages.McBryan [11] used the anchor text to index URLs in theWWW Worm. Iwazume et. al. [10] guided a crawler us-ing the anchor text along with an ontology. SharkSearch[9], used not only the anchor text but some of the text inits “vicinity” to estimate the benefit of following the cor-responding link. In a recent work, Chakrabarti et. al. [5]suggested the idea of using DOM offset, based on the dis-


<html><head><title>Projects</title></head><body><h4>Projects</h4><ul> <li> <a href="blink.html">LAMP</a> Linkage analysis with multiple processors.</li> <li> <a href="nice.html">NICE</a> The network infrastructure for combinatorial exploration.</li> <li> <a href="amass.html">AMASS</a> A DNA sequence assembly algorithm.</li> <li> <a href="dali.html">DALI</a> A distributed, adaptive, first-order logic theorem prover.</li></ul></body></html>

html

bodyhead

h4title

text text

ul

li

a

text

aggregation path

parent

anchor

grand-parent

root

Figure 1: An HTML page and the corresponding tag tree

tance of text tokens from an anchor on a tag tree, to scorelinks for crawling. A DOM offset of zero corresponds to allthe text tokens in the sub-tree rooted at the anchor tag.The text tokens to the left of the anchor tag on the tag treeare assigned negative DOM offset values. Similarly, positiveDOM offset values are assigned to text tokens on the rightof the anchor tag. The offset values are assigned in order —the first token on the right or left is assigned a value of ±1and so on. The authors trained a classifier in an attemptto identify the optimal DOM offset window to derive thelink contexts. They found the offset window to be no largerthan 10 (offset values between −5 and +5) for most cases.The mechanism for assigning the DOM offset values madeno use of the inherent hierarchy in the tag tree. We will laterdescribe link-context derivation methods that make explicituse of the tag tree hierarchy and compare them against someof the traditional techniques.

Attardi et. al. [2] proposed a technique that categorized aWeb page based on the context of the URL corresponding tothe page. In fact, they exploited the HTML structure to de-rive a sequence of context strings. However, they restrictedthe analysis to using only a few HTML tags and by obtainingthe text within the tags in a selective and ad-hoc manner.We note that their concept of context path has similaritiesto the idea of aggregation path described in section 3.3.1.A context path is a sequence of text strings that are associ-ated with a URL based on their appearance in certain tagsand at specific positions in those tags along the hierarchy ofthe tag tree. The authors also associated a priori semanticmeaning with some tags. For example, they argued that thetitle of a table column or row should be associated with thehyperlinks appearing in the corresponding row or column.

Brin and Page [3] have suggested the use of anchor textto index URLs in their Google search engine. Craswell et.

al. [7] have shown that using anchor text can provide more

Obtain a diverse sample of URLsand their manual descriptions

Pick the nextsample URL

[URL found]

[no more URLs]

Find In-links to URL

Summarize and analysethe data collected

Pick randomIn-link

Web

Fetch In-link page HTML tidyingDerive different

contexts of sample URL

Measure and storeproperties of contexts

Figure 2: The framework for study

effective rankings for site finding tasks. Chakrabarti et. al.

[4] used a text window of size B bytes around the anchorto obtain the link-context. With an experiment using 5000Web pages, the authors found that the word “Yahoo” wasmost likely to occur within 50 bytes of the anchor containingthe URL htpp://www.yahoo.com/. Davison [8], while usinga sample of the DiscoWeb, showed that anchor text (actuallyit includes words in the vicinity) has high similarity to thedestination page as compared to a random page. The authoralso suggested (based on some results) that enlarging thecontext of a link by including more words should increasethe chance of getting the important terms but at the cost ofincluding more unimportant terms. While the focus of ourstudy is different, we do validate some of the results shownby Davison.

3. FRAMEWORK

3.1 General MethodologyFigure 2 describes the framework in terms of the main stepsneeded to conduct the study. We first need a sample of URLsthat represent a wide variety of topics. In addition, thesemantics of the pages corresponding to the URLs must beunderstood and described by human experts. The manualdescriptions will guide us in judging the quality of contextsthat are derived for the corresponding URLs. We rely on themanual descriptions because the text content of a Web pagemay not describe its semantics accurately. Many Web pagesare not designed to describe themselves and often containlittle text. However, after viewing a page, a human expertcan provide us with a good description of what the page isabout.

Once we have a sample of Web pages (identified by sample

URLs) and their manual descriptions, we obtain the in-linksfor those pages. For each sample URL we use a search engineto find its in-links. We must make sure that the in-link in-formation from the search engine is not stale (i.e. the in-link


html

p

strong text

text

text a

text

Figure 3: Mapping an HTML snippet (top) to a tag tree(bottom) via conversion of the HTML to a convenient XMLformat

points to the sample URL). We choose one random in-linkfor each sample URL. We fetch the page corresponding tothe in-link and call it the in-link page. The page may provideus with content that describes the sample URL. Before weprocess the in-link page for deriving contexts, we “tidy” theHTML and convert it into a convenient XML format. Afterthe conversion, we use multiple methods to derive contextsfor the sample URL. The manual description of the sampleURL helps us in judging different contexts and hence themethods that produced them. In the current study, we useone set of methods that analyze the HTML tag tree andanother set that use arbitrary size text windows around thesample URL in href. We summarize the performance dataobtained from our sample using a number of metrics. Theframework allows for evaluating context derivation methodsin general. Note that we only fetch the first 20KB of eachin-link page and we parse the tag tree up to a depth of 19(root of the tree is at depth 0).

3.2 HTML Tag TreeMany Web pages contain badly written HTML. For exam-ple, a start tag may not have an end tag, or the tags maynot be properly nested. In many cases, the <html> tag orthe <body> tag is all together missing from the HTML page.The process of converting a “dirty” HTML document intoa well-formed one is called tidying an HTML page1. It in-cludes insertion of missing tags and reordering of tags in the“dirty” page. Tidying an HTML page is necessary to makesure that we can map the content onto a tree structure witheach node having a single parent. Hence, it is an essentialprecursor to analyzing an HTML page as a tag tree. Further,

1http://www.w3.org/People/Raggett/tidy/

the text tokens are enclosed between <text>...</text> tagswhich makes sure that all text tokens appear as leaves onthe tag tree. While this step is not necessary for mapping anHTML document to a tree structure, it does provide somesimplifications for the analysis. Figure 3 shows the processthrough which an HTML snippet is converted into a conve-nient XML format and mapped onto a tag tree.

3.3 Context Derivation MethodsWe will now describe two link-context derivation techniques.We later evaluate many versions of each and treat each ver-sion as a separate context derivation method.

3.3.1 Context from Aggregation NodesWe fetch each in-link page, tidy it up as explained beforeand map it onto a tag tree structure. Somewhere, on thetree we will find the anchor tag that contains the sampleURL. We call that anchor, the sample anchor. If there aremore than one sample anchors in the in-link page, we con-sider only the first one for analysis. Next, we treat eachnode on the path from the root of the tree to the sampleanchor as a potential aggregation node (see Figure 1 shadednodes). Note that a particular context derivation methodwould identify one aggregation node for each sample URL.The path from the root of the tree to the sample anchor thatcontains potential aggregation nodes is called the aggrega-tion path (see Figure 1). Once a node on the aggregationpath is set to be an aggregation node the following data iscollected from it:

• All of the text in the sub-tree rooted at the node isretrieved as the context of the sample URL. The sim-ilarity (described later) of the context to the manualdescription of the sample URL is measured and calledthe context similarity.

• Number of words in the context is counted. Large sizecontexts may be too “noisy” and burdensome for somesystems.

When we decide on a strategy to assign an aggregation nodeon a given in-link page, it constitutes a context derivationmethod. Since we have collected the above data for theentire aggregation path for each of the in-links in our sample,we may evaluate many different methods.

For each in-link page we may have an optimal aggregationnode that provides the highest context similarity for thecorresponding sample URL. If there are multiple aggregationnodes with highest similarity, we pick the one closest to thesample anchor as the optimal one.

3.3.2 Context from Text WindowWe consider the general technique of using arbitrary sizewindows of text around the sample URL in href as a base-line for comparison. Hence, for each sample URL, we usea window of T words around its first appearance on the in-link page as the context. Whenever possible, the window issymmetric with respect to the href containing the sampleURL. The window may or may not include the entire anchortext. We derive context using multiple values of T startingfrom 2 and moving up to 100 with increments of 2 words.Each value of T corresponds to a separate context derivationmethod.


3.4 Performance MetricsWhile manual evaluations of context derivation methods areideal, such evaluation techniques will not scale easily tothousands of sample URLs and several methods applied toeach sample. Hence, we depend on automated techniques.Stop words are removed before computing context similari-ties and stemming [13] is used to normalize the words. The(cosine) similarity between a context c and a description dis computed as:

sim(c, d) =~vd · ~vc

‖ ~vd ‖ · ‖ ~vc ‖where ~vc and ~vd are term frequency based vector representa-tions of the context and the description respectively, ~vc · ~vd

is the dot (inner) product of the two vectors, and ‖ ~v ‖ isthe Euclidean norm of the vector ~v.

We measure the performance of each of the context deriva-tion methods using the following performance metrics:

Average context similarity: Given a context derivationmethod M, we measure the context similarity for each ofthe sample URLs. The average context similarity for M isthen computed as:

avg context sim(M) =1

n

i=n∑

i=1

sim(ci, di)

where n is the sample size, ci is the context derived using themethodM for sample URL i, and di is the manual descrip-tion of the sample URL i. The average context similarity isa measure of average quality of contexts obtained using themethod.

Zero-similarity frequency: The zero-similarity frequencymeasures the frequency with which a method obtains a con-text that has no similarity to the manual description. Thefrequency is measured as a percentage of the sample size.While zero similarity may not necessarily mean zero infor-mation (due to variable word usage), it does mean that themethod failed to provide many of the important words thatwere used by an expert editor to describe a given page. Suchfailures could have an affect on the recall of a system basedon the method. It is possible that a method provides uswith high quality contexts for some URLs but provides uswith little or no information about many others. The zero-similarity frequency is computed as:

zero sim freq(M) =100

n

i=n∑

i=1

δ(sim(ci, di))

where the δ(x) is 1 if x = 0, and is 0 otherwise. Unlikeaverage context similarity, we would like to have a low valueof zero-similarity frequency.

Average context size: This measure can be viewed as thepotential cost involved in processing or indexing contextsderived from a method. It is simply computed as:

avg context size(M) =1

n

i=n∑

i=1

words(ci)

where words(c) counts the number of words in the contextc. A desirable property of a context derivation method maybe that it produces reasonable size link-contexts.

Note that we associate 95% confidence intervals with all av-erage values.

4. SAMPLEWe obtain our sample URLs from the Open Directory Project2

(ODP). The ODP contains a large list of categorized Webpages along with their manual descriptions provided by hu-man editors. Due to the lack of commercial bias and easilyavailable content dump, ODP is a suitable and practicalsource for obtaining our sample. First, we filter out all thecategories of the ODP that have no external URLs since wewant only those categories from which we can get sampleURLs. Next, we remove the categories that appear underthe “World”, “Regional” or “International” top-level cate-gories or sub-categories. The categories that appear to bean alphabetical category (e.g. Business: Healthcare: Con-sultancy: A) are also filtered out. These filters are needed toavoid many semantically similar categories as well as pageswith non-English content. Once we have a filtered set ofODP categories (nearly a 100,000 in number), we randomlypick 20,000 categories from them. These categories representa large spectrum of interests on the Web. From each of the20,000 categories we pick one external URL randomly. Theexternal URL thus obtained is used as a sample URL. Weconcatenate the title and the description (provided by theeditors) of the external URL and use the combined text asthe description of the sample URL. Hence, we get a sampleof 20,000 URLs along with their manual descriptions froman equal number of ODP categories.

As mentioned earlier in-links to the sample pages are neededto evaluate different context derivation methods. We usedthe Google Web API3 to collect the in-links. Using the APIwe were able to obtain 1-10 in-links for only 14,960 of the20,000 sample URLs. Out of the in-links returned by theAPI, we filter out the ones that are from the dmoz.org ordirectory.google.com (a popular directory based on ODP)domains. Further, we fetch each page corresponding to thein-links and compare (ignoring case and non-alphanumericcharacters) it against the description of the correspondingsample URL. If an in-link page contains the description wefilter it out. The filtering process is necessary to avoid ODPpages or their partial or complete mirrors amongst the in-links. Such in-links, if not filtered, will corrupt our conclu-sions. Finally, we remove in-link pages that do not containa link to the sample page (a case of stale information fromthe search engine). We randomly pick one in-link from theremaining in-links and associate it with the sample URLand its description. Table 1 shows a sample URL, its de-scription and a random in-link. Due to the in-link filteringcriteria and limited results from the API, our usable samplereduces to 10,549 sample URLs.

5. ANALYSISThe sample anchor’s absolute depth and the number of itsancestors will vary with in-link pages. However, due to theHTML tidying process, we are assured that each sampleanchor will have a parent and a grand-parent node. Thisis because the tidying process inserts <html> and <body>

tags when they are missing from an HTML page. Also,each tag tree will always have <html> tag at its root. Forsome in-link pages, grand-parent and root node will be thesame. However, in many other cases the two will be entirelydifferent.

2http://dmoz.org3http://www.google.com/apis/


Table 1: Sample instanceSample URL Description In-link

http://www.biotechnologyjobs.com/biotechnology jobs database of candi-

date profiles. automatically matches

search to most suitable job in the

biotech, pharmaceutical, regulatory

affairs or medical device industries.

http://www.agriscape.com/careers/biotechnology/

Figure 4 shows the performance of four different contextderivation methods. The four methods correspond to set-ting the aggregation node at the anchor (sample anchor),parent, grand-parent and the root node respectively. In ad-dition to the four methods, we also show the performanceof a fictional method that would always find the optimalaggregation node. The error bars in this and the followingplots show 95% confidence intervals. All the plots with thesame performance metric are drawn to same scale. Hence,non-overlapping error bars indicate a result that is signifi-cant at α = 0.05 significance level. Figure 4(a) shows thatthe fictional method will have significantly higher averagecontext similarity than any of the other four methods. Wealso find that the method that uses anchor text gives thehighest average context similarity amongst the other fourmethods. However, the same method leads to informationfailure for more than 40% of the sample URLs (Figure 4(b)).The average size of the anchor text is found to be 2.85±0.05words (slightly higher than that reported by Davison [8]).The method that picks context from the parent node, re-duces the zero-similarity frequency by more than 11% atthe cost of slightly reducing the average context similarity.

The method that sets the aggregation node at the root isusing the entire in-link page as a context of the sampleURL. As expected such a strategy produces low quality (sim-ilarity) contexts with large number of context words (Fig-ure 4(a),(c)). However, it leads to contexts that are far morelikely to contain some information about the sample URL(Figure 4(b)). Hence, our metrics show a trade-off whichmay manifest itself in form of precision and recall in a sys-tem that uses the context derivation methods. Davison’s [8]results also show a similar trade-off by incrementally enlarg-ing the size of a link-context.

For our baseline methods that use arbitrary size text win-dows we find that the best average context similarity is ob-tained when T is about 14 words. Yet, this best average con-text similarity is significantly worse than that of the methodthat just sets the aggregation node at sample anchor or itsparent. However, the zero-similarity frequency at T = 14 islower (better) than for the methods that uses the anchor orparent as the aggregation node.

We find that for 65% of the sample the optimal aggrega-tion node can be found at the anchor, the parent or thegrand-parent nodes. Also, it is most frequently found atthe anchor and the frequency progressively decreases as wemove up. This suggests that if we were to look for the opti-mal aggregation node, we should start from the anchor andmake our way upwards if needed. Based on this observa-tion, we suggest a simple tree climbing algorithm for settingthe aggregation node. This algorithm strives for a minimumnumber of words in a context. We start from the sample an-chor and if it does not have enough words (N) under it, wemove a level above it in the tree. This process is continueduntil we have the minimum number (N) of words in our con-text or we reach the root of the tree. We test the algorithm

for various values of N starting from 0 to 20 with incre-ments of 1 word. The algorithm with each value of N canbe considered to be a different context derivation method.For N = 0, the aggregation node would be trivially set tothe sample anchor. Figure 5 shows the performance for vari-ous values of N . We find that the average context similaritywith N = 2, 3, 4 is significantly higher than any of the meth-ods seen before (cf. Figure 4(a) and Figure 5(a)). The bestaverage context similarity is seen at N = 2 with averagecontext size of about 53 words and zero-similarity frequencyof 20% (half of the method using the anchor text). However,even for N = 2 the algorithm’s average context similarityis much lower than that of the fictional method that findsthe optimal aggregation node. Hence, there is large spacefor improvement that may come through more sophisticateddata mining techniques.

6. CONCLUSIONWe introduced a framework to study link-context deriva-tion methods. The framework included a number of met-rics to understand the utility of a given link-context. Usingthe framework, we evaluated a number of methods basedon HTML tag tree and arbitrary size text windows. Theimportance of anchor text as link-context was reaffirmed.However, we also noticed its failure in providing informa-tion for large number of sample URLs. The parent of ananchor node, appeared to provide a good balance betweenthe lack of information in the case of anchor node and lackof quality in case of root node. We observed that the opti-mal aggregation node is likely to be found at depths closeto the sample anchor. A simple algorithm that utilizes thetag tree hierarchy and the above observation, provided uswith higher quality contexts than any of the other methodsconsidered.

We envisage a number of extensions to the current work.Large search engines have many pages that have severalin-links. It is worth studying the performance of contextderivation methods when they combine contexts from differ-ent source pages. In fact, our evaluation framework providesfor such an analysis.

7. ACKNOWLEDGMENTSThe author would like to thank Filippo Menczer, PadminiSrinivasan, Alberto Segre and Robin McEntire for their valu-able support and suggestions. Special thanks to Nick Streetfor providing the hardware that was used to conduct theexperiments.

8. REFERENCES

[1] C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelli-gent crawling on the World Wide Web with arbitrarypredicates. In WWW10, Hong Kong, May 2001.


(a)

anchor parent grand−parent root optimal0.15

0.2

0.25

0.3

0.35

0.4

aver

age

cont

ext s

imila

rity

aggregation node

(b)

anchor parent grand−parent root optimal0

5

10

15

20

25

30

35

40

45

50

zero

−si

mila

rity

freq

uenc

y (%

)

aggregation node

(c)

anchor parent grand−parent root optimal

0

50

100

150

200

250

300

350

400

450

500

aver

age

cont

ext s

ize

(wor

ds)

aggregation node

Figure 4: Performance of aggregation node based methods:(a) average context similarity (b) zero-similarity frequency(c) average context size

(a)

0 2 4 6 8 10 12 14 16 18 200.15

0.2

0.25

0.3

0.35

0.4

N

aver

age

cont

ext s

imila

rity

(b)

0 2 4 6 8 10 12 14 16 18 200

5

10

15

20

25

30

35

40

45

50

N

zero

−si

mila

rity

freq

uenc

y (%

)

(c)

0 2 4 6 8 10 12 14 16 18 20

0

50

100

150

200

250

300

350

400

450

500

N

aver

age

cont

ext s

ize

(wor

ds)

Figure 5: Performance of the tree climbing algorithm: (a)average context similarity (b) zero-similarity frequency (c)average context size


[2] G. Attardi, A. Gullı, and F. Sebastiani. AutomaticWeb page categorization by link and context analysis.In Proceedings of THAI-99, 1st European Symposium

on Telematics, Hypermedia and Artificial Intelligence,1999.

[3] S. Brin and L. Page. The anatomy of a large-scale hy-pertextual Web search engine. Computer Networks and

ISDN Systems, 30(1–7):107–117, 1998.

[4] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg,P. Raghavan, and S. Rajagopalan. Automatic resourcelist compilation by analyzing hyperlink structure andassociated text. In WWW7, 1998.

[5] S. Chakrabarti, K. Punera, and M. Subramanyam.Accelerated focused crawling through online relevancefeedback. In WWW2002, Hawaii, May 2002.

[6] S. Chakrabarti, M. van den Berg, and B. Dom. Focusedcrawling: A new approach to topic-specific Web re-source discovery. Computer Networks, 31(11–16):1623–1640, 1999.

[7] N. Craswell, D. Hawking, and S. Robertson. Effectivesite finding using link anchor information. In Proc. 24th

Annual Intl. ACM SIGIR Conf. on Research and De-

velopment in Information Retrieval, 2001.

[8] B. Davison. Topical locality in the web. In Proc. 23rd

Annual Intl. ACM SIGIR Conf. on Research and De-

velopment in Information Retrieval, 2000.

[9] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg,M. Shtalhaim, and S. Ur. The shark-search algorithm— An application: Tailored Web site mapping. InWWW7, 1998.

[10] M. Iwazume, K. Shirakami, K. Hatadani, H. Takeda,and T. Nishida. Iica: An ontology-based internet nav-igation system. In AAAI-96 Workshop on Internet

Based Information Systems, 1996.

[11] O. McBryan. Genvl and wwww: Tools for taming theWeb. In Proc. 1st International World Wide Web Con-

ference, 1994.

[12] F. Menczer and R. K. Belew. Adaptive retrieval agents:Internalizing local context and scaling up to the Web.Machine Learning, 39(2–3):203–242, 2000.

[13] M. Porter. An algorithm for suffix stripping. Program,14(3):130–137, 1980.



Clustering of Streaming Time Series is Meaningless Jessica Lin

Eamonn Keogh Wagner Truppel

ABSTRACT Time series data is perhaps the most frequently encountered type of data examined by the data mining community. Clustering is perhaps the most frequently used data mining algorithm, being useful in it’s own right as an exploratory technique, and also as a subroutine in more complex data mining algorithms such as rule discovery, indexing, summarization, anomaly detection, and classification. Given these two facts, it is hardly surprising that time series clustering has attracted much attention. The data to be clustered can be in one of two formats: many individual time series, or a single time series, from which individual time series are extracted with a sliding window. Given the recent explosion of interest in streaming data and online algorithms, the latter case has received much attention.

In this work we make a surprising claim. Clustering of streaming time series is completely meaningless. More concretely, clusters extracted from streaming time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature.

We can justify calling our claim surprising, since it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method which, based on the concept of time series motifs, is able to meaningfully cluster some streaming time series datasets.

Keywords Time Series, Data Mining, Clustering, Rule Discovery, Data Streams

1. INTRODUCTION Time series data is perhaps the most commonly encountered kind of data explored by data miners [26, 35]. Clustering is perhaps the most frequently used data mining algorithm [14], being useful in it’s own right as an exploratory technique, and as a subroutine in more complex data mining algorithms [3, 5]. Given these two facts, it is hardly surprising that time series clustering has attracted an extraordinary amount of attention [3, 7, 8, 9, 11, 12, 15, 16, 17, 18, 20, 21, 24, 25, 27, 28, 29, 30, 31, 32, 33, 36, 38, 40, 42, 45]. The work in this area can be broadly classified into two categories:

� Whole Clustering: The notion of clustering here is similar to that of conventional clustering of discrete objects. Given a set of individual time series data, the objective is to group similar time series into the same cluster.

� Subsequence Clustering: Given a single time series, individual time series (subsequences) are extracted with a sliding window. Clustering is then performed on the extracted time series.

Subsequence clustering is commonly used as a subroutine in many other algorithms, including rule discovery [9, 11, 15, 16, 17, 20, 21, 30, 32, 36, 42, 45], indexing [27, 33], classification [7, 8], prediction [37, 40], and anomaly detection [45]. For clarity, we will refer to this type of clustering as STS (Subsequence Time Series) clustering.

In this work we make a surprising claim. Clustering streaming time series is meaningless! More concretely, clusters extracted from streaming time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. This simple observation has never appeared in the literature.

Our claim is surprising since it calls into question the contributions of dozens of papers. In fact, the existence of so much work based on STS clustering offers an obvious counter argument to our claim. It could be argued: “Since many papers have been published which use time series subsequence clustering as a subroutine, and these papers produced successful results, time series subsequence clustering must be a meaningful operation.”

We strongly feel that this is not the case. We believe that in all such cases the results are consistent with what one would expect from random cluster centers. We recognize that this is a strong assertion, so we will demonstrate our claim by reimplementing the most successful (i.e. the most referenced) examples of such work, and showing with exhaustive experiments that these contributions inherit the property of meaningless results from the STS clustering subroutine.

The rest of this paper is organized as follows. In Section 2 we will review the necessary background material on time series and clustering, then briefly review the body of research that uses STS clustering. In Section 3 we will show that STS clustering is meaningless with a series of simple intuitive experiments; then in Section 4 we will explain why STS clustering cannot produce useful results. In Section 5 we show that the many algorithms that use STS clustering as a subroutine produce results indistinguishable from random clusters. Since the main

Computer Science & Engineering Department University of California - Riverside

Riverside, CA 92521

{jessica, eamonn, wagner}@cs.ucr.edu


contribution of this paper may be considered “negative,” we conclude in Section 6 with the demonstration of a simple algorithm that can find clusters in at least some trivial streaming datasets. This algorithm is not presented as the best way to find clusters in streaming time series; it is simply offered as an existence proof that such an algorithm exists, and to pave the way for future research.

2. BACKGROUND MATERIAL In order to frame our contribution in the proper context we begin with a review of the necessary background material.

2.1 Notation and Definitions We begin with a definition of our data type of interest, time series:

Definition 1. Time Series: A time series T = t1,…,tm is an ordered set of m real-valued variables.

Data miners are typically not interested in any of the global properties of a time series; rather, data miners confine their interest to subsections of the time series, called subsequences.

Definition 2. Subsequence: Given a time series T of length m, a subsequence Cp of T is a sampling of length w < m of contiguous positions from T, that is, C = tp,…,tp+w-1 for 1

� p

� m – w + 1.

In this work we are interested in the case where all the subsequences are extracted, and then clustered. This is achieved by use of a sliding window.

Definition 3. Sliding Windows: Given a time series T of length m, and a user-defined subsequence length of w, a matrix S of all possible subsequences can be built by “sliding a window” across T and placing subsequence Cp in the pth row of S. The size of matrix S is (m – w + 1) by w.

Figure 1 summarizes all the above definitions and notations.

Figure 1. An illustration of the notation introduced in this section: a time series T of length 128, a subsequence of length w = 16, beginning at datapoint 67, and the first 8 subsequences extracted by a sliding window.

Note that while S contains exactly the same information as T, it requires significantly more storage space. This is typically not a problem, since, as we shall see in the next section, the limiting factor tends to be the CPU time for clustering.

2.2 Background on Clustering One of the most widely used clustering approaches is hierarchical clustering, due to the great visualization power it offers [26, 29]. Hierarchical clustering produces a nested hierarchy of similar groups of objects, according to a pairwise distance matrix of the objects. One of the advantages of this method is its generality, since the user does not need to provide any parameters such as the number of clusters. However, its application is limited to only small datasets, due to its quadratic computational complexity. Table 1 outlines the basic hierarchical clustering algorithm.

Algorithm Hierarchical Clustering 1. Calculate the distance between all objects. Store the

results in a distance matrix.

2. Search through the distance matrix and find the two most similar clusters/objects.

3. Join the two clusters/objects to produce a cluster that now has at least 2 objects.

4. Update the matrix by calculating the distances between this new cluster and all other clusters.

5. Repeat step 2 until all cases are in one cluster.

Table 1: An outline of hierarchical clustering.

A faster method to perform clustering is k-means [5]. The basic intuition behind k-means (and a more general class of clustering algorithms known as iterative refinement algorithms) is shown in Table 2:

Algorithm k-means 1. Decide on a value for k.

2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by

assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the

memberships found above are correct. 5. If none of the N objects changed membership in the last

iteration, exit. Otherwise goto 3. Table 2: An outline of the k-means algorithm.

The k-means algorithm for N objects has a complexity of O(kNrD), with k the number of clusters specified by the user, r the number of iterations until convergence, and D the dimensionality of time series (in the case of STS clustering, D is the length of the sliding window, w). While the algorithm is perhaps the most commonly used clustering algorithm in the literature, it has several shortcomings, including the fact that the number of clusters must be specified in advance [5, 14].

It is well understood that some types of high dimensional clustering may be meaningless. As noted by [1, 4], in high dimensions the very concept of nearest neighbor has little meaning, because the ratio of the distance to the nearest neighbor over the distance to the average neighbor rapidly approaches one as the dimensionality increases. However, time series, while often having high dimensionality, typically have a low intrinsic dimensionality [25], and can therefore be meaningful candidates for clustering.

2.3 Background on Time Series Data Mining The last decade has seen an extraordinary interest in mining time series data, with at least one thousand papers on the subject [26]. Tasks addressed by the researchers include segmentation, indexing, clustering, classification, anomaly detection, rule discovery, and summarization.

Of the above, a significant fraction use streaming time series clustering as a subroutine. Below we will enumerate some representative examples. � There has been much work on finding association rules in

time series [9, 11, 15, 16, 20, 21, 30, 32, 26, 42, 45]. Virtually all work is based on the classic paper of Das et. al. that uses STS clustering to convert real valued time series

0 20 40 60 80 100 120

T

C67 Cp p = 1…8


into symbolic values, which can then be manipulated by classic rule finding algorithms [9].

� The problem of anomaly detection in time series has been generalized to include the detection of surprising or interesting patterns (which are not necessarily anomalies). There are many approaches to this problem, including several based on STS clustering [45].

� Indexing of time series is an important problem that has attracted the attention of dozens of researchers. Several of the proposed techniques make use of STS clustering [27, 33].

� Several techniques for classifying time series make use of STS clustering to preprocess the data before passing to a standard classification technique such as a decision tree [7, 8].

� Clustering of streaming time series has also been proposed as a knowledge discovery tool in its own right. Researchers have suggested various techniques to speed up the clustering [11].

The above is just a small fraction of the work in the area, more extensive surveys may be found in [24, 35].

3. DEMONSTRATIONS OF THE MEANINGLESSNESS OF STS CLUSTERING In this section we will demonstrate the meaninglessness of STS clustering. In order to demonstrate that this meaninglessness is a product of the way the data is obtained by sliding windows, and not some quirk of the clustering algorithm, we will also do whole clustering as a control [12, 31].

3.1 K-means Clustering Because k-means is a heuristic, hill-climbing algorithm, the cluster centers found may not be optimal [14]. That is, the algorithm is guaranteed to converge on a local, but not necessarily global optimum. The choices of the initial centers affect the quality of results. One technique to mitigate this problem is to do multiple restarts, and choose the best set of clusters [5]. An obvious question to ask is how much variability in the shapes of cluster centers we get between multiple runs. We can measure this variability with the following equation:

� Let ),...,,( 21 kaaaA � be the cluster centers derived from

one run of k-means.

� Let ),...,,( 21 kbbbB � be the cluster centers derived from a

different run of k-means. � Let ),( ji aadist be the distance between two cluster

centers, measured with Euclidean distance.

Then the distance between two sets of clusters can be defined as: � �kjbadistBAncedistacluster

k

iji �� 1,),(min),(_

1

(1)

The simple intuition behind the equation is that each individual cluster center in A should map on to its closest counterpart in B, and the sum of all such distances tells us how similar two sets of clusters are.

An important observation is that we can use this measure not only to compare two sets of clusters derived for the same dataset, but

also two sets of clusters which have been derived from different data sources. Given this fact, we propose a simple experiment.

We performed 3 random restarts of k-means on a stock market data set, and saved the 3 resulting sets of cluster centers into set X. We also performed 3 random restarts on random walk dataset, saving the 3 resulting sets of cluster centers into set Y.

We then measured the average cluster distance (as defined in equation 1), between each set of cluster centers in X, to each other set of cluster centers in X. We call this number within_set_X_distance. We also measured the average cluster distance between each set of cluster centers in X, to cluster centers in Y; we call this number between_set_X_and_Y_distance.

We can use these two numbers to create a fraction:

ancedistYandXsetbetween

ancedistXsetwithin

_____

___ Y)ness(X,meaningful clustering (2)

We can justify calling this number “clustering meaningfulness” since it clearly measures just that. If the clustering algorithm is returning the same or similar sets of clusters despite different initial seeds, the numerator should be close to zero. In contrast, there is no reason why the clusters from two completely different, unrelated datasets to be similar. Therefore, we should expect the denominator to be relatively large. So overall we should expect that the value of clustering meaningfulness(X,Y) should be close to zero when X and Y are sets of cluster centers derived from different datasets.

As a control, we performed the exact same experiment, on the same data, but using subsequences that were randomly extracted, rather than extracted by a sliding window. We call this whole clustering.

Since it might be argued that any results obtained were the consequence of a particular combination of k and w, we tried the cross product of k = { 3, 5, 7, 11} and w = { 8, 16, 32} . For every combination of parameters we repeated the entire process 100 times, and averaged the results. Figure 2 shows the results.

Figure 2. A comparison of the clustering meaningfulness for whole clustering and STS clustering, using k-means with a variety of parameters. The two datasets used were Standard and Poor's 500 Index closing values and random walk data.

The results are astonishing. The cluster centers found by STS clustering on any particular run of k-means on stock market dataset are not significantly more similar to each other than they are to cluster centers taken from random walk data! In other

8 16

32 8

16 32 11

7 5

3

0

0.5

1

w Whole Clustering

w STS Clustering k

(number of clusters)


words, if we were asked to perform clustering on a particular stock market dataset, we could reuse an old clustering obtained from random walk data, and no one could tell the difference!

We reemphasize here that the difference in the results for STS clustering and whole clustering in this experiment (and all experiments in this work) are due exclusively to the feature extraction step. In particular, both are being tested on the same dataset, with the same parameters of w and k, using the same algorithm.

We also note that the exact definition of clustering meaningfulness is not important to our results. In our definition, each cluster center in A maps onto its closest match in B. It is possible therefore that two or more cluster centers from A map to one center in B, and some clusters in B have no match. However, we tried other variants of this definition, including pairwise matching, minimum matching and maximum matching, together with dozens of other measurements of clustering quality suggested in the literature [14]; it simply makes no significant difference to the results.

3.2 Hierarchical Clustering The previous section suggests that k-means clustering of STS time series does not produce meaningful results, at least for stock market data. An obvious question to ask is, is this true for STS with other clustering algorithms? We will answer the question for hierarchical clustering here.

Hierarchical clustering, unlike k-means, is a deterministic algorithm. So we can’ t reuse the experimental methodology from the previous section exactly; however, we can do something very similar.

First, we note that hierarchical clustering can be converted into a partitional clustering by cutting the first k links [29]. Figure 3 illustrates the idea. The resultant time series in each of the k subtrees can then be merged into single cluster prototypes. When performing hierarchical clustering, one has to make a choice about how to define the distance between two clusters, this choice is called the linkage method (cf. line 4 of Table 1).

Figure 3. A hierarchical clustering of ten time series. The clustering can be converted to a k partitional clustering by “sliding” a cutting line until it intersects k lines of the dendrograms, then averaging the time series in the k subtrees to form k cluster centers (gray panel).

Three popular choices are complete linkage, average linkage and Wards method [14]. We can use all three methods for the stock market dataset, and place the resulting cluster centers into set X. We can do the same for random walk data and place the resulting cluster centers into set Y. Having done this, we can extend the measure of clustering meaningfulness in Eq. 2 to hierarchical clustering, and run a similar experiment as in the last section, but using hierarchical clustering. The results of this experiment are shown in Figure 4.

Figure 4. A comparison of the clustering meaningfulness for whole clustering and STS clustering using hierarchical clustering with a variety of parameters. The two datasets used were Standard and Poor's 500 Index closing values and random walk data.

Once again, the results are astonishing. While it is well understood that the choice of linkage method can have minor effects on the clustering found, the results above tell us that when doing STS clustering, the choice of linkage method has as much effect as the choice of dataset! Another way of looking at the results is as follows. If we were asked to perform hierarchical clustering on a particular dataset, but we did not have to report which linkage method we used, we could reuse an old random walk clustering and no one could tell the difference without re-running the clustering for every possible linkage method.

8 16

32 8

16 32 11

7 5

3

0

0.5

1

w Whole Clustering

w STS Clustering k


0 10 20 30 40

a1

a2

a3


3.3 Other Datasets and Algorithms The results in the two previous sections are extraordinary, but are they the consequence of some properties of stock market data, or as we claim, a property of the sliding window feature extraction? The latter is the case, which we can simply demonstrate. We visually inspected the UCR archive of time series datasets for the two time series datasets that appear the least alike [23]. The best two candidates we discovered are shown in Figure 5.

Figure 5. Two subjectively very dissimilar time series from the UCR archive. Only the first 1,000 datapoints are shown. The two time series have very different properties of stationarity, noise, periodicity, symmetry, autocorrelation etc.

We repeated the experiment of Section 3.2, using these two datasets in place of the stock market data and the random walk data. The results are shown in Figure 6.

Figure 6. A comparison of the clustering meaningfulness for whole clustering and STS clustering, using k-means with a variety of parameters. The two datasets used were buoy_sensor(1) and ocean.

In our view, this experiment sounds the death knell for clustering of STS time series. If we cannot easily differentiate between the clusters from these two vastly different time series, then how could we possibly find meaningful clusters in any data?

In fact, the experiments shown in this section are just a tiny subset of the experiments we performed. We tested other clustering algorithms, including EM and SOMs [43]. We tested on 42 different datasets [24, 26]. We experimented with other measures of clustering quality [14]. We tried other variants of k-means, including different seeding algorithms. Although Euclidean distance is the most commonly used distance measure for time series data mining, we also tried other distance measures from the literature, including Manhattan, L � , Mahalanobis distance and dynamic time warping distance [12, 24, 31]. We tried various normalization techniques, including Z-normalization, 0-1 normalization, amplitude-only normalization, offset-only normalization, no normalization, etc. In every case we are forced to the inescapable conclusion: whole clustering of time series is usually a meaningful thing to do, but sliding window time series clustering is never meaningful.

4. WHY IS STS CLUSTERING MEANINGLESS? Before explaining why STS clustering is meaningless, it will be instructive to visualize the cluster centers produced by both whole clustering and STS clustering. We will demonstrate on the classic Cylinder-Bell-Funnel data [26]. This dataset consists of random instantiations of the eponymous patterns, with Gaussian noise added. While each time series is of length 128, the onset and duration of the shape is subject to random variability. Figure 7 shows one instance from each of the three clusters.

Figure 7. Examples of Cylinder, Bell, and Funnel patterns.

We generated a dataset of 30 instances of each pattern, and performed k-means clustering on it, with k = 3. The resulting cluster centers are show in Figure 8. As one might expect, all three clusters are successfully discovered. The final centers closely resemble the three different patterns in the dataset, although the sharp edges of the patterns have been somewhat “softened” by the averaging of many time series with some variability in the time axis.

Figure 8. The three final centers found by k-means on the cylinder-bell-funnel dataset. The shapes of the centers are close approximation of the original patterns.

To compare the results of whole clustering to STS clustering, we took the 90 time series used above and concatenated them into one long time series. We then performed STS k-means clustering. To make it easy for the algorithm, we use the exact length of the patterns (w = 128) as the window length, and k = 3 as the number of desired clusters. The cluster centers are shown in Figure 9.

0 20 40 60 80 100 120 140 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

Figure 9. The three final centers found by subsequence clustering using the sliding window approach.

The results are extraordinarily unintuitive! The cluster centers look nothing like any of the patterns in the data; what’s more, they appear to be perfect sine waves.

In fact, for w << m, we get approximate sine waves with STS clustering regardless of the clustering algorithm, the number of

8 16

32 8

16 32 11

7 5

3

0

0.5

1

w Whole Clustering

w STS Clustering k


0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0

b u o y _ s e n s o r ( 1 )

o c e a n

0 20 40 60 80 100 120 140 -5 0 5

10

Cyl

inde

r

0 20 40 60 80 100 120 140 -5 0 5

10

Bel

l

0 20 40 60 80 100 120 140 -5 0 5

10

Fun

nel

0

20 40 60 80 100 120 140


clusters, or the dataset used! Furthermore, although the sine waves are always exactly out of phase with each other by 1/k period, overall, their joint phase is arbitrary, and will change with every random restart of k-means.

This result completely explains the results from the last section. If sine waves appear as cluster centers for every dataset, then clearly it will be impossible to distinguish one dataset’s clusters from another. Having demonstrated the inability of STS clustering to produce meaningful results, we have now revealed a new question: why do we always get cluster centers with this special structure?

4.1 A Hidden Constraint To explain the unintuitive results above, we must introduce a new fact.

Theorem 1: For any time series dataset T, if T is clustered using sliding windows, and w << m, then the mean of all the data (i.e. the special case of k = 1) will be an approximately constant vector.

In other words, if we run STS k-means on any dataset, with k = 1 (an unusual case, but perfectly legal), we will always end up with a horizontal line as the cluster center. The proof of this fact is straightforward but long, so we have elucidated it in a separate technical report [41].

We content ourselves here with giving the intuition behind the proof, and offering a visual “proof” in Figure 10.

Figure 10: A visual “proof” of Theorem 1. Ten time series of vastly different properties of stationarity, noise, periodicity, symmetry, autocorrelation, etc, are shown at left. The cluster centers for each time series, for w = 32, k = 1 are shown next to the data. Far right shows a zoom-in that illustrates just how close to a straight line the cluster centers are. While the objects have been shifted for clarity, they have not been rescaled in either axis; note the light gray circle in both graphs. The datasets used are, reading from top to bottom: Space Shuttle, Flutter, Speech, Power_Data, Koski_ecg, Earthquake, Chaotic, Cylinder, Random_Walk, and Balloon.

The intuition behind Theorem 1 is as follows. Imagine an arbitrary datapoint ti somewhere in the time series T, such that w

�

i � m – w + 1. If the time series is much longer than the window

size, then virtually all datapoints are of this type. What contribution does this datapoint make to the overall mean of the STS matrix S? As the sliding window passes by, the datapoint first appears as the rightmost value in the window, then it goes on to appear exactly once in every possible location within the sliding window. So the ti datapoints contribution to the overall shape is

the same everywhere and must be a horizontal line. Only those points at the very beginning and the very end of the time series avoid contributing their value to all w columns of S, but these are asymptotically irrelevant. The average of many horizontal lines is clearly just another horizontal line.

The implications of Theorem 1 become clearer when we consider the following well documented fact. For any dataset, the weighted (by cluster membership) average of k clusters must sum up to the global mean. The implication for STS clustering is profound. If we hope to discover k clusters in our dataset, we can only do so if the weighted average of these clusters happen to sum to a constant line! However, there is no reason why we should expect this to be true of any dataset, much less every dataset. This hidden constraint limits the utility of STS clustering to a vanishing small set of subspace of all datasets.

4.2 The Importance of Trivial Matches There are further constraints on the types of datasets where STS clustering could possibly work. Consider a subsequence Cp that is a member of a cluster. If we examine the entire dataset for similar subsequences, we should typically expect to find the best matches to Cp to be the subsequences …,Cp-2, Cp-1, Cp+1, Cp+2 ,… In other words, the best matches to any subsequence tend to be just slightly shifted versions of the subsequence. Figure 11 illustrates the idea, and Definition 4 states it more formally.

Definition 4. Trivial Match: Given a subsequence C beginning at position p, a matching subsequence M beginning at q, and a distance R, we say that M is a trivial match to C of order R, if either p = q or there does not exist a subsequence M’ beginning at q’ such that D(C, M’ ) > R, and either q < q’< p or p < q’< q.

The importance of trivial matches, in a different context, has been documented elsewhere [28]

Figure 11: For almost any subsequence C in a time series, the closest matching subsequences are the subsequences immediately to the left and right of C.

An important observation is the fact that different subsequences can have vastly different numbers of trivial matches. In particular, smooth, slowly changing subsequences tend to have many trivial matches, whereas subsequences with rapidly changing features and/or noise tend to have very few trivial matches. Figure 12 illustrates the idea. The figure shows a time series that subjectively appears to have a cluster of 3 square waves. The bottom plot shows how many trivial matches each subsequence has. Note that the square waves have very few trivial matches, so all three taken together sit in a sparsely populated region of w-space. In contrast, consider the relatively smooth Gaussian bump centered at 125. The subsequences in the smooth ascent of this feature have more than 25 trivial matches, and thus sit in a dense region of w-space; the same is true for the subsequences in the descent from the peak. So if clustering this dataset with k-means, k = 2, the two cluster centers will be irresistibly drawn to these two “shapes” , simple ascending and descending lines.

0 2 0 4 0

6 0 8 0

1 0 0 1 2 0

T

C 68

C 66

C 67

0 500 1000 0 10 20 30

Cluster centers, k =1

Cluster centers, k =1


Figure 12: A) A time series T that subjectively appears to have a cluster of 3 noisy square waves. B) Here the i th value is the number of trivial matches for the subsequence Ci in T, where R = 1, w = 64.

The importance of this observation for STS clustering is obvious. Imagine we have a time series where we subjectively see two clusters: equal numbers of a smooth slowing changing pattern, and a noisier pattern with many features. In w-dimensional space, the smooth pattern is surrounded by many trivial matches. This dense volume will appear to any clustering algorithm an extremely promising cluster center. In contrast, the highly featured, noisy pattern has very few trivial matches, and thus sits in a relatively sparse space, all but ignored by the clustering algorithm. Note that it is not possible to simply remove or “ factor out” the trivial matches since there is no way to know beforehand the true patterns.

We have not yet fully explained why the cluster centers for STS clustering degenerate to sine waves (cf Figure 9). However, we have shown that for STS “clustering” , algorithms do not really cluster the data. If not clustering, what are the algorithms doing? It is instructive to note that if we perform singular value decomposition on time series, we also get shapes that seem to approximate sine waves [25]. This suggests that STS clustering algorithms are simply returning a set of basis functions that can be added together in a weighted combination to approximate the original data.

An even more tantalizing piece of evidence exists. In the 1920’s, “data miners” were excited to find that by preprocessing their data with repeated smoothing, they could discover trading cycles. Their joy was shattered by a theorem by Evgeny Slutsky (1880-1948), who demonstrated that any noisy time series will converge to a sine wave after repeated applications of moving window smoothing [22]. While STS clustering is not exactly the same as repeated moving window smoothing, it is clearly highly related. For brevity we will defer future discussion of this point to future work.

4.3 Is there a Simple Fix? Having gained an understanding of the fact that STS clustering is meaningless, and having developed an intuition as to why this is so, it is natural to ask if there is a simple modification to allow it to produce meaningful results. We asked this question, not just among ourselves, but also to dozens of time series clustering researchers with whom we shared our initial results. While we considered all suggestions, we discuss only the two most promising ones here.

The first idea is to increment the sliding window by more than one unit each time. In fact, this idea was suggest by [9], but only as a speed up mechanism. Unfortunately, this idea does not help. If the new step size s is much smaller than w, we still get the same

empirical results. If s is approximately equal to, or larger than w, we are no longer doing subsequence clustering, but whole clustering. This is not useful, since the choice of the offset for the first window is a critical parameter, and choices that differ by just one timepoint can give arbitrarily different results.

The second idea is to set k to be some number much greater than the true number of clusters we expect to find, then do some post-processing to find the real clusters. Empirically, we could not make this idea work, even on the trivial dataset introduced at the beginning of this section. We found that even if k is extremely large, unless it is a significant fraction of T, we still get arbitrary sine waves as cluster centers. In addition, we note that the time complexity for k-means increases with k.

It is our belief that there is no simple solution to the problem of STS-clustering; the definition of the problem is itself intrinsically flawed.

4.4 Necessary Conditions for STS Clustering to Work We conclude this section with a summary of the conditions that must be satisfied for STS clustering to be meaningful.

Assume that a time series contains k approximately or exactly repeated patterns of length w. Further assume that we happen to know k and w in advance. A necessary (but not necessarily sufficient) condition for a clustering algorithm to discover the k patterns is that the weighted mean of the patterns must sum to a horizontal line, and each of the k patterns must have approximately equal numbers of trivial matches.

It is obvious that the chances of both these conditions being met is essentially zero.

5. A CASE STUDY ON EXISTING WORK As we noted in the introduction, an obvious counter argument to our claim is the following. “Since many papers have been published which use time series subsequence clustering as a subroutine, and these papers produce successful results, time series subsequence clustering must be a meaningful operation.” To counter this argument, we have reimplemented the most influential such work, the Time Series Rule Finding algorithm of Das et. al. [9] (the algorithm is not named in the original work, we will call it TSRF here for brevity and clarity).

5.1 (Not) Finding Rules in Time Series The algorithm begins by performing STS clustering. The centers of these clusters are then used as primitives that are feed into a slightly modified version of a classic association rule algorithm [2]. Finally, the rules are ranked by their J-measure, an entropy based measure of their significance.

The rule finding algorithm found the rules shown in Figure 13 using 19 months of NASDAQ data. The high values of support, confidence and J-measure are offered as evidence of the significance of the rules. The rules are to be interpreted as follows. In Figure 13 (b) we see that “ if stock rises then falls greatly, follow a smaller rise, then we can expect to see within 20 time units, a pattern of rapid decrease followed by a leveling out.” [9].

0

50 100

150 200

250

300

350

400

450

0 50 100 150 200 250 300 350 400 450 0 10

20

30

w = 64

A)

B)


w d Rule Sup % Conf % J-Mea. Fig 20 5.5 7 � 15 8 8.3 73.0 0.0036 (a)

30 5.5 18 � 20 21 1.3 62.7 0.0039 (b) Figure 13: Above, two examples of “ significant” rules found by Das et. al. (This is a capture of Figure 4 from their paper). Below, a table of the parameters they used and results they found.

What would happen if we used the TSRF algorithm to try to find rules in random walk data, using exactly the same parameters? Since no such rules should exist by definition, we should get radically different results1. Figure 14 shows one such experiment; the support, confidence and J-measure values are essentially the same as in Figure 13!

w d Rule Sup % Conf % J-Mea Fig

20 5.5 11 � 15 3 6.9 71.2 0.0042 (a)

30 5.5 24 � 20 19 2.1 74.7 0.0035 (b) Figure 14: Above, two examples of “ significant” rules found in random walk data using the techniques of Das et. al. Below, we used identical parameters and found near identical results.

This one experiment might have been an extraordinary coincidence; we might have created a random walk time series that happens to have some structure to it. Therefore, for every result shown in the original paper we ran 100 recreations using different random walk datasets, using quantum mechanically generated numbers to insure randomness [44]. In every case the results published cannot be distinguished from our results on random walk data.

The above experiment is troublesome, but perhaps there are simply no rules to be found in stock market. We devised a simple experiment in a dataset that does contain known rules. In particular we tested the algorithm on a normal healthy electrocardiogram. Here, there is an obvious rule that one heartbeat follows another. Surprisingly, even with much tweaking of the parameters, the TSRF algorithm cannot find this simple rule.

The TSRF algorithm is based on the classic rule mining work of Agrawal et.al. [2]; the only difference is the STS step. Since the work of [2] has been carefully vindicated in 100’s of experiments on both real and synthetic datasets, it seems reasonable to

1 Note that the shapes of the patterns in Figures 13 and 14 are only very

approximately sinusoidal. This is because the time series are relatively short compared the window length. When the experiments are repeated with longer time series, the shapes converge to pure sine waves.

conclude that the STS clustering is at the heart of the problems with the TSRF algorithm.

These results may appear surprising, since they invalidate the contributions of dozens of papers [9, 11, 15, 16, 17, 20, 21, 30, 32, 36, 42, 45]. However, in retrospect, this result should not really be too surprising. Imagine that a researcher claims to have an algorithm that can differentiate between three types of Iris flowers (Setosa, Virginica and Versicolor) based on petal and sepal length and width [10]. This claim is not so extraordinary, given that it is well known that even amateur botanists and gardeners have this skill [6]. However, the paper in question is claiming to introduce an algorithm that can find rules in stock market time series. There is simply no evidence that any human can do this, in fact, the opposite is true: every indication suggests that the patterns much beloved by technical analysts such as the “calendar effect” are completely spurious [19, 39].

6. A TENTATIVE SOLUTION The results presented in this paper thus far are somewhat downbeat. In this section we modify the tone by introducing an algorithm that can find clusters in some streaming time series. This algorithm is not presented as the best way to find clusters in streaming time series; for one thing, its time complexity is untenable for massive datasets. It is simply offered as an existence proof that such an algorithm exists, and to pave the way for future research.

Our algorithm is motivated by the two observations in Section 4 that attempting to cluster every subsequence produces an unrealistic constraint, and that considering trivial matches causes smooth, low-detail subsequences to form pseudo clusters.

We begin by considering motifs, a concept highly related to clusters. Motifs are overrepresented sequences in discrete strings, for example, in musical or DNA sequences [34]. Classic definitions of motifs require that the underling data be discrete, but in recent work the present authors have extended the definitions to real valued time series [28]. Figure 15 illustrates a visual intuition of a motif, and definition 5 defines the concept more concretely.

Figure 15: An example of a motif that occurs 4 times in a short section of winding(4) dataset.

Definition 5. K-Motifs: Given a time series T, a subsequence length n and a distance range R, the most significant motif in T (called 1-Motif) is the subsequence C1 that have the highest count of non-trivial matches. The Kth most significant motif in T (called K-Motif) is the subsequence CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1

� i < K .

Although motifs may be considered similar to clusters, there are several important differences, a few of which we enumerate here. � When mining motifs, we must specify an additional

parameter R.

0 2 4 6 8 10 12 14 16 18 20-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 5 10 15 20 25 30-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

(a) (b)

50 100 150 200 250 300 350 - 2 - 1

0 1


� Assuming the distance R is defined as Euclidean, motifs always define circular regions in space, whereas clusters may have arbitrary shapes2.

� Motifs generally define a small subset of the data, and not the entire dataset.

� The definition of motifs explicitly eliminates trivial matches.

Note that while the first two points appear as limitations, the last two points explicitly counter the two reasons that STS clustering cannot produce meaningful results.

We cannot simply run a K-motif detection algorithm in place of STS clustering, since a subset of the motifs discovered might really be a group that should be clustered together. For example, imagine a true cluster that sits in a hyper-ellipsoid. It might be approximated by 2 or 3 motifs that cover approximately the same volume. However, we could run a K-motif detection algorithm, with K >> k, to extract promising subsequences from the data, then use a classic clustering algorithm to cluster only these subsequences. This idea is formalized in Table 3.

Algorithm motif-based-clustering 1. Decide on a value for k. 2. Discover the K-motifs in the data, for K = k � c

(c is some constant, in the region of about 2 to 30)

3. Run k-means, or k partitional hierarchical clustering, or any other clustering algorithm on the subsequences covered by K-motifs

Table 3: An outline of the motif-based-clustering algorithm.

Line two of the algorithm requires a call to a motif discovery algorithm; such an algorithm appears in [28].

6.1 Experimental Results We have seen in Section 6 that the clusters centers returned by STS have been mistaken for meaningful clusters by many researchers. To eliminate the possibility of repeating this mistake, we will demonstrate the proposed algorithm on the dataset introduced in Section 4, which consists of the concatenation of 30 examples each of the Cylinder, Bell, Funnel shapes in random order. We would like our clustering algorithm to be able to find clusters centers similar to the ones shown in Figure 8.

We ran the motif based clustering algorithm with w = 128, and k = 3, which is fair since we also gave these two correct parameters to all the algorithms above. We needed to specify the value of R; we did this by simply examining a fraction of our dataset, finding ten pairs of subsequences we found to be similar, measuring the Euclidean distance between these pairs, and averaging the results. The cluster centers found are shown in Figure 16.

Figure 16: The cluster centers found by the motif-based-clustering algorithm on the concatenated Cylinder-Bell-Funnel dataset. Note the results are very similar to the prototype shapes shown in Figure 7, and the cluster centers found by the whole matching case, shown in Figure 8.

2 It is true that k-means favors circular clusters, but more generally,

clustering algorithms can define arbitrary spaces.

These results tell us that on at least some datasets, we can do meaningful streaming clustering. The fact that this is achieved by working with only a subset of the data, and explicitly excluding trivial matches, further supports our explanations in Sections 4.1 and 4.2, of why STS clustering is meaningless.

7. CONCLUSIONS We have shown that a popular technique for data mining does not produce meaningful results. We have further explained the reasons why this is so.

Although our work may be viewed as negative, we have shown that a reformulation of the problem can allow clusters to be extracted from streaming time series. In future work we intend to consider several related questions; for example, whether or not the weaknesses of STS clustering described here have any implications for model-based, streaming clustering of time series, or streaming clustering of nominal data [13].

Acknowledgments: We gratefully acknowledge the following people who looked at an early draft of this paper and made useful comments and suggestions: Michalis Vlachos, Frank Höppner, Howard Hamilton, Daniel Barbara, Magnus Lie Hetland, Hongyuan Zha, Sergio Focardi, Xiaoming Jin, Shoji Hirano, Shusaku Tsumoto, Mark Last, and Zbigniew Struzik.

8. REFERENCES [1] Aggarwal, C., Hinneburg, A., & Keim, D. A. (2001). On the

Surprising Behavior of Distance Metrics in High Dimensional Space. In proceedings of the 8th Int’ l Conference on Database Theory. London, UK, Jan 4-6. pp 420-434.

[2] Agrawal, R., Imielinski, T. & Swami, A. (1993). Mining Association Rules Between Sets of Items in Large Databases. In proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. Washington, D.C., May 26-28. pp. 207-216.

[3] Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T. & Simon, I. (2002). A New Approach to Analyzing Gene Expression Time Series Data. In proceedings of the 6th Annual Int’ l Conference on Research in Computational Molecular Biology. Washington, D.C., Apr 18-21. pp 39-48.

[4] Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). When is Nearest Neighbor Meaningful? In proceedings of the 7th Int’ l Conference on Database Theory. Jerusalem, Israel, Jan 10-12. pp 217-235.

[5] Bradley, P. S. & Fayyad, U.M. (1998). Refining Initial Points for K--Means Clustering. In proceedings of the 15th Int’ l Conference on Machine Learning. Madison, WI, July 24-27. pp. 91-99.

[6] British Iris Society, Species Group Staff. (1997). A Guide to Species Irises: Their Identification and Cultivation. Cambridge University Press. March, 1997.

[7] Cotofrei, P. (2002). Statistical Temporal Rules. In proceedings of the 15th Conference on Computational Statistics - Short Communications and Posters. Berlin, Germany, Aug 24-28.

[8] Cotofrei, P. & Stoffel, K (2002). Classification Rules + Time = Temporal Rules. In proceedings of the 2002 Int’ l Conference on Computational Science. Amsterdan, Netherlands, Apr 21-24. pp 572-581.

[9] Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule Discovery from Time Series. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27-31. pp 16-22.

[10] Fisher, R. A. (1936). The Use of Multiple Measures in Taxonomic Problems. Annals of Eugenics. Vol. 7, No. 2, pp 179-188.

[11] Fu, T. C., Chung, F. L., Ng, V. & Luk, R. (2001). Pattern Discovery from Stock Time Series Using Self-Organizing Maps. Workshop

0 20 40 60 80 100 120 140


Notes of KDD2001 Workshop on Temporal Data Mining. San Francisco, CA, Aug 26-29. pp 27-37.

[12] Gavrilov, M., Anguelov, D., Indyk, P. & Motwani, R. (2000). Mining the Stock Market: Which Measure is Best? In proceedings of the 6th ACM Int'l Conference on Knowledge Discovery and Data Mining. Boston, MA, Aug 20-23. pp 487-496.

[13] Guha, S. Mishra, N. Motwani, R. & O'Callaghan, L (2000). Clustering Data Streams. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science. Redondo Beach, CA. Nov 12-14. pp. 359-366.

[14] Halkidi, M., Batistakis, Y. & Vazirgiannis, M. (2001). On Clustering Validation Techniques. Journal of Intelligent Information Systems (JIIS), Vol. 17, No. 2-3. pp. 107-145.

[15] Harms, S. K., Deogun, J. & Tadesse, T. (2002). Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. In proceedings of the 13th Int’ l Symposium on Methodologies for Intelligent Systems. Lyon, France, June 27-29. pp 432-441.

[16] Harms, S. K., Reichenbach, S. Goddard, S. E., Tadesse, T. & Waltman, W. J. (2002). Data Mining in a Geospatial Decision Support system for Drought Risk Management. In proceedings of the 1st National Conference on Digital Government. Los Angeles, CA, May 21-23. pp. 9-16.

[17] Hetland, M. L. & Sætrom, P. (2002). Temporal Rules Discovery Using Genetic Programming and Specialized Hardware. In proceedings of the 4th Int’ l Conference on Recent Advances in Soft Computing. Nottingham, UK, Dec 12-13.

[18] Honda, R., Wang, S., Kikuchi, T. & Konishi, O. (2002). Mining of Moving Objects from Time-Series Images and its Application to Satellite Weather Imagery. The Journal of Intelligent Information Systems, Vol. 19, No. 1, pp. 79-93.

[19] Jensen, D. (2000). Data Snooping, Dredging and Fishing: The dark Side of Data Mining. SIGKDD99 panel report. ACM SIGKDD Explorations, Vol. 1, No. 2. pp. 52-54.

[20] Jin, X., Lu, Y. & Shi, C. (2002). Distribution Discovery: Local Analysis of Temporal Rules. In proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Taipei, Taiwan, May 6-8. pp 469-480.

[21] Jin, X., Wang, L., Lu, Y. & Shi, C. (2002). Indexing and Mining of the Local Patterns in Sequence Database. In proceedings of the 3rd International Conference on Intelligent Data Engineering and Automated Learning. Manchester, UK, Aug 12-14. pp 68-73.

[22] Kendall, M. (1976) Time-Series. 2nd Edition. Charles Griffin and Company, Ltd., London.

[23] Keogh, E. (2002). The UCR Time Series Data Mining Archive [http://www.cs.ucr.edu/~eamonn/TSDMA/index.html]. Riverside CA. University of California - Computer Science & Engineering Department

[24] Keogh, E. (2002). Exact Indexing of Dynamic Time Warping. In proceedings of the 28th International Conference on Very Large Data Bases. Hong Kong, Aug 20-23. pp 406-417.

[25] Keogh, E. Chakrabarti, K. Pazzani, M & Mehrotra, S. (2001). Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Journal of Knowledge and Information Systems. Vol. 3, No. 3, pp. 263-286.

[26] Keogh, E. & Kasetty, S. (2002). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23 - 26, 2002. Edmonton, Alberta, Canada. pp 102-111.

[27] Li, C., Yu, P. S. & Castelli, V. (1998). MALM: A Framework for Mining Sequence Database at Multiple Abstraction Levels. In proceedings of the 7th ACM CIKM Int'l Conference on Information and Knowledge Management. Bethesda, MD, Nov 3-7. pp 267-272.

[28] Lin, J. Keogh, E. Patel, P. & Lonardi, S. (2002). Finding motifs in time series. In the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23 - 26, 2002. Edmonton, Alberta, Canada.

[29] Mantegna., R. N. (1999). Hierarchical Structure in Financial Markets. European. Physical Journal. B11, pp. 193-197.

[30] Mori, T. & Uehara, K. (2001). Extraction of Primitive Motion and Discovery of Association Rules from Human Motion. In proceedings of the 10th IEEE Int’ l Workshop on Robot and Human Communication, Bordeaux-Paris, France, Sept 18-21. pp 200-206.

[31] Oates, T. (1999). Identifying Distinctive Subsequences in Multivariate Time Series by Clustering. In proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. San Diego, CA, Aug 15-18. pp 322-326.

[32] Osaki, R., Shimada, M. & Uehara, K. (2000). A Motion Recognition Method by Using Primitive Motions, Arisawa, H. and Catarci, T. (eds.) Advances in Visual Information Management, Visual Database Systems, Kluwer Academic Pub. pp 117-127.

[33] Radhakrishnan, N., Wilson, J. D. & Loizou, P. C. (2000). An Alternate Partitioning Technique to Quantify the Regularity of Complex Time Series. International Journal of Bifurcation and Chaos, Vol. 10, No. 7. World Scientific Publishing. pp 1773-1779.

[34] Reinert, G., Schbath, S. & Waterman, M. S. (2000). Probabilistic and statistical properties of words: An overview. J. Comput. Bio., Vol. 7, pp 1-46.

[35] Roddick, J. F. & Spiliopoulou, M. (2002). A Survey of Temporal Knowledge Discovery Paradigms and Methods. Transactions on Data Engineering. Vol. 14, No. 4, pp 750-767.

[36] Sarker, B. K., Mori, T. & Uehara, K. (2002). Parallel Algorithms for Mining Association Rules in Time Series Data. CS24-2002-1 Tech report.

[37] Schittenkopf, C., Tino, P. & Dorffner, G. (2000). The Benefit of Information Reduction for Trading Strategies. Report Series for Adaptive Information Systems and Management in Economics and Management Science, July. Report #45.

[38] Steinback, M., Tan, P.N., Kumar, V., Klooster, S. & Potter, C. (2002). Temporal Data Mining for the Discovery and Analysis of Ocean Climate Indices. In the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada. July 23.

[39] Timmermann, A., Sullivan, R. & White, H. (1998). The Dangers of Data-Driven Inference: The Case of Calendar Effects in Stock Returns. FMG Discussion Papers dp0304, Financial Markets Group and ESRC.

[40] Tino, P., Schittenkopf, C. & Dorffner, G. (2000). Temporal Pattern Recognition in Noisy Non-stationary Time Series Based on Quantization into Symbolic Streams: Lessons Learned from Financial Volatility Trading. Report Series for Adaptive Information Systems and Management in Economics and Management Science, July. Report #46.

[41] Truppel, Keogh, Lin (2003). A Hidden Constraint When Clustering Streaming Time Series. UCR Tech Report.

[42] Uehara, K. & Shimada, M. (2002). Extraction of Primitive Motion and Discovery of Association Rules from Human Motion Data. Progress in Discovery Science 2002, Lecture Notes in Artificial Intelligence, Vol. 2281. Springer-Verlag. pp 338-348.

[43] Van Laerhoven, K. (2001). Combining the Kohonen Self-Organizing Map and K-Means for On-line Classification of Sensor data. Artificial Neural Networks, Dorffner, G., Bischof, H. & Hornik, K. (Eds.), Vienna, Austria; Lecture Notes in Artificial Intelligence. Vol. 2130, Springer Verlag, pp.464-470.

[44] Walker, J. (2001). HotBits: Genuine Random Numbers Generated by Radioactive Decay. www.fourmilab.ch/hotbits/

[45] Yairi, T., Kato, Y. & Hori, K. (2001). Fault Detection by Mining Association Rules in House-keeping Data. In proceedings of the 6th International Symposium on Artificial Intelligence, Robotics and Automation in Space. Montreal, Canada, June 18-21.

A Learning-Based Approach to Estimate Statistics ofOperators in Continuous Queries: a Case Study

Like Gao � Min Wang�

X. Sean Wang � Sriram Padmanabhan�

� Dept. of Information and Software Engineering, George Mason University, VA� lgao, xywang � @gmu.edu�IBM T.J. Watson Research Center, NY, � min, srp � @us.ibm.com

ABSTRACT�� ! � �"�� #$�%�� &��'(��!)��*��+��*�� ,�.-/��0�01+2��435 �637��8:9��% �7�� ;3��8��;*��$��6��*��<�=�$��>+�?�� A@�B5�� 01@C'D��*��;! � *�!)��E�'GF�� *�@C��! �� & #6��IH,J/� ��&!5��IBK� ��-G�$LM��*6B( ��EF�� *� �$�N��*��>��3:+2��?��43O�� *��+'P��*��Q��E�$�&!�� ,�4��,8)�$�$�=��R�6��!5�� *� ��S>�� ?3 ��&35��*� 8 � +�� IHUTV��$�W0 ��SR��3 ��SR�$��:�� :��?F��*� �$�?��X�=�<�� S�+ � SE3��84��%��*��,��*��63KB��N�&�*��Y3 1*��$��G��! ! *��M��<�(B��4��&�$01@8 �� 0�3 ��S>��O�$�� 7�&�:35�$0('P��*,�6��<�O��!)��*��*6B4 �Y!)�� 8�0 ��HZ �W�� >!4�!)��*6B"-/�7!5*��!)��=�U� ��LM�$0/0 �6��*�� S+A8��63[�&��:3KH\%�5*"�&��:3.�� "��'K�]-G�;��$!��$H^J/��`_�*��"��$!& �a��;3 �$�� S��=35�63 �6��63b'D�$�� *��c:��*�� X��0 S��*�1��d��4��.�6��W8)�&��63 � ��*��$�&�$��0 01@=��C��8 �� e'D�$�� *��?L��0 ��$�N'f*��g�� ?� �43 �<*�0 @5 ��S3 ��:H/J/� �;��$�$��37��$!O �`��>� ��E�>3��.�& �� S.��0 S��*� �� h��S��$��<*��/��?�� E�&�:3 �$0�84��$3Q��E�� "'P�6��5*��GL��0 ��i��c:+��*��63>'P*��j�� ,� ��*� �6��0(3 ��5H"JI�Q 0 0 ��*��,�!�! *��IB�� ,!��!)��*Y��3 �$�/�� N�6��;�'^�� & 0f��*�1�]@:+284��$3k��6��*��<� �$�Y�L��*��*��$��& ��S=�� &�&��*� �$�$Hblac5!)��*� �&�$��0a*��$�� 0 ��?��6-m�� .�! +!5*��M��<�>!5*��L5�35�$�"��$�$�5*��`�� Y�$�� a-Y1��%0 �6-[��LM��*�+� �6��3KH

1. INTRODUCTIONn �� Se��43 � SeF�� *� �$�$B��0 ��e�$��0 0 �63[�� 5� �� .F��<*��7o p�q�r2B��*��Q�� ;��4��N��*��;�]@5!� �6�0�01@7�� 63O�� $�Q��37��$�O��L��0��63�� 5� �� 0 @�Ba��LM��*�@U��&�&-Y� �$�s��3 ��8��C �.��!K3 ��63e��*��&�k!5*��+A3 ��_��63e�$LM�$��Q��$�$� *6HbJ/��7F��<*��>��! !)�6�*. �X�-Y�35�7*�� S��=��'N��! !�0 �6�� $BY�$��!)�$�$��0�01@t �u�!�!�0 �6��&��4�� $�63v��V�&��1��*R��*�� w�$LM��:��='f*��x3��u��5*��$�$�$HyJ/� �3 ��84��=*��$��$�*��z�$��&�?�� A@W�4��6-Y�zS�*��$��C �:��<*��$��C�� $��QF��*� �$�,��LM��*Y��;0��N3 �$�$��3 �&o {:|i}6q�r2HY~,�$�$��:��01@�B5�� Y � +��<*��$��&�4��. ��<*��6��63W��*�!�01@X35��=��$�&��*�S�� SR��$�635�.�'3 ��.��*��$��S��&�$��;o �5|��5|�}$�5|(}�}�|�}6��r�H� �N �U��*��3 1�� 0"3��84��>��@5��$�&�$B(1�; �N �&!)��*��7�$��1+��,*��0f��63&�� $�`��'K�� !)��*��*��/��63& �C�;�$�� :�� F�� *�@�He��*&�<c��&!�0 ��Ba �X�$��+28��63XF��*�@b��!5��& #6�� IBG��! �� & #$��*��$�$3 �`��&�$�� N�� Q��%��43k�� ! � �Y�� #$�Q��'a��! +�<*��*��?��R3 ��*��& � �>��&��Q��&��:�.��c5�$�$�5��W!�0��(H Z �� 5� �� >F��*� �$�$B^�� $��C�� 7�$�� .��*��k�&�*��C�<*�� +��f�0N�� $�O�� R3��84��R�� 4��0 01@z� !K3��63['f*��6F��$��01@t��43'D��%F��*�@&*��$��!)��E �`*��6F��1*��63(H� L��*��<�]@[��';��$�� F:� �$�&��c5 ��k��b�$�� =�� $�C��'E��! +�<*��*��$H � �&��SW�0�0N�� R�� $�$BN�� !��5�7�� #$�e�$��

�4��Y��*��63C�� %�&��Y��$�� (H��7��:35�` �k��0 ��<*�� *�� $0 ��3 �C��5+2!4�*��&��*� �k�&�<��:3 �k�� $0 �435 ��SR�� S*��&�Oo1}6�:|p:}<rP�<B�!��*��&��*� �"�&�<��:3 �"o p��6r2B�$�5*�LM�^_�� S%o p{r�B��&!�0 ��S;o1}${r�B��3[�&�<��:3 �.84��$3t��uF��*�@s'P�$�63584��o ��r�H � �&��SR��$�OB�� S�*��h�&�� :3k �%�&��&�&�� 01@k� ��637 �R3��8��E��@5��+��$�&�;35��E��71��N�$��&!��5�� 0^��&��@e��3=��3 �$!)��43 ��$��'^3��.3 ��*� 8�� IHa�%�6-G�$LM��*6B�3 �$��! 1��%��$�� 6�0�351�K��*��$��$�$B�O�$��&�&��R'D�6�� *��'Y��>��$�<� ��F��$�; �E��4��Q�� @b��0 0/��$��! ��5*� ��SR��O� �43 �<*�0 @5 ��SU3 ��b35 ��*��8 � �� u��&!5*��$�$ ��$01@��Y!)��8 0 �Q� �435��*Y�$��*��7��*��S��E�$�� *�� 6H/�:��k�$��! ��5*��633��N3 ��*� 8��5�� "��*��/��$�?��63E��N�$�� /�� /��5��!�� i��#��$��$0 �$�� L:1�� $��Y�'a��!)��*��*��$HTV� �$�[3 �6�0� � Se-Y1��t�� 5� �� &F��*� �$�$B"-/�k��6@s��k�b3 1'f+'D�<*��$��;��! ! *��R3 ��;��&��Q35 �K�<*��$��$�E8)��A-/�$��e�.��*��3 1�� 0F��*�@b��3s�O�$��:�� :��?F��*�@�H Z �X�7��*��3 1�� 4�0`F��*�@�BI��3��8��,��G�� &�63?��E8)�Y�� %��43.�� YF:� ��*� �$�/��*��,��35+2� ��B�H ��H B4��E��@5��$��63 �Y��&�4��43 0 �E��:@k!)�� 8 0��?F��*�@�HGJ/�� , �-Y��@O�&��<c �� S&��$�� F:� �$�`'D��$� �%��=�6��!5�� *� ��S>�� Q��:��1*��3 ��*�01@5 ��S=3 ��=35��*� 8 � �� (H Z �s�O�$�� :�� QF:� ��*�@�Bi� �6-`+�$LM�<*6B��>F:� ��*�@U �Q0 ��S=��3 ��SO��43R��3 ��8��&��4��S��6��<�7�� &�%�� ;F��*�@C �Y�$L��0 �4��63(H"J��.�$�� N��%�� $��'/�CS� LM�$�U��!)��*��*; �e�C�$��:� ��NF:� ��*�@�B4-G�.�� 01@R��$�63O��:��6-�� O!4��*��&��'��O3 ��R��4��>��>*��0��L��:�.��e��k��!)�$�$1_4��!)�<*��*%�� 6��3>��')��Y�$��1*��Y��3 ��*�01@5 ��S;3 ��;3 ��*� 8��5�� )��37�6�! �� *��N�� N�$LM��0 � �� O�'a1�,-Y� �$�7�� E3��84��Q��4��S��$�$HJ/�� ?0��$��3 �Q��R�=3 1�K��*��$��?�$�� W�&�<��:3e��.3 ��$�?��*��6F��1*��N�� N�$��1*��E3��.3 ��*� 8�� IHZ �=�� %!4��!)�<*6B(-G�.*��$�$��3 �<*N��?�$�� U! *��8�0 �$�g �R��$��:��<c5�>��';�$��:� ��>F:� ��*� �$�$H � 0 �� B/-G�=3 �e��&0 ��&1�>�� 43:@%��,��3��*�3;��!)��*��*��`��H S H B��0�� >��43/9�� )��43N��3��*�3X�� =��H S H BG�� ! � �>�� #$�6�<HX�G�� 3 ��*>�� k!5*��85+0 �$��'(�$�� S;�� $�G�'(��&��!)�<*��*/�� N�$��:� ��F��*�@�B6-Y� ��*��G�� /F��*�@N��I_�c5�63Q��43N�� /3��8��/ �^��4�� S� ��S5HJ/��E_�c5�637F:� ��*�@7��&!5�� = �&!�0 �$�Y��4��,��E�� !��5�,�� +�� B��$@>�$�]��B��6��.��0 �$01@>8)�Y35��*��& ��63Q8�@Q�� / ��! � ��3��N�7B��$@b�$�]�� .�m�6� ��<B�-Y��<*�� E��_�c5�63R�$�� U'D�� 'D�*E��!)��*��*?�.H Z �U��<*N-/�*�3 �$B(-/�>��01@R��$�$3R��k�5� �6-�� ! � �G3��N��E��85�� &��,�� N35 *��01@�H Z �&�$��*��6B:�&��c5 �� SU�$�� X�&�� :3 �Q�6��!5�� *��k3��U3 ��*� 8��5�� C��3 L�� $��43�35��*��& ��`��Y�� N�'i�;��!)�$�� _��%��!)��*��*��Y�� ;F��*�@C�$L��0 �4�� 7�� &�N� �� S.��N3��.3 ��*� 8�� IHJ/��.��35L��:��S��?��'G�� ;3 1*��$��N�&��:3= �N1��;!)��$��0i��CS��<��&��*��R��$�$�5*��R�$�� $�$H��%�6-/�$L��*6B`��=�$��:� ��kF��*�@��$@W �:LM��0 LM�k�U0��*�S��=LM��0�� &��'N � !�� >3��e��43[3��=�]@5!)�$��6��V8)�=�$��&! 0��<c(H Z �C �7� ��C�$0 �6�*7��6-h��W�$��8 0� ��z�� U��+�� ='P��<�� .3 1*��$��01@�H Z �=�� !4��!)�<*6B4-/�.3 �$�$��&!)��$�� Q'D� �� ,�]-/��$��&!)��$��$B�4��&�$01@,�6 �¡�¢

£ �7¤ £ ��:¥Gq�� `�¦� Z�§ �=\ £ Ts�*��5�� !O��7~,�$��6��*��<� Z ��$�Y � £ ��&�7 �� S>��43C¤;��6-Y0 �63 S�� £ ��$��LM��*�@�B�p�� !4�S��E{�{

�� C �� ¡�� 43 � � ¡ �� k �� ¡ �� HE�7�*��?��!)�$�� _��6��0 01@�BK0 ��]�:� �U� � � �� <BG-Y��<*�� >��k'P�6��5*��O��c:��*��<��*&��4��8 �� Q'P�6��5*��&L��0 ��$�;'P*�� &� �43 �<*�0 @5 ��S73��5Ba��43s�$�& ��=�� k�$�� *.��4��?��?��8 �� 63s'D�6�� *��7L��0 � �$��? � !�� 6HCJ/� �>'D�6�� *��&��c:��*��*Q��Q��01@R*��635��$�$�;�� 3��L��0 ��&��B�8��5�a��0 ��;��c:��*��^��G*��$0 �$L��a!��*��a��'��/� �43 �<*�0 @5 ��S3 ��:H Z �E��3�3 1�� (B ��$��:LM�<*��I��"��3 ��*�01@5 ��S,3��`��,�,3��A@5!)�>��4��E-Y 0 0/8)�&��$3e'D�*.8 �� 0�3 ��SO�� &�� C��*6HZ �s��&�&�$��$�$B^�� >��c:��*��*.�$��b-G��*��U�� *��$�&�$��0�01@U-Y��$�3 ��84��E �Y��!K3 ��63��. ��*��$��;��N�<�&�$ �$��<@�HGJ/� �N�� *��6��78)�;8�� 01�`�� *��S��7��;� ��;��'i��*��3 1�� 4�0i�&��5+�:35�$BK��U��N35�$�$ �� R��*��$�?��0 S��*� �� &�$BK!)��01@5��&f�0�'D� �� 437��LM�$�7� ��S�*��&�$HJ/� �/'D�6�� *��`��c:��*��*" �a��4�0 0 @?35��*��& ��63E��:��0 01@.��43Q��@.�6��$�$B �E3 �635��$��63> ��<*��$�&�$��0 ! *��$�635� *��, �/3 ��LM�$0 ��!)�63��W��8 ��t'D�6�� *��RL��0 ��$�� z��*�35��*C��b*��63 � �$�O��=��LM�<*��6��3(H��*��$S� �$�%��$��8 0 ��=�� ?�� Q�$�� *N�6��U8)�E�$0�� +_��63= �:��&�A-/��6��S��*� �$�$¥%��01@:�� .�!�! *��$�N��43O��&!�1*� �6��0�!�!5*��M��$H Z �U��U��4��01@:�� >�!�! *��IB(8�@=��4��01@5#$ ��S��Q��5+35��*�01@5�� SW�$L��0 �4��z! *��63 �5*��=��'Q��u��!)��*��*6BY-/�R��$@[8)��8�0 �7��e3 ��*� LM�C��C�� O�$�� *&-Y�� [��M�$�.�� C'P�6�+��5*��$�N��% ��! � �6HY��-G�$LM��*6B)-Y� �$�O��E��!)��*��*; �%�$��&!�0 �6��63�*/1�" �5L��0 LM�$�"�.� 0 �� ! 0��/*�� *��$�$B5��&�$�&! 1*��$��04�!�!5*��M��&��$@8)�7��63W��e��8�0 ��t�� 7�� =�$�� *6B/1'%�$��S��t��c:+!)�<*��&��:��0(3��:B4�H ��H B)�.��0 0 �$�� =�'i'D�6�� *��E3 ��>��43k��*�*��+��!)��435�� S7��4�0a�� $�$B^��*��>�6L�� 0f�8�0 ��H.��*N�$��&! 0��<c=��1�� +�� IBa�� &�$�&!�1*� �6��0G��! ! *��M��<�W �?�&��*��&��1��8�0 �k8)��6�� -G��$��k3 1*��$��01@��,�� *��$��0(�$L��0 �4�� 73 ��;��'I��,�$�� :�� F�� *�@V��7��c5!)��*� �&�$��0�3��5B%��3��!�! 0 @z��6L�� 0��8�0 �b3��& � �� S��0 S��*�1�� C��4��01@5#$�E��$��?3��>��C8 �� 0�37��E�� +�� E�$�� *6H`J/� ��, �`-Y�:@C-/�E�6�0 0��5*%!5*��!)��63O��! ! *��X¡� �¡ �� ¢��¡��$ ��>��?��43O�� ?�� >��*%��N��0 ��7�6�0 0��$3 �� ¡ �� M � HJI�QL��0 �3�� */��!�!5*��M��<�(B�-/�,��435@>�;��!)�$�$��0)�: �43>�'I�� +� �� .F��<*�@�B��&�$01@U�� &�0��*� �A@:+A8��63s��$�*��W�L��*Q��*��6��& ��S�� &�Y��*� �$�$HaTs�,35�$�� S��C�N'D�6�� *��%�<c5��*��*G8�@.� �� SQ3��E�! +!5*��6c ��kL5�� £ �^J � £ ��*��e��5*� ��*=JI*��'P��*��C�<B%��43� ��W�[3 �$�$ �� *��$�W�0�S��*�1�� u�$��8�0 ��v��e�$�� &�:35�$0�H�Tb�N��$�k��E��c5!)��*� �&�$��/��&��$�� ;!)��*�'D�*��',�� *?! *��!)��63W�!�!5*��M��IH=J/��c5!)��*� �&�$��0a*��$��01��Q��6-��E�� ?��! ! *��M��<�b! *��L:f35�$�Q��$�� *��$�� $�Q-Y1��W�O0 �6-��LM�<*��6��3(HJ/� �O*��$��'%�� !4��!)�<*��&�*�SM�� #$�63V��'P��0 0 �6-Y�$H�Tb�O ��*��+35��$�E��Q�� & 0f��*�1�]@:+284��$3R��$�*��e��LM��*%��*��6��& � S�� &�Q��*� �$� �u�5�$�<��tpU��43W3 ��*� 8)�k�]-G�e �&!)�*��.�� &��t�:�$��+�� e�:H`Ts�E! *��$�� ?35�� 0 �63=0 �6�*�� S�+28��63=�!�! *��R��:�$�� "!5H Z �U�5�$�<��U�:B)-G�Q*��$!)�*��%�� *%��c5!)��*� �&�$��0K*��$��01��43KBI �b�5�$�<��b{:B�S� LM�?�� ?*��$0��$3R-G��*��)HETs�.��0��3 �Q�� !��!)��*, �R�:�$�� $#:H2. SIMILARITY-BASED SEARCH OVER

STREAMING TIME SERIESZ �[�� &��$�� (BG-/�O3 ��_��k�� 7� �� t��';�U�� &�0��*� �A@:+A8��63��$�*��O��LM��*,��*��6��& � S.�� &�N��*� �$�$HJ/� ��*��*��a�]-/�/�A@5!)�$�I�' 3��/ �:LM��0�L��63N �;��E�/��6�*��IB�4��&�$01@�� %�� .��43&� �� ¡ �'��(�� %�� H � �� %�$ �� B`35��+� ��63e��*) �,+.-0/21�-43�125652521�-47�8 BI �Q�C_4� 1��>��6F��$� �$�>�':9W*��6��0�:� �.8)��*��$B/-Y1��z1��C0 �$��S��;9�35�$��63=<��69a��)K�<H � � �� ¡ �>�� ?�� BA@ �B+ � /61252525C1 �6D�E /21 �6D 1252565 8 B: �`��C � _��1��$F:� �$��

��'�*��6��05�:��.8)�<*��I-Y1��?�� `�� &! �� Q��4��i ��aL��0 ��a�*�*� LM�`��E3 ��84��E��@5��$��6F�� $��f�0 0 @�HGJ/��E�� 8��*� ! �Y�` ��3 �6��3 ��O��*�*� L��0�� &�&��'�� D H § LM�$�s�O��*��6��& ��Sk��&�.��*� �$�@(B�� >� �� ¡ �>�� ?�� :@ �GF��H�� /��35�$��$�/��,��85+��*� �$�,��'I@k8)��'D�*��N��&��B �� $0 �� LM�$01@�B �H ��H�B + �J/ 1252525C1 � DE / 1 � D 8 H§ LM�$�U�]-/�k��&�?��*� �$�$Bi�� *��&�*��>3 1�K��*��$��;-`�$@5�;��7�&�6�� *��"�� & 0f��*�1�]@E8)��A-/�$�$�;�� $�jo1}6p�r�H Z �N�� !4��!)�<*6B�� "-G�$ S��63la��$0 �3 �6��=3 ��$��B

�2K�Lb�M 1 )K� � NOOP Q E /R SUTWV ��X6Y.Z 7\[^]`_ E S0a - Y.Z 7�[UbC_ E S � 32c�d 1-Y��<*�� d � LeK�9a��<��69a�M�� 1 <��69a��)K��<B� �;� ��63R��k3 �<_4��Q��Q�� >+ 0��*�1�]@R8)��A-/�$��R�� &�E��*� �$�fM��43()�H;�7��*��?��!)�$�$1_4�6�0 0 @�B)�� &�6�� *��; �N35��_4� �63k��O��;�$��&�&�� d +A��5�>ck��'a�]-G�>��&�;��<+*� �$�$B�-Y��<*�� d ��"��Y0 �$��S��'K��,��*��*G�� &�Y��*� �$�$Hhg%��3 ��*�� 73 ��_��1�� IB,��U�&��*��R�� & 0��*O�� U�A-/�W�� &�U��<*��$BN��0 0 ��*;�� >3 ��$�>8)�<�]-/��$�U�� $�OHQJ/�� E35��_4� �� e�6��b8)��! !�0 �63C3 1*��$��01@&��?�E�� &�,��*� �$�,��43��E��*��6��& � SQ�� &�,��*� �$��!C��>�$�5*�*��$��`��&��-Y �� 5�%��@k�&�:351_4�6��IH /� �� &�iMj �.�=S��L��$�XF��*�@U�� &�&��*� �$�$Bhj �>�O��<�.�',�� &��*� �$� � ) /61 ) 3�1C5252521 )lk � -Y��*��R�6��<�m) S �k};noK"npLx�� j� �Q�$��0 0 �63W�7!4��*��IBi��s ��$S��*Hqsr �7 �E�6��0 0 �63=� ��>� � ¡ ��t� ¡ �\u ��3X�=*��$��0Y�:��.8)�<*>vxw��U �>�$��0 0 �63y� ��>� � ¡ ��t"��z�� z ¢ � ��H � !��*��{) S �=j¦ �� qK¢ � �¡ � �� f| v/¢ � �¡ �$� �}�Gz � ��'HM 1'Q��*��U��c5 ��=��7�&��eq a }U!4��*��~)��{��j ��4��&�2K�Lb�M 1 ) S � r��2K�Lb�M 1 ) � �<B"��43X�2K�Lb�M 1 ) S � n�vGH � q�+��$�*��$�� | vI+A� �6�*"� �$ S��:8)��*� �G�0��E�6�0 0��$3C��q�+v�+2��6��*"��$ S��:8)�*'D�*C��*��6H[TV� �$��v ��*iqX��&��<�&��b��&�k��!)�$�$��0,L��0 ��B��q�+�vI+A� �6�*.��$ S��58)�*. �>�<c��01@b�� %qK¢ � �¡ � �� }�Gz � �� D-Y� �$�v �y� �G�*,�� v/¢ � �¡ �� }�Gz � �� D-Y��q �y� �<H�%�`�I�0��6��4�h§ LM�$�Q�Y��*��6��& � S,�� &�"��*� �$�I@(B��Y��^��' !4��*�� &�7��*� �$� j>B/�e�� & 0��*�1�]@W*��q[��43[�=�� *��$�� 0�3�vGB�� '� � ¡ ��t ¢��¡�� '�� ¡ � � ze�� ¡ � @>��a��;_4��3(B��/�6��<��3 ��*�*� L��0K�� &��B �0�0�q�+�vI+A� �6�*/��$ S��:8)�*��,�'�@C��!��?�� &��'P*��j>H� �$�� :��=F��<*�@V��$@z�$��:�� v��b�*=�&��*��e�� &�0��*� �A@:+84��63O��6��*��<� �$�N��%1��!)��*��*��$H�J/��QF��*�@k��!5��& #$�<*%� �$�63 ��7_��43=��>�&��;��&�$ �$��?F��*�@R�<c ��$� �� e!�0��e��7��4��;��@5��$� �$��E_4�� ;��F��*�@N�$L��0 �4�� .��i��?��^!)��8 0 ��H Z � �% �&!)��*��&!5*��L:�3 �Q��$�$� *��.�� Q�$�� $��'"�6��<� �:LM��0 LM�63C�� &�0��*� �A@:+A8��63k��6�*��7��.�� N��!5��& #$�<*6H

3. TWO IMPORTANT STATISTICS FOROPTIMIZING CONTINUOUS QUERIES

Z �U��.'P��0 0 �6-Y�� S B^-G��3 �$��*� 8)�>�A-/�7 �&!)�*��E�� $�E��4��*��`� ��'D� 0 �>��! �� & #$ ��S%�%�$�� :��aF��*�@�B��4��&�$01@i� � � ��43 �?F�� i�'K�%�� & 0��*�1�]@:+284��63.��6��*��<�(H"�G��" �"�� /��c5�$�$�5�� &�`��'K��`��$�*��IB:��43>�$�&!5�� $��a �G�N8)��0��$��L��0 ��/��;��%��6�*��O*��$��01�Y�� #$�E8)�$ ��S.#$��*��>�*�� 6H�G�� 3 ��*Q�&�$��:�� :��;F�� *�@7��N�� Q�$��9�� '��]-G�7! *��$3 �6��$B��6��<�U �:LM��0 L: ��Sk�k�� & 0f��*�1�]@:+284��$3U��6��*��<�(H��*a��c ��&!�0 ��B!5*��63 �6�� -0/ �i��a ')��*��6��&�� S,�� &��*� �$�/�� ,��5*/35��_4� �� &��'I�G �}�Gz�� f� � � � � �M �¡ � � � � � ¡ � �� /35 �K�<*�+�$��Y'P*��h��Wo1}C!�r�B4-Y��<*��Q�� N�]-/�>�� &�E��*� �$��:LM��0 LM�637 ��N3 �� $�E3 ��_��1�� O��6LM�%_4� 1��N0 �$��S��IH

£ �7¤ £ ��:¥Gq�� `�¦� Z�§ �=\ £ Ts�*��5�� !O��7~,�$��6��*��<� Z ��$�Y � £ ��&�7 �� S>��43C¤;��6-Y0 �63 S�� £ ��$��LM��*�@�B�p�� !4�S��E{J#

� /E�4��E�7��$�*��$��E� �$ S��:8)��*E��4��; �;-Y �� s�7S��L��$�e��5*��$��0f3 �O!��*��k��j / ��Y�� N�$� *�*��:�`�� &�N!)��1��IHZ �; �N�6��@U��C��$�Q��4��%��.��! �� 0��$L��0 �4�� b��*�35��*;�'G��A-/�k! *��635��$��$�;�6��e8)�.3 ��<*��&�� 63R��N'P��0 0 ��-Y�$¥&}6� Z '��E�]-G�� & 0��*�1�]@:+284��63u��6��*��<� �$��4�6LM�O�� & 0��*C�$L��0 �4�� z�$��k��43��&�'`��$�¦ �E�&��*��&0 �M�$01@U��7*��$��01�Q �W��b�$�&! �A@R��7�D��:� �� .�$�*�*��$��!)��3 ��Sk! *��635 �6��?��'��0 ��<BI�� N��6��*��<�U��0�3R8)��L��0��63W_�*��6|`p�� Z ',��]-/�U�� & 0f��*�1�]@:+284��$3X��$�*��$�&��*��$F:��0 01@O0 �M�$01@C��&S��$� ��*��Q��7�$�&! �A@k��<�6B��;��;��4��,��,0 �$��01@7�� 0f3k8)�;�$L��0 �4��_�*��6H�5*��j�� G�� &!�0 �,��c ��&!�0 ��B�-/�%�$��$��,��! �� 04!�0�� 3 ��<*��&�� 63?8�@E��`��3.�$�&! �� $��^��')�� /�]-/�;��$�*��$�$HTb�e��:��C��b�$��R��43V�$�&!5�� $��7�'&�X�� &�0��*� �A@:+A8��63��$�*��t��?�� k�� $�?��=8)�k��3 �63W��3s��$�C��6-m��R�$��1+��%�� *,L��0 ��$HTb�E3 �>� ��,��43:@C�� ! � �Y�� #$�E��'"�?��& 0��*�1�A@5+28��637��$�*��=�� !��!)��*6BK�01�� S��R1�% ��E�'"�� ;�&�� &!)��*��,�� +�� $�.��XF�� *�@s��! �� & #6�� W-Y1��W*��$0�� 0Y��!)�<*��*��$HsJ/�� 'P��0 0 �6-Y�.'P*��d�A-/�O��8��*�L�� Q��s�7�� &�0��*� �A@:+A8��63e��6�*��IH�i1*��01@�BK�>��0�0I�� &�0��*� �A@7*��$qk��6@k8)�;S� LM��IB)��H S5H�B4q � }�HJ/� �/ �&!�0 �$�"��Y��5��!�� #$�,��0 01@&��/S�*��$��*��4�� /L��0 �� qKH��:�$�$��3 01@�B:��C��! �� & #$��*G��6@.�� 0 @�� $�63.��Q�:��6-�� &! �� $��I��'��,��$�*��IB��3?35��$�� ^�6��*��/�8)�� I��0*��01�Y�� #$��H�J/�� Y�4�!�!)�$� �/ �7��@��!�! 0� �6�� `-Y1��7�� +� �� aF:� ��*� �$�$B��>��^�&��1��*� � S%��*��6��&�i'D��*"�8�� *��0:�$L��$��35��$�� IH,J/��<*��'D�*��B)-/�?3 �$��f35�;��&��435@k�$�&!5�� $��Y ��6��3�'a��5��!�� /��#��Q �7�� `!4��!)�<*6H� 01�� S��O-G�>3 �&� ��%��435@k��6-v��C��Q�� E�� ! � �� #$��'I�N��& 0��*�1�A@5+28��63>��6��*��<�(B�-/�Y�6��&��0 0)�!�! 0 @?��`��*��$S� �$�35 ��$��63k��*��6��'P��*,'D�*,�� ! � �Y�� #$�E�$�� IH

4. STATISTIC ESTIMATION� �N�&�$�� 63=��R��Q ��*��:3 ��<��IBK-/�.��$�63O��k�$�� f35��*N�� <c ��$� �� Q! *��63 �5*��G��'5��!)��*��*^��3 ��<*��&�� "�� "'P�6��5*��<c5��*��*6H Z �=�� N!��!)��*6B)-G�>�� &�.��!)�$�$1_��.! *��63 �5*��? �� 63e��=�$L��0 ��&��&�� & 0��*�1�]@:+284��63b��6��*��<� �$�$H��@e�$�� 35+�<*�� S&�� Y! *��$�635� *��B�-/�E! *��$��:��>��*��$S�@=8)��0��6-��3 �$�� S�� G'D�$�� *��`�<c5��*��*��$H�Ts�G�� $�>35�$��*� 8)�`��6-W��N� ��`3 �$�� *��/�&��:35�I��,�$��8 0 ��Q��$�� ?�&�:3 �$0 �$B��3?35�� 6-��.��N��;�&�:3 �$0 �/��>��8 �� 7�� N�� E�$�� $�$HJ/� �O��!)�$�$1_��=! *��$�635� *��7'P��*�� & 0��*�1�]@:+284��63t��6��*��<�u��e3 1+*��,�$��&! � �� 7�&�� :3C��._��43eq�+v�+2��6��*Y��$ S��58)�*��$H`�7�*��!)��$1_4�6�0�01@�BK �O�� ,�&��:3KB -G�Q_ *��N35��*��& ��E�>�:� �.8)��*Y�'!��*��s�� &��*� �$�.-Y� ��*��iq:+v�+2� �6�*.��$ S��:8)�*��.�.� ��Q8)�-Y1�� e��$�OH>Tb�.��e�6�0 �$��0��&�� &3 �� $�>'P*��¦�6��s�'�� $��>!��*�� N��k��?��*��6��& � Sk�� &�.��*� �$�Q8�@R3 1*��$�<��0 @e�! +! 01@ � Se��7��& 0��*�1�A@['P��*��?��0��5Hu�N�6L:�� Ss�0�0Y��=3 �� $�$�$B-G�>�6��sF�� 501@R35��*��& ��E-Y�� b!��*�� ;�*��.�� >��4�0 q�+vI+A� �6�*,� �$ S��:8)��*��$H4.1 Feature Extractors� �6��5*��&��c:��*��*E��8 �� ;'D�6�� *��&L��0 ��E'f*�� >��3 ��*�01@:+ � SR3��C-Y ��e��>S��M��0G��'Y�& �� & #$ ��Sk�$�� s��LM�<*��6��3(HZ �s��*�35��*Q��O*��63 � �$�&��&��LM��*�� 6��3KB^-/�C3 �$L��$0 ��!W��W�� *��$�&�$�5+��0G�!�!5*��M��e��73 �$�� S��b��Q'D�6�� *��&��c:��*��<��*E8�@R�� SO3��!�!5*��c5 �� Y��'i�� &�N��<*��%��43k��*��$��& ��S.�� &�N��<*��$HTb�^_�*��3 �$��*� 8)�"��6-b��Y�!�!5*��c5 ��^�� &�^��*� �$�^��43N��*��6��>+

��S�� &�Q��<*��$HQJ/��?��!�!5*��6c ��R�&��:37��%-G�?� ��. �� ?!4��!)�<*. �>�� & 0��*.��=��?��o qr�B"-Y��*��k�=��0 0/�:��?8)��*��'�� S�� _��6�� £ �^Jv�$��<�&�$ �$��;�*��Q��$3=��&*��$!5*��$��$��%�&�� &��*� �$�"��*�� 8��*� �$�"��')�%��*��6��& � S�� &�/��*� �$�$H^Tb�/�6��0 0:��$��$��>�$ �$��%��Y�� &¡ F�F�� ¡ �� 6� ¡��iH��*"�� G��Y!��*��Q��&��*� �$�$B��H ��H B:��0 05!4��*��a��*��`��*��63 �Q��3 ��8��3Q35�Y��<�� S��3 � *� � SY�� 5� �� iF��*�@�$L��0 �4�� IB�-/�&�6��s! *��+2!5*��$�$��E��$�g��k_4�43R��$1*?��! ! *��6c51+�� E��)+20 ��HO�5 ��&!4��<*��E �b��3 ��8��&��$@U�4�6LM�3 1�K��*��:�Y0 �$� S�� $B -G�N��$�$3C��>� �� £ �^Ju-Y1��7��$��*�*��!)��43:+ ��S>0 �$� S��7��.!)��*�'D�*��;�!�! *��6c5 �� k'D��*,�6��<�7!4��<*��IH��*N�>��*��6��& ��S.�� &�;��*� �$�$B)-G�Q�$��O��01@O��8 �� O1��! ! *��6c:+ �� 7��O��7�� +20 ��N'�� O-Y� �$�O�6��<�7��-v3��.L��0 ��E��*�+*� LM�$�$HeJ��=*��635��$�&��LM��*�� 6��3s�',�$��&! � �� S £ �^Jm��?�6��<�� &�k!)��1��IB�-/�O� ��O��X ��<*��$�&�$��0Y�!�! *��[��US��<�>��! ! *��6c5 �� =�'"�� d +2�� >c7��'a��;��*��$��& ��S&�� &�;��*� �$�$B-Y��<*�� d �Y��;0 �$� S��O�'"�.!4��*��7�� &�N��*� �$�$H`�5!)�$�$1_��6��0 01@�B-/��N�50 �3 ��S £ �^J o pJ#6r�B�S��L��$�7��G��,'D��0�0 �6-Y ��S>!5*��!)�� IB��CF�� <�:01@k�6��0 �$� 0f��?�� £ �^Jv�$��>�$ �$��$H�Tb�Q�$��R��$�Q8)��&��!��$�&*��6F�� *��&�$��Q��43s��&!��5�� b�$��?��'`�� Q ��*��<+�&�$��0 £ �^JV�6�0 �$��0�� R�*��;LM��*�@C��0�0�H�G� �I��6��6�U�4�y§ LM�$�� £ �^JX0��S�� d ��3&�;��*��6��& ��SN�� &��*� �$��@ � + �J/ 12525C5.8 BI0 �� +�� V 1�� / 125C52521�� Q E / 8 8)�?�� £ �^J�$��>�$ �$��k�';�� d +2�� .cu�'�@z�� &�O�

a}�B,��3�� +�� V 1�� / 125C52521�� Q E / 8 8)�?��&��&�$ �$��E��' @U��E�� &�?�<BI�� $�� 6��s8)�.'P��3e8�@ � �7 � � ��3��J7�� Q o ��7�� 2�GD a �6D�E Q �

cCd r"'D�*9 � � 52525 d a }�HJG-/�?'��<��,�� 0f3C8)�N�� $�637�8)��5�`��;�!�! *��6c5 �� (HaJ/��_�*��? �Q��Q�$�*�*��$��!)��3 ��S=��=�6��<�X35�� .!��*��b0��S��(B��<*��`��!�!5*��c5 �� >��')�� `��*��6��& ��SN�� &�`��*� �$�$H"J/� ��&�6��Y��4��N�&��*��6��& ��S&�� &�E��*� �$�N��$@k�4�6LM�Q�.��01�� !�0 �;LM�<*�+�� `��'^�!�!5*��c5 �� k��Y��@&S� LM�$��&�,!)��1�� (H�J/��%��+��3C �/��`-/�%�6��k�6��0 �$� 0f��;�.35 ��N0 ��-G��*,8)�� 43&'D��*`�A-/�� &�%��*� �$�`� �� S?�� $1*,��! ! *��6c5�� GL5�� £ �^J%B ��`S��L��$�7 ��'D��0 0 �6-Y ��SC! *��!)��1��IH�G� �I��6��6�U�4� ��*7�]-G�[�� &�R��*� �$��Mg��43 )�-Y��*�� d �LeK�9a��<��69a�M�� 1 <��69"��))��<BI1'�0 �� M��3!�)b3 �$� ��E��>�!�!5*��c5 ��+�� 7��';�� d +2�� >cu��''M ��43B)�BY*��$��!)�$�<��L��$01@�B%��43z0 ��%93 ��N�� ;3 �&�$� �� 4�0 �A@C��'^��N��!�!5*��6c ��$B5��$�" 9d �2K�Lb��M 1 �)K� nu�2K�Lb�M 1 )K� 5# ��6 �%$ n ��`�� d +2�� >c��'hMV��3~)78)� +�&0/21�& 3G165252521�& Q 8 ��43+�� / 1�� 3 125C52521�� Q 8 B)*��$��!)�$�<��L��$01@�H n ��$1*%�$��*�*��$��!)��435�� S £ �^J�$��>�$ �$��M Q ��3 ) Q 8)� +�'%/61�' 3�1256525C1�' Q 8 ��43 + � /21 � 3 B52525C1 � Q 8 B/*��!)�$�� LM�$01@�Hyg%�� S)(^�*��$L��0�* �kJ/��$�*��$� o p�p�r�BY-/��4�6LM��¥

�2K�Lb�M 1 )K� �,+ - QSUT / � & S a � S � 3d �,+ - QSUT //. ' S a � S . 3d 5��*N��.��!�!5*��6c ��0�Mw��431�)e-Y1��b35 �&�$�� 0 1�]@R�' 9^B9~n d B5-/�;��6LM�

�2K�Lb��M 1 �)K� � " - 7SUT //. ' k S a � k S . 39 n + - QSUT /2. ' S a � S . 39 5J/��;!5*��!)�� O'D��0 0 �6-Y�$H 3

£ �7¤ £ ��:¥Gq�� `�¦� Z�§ �=\ £ Ts�*��5�� !O��7~,�$��6��*��<� Z ��$�Y � £ ��&�7 �� S>��43C¤;��6-Y0 �63 S�� £ ��$��LM��*�@�B�p�� !4�S��E{�q

Z �b�� E!4�!)��*6Bi��O�� &! 0�1'f@e�� &�0 0 ��*�� IB^-G��01@e��>�� _ *�� £ �^Ju��&�$ �$��`��>�!�!5*��c5 ��Y��!4��*��&��&��*� �$��43R��*��6��& � SC�� &�.��<*��$B��H ��H B09 � }. �U�� .�8)�L��> ��6F��4��+�� IHTb�%� ��-z�$�� f35��*`� ��-u��?��c:��*��/'D�$�� *��L��0 ��$�G'P��*`��Y�$��1+�� Q8:@N�� SY��$��G��! ! *��6c5�� $H^�i *��6B-/��! *��+2!5*��$�$�� &!4��*��e�� &�>��*� �$�Q8�@U��*�� S=�0 0G��'G��$�¦8�@=�� *?�! +!5*��6c ��`�D�� _�*�� £ �^Js�$��&��:�� Q ��*��6�� SN�*�3 ��*6B��43b-G��*��<+]�� S��W��&�$�*�*��$��!)��3 ��S=*��:�Q��?�� $1*Q��-j � +35 �$�$�$HaTs�,!)��*�'D�*��m�� G��>��)+20 ��N�� $�%� �?��*��6��& ��S;�� &��<*��? �; �:LM��0 LM�63KH.Tb�.��$�e3 �<��*��& ��>��;'D�6�� *��$�;�]-G�7!��+��<*��b �435��$�$Bi��&�$01@�� `9�� & ��43�� -J- ��`9�� & B)-Y�� +��LM��*`�E��<-VL��0 ��,��'�� *��$��& ��S;�� &��<*��Y�*�*� LM�$�$H"J/� �$��A-/�& �435 �$�$�,3 ��<*��&�� ;�.*�� S��;�'a!4��<*��$HGJ/��N*��6��O��4��-G�R��O��O�]-/�b��3 �$�$�k��&'P�6��5*��$�k �C8)�$�6��O��=�:� �>+8)�<*%�'a!4��<*��Y �O��N*�� S��E ��0��$01@O*��$0��$37��&�$��'^�� & 0��*�1�]@:+284��637��6��*��<�(H

Ps0 PsLower PsUpper PsN... ... ...

α

α

First DFTCoefficient

Sorted Patterns

Maxstream

Minstream

IndexRange

LowerIndex UpperIndex

UpperBound

LowerBound

�i S��5*��>}�¥G�G��,'D�$�� *��;�<c5��*��*6H

Z ��i S�� *��[}�B,-/�e�0 0 ��*��e�� e35�� 0 �63V��$! �O��'Q��85�� S�� $��&�]-G�O'D�6�� *��L��0 ��$�Q �W��W�� +20 �� >'�� (H Z �b�� ;_�S�+�5*��B�� %c:+A�c5 �Y*��! *��$��$��`��; ��3 �$�$�E�D*�� :��`�'^��N!��*�� &�?��*� �$�$B�� ?@:+A�c5 �%*��$! *��$��%��.��!�!5*��6c ��eL��0 � �$��43E��/�&�� G � ��*��6�� S;�$� *�LM�"*��$! *��$��i��G_ *�� £ �^J��&�$ �$��.�'��0 0/!4��*��E �s��!��*��s��<�>��'P��*?-/��!)��*�+'P��*�� 7��*�� S HzTb�O�6�*�*�@X�� &�� 7'D�6�� *��O�<c5��*�� u��'P��0 0 �6-Y�$HC�i *��6B(-/�.��?�� . ��<*��$�&�$��0 £ �^J��&�<��:3O��kS��?��01�� !�0 �Q��! ! *��6c5�� ,��'"��E��*��6��& ��S>�� &�E��*� �$�N��!O�� *�*��$��&�� &��H��l"��z��! ! *��6c5 �� t�$��*�*��$��!)��435��s�s3 ��+�� C0��S��z'P��*C��=!��*��z �u�� =!4��*��z��6H � � ��&�� $��>�!�!5*��c5 �� N��L��.�C�&�� .��L��0 ��Q��'� yK�9�� D�� Z�� k��43w�z��c �?�� L��0 � �[�'� �� &�� D�� Z�� kQH � �6- -/�[�6��w35��+��<*��&�� X�V3 �� $�W*�� S��X8�@ �$��&!��5�� S�� 9� � &K�9 � D�� Z�� k a v ��3�� -J- ��9� � �� &�� D�� Z�� k � v�H � ��+�� &� - ��.�O!4��*��W �s��!4��*��W��.��43 - �>��sv�+2��6��*� �$ S��:8)��*��'%�� 7��*��6��&�� SU�� &�C��*� �$�C��&�� *�*��$��>�� &�C!)��+��1�� (H �/��$3s��e�� >! *��!)��1�� s��s35��$�&0 �6-/�<*?8)��3(B-G�O�6��t��$�C��4��>��_�*�� £ �^Jm�$��&��:��' - �.� ��.8)�k�� >*�� S��'Qo �� !��l9 1 � -J- ��9�Mr�H7�i �4��0 01@�B^-/�� 8 �4�*�@��$�*��k��>F�� :01@>_4��3&�� %�$�*�*��$��!)��3 ��S.��3 ��c.*��S��o �� `9�� & B"� -�- �� `9�� & r2HTb�b� ��U�� b�� & 0f��*=! *��63 �5*��e��X��c:��*��7'D�6�� *��$�O'P��*O�� &! �� $��k�$�� (HmTb�b�0��[��U��$��U�]-G�[!��*�� 3 �$�$�?��;'D�$�� *��.��43s�0��=� ��>��>!5*��$�$��Q �W�i S�� *��O}&��S��/�� $1*`L��0 ��$�$B:�� $�&! �� $��G�'I�� 6��*��<�C��0�3�8)��0��$01@O*��$0��$37��>��; �435��ck*��S��H Z �=��3�35 �� IB�-G�E�$�� f35��*�e��F��7! *��!)�<*��]@�B/��&�$01@[�$��:�� :��1�A@�o }$�r;�'E�b�� & 0��*�1�]@:+

84��63��6�*��7��LM��*/��*��6��& ��SE�� &�%��*� �$�$H�J/� ��`!5*��!)��*��A@&��$@5��4��`��$�$�� LM�N�$L��0 �� `��'��?��6��*��<�k�*��N0 �M�$01@&��E*��$��01�Y ��;��&�;��5��!�� 6HGJ/�� , �&!�0 �$�,��,1'a�� Q��L��0�� =�'^��6��*��<�= �Y�$�&! �A@U��5+2�$�&! �A@4�<B��$�7�� %'D��0 0��6-Y ��SC�� E��,�0��0 �M�$01@.��N8)�Y�$�&!5�]@C�� +2�$�&!5�]@��<H(�%�$� �$��B�-G�Y�6��&� ��,!5*��$L: ��$L��0 �4�� &�$�&!5�� $��i��;�$�� /'P��*��$� *�*��:�"��H"�:�N8)�$��3 ��Y�8)��LM�`�]-/�;!��*��& ��3 �$�$�$BM-G�,��0 ��E��,�N�:��.8)�<*a��'K�&��*��$�$��:�,�$�&!5�� $��Y��Y'P�6��5*��$�$HZ �C��&��*�@�BM-/�%��;��/'D�6�� *��$�`�]-G�.!��*��C �43 �$��43&��$�&!5�� $��N�'Y�k�:��?8)��*N�'/!5*��$L:��;�$L��0 �� $H Z �R��.��>+!�0 �$�&��:�� IB5-/�E� ��?��=��3 3 1�� 0�'P�6��5*��B��`9�� &�# ��9�$\� �� -�- �� `9�� & a �� `9�� & H � 01��S��[�� >'P�6��5*��O��$�$�&��8)�Q*��$3 ��3��:�6B�1�;3 ��%��E�� E�$�� R�&�:3 ��0��&��*��$��$ ��Q��3k�&��*��N��$�$�5*��HaTb�%��8 ��0��E�0�0K��'D�6�� *��$�Y'D�*�6��<�O�<c5��*��*, �=�i S��5*��Qp:H

l^c5��*��* ��6�� *��$��G�� ?�� `9�� & B

� -�- ��`9�� & ��43�`9�� &�# �\9�$\�

la�&! �� $�� ?�� `9�� & B� -�- ��`9�� & B�`9�� &�# �\9�$\� ��43�C�:��?8)��*��'�! *��$L: �� % L - � K�9W��

�i S�� *��?p:¥"��$�� *��,� ��63(H

4.2 Estimation Models�G� �0�35�� SR��3b�� S7�$�� W�&�:35�$0 �?��*��&��*�� S��'P��*�-`��*�3(HZ �Y ��Q�A@ ! �6��0I3��.�& �� S.��!�! 0 ��$�� (H

IndexRange >= 10000

IndexRange >= 23000

IndexRange >=1000

Y N

Y

Y

N

N

UpperIndex > 50000

Y

LowerIndex > 500

N

Cost = 1 ms

Cost = 4 ms

Cost = 7 ms

Cost = 15 ms

Cost = 2 ms

N Y

...

... ...

Y N

�i S��5*��E�5¥ � ��&!�0 �;�$��,�$�� O�&�:3 ��0

Ts�`_ *��G0 ��>��6-u��;�$��8�0 ��&��Y�$�� &�:35�$0 �$HiTs�_�*��C*��z�� R��& 0��*�1�A@5+28��63V��6��*��<�V'D�*=�W�� >�$ �$��O�:��>+8)��*E�'/�� &�$�E��;��?��*��6��& � S7�� &�.��*� �$�E�$��&�> �IH £ �5*� ��S�6��<��$L��0 �4�� IB:-G�� ,'P�6��5*��%��c:��*��<��*��G��ES��<�G��Y'D�$�� *��L��0 ��$�/��43&�$��0 0��Y��0K�$L��0 �4��k��Y��43��$�&! �� $H�N�6L: ��S,��$��/��0 0 �$��63&3 ��%��^��*�� SN3��6B�-G�G��$�.��!5+!�01@?3��%�& �� S��0 S��*�1��&�a��>�� ! H �N3 �$�� .��*��$�;o p��r ��$��8�0 ��O��;�$�� O�&�:3 �$0 �$H"Ts�NS� LM�E�>��&! 0��E35�$�$ �� *��$�N'P��*Y��;�$��,�$�� O �7�i�S�� *��Q�5Hg%�� S��G�$�� ?�&�:3 �$0 �i��%��G��G�� $�� a�0��*�� S��:��'P��*�-/�*�3(HaTV� �$�&��<-uL��0 ��"�'K��Y��*��6��& � S%�� &�,��<+*� �$��*�*� LM��B�-G�O��C��'D�6�� *��O��c:��*��*��> � ��*��$�&�$��0 01@s��85�� t�� 7'D�$�� *��7L��0�� $�$B`��43X��$�X'D�$�$3[�� b�$�*�*��+��!)��3 ��S.�$�� O�&�:35�$0 �/��>S��<�,�� ;�$�� $�$H

£ �7¤ £ ��:¥Gq�� `�¦� Z�§ �=\ £ Ts�*��5�� !O��7~,�$��6��*��<� Z ��$�Y � £ ��&�7 �� S>��43C¤;��6-Y0 �63 S�� £ ��$��LM��*�@�B�p�� !4�S��E{��

5. EXPERIMENTAL RESULTSZ �N�� (��$�� (B-G�^*��$!)��*�� *I��c5!)��*� �&�$��0�*��$�� 0 ��(��4��i35�$�&��5+��*��C�� K�$�� LM�$� �$��.�'Y��! *��!)��63s0 �6��*�� S+A8��63X�! +!5*��M��<�(HaJ/��`��c5!)��*� �&�$��a �a!)��*�'D�*��&�63.��&�N3 ��:��!.8)�6cQ-Y ��(i��:�� Z�Z�Z }�H p § �%#N� (hg��3��:}6p��g�&�$�&�*�@�B�*�� SNTV � +35�6-Y��( (^*��'P�$�� 0�Bi��437�� Q��:3 �Q��,-`*�1��$�R��=!5*��S*��>+�& � S=0f��S��4��S��=� � � HeJ/��C3 ��=�& �� SR�0�S��*�1�� $3W��8 �� 0�3>�$�� &�&�:35�$0 �� G� ! H �E3 �$�$ �� &��*��$�,�0�S��*�1��ho p��r2H� 0 0a3��;��*��.0 �M��3 �63R �:��&�$�&�*�@O �U��35L��$�?��3R� �$�� ;�$��,�&�6�� *��63C�� 0 @�*��'P��*��Y��.��N��&!��5�� 0(�$��6H�� 2��%��U!4��<*��U�� &�Q��*� �$�E��3=�� ?��*��$��& ��S�� &�?��*� �$�E��*��@5��<��3 ��:H=J/��=}$��>!4��*��b�� &�&��*� �$�?�*��S��$� ��*��$3 ��3 �$!)�$��3 �$��01@XL:��b�e*��43 ��>+�-`��0 �s'P��<��IH�l"��u!��*��G�E0 �$� S��&8)��A-/�$��7��E��43&��:HGJ/� �Y��*��6��& ��S;�� &�Y��*� �$� �>��*�� 63W8�@b*��43 ��&01@s!� ��:�� SR��!X��&�C!4��*��s�� &��<*��/'P*��j��8)��LM��43��$��6��$�4�� SQ�� $�m ��E�� %0 �� S��$F:� �$��H Z �Q�*�3 ��*I��,_�0 0��"SM�!?8)��A-/�$��E�A-/�,��$��$�� LM�G!��+��<*��$Bi��&�&L��0 ��$�E��*��C ��*�!)��0f��63b �b8)�<�]-/��$�sL:�� (G�G� Z (� (a �$�$��-Y ��U�G� 8� �k��*��&1�� Z ��*�!)��0�� S0(i��01@ � ��&��0f�;'D�� +�� $H Z �s��3�351��IB"��&�&!��*�� .�!�!)�6��*.�&�*��'P��s��4��*��^ �?��*�3 �<*^��%�� .� 0f��/!)��*� �:3 �/! ��$� ��&�$��, �E*��6��0:-/�*�0�3��*��$��&�$H"�i �4�0�01@�BM-Y� 1��,� �� ,��3 3 �63Q��N��Y��*��6��& � S%�� &��<*��$H(^��*��?�� &��*� �$�"�*��`! *��<+A!5*��$�$��63E��%��8 �� Q��$1*"��! ! *��6c:+ ��aL:�� £ �^JX��3?��@?��*��`��*��63.8�@E�� *a_ *�� £ �^JX��+�<�&�$ �$��$HEJ/��?��! ! *��6c5 �� %��'��*��6��&�� S&��&�Q��*� �$�;��*��$��0 �$��0��637��@�-Y1��k�� N ��*��$�&��:��0 £ �^J��! ! *��M��<�kS� LM�$� �k�5�� *! H }�HaTb�Y-Y 0 0)�� 01@&��Y��`_�*�� £ �^Jt��&�$ �$��/�'��*��$��& ��S�� &�?��*� �$�E��43=!4��*��=�� &�?��*� �$�N��k3 �<*��L��Q�� 'P�6��5*��$�$H�� J��~�� 2G� �J/� �>!)��*�'P��*�� $�7��$�$�5*��@��'/�$��E��3U�$�&!5�� $��$�� +�� O�&�:3 �$0 �Y �YS� LM�$�=��`'D��0 0 �6-Y�$¥

� J/��.��$�� *��<@R��'G�� ?��;�$�� e�&�:35�$0�¥Y��?*�� 4��N��Q�$�� $3R��; �N-Y1�� e�CS��L��$�=��0 ��*�� $�>�'��N��0�� 3 H� J/��?��$� *��@=��'a�� Q��&! �� $��,�$�� =�&�:3 �$0�¥/�� *�� >��Y��;�$�� O�&�:3 ��0I�$�*�*��$��01@O!5*��63 ��`�� $�&!5�� $��$H

� �4�W�2��/��0� ��U� �Tb�� G�� & 0��*�1�]@:+284��63;��6�*��$�I��4��i��N'P��*/}<+�vI+A� �6�*��$ S�� +8)�*��?�'Y��&��*��6��&�� SO�� &�&��*� �$�>��Q�� C�� 5� �� ?F�� *�@�B-Y� ��*��HvzL��*� �$�N'f*��d�:H ��C��O�:H }�Bi��!b8�@=��$!(H>�%�<*��.-/�._ c�� ?�� & 0��*�1�]@7*��$q � }E'D�*N�]-G�C*��$�� $H;J/� �?_�*�� %��4��O��0 0��4��S��&�'�qb35��$�?� ��Q��>�.� �<�e3 1�K��*��$� �$�>��7�� L��0�� e�$��6H.J/�� N�$��b8)�?��$�$�R'P*�� 5*N'D�6�� *��>�<c5��*��+�� [!5*��$�635� *��Ba-Y��<*��k��01@(v� �.��$3(HeJ/��C��$�$��43W �?��4��&! �� $��`�� 7�'^�.��$�*��7 �` �43 ��!)�$�435�$��`��'i�� *��L��0 � �*qKH Z �435�$�63KBi1',��3b�� 0 @b1',�e}�+v�+2��$�*?��$ S��:8)�*?��$�*��*��01��/ �7��$�&! �A@O�� +2�$�&!5�]@��i��6B5��@*q�+v�+2��6��*`��$ S��58)�*��$�*��7-Y 0�0i��0 ��>*��$��01�Y �=��O�$�&! �A@=�� +2�$�&!5�]@��"��6H�� I� ��~�\� �� G� �� Z �=��Q��c5!)��*� �&�$��$B4-/�?�$L��0 �4��E��?��8)��LM�?�� &�0��*� �A@:+A8��63��$�*��t'D�*Cp:B ��e�� &�C!)��1�� $BG-Y�� <�X-/�k�6��0 0,� �� Ht��*�$��C*�� IB�-/�,��,��& 0��*�1�A@&�� *��$�� 0�3 vb8)��A-/�$�$�C�:H ��43R�5H }�H Z �R�6��R*��IB)-G�.�$��0 0��N'D�$�� *��?L��0 ��$�$BI��<��4��0^��3 J/��Y��0 ��*�� $�% �%}��&�G �&��,��c5!)��*� �&�$��"*��!)��*��63KB�-Y�� <�� 0 �$��E}6� �m��'^�� ;�6LM��*��S��E�$��6H

0.05 0.06 0.07 0.08 0.09 0.170

75

80

85

90

95

100

Threshold(α)

Acc

urac

y (%

)

EmptinessCost

�i S��5*��! ¥ (i��*�'P��*�� $�;��'��;�&�:3 ��0��$H

05

1015

20

05

1015

200

100

200

300

Estimated CostActual Cost

Num

ber

�i�S�� *��?�:¥ £ ��*� 8��5�� 7�'a�$�� b��$��<�<H

��3C��&! �� $��G��'i�� %��6��*��<�7��`�6��<�C�� &�%!)��1�� (HaTs�� $��! !�01@b�� k� ! H �U�0 S��*�1�� #��d��'Y��$��0�0 �$�<��63t3 ��7��_4��3O��5�N��E�]-G�k�$�� R�&�:3 �$0 �$HNTb�?� ��?��E*��$�� S�� 3��,��Y��$��^��G��$�� *��<@E��'�� G�$�� .�&�:35�$0 �$H^J/��*��$�� 0 ��,��*��QS� LM�$�7 �=�i S�� *��! H� *�� i S�� *��e! BG-G�O�6��t��$�k��4��.��O�$�&!5�� $��?�$�� &�:3 ��0^�4��N��*��$0��L��$01@R�� S��e��$�� *��<@�H � �R��*�� SC��*��43 �;��4��?��E�� .�� *��0�3b � ��*��6��$Bi��$�$� *��@eS��$�?3 �6-Y�_�*��O��3u��$��S��O��!(H�J/� ��C'D��0�0 �6-Y�=��='��O��7-Y� �$�� *��0�3U��;L��*�@R��0 0Y��LM��*�@R8 �SM�<BI�� .��6��*��<�e �QL��*�@R0 ��$01@��k*��01�E �s��e�$�&! �A@W�� +2�$�&!5�]@��Y�� ! � �6H Z �U�� $��>�6��$�$B��$�� ?�&�:3 �$0��6��.��$�$�5*��$01@Q�$�� "��$�&! �� $HTV� �$�7�� *��$�� 0�3k��$��L��0 � �$�Y��78)��]-G�$�$�IB5��;��&! �� $�� k�&�*��R��$��1�� LM�=��s��O'P�6��5*��eL��0 ��$H Z �t�� k�$��B,��$�� O�&�:3 �$0K��,�?*��$0�� LM��0 @k0 �6-��$�$� *��@�HJ/��;��69��*�*��$��k'D��*�!)��*�'P��*�� $�E35*��!O �`��4��Y-G�;��01@k�� £ �^J��$��&��:�;��C35��*� LM�Q'P�6��5*��$�$HQJ/�� N ��*��:3 � �$�$�N�<*�+*��*��i��Y��a'D�$�� *��GL��0�� $�$H^J/� �$��/��*�*��*��^�4�6LM��3 1�K��*��$��&!��b�$�� $� *��@b'P��*.3 1�K��*��$��E�� & 0f��*�1�]@b��5*��$��0�3bL��01+��$HCTz��$�U��& �&!4��E8)�$�$��&�$�E0f��*�S��B^�� &��$� *��@b�'`��$�� =35*��!��,��3k��&35��$�,�� N!)��*�'P��*�� $��HJ/��,��$�$� *��@>�'K��Y�$��$�� &�&�:3 �$0 ��*��3*#�� q�� CB��;��6-Y�U �b�i S��5*��H! H � ��Q��4��;�k�& ��+2�$0��1_4�63R�$��

£ �7¤ £ ��:¥Gq�� `�¦� Z�§ �=\ £ Ts�*��5�� !O��7~,�$��6��*��<� Z ��$�Y � £ ��&�7 �� S>��43C¤;��6-Y0 �63 S�� £ ��$��LM��*�@�B�p�� !4�S��#�

35��$�E��;��$��$��*� 01@b�&�$��e�k84��3e�$�� B�� $�>1�E��$@R8)�� 0 0KLM��*�@&��0��N��Q�� %��4�0K�$��6H�J��?��$��`�� $B�-G�%*��&�� & 0��*�1�]@:+284��63b��6��*��<�U-Y1��b�� *��$�� 0�3~v � � 5 }&'P��*?�� *p�B ��O�� &�&!)��1�� $B"��3e��43:@R��&�$��?�� X3 ��*�1+8 � �� (H"J/��,*��$��01�` �`��6-Y�C �7�i S��5*��E�:H Z �&�� G_4S�� *��B5�6��<�!��1*&�'N��4�0`�$��C��43W�$�� 63W�$��&�$�*�*��$��!)��3 �.��U�� L��0�� (H�J/��%L��*�� 6��0)L��0 �� 43 �6��$�G�� 0)�:��.8)�<*G�'�� &�$��4��`�Q!��1*Y��'��$�� 63��$��`��43C��<��4��0K�$��,��! !)�6�*��$H�: ��$�k��0 0`�� +2#$��*��R� �$ S��>L��0 ��&�*��C�$0 ��*��$3W��R��C35f�S�+��4�0� �R�� N_4S�� *��B(�� N�&�6��%��E �U�6��b�$L��0 �� (BI�� 637�$��,��YLM�<*�@k��0��;��.��;��4�0I��6H� ��*.�� k��$�*��X-Y1��v � }�Ba��k�6LM��*��S��63s�� &�&��R_��43b�� A-/�k�� &�$�� $�; �E�5H ��&�$H>�G��&!4��*��63=��k��?��LM��*��0 0��$�*��X�� &�&�'Q}6��&�Q��X�6LM��*��S��Ba�� Q �>�OLM�<*�@b��0�0��LM��*�+� �6��3KH�lac5!)��*� �&�$��`-Y1��U��*,�� *��0�35�N��0 ��C�$��5_�*��4�� *,! *��!)��637��!�!5*��M��<�O��YLM��*�@C0 �6-��L��*��6��3(H

6. RELATED WORK�G��:� ��F��*� �$�,�4�6LM�;0 ��S&8)�$�$�R��*�� S>��Q��$�� '`3 ��8��.*��6�*��*��$H>J��*�*�@R��;��0�H"o pqr�!)��*��4�!��%-G��*��.�� _ *��`�� %��Q��*��:35��$�,��%� �� k��'��]�$��:�� :��`F�� *� �$��?'P��*�,�$0��^�')F��*� �$�i��a��*��G ��$3?�� $�`��43N��$�E*��A�$�� :� +��01@��LM�<*i3 ��8��$�$H n ��<* n �E��i��0�H:o1}G#:|5}6q�r5��0 ��,! *��$��$��63�>�� & 0��*��$�� $�$!5�6B4��*��&�63��]�$��:�� :�4�0iF��<*�@��5HG~,�$�$��:��01@�B�3 � ��E��,�$�&��*�S� ��SQ��$�635�G�'i3 ��E��*��6��j��4�S��$�&��:��@5��$�&�$B�� & �:��<*��$��E �s�$�� :��EF��*� �$�$Bi�$��!)�$�$��0 01@s �e�$�� :�� F�� *� �$�E�L��*E��*��6��&�$B��4��;�� *��6��63b��4��*�!�01@�HC\%� �&3 1*��$�� s�� E_4��0f3e��E��R3 ��S��W��*��6�� 3 ��O! *��$��.��@5��$�&�;��4�� !�!)�*��E�$�� :�� ;F��*� �$�$BI�H ��H Ba�G��b��Q�0�H.! *��$��:��$3R�� 35�$�� S��C��'(�� S*��M�� @5��$�ho {5|�#�r�| �/��8 �7��43>TV�3 �� o !r2B�O��3�35�$�C��43&�5*��:0 �=o1}6�r(�0 ��E*��!)��*��63&��@5��$�j�*��1��$��5*��437*��$0��63=�� $��'P��*E35�6��0 ��S&-Y1��=�$��:�� :��F��*� �$�$H � �5+��*O3 1*��$�� v �OF�� *�@z! 0f��! �� & #6��V'D��*O�$�� :�� F�� *� �$�=o p��5|%p�r�H��:��&�C�&��*��k*��0f��63u35�� 0 ��'%�$�� :�� F�� *� �$�,��43C��*��6��&�Y�6��O8)�%'P��37 �k�� N�� *��0�o �5|I}�}<r2HZ �N�� (!4��!)�<*6B6-/�"��43 �63,��a�$�� E!5*��8�0 �$�V'P��*i�� & 0��*�1�]@:+8��63[��6�*��&�L��*&��*��6��&�� SU�� &�C��*� �$�$H��7��! *��$L: �� -G��*��o1}�|�p:|:pG!�r5��Q�� & 0f��*�1�]@:+284��$3?��6��*��<� �$�i-`��^�$�� $�$��*��63��7��35+2��%F��*� �$�/��LM�<*`�� *�@C3 ��E�� *��$�$B4��43& �k�&��$��$�$B"�� &�C��<*��>��*��63W �[3��84��$�&��*��k�� !�!)��$3W��6LM�� G��&��0 �$��S��IH � 01�� S��Co �5|4}$�r ��0 ��N��35@;�� & 0��*�1�]@:+8��63e��6�*��$�Q��LM��*Q��*��6��&�� S7�� &�>��*� �$�$B^�� $1*?�� b�$�� +��*� 8 � �� X �>��[��6-j��e��>�$ �$��0 @s�$L��0 �4��k�� &�0��*� �A@:+A8��63��$�*��.��?�Y�� S�0 �/��*��6��& ��SY��&�"��*� �$�$H Z �Q��*��6B�� i!��+!)�<*%�$��3 �<*��;��$�4��*��4�� 5� �� F�� *�@7 �:LM��0�L��$��?��01�� !�0 �`��*��6��&�� SN�� &�`��*� �$�/��43.�?��01�� !�0 �`�� &�0��*� �A@:+A8��63��$�*��$�$Bi��43O'P��$��$�E��e�$�� Sk�� E��E��*��.� �$��+��*�@7'P��*,��!5�� &�#�� S>��O�.F:� ��*�@�HJ/� �,!��!)��*Y �Y�0��&��0��$01@k*��0f��63C��Q��N0 1��*�� *��N��7�� +�� `�$�� IB�-Y�� <�.�4��^8)��$�&��3 35*��$��$3? �.��G��*��:35�� IH ��$��3 �$�$B:��*" ��*��$�� S%-G��*��o1}6��r4��$3.�� 0 �6��*�� SO��0 S��*� �� &�;��U��?��*�� S=F�� *� �$�%��2*��+2L: ��0 �#�� &��c5 �� S=��5��!�� Q�� #$��$�� W�&��:35�N-Y��!)��*�'P��*�+��$�e! *��$L: �� 01@�3 ��*� �*��63v3 ��R��u� S��!K3��e0 ��3 �$H��6-/�$LM�<*6BK��0 0�� $��;-/��*��7��,84��63=��=�&�:35�$0 �� S>��Q3��&3 ��+��*� 8 � �� (H Z �7��*��6B �k�� `!��!)��*6B5��5*Y0 �6�*�� S�+284��63O�! +!5*��M��<�z ��s351*��$��$�� z�&��:3W-Y1�� &��7��$!u�'�&�:35�$0 ��S&3 ��>35��*� 8 � �� (H

7. CONCLUSIONSZ �� `!��!)��*6B5-/�N ��*��:3 � �$�63C�.0 �6�*�� S�+28��63�'P*��&�<-/��*��&��$�� k�� $��'N��[��!)��*��*� �u�R�$��:�� :��&F�� *� �$�$HJ/��;'f*��&��-G��*��O�$�� %�'"��c:��*�� S&'P�6��5*��$�N��43O8�� 0f35 ��S0 �6�*�� SO�&�:35�$0a�� S=�C��*�� Sk-/��*��:0 �M��3(H�Ts�.��43 �63R��6��Q�'a�� &�0��*� �A@:+A8��63=F�� *�@C�L��*,��*��6��& ��S?��&�;��<*��,��L��0 �3��E�� %��!�!5*��M��<�(H;\%�5*%��c5!)��*� �&�$��3 �$�&��*��637��K�$�<��L��$��$��Y�'a�� *Y!5*��!)��63O��! ! *��IHZ ��0�3Q8)�"��63N��4��<*��G��$@N8)�G3 1�K��*��$��-/�$@ �(��%35�$�� S��,'P�6��5*��`��c:��*��*6H Z �a �^ �&!)��*��i'D�*G�Y'D�6�� *��`��c:��*��*^��8)�E-/�$0 0a�$�� *��$3R��4��N��Q��U�&�<��:3=�6��R0 �6��3R��$� *��.�$�� e�&�:35�$0 �$H Z �%-Y 0 0a8)�. ��*��$�� S��C��435@��*��$S@��C��6-z��E_4��3C��]��!5�� 0 �.!��1*`��'I'D�6�� *��c:+��*��<��*%��43C�$�� 7�&�:3 �$0�B5-Y�� <�7��6@�@5 �$0�3k��;8)��+�� $�$�5*��@.��435��*aS� LM��>*��$��5*��$��*�� :�6H � ��* ��*��$�� SE*��$��6��*��<�k3 1*��$�� C �G��Q�!�!�01@>�� G0 �6�*�� S�+28��63��! ! *��7��>�� *`�]@5!)�$�Y�'^�$�� :��YF��*� �$�$H

8. REFERENCES

o1}<rE~QH � S*��6-/��0�B��H^��0 ��5��$B"��3 � H � H^��-`��&�H�la�>�$ �$��& 0��*�1�A@��6�*��v ��6F��$�e3��84��$�$H Z � # �� C�� /B5!4��S��$��{��q�!5Bi}6��5Ho p6rE~QH � S*��$-`��0�BY¤>H + Z H n �(B`�?HG�KHG�5�6-Y� ��@�B/��3[¤>H/�5� ��OH�4�� &�0��*� �A@=��6��*��<�R �O��E! *��$��$�Q��'"� �� BK��6�0� � S B��43>��*��0�� k �� &��+2��*� �$�/3��84��$�$H Z � � z ��2��

� �� ¡ � B4!��S��!��5��:}�B^}$��M��Ho ��r��*�� /��8)�$��)B^�:�� L:�4��/��8 �IB��O�$@ �5* £ ��*6B(~%�69��$L�7��]-/�� B,��43��$�� 1'D��*�TV�3 ��OH��7�:3 �$0 �k��3t ��$�� 3��k��*��6��¦��@5��$�&�$H Z � # � �C�� # � �� B�!��S��$�C} �4}${5Bp��p:Ho !r?�KH �/��8 �s��43!�5HITV�35��OH��G�� :�� EF��*� �$�;��LM��*Q3 ��

��*��6��&�$H Z �"��#%$'&�� )(, �� %*'�) Fl� �,+.-/-/0B)p��:}�Ho �6r?�G�5� ��S��&��>�7�$0 L:��&�G��$�&��43 � �<�Q~��!)��0 ��$H � 3 ��! +��L��0�� L51�A@��$�� h� �� S F��*�@ 'D�$�$3 84��<�)H Z �# � �C�� 21435&U¢6��#%$'&7� �;B !4�S��$�E}6{:}8�4}6#�p�Bi}6��! Ho {�r4� H��G� �$�IB £ H9� H £ ��TV1��6B��43�� H��GH � �� S��(H £ �$�� S��&��43�$L��0 �4�� O��'"�0 ��<*��4�� LM�E��$0 �$�<��=!�0��$�&�$��Y��*��$S� �$��C��!5�� &�#�� S.�$��:� ��/F:� ��*� �$�$H Z �:#%32� �)3 �� 6 � � �� <Bp��p:Ho #6r4� HG�G��$�(B £ H;� H £ ��TV1��6B/�GHGJ/��IB,��3�<?HaTW�� S H � ��+SM�*��?¥I�C��6�0��8�0 �>�� 5� �� NF��*�@O��@ ��h'P��* Z ��*�+��G3��84��$�$H Z � # �� C�� 1435&U¢6��#%$'&7� �;B�!��S��$�/�J#��:BKp��:Ho q�r?��H(�4��0 ��5��$Bi�eHK~%��S��4��4��IBI��3=<?HI�O�� 0 ��!)��0 ��$H�4��G��8��$F:� �$��<� �� SE��&�� &��+2��*� �$�`3 ��84��$H Z �# � �C�� 21435&U¢6��#%$'&7� �;B !4�S��$��! }$��`!Mp��:Bi}6��! Ho ��r n H § �� 43 �?Ht�)HXTW�� S Ht�G��:�� :�4�0 0 @ �$L��0 �� S

��& 0��*�1�A@5+28��63V!4��*��F��*� �$�k��W��*��6��& � SX�� &��*� �$�$H Z � # �� C�� >1435&U¢6��#%$;&�� ;B�p��Mp�Ho1}6��r n H § �� B��?HM�KHTW�� S BM��43�?^H%</�� H�l"L��0 �� S;�$��:� ��6��*��$��.� �$ S��:8)��*.F��*� �$�E'D�*>��*��6��& ��S7�� &�&��*� �$�?L5��

! *��+�'D�<��<� �� S H Z ��35#�@>&RB5!4�S��$� !Mq��!M�Mp�BIp��Mp:H£ �7¤ £ ��:¥Gq�� `�¦� Z�§ �=\ £ Ts�*��5�� !O��7~,�$��6��*��<� Z ��$�Y � £ ��&�7 �� S>��43C¤;��6-Y0 �63 S�� £ ��$��LM��*�@�B�p�� !4�S��#�}

o1}�}�r?�7 �� H § �*��'��0��5 �$B ��4�� $� § �� *��M��B��43u~��9]�$�$L~��S��H �%��*�@5 ��S>��43k�&�� S&3 ��>��*��6��&�$¥'<��O��01@S��<��;0 ��)H Z � # �� C�� 21435&U¢6��#%$'&�� ;B4p��p:H

o1}6p�r £ H § �� !�� 0 ��43 § H £ ��$H)J/ �&�;��*� �$�,�� &�0��*� �A@O�&�6��+�� *��$H Z � # �� C�� 2@��;B !4�S��$��p�!��#�BIp��5Ho1}$�r��/�� Sk�%�*��S��*��B5�� s�:��$! ��*�3KB(��3 � �� ?�QHK�?H� S��IH��%��*�@k�� #$�E�$�� =� �� S&�� ;0 �6��*�� S5H Z ��Q¡ � ¡��¡�� 2� t � � � �(� �� 1 � � ¡ � �< ��>1 F�F � � ��¡ �� BM!4��S��$�"��#��}6��{5BI}6��J#:Ho1}C!�rQl`H)¤;�$��S��IB)¤>H4�G�4��*��84��*��BI�eH�(^��#�#6�� B(��43R�)HK�7�$�5*��+��*��5H £ �&�$�� 4��0 1�]@k*��63 � �� 7'P��*�'�� & 0f��*�1�]@O��$�*�� &0��*�S��Y�� &�Y��*� �$�/3 ��8��$�$H�@ �4 � � �� N¡ � � # � � �� ¡�¢�� t � � � ��B4� ��M�<¥ p{��:p�q�{5B�p��5Ho1}6��rE~,��8)��*��I¤;��H � z 2� F��'� � ¡ ��H �� (Y � ¡ �� ¡ ��Q¡ � ¡��¡�� H ("� £ ��$B^�/��&Ts��*��e~,�$��*�L��Hg%� �L��*�+��1�]@�B(}6��q��5Ho1}${rE~,��4��*�3 � H n ! ��IB ��)*��@ �GH � �� S��(BQ��43 £ �� +L�� Hz�5�� $�3 ��*6H (a*�� 6��0w��0�� L51�A@ �$�� *��S�� 3 ��!5��L�� &!�0 ��S5H Z � # �� C�� 1435&U¢

��#%$;&�� ;B !4�S��$�;} �4}�}�Bi}6��:Ho1}6#�r n H n �(Ba��H (a�IB�~QH��)H/�/��*�SM�:B"��43eJ%H ?)��IH £ 1�K��*��$��0�$L��0 �4�� X��',�$��:�� :�4�0`F��*� �$�$H Z ��# �A� �� ¡ �� ¡ � 3 �� ¢�6 � � �� "3 �� Fl�� t � � � ��B4!4��S��$��!��q��!M{��:BI}6��{5Ho1}$qr n H n �(B>��H�(a�IB?��3�T�H%Ji�� S HN�G�� :�4��0?F��*� �$�='P��*

��*��E��$��0 �&�$LM�$��+A35*� LM��=��5'D�*�� W35�$0 LM��*�@�H5# �h� �� @��`B(}�}��!��<¥ {5}$��:{�p�q:B"}$��5Ho1}$�r?�KHG�O��3�3 ��z��3z�eH � HG� *�� 50 �IH��9]��*�3 � Sb��O��*��$��O¥� �C��*��<� �� *��,'D��*,F�� *� �$�/��LM��*`��*��$��& ��SQ��$��*Y3��5H

Z �=#%32� � 3 �� 6 � � �� <BKp��Mp�Ho p�r?� ��.� �$0Y�O��3 3 �$�(BY�7�$�:� 0,�5�4��IB'��$!��z�eH��$0 0 ��*��$ �IB��3�� 9��$@5��4�� *N~��IHI�G�� :��01@R��3 ��!5��L��.�$�� +�� :�� F:� ��*� �$� �L��*m��*��$��&�$H Z � # �� C�� 1435&U¢

��#%$;&�� ;B !4�S��$� !��:{��5B(p��p:Ho p�}�r�<��k�O��$B��)*��<@��:�$��%1��*6Bk��3m�7 �jTs�� S HTW�6LM��0��<��+A8��63�� S�*��&�/'P��*,��$0 �$�� L:1�]@��$�� IH Z �# �� C�� >1435&U¢6��#%$;&�� ;B5!��S�� !�!Mq��!��5Bi}$��q:Ho p�p�r � H �QH�\%! !)�$�� $ �w��3?~EH T H��:�<��'D�<*6H � �}�� ¡ � � �}�� ¡ � # �� ¢�� H�(^*��$�� $��+2�N�0 0�B Z � ��BK}6��#��:Ho p�r4� H ~EH'�%�� 0��(H�3�� $ # � �� ¡ � �N� �� &U¡�� z�� i �¡ ��A�� H�7��*�S�� =¤E�� 'P�� IBK}6��5Ho pG!�r £ Ha~��_��$Y��3 � H��7�$�435�$0 #$��(H��5 �& 0��*�1�]@:+284��63tF�� *� �$�

'D�*�� &�`��*� �$�G3��5H Z � # �� C�� 1435&U¢6��#%$'&�� ;B�!4��S��$�}6��5p��:Bi}$��#�Ho p��r�(iH § HK�5��0� � S��*6B)�eH �eH � ��*��4��(B £ H £ H)�G�� :8)��*�0 ��(B4~EH � Hn �*� ��BI��43RJ%H § H (a*� �$��H � �$�$��N!4��U��0�� U��e�>*��$0��+�� 0:3��8��"��4�S��$�&��:�I��@5��$�OH Z � # � �C�� 1435&U¢

��#%$;&�� ;B !4�S��$��p��:��! Bi}$��#�5Ho p{r;T�H�:��(B�<.H n ��S5B � H$~, ��B��43><.H £ �$��S5H � �;�� 43��$� *��N�� #$�;�$�� k�&��:3>'P��*a9�� Y��3k��$0 �$�� %*��*� �$L��01+2��$� �� LM�Y�$�:L51*��&�$��6H Z � # �� C�� 1435&U¢

��#%$;&�� ;B !4�S��$� #��:q�q5Bi}$��5H

o pJ#6r4��)*��@ £ HbJi�'f�6HWJ/�� :0��35�� S £ �^J (^��S��H[\%�5+A0 ��H�:��!(¥ ��-`-`-;H �� $��OH �� 6-`-`-��9�3:��'f� � £ �^J ��*��$�OH ��OH

o p�q�r £ H�J��<*�*�@�B £ H § ��0f358)��*�S B £ H � ��0 �$B5��43 ��H5\%�:�H5�G��:�� 5+�� EF��*� �$�;��LM��*;��!�!)��435+2��01@U3��84��$�$H Z � # �� C�� 1435&U¢ �.#%$'&�� NB�!��S��$�,�Mp:} �:��5B^}$��p:H

o p��r?�KH £ H��S�0��3 ��)*��@h�GH � ��S��IH�~��+284��63F:� ��*�@R��!5�� &�#$�� b'P��*?��*��6��& � S7 � 'P��*�� s��5*��$�$�$HZ � # �� C�� 21435&U¢6��#%$'&�� ;B !4��S��$��#��!Mq:BKp��Mp�H

£ �7¤ £ ��:¥Gq�� `�¦� Z�§ �=\ £ Ts�*��5�� !O��7~,�$��6��*��<� Z ��$�Y � £ ��&�7 �� S>��43C¤;��6-Y0 �63 S�� £ ��$��LM��*�@�B�p�� !4�S��#�p

Using transposition for pattern discoveryfrom microarray data

Francois RioultGREYC CNRS UMR 6072

Universite de Caen

F-14032 Caen, France

[email protected]

Jean-Francois BoulicautLIRIS CNRS FRE 2672

INSA Lyon

F-69621 Villeurbanne, France

[email protected]

Bruno CremilleuxGREYC CNRS UMR 6072

Universite de Caen

F-14032 Caen, France

[email protected]

Jeremy BessonLIRIS CNRS FRE 2672

INSA Lyon

F-69621 Villeurbanne, France

[email protected]

ABSTRACTWe analyze expression matrices to identify a priori interest-ing sets of genes, e.g., genes that are frequently co-regulated.Such matrices provide expression values for given biologi-cal situations (the lines) and given genes (columns). Thefrequent itemset (sets of columns) extraction technique en-ables to process di±cult cases (millions of lines, hundredsof columns) provided that data is not too dense. However,expression matrices can be dense and have generally onlyfew lines w.r.t. the number of columns. Known algorithms,including the recent algorithms that compute the so-calledcondensed representations can fail. Thanks to the proper-ties of Galois connections, we propose an original techniquethat processes the transposed matrices while computing thesets of genes. We validate the potential of this frameworkby looking for the closed sets in two microarray data sets.

1. INTRODUCTIONWe are now entering the post-genome era and it seems ob-vious that, in a near future, the critical need will not be togenerate data, but to derive knowledge from huge data setsgenerated at very high throughput. Di®erent techniques (in-cluding microarrays and SAGE) enable to study the simulta-neous expression of (tens of) thousands of genes in variousbiological situations. The data generated by those exper-iments can then be seen as expression matrices in whichthe expression level of genes (the columns) are recorded invarious biological situations (the lines). Various knowledgediscovery methods can be applied on such data, e.g., thediscovery of sets of co-regulated genes, also known as synex-pression groups [12]. These sets can be computed from thefrequent sets in the boolean matrices coding for the expres-sion data (see Table 1).

One attribute ai is attributed the value true (1) to representthe over- (or under-) expression of gene i in that particularsituation.

Discretization procedures (true is assigned above a thresh-old value) are used to derive boolean matrices from a rawexpression matrix. Discretization can obviously have a largein°uence on the nature of the extracted sets. It is thus es-sential that, in exploratory contexts, one can study di®erentthreshold values and proceed with a large number of analy-sis.

What we would like to do is to compute all sets of genes thathave the true value in a su±cient number (frequency thresh-old) of biological situations. Extracting frequent sets is oneof the most studied data mining techniques since the de-scription of the Apriori algorithm [1] and tens of algorithmshave been published. Nevertheless, the gene expression ma-trices, obtained through microarrays, raise new di±culties,due to their “pathological” dimensions (i.e. few lines and ahuge number of columns). This is a very di±cult problemsince the overall complexity is exponential in the number ofgenes. Furthermore the size of the solutions (i.e. collectionof extracted sets) is huge whatever the frequency thresholdsince there is a very limited number of lines.

In Section 2, we present the problems raised by the extrac-tion of frequent sets. Section 3 proposes a solution that com-bines the power of the closed set extraction with an originaluse of the properties of the Galois connection [17; 9]. In Sec-tion 4 we provide experimental results on two matrices builtfrom microarray data [2; 16]. It establishes the spectaculargains allowed by our approach. Section 5 concludes.

2. FREQUENT SET EXTRACTION

2.1 De£nitionsLet S denote a set of biological situations and A denote a setof attributes. In the example from Table 1, S = {s1, . . . s5}and A = {a1, . . . a10}. Each attribute denotes a propertyabout the expression of a gene. The encoded expression datais represented by the matrix of the binary relation R ⊂ S£Ade¯ned for each situation and each attribute. (si, aj) ∈ Rdenotes that situation i has the property j, i.e., that gene j isover-expressed or under-expressed in situation i. A databaser to be mined is thus a 3-tuple (S,A, R). LA = 2A is thepower set of attributes. For the sake of clarity, sets of at-tributes are often called sets of genes. LS = 2S is the powerset of situations.

De¯nition 1. Given T µ S and X µ A, let f(T ) = {a ∈A | ∀s ∈ T, (s, a) ∈ R} and g(X) = {s ∈ S | ∀a ∈ X, (s, a) ∈R}. f provides the set of over-expressed or under-expressedgenes that are common to a set of situations and g providesthe set of situations that share a given set of attributes (ex-pression properties). (f, g) is the so-called Galois connectionbetween S and A. We use the classical notations h = f ◦ gand h′ = g ◦ f to denote the Galois closure operators.


AttributesSituations a1 a2 a3 a4 a5 a6 a7 a8 a9 a10

s1 1 1 1 1 0 1 1 0 0 0s2 1 1 1 1 0 0 0 0 1 1s3 1 1 1 1 0 0 0 0 1 1s4 0 0 0 0 1 1 1 1 1 1s5 1 0 1 0 1 1 1 1 0 0

Table 1: Example of a boolean matrix r1

De¯nition 2. A set of genes X µ A is closed i® h(X) =X. We say that X satis¯es the CClose constraint in r:CClose(X, r) ≡ h(X) = X. A set of situations T µ S isclosed i® h′(T ) = T .

De¯nition 3. The frequency of a set of genes X µ A de-noted F(X, r) is the size of g(X). Constraint Cfreq enforcesa minimal frequency: Cfreq(X, r) ≡ F(X, r) ¸ ° where ° isthe user-de¯ned frequency threshold.

Example 1. Given r1, we have F({a1, a3, a5}) = 1 andF({a1, a2}) = 3. If ° = 3, {a9, a10} and {a1, a2, a3, a4}satisfy Cfreq in r1 but {a1, a5} does not. h({a1, a2}) in r1 isf(g({a1, a2})) = f({s1, s2, s3}) = {a1, a2, a3, a4}. {a1, a2}does not satisfy CClose in r1 but {a1, a2, a3, a4} satis¯es it.

Mining task. We want to compute the collection of thefrequent sets of genes FS = {ϕ ∈ LA | Cfreq(ϕ, r) satis¯ed}where Cfreq is the minimal frequency constraint and r is aboolean expression matrix. Furthermore, we need the fre-quencies of each frequent itemset to, e.g., derive interestingassociation rules from them.

The closure of a set of genes X, h(X), is the maximal (w.r.t.set inclusion) superset of X which has the same frequencythan X. A closed set of genes is thus a maximal set of geneswhose expression properties (true values) are shared by aset of situations. E.g., the closed set {a1, a3} in the data ofTable 1, is the largest set of genes that are over-expressed(or under-expressed) simultaneously in situations s1, s2, s3

and s5.

The concept of free set has been introduced in [7] as a specialcase of the ±-free sets and has been proposed independentlyin [3] under the name of key pattern. This concept char-acterizes the closed set generators [13] but is also useful fornon redundant association rule computation (see, e.g., [5]for an illustration).

De¯nition 4. A set of genes X µ A is free i® X is notincluded in the closure (i.e., h = f ◦ g) of one of its strictsubsets. We say that X satis¯es the Cfree constraint in r. Analternative de¯nition is that X is free in r i® the frequencyof X in r is strictly lower than the frequency of every strictsubset of X.

Example 2. {a1, a6} satis¯es Cfree in r1 but {a1, a2, a3}does not.

It is easy to adapt these de¯nitions to sets of situations.An important result is that the closures of the free sets areclosed sets. The size of the collection of the free sets is, byconstruction, greater or equal to the size of the collection ofthe closed sets (see, e.g., [8]).

It is well known that LA can be represented by a latticeordered by set inclusion. On top of the lattice, we have the

empty set, then the singletons, the pairs, etc. The last levelfor our example from Table 1 contains the unique set of size10. A classical framework [11; 10] for an e±cient explorationof such a search space is based on the monotonicity of theused constraints w.r.t. the specialization relation, i.e., setinclusion.

De¯nition 5. A constraint C on sets is said anti-monotonicwhen ∀X, X ′: (X ′ µ X ∧X satis¯es C) ⇒ X ′ satis¯es C. Aconstraint C is said monotonic when ∀X, X ′: (X µ X ′ ∧Xsatis¯es C) ⇒ X ′ satis¯es C.

Example 3. Cfreq, Cfree and Cfreq∧Cfree are anti-monotonic.Csize(X) ≡ |X| > 3 is monotonic.

The negation of a monotonic (resp. anti-monotonic) con-straint is an anti-monotonic (resp. monotonic) constraint.Anti-monotonic constraints can be pushed e±ciently intothe extraction process: when a set X does not satisfy ananti-monotonic constraint, we can prune large parts of thelattice since no superset of X can satisfy it. For instance,the Apriori algorithm [1] computes all the frequent setsby a levelwise search on the lattice, starting from the mostgeneral sentences (the singletons) until it reaches the mostspeci¯c sentences that are frequent (the maximal frequentsets w.r.t. set inclusion). Apriori and its variants work wellon very large boolean matrices (millions of lines, hundredsor thousands of columns) that are not dense and for lowlycorrelated data. Notice that such algorithms have to countthe frequency of at least every frequent set.

2.2 Extraction tractabilityThe computation of sets that satisfy a given constraint C isa very hard problem. Indeed, as soon as we have more thana few tens of columns, only a quite small subset of the searchspace can be explored. Then, the size of the solution, i.e., thecollection of the sets that satisfy C can be so huge that nonealgorithm can compute them all. When the used constraintis Cfreq, it is possible to take a greater frequency thresholdto decrease a priori the size of the solution and thus providethe whole collection of the frequent sets. The used thresholdcan however be disappointing for the biologist: extractedpatterns are so frequent that they are already known.

In the expression matrices we have to analyze, the numberof the frequent sets can be huge, whatever is the frequencythreshold. It comes from the rather low number of linesand thus the small number of possible frequencies. Clearly,Apriori and its variants can not be used here. Since we needfor the frequencies of every frequent set, e.g., for derivingvalid association rules, algorithms that compute only themaximal frequent itemsets, e.g., [4] do not solve the problem.

We decided to investigate the use of the so-called condensed

representations of the frequent sets by the frequent closed


sets, i.e., CFS = {ϕ ∈ LA | Cfreq(ϕ, r)∧CClose(ϕ, r) satis¯ed}because FS can be e±ciently derived from CFS [13; 6].CFS is a compact representation of the information aboutevery frequent set and its frequency. Furthermore, sev-eral recent algorithms can compute e±ciently the frequentclosed sets [13; 7; 8; 14; 18; 3]. To be e±cient, these al-gorithms can not use the properties of CClose which is nei-ther anti-monotonic nor monotonic. However, we can com-pute the frequent free sets and provide their closures, i.e.,{h(ϕ) ∈ LA | Cfreq(ϕ, r) ∧ Cfree(ϕ, r) satis¯ed} [7; 8; 3]. Thelattice is still explored levelwise. At level k, the data isaccessed to compute the frequency and the closure of eachcandidate set. The infrequent sets can be pruned. Thanksto pruning at level k-1, the frequent sets are free sets. Can-didates for the next level can be generated from two free sets(using an Apriori-like generation procedure [1]) and can-didates for which at least one subset is not frequent (Cfreqis violated) or that are included in the closure of one theirsubsets (i.e., Cfree is violated) are pruned before the nextiteration can start. At the end, we compute h(X) for eachfrequent free set that has been extracted. It turns out thatthe anti-monotonicity of a constraint like Cfreq∧Cfree is usedin two phases. First (Criterion 1), we avoid the computa-tion of supersets that do not satisfy the constraint thanks tothe Apriori-like generation procedure. Next (Criterion 2),we prune the sets for which some subsets do not satisfy theconstraint. The number of pruned candidates in the secondphase, i.e., failures for Criterion 2, can be huge for matriceswith a number of lines that is small w.r.t. the number ofcolumns and it can lead to intractable extractions. In otherterms, even though these approaches have given excellentresults on large matrices for transactional data (e.g., corre-lated and dense data in WWW usage mining applications),they can fail on expression matrices because of their “patho-logical” dimensions. Furthermore, we want to enable the useof various discretization procedures and thus the analysis ofmore or less dense matrices. It appears crucial to us that wecan achieve a breakthrough w.r.t. extraction intractabilityand it has lead to the following original method.

3. A NEW METHODWe have considered the extraction from a transposed matrixusing the Galois connection to infer the results that wouldhave been extracted from the initial matrix. Indeed, one canassociate to the lattice on genes the lattice on situations.Elements from these lattices are linked by the Galois opera-tors. The Galois connection gives rise to concepts [17] thatassociate sets of genes with sets of situations, or in the trans-posed matrix, sets of situations with sets of genes. When wehave only few situations and many genes, the transpositionenables to reduce the complexity of the search.

De¯nition 6. If X ∈ LA and T ∈ LS , we consider theso-called concepts (X, T ) where T = g(X) and X = f(T ).By construction, concepts are built on closed sets and, eachclosed set of genes (resp. situations) is linked to a closed setof situations (resp. genes).

De¯nition 7. The concept theory w.r.t. r, L = LA £LS , and a constraint C is denoted Thc(L, r, C). It is thecollection of concepts (X, T ) such that X ∈ {ϕ ∈ LA |C(ϕ, r) satis¯ed}.

On Figure 1, we provide the so-called Galois lattice for theconcepts in the data from Table 1. The specialization re-lation on the sets of genes which is oriented from the toptowards the bottom of the lattice is now associated to a spe-cialization relation on sets of situations which is oriented inthe reverse direction. Indeed, if X ⊂ Y then g(X) ¶ g(Y ).

The collection of the maximally speci¯c sets of genes (e.g.,the maximal frequent itemsets) has been called the positiveborder in [10]. A dual concept is the one of negative border,i.e., the minimally general sets (e.g., the smallest infrequentsets whose every subset is frequent). The lattice is thussplit in two parts. On the top, we have the solution whichis bordered by the positive border. On the bottom, we havethe sets that do not belong to the solution. The minimalelements of this part constitute the negative border. Thisduality is interesting: borders are related to a specialisa-tion relation and an anti-monotonic constraint. The bottompart of the lattice can be considered as the solution for thenegated constraint, the former positive border becomes thenegative border and vice versa.

On the Galois lattice, it is possible to perform two typesof extraction: one on the gene space (1), starting from thetop of the lattice and following the specialisation relationon the genes, and the other one on the biological situations(2), starting from the bottom of the lattice and following thespecialisation relation on the situations. We now de¯ne thematrix transposition for a matrix r, the constraint transpo-sition for a constraint C and we state the central result ofthe complementarity of the extractions.

De¯nition 8. If r = (S,A, R) is an expression matrix, thetransposed matrix is t

r = (A,S, tR) where (a, s) ∈ tR ⇐⇒(s, a) ∈ R.

Whereas the matrix transposition is quite obvious, it is notthe same for the transposition of constraints. In the caseof the minimal frequency constraint Cfreq, the dual notionof the frequency for the sets of genes is the length of thecorresponding sets of situations.

De¯nition 9. Let C be a constraint on LA, its transposedconstraint tC is de¯ned on LS by ∀T ∈ LS , tC(T, r) ⇐⇒C(f(T ), r) where f is the Galois operator. Thus, tCfreq(T, r) ≡|T | ¸ ° if ° is the frequency threshold for Cfreq.

With respect to gene specialization, tC is monotonic (resp.anti-monotonic) if C is monotonic (resp. anti-monotonic).However, if C is anti-monotonic (e.g., Cfreq) following thegene specialization relation, tC is monotonic according tothe specialization relation on the situations: it has to benegated to get an anti-monotonic constraint that can be usee±ciently.

Property 1. If C is anti-monotonic w.r.t. gene specializa-tion, then ¬tC is anti-monotonic w.r.t. situation specializa-tion.

We have an operation for the transposition of the data anda new anti-monotonic constraint w.r.t. to the specializa-tion relation on the situations. However, to obtain this newanti-monotonic constraint, we had to transpose the originalconstraint and take its negation: the new extraction turnsto be complementary to the collection we would get withthe standard extraction.


Figure 1: A Galois lattice

Situ

atio

ns s

peci

alis

atio

n

{a1,a3}

{a1,a2,a3,a4} {a1,a3,a6,a7}

{a6,a7} {a9,a10}

{a5,a6,a7,a8}

{a1,a2,a3,a4,a9,a10}

{o1,o2,o3,o5} {o1,o4,o5} {o2,o3,o4}

{o1,o2,o3} {o1,o5} {o4,o5}

{o2,o3}

Gene specialisation

De¯nition 10. Given tr and ¬tC, the transposed theory

Thc(L, tr,¬tC) is the transposition of Thc(L, r, C).

Property 2. The concept theory Thc(L, r, C) and its trans-posed theory Thc(L, t

r,¬tC) are complementary w.r.t. thewhole collection of concepts.

Example 4. On the data from Table 1, the sets of geneswith a frequency of at least 3 are {a1, a3}, {a6, a7}, {a9, a10},and {a1, a2, a3, a4}. A closed set of genes has a frequencygreater than 3 if the size of the corresponding situation setis greater than 3. When taking the negation of this con-straint, we look for the sets of situations whose size are atmost 3 (anti-monotonic constraint w.r.t. the situation spe-cialization). The sets {s1, s5}, {s4, s5} and {s2, s3} are ex-tracted. Clearly, the two collections are complementary (seeFigure 2).

The correctness of this extraction method for ¯nding theclosed sets of genes from the extractions on transposed ma-trices relies on this complementary property. Due to the lackof space, we consider only a straightforward application ofthis framework that concerns the computation of concepts.

If we compute the closed sets from the gene space, the Ga-lois connection allows to infer the closed sets of situations.Reciprocally, the extraction on the transposed matrix pro-vides the closed sets on the situations and we can infer theclosed sets of genes. Thus, the same collection of closedsets can be extracted from a matrix or its transposed. Thechoice between one or the other method can be guided bythe dimension of the matrix. On the data from Table 1, thesmallest dimension concerns the situations (5 elements) andit leads to 25 = 32 possible sets. Among these 32 elements,only 10 are closed. However extracting the closed sets fromthe original matrix, which contains 10 columns, leads to asearch space of 210 = 1024 sets of genes whereas there is still10 closed sets. To compute the closed sets, we output theclosures of the free sets. This is an e±cient solution sinceCfree is anti-monotonic. Several free sets can however gen-erate the same closed set. On the data from Table 1, thefree set extraction provides 41 sets which generate the 10closed sets, whereas the transposed matrix extraction pro-vides only 17 free sets. We provide real examples in the nextsection.

4. APPLICATIONSWe have been working on data sets produced with cDNA mi-croarrays at the Stanford Genome Technology Center (PaoloAlto, CA 94306, USA). The ¯rst data set is described in [16].It concerns the study of human insulino-resistance. From 6cDNA microarrays (around 42 557 spots for 29 308 Uni-Gene clusters), a typical preprocessing for microarray datahas given an expression matrix with 6 lines (situations) and1 065 columns (genes). It is denoted as the inra matrix.The second data set concerns gene expression during thedevelopment of the drosophila [2]. With the same kind ofpreprocessing, we got an expression matrix with 162 linesand 1 230 columns denoted droso. To derive boolean ma-trices, we have encoded the over-expression of genes: foreach gene i, we have computed a threshold ¾i under whichthe attribute boolean value is false, and true otherwise. Dif-ferent methods can be used for the de¯nition of threshold ¾i

and we have done as follows [2]: ¾i = Maxi £ (1¡ ¾ discr)where Maxi is the maximal expression value for gene i and¾ discr is a parameter that is common to every genes.

We used the prototype mv-miner implemented by F. Ri-oult and have extracted the closed sets under the frequencythreshold 1 to get all of them. These experiments have beenperformed on a 800MHz processor with RAM 4GB and 3GBfor swap (linux operating system). We have used parameter¾ disc to study the extraction complexity w.r.t. the densityof the boolean matrices (ratio of the number of true valueson the number of values).

First, we compare the extraction in inra and tinra for agiven value of ¾ disc (Table 2).

We have 41 closed sets in these boolean matrices. The freeset extraction on inra provides 667 831 free sets of geneswhereas the extraction on tinra provides 42 free sets of sit-uations. In Section 2.2, we have seen that algorithms useanti-monotonic constraints in two ways. Criterion 1 avoidsto generate some candidates that would have to be pruned.Criterion 2 enables to prune candidates for which the con-straint is not satis¯ed. Checking Criterion 2 is expensivebecause it needs to store the sets and check the propertiesof all their subsets. Table 2 provides the number of sets(for each level in the levelwise search) which satisfy thesetwo criteria and the number of sets that have been exam-


Figure 2: Complementarity of the extractions

{a1,a3}

{a1,a2,a3,a4} {a1,a3,a6,a7}

{a6,a7} {a9,a10}

{a5,a6,a7,a8}

{a1,a2,a3,a4,a9,a10}

{o1,o2,o3,o5} {o1,o4,o5} {o2,o3,o4}

{o1,o2,o3} {o1,o5} {o4,o5}

{o2,o3}

Gene specialisation

Situ

atio

n sp

ecia

lisat

ion

(Size of situation sets > 3)

Size of situation sets < 3(Frequency of gene sets < 3)

Frequency of gene sets > 3

Table 2: Failure/success in pruning for inra and tinratinra inra

size success failure success failure1 6 0 777 02 15 0 172 548 128 9283 16 4 2 315 383 4 713 1144 6 9 2 965 726 9 371 3255 0 2 0 1 544 485

Total 43 15 5 454 434 15 757 852Nb free sets 42 667 831

Nb closed sets 41

ined when processing inra and tinra. Extraction in tinra

is clearly more e±cient. Not only it provides less candidatesets to test (43 vs. 5 454 434) but also it leads to far lessfailures: 15 vs. 15 757 852.

We have performed experiments on the two microarray datasets for various discretization thresholds. Considering droso,Table 3 shows that extraction becomes feasible for largerdensities. It enables that the biologist explore alternativesfor discretizations.

Results on inra con¯rm the observations (see Table 4). Thedi®erence between standard extraction and transposed ma-trix extraction is even more spectacular. Indeed, extractiontime on the transposed matrix can become negligible w.r.t.the standard extraction time (e.g., 120 ms vs. 368 409 ms).Notice that the number of free sets of genes can be very largew.r.t. the number of closed sets to be found, e.g., 51 881 freesets for only 34 closed sets.

Also, the method has been applied on very large expres-sion matrices derived from human SAGE data (matrix 90£12 636) [15]. In these matrices and for di®erent discretiza-tion techniques, none of the standard extractions have beentractable while extractions on the transposed matrix havebeen easy.

5. USING THE CLOSED SETSAn algorithm like mv-miner takes a boolean matrix and pro-vides the free sets on the columns and the closed sets on boththe lines and the columns with their frequencies in the data.From the closed sets of genes and their frequencies, it is in-

deed possible to select the frequent closed sets provided afrequency threshold. When using Cfreq with ° > 1, it is pos-sible to use its transposed constraint and make use of thetransposed extractions.

Let us discuss informally how to use the closed sets to derivesome knowledge. It is possible to regenerate the whole col-lection of the frequent sets of genes (resp. situations) fromthe frequent closed sets of genes (resp. situations). So, themultiples applications of the frequent itemsets are available(e.g., association rule mining, class characterization, sometypes of clustering, approximation of the joint distribution).Notice also that the extracted collections can be representedas concepts [17] and thus many knowledge discovery tasksbased on concept lattices can be considered.

It is quite possible that the number of frequent sets of genesis too huge and that a “blind” regeneration process is notfeasible. It is possible to ¯lter at regeneration time, e.g., totake into account some syntactical constraints on the setsof interests. Notice also that the free sets that have beenproved useful for non redundant association rule computa-tion are missing. When mining the transposed matrix, weget the free sets on situations but not the free sets on genes.

Let us however sketch typical uses of the extracted patterns.Part of this has been already validated on SAGE data anal-ysis in [5]. It is useful to look at the patterns extractedafter several discretizations on the same expression matrix.These extractions can provide di®erent sets of co-regulatedgenes for the same expression data. So, after such compu-tations, the biologist want to compare di®erent closed set


Table 3: Results for the drosophila data¾ discr density time (ms) nb free sets nb closed sets

tdroso 0.02 0.08 160 965 434droso 0.02 0.08 1 622 5 732 434tdroso 0.075 0.015 420 3 667 1 508droso 0.075 0.015 35 390 60 742 1 508tdroso 0.1 0.019 721 6 890 2 569droso 0.1 0.019 146 861 162 907 2 569tdroso 0.15 0.032 4 526 36 309 10 447droso 0.15 0.032 failure - -tdroso 0.2 0.047 36 722 410 666 4 6751droso 0.2 0.047 failure - -tdroso 0.25 0.067 455 575 1 330 099 259 938droso 0.25 0.067 failure - -tdroso 0.3 0.09 failure - -droso 0.3 0.09 failure - -

collections, looking at the common patterns, the dissimilari-ties, etc. Browsing these collections of closed sets can lead tothe selection of some of them, e.g., the one that are almostalways extracted.

Then, the biologists can select some concepts for an in-depth study of the interactions between the involved genes.Clearly, one objective criterion for selection can be based onthe frequency. One important method concerns the use ofinformation sources about gene functions. Quite often, oneof the ¯rst post-processing is to look for the homogeneityof the sets (e.g., they all share the same function). It isthen quite interesting to focus on almost homogeneous setsof genes and look at the outliers. This approach has beenused in the SAGE data analysis described in [5] and hasprovided a valuable result: one EST (Expressed SequenceTag) was always co-regulated with a set of around 20 genesthat had the same function and it is reasonable to suspectthat this EST has that function. For this type of post-processing, it is possible to use the various ontologies thatare available, e.g., http://www.geneontology.org/, and, e.g.,study the homogeneity of the selected sets of genes at dif-ferent levels (biological process, molecular function, cellularfunction).

Last but not the least, the biologist can chose a given closedset of genes X and then project the original expression ma-trix on X. Since the size of a closed set will be generallysmall w.r.t. the size of whole collection of genes, it is thenpossible to mine this restricted matrix. For instance, it be-comes possible to extract the whole collection of non redun-dant association rules (free sets of genes in the left-handside) from this non transposed restricted matrix.

6. CONCLUSIONWe have been studying the extraction of groups of genesfound to be frequently co-regulated in expression matrices.This type of data raises di±cult problems due to the hugesize of the search space and to the huge size of the solutions.In [5], it has been shown that the use of condensed represen-tations as described in e.g. [6; 7], was useful, at least whenthe number of biological situations is not too small in light ofthe number of genes. Unfortunately this situation is rarelyobserved in most of the available gene expression data. Wetherefore explored the possibility to process the transposed

matrices by making use of properties of the Galois connec-tions. This resulted in a very spectacular improvement ofthe extraction procedure, allowing to work in context whereprevious approaches failed. [5] has validated the interest offrequent closed sets in biological terms on a reduced set ofgenes. We are pretty con¯dent that given the algorithmicbreakthrough, biological signi¯cant information will be ex-tracted from the expression data we have to mine. We arefurthermore exploring the transposition of other constraints.

Acknowledgements This work has been partially fundedby the EU contract cInQ IST-2000-26469 (FET arm of theIST programme). F. Rioult is supported by the IRM de-partment of the Caen CHU, the comit¶e de la Manche de laLigue contre le Cancer and the Conseil R¶egional de Basse-Normandie. The authors thank S. Blachon, O. Gandrillon,C. Robardet, and S. Rome for exciting discussions.

8. REFERENCES

[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, andA. I. Verkamo. Fast discovery of association rules. InAdvances in Knowledge Discovery and Data Mining,pages 307–328. AAAI Press, 1996.

[2] M. Arbeitman, E. Furlong, F. Imam, E. Johnson,B. Null, B. Baker, M. Krasnow, M. P. Scott, R. Davis,and K. P. White. Gene expression during the life cycleof drosophilia melanogaster. Science, 297, 2002.

[3] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, andL. Lakhal. Mining frequent patterns with counting in-ference. SIGKDD Explorations, 2(2):66 – 75, Dec. 2000.

[4] R. J. Bayardo. E±ciently mining long patterns fromdatabases. In Proceedings ACM SIGMOD’98, pages 85–93, Seattle, Washington, USA, 1998. ACM Press.

[5] C. Becquet, S. Blachon, B. Jeudy, J.-F. Boulicaut, andO. Gandrillon. Strong association rule mining for largegene expression data analysis: a case study on humanSAGE data. Genome Biology, 12, 2002.


Table 4: Results for the human data¾ discr density time (ms) nb free sets nb closed sets

tinra 0.01 0.193 80 19 19inra 0.01 0.193 16 183 11 039 19tinra 0.05 0.21 90 29 29inra 0.05 0.21 42 991 23 377 29tinra 0.085 0.023 110 33 33inra 0.085 0.023 83 570 38 659 33tinra 0.1 0.24 120 36 34inra 0.1 0.24 368 409 51 881 34tinra 0.15 0.268 120 40 38inra 0.15 0.268 failure - -

[6] J.-F. Boulicaut and A. Bykowski. Frequent closures asa concise representation for binary data mining. In Pro-

ceedings PAKDD’00, volume 1805 of LNAI, pages 62–73, Kyoto, JP, Apr. 2000. Springer-Verlag.

[7] J.-F. Boulicaut, A. Bykowski, and C. Rigotti. Approx-imation of frequency queries by mean of free-sets. InProceedings PKDD’00, volume 1910 of LNAI, pages 75–85, Lyon, F, Sept. 2000. Springer-Verlag.

[8] J.-F. Boulicaut, A. Bykowski, and C. Rigotti. Free-sets: a condensed representation of boolean data forthe approximation of frequency queries. Data Mining

and Knowledge Discovery journal, 7(1):5–22, 2003.

[9] R. Godin, R. Missaoui, and H. Alaoui. Incremental al-gorithms based on galois (concepts) lattices. Computa-

tional Intelligence, 11(2):246 – 267, 1995.

[10] H. Mannila and H. Toivonen. Levelwise search and bor-ders of theories in knowledge discovery. Data Mining

and Knowledge Discovery journal, 1(3):241–258, 1997.

[11] T. M. Mitchell. Generalization as search. Arti¯cial In-

telligence, 18:203–226, 1982.

[12] C. Niehrs and N. Pollet. Synexpression groups in eu-karyotes. Nature, 402:483–487, 1999.

[13] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Ef-¯cient mining of association rules using closed itemsetlattices. Information Systems, 24(1):25–46, Jan. 1999.

[14] J. Pei, J. Han, and R. Mao. CLOSET an e±cient al-gorithm for mining frequent closed itemsets. In Pro-

ceedings ACM SIGMOD Workshop DMKD’00, Dallas,USA, May 2000.

[15] C. Robardet, F. Rioult, S. Blachon, B. Cr¶emilleux,O. Gandrillon, and J.-F. Boulicaut. Mining closed setsof genes from SAGE expression matrices: a spectacu-lar improvement. Technical report, LIRIS INSA Lyon,F-69622 Villeurbanne, March 2003.

[16] S. Rome, K. Cl¶ement, R. Rabasa-Lhoret, E. Loizon,C. Poitou, G. S. Barsh, J.-P. Riou, M. Laville, andH. Vidal. Microarray pro¯ling of human skeletal musclereveals that insulin regulates 800 genes during an hy-perinsulinemic clamp. Journal of Biological Chemistry,March 2003. In Press.

[17] R. Wille. Restructuring lattice theory: an approachbased on hierarchies of concepts. In I. Rival, editor,Ordered sets, pages 445–470. Reidel, 1982.

[18] M. J. Zaki. Generating non-redundant association rules.In Proceedings ACM SIGKDD’00, pages 34 – 43,Boston, USA, Aug. 2000. AAAI Press.


Weave Amino Acid Sequencesfor Protein Secondary Structure Prediction

Xiaochun YangDepartment of Computer Science,

Brigham Young University,Provo, Utah, 84602 U.S.A.

[email protected]

Bin WangInstitute of Software,

Northeastern University,Shenyang, 110004 China.P.R.

wang [email protected]

ABSTRACTGiven a known protein sequence, predicting its secondarystructure can help understand its three-dimensional (ter-tiary) structure, i.e., the folding. In this paper, we presentan approach for predicting protein secondary structures. Dif-ferent from the existing prediction methods, our approachproposes an encoding schema that weaves physio-chemicalinformation in encoded vectors and a prediction frameworkthat combines the context information with secondary struc-ture segments. We employed Support Vector Machine (SVM)for training the CB513 and RS126 data sets, which are col-lections of protein secondary structure sequences, throughsevenfold cross validation to uncover the structural differ-ences of protein secondary structures. Hereafter, we applythe sliding window technique to test a set of protein se-quences based on the group classification learned from thetraining set. Our approach achieves 77.8% segment overlapaccuracy (SOV) and 75.2% three-state overall per-residueaccuracy (Q3), which outperform other prediction methods.

General TermsDesign, Experimentation

KeywordsProtein structure prediction, protein secondary structure,SVM, encoding schema

1. INTRODUCTIONUnderstanding the function and the physiological role ofproteins is a fundamental requirement for the discovery ofnovel medicines and protein-based products with medicaland industrial applications. To enhance their understand-ing, many researchers focus on the functional analysis ofgenome these days. This functional analysis is a difficultproblem, and the determination of protein three-dimensionalstructures is essential in the analysis of the genome func-tions. A protein three-dimensional structure can help un-derstand its function. Unfortunately, it happens that three-dimensional protein structures cannot be accurately pre-dicted from protein (amino acid) sequences. An interme-diate, but useful and fundamental, alternative step is topredict protein secondary structures since a protein three-

dimensional structure results partially from its secondarystructure.

The protein secondary structures can be categorized intoseveral states: a-helix (H), 310 helix (G), pi helix (I), iso-lated β-bridge (B), extended β-strand (E), hydrogen bondedturn (T), bend (S) and random coil (-) [1]. In the previouswork on predicting protein secondary structures, researchersfocused on classifying these states into three consolidatedclasses: helix, strand and coil classes, among which the he-lix and strand classes are the major repetitive secondarystructures [11]. These days the number of known proteinsequences has far exceeded the number of known protein sec-ondary structures, which have been obtained through chem-ical and biological methods such as the crystallography andNMR methods [14], and the rapid pace of the discovery ofgenomic sequences has further widened the gap. Thus, com-putational tools for predicting protein secondary structuresfrom protein sequences are needed and they are essential innarrowing the widening gap [14]. After four decades of re-search work, the prediction of the secondary structures hasnot yet attained the satisfying accuracy rate. (See Table 1.)

The secondary structure prediction methods developed inthe past [8; 17; 19] use local information of a single se-quence, but they exhibit major drawbacks. For example,the accuracy measures (Q3) on these methods are relativelylow (57% for the 1st generation and 63% for the 2nd gener-ation). A usual criticism on using the neural network pre-diction approach [17] is that the methods are “black boxes.”The neural network approach may be an effective classifier;however, it is hard to explain why a given pattern is classi-fied as a helix rather than a strand, and thus it is difficult toderive a conclusion on the merit of the approach. Recently, amore promising approach on solving the predicting problemappears in [7]. Hua et al. introduce a prediction approachbased on the SVM [18], which separated the input data, i.e.,the protein secondary structure segments, into two classesto uncover the protein secondary structural classes.

However, most of the existing works did not consider theroles of physio-chemical properties of each amino acid residue,and to our best knowledge, none of them consider the rolesof contexts of helices, strands, or coils in protein sequenceswhich results in the lower accuracy prediction. Our predic-tion approach builds up an encoding schema that weaves theproperties of different chemical groups into secondary struc-ture segments and a two dimensional prediction framework

that considers not only interested segments themselves (e.g.helices and strands) but also the contexts of those interested


Table 1: Prediction of protein secondary structures during the three generations1st Generation 2nd Generation 3rd Generation

Time Period 1960s - 1970s 1980s 1990sStandard Method GOR1 [8] GOR3 [13; 19] PHD [17]Approach Statistical Local Interactions Homologous Protein

Information Among Amino Acids SequencesPrediction Accuracy (Q3) 57% 63% 72%

segments. Encoding schema includes: (i) n-gram for deter-mining the length of training segments, (ii) cardinal vectorsfor catching similar physio-chemical properties of amino acidresidues, and (iii) probability of different residues for cap-ture the tendency of a residue location has a conformationof helical (strand or coil) structure, while two dimensionalprediction framework uses sliding window technique to ex-haustively exam all the possible protein sequence segmentsand synthesizes classifiers derived from the helical (strand orcoil) segments and context segments to the final prediction.

We use two data sets CB513 [3] and RS126 [16], which con-tain 513 and 126 sequences of protein secondary structures,respectively, that have been discovered by using differentchemical and biological experiments and employ SVM totrain samples (segments) in the data sets. Hereafter, weuse the two dimensional prediction framework to compareeach predicted segments with known protein classes. Ourprediction approach gets encouraging experimental resultsby adopting various accuracy measures, and it has some ad-vantages over other prediction methods, which are listed asfollows.

• It weaves part of biological information, such as chem-ical groups of each residue, and frequency of residuesinto the encoding schema which combines the classicalencoding schema, matrix calculation, and statisticalmethod, where we use classical encoding schema torepresent the order of residues, matrix calculation tocombine biological information and order of segmentstogether, and statistical method to capture the fre-quency of each residue in the training set.

• Our two dimensional prediction is correlated, i.e. eachprediction is not made in isolation, taking account ofthe predictions for neighboring residues. It allows pro-tein classification based on not only the local regionsof similarity but also the context regions of similarityin contrast to the detection of global similarities usedin the traditional search techniques. Each predictionis based on multiple decision functions that get rid ofthe inherent bias/noise.

• The approach is theoretically sound since the SVMmethod has been adopted by researchers in mathe-matic and computer science since its discovery in 1997in solving the classification problems, and it has beenproved to be efficient in the determination of the deci-sion function in separating two different (in our work,the protein) classes, which is exactly what we need insolving the prediction of protein secondary structures.

The rest of the paper is organized as follows. In Section 2, wesimply introduce SVM, in Section 3, we propose an encodingschema that transforms amino acid segments into vectors for

further training, and in Section 4, we present a two dimen-sional prediction framework, In Section 5, we discuss ourexperimental results, and finally we give our conclusions.

2. REVIEW OF SUPPORT VECTOR MA-CHINE

SVM [18] has been successfully applied in solving a widerange of pattern recognition problems, including face detec-tion [12], text classification [4], and etc. It is popular dueto effective avoidance of overfitting, the ability to handlelarge feature spaces, information condensing of data sets,and promising empirical performance.

The basic idea of SVM is that for a training set which con-tains l samples, where each sample is represented by an n-dimensional input vector ~xi (1 ≤ i ≤ l) and the training setis regarded as an n-dimensional input space <n, SVM triesto estimate a decision function f : <n → { ±1}, i.e. for eachn-dimensional input vectors ~xi (i = 1, 2, . . ., l) ∈ <n, thereis a classification identifier yi ∈ {+1, −1} that indicates towhich class, +1 or −1, ~xi belongs. The decision function isshown as follows,

f(~x) = sign(

l∑

i=1

yiαi · ~x · ~xi + b) (1)

where l is the number of vectors ~xi (1 ≤ i ≤ l) in the trainingset, vector ~x is a variant that represents a testing vector ina test set, yi (∈ {+1, −1}) is a classification identifier, andcoefficients αi are obtained by solving the following convexQuadratic Programming (QP) problem:

Maximize

l∑

i=1

αi − 1

2

l∑

i=1

l∑

i=1

αiαj · yiyj · (~xi · ~xj) (2)

with constraints

l∑

i=1

αiyi = 0, 0 ≤ αi, i = 1, 2, . . . , l. (3)

Vector ~x can be classified by the decision function in thetesting phase by calculating distances (dot products) be-tween ~x and ~xi. (Note that the decision function dependsonly on the dot products of vectors in <n.)

In the case where a linear boundary is inappropriate for clas-sifying samples in the training set, a straightforward methodcan be used to map vectors in <n into a high dimensionalfeature space H via a nonlinear map Φ: <n 7→ H. There-fore, SVM maps the input vector, ~xi ∈ <n, into a feature

vector, Φ(~xi) ∈ H. Then the training problem only dependson the dot products of vectors, Φ(~x)· Φ(~xi), in H. How-ever, if H is a high-dimensional space, Φ(~x)· Φ(~xi) will be


very expensive to compute [12]. Fortunately, the dot prod-uct in H has an equivalent expression K(~x, ~xi) in <n suchthat K(~x, ~xi) = Φ(~x) · Φ(~xi), where K(~x, ~xi) is a kernelfunction. Therefore, the kernel function is used to performΦ(~x)· Φ(~xi) in the input space <n rather than the poten-tially high dimensional feature space H, i.e. using kernelfunction in <n without need to explicitly know what Φ is.If the kernel function contains a bias term1, b can be accom-modated within the kernel function, and hence the decisionfunction in Equation 1 can be rewritten into Equation 4.

f(~x) = sign(

l∑

i=1

yiαi ·K(~x, ~xi)) (4)

where coefficients αi satisfy the Equation 5 with the con-straints in Equation 6.

Maximize

l∑

i=1

αi − 1

2

l∑

i=1

l∑

i=1

αiαj · yiyj ·K(~xi, ~xj) (5)

with constraints

l∑

i=1

αiyi = 0, 0 ≤ αi ≤ C, i = 1, 2, . . . , l. (6)

where, C is a unper bound value. A large C assigns a highpenalty to the classification errors[12].

The most commonly accepted employed kernel functions areshown in Table 2. For a given training set, only the kernelfunction and the parameter C are needed to be selected tospecify one SVM.

3. ENCODING SCHEMAEach known protein sequence is composed of a number ofstructured segments. We first extract n-gram segments fromknown protein sequences with consideration of structuredsegments and the contexts of them. Then we represent eachextracted segments as an input vector for the further train-ing and prediction.

3.1 N-gram ExtractionIt has been well accepted that the order in an amino acidsequence plays the most important role in predicting its sec-ondary structure. Having looked at a lot of sequences, weknow which segment tends to construct a secondary struc-ture and which does not. For this task, we cannot possiblyconsider each residue separately. Therefore, we use n-grammodel (any segments with n amino acid residues, where n >0) to extract segments from amino acid sequences and trans-form them into input vectors. The total number of possiblesegments of an amino acid sequence using n-gram extractionis m−n+1, where m (>0) is the number of residues in the se-quence. The choice of n is determined by the average lengthof different structure segments (helix or strand). Based onour study, for most helices the amino acid length is betweenfive and 17, and for most strands the length is between fiveand 11. Thus, we will choose varied n between five and 17and begin with a small size and gradually increase the sizeto determine the “best” size for future helix/non-helix (or

1Many employed Kernels have a bias term and any finitekernel function can be made to have one[12].

strand/non-strand) prediction. (It is anticipated that thebest sizes, n, of different classifications should achieve thebest prediction result for their classification.)

3.1.1 Subsequences without contextIn the case that extracting segments from a training setwithout considering context, we only consider structure seg-ments (i.e., helix, strand, or coil) in known protein sequences,which includes identifying structure segments in protein se-quences from the training set and extracting n-gram subse-quences m−n+1 times from each structure segment sequen-tially, where m is the number of residues in the segment.

Example 1. Given a protein sequence “MNIFEMLRID” witha helix segment from position 2 to 7 (underlined), the four3-gram subsequences are “NIF,” “IFE,” “FEM,” and “EML.”

3.1.2 Subsequences with contextLocally encoding structure segments has its inherent dis-advantage, for example, the contexts of structure segmentshas been excluded. Consider the protein sequence “MNIFEMLRID” again. Residue ‘M’ and subsequence “RID” are contextsof helix segment “NIFEML” which would lead us to infer thatthe nearest neighbors (one or more residues) of “NIFEML”must have significant different properties from “NIFEML,”otherwise these neighbors should be included in “NIFEML”as a partial helical structure.

Definition 1. N -gram context subsequences of structuresegments. Given a protein sequence s and a secondary struc-ture segment ss of size m that begins at position i in s, theleft n-gram context of ss is a subsequence of size n that be-gins at position i - bn

2c in s, and the right n-gram context of

ss is a subsequence of size n that begins at position i + m- dn

2e in s.

Example 2. Given a protein sequence “DQAMRYKT,” theunderlined residues are helical structures. The left 5-gramcontext subsequence of “QAM” is “∅DQAM,” where “∅D” are theleft two neighbors, where ∅ indicates that the followed “D”is the head of the sequence. The right 5-gram context stringof it is “QAMRY,” where “RY” are the right two neighbors.

3.2 Encoding Schema of Amino Acid ResiduesThe transformation from n-gram subsequences into inputvectors is an encoding process. The classical encoding schemawas adopted along with using neural network in predictingprotein secondary structures [13], where in a segment, eachresidue is encoded as an orthogonal binary vector [1, 0, . . . , 0],. . ., or [0, 0, . . . , 1], in which ‘1’ indicates the relative posi-tion of the corresponding residue. This type of vectors is21-dimensional, and each one of the first twenty coefficientsof these vectors denotes a distinct amino acid in the segmentand the last unit is used for specifying whether the corre-sponding residue is the “head” or the “tail” of the sequence[13]. However, even if chemical and biological methods, suchas crystallography and NMR, are used, the “head” and the“tail” of a segment can not be accurately determined [14].Therefore, we use the first 20 units of each 21-dimensionalvector to represent a residue, which we call 20-dimensionalresidue vector. In this paper, we adopt the following orderedlist of residues in [11] to encode a residue at the relative po-sition in its corresponding sequence.


Table 2: The commonly accepted kernel functionsTypical Kernel function Explanation

K(~xi, ~xj) = ((~xi · ~xj) + θ)d Polynomial functionK(~xi, ~xj) = exp(−γ|~xi − ~xj |2) Gaussian radial basis functionK(~xi, ~xj) = tanh(~xi · ~xj − θ) Multi layer perceptron function

Table 3: Five groups according to chemical properties ofamino acid residues.

R-group 5D cardinal Amino acidclassification vectors residues

Nonpolar, aliphatic (c1) [1, 0, 0, 0, 0] A,V,L,I,M

Aromatic (c2) [0, 1, 0, 0, 0] F,Y,W

Polar, uncharged (c3) [0, 0, 1, 0, 0] G,S,P,T,C,N,Q

Positively charged (c4) [0, 0, 0, 1, 0] K,H,R

Negatively charged (c5) [0, 0, 0, 0, 1] D,E

A, V, L, I, M, F, Y, W, G, S, P, T, C, N, Q, K, H, R, D, E

Example 3. Given a training set of protein sequences, the20-dimensional residue vector of ‘V’‘ is [0,1,0,. . .,0]1×20, wherethe second coefficient of the vector is 1, since ‘V’ is the 2ndresidue in the ordered list given above.

In our prediction approach, instead of using 20-dimensionalresidue vector to represent a residue, we incorporate statis-tical information and the physio-chemical features of eachresidue into our encoding schema to enhance the classicalapproach by considering the probability of each residue anddifferent biological groups [11] to which a residue belongs.In Table 3, twenty different amino acid residues are classi-fied into five different R-groups, which combines Hydropathyindex, molecular weights Mr, and pI value together [11].

Besides this R-group classification, amino acid residues ofsecondary structure sequences can be classified into manydifferent groups according to their biological meanings, suchas exchange group[5], hydrophobicity group[11], and etc (seeappendix). [11] indicates that if two amino acids have simi-lar biological properties, they should form similar structures.Therefore, we define cardinal vector to incorporate the dif-ferent biological groups information of amino acid residuesinto a secondary structure sequence.

Definition 2. Cardinal vector. Given an m-class biolog-ical group of amino acid residues, its m-dimensional (mD)cardinal vector is a binary vector vcm = [ac1 , ac2 , . . ., acm ],where aci

(1 ≤ i ≤ m) ∈ [0,1] and∑m

i=1 a2ci

= 1.

Definition 3. Probability of an m-class group with helix(strand or coil) structure in a training set. Given a trainingset s and an m-class group C={c1, . . ., cm} of amino acidresidues, the probability of a residue i in cj (1≤ i ≤ m) hashelical structure is:

pHi,cj

=1

NH

∑

i∈cj

NHi(7)

where pHi,cj

(1 ≤ i ≤ 20) is the probability that residue ibelongs to group cj and is a helix in s, NH is the totalnumber of residues that have the conformations of helicalstructures in the training set, and NHi

is the number of

residues in which residue i belongs to cj and residue i is ahelix in s. (pE

i,cjfor strand and pC

i,cjfor coil structures are

computed accordingly.)

We use Kronecker product[20] of residue vectors, cardinalvectors, and frequency of residues in the training set to en-code a residue, and we use concatenation of vectors of nconsecutive residues to encode an amino acid subsequencewith n residues. Residue vectors distinguish a residue fromothers, cardinal vectors catch the similar physio-chemicalproperties of residues, the frequency of a residue capturesthe tendency of the residue that locates in which structure,and concatenation of vectors exhibits the order of residuesin a protein subsequence.

Definition 4. Encoding schema of protein segments. Givena training set s and an m-class biological group C={c1, . . .,cm}, the encoded vector of a residue i w.r.t. a cj (1≤ i ≤ m)is

vi,cj= pH

i,cj× vri

⊗ vcmcj

= pHi,cj× [a1 × vcmcj

, . . . , a20 × vcmcj]

= [pHi,cj× a1 × ac1 , . . . , pH

i,cj× a20 × ac1 , . . . ,

pHi,cj× a1 × acm , . . . , pH

i,cj× a20 × acm ] (8)

where vri(= [a1, a2, . . ., a20]) is the residue vector of residue

i, vcmcj(= [ac1 , ac2 , . . ., acm ]) is the cardinal vector of the

m-class group to which i belongs, and pHi,cj

is the probabilitythat residue i is a helix in s and is belonged to group cj . Foreach subsequence ss in s, the encoded schema of ss is theconcatenation of vectors of the consecutive residues in ss.Hence, the size of an encoded vector of a residue is 20 ×m, and the size of an encoded vector of subsequence with nresidues is 20 × m × n.

Example 4. Consider a training set that contains the fol-lowing two known protein sequences and their DSSP struc-tures,2

s1 : LWQFNGMIKCK︸︷︷︸

H

PS︸︷︷︸

T (Coil)

PLLD︸︷︷︸

H

FNN︸︷︷︸

S(Coil)

GCY︸︷︷︸

T (Coil)

s2 : NLLLSDW︸︷︷︸

E

W︸︷︷︸

(Coil)

HQ︸︷︷︸

S(Coil)

S︸︷︷︸

(Coil)

IHKQEVGLS︸︷︷︸

H

where H is α-helix, E is extended β-strand, T is hydrogenbonded turn, S is bend, and is random coil. Hence,

Helix (H) : LWQFNGMIKCK, PLLD and IHKQEVGLS

Strand (E) : NLLLSDWCoil (S, T, ) : PS, FNNGCY, and WHQS

2DSSP (Dictionary of Secondary Structure Protein) is avail-able at http://www.sander.ebi.ac.uk/dssp.


Let’s consider the first helix segment, i.e. “LWQFNGMIKCK,”extracted from s1 and the first coil segment, i.e., “PS,” alsoextracted from s1. If we use 3-gram and R-group, i.e., 5-group (see Table 3) to train (determine) a helix/non-helixclassifier, “LWQ” is represented as a 300-dimensional (20 × 5× 3) vector V1, in which the first 100 units of V1 (see below)encodes the amino acid residue ‘L.’ Since ‘L’ belongs to theNonpolar, aliphatic R group (c1), the probability that ‘L’ isa helix and ‘L’ belongs to c1 (in the training set) is 33% sincepH

L,c1= 8

24= 0.33, where 24 is the number of helix, and 8 is

the sum of all the residues that belong to c1 that have theconformation of helical structure in the training set.

V1 = [ 0, . . .︸︷︷︸

10

, 0.33︸︷︷︸

pHL,c1

, 0, . . .︸︷︷︸

89

︸︷︷︸

L

, 0, . . .︸︷︷︸

35

, 0.08︸︷︷︸

pHW,c2

, 0, . . .︸︷︷︸

62

︸︷︷︸

W

, 0, . . .︸︷︷︸

70

, 0.33︸︷︷︸

pHQ,c3

, 0, . . .︸︷︷︸

29

︸︷︷︸

Q

]

Both strand and coil structures can be regarded as non-helix in the helix/non-helix classification. Thus, “PS ” is anon-helix subsequence with size 3, and ‘ ’ in “PS ” denotesa space needed for making up the required length. Theencoding vector of “PS ” is V2, which is shown below.

V2 = [ 0, . . .︸︷︷︸

52

, 0.33︸︷︷︸

pHP,c3

, 0, . . .︸︷︷︸

47

︸︷︷︸

P

, 0, . . .︸︷︷︸

45

, 0.33︸︷︷︸

pHS,c3

, 0, . . .︸︷︷︸

54

︸︷︷︸

S

0, . . .︸︷︷︸

100︸︷︷︸

]

4. PREDICTION FRAMEWORK

4.1 Training of Encoded Vectors Using SVMProtein secondary structure prediction is a typical three-class classification problem. A simple strategy to handle thethree-classification problem is to cast it to a set of binarycategorizations. For an n-class classification, n−1 classes areused to construct training sets. The ith class will be trainedwith all of the samples in the ith class with “y = +1” labelsand all other example data with “y = −1” labels. If inputvectors are transformed from helix (strand or coil) segments,then their learned decision function is called vertical decision

function, denoted fv. Otherwise, if input vectors are trans-formed from context subsequences, then their learned de-cision function is called horizontal decision function, whichcan be further classified into left horizontal decision func-

tion derived from the left contexts subsequences, denotedfhl, and right horizontal decision function derived from theright contexts subsequences, denoted fhr. The final predic-tion synthesizes fv and fhl (fhr) that will be discussed indetail in Section 4.3.

4.2 Sliding WindowThe sliding window method is adopted to move the windowover the test protein sequence. A window, which is a one-dimensional frame of slots, can be adopted for overlaiding onand moving over a test protein sequence to examine differentsegments of its amino acids. It extracts an protein segmentof length L (L≥ 1) to match a potential structure class usingthe derived decision function generated after the trainingphase. Note that the size of the sliding window should bethe same size of the n-gram so that the test vectors can befed into the same input space with the training vectors.

4.3 Synthesis of Deduced Decision FunctionsWe develop a heuristic algorithm to synthesize the deduceddecision function for the final prediction.

Algorithm 1. Prediction of a protein sequence s with asliding window of size n, and return a labeled sequence.

string predict (string s, int n, SVMmodel model,SVMmodel l_model, SVMmodel r_model)

{bool mark = true;int pre_p = -1; //prediction of the previous segmentstring s_p = new(strlen(s)); // predict sequenceint begin = 0; //begin position of sliding windowwhile (begin < strlen(s)) {

int current = svm_predict(segment(s,i,n), model);int left = svm_l_predict(l_seg(s, i, n), l_model);int right = svm_r_predict(r_seg(s, i, n), r_model);if (mark)

begin += l_synthesis(s_p, begin, current,left, n, &pre_p, &mark);

elsebegin += r_synthesis(s_p, begin, current,

right, n, &pre_p, &mark);}return s_p;

}int l_predict(int begin, int current, int left,

int size, bool *pre_p, bool *mark){

int len = size;if((current==-1) && (left==-1))

{ mark_seq(s_p, begin, size, -1); *pre_p = -1; }else if ((current==1) && (left==-1)) {

if (begin_position == 0){ mark_seq(s_p, begin, len, 1); *pre_p = 1; }

else { mark_seq(s_p, begin, len, *pre_p);if (*mark) *mark = false;else *mark = true;

}}else if ((current==-1) && (left==1)) {

len = size/2; mark_seq(s_p, begin, len, -1);*mark = false; *pre_p = -1;

}else if ((current==1) && (left==1)) {

len = 1; mark_seq(s_p, begin, len, 1);*pre_p = 1;

}return len;

}int r_predict(int begin, int current, int right,

int size, bool *pre_p, bool *mark){

int len = size;if((current==-1) && (right==-1))

{ mark_seq(s_p, begin, len, -1); *pre_p = -1; }else if ((current==-1) && (right==1)) {

len = 1; mark_seq(s_p, begin, len, 1);*pre_p = 1;

}else if ((current==1) && (right==-1)) {

mark_seq(s_p, begin, len, *pre_p);if ((*pre_p)==1) *mark = true;else *mark = false;

}else if ((current==1) && (right==1)) {

len = size/2; mark_seq(s_p, begin, len, 1);*pre_p = 1;

}return len;

}

For a helix/non-helix (strand/non-strand, coil/non-coil) clas-


sifier, we select the best accuracy value (will be discussed inSection 5.1), and the corresponding decision function f(~x)(including the kernel function and the upper bound C), andn-gram for the prediction phase. Thus, given an unknownsequence s, we first use sliding window with the window sizen and different biological groups (cardinal vectors) to trans-form subsequences of s into different sets of input vectors.Hereafter, for each set of vectors, we apply the derived de-cision functions to classify each subsequence into either thehelix or non-helix (strand or non-strand, coil or non-coil)class. If different biological groups cause conflict predictions,a vote strategy will be used to decide the final prediction.

5. EXPERIMENTAL DATAThe experiments are carried out on the CB513 set and RS126set with 513 and 126 non-redundant protein amino acid se-quences, respectively, whose secondary structures are known.In these two data sets, the assignment of secondary struc-ture is usually performed by DSSP[9], STRIDE[6], or DE-FINE[15].

Different assignment methods have different explanations onpartial of residues in protein sequences, therefore they willinfluence the prediction accuracy as discussed by the con-structor of CB513. In this paper, we select DSSP assignmentin order to compare with other prediction approaches. DSSPassignment classify eight states secondary structure, H, G, I,B, E, T, S, and - [1], into three classes helix (H), strand(E),and coil(C). Here, we use methods in [7; 17] that call H, G,and I as helix (H), E as strand(E), and other states as coil(C).This assignment is slightly different from other assignmentsreported in different literatures. For example, in the CASPcompetition[10], H contains DSSP classes H and G, while E

contains DSSP classes E and B.

We use multiple cross-validation trials to minimize variationin the results caused by a particular choice of training ortest sets. An sevenfold cross validation is used on CB513(RS126), where CB513 (RS126) is divided into seven subsetsrandomly, and each subset has similar size. Among theseseven subsets, data in six subsets are selected as trainingdata and data in the remaining subset is test data.

5.1 Evaluation MetricsFor a given sequence, we evaluate our prediction approachand compare it with some prediction methods that havebeen well accepted using evaluation metrics. First, we eval-uate the effectiveness of SVM in our prediction applicationdomain. Hereafter, we select the best parameters (e.g. ker-nel function and upper bound C) in SVM according to thehighest accuracy value in Equation 9, and use the selectedparaments for further prediction and comparison. For theevaluation of the effectiveness of our prediction approach,we give the following metrics.

The prediction accuracy using SVM is calculated as:

ACSV M

=Nci

Ni

× 100% (9)

where i is a test set, Nciis the number of residues correctly

predicted in i, and Ni is the number of residues in i.

We also employed several standard performance measures toassess prediction accuracy. Three-state overall per-residueaccuracy (Q3) and segment overlap calculation SOV [17]were calculated to evaluate accuracy.

Q3, The overall per residue three-state accuracy method[17], is the most widely-used method for assessing the qual-ity of a predicting approach, which is defined as

Q3 =NH + NS + NC

N× 100% (10)

where NH , NS , and NC are the numbers of residues correctlypredicted in the helix, strand, and coil, respectively, and N isthe number of residues in the predicted sequences. Q3, how-ever, does not consider the problem of over-predictions, e.g.non-helix residues are treated as helix, or under-predictions,e.g. helix residues are treated as non-helix.

Another popular accuracy measurement, the Segment over-

lap calculation (SOV) has been proposed to assess the qual-ity of a prediction in a more realistic manner. This is done bytaking into account the type and position of secondary struc-ture segment, the natural variation of segment boundariesamong families of homologous proteins and the ambiguityat the end of each segment. SOV is calculated as

Sov =1

N

∑

s

minov(sobs; spred) + δ

maxov(sobs; spred)× len(sobs) (11)

where N is the number of residues in the predicted se-quences, s is a segment in the predicted sequences, sobs

and spred are observed and predicted segments, respectively,minov is the actual overlap between sobs and spred, maxov isthe extent of the either segment, and len(sobs) is the lengthof the observed segment. δ is an integer chosen to be smallerthan minov and smaller than one-half the length of sobs, δ= 1, 2, or 3 for short, intermediate, and long segments.

The per-residue accuracy for each type of secondary struc-ture (QH , QE , QC , Qpre

H , Qpre

E , Qpre

C ) was also calculated. Wedistinguish QI and Qpre

I (here, I=H, E, and C) as:

QI =NIcp

NIo

× 100%, Qpre

I =NIcp

NIp

× 100% (12)

where, NIcp is number of residues correctly predicted in stateI, NIo is number of residues observed in state I, and NIp isnumber of residues predicted in state I.

In the following sub-sections, we use LIBSVM [2], an SVMsoftware, to evaluate our prediction approach from differentpoint of views, which include (i) determination of kernelfunction and parameter C, (ii) selection of window size, and(iii) comparison with other prediction methods.

5.2 Determination of Kernel Functions andParameters in SVM

Given a data set, a proper kernel function and its parame-ters must be chosen to construct the SVM classifier. Differ-ent kernel functions and paraments will construct differentSVM classifier. We hope to select an optimal kernel func-tion and its parameters which can achieve high ACSV M .However, there are no successful theoretical methods for de-termining the optimal kernel function and its parameters.In [7], the author has proved that the gaussian radial ba-sis function exp(−γ|~xi − ~xj |2) can provide superior perfor-mance in generalization ability and converge speed. There-fore, we select different parameters and upper bound valuesC on exp(−γ|~xi−~xj |2) to compare the testing accuracies onhelix/non-helix (H/¬H) test. Since the average of helix in


CB513 is eight, we use 8-gram to select the kernel functionand parameter C for H/¬H.

As shown in Table 4, for each C, we select γ from 0.001 to 50,where 0.001 is the ratio of 1 to the number of dimensions800 (20 × 5 × 8) in the training set. We found that thebest accuracy (84.72%) is achieved by exp(−γ|~xi − ~xj |2)with γ being equal to 0.2 and C being equal to 1.5. It iswell known that a large C value imposes a high penaltyto the classification errors [12]. However, Table 4 showsthat for each γ between 0.001 and 50, when C increases,the accuracy on the test set first increases to a peak value,and then decreases. This is because too small a value of Cgives a poor accuracy while too large a value of C leads tooverfitting. A proper value of C balances these problems.We found that C=1.5 is appropriate for not only H/¬H butalso other classifiers. In the following tests we set C = 1.5and γ = 0.2 to construct different SVM classifiers.

5.3 Dependency of Window SizeThere is a trade off between the number of residues usedand the level of “noise” in deciding the size of the window.If the window size is too large, the prediction process maybe misled by the noise, while if the window size is too small,then the number of residues for predicting the protein struc-ture may be insufficient. Therefore, A proper window sizecould lead to a good performance.

We construct three different binary classifiers. The optimalwindow size, nop, for each binary classifier was determinedon the CB513. The results in Table 5 indicate that theoptimal window size is related to the average length of thesecondary structure segments. In general, longer nop valuesmean segments require larger optimal window sizes. (Theaverage length of helix, strand, and coil in CB513 are 8, 5,and 5, respectively.) It is interesting to find that the optimalwindow size based on our prediction approach is the similarsize with the one based on previous NN approaches [13; 17],e.g. the optimal window sizes for H/¬H are both 13.

5.4 Effectiveness of Our Prediction ApproachWe now evaluate the effectiveness of our prediction approach.We first compare our prediction approach with PHD [17]and an SVM-based method [7]. PHD is one of the mostaccurate and reliable secondary structure prediction meth-ods based on NNs, and the SVM-based method is a newproposed method that can outperform PHD. Our compari-son is based on the same data sets (CB513 and RS126), thesame secondary structure definitions, and the same accuracyassessments. For easy to illustrate, we call the SVM-basedmethod SV M01 and our prediction method SV MSSP . Thecomparison results among PHD, SV M01, and SV MSSP areshown in Table 6, where ‘-’ means that the results cannotbe obtained from the papers.

On the CB513 set, the Q3 is 75.2%, which is 1.7% higherthan SV M01, and the SOV with SV MSSP achieves 77.8%,which is 1.6% higher than SV M01. Q3 and SOV scores forthe RS126 set are improved by 3.0% and 2.9%, respectively,compared with PHD results.

The comparison results between PHD and the two SVM-based methods indicate that the methods based on SVMoutperformed PHD, because SVM-based method can suc-cessfully avoid many problems which other machine learningapproaches often encounter. For example, structures of NNs(especially the size of the hidden layer) are hard to be de-

Table 6: Comparison with results of other systems.CB513 RS126

SV M01 SV MSSP PHD SV M01 SV MSSP

Q3 73.5% 75.2% 70.8% 71.2% 73.8%QH 75.0% 77.5% 72.0% 73.0% 75.1%QE 60.0% 64.8% 66.0% 58.0% 61.2%QC 79.0% 79.9% 72.0% 75.0% 81.2%Q

preH

79.0% 82.1% 73.0% 77.0% 80.9%Q

preE

67.0% 70.5% 60.0% 66.0% 69.7%Q

preC

70.0% 70.6% - 69.0% 70.1%Sov 76.2% 77.8% 73.5% 74.6% 76.4%

termined, gradient based training algorithms only guaranteefinding local minima, too many model parameters have tobe optimized, overfitting problems are hard to be avoided,etc. In addition, the comparison results between SV M01

and SV MSSP prove that the encoding schema of SV MSSP

and two dimensional prediction framework play significantroles on predicting protein secondary structures.

6. CONCLUSIONSIn this paper, we propose an approach in predicting proteinsecondary structures using SVM. Our prediction approachtakes a segment extracted from a given sequence using a slid-ing window and matches it against the structural classifica-tion results computed by SVM. We incorporate the biolog-ical classifications of amino acid residues and statistical in-formation into an encoding schema, consider not only struc-tured segments but also their contexts, and choose the mostideal kernel function, parameters, and size of the window forthe prediction of different protein structure classes. The ex-perimental results show that our two dimensional predictionframework outperforms the existing prediction methods.

7. ACKNOWLEDGEMENTSThe authors would like to thank Jianyin Shao of the Chem-istry and Biochemistry department at Brigham Young Uni-versity for his helpful discussions and suggestions. Specialthanks are also given to Chih-Chung Chang and Chih-JenLin for providing LIBSVM.

8. REFERENCES

[1] P. D. Bank. http://www.rcsb.org/pdb/, 2002.

[2] C.-C. Chang and C.-J. Lin. LIBSVM: a Libraryfor Support Vector Machines. Software available athttp://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.

[3] J. A. Cuff and G. J. Barton. Evaluation and Improve-ment of Multiple Sequence Methods for Protein Sec-ondary Structure Prediction. Proteins: Struct. Funct.

Genet., 34:508–519, 1999.

[4] H. Drucker, D. Wu, and V. Vapnik. Support Vector Ma-chines for Span Categorization. IEEE Trans. on Neural

Networks, 10:1048–1054, 1999.

[5] M. O. D. (ed). Atlas of Protein Sequence and Struc-ture. National Biomedical Research Foundation (Wash-

ington, D. C.), 5, 1972.


Table 4: Accuracies ACSV M of different parameters γ and upper bounds C for H/¬H under 8-gram on CB513 set, where *marks the best accuracy for each C.

γC 0.001 0.01 0.2 0.3 0.5 1 5 50

0.01 50.00% 50.00% 51.39% 65.83% 80.83% *83.89% 50.00% 50.00%0.1 50.00% 50.00% 83.89% 83.89% 83.61% *84.44% 83.61% 51.11%1.5 51.39% 82.50% *84.74% 81.94% 82.50% 80.28% 53.89% 53.89%5 79.72% 79.72% 80.83% *81.94% 81.39% 81.67% 53.89% 53.89%50 78.61% 81.28% *81.67% 81.11% 81.67% 82.78% 78.06% 53.89%

Table 5: Accuracies ACSV M of different window sizes for each binary classifier on CB513.Binary Window size (n-gram) Best window size

classifier 5 7 9 11 13 15 17 nop

H/¬H 79.38% 83.91% 84.82% 85.99% 86.72% 86.13% 84.54% 13E/¬E 79.26% 81.65% 83.89% 84.06% 83.38% 82.45% 81.61% 11C/¬C 75.12% 78.18% 79.80% 81.06% 80.58% 80.47% 79.01% 11

[6] D. Frishman and P. Argos. Knowledge-Based ProteinSecondary Structure Assignment. Proteins, 23:566–579,1995.

[7] S. Hua and Z. Sun. A Novel Method of ProteinSecondary Structure Prediction with High SegmentOverlap Measure: Support Vector Machine Approach.Bioinformatics, 308:397–407, 2001.

[8] J.Garnier, D.J.Osguthorpe, and B.Robson. Analysis ofthe Accuracy and Implications of Simple Methods forPredicting the Secondary Structure of Globular Pro-teins. J. Mol Biol, 120:97–120, 1978.

[9] W. Kabsch and C. Sander. A Dictionary of Protein Sec-ondary Structure. Biopolymers, 22:2577–2637, 1983.

[10] J. Moult and et al. Critical Assessment of Methods ofProtein Structure Prediction (CASP): Round II. Pro-

teins. supplement 1., 29(S1):2–6, 1997.

[11] D. Nelson and M. Cox. Lehninger Principles of Bio-

chemistry Amino. Worth Publishers, 2000.

[12] E. E. Osuna, R. Freund, and F. Girosi. SupportVector Machines: Training and Applications (A. I.Memo1602). MIT A.I.Lab, 1997.

[13] N. Qian and T. J. Sejnowski. Predicting the SecondaryStructure of Globular Proteins Using Neural NetworkModels. J. Mol. Biol, 202:865–884, 1988.

[14] H. H. Rashidi and K. L. Buehler. Bioinformatics Basics

Applications in Biological Science and Medicine. CRCPress, 2000.

[15] F. M. Richards and C. E. Kundrot. Identificationof Structural Motifs from Protein Coordinate Data:Secondary Structure and First-Level SupersecondaryStructure. Proteins, 3:71–84, 1988.

[16] B. Rost and C. Sander. Prediction of Protein SecondaryStructure at Better Than 70% Accuracy. J. Mol Biol,232:584–599, 1993.

[17] B. Rost, C. Sander, and R. Schneider. Redefining theGoals of Protein Secondary Structure Prediction. J.

Mol Biol, 235:13–26, 1994.

[18] V. Vapnik. The Nature of Statistical Learning Theory.Springer-Verlag, New York, 1995.

[19] M. J. Zvelebil, G. J. Barton, W. R. Taylor, and et al.Prediction of Protein Secondary Structure and ActiveSites Using the Alignment of Homologous Sequences. J.

Mol Biol, 195:957–961, 1987.

[20] D. Zwillinger, S. G. Krantz, and K. H. Rosen, editors.Standard Mathematical Tables and Formulae (30th edi-

tion). CRC Press, 1996.

APPENDIX

A. CARDINAL VECTORS WITH PHYSIO-CHEMICAL PROPERTIES

The exchange group and hydrophobicity are shown in Ta-ble 7 and 8, respectively.

Table 7: Exchange groups of amino acid residues.Exchange 6D Amino acid

groups cardinal vectors residues(A) [1, 0, 0, 0, 0, 0] C

(C) [0, 1, 0, 0, 0, 0] A,G,P,S,T

(D) [0, 0, 1, 0, 0, 0] D,E,N,Q

(E) [0, 0, 0, 1, 0, 0] H,K,R

(F) [0, 0, 0, 0, 1, 0] I,L,M,V

(G) [0, 0, 0, 0, 0, 1] F,W,Y

Table 8: Hydrophobicity groups of amino acid residues.Hydrophobicity 2D Amino acid

groups cardinal vectors residueshydrophobic(O) [1, 0] C,D,E,G,H,K,

N,Q,R,S,T,Y

hydrophilic(I) [0, 1] A,F,I,L,M,P,V,W


Assuring Privacy when Big Brother is Watching

Murat Kantarcıoglu Chris CliftonDepartment of Computer Sciences

Purdue University250 N University St

West Lafayette, IN 47907-2066

{kanmurat, clifton}@cs.purdue.edu

ABSTRACTHomeland security measures are increasing the amount ofdata collected, processed and mined. At the same time, own-ers of the data raised legitimate concern about their privacyand potential abuses of the data. Privacy-preserving datamining techniques enable learning models without violatingprivacy. This paper addresses a complementary problem:What if we want to apply a model without revealing it?This paper presents a method to apply classification ruleswithout revealing either the data or the rules. In addition,the rules can be verified not to use “forbidden” criteria.

KeywordsPrivacy, Security, Profiling

1. INTRODUCTIONThe new U.S. government CAPPS II initiative plans to clas-sify each airline passenger with a green, yellow or red risklevel. Passengers classified as green will be subject to onlynormal checks, while yellow will get extra screening and redwon’t fly. [2] Although government agencies promise thatno discriminatory rules will be used in the classification andthat privacy of the data will be maintained, this does notsatisfy privacy advocates.

One solution is to have the government send the classifica-tion rules to the owners of the data (e.g., credit reportingagencies). The data owners would then verify that the rulesare not discriminatory, and return the classification for eachpassenger. The problem with this approach is that reveal-ing the classification rules gives an advantage to terrorists.For example, knowing that there is a rule “A one-way ticketpaid in cash implies yellow risk”, no real terrorist will buyone way tickets with cash.

This appears to give three conflicting privacy/security re-quirements. The data must not be revealed, the classifiermust not be revealed, and the classifier must be checked forvalidity. Although these seem contradictory, we show thatif such a system must exist, it can be done while achievingsignificant levels of privacy. We prove that under reasonableassumptions (i.e., the existence of one way functions, non-colluding parties) it is possible to do classification similarto the above example that provably satisfies the followingconditions:

• No one learns the classification result other than des-ignated party.

• No information other than the classification result willbe revealed to designated party.

• Rules used for classification can be checked for thepresence of certain conditions without revealing therules.

While such a system still raises privacy issues (particularlyfor anyone classified as red), the security risks are signif-icantly reduced relative to a “give all information to thegovernment” (or anyone else) model.

2. RELATED WORKThis work has many similarities with privacy-preserving dis-tributed data mining. This was first addressed for the con-struction of decision trees[11]. This was based on the con-cepts of secure multiparty computation (discussed below.)There has since been work to address association rules inhorizontally partitioned data[6; 7], EM Clustering in Hori-zontally Partitioned Data[10], association rules in verticallypartitioned data[13], and generalized approaches to reducingthe number of “on-line” parties required for computation[8].However, the goal in this paper is not to learn a data miningmodel, but to privately use a model. (The other approachto privacy-preserving data mining, adding noise to data be-fore the mining process(i.e [1]), doesn’t make sense when thegoal is to privately evaluate the results of the model on spe-cific data items.) A good survey of privacy and data miningcan be found in [9].

2.1 Secure Multiparty ComputationIn the late 1980’s, work in secure multiparty computationdemonstrated that a wide class of functions can be com-puted securely under reasonable assumptions (see Theorem2.1.) We give a brief overview of this work, concentrating onmaterial that is used later in the paper. For simplicity, weconcentrate on the two party case. Extending the definitionsto the multiparty case is straightforward.

Secure multiparty computation has generally concentratedon two models of security. The semi-honest model assumeseach party follows the rules of the protocol, but is free tolater use what it sees during execution of the protocol tocompromise security. The malicious model assumes partiescan arbitrarily “cheat”, and such cheating will not compro-mise either security or the results (i.e., the results from the


nonmalicious parties will be correct, or the malicious partywill be detected.)We assume an intermediate model: preserving privacy withnon-colluding parties. A malicious party may corrupt theresults, but will not be able to learn the private data ofother parties without colluding with another party. This isa realistic assumption for our problem: Both parties want toget correct results (class of a passenger), and are unwillingto compromise this goal through malicious behavior. Thereal security issue is the potential for misuse of the dataor rules if accidentally released. The risk that a terroristwould discover rules stored at an airline or credit agency ismuch greater than the risk that they would alter the com-plex software to defeat the system, particularly as alteringthe protocol to maliciously learn the rules could be detectedby inserting passengers with known security classes as a con-tinuous test of the system.The formal definition of secure multiparty computation[5]provides a methodology for proving the security of our ap-proach. Informally, this definition states that a computationis secure if the view of each party during the execution ofthe protocol can be effectively simulated by the input andthe output of the party. This is not quite the same as sayingthat private information is protected. For example, assumetwo parties use a secure protocol to compare two positiveintegers. If A has 2 as its integer and the comparison resultindicates that 2 is bigger than equal other site’s integer, Acan conclude that B has 1 as its input. Site A can deducethis information by solely looking at its local input and thefinal result - the disclosure is a fault of the problem beingsolved and not the protocol used to solve it. To prove a pro-tocol is secure, we have to show that anything seen is notgiving more information then seeing the final result.Yao introduced secure multiparty computation with his mil-lionaire’s problem (and solution). Two millionaires wantto know who is richer, without either disclosing their networth[14]. Goldreich extended this to show there is a securesolution for any functionality[5]. The general secure twoparty evaluation is based on expressing the function f(x, y)as a circuit and encrypting the gates for secure evaluation.This leads to the following theorem:

Theorem 2.1. [4] Suppose that trapdoor permutation ex-

ist. Then any circuit is privately computable (in the semi-

honest model).

Trapdoor permutations can be constructed under the in-tractability of factoring assumption.This works as follows. The function f to be computed is firstrepresented as a combinatorial circuit. Each party sends arandom bit to the other party for each input, and replacestheir input with the exclusive or of their input and the ran-dom bit. This gives each party a share of each input, withoutrevealing anything about the input. The parties then runa cryptographic protocol to learn their share of the resultof each gate in the circuit. At the end, the parties combinetheir shares to obtain the final result. This protocol hasbeen proven to produce the desired result without disclos-ing anything except that result.

3. PRIVATE CONTROLLED CLASSIFICA-TION

We first formally define the problem of classifying items us-ing decision rules, when no party is allowed to know both therules and the data, and the rules must be checked for “forbid-den” tests. This problem could be solved using the genericmethod described above. While we have a construction andproof of such a circuit, it is omitted due to space constraints.Instead, we present a comparatively efficient approach basedon the notion of an untrusted third party that processesstreams of data from the data and rule sources.

3.1 Problem StatementGiven an instance x from site Data with v attributes, wewant to classify x according to a rule set R provided bysite Government. Let us assume that each attribute of xhas n bits, and let xi denotes the ith attribute of x. Weassume that each given classification rule r ∈ R is of theform (L1∧L2∧· · ·∧Lv)→ C where C is the predicted classif (L1 ∧ L2 ∧ · · · ∧ Lv) evaluates to true. Each Li is eitherxi = a, or a don’t care (always true). (While the don’tcare clauses are redundant in the problem definition, theywill need to be included explicitly in the protocol to maskthe number of clauses in each rule. By using don’t cares,G can define rules with an arbitrary number of clauses; theother parties gain no information about the number of “real”clauses in the rule.) We assume that for any given x onlyone rule will be satisfied.

In addition, D has a set F of rules that are not allowed to beused for classification. In other words, D requires F ∩R = ∅.The goal is to find the class value of x according to R whilesatisfying the following conditions:

• D will not be able to learn any rules in R,

• D will be convinced that F ∩R = ∅ holds, and

• G will only learn the class value of x and what is im-plied by the class value.

3.2 Untrusted Non-colluding SiteTo achieve a solution that is both secure and efficient, wemake use of an untrusted, non-colluding site. Use of sucha site was first suggested in [3]. Application to privacy-preserving data mining was discussed in [8]. The key to sucha site is that without active help from one of the other sites itlearns nothing, although by colluding with other sites it maybe able to obtain information that should not be revealed.Thus, the only trust placed in the site is that it will not col-lude with any of the other sites to violate privacy. Althoughthis seems like a strong assumption, it occurs often in reallife. For example, bidders or sellers on e-bay assume thate-bay is not colluding with other bidders or sellers againstthem. In our approach, the untrusted site learns only thenumber of literals v, the number of rules |R|, and how manyliterals of a given rule are satisfied. It does not learn whatthose literals are, what the class is, how they relate to liter-als satisfied by other rules or other data items, or anythingelse except what is explicitly stated above.

3.3 Commutative EncryptionA key tool used in this protocol is commutative encryp-tion. An encryption algorithm is commutative if the follow-ing two equations hold for any given feasible encryption keysK1, . . . , Kn ∈ K, any item to be encrypted m ∈M , and any


permutations of i, j: ∀m1, m2 ∈M such that m1 6= m2

EKi1(. . . EKin

(m) . . . ) = EKj1(. . . EKjn

(m) . . . ) (1)

and for any given k, ε < 12k

Pr(EKi1(. . . EKin

(m1) . . . ) = EKj1(. . . EKjn

(m2) . . . )) < ε

(2)

Commutative encryption is used to check if two items areequal without revealing them. For example, assume thatparty A has item iA and party B has item iB . To check ifthe items are equal, each party encrypts its item and sends itto the other party: Party A sends EKA

(iA) to B and party Bsends EKB

(iB) to A. Each party encrypts the received itemwith its own key, giving party A EKA

(EKB(iB)) and party

B EKB(EKA

(iA)). At this point, they can compare the en-crypted data. If the original items are the same, equation1 ensures that they have the same encrypted value. If theyare different, equation 2 ensures that with high probabilitythe encrypted values are different. During this comparison,each site sees only the other site’s encrypted value. Assum-ing the security of the encryption, this simple scheme is asecure way to check equality.

One example of a commutative encryption scheme is Pohlig-Hellman [12], based on the assumption of the intractabilityof the discrete logarithm problem. This or any other com-mutative encryption scheme will work for our purposes.

3.4 Protocol for Private Controlled Classifica-tion

We now show how to solve the problem presented in Section3.1 between sites D, G, and an untrusted, non-colludingsite C, where C learns only the number of attributes v, thenumber of rules |R|, and the number of literals satisfied byeach rule for a given instance x.

The basic idea behind the protcol is that sites D and G sendsynchronized streams of encrypted data and rule clauses tosite C. The order of attributes are scrambled in a way knownto D and G, but not C. Each attribute is given two values,one corresponding to don’t care, the other to its true value.Each clause also has two values for each attribute. One issimply an “invalid” value (masking the real value). Theother is the desired result, either the a (for a clause xj = a),or the agreed upon “don’t care” value. C compares to see ifeither the first or second values match, if so then either theattribute is a match or the clause is a don’t care. If there isa match for every clause in a rule, then the rule is true.The key is that the don’t care, true, and invalid values areencrypted differently for each data/rule pair in the stream,in a way shared by D and G but unknown to C. The order(is the first attribute the value, or the don’t care value) alsochanges, again in a way known only to D and G. Sinceall values are encrypted (again, with a key unknown to C),the non-colluding site C learns nothing except which rulematches. Since the rule identifier is also encrypted, this isuseless to C.The protocol operates in three phases: encryption, predic-tion, and verification. Three methods are used for encryp-tion, a one-time pad based on a pseudo-random number gen-erator shared by D and G, a standard encryption method Eto encrypt data sent to C, and a commutative method Ecfor checking for forbidden rules. To aid in understandingthe discussion, a summary of the functions / symbols usedis given in Table 1.

Table 1: Function and Symbol DescriptionsSymbol Description

EK Encryption with key KEcK Commutative encryption with key KRg Common pseudo-random generator shared

by site D, Gx Data item to be evaluated, a vector of at-

tributes xj

A Rules × attributes matrix of encrypted in-stances created by site D

R[i] Rule i, consisting of clauses correspondingto attributes of x

B Rules × attributes matrix of encryptedrules created by site G

nj , (nj + 1) Values outside the domain of jth attributei Index for rulesj Index for attributesσ Index for “match” and “invalid” pairs

Protocol 1 Private classification: encryption phase

Require: D and G share a pseudo-random generator Rg,G has a private generator Rpg. Rg(k) (Rpg(k)) representsthe kth number generated by the pseudo-random genera-tors. Let n be a vector of elements where nj and nj + 1are values out side the domain of xj .cntd = cntg = cntgp = 0 ;

At site D:

Kr = Rg(cntd + +);for i = 1; i ≤ |R|; i + + do

for each attribute xj of x do

σ = (Rg(cntd + +) mod 2)A[i][j][σ] = EKr (xj ⊕Rg(cntd + +)) ;A[i][j][(1 − σ) mod 2] = EKr (nj ⊕Rg(cntd + +)) ;

end for

end for

send A to site C

At site G:

randomly permute rules in R;Kr = Rg(cntg + +);let R[i][j] is ith rule’s jth literal;let R[i][v + 1] is the predicted class for ith rule;for i = 1; i ≤ |R|; i + + do

for j = 1; j ≤ v; j + + do

σ = (Rg(cntg + +) mod 2)if R[i][j] is of the form Xj = aj then

B[i][j][σ] = EKr (aj ⊕Rg(cntg + +)) ;B[i][j][(1−σ) mod 2] = EKr ((nj + 1)⊕Rg(cntg ++)) ;

else

B[i][j][σ] = EKr ((nj + 1)⊕Rg(cntg + +)) ;B[i][j][(1 − σ) mod 2] = EKr (nj ⊕Rg(cntg + +)) ;

end if

end for

rc = Rpg(cntpg + +);B[i] = (rc, Ekr (rc)⊕R[i][v + 1]);

end for

send B, B to site C;


In the encryption phase, site D first encrypts every attributeof x using xor with a random number. The xor with randomis effectively a one-time pad to prevent revealing anythingto site C. An additional attribute is added corresponding tothe don’t care condition using the “illegal” value nj not inthe domain of xj . Site G creates encrypted values for eachliteral based on whether it must match or is a don’t care.For the don’t care B[i][j][(1−σ) mod 2] will be the same asA[i][j][(1−σ) mod 2] (nj xored with the same random num-ber). G applies a slightly different cryptographic scheme forclass values (again padding with a newly generated randomeach time), ensuring C doesn’t see that different rules mayhave the same class value. This is necessary to ensure Cdoesn’t learn about class distribution of different instancesover multiple runs of the protocol. This is repeated for ev-ery rule (giving multiple encryptions of the same x, but withdifferent encryption each time.) The encryption phase is de-scribe in detail in Protocol 1.

Protocol 2 Private classification: prediction phase

Require: A, B, B be generated and sent to site C.At site C:

for i = 1; i ≤ |R|; i + + do

if ∀j, 1 ≤ j ≤ v, (A[i][j][0] == B[i][j][0]∨A[i][j][1] ==B[i][j][1]) then

(rs, cs) = B[i];break;

end if

end for

randomly generate rf

send (rs, cs⊕ rf ) to D and send rf to G;

At site D:

receive (rd, cd) from site C;send cd ⊕Ekr (rd) to site G

At site G:

receive r from site C and c from site Doutput c⊕ r as the classification result

Site C compares the vectors to find which rule is satisfied inits entirety. However, it cannot directly send the predictionresult to site G as this would reveal which rule is satisfied,since this is the encrypted (and distinct for each rule) valuerather than the actual class. C instead sends the result xoredwith a random number to site D. D decrypts this to getthe class, but the true class value is masked by the randomgenerated by C. Finally G can combine the information itgets from site C and site D to learn the classification. Thisprocess is fully described in Protocol 2.

3.5 Checking for Forbidden RulesProtocols 1 and 2 generate the correct classification withoutrevealing rules or data. Protocol 3 shows how C and D cantest if F ∩R = ∅ (presumably before D returns the maskedresult to G in Protocol 2.)

The main idea of Protocol 3 is that equality can be checkedwithout revealing the items using commutative encryption.Since site D knows which random numbers are used by siteG, it can mask F in the same way using those random num-bers and encrypt with EKr . Site D also gets the maskedR after encryption by site C, encrypts those, and returnsthem to C. C now has the double encrypted version of the

Protocol 3 Private classification: verification phase

Require: Let B be the rule set received from G in Protocol 1,Ec be a commutative encryption method.At site C:

K = { create 2 ∗ |F | ∗ v random keys }for i = 1; i ≤ |F |; i + + do

for j = 1; j ≤ v; j + + doEn[i][j][0] = EcK[i][j][0](B[rf ][j][0])

En[i][j][1] = EcK[i][j][1](B[rf ][j][1])end for

end for

send En to site D.

At site D:

receive En;Kd = { create 2 ∗ |F | ∗ v random keys }for i = 1; i ≤ |F |; i + + do

for j = 1; j ≤ v; j + + do

En[i][j][0] = EcKd[i][j][0](En[i][j][0])

En[i][j][1] = EcKd[i][j][1](En[i][j][1])end for

end for

Generate Fe, the matrix of encrypted forbidden rules, from Fusing the method used by G to generate B from R in Protocol1.for i = 1; i ≤ |F |; i++ do

for j = 1; j ≤ v; j + + do

Ef [i][j][0] = EcKd[i][j][0](Fe[i][j][0])

Ef [i][j][1] = EcKd[i][j][1](Fe[i][j][1])end for

end for

send Ef ,En to site C

At site C:

receive Ef , En;for i = 1; i ≤ |F |; i + + do

for j = 1; j ≤ v; j + + do

Ez[i][j][0] = EcK[i][j][0](Ef [i][j][0])

Ez[i][j][1] = EcK[i][j][1](Ef [i][j][1])end for

end forfor i = 1; i ≤ |F |; i + + do

if ∀j, 1 ≤ j ≤ v, (En[i][j][0] == Ez[i][j][0]∨ En[i][j][1] ==Ez[i][j][1]) then

output that F ∩ R = ∅ does not hold.end if

end for


classification rule set R. It now encrypts F . The items canbe compared since Eck1 (Eck2(A)) = Eck2(Eck1(A)).

3.6 Security of the ProtocolTo prove that this protocol reveals nothing but the numberof matching literals, we use the secure multiparty computa-tion paradigm of defining a simulator whose output is com-putationally indistinguishable from the view seen by eachparty during execution of the protocol. This reduces toshowing that the received messages can be simulated; thealgorithm itself generates the rest of the view. During theencryption phase, D and G receive no messages. Assumingencryption is secure, output of the encryption is compu-tationally indistinguishable from a randomly chosen stringover the domain of the encryption output. We can definethe simulator as follows: The simulator randomly generatesnumbers and assigns them to As. For Bs, first a numberof locations corresponding to the number of matching liter-als are chosen at random. For these locations, A[i][j][σ] isused for B[i][j][σ]. For other locations, random values aregenerated.Assume that the output of the simulator’s As, Bs is not

computationally indistinguishable from the A, B seen dur-ing the protocol. Let M be the distinguisher. Then M mustbe able to distinguish between some A[i][j][σ] and As[i][j][σ].This means M can distinguish between a random numberand EKr (X ⊕ R), contradicting our secure encryption as-sumption. For B a similar argument applies.

For the classification phase site D only sees r, d. Since bothof them are random numbers, a simple random number gen-erator can be used for the simulator. The message G receivescan be simulated using a random number generator (simu-lating the message from C) and the actual result (result⊕r).For the verification phase note that what each party sees isan item encrypted by a random key for each field. Assumethat the places of the literals that are equal is the part of thefinal result (changing the order of literals at each executionprevents this from revealing any real information.) Site Ccan use the same simulator given for the encryption phase.Therefore we can conclude that the method is secure underthe stated assumptions.

In practice the site C would operate in a stream mode, withD sending each new instance x and G re-encrypting the rulesR. To C these would appear to be continuous streams – therepetition in R is masked by the random pad. This avoidsC learning anything based on results over multiple instances(e.g., giving C a single encrypted rule set for all instanceswould enable C to learn what rule was most common, evenif it didn’t know what that rule was.)

One interesting advantage to this method over the genericmethod is that by colluding (e.g., under a court order), anytwo sites could reveal what the third site has. For exampleif G does use a forbidden rule, C and D can collaborateto expose what that rule is. This is not directly possiblewith generic circuit evaluation, since a malfeasant G coulddelete the keys used in the protocol to permanently hide theforbidden rule.

4. COST ANALYSISPrivacy is not free. Keeping the necessary information pri-vate requires many encryptions for each classification. Givenv literals in each rule both sites G and D perform O(|R| · v)

encryptions in the encryption phase. The prediction phaserequires a single encryption. Verification requires O(|F | ·v) encryptions. The total number of encryptions is thenO((|R| + |F |) · v).The communication cost of the protocol is similar. Assumethe size of each encrypted value is t bits. The encryptionphase sends O(|R| ·v · t) bits, and the prediction phase sends3t bits. In the verification phase the set F is transmitted inencrypted form, O(|F | · v · t) bits. The total communicationcost is O((|R|+ |F |) · v · t) bits.

While the communication cost may seem high, particularlysince this is the cost per instance to be evaluated, it is likelyto work well in practice. Bandwidth is growing rapidly, itis generally latency that limits performance. This protocoladapts well to streaming - the small number of messages,and the fact that each site sends only one per phase, meansa continuous stream of instances could be fed through thesystem. The throughput of the system could approach thebandwidth of the communication network.

Note that the generic circuit evaluation is likely to be signif-icantly more expensive. We have a circuit construction thatsolves the problem with O((|R| + |F |) · v · n)) encryptionswhere n is the number of bits to represent a literal. The cir-cuit size is O(|R| · |F | · v · n) gates, giving a bit transmissioncost of O(|R| · |F | · v · n · t). Due to the necessity of repre-senting all possible input in the construction of the circuit, asignificantly more efficient approach based on generic circuitevaluation is unlikely.

5. CONCLUSIONSAs the usage of data mining results for homeland securityand other potentially intrusive purposes increases, privatelyusing these results will become more important. We haveshown that it is possible to ensure privacy without compro-mising the ultimate goal.

The protocol presented here is a first cut at addressing thisproblem. Developing a practical solution requires additionalwork. One example is negative clauses. We have a methodto easily extend this protocol to handle clauses with nega-tion, however it reveals at least an upper bound on thenumber clauses with negation. Developing a more secureapproach, and proving that secure, is an open topic. Otherissues include non-boolean rules (e.g., best match), allow-ing multiple rule matches, supporting less structured dataformats (e.g., text), and demonstrating practical efficiency.

Continued research in privacy preserving data mining musttake into account not just generating the results, but ef-ficiently and privately using them. Another research direc-tion is lower bounds on computation and the communicationcost. We would like to explore the trades off between costand privacy. We believe that research can develop methodsto increase security in our lives while sacrificing less privacythan generally assumed.

6. REFERENCES

[1] R. Agrawal and R. Srikant. Privacy-preserving datamining. In Proceedings of the 2000 ACM SIGMOD

Conference on Management of Data, pages 439–450,Dallas, TX, May 14-19 2000. ACM.

[2] Computer assisted passenger pre-screening II program.


http://www.fcw.com/fcw/articles/2003/0310/news-tsa-03-10-03.asp.

[3] U. Feige, J. Kilian, and M. Naor. A minimal model forsecure computation. In 26th ACM Symposium on the

Theory of Computing (STOC), pages 554–563, 1994.

[4] O. Goldreich. Secure multi-party computation, Sept.1998. (working draft).

[5] O. Goldreich, S. Micali, and A. Wigderson. How to playany mental game - a completeness theorem for protocolswith honest majority. In 19th ACM Symposium on the

Theory of Computing, pages 218–229, 1987.

[6] M. Kantarcioglu and C. Clifton. Privacy-preserving dis-tributed mining of association rules on horizontally par-titioned data. In The ACM SIGMOD Workshop on Re-

search Issues on Data Mining and Knowledge Discovery

(DMKD’02), pages 24–31, Madison, Wisconsin, June 22002.

[7] M. Kantarcıoglu and C. Clifton. Privacy-preserving dis-tributed mining of association rules on horizontally par-titioned data. IEEE-TKDE, submitted.

[8] M. Kantarcioglu and J. Vaidya. An architecture forprivacy-preserving mining of client information. InC. Clifton and V. Estivill-Castro, editors, IEEE In-

ternational Conference on Data Mining Workshop on

Privacy, Security, and Data Mining, volume 14, pages37–42, Maebashi City, Japan, Dec. 9 2002. AustralianComputer Society.

[9] Special section on privacy and security. SIGKDD Ex-

plorations, 4(2):i–48, Jan. 2003.

[10] X. Lin and C. Clifton. Privacy preserving clusteringwith distributed EM mixture modeling. Knowledge and

Information Systems, Submitted.

[11] Y. Lindell and B. Pinkas. Privacy preserving data min-ing. In Advances in Cryptology – CRYPTO 2000, pages36–54. Springer-Verlag, Aug. 20-24 2000.

[12] S. C. Pohlig and M. E. Hellman. An improved algorithmfor computing logarithms over GF(p) and its crypto-graphic significance. IEEE Transactions on Informa-

tion Theory, IT-24:106–110, 1978.

[13] J. Vaidya and C. Clifton. Privacy preserving associationrule mining in vertically partitioned data. In The Eighth

ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, pages 639–644, Edmonton,Alberta, Canada, July 23-26 2002.

[14] A. C. Yao. How to generate and exchange secrets. InProceedings of the 27th IEEE Symposium on Founda-

tions of Computer Science, pages 162–167. IEEE, 1986.


Dynamic Inference Control

Jessica StaddonPalo Alto Research Center

3333 Coyote Hill Rd.Palo Alto, CA 94304, USA

[email protected]

ABSTRACTAn inference problem exists in a multilevel database if knowl-edge of some objects in the database allows information witha higher security level to be inferred. Many such inferencesmay be prevented prior to any query processing by raisingthe security level of some of the objects, however this in-evitably impedes information access, as a user with low au-thorization who queries just one of the objects with raisedsecurity must seek clearance even when not in danger ofmaking the inference. More flexible access control is possiblewhen inferences are prevented during query processing, how-ever this practice can result in slow query response times.We demonstrate that access control can be made sufficientlydynamic to ensure easy access to the information users areentitled to, while retaining fast query processing. Our infer-ence control schemes provide collusion resistance and havea query processing time that depends only on the lengthof the inference channels (not on the length of user queryhistories). In addition, our schemes provide a property wecall crowd control that goes beyond collusion resistance toensure that if a large number of users have queried all butone of the objects in an inference channel, then no one willbe able to query the remaining object regardless of the levelof collusion resistance provided by the scheme.

1. INTRODUCTIONIn a multilevel database an inference problem exists if usersare able to infer sensitive information from a sequence ofqueries that each have a low security classification (i.e. arenot sensitive). For example, any user may be able to querya database to retrieve a list of ship names and the ports atwhich they are docked. In addition, the knowledge of whichports are being used to load ships with weapons may havea low security classification. Yet, these two queries togetherconstitute an inference channel, because if both are madethen it’s possible to infer exactly which ships are carryingweapons, and this may be sensitive.

When protecting against inferences, there is an inherenttrade-off between the granularity of the access control andthe query processing time. The approach to inference con-trol proposed in [17; 21] requires essentially no query pro-cessing. In [17; 21], the security levels of specific objects inthe database are raised in order to prevent a user with lowsecurity clearance from completing enough queries to be ableto make an undesired inference. By ensuring that at least

one object in each inference channel requires high clearance,low-clearance users are prevented from making inferences.However, a user who only wants to query one particular ob-ject whose security level has been raised will be unable to doso without receiving additional authorization, even thoughthe information they seek may be completely innocuous onits own. Hence, because access controls are predetermined,this approach may impede access to information unneces-sarily .

Another approach to inference control is to determine atquery time whether a query can be safely answered. Thiscan be done, for example, by maintaining user query histo-ries. When a user makes a query it is checked against theuser’s query history and all known inference channels, beforegranting access to the results of the query. More sophisti-cated methods of query-time inference control use inferenceengines to determine access [9]. However, since query his-tories can be quite long, this approach can result in slowquery processing.

We introduce a new approach to inference control that al-lows for fast query processing while enabling fine-grainedaccess control, and thus, flexible information access. In ourapproach, access-enabling tokens are associated with objectsin the database, and users are allocated keys that they useto generate the necessary tokens. Once a token is used toquery an object the key it was derived from cannot be usedto query any other object in the inference channel. Thisis implemented by deleting (or, revoking) the tokens gener-ated with this key from other objects in the channel. Hence,query processing depends on the length of the channel ratherthan the ever-growing user query histories. In addition, be-cause initially the same tokens are associated with each ob-ject, our approach allows for flexible information access. Auser can access any objects in the inference channel provideddoing so will not enable the user to make the undesired in-ference, even through collusion with other users.

Our approach to inference control is inspired by crypto-graphic revocation techniques for large groups of users. Themotivating intuition behind the use of these techniques isthat group dynamics play an essential role in ensuring in-ference control: it is not enough to only consider the userrequesting the object when deciding whether or not to grantaccess, instead all of the users of the database and all of thequeries they’ve made, should somehow be taken into ac-count. The difficulty comes in finding a way to do this with-out relying on the time-consuming processing of user queryhistories. As an example, one might imagine solving theproblem by associating counters with objects in the chan-


nel, and cutting off access to the channel when the countersget too high. However, the counters reflect x queries by 1user and 1 query by each of x users in the same way, andthis doesn’t sufficiently capture the access dynamics. Byleveraging ideas from group communications we are able toprovide some separation between these cases and an auto-mated mechanism for updating the access controls that is farmore likely to be affected by large-scale querying by a fewusers, than scattered queries by many users. More precisely,our schemes provide collusion resistance and a desirable newproperty that we call crowd control. Crowd control ensuresthat if a large number of users have queried all but one ofthe objects in the channel then no one will be able to querythe remainder of the channel even if they have never queriedthe database before.

Overview. In Section 1.1 we discuss related work and inSection 2 we provide the necessary background definitionsand notation. In Section 3 we provide our two inferencecontrol schemes and in Section 4 we discuss their attributesand highlight areas for future work.

1.1 Related WorkOur approach to inference control relies on the identificationof inference channels in an initial pre-query processing step.Of course, it is impossible to identify all inference channels[24] but previous work has demonstrated significant successin identifying inferences both from the database schema [6;17; 2; 26] and the data itself [5; 8; 28]. Our inference con-trol techniques work with either approach, however we notethat if the latter approach is used then it should be re-runperiodically to accommodate changes in the data and thiscould lead to a need to distribute additional keys to users(if, for example, an inference channel of a length exceedingall others is identified).

Inferences may also be detected at query processing time(see [9; 25; 12]). This approach may lead to long queryprocessing since it most effectively operates on a user’s entirequery history (in conjunction with a logic-based inferenceengine).

Once the channels are identified we assign keys to users andtokens to the objects in the inference channels in such a waythat users have flexibility in the objects they choose to ac-cess, provided the users are unable to complete the channel.Because the queries made by each user must affect the ac-cess capabilities of other users in order to ensure collusionresistance and the desirable property of crowd control, wedraw inspiration for our schemes from secure group commu-nication. In particular, the key allocation method employedin our schemes is similar to what’s currently known as asubset-cover scheme [14] in the group communication set-ting. We believe that these cryptographic techniques arein many ways better suited to the inference control prob-lem than the group communication setting. For example, ifa large number of users have queried all but one object inan inference channel then it may be wise to consider thisinformation to be in the public domain and to block allusers (even those who have never queried the database) fromquerying the remaining object in the channel. The analo-gous situation in group communication is the revocation ofan innocent user as the result of the revocation of severalunauthorized users. This is a well-known problem in groupcommunication that much work has gone into remedying

(see, for example, [10]) yet it is a desirable property of aninference control scheme (what we call crowd control).

As mentioned in Section 1, an alternative approach to in-ference control that can be viewed as fitting in our generalframework is to maintain a bit for each object indicatingwhether or not it has been released as the result of a query(see, for example, [9]). Our approach allows for more ac-curate access control because we can distinguish between nusers each accessing m−1 objects in an inference channel oflength m, and n(m−1) users each accessing a single object inthe channel. The method of [9] cannot, and this distinctionis important since in the former case the remaining objectin the channel should probably be blocked from all users ifn is large, but in the latter case this may be unnecessary.

Finally, we note that if it is determined that a user’s cur-rent query will enable them to make an undesired inferencethen this may be prevented by using query response mod-ification (see, for example, [9; 23]). Our techniques can beused in conjunction with query response modification, thatis, rather than requiring that the user receive higher autho-rization before completing the inference channel, it is alsopossible to simply modify the response, provided that stillyields sufficiently useful information.

A survey of many of the existing approaches to inferencecontrol is in [4].

2. PRELIMINARIESWe denote the number of users of the database by n, and theusers themselves by, U1, . . . , Un. We view inference channelsas being composed of objects. For our purposes, “object” isa generic term to describe a unit of information that is recov-erable from the database, for example, a fact, attribute orrelation. The example inference channel of Section 1 consistsof objects that are each relations: the relation between shipsand ports, and the relation between ports and weapons. Weuse m to denote the number of objects in an inference chan-nel, and let O1, . . . , Om denote the objects in the channel.We sometimes refer to an inference channel of length m asan m-channel. The ship-port inference channel of Section 1is a 2-channel.

In the initialization phase, users receive a set of encryptionkeys that they use to prove authorization (perhaps in addi-tion to authenticating themselves to the database) to accessthe objects in an inference channel. Users prove this byencrypting some information specific to their object of in-terest. For example, the token might be an encryption of theattribute names needed to pose the query to the database.Referring back to the example of Section 1, we might requirethat the user form the token that is an encryption, under anapproved key, of the attributes “ship name” and “port” con-catenated together (i.e. the attributes are first representedas bit strings, then concatenated and then encrypted).1

We denote the set of keys allocated to Ui by Ki, i = 1, . . . , n.Each encryption key may be known to several users. Beforeany queries have been made, each encryption key can poten-tially be used to access any object in the channel. However,users only have enough keys to generate tokens for a propersubset of the objects in the channel. We say a user has max-

1It is possible that with this approach a sequence of querieswill be treated as though they form an inference channelwhen they do not. This is a common problem in inferencecontrol. We discuss ways to remedy this in Section 4.


imally queried the channel if they have used all possible keysto query the channel. By limiting the number of keys peruser, we guarantee collusion resistance–a coalition of userscannot together query all the objects in the inference chan-nel. The following definition makes this more precise.

Definition 1. Let c be an integer, 0 < c ≤ n. We say

that an inference protection scheme is c-collusion resistant if

c users acting in collusion are unable to query all the objects

in any inference channel.

Once an encryption key is used to gain access to object Oi,the same key cannot be used to gain access to any object,Oj , j 6= i, in the same inference channel. This is accom-plished by deleting (or, revoking) the tokens generated bythat key from the rest of the inference channel. In otherwords, part of the automated access control mechanism isto update the list of acceptable tokens for objects in theinference channel each time an object in the channel is ac-cessed (this processing time depends on the length of thechannel). Hence, because a key may be used by many users,the queries of each user potentially affect the access capa-bilities of many users. For example, if two of the keys usedto query an object in the channel belong to the same user,then this user will be unable to maximally query the chan-nel. These properties allow us to achieve a property that wecall crowd control: if a lot of users have queried a maximalportion of an inference channel (e.g. m−1 out of m objects)then no user should be able to complete the inference chan-nel by making the remaining query. The reasoning here isthat if an object has been queried a lot it is more likely tohave been leaked and so it may be prudent to consider thequery results to be in the public domain. So, if this is thecase for most of the channel, access to the remaining objectshould be explicitly prohibited.

Definition 2. Let 0 < ε < 1, and let U denote a ran-

domly selected user. Consider a c-collusion resistant in-

ference protection scheme. If when more than x sets of

c users have together queried some set of m − 1 objects,

Oi1 , . . . , Oim−1 , in inference channel {O1, . . . , Om}, then with

probability at least 1− ε, U cannot access the remaining ob-

ject, {O1, . . . , Om} − {Oi1 , . . . , Oim−1}, we say the scheme

has (x, c, ε)-crowd control.

Figure 1 is a high level example of how token sets, and con-sequently access control, changes as a result of user queries.The figure simplifies our approach in two ways. First, wedon’t show the keys each user receives, but rather just thetokens they generate from them (depicted here as ovals).The second, and more important, simplification is that to-kens corresponding to different objects appear identical. Inreality, the tokens are particular to the object with whichthey are associated, while the encryption keys used to gen-erate them may be the same (see the discussion in Section 4for more on this).

An important part of our analysis is assessing how informa-tion access changes as more users query the channel. To dothis we need to understand how one user’s set of keys relatesto another user’s set of keys. This is important because witheach query the set of acceptable tokens for every other querycan change. More precisely, we often study how much thekey sets of one group of users, say U1, . . . , Ur, cover the keyset of user, Ur+1. The size of the cover is the number of keys

After U1 queries O1 and O2:

After U3 queries O1 and O2:

Inference Channel of Length 3: { O1, O2, O3}

T3

4 Users:

U1’ s Tokens =

U2’ s Tokens =

U3’ s Tokens =

U4’ s Tokens =

Dynamic Inference Control:

Initial State:

T1 T2

Empty

U1 used token to query O1 and token to query O2

U3 used token to query O1 and token to query O2

Figure 1: In this example there are four users each with twotokens (or, two keys that they use to generate tokens) andcollusion resistance is c = 1. For i = 1, 2, 3, the ith columnindicates which tokens can be used to access object Oi afterthe queries listed on the left hand side have been executed.After both U1 and U2 have queried objects O1 and O2, noone can query object O3 (it has no more acceptable tokens)but everyone can still access both O1 and O2.

Ur+1 has in common with at least one of U1, . . . , Ur, that is,the value: |Kr+1 ∩ (∪r

i=1Ki)|.Finally, for simplicity of exposition, we often assume thata fractional quantity (i.e. a

b) is an integer. This should be

clear from context.

3. DYNAMIC INFERENCE CONTROLWe assume that inference channels have been identified (forexample, using a tool such as [17]) prior to the deploymentof our inference control scheme. Our protocol consists ofthree phases:

• Key Allocation: Users are allocated (mmax−1)/c keys,where mmax is the maximum length of an inferencechannel in the database, and c is the desired degree ofcollusion resistance.

• Initialization of the database: For each inference chan-nelQ = {O1, . . . , Om}, a set of tokens, Ti, is associatedwith each object such that each user is capable of gen-erating exactly (m−1)/c tokens in Ti, for i = 1, . . . , m.Initially, that is prior to any queries, the token sets areidentical: T1 = T2 = . . . = Tm.


• Query processing: If token t ∈ Ti is used to gain accessto Oi, then for every s 6= i, any token in Ts that wasgenerated by the same key is deleted. Hence, the tokensets change as queries are made.

We present two schemes that differ in how the first stageis completed; initialization of the database is essentially thesame for both and query processing is as described above.The first is a simple randomized scheme that achieves prob-abilistic guarantees on crowd control (that is, ε > 0). Weanalyze that scheme according to the crowd control and in-formation access it permits as a function of the number ofusers querying the database.

We also present an algebraic scheme that offers deterministicguarantees on information access. Due to space constraintswe do not analyze the scheme, although its analysis is similarto that of the randomized approach.

3.1 A Probabilistic Key Allocation SchemeTo allocate keys to the users we adopt a “bucket”-based ap-proach that is often used in secure group communication(see, for example, [10]). Let there be (mmax − 1)/c buck-ets, B1, . . . , B(mmax−1)/c, each containing q encryption keys(that is, there are q(mmax − 1)/c keys total). The keysthemselves are randomly generated and are of a bit lengthsuitable for the symmetric encryption scheme being used.For i = 1, . . . , n, Ui receives a randomly selected key fromeach bucket for a total of (mmax − 1)/c keys per user.

Token sets for an inference channel of length m ≤ mmax areformed by choosing a subset of (m− 1)/c buckets, Bi1 , . . . ,Bi m−1

c

, and using the keys in each bucket to generate the

tokens for each object in the m-channel, as described inSection 2 (i.e. initially, every key in Bi1 ∪ . . . ∪ Bi m−1

c

can

be used to access each object). Hence, before any queries aremade, each user has the ability to query any m−1

cobjects,

and so the scheme is c-collusion resistant. The followingtheorem shows how, for a given value of ε, the degree ofcrowd control afforded by the scheme depends on m and q.The idea behind the theorem is that a large set of users arelikely to cover the key set of another user, so if these usershave maximally queried the channel, the other user will beblocked.

Theorem 1. Let 0 < ε < 1, and let c denote the collusion

resistance of an instance of the probabilistic inference control

scheme. Let x = ln(1−(1−ε)c/(m−1))ln(1−c/q)

then the scheme has

(x, c, ε)-crowd control.

Proof. It suffices to show that if more than

x = ln(1−(1−ε)1/(m−1))ln(1−1/q)

sets of c users have queried m − 1

objects, Oi1 , . . . , Oim−1 , in an inference channel of lengthm then the probability that a user U can query Oim is lessthan ε. U is unable to query Oim if the key sets of the userswho have queried any of the other objects in the channelcover the relevant part of U ’s key set (i.e. those keys of U ’sthat are useful for querying this channel). Of course, thishappens with probability 1 if U has made (m − 1)/c otherqueries, and otherwise, the probability of this covering is atleast (1− (1− c/q)x)(m−1)/c. Setting the latter quantity tobe at least 1− ε and solving for x gives the quantity in thetheorem statement.

Crowd Control

(q = 10, m = 16)

P(U can query Om | x)

x =Number of users who have each quer ied O1,…,Om-1

Figure 2: This figure shows how U ’s access to an objectin an m-channel changes as the number of users who haveeach accessed the m− 1 other objects in the channel grows.Here, c = 1, q = 10 and m = 16 (hence, this scheme canaccommodate 1015 users). When the number of users is 70or more, U can only access the object with probability atmost .01.

Note that this theorem can be used to lower bound the prob-ability that U can access object Om when x sets of c userseach have accessed any portion (i.e. not necessarily a max-imal portion) of O1, . . . , Om−1. However, it is a far weakerbound in this case. We claim that this is as it should be: ifa large number of users have maximally queried the chan-nel then for security purposes the remainder of the channelshould be blocked, however if the users have only partiallyaccessed the channel then the security risk to leaving it ac-cessible is far less. To get a better idea of how access changesas a function of x, we include a concrete example of the ac-cess probabilities in Figure 2.

Theorem 1 shows that if a particular (m − 1)-subset of theobjects in a channel has been queried a lot, then users will beunable to query the remaining object in the channel whetheror not they’ve already made some queries. It’s likely thatmost users will still be able to access some ((m−1)/c)-subsetof the queried objects, however, as the following lemmashows.

Lemma 1. Let 0 < ε < 1, and let c denote the collusion

resistance of an instance of the probabilistic inference control

scheme. If x > ln(1−(1−ε)c2/m2)

ln(1−1/q2)users have each maximally

queried an m-channel, then with probability greater than

1− ε, a user U , who is not amongst the x users, can maxi-

mally query the same channel.

Proof. A user U cannot maximally query the channelif two of U ’s keys have been used to query a single object.This is impossible if for every pair of U ’s keys, one of theusers who has maximally queried the channel has both suchkeys. The probability of this is at least (note this is a coarse

bound): (1− (1− 1/q2)x)(m/c)2 . Setting this quantity to begreater than 1 − ε, we get the bound in the lemma state-ment.

The above results demonstrate how a threshold number of


queries to a channel affect the access capabilities of otherusers, but they don’t describe how information access changesbefore the threshold is reached. We prove a lower boundon information access as a function of the number of usersquerying the channel by using the fact that any of U ’s keysthat have not been used to query an object in the channel,may be used to query any object in the channel.

Theorem 2. Let 0 < α < 1. Consider a c-collusion re-

sistant instance of the probabilistic inference control scheme.

If the number of users who have maximally queried the chan-

nel is x < q(1−α)εc/((m−1)(1−α))

ethen with probability at least

1 − ε, another user can access any subset of objects in the

channel of sizeα(m−1)

c.

Proof. Consider a set of x + 1 users, U 6∈ {U1, . . . , Ux},the expected number of U ’s keys that U1, . . . , Ux cover, is(m−1)x

cq. We show that the probability that U1, . . . , Ux cover

more than 1− α of the keys U has for querying the channelis less than ε. From Chernoff bounds [13], it follows thatthis is true when the following inequality holds:

( e(q(1−α)/x)−1

(q(1−α)

x)

q(1−α)x

)(m−1)x

cq < ε. This inequality is satisfied by

the x bound given in the theorem statement.

To reduce user U ’s access to an inference channel, two userssharing distinct keys with U must use these keys to querythe same object, O, in the channel. In such a situation,U is unable to use either key to query other objects in thechannel, while U only needs one of the keys to query O,hence one of U ’s keys cannot be used. Hence, restrictinga user’s access to the channel requires more about the keyallocation structure than we use in Theorem 2, and so wesuspect that this lower bound is generally not tight (andLemma 1 provides particular evidence that it’s not tight).

3.2 A Variant with Deterministic GuaranteesThe key allocation scheme of Section 3.1 guarantees crowdcontrol and information access probabilistically as a func-tion of the number of users querying an inference channel.In some settings deterministic guarantees are necessary. Wecan achieve some deterministic guarantees by using error-correcting codes to allocate keys to users. Specifically, weuse Reed-Solomon codes (see, for example, [27]) to allocatekeys to users. Reed-Solomon codes use a polynomial that’sunique to each user to determine each user’s codeword, orin our case, key set. Because two polynomials of the samedegree intersect on at most as many points as their degree,two users in our inference control scheme will share at mostas many keys as the degree of their polynomials. Using thisfact, we construct a scheme with deterministic (i.e. ε = 0)information access guarantees. The following makes the keyallocation part of such a scheme more precise, the initializa-tion of the database and the query processing are both justas before.

We consider all polynomials of degree t, 0 < t < (mmin −1)/c, over the finite field Fq of q elements, where mmin is theminimum length of an inference channel. To each user, U , weassociate a unique such polynomial, pU (x) ∈ Fq[x]. For eachelement (γ, β) ∈ Fq × Fq we generate a random key, kγ,β ofthe desired bit length. User U receives the set of keys KU ={kγ,pU (γ)|γ ∈ A ⊆ Fq}, where A is a set of size (mmax −1)/c. Note that this is very similar to the bucket-based

construction of Section 3.1 except that using polynomialsto allocate keys from the “buckets” gives more control overthe overlap between users’ key sets. The following lemmademonstrates one of the deterministic guarantees provided.

Lemma 2. Consider a c-collusion resistant inference pro-

tection scheme that uses the key allocation method of this

section. If x < (m−1)(1−α)ct

users have maximally accessed

an m-channel then another user can access any subset of

objects in the channel of sizeα(m−1)

cwith probability 1.

Proof. Consider a user U who is not among the x userswho have maximally accessed the channel. U shares at mostt keys with each of the x users, and so at most

tx < t( (m−1)(1−α)ct

) = (m−1)(1−α)c

keys total. Hence, U has

more than α(m−1)c

keys that none of the x users have and canaccess a different object in the channel with each key.

An analysis very similar (but a bit more involved due tothe fact that keys aren’t assigned independently) can beperformed to prove crowd control and lower bounds on in-formation access. The scheme performs comparably to theearlier one.

4. DISCUSSION AND OPEN PROBLEMSWe have concentrated on a single inference channel for sim-plicity of exposition. When using our methods to prevent in-ferences across multiple channels a potential problem ariseswhen a single object appears in channels of different lengths.To ensure that no inferences can be made by exploiting thevarying channel lengths it may be necessary to reduce thenumber of acceptable keys associated with objects in over-lapping channels, thus reducing information access.

In addition to focusing on a single channel we have also as-sumed a fixed user base. Over time, however, it is likely thatusers will lose access to the database (i.e. be revoked) andnew users will be added. The scheme of Section 3.1 has afixed upper bound on the number of users it can accommo-date: q(mmax−1)/c (the bound for the scheme of Section 3.2may be less depending on t). To allow for the addition ofnew users we can choose q to be larger than is currentlyrequired. Of course, this will have consequences for crowdcontrol: increasing q means users’ key sets are more disjointand so the queries of individual users tend to have less of animpact on the access capabilities of others.

When a user is revoked from accessing the database theirkeys must no longer be valid. Simply deleting the tokensgenerated from their keys might unfairly restrict the ac-cess of others, so rekeying of the valid users may be needed.There is a large body of work on revocation in group com-munication that we may leverage for this problem. For ex-ample, efficient techniques for key updating over a broadcastchannel are provided in [16; 10]. In addition, we note thatprotection against key leaks can be built into the key allo-cation scheme by ensuring that if a group of users pool theirkeys a leak a subset of the resulting pooled set, we will beable to trace at least one of the culprit users. Such tracingcapability can be incorporated into exactly the type of keyallocation we use in this paper (see, for example, [19] ) butit will tend to increase the crowd control threshold, x.

As mentioned earlier (and as Lemma 1 indicates) our lowerbound on information access for the scheme given in The-


orem 2 isn’t tight. A proof that takes more of the combi-natorics of the key allocation scheme into account shouldproduce a better bound. In addition, there may be ways toreduce user storage while retaining inference control. Usingsimilar techniques to those in [14], it may be possible to al-locate to each user a smaller set of keys that they can thenuse to generate additional keys, and thus, tokens.

Our approach offers more flexible information access thanexisting approaches with fast query processing, but it stillmay not offer quite the flexibility afforded by schemes thatrely on user query histories to determine access control. Thisis because, as mentioned in Section 1.1, we require users toform tokens that are derived from information that’s spe-cific, but not necessarily unique, to the object of the user’sinterest. We could easily base the tokens on unique infor-mation but we think that doing so will increase the totalnumber of inference channels so much as to negatively im-pact performance.

Finally, we note that with our schemes it is possible to de-termine what information the user could have accessed justfrom the current access control state (i.e. the value of thetoken sets at various points in time). For example, any keythat appears in a token set Ti and no others must havebeen used to query object Oi, whereas a key that appearsin all the token sets could not have been used. Hence, if auser is known to be compromised, the authorities can usetheir knowledge of the user’s keys to determine what infor-mation the user could have accessed. Of course, providedthe users’ keys sets are kept private unless such a securitybreach occurs, the fact that keys are shared amongst usersalso provides desirable privacy to the users of the database.

AcknowledgementsThe author is grateful to Teresa Lunt for numerous help-ful discussions and to David Woodruff for comments on anearlier draft of this paper.

5. REFERENCES

[1] A. Barani-Dastjerdi, J. Pieprzyk and R. Safavi-Naini.Security in databases: A survey study. Manuscript,1996.

[2] L. Binns. Implementation considerations for inference

detection: Intended vs. actual classification. In Proceed-ings of the IFIP WG 11.3 Seventh Annual WorkingConference on Database Security, 1993.

[3] D. Denning and T. Lunt. A multilevel relational data

model. In IEEE Symposium on Security and Privacy,1987.

[4] S. Jajodia and C. Meadows. Inference problems in mul-

tilevel secure database management systems. In Infor-mation Security: An Integrated Collection of Essays,M. Abrams et al., eds., IEEE Computer Society Press(1995), pages 570-584.

[5] J. Hale and S. Shenoi. Catalytic inference analysis: De-

tecting inference threats due to knowledge discovery. Inthe IEEE Symposium on Security and Privacy, 1997,pp. 188-199.

[6] T. Hinke. Database inference engine design approach.

In Database Security II: Status and Prospects, 1990.

[7] T. Hinke, H. Degulach and A. Chandrasekhar. A fast

algorithm for detecting second paths in database infer-

ence analysis. Journal of Computer Security, 1995.

[8] T. Hinke, H. Degulach and A. Chandrasekhar. Layered

knowledge chunks for database inference. In DatabaseSecurity VII: Status and Prospects, 1994.

[9] T. Keefe, M. Thuraisingham, W. Tsai. Secure query-

processing strategies. IEEE Computer, Vol. 22, No. 3,1989, pp. 63-70.

[10] R. Kumar, S. Rajagopalan and A. Sahai. Coding con-

structions for blacklisting problems without computa-

tional constructions. In Advances in Cryptology –Crypto ‘99, pp. 609-623.

[11] T. Lunt. Access control policies for database systems. InDatabase Security II: Status and Prospects, pp.41-52.

[12] D. Marks, A. Motro, S. Jajodia. Enhancing the con-

trolled disclosure of sensitive information. In Proc.European Symp. on Research in Computer Security,Rome, Italy, September, 1996.

[13] R. Motwani and P. Raghavan. Randomized Algorithms.

Cambridge University Press, 2000.

[14] D. Naor, M. Naor and J. Lotspiech. Revocation and

tracing schemes for stateless receivers. In Advances inCryptology – Crypto 2001.

[15] T. Lin. Inference secure multilevel databases. InDatabase Security VI, pp.317-333.

[16] M. Naor and B. Pinkas. Efficient trace and revoke

schemes. In Financial Cryptography 2000.

[17] X. Qian, M. Stickel, P. Karp, T. Lunt and T. Garvey.Detection and elimination of inference channels in mul-

tilevel relational database systems. In IEEE Symposiumon Security and Privacy, 1993.

[18] G. Smith. Modeling security-relevant data semantics. InIEEE Symposium on Security and Privacy, 1990.

[19] D. Stinson and R. Wei. Key preassigned traceability

schemes for broadcast encryption. Lecture Notes inComputer Science 1556 (1999), 144-156 (SAC ’98 Pro-ceedings).

[20] B. Sowerbutts and S. Cordingley. Database architecton-

ics and inferential security. In Database Security IV,pp. 309-324.

[21] M. Stickel. Elimination of inference channels by opti-

mal upgrading. In IEEE Symposium on Security andPrivacy, 1994.

[22] B. Thuraisingham. Data mining, national security, pri-

vacy and civil liberties. To appear in Knowledge Dis-covery and Data Mining Magazine (SIG KDD), 2003.

[23] B. Thuraisingham. Mandatory security in object-

oriented database systems. In proceedings of OOPSLA‘89.


[24] B. Thuraisingham. Recursion theoretic properties of the

inference problem. In the IEEE Third Computer Secu-rity Foundations Workshop, 1990.

[25] B. Thuraisingham. Towards the design of a secure

data/knowledge base management system. In DataKnowledge and Engineering, 1990.

[26] B. Thuraisingham. The use of conceptual structures for

handling the inference problem. In Database SecurityV, pp.333-362.

[27] J. van Lint. Introduction to Coding Theory. Springer-Verlag, 1999.

[28] R. Yip and K. Levitt. Data level inference detection

in database systems. In the IEEE Eleventh ComputerSecurity Foundations Workshop, 1998.


proceedings of dmkd'03 8th acm sigmod workshop on …

Documents