smm: a data stream management system for knowledge discovery 1 hetal thakkar, nikolay laptev, hamid...

SMM: A Data Stream Management System for Knowledge Discovery

1

Hetal Thakkar, Nikolay Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, Carlo Zaniolo

Computer Science Department UCLA

Data Stream Management Systems (DSMS)

2

• DSMS critical in a variety of applicationso Click-stream analysis,o Algorithmic Tradingo Network monitoringo Credit card fraud detection …

• Many DSMS Projects and Prototypes :o STREAM (Stanford), Aurora/Borealis (Brown, MIT),

Telegraph (UCB), Gigascope (AT&T), Stream Mill (UCLA), … and so on.

• Commercial Startups and vendor extensions:o StreamBase, Aleri, Coral8, Apama, Truviso,o DBMS vendors …

• Support for online mining on data streams: unresolved issue for current systems.

Two Main Research Challenges

3

• Challenge I: Fast and Light algorithms needed for online mining algorithms.

• Challenge II: These and business intelligence applications require the Quality of Service (QoS) of DSMS. Thus these algorithms must be deployed as part of a DSMS.

• Much research on first challenge—a stream of papers in DM conferences—but not on the second that is probably even harder.

Data Stream Mining & DSMS QoS

4

•DSMS: Support continuous queries over massive data streams – with QoS (Quality of Service) guarantees and– (Quasi) Real-time response through:

o Scheduling, query optimization,o Windows and other Synopseso Load shedding …

• But - Current DSMS focus on simple continuous queries- Using query languages based on SQL- Lackluster history of SQL with KDD- DSMS bring more problems:

e.g. blocking queries not allowed.

Knowledge Discovery from DBs (KDD) vs. SQL

5

OLAP in Relational DBMS: simple SQL extensions brought rich payoffs to vendors.

Extending DBMSs for Data Mining proved much harder: Limited expressive power of SQL and OR-DBMS Apriori in DB2 [Saravagi’ 98] proved extremely difficult and not as

efficient as the cache-mining task.CAC Imielinski & Mannila [CACM’96]: A call for Declarative DM Vendors: Libraries of Mining Methods

o IBM: DB2 Intelligent Miner-o Oracle Data Minero OLE DB for DM (DMX)

Mining Models, Predictive Model Markup Language (PMML) Closed proprietary systems, limited extensibility, usability … compared with open systems such as WEKA.

+ CACM’96] M’96]

Our Stream Mill Miner (SMM) Syst.

6

• Efficient support for online mining algorithms: DSMS QoS, scalability, load shedding, synopses

• Expressive Power & Extensibility: User-Defined Aggregates (UDAs), with windows and slides.

• Genericity: Mining Algorithms for arbitrary windows (logical/physical), & tables with any number of columns

• Abstract Mining Models & Mining Workflows: analysts do not care to see SQL code

• GUI to further enhance ease of use.

Expressive Power and Extensibility by UDA functions

7

• UDAs are needed to express mining algorithms.

• UDAs can be written in:

• An external PL (as other DBMS & DSMS do), or

• In SQL itself (then SQL becomes Turing-Complete!)

• Using the following template:o INITIALIZE: the first tuple

o ITERATE: each subsequent tuples

o TERMINATE : After the end of the relation/stream

• UDAs are invoked in the same way as built-in aggregates

UDA Example: same as SQL AVG

8

AGGREGATE myavg(Next Real) : Real { TABLE state(tsum Real, cnt Int); INITIALIZE: { INSERT INTO state VALUES (Next, 1) } ITERATE: { UPDATE state SET tsum = tsum+Next, cnt = cnt+1; } TERMINATE: {

INSERT INTO RETURN SELECT tsum/cnt FROM state; }

}Blocking UDAs can be

applied to data streams only if

we use Windows !

The ‘state’ table stores the computation state

INSERT INTO RETURNthe value produced by the UDAIn TERMINATE this is blocking

Windows:

9

• Windowed QuerySELECT myavg(price) OVER (ROWS 9 PRECEDING) FROM OpenAuction

• SQL:2003 OLAP functions for built-ins onlyo Other DSMSs

• Window typeso Logical, physical, unlimited, partition by, o (DSMS) slides, tumbles.

Window UDAs and Differential Computation

10

WINDOW AGGREGATE myavg(Next Real) : Real

{ TABLE inwindow(wnext Real); TABLE state(tsum Real, cnt Int);

INITIALIZE: { INSERT INTO state VALUES (Next, 1); INSERT INTO RETURN VALUES (Next)}

ITERATE: { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; INSERT INTO RETURN SELECT tsum/cnt FROM state}

EXPIRE: { UPDATE state SET cnt = cnt-1, tsum=tsum - oldest().wnext }}• inwindow: system-memorized tuples. oldest() of such tuples• EXPIRE: servicing an expired tuple—possibly asynchrously • Physical window: for each new tuple one expired tuple• Logical window: for each new tuple zero or more expired

tuples—same code!

Slides for Window UDAs

11

SELECT myavg(price) OVER (ROWS 99 PRECEDING SLIDE 5) FROM OpenAuction

The 100-tuple window is partitioned into 20 panes of 5 tuples

each

Very useful to scale down computation & output E.g. mining aggregates: train on data from last hour but a new

classifier every 10 minutes.

When slide ≥ window is called a tumble Windows in DBMS (SQL:2003 OLAP functions) and slides in

DSMS--only for built-in aggregates.

Windows logical, physical, unlimited, partition by, slides, tumbles: declared in SMM by a base UDA +a window UDA.

Superior expressive power and extensibility.

Genericity (of UDAs)

12

• UDAs with window and slides: expressive power and extensibility—great for mining algorithms

• SMM UDAs can be declared with an arbitrary but fixed number of arguments.

• But the same mining algorithm must work on windows and tables where tuples have an arbitrary number of columns.

Solution:• For UDA coded in an external PL, each tuple can be

represented as a self-describing blob.

• For UDAs coded in SQ use verticalization…

Verticalization

13

RID 1:Outlook 3:Humidity Play?

1 Sunny Hot High Weak No

2 Overcast Hot High Weak Yes

2:Temprt

RID Column Value Decision

1 1 Sunny No

1 2 Hot No

1 3 High No

1 4 Weak No

2 1 Overcast Yes

2 2 Hot Yes

2 3 High Yes

2 4 Weak Yes

4: Wind

Generic NBClassifier (Training)

14

TABLE DescriptorTbl(Col INT, Val INT, Dec INT, normCnt REAL);

WINDOW AGGREGATE LearnNB(col INT, val Char, totCols INT, classVal INT) :

INT { INITIALIZE: ITERATE: { UPDATE DescriptorTbl SET normCnt = normCnt + 1 WHERE Col = col AND Val = val AND Dec = classVal; INSERT INTO DescriptorTbl VALUES (col, val, classVal, 1)

WHERE SQLCODE <> 0; } EXPIRE: { UPDATE DescriptorTbl SET normCnt = normCnt - 1 WHERE Col = oldest().col AND Val = oldest().val AND Dec = oldest().classVal } } E.g. for DescriptorTbl

Example: NBClassifier (Classifiying)

15

•Assume that that test table has the same verticalized format as the training table.

• Then we can use a join to find the counts for each tuple and each class.

• Then we need to multiply the counts for each class—how? Use the sum aggregate.

• And compare the results and select the larger. …

•What about missing values ….?

• What if tuples arrive in a stream?

•NBC is probably the simplest classification methods—among the effective ones

Data Stream Mining: why Mining Models?

16

Specification mining tasks involves many details. E.g., a simple classifier:

• Define a training stream and a testing stream• A data cleaner/discretizer• A TrainingUDA that builds a model• A TestingUDA that uses the model• How these two communicate: a table holding the model• Different parameters for each classifier instance• A workflow to describe the flow of the information between

mining tasks: More critical here than in KDD

SMM’s Mining Models: a declarative, user-definable framework to achieve all the above.

Built-in Mining Algorithms in SMM

17

• Online classifierso Naïve Bayesiano Decision Treeo Linear Regressiono Ensemble Methodso K-nearest Neighbor

• Online clusteringo DBScan [Ester’ 96]o IncDBScano Windowed K-means*o DenStream* [Cao’ 06]o CluStream

• Association rule miningo Approximate frequent

itemso SWIM [ICDE’ 08]o Moment [Chi’ 04]o AFPIM

• Time Series/Sequenceso SQL-TS [Sadri’ 01]o K*SQL [VLDB’10]

• Many more …

o Already supported O Work in progress

Our Stream Mill Miner (SMM) Syst.

18

1. DSMS performance and QoS

2. Genericity/flexibility

3. Expressive Power & Extensibility

– So performance, flexibility, power, extensibility of mining algorithms written in our Expressive Stream Language ESL (an extension of SQL)

– But Analysts do not want to see hundred of lines of ESL code!

Thus SMM also provides:

• Abstract Mining Models & Mining Workflows, and

• GUI to further enhance ease of use.

A More Complex Task: Classifier Ensembles

19

• Classifier Ensembles for accuracy and concept shift/drift: Weighted bagging [Wang’ 03], adaptive boosting [Chu’ 04], inductive transfer [Forman’ 06].

• Example: Specify UDAs (boxes) and flow for Weighted Bagging

BuildEns

Classify

Classify

Train

Voting

Ensemble Based Bagging: Flows

20

MODELTYPE EnsembleBag { BuildEns (UDA buildEns),

Train (UDA learnDTree), UpdateEns (UDA updateEnsembles),

Classify (UDA evaluateClassifier), ManageWeights (UDA updateWeights), Voting (UDA voting),

SHARETABLES(activeEnsembles, ensClassTbl, ensembleWeights),

Flow Train ( CREATE STREAM buildEnsTrain AS (RUN BuildEns ON INSTREAM); CREATE STREAM dTreeTrain AS (RUN Train ON buildEnsTrain); RUN UpdateEns ON dTreeTrain; CREATE STREAM ensClassiTrainPairs AS (SELECT a.ensId trainEns,b.ensId,b.id,b.col,b.val,b.lbl,b.numCols FROM buildEnsTrain AS b, activeEnsembles AS a); CREATE STREAM evalClassiTrain AS (RUN Classify ON ensClassiTrainPairs); INSERT INTO OUTSTREAM RUN ManageWeights ON evalClassiTrain; ), Flow Test ( … )}

Aggregate buildEns (idi int, coli int, vali int, lbli int, numColsi int, tWeighti int, ensSize int):

(ensId int, id int, col int, val int, lbl int, numCols int, tWeight int) { table curEnsId(ensId int) memory; table curEnsCnt(cnt int) memory;

initialize: { insert into curEnsCnt values(1); insert into curEnsId values(1);

insert into return select ensId, idi, coli, vali, lbli, numColsi, tWeighti from curEnsId; } iterate: { update curEnsCnt set cnt = cnt + 1 where coli = numColsi;

insert into return select ensId, idi, coli, vali, lbli, numColsi, tWeighti from curEnsId; /* indicates end of ens */ insert into return select ensId, -1, 0, 0, 0, 0, 0 from curEnsId where coli = numColsi and (select cnt from curEnsCnt) = ensSize; update curEnsId set ensId = ((ensId+1)%20) where (select cnt from

curEnsCnt) = ensSize and coli = numColsi; update curEnsCnt set cnt = 0 where coli = numColsi and (select cnt from

curEnsCnt) = ensSize; }};

29th Oct 08

Picture of the flow definition GUI

22

After Functionality & Usability:Performance

23

No data stream mining workbench to compare against,

We compared SMM with WEKA on

• Integration overhead: performance lost because algorithm is embedded in the system

• Scalability.

– Results obtained using a single-processor machine, with a Pentium4, 2.4GHz processor, 1GB RAM

– On data preloaded in main memory.

29th Oct 08

Comparison with Weka

24

• Left most bars, Iris-C4.5, HD-C4.5: C4.5 directly on data.

• Middle bars, Iris-SMM(C4.5), HD-SMM(C4.5): C4.5 incorporated into SMM

• Rightmost bars: Weka J48

• C4.5 was recast as a UDA and incorporated into SMM

Integration Overhead: Integrated SWIM vs. Standalone SWIM (Frequent Patterns on DS)

25

Concurrent Queries

26

Conclusions• SMM main contributions

– Building on a DSMS efficiency and QoS, achieved– Expressive Power, Generality and User Extensibility

• With an SQL-based continuous query language• UDAs with windows and slides• Arbitrary relational/XML data streams• A suite of fast & light mining algorithms (domestic & imported)

– High-level mining Language• Defining the mining process and information flow

• New mining models can be defined easily

• High-level abstractions and GUI to match analysts’ requirements.

27

28

Acknowledging the many SMM Contributors

• Yijian Bai, • Stefano Emiliozzi, • Chang Luo, • Yan-Nei Law, • Haixun Wang, • Kai Zeng, • Xin Zhou

• Hetal Thakkar, • Nikolay Laptev, • Hamid Mousavi, • Barzan Mozafari, • Vincenzo Russo,

Thank You!

Questions?

29

Example: NBClassifier (Classifiying)

30

AGGREGATE ClassifyNB(col INT, val REAL, totCols INT):INT { TABLE tmp(column INT, value REAL); TABLE pred(dec INT, tot REAL); INITIALIZE: ITERATE: { INSERT INTO tmp VALUES (col, val); INSERT INTO pred SELECT d.Dec, sum(abs(log(normCnt))) FROM DescriptorTbl AS d, tmp AS t WHERE col = totCols AND d.Val=t.value AND d.Col=t.column GROUP BY d.Dec; INSERT INTO RETURN SELECT dec FROM pred WHERE col = totCols AND tot = (SELECT max(tot) FROM pred GROUP BY dec); DELETE FROM tmp WHERE col = totCols; DELETE FROM pred WHERE col = totCols; }}

Future Work

• Integration of other mining algorithms• Distributed execution of UDAs, like

MapReduce• Similar solution for databases

31

29th Oct 08

Research Challenge I: Online Mining Algorithms

32

• Online mining different from static mining o Changing data characteristics

Data distribution Concept-drifts and shifts

o Data volume Existing static data solutions not suitable Load shedding and sampling

o Response time constraints Sacrifice accuracy

• Fast & light algorithms required

29th Oct 08

Hot Research Topic: Online Mining Algorithms

33

• Existing Online algorithmso Moment [Chi’ 04], AFPIM [Koh’ 04]o CluStream [Aggrawal’ 03], GenIc [Gupta’ 04]o Ensemble based bagging [Wang’ 03]o Adaptive boosting [Chu’ 04]o Many more opportunities

E.g. frequent itemset mining over large sliding windows (SWIM) [ICDE’ 08]

• Do not tackle the system oriented challenges

29th Oct 08

DSMSs

34

• Commercial Systems (General Purpose)o Aleri - OLAP querieso StreamBase - Synopses and pattern matchingo Apama, Coral8, …o Oracle, IBMo KX, Vhayu – specializedo All oriented towards SQL

• Focusing on special purpose queries or simple SQL querieso Little extensibilityo No mining support

29th Oct 08

Another Example: End-to-end Association Rule Mining

35

GUI for defining and using mining models and flows.

smm: a data stream management system for knowledge discovery 1 hetal thakkar, nikolay laptev, hamid...

Documents