smm: a data stream management system for knowledge discovery 1 hetal thakkar, nikolay laptev, hamid...
TRANSCRIPT
SMM: A Data Stream Management System for Knowledge Discovery
1
Hetal Thakkar, Nikolay Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, Carlo Zaniolo
Computer Science Department UCLA
Data Stream Management Systems (DSMS)
2
• DSMS critical in a variety of applicationso Click-stream analysis,o Algorithmic Tradingo Network monitoringo Credit card fraud detection …
• Many DSMS Projects and Prototypes :o STREAM (Stanford), Aurora/Borealis (Brown, MIT),
Telegraph (UCB), Gigascope (AT&T), Stream Mill (UCLA), … and so on.
• Commercial Startups and vendor extensions:o StreamBase, Aleri, Coral8, Apama, Truviso,o DBMS vendors …
• Support for online mining on data streams: unresolved issue for current systems.
Two Main Research Challenges
3
• Challenge I: Fast and Light algorithms needed for online mining algorithms.
• Challenge II: These and business intelligence applications require the Quality of Service (QoS) of DSMS. Thus these algorithms must be deployed as part of a DSMS.
• Much research on first challenge—a stream of papers in DM conferences—but not on the second that is probably even harder.
Data Stream Mining & DSMS QoS
4
•DSMS: Support continuous queries over massive data streams – with QoS (Quality of Service) guarantees and– (Quasi) Real-time response through:
o Scheduling, query optimization,o Windows and other Synopseso Load shedding …
• But - Current DSMS focus on simple continuous queries- Using query languages based on SQL- Lackluster history of SQL with KDD- DSMS bring more problems:
e.g. blocking queries not allowed.
Knowledge Discovery from DBs (KDD) vs. SQL
5
OLAP in Relational DBMS: simple SQL extensions brought rich payoffs to vendors.
Extending DBMSs for Data Mining proved much harder: Limited expressive power of SQL and OR-DBMS Apriori in DB2 [Saravagi’ 98] proved extremely difficult and not as
efficient as the cache-mining task.CAC Imielinski & Mannila [CACM’96]: A call for Declarative DM Vendors: Libraries of Mining Methods
o IBM: DB2 Intelligent Miner-o Oracle Data Minero OLE DB for DM (DMX)
Mining Models, Predictive Model Markup Language (PMML) Closed proprietary systems, limited extensibility, usability … compared with open systems such as WEKA.
+ CACM’96] M’96]
Our Stream Mill Miner (SMM) Syst.
6
• Efficient support for online mining algorithms: DSMS QoS, scalability, load shedding, synopses
• Expressive Power & Extensibility: User-Defined Aggregates (UDAs), with windows and slides.
• Genericity: Mining Algorithms for arbitrary windows (logical/physical), & tables with any number of columns
• Abstract Mining Models & Mining Workflows: analysts do not care to see SQL code
• GUI to further enhance ease of use.
Expressive Power and Extensibility by UDA functions
7
• UDAs are needed to express mining algorithms.
• UDAs can be written in:
• An external PL (as other DBMS & DSMS do), or
• In SQL itself (then SQL becomes Turing-Complete!)
• Using the following template:o INITIALIZE: the first tuple
o ITERATE: each subsequent tuples
o TERMINATE : After the end of the relation/stream
• UDAs are invoked in the same way as built-in aggregates
UDA Example: same as SQL AVG
8
AGGREGATE myavg(Next Real) : Real { TABLE state(tsum Real, cnt Int); INITIALIZE: { INSERT INTO state VALUES (Next, 1) } ITERATE: { UPDATE state SET tsum = tsum+Next, cnt = cnt+1; } TERMINATE: {
INSERT INTO RETURN SELECT tsum/cnt FROM state; }
}Blocking UDAs can be
applied to data streams only if
we use Windows !
The ‘state’ table stores the computation state
INSERT INTO RETURNthe value produced by the UDAIn TERMINATE this is blocking
Windows:
9
• Windowed QuerySELECT myavg(price) OVER (ROWS 9 PRECEDING) FROM OpenAuction
• SQL:2003 OLAP functions for built-ins onlyo Other DSMSs
• Window typeso Logical, physical, unlimited, partition by, o (DSMS) slides, tumbles.
Window UDAs and Differential Computation
10
WINDOW AGGREGATE myavg(Next Real) : Real
{ TABLE inwindow(wnext Real); TABLE state(tsum Real, cnt Int);
INITIALIZE: { INSERT INTO state VALUES (Next, 1); INSERT INTO RETURN VALUES (Next)}
ITERATE: { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; INSERT INTO RETURN SELECT tsum/cnt FROM state}
EXPIRE: { UPDATE state SET cnt = cnt-1, tsum=tsum - oldest().wnext }}• inwindow: system-memorized tuples. oldest() of such tuples• EXPIRE: servicing an expired tuple—possibly asynchrously • Physical window: for each new tuple one expired tuple• Logical window: for each new tuple zero or more expired
tuples—same code!
Slides for Window UDAs
11
SELECT myavg(price) OVER (ROWS 99 PRECEDING SLIDE 5) FROM OpenAuction
The 100-tuple window is partitioned into 20 panes of 5 tuples
each
Very useful to scale down computation & output E.g. mining aggregates: train on data from last hour but a new
classifier every 10 minutes.
When slide ≥ window is called a tumble Windows in DBMS (SQL:2003 OLAP functions) and slides in
DSMS--only for built-in aggregates.
Windows logical, physical, unlimited, partition by, slides, tumbles: declared in SMM by a base UDA +a window UDA.
Superior expressive power and extensibility.
Genericity (of UDAs)
12
• UDAs with window and slides: expressive power and extensibility—great for mining algorithms
• SMM UDAs can be declared with an arbitrary but fixed number of arguments.
• But the same mining algorithm must work on windows and tables where tuples have an arbitrary number of columns.
Solution:• For UDA coded in an external PL, each tuple can be
represented as a self-describing blob.
• For UDAs coded in SQ use verticalization…
Verticalization
13
RID 1:Outlook 3:Humidity Play?
1 Sunny Hot High Weak No
2 Overcast Hot High Weak Yes
2:Temprt
RID Column Value Decision
1 1 Sunny No
1 2 Hot No
1 3 High No
1 4 Weak No
2 1 Overcast Yes
2 2 Hot Yes
2 3 High Yes
2 4 Weak Yes
4: Wind
Generic NBClassifier (Training)
14
TABLE DescriptorTbl(Col INT, Val INT, Dec INT, normCnt REAL);
WINDOW AGGREGATE LearnNB(col INT, val Char, totCols INT, classVal INT) :
INT { INITIALIZE: ITERATE: { UPDATE DescriptorTbl SET normCnt = normCnt + 1 WHERE Col = col AND Val = val AND Dec = classVal; INSERT INTO DescriptorTbl VALUES (col, val, classVal, 1)
WHERE SQLCODE <> 0; } EXPIRE: { UPDATE DescriptorTbl SET normCnt = normCnt - 1 WHERE Col = oldest().col AND Val = oldest().val AND Dec = oldest().classVal } } E.g. for DescriptorTbl
Example: NBClassifier (Classifiying)
15
•Assume that that test table has the same verticalized format as the training table.
• Then we can use a join to find the counts for each tuple and each class.
• Then we need to multiply the counts for each class—how? Use the sum aggregate.
• And compare the results and select the larger. …
•What about missing values ….?
• What if tuples arrive in a stream?
•NBC is probably the simplest classification methods—among the effective ones
Data Stream Mining: why Mining Models?
16
Specification mining tasks involves many details. E.g., a simple classifier:
• Define a training stream and a testing stream• A data cleaner/discretizer• A TrainingUDA that builds a model• A TestingUDA that uses the model• How these two communicate: a table holding the model• Different parameters for each classifier instance• A workflow to describe the flow of the information between
mining tasks: More critical here than in KDD
SMM’s Mining Models: a declarative, user-definable framework to achieve all the above.
Built-in Mining Algorithms in SMM
17
• Online classifierso Naïve Bayesiano Decision Treeo Linear Regressiono Ensemble Methodso K-nearest Neighbor
• Online clusteringo DBScan [Ester’ 96]o IncDBScano Windowed K-means*o DenStream* [Cao’ 06]o CluStream
• Association rule miningo Approximate frequent
itemso SWIM [ICDE’ 08]o Moment [Chi’ 04]o AFPIM
• Time Series/Sequenceso SQL-TS [Sadri’ 01]o K*SQL [VLDB’10]
• Many more …
o Already supported O Work in progress
Our Stream Mill Miner (SMM) Syst.
18
1. DSMS performance and QoS
2. Genericity/flexibility
3. Expressive Power & Extensibility
– So performance, flexibility, power, extensibility of mining algorithms written in our Expressive Stream Language ESL (an extension of SQL)
– But Analysts do not want to see hundred of lines of ESL code!
Thus SMM also provides:
• Abstract Mining Models & Mining Workflows, and
• GUI to further enhance ease of use.
A More Complex Task: Classifier Ensembles
19
• Classifier Ensembles for accuracy and concept shift/drift: Weighted bagging [Wang’ 03], adaptive boosting [Chu’ 04], inductive transfer [Forman’ 06].
• Example: Specify UDAs (boxes) and flow for Weighted Bagging
BuildEns
Classify
Classify
Train
Voting
Ensemble Based Bagging: Flows
20
MODELTYPE EnsembleBag { BuildEns (UDA buildEns),
Train (UDA learnDTree), UpdateEns (UDA updateEnsembles),
Classify (UDA evaluateClassifier), ManageWeights (UDA updateWeights), Voting (UDA voting),
SHARETABLES(activeEnsembles, ensClassTbl, ensembleWeights),
Flow Train ( CREATE STREAM buildEnsTrain AS (RUN BuildEns ON INSTREAM); CREATE STREAM dTreeTrain AS (RUN Train ON buildEnsTrain); RUN UpdateEns ON dTreeTrain; CREATE STREAM ensClassiTrainPairs AS (SELECT a.ensId trainEns,b.ensId,b.id,b.col,b.val,b.lbl,b.numCols FROM buildEnsTrain AS b, activeEnsembles AS a); CREATE STREAM evalClassiTrain AS (RUN Classify ON ensClassiTrainPairs); INSERT INTO OUTSTREAM RUN ManageWeights ON evalClassiTrain; ), Flow Test ( … )}
Aggregate buildEns (idi int, coli int, vali int, lbli int, numColsi int, tWeighti int, ensSize int):
(ensId int, id int, col int, val int, lbl int, numCols int, tWeight int) { table curEnsId(ensId int) memory; table curEnsCnt(cnt int) memory;
initialize: { insert into curEnsCnt values(1); insert into curEnsId values(1);
insert into return select ensId, idi, coli, vali, lbli, numColsi, tWeighti from curEnsId; } iterate: { update curEnsCnt set cnt = cnt + 1 where coli = numColsi;
insert into return select ensId, idi, coli, vali, lbli, numColsi, tWeighti from curEnsId; /* indicates end of ens */ insert into return select ensId, -1, 0, 0, 0, 0, 0 from curEnsId where coli = numColsi and (select cnt from curEnsCnt) = ensSize; update curEnsId set ensId = ((ensId+1)%20) where (select cnt from
curEnsCnt) = ensSize and coli = numColsi; update curEnsCnt set cnt = 0 where coli = numColsi and (select cnt from
curEnsCnt) = ensSize; }};
29th Oct 08
Picture of the flow definition GUI
22
After Functionality & Usability:Performance
23
No data stream mining workbench to compare against,
We compared SMM with WEKA on
• Integration overhead: performance lost because algorithm is embedded in the system
• Scalability.
– Results obtained using a single-processor machine, with a Pentium4, 2.4GHz processor, 1GB RAM
– On data preloaded in main memory.
29th Oct 08
Comparison with Weka
24
• Left most bars, Iris-C4.5, HD-C4.5: C4.5 directly on data.
• Middle bars, Iris-SMM(C4.5), HD-SMM(C4.5): C4.5 incorporated into SMM
• Rightmost bars: Weka J48
• C4.5 was recast as a UDA and incorporated into SMM
Integration Overhead: Integrated SWIM vs. Standalone SWIM (Frequent Patterns on DS)
25
Concurrent Queries
26
Conclusions• SMM main contributions
– Building on a DSMS efficiency and QoS, achieved– Expressive Power, Generality and User Extensibility
• With an SQL-based continuous query language• UDAs with windows and slides• Arbitrary relational/XML data streams• A suite of fast & light mining algorithms (domestic & imported)
– High-level mining Language• Defining the mining process and information flow
• New mining models can be defined easily
• High-level abstractions and GUI to match analysts’ requirements.
27
28
Acknowledging the many SMM Contributors
• Yijian Bai, • Stefano Emiliozzi, • Chang Luo, • Yan-Nei Law, • Haixun Wang, • Kai Zeng, • Xin Zhou
• Hetal Thakkar, • Nikolay Laptev, • Hamid Mousavi, • Barzan Mozafari, • Vincenzo Russo,
Thank You!
Questions?
29
Example: NBClassifier (Classifiying)
30
AGGREGATE ClassifyNB(col INT, val REAL, totCols INT):INT { TABLE tmp(column INT, value REAL); TABLE pred(dec INT, tot REAL); INITIALIZE: ITERATE: { INSERT INTO tmp VALUES (col, val); INSERT INTO pred SELECT d.Dec, sum(abs(log(normCnt))) FROM DescriptorTbl AS d, tmp AS t WHERE col = totCols AND d.Val=t.value AND d.Col=t.column GROUP BY d.Dec; INSERT INTO RETURN SELECT dec FROM pred WHERE col = totCols AND tot = (SELECT max(tot) FROM pred GROUP BY dec); DELETE FROM tmp WHERE col = totCols; DELETE FROM pred WHERE col = totCols; }}
Future Work
• Integration of other mining algorithms• Distributed execution of UDAs, like
MapReduce• Similar solution for databases
31
29th Oct 08
Research Challenge I: Online Mining Algorithms
32
• Online mining different from static mining o Changing data characteristics
Data distribution Concept-drifts and shifts
o Data volume Existing static data solutions not suitable Load shedding and sampling
o Response time constraints Sacrifice accuracy
• Fast & light algorithms required
29th Oct 08
Hot Research Topic: Online Mining Algorithms
33
• Existing Online algorithmso Moment [Chi’ 04], AFPIM [Koh’ 04]o CluStream [Aggrawal’ 03], GenIc [Gupta’ 04]o Ensemble based bagging [Wang’ 03]o Adaptive boosting [Chu’ 04]o Many more opportunities
E.g. frequent itemset mining over large sliding windows (SWIM) [ICDE’ 08]
• Do not tackle the system oriented challenges
29th Oct 08
DSMSs
34
• Commercial Systems (General Purpose)o Aleri - OLAP querieso StreamBase - Synopses and pattern matchingo Apama, Coral8, …o Oracle, IBMo KX, Vhayu – specializedo All oriented towards SQL
• Focusing on special purpose queries or simple SQL querieso Little extensibilityo No mining support
29th Oct 08
Another Example: End-to-end Association Rule Mining
35
GUI for defining and using mining models and flows.