extending sql for decision support applications sql for decision support applications carlo zaniolo*...

49
Extending SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49

Upload: dangmien

Post on 30-Mar-2018

236 views

Category:

Documents


18 download

TRANSCRIPT

Page 1: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Extending SQL for Decision SupportApplications

Carlo Zaniolo*

*Joint Work with Haixun Wang, IBM

UCLA – p.1/49

Page 2: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Outline

Part I The problem and the state of the art

Part II Introduction to ATLaS

Part III Decision support applications The System andPerformance

Part IV Conclusions and future directions.

UCLA – p.2/49

Page 3: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Part I: The Problem

• Databases: where the data is (well most of it)

• But Database Management Systems (DBMSs) do notsupport well data mining tasks

• Desiderata: Data Mining Query Languages that supportad-hoc mining queries and general data mining

UCLA – p.3/49

Page 4: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

DBMSs and Data Mining

• Many proposals, including:• DMQL [Han, Fu, Wang, Koperski, Zaiane:

DMDW 1996]• Mine operator [Meo, Psaila, Ceri: 1996]• M-SQL [Imielinski, Virmani: 1999]

• Difficult technical challenges:• No natural way to retrofit SQL with mining

operators—as opposed to ROLAP extensionsthat naturally fit in the (super)group-by syntax

• Implementation and Performance issues• Much diversity in mining tasks: Can one

solution fit all?

UCLA – p.4/49

Page 5: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Data Mining in Object-OrientedDBMSs

• S. Sarawagi, S. Thomas,R. Agrawal: Integrating AssociationRule Mining with Relational Database Systems: Alternativesand Implications,SIGMOD 1998

• The Question: forget nice SQL extensions, and ask ifexperts can implement Apriori efficiently in anObject-Relational System such as DB2. An the answer was:• Not easily: UDFs are very difficult to write and debug• Not as efficient as Cache Mining approaches

• Apriori established as the acid test for the extensibility ofDBMSs for data mining tasks.

• Next Question: is SQL the real cause of these problems andshould we instead use other languages for database centricdatamining?

UCLA – p.5/49

Page 6: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Replacing SQL with BetterLanguages for Mining Databases

• The DATASIFT project uses a logical datalanguage (LDL++) to address these problems[Giannotti, Manco, et al. 1999, 2000]• Both deductive and inductive reasoning needed

to support the data mining process• LDL++ is Turing complete and supports User

Defined Aggregates (UDAs)• Direct C++ implementation of UDAs to solve

performance problems• New Datamining Algebras: The 3W Model

[Johnson, Lakshmanan, Ng: VLDB 2000]

UCLA – p.6/49

Page 7: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Part II: Introducing ATLaS

1. History and main ideas

2. Simple examples: average, minpoints, temporalcoalescing after projetion

3. Transitive closure computation

4. Data Streaming

UCLA – p.7/49

Page 8: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

A Brief History of ATLaS

1. SQL–AG: extending SQL3 proposal forAggregates to support ‘early returns’

2. LDL++ 5.1: Logic Database Language MonotonicAggregates: used freely in recursive queries forBoM and greedy algorithms

3. SADL: Simple Aggregate Definition Languagebased on SQL. easy to use, but with limitedperformance and power

4. AXL: Aggregate eXtension Language: Much morepowerful and efficient

5. ATLaS: table functions and support for thedefinition and management of in-memory datastructures using SQL.

UCLA – p.8/49

Page 9: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

ATLaS Main Ideas

• Tables as the only data type• SQL statements as the only statements• Native Extensibility by letting users introduce new

Aggregates and Table functions by coding them inSQL

UCLA – p.9/49

Page 10: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Defining Aggregates

Aggregates are functions that process a stream ofvalues, on the basis of whether the current item is

• The first value—INITIALIZE state,• Every other successive value—ITERATE state,• The EOF marker—TERMINATE state• The calling query generates the streams—one for

each GROUP BY— and set the states

This way of defining UDAs is similar to that used byPostgres, LDL++, SQL3, etc.

ATLaS aggregates take streams as input but also re-

turn streams as output (e.g., online aggregates)UCLA – p.10/49

Page 11: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Example: Define Average

AGGREGATE myavg(Next Int) : Real{ TABLE state(sum Int, cnt Int);

INITIALIZE : {INSERT INTO state VALUES (Next, 1);

}ITERATE : {

UPDATE state SET sum=sum+Next, cnt=cnt+1;}TERMINATE : {

INSERT INTO RETURNSELECT sum/cnt FROM state;

}}

UCLA – p.11/49

Page 12: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

OnLine Averages

AGGREGATE online avg(Next Int) : Real{ TABLE state(sum Int, cnt Int);

INITIALIZE : {INSERT INTO state VALUES (Next, 1);

}ITERATE : {

UPDATE state SET sum=sum+Next, cnt=cnt+1;INSERT INTO RETURN

SELECT sum/cnt FROM stateWHERE cnt % 200 = 0;

}TERMINATE : {}

}

UCLA – p.12/49

Page 13: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Calling UDAs

SELECT Sex, online avg(Sal)FROM employeeWHERE Dept=1024GROUP BY Sex;

UCLA – p.13/49

Page 14: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Simple Aggregates in ATLaS

• SEQ: an aggregate that appends to each new tuplea consecutive sequence number

• The DISTINCT version of the same (duplicate tuplesare ignored)

• To implement that, you must declare a MEMO tableto memorize old values

• Then results could be returned:• during computation: in the INITIAL and ITERATE

states: nonblocking and monotonic UDA• all at the end in the TERMINATE state: blocking

(and frequently) nonmonotonic UDA• Users can exercise high-level control over

computation.UCLA – p.14/49

Page 15: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

The Point and value of Minimum in asequence of pairs

AGGREGATE minpair(iPoint Int, iValue Int): (mPoint Int, mValue Int)

{ TABLE mvalue(value Int); TABLE mpoints(point Int);INITIALIZE: {

INSERT INTO mvalue VALUES (iValue);INSERT INTO mpoints VALUES(iPoint);

}ITERATE: {

UPDATE mvalue SET value = iValue WHERE iValue < value;DELETE FROM mpoints WHERE SQLCODE = 1;INSERT INTO mpoints

SELECT iPoint FROM mvalueWHERE iValue =mvalue.value;

}TERMINATE: {

INSERT INTO RETURNSELECT point, value FROM mpoints, mvalue; }

} UCLA – p.15/49

Page 16: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Coalescing

AGGREGATE coalesce(from TIME, to TIME)

: (start TIME, end TIME)

{ TABLE state(cFrom TIME, cTo TIME);

INITIALIZE: { INSERT INTO state VALUES(from,to) }ITERATE :{

UPDATE state SET cTo = to

WHERE cTo >= from AND cTo < to;

INSERT INTORETURN

SELECT cFrom, cTo FROM state

WHERE cTo < from;

UPDATE state

SET cFrom = from, cTo = to

WHERE cTo < from; }TERMINATE: { INSERT INTO RETURN

SELECT cFrom, cTo FROM state; }}

UCLA – p.16/49

Page 17: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Computation of Transitive Closures

AGGREGATE reachable(Rnode Char(10)) : Char(10){ TABLE memnodes(Pnode: Char(10));

INITIALIZE: ITERATE: {INSERT INTO memnodes VALUES(Rnode)

WHERE NOT EXIST (SELECT * FROM memnodes AS aWHERE a.Pnode=Rnode);

INSERT INTO RETURN VALUES (Rnode)WHERE SQLCODE=1;

INSERT INTO RETURNSELECT reachable(end) FROM dgraphWHERE SQLCODE=1 AND start=Rnode;

}}

UCLA – p.17/49

Page 18: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Transitive Closures—Cont.

• The nodes reachable from ’000’ can be computedby calling this aggregate:

SELECT reachable(dgraph.end)FROM dgraphWHERE dgraph.start=’000’;

• reachable performs a top-down computation(Prolog-like)

• We can also express a bottom-up computationusing the differential fixpoint

• In the next slide we show a nonrecursive way,similar to that used by active database triggers.

UCLA – p.18/49

Page 19: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Nodes Reachable from ’000’–a latrigger

AGGREGATE tclosur(X Char(10), Y Char(10)): (tcX Char(10), tcY Char(10))

{ TABLE tc(snode Char(10), enode: Char(10));INITIALIZE: ITERATE: {

INSERT INTO tc VALUES(X,Y);INSERT INTO tc

SELECT tc.snode, YFROM tc WHERE tc.enode=X;

INSERT INTO tcSELECT X, tc.enodeFROM tc WHERE Y=tc.snode;

}TERMINATE: {

INSERT INTO RETURN SELECT * FROM tc; }}

The actual call is: SELECT tclosur(dgraph.start, dgraph.end)FROM dgraph;

UCLA – p.19/49

Page 20: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

The Power of Streams

• Relationally complete languages cannot expresstransitive closures

• Recursion had to be added to SQL to expressthese queries

• Here, we have expressed transitive closure in anon-recursive ATLaS program

• Conclusion: a stream-oriented processing modeladds significant expressive power to SQL!

• ATLaS taps on this hidden source of power.

UCLA – p.20/49

Page 21: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Blocking and NonBlockingAggregates

• Nonblocking aggregates are needed for streams• Every UDA with an empty TERMINATE clause is

nonblocking—also monotonic• tclosr can be made nonblocking by moving the

RETURN to the initialize/iterate states.• Memory is the second issue for stream-based

processing• Our program only uses one tuple. This is fine if

there is no duplicate path (i.e., our graph is a tree)or we do not mind duplicates. Otherwise, we needto store previous pairs in a memo table and add aNOT IN check to the code.

UCLA – p.21/49

Page 22: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Part III: Decision SupportApplications in ATLaS

Here we discuss:

• OLAP applications• Classifiers• The Apriori Algorithm

Many other algorithms, including DBSCAN and other

clustering methods, are listed in the ATLaS web page

http://wis.cs.ucla.edu/atlas with documenta-

tion and downloads

UCLA – p.22/49

Page 23: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

ROLLUPS in SQL

SELECT Time, Region, Dept, SUM(Sales)FROM Sales GROUP BY Time, Region, Dept

UNION ALLSELECT Time, Region, ‘all’ , SUM(Sales)FROM Sales GROUP BY Time, Region

UNION ALLSELECT Time, ‘all’, ‘all’, SUM(Sales)FROM Sales GROUP BY Time

UNION ALLSELECT ‘all’, ‘all’, ‘all’, SUM(Sales)FROM Sales;

To remedy the lack of performance and ease of use

of standard SQL supergroups constructs have been

added to SQLUCLA – p.23/49

Page 24: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

ROLLUPS in ATLaS

We will defined a new aggregate called rollup:

SELECT rollup(Time, Region, Dept, Sales)FROM data;

UCLA – p.24/49

Page 25: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Roll-up sales on Time, Region, Dept

AGGREGATE rollup(T Char(20), R Char(20),D Char(20), Sales Real)

: (T Char(20), R Char(20), D Char(20), Sales Real){ TABLE memo(j Int, Time Char(20), Region Char(20),

Dept Char(20), Sum Real) MEMORY;FUNCTION onestep(L Int, T Char(20), R Char(20),

D Char(20), S Real): (T Char(20), R Char(20), D Char(20), Sales Real)

{ INSERT INTO RETURNSELECT Time, Region, Dept, Sum FROM memoWHERE L=j

AND (T 6=Time OR R6=Region OR D6=Dept);UPDATE memo SET Sum = Sum + ( SELECT m.Sum

FROM memo AS m WHERE m.j=L)WHERE SQLCODE=1 AND j=L+1;

UPDATE memoSET Time=T, Region=R, Dept=D, Sum=SWHERE SQLCODE=1 AND j=L;

}

UCLA – p.25/49

Page 26: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Roll-up sales on Time, Region,Dept—cont.

INITIALIZE: {INSERT INTO memo

VALUES (1, T, R, D, Sales), (2, T, R, ‘all’, 0),(3, T, ‘all’, ‘all’, 0), (4, ‘all’, ‘all’, ‘all’, 0);

}ITERATE: { UPDATE memo SET Sum = Sum + Sales

WHERE Time=T AND Region=R AND Dept=D;INSERT INTO RETURN SELECT os.*

FROM TABLE(onestep(1, T, R, D, Sales)) AS osWHERE SQLCODE=0;

INSERT INTO RETURN SELECT os.*FROM TABLE(onestep(2, T, R, ‘all’, 0)) AS osWHERE SQLCODE=1;

INSERT INTO RETURN SELECT os.*FROM TABLE(onestep(3, T, ‘all’, ‘all’, 0)) AS osWHERE SQLCODE=1;

}

UCLA – p.26/49

Page 27: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Roll-up sales on Time, Region,Dept–cont.

TERMINATE: {INSERT INTO RETURN SELECT os.*

FROM TABLE(onestep(1, ‘all’, ‘all’, ‘all’, 0)) AS os;INSERT INTO RETURN SELECT os.*

FROM TABLE(onestep(2, ‘all’, ‘all’, ‘all’, 0)) AS os;INSERT INTO RETURN SELECT os.*

FROM TABLE(onestep(3, ‘all’, ‘all’, ‘all’, 0)) AS os;INSERT INTO RETURN

SELECT Time, Region, Dept, SumFROM memo WHERE j=4;

}}

UCLA – p.27/49

Page 28: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Sorting and Rolling

AGGREGATE sort and roll(T Char(20), R Char(20),D Char(20), Sales Real)

: (T Char(20), R Char(20), D Char(20), Sales Real){TABLE temp(A1 Char(20), A2 Char(20), A3 Char(20),

V Real) MEMORY;FUNCTION sort temp()

: (A1 Char(20), A2 Char(20), A3 Char(20), V Real){ INSERT INTO RETURN

SELECT * FROM temp ORDER BY A1, A2, A3;}INITIALIZE: ITERATE: {

INSERT INTO temp VALUES (T, R, D, Sales);}TERMINATE: {

INSERT INTO RETURNSELECT rollup(t.A1, t.A2, t.A3, t.V)FROM TABLE (sort temp()) AS t;

}} UCLA – p.28/49

Page 29: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Computing a data cube by 3sort-and-roll

AGGREGATE cube(T Char(20), R Char(20), D Char(20),Sales Real)

: (T Char(20), R Char(20), D Char(20), Sales Real){ TABLE temp(A1 Char(20), A2 Char(20), A3 Char(20),

V Real) MEMORY;FUNCTION sort1()

: (A1 Char(20), A2 Char(20), A3 Char(20), V Real){ INSERT INTO RETURN

SELECT * FROM temp ORDER BY A1, A2, A3 }FUNCTION sort2()

: (A1 Char(20), A2 Char(20), A3 Char(20), V Real){ INSERT INTO RETURN

SELECT * FROM temp ORDER BY A2, A3 }FUNCTION sort3()

: (A1 Char(20), A2 Char(20), A3 Char(20), V Real){ INSERT INTO RETURN

SELECT * FROM temp ORDER BY A3, A1 }}

UCLA – p.29/49

Page 30: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Cubing–cont

INITIALIZE: ITERATE: {INSERT INTO temp VALUES (T, R, D, Sales);

}TERMINATE: {

INSERT INTO RETURNSELECT rollup(t.A1, t.A2, t.A3, t.V)FROM TABLE (sort1) AS t;

INSERT INTO RETURNSELECT rollup2(t.A1, t.A2, t.A3, t.V)FROM TABLE (sort2) AS t;

INSERT INTO RETURNSELECT rollup3(t.A1, t.A2, t.A3, t.V)FROM TABLE (sort3) AS t;

}}

UCLA – p.30/49

Page 31: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

The PlayTennis Example

RID Outlook Temp Humidity Wind Play1 Sunny Hot High Weak No2 Rain Mild High Weak Yes3 Overcast Hot High Weak Yes4 Sunny Hot High Strong No5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong Yes7 Overcast Cool Normal Strong No8 Sunny Mild High Weak No9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No

RID col val YorN1 1 Rain No1 2 Mild No1 3 High No1 4 Strong No2 1 Rain Yes2 2 Mild Yes2 3 High Yes2 4 Weak Yes3 1 . . .

. . . . . .

UCLA – p.31/49

Page 32: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Table Functions

Dissemble a relation into column/value pairs:

FUNCTION dissemble (v1 Int, v2 Int, v3 Int, v4 Int, YorN Int): (col Int, val Int, YorN Int);

{ INSERT INTO RETURNVALUES (1, v1, YorN), (2, v2, YorN),

(3, v3, YorN), (4, v4, YorN);}

Then you write the classifier and call it as follows:

SELECT classify(0, p.RID, d.Col, d.Val, d.YorN)FROM PlayTennis AS p,TABLE(dissemble(Outlook,Temp, Humidity, Wind, Play)) AS d;

UCLA – p.32/49

Page 33: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Naive Bayesian Classifiers

• For each pair (column, value) count the positives and

negatives and store them in a summary table:

col val Yc Nc

1 Rain 4 1

1 Sunny 2 3

1 Overcast 2 1

2 Hot 1 2

2 . . . . . . . . .

• Also tally up the positives and negatives• These operations are done in one pass• The resulting table is all is needed to classify a new tuple.

Example: (Sunny, Hot, . . .).UCLA – p.33/49

Page 34: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Decision Tree Classifiers

Select a column for splitting using the gini index—one branch for

each value in the column:

• We start by computing the SUMMARY table.• Compute the Gini index g for each column:

Each class: fp = p/(p + n), fn = n/(p + n), g = 1− f2

p− f2

n

All classes: Gini = f1 × g1 + f2 × g2 + . . .

then store the result in ginitable(Col, Gini)• Find the column with the least Gini (using minpair) and

store it in mincol.• Reclassify the tuples by splitting each class according to

their values in column c. An unique id must be generated for

the new class.• Recursive invocation (but homogenous classes and

columns with only one value are not split) UCLA – p.34/49

Page 35: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

A Scalable Decision Tree Classifier

AGGREGATE classify(iNode Int, RecId Int, iCol Int,iValue Int, iYorN Int)

{ TABLE treenodes(RecId Int, Node Int, Col Int,Value Int, YorN Int);

TABLE mincol(Col Int);TABLE summary(Col Int, Value Int, Yc Int, Nc Int,

INDEX {Col,Value});TABLE ginitable(Col Int, Gini Int);INITIALIZE : ITERATE : {

INSERT INTO treenodesVALUES(RecId, iNode, iCol, iValue, iYorN);

UPDATE summarySET Yc=Yc+iYorN, Nc=Nc+1-iYorNWHERE Col = iCol AND Value = iValue;

INSERT INTO summarySELECT iCol, iValue, iYorN, 1-iYorNWHERE SQLCODE=0;

}UCLA – p.35/49

Page 36: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

A Scalable Decision TreeClassifier—Cont

TERMINATE : {INSERT INTO ginitable

SELECT Col, sum((Yc*Nc)/(Yc+Nc))/sum(Yc+Nc)FROM summaryGROUP BY Col;HAVING count(Value)>1

AND sum(Yc)>0 AND sum(Nc)>0;INSERT INTO mincol

SELECT minpair(Col, Gini) FROM ginitable;INSERT INTO result

SELECT iNode, Col FROM mincol;/* Call classify() recusively to partition each of its subnodes unless it is pure.*/

SELECT classify(t.Node*MAXVALUE+m.Value+1,t.RecId, t.Col, t.Value, t.YorN)

FROM treenodes AS t,( SELECT tt.RecId RecId, tt.Value Value

FROM treenodes AS tt, mincol AS mWHERE tt.Col=m.Col

) AS mWHERE t.RecId = m.RecIdGROUP BY m.Value; }

} UCLA – p.36/49

Page 37: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Association Rules: Apriori

• Many attempts to implement frequent-item-setcomputations in SQL DBMS and O-R DBs havefailed to produce good performance

• In-depth investigation by Sarawagi, Thomas, andAgrawal [ACM/SIGMOD 98] established this as theacid test for any SQL extension claiming to dosupport data mining

• Our previous system, AXL system had also failedbecause of poor performance

• ATLaS solved the problem via (i) table functions,(ii) in-memory tables, and (iii) better optimizationtechniques.

UCLA – p.37/49

Page 38: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Example: AprioriAlgorithm—minimun support = 2

D :

TID Items

1000 1, 3, 4200 2, 3, 5300 1, 2, 3, 5400 2, 5

Scan D

Itemset Support

{1} 2{2} 3{3} 3{4} 1{5} 3

Filter

Itemset Support

{1} 2{2} 3{3} 3{5} 3

BuildLargerSets

Itemset

{1,2 }{1,3 }{1, 5 }{2,3 }{2,5}{3, 5 }

Scan D

Itemset Support

{1,2 } 1{1,3 } 2{1, 5 } 1{2,3 } 2{2,5 } 3{3, 5 } 2

Filter

Itemset Support

{1,3 } 2{2,3 } 2{2,5 } 3{3, 5 } 2

BuildLargerSets

→Itemset

{2,3, 5 }Scan D

Itemset Support

{2,3, 5 } 2Filter:

Itemset Support

{2,3, 5 } 2

UCLA – p.38/49

Page 39: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Main Operations

• Scanning the database and counting occurrences• Pruning the itemsets below the minimum support

level:• Combining frequent sets of size n into candidate

larger sets of size n + 1 [or even larger].Monotonicity Condition: The support level of a setis always ≤ than that of every subset

• Checking the presence of large-cardinalityitemsets is expensive. A prefix tree is normallyused to solve this problem.

UCLA – p.39/49

Page 40: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

A Priori in ATLaS

• baskets(item INT): A stream of transactions from theDB table

0, 2, 3, 4,0, 1, 2, 3, 4,0, 3, 4, 5,0, 1, 2, 5,0, 2, 4, 5

• The frequent itemsets in the prefix tree: trie

• The tuples in cands hold an item, cit, a reference,trieref, to a leaf node of the trie, and a the countfreqcount, for the set.

UCLA – p.40/49

Page 41: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

The Prefix Tree

1 2 3 4 5

null

2 3 4 5 3 4 5 4 5 5

1st Level

cands

1 2 3 4 5

null

2 3 5

1st Level

44 5

4

2nd Level

3rd Level

UCLA – p.41/49

Page 42: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Main ATLaS Program for Apriori

1: TABLE baskets(item Int);2: TABLE trie(item Int, father REF(trie), INDEX(father)) MEMORY;3: TABLE cands(item Int, trieref REF(trie), freqcount Int,

INDEX(cit,trieref)) MEMORY;4: TABLE fitems(item Int, INDEX(item));{generat frequent one-itemsets}

5: INSERT INTO fitemsSELECT item FROM baskets WHERE item > 0GROUP BY item HAVING count(*)≥ MinSup;

{intialize the trie to contain frequent one-itemsets}6: INSERT INTO trie SELECT item, null FROM fitems;{self-join frequent 1-itemsets to get candidate 2-itemsets}

7: INSERT INTO candsSELECT t1.itno, t2.OID, 0 FROM trie AS t1, trie AS t2WHERE t1.itno > t2.itno;

{Generate (k+1)-itemsets from k-itemsets. Start with k=2}8: SELECT countset(item, 2, MinSup, cands) FROM baskets;

UCLA – p.42/49

Page 43: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Managing the Prefix Tree

• A perfect candidate for an TRIE ADT coded in C++• But we did it in ATLaS SQL• Using an in-memory table using SQL3 reference

type to organize the TRIE data structure• How are reference types on in-memory tables

implemented in ATLaS ...?

UCLA – p.43/49

Page 44: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Performance

Name T I D size of dataset

T5.I2.D100K 5 2 100K 2.8M text stream

T10.I2.D100K 10 2 100K 5.2M text stream

T10.I4.D100K 10 4 100K 5.2M text stream

T20.I2.D100K 20 2 100K 10.1M text stream

Table 1: Benchmark Data Sets

UCLA – p.44/49

Page 45: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Performance Curves

2 1.5 1 .75 .5 .250

5

10

15

20

25

30

35

40

Min Support

Tim

e (s

ec)

T5.I2.D100K

ATLaSC

2 1.5 1 .75 .5 .250

20

40

60

80

100

Min Support

Tim

e (s

ec)

T10.I2.D100K

ATLaSC

2 1.5 1 .75 .5 .250

20

40

60

80

100

Min Support

Tim

e (s

ec)

T10.I4.D100K

ATLaSC

2 1.5 1 .75 .5 .250

100

200

300

400

500

Min Support

Tim

e (s

ec)

T20.I2.D100K

ATLaSC

Figure 1: ATLaS vs. C implementation of Apriori

UCLA – p.45/49

Page 46: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

Programming in ATLaS

• Table-based programming is powerful and naturalfor data intensive applications

• SQL can be awkward and many extensions arepossible

• But even SQL AS IS is adequate even for complexdatamining queries and algorithms.

UCLA – p.46/49

Page 47: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

The ATLaS System

• ATLaS programs into C programs are compiledinto C programs that Execute on the Berkeley DBrecord manager

• The 100 Apriori program compiles into 2,800 linesof C The system

• Other data structures (R-trees, in-memory tables)have been added using the same API.

• The system is now 54,000 lines of C++ code.

UCLA – p.47/49

Page 48: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

ATLaS: Conclusions

• A simple native extensibility mechanism for SQL• More efficient than Java or PL/SQL. Effective with

Data Mining Applications• Also OLAP applications, and recursive queries,

and temporal database applications• Complements current extensibility mechanisms

based on UDFs and Data Blades• Supports and favors streaming aggregates (in

SQL the default is blocking)• Good basis for determining program properties:

e.g. (non)monotonic and blocking behavior• These are lessons that future query languages

cannot easily ignore. UCLA – p.48/49

Page 49: Extending SQL for Decision Support Applications SQL for Decision Support Applications Carlo Zaniolo* *Joint Work with Haixun Wang, IBM UCLA – p.1/49 Outline Part I The problem and

References

1. J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data MiningQuery Language for Relational Databases. In Proc. 1996 SIGMOD’96 Workshopon Research Issues on Data Mining and Knowledge Discovery (DMKD’96), pp.27-33, Montreal, Canada, June 1996.

2. R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules.In Proc. VLDB96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp.122-133.

3. T. Imielinski and A. Virmani. MSQL: a query language for database mining. DataMining and Knowledge Discovery, 3:373-408, 1999. S.

4. S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query ¤ocks:a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1-12,1998.

5. F. Bonchi , F. Giannotti, G. Mainetto, D. Pedreschi. A Classification-basedMethodology for Planning Audit Strategies in Fraud Detection. In Proc. KDD-99,ACM-SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, San Diego (CA),August 1999.

6. F. Giannotti, G. Manco, D. Pedreschi and F. Turini. Experiences with a logic-basedknowledge discovery support environment. In Proc. 1999 ACM SIGMODWorkshop on Research Issues in Data Mining and Knowledge Discovery (DMKD1999).

7. T. Johnson, Laks V. S. Lakshmanan, Raymond T. Ng: The 3W Model and Algebrafor Unified Data Mining. VLDB 2000: 21-32

UCLA – p.49/49