b oosting tuple propagation in multi- relational classification

BOOSTING TUPLE PROPAGATION IN MULTI-RELATIONAL CLASSIFICATION

Dept. of Mathematics, University of Calabria, Italy

Lucantonio Ghionna, Gianluigi Greco

15th International Database Engineering & Applications SymposiumLisbon, Portugal, 21-23 September, 2011

http://www.unical.it/portale/

Outline

Background Multi-Relational Classification

Problem Complexity Tractability Islands Heuristic Approaches

DBMS Implementation System Design Experiments

Conclusion Remarks

Multi-Relational Classification

Target relation:

Each tuple has a class label, indicating whether a loan is paid on time.

district-id

frequency

date

Accountaccount-id

account-id

date

amount

duration

Loanloan-id

payment

account-id

bank-to

account-to

amount

Orderorder-id

type

disp-id

type

issue-date

Cardcard-id

account-id

client-id

Disposition

disp-id

birth-date

gender

district-id

Clientclient-id

dist-name

region

#people

#lt-500

District

district-id

#lt-2000

#lt-10000

#gt-10000

#city

ratio-urban

avg-salary

unemploy95

unemploy96

den-enter

#crime95

#crime96

account-id

date

type

operation

Transactiontrans-id

amount

balance

symbolHow to make decision on loan granting?


Applicant #1

Applicant #2

Applicant #3

Applicant #4

Loan ID Account ID Amount Duration Decision

1 124 1000 12 Yes

2 124 4000 12 Yes

3 108 10000 24 No

4 45 12000 36 No

Account ID Frequency Open date District ID

128 monthly 02/27/96 61820

108 weekly 09/23/95 61820

45 monthly 12/09/94 61801

67 weekly 01/01/95 61822

Loan Applications

Accounts

Orders

DistrictsOther relations

Search for good predicates across multiple relations

Do good payers access their account with a "monthly"

frequency?

Solving CLP: State-of-Art

Flattening approach [Krogel03] Build the universal relation through joins

Combinatorial explosition of data, large tables with many attributes [Mugg92]

Upgrading approach [Xu06] Keep the universal relation virtual by propagating labels

through foreign keys Global Perspective [Xu06] Local Perspective [Blockheel03,Yin04,Xu06]

Contributions

We show that the propagation problem can effectively be solved on databases whose hypergraphs are nearly-acyclic

We design effective algorithms for the global/local perspectives

We provide an implementation of a complete JDBC based system for tuple propagation

Experiments

Problem Complexity Tractability Islands Heuristic Approaches

Global Perspective: Tractability Islands of CLP

Good newsExponentially large universal relations does not imply CLP intractability [Xu06]

p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 1 <0,0>

x2 y1 1 <0,0>

x1 y2 1 <0,0>

x1 z1 v1 C1 1 <1,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <0,0>

y1 t2 x1 1 <0,0>

y1 t2 x2 1 <0,0>



p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 1 <1,0>

x2 y1 1 <0,1>

x1 y2 1 <1,0>

x1 z1 v1 C1 1 <1,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <0,0>

y1 t2 x1 1 <0,0>

y1 t2 x2 1 <0,0>

Bottom up



p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 2 <2,0>

x2 y1 1 <0,1>

x1 y2 1 <1,0>

x1 z1 v1 C1 1 <1,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <0,0>

y1 t2 x1 1 <0,0>

y1 t2 x2 1 <0,0>

Bottom up



p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 2 <2,0>

x2 y1 1 <0,1>

x1 y2 1 <1,0>

x1 z1 v1 C1 2 <2,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <0,0>

y1 t2 x1 1 <0,0>

y1 t2 x2 1 <0,0>

Top down


p1(X,Y)

p2(X,Z,W)p5(Y,T,X)

x1 y1 2 <2,0>

x2 y1 1 <0,1>

x1 y2 1 <1,0>

x1 z1 v1 C1 2 <2,0>

x2 z2 v2 C2 1 <0,1>

y1 t1 x1 1 <1,0>

y1 t2 x1 1 <1,0>

y1 t2 x2 1 <0,1>


CLP tractable on dependency graphs whose undirected versions are (forests of) trees [Xu06]

Top down

Tractability Islands of CLP. Are trees enough?

The (undirected) dependency graph is a bipartite clique of size m × n, and hence it is not a tree and the result in [XU06] does not apply

CLP is still tractable !

R1 (B1,A1, …, Am) R2 (B2,A1, …, Am) Rn (Bn,A1, …, Am)…..

R’1 (A1) R’

2 (A2) R’m (Am)…..

…..

…..…..

Q=R1 (B1,A1, …, Am), R2 (B2,A1, …, Am), …, R1 (B1,A1, …, Am), …, R’1 (A1), R’

2 (A2), …,R’m (Am)

Tractability Islands of CLP. Hypertree Decompositions

For fixed k, deciding whether hw(Q) k is in P [Gottlob02] computing hypertree decompositions is in P [Gottlob02]

{B1, …, Bm ,A1, …, Am} R1, R2, …, Rm

{A2} R’2{A1} R’

1 {Am} R’m…….

Q=R1 (B1,A1, …, Am), R2 (B2,A1, …, Am), …, R1 (B1,A1, …, Am), …, R’1 (A1), R’

2 (A2), …,R’m (Am)

R1

R2

Rm

A1A2 Am

…

B1

B2

Bm

R’1 R’

2 R’m

Tractability Islands of CLP. Hypertree Decompositions

Cyclic dependency graph……….bounded width!

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

L1 A1 #C1

L2 A1 #C2O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

AccountA1 D1

A2 D2

CLPonDBk algorithm

Bottom up PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

L1 A1 #C1

L2 A1 #C2

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1>

L1 A1 #C1

L2 A1 #C2

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1> <1,1>

L1 A1 #C1

L2 A1 #C2

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1> <1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1> <1,1>

<1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1>

<1,1>

<1,1>

<1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1>

<1,1>

<1,1>

<1,1><1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

CLPonDBk algorithm

Top Down PhaseLoan

TransactionOrder

Account,Disposition

Card Client

District

Account

CLPonDBk solves CLP in time O(|D| × max RiD ||Ri||k+3), on the class of those instances whose associated hypergraphs have hypertree width bounded by k.

O1 A1

O2 A2

O1 A2

T1 A1

T2 A2

A1 D1 C1 S1

A2 D2 C2 S1

P1 S1

P2 S2

C1 D1

C2 D1

C1 D2 D1

D2

D3

A1 D1

A2 D2

<1,1>

<1,1>

<1,1>

<1,1><1,1>

<1,1>

<1,1>

L1 A1 #C1

L2 A1 #C2

L-CLP: Local Perspective on Propagation Problem

In several multi-relational approaches, CLP is heuristically restricted to portions of the database

Reducing the search space can pragmatically speed-up the computation

• Still, joining many relations may be challenging from a computational viewpoint.

L-CLP: NTtoT_onDBMS and TtoNT_onDBMS

“Target to Non-Target” Propagation (TtoNT onDBMS) Propagate information from R1 to Rm, evaluate C on the result

“Non-Target to Target” Propagation (NTtoT onDBMS) Start by filtering Rm with the condition C, by joining the result with Rm-1, and by iterating the process back to R1

Propagation path from R1 to Rm only requires joining pairs of “adjacent” relations

L-CLP: NTtoT_onDBMS and TtoNT_onDBMS

NTtoT_onDBMS

TtoNT_onDBMS

DBMS Implementation System Design Experiments

A JDBC System for CLP

Experimentation Settings

Scenario: CROSSMINE + NTtoT_onDBMS CROSSMINE + TtoNT_onDBMS CROSSMINE + TupleIDPropagation

Parameters: The number m of relations The number ||target || of tuples in the target relation; The “propagation ratio” ||target ||/||R|| The selectivity s of each join attribute

Environment:

2.1GHz Centrino PC, 1 Gb RAM, 5400 rpm hard disk (Windows XP Professional)

Computation Time and Propagation Time

m=5; ||target ||/||R||=1; s=50%

Dramatic improvements w.r.t. standard Crossmine • Effective scaling for large relations• ….

Gains w.r.t. Crossmine

m=5; s=50%

• Gain on propagation up to 95 %• Gain on computation time up to 90 %• ……

NTtoT_onDBMS or TtoNT_onDBMS ?

NTtoT_onDBMS vs TtoNT_onDBMS

||target ||=100000; m=5; s=50%

||target ||=100000; m=5; s=50% ||target ||/R=1

• TtoNT_onDBMS is the best with low propagation ratio• NTtoT_onDBMS is the best when target relation is much larger

than other relations• Semi-joins operators are a winning choice in practical database

applications

Conclusion and Discussion

CLP problem is a challenging task which can be effectively asked using state-of-art query-optimization methods Propagation over large class of nearly-acyclic database

schemas is in fact tractable (polynomial upper bound guarantee) Result in [Xu06] emerges as a special case

Database implementation of local-perspective methods shows tremendous benefits w.r.t. standard in-memory strategies

Potential benefits for many classifications algorithms, such as Bayesian classifiers[Getoor01], probabilistic models [Taskar02], and decision tree learning methods[Leiva03].

THANK YOU!

References

P. A. Bernstein and N. Goodman. Power of natural semijoins. SIAM Journal on Computing, 10(4):751–771, 1981.

H. Blockeel and L. De Raedt. Top-down Induction of First-Order Logical Decision Trees. Artificial Intelligence, 101(1-2):285–297, 1998.

H. Blockeel and M. Sebag. Scalability and Efficiency in Multi-relational Data Mining. SIGKDD Explorations Newsletters, 5(1):17–30, 2003.

M. Ceci and D. Malerba. Mr-SBC: a Multi-Relational Naive Bayes Classifier. In Proc. of PKDD’03, pages 95–106, 2003.

S. Dˇzeroski. Multi-relational Data Mining: an Introduction. SIGKDD Explorations Newsletters, 5(1):1–16, 2003.

P. A. Flach and N. Lachiche. IBC2: A True First-Order Bayesian Classifier. In Proc. of ILP’02, pages 133–148, 2002.

R. Frank and F.M.M. Ester. A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions. In Proc. Of PKDD’07, pages 430–437, 2007.

L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning Probabilistic Models of Relational Structure. In Proc. of ICML’01, pages 170–177, 2001.

G. Gottlob, N. Leone, and F. Scarcello. Hypertree decomposition and tractable queries. Journal of Computer and System Sciences, 64:579–627, 2002.

G. Gottlob, Z. Miklos, and T. Schwentick. Generalized hypertree decompositions: Np-hardness and tractable variants. In Proc. of PODS’07, pages 13–22, 2007.

H. Guo and H. L. Viktor. Multirelational classification: a multiple view approach. Knowledge and Information Systems, 17(3):287–312, 2008.

References

G. Jing-Feng, L. Jing, and B. Wei-Feng. An Efficient Relational Decision Tree Classification Algorithm. In Proc. of ICNC’07, pages 530–534, 2007.

M. A. Krogel, S. Rawles, F. Zelezny, P. A. Flach, N. Lavrac, and S. Wrobel. Comparative Evaluation of Approaches to Propositionalization. In In Proc. Of ILP’03, pages 197–214, 2003.

H. Leiva, A. Atramentov, and V. Honavar. A Multi-relational Decision Tree Learning Algorithm. In Proc. of ILP’03, pages 97–112, 2002.

H. Liu, X. Yin, and J. Han. An efficient Multi-relational Na¨ıve Bayesian classifier based on Semantic Relationship Graph. In Proc. of MRDM’05, pages 39–48, 2005.

S. Muggleton. Inductive Logic Programming. Academic Press, New York, 1992. J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning Relational Probability Trees. In Proc. Of KDD’03, pages 625–630, 2003.

J. Neville, D. Jensen, and B. Gallagher. Simple Estimators for Relational Bayesian Classifiers. In Proc. of ICDM’03, page 609, 2003.

U. Pompe and I. Kononenko. Naive Bayesian Classifier within ILP-R. In Proc. of ILP’95, pages 417–436, 1995.

B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. Of UAI’02, 2002.

K. Wang, Y. Xu, P.S. Yu, and R. She. Building Decision Trees on Records Linked through Key References. In Proc. of SDM’05, 2005.

Y. Xu, K. Wang, A. Wai-Chee Fu, R. She, and J. Pei. Classification Spanning Correlated Data Streams. In Proc. of CIKM’06, pages 132–141, 2006.

M. Yannakakis. Algorithms for acyclic database schemes. In Proc. of VLDB’81, pages 82–94.

X. Yin, J. Han, J. Yang, and P.S. Yu. CrossMine: Efficient Classification Across Multiple Database Relations. In Proc. of t ICDE’04, page 399, 2004.


Formal Framework

Input: D (with target having attribute CL), I, a class label ‘l’, and a condition C over the attributes of some relation RD;

Output: key[target] C^target.CL=‘l’R(D, I)

{account-id,district-id} {Account}

{transaction-id,account-id} {Transaction}

{loan-id,account-id} {Loan}

{order-id,account-id} {Order}

{account-id,disp-id,client-id,district-id} {Account,Disposition}

{disp-id,card-id} {card} {client-id,district-id} {Client}

{district-id} {District}

b oosting tuple propagation in multi- relational classification

Documents

clp intractability xu06

upglobal perspective

downglobal perspective

tractability islands

approach xu06

xu06 contributionswe

universal relation virtual

large tables