b oosting tuple propagation in multi- relational classification
DESCRIPTION
15th International Database Engineering & Applications Symposium. Lisbon , Portugal, 21-23 September , 2011. Lucantonio Ghionna , Gianluigi Greco. B oosting tuple propagation in multi- relational classification. Dept . of Mathematics, University of Calabria, Italy. Outline. - PowerPoint PPT PresentationTRANSCRIPT
BOOSTING TUPLE PROPAGATION IN MULTI-RELATIONAL CLASSIFICATION
Dept. of Mathematics, University of Calabria, Italy
Lucantonio Ghionna, Gianluigi Greco
15th International Database Engineering & Applications SymposiumLisbon, Portugal, 21-23 September, 2011
Outline
Background Multi-Relational Classification
Problem Complexity Tractability Islands Heuristic Approaches
DBMS Implementation System Design Experiments
Conclusion Remarks
Multi-Relational Classification
Target relation:
Each tuple has a class label, indicating whether a loan is paid on time.
district-id
frequency
date
Accountaccount-id
account-id
date
amount
duration
Loanloan-id
payment
account-id
bank-to
account-to
amount
Orderorder-id
type
disp-id
type
issue-date
Cardcard-id
account-id
client-id
Disposition
disp-id
birth-date
gender
district-id
Clientclient-id
dist-name
region
#people
#lt-500
District
district-id
#lt-2000
#lt-10000
#gt-10000
#city
ratio-urban
avg-salary
unemploy95
unemploy96
den-enter
#crime95
#crime96
account-id
date
type
operation
Transactiontrans-id
amount
balance
symbolHow to make decision on loan granting?
Multi-Relational Classification
Applicant #1
Applicant #2
Applicant #3
Applicant #4
Loan ID Account ID Amount Duration Decision
1 124 1000 12 Yes
2 124 4000 12 Yes
3 108 10000 24 No
4 45 12000 36 No
Account ID Frequency Open date District ID
128 monthly 02/27/96 61820
108 weekly 09/23/95 61820
45 monthly 12/09/94 61801
67 weekly 01/01/95 61822
Loan Applications
Accounts
Orders
DistrictsOther relations
Search for good predicates across multiple relations
Do good payers access their account with a "monthly"
frequency?
Solving CLP: State-of-Art
Flattening approach [Krogel03] Build the universal relation through joins
Combinatorial explosition of data, large tables with many attributes [Mugg92]
Upgrading approach [Xu06] Keep the universal relation virtual by propagating labels
through foreign keys Global Perspective [Xu06] Local Perspective [Blockheel03,Yin04,Xu06]
Contributions
We show that the propagation problem can effectively be solved on databases whose hypergraphs are nearly-acyclic
We design effective algorithms for the global/local perspectives
We provide an implementation of a complete JDBC based system for tuple propagation
Experiments
Problem Complexity Tractability Islands Heuristic Approaches
Global Perspective: Tractability Islands of CLP
Good newsExponentially large universal relations does not imply CLP intractability [Xu06]
p1(X,Y)
p2(X,Z,W)p5(Y,T,X)
x1 y1 1 <0,0>
x2 y1 1 <0,0>
x1 y2 1 <0,0>
x1 z1 v1 C1 1 <1,0>
x2 z2 v2 C2 1 <0,1>
y1 t1 x1 1 <0,0>
y1 t2 x1 1 <0,0>
y1 t2 x2 1 <0,0>
Global Perspective: Tractability Islands of CLP
Good newsExponentially large universal relations does not imply CLP intractability [Xu06]
p1(X,Y)
p2(X,Z,W)p5(Y,T,X)
x1 y1 1 <1,0>
x2 y1 1 <0,1>
x1 y2 1 <1,0>
x1 z1 v1 C1 1 <1,0>
x2 z2 v2 C2 1 <0,1>
y1 t1 x1 1 <0,0>
y1 t2 x1 1 <0,0>
y1 t2 x2 1 <0,0>
Bottom up
Global Perspective: Tractability Islands of CLP
Good newsExponentially large universal relations does not imply CLP intractability [Xu06]
p1(X,Y)
p2(X,Z,W)p5(Y,T,X)
x1 y1 2 <2,0>
x2 y1 1 <0,1>
x1 y2 1 <1,0>
x1 z1 v1 C1 1 <1,0>
x2 z2 v2 C2 1 <0,1>
y1 t1 x1 1 <0,0>
y1 t2 x1 1 <0,0>
y1 t2 x2 1 <0,0>
Bottom up
Global Perspective: Tractability Islands of CLP
Good newsExponentially large universal relations does not imply CLP intractability [Xu06]
p1(X,Y)
p2(X,Z,W)p5(Y,T,X)
x1 y1 2 <2,0>
x2 y1 1 <0,1>
x1 y2 1 <1,0>
x1 z1 v1 C1 2 <2,0>
x2 z2 v2 C2 1 <0,1>
y1 t1 x1 1 <0,0>
y1 t2 x1 1 <0,0>
y1 t2 x2 1 <0,0>
Top down
Global Perspective: Tractability Islands of CLP
p1(X,Y)
p2(X,Z,W)p5(Y,T,X)
x1 y1 2 <2,0>
x2 y1 1 <0,1>
x1 y2 1 <1,0>
x1 z1 v1 C1 2 <2,0>
x2 z2 v2 C2 1 <0,1>
y1 t1 x1 1 <1,0>
y1 t2 x1 1 <1,0>
y1 t2 x2 1 <0,1>
Good newsExponentially large universal relations does not imply CLP intractability [Xu06]
CLP tractable on dependency graphs whose undirected versions are (forests of) trees [Xu06]
Top down
Tractability Islands of CLP. Are trees enough?
The (undirected) dependency graph is a bipartite clique of size m × n, and hence it is not a tree and the result in [XU06] does not apply
CLP is still tractable !
R1 (B1,A1, …, Am) R2 (B2,A1, …, Am) Rn (Bn,A1, …, Am)…..
R’1 (A1) R’
2 (A2) R’m (Am)…..
…..
…..…..
Q=R1 (B1,A1, …, Am), R2 (B2,A1, …, Am), …, R1 (B1,A1, …, Am), …, R’1 (A1), R’
2 (A2), …,R’m (Am)
Tractability Islands of CLP. Hypertree Decompositions
For fixed k, deciding whether hw(Q) k is in P [Gottlob02] computing hypertree decompositions is in P [Gottlob02]
{B1, …, Bm ,A1, …, Am} R1, R2, …, Rm
{A2} R’2{A1} R’
1 {Am} R’m…….
Q=R1 (B1,A1, …, Am), R2 (B2,A1, …, Am), …, R1 (B1,A1, …, Am), …, R’1 (A1), R’
2 (A2), …,R’m (Am)
R1
R2
Rm
A1A2 Am
…
B1
B2
Bm
R’1 R’
2 R’m
Tractability Islands of CLP. Hypertree Decompositions
Cyclic dependency graph……….bounded width!
CLPonDBk algorithm
Bottom up PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
L1 A1 #C1
L2 A1 #C2O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
AccountA1 D1
A2 D2
CLPonDBk algorithm
Bottom up PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Bottom up PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Bottom up PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Bottom up PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Bottom up PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Bottom up PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Bottom up PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Top Down PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Top Down PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
<1,1>
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Top Down PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
<1,1> <1,1>
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Top Down PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
<1,1> <1,1>
<1,1>
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Top Down PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
<1,1> <1,1>
<1,1>
<1,1>
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Top Down PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
<1,1>
<1,1>
<1,1>
<1,1>
<1,1>
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Top Down PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
<1,1>
<1,1>
<1,1>
<1,1><1,1>
<1,1>
L1 A1 #C1
L2 A1 #C2
CLPonDBk algorithm
Top Down PhaseLoan
TransactionOrder
Account,Disposition
Card Client
District
Account
CLPonDBk solves CLP in time O(|D| × max RiD ||Ri||k+3), on the class of those instances whose associated hypergraphs have hypertree width bounded by k.
O1 A1
O2 A2
O1 A2
T1 A1
T2 A2
A1 D1 C1 S1
A2 D2 C2 S1
P1 S1
P2 S2
C1 D1
C2 D1
C1 D2 D1
D2
D3
A1 D1
A2 D2
<1,1>
<1,1>
<1,1>
<1,1><1,1>
<1,1>
<1,1>
L1 A1 #C1
L2 A1 #C2
L-CLP: Local Perspective on Propagation Problem
In several multi-relational approaches, CLP is heuristically restricted to portions of the database
Reducing the search space can pragmatically speed-up the computation
• Still, joining many relations may be challenging from a computational viewpoint.
L-CLP: NTtoT_onDBMS and TtoNT_onDBMS
“Target to Non-Target” Propagation (TtoNT onDBMS) Propagate information from R1 to Rm, evaluate C on the result
“Non-Target to Target” Propagation (NTtoT onDBMS) Start by filtering Rm with the condition C, by joining the result with Rm-1, and by iterating the process back to R1
Propagation path from R1 to Rm only requires joining pairs of “adjacent” relations
L-CLP: NTtoT_onDBMS and TtoNT_onDBMS
NTtoT_onDBMS
TtoNT_onDBMS
DBMS Implementation System Design Experiments
A JDBC System for CLP
Experimentation Settings
Scenario: CROSSMINE + NTtoT_onDBMS CROSSMINE + TtoNT_onDBMS CROSSMINE + TupleIDPropagation
Parameters: The number m of relations The number ||target || of tuples in the target relation; The “propagation ratio” ||target ||/||R|| The selectivity s of each join attribute
Environment:
2.1GHz Centrino PC, 1 Gb RAM, 5400 rpm hard disk (Windows XP Professional)
Computation Time and Propagation Time
m=5; ||target ||/||R||=1; s=50%
Dramatic improvements w.r.t. standard Crossmine • Effective scaling for large relations• ….
Gains w.r.t. Crossmine
m=5; s=50%
• Gain on propagation up to 95 %• Gain on computation time up to 90 %• ……
NTtoT_onDBMS or TtoNT_onDBMS ?
NTtoT_onDBMS vs TtoNT_onDBMS
||target ||=100000; m=5; s=50%
||target ||=100000; m=5; s=50% ||target ||/R=1
• TtoNT_onDBMS is the best with low propagation ratio• NTtoT_onDBMS is the best when target relation is much larger
than other relations• Semi-joins operators are a winning choice in practical database
applications
Conclusion and Discussion
CLP problem is a challenging task which can be effectively asked using state-of-art query-optimization methods Propagation over large class of nearly-acyclic database
schemas is in fact tractable (polynomial upper bound guarantee) Result in [Xu06] emerges as a special case
Database implementation of local-perspective methods shows tremendous benefits w.r.t. standard in-memory strategies
Potential benefits for many classifications algorithms, such as Bayesian classifiers[Getoor01], probabilistic models [Taskar02], and decision tree learning methods[Leiva03].
THANK YOU!
References
P. A. Bernstein and N. Goodman. Power of natural semijoins. SIAM Journal on Computing, 10(4):751–771, 1981.
H. Blockeel and L. De Raedt. Top-down Induction of First-Order Logical Decision Trees. Artificial Intelligence, 101(1-2):285–297, 1998.
H. Blockeel and M. Sebag. Scalability and Efficiency in Multi-relational Data Mining. SIGKDD Explorations Newsletters, 5(1):17–30, 2003.
M. Ceci and D. Malerba. Mr-SBC: a Multi-Relational Naive Bayes Classifier. In Proc. of PKDD’03, pages 95–106, 2003.
S. Dˇzeroski. Multi-relational Data Mining: an Introduction. SIGKDD Explorations Newsletters, 5(1):1–16, 2003.
P. A. Flach and N. Lachiche. IBC2: A True First-Order Bayesian Classifier. In Proc. of ILP’02, pages 133–148, 2002.
R. Frank and F.M.M. Ester. A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions. In Proc. Of PKDD’07, pages 430–437, 2007.
L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning Probabilistic Models of Relational Structure. In Proc. of ICML’01, pages 170–177, 2001.
G. Gottlob, N. Leone, and F. Scarcello. Hypertree decomposition and tractable queries. Journal of Computer and System Sciences, 64:579–627, 2002.
G. Gottlob, Z. Miklos, and T. Schwentick. Generalized hypertree decompositions: Np-hardness and tractable variants. In Proc. of PODS’07, pages 13–22, 2007.
H. Guo and H. L. Viktor. Multirelational classification: a multiple view approach. Knowledge and Information Systems, 17(3):287–312, 2008.
References
G. Jing-Feng, L. Jing, and B. Wei-Feng. An Efficient Relational Decision Tree Classification Algorithm. In Proc. of ICNC’07, pages 530–534, 2007.
M. A. Krogel, S. Rawles, F. Zelezny, P. A. Flach, N. Lavrac, and S. Wrobel. Comparative Evaluation of Approaches to Propositionalization. In In Proc. Of ILP’03, pages 197–214, 2003.
H. Leiva, A. Atramentov, and V. Honavar. A Multi-relational Decision Tree Learning Algorithm. In Proc. of ILP’03, pages 97–112, 2002.
H. Liu, X. Yin, and J. Han. An efficient Multi-relational Na¨ıve Bayesian classifier based on Semantic Relationship Graph. In Proc. of MRDM’05, pages 39–48, 2005.
S. Muggleton. Inductive Logic Programming. Academic Press, New York, 1992. J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning Relational Probability Trees. In Proc. Of KDD’03, pages 625–630, 2003.
J. Neville, D. Jensen, and B. Gallagher. Simple Estimators for Relational Bayesian Classifiers. In Proc. of ICDM’03, page 609, 2003.
U. Pompe and I. Kononenko. Naive Bayesian Classifier within ILP-R. In Proc. of ILP’95, pages 417–436, 1995.
B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. Of UAI’02, 2002.
K. Wang, Y. Xu, P.S. Yu, and R. She. Building Decision Trees on Records Linked through Key References. In Proc. of SDM’05, 2005.
Y. Xu, K. Wang, A. Wai-Chee Fu, R. She, and J. Pei. Classification Spanning Correlated Data Streams. In Proc. of CIKM’06, pages 132–141, 2006.
M. Yannakakis. Algorithms for acyclic database schemes. In Proc. of VLDB’81, pages 82–94.
X. Yin, J. Han, J. Yang, and P.S. Yu. CrossMine: Efficient Classification Across Multiple Database Relations. In Proc. of t ICDE’04, page 399, 2004.
Multi-Relational Classification
Formal Framework
Input: D (with target having attribute CL), I, a class label ‘l’, and a condition C over the attributes of some relation RD;
Output: key[target] C^target.CL=‘l’R(D, I)
{account-id,district-id} {Account}
{transaction-id,account-id} {Transaction}
{loan-id,account-id} {Loan}
{order-id,account-id} {Order}
{account-id,disp-id,client-id,district-id} {Account,Disposition}
{disp-id,card-id} {card} {client-id,district-id} {Client}
{district-id} {District}