matic perovšek, anže vavpeti č, nada lavra č jožef stefan institute, slovenia a wordification...

Matic Perovšek, Anže Vavpetič, Nada Lavrač

Jožef Stefan Institute, Slovenia

A Wordification Approach to Relational Data Mining: Early

Results

OverviewIntroductionMethodologyExperimental resultsConclusion

Introduction

Relational data mining algorithms aim to induce models and/or relational patterns from multiple tables

Individual-centered relational databases can be transformed to a single-table form – propositionalization

MotivationWordification inspired by text

mining techniquesLarge number of simple, easy to

understand featuresGreater scalability, handling large

datasetsCan be used as a preprocessing

step to propositional learners, as well as to declarative modeling / constraint solving (De Raedt et al., today’s invited talk)

Methodology

1. Transformation from relational database to a textual corpus

2. TF-IDF weight calculation

Transformation from relational database to a textual corpus

One individual of the initial relational database -> one text document

Features -> the words of this document

Words constructed as a combination:

Transformation from relational database to a textual corpus

For each individual, the words generated for the main table are concatenated with words generated from the secondary (BK) tables

Example

TF-IDF weightsNo explicit use of existential

variables in our features, TF-IDF instead

The weight of a word gives a strong indication of how relevant is the feature for the given individual.

The TF-IDF weights can then be used either for filtering words with low importance or using them directly by a propositional learner.

Experimental resultsSlovenian traffic accidents database

IMDB databaseTop 250 and bottom 100 moviesMovies, actors, movie genres, directors, director genres

Applied the wordification methodology

Performed association rule learning

Experimental results

ConclusionNovel propositionalization technique called

WordificationGreater scalabilityEasy to understand featuresFurther work:

Test on larger databasesExperimental comparison with other

propositionalization techniquesCombine with propositionalization–like

approach to mining heterogeneous information networks (Grčar et al. 2012), applicable to CLP in data preprocessing

Grčar, Trdin, Lavrač: A Methodology for Mining Document-Enriched Heterogeneous Information Networks, Computer Journal 2012

matic perovšek, anže vavpeti č, nada lavra č jožef stefan institute, slovenia a wordification...

Documents

talk slide

relational databases

initial relational database

wordification methodology

document words

tfidf weights

wordification approach

text document features