matic perovšek, anže vavpeti č, nada lavra č jožef stefan institute, slovenia a wordification...
TRANSCRIPT
Matic Perovšek, Anže Vavpetič, Nada Lavrač
Jožef Stefan Institute, Slovenia
A Wordification Approach to Relational Data Mining: Early
Results
Introduction
Relational data mining algorithms aim to induce models and/or relational patterns from multiple tables
Individual-centered relational databases can be transformed to a single-table form – propositionalization
MotivationWordification inspired by text
mining techniquesLarge number of simple, easy to
understand featuresGreater scalability, handling large
datasetsCan be used as a preprocessing
step to propositional learners, as well as to declarative modeling / constraint solving (De Raedt et al., today’s invited talk)
Methodology
1. Transformation from relational database to a textual corpus
2. TF-IDF weight calculation
Transformation from relational database to a textual corpus
One individual of the initial relational database -> one text document
Features -> the words of this document
Words constructed as a combination:
Transformation from relational database to a textual corpus
For each individual, the words generated for the main table are concatenated with words generated from the secondary (BK) tables
TF-IDF weightsNo explicit use of existential
variables in our features, TF-IDF instead
The weight of a word gives a strong indication of how relevant is the feature for the given individual.
The TF-IDF weights can then be used either for filtering words with low importance or using them directly by a propositional learner.
Experimental resultsSlovenian traffic accidents database
IMDB databaseTop 250 and bottom 100 moviesMovies, actors, movie genres, directors, director genres
Applied the wordification methodology
Performed association rule learning
ConclusionNovel propositionalization technique called
WordificationGreater scalabilityEasy to understand featuresFurther work:
Test on larger databasesExperimental comparison with other
propositionalization techniquesCombine with propositionalization–like
approach to mining heterogeneous information networks (Grčar et al. 2012), applicable to CLP in data preprocessing
Grčar, Trdin, Lavrač: A Methodology for Mining Document-Enriched Heterogeneous Information Networks, Computer Journal 2012