rm world 2014: data mining with background knowledge from the web

08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 1

Data Mining with Background Knowledgefrom the Web

Introducing the RapidMinerLinked Open Data Extension

Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer


Motivation: An Example Data Mining Task

• Analyzing book sales

ISBN City Sold

3-2347-3427-1 Darmstadt 124

3-43784-324-2 Mannheim 493

3-145-34587-0 Roßdorf 14

...

ISBN City Population ... Genre Publisher ... Sold

3-2347-3427-1 Darm-stadt

144402 ... Crime Bloody Books

... 124

3-43784-324-2 Mann-heim

291458 … Crime Guns Ltd. … 493

3-145-34587-0 Roß-dorf

12019 ... Travel Up&Away ... 14

...

→ Crime novels sell better in larger cities


Motivation

• Many data mining problems are solved better

– when you have more background knowledge

(leaving scalability aside)

• Problems:

– Tedious work

– Selection bias: what to include?


Linked Open Data in a Nutshell

• Started in 2007

• A collection of ~1,000 open datasets

– from various domains, e.g., general knowledge, government data, …

– using semantic web standards (HTTP, RDF, SPARQL,…)

• Machine processable

• Free of charge

• Sophisticated tool stacks


Linked Open Data in a Nutshell

http://lod-cloud.net/


Example: DBpedia


The RapidMiner LOD Extension

• Automatic discovery of links to Linked Open Data

– for local data objects

– e.g., the database entry Boston is linked to http://dbpedia.org/resource/Boston

• Automatic generation of attributes

– e.g., add all numeric values found for Boston (and other cities)

• Plus

– Feature selection algorithms optimized for LOD

– Automatic following of links to other datasets

– Schema matching (coming soon)

• No need to know Semantic Web technologies!


Example: the Auto MPG Dataset

• A well-known UCI dataset

– Goal: predict fuel consumption of cars

• Hypothesis: background knowledge → more accurate predictions

• Used background knowledge:

– Entity types and categories from DBpedia (=Wikipedia)



• A well-known UCI dataset

– Goal: predict fuel consumption of cars

• Hypothesis: background knowledge → more accurate predictions

• Used background knowledge:

– Entity types and categories from DBpedia (=Wikipedia)

• Result: M5Rules down to almost half the prediction error

– i.e., on average, we are wrong by 1.6 instead of 2.9 MPG



• The original attributes are

– cylinders, displacement, horsepower, weight, acceleration, model, origin

– plus name (unique string) and mpg (target)

• Models built are, e.g.,

– high horsepower/weight → high consumption

• Additional attributes lead to further insights, e.g.

– front-wheel drives have a lower consumption than rear-wheel drives

– hatchbacks have a lower consumption than station wagons

– rally cars generally have a low consumption


Example: Analyzing Statistics

• As shown, e.g., at ESWC 2012, SemStats 2013

• Statistics found on the web often contain only few attributes

– extreme case: only entity + target

• Examples:

– Quality of living in cities (right)

– Corruption by country

– Fertility rate by country

– Suicide rate by country

– Box office revenue of films

– ...



• Process in RapidMiner:

– load statistic

– link entities (cities, countries, etc.) to LOD cloud

– collect additional attributes

– analyze for correlations with target attribute of statistic



• Corruption Perception Index (CPI) by Transparency International

• Indicators for low corruption:

– high HDI (human development index)

– large number of companies

– large number of NGOs

– small number of cargo airlines?!

• Burnout rates in German DAX companies

– Positive correlation between turnover and burnout rates

– Car manufacturers are less prone to burnout

– Local companies are less prone to burnout than international ones

• Exception: Frankfurt



• Sexual activity (based on Durex survey 2005-2009)

– Higher in French speaking than in English speaking countries

– High GDP per capita → low activity

– High unemployment rate → high activity

– High number of ISPs → low activity

http://xkcd.com/552/


Further Usage Examples

• Classification of Twitter messages (SMILE, 2013)– given a target, e.g., messages related to car traffic

– annotate message, extract abstract features for concepts

– e.g. “I-90” → highway

• Prediction of user location for Twitter (ICWSM, 2013)– useful, e.g., for market research

– combination with sentiment analysis: public opinion maps

• Identifying disputed topics in the news (LD4KD, 2014)– on a corpus of different online newspapers

– identified, e.g., concurrent opinions on drug legislation and gay marriage

• Debugging Linked Open Data as such– e.g., identifying wrong links and axioms

– combination with outlier detection


Conclusions

• Many data mining tasks are better solved with more background knowledge

– better predictive models

– more insights from additional attributes

• A lot of such knowledge exists as Linked Open Data

• The Linked Open Data extension grants easy access to that data

– from within RapidMiner

– without the need to know anything about RDF, SPARQL, etc.

• Try it out!

– find “Linked Open Data” on the marketplace

– Google Group: https://groups.google.com/forum/#!forum/rmlod


Data Mining with Background Knowledgefrom the Web

Introducing the RapidMinerLinked Open Data Extension

Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer

rm world 2014: data mining with background knowledge from the web

Documents