rm world 2014: data mining with background knowledge from the web
DESCRIPTION
TRANSCRIPT
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 1
Data Mining with Background Knowledgefrom the Web
Introducing the RapidMinerLinked Open Data Extension
Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 2
Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-stadt
144402 ... Crime Bloody Books
... 124
3-43784-324-2 Mann-heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 3
Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 4
Linked Open Data in a Nutshell
• Started in 2007
• A collection of ~1,000 open datasets
– from various domains, e.g., general knowledge, government data, …
– using semantic web standards (HTTP, RDF, SPARQL,…)
• Machine processable
• Free of charge
• Sophisticated tool stacks
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 5
Linked Open Data in a Nutshell
http://lod-cloud.net/
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 6
Example: DBpedia
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 8
The RapidMiner LOD Extension
• Automatic discovery of links to Linked Open Data
– for local data objects
– e.g., the database entry Boston is linked to http://dbpedia.org/resource/Boston
• Automatic generation of attributes
– e.g., add all numeric values found for Boston (and other cities)
• Plus
– Feature selection algorithms optimized for LOD
– Automatic following of links to other datasets
– Schema matching (coming soon)
• No need to know Semantic Web technologies!
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 9
Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 10
Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
• Result: M5Rules down to almost half the prediction error
– i.e., on average, we are wrong by 1.6 instead of 2.9 MPG
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 11
Example: the Auto MPG Dataset
• The original attributes are
– cylinders, displacement, horsepower, weight, acceleration, model, origin
– plus name (unique string) and mpg (target)
• Models built are, e.g.,
– high horsepower/weight → high consumption
• Additional attributes lead to further insights, e.g.
– front-wheel drives have a lower consumption than rear-wheel drives
– hatchbacks have a lower consumption than station wagons
– rally cars generally have a low consumption
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 12
Example: Analyzing Statistics
• As shown, e.g., at ESWC 2012, SemStats 2013
• Statistics found on the web often contain only few attributes
– extreme case: only entity + target
• Examples:
– Quality of living in cities (right)
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 13
Example: Analyzing Statistics
• Process in RapidMiner:
– load statistic
– link entities (cities, countries, etc.) to LOD cloud
– collect additional attributes
– analyze for correlations with target attribute of statistic
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 15
Example: Analyzing Statistics
• Corruption Perception Index (CPI) by Transparency International
• Indicators for low corruption:
– high HDI (human development index)
– large number of companies
– large number of NGOs
– small number of cargo airlines?!
• Burnout rates in German DAX companies
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– Local companies are less prone to burnout than international ones
• Exception: Frankfurt
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 16
Example: Analyzing Statistics
• Sexual activity (based on Durex survey 2005-2009)
– Higher in French speaking than in English speaking countries
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISPs → low activity
http://xkcd.com/552/
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 17
Further Usage Examples
• Classification of Twitter messages (SMILE, 2013)– given a target, e.g., messages related to car traffic
– annotate message, extract abstract features for concepts
– e.g. “I-90” → highway
• Prediction of user location for Twitter (ICWSM, 2013)– useful, e.g., for market research
– combination with sentiment analysis: public opinion maps
• Identifying disputed topics in the news (LD4KD, 2014)– on a corpus of different online newspapers
– identified, e.g., concurrent opinions on drug legislation and gay marriage
• Debugging Linked Open Data as such– e.g., identifying wrong links and axioms
– combination with outlier detection
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 18
Conclusions
• Many data mining tasks are better solved with more background knowledge
– better predictive models
– more insights from additional attributes
• A lot of such knowledge exists as Linked Open Data
• The Linked Open Data extension grants easy access to that data
– from within RapidMiner
– without the need to know anything about RDF, SPARQL, etc.
• Try it out!
– find “Linked Open Data” on the marketplace
– Google Group: https://groups.google.com/forum/#!forum/rmlod
08/22/14 Paulheim, Ristoski, Mitichkin, Bizer 19
Data Mining with Background Knowledgefrom the Web
Introducing the RapidMinerLinked Open Data Extension
Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer