machine learning and the semantic web hendrik blockeel katholieke universiteit leuven department of...
Post on 18-Dec-2015
217 views
TRANSCRIPT
Machine Learning and the Semantic Web
Hendrik BlockeelKatholieke Universiteit Leuven
Department of Computer Science
Thanks : Raymond Kosala, Nico Jacobs
Overview
Machine learning and data mining Relationship with semantic web
Synergy between both Some concrete examples
Document classification Information integration
Conclusions
Machine Learning & Data Mining
Related technology, different focus Machine learning:
Programs that improve their performance on certain tasks Focus on adaptive behaviour
Data mining: Discovering implicit knowledge (regularities) in large
amounts of data Focus on handling large amounts of data
Very useful technology in the context of the Web
Learning Agents
Programs that Learn the user’s preferences
Make life for the user as simple as possible E.g., intelligent mail reader E.g., adaptive web pages
Move links, create “direct” links, ... Index page synthesis (Perkowitz & Etzioni, IJCAI 1999)
Learn how to find reliable information E.g., learn which other people have similar preferences to
this user, use their opinions to make suggestions
(other applications: learning to play games, ...)
Mining the Web
Analyze data that are available on the Web Distinguish 3 types:
Web content mining Look in contents of documents (text, ...)
Web structure mining Look at links between documents
Web usage mining Look at user logs (e.g. who accessed a web page, which
links often used, ...)
Web Content Mining
Relies on information extraction E.g., in a text: find keywords, ...
Techniques from machine learning, statistics, ... used to guess from context
what a word means what its function in the text is ...
Fill a schema with specific slots, based on analysis of text
Even more complicated: recognise objects in pictures, ...
I.E. is a complex matter
Mining for Genes
Jenssen et al. (2001), Nature Genetics 28, “A literature network of human genes”
Mining MEDLINE database of abstracts Find names of genes occurring together Construct similarity graph Construct a database with this information Database contains knowledge no single individual
has, or could obtain without data mining Similar techniques could be used on the web
One extra problem: uncertainty about reliability
Web Structure Mining
Analyse structure of the web Which sites have many incoming / outgoing links?
Identify “hubs”
Find clusters of sites that are strongly interconnected Web communities
... E.g., Google
Identifies important pages based on links that point to it (rather than contents of page itself)
Web Usage Mining
Log user behaviour Which links are often followed, in which order, how
long is a page looked at, ... Possible at several levels:
General usage statistics User-specific statistics
Relating behaviour to properties of user, insofar available
E.g., adaptive web sites Adaplix project automatic index page creation
Web Mining As It Currently Is
Machine learning / data mining strongly rely on Data quantity Data quality
Quantity is usually not a problem on the Web Quality is!
Much data not in easily processable format E.g. Inside text documents : need information extraction Unstructured, poorly structured, heterogeneously structured
Lots of noise ...
How Is All This Related to the Semantic Web?
There can be a synergy : Machine learning can help with building the
Semantic Web The Semantic Web will help mining the Web,
making Web interfaces and agents more intelligent
What Machine Learning Can Do for the Semantic Web
Upgrading the current web to a semantic web involves a lot of work
Can partially be automated! Examples:
Learning ontologies Automatic document classification Information integration ...
Learning Ontologies
Maedche & Staab (2001), “Ontology learning for the semantic web”
View: Manually creating of ontologies is very labour-
intensive Fully automating creating of ontologies is not feasible Hence: develop tool that helps building ontologies
Basic components: Good graphical interface (interaction man-machine) Powerful underlying machine learning techniques
Text-To-Onto
Framework : Import / reuse existing ontologies Extract ontology from documents
Identify new terms, map onto existing concepts or define new ones
Identify relationships between concepts ... Many opportunities for general machine learning techniques
Prune ontology Refine ontology
Some Useful Techniques for Learning Ontologies
Term extraction from texts Identification of concepts
Hierarchical Clustering Clustering: finding groups of “similar” things Hierarchical clustering: clusters of clusters Taxonomy can be constructed through hierarchical
clustering of concepts Association rules
Find sets of terms that often occur together May indicate important relations
E.g., events in texts often co-occur with locations
Information Integration
Doan, Domingos, Halevy: “Reconciling Schemas of Disparate Data Sources”, ACM SIGMOD 2001
Context: Given databases with different schemas:
Find similarities in schemas, guess how concepts map onto each other
Integrate the schemas
Essentially the same as mapping ontologies onto each other
Automated Document Classification
Mitchell et al. Based on examples of web pages + what kind of page
they are (course page, student page, ...), Learn to classify new pages Can be based on contents of page, links pointing to
page, typical structure of certain kinds of web sites (e.g. universities), ...
Note: helps to relate objects to ontology Problem: how to get labeled examples
Unlimited amount of unlabelled pages available But labelling them manually is labour intensive!
Exploiting Unlabelled Data
A solution: co-training (Blum & Mitchell 1998) Learn separate (imperfect) classifiers from disjoint
sets of sufficient information E.g. Learn to classify pages from
Content of page (“Home page of CS 101”) Links pointing to page (“CS 101”)
Take classifications that classifier A is most certain of, add these labels to training set for B (and vice versa)
Repeat multiple times (kind of bootstrapping process) Co-training allows to exploit large amounts of
unlabelled data!
What the Semantic Web Can Do for Machine Learning
Will make mining the web much easier Reason 1: removal of ambiguity
More precise knowledge of what is meant with certain terms
Reason 2: structured vs. unstructured data Learning from structured data is much easier than
from unstructured data Reason 3: availability of background knowledge
Can be used to make better decisions when learning
Removal of Ambiguity
Example: text document classification E.g., given a text, tell in which newsgroups it belongs
Typical approaches: “bag of words” Look only at which words occur, in the text, and how
often Each time a word occurs that occurs mainly in one
particular class, increase probability for that class But words are ambiguous! Increased classification accuracy can be expected by
removing ambiguity
Mining From (Un)structured Data
Mining data = intensively querying data Answering a querying is
Easy in structured data Relational database, XML, ...
Harder in semi-structured data (e.g., HTML) Hard in unstructured data
Information exraction needed Could do this by learning a “wrapper” This involves one extra layer of learning
Relating this to our text example: taking into account function of words in text
Availability of Background Knowledge
Learning = finding relevant patterns in behaviour Important to have the right context to describe
these patterns Example:
Making interesting offers to clients “People who bought this book also bought ...” = “Instance-based” learning
Estimate profile of user Find users with similar profile Look at behaviour of those users to help current user
Availability of Background Knowledge
Can work better if more background knowledge is available, e.g., type of book, author, ... For instance, for books:
“similar profile” = users that up till now bought same books as this user
May not be many people “similar” = often bought books by same author
Probably many more people, allows for more reasonable guess “similar” = often bought books of same genre (fiction, ...)
May work even better
Ontologies (among other) provide such background knowledge
Web Mining Revisited
Semantic Web will change Content mining
Clearer view on contents and meaning of documents
Structure mining More relevant structure
Usage mining More relevant information on actions of user
Will in general improve intelligence of systems E.g. mail filter gets a better view of contents of mails
Promising Learning Techniques
Many different learning techniques exist Neural networks, support vector machines, instance-
based learning, bayesian learning, association rules, ... Not all equally suitable for any task
E.g. SVM for document classification works well E.g. instance-based learning: find other users with same profile
as this user to make predictions Intelligent agents will use a mix of them Relational learners seem interesting
Can handle explicit information on objects and relations between them
Classic example: Inductive logic programming
Inductive Logic Programming
Induces rules in first order logic from examples or other rules Such rules can be used to reason with The reasoning can be explained
Cf. example of mail program
Can use existing background knowledge “knowledge intensive learning” Currently: good background knowledge has to be
engineered manually Will become more easily available with semantic web Example: mining in chemical domains
Mining in chemical domains
Example problem: relate activity of molecule to its properties Useful for, e.g., drug development Which properties are important?
Chemically relevant properties: functional groups, 3D structure, ... ?
Has to be encoded manually Ideally: get relevant information from some
trustworthy data source as and when needed Intelligent agents will exploit (“tap”) the common
intelligence of the Web
Conclusions
Machine learning is an promising tool for the Semantic Web For building it For exploiting it
Clear synergy between Semantic Web efforts and Machine Learning efforts
Some References
Maedche, “A Machine Learning Perspective for the Semantic Web”, position paper www.semanticweb.org/SWWS/program/position/soi-maedche.pdf
Maedche & Staab (2001): Ontology Learning for the Semantic Web, IEEE Intelligent Systems 16(2)
Jenssen et al., Nature Genetics 28 Doan et al. (2001), ACM SIGMOD conf. Kosala & Blockeel (2000), SIGKDD Explorations 2(1) Mitchell (1996), Machine Learning