Ontology construction from text
Blaz Fortuna
Outline Big picture OntoGen Future work
2
Big picture
3
Vision
What is “text”? From single documents to large corpora
different granularity
What is “structured information”? From topic taxonomies to full-blown ontologies
different expressivity
Extracting structured information from text
Extracting structured information from text
4
Available tools Text mining
… for dealing with large corpora Natural Language Processing (NLP)
… for dealing with sentence level structure Machine learning
… for abstracting structure from data (modeling) … inside of many text mining and NLP algorithms
Visualization … for user interactions
5
The Plan
Expressiveness
gra
nu
lari
ty
OntoGen
TemplateExtraction
document
corpus
SemanticGraphs
Q&A
6
OntoGen
7
OntoGen
Tool for semi-automatic ontology construction from large text corpora
Integrates several text-mining methods Clustering Active learning Classification Visualizations
Publicly available at ontogen.ijs.si
[Fortuna, Mladenić, Grobelnik, 2005]
8
Ontology construction with OntoGen
Semi-Automatic provide suggestions and insights into domain user interacts with parameters of methods final decisions taken by user
Data-Driven most of the aid provided by the system is based
on some underlying data instances are described by features extracted
from the data (e.g. words-vectors)
9
Ontology model in OntoGen
Ontology is a data model representing: a set of concepts within a domain the relationships between these concepts
OntoGen models ontology as a graph/network structure consisting from: a set of concepts (vertices in a graph), a set of instances assigned to a particular
concepts (data records assigned to vertices in a graph)
a set of relationships connecting concepts (directed edges in a graph)
each instance is described by a set of features
10
Example of a Topic Ontology
11
Instance representation Bag of words:
Vocabulary: {wi | i = 1, …, N } Documents are represented with vectors (word space):
Example:
Document set: d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra”
Document set: d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra”
Document vector representation: x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,)
Document vector representation: x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,)
Vocabulary: {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}
Vocabulary: {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}
12
Basic idea behind OntoGen
Domain
Text corpus Ontology
Concept AConcept
B
Concept C
1313
Concept discovery – unsupervised
Clustering based approach K-means clustering of
the instances Clusters offered as
suggestions Users selects relevant
suggestions
14
Concept discovery – unsupervised
Visualization based Topic-landscape
based visualization One instance one
yellow point on the map
Similar instances appear closer together
User can make a concept by selecting a region of the map Pink points on the map
are selected instances
15
Concept discovery – supervised Active learning based
approach User enters a query System ranks the instances
according to the query User labels instances:
Yes – belongs to the concept
No – does not belong to the concept
Once there are enough instances, system switches to SVM based active learning
When done, concept added to the ontology.
16
Concept discovery – supervised
Classification based approach Instances are classified
into a background ontology called OntoLight
Concepts with the most instances provided as sub-concept suggestions
17
Concept naming – unsupervised
Automatic extraction of keywords, for describing the concepts First approach based on
TFIDF weights of words Second approach based
on SVM based feature selection algorithm
18
Concept naming – supervised
Classification based approach Concept’s instances are
classified into a background ontology called OntoLight
Names from background ontology, with most classified instances, are provided as suggestions
Shows what is the name in some pre-defined vocabulary
19
Concept visualization
Instances are visualized as points on 2D map.
The distance between two instances on the map correspond to their similarity.
Characteristic keywords are shown for all parts of the map.
User can select groups of instances on the map to create sub-concepts.
20
Ontology visualization
Ontology concepts visualized as points on the 2D topic map.
Topic map generated from a set of text documents.
21
Multiple views of the same data
Simple taxonomy on top of Reuters news articles
Two different views, one focuses on topics, one focuses on geography
Each view offers yields a different taxonomy on the data.
SVM based method detects importance of keywords for each view.
Topics view
Countries view
UK takeovers and mergersThe following are additions and deletions to the takeovers and mergers list for the week beginning August 19, as provided by the Takeover …
UK takeovers and mergersThe following are additions and deletions to the takeovers and mergers list for the week beginning August 19, as provided by the Takeover …
Lloyd’s CEO questioned in recovery suit in U.S. Ronald Sandler, chief executive of Lloyd's of London, on Tuesday underwent a second day of court interrogation about …
Lloyd’s CEO questioned in recovery suit in U.S. Ronald Sandler, chief executive of Lloyd's of London, on Tuesday underwent a second day of court interrogation about …
22
Word weight learning The word weight learning
method is based on SVM feature selection.
Besides ranking the words it also assigns them weights based on SVM classifier.
Notation: N – number of documents {x1, …, xN} – documents C(xi) – set of categories for
document xi n – number of words {w1, …, wn} – word weights {nj
1, …, njn} – SVM normal
vector for j-th category
Algorithm:1. Calculate linear SVM
classifier for each category
2. Calculate word weights for each category from SVM normal vectors. Weight for i-th word and j-th category is:
3. Final word weights are calculated separately for each document:
N
kijik
ji nx
N 1,,
1
ixCj
jiik TFx
k
)(,
23
Relations – preprocessing
24
Name-Entity profile Extracted sentences from articles in which they name entity
appears Example: Agassi
Olympic champion Agassi meets of Morocco in the first round.
Co-occurrence profiles Extracted sentences from articles in which two name entities
appear together Example: Sampras – Agassi
There will be no repeat of last year's men's final with eighth-ranked Agassi landing in Sampras's half of the draw.
Relationship By extracting keywords from co-occurrence profiles we can get
summary of relationship between two name entities. Keywords are extracted by from co-occurrence profile bag-of-
words vectors
Relations – example
25
Bill Clinton Iraq [476]
president, missiles, attacks, Kurdish, northern Bob Dole [294]
republican, president, presidential, candidates, poll
United States [204] president, Monday, southern, move, election
White House [146] president, spokesman, reporters, Friday,
campaign Iran [74]
president, investment, gas, law, penalize Congress [66]
president, calling, billion, republican, democrat Chicago [42]
president, conventional, democrat, drug, campaign
Al Gore [40] president, vice, bus, tour, election
Chicago Clinton [236]
conventional, democrat, training, day, campaign
U.S. [164] trader, markets, purchasers, index, future
New York [100] variety, mixed, critical, poll, bulletproof
Dole [70] conventional, democrat, campaign, drug,
Sunday Kansas City [70]
basis, wheat, bushels, fob, red Los Angeles [60]
(variety, mixed, critical, poll, stg Illinois [34]
democrat, state, conventional, trip, mayor Chicago Board of Trade [34]
future, deliverable, stocks, bus, reporters San Francisco [34]
operations, municipal, full, remain, services Boston [32]
fared, comparatively, game, existed, American
Relations – abstraction Clustering of name entities using k-
means clustering Relations between clusters are
established based on the name-entities co-occurrence profiles: Let C1 and C2 be two clusters Let pij be a co-occurrence profile
between document di and dj
P = {pij | so that di from C1 and dj from C2 }
Relation is defined by a profile set P Summary of relation is extracted from
the centroid vector of profiles from P
C1
C2
26
Relations – example Example of clusters:
Cluster 1: Name Entities: Bosnia,
Bosnian, Sarajevo Keywords: serbs,
moslems, bosnian, election
Cluster 2: Name Entities: Russia,
Britain, Germany, France Keywords: meeting,
country, government, told
Cluster 3: Name Entities:
Washington, United States
Keywords: spokesman, military, missiles
Example of relations Cluster 1 vs. Cluster 3:
Name Entities: U.N., U.S., American, Washington, Bosnia, Turkey, Richard Holbrooke, U.N. Security Council, White House
Keywords: election, serb, war, bosnians, moslem, peace, tribunal, police, spokesman, crime
Cluster 1 vs. Cluster 2: Name Entities: NATO,
Yugoslavia, Bosnia, Croatia, Serbia, Belgrade, Balkan, OSCE, Burns
Keywords: country, election, state, international, peace, meeting, secretary, foreign, talks, member
27
Relations – example
Russia, Britain, Germany, France, China,
EU meeting, country,
government, told, officials, union, minister, secretary,
trade, report
Russia, Britain, Germany, France, China,
EU meeting, country,
government, told, officials, union, minister, secretary,
trade, report
Hashimoto, Romano Prodi, Benjamin Netanyahu, Jim
Bolger
Hashimoto, Romano Prodi, Benjamin Netanyahu, Jim
Bolger
minister, prime, meeting, foreign, talks, president, peace, visit,
told, officials
minister, prime, meeting, foreign, talks, president, peace, visit,
told, officials
president, meeting, visit, talks, leaders, minister, secretary, officials, state
president, meeting, visit, talks, leaders, minister, secretary, officials, state
Bill Clinton, Jacques Chirac, Suharto, Hosni
Mubarak, Leonid Kuchma
Bill Clinton, Jacques Chirac, Suharto, Hosni
Mubarak, Leonid Kuchma
Supreme Court, U.S. District Court,
Simpson, Justice Department
Supreme Court, U.S. District Court,
Simpson, Justice Department
courts, case, year, told, rules, trials, charges, sentenced, law, file
courts, case, year, told, rules, trials, charges, sentenced, law, file
plant, powerful, company, venture, electrical, projects,
million, joint, province, state
plant, powerful, company, venture, electrical, projects,
million, joint, province, state
Tennessee Valley Authority, New Hill,
TVA, Florida Power & Light Co, St Lucie
Tennessee Valley Authority, New Hill,
TVA, Florida Power & Light Co, St Lucie
28
Relations – example
29
CountryCountry
PresidentPresidentMinisterMinister
CourtCourt Power plantPower plant
VisitVisit VisitVisit
InvestInvestRuleRule
Evaluation First prototype was
successfully used: Applied in multiple
domains: business, legislations
and digital libraries (SEKT project)
Users were always domain experts with limited knowledge
and experience with ontology construction / knowledge engineering
Feedback from first trails used as input for the second prototype the one presented here
User study performed for the second prototype Main impression
the tool saves time is especially useful when
working with large collections of documents
Main disadvantages abstraction unattractive interface
design
Used in several EU projects SWING, TAO, NEON,
ECOLEAD, E4, TOOLEAST
30
From the users
31
Future work
32
The Plan
Expressiveness
gra
nu
lari
ty
OntoGen
TemplateExtraction
document
corpus
SemanticGraphs
Q&A
33
Move towards bigger granularity Semantic graphs
Extract data-points from sentences level OntoGen does it on a
document level Based on triplets extracted
from sentence structure Subject Predicate Object
Extraction can be done with Parsers Structured learning
Triplets from one document can be merged into Semantic graphs
Stronger then bag-of-words Example application:
Document summarization
34
The Plan
Expressiveness
gra
nu
lari
ty
OntoGen
TemplateExtraction
document
corpus
SemanticGraphs
Q&A
35
36
The Plan
Expressiveness
gra
nu
lari
ty
OntoGen
TemplateExtraction
document
corpus
SemanticGraphs
Q&A
37
Template extraction
38
Hypothesis: People view events through “templates”
Models of how things evolve, relate Use these models to understand, predict
Goal: automatic extraction of such models from texts
Search over triplets Triplet extraction ran over Reuters corpus
800k news articles from 1996 to 1997
39
Search over triplets
40
Template earthquake
41
Places
Time-period
People
Buildings
Richter scale
Government
HitsHits in
Kills
Earthquake
Collapses
Registered in
Measured by
Thank you!
Questions?
42