Download - Ontology construction from text Blaz Fortuna. Outline Big picture OntoGen Future work 2

Ontology construction from text

Blaz Fortuna

Outline Big picture OntoGen Future work

2

Big picture

3

Vision

What is “text”? From single documents to large corpora

different granularity

What is “structured information”? From topic taxonomies to full-blown ontologies

different expressivity

Extracting structured information from text

Extracting structured information from text

4

Available tools Text mining

… for dealing with large corpora Natural Language Processing (NLP)

… for dealing with sentence level structure Machine learning

… for abstracting structure from data (modeling) … inside of many text mining and NLP algorithms

Visualization … for user interactions

5

The Plan

Expressiveness

gra

nu

lari

ty

OntoGen

TemplateExtraction

document

corpus

SemanticGraphs

Q&A

6

OntoGen

7

OntoGen

Tool for semi-automatic ontology construction from large text corpora

Integrates several text-mining methods Clustering Active learning Classification Visualizations

Publicly available at ontogen.ijs.si

[Fortuna, Mladenić, Grobelnik, 2005]

8

Ontology construction with OntoGen

Semi-Automatic provide suggestions and insights into domain user interacts with parameters of methods final decisions taken by user

Data-Driven most of the aid provided by the system is based

on some underlying data instances are described by features extracted

from the data (e.g. words-vectors)

9

Ontology model in OntoGen

Ontology is a data model representing: a set of concepts within a domain the relationships between these concepts

OntoGen models ontology as a graph/network structure consisting from: a set of concepts (vertices in a graph), a set of instances assigned to a particular

concepts (data records assigned to vertices in a graph)

a set of relationships connecting concepts (directed edges in a graph)

each instance is described by a set of features

10

Example of a Topic Ontology

11

Instance representation Bag of words:

Vocabulary: {wi | i = 1, …, N } Documents are represented with vectors (word space):

Example:

Document set: d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra”

Document set: d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra”

Document vector representation: x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,)

Document vector representation: x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,)

Vocabulary: {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}

Vocabulary: {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}

12

Basic idea behind OntoGen

Domain

Text corpus Ontology

Concept AConcept

B

Concept C

1313

Concept discovery – unsupervised

Clustering based approach K-means clustering of

the instances Clusters offered as

suggestions Users selects relevant

suggestions

14

Concept discovery – unsupervised

Visualization based Topic-landscape

based visualization One instance one

yellow point on the map

Similar instances appear closer together

User can make a concept by selecting a region of the map Pink points on the map

are selected instances

15

Concept discovery – supervised Active learning based

approach User enters a query System ranks the instances

according to the query User labels instances:

Yes – belongs to the concept

No – does not belong to the concept

Once there are enough instances, system switches to SVM based active learning

When done, concept added to the ontology.

16

Concept discovery – supervised

Classification based approach Instances are classified

into a background ontology called OntoLight

Concepts with the most instances provided as sub-concept suggestions

17

Concept naming – unsupervised

Automatic extraction of keywords, for describing the concepts First approach based on

TFIDF weights of words Second approach based

on SVM based feature selection algorithm

18

Concept naming – supervised

Classification based approach Concept’s instances are

classified into a background ontology called OntoLight

Names from background ontology, with most classified instances, are provided as suggestions

Shows what is the name in some pre-defined vocabulary

19

Concept visualization

Instances are visualized as points on 2D map.

The distance between two instances on the map correspond to their similarity.

Characteristic keywords are shown for all parts of the map.

User can select groups of instances on the map to create sub-concepts.

20

Ontology visualization

Ontology concepts visualized as points on the 2D topic map.

Topic map generated from a set of text documents.

21

Multiple views of the same data

Simple taxonomy on top of Reuters news articles

Two different views, one focuses on topics, one focuses on geography

Each view offers yields a different taxonomy on the data.

SVM based method detects importance of keywords for each view.

Topics view

Countries view

UK takeovers and mergersThe following are additions and deletions to the takeovers and mergers list for the week beginning August 19, as provided by the Takeover …

UK takeovers and mergersThe following are additions and deletions to the takeovers and mergers list for the week beginning August 19, as provided by the Takeover …

Lloyd’s CEO questioned in recovery suit in U.S. Ronald Sandler, chief executive of Lloyd's of London, on Tuesday underwent a second day of court interrogation about …

Lloyd’s CEO questioned in recovery suit in U.S. Ronald Sandler, chief executive of Lloyd's of London, on Tuesday underwent a second day of court interrogation about …

22

Word weight learning The word weight learning

method is based on SVM feature selection.

Besides ranking the words it also assigns them weights based on SVM classifier.

Notation: N – number of documents {x1, …, xN} – documents C(xi) – set of categories for

document xi n – number of words {w1, …, wn} – word weights {nj

1, …, njn} – SVM normal

vector for j-th category

Algorithm:1. Calculate linear SVM

classifier for each category

2. Calculate word weights for each category from SVM normal vectors. Weight for i-th word and j-th category is:

3. Final word weights are calculated separately for each document:

N

kijik

ji nx

N 1,,

1

ixCj

jiik TFx

k

)(,

23

Relations – preprocessing

24

Name-Entity profile Extracted sentences from articles in which they name entity

appears Example: Agassi

Olympic champion Agassi meets of Morocco in the first round.

Co-occurrence profiles Extracted sentences from articles in which two name entities

appear together Example: Sampras – Agassi

There will be no repeat of last year's men's final with eighth-ranked Agassi landing in Sampras's half of the draw.

Relationship By extracting keywords from co-occurrence profiles we can get

summary of relationship between two name entities. Keywords are extracted by from co-occurrence profile bag-of-

words vectors

Relations – example

25

Bill Clinton Iraq [476]

president, missiles, attacks, Kurdish, northern Bob Dole [294]

republican, president, presidential, candidates, poll

United States [204] president, Monday, southern, move, election

White House [146] president, spokesman, reporters, Friday,

campaign Iran [74]

president, investment, gas, law, penalize Congress [66]

president, calling, billion, republican, democrat Chicago [42]

president, conventional, democrat, drug, campaign

Al Gore [40] president, vice, bus, tour, election

Chicago Clinton [236]

conventional, democrat, training, day, campaign

U.S. [164] trader, markets, purchasers, index, future

New York [100] variety, mixed, critical, poll, bulletproof

Dole [70] conventional, democrat, campaign, drug,

Sunday Kansas City [70]

basis, wheat, bushels, fob, red Los Angeles [60]

(variety, mixed, critical, poll, stg Illinois [34]

democrat, state, conventional, trip, mayor Chicago Board of Trade [34]

future, deliverable, stocks, bus, reporters San Francisco [34]

operations, municipal, full, remain, services Boston [32]

fared, comparatively, game, existed, American

Relations – abstraction Clustering of name entities using k-

means clustering Relations between clusters are

established based on the name-entities co-occurrence profiles: Let C1 and C2 be two clusters Let pij be a co-occurrence profile

between document di and dj

P = {pij | so that di from C1 and dj from C2 }

Relation is defined by a profile set P Summary of relation is extracted from

the centroid vector of profiles from P

C1

C2

26

Relations – example Example of clusters:

Cluster 1: Name Entities: Bosnia,

Bosnian, Sarajevo Keywords: serbs,

moslems, bosnian, election

Cluster 2: Name Entities: Russia,

Britain, Germany, France Keywords: meeting,

country, government, told

Cluster 3: Name Entities:

Washington, United States

Keywords: spokesman, military, missiles

Example of relations Cluster 1 vs. Cluster 3:

Name Entities: U.N., U.S., American, Washington, Bosnia, Turkey, Richard Holbrooke, U.N. Security Council, White House

Keywords: election, serb, war, bosnians, moslem, peace, tribunal, police, spokesman, crime

Cluster 1 vs. Cluster 2: Name Entities: NATO,

Yugoslavia, Bosnia, Croatia, Serbia, Belgrade, Balkan, OSCE, Burns

Keywords: country, election, state, international, peace, meeting, secretary, foreign, talks, member

27


Russia, Britain, Germany, France, China,

EU meeting, country,

government, told, officials, union, minister, secretary,

trade, report

Russia, Britain, Germany, France, China,

EU meeting, country,

government, told, officials, union, minister, secretary,

trade, report

Hashimoto, Romano Prodi, Benjamin Netanyahu, Jim

Bolger

Hashimoto, Romano Prodi, Benjamin Netanyahu, Jim

Bolger

minister, prime, meeting, foreign, talks, president, peace, visit,

told, officials

minister, prime, meeting, foreign, talks, president, peace, visit,

told, officials

president, meeting, visit, talks, leaders, minister, secretary, officials, state

president, meeting, visit, talks, leaders, minister, secretary, officials, state

Bill Clinton, Jacques Chirac, Suharto, Hosni

Mubarak, Leonid Kuchma

Bill Clinton, Jacques Chirac, Suharto, Hosni

Mubarak, Leonid Kuchma

Supreme Court, U.S. District Court,

Simpson, Justice Department

Supreme Court, U.S. District Court,

Simpson, Justice Department

courts, case, year, told, rules, trials, charges, sentenced, law, file

courts, case, year, told, rules, trials, charges, sentenced, law, file

plant, powerful, company, venture, electrical, projects,

million, joint, province, state

plant, powerful, company, venture, electrical, projects,

million, joint, province, state

Tennessee Valley Authority, New Hill,

TVA, Florida Power & Light Co, St Lucie

Tennessee Valley Authority, New Hill,

TVA, Florida Power & Light Co, St Lucie

28


29

CountryCountry

PresidentPresidentMinisterMinister

CourtCourt Power plantPower plant

VisitVisit VisitVisit

InvestInvestRuleRule

Evaluation First prototype was

successfully used: Applied in multiple

domains: business, legislations

and digital libraries (SEKT project)

Users were always domain experts with limited knowledge

and experience with ontology construction / knowledge engineering

Feedback from first trails used as input for the second prototype the one presented here

User study performed for the second prototype Main impression

the tool saves time is especially useful when

working with large collections of documents

Main disadvantages abstraction unattractive interface

design

Used in several EU projects SWING, TAO, NEON,

ECOLEAD, E4, TOOLEAST

30

From the users

31

Future work

32

The Plan

Expressiveness

gra

nu

lari

ty

OntoGen

TemplateExtraction

document

corpus

SemanticGraphs

Q&A

33

Move towards bigger granularity Semantic graphs

Extract data-points from sentences level OntoGen does it on a

document level Based on triplets extracted

from sentence structure Subject Predicate Object

Extraction can be done with Parsers Structured learning

Triplets from one document can be merged into Semantic graphs

Stronger then bag-of-words Example application:

Document summarization

34

The Plan

Expressiveness

gra

nu

lari

ty

OntoGen

TemplateExtraction

document

corpus

SemanticGraphs

Q&A

35

The Plan

Expressiveness

gra

nu

lari

ty

OntoGen

TemplateExtraction

document

corpus

SemanticGraphs

Q&A

37

Template extraction

38

Hypothesis: People view events through “templates”

Models of how things evolve, relate Use these models to understand, predict

Goal: automatic extraction of such models from texts

Search over triplets Triplet extraction ran over Reuters corpus

800k news articles from 1996 to 1997

39

Search over triplets

40

Template earthquake

41

Places

Time-period

People

Buildings

Richter scale

Government

HitsHits in

Kills

Earthquake

Collapses

Registered in

Measured by

Thank you!

Questions?

42

Download - Ontology construction from text Blaz Fortuna. Outline Big picture OntoGen Future work 2

Top Related