slides si ps for automated ontologies

Using statistically improbable phrases to automate the creation of an ontology

for a technical document collection

Thesis Defense

Joseph ColanninoTulsa, April 29, 2009

© 2009, all rights reserved

Overview• Literature survey

Contrast of keyword and semantic searchThe TF x IDF strategyStatistically Improbably Phrases (SIPs)

• Key conceptsSemantic filtering and mappingUse of SIPs to create an ontology

• Method and key results• Limitations• Recommendations for future work

keywordskeywords

Shortcomings of Keyword Search

DOCsDOCs

KeywordKeywordExtractionExtraction

DOCsDOCs

CacheCache

We would hope thatkeywords can be extractedand cached

keywordskeywords


DOCsDOCs


DOCsDOCs

CacheCache

We would hope thatkeywords can be extractedand cached, leading toaccurate documentretrieval

keywordskeywords


DOCsDOCs


DOCsDOCs

CacheCache

DOCsDOCs DOCsDOCs DOCsDOCs

DOCsDOCs DOCsDOCs

DOCsDOCs

DOCsDOCs

Unfortunately, we retrieve a lot

of extraneousstuff as well.

keywordskeywords


DOCsDOCs


DOCsDOCs

CacheCache

DOCsDOCs DOCsDOCs DOCsDOCs

DOCsDOCs DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs DOCsDOCs

DOCsDOCs

DOCsDOCsDOCsDOCs

DOCsDOCs

Advantages of Semantic Search

DOCsDOCs DOCsDOCs

semanticsemantictokenstokens

semanticsemanticextractionextraction

semanticsemanticcontainercontainer

If some black box existedthat could extract meaningor thoughts and cache them, we could link our thoughtswith thoughts in documents

Advantages of Semantic Search

DOCsDOCs DOCsDOCs

semanticsemantictokenstokens

semanticsemanticextractionextraction

semanticsemanticcontainercontainer

Why are keywords not good semantic tokens?

thou

ghts

words

red

gree

n

blue

t = φ(

w)

color: red

color: green

color: blue

thou

ghts

words

red

gree

n

blue

t = φ(

w)

color: red

color: green

color: blueIf words mapped to thoughts bijectively,we could use keywordsto search for semantics


color: blue

wordsred

neop

hytic

t = γ(

w)

color: red

color: green

disposition: depressed

disposition: inexperienced

disposition: angry

livid

sulle

n

gree

n

blue

thou

ghts

But they don’t. Wordshave many more than one meaning and meanings may be expressed by many more than one word


color: blue

wordsred

neop

hytic

t = γ(

w)

color: red

color: green

disposition: depressed

disposition: inexperienced

disposition: angry

livid

sulle

n

gree

n

blue

thou

ghts

targettarget

sourcesource

……

……

This makes it tough to match a source query withits target

Why are keywords not good semantic tokens?•Words have meaning only in context*

WORD # Meanings Some alternate meaningsjohn 33 toilet, anonymous…car 41 railcar, wagon, coach…is 8 information systems...blue 87 sad, vulgar, restrictive…

(33 x 41 x 8 x 87) x 4! = 22,600,512

If there are 22 million possible meanings for the sentence “John’s car is blue,” why isn’t anyone confused about what it means?

*Frege, G. 1892. Über Sinn und Bedeutung, in Zeitschrift für Philosophie und philosophische Kritik, 100: 25-50. Translated as ‘On Sense and Reference’ by M. Black in Translations from the Philosophical Writings of Gottlob Frege, P. Geach and M. Black (eds. and trans.), Oxford: Blackwell, third edition, 1980.

Why are keywords not good semantic tokens?•Words have meaning only in context*

WORD # Meanings Some alternate meaningsjohn 33 toilet, anonymous…car 41 railcar, wagon, coach…is 8 information systems...blue 87 sad, vulgar, restrictive…

(33 x 41 x 8 x 87) x 4! = 22,600,512

Why doesn’t it “John’s car is blue” mean that the “toilet’s railcar information systems are sad”?

*Frege, G. 1892. Über Sinn und Bedeutung, in Zeitschrift für Philosophie und philosophische Kritik, 100: 25-50. Translated as ‘On Sense and Reference’ by M. Black in Translations from the Philosophical Writings of Gottlob Frege, P. Geach and M. Black (eds. and trans.), Oxford: Blackwell, third edition, 1980.

blue87

john33

car41

is8

restricted semantics

1

+J+ ’s–

The

jcib

John’s car is blue ≡The car belonging to the person named John is colored blue

Semantic FilteringBecause context acts as a filter to restrict meaning. With so many restrictions, only one meaning is left. Therefore, phrases have narrower semantics than keywords.

Semantic Filtering• Owing to semantic filtering word

phrases contain more precise (more restricted) semantics than individual words or Boolean constructions thereof

• If we were able to automatically generate meaningful phrases from a document, we could begin to search semantically

Statistically Improbable Phrases• A statistically improbable phrase (SIP) is a

contiguous cluster of words which occur

1. with high frequency within a document (high term frequency, TF)

*AND*

2. with low frequency in the document collection(high inverse document frequency, IDF)

Statistically Improbable Phrases• Highly repeated phrases within a document are

statistically correlated with meaningful phrases, because humans emphasize important points via repetition

• Therefore, phrases occuring with high term frequency (TF) in a document are likely to be meaningful

• To filter out pedestrian words and turns of phrase, the phrase must also occur outside the document with low frequency (i.e., high inverse document frequency, IDF)

• The resulting strategy is known as TF x IDF

JabberwockyHere is a nonsense poem: “…

Beware the Jabberwock , my son! The jaws that bite, the claws thatcatch!Beware the Jubjub bird, and shun the frumious Bandersnatch

…”From The Jaberwocky, Lewis Carroll; Through the Looking-Glass and

What Alice Found There, 1872

Jabberwocky – Term FrequencyLet’s post the TF for each word in the

poem“…

Beware2 the19 Jabberwock4, my3 son! The19 jaws that2 bite, the19 claws that2catch!Beware2 the19 Jubjub bird, and15 shun the19 frumious Bandersnatch

…”

Jabberwocky – Term FrequencyNow let’s filter out the low TF words“…

Beware2 the19 Jabberwock4, my3 son! The19 jaws that2 bite, the19 claws that2catch!Beware2 the19 Jubjub bird, and15 shunthe19 frumious Bandersnatch

…”

• Considering TF only, “the” is the most important word!

the19

and15

Jabberwock4

my3

beware2

that2

Jabberwocky – Term Frequencies

• But consideringTF x IDF, Jabberwock emerges as the seminalconstruct!

the19

and15

Jabberwock4

my3

beware2

that2

Jabberwocky – TF x IDF

the, and, the, and, my, beware, my, beware,

thatthat

the of to in and a the of to in and a for was is that on for was is that on at he with by be at he with by be

it an as hisit an as his……

JabberwockJabberwock

DocumentDocumentCollectionCollection

DocumentDocument

TFTF

TF x IDFTF x IDF

IDFIDF

What is an Ontology?• Thesis title:

Using statistically improbable phrases to automate the creation of an ontology or a technical document collection

What is an Ontology?• “Ontologies have been defined in many

ways, some of which contradict each other.”

• “The broadest usage of ontology is for formal representations of what, to a human, is common sense. …”

A. Taylor. 2004. A. Taylor. 2004. The Organization of InformationThe Organization of Information, , Libraries UnlimitedLibraries Unlimited, Westport CT. p 282, emphasis mine , Westport CT. p 282, emphasis mine

What is an Ontology?• “The Semantic Web is a vision for the

future of the Web in which informationis given explicit meaning.”

• “An ontology defines the terms used to describe and represent an area of knowledge”

OWL Web Ontology Language, Use Cases and Requirements, OWL Web Ontology Language, Use Cases and Requirements, W3C Recommendation, 10 February 2004W3C Recommendation, 10 February 2004http://www.w3.org/TR/webonthttp://www.w3.org/TR/webont--req/#ontoreq/#onto--defdef, , empahsis mine empahsis mine

What is an Ontology?• An ontology is a “a collection of

tokenized meanings.”

J. Colannino, 2009. J. Colannino, 2009. Using statistically improbable phrases to automate Using statistically improbable phrases to automate the creation of an ontology for a technical document collectionthe creation of an ontology for a technical document collection, p2, , p2, Masters Thesis, Masters Thesis, University of OklahomaUniversity of Oklahoma, Tulsa., Tulsa.

Automating Ontology Creation

DOCsDOCs

TF x IDFTF x IDF

DOCsDOCs

semantic semantic tokenstokens

semantic semantic searchsearch

SIPsSIPs

semantic semantic extractionextraction

ontologyontology

Automating Ontology Creation

DOCsDOCs

TF x IDFTF x IDF

DOCsDOCs

semantic semantic tokenstokens

semantic semantic searchsearch

SIPsSIPs

semantic semantic extractionextraction

ontologyontology

Now we see that TF x IDF can be used as anextraction algorithm, the ontology is the container, and the SIPsare the semantic tokens

Selecting the Database

To perform the anlaysis the database had to be

• Large enough• Inexpensive• Indexed• Availabile to the public• Accessibile to the researcher

Method• So we used Amazon.com

• Selected a seminal text

• Then used its SIPs to interrogate other collections such as the World Wide Web

• We compared the results with keyword searches

The next page shows the logic

Method

BeginBegin

SpecifySpecifydocumentdocumentcollectioncollection

SelectSelectseminalseminal

documentdocument

Find relFind rel’’tdtddocs usingdocs using

SIPsSIPs

RelatedRelateddocsdocs(348)(348)

Use SMEsUse SMEsto scoreto score

docsdocs

MCSPAMCSPA Use SIPsUse SIPsto findto find

MCSPA MCSPA

Analyze Analyze

ReportReportResultsResultsUse keywordsUse keywords

to find MCSPAto find MCSPA

Method, 25 SIPs in Seminal TextMCSPA

air preheat temperatureblocking generatorsbridgewall temperatureburner throatdiffusion burnerdummy factorsethylene cracking unitsflame dimensionsflame qualityfloor burnersfunction shape analysisgenuine replicatesheat flux curveheat flux profilehidden extrapolationhydrogen reformersinsulation patternlenient senselurking factorsnormal matrix equationnormalized heat fluxrefinery fuel gassubstoichiometric combustionthermoacoustic couplingtramp air

Method, 348 Documents

Method• 25 SIPs in seminal text, shared by

• 348 books, collectively comprising

• 5829 unique SIPs

Results: SIP SearchGoogle Search Results for SIPs on the Worldwide Web,

Unrestricted as to Document Type Placement of MCSPA in Google Search

SIP Placement 1 ≤5 ≤10 ≤20 ≤50 air preheat temperature 48 N N N N Y blocking generators 7 N N Y Y Y bridgewall temperature 13 N N N Y Y burner throat 16 N N N Y Y diffusion burner 17 N N N Y Y dummy factors 14 N N N Y Y ethylene cracking units 10 N N Y Y Y flame dimensions 17 N N N Y Y flame quality 15 N N N Y Y floor burners 50 N N N N Y function shape analysis 1 Y Y Y Y Y genuine replicates 1 Y Y Y Y Y heat flux curve 1 Y Y Y Y Y heat flux profile 7 N N Y Y Y hidden extrapolation 4 N Y Y Y Y hydrogen reformers 7 N N Y Y Y insulation pattern 20 N N N Y Y lenient sense 4 N Y Y Y Y lurking factors 1 Y Y Y Y Y normal matrix equation 4 N Y Y Y Y normalized heat flux 1 Y Y Y Y Y refinery fuel gas 380 N N N N N substoichiometric combustion 8 N N Y Y Y thermoacoustic coupling 3 N Y Y Y Y tramp air 8 N N Y Y Y

Number by Category 5 9 15 22 24 Percentage by Category 20% 36% 60% 88% 96%

88%88%≤≤ 2020

88% of SIPs found the seminal document on the World Wide Webin the first 20 returns

Results: Boolean SearchResults of Boolean Search on Google.com

Placement of MCSPA in Google Search

Search Term Placement 1 ≤5 ≤10 ≤20 ≤50

air AND preheat AND temperature >100 N N N N N blocking AND generators 4 N Y Y Y Y bridgewall AND temperature 20 N N N Y Y burner AND throat 48 N N N N Y diffusion AND burner 54 N N N N N dummy AND factors >100 N N N N N ethylene AND cracking AND units >100 N N N N N flame AND dimensions >100 N N N N N flame AND quality >100 N N N N N floor AND burners 2 N Y Y Y Y function AND shape AND analysis >100 N N N N N genuine AND replicates 2 N Y Y Y Y heat AND flux AND curve >100 N N N N N heat AND flux AND profile >100 N N N N N hidden AND extrapolation 7 N N Y Y Y hydrogen AND reformers >100 N N N N N insulation AND pattern >100 N N N N N lenient AND sense 29 N N N N Y lurking AND factors 3 N Y Y Y Y normal AND matrix AND equation >100 N N N N N normalized AND heat AND flux 43 N N N N Y refinery AND fuel AND gas >100 N N N N N substoichiometric AND combustion 12 N N N Y Y thermoacoustic AND coupling 21 N N N N Y tramp AND air 12 N N N Y Y

Number by Category 0 4 5 8 12 Percentage by Category 0% 16% 20% 32% 48%

32%32%≤≤ 2020

Only 32% of keywords did the same

Results: Boolean Vs. SIP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60Return Placement of Seminal Text

Cum

mul

ativ

e Pe

rcen

t of T

erm

s

y = 0.2288Ln(x) + 0.0741R 2 = 0.94

SIPSIP

BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368

R R 2 2 = 0.98= 0.98

Con

fiden

ce

Information Overload

Here are theresultsgraphically. We can think of the cumulative % of terms as a kind of confidence that we will get a return within a given placement. For example, we have about 75% confidence that SIPs will returnthe target text inthe first 20 hits


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Cum

mul

ativ

e Pe

rcen

t of T

erm

s

y = 0.2288Ln(x) + 0.0741R 2 = 0.94

SIPSIP


R R 2 2 = 0.98= 0.98

Con

fiden

ce


We can think of the return placement as a kind of measure of information overload. We want our target to be returned as the first hit. If its returned as hit number 50 then we are overloaded with lots of extraneous information


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Cum

mul

ativ

e Pe

rcen

t of T

erm

s

y = 0.2288Ln(x) + 0.0741R 2 = 0.94

SIPSIP


R R 2 2 = 0.98= 0.98

2.2 (Confidence ratio)

Con

fiden

ce


We can define a confidence ratio as shown. For example, if we desire our target to be returned in the first 20 hits, then SIPs give us more than double the confidence that this will be so compared with keywords.


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Cum

mul

ativ

e Pe

rcen

t of T

erm

s

y = 0.2288Ln(x) + 0.0741R 2 = 0.94

SIPSIP


R R 2 2 = 0.98= 0.98

9.8 (Overload ratio)

Con

fiden

ce


We can also define an overload ratio as shown. For example, for 50% confidence, we need to examine about 10 times more returns to find our target if we use Boolean keyword constructions than if we use SIPs


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60

Placement (Information Overload)

% o

f Tem

rs (C

onfid

ence

)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Con

fiden

ce R

atio

0 10 20 30 40 50 60Overload Ratio

2.2

9.8

The confidence ratio tends to stabilize after the first 10 hits, but the overload ratio continues to grow: for example, if you want 90% confidence that your search returns the target, you need to examine almost 40x more returns with keywords compared to SIPs.

Limitations of the Work• Restricted to comparisons between

other documents and a single seminal text

• Investigated voluminous documents –i.e., books

• Presumed that Amazon.com’s SIP strategy was indicative of TF x IDF strategies in general

Limitations of the Work• Neither TF x IDF nor SIPs are worry free:

non-standard terminology

author-coined terms

evolving nomenclature

synonymic constructions

• although SIPs have narrower semantics than keywords, they may still be too broad

• SIPs cannot capture all thoughts

Recommendations• Expand the work to include more than a

single seminal text

• Examine shorter works such as technical reports and memoranda

• Modify search engines to cache SIP tokens as well as keywords and begin to enable semantic search

slides si ps for automated ontologies

Education