slides si ps for automated ontologies
DESCRIPTION
TRANSCRIPT
Using statistically improbable phrases to automate the creation of an ontology
for a technical document collection
Thesis Defense
Joseph ColanninoTulsa, April 29, 2009
© 2009, all rights reserved
Overview• Literature survey
Contrast of keyword and semantic searchThe TF x IDF strategyStatistically Improbably Phrases (SIPs)
• Key conceptsSemantic filtering and mappingUse of SIPs to create an ontology
• Method and key results• Limitations• Recommendations for future work
keywordskeywords
Shortcomings of Keyword Search
DOCsDOCs
KeywordKeywordExtractionExtraction
DOCsDOCs
CacheCache
We would hope thatkeywords can be extractedand cached
keywordskeywords
Shortcomings of Keyword Search
DOCsDOCs
KeywordKeywordExtractionExtraction
DOCsDOCs
CacheCache
We would hope thatkeywords can be extractedand cached, leading toaccurate documentretrieval
keywordskeywords
Shortcomings of Keyword Search
DOCsDOCs
KeywordKeywordExtractionExtraction
DOCsDOCs
CacheCache
DOCsDOCs DOCsDOCs DOCsDOCs
DOCsDOCs DOCsDOCs
DOCsDOCs
DOCsDOCs
Unfortunately, we retrieve a lot
of extraneousstuff as well.
keywordskeywords
Shortcomings of Keyword Search
DOCsDOCs
KeywordKeywordExtractionExtraction
DOCsDOCs
CacheCache
DOCsDOCs DOCsDOCs DOCsDOCs
DOCsDOCs DOCsDOCs
DOCsDOCs
DOCsDOCs
DOCsDOCs
DOCsDOCs
DOCsDOCs
DOCsDOCs
DOCsDOCs
DOCsDOCs
DOCsDOCs
DOCsDOCs DOCsDOCs
DOCsDOCs
DOCsDOCsDOCsDOCs
DOCsDOCs
Advantages of Semantic Search
DOCsDOCs DOCsDOCs
semanticsemantictokenstokens
semanticsemanticextractionextraction
semanticsemanticcontainercontainer
If some black box existedthat could extract meaningor thoughts and cache them, we could link our thoughtswith thoughts in documents
Advantages of Semantic Search
DOCsDOCs DOCsDOCs
semanticsemantictokenstokens
semanticsemanticextractionextraction
semanticsemanticcontainercontainer
Why are keywords not good semantic tokens?
thou
ghts
words
red
gree
n
blue
t = φ(
w)
color: red
color: green
color: blue
thou
ghts
words
red
gree
n
blue
t = φ(
w)
color: red
color: green
color: blueIf words mapped to thoughts bijectively,we could use keywordsto search for semantics
Why are keywords not good semantic tokens?
color: blue
wordsred
neop
hytic
t = γ(
w)
color: red
color: green
disposition: depressed
disposition: inexperienced
disposition: angry
livid
sulle
n
gree
n
blue
thou
ghts
But they don’t. Wordshave many more than one meaning and meanings may be expressed by many more than one word
Why are keywords not good semantic tokens?
color: blue
wordsred
neop
hytic
t = γ(
w)
color: red
color: green
disposition: depressed
disposition: inexperienced
disposition: angry
livid
sulle
n
gree
n
blue
thou
ghts
targettarget
sourcesource
……
……
This makes it tough to match a source query withits target
Why are keywords not good semantic tokens?•Words have meaning only in context*
WORD # Meanings Some alternate meaningsjohn 33 toilet, anonymous…car 41 railcar, wagon, coach…is 8 information systems...blue 87 sad, vulgar, restrictive…
(33 x 41 x 8 x 87) x 4! = 22,600,512
If there are 22 million possible meanings for the sentence “John’s car is blue,” why isn’t anyone confused about what it means?
*Frege, G. 1892. Über Sinn und Bedeutung, in Zeitschrift für Philosophie und philosophische Kritik, 100: 25-50. Translated as ‘On Sense and Reference’ by M. Black in Translations from the Philosophical Writings of Gottlob Frege, P. Geach and M. Black (eds. and trans.), Oxford: Blackwell, third edition, 1980.
Why are keywords not good semantic tokens?•Words have meaning only in context*
WORD # Meanings Some alternate meaningsjohn 33 toilet, anonymous…car 41 railcar, wagon, coach…is 8 information systems...blue 87 sad, vulgar, restrictive…
(33 x 41 x 8 x 87) x 4! = 22,600,512
Why doesn’t it “John’s car is blue” mean that the “toilet’s railcar information systems are sad”?
*Frege, G. 1892. Über Sinn und Bedeutung, in Zeitschrift für Philosophie und philosophische Kritik, 100: 25-50. Translated as ‘On Sense and Reference’ by M. Black in Translations from the Philosophical Writings of Gottlob Frege, P. Geach and M. Black (eds. and trans.), Oxford: Blackwell, third edition, 1980.
blue87
john33
car41
is8
restricted semantics
1
+J+ ’s–
The
jcib
John’s car is blue ≡The car belonging to the person named John is colored blue
Semantic FilteringBecause context acts as a filter to restrict meaning. With so many restrictions, only one meaning is left. Therefore, phrases have narrower semantics than keywords.
Semantic Filtering• Owing to semantic filtering word
phrases contain more precise (more restricted) semantics than individual words or Boolean constructions thereof
• If we were able to automatically generate meaningful phrases from a document, we could begin to search semantically
Statistically Improbable Phrases• A statistically improbable phrase (SIP) is a
contiguous cluster of words which occur
1. with high frequency within a document (high term frequency, TF)
*AND*
2. with low frequency in the document collection(high inverse document frequency, IDF)
Statistically Improbable Phrases• Highly repeated phrases within a document are
statistically correlated with meaningful phrases, because humans emphasize important points via repetition
• Therefore, phrases occuring with high term frequency (TF) in a document are likely to be meaningful
• To filter out pedestrian words and turns of phrase, the phrase must also occur outside the document with low frequency (i.e., high inverse document frequency, IDF)
• The resulting strategy is known as TF x IDF
JabberwockyHere is a nonsense poem: “…
Beware the Jabberwock , my son! The jaws that bite, the claws thatcatch!Beware the Jubjub bird, and shun the frumious Bandersnatch
…”From The Jaberwocky, Lewis Carroll; Through the Looking-Glass and
What Alice Found There, 1872
Jabberwocky – Term FrequencyLet’s post the TF for each word in the
poem“…
Beware2 the19 Jabberwock4, my3 son! The19 jaws that2 bite, the19 claws that2catch!Beware2 the19 Jubjub bird, and15 shun the19 frumious Bandersnatch
…”
Jabberwocky – Term FrequencyNow let’s filter out the low TF words“…
Beware2 the19 Jabberwock4, my3 son! The19 jaws that2 bite, the19 claws that2catch!Beware2 the19 Jubjub bird, and15 shunthe19 frumious Bandersnatch
…”
• Considering TF only, “the” is the most important word!
the19
and15
Jabberwock4
my3
beware2
that2
Jabberwocky – Term Frequencies
• But consideringTF x IDF, Jabberwock emerges as the seminalconstruct!
the19
and15
Jabberwock4
my3
beware2
that2
Jabberwocky – TF x IDF
the, and, the, and, my, beware, my, beware,
thatthat
the of to in and a the of to in and a for was is that on for was is that on at he with by be at he with by be
it an as hisit an as his……
JabberwockJabberwock
DocumentDocumentCollectionCollection
DocumentDocument
TFTF
TF x IDFTF x IDF
IDFIDF
What is an Ontology?• Thesis title:
Using statistically improbable phrases to automate the creation of an ontology or a technical document collection
What is an Ontology?• “Ontologies have been defined in many
ways, some of which contradict each other.”
• “The broadest usage of ontology is for formal representations of what, to a human, is common sense. …”
A. Taylor. 2004. A. Taylor. 2004. The Organization of InformationThe Organization of Information, , Libraries UnlimitedLibraries Unlimited, Westport CT. p 282, emphasis mine , Westport CT. p 282, emphasis mine
What is an Ontology?• “The Semantic Web is a vision for the
future of the Web in which informationis given explicit meaning.”
• “An ontology defines the terms used to describe and represent an area of knowledge”
OWL Web Ontology Language, Use Cases and Requirements, OWL Web Ontology Language, Use Cases and Requirements, W3C Recommendation, 10 February 2004W3C Recommendation, 10 February 2004http://www.w3.org/TR/webonthttp://www.w3.org/TR/webont--req/#ontoreq/#onto--defdef, , empahsis mine empahsis mine
What is an Ontology?• An ontology is a “a collection of
tokenized meanings.”
J. Colannino, 2009. J. Colannino, 2009. Using statistically improbable phrases to automate Using statistically improbable phrases to automate the creation of an ontology for a technical document collectionthe creation of an ontology for a technical document collection, p2, , p2, Masters Thesis, Masters Thesis, University of OklahomaUniversity of Oklahoma, Tulsa., Tulsa.
Automating Ontology Creation
DOCsDOCs
TF x IDFTF x IDF
DOCsDOCs
semantic semantic tokenstokens
semantic semantic searchsearch
SIPsSIPs
semantic semantic extractionextraction
ontologyontology
Automating Ontology Creation
DOCsDOCs
TF x IDFTF x IDF
DOCsDOCs
semantic semantic tokenstokens
semantic semantic searchsearch
SIPsSIPs
semantic semantic extractionextraction
ontologyontology
Now we see that TF x IDF can be used as anextraction algorithm, the ontology is the container, and the SIPsare the semantic tokens
Selecting the Database
To perform the anlaysis the database had to be
• Large enough• Inexpensive• Indexed• Availabile to the public• Accessibile to the researcher
Method• So we used Amazon.com
• Selected a seminal text
• Then used its SIPs to interrogate other collections such as the World Wide Web
• We compared the results with keyword searches
The next page shows the logic
Method
BeginBegin
SpecifySpecifydocumentdocumentcollectioncollection
SelectSelectseminalseminal
documentdocument
Find relFind rel’’tdtddocs usingdocs using
SIPsSIPs
RelatedRelateddocsdocs(348)(348)
Use SMEsUse SMEsto scoreto score
docsdocs
MCSPAMCSPA Use SIPsUse SIPsto findto find
MCSPA MCSPA
Analyze Analyze
ReportReportResultsResultsUse keywordsUse keywords
to find MCSPAto find MCSPA
Method, 25 SIPs in Seminal TextMCSPA
air preheat temperatureblocking generatorsbridgewall temperatureburner throatdiffusion burnerdummy factorsethylene cracking unitsflame dimensionsflame qualityfloor burnersfunction shape analysisgenuine replicatesheat flux curveheat flux profilehidden extrapolationhydrogen reformersinsulation patternlenient senselurking factorsnormal matrix equationnormalized heat fluxrefinery fuel gassubstoichiometric combustionthermoacoustic couplingtramp air
Method, 348 Documents
Method• 25 SIPs in seminal text, shared by
• 348 books, collectively comprising
• 5829 unique SIPs
Results: SIP SearchGoogle Search Results for SIPs on the Worldwide Web,
Unrestricted as to Document Type Placement of MCSPA in Google Search
SIP Placement 1 ≤5 ≤10 ≤20 ≤50 air preheat temperature 48 N N N N Y blocking generators 7 N N Y Y Y bridgewall temperature 13 N N N Y Y burner throat 16 N N N Y Y diffusion burner 17 N N N Y Y dummy factors 14 N N N Y Y ethylene cracking units 10 N N Y Y Y flame dimensions 17 N N N Y Y flame quality 15 N N N Y Y floor burners 50 N N N N Y function shape analysis 1 Y Y Y Y Y genuine replicates 1 Y Y Y Y Y heat flux curve 1 Y Y Y Y Y heat flux profile 7 N N Y Y Y hidden extrapolation 4 N Y Y Y Y hydrogen reformers 7 N N Y Y Y insulation pattern 20 N N N Y Y lenient sense 4 N Y Y Y Y lurking factors 1 Y Y Y Y Y normal matrix equation 4 N Y Y Y Y normalized heat flux 1 Y Y Y Y Y refinery fuel gas 380 N N N N N substoichiometric combustion 8 N N Y Y Y thermoacoustic coupling 3 N Y Y Y Y tramp air 8 N N Y Y Y
Number by Category 5 9 15 22 24 Percentage by Category 20% 36% 60% 88% 96%
88%88%≤≤ 2020
88% of SIPs found the seminal document on the World Wide Webin the first 20 returns
Results: Boolean SearchResults of Boolean Search on Google.com
Placement of MCSPA in Google Search
Search Term Placement 1 ≤5 ≤10 ≤20 ≤50
air AND preheat AND temperature >100 N N N N N blocking AND generators 4 N Y Y Y Y bridgewall AND temperature 20 N N N Y Y burner AND throat 48 N N N N Y diffusion AND burner 54 N N N N N dummy AND factors >100 N N N N N ethylene AND cracking AND units >100 N N N N N flame AND dimensions >100 N N N N N flame AND quality >100 N N N N N floor AND burners 2 N Y Y Y Y function AND shape AND analysis >100 N N N N N genuine AND replicates 2 N Y Y Y Y heat AND flux AND curve >100 N N N N N heat AND flux AND profile >100 N N N N N hidden AND extrapolation 7 N N Y Y Y hydrogen AND reformers >100 N N N N N insulation AND pattern >100 N N N N N lenient AND sense 29 N N N N Y lurking AND factors 3 N Y Y Y Y normal AND matrix AND equation >100 N N N N N normalized AND heat AND flux 43 N N N N Y refinery AND fuel AND gas >100 N N N N N substoichiometric AND combustion 12 N N N Y Y thermoacoustic AND coupling 21 N N N N Y tramp AND air 12 N N N Y Y
Number by Category 0 4 5 8 12 Percentage by Category 0% 16% 20% 32% 48%
32%32%≤≤ 2020
Only 32% of keywords did the same
Results: Boolean Vs. SIP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 10 20 30 40 50 60Return Placement of Seminal Text
Cum
mul
ativ
e Pe
rcen
t of T
erm
s
y = 0.2288Ln(x) + 0.0741R 2 = 0.94
SIPSIP
BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368
R R 2 2 = 0.98= 0.98
Con
fiden
ce
Information Overload
Here are theresultsgraphically. We can think of the cumulative % of terms as a kind of confidence that we will get a return within a given placement. For example, we have about 75% confidence that SIPs will returnthe target text inthe first 20 hits
Results: Boolean Vs. SIP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 10 20 30 40 50 60Return Placement of Seminal Text
Cum
mul
ativ
e Pe
rcen
t of T
erm
s
y = 0.2288Ln(x) + 0.0741R 2 = 0.94
SIPSIP
BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368
R R 2 2 = 0.98= 0.98
Con
fiden
ce
Information Overload
We can think of the return placement as a kind of measure of information overload. We want our target to be returned as the first hit. If its returned as hit number 50 then we are overloaded with lots of extraneous information
Results: Boolean Vs. SIP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 10 20 30 40 50 60Return Placement of Seminal Text
Cum
mul
ativ
e Pe
rcen
t of T
erm
s
y = 0.2288Ln(x) + 0.0741R 2 = 0.94
SIPSIP
BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368
R R 2 2 = 0.98= 0.98
2.2 (Confidence ratio)
Con
fiden
ce
Information Overload
We can define a confidence ratio as shown. For example, if we desire our target to be returned in the first 20 hits, then SIPs give us more than double the confidence that this will be so compared with keywords.
Results: Boolean Vs. SIP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 10 20 30 40 50 60Return Placement of Seminal Text
Cum
mul
ativ
e Pe
rcen
t of T
erm
s
y = 0.2288Ln(x) + 0.0741R 2 = 0.94
SIPSIP
BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368
R R 2 2 = 0.98= 0.98
9.8 (Overload ratio)
Con
fiden
ce
Information Overload
We can also define an overload ratio as shown. For example, for 50% confidence, we need to examine about 10 times more returns to find our target if we use Boolean keyword constructions than if we use SIPs
Results: Boolean Vs. SIP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 10 20 30 40 50 60
Placement (Information Overload)
% o
f Tem
rs (C
onfid
ence
)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Con
fiden
ce R
atio
0 10 20 30 40 50 60Overload Ratio
2.2
9.8
The confidence ratio tends to stabilize after the first 10 hits, but the overload ratio continues to grow: for example, if you want 90% confidence that your search returns the target, you need to examine almost 40x more returns with keywords compared to SIPs.
Limitations of the Work• Restricted to comparisons between
other documents and a single seminal text
• Investigated voluminous documents –i.e., books
• Presumed that Amazon.com’s SIP strategy was indicative of TF x IDF strategies in general
Limitations of the Work• Neither TF x IDF nor SIPs are worry free:
non-standard terminology
author-coined terms
evolving nomenclature
synonymic constructions
• although SIPs have narrower semantics than keywords, they may still be too broad
• SIPs cannot capture all thoughts
Recommendations• Expand the work to include more than a
single seminal text
• Examine shorter works such as technical reports and memoranda
• Modify search engines to cache SIP tokens as well as keywords and begin to enable semantic search