Towards the Self-Annotating Web
Philipp Cimiano, Siegfried Handschuh, Steffen Staab
Presenter: Hieu K Le(most of slides come from Philipp Cimiano)
CS598CXZ - Spring 2005 - UIUC
OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion
Annotating• To annotate terms in a web page:
– Manually defining– Learning of extraction rulesBoth require lot of labor
From Google• „cities such as Laksa“ 0 hits• „dishes such as Laksa“ 10 hits• „mountains such as Laksa“ 0 hits• „temples such as Laksa“ 0 hits
Google knows more than all of you together! Example of using syntactic information +
statistics to derive semantic information
Self-annotating• PANKOW (Pattern-based Annotation through Knowledge On the Web)
– Unsupervised– Pattern based– Within a fixed ontology– Involve information of the whole web
The Self-Annotating Web• There is a huge amount of implicit
knowledge in the Web• Make use of this implicit knowledge
together with statistical information to propose formal annotations and overcome the viscious cycle:
semantics ≈ syntax + statistics?• Annotation by maximal statistical
evidence
OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion
OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion
Patterns
• HEARST1: <CONCEPT>s such as <INSTANCE>• HEARST2: such <CONCEPT>s as <INSTANCE>• HEARST3: <CONCEPT>s, (especially/including)
<INSTANCE>• HEARST4: <INSTANCE> (and/or) other <CONCEPT>s
• Examples:– dishes such as Laksa
– such dishes as Laksa
– dishes, especially Laksa
– dishes, including Laksa
– Laksa and other dishes
– Laksa or other dishes
Patterns (Cont‘d)
• DEFINITE1: the <INSTANCE> <CONCEPT>• DEFINITE2: the <CONCEPT> <INSTANCE>
• APPOSITION:<INSTANCE>, a <CONCEPT>• COPULA: <INSTANCE> is a <CONCEPT>
• Examples:• the Laksa dish• the dish Laksa• Laksa, a dish• Laksa is a dish
Asking Google (more formally)
• Instance iI, concept c C, pattern p {Hearst1,...,Copula} count(i,c,p) returns the number of Google hits of instantiated pattern
• E.g. count(Laksa,dish):=count(Laksa,dish,def1)+...• Restrict to the best ones beyond threshold
p
pcicountcicount ),,(:),(
),(),(:,|),(: maxarg cicountcicountcIiciRCc
ii
OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion
Evaluation Scenario• Corpus: 45 texts from
http://www.lonelyplanet.com/destinations
• Ontology: tourism ontology from GETESS project– #concepts: original – 1043; pruned – 682
• Manual Annotation by two subjects:– A: 436 instance/concept assignments– B: 392 instance/concept assignments– Overlap: 277 instances (Gold Standard)– A and B used 59 different concepts– Categorial (Kappa) agreement on 277 instances: 63.5%
Examples
Atlantic city 1520837Bahamas island 649166USA country 582275Connecticut state 302814Caribbean sea 227279Mediterranean sea 212284Canada country 176783Guatemala city 174439Africa region 131063Australia country 128607France country 125863Germany country 124421Easter island 96585St Lawrence river 65095Commonwealth state 49692New Zealand island 40711Adriatic sea 39726Netherlands country 37926
St John church 34021Belgium country 33847San Juan island 31994Mayotte island 31540EU country 28035UNESCO organization 27739Austria group 24266Greece island 23021Malawi lake 21081Israel country 19732Perth street 17880Luxembourg city 16393Nigeria state 15650St Croix river 14952Nakuru lake 14840Kenya country 14382Benin city 14126Cape Town city 13768
Results
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,50
90
18
0
27
0
36
0
45
0
54
0
63
0
72
0
81
0
90
0
99
0threshold
Precision
Recall
F-Measure
F=28,24%
R/Acc=24,90%
Comparison
System # Preprocessing / Cost Accuracy
[MUC-7] 3 Various (?) >> 90%
[Fleischman02] 8 N-gram extraction ($) 70.4%
PANKOW 59 none 24.9%
[Hahn98] –TH 196 syn. & sem. analysis ($$$) 21%
[Hahn98]-CB 196 syn. & sem. analysis ($$$) 26%
[Hahn98]-CB 196 syn. & sem. analysis ($$$) 31%
[Alfonseca02] 1200 syn. analysis ($$) 17.39% (strict)
OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion
CREAM/OntoMat
DocumentManagement
Annotation Environment
Annotated Web Pages
Web Pages
Domain Ontologies
WWW
PANKOW
annotate
crawl
AnnotationTool GUI
plugin
plugin
plugin
query
extract
load
Annotation Inference
Server
Annotation by Markup
OntologyGuidance
& Fact
Browser
DocumentEditor /Viewer
Results (Interactive Mode)
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
threshold
Precision (Top 5)
Recall (Top 5)
F-Measure (Top 5)
F=51,65%
R/Acc=49.46%
OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion
Current State-of-the-art• Large-scale IE [SemTag&Seeker@WWW‘03]
– only disambiguation• Standard IE (MUC)
– need of handcrafted rules• ML-based IE (e.g.Amilcare@{OntoMat,MnM})
– need of hand-annotated training corpus– does not scale to large numbers of concepts– rule induction takes time
• KnowItAll (Etzioni et al. WWW‘04)– shallow (pattern-matching-based) approach
OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion
ConclusionSummary• new paradigm to overcome the annotation problem• unsupervised instance categorization• first step towards the self-annotating Web• difficult task: open domain, many categories• decent precision, low recall• very good results for interactive mode• currently inefficient (590 Google queries/instance)
Challenges:• contextual disambiguation• annotating relations (currently restricted to instances)• scalability (e.g. only choose reasonable queries to Google)• accurate recognition of Named Entities (currently POS-tagger)
OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion