towards the self-annotating web philipp cimiano, siegfried handschuh, steffen staab presenter: hieu...

36
Towards the Self- Annotating Web Philipp Cimiano, Siegfried Handschuh, Steffen Staab Presenter: Hieu K Le (most of slides come from Philipp Cimiano) CS598CXZ - Spring 2005 - UIUC

Upload: piers-barber

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Towards the Self-Annotating Web

Philipp Cimiano, Siegfried Handschuh, Steffen Staab

Presenter: Hieu K Le(most of slides come from Philipp Cimiano)

CS598CXZ - Spring 2005 - UIUC

OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion

The annotation problemin 4 cartoons

The annotation problem from a scientific point of view

The annotation problem in practice

The viscious cycle

Annotating

A NounA

Concept

?

Annotating• To annotate terms in a web page:

– Manually defining– Learning of extraction rulesBoth require lot of labor

A small Quiz• What is “Laska” ?

AA. A dish

BB. A city

CC. A temple

DD. A mountain

The answer is:

A small Quiz• What is “Laska” ?

AA. A dish

BB. A city

CC. A temple

DD. A mountain

The answer is:

A small Quiz• What is “Laska” ?

AA. A dish

BB. A city

CC. A temple

DD. A mountain

The answer is:

From Google

0

10

20

30

40

50

60

70

dish city temple mountain

“Laska”

From Google• „cities such as Laksa“ 0 hits• „dishes such as Laksa“ 10 hits• „mountains such as Laksa“ 0 hits• „temples such as Laksa“ 0 hits

Google knows more than all of you together! Example of using syntactic information +

statistics to derive semantic information

Self-annotating• PANKOW (Pattern-based Annotation through Knowledge On the Web)

– Unsupervised– Pattern based– Within a fixed ontology– Involve information of the whole web

The Self-Annotating Web• There is a huge amount of implicit

knowledge in the Web• Make use of this implicit knowledge

together with statistical information to propose formal annotations and overcome the viscious cycle:

semantics ≈ syntax + statistics?• Annotation by maximal statistical

evidence

OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion

PANKOW Process

OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion

Patterns

• HEARST1: <CONCEPT>s such as <INSTANCE>• HEARST2: such <CONCEPT>s as <INSTANCE>• HEARST3: <CONCEPT>s, (especially/including)

<INSTANCE>• HEARST4: <INSTANCE> (and/or) other <CONCEPT>s

• Examples:– dishes such as Laksa

– such dishes as Laksa

– dishes, especially Laksa

– dishes, including Laksa

– Laksa and other dishes

– Laksa or other dishes

Patterns (Cont‘d)

• DEFINITE1: the <INSTANCE> <CONCEPT>• DEFINITE2: the <CONCEPT> <INSTANCE>

• APPOSITION:<INSTANCE>, a <CONCEPT>• COPULA: <INSTANCE> is a <CONCEPT>

• Examples:• the Laksa dish• the dish Laksa• Laksa, a dish• Laksa is a dish

Asking Google (more formally)

• Instance iI, concept c C, pattern p {Hearst1,...,Copula} count(i,c,p) returns the number of Google hits of instantiated pattern

• E.g. count(Laksa,dish):=count(Laksa,dish,def1)+...• Restrict to the best ones beyond threshold

p

pcicountcicount ),,(:),(

),(),(:,|),(: maxarg cicountcicountcIiciRCc

ii

OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion

Evaluation Scenario• Corpus: 45 texts from

http://www.lonelyplanet.com/destinations

• Ontology: tourism ontology from GETESS project– #concepts: original – 1043; pruned – 682

• Manual Annotation by two subjects:– A: 436 instance/concept assignments– B: 392 instance/concept assignments– Overlap: 277 instances (Gold Standard)– A and B used 59 different concepts– Categorial (Kappa) agreement on 277 instances: 63.5%

Examples

Atlantic city 1520837Bahamas island 649166USA country 582275Connecticut state 302814Caribbean sea 227279Mediterranean sea 212284Canada country 176783Guatemala city 174439Africa region 131063Australia country 128607France country 125863Germany country 124421Easter island 96585St Lawrence river 65095Commonwealth state 49692New Zealand island 40711Adriatic sea 39726Netherlands country 37926

St John church 34021Belgium country 33847San Juan island 31994Mayotte island 31540EU country 28035UNESCO organization 27739Austria group 24266Greece island 23021Malawi lake 21081Israel country 19732Perth street 17880Luxembourg city 16393Nigeria state 15650St Croix river 14952Nakuru lake 14840Kenya country 14382Benin city 14126Cape Town city 13768

Results

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,50

90

18

0

27

0

36

0

45

0

54

0

63

0

72

0

81

0

90

0

99

0threshold

Precision

Recall

F-Measure

F=28,24%

R/Acc=24,90%

Comparison

System # Preprocessing / Cost Accuracy

[MUC-7] 3 Various (?) >> 90%

[Fleischman02] 8 N-gram extraction ($) 70.4%

PANKOW 59 none 24.9%

[Hahn98] –TH 196 syn. & sem. analysis ($$$) 21%

[Hahn98]-CB 196 syn. & sem. analysis ($$$) 26%

[Hahn98]-CB 196 syn. & sem. analysis ($$$) 31%

[Alfonseca02] 1200 syn. analysis ($$) 17.39% (strict)

OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion

CREAM/OntoMat

DocumentManagement

Annotation Environment

Annotated Web Pages

Web Pages

Domain Ontologies

WWW

PANKOW

annotate

crawl

AnnotationTool GUI

plugin

plugin

plugin

query

extract

load

Annotation Inference

Server

Annotation by Markup

OntologyGuidance

& Fact

Browser

DocumentEditor /Viewer

PANKOW & CREAM/OntoMat

Results (Interactive Mode)

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

threshold

Precision (Top 5)

Recall (Top 5)

F-Measure (Top 5)

F=51,65%

R/Acc=49.46%

OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion

Current State-of-the-art• Large-scale IE [SemTag&Seeker@WWW‘03]

– only disambiguation• Standard IE (MUC)

– need of handcrafted rules• ML-based IE (e.g.Amilcare@{OntoMat,MnM})

– need of hand-annotated training corpus– does not scale to large numbers of concepts– rule induction takes time

• KnowItAll (Etzioni et al. WWW‘04)– shallow (pattern-matching-based) approach

OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion

ConclusionSummary• new paradigm to overcome the annotation problem• unsupervised instance categorization• first step towards the self-annotating Web• difficult task: open domain, many categories• decent precision, low recall• very good results for interactive mode• currently inefficient (590 Google queries/instance)

Challenges:• contextual disambiguation• annotating relations (currently restricted to instances)• scalability (e.g. only choose reasonable queries to Google)• accurate recognition of Named Entities (currently POS-tagger)

OutlineIntroductionThe Process of PANKOWPattern-based categorizationEvaluationIntegration to CREAMRelated workConclusion

Thanks to…• Philipp Cimiano (cimiano

@aifb.uni-karlsruhe.de) for slides• The audience for listening