geographic text search corporate proprietary, copyright 1999-2003, metacarta, inc. analysis of...

17
Geographic Text Search rate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03 workshop 31 May 2003

Upload: deanna-curtis

Post on 02-Apr-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Analysis of geographic references

András Kornai, Beth Sundheim

HLT/NAACL03 workshop 31 May 2003

Page 2: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Thanks to

Program committee:Doug AppeltMerrick Lex BermanSean BoisenQuintin CongdonJim CowieDoug JonesLinda HillGeorge Wilson

TIDES AQUAINT

Conference support:Ed HovyJames AllenSteven AbneyDragomir RadevAli HakimDekang Lin

Sponsors:

Page 3: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Program

• 19 papers submitted, 12 accepted

• 2 invited speakers

• 2 discussion periods

• Authors asked to email presentation to [email protected] by end of day

Page 4: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Changes

• Afternoon invited speaker: Jerry Hobbs (ISI) replaces Randy Flynn (NIMA)

• Paper presentation ordering: Li et al swapped with Manov et al

(9:30am v 12:10pm)

• Additional workshop event: Linda Hill (UCSB) poster during breaks

Page 5: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Workshop goals

• Exchange information on work in the analysis and grounding of place names and other forms of geographic reference

• Informally assess state of art in handling various aspects of the problem

• Identify ways to follow up on workshop as a community

Page 6: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

External resources• Diversity across projects:

ADL, Tipster, NIMA/USGS, UN-LOCODE, TGN, GB Historical GIS, web, …

• Integrated resources: KIM KB (Manov et al.), named entity word list in

InfoXtract, extended multi-gazetteer MetaCarta db, …

• Net result – how happy are we with current resources and integration solutions? With coverage of named places, richness of information,

utility for NLP analysis as well as for grounding references? With using a named entity finder as an analysis

preprocessor?

Page 7: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Entity finding in text

• Some systems (for now) entirely manual

• Semi-automated (with human review)

• Fully automated FS template matching (Weighted) rule-based HMM-based Confidence-based

Page 8: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Disambiguation

• What do we mean? Discrimination between names of places and other types of

names Disambiguation of place reference by location of place Disambiguation of place reference by type of place

• How well do current techniques work, and what hard problems remain? Relative difficulty given texts about U.S., detailed location

references, historical texts Relation to general word sense disambiguation problem Use of non-local descriptive references, coreference, … Co-occurrence of names with non-spatial clue terms (“San

Francisco” and “earthquake”)

Page 9: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Disambiguation (2)

Observations from Nov. ’02 name annotation round:

• For 80% of all name instances, evidence from local context was enough to determine which gazetteer entry was the corresponding one in over 75% of casesThis augurs well for successful automation

• No gazetteer linkage could be made for 20% of all name instances – either the name did not appear in the gazetteer at all (majority), or it appeared there in the wrong senseThis lack of gazetteer coverage presents a significant

challenge

Page 10: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Failure modes (1)

• Lack of complete match on name St. Petersburg – no variant in gazetteer with “St[.]”

• Multiple acceptable entries [the] Crimea – one for “regions”, one for “capes”

• Transliteration differences Sheremetyevo -> Sheremet’yevo Belarus -> Byelarus

• Mismatch on feature type Simferopol, Vladikavkaz – “capital” in doc, but not in

gazetteer

Page 11: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Failure modes (2)

• Many matching entries, but no clear winner Prigorodny – 16 hits on Prigorod (many in Russia)

• No entry for general places Asia – no entry in gazetteer

• Variant name missing from entry America – no match in gazetteer (i.e., not a listed

variant)

• Name in doc matches wrong entry in gaz The Heavenly Ski Resort – exactly matches entry with

BUILDING feature, but correct entry is under Heavenly Valley Ski Area (with LOCALE feature in USGS GNIS and “sports facilities” feature in ADL gaz)

Page 12: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Foreign language

• Example: TIDES surprise language exercise Challenge: Develop resources and NLP tools

for a foreign language in a month (June) Can’t expect to find an existing placename

gazetteer for this language This language is likely to have a non-western

script; ease of transliteration unpredictable

Page 13: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Community

• Offerings from SPAWAR Systems Center: Annotated corpora available to those with

licenses for source texts, along with annotation protocol

“Modernized” (with respect to diacritics) Tipster gazetteer available upon request

• Call for papers: Special issue of TALIP journal on temporal

and spatial information processing (Editors: Mani, Pustejovsky, Sundheim)

Submissions due December 1 – think about it!

Page 14: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Tagging

• Finding the entity in text

• Disambiguation

• Type assignment

• Grounding Linking to unique gazetteer entry Assigning coordinates

Page 15: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Annotation standards• Example: Automatic Content Extraction (ACE)

XML-based Levels: mentions (instances), entities, inter-entity

relations Types of mentions: names, nominals (descriptive

references), pronouns Entity categories wrt places: LOCATION, FACILITY,

GEOPOLITICAL ENTITY (GPE) Each category has defined subtypes (new) Scheme allows for metonymic usage and fuzzy meaning Software tools to support manual annotation, output

format transformation, annotation lookup and review Entity and relation schemes could/should be elaborated

further over time

Page 16: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Volume and pressure

Page 17: Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. Analysis of geographic references András Kornai, Beth Sundheim HLT/NAACL03

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Conclusions

• Procedural input sought from participants: shall we summarize at the end?

• Who is we: Organizers? Session chairs? Committee members? Panel?