technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

19
Technologies for (semi-) automatic metadata creation http://gate.ac.uk/ http://nlp.shef.ac.uk/ Diana Maynard University of Sheffield KnowledgeWeb WP 1.3 meeting, Crete, 14 May 2004

Upload: novia

Post on 05-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Technologies for (semi-) automatic metadata creation http://gate.ac.uk/ http://nlp.shef.ac.uk/ Diana Maynard University of Sheffield KnowledgeWeb WP 1.3 meeting, Crete, 14 May 2004. USFD is mainly concerned in this WP with best practices and guidelines for ontology-based web applications - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Technologies for (semi-) automatic metadata creation

http://gate.ac.uk/ http://nlp.shef.ac.uk/

Diana MaynardUniversity of Sheffield

KnowledgeWeb WP 1.3 meeting, Crete, 14 May 2004

Page 2: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Overview

• USFD is mainly concerned in this WP with best practices and guidelines for ontology-based web applications

• State-of-the-art systems and platforms for metadata creation

• Metadata is created through semantic tagging • Metadata can be represented as inline

(modification of the original document) or standoff (separate storage from the document)

Page 3: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Semi-automatic v automatic metadata creation

• Semi-automatic methods are more reliable, but require human intervention– MnM: requires initial human annotation; pre-defined ontology– S-CREAM– AERODAML

• Automatic methods less reliable, but suitable for large volumes of text, and offer a dynamic view– SemTag: semantic tagging from ontology– KIM: semantic tagging and ontology population– hTechSight: semantic tagging, ontology population and evolution

Page 4: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Semi-automatic methods

• MnM

• S-CREAM

Page 5: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

MnM

• Semi-automatic in that it requires initial training by user

• Uses pre-defined set of concepts in ontology• User browses web and manually annotates his chosen

pages• System learns annotation rules, tests them, and takes

over annotation, populating ontologies with the instances found

• Precision and recall are not perfect, however retraining is possible at any stage

Page 6: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

S-CREAM

• Semi-automatic CREAtion of Metadata

• Uses Onto-O-Mat + Amilcare

• Trainable for different domains

• Aligns conceptual markup (which defines relational metadata) provided by e.g. Ont-O-Mat with semantic markup provided by Amilcare

Page 7: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Annotated data in S-CREAM

Page 8: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Amilcare

• Amilcare learns IE rules from pre-annotated data (e.g. using Ont-O-Mat)

• Uses GATE (ANNIE) for pre-processing + applies rules learnt in training phase to new documents

• Concepts need to be pre-defined, but system can be trained for new domain

• Can be tuned towards precision or recall

Page 9: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Automatic methods

• SemTag

• KIM

• h-Techsight

Page 10: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

SemTag and KIM

• SemTag and KIM both annotate webpages using instances from an ontology

• Main problem is to disambiguate such instances which occur in multiple parts of the ontology

• SemTag aims for accuracy of classification, whereas KIM aims more for recall (finding all instances)

• KIM also uses IE to find new instances not present in ontology

Page 11: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

SemTag • Automated semantic tagging of large corpora, using

TAP ontology (contains 65K instances)• Largest scale semantic tagging effort to date• Uses concept of Semantic Label Bureau• Annotations are stored separately from web pages

(standoff markup)• Uses corpus-wide statistics to improve quality of

tagging, e.g. automated alias discovery• Tags can be extracted using a variety of

mechanisms, e.g. search for all tags matching a particular object

Page 12: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

SemTag Architecture

Page 13: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

KIM • Uses an ontology (KIMO) with 86K/200K instances • Lookup phase marks instances from the ontology• High ambiguity of instances with the same label (e.g.

locations belonging to different countries)• Disambiguation uses an Entity Ranking algorithm,

i.e., priority ordering of entities with the same label based on corpus statistics

• Lookup is combined with rule-based IE system (from GATE) to recognise new instances of concepts and relations

• Special KB enrichment stage where some of these new instances are added to the KB

Page 14: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

KIM (2)

Page 15: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

h-TechSight KMP• Knowledge management platform for fully automatic

metadata creation and ontology population, and semi-automatic ontology evolution, powered by GATE and ToolBox.

• Data-driven analysis of ontologies enables trends of instances to be monitored

• Uses GATE to support the instance-based evolution of ontologies in the Chemical Engineering domain.

• Analysis of unrestricted text to extract instances of concepts from such ontologies

• Instances populated into a domain-specific ontology and/or exported to an Access / Oracle database

Page 16: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Visualisation of

New Instances

1234

DB

Evolution of Ontologies

Analysis of Results

Ontology in

EmploymentWeb site

URL

Page 17: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Ontology-based IE in h-TechSight• Ontology-Based IE for semantic tagging of job adverts,

news and reports in chemical engineering domain• Semantic tagging used as input for ontological

analysis• Fundamental to the application is a domain-specific

ontology• Terminological gazetteer lists are linked to classes in

the ontology• Rules classify the mentions in the text wrt the domain

ontology• Annotations output into a database or as an

ontology

Page 18: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Limitations

• h-Techsight uses rule-based IE system

• Requires human expert to write rules

• Accurate on restricted domains with small ontologies

• Adaptation to a new domain / ontology may require some effort

Page 19: Technologies for (semi-) automatic metadata creation gate.ac.uk/ nlp.shef.ac.uk

Summary

• Tradeoff between semi-automatic and fully automatic systems, dependent on application, corpus size etc

• Tradeoff between rule-based and ML techniques for IE

• Tradeoff between dynamic vs static systems