Transcript
Page 1: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

The Semantic Web and Language Technology

BT Exact, Martlesham

Hamish CunninghamDepartment of Computer Science,

University of Sheffield

Friday October 11th 2002

• Next generation web

• GATE, language technology infrastructure

1(19)

Page 2: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

A Ubiquitous Permeable Web

The next generation of the web must be:

• ubiquitous: semantics for every device, every organisation, every individual;

• permeable: allow contextual data to penetrate and persist;

• companionable: able to engage with us via multiple natural modalities.

Roles for Language Technology:

• discovery of semantics (ubiquity);

• mediating between context and personal semantic memories (permeability);

• conversing with people and the semantic web (companionableness).

2(19)

Page 3: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

Critical Mass for the Semantic WebThe SW: machine processable, repurposable data to compliment hypertext

But: semantics = 0.0000000...% of the Web

How to achieve critical mass? Huge scale automatic annotation. Requirements:

• Huge scale:

– freely available to all EU citizens

– distributed (over a Grid)

– re-purposeable (delivered as Web Services)

• Portability and robustness via:

– simple and therefore shallow HLT methods

– +ve and –ve learning

– analogs of IPSEs for computer-literate users

3 (19)

Page 4: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

Motivation for Software Infrastructure for Language Engineering

• Need for scalable, reusable, and portable HLT solutions

• Support for large data, in multiple media, languages, formats, and locations

• Lowering the cost of creation of new language processing components

• Promoting quantitative evaluation metrics via tools and a level playing field

4 (19)

Page 5: The Semantic Web and Language Technology BT Exact, Martlesham

5 (19)

Motivation (II):

Page 6: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

GATE, a General Architecture for Text Engineering• An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.• Free software (LGPL). Download at http://gate.ac.uk/download/

6 (19)

Page 7: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

Architectural principles• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) • (Almost) everything is a component, and component sets are user-extendable

Component-based development• An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.

7 (19)

Page 8: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

GATE Language ResourcesGATE LRs are documents, ontologies, corpora, lexicons, ……

Documents / corpora:• GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML.

Processing ResourcresAlgorithmic components knows as PRs – beans with execute methods.• All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing).• 20-30 freebies with GATE• e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene

8 (19)

Page 9: The Semantic Web and Language Technology BT Exact, Martlesham

Relational Database

GA

TE

Form

at Handlers

HTMLdocs

RTFdocs

XMLdocs

Named entity

Core-ference

ANNIE

POS tagger

Named entity

Eventextraction…

Custom application 1

…Document content

Document metadata

Document format data

Linguistic data

File storage

Oracle/PostgresQL

A Language AnalysisExample

Page 10: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

                                                                                   

10(11)

Page 11: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

Building IE Components in GATE (1)The ANNIE system – a reusable and easily extendable set of components

11 (19)

Page 12: The Semantic Web and Language Technology BT Exact, Martlesham

 Building IE Components in GATE (2)

JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components

Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = { kind = company, rule = “Company1” }

12 (19)

Page 13: The Semantic Web and Language Technology BT Exact, Martlesham

GATE is being used for development of (semi-)automatic methods for:

• linking web pages to Ontologies using Information Extraction;

• learning and evolving Ontologies via IE and lexical semantic network traversal.

The Semantic Web and GATE

13 (19)

Page 14: The Semantic Web and Language Technology BT Exact, Martlesham

Populating Ontologies with IE

Page 15: The Semantic Web and Language Technology BT Exact, Martlesham

Protégé and Ontology Management

Page 16: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

Information Retrieval SupportBased on the Lucene IR engine

16 (19)

Page 17: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

Displaying Multilingual DataAll the visualisation and editing tools for ML LRs use enhanced Java facilities:

17 (19)

Page 18: The Semantic Web and Language Technology BT Exact, Martlesham

                                          ApplicationsGATE has been used for a variety of applications, including:

• MUMIS: automatic creation of semantic indexes for multimedia programme material

• MUSE: a multi-genre IE system

• Metadata for Medline (at Merck)

• ACE: participation in the Automatic Content Extraction programme

• HSE: summarisation of health and safety information from company reports

• OldBaileyIE: NE recognition on 17th century Old Bailey Court reports.

• Various Medical Informatics and database technology projects

• IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and

French (Arabic, Chinese and Russian this autumn)

18 (19)

Page 19: The Semantic Web and Language Technology BT Exact, Martlesham

                                                                                                                           

Conclusion

GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components

Further information: http://gate.ac.uk/

• Online demos, tutorials and documentation• Software downloads• Talks and papers

19 (19)


Top Related