a unicode-based environment for the creation and use of lrs

11
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham , Diana Maynard, Oana Hamza, Tony McEnery 1 , Paul Baker 1 , Mark Leisher 2 Department of Computer Science, University of Sheffield 1 University of Lancaster 2 New Mexico State University GATE (a General Architecture for Text Engineering) and ML LRs 1. Motivation (history of men’s underwear) 2. Short definition of GATE 3. GATE, Unicode and Java 1(11)

Upload: nash-mcfarland

Post on 31-Dec-2015

36 views

Category:

Documents


0 download

DESCRIPTION

A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham , Diana Maynard, Oana Hamza, Tony McEnery 1 , Paul Baker 1 , Mark Leisher 2 Department of Computer Science, University of Sheffield 1 University of Lancaster - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

A Unicode-based Environment for the Creation and use of LRs

Valentin Tablan, Cristian Ursu, Kalina Bontcheva,Hamish Cunningham, Diana Maynard, Oana Hamza,

Tony McEnery1, Paul Baker1, Mark Leisher2 Department of Computer Science, University of Sheffield

1University of Lancaster2New Mexico State University

GATE (a General Architecture for Text Engineering) and ML LRs

1. Motivation (history of men’s underwear)

2. Short definition of GATE

3. GATE, Unicode and Java

4. EMILLE1(11)

Page 2: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

Motivation for Software Infrastructure for Language Engineering

Analogy with recent history of men’s underwear – also supportive infrastructure:

• The bad old days: Y-fronts: supportive, yes, but tended to be too constrictive• The brave new world: boxer shorts: still supportive, but less constraining

The purpose of our work (the boxer shorts ideal):

freedom within a supportive environment

2(11)

Page 3: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

GATE is:• An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualisation/edit; persistence; IR; IE; dialogue; ontologies; etc.• Free software (LGPL). Download at http://gate.ac.uk/download/

3(11)

Page 4: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

Architectural principles• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. v1 used LT-NSL for SGML input; v2 talks to other XML-based systems, APIs and standards) • (Almost) everything is a component, and component sets are user-extendable

Component-based development• An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.

4(11)

Page 5: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

GATE Language Resources

GATE LRs are documents, ontologies, corpora, lexicons.

Documents / corpora:• GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML.

Multilinguality: • New internationalised versions of JVM support >100 different encodings. • Other encodings: developing system for user-entry of mapping tables.• LR persistence through XML, file datastore or databases (Oracle, PostgreSQL).

5(11)

Page 6: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

Processing ResourcresAlgorithmic components knows as PRs – beans with execute methods.• All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing).• 20-30 freebies with GATE

Unicode Tokeniser• splits text into typed tokens based on FSM • dynamically constructed from a set of rules based on the character categories defined by the Unicode standard. UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word;

• output can be localised by a later module (e.g. “don’t” … “do” “n’t”)• current status:

• 23 rules seem able to handle without changes Indo-European languages. • the English tokeniser: Unicode tokeniser + pattern grammar FST.

6(11)

Page 7: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

Displaying Multilingual Data (1)GATE uses standard (and imperfect) Java rendering engine for displaying text.

                                                                                   

7(11)

Page 8: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

Displaying Multilingual Data (2)All the visualisation and editing tools for ML LRs use the same facilities:

8(11)

Page 9: A Unicode-based Environment for the Creation and use of LRs

                     Editing Multilingual Data• Java provides no special support for text input (this may change)• GATE Unicode Kit (GUK) plugs this hole• Support for defining additional Input Methods; currently 30 IMs for 17 languages• Pluggable in other applications (e.g. MPI’s EUDICO)• Can use virtual keyboard or standard layouts over QWERTY• IMs defined in plain text files• GUK comes with a standalone Unicode editor

9(11)

Page 10: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

EMILLE: Enabling Minority LE3 year EPSRC project at Lancaster University and Sheffield University.

Corpus development: • written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu. • spoken corpora of at least 500,000 words per language.

Unicode developments for GATE:

• Indic keyboard layouts. • encodings for Indic languages.

Development of basic LE tools: • POS tagging. • alignment tools for parallel corpora.

10(11)

Page 11: A Unicode-based Environment for the Creation and use of LRs

                                                                                                                           

Encore

http://gate.ac.uk/

Other GATE-related stuff at LREC:

• Saggion et al.: Extraction Information for MM Indexing [Weds, 19.05]

• Baker et al.: EMILLE [Thurs, 10.25]

• Demo and poster [Thurs, 11.00-12.20, session D1]

• Pastra et al.: Reuse of NE pattern grammars [Thurs, 16.20]

• Fliers

11(11)