describing and discovering language resources david illsley, ewan klein, steve renals school of...

40
Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Upload: nelson-shepherd

Post on 28-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Describing and Discovering Language

Resources

David Illsley, Ewan Klein, Steve Renals

School of InformaticsUniversity of Edinburgh

Page 2: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Overview

• Goals: availability and interoperability• Service oriented architecture and

workflow• NLP Components• Service description and discovery• NLP and the Grid

Page 3: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

What are Language Resources?

• Language Resources (LRs) of two kinds:• Static resources:

– corpora (text, speech, multimodal)– lexicons, terminologies, ontologies– grammars, declarative rule-sets

• Processing resources:– segmenters, tokenizers, zoners, taggers,

entity classifiers, chunkers, parsers, …

Page 4: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Goals

• Maximize availability of static LRs for automatic processing

• Maximize interoperability of processing LRs

Page 5: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

LRs on the WWW, 1

• Can use the WWW to locate corpora• Example: OLAC (Open Language

Archive Community)– Provides query interface to search for

corpora across multiple repositories– Requires standard metadata record for

harvesting.– Does not provide access to corpora.

Page 6: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh
Page 7: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

LRs on the WWW,2 • Can use the WWW to directly

search corpora• Many examples• BNC Online Search

– words (with regular expressions)– tag strings

• Typically search is limited (expressiveness, number of results)

Page 8: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh
Page 9: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

LRs on the WWW, 3• Can use the WWW to download

tools• Some tools offer a demo web

interface• No interoperability:

– you cannot take the output of one web-interfaced tool and feed it as input to another tool

Page 10: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh
Page 11: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

LRs on the WWW, 4

• Challenges for accessing static LRs for automatic processing:– licensing restrictions– file (or database) structure– data format– data transfer

• What about processing LRs?– can download, – but not execute in an interoperable manner

Page 12: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Web Services (WS)

• WS is a self-contained software resource• Can be located and invoked across the web:

– identified by a URL– public interfaces and bindings are defined and

described using XML

• Other applications interact with it in a prescribed manner– XML-based messages conveyed by internet

protocols (e.g. HTTP)

• Web services can be composed into complex, distributed applications

Page 13: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

WWW

Service Oriented Architecture (SOA)

Service Requester

Service Provider

Discovery Agencies

Source: Berners-Lee

clientclient descriptiondescription

interact

locate publish

serviceservice

description

Page 14: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Web Service: Key Ideas

• Interaction with Web Services is – described by – and conducted

• using XML documents exchanged over the internet

• SOAP protocol– describes the form of messages and how to

process them– a way of representing Remote Procedure

Calls over HTTP

Page 15: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

The Appeal of Web Services

• A means of building distributed systems• virtualization — not dependent on any

one programming language, OS, development environment

• based on well-understood underlying protocols

• components can be developed independently

• decentralized (apart from DNS)

Page 16: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

NLP Services

• Fairly easy to wrap legacy code as web services

• Allows us to deploy tools across the web as part of a larger application

• Corpora can also be deployed as services

• Helps with availability interoperability

• But still many challenges

Page 17: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Building NLP Applications

• Many NLP applications involve relatively few ‘conceptual’ components:– tokenizers, taggers, named entity

recognizers, parsers, etc– often different versions of the same

components– much repeated (and messy) labour in

wiring the components together to interoperate

Page 18: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Issues in Component Approach

• Granularity– What is appropriate ‘grain size’ of

functionality?• Too fine: heavy overheads in

communication, lose ease of use• Too gross: loss of flexibility• Hierarchical decomposition is possible

• Compatibility– informational, functional, formal

Page 19: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Linguistic Annotation

• Makes information in raw text explicit:– Classification of words and phrases– Detection of structural relationships– Annotation with general and domain-specific

semantic labels

• Usually proceeds from more concrete to more abstract

• Earlier stages of annotation feed into the later stages

• Assumed that annotation is represented as XML

Page 20: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Idealized View

Page 21: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Compatible NLP Services:Substitution

tokenize POS tag parse

POS tag

POS tag

Page 22: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Compatible NLP Services: Sequencing

tokenize POS tag parse

tokenizePOS tagparse

Page 23: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

WSDL File

• XML document, usually on same machine as server

• Describes everything involved in calling a web service:– The service URL and namespace– The type of web service– List of available functions– Arguments for each function– Data type of each argument– Return value of each function and data type of

each return value

Page 24: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Processor Input and Output Types

• Composition of NL processors constrained by input and output types

• Candidates for types?• WSDL provides simple data types:

– strings, integers, booleans– not expressive enough

• Can we build on notion of metadata for LRs?

Page 25: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

IMDI Catalogue Specification

Catalogue.Title Arabic TreebankCatalogue.Subject-Language araCatalogue.Content-Type writtenCatalogue.Format.Text UTF-8Catalogue.Smallest Annotation Unit wordCatalogue.Publisher LDCCatalogue.Size 266 Mb

Page 26: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

LR Metadata Standards

• Advantages– consistency– software knows what to expect– can be designed according to agreed principles

• Challenges– no generally agreed ontology for LRs– hard to get agreement (and who gets to

decide?)– categorizations of LRs influenced by favourite

linguistic theory

• Other people are addressing this issue

Page 27: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

What’s missing: tool metadata

• What kind of metadata would enable us to ensure tool interoperability?

• Neither OLAC nor IMDI provide an answer.

Page 28: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Discovering Resources

• Who cares about discovering LRs?– researchers who are searching for LRs

that meet specific research criteria– information providers– teachers, journalists, casual browsers– …

• Current focus: automatic discovery by software agents

Page 29: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Service Description & Discovery

• What LRs can be discovered depends on how the LRs are described.

• How LRs are described depends on the requirements for discovery.

• Composability:– If an agent (human or software) has already

selected component P, what other components Q can provide well-formed input to P ?

– Query for all Q such that Q’s output type is compatible with P’s input type

Page 30: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Some Versions of BNCname: British National Corpus, Version 1.0type: textsize: 2866 MB

name: British National Corpus, Version 1.0, marked up in XMLtype: textsize: 815 MB

name: British National Corpus, Version 1.0, parsed with Charniak parsertype: textsize: 419 MB

name: British National Corpus, Version 1.0, parsed with IMS parsertype: textsize: 2088 MB

name: British National Corpus, Version 1.0, parsed with Minipartype: textsize: 448 MB

Page 31: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Corpus Request Scenario

• Agent A requests corpus C with property [key = val].

• If C with [key = val] exists, serve it to A.• Otherwise,

– find processor P such that output of P(C) satisfies [key = val]

– apply P to C– serve result to A– store result for future requests

Page 32: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Service Description

• Standard approach– WSDL: describes service inputs/outputs

in terms of simple data types– Doesn’t support semantically-based

service discovery• Alternatives from Semantic Web

– inputs and outputs specified in an ontology language

– OWL and RDF both possible

Page 33: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

NLP as Document Annotation

• NL Processor– takes a partially annotated document as input– yields a more richly annotated document as

output

Page 34: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Tagging as document annotation

• Part of Speech Tagger– takes in a document with markup of words– yields a document as with additional markup of part

of speech

Page 35: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Document Class

NB This is just corpus metadata!

Page 36: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Subsumption over the Document class

Page 37: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Subsumption over Processors

Page 38: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Grid & NLP

• Parallelism– distribute processes over many machines– use parallel algorithms within process– redundancy and fault tolerance

• Distributed data– multiple corpora– distributed annotation of single corpus

• Distributed processing pipeline– different components hosted at different

sites

Page 39: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Implementation

• Based on Globus Toolkit 3.2 middleware• Corpus Services and Transformation Services

provide interfaces for corpora and tools• Services Data Elements describe properties of

services– properties are aggregated by Index Service, can be

queried by clients

• Index Service extended by Model Service– provides richer description of services using RDF

triples

• Backward chaining used to construct pipelines that will produce a requested resource

Page 40: Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Summary

• Corpus query– for user, no obvious distinction between raw

and processed data

• Corpus service– either provide existing resource, or generate it

• Need to have metadata for tools which allows automatic composition

• Metadata needs to allow subsumption matching– using shared controlled vocabulary