beesley 2001 finite-state technology and linguistic applications 12-16 march 2001 xerox research...

36
Beesley 2001 Finite-State Technology and Linguistic Applications 12-16 March 2001 Xerox Research Centre Europe Grenoble Laboratory 6, chemin de Maupertuis 38240 MEYLAN, France Kenneth BEESLEY

Upload: austen-luff

Post on 15-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Beesley 2001

Finite-State Technology and Linguistic Applications

12-16 March 2001

Xerox Research Centre EuropeGrenoble Laboratory

6, chemin de Maupertuis38240 MEYLAN, France

Kenneth BEESLEY

Beesley 2001

Ken Beesley: Brief Introduction

B.A., Linguistics and Computer Science, Brigham Young University, 1978

Diploma, Linguistics and Phonetics, Univ. of Glasgow, 1979

D.Phil., “Epistemics” (Cognitive Science), Univ. of Edinburgh, 1983

ALPNET, computer assisted translation, 1984-1990

1988-1990 Arabic morphology project, exposure to Finite-State Morphology from Lauri Karttunen at COLING 1988

Microlytics (Xerox spinoff), 1990-1993

Xerox Corporation 1993-present

Morphology projects: Arabic, Spanish, Portuguese, Italian, Dutch, (Malay), (Aymara); also teaching finite-state programming techniques

Some people are into finite-state programming for the mathematics and algorithms; I’m in it because it lets me build working systems for interesting natural languages.

Beesley 2001

Goals for the Week

• Introduce finite-state theory• Introduce the Xerox Finite-State “Calculus”, a practical

software implementation of the theory: xfst, lexc• Try to convince you that finite-state natural-language

processing is a Good Thing• The Hope: Inspire a few of you to start your own

computational projects, perhaps on Maltese

Finite-state techniques are widely used today in both research and industry for natural-language processing. The software implementations and documentation are improving steadily, and they are increasingly available to all of us.

Beesley 2001

Schedule

Monday 12 March LC1117 Gentle Introduction

17.00-19.00

Tuesday 13 Unix Lab Intro. to xfst

17.00-19.30

Wednesday 14 Unix Lab More on xfst

10.00-12.30

Thursday 15 Unix Lab Intro. to lexc

17.00-19.30

Friday 16 403 CCT Linguistics Circle

18.30-20.00

Beesley 2001

Today’s Goals

• Understand “Regular” Languages and Relations.

• Understand the mathematical operations that can be performed on such Languages and Relations.

• Understand how Languages, Relations, Regular Expressions, and Networks are interrelated.

• Understand that we can create finite-state networks and compute with them using Xerox Finite-State Technology

• xfst interface– Regular-Expression Compiler

– Access to Finite-State Algorithms

• lexc language– Used mainly for lexicons and for describing morphotactics

Beesley 2001

Why is “Finite State” Computing So Interesting?

• Finite-state systems are mathematically elegant, easily manipulated and modifiable.

• Computationally efficient. Usually very compact.

• The programming we linguists do is declarative. We describe the facts of our natural language; i.e. we write grammars. We do not hack ad hoc code.

• The runtime code, which applies our systems to linguistic input, is already written and it is completely language-independent.

• Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate.

Beesley 2001

What is Finite-State Computing Good For?

Mostly “lower-level” natural language processing• Tokenization• Spelling checking/correction• Phonology• Morphological Analysis/Generation Emphasis this week• Part-of-Speech Tagging• “Shallow” Syntactic Parsing and “Chunking”

Finite-state techniques cannot do everything; but for tasks where they do apply, they are extremely attractive.

Beesley 2001

Where is Xerox Finite-State Technology Used?

Xerox Research• Xerox Palo Alto Research Center

• Xerox Research Centre Europe

Xerox Business Units and Partners• ATS

• MKMS

• Inxight

Universities and Research Groups• Over 70 licensees

We would like to make Xerox technology the de facto standard

Beesley 2001

The Gentle Introduction

• Chapter 1 of The Book• Physical Finite-State Machines (Automata)• Linguistic Finite-State Machines

– Symbol– Alphabet– Language

• Lookup and Generation• Quick Review of Set Theory• Languages, Relations and Transducers

Beesley 2001

Physical Machines with Finite States

The Lightswitch Machine

OFF ON

PUSH UP

PUSH DOWN

Beesley 2001

Physical Machines with Finite States

The Lightswitch Toggle Machine

OFF ON

PUSH

PUSH

Beesley 2001

Physical Machines with Finite States

The Fan in Ken’s Old Car

OFF HILOW MED

R R R

LLL

Beesley 2001

Physical Machines with Finite States

Three-Way Lightswitch

OFF HILOW MED

R R R

R

Beesley 2001

The Cola Machine

• Need to enter 25 cents (USA) to get a drink

• Accepts the following coins:• Nickel = 5 cents

• Dime = 10 cents

• Quarter = 25 cents

• For simplicity, our machine needs exact change

• We will model only the coin-accepting mechanism

Beesley 2001

Physical Machines with Finite States

The Cola Machine

0

N

D

Q

N N NN

D D D

5 10 15 20 25

Start State Final/Accept State

Beesley 2001

The Cola Machine Language

• List of all the sequences of coins accepted:• Q• DDN• DND• NDD• DNNN• NDNN• NNDN• NNND• NNNNN

• Think of the coins as SYMBOLS or CHARACTERS

• The set of symbols accepted is the ALPHABET of the machine

• Think of sequences of coins as WORDS or “strings”

• The set of words accepted by the machine is its LANGUAGE

Beesley 2001

Linguistic Machines

ca n t

o

t i g r e

m e s a

m e s a“Apply”

Beesley 2001

More Linguistic Machines

c l e a

e

m e s a s“Apply Up”

v

r

e

“Apply Down”

m e s a +Noun +Fem +Pl

m e s a 0 0 s

A Transducermesas+Noun+Fem+Pl

Beesley 2001

A Morphological Analyzer

Transducer

Surface Word Language

Analysis Word Language

Beesley 2001

A Quick Review of Set Theory

A set is a collection of objects.

A B

D E

We can enumerate the “members” or “elements” of finite sets: { A, D, B, E}.

There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc.

Beesley 2001

Uniqueness of Elements

You cannot have two or more ‘A’ elements in the same set

A B

D E

{ A, A, D, B, E} is just a redundant specification of the set { A, D, B, E }.

Beesley 2001

Cardinality of Sets

The Empty Set:

A Finite Set:

An Infinite Set: e.g. The Set of all Positive Integers

Norway Denmark Sweden

Beesley 2001

Simple Operations on Sets: Union

A B

C

DE

Set 1 Set 2

B C A D E

Union of Set1 and Set 2

Beesley 2001

Simple Operations on Sets (2): Union

A B

C

CD

Set 1 Set 2

B C A D

Union of Set1 and Set 2

Beesley 2001

Simple Operations on Sets (3): Intersection

A B

C

CD

Set 1 Set 2

C

Intersection of Set1 and Set 2

Beesley 2001

Simple Operations on Sets (4): Subtraction

A B

C

CD

Set 1 Set 2

A B

Set 1 minus Set 2

Beesley 2001

Formal Languages

Very Important Concept in Formal Language Theory:

A Language is just a Set of Words.

• We use the terms “word” and “string” interchangeably.

• A Language can be empty, have finite cardinality, or be infinite in size.

• You can union, intersect and subtract languages, just like any other sets.

Beesley 2001

Union of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

dog cat rat

elephant mouse

Union of Language 1 and Language 2

Beesley 2001

Intersection of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

Beesley 2001

Intersection of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

rat

Beesley 2001

Subtraction of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Language 1 minus Language 2

dog cat

Beesley 2001

Languages

• A language is a set of words (=strings).

• Words (strings) are composed of symbols (letters) that are “concatenated” together.

• At another level, words are composed of “morphemes”.

• In most natural languages, we concatenate morphemes together to form whole words.

For sets consisting of words (i.e. for Languages), the operation of concatenation is very important.

Beesley 2001

Concatenation of Languages

work talk walk

Root Language

0 ing ed s

Suffix Language

work working worked works talk talking talked talks walk walking walked walks

The concatenation of the Suffix language after the Root language.

Beesley 2001

Languages and Networks

w a l k

o r

t

Network/Language 1

Network/Language 2

s

o r

s The concatenation of Network 1 and Network 2

w a l k

t

a

as

ed

i n g

0

s

ed

i n g

0

s

Beesley 2001

Grammars, Languages, Networks

Grammarwritten in xfstor lexc

Language or Relation

Finite-State Network

Describes Compiles Into

Recognize or Map

In the coming days, we will learn how to write xfst and lexc grammars and compile them into working systems.

Beesley 2001

Tasks/Exercises

• Read chapter 1, at least up to page 28

• Do Exercises 1.10.1 (page 34) and 1.10.2 (page 36).

• For more rigor, read Chapter 2. Do the graphing exercise in Appendix B (page 381).