open source natural language processing - francis bond

27
Open Source Natural Language Processing Francis Bond <www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University <[email protected]> 2009-08-21 (GeekCamp)

Upload: jasonong

Post on 03-Dec-2014

3.890 views

Category:

Technology


4 download

DESCRIPTION

Talk at Geekcamp SG given Francis Bond on Natural Language Processsing using open source tools.

TRANSCRIPT

Page 1: Open Source Natural Language Processing - Francis Bond

Open Source

Natural Language Processing

Francis Bond

<www3.ntu.edu.sg/home/fcbond/>

Division of Linguistics and Multilingual Studies

Nanyang Technological University

<[email protected]>

2009-08-21 (GeekCamp)

Page 2: Open Source Natural Language Processing - Francis Bond

Self Introduction

➣ BA in Japanese and Mathematics

➣ BEng in Power and Control

➣ PhD in “Machine Translation”

➣ 1991-2006 NTT (Nippon Telegraph and Telephone)

➢ Japanese - English/Malay Machine Translation

➢ Japanese corpus, grammar and ontology (Hinoki)

➣ 2006-2009 NICT (National Inst. for Info. and Comm.

Technology)

➢ Japanese - English, Chinese Machine Translation

➢ Japanese WordNet (Released in March 2009)

2009-08-21 (GeekCamp) 1

Page 3: Open Source Natural Language Processing - Francis Bond

Overview

➣ What is NLP (and Why do it)?

➣ Machine Translation Examples

➣ Why Open Source?

➣ Wrap Up

➣ State of the Art

2009-08-21 (GeekCamp) 2

Page 4: Open Source Natural Language Processing - Francis Bond

The basic problem

We get words

People saw her duck.

We want meaning

2009-08-21 (GeekCamp) 3

Page 5: Open Source Natural Language Processing - Francis Bond

People saw her duck1

http://www.animaltalk.us/for/Animals/

fw-cute-picture-of-your-daughter-with-duck/

2009-08-21 (GeekCamp) 4

Page 6: Open Source Natural Language Processing - Francis Bond

People saw her duck2

http://www.nataliedee.com/012109/

ducking-incoming-balls.jpg

2009-08-21 (GeekCamp) 5

Page 7: Open Source Natural Language Processing - Francis Bond

People saw her duck3

OpenClipArtLibrary

2009-08-21 (GeekCamp) 6

Page 8: Open Source Natural Language Processing - Francis Bond

Syntax

(1) (2) (3)

S

NP

N

N

N

People

VP

V

V:see

saw

NP

DET

her

N

N

N

duck.

S

NP

N

N

N

People

VP

V

V

V:see

saw

NP

N

her

VP

V

V

V

duck.

S

NP

N

N

N

People

VP

V

V:saw

saw

NP

DET

her

N

N

N

duck.

2009-08-21 (GeekCamp) 7

Page 9: Open Source Natural Language Processing - Francis Bond

Structural Semantics

Who did what to whom, how, where, when and why?

(1) see(people, ducki: past) poss(ducki, pron:[3rd, sg,

fem]: past)

(2) see(people, duckj) duckj(pron:[3rd, sg, fem])

(3) saw(people, ducki) poss(ducki, pron:[3rd, sg, fem])

2009-08-21 (GeekCamp) 8

Page 10: Open Source Natural Language Processing - Francis Bond

Lexical Semantics

What are people? What’s a duck? What does sawing entail?

(4) people ⊂ entity

(5) see ⊂ perceive

(6) saw ⊂ cut

(7) ducki ⊂ bird

(8) duckj ⊂ move

2009-08-21 (GeekCamp) 9

Page 11: Open Source Natural Language Processing - Francis Bond

Pragmatics

The study of meaning in context.

➣ Which people?

➣ What duck?

➣ Why did you say that?

➣ What does it imply?

2009-08-21 (GeekCamp) 10

Page 12: Open Source Natural Language Processing - Francis Bond

The problem restated

➣ How can we model and resolve ambiguity?

➣ Two main approaches

➢ Deduce implicit models

∗ bag of words, n-gram chunks, . . .

➢ Define explicit models

∗ Grammars, lexicons and thesauri

➣ Then build a statistical language model (machine learning)

2009-08-21 (GeekCamp) 11

Page 13: Open Source Natural Language Processing - Francis Bond

Not just algorithms

➣ The data is as important as the algorithm

➣ Two areas of development

➢ Open (?) Content

∗ The Web!, Text Corpora, WordNet, Wikipedia,

dictionaries, . . .

➢ Open Software

∗ NLTK (python), Gate, DELPH-IN, . . .

➣ Copyright issues are always with us (;_;)

2009-08-21 (GeekCamp) 12

Page 14: Open Source Natural Language Processing - Francis Bond

Some Examples

➣ Speech Recognition

➣ Text-to-speech

➣ Segmentation: split strings into words

➣ Part-of-Speech (nouns or verbs)

➣ Named Entity Recognition

➣ Syntactic Parsing: syntactic trees and dependencies

➣ Word Sense Disambiguation: lexical semantics

➣ Semantic Parsing: structural semantics

2009-08-21 (GeekCamp) 13

Page 15: Open Source Natural Language Processing - Francis Bond

Two Examples of Open Source MT

➣ MOSES (http://www.statmt.org/moses/)

➢ Open Source Statistical MT tool kit

Just add bilingual corpus!

➣ LOGON (www.delph-in.net/)

➢ Open Source Knowledge-based MT tool kit

Just add transfer rules!

2009-08-21 (GeekCamp) 14

Page 16: Open Source Natural Language Processing - Francis Bond

Statistical Machine Translation?

Basic Idea (Brown et al 1990)

E = argmaxE

P (E|J)

JapaneseJ

Translation ModelP (J |E)

EnglishE

Language ModelP (E)

JDecoder

argmaxE P (E)P (J |E)E

2009-08-21 (GeekCamp) 15

Page 17: Open Source Natural Language Processing - Francis Bond

Translation Model (IBM Model 4)

P (J, A|E)

could you recommend another hotel∏

n(φi|Ei)

Fertility Model

could could recommend another another hotel(

m−φ0

φ0

)

pm−2φ00 p

φ01

NULL Generation Model

could could recommend NULL another another hotel NULL∏

t(Jj|EAj)

Lexicon Model

ててていいいたたただだだけけけ ままますすす 紹紹紹介介介ししし ををを 他他他 ののの ホホホテテテルルル かかか∏

d1(j − k|A(Ei)B(Jj))∏

d1>(j − j′|B(Jj))

Distortion Model

他他他 ののの ホホホテテテルルル ををを 紹紹紹介介介ししし ててていいいたたただだだけけけ ままますすす かかか

Now with chunks (another hotel ↔ 他 の ホテル)!

2009-08-21 (GeekCamp) 16

Page 18: Open Source Natural Language Processing - Francis Bond

Knowledge-based MT

SourceText

Stochastic Model(s)

SourceAnalysis(JACY)

MRSSSemantic

TransferMRST

TargetGeneration

(ERG)

TargetText

➣ From text to meaning and back again

➢ Grammars for Japanese and English

➢ Stochastic models to choose interpretations

➢ Brittle but powerful

2009-08-21 (GeekCamp) 17

Page 19: Open Source Natural Language Processing - Francis Bond

Some Examples

Source 私はいやいやその仕事をした 。

Ref I did the work against my will.

Moses I did the work against his will.

JaEn I did that work unwillingly.

Source バイオリンの音色はとても美しい。

Ref The sound of the violin is very sweet.

Moses The violin 音色 is very beautiful .

JaEn Really, the violin timbers are beautiful.

Source メイドはテーブルにナイフとフォークを並べた。

Ref The maid arranged the knives and forks on the table.

Moses The maid on the table arranged the knives and forks.

JaEn The maid set up the fork with the knife in the table.

2009-08-21 (GeekCamp) 18

Page 20: Open Source Natural Language Processing - Francis Bond

Source その銀行はここから遠いですか。

Ref Is there bank far from here?

Moses The bank is a long way from here?

JaEn Is that bank distant from here?

Source シェークスピアに匹敵する劇作家はいない。

Ref No dramatist can compare with Shakespeare.

Moses Shakespeare is quite equal to a dramatist. (no no)

JaEn A playwright, that matches in Shie-kusupia, doesn’t live.

Source 彼はなぜそんなことをしたのか。

Ref Why did he do that?

Moses Why did he did such a thing?

JaEn Why did he do that business?

2009-08-21 (GeekCamp) 19

Page 21: Open Source Natural Language Processing - Francis Bond

Why Open?

➣ NLP needs serious resources

➢ They cannot be built and maintained by a single group

➢ Open source is a very practical way of achieving flexible

multi-group collaboration

➣ NLP needs standards and historically the successful ones

have been created bottom-up.

➣ Seeing one’s work used by other groups is very rewarding.

➣ People are generally enthusiastic about contributing to widely

used work.

Not just the warm inner glow 20

Page 22: Open Source Natural Language Processing - Francis Bond

➣ Making resources open source removes difficulties in

distributing work or in continuing work at another institution.

➣ Researchers are evaluated by the impact that their work has:

Open Source work generally has more impact.

➣ Research should be open in principle:

. . . the principle of openness in research - the principle

of freedom of access by all interested persons to the

underlying data, to the processes, and to the final

results of research - is one of overriding importance.

Openness in Research (Stanford, Research Policy Handbook 2.6)

Not just the warm inner glow 21

Page 23: Open Source Natural Language Processing - Francis Bond

NLP by regexp

Bilingual Dictionaries from mainly monolingual text!

➣ Fully Bracketed Examples

➢ 「収穫逓減の法則(the law of diminishing return)」

➣ Partly Bracketed Examples

➢ 図1に,明瞭性 (Clarity)・新奇性 (Novelty)

➣ Over a million pairs from the Japanese Web corpus

➢ Not yet released (copyright again)

It’s fun 22

Page 24: Open Source Natural Language Processing - Francis Bond

The ultimate goal

➣ NLP is fairly wide in scope

➣ We want to know everything about everything andhow it fits together

➢ The best source of knowledge we have is still text

➢ Replace human bandwidth with machine bandwidth

➢ Process, refine, reprocess

➣ Need both technical and social approaches

➢ Linguistic Analysis

➢ Machine Learning

➢ User Generated Content

Mad Scientists of the World Unite 23

Page 25: Open Source Natural Language Processing - Francis Bond

Closing

➣ There are many great open source NLP tools

➢ the bleeding edge is mainly open source

If you want to know more

Or even better want to play with them

Or best of all develop them

⇒ Say hello: (especially PhD candidates)

[email protected]

And now, the end is near 24

Page 26: Open Source Natural Language Processing - Francis Bond

Another Example of the Problem

(9) Everyone gets a little of Cucumber’s ♥.

➣ Lexical gaps: Cucumber (name)

➣ Lexical gaps: ♥ (noun – we have it as verb: I ♥ NY)

➣ How to model ambiguity

➢ Cucumber is deliberately ambiguous here

∗ research show rude jokes are funnier

∗ can we model this?

Topical Example 25

Page 27: Open Source Natural Language Processing - Francis Bond

Solutions

➣ Morphological analysis should guess the POS

➢ Based on two to three words of previous context

and a large learned lexicon and model

➢ This allows us to parse

➢ Actually there are issues with ♥ (words are [a-z -]+)

➣ Recognizing “Cucumber” as software

Cucumber is a tool that can execute . . .

➣ Linking ♥ to love: ♥n → ♥v (v2n derivational rule)

➣ Scaling is the problem

Feel free to use these slides or extracts from them for any purpose at all, Francis Bond 2009-08-22. 26