ling/c sc 581: advanced computational linguistics lecture notes jan 15 th

63
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 15 th

Upload: jaiden-dewhurst

Post on 15-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

LING/C SC 581: Advanced Computational Linguistics

Lecture NotesJan 15th

Course

• Webpage for lecture slides and Panopto recordings:– http://dingo.sbs.arizona.edu/~sandiway/ling581-15/

• Meeting information

Course Objectives

• Follow-on course to LING/C SC/PSYC 438/538 Computational Linguistics:– continue with selected material from the 538 textbook (J&M):

• 25 chapters, a lot of material not covered in 438/538

• And gain more extensive experience– with new stuff not in textbook– dealing with natural language software packages– Installation, input data formatting– operation – project exercises– useful “real-world” computational experience– abilities gained will be of value to employers

Computational Facilities• Use your own laptop/desktop

– can also make use of the computers in this lab (Shantz 338) • but you don’t have installation rights on these computers• Plus the alarm goes off after hours and campus police will arrive…

• PlatformsWindows is maybe possible but you really should run some variant of Unix… (for your task #1 for this week)– Linux (separate bootable partition or via virtualization software)

• de facto standard for advanced/research software • https://www.virtualbox.org/ (free!)

– Cygwin on Windows• http://www.cygwin.com/• Linux-like environment for Windows making it possible to port software running on POSIX systems (such as

Linux, BSD, and Unix systems) to Windows.

– OSX• Not quite Linux, some porting issues, especially with C programs, can use Virtual Box (Linux under OSX)

Grading

• Completion of all homework tasks will result in a satisfactory grade (A)

• Tasks should be completed before the next class. – email me your work ([email protected]). – also be prepared to come up and present your work (if

called upon).

Today's Topics

• Homework Task 1: Install tregex

• Minimum Edit Distance

Homework Task 1: Install Tregex

Computer language: java

• http://nlp.stanford.edu/software/tregex.shtml

• (538: Perl regex on strings)• 581: regex for trees …

Homework Task 1: Install Tregex

• We’ll use the program tregex from Stanford University to explore the Penn Treebank– current version:

Penn Treebank

• Availability– Source:

• Linguistic Data Consortium (LDC)• U. of Arizona is a (fee-paying) member of this

consortium• Resources are made available to the community

through the main library• URL

– http://sabio.library.arizona.edu/search/X

Penn Treebank (V3)

• Call Record

Have it on a usb drive here that I willpass aroundTREEBANK_3.zip (65.2MB)

Penn Treebank (V3)• Raw data:

tregex

• Tregex is a Tgrep2-style utility for matching patterns in trees.

written in Java

run-tregex-gui.command shell script

-mx flag, the 300m default memory size may need to be increased depending on the platform

tregex• Select the PTB directory

– TREEBANK_3/parsed/mrg/wsj/• Browse

Deselect any unwanted files

Part 2

• Minimum Edit Distance• Textbook: section 3.11

15

Minimum Edit Distance

• general string comparison• edit operations are insertion, deletion and substitution• not just limited to distance defined by a single operation away• we can ask how different is string a from b by the minimum edit distance

16

Minimum Edit Distance• applications

– could be used for multi-typo correction– used in Machine Translation Evaluation (MTEval)– example

• Source: 生産工程改善について

• Translations:• (Standard) For improvement of the production process• (MT-A) About a production process betterment• (MT-B) About the production process improvement• method

– compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.

17

Minimum Edit Distance

• cost models– Levenshtein

• insertion, deletion and substitution all have unit cost

– Levenshtein (alternate)• insertion, deletion have unit cost• substitution is twice as expensive• substitution = one insert followed by one

delete

– Typewriter• insertion, deletion and substitution all

have unit cost• modified by key proximity

Minimum Edit Distance

• Dynamic Programming– divide-and-conquer

• to solve a problem we divide it into sub-problems

– sub-problems may be repeated• don’t want to re-solve a sub-problem the 2nd time around

– idea: put solutions to sub-problems in a table• and just look up the solution 2nd time around, thereby saving time• memoization

we’ll use a spreadsheet…

Minimum Edit Distance

• Consider a simple case: xy yx⇄

• Minimum # of operations: • insert and delete• cost = 2

• Minimum # of operations: • swap• cost = ?

Minimum Edit Distance

• Generally

Minimum Edit Distance• Programming Practice: could be easily

implemented in Perl

Minimum Edit Distance

• Generally

Minimum Edit Distance Computation

• Or in Microsoft Excel, file: eds.xls (on course webpage)

$ in a cell referencemeans don’t change when copiedfrom cell to celle.g. in C$11 stays the samein $A3A stays the same

Minimum Edit Distance

• Task: transform string s1..si into string t1..tj

– each sn and tn are letters– string s is of length i, t is of length j

• Example: – s = leader, t = adapter– i = 6, j = 7– Let’s say you’re allowed just three operations: (1)

delete a letter, (2) insert a letter, or (3) substitute a letter for another letter

– What is one possible way to generate t from s?

Minimum Edit Distance

• Example: – s = leader, t = adapter– What is one possible way to generate t from s?– leader

– ↕︎ ↕︎ – adapter– cost is 2 deletes and 3 inserts, total 5 operations– Question: is this the minimum possible?

leader◄leade◄lead◄lea◄le◄l◄◄a◄ad◄ada◄adap◄adapt◄adapte◄adapter◄

Simplest methodcost: 13 operations

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

cell (6,7)cost of

transforming leader into

adapter

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (3,0)cost of

transforming lea into (empty)

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (0,4)cost of

transforming (empty) into

adap

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e k

6 r

cell (5,6)cost of

transforming leade into

adapte

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e k

6 r

cell (5,6)cost of

transforming leade into

adapte➡︎

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e k

6 r k

cell (5,6)cost of

transforming leade into

adapte

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e k

3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e k

3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

cell (2,4)cost of

transforming le into adap

➡︎�

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e k k+1

3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

cell (2,4)cost of

transforming le into adap

➡︎�

l e

a d a p

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k

2 e3 a4 d5 e6 r

cell (1,4)cost of

transforming l into adap

➡︎�

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k

2 e k+1

3 a4 d5 e6 r

cell (1,4)cost of

transforming l into adap

➡︎�

l e

a d a p

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k

2 e3 a4 d5 e6 r

cell (1,3)cost of

transforming l into ada

➡︎

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k

2 e k+2

3 a4 d5 e6 r

cell (1,3)cost of

transforming l into ada

➡︎

assuming the cost of

swapping e for p is 2

l e

a d a p

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k1,3 k1,4

2 e k2,3 ?

3 a4 d5 e6 r

➡︎

�➡︎�

➡︎� cell (2,4)minimum of

the three costs to get here in one

step

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (3,0)cost of

transforming lea into (empty)

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l2 e3 a4 d5 e6 r

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l 1

2 e3 a4 d5 e6 r

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l 1

2 e 2

3 a4 d5 e6 r

➡︎�

cost of le =cost of l , plus the cost of deleting the e

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l2 e3 a4 d5 e6 r

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1

1 l2 e3 a4 d5 e6 r

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l2 e3 a4 d5 e6 r

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

➡︎

➡︎�

➡︎�

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1 2

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

➡︎

➡︎�

➡︎�

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4 5 6

5 e 5 6

6 r 6

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4 5 6

5 e 5 6

6 r 6

➡︎

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4 5 6

5 e 5 6 5

6 r 6

➡︎

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2 6 7

3 a 3 5

4 d 4

5 e 5

6 r 6

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2 6 7

3 a 3 5

4 d 4

5 e 5

6 r 6

➡︎�

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2 6 7

3 a 3 5 6

4 d 4

5 e 5

6 r 6

➡︎�

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5 6 5

6 r 6 7

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5 6 5

6 r 6 7

➡︎�

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5 6 5

6 r 6 7 6

➡︎�

Minimum Edit Distance