lecture 3: spell checkers, n-grams - university of pittsburghnaraehan/ling1330/lecture3.pdf ·...

30
Lecture 3: Spell Checking in Context: n-gram Models LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Upload: hoangkien

Post on 19-Apr-2018

224 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

Lecture 3: Spell Checking in

Context: n-gram Models

LING 1330/2330: Introduction to Computational Linguistics

Na-Rae Han

Page 2: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

Objectives

Context-aware spell checkers

n-gram as context

Character-level n-grams

Word-level n-grams

Frequent n-grams in English

1/26/2017 2

Page 3: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

Spell checkers

1/26/2017 3

What did you find?

Which spell checkers work well, which don’t? In what way?

Anything else you noticed?

Page 4: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 4

Samsung Keypad's candidates for what

this word is or would become

What I typed in so

far

Samsung Keypad's top pick

(correction)

Page 5: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 5

Page 6: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 6

Page 7: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 7

Page 8: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 8

Page 9: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 9

Page 10: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 10

Page 11: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 11

Page 12: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 12

Page 13: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 13

What do you think about the

ranking?

Page 14: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 14

Page 15: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 15

typo

correction

correction candidates.

Edit Distance in action

Page 16: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 16

Page 17: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

1/26/2017 17

typo

correction

correction candidates.

Edit Distance in action

Page 18: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

What type of spell checker is this?

1/26/2017 18

Real-time Generates candidates with each character input

cf. MS word: operates by incoming words (response after a space or enter)

Look-ahead mode Two goals: spell correction + save typing

Predicts whole words based on first few characters

Aggressive Auto-corrects without user confirmation

Context-aware Considers preceding characters

Probability-based Ranks candidates based on likelihood

Page 19: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

Larger context?

1/26/2017 19

So Samsung Keypad considers preceding characters as a context

Does it consider preceding words as well?

Page 20: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

"Are you"

1/26/2017 20

Page 21: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

"Is you"

1/26/2017 21

Same candidates, same top pick. Does NOT take

preceding words into consideration.

This would have been a better pick following

"is"

Page 22: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

MS Word considers word contexts

1/26/2017 22

Page 23: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

MS Word considers word contexts

1/26/2017 23

suggests "form" "from"

suggests "your" "you're" instead of "from" "form"

Page 24: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

n-grams: character-level

1/26/2017 24

n-gram: a stretch of text n units long unigrams (1), bigrams (2), trigrams (3), 4-grams, 5-grams, …

'green ideas' Character unigrams:

['g', 'r', 'e', 'e', 'n', ' ', 'i', 'd', 'e', 'a', 's']

Character bigrams: ['gr', 're', 'ee', 'en', 'n ', ' i', 'id', 'de', 'ea', 'as']

Character trigrams: ['gre', 'ree', 'een', 'en ', 'n i', ' id', 'ide', 'dea', 'eas']

Character 4-grams: ['gree', 'reen', 'een ', 'en i', 'n id', ' ide', 'idea', 'deas']

Page 25: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

n-grams: word-level

1/26/2017 25

n-gram: a stretch of text n units long

unigrams (1), bigrams (2), trigrams (3), 4-grams, 5-grams, …

'Colorless green ideas sleep furiously.'

Word bigrams: [('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously'), ('furiously', '.')]

Word trigrams: [('colorless', 'green', 'ideas'), ('green', 'ideas', 'sleep'), ('ideas', 'sleep', 'furiously'), ('sleep', 'furiously', '.')]

Page 26: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

n-grams and probability

1/26/2017 26

How likely do you think these letter bigrams are in English:

'th' 'ti' 'tb' 'tq' 'tx'

Putting it in terms of conditional probability:

After a user typed in letter 't', what is the most likely next character input?

How about after 'q'? After 'io'?

For fun:

What are the most frequent English letter bigrams?

th, he, in, er, an, re, nd, on, en, at

Trigrams?

the, and, ing, her, hat, his, tha, ere, for, ent

Page 27: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

Word-level n-grams

1/26/2017 27

How likely do you think these n-grams are in English:

Putting in terms of conditional probability:

After a user types in 'are you', what is the most likely next word?

How about 'in the'? 'in the middle'?

are you

is you

are you so

are you also

are you does

46622

4441

428

26

-

Page 28: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

For fun: most frequent bigrams?

1/26/2017 28

2551888 of the

1887475 in the

1041011 to the

861798 on the

676658 and the

648408 to be

578806 for the

561171 at the

498217 in a

479627 do n't

Source: http://www.ngrams.info/download_coca.asp

455367 with the

451460 from the

443547 of a

395939 that the

362176 is a

361879 going to

335255 by the

330828 as a

319846 with a

317431 I think

Page 29: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

Most frequent trigrams?

1/26/2017 29

198630 I do n't

140305 one of the

129406 a lot of

117289 the United States

79825 do n't know

76782 out of the

75015 as well as

73540 going to be

61373 I did n't

61132 to be a

Source: http://www.ngrams.info/download_coca.asp

Page 30: Lecture 3: Spell checkers, n-grams - University of Pittsburghnaraehan/ling1330/Lecture3.pdf · Spell checkers 1/26/2017 3 What did you find? Which spell checkers work well, which

4-grams? 5-grams?

1/26/2017 30

54647 I do n't know

43766 I do n't think

33975 in the United States

29848 the end of the

27176 do n't want to

12663 I do n't want to

10663 at the end of the

8484 in the middle of the

8038 I do n't know what

6446 I do n't know if

Source: http://www.ngrams.info/download_coca.asp