natural language processingrobotics.cs.tamu.edu/dshell/cs420/nlp.pdf · 2019-12-01 · natural...

Post on 13-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Natural Language Processing

Nov 19, 2019

1 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

2 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Who wrote the Federalist Papers?

1787-8: anonymous essays try to convince New York toratify U.S Constitution: Jay, Madison, Hamilton.

Authorship of 12 of the letters in dispute.

1963: solved by Mosteller and Wallace using Bayesianmethods.

By the end of this lecture we will see how to do that.

3 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Who wrote the Federalist Papers?

1787-8: anonymous essays try to convince New York toratify U.S Constitution: Jay, Madison, Hamilton.

Authorship of 12 of the letters in dispute.

1963: solved by Mosteller and Wallace using Bayesianmethods.

By the end of this lecture we will see how to do that.

3 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

What makes it hard?

Formal languages are:

unambiguous

Natural languages areambiguous:

“He saw her duck”.“Time flies like an arrow. Fruit flies like a banana”

4 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

By the end of the class

By the end of the class we will see how to do:

1 Text Classification. E.g. Spam detection, Authorshipidentification.

2 Spell Correction. E.g. Auto-correct.

3 Word suggestion.

5 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Regular Expressions

A formal language for specifying text strings.

6 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Notations

• Disjunctions [] :

Pattern Matches[Ww ]oodchuck woodchuck, Woodchuck[0123456789] Any single digit

• Disjunctions |:

Pattern Matchesabc|def Find ‘abc’ or ‘def’.a|b|ab Find ‘a’ or ‘b’ or ‘ab’. Example: ‘abc’

7 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Notations

• Ranges:

Pattern Matches[A− Z ] An uppercase letter.[a− z ] A lowercase letter.[0− 9] A single digit.

• Negation ˆ. (Note: Carat means negation only when its firstin [])

Pattern Matches[ˆA− Z ] Not upper case

[ˆSs] Not ‘S’ nor ‘s’[ˆeˆ] Not ‘e’ nor ‘ˆ’aˆb Search for the pattern‘aˆb’

8 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Notations (? * . + ˆ $)

? 0 or 1 of previous character* 0 or more of previous character+ 1 or more of previous character. Any characterˆ Start anchor$ End anchor\ Escape character

9 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: ˆ[A− Z ].Which of them are matches? “Class”, “cSCE”, “420”.

Class.• Pattern: ˆ[ˆA− Z ].Which of them are matches? “Class”, “cSCE”, “420”.cSCE, 420.• Pattern: .$Which of them are matches? “end”, “end?”, “end!”, “end.”.end. end? end! end.• Pattern: \.$Which of them are matches? “end”, “end?”, “end!”, “end.”.end..

10 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: ˆ[A− Z ].Which of them are matches? “Class”, “cSCE”, “420”.Class.

• Pattern: ˆ[ˆA− Z ].Which of them are matches? “Class”, “cSCE”, “420”.cSCE, 420.• Pattern: .$Which of them are matches? “end”, “end?”, “end!”, “end.”.end. end? end! end.• Pattern: \.$Which of them are matches? “end”, “end?”, “end!”, “end.”.end..

10 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: ˆ[A− Z ].Which of them are matches? “Class”, “cSCE”, “420”.Class.• Pattern: ˆ[ˆA− Z ].Which of them are matches? “Class”, “cSCE”, “420”.

cSCE, 420.• Pattern: .$Which of them are matches? “end”, “end?”, “end!”, “end.”.end. end? end! end.• Pattern: \.$Which of them are matches? “end”, “end?”, “end!”, “end.”.end..

10 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: ˆ[A− Z ].Which of them are matches? “Class”, “cSCE”, “420”.Class.• Pattern: ˆ[ˆA− Z ].Which of them are matches? “Class”, “cSCE”, “420”.cSCE, 420.

• Pattern: .$Which of them are matches? “end”, “end?”, “end!”, “end.”.end. end? end! end.• Pattern: \.$Which of them are matches? “end”, “end?”, “end!”, “end.”.end..

10 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: ˆ[A− Z ].Which of them are matches? “Class”, “cSCE”, “420”.Class.• Pattern: ˆ[ˆA− Z ].Which of them are matches? “Class”, “cSCE”, “420”.cSCE, 420.• Pattern: .$Which of them are matches? “end”, “end?”, “end!”, “end.”.

end. end? end! end.• Pattern: \.$Which of them are matches? “end”, “end?”, “end!”, “end.”.end..

10 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: ˆ[A− Z ].Which of them are matches? “Class”, “cSCE”, “420”.Class.• Pattern: ˆ[ˆA− Z ].Which of them are matches? “Class”, “cSCE”, “420”.cSCE, 420.• Pattern: .$Which of them are matches? “end”, “end?”, “end!”, “end.”.end. end? end! end.

• Pattern: \.$Which of them are matches? “end”, “end?”, “end!”, “end.”.end..

10 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: ˆ[A− Z ].Which of them are matches? “Class”, “cSCE”, “420”.Class.• Pattern: ˆ[ˆA− Z ].Which of them are matches? “Class”, “cSCE”, “420”.cSCE, 420.• Pattern: .$Which of them are matches? “end”, “end?”, “end!”, “end.”.end. end? end! end.• Pattern: \.$Which of them are matches? “end”, “end?”, “end!”, “end.”.

end..

10 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: ˆ[A− Z ].Which of them are matches? “Class”, “cSCE”, “420”.Class.• Pattern: ˆ[ˆA− Z ].Which of them are matches? “Class”, “cSCE”, “420”.cSCE, 420.• Pattern: .$Which of them are matches? “end”, “end?”, “end!”, “end.”.end. end? end! end.• Pattern: \.$Which of them are matches? “end”, “end?”, “end!”, “end.”.end..

10 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: colou?r .Which of them are matches? “color”, “colour”, “colouur”.

color, colour.• Pattern: colou + r .Which of them are matches? “color”, “colour”, “colouur”.colour, colouur.• Pattern: colou ∗ r .Which of them are matches? “color”, “colour”, “colouur”.color, colour, colouur.• Pattern: colou.r .Which of them are matches? “color”, “colour”, “colouur”.colouur.

11 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: colou?r .Which of them are matches? “color”, “colour”, “colouur”.color, colour.

• Pattern: colou + r .Which of them are matches? “color”, “colour”, “colouur”.colour, colouur.• Pattern: colou ∗ r .Which of them are matches? “color”, “colour”, “colouur”.color, colour, colouur.• Pattern: colou.r .Which of them are matches? “color”, “colour”, “colouur”.colouur.

11 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: colou?r .Which of them are matches? “color”, “colour”, “colouur”.color, colour.• Pattern: colou + r .Which of them are matches? “color”, “colour”, “colouur”.

colour, colouur.• Pattern: colou ∗ r .Which of them are matches? “color”, “colour”, “colouur”.color, colour, colouur.• Pattern: colou.r .Which of them are matches? “color”, “colour”, “colouur”.colouur.

11 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: colou?r .Which of them are matches? “color”, “colour”, “colouur”.color, colour.• Pattern: colou + r .Which of them are matches? “color”, “colour”, “colouur”.colour, colouur.

• Pattern: colou ∗ r .Which of them are matches? “color”, “colour”, “colouur”.color, colour, colouur.• Pattern: colou.r .Which of them are matches? “color”, “colour”, “colouur”.colouur.

11 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: colou?r .Which of them are matches? “color”, “colour”, “colouur”.color, colour.• Pattern: colou + r .Which of them are matches? “color”, “colour”, “colouur”.colour, colouur.• Pattern: colou ∗ r .Which of them are matches? “color”, “colour”, “colouur”.

color, colour, colouur.• Pattern: colou.r .Which of them are matches? “color”, “colour”, “colouur”.colouur.

11 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: colou?r .Which of them are matches? “color”, “colour”, “colouur”.color, colour.• Pattern: colou + r .Which of them are matches? “color”, “colour”, “colouur”.colour, colouur.• Pattern: colou ∗ r .Which of them are matches? “color”, “colour”, “colouur”.color, colour, colouur.

• Pattern: colou.r .Which of them are matches? “color”, “colour”, “colouur”.colouur.

11 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: colou?r .Which of them are matches? “color”, “colour”, “colouur”.color, colour.• Pattern: colou + r .Which of them are matches? “color”, “colour”, “colouur”.colour, colouur.• Pattern: colou ∗ r .Which of them are matches? “color”, “colour”, “colouur”.color, colour, colouur.• Pattern: colou.r .Which of them are matches? “color”, “colour”, “colouur”.

colouur.

11 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Examples

• Pattern: colou?r .Which of them are matches? “color”, “colour”, “colouur”.color, colour.• Pattern: colou + r .Which of them are matches? “color”, “colour”, “colouur”.colour, colouur.• Pattern: colou ∗ r .Which of them are matches? “color”, “colour”, “colouur”.color, colour, colouur.• Pattern: colou.r .Which of them are matches? “color”, “colour”, “colouur”.colouur.

11 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

We need to find instances of “the” in a text.

the × ‘The’

[Tt]he × ‘Theology’

[ˆA− Za− z ][Tt]he[ˆA− Za− z ]

12 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

We need to find instances of “the” in a text.

the

× ‘The’

[Tt]he × ‘Theology’

[ˆA− Za− z ][Tt]he[ˆA− Za− z ]

12 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

We need to find instances of “the” in a text.

the × ‘The’

[Tt]he × ‘Theology’

[ˆA− Za− z ][Tt]he[ˆA− Za− z ]

12 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

We need to find instances of “the” in a text.

the × ‘The’

[Tt]he

× ‘Theology’

[ˆA− Za− z ][Tt]he[ˆA− Za− z ]

12 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

We need to find instances of “the” in a text.

the × ‘The’

[Tt]he × ‘Theology’

[ˆA− Za− z ][Tt]he[ˆA− Za− z ]

12 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

We need to find instances of “the” in a text.

the × ‘The’

[Tt]he × ‘Theology’

[ˆA− Za− z ][Tt]he[ˆA− Za− z ]

12 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Text classification

Assigning subject categories, topics, or genres.

Spam detection.

Authorship identification.

Age/gender identification.

Language Identification.

· · ·

13 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Text Classification

Inputs:

Document d.Fixed set of classes C = {c1, c2, · · · , cn}.

Output:

A predicted class c ∈ C

14 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Naive Bayes

Relies on simple representation of document – Bag of Words.

For a document d and a class c

P(c |d) =P(d |c)P(c)

P(d)

15 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Naive Bayes

Relies on simple representation of document – Bag of Words.

For a document d and a class c

P(c |d) =P(d |c)P(c)

P(d)

15 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Naive Bayes Classifier

cMAP = argmaxc∈C

P(c |d)

MAP - Maximum a posteriori (most likely class).

cMAP = argmaxc∈C

P(d |c)P(c)

P(d)

cMAP = argmaxc∈C

P(d |c)P(c)

16 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Naive Bayes Classifier

cMAP = argmaxc∈C

P(d |c)P(c)

Let’s say that the document is represented by n featuresx1, x2, · · · xn

cMAP = argmaxc∈C

P(x1, x2, · · · xn|c)P(c)

17 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Assumptions

Bag of words: Position of words does not matter.Conditional Independence: The feature probabilities P(xi |c)are independent given the class c .

P(x1, x2, · · · xn|c) =n∏

i=1

P(xi |c)

18 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Bag of word representation

19 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Bag of word representation

20 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Naive Bayes: Learning

What do we need?Training set of m hand-labeled documents(d1, c1), · · · , (dm, cm)

21 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Naive Bayes: Learning

Let ND be the number of documents, and Ncj be the numberof documents present in class cj .Let Vcj be the set of all words in the documents of class cjNow we find the maximum likelihood estimates:

P̂(cj) =Ncj

ND

P̂(wi |cj) =count(wi , cj)∑

w∈Vcjcount(w , cj)

Now we can classify a document d by:

cd = argmaxcj∈C

P̂(cj)∏wi∈d

P̂(wi |cj)

22 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Naive Bayes: Learning

What if we come across an unknown word in the document d .Let wu be the unknown word P̂(wu|cj) = 0,∀cj .

23 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Laplace smoothing

Let V be the set of all words in the test documents, i.e,V = ∪cjVcj

Add one word for the unknown word in the vocabulary.

P̂(wi |cj) =count(wi , cj) + 1∑

w∈Vcjcount(w , cj) + |V |+ 1

So, for all unknown words, we have:

P̂(wu|cj) =1∑

w∈Vcjcount(w , cj) + |V |+ 1

24 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Training set:

# Text Class1 Carla Betty Carla a2 Carla Carla Suzanne a3 Carla Matt a4 Taylor Jessica Carla b

P̂(a) = 3/4P̂(b) = 1/4

25 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Training set:

# Text Class1 Carla Betty Carla a2 Carla Carla Suzanne a3 Carla Matt a4 Taylor Jessica Carla b

P̂(a) =

3/4P̂(b) = 1/4

25 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Training set:

# Text Class1 Carla Betty Carla a2 Carla Carla Suzanne a3 Carla Matt a4 Taylor Jessica Carla b

P̂(a) = 3/4P̂(b) =

1/4

25 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Training set:

# Text Class1 Carla Betty Carla a2 Carla Carla Suzanne a3 Carla Matt a4 Taylor Jessica Carla b

P̂(a) = 3/4P̂(b) = 1/4

25 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Training set:

# Text Class1 Carla Betty Carla a2 Carla Carla Suzanne a3 Carla Matt a4 Taylor Jessica Carla b

P̂(Carla|a) =(5 + 1)/(8 + 6 + 1) = 6/15P̂(Taylor |a) = (0 + 1)/(8 + 6 + 1) = 1/15

26 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Training set:

# Text Class1 Carla Betty Carla a2 Carla Carla Suzanne a3 Carla Matt a4 Taylor Jessica Carla b

P̂(Carla|a) =

(5 + 1)/(8 + 6 + 1) = 6/15P̂(Taylor |a) = (0 + 1)/(8 + 6 + 1) = 1/15

26 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Training set:

# Text Class1 Carla Betty Carla a2 Carla Carla Suzanne a3 Carla Matt a4 Taylor Jessica Carla b

P̂(Carla|a) =(5 + 1)/(8 + 6 + 1) = 6/15P̂(Taylor |a) =

(0 + 1)/(8 + 6 + 1) = 1/15

26 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Training set:

# Text Class1 Carla Betty Carla a2 Carla Carla Suzanne a3 Carla Matt a4 Taylor Jessica Carla b

P̂(Carla|a) =(5 + 1)/(8 + 6 + 1) = 6/15P̂(Taylor |a) = (0 + 1)/(8 + 6 + 1) = 1/15

26 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Document d5: Carla Carla Carla Taylor Jessica

P̂(a) = 3/4 and P̂(b) = 1/4P̂(Carla|a) = 6/15, P̂(Carla|b) = 2/10P̂(Taylor |a) = 1/15, P̂(Taylor |b) = 2/10P̂(Jessica|a) = 1/15, P̂(Jessica|b) = 2/10

P(a|d5) = 3/4× (6/15)3 × 1/15× 1/15 ≈ 0.0002P(b|d5) = 1/4× (2/10)3 × 2/10× 2/10 ≈ 0.00008

27 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Document d5: Carla Carla Carla Taylor Jessica

P̂(a) = 3/4 and P̂(b) = 1/4P̂(Carla|a) = 6/15, P̂(Carla|b) = 2/10P̂(Taylor |a) = 1/15, P̂(Taylor |b) = 2/10P̂(Jessica|a) = 1/15, P̂(Jessica|b) = 2/10

P(a|d5) = 3/4× (6/15)3 × 1/15× 1/15 ≈ 0.0002P(b|d5) = 1/4× (2/10)3 × 2/10× 2/10 ≈ 0.00008

27 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Document d5: Carla Carla Carla Taylor Jessica

P̂(a) = 3/4 and P̂(b) = 1/4P̂(Carla|a) = 6/15, P̂(Carla|b) = 2/10P̂(Taylor |a) = 1/15, P̂(Taylor |b) = 2/10P̂(Jessica|a) = 1/15, P̂(Jessica|b) = 2/10

P(a|d5) =

3/4× (6/15)3 × 1/15× 1/15 ≈ 0.0002P(b|d5) = 1/4× (2/10)3 × 2/10× 2/10 ≈ 0.00008

27 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Document d5: Carla Carla Carla Taylor Jessica

P̂(a) = 3/4 and P̂(b) = 1/4P̂(Carla|a) = 6/15, P̂(Carla|b) = 2/10P̂(Taylor |a) = 1/15, P̂(Taylor |b) = 2/10P̂(Jessica|a) = 1/15, P̂(Jessica|b) = 2/10

P(a|d5) = 3/4× (6/15)3 × 1/15× 1/15 ≈ 0.0002P(b|d5) =

1/4× (2/10)3 × 2/10× 2/10 ≈ 0.00008

27 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

Document d5: Carla Carla Carla Taylor Jessica

P̂(a) = 3/4 and P̂(b) = 1/4P̂(Carla|a) = 6/15, P̂(Carla|b) = 2/10P̂(Taylor |a) = 1/15, P̂(Taylor |b) = 2/10P̂(Jessica|a) = 1/15, P̂(Jessica|b) = 2/10

P(a|d5) = 3/4× (6/15)3 × 1/15× 1/15 ≈ 0.0002P(b|d5) = 1/4× (2/10)3 × 2/10× 2/10 ≈ 0.00008

27 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Naive Bayes

Naive Bayes is not so naive!!

Robust to Irrelevant Features.

Optimal if the independence assumptions hold.

A good dependable baseline for text classification. - Thereexists other classifiers that give better accuracy

28 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Federalist Papers

Discussion: Federalist papers.E.g. What training set do we need?

29 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Language Modelling

Goal: Assign probability to a sentence.

Why?

Machine Translation. P(high winds tonight) > P(largewinds tonight)

Spell Correction. P(about fifteen minutes from) >P(about fifteen minuets from)

Speech Recognition. P(I saw a van) > P(eyes awe of an)

30 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Language Modelling

Goal: Assign probability to a sentence.Why?

Machine Translation. P(high winds tonight) > P(largewinds tonight)

Spell Correction. P(about fifteen minutes from) >P(about fifteen minuets from)

Speech Recognition. P(I saw a van) > P(eyes awe of an)

30 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Language Modelling

Goal: Assign probability to a sentence.Why?

Machine Translation. P(high winds tonight) > P(largewinds tonight)

Spell Correction. P(about fifteen minutes from) >P(about fifteen minuets from)

Speech Recognition. P(I saw a van) > P(eyes awe of an)

30 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Language Modelling

Goal: Assign probability to a sentence.Why?

Machine Translation. P(high winds tonight) > P(largewinds tonight)

Spell Correction. P(about fifteen minutes from) >P(about fifteen minuets from)

Speech Recognition. P(I saw a van) > P(eyes awe of an)

30 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Language Modelling

Goal: Assign probability to a sentence.Why?

Machine Translation. P(high winds tonight) > P(largewinds tonight)

Spell Correction. P(about fifteen minutes from) >P(about fifteen minuets from)

Speech Recognition. P(I saw a van) > P(eyes awe of an)

30 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Language Modelling

Goal: compute the probability of a sentence or sequenceof words. P(W ) = P(w1,w2, · · · ,wn)

Related task: probability of an upcoming word.P(wi |w1,w2, · · · ,wi−1)

Chain rule:

P(x1, x2, · · · , xn)

= P(x1)P(x2|x1)P(x3|x1, x2) · · ·P(xn|x1, x2 · · · , xn−1)

=∏i

P(xi |x1, x2, · · · xi−1)

31 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Language Modelling

Goal: compute the probability of a sentence or sequenceof words. P(W ) = P(w1,w2, · · · ,wn)

Related task: probability of an upcoming word.P(wi |w1,w2, · · · ,wi−1)

Chain rule:

P(x1, x2, · · · , xn)

= P(x1)P(x2|x1)P(x3|x1, x2) · · ·P(xn|x1, x2 · · · , xn−1)

=∏i

P(xi |x1, x2, · · · xi−1)

31 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

P(its water is so transparent that) = P(its)× P(water |its)×P(so|its,water)× P(transparent|its,water , is, so)×P(that|its,water , is, so, transparent)Can we count?

P(that|its,water , is, so, transparent)

=P(its,water , is, so, transparent, that)

P(its,water , is, so, transparent)

No. Too many possibilities.

32 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Example

P(its water is so transparent that) = P(its)× P(water |its)×P(so|its,water)× P(transparent|its,water , is, so)×P(that|its,water , is, so, transparent)Can we count?

P(that|its,water , is, so, transparent)

=P(its,water , is, so, transparent, that)

P(its,water , is, so, transparent)

No. Too many possibilities.

32 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Markov Assumption

Take only the k words preceding it.

P(wi |w1,w2, · · · ,wi−1) ≈ P(wi |wi−k · · · ,wi−1)

P(that|its,water , is, so, transparent) = P(that|transparent)

or,

P(that|its,water , is, so, transparent) = P(that|so, transparent)

33 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Markov Assumption

Take only the k words preceding it.

P(wi |w1,w2, · · · ,wi−1) ≈ P(wi |wi−k · · · ,wi−1)

P(that|its,water , is, so, transparent) = P(that|transparent)

or,

P(that|its,water , is, so, transparent) = P(that|so, transparent)

33 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Unigram, Bigram and N-gram

Unigram model:

P(w1,w2, · · · ,wn−1,wn) ≈∏i

P(wi )

Bigram model:

P(wi |w1,w2, · · · ,wi−1) ≈ P(wi |wi−1)

N-gram model:Extension to trigram, 4-gram, 5-gram, etc.

34 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Discussion

1. Spell correction

2. Word suggestion.

35 / 35

NaturalLanguageProcessing

Introduction

RegularExpression

Notations

Examples

TextClassification

Naive Bayes

Example

LanguageModelling

Unigram, Bigramand N-gram

Discussion

1. Spell correction2. Word suggestion.

35 / 35

top related