introduction to natural language processingwhat is natural language processing (nlp)? i goal: for...

104
1/45 Introduction to Natural Language Processing Vered Shwartz Bar-Ilan University, Israel June 13, 2018

Upload: others

Post on 26-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

1/45

Introduction to Natural Language Processing

Vered Shwartz

Bar-Ilan University, Israel

June 13, 2018

Page 2: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

2/45

TMI: Too Much InformationLet’s start with two facts:

I 90% of the data in the world today hasbeen created in the last two years.[1]

I Our attention span is now less than thatof a goldfish,[2] and we almost never readthrough an article.[3]

[1] IBM, March 2012 (!)[2] The Telegraph, March 2016[3] Slate, March 2016

Page 3: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

2/45

TMI: Too Much InformationLet’s start with two facts:

I 90% of the data in the world today hasbeen created in the last two years.[1]

I Our attention span is now less than thatof a goldfish,[2] and we almost never readthrough an article.[3]

[1] IBM, March 2012 (!)[2] The Telegraph, March 2016[3] Slate, March 2016

Page 4: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

2/45

TMI: Too Much InformationLet’s start with two facts:

I 90% of the data in the world today hasbeen created in the last two years.[1]

I Our attention span is now less than thatof a goldfish,[2] and we almost never readthrough an article.[3]

[1] IBM, March 2012 (!)[2] The Telegraph, March 2016[3] Slate, March 2016

Page 5: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

2/45

TMI: Too Much InformationLet’s start with two facts:

I 90% of the data in the world today hasbeen created in the last two years.[1]

I Our attention span is now less than thatof a goldfish,[2] and we almost never readthrough an article.[3]

[1] IBM, March 2012 (!)[2] The Telegraph, March 2016[3] Slate, March 2016

Page 6: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

3/45

TMI: Too Much InformationWhy is this a problem?

Some people may spend their entire vacation...trying to find the optimal hotel!

Page 7: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

3/45

TMI: Too Much InformationWhy is this a problem?

Some people may spend their entire vacation...trying to find the optimal hotel!

Page 8: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

3/45

TMI: Too Much InformationWhy is this a problem?

Some people may spend their entire vacation...trying to find the optimal hotel!

Page 9: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

3/45

TMI: Too Much InformationWhy is this a problem?

Some people may spend their entire vacation...trying to find the optimal hotel!

Page 10: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

3/45

TMI: Too Much InformationWhy is this a problem?

Some people may spend their entire vacation...trying to find the optimal hotel!

Page 11: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

3/45

TMI: Too Much InformationWhy is this a problem?

Some people may spend their entire vacation...trying to find the optimal hotel!

Page 12: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

4/45

Natural Language Processing to the rescueWe are working on automatic methods to...

I Summarize multiple long texts

I Answer questions based on texts

I Identify the sentiment of texts (e.g.reviews)

I More...

Page 13: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

4/45

Natural Language Processing to the rescueWe are working on automatic methods to...

I Summarize multiple long texts

I Answer questions based on texts

I Identify the sentiment of texts (e.g.reviews)

I More...

Page 14: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

4/45

Natural Language Processing to the rescueWe are working on automatic methods to...

I Summarize multiple long texts

I Answer questions based on texts

I Identify the sentiment of texts (e.g.reviews)

I More...

Page 15: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

4/45

Natural Language Processing to the rescueWe are working on automatic methods to...

I Summarize multiple long texts

I Answer questions based on texts

I Identify the sentiment of texts (e.g.reviews)

I More...

Page 16: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

5/45

What is Natural Language Processing (NLP)?

I Goal: for computers to “understand” and be able tocommunicate with people in natural languages (e.g. English)

Page 17: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

6/45

NLP Applications are EverywhereSpell Check

Page 18: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

7/45

NLP Applications are EverywhereGrammar Correction

Page 19: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

8/45

NLP Applications are EverywhereAutocomplete

Page 20: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

9/45

NLP Applications are EverywhereAutocomplete

Page 21: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

10/45

NLP Applications are EverywhereSpam Detection

Page 22: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

11/45

NLP Applications are EverywhereMachine Translation

Page 23: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

12/45

NLP Applications are EverywhereSearch Queries

Page 24: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

13/45

NLP Applications are EverywhereQuestion Answering

Page 25: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

14/45

NLP Applications are EverywhereTargeted Ads

Page 26: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

15/45

NLP Applications are EverywherePersonal Assistants

Page 27: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

16/45

NLP Applications are EverywhereChatbots

Page 28: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

17/45

Text Analysis Tasks

tokenization(split to words)

speech to text

speech text

morphologicalanalysis

syntacticanalysis

semanticanalysis

Page 29: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

18/45

Text Analysis Tasks

tokenization(split to words)

speech to text

speech text

morphologicalanalysis

syntacticanalysis

semanticanalysis

Page 30: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

19/45

Text Analysis TasksTokenization

I Split text into a sequence of tokens (≈ words)

I Naive approach: split sentences by period, words by spacesI How to tokenize this text?

‘Whose frisbee is this?’ John asked, rather self-consciously.‘Oh, it’s one of the boys’ said the Sen.

I (Optional) answer:

Page 31: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

19/45

Text Analysis TasksTokenization

I Split text into a sequence of tokens (≈ words)I Naive approach: split sentences by period, words by spaces

I How to tokenize this text?‘Whose frisbee is this?’ John asked, rather self-consciously.‘Oh, it’s one of the boys’ said the Sen.

I (Optional) answer:

Page 32: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

19/45

Text Analysis TasksTokenization

I Split text into a sequence of tokens (≈ words)I Naive approach: split sentences by period, words by spacesI How to tokenize this text?

‘Whose frisbee is this?’ John asked, rather self-consciously.‘Oh, it’s one of the boys’ said the Sen.

I (Optional) answer:

Page 33: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

19/45

Text Analysis TasksTokenization

I Split text into a sequence of tokens (≈ words)I Naive approach: split sentences by period, words by spacesI How to tokenize this text?

‘Whose frisbee is this?’ John asked, rather self-consciously.‘Oh, it’s one of the boys’ said the Sen.

I (Optional) answer:

Page 34: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

20/45

Text Analysis Tasks

tokenization(split to words)

speech to text

speech text

morphologicalanalysis

syntacticanalysis

semanticanalysis

Page 35: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

21/45

Text Analysis TasksMorphological Analysis

I Words are made from morphemes, smaller meaningful units

I Normally: base form + affixesI Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works

I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”

I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT

I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run

Page 36: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

21/45

Text Analysis TasksMorphological Analysis

I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes

I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works

I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”

I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT

I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run

Page 37: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

21/45

Text Analysis TasksMorphological Analysis

I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes

I Nouns - plural form: dogs, suffixes, baby → babies

I Verbs - tense: worked, working, person: works

I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”

I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT

I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run

Page 38: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

21/45

Text Analysis TasksMorphological Analysis

I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes

I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works

I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”

I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT

I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run

Page 39: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

21/45

Text Analysis TasksMorphological Analysis

I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes

I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works

I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”

I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT

I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run

Page 40: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

21/45

Text Analysis TasksMorphological Analysis

I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes

I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works

I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”

I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT

I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run

Page 41: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

21/45

Text Analysis TasksMorphological Analysis

I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes

I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works

I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”

I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT

I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run

Page 42: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

22/45

Text Analysis Tasks

tokenization(split to words)

speech to text

speech text

morphologicalanalysis

syntacticanalysis

semanticanalysis

Page 43: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

23/45

Text Analysis TasksPart of Speech Tagging

I Tags each word with its part of speech (POS): noun, verb,adjective, adverb, preposition, etc.

I Surrounding words help deciding on the correct POS tag forambiguous words:I’m reading an interesting book ⇒ book = NOUNI would like to book a flight ⇒ book = VERB

Page 44: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

23/45

Text Analysis TasksPart of Speech Tagging

I Tags each word with its part of speech (POS): noun, verb,adjective, adverb, preposition, etc.

I Surrounding words help deciding on the correct POS tag forambiguous words:I’m reading an interesting book ⇒ book = NOUNI would like to book a flight ⇒ book = VERB

Page 45: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

24/45

Text Analysis TasksSyntactic Parsing

I Analyzes the syntactic structure of a sentence

I Let’s look at some syntactic ambiguities!

Page 46: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

24/45

Text Analysis TasksSyntactic Parsing

I Analyzes the syntactic structure of a sentence

I Let’s look at some syntactic ambiguities!

Page 47: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

25/45

Text Analysis TasksSyntactic Parsing

I “They ate pizza with anchovies”

I (1) They ate pizza, the pizza had anchovies on it

I (2) They ate pizza using anchovies instead of utensils

I (3) The anchovies also ate pizza

I Each of the interpretations yields a different syntactic analysis

Page 48: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

25/45

Text Analysis TasksSyntactic Parsing

I “They ate pizza with anchovies”

I (1) They ate pizza, the pizza had anchovies on it

I (2) They ate pizza using anchovies instead of utensils

I (3) The anchovies also ate pizza

I Each of the interpretations yields a different syntactic analysis

Page 49: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

25/45

Text Analysis TasksSyntactic Parsing

I “They ate pizza with anchovies”

I (1) They ate pizza, the pizza had anchovies on it

I (2) They ate pizza using anchovies instead of utensils

I (3) The anchovies also ate pizza

I Each of the interpretations yields a different syntactic analysis

Page 50: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

25/45

Text Analysis TasksSyntactic Parsing

I “They ate pizza with anchovies”

I (1) They ate pizza, the pizza had anchovies on it

I (2) They ate pizza using anchovies instead of utensils

I (3) The anchovies also ate pizza

I Each of the interpretations yields a different syntactic analysis

Page 51: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

26/45

Text Analysis TasksSyntactic Parsing

Page 52: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

27/45

Text Analysis Tasks

tokenization(split to words)

speech to text

speech text

morphologicalanalysis

syntacticanalysis

semanticanalysis

Page 53: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

28/45

Text Analysis TasksCoreference Resolution

I Identify mentions referring to the same entity

I Considered a difficult task!

Page 54: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

28/45

Text Analysis TasksCoreference Resolution

I Identify mentions referring to the same entity

I Considered a difficult task!

Page 55: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

29/45

Text Analysis TasksCoreference Resolution

I “I gave the monkeys thebananas because they werehungry” ⇒they = the monkeys

I “I gave the monkeys thebananas because they wereripe” ⇒they = the bananas

Page 56: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

29/45

Text Analysis TasksCoreference Resolution

I “I gave the monkeys thebananas because they werehungry” ⇒they = the monkeys

I “I gave the monkeys thebananas because they wereripe” ⇒they = the bananas

Page 57: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

30/45

Text Analysis TasksWord Sense Disambiguation

I What’s the correct sense of a word in a given context?

from http://naviglinlp.blogspot.co.il/

Page 58: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

30/45

Text Analysis TasksWord Sense Disambiguation

I What’s the correct sense of a word in a given context?

from http://naviglinlp.blogspot.co.il/

Page 59: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

31/45

Text Analysis TasksNamed Entities

I Named Entity Recognition: recognize entities and their type

I Entity Linking: linking entities to their Wikipedia pages

from http://www.ibm.com/blogs/research

Page 60: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

31/45

Text Analysis TasksNamed Entities

I Named Entity Recognition: recognize entities and their type

I Entity Linking: linking entities to their Wikipedia pages

from http://www.ibm.com/blogs/research

Page 61: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

32/45

NLP is hard!I Tokenization and POS tagging are almost 100% accurate

today, but semantic tasks are far from that

I Two major difficulties:I Ambiguity: one text can have multiple meaningsI Lexical variability: the same meaning can be expressed with

different words

Page 62: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

32/45

NLP is hard!I Tokenization and POS tagging are almost 100% accurate

today, but semantic tasks are far from thatI Two major difficulties:

I Ambiguity: one text can have multiple meaningsI Lexical variability: the same meaning can be expressed with

different words

Page 63: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

32/45

NLP is hard!I Tokenization and POS tagging are almost 100% accurate

today, but semantic tasks are far from thatI Two major difficulties:

I Ambiguity: one text can have multiple meanings

I Lexical variability: the same meaning can be expressed withdifferent words

Page 64: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

32/45

NLP is hard!I Tokenization and POS tagging are almost 100% accurate

today, but semantic tasks are far from thatI Two major difficulties:

I Ambiguity: one text can have multiple meaningsI Lexical variability: the same meaning can be expressed with

different words

Page 65: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

33/45

Example Application: Spam Detection

(used to be much worse... > 90%!)

I Automatically determine whether an email is spam or notI (and move spam messages to “spam” folder)

I Special case of Text Classification: given a text, automaticallydetermine its topic

I How does it work?

Page 66: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

33/45

Example Application: Spam Detection

(used to be much worse... > 90%!)

I Automatically determine whether an email is spam or not

I (and move spam messages to “spam” folder)

I Special case of Text Classification: given a text, automaticallydetermine its topic

I How does it work?

Page 67: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

33/45

Example Application: Spam Detection

(used to be much worse... > 90%!)

I Automatically determine whether an email is spam or notI (and move spam messages to “spam” folder)

I Special case of Text Classification: given a text, automaticallydetermine its topic

I How does it work?

Page 68: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

33/45

Example Application: Spam Detection

(used to be much worse... > 90%!)

I Automatically determine whether an email is spam or notI (and move spam messages to “spam” folder)

I Special case of Text Classification: given a text, automaticallydetermine its topic

I How does it work?

Page 69: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

33/45

Example Application: Spam Detection

(used to be much worse... > 90%!)

I Automatically determine whether an email is spam or notI (and move spam messages to “spam” folder)

I Special case of Text Classification: given a text, automaticallydetermine its topic

I How does it work?

Page 70: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

34/45

Spam DetectionLet’s think of characteristics of spam emails

I Unknown sender

I Spam triggering words:I Earn extra cashI Earn $I FreeI Lose weightI InstantI BonusI ...

I Naive idea: mark any email that contains these words as spam

I Problem: inaccurate (will mark non-spam as spam and viceversa)

Page 71: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

34/45

Spam DetectionLet’s think of characteristics of spam emails

I Unknown senderI Spam triggering words:

I Earn extra cashI Earn $I FreeI Lose weightI InstantI BonusI ...

I Naive idea: mark any email that contains these words as spam

I Problem: inaccurate (will mark non-spam as spam and viceversa)

Page 72: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

34/45

Spam DetectionLet’s think of characteristics of spam emails

I Unknown senderI Spam triggering words:

I Earn extra cashI Earn $I FreeI Lose weightI InstantI BonusI ...

I Naive idea: mark any email that contains these words as spam

I Problem: inaccurate (will mark non-spam as spam and viceversa)

Page 73: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

34/45

Spam DetectionLet’s think of characteristics of spam emails

I Unknown senderI Spam triggering words:

I Earn extra cashI Earn $I FreeI Lose weightI InstantI BonusI ...

I Naive idea: mark any email that contains these words as spam

I Problem: inaccurate (will mark non-spam as spam and viceversa)

Page 74: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

35/45

Spam DetectionRule-based Approach

I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”

I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)

I Problems:I Finding the optimal rules is difficultI Not all triggering words were created equal

I Solution: Let the computer “learn” these rules alone!

Page 75: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

35/45

Spam DetectionRule-based Approach

I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”

I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)

I Problems:I Finding the optimal rules is difficultI Not all triggering words were created equal

I Solution: Let the computer “learn” these rules alone!

Page 76: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

35/45

Spam DetectionRule-based Approach

I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”

I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)

I Problems:I Finding the optimal rules is difficult

I Not all triggering words were created equal

I Solution: Let the computer “learn” these rules alone!

Page 77: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

35/45

Spam DetectionRule-based Approach

I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”

I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)

I Problems:I Finding the optimal rules is difficultI Not all triggering words were created equal

I Solution: Let the computer “learn” these rules alone!

Page 78: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

35/45

Spam DetectionRule-based Approach

I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”

I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)

I Problems:I Finding the optimal rules is difficultI Not all triggering words were created equal

I Solution: Let the computer “learn” these rules alone!

Page 79: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

36/45

Spam DetectionSupervised Learning

I Let the computer learn a scoring function:score = ...+ αhave · c(have) + αsent · c(sent) + ...+ αbernard · c(bernard)

I Different weight αi for each word, e.g. αcash > αdocument

I Classify as spam if score > threshold (learn threshold too!)

Page 80: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

36/45

Spam DetectionSupervised Learning

I Let the computer learn a scoring function:score = ...+ αhave · c(have) + αsent · c(sent) + ...+ αbernard · c(bernard)

I Different weight αi for each word, e.g. αcash > αdocument

I Classify as spam if score > threshold (learn threshold too!)

Page 81: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

36/45

Spam DetectionSupervised Learning

I Let the computer learn a scoring function:score = ...+ αhave · c(have) + αsent · c(sent) + ...+ αbernard · c(bernard)

I Different weight αi for each word, e.g. αcash > αdocument

I Classify as spam if score > threshold (learn threshold too!)

Page 82: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

37/45

Spam DetectionSupervised Learning

I How does the computer learn the α weights?

I Supervised learning: estimate a function (learn weights)using labeled examples

I Take a lot of emails, manually mark them as spam/not spam

I The computer learns a function (weights) that best predictsspam/not spam for the known emails

I If we have enough examples, it would also work well on newemails

Page 83: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

37/45

Spam DetectionSupervised Learning

I How does the computer learn the α weights?

I Supervised learning: estimate a function (learn weights)using labeled examples

I Take a lot of emails, manually mark them as spam/not spam

I The computer learns a function (weights) that best predictsspam/not spam for the known emails

I If we have enough examples, it would also work well on newemails

Page 84: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

37/45

Spam DetectionSupervised Learning

I How does the computer learn the α weights?

I Supervised learning: estimate a function (learn weights)using labeled examples

I Take a lot of emails, manually mark them as spam/not spam

I The computer learns a function (weights) that best predictsspam/not spam for the known emails

I If we have enough examples, it would also work well on newemails

Page 85: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

37/45

Spam DetectionSupervised Learning

I How does the computer learn the α weights?

I Supervised learning: estimate a function (learn weights)using labeled examples

I Take a lot of emails, manually mark them as spam/not spam

I The computer learns a function (weights) that best predictsspam/not spam for the known emails

I If we have enough examples, it would also work well on newemails

Page 86: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

37/45

Spam DetectionSupervised Learning

I How does the computer learn the α weights?

I Supervised learning: estimate a function (learn weights)using labeled examples

I Take a lot of emails, manually mark them as spam/not spam

I The computer learns a function (weights) that best predictsspam/not spam for the known emails

I If we have enough examples, it would also work well on newemails

Page 87: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

38/45

Spam DetectionFeatures

I We used bag-of-words as features for classification :{ I, have, sent, you, ... }

I If we have enough spam examples that contain the word“urgent”, αurgent will be high

I What about similar words like “immediate” or “instant”?

I We need to find a way to let the computer know aboutsemantically-similar words

Page 88: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

38/45

Spam DetectionFeatures

I We used bag-of-words as features for classification :{ I, have, sent, you, ... }

I If we have enough spam examples that contain the word“urgent”, αurgent will be high

I What about similar words like “immediate” or “instant”?

I We need to find a way to let the computer know aboutsemantically-similar words

Page 89: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

38/45

Spam DetectionFeatures

I We used bag-of-words as features for classification :{ I, have, sent, you, ... }

I If we have enough spam examples that contain the word“urgent”, αurgent will be high

I What about similar words like “immediate” or “instant”?

I We need to find a way to let the computer know aboutsemantically-similar words

Page 90: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

38/45

Spam DetectionFeatures

I We used bag-of-words as features for classification :{ I, have, sent, you, ... }

I If we have enough spam examples that contain the word“urgent”, αurgent will be high

I What about similar words like “immediate” or “instant”?

I We need to find a way to let the computer know aboutsemantically-similar words

Page 91: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

39/45

Word RepresentationOne-hot Vectors

I How do we represent all the words in the computer?

I Simplest: we have a dictionary, and each word has an index,e.g. index(urgent) = 316, index(instant) = 12418

I You can think of the word with index i as a vector (array ofnumbers) with zeros and one entry with 1 in the ith index -“one-hot vector”:

urgent 0 0 ... 1 0 ... 0 0 ... 0

↑316

instant 0 0 ... 0 0 ... 1 0 ... 0

↑12418

Page 92: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

39/45

Word RepresentationOne-hot Vectors

I How do we represent all the words in the computer?

I Simplest: we have a dictionary, and each word has an index,e.g. index(urgent) = 316, index(instant) = 12418

I You can think of the word with index i as a vector (array ofnumbers) with zeros and one entry with 1 in the ith index -“one-hot vector”:

urgent 0 0 ... 1 0 ... 0 0 ... 0

↑316

instant 0 0 ... 0 0 ... 1 0 ... 0

↑12418

Page 93: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

39/45

Word RepresentationOne-hot Vectors

I How do we represent all the words in the computer?

I Simplest: we have a dictionary, and each word has an index,e.g. index(urgent) = 316, index(instant) = 12418

I You can think of the word with index i as a vector (array ofnumbers) with zeros and one entry with 1 in the ith index -“one-hot vector”:

urgent 0 0 ... 1 0 ... 0 0 ... 0

↑316

instant 0 0 ... 0 0 ... 1 0 ... 0

↑12418

Page 94: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

40/45

Spam DetectionBag-of-words with One-hot Vectors

I A vector representing the entire email: sum of one-hot vectorsof the words in the email:

I 0 0 ... 1 0 ... 0 0 ... 0

have 0 1 ... 0 0 ... 0 0 ... 0

sent 0 0 ... 0 0 ... 1 0 ... 0

+ ... ...

bernard 0 0 ... 0 1 ... 0 0 ... 0

=

feature vector 0 4 ... 2 1 ... 1 0 ... 0

I Problem: Emails with similar words (e.g. deliver instead ofsend, urgent instead of instant) have very different featurevectors!

Page 95: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

40/45

Spam DetectionBag-of-words with One-hot Vectors

I A vector representing the entire email: sum of one-hot vectorsof the words in the email:

I 0 0 ... 1 0 ... 0 0 ... 0

have 0 1 ... 0 0 ... 0 0 ... 0

sent 0 0 ... 0 0 ... 1 0 ... 0

+ ... ...

bernard 0 0 ... 0 1 ... 0 0 ... 0

=

feature vector 0 4 ... 2 1 ... 1 0 ... 0

I Problem: Emails with similar words (e.g. deliver instead ofsend, urgent instead of instant) have very different featurevectors!

Page 96: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

41/45

Word RepresentationDistributional Word Vectors

I Can we have similar vectors for semantically-similar words?

I “You shall know a word by the company it keeps”(John Rupert Firth, 1957)

elevator 0 0 ... 0.16 0 ... 0.49 0 ... 0

lift 0 0 ... 0.15 0 ... 0.51 0 ... 0

↑ ↑up stairs

I Now semantically-similar words have similar word vectors!

Page 97: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

41/45

Word RepresentationDistributional Word Vectors

I Can we have similar vectors for semantically-similar words?

I “You shall know a word by the company it keeps”(John Rupert Firth, 1957)

elevator 0 0 ... 0.16 0 ... 0.49 0 ... 0

lift 0 0 ... 0.15 0 ... 0.51 0 ... 0

↑ ↑up stairs

I Now semantically-similar words have similar word vectors!

Page 98: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

41/45

Word RepresentationDistributional Word Vectors

I Can we have similar vectors for semantically-similar words?

I “You shall know a word by the company it keeps”(John Rupert Firth, 1957)

elevator 0 0 ... 0.16 0 ... 0.49 0 ... 0

lift 0 0 ... 0.15 0 ... 0.51 0 ... 0

↑ ↑up stairs

I Now semantically-similar words have similar word vectors!

Page 99: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

41/45

Word RepresentationDistributional Word Vectors

I Can we have similar vectors for semantically-similar words?

I “You shall know a word by the company it keeps”(John Rupert Firth, 1957)

elevator 0 0 ... 0.16 0 ... 0.49 0 ... 0

lift 0 0 ... 0.15 0 ... 0.51 0 ... 0

↑ ↑up stairs

I Now semantically-similar words have similar word vectors!

Page 100: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

42/45

Spam DetectionBag-of-words with Distributional Word Vectors

I Again, we sum up all the vectors:

I 0 0 ... 0.12 0.03 ... ... ... ... 0.04 0 ... 0have 0 0.22 ... 0 0 ... ... ... ... 0 0 ... 0sent 0 0.43 ... 0 0.1 ... ... ... ... 0.25 0 ... 0+ ... ...

bernard 0 0 ... 0 0.67 ... ... ... ... 0 0 ... 0=FV 0 0.65 ... 0.12 0.71 ... ... ... ... 0.29 0 ... 0

I We can now replace a word (e.g. sent) with a similar word(e.g. delivered) and get a similar feature vector ⇒ sameclassification for similar emails!

Page 101: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

42/45

Spam DetectionBag-of-words with Distributional Word Vectors

I Again, we sum up all the vectors:

I 0 0 ... 0.12 0.03 ... ... ... ... 0.04 0 ... 0have 0 0.22 ... 0 0 ... ... ... ... 0 0 ... 0sent 0 0.43 ... 0 0.1 ... ... ... ... 0.25 0 ... 0+ ... ...

bernard 0 0 ... 0 0.67 ... ... ... ... 0 0 ... 0=FV 0 0.65 ... 0.12 0.71 ... ... ... ... 0.29 0 ... 0

I We can now replace a word (e.g. sent) with a similar word(e.g. delivered) and get a similar feature vector ⇒ sameclassification for similar emails!

Page 102: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

43/45

Word Embeddings

I [ A more recent type of distributional vectors ]

I Find most similar words:

See more here: http://bionlp-www.utu.fi/wv_demo/

Page 103: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

44/45

Page 104: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages

45/45

Additional Resources

I Books:I Chris Manning and Hinrich Schutze, Foundations of Statistical

Natural Language Processing, MIT Press. Cambridge, MA:May 1999.

I Dan Jurafsky and James H. Martin, Speech and LanguageProcessing. Second Edition. Pearson Education, 2014.

I Resources from NACLO - North American ComputationalLinguistics Olympiadhttp://nacloweb.org/resources.php

I My blog: http://veredshwartz.blogspot.co.il