machine learning in nlp, lecture 1 introduction · i write a short report i ... the most important...

Machine learning in NLP, lecture 1Introduction

Richard Johansson

September 1, 2014

-20pt

overview of today's lecture

I some information about the course

I machine learning basics and overview

I overview of the assignments

-20pt

overview

information about the course

machine learning basics

introduction to the assignments

-20pt

teaching

I apart from today's lecture, the rest of the teaching will be in

the computer lab (T225)

I material related to the assignments will be taught in the lab

room

I the rest will be available as video lectures

-20pt

course web page

I http://spraakbanken.gu.se/personal/richard/ml2014

I . . . or from GUL

http://spraakbanken.gu.se/personal/richard/ml2014

-20pt

your work

I 3 assignments (solved individually)

I mini-project (individually or in groups)I write a short reportI present it at a seminar

-20pt

overview




-20pt

basic ideas

I given some object, make a predictionI is this patient diabetic?I is the sentiment of this movie review positive?I does this image contain a cat?I what is the grammatical function of this noun phrase?I what will be tomorrow's share value of this stock?I what are the part-of-speech tags of the words in this sentence?

I the goal of machine learning is to build the prediction

functions by observing data

-20pt

some types of machine learning problems

I classi�cation: learning to output a category labelI spam/non-spam; positive/negative; subject/object, . . .

I structure prediction: learning to build some structureI POS tagging; dependency parsing; translation; . . .

I (numerical regression: learning to guess a number)I value of a share; number of stars in a review; . . .

-20pt

why machine learning?

why would we want to build the function from data instead of just

implementing it?

I usually because we don't really know how to write down thefunction by hand

I speech recognitionI image classi�cationI syntactic parsingI translationI . . .

I might not be necessary for limited tasks where we know:I morphology?I sentence splitting and tokenization?I identi�cation of limited classes of names, dates and times?

I which is more expensive in your case? knowledge or data?

-20pt

don't forget your linguistic intuitions!

machine learning automatizes some tasks, but we still need our

brains:

I de�ning the tasks and terminology

I annotating training and testing data

I having an intuition about which features may be useful canbe crucial

I in general, features are more important than the choice oflearning algorithm

I error analysis

I de�ning constraints to guide the learnerI valency lexicons can be used in parsersI grammar-based parsers with ML-trained disambiguators

-20pt

learning from data

-20pt

example: is the patient diabetic?

in order to predict, we make some measurements of propertieswe believe will be useful

these are called the features

-20pt

example: is the patient diabetic?

I in order to predict, we make some measurements of propertieswe believe will be useful

I these are called the features

-20pt

attributes/values or bag of words

I we often represent the features as attributes with valuesI in practice, as a Python dict

features = { "gender":"male",

"age":37,

"blood_pressure":130,

.... }

I sometimes, it's easier just to see the features as a list of e.g.

words (bag of words)

features = [ "here", "are", "some", "words",

"in", "a", "document" ]

-20pt

examples of ML in NLP: document classi�cation

I in a previous course, you have implemented a classi�er of

documents

I many document classi�ers use the words of the documents

as its features (bag of words)

-20pt

examples of ML in NLP: document classi�cation

I in a previous course, you have implemented a classi�er of

documents

I many document classi�ers use the words of the documents

as its features (bag of words)

I . . . but we could also add other features such as the presence

of smileys or negations

I Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013

http://tobias.io/semevaltweet/sentiment_analysis_of_microblogs.pdf

-20pt

examples of ML in NLP: di�culty level classi�cation

I what learner level (e.g. according to CEFR) do you need tounderstand the following Swedish sentences?

I Flickan sover. → A1

I Under förberedelsetiden har en baslinjestudie utförts för attkartlägga bland annat diabetesärftlighet, oral glukostolerans,mat och konditionsvanor och socialekonomiska faktorer iåldrarna 35-54 år i �era kommuner inom Nordvästra ochSydöstra sjukvårdsområdena (NVSO resp SÖSO). → C2

I Pilán, NLP-based Approaches to Sentence Readability for Second Language

Learning Purposes, MLT Master's Thesis, 2013

https://www.academia.edu/6845845/NLP-based_Approaches_to_Sentence_Readability_for_Second_Language_Learning_Purposes

https://www.academia.edu/6845845/NLP-based_Approaches_to_Sentence_Readability_for_Second_Language_Learning_Purposes

-20pt

examples of ML in NLP: di�culty level classi�cation

-20pt

examples of ML in NLP: coreference resolution

I do two given noun phrases refer to the same real-world entity?

I Soon et al. A Machine Learning Approach to Coreference

Resolution of Noun Phrases, Comp. Ling. 2001

http://aclweb.org/anthology/J/J01/J01-4004.pdf

http://aclweb.org/anthology/J/J01/J01-4004.pdf

-20pt

examples of ML in NLP: named entity recognition

United Nations o�cial Ekeus heads for Baghdad.

[ ORG ] [ PER ] [ LOC ]

-20pt

examples of ML in NLP: named entity recognition

I Zhang and Johnson A Robust Risk Minimization based Named

Entity Recognition System, CoNLL 2003

http://aclweb.org/anthology/W/W03/W03-0434.pdf

http://aclweb.org/anthology/W/W03/W03-0434.pdf

-20pt

two interesting seminars this week

I this week, Joel Tetreault (Yahoo! research) is visiting us andwill be giving seminars at the following times:

I Thursday, 15.15�16.00, L308: Hate Speech DetectionI Friday, 10.30�11.30, T307: Automated Grammatical Error

Correction for Language Learners

-20pt

machine learning in NLP research

I ACL, EMNLP, Coling, etc are heavily dominated by

ML-focused papers

-20pt

what goes on when we �learn�?

I the learning algorithm observes the examples in the training set

I it tries to �nd common patterns that explain the data: it

generalizes

-20pt

what is generalization?

I imagine a rote learner :I store each example in memoryI at prediction time: if input was seen before, return the

corresponding output; otherwise return �I don't know �

I is this any good? Probably not, since this learner does not try

to generalize

I generalization = coming up with a hypothesis that sayssomething about what we haven't seen

I exactly how this is done depends on what algorithm we areusing

-20pt

�knowledge� from experience?

I given some experience, when is it justi�ed to say that we have

gained �knowledge�?

-20pt

learning algorithms that we have seen so far

I we have already seen a number of learning algorithms in thecourse on statistical methods:

I Naive BayesI perceptronI hidden Markov models

-20pt

perceptron revisited

I the perceptron learning algorithm creates a weight table

I each weight in the table corresponds to a featureI e.g. "fine" probably has a high positive weight in sentiment

analysisI "boring" a negative weightI "and" near zero

I classi�cation is carried out by summing the weights for each

feature

def perceptron_classify(features, weights):

score = 0

for f in features:

score += weights.get(f, 0)

if score >= 0:

return "pos"

else:

return "neg"

-20pt

the perceptron learning algorithm

I start with an empty weight table

I classify according to the current weight table

I each time we misclassify, change the weight table a bitI if a positive instance was misclassi�ed, add 1 to the weight of

each feature in the documentI and conversely . . .

def perceptron_learn(examples, number_iterations):

weights = {}

for iteration in range(number_iterations):

for label, features in examples:

guess = perceptron_classify(features, weights)

if label == "pos" and guess == "neg":

for f in features:

weights[f] = weights.get(f, 0) + 1

elif label == "neg" and guess == "pos":

for word in document:

weights[f] = weights.get(f, 0) - 1

return weights

-20pt

estimation in Naive Bayes, revisited

I Naive Bayes:

P(document, label) =

P(f1, . . . , fn, label) = P(label) · P(f1, . . . , fn|label)

= P(label) · P(f1|label) · . . . · P(fn|label)

I how do we estimate the probabilities?I maximum likelihood: set the probabilities so that the

probability of the data is maximized

-20pt

estimation in Naive Bayes: supervised case

I how do we estimate P(positive)?

PMLE(positive) =count(positive)

count(all)=

2

4

I how do we estimate P(�nice�|positive)?

PMLE(�nice�|positive) =count(�nice�, positive)

count(any word, positive)=

2

7

-20pt

representation of the prediction function

we may represent our prediction function in di�erent ways:

I numerical models:I weight or probability tablesI networked models

I rulesI decision treesI transformation rules

-20pt

example: the prediction function as numbers

def sentiment_is_positive(features):

score = 0.0

score += 2.1 * features["wonderful"]

score += 0.6 * features["good"]

...

score -= 0.9 * features["bad"]

score -= 3.1 * features["awful"]

...

if score > 0:

return True

else:

return False

-20pt

example: the prediction function as rules

def patient_is_sick(features):

if features["systolic_blood_pressure"] > 140:

return True

if features["gender"] == "m":

if features["psa"] > 4.0:

return True

...

return False

-20pt

machine learning software

I general-purpose software, large collections of algorithms:I scikit-learn: http://scikit-learn.org

I Python library � will be used in this course

I Weka: http://www.cs.waikato.ac.nz/ml/wekaI Java library with nice user interface

I NLTK includes some learning algorithms but seems to bediscontinuing them in favor of scikit-learn

I special-purpose software, small collections of algorithms:I LibSVM/LibLinearI CRF++, CRFSGDI . . .

http://scikit-learn.org

http://www.cs.waikato.ac.nz/ml/weka

-20pt

evaluation

I how do we evaluate our systems?I intrinsic evaluation: test the performance in isolationI extrinsic evaluation: I changed my POS tagger � how does

this change the performance of my parser?I how much more money do I make?

I common measures in intrinsic evaluationI classi�cation accuracyI precision and recall (for needle in haystack)

I also several other task-dependent measures.

I remember to test for statistical signi�cance of your results!

-20pt

overview




-20pt

assignment 1: feature design for function tagging

NPNP

NNS JJ IN PRPVBZ

PP

*

ADJP

VP

VP

NP

S

S

Smoking cigarettes is bad youfor*

VBG

-20pt


NPNP

NNS JJ IN PRPVBZ

PP

*

ADJP

VP

VP

NP

S

S

Smoking cigarettes is bad youfor*

VBG

PRDNOM, SBJ

SBJ

-20pt


I probably, the most important step when using machine

learning in NLP is to design useful features

I that is your job in this assignment

I please check the assignment web page before the lab sessionI in particular, please read the paper Chrupaªa et al. (2007),

Better Training for Function Labeling (at least theintroduction and the description of the features)

http://grzegorz.chrupala.me/papers/chrupala-et-al-2007/paper.pdf

-20pt

assignment 2: classi�er implementation

I read a paper about a simple algorithm for training the support

vector machine classi�er

I write code to implement the algorithm similar to the

algorithms in scikit-learn

-20pt

assignment 3: learning for tagging and parsing

I implement the structured perceptron learning algorithm

I apply it to named entity recognition and dependency parsing

United Nations o�cial Ekeus heads for Baghdad.

[ ORG ] [ PER ] [ LOC ]

gave the horse an applesheyesterday

-20pt

independent work

I select a topic of interest (or ask me for ideas)

I de�ne a small project

I write code, carry out experiments

I write a short paper, present it at a seminar at the end of the

course

-20pt

next

I Thursday 13�15, in the computer lab

I and don't forget Joel's seminarsI Thursday, 15.15�16.00, L308: Hate Speech DetectionI Friday, 10.30�11.30, T307: Automated Grammatical Error

Correction for Language Learners

machine learning in nlp, lecture 1 introduction · i write a short report i ... the most important...

Documents