machine learning in nlp, lecture 1 introduction · i write a short report i ... the most important...
TRANSCRIPT
-20pt
overview of today's lecture
I some information about the course
I machine learning basics and overview
I overview of the assignments
-20pt
teaching
I apart from today's lecture, the rest of the teaching will be in
the computer lab (T225)
I material related to the assignments will be taught in the lab
room
I the rest will be available as video lectures
-20pt
course web page
I http://spraakbanken.gu.se/personal/richard/ml2014
I . . . or from GUL
-20pt
your work
I 3 assignments (solved individually)
I mini-project (individually or in groups)I write a short reportI present it at a seminar
-20pt
basic ideas
I given some object, make a predictionI is this patient diabetic?I is the sentiment of this movie review positive?I does this image contain a cat?I what is the grammatical function of this noun phrase?I what will be tomorrow's share value of this stock?I what are the part-of-speech tags of the words in this sentence?
I the goal of machine learning is to build the prediction
functions by observing data
-20pt
basic ideas
I given some object, make a predictionI is this patient diabetic?I is the sentiment of this movie review positive?I does this image contain a cat?I what is the grammatical function of this noun phrase?I what will be tomorrow's share value of this stock?I what are the part-of-speech tags of the words in this sentence?
I the goal of machine learning is to build the prediction
functions by observing data
-20pt
some types of machine learning problems
I classi�cation: learning to output a category labelI spam/non-spam; positive/negative; subject/object, . . .
I structure prediction: learning to build some structureI POS tagging; dependency parsing; translation; . . .
I (numerical regression: learning to guess a number)I value of a share; number of stars in a review; . . .
-20pt
why machine learning?
why would we want to build the function from data instead of just
implementing it?
I usually because we don't really know how to write down thefunction by hand
I speech recognitionI image classi�cationI syntactic parsingI translationI . . .
I might not be necessary for limited tasks where we know:I morphology?I sentence splitting and tokenization?I identi�cation of limited classes of names, dates and times?
I which is more expensive in your case? knowledge or data?
-20pt
don't forget your linguistic intuitions!
machine learning automatizes some tasks, but we still need our
brains:
I de�ning the tasks and terminology
I annotating training and testing data
I having an intuition about which features may be useful canbe crucial
I in general, features are more important than the choice oflearning algorithm
I error analysis
I de�ning constraints to guide the learnerI valency lexicons can be used in parsersI grammar-based parsers with ML-trained disambiguators
-20pt
example: is the patient diabetic?
in order to predict, we make some measurements of propertieswe believe will be useful
these are called the features
-20pt
example: is the patient diabetic?
I in order to predict, we make some measurements of propertieswe believe will be useful
I these are called the features
-20pt
attributes/values or bag of words
I we often represent the features as attributes with valuesI in practice, as a Python dict
features = { "gender":"male",
"age":37,
"blood_pressure":130,
.... }
I sometimes, it's easier just to see the features as a list of e.g.
words (bag of words)
features = [ "here", "are", "some", "words",
"in", "a", "document" ]
-20pt
examples of ML in NLP: document classi�cation
I in a previous course, you have implemented a classi�er of
documents
I many document classi�ers use the words of the documents
as its features (bag of words)
-20pt
examples of ML in NLP: document classi�cation
I in a previous course, you have implemented a classi�er of
documents
I many document classi�ers use the words of the documents
as its features (bag of words)
I . . . but we could also add other features such as the presence
of smileys or negations
I Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013
-20pt
examples of ML in NLP: di�culty level classi�cation
I what learner level (e.g. according to CEFR) do you need tounderstand the following Swedish sentences?
I Flickan sover. → A1
I Under förberedelsetiden har en baslinjestudie utförts för attkartlägga bland annat diabetesärftlighet, oral glukostolerans,mat och konditionsvanor och socialekonomiska faktorer iåldrarna 35-54 år i �era kommuner inom Nordvästra ochSydöstra sjukvårdsområdena (NVSO resp SÖSO). → C2
I Pilán, NLP-based Approaches to Sentence Readability for Second Language
Learning Purposes, MLT Master's Thesis, 2013
-20pt
examples of ML in NLP: coreference resolution
I do two given noun phrases refer to the same real-world entity?
I Soon et al. A Machine Learning Approach to Coreference
Resolution of Noun Phrases, Comp. Ling. 2001
-20pt
examples of ML in NLP: named entity recognition
United Nations o�cial Ekeus heads for Baghdad.
[ ORG ] [ PER ] [ LOC ]
-20pt
examples of ML in NLP: named entity recognition
United Nations o�cial Ekeus heads for Baghdad.
[ ORG ] [ PER ] [ LOC ]
-20pt
examples of ML in NLP: named entity recognition
I Zhang and Johnson A Robust Risk Minimization based Named
Entity Recognition System, CoNLL 2003
-20pt
two interesting seminars this week
I this week, Joel Tetreault (Yahoo! research) is visiting us andwill be giving seminars at the following times:
I Thursday, 15.15�16.00, L308: Hate Speech DetectionI Friday, 10.30�11.30, T307: Automated Grammatical Error
Correction for Language Learners
-20pt
machine learning in NLP research
I ACL, EMNLP, Coling, etc are heavily dominated by
ML-focused papers
-20pt
what goes on when we �learn�?
I the learning algorithm observes the examples in the training set
I it tries to �nd common patterns that explain the data: it
generalizes
-20pt
what is generalization?
I imagine a rote learner :I store each example in memoryI at prediction time: if input was seen before, return the
corresponding output; otherwise return �I don't know �
I is this any good? Probably not, since this learner does not try
to generalize
I generalization = coming up with a hypothesis that sayssomething about what we haven't seen
I exactly how this is done depends on what algorithm we areusing
-20pt
�knowledge� from experience?
I given some experience, when is it justi�ed to say that we have
gained �knowledge�?
-20pt
learning algorithms that we have seen so far
I we have already seen a number of learning algorithms in thecourse on statistical methods:
I Naive BayesI perceptronI hidden Markov models
-20pt
perceptron revisited
I the perceptron learning algorithm creates a weight table
I each weight in the table corresponds to a featureI e.g. "fine" probably has a high positive weight in sentiment
analysisI "boring" a negative weightI "and" near zero
I classi�cation is carried out by summing the weights for each
feature
def perceptron_classify(features, weights):
score = 0
for f in features:
score += weights.get(f, 0)
if score >= 0:
return "pos"
else:
return "neg"
-20pt
the perceptron learning algorithm
I start with an empty weight table
I classify according to the current weight table
I each time we misclassify, change the weight table a bitI if a positive instance was misclassi�ed, add 1 to the weight of
each feature in the documentI and conversely . . .
def perceptron_learn(examples, number_iterations):
weights = {}
for iteration in range(number_iterations):
for label, features in examples:
guess = perceptron_classify(features, weights)
if label == "pos" and guess == "neg":
for f in features:
weights[f] = weights.get(f, 0) + 1
elif label == "neg" and guess == "pos":
for word in document:
weights[f] = weights.get(f, 0) - 1
return weights
-20pt
estimation in Naive Bayes, revisited
I Naive Bayes:
P(document, label) =
P(f1, . . . , fn, label) = P(label) · P(f1, . . . , fn|label)
= P(label) · P(f1|label) · . . . · P(fn|label)
I how do we estimate the probabilities?I maximum likelihood: set the probabilities so that the
probability of the data is maximized
-20pt
estimation in Naive Bayes: supervised case
I how do we estimate P(positive)?
PMLE(positive) =count(positive)
count(all)=
2
4
I how do we estimate P(�nice�|positive)?
PMLE(�nice�|positive) =count(�nice�, positive)
count(any word, positive)=
2
7
-20pt
representation of the prediction function
we may represent our prediction function in di�erent ways:
I numerical models:I weight or probability tablesI networked models
I rulesI decision treesI transformation rules
-20pt
example: the prediction function as numbers
def sentiment_is_positive(features):
score = 0.0
score += 2.1 * features["wonderful"]
score += 0.6 * features["good"]
...
score -= 0.9 * features["bad"]
score -= 3.1 * features["awful"]
...
if score > 0:
return True
else:
return False
-20pt
example: the prediction function as rules
def patient_is_sick(features):
if features["systolic_blood_pressure"] > 140:
return True
if features["gender"] == "m":
if features["psa"] > 4.0:
return True
...
return False
-20pt
machine learning software
I general-purpose software, large collections of algorithms:I scikit-learn: http://scikit-learn.org
I Python library � will be used in this course
I Weka: http://www.cs.waikato.ac.nz/ml/wekaI Java library with nice user interface
I NLTK includes some learning algorithms but seems to bediscontinuing them in favor of scikit-learn
I special-purpose software, small collections of algorithms:I LibSVM/LibLinearI CRF++, CRFSGDI . . .
-20pt
evaluation
I how do we evaluate our systems?I intrinsic evaluation: test the performance in isolationI extrinsic evaluation: I changed my POS tagger � how does
this change the performance of my parser?I how much more money do I make?
I common measures in intrinsic evaluationI classi�cation accuracyI precision and recall (for needle in haystack)
I also several other task-dependent measures.
I remember to test for statistical signi�cance of your results!
-20pt
assignment 1: feature design for function tagging
NPNP
NNS JJ IN PRPVBZ
PP
*
ADJP
VP
VP
NP
S
S
Smoking cigarettes is bad youfor*
VBG
-20pt
assignment 1: feature design for function tagging
NPNP
NNS JJ IN PRPVBZ
PP
*
ADJP
VP
VP
NP
S
S
Smoking cigarettes is bad youfor*
VBG
PRDNOM, SBJ
SBJ
-20pt
assignment 1: feature design for function tagging
I probably, the most important step when using machine
learning in NLP is to design useful features
I that is your job in this assignment
I please check the assignment web page before the lab sessionI in particular, please read the paper Chrupaªa et al. (2007),
Better Training for Function Labeling (at least theintroduction and the description of the features)
-20pt
assignment 2: classi�er implementation
I read a paper about a simple algorithm for training the support
vector machine classi�er
I write code to implement the algorithm similar to the
algorithms in scikit-learn
-20pt
assignment 3: learning for tagging and parsing
I implement the structured perceptron learning algorithm
I apply it to named entity recognition and dependency parsing
United Nations o�cial Ekeus heads for Baghdad.
[ ORG ] [ PER ] [ LOC ]
gave the horse an applesheyesterday
-20pt
independent work
I select a topic of interest (or ask me for ideas)
I de�ne a small project
I write code, carry out experiments
I write a short paper, present it at a seminar at the end of the
course