tal: tâches de nlp introduction à la classification des
TRANSCRIPT
TAL: TÂCHES DE NLPINTRODUCTION À LA CLAS-SIFICATION DES DOCUMENTSTEXTUELS
Vincent Guigue
INTRODUCTION:différentes tâchesen analysede données textuelles
C’est quoi du texte?
Une suite de lettres
l e _ c h a t _ e s t ...
Une suite de mots
le chat est ...
Un ensemble de mots
Dans l’ordre alphabétique
chatestle...
Au moins une source à vérifier: C. Manning, Stanford:https://nlp.stanford.edu/cmanning/
http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
http://web.stanford.edu/class/cs224n/TAL – Introduction 2/44
Différentes tâches de base
Analyse grammaticalePart-Of-Speech (POS)
Nom, Nom propre, Déterminant, Verbe...NER = Named Entity Recognition
détection des noms propres, travail sur les co-référencesSRL = Semantic Role Labeling sujet, verbe, compléments...Résolution de co-références
Le chat est dans le jardin, il mange un morceau jambon.
Analyse thématique / sémantique
Construire une métrique entre les motsClassification thématique de documents
e.g. football, article scientifique, analyse politiqueClassification de sentiments positif, négatif, neutre, agressif, . . .
⇒ Ces tâches sont très variées et à différentes échelles: mots,phrases, documents
TAL – Introduction 3/44
Tâches de plus haut niveau
Topic detection & trackingTraduction automatiqueQuestion AnsweringExtraction d’informationGénération de textes
TAL – Introduction 4/44
Illustrations
Part-of-Speech (POS) Tagging
tags: ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB,ARTICLE...
Exemple: Bob drank coffee at Starbucks⇒ Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION)Starbucks (NOUN).
Named Entity Recognition (NER)
TAL – Introduction 5/44
Illustrations (suite)
Parsing
crédit :
CoreNLP
Semantic Role Labeling :Information ExtractionQuestion answering (QA)
TAL – Introduction 6/44
Illustrations (suite)
ParsingSemantic Role Labeling :
Crédit: Stanford NLP
Information ExtractionQuestion answering (QA)
TAL – Introduction 6/44
Illustrations (suite)
ParsingSemantic Role Labeling :Information Extraction
Dan$Jurafsky$
Informa%on(Extrac%on(
Subject:$curriculum(mee%ng($$$$$Date:$January$15,$2012$
$$$$$$$$$To:$Dan$Jurafsky$
$
Hi$Dan,$we’ve$now$scheduled$the$curriculum$meeIng.$
It$will$be$in$Gates$159$tomorrow$from$10:00O11:30.$
OChris$
3$
Create new Calendar entry
Event: Curriculum mtg Date: Jan-16-2012 Start: 10:00am End: 11:30am Where: Gates 159
Question answering (QA)
TAL – Introduction 6/44
Illustrations (suite)
ParsingSemantic Role Labeling :Information ExtractionQuestion answering (QA)
TAL – Introduction 6/44
Illustrations (suite)
Extraction d’information & analyse de sentimentsDan$Jurafsky$
Informa%on(Extrac%on(&(Sen%ment(Analysis(
• nice$and$compact$to$carry!$$
• since$the$camera$is$small$and$light,$I$won't$need$to$carry$
around$those$heavy,$bulky$professional$cameras$either!$$
• the$camera$feels$flimsy,$is$plasIc$and$very$light$in$weight$you$have$to$be$very$delicate$in$the$handling$of$this$camera$4$
Size$and$weight$
AWributes:$$zoom$$affordability$
$size$and$weight$$flash$$
$ease$of$use$
✓$
✗$
✓$
TAL – Introduction 7/44
Illustrations (suite)
Traduction automatiqueAligner des motsGénérer une phrase intelligible / vraisemblable
Historique (en évolution rapide)traduction de motstraduction de séquencestraduction de connaissances / signification
TAL – Introduction 8/44
Petit bilan (avant de commencer !)
Dan$Jurafsky$
Language(Technology(
Coreference$resoluIon$
QuesIon$answering$(QA)$
PartOofOspeech$(POS)$tagging$
Word$sense$disambiguaIon$(WSD)$
Paraphrase$
Named$enIty$recogniIon$(NER)$
Parsing$SummarizaIon$
InformaIon$extracIon$(IE)$
Machine$translaIon$(MT)$Dialog$
SenIment$analysis$
$$$
mostly$solved$
making$good$progress$
sIll$really$hard$
Spam$detecIon$Let’s$go$to$Agra!$
Buy$V1AGRA$…$
✓✗
Colorless$$$green$$$ideas$$$sleep$$$furiously.$
$$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$
Einstein$met$with$UN$officials$in$Princeton$PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$
You’re$invited$to$our$dinner$party,$Friday$May$27$at$8:30$
Party$May$27$
add$
Best$roast$chicken$in$San$Francisco!$
The$waiter$ignored$us$for$20$minutes.$
Carter$told$Mubarak$he$shouldn’t$run$again.$
I$need$new$baWeries$for$my$mouse.$
The$13th$Shanghai$InternaIonal$Film$FesIval…$
�13����������…�
The$Dow$Jones$is$up$
Housing$prices$rose$
Economy$is$good$
Q.$How$effecIve$is$ibuprofen$in$reducing$fever$in$paIents$with$acute$febrile$illness?$
I$can$see$Alcatraz$from$the$window!$
XYZ$acquired$ABC$yesterday$
ABC$has$been$taken$over$by$XYZ$
Where$is$CiIzen$Kane$playing$in$SF?$$
Castro$Theatre$at$7:30.$Do$you$want$a$Icket?$
The$S&P500$jumped$
TAL – Introduction 9/44
Quelques difficultés pour pimenter le cours
Dan$Jurafsky$
nonBstandard(English$Great$job$@jusInbieber!$Were$
SOO$PROUD$of$what$youve$
accomplished!$U$taught$us$2$
#neversaynever$&$you$yourself$
should$never$give$up$either�$
segmenta%on(issues( idioms(dark$horse$get$cold$feet$lose$face$
throw$in$the$towel$
neologisms(unfriend$Retweet$bromance$
$
tricky(en%ty(names(Where$is$A(Bug’s(Life$playing$…(Let(It(Be$was$recorded$…$…$a$mutaIon$on$the$for$gene$…(
world(knowledge(Mary$and$Sue$are$sisters.$
Mary$and$Sue$are$mothers.$
But$that’s$what$makes$it$fun!$
the$New$YorkONew$Haven$Railroad$
the$New$YorkONew$Haven$Railroad$
Why else is natural language understanding difficult?(
TAL – Introduction 10/44
Modélisation(s) du texte
Chaine de traitements standard
1. Preprocessing 2. Mise en forme 3. Traitements
encodage (latin,utf8, ...)ponctuationstemminglemmatisationtokenizationminuscule/majregex...
Construction d’undictionnaire(index)+ Index inversé(pour l’explicationdes traitementsMise en formevectorielleConservation desséquences
Classification desdocs, des mots,des phrasesSémantique...Perceptron ouHMM?
TAL – Introduction 11/44
BOW:Bag Of Words
Handling textual data: the classification case
1 Big corpus ⇔ Huge vocabulary2 Sentence structure is hard to model3 Words are polymorphous: singular/plural, masculine/feminine4 Machine learning + large dimensionality = problems
TAL – Introduction 12/44
Handling textual data: the classification case
1 Big corpus ⇔ Huge vocabularyPerceptron, SVM, Naive Bayes... Boosting, Bagging...
Distributed & efficient algorithms2 Sentence structure is hard to modelRemoving the structure...
3 Words are polymorphous: singular/plural, masculine/feminineSeveral approaches... (see below)
4 Machine learning + large dimensionality = problemsRemoving useless words
TAL – Introduction 12/44
Bag of words
Sentence structure = costly handling ⇒ Elimination !
Thus : Document = set of words + countings
Bag of words representation1 Extraction of vocabulary V
2 Each document becomes a counting vector : d ∈ N|V |
Note: d is always a sparse vectors, mainly composed of 0
TAL – Introduction 13/44
Example
Set of toy documents:1 documents = [ ’The␣ l i o n ␣ does ␣ not ␣ l i v e ␣ i n ␣ the ␣ j u n g l e ’ ,\2 ’ L i on s ␣ ea t ␣ b i g ␣ p r e y s ’ ,\3 ’ I n ␣ the ␣zoo , ␣ the ␣ l i o n ␣ s l e e p ’ ,\4 ’ S e l f−d r i v i n g ␣ c a r s ␣ w i l l ␣be␣autonomous␣ i n ␣ towns ’ ,\5 ’ The␣ f u t u r e ␣ ca r ␣ has ␣no␣ s t e e r i n g ␣whee l ’ ,\6 ’My␣ ca r ␣ a l r e a d y ␣ has ␣ s e n s o r s ␣and␣a␣camera ’ ]
Dictionary
Green level ∝ nb occurences
TAL – Introduction 14/44
Information coding
Counting words appearing in 2 documents:
The lion does not live in the jungleIn the zoo, the lion sleep
The
10
does
10
in
10
jung
le
10
lion
11
live
10
not
10
the
12
Lions
00
big
00
eat
00
prey
s
00
In
01
sleep
01
zoo,
01
Self-
driv
ing
00
auto
nom
ous
00
be
00
cars
00
town
s
00
will
00
car
00
futu
re
00
has
00
no
00
stee
ring
00
whee
l
00
My
00
a
00
alre
ady
00
and
00
cam
era
00
sens
ors
00
+ We are able to vectorize textual information− Dictionary requires preprocessing
TAL – Introduction 15/44
Word representation & semantic gap
All words are orthogonal:
Considering virtual 2 documents made of a single word :[0 . . . 0 dik > 0 . . . 00 . . . djk ′ > 0 0 . . . 0
]
Then: k 6= k ′ ⇒ di · dj = 0
...even if wk = lion and wk ′ = lions
⇒ Definition of the semantic gapNo metrics between words
TAL – Introduction 16/44
Semantic issue
Understanding documents = matching relevant descriptors
Syntactic difference ⇒orthogonality of the representation vectors
Word groups : more intrinsic semantics...... but fewer match with other document
N-grams ⇒ dictionary size ↗N-grams = great potential...
but require careful preprocessings
This film was not interesting
Unigrams: this, film, was, not, interesting
bigrams: this_film, film_was, was_not, not_interesting
N-grams... + combination: e.g. 1-3 grams
TAL – Introduction 17/44
Implementation issues
How many unique words in a corpus of 10k movie reviews?
Example:Story of a man who has unnatural feelings for a pig. Starts out with a openingscene that is a terrific example of absurd comedy. A formal orchestra audienceis turned into an insane, violent mob by the crazy chantings of it’s singers.Unfortunately it stays absurd the WHOLE time with no general narrativeeventually making it just too off putting. Even those from the era should beturned off. The cryptic dialogue would make Shakespeare seem easy to a thirdgrader. On a technical level it’s better than you might think with some goodcinematography by future great Vilmos Zsigmond. Future stars Sally Kirklandand Frederic Forrest can be seen briefly.
104077 unique words...104 × 105 × 4 bytes = 4 · 109 ⇒ 4Gb...
Against 100Mb of raw textual data. How to improve?Sparse coding / hash table ⇒ 0 are no longer encoded
TAL – Introduction 18/44
Implementation issues
How many unique words in a corpus of 10k movie reviews?
Example:Story of a man who has unnatural feelings for a pig. Starts out with a openingscene that is a terrific example of absurd comedy. A formal orchestra audienceis turned into an insane, violent mob by the crazy chantings of it’s singers.Unfortunately it stays absurd the WHOLE time with no general narrativeeventually making it just too off putting. Even those from the era should beturned off. The cryptic dialogue would make Shakespeare seem easy to a thirdgrader. On a technical level it’s better than you might think with some goodcinematography by future great Vilmos Zsigmond. Future stars Sally Kirklandand Frederic Forrest can be seen briefly.
104077 unique words...
104 × 105 × 4 bytes = 4 · 109 ⇒ 4Gb...Against 100Mb of raw textual data. How to improve?
Sparse coding / hash table ⇒ 0 are no longer encoded
TAL – Introduction 18/44
Implementation issues
How many unique words in a corpus of 10k movie reviews?
Example:Story of a man who has unnatural feelings for a pig. Starts out with a openingscene that is a terrific example of absurd comedy. A formal orchestra audienceis turned into an insane, violent mob by the crazy chantings of it’s singers.Unfortunately it stays absurd the WHOLE time with no general narrativeeventually making it just too off putting. Even those from the era should beturned off. The cryptic dialogue would make Shakespeare seem easy to a thirdgrader. On a technical level it’s better than you might think with some goodcinematography by future great Vilmos Zsigmond. Future stars Sally Kirklandand Frederic Forrest can be seen briefly.
104077 unique words...104 × 105 × 4 bytes = 4 · 109 ⇒ 4Gb...
Against 100Mb of raw textual data. How to improve?
Sparse coding / hash table ⇒ 0 are no longer encoded
TAL – Introduction 18/44
Implementation issues
How many unique words in a corpus of 10k movie reviews?
Example:Story of a man who has unnatural feelings for a pig. Starts out with a openingscene that is a terrific example of absurd comedy. A formal orchestra audienceis turned into an insane, violent mob by the crazy chantings of it’s singers.Unfortunately it stays absurd the WHOLE time with no general narrativeeventually making it just too off putting. Even those from the era should beturned off. The cryptic dialogue would make Shakespeare seem easy to a thirdgrader. On a technical level it’s better than you might think with some goodcinematography by future great Vilmos Zsigmond. Future stars Sally Kirklandand Frederic Forrest can be seen briefly.
104077 unique words...104 × 105 × 4 bytes = 4 · 109 ⇒ 4Gb...
Against 100Mb of raw textual data. How to improve?Sparse coding / hash table ⇒ 0 are no longer encoded
TAL – Introduction 18/44
Implementation issues (2)
Hash table...⇒ no operator !
higher level sparse coding = sparse matrixSeveral implementations
key = line/column indexing, case indexing or block matrices
Sparse matrices are rather well integrated...but take care: if your program has a strange behavior (eg inscikit learn); may be there is an implicit conversion to fullmatrix inside.
TAL – Introduction 19/44
Implementation issues (2)
Hash table...⇒ no operator !
higher level sparse coding = sparse matrixSeveral implementations
key = line/column indexing, case indexing or block matrices
Sparse matrices are rather well integrated...but take care: if your program has a strange behavior (eg inscikit learn); may be there is an implicit conversion to fullmatrix inside.
TAL – Introduction 19/44
Implementation issues (2)
Hash table...⇒ no operator !
higher level sparse coding = sparse matrixSeveral implementations
key = line/column indexing, case indexing or block matrices
Sparse matrices are rather well integrated...but take care: if your program has a strange behavior (eg inscikit learn); may be there is an implicit conversion to fullmatrix inside.
TAL – Introduction 19/44
Modélisations du texte
Approche classique:
Sac de mots (bag-of-words, BoW)+ Avantages BoW
plutôt simple, plutôt légerrapide (systèmes temps-réel, RI, indexation naturelle...)nb possibilités d’enrichissement
(POS, codage du contexte, N-gram...)bien adapté pour la classification de documentsImplémentations existantes efficaces nltk, sklearn
− Inconvénient(s) BoWperte de la structure des phrases/documents
⇒ Plusieurs tâches difficiles à attaquerNER, POS tagging, SRLGénération de textes
TAL – Introduction 20/44
Traitement des donnéesséquentielles
Mieux gérer les séquences
1 Enrichissement de la description vectorielleN-grams,Description du contexte des mots...Usage type : amélioration des tâches de classification auniveau document
2 Approche par fenêtre glissante : prendre une décision àl’échelle intra-documentaire
Taille fixe ⇒ possibilité de description vectorielleClassifieur sur une représentation locale du texte
Traitement du signal (AR, ARMA...)Détection de pattern (frequent itemsets, règles d’association)
3 Modèles séquentielsHidden Markov Model (=Modèles de Markov Cachés)
Les dépendances dans un MMC
8
UPM
C -
M1
- MQ
IA -
T. A
rtièr
es
CRF (Conditional Random Fields) : approche discriminante
TAL – Introduction 21/44
Mieux gérer les séquences
1 Enrichissement de la description vectorielleN-grams,Description du contexte des mots...Usage type : amélioration des tâches de classification auniveau document
2 Approche par fenêtre glissante : prendre une décision àl’échelle intra-documentaire
Taille fixe ⇒ possibilité de description vectorielleClassifieur sur une représentation locale du texte
Traitement du signal (AR, ARMA...)Détection de pattern (frequent itemsets, règles d’association)
3 Modèles séquentielsHidden Markov Model (=Modèles de Markov Cachés)
Les dépendances dans un MMC
8
UPM
C -
M1
- MQ
IA -
T. A
rtièr
es
CRF (Conditional Random Fields) : approche discriminante
TAL – Introduction 21/44
Mieux gérer les séquences
1 Enrichissement de la description vectorielleN-grams,Description du contexte des mots...Usage type : amélioration des tâches de classification auniveau document
2 Approche par fenêtre glissante : prendre une décision àl’échelle intra-documentaire
Taille fixe ⇒ possibilité de description vectorielleClassifieur sur une représentation locale du texte
Traitement du signal (AR, ARMA...)Détection de pattern (frequent itemsets, règles d’association)
3 Modèles séquentielsHidden Markov Model (=Modèles de Markov Cachés)
Les dépendances dans un MMC
8
UPM
C -
M1
- MQ
IA -
T. A
rtièr
es
CRF (Conditional Random Fields) : approche discriminante
TAL – Introduction 21/44
Historique des approches POS, SRL, NER
Modélisation par règle d’association 80’sQuelles sont les cooccurrences fréquentes entre un POS et unitem dans son contexte?⇒ Règles
Modélisation bayésiennePour un POS i , modélisation de la distribution du contextep(contexte|θi )Décision en MV: arg maxi p(contexte|θi )
Extension structurée (Hidden Markov Model) > 1985/90
HMM taggers are fast and achieve precision/recall scores ofabout 93-95%
Vers une modélisation discriminante (CRF) > 2001Recurrent Neural Network (cf cours ARF, AS) > 2010
TAL – Introduction 22/44
TAL / ML : beaucoup de choses en commun
Des financements liés (et pas toujours glorieux) :Conférences MUC / TREC (...)
RI, extraction d’info, classification de doc, sentiments, QAMultiples domaines: général, médecine, brevet, judiciaire...⇒ Construction de bases, centralisation des résultats,échanges
NSA,Boites noires (loi sur le renseignement 2015)
Avancée ML tirée par le TAL : HMM, CRFDe nombreux projets ambitieux:
Google Knowledge Graph,NELL: Never-Ending Language Learning (Tom Mitchell, CMU)
TAL – Introduction 23/44
HMM
Formalisation séquentielle
Observations : Le chat est dans le salonEtiquettes : DET NN VBZ ...
Le HMM est pertinent: il est basé sur
les enchainements d’étiquettes,les probabilités d’observation
TAL – Introduction 24/44
Notations
La chaine de Markov est toujours composée de:d’une séquence d’états S = (s1, . . . , sT )dont les valeurs sont tirées dans un ensemble finiQ = (q1, ..., qN)Le modèle est toujours défini par {Π,A}
πi = P(s1 = qi)aij = p(st+1 = qj |st = qi)
Les observations sont modélisées à partir des stséquence d’observation: X = (x1, . . . , xT )loi de probabilité: bj(t) = p(xt |st = qj)B peut être discrète ou continue
MMC: λ = {Π,A,B}
TAL – Introduction 25/44
Ce que l’on manipule
Séquence d’observationsX = (x1, . . . , xT )
Séquence d’états (cachée = manquante)S = (s1, . . . , sT )
TAL – Introduction 26/44
Rappels sur la structure d’un MMC
Constitution d’un MMC:
1
2
N
S1 S2 S3 S4 ST
X1 X2 X3 X4 XT
Observations
Etats
...
...
...
... ... ... ... ...
Les états sont inconnus...
TAL – Introduction 27/44
Rappels sur la structure d’un MMC
Constitution d’un MMC:
1
2
N
S1 S2 S3 S4 ST
X1 X2 X3 X4 XT
Observations
Etats
...
...
...
... ... ... ... ...Pi
Les états sont inconnus...
TAL – Introduction 27/44
Rappels sur la structure d’un MMC
Constitution d’un MMC:
1
2
N
S1 S2 S3 S4 ST
X1 X2 X3 X4 XT
Observations
Etats
...
...
...
... ... ... ... ...
A
Hyp. Ordre 1:chaque état nedépend que duprécédent
Les états sont inconnus...La combinatoire à envisager est problématique!
TAL – Introduction 27/44
Rappels sur la structure d’un MMC
Constitution d’un MMC:
1
2
N
S1 S2 S3 S4 ST
X1 X2 X3 X4 XT
Observations
Etats
...
...
...
... ... ... ... ...B
Hyp. Ordre 1:chaque état nedépend que duprécédent
Chaque obs. nedépend que del’état courant
Les états sont inconnus...
TAL – Introduction 27/44
Les trois problèmes des MMC (Fergusson - Rabiner)
Evaluation: λ donné, calcul de p(xT1 |λ)
Décodage: λ donné, quelle séquence d’états a généré lesobservations?
sT?1 = arg maxsT1
p(xT1 , sT1 |λ)
Apprentissage: à partir d’une série d’observations, trouver λ?
λ? = {Π?,A?,B?} = arg maxsT1 ,λ
p(xT1 , sT1 |λ)
TAL – Introduction 28/44
PB1: Algorithme forward (prog dynamique)
αt(i) = p(x t1, st = i |λ)
Initialisation:αt=1(i) = p(x11 , s1 = i |λ) = πibi (x1)
Itération:
αt(j) =
[N∑
i=1
αt−1(i)aij
]bj(xt)
Terminaison:
p(xT1 |λ) =N∑
i=1
αT (i)
Complexité linéaire en T
Usuellement: T >> N
1
2 j
N
S1 S2 S3
X1 X2 X3
Observations
Etats
... ... ...
A
TAL – Introduction 29/44
PB2: Viterbi (récapitulatif)
δt(i) = maxst−11
p(st−11 , st = i , x t1|λ)
1 Initialisationδ1(i) = πibi (x1)Ψ1(i) = 0
2 Récursion
δt(j) =
[max
iδt−1(i)aij
]bj(xt)
Ψt(j) = arg maxi∈[1, N]
δt−1(i)aij
3 TerminaisonS? = maxiδT (i)
4 Cheminq?T = arg max
iδT (i)
q?t = Ψt+1(q?t+1)
1
2 j
N
S1 S2 S3
X1 X2 X3
Observations
Etats
... ... ...
A
TAL – Introduction 30/44
PB3: Apprentissage des MMC
Version simplifiée (hard assignment): type k-meansNous disposons de :
Evaluation: p(xT1 |λ)Décodage: sT?
1 = arg maxsT1 p(xT1 |λ)
Proposition:Data: Observations : X , Structure= N,KResult: Π?, A?, B?
Initialiser λ0 = Π0,A0,B0;→ finement si possible;
t = 0;while convergence non atteinte do
St+1 = decodage(X , λt);λ?t+1 = Πt+1,At+1,B t+1 obtenus par comptage des transitions ;
t = t + 1;endAlgorithm 1: Baum-Welch simplifié pour l’apprentissage d’un MMC
Vous avez déjà tous les éléments pour faire ça!
TAL – Introduction 31/44
Apprentissage en contexte supervisé
Observations : Le chat est dans le salonEtiquettes : DET NN VBZ ...
Beaucoup plus simple (après la couteuse tâche d’étiquetage) :Matrices A,B,Π obtenues par comptage...Inférence = viterbi
Philosophie & limites:Trouver l’étiquetage qui maximise la vraisemblance de laséquence états–observations...
... sous les hypothèses des HMM – indépendance desobservations étant donnés les états, ordre 1 –
TAL – Introduction 32/44
HMM ⇒ CRF
Introduction to CRF, Sutton & McCallum1.2 Graphical Models 7
Logistic Regression
HMMs
Linear-chain CRFs
Naive BayesSEQUENCE
SEQUENCE
CONDITIONAL CONDITIONAL
Generative directed models
General CRFs
CONDITIONAL
GeneralGRAPHS
GeneralGRAPHS
Figure 1.2 Diagram of the relationship between naive Bayes, logistic regression,
HMMs, linear-chain CRFs, generative models, and general CRFs.
Furthermore, even when naive Bayes has good classification accuracy, its prob-
ability estimates tend to be poor. To understand why, imagine training naive
Bayes on a data set in which all the features are repeated, that is, x =
(x1, x1, x2, x2, . . . , xK , xK). This will increase the confidence of the naive Bayes
probability estimates, even though no new information has been added to the data.
Assumptions like naive Bayes can be especially problematic when we generalize
to sequence models, because inference essentially combines evidence from di↵erent
parts of the model. If probability estimates at a local level are overconfident, it
might be di�cult to combine them sensibly.
Actually, the di↵erence in performance between naive Bayes and logistic regression
is due only to the fact that the first is generative and the second discriminative;
the two classifiers are, for discrete input, identical in all other respects. Naive Bayes
and logistic regression consider the same hypothesis space, in the sense that any
logistic regression classifier can be converted into a naive Bayes classifier with the
same decision boundary, and vice versa. Another way of saying this is that the naive
Bayes model (1.5) defines the same family of distributions as the logistic regression
model (1.7), if we interpret it generatively as
p(y,x) =exp {Pk �kfk(y,x)}Py,x exp {Pk �kfk(y, x)} . (1.9)
This means that if the naive Bayes model (1.5) is trained to maximize the con-
ditional likelihood, we recover the same classifier as from logistic regression. Con-
versely, if the logistic regression model is interpreted generatively, as in (1.9), and is
trained to maximize the joint likelihood p(y,x), then we recover the same classifier
as from naive Bayes. In the terminology of Ng and Jordan [2002], naive Bayes and
logistic regression form a generative-discriminative pair.
The principal advantage of discriminative modeling is that it is better suited to
TAL – Introduction 33/44
Modélisation CRF
Séquence de mots x = {x1, . . . , xT}Sequence d’étiquettes y (=POS tag)
Estimation paramétrique des probabilités basées sur la familleexponentielle:
p(y, x) =1Z
∏
t
Ψt(yt , yt−1, xt),
Ψt(yt , yt−1, xt) = exp (∑
k θk fk(yt , yt−1, xt))
Dépendances de Ψt ⇒ forme du modèle HMMθk : paramètres à estimer (cf regression logistique)fk(yt , yt−1, xt) : expression générique des caractéristiques
(détails plus loin)
TAL – Introduction 34/44
CRF = généralisation des HMM
Cas général (slide précédent):
p(y, x) =1Z
∏
t
exp
(∑
k
θk fk(yt , yt−1, xt)
)
Cas particulier :fk = existence de (yt , yt−1) ou (yt , xt)
p(y, x) =
1Z
∏
t
exp
∑
i ,j∈Sθi ,j1yt=i&yt−1=j +
∑
i∈S,o∈Oµo,i1yt=i&xt=o
Avec :
θi ,j = log p(yt = i |yt−1 = j)
µo,i = log p(x = o|yt = i)
Z = 1
⇒ Dans ce cas, les caractéristiques sont binaires (1/0)TAL – Introduction 35/44
CRF : passage aux probas conditionnelles
p(y, x) =1Z
∏
t
exp
(∑
k
θk fk(yt , yt−1, xt)
)
⇒p(y|x) =
∏t exp (
∑k θk fk(yt , yt−1, xt))∑
y ′∏
t exp(∑
k θk fk(y ′t , y′t−1, xt)
)
=1
Z (x)
∏
t
exp
(∑
k
θk fk(yt , yt−1, xt)
)
TAL – Introduction 36/44
Apprentissage CRF
p(y|x) =1
Z (x)
∏
t
exp
(∑
k
θk fk(yt , yt−1, xt)
)
Comme pour la regression logistique:
Lcond =∑
n
log p(yn|xn) =
∑
n
∑
t
∑
k
θk fk(xn, ynt , ynt−1)−
∑
n
log(Z (xn))
Comment optimiser?
TAL – Introduction 37/44
Apprentissage CRF
p(y|x) =1
Z (x)
∏
t
exp
(∑
k
θk fk(yt , yt−1, xt)
)
Comme pour la regression logistique:
Lcond =∑
n
log p(yn|xn) =
∑
n
∑
t
∑
k
θk fk(xn, ynt , ynt−1)−
∑
n
log(Z (xn))
Comment optimiser?
Montée de gradient ∂L∂θk
θk ← θk +∑
n,t
fk(xn, ynt , ynt−1)−
∑
n,t
∑
y ′t ,y′t−1
fk(xn, y ′t , y′t−1)p(y ′t , y
′t−1|xn)
Calcul exact possible O(TM2N), solutions approchées plus rapides
M : nb labels, T : lg chaine, N : nb chaines
TAL – Introduction 37/44
Apprentissage CRF
p(y|x) =1
Z (x)
∏
t
exp
(∑
k
θk fk(yt , yt−1, xt)
)
Comme pour la regression logistique:
Lcond =∑
n
log p(yn|xn) =
∑
n
∑
t
∑
k
θk fk(xn, ynt , ynt−1)−
∑
n
log(Z (xn))
Comment optimiser?
La difficulté réside essentiellement dans le facteur denormalisation
TAL – Introduction 37/44
Régularisation
Que pensez-vous de la complexité du modèle?Comment la limiter?
Lcond =∑
n,k,t
θk fk(xn, ynt , ynt−1)−
∑
n
log(Z (xn))
⇒
Lcond =∑
n,k,t
θk fk(xn, ynt , ynt−1)−
∑
n
log(Z (xn))+1
2σ2‖θ‖2
Lcond =∑
n,k,t
θk fk(xn, ynt , ynt−1)−
∑
n
log(Z (xn))+α∑
k
|θk |
TAL – Introduction 38/44
Et les fj alors???
p(y|x) =1
Z (x)exp
{∑
k
T∑
t=1
θk fk(x, yt , yt−1)
}
Par défaut, les fk sont des features et ne sont pas apprisesExemples:
f1(x, yt , yt−1) = 1 if yt = ADVERB and the word ends in"-ly"; 0 otherwise.
If the weight θ1 associated with this feature is large andpositive, then this feature is essentially saying that we preferlabelings where words ending in -ly get labeled as ADVERB.
f2(x, yt , yt−1) = 1 si t=1, yt = VERB, and the sentence endsin a question mark; 0 otherwise.f3(x, yt , yt−1) = 1 if yt−1 = ADJECTIVE and yt = NOUN; 0otherwise.
Il est possible d’apprendre des features sur des données
TAL – Introduction 39/44
Et les fj alors???
p(y|x) =1
Z (x)exp
{∑
k
T∑
t=1
θk fk(x, yt , yt−1)
}
Par défaut, les fk sont des features et ne sont pas apprisesExemples:
f1(x, yt , yt−1) = 1 if yt = ADVERB and the word ends in"-ly"; 0 otherwise.f2(x, yt , yt−1) = 1 si t=1, yt = VERB, and the sentence endsin a question mark; 0 otherwise.
Again, if the weight θ2 associated with this feature is largeand positive, then labelings that assign VERB to the firstword in a question (e.g., Is this a sentence beginning with averb??) are preferred.
f3(x, yt , yt−1) = 1 if yt−1 = ADJECTIVE and yt = NOUN; 0otherwise.
Il est possible d’apprendre des features sur des données
TAL – Introduction 39/44
Et les fj alors???
p(y|x) =1
Z (x)exp
{∑
k
T∑
t=1
θk fk(x, yt , yt−1)
}
Par défaut, les fk sont des features et ne sont pas apprisesExemples:
f1(x, yt , yt−1) = 1 if yt = ADVERB and the word ends in"-ly"; 0 otherwise.f2(x, yt , yt−1) = 1 si t=1, yt = VERB, and the sentence endsin a question mark; 0 otherwise.f3(x, yt , yt−1) = 1 if yt−1 = ADJECTIVE and yt = NOUN; 0otherwise.
Again, a positive weight for this feature means that adjectivestend to be followed by nouns.
Il est possible d’apprendre des features sur des données
TAL – Introduction 39/44
Feature engineering
Label-observation features : les fk qui s’exprimentfk(x, yt) = 1yt=cqk(x) sont plus facile à calculer (ils sontcalculés une fois pour toutes)
e.g. : si x = moti , x termine par -ing, x a une majuscule...alors 1 sinon 0
Node-obs feature : même si c’est moins précis, essayer dene pas mélanger les références sur les observations et sur lestransitions.
fk(x, yt , yt−1) = qk(x)1yt=c1yt−1=c′ ⇒fk(xt , yt) = qk(x)1yt=c , & fk+1(yt , yt−1) = 1yt=c1yt−1=c′
Boundary labels
TAL – Introduction 40/44
Feature engineering (2)
Unsupported features : générée automatiquement à partirdes observations (e.g. with n’est pas un nom de ville)... Maispas très pertinente.
Utiliser ces features pour désambiguiser les erreurs dans la based’apprentissage
Feature inductionFeatures from different time stepsRedundant featuresComplexe features = model ouputs
TAL – Introduction 41/44
Inférence
Processus:
1 Définir des caractéristiques,2 Apprendre les paramètres du modèles (θ)3 Classer les nouvelles phrases (=inférence)
y? = arg maxy
p(y|x)
⇒ Solution très proche de l’algorithme de Viterbi
TAL – Introduction 42/44
Inférence (2)
p(y|x) =1
Z (x)
∏
t
Ψt(yt , yt−1, xt),
Ψt(yt , yt−1, xt) = exp
(∑
k
θk fk(yt , yt−1, xt)
)
1 Passerelle avec les HMM:
Ψt(i , j , x) = p(yt = j |yt−1 = i)p(xt = x |yt = j)
αt(j) =∑
i
Ψt(i , j , x)αt−1(i), βt(i) =∑
j
Ψt(i , j , x)βt+1(j)
δt(j) = maxi
Ψt(i , j , x)δt−1(i)
2 Les définitions restent valables avec les CRF (cf Sutton,McCallum), avec:
Z (x) =∑
i
αT (i) = β0(y0)
TAL – Introduction 43/44
Inférence (2)
p(y|x) =1
Z (x)
∏
t
Ψt(yt , yt−1, xt),
Ψt(yt , yt−1, xt) = exp
(∑
k
θk fk(yt , yt−1, xt)
)
1 Passerelle avec les HMM:
Ψt(i , j , x) = p(yt = j |yt−1 = i)p(xt = x |yt = j)
αt(j) =∑
i
Ψt(i , j , x)αt−1(i), βt(i) =∑
j
Ψt(i , j , x)βt+1(j)
δt(j) = maxi
Ψt(i , j , x)δt−1(i)
2 Les définitions restent valables avec les CRF (cf Sutton,McCallum), avec:
Z (x) =∑
i
αT (i) = β0(y0)
TAL – Introduction 43/44
Applications
Analyse des phrases: analyse morpho-syntaxiqueNER: named entity recognition
... Mister George W. Bush arrived in Rome together with ...
... O name name name O O city O O ...
Passage en 2D... Analyse des images
TAL – Introduction 44/44
Applications
Analyse des phrases: analyse morpho-syntaxiqueNER: named entity recognition
... Mister George W. Bush arrived in Rome together with ...
... O name name name O O city O O ...
Passage en 2D... Analyse des images
Détection des contours
Classification d’objets
features =
cohésion de l’espace
enchainementsusuels/impossibles
crédit: DGM lib
TAL – Introduction 44/44