tal: tâches de nlp introduction à la classification des

TAL: TÂCHES DE NLPINTRODUCTION À LA CLAS-SIFICATION DES DOCUMENTSTEXTUELS

Vincent Guigue

INTRODUCTION:différentes tâchesen analysede données textuelles

C’est quoi du texte?

Une suite de lettres

l e _ c h a t _ e s t ...

Une suite de mots

le chat est ...

Un ensemble de mots

Dans l’ordre alphabétique

chatestle...

Au moins une source à vérifier: C. Manning, Stanford:https://nlp.stanford.edu/cmanning/

http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

http://web.stanford.edu/class/cs224n/TAL – Introduction 2/44

https://nlp.stanford.edu/cmanning/



http://web.stanford.edu/class/cs224n/

Différentes tâches de base

Analyse grammaticalePart-Of-Speech (POS)

Nom, Nom propre, Déterminant, Verbe...NER = Named Entity Recognition

détection des noms propres, travail sur les co-référencesSRL = Semantic Role Labeling sujet, verbe, compléments...Résolution de co-références

Le chat est dans le jardin, il mange un morceau jambon.

Analyse thématique / sémantique

Construire une métrique entre les motsClassification thématique de documents

e.g. football, article scientifique, analyse politiqueClassification de sentiments positif, négatif, neutre, agressif, . . .

⇒ Ces tâches sont très variées et à différentes échelles: mots,phrases, documents

TAL – Introduction 3/44

Tâches de plus haut niveau

Topic detection & trackingTraduction automatiqueQuestion AnsweringExtraction d’informationGénération de textes


Illustrations

Part-of-Speech (POS) Tagging

tags: ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB,ARTICLE...

Exemple: Bob drank coffee at Starbucks⇒ Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION)Starbucks (NOUN).

Named Entity Recognition (NER)


Illustrations (suite)

Parsing

crédit :

CoreNLP

Semantic Role Labeling :Information ExtractionQuestion answering (QA)



ParsingSemantic Role Labeling :

Crédit: Stanford NLP

Information ExtractionQuestion answering (QA)



ParsingSemantic Role Labeling :Information Extraction

Dan$Jurafsky$

Informa%on(Extrac%on(

Subject:$curriculum(mee%ng($$$$$Date:$January$15,$2012$

$$$$$$$$$To:$Dan$Jurafsky$

$

Hi$Dan,$we’ve$now$scheduled$the$curriculum$meeIng.$

It$will$be$in$Gates$159$tomorrow$from$10:00O11:30.$

OChris$

3$

Create new Calendar entry

Event: Curriculum mtg Date: Jan-16-2012 Start: 10:00am End: 11:30am Where: Gates 159

Question answering (QA)



ParsingSemantic Role Labeling :Information ExtractionQuestion answering (QA)



Extraction d’information & analyse de sentimentsDan$Jurafsky$

Informa%on(Extrac%on(&(Sen%ment(Analysis(

•  nice$and$compact$to$carry!$$

•  since$the$camera$is$small$and$light,$I$won't$need$to$carry$

around$those$heavy,$bulky$professional$cameras$either!$$

•  the$camera$feels$flimsy,$is$plasIc$and$very$light$in$weight$you$have$to$be$very$delicate$in$the$handling$of$this$camera$4$

Size$and$weight$

AWributes:$$zoom$$affordability$

$size$and$weight$$flash$$

$ease$of$use$

✓$

✗$

✓$



Traduction automatiqueAligner des motsGénérer une phrase intelligible / vraisemblable

Historique (en évolution rapide)traduction de motstraduction de séquencestraduction de connaissances / signification


Petit bilan (avant de commencer !)

Dan$Jurafsky$

Language(Technology(

Coreference$resoluIon$

QuesIon$answering$(QA)$

PartOofOspeech$(POS)$tagging$

Word$sense$disambiguaIon$(WSD)$

Paraphrase$

Named$enIty$recogniIon$(NER)$

Parsing$SummarizaIon$

InformaIon$extracIon$(IE)$

Machine$translaIon$(MT)$Dialog$

SenIment$analysis$

$$$

mostly$solved$

making$good$progress$

sIll$really$hard$

Spam$detecIon$Let’s$go$to$Agra!$

Buy$V1AGRA$…$

✓✗

Colorless$$$green$$$ideas$$$sleep$$$furiously.$

$$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$

Einstein$met$with$UN$officials$in$Princeton$PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$

You’re$invited$to$our$dinner$party,$Friday$May$27$at$8:30$

Party$May$27$

add$

Best$roast$chicken$in$San$Francisco!$

The$waiter$ignored$us$for$20$minutes.$

Carter$told$Mubarak$he$shouldn’t$run$again.$

I$need$new$baWeries$for$my$mouse.$

The$13th$Shanghai$InternaIonal$Film$FesIval…$

�13��…�

The$Dow$Jones$is$up$

Housing$prices$rose$

Economy$is$good$

Q.$How$effecIve$is$ibuprofen$in$reducing$fever$in$paIents$with$acute$febrile$illness?$

I$can$see$Alcatraz$from$the$window!$

XYZ$acquired$ABC$yesterday$

ABC$has$been$taken$over$by$XYZ$

Where$is$CiIzen$Kane$playing$in$SF?$$

Castro$Theatre$at$7:30.$Do$you$want$a$Icket?$

The$S&P500$jumped$


Quelques difficultés pour pimenter le cours

Dan$Jurafsky$

nonBstandard(English$Great$job$@jusInbieber!$Were$

SOO$PROUD$of$what$youve$

accomplished!$U$taught$us$2$

#neversaynever$&$you$yourself$

should$never$give$up$either�$

segmenta%on(issues( idioms(dark$horse$get$cold$feet$lose$face$

throw$in$the$towel$

neologisms(unfriend$Retweet$bromance$

$

tricky(en%ty(names(Where$is$A(Bug’s(Life$playing$…(Let(It(Be$was$recorded$…$…$a$mutaIon$on$the$for$gene$…(

world(knowledge(Mary$and$Sue$are$sisters.$

Mary$and$Sue$are$mothers.$

But$that’s$what$makes$it$fun!$

the$New$YorkONew$Haven$Railroad$

the$New$YorkONew$Haven$Railroad$

Why else is natural language understanding difficult?(


Modélisation(s) du texte

Chaine de traitements standard

1. Preprocessing 2. Mise en forme 3. Traitements

encodage (latin,utf8, ...)ponctuationstemminglemmatisationtokenizationminuscule/majregex...

Construction d’undictionnaire(index)+ Index inversé(pour l’explicationdes traitementsMise en formevectorielleConservation desséquences

Classification desdocs, des mots,des phrasesSémantique...Perceptron ouHMM?


BOW:Bag Of Words

Handling textual data: the classification case

1 Big corpus ⇔ Huge vocabulary2 Sentence structure is hard to model3 Words are polymorphous: singular/plural, masculine/feminine4 Machine learning + large dimensionality = problems


Handling textual data: the classification case

1 Big corpus ⇔ Huge vocabularyPerceptron, SVM, Naive Bayes... Boosting, Bagging...

Distributed & efficient algorithms2 Sentence structure is hard to modelRemoving the structure...

3 Words are polymorphous: singular/plural, masculine/feminineSeveral approaches... (see below)

4 Machine learning + large dimensionality = problemsRemoving useless words


Bag of words

Sentence structure = costly handling ⇒ Elimination !

Thus : Document = set of words + countings

Bag of words representation1 Extraction of vocabulary V

2 Each document becomes a counting vector : d ∈ N|V |

Note: d is always a sparse vectors, mainly composed of 0


Example

Set of toy documents:1 documents = [ ’The␣ l i o n ␣ does ␣ not ␣ l i v e ␣ i n ␣ the ␣ j u n g l e ’ ,\2 ’ L i on s ␣ ea t ␣ b i g ␣ p r e y s ’ ,\3 ’ I n ␣ the ␣zoo , ␣ the ␣ l i o n ␣ s l e e p ’ ,\4 ’ S e l f−d r i v i n g ␣ c a r s ␣ w i l l ␣be␣autonomous␣ i n ␣ towns ’ ,\5 ’ The␣ f u t u r e ␣ ca r ␣ has ␣no␣ s t e e r i n g ␣whee l ’ ,\6 ’My␣ ca r ␣ a l r e a d y ␣ has ␣ s e n s o r s ␣and␣a␣camera ’ ]

Dictionary

Green level ∝ nb occurences


Information coding

Counting words appearing in 2 documents:

The lion does not live in the jungleIn the zoo, the lion sleep

The

10

does

10

in

10

jung

le

10

lion

11

live

10

not

10

the

12

Lions

00

big

00

eat

00

prey

s

00

In

01

sleep

01

zoo,

01

Self-

driv

ing

00

auto

nom

ous

00

be

00

cars

00

town

s

00

will

00

car

00

futu

re

00

has

00

no

00

stee

ring

00

whee

l

00

My

00

a

00

alre

ady

00

and

00

cam

era

00

sens

ors

00

+ We are able to vectorize textual information− Dictionary requires preprocessing


Word representation & semantic gap

All words are orthogonal:

Considering virtual 2 documents made of a single word :[0 . . . 0 dik > 0 . . . 00 . . . djk ′ > 0 0 . . . 0

]

Then: k 6= k ′ ⇒ di · dj = 0

...even if wk = lion and wk ′ = lions

⇒ Definition of the semantic gapNo metrics between words


Semantic issue

Understanding documents = matching relevant descriptors

Syntactic difference ⇒orthogonality of the representation vectors

Word groups : more intrinsic semantics...... but fewer match with other document

N-grams ⇒ dictionary size ↗N-grams = great potential...

but require careful preprocessings

This film was not interesting

Unigrams: this, film, was, not, interesting

bigrams: this_film, film_was, was_not, not_interesting

N-grams... + combination: e.g. 1-3 grams


Implementation issues

How many unique words in a corpus of 10k movie reviews?

Example:Story of a man who has unnatural feelings for a pig. Starts out with a openingscene that is a terrific example of absurd comedy. A formal orchestra audienceis turned into an insane, violent mob by the crazy chantings of it’s singers.Unfortunately it stays absurd the WHOLE time with no general narrativeeventually making it just too off putting. Even those from the era should beturned off. The cryptic dialogue would make Shakespeare seem easy to a thirdgrader. On a technical level it’s better than you might think with some goodcinematography by future great Vilmos Zsigmond. Future stars Sally Kirklandand Frederic Forrest can be seen briefly.

104077 unique words...104 × 105 × 4 bytes = 4 · 109 ⇒ 4Gb...

Against 100Mb of raw textual data. How to improve?Sparse coding / hash table ⇒ 0 are no longer encoded





104077 unique words...

104 × 105 × 4 bytes = 4 · 109 ⇒ 4Gb...Against 100Mb of raw textual data. How to improve?

Sparse coding / hash table ⇒ 0 are no longer encoded






Against 100Mb of raw textual data. How to improve?

Sparse coding / hash table ⇒ 0 are no longer encoded






Against 100Mb of raw textual data. How to improve?Sparse coding / hash table ⇒ 0 are no longer encoded


Implementation issues (2)

Hash table...⇒ no operator !

higher level sparse coding = sparse matrixSeveral implementations

key = line/column indexing, case indexing or block matrices

Sparse matrices are rather well integrated...but take care: if your program has a strange behavior (eg inscikit learn); may be there is an implicit conversion to fullmatrix inside.


Modélisations du texte

Approche classique:

Sac de mots (bag-of-words, BoW)+ Avantages BoW

plutôt simple, plutôt légerrapide (systèmes temps-réel, RI, indexation naturelle...)nb possibilités d’enrichissement

(POS, codage du contexte, N-gram...)bien adapté pour la classification de documentsImplémentations existantes efficaces nltk, sklearn

− Inconvénient(s) BoWperte de la structure des phrases/documents

⇒ Plusieurs tâches difficiles à attaquerNER, POS tagging, SRLGénération de textes


Traitement des donnéesséquentielles

Mieux gérer les séquences

1 Enrichissement de la description vectorielleN-grams,Description du contexte des mots...Usage type : amélioration des tâches de classification auniveau document

2 Approche par fenêtre glissante : prendre une décision àl’échelle intra-documentaire

Taille fixe ⇒ possibilité de description vectorielleClassifieur sur une représentation locale du texte

Traitement du signal (AR, ARMA...)Détection de pattern (frequent itemsets, règles d’association)

3 Modèles séquentielsHidden Markov Model (=Modèles de Markov Cachés)

Les dépendances dans un MMC

8

UPM

C -

M1

- MQ

IA -

T. A

rtièr

es

CRF (Conditional Random Fields) : approche discriminante


Historique des approches POS, SRL, NER

Modélisation par règle d’association 80’sQuelles sont les cooccurrences fréquentes entre un POS et unitem dans son contexte?⇒ Règles

Modélisation bayésiennePour un POS i , modélisation de la distribution du contextep(contexte|θi )Décision en MV: arg maxi p(contexte|θi )

Extension structurée (Hidden Markov Model) > 1985/90

HMM taggers are fast and achieve precision/recall scores ofabout 93-95%

Vers une modélisation discriminante (CRF) > 2001Recurrent Neural Network (cf cours ARF, AS) > 2010


TAL / ML : beaucoup de choses en commun

Des financements liés (et pas toujours glorieux) :Conférences MUC / TREC (...)

RI, extraction d’info, classification de doc, sentiments, QAMultiples domaines: général, médecine, brevet, judiciaire...⇒ Construction de bases, centralisation des résultats,échanges

NSA,Boites noires (loi sur le renseignement 2015)

Avancée ML tirée par le TAL : HMM, CRFDe nombreux projets ambitieux:

Google Knowledge Graph,NELL: Never-Ending Language Learning (Tom Mitchell, CMU)


Formalisation séquentielle

Observations : Le chat est dans le salonEtiquettes : DET NN VBZ ...

Le HMM est pertinent: il est basé sur

les enchainements d’étiquettes,les probabilités d’observation


Notations

La chaine de Markov est toujours composée de:d’une séquence d’états S = (s1, . . . , sT )dont les valeurs sont tirées dans un ensemble finiQ = (q1, ..., qN)Le modèle est toujours défini par {Π,A}

πi = P(s1 = qi)aij = p(st+1 = qj |st = qi)

Les observations sont modélisées à partir des stséquence d’observation: X = (x1, . . . , xT )loi de probabilité: bj(t) = p(xt |st = qj)B peut être discrète ou continue

MMC: λ = {Π,A,B}


Ce que l’on manipule

Séquence d’observationsX = (x1, . . . , xT )

Séquence d’états (cachée = manquante)S = (s1, . . . , sT )


Rappels sur la structure d’un MMC

Constitution d’un MMC:

1

2

N

S1 S2 S3 S4 ST

X1 X2 X3 X4 XT

Observations

Etats

...

...

...

... ... ... ... ...

Les états sont inconnus...




1

2

N

S1 S2 S3 S4 ST

X1 X2 X3 X4 XT

Observations

Etats

...

...

...

... ... ... ... ...Pi





1

2

N

S1 S2 S3 S4 ST

X1 X2 X3 X4 XT

Observations

Etats

...

...

...

... ... ... ... ...

A

Hyp. Ordre 1:chaque état nedépend que duprécédent

Les états sont inconnus...La combinatoire à envisager est problématique!




1

2

N

S1 S2 S3 S4 ST

X1 X2 X3 X4 XT

Observations

Etats

...

...

...

... ... ... ... ...B

Hyp. Ordre 1:chaque état nedépend que duprécédent

Chaque obs. nedépend que del’état courant



Les trois problèmes des MMC (Fergusson - Rabiner)

Evaluation: λ donné, calcul de p(xT1 |λ)

Décodage: λ donné, quelle séquence d’états a généré lesobservations?

sT?1 = arg maxsT1

p(xT1 , sT1 |λ)

Apprentissage: à partir d’une série d’observations, trouver λ?

λ? = {Π?,A?,B?} = arg maxsT1 ,λ

p(xT1 , sT1 |λ)


PB1: Algorithme forward (prog dynamique)

αt(i) = p(x t1, st = i |λ)

Initialisation:αt=1(i) = p(x11 , s1 = i |λ) = πibi (x1)

Itération:

αt(j) =

[N∑

i=1

αt−1(i)aij

]bj(xt)

Terminaison:

p(xT1 |λ) =N∑

i=1

αT (i)

Complexité linéaire en T

Usuellement: T >> N

1

2 j

N

S1 S2 S3

X1 X2 X3

Observations

Etats

... ... ...

A


PB2: Viterbi (récapitulatif)

δt(i) = maxst−11

p(st−11 , st = i , x t1|λ)

1 Initialisationδ1(i) = πibi (x1)Ψ1(i) = 0

2 Récursion

δt(j) =

[max

iδt−1(i)aij

]bj(xt)

Ψt(j) = arg maxi∈[1, N]

δt−1(i)aij

3 TerminaisonS? = maxiδT (i)

4 Cheminq?T = arg max

iδT (i)

q?t = Ψt+1(q?t+1)

1

2 j

N

S1 S2 S3

X1 X2 X3

Observations

Etats

... ... ...

A


PB3: Apprentissage des MMC

Version simplifiée (hard assignment): type k-meansNous disposons de :

Evaluation: p(xT1 |λ)Décodage: sT?

1 = arg maxsT1 p(xT1 |λ)

Proposition:Data: Observations : X , Structure= N,KResult: Π?, A?, B?

Initialiser λ0 = Π0,A0,B0;→ finement si possible;

t = 0;while convergence non atteinte do

St+1 = decodage(X , λt);λ?t+1 = Πt+1,At+1,B t+1 obtenus par comptage des transitions ;

t = t + 1;endAlgorithm 1: Baum-Welch simplifié pour l’apprentissage d’un MMC

Vous avez déjà tous les éléments pour faire ça!


Apprentissage en contexte supervisé

Observations : Le chat est dans le salonEtiquettes : DET NN VBZ ...

Beaucoup plus simple (après la couteuse tâche d’étiquetage) :Matrices A,B,Π obtenues par comptage...Inférence = viterbi

Philosophie & limites:Trouver l’étiquetage qui maximise la vraisemblance de laséquence états–observations...

... sous les hypothèses des HMM – indépendance desobservations étant donnés les états, ordre 1 –


HMM ⇒ CRF

Introduction to CRF, Sutton & McCallum1.2 Graphical Models 7

Logistic Regression

HMMs

Linear-chain CRFs

Naive BayesSEQUENCE

SEQUENCE

CONDITIONAL CONDITIONAL

Generative directed models

General CRFs

CONDITIONAL

GeneralGRAPHS

GeneralGRAPHS

Figure 1.2 Diagram of the relationship between naive Bayes, logistic regression,

HMMs, linear-chain CRFs, generative models, and general CRFs.

Furthermore, even when naive Bayes has good classification accuracy, its prob-

ability estimates tend to be poor. To understand why, imagine training naive

Bayes on a data set in which all the features are repeated, that is, x =

(x1, x1, x2, x2, . . . , xK , xK). This will increase the confidence of the naive Bayes

probability estimates, even though no new information has been added to the data.

Assumptions like naive Bayes can be especially problematic when we generalize

to sequence models, because inference essentially combines evidence from di↵erent

parts of the model. If probability estimates at a local level are overconfident, it

might be di�cult to combine them sensibly.

Actually, the di↵erence in performance between naive Bayes and logistic regression

is due only to the fact that the first is generative and the second discriminative;

the two classifiers are, for discrete input, identical in all other respects. Naive Bayes

and logistic regression consider the same hypothesis space, in the sense that any

logistic regression classifier can be converted into a naive Bayes classifier with the

same decision boundary, and vice versa. Another way of saying this is that the naive

Bayes model (1.5) defines the same family of distributions as the logistic regression

model (1.7), if we interpret it generatively as

p(y,x) =exp {Pk �kfk(y,x)}Py,x exp {Pk �kfk(y, x)} . (1.9)

This means that if the naive Bayes model (1.5) is trained to maximize the con-

ditional likelihood, we recover the same classifier as from logistic regression. Con-

versely, if the logistic regression model is interpreted generatively, as in (1.9), and is

trained to maximize the joint likelihood p(y,x), then we recover the same classifier

as from naive Bayes. In the terminology of Ng and Jordan [2002], naive Bayes and

logistic regression form a generative-discriminative pair.

The principal advantage of discriminative modeling is that it is better suited to


Modélisation CRF

Séquence de mots x = {x1, . . . , xT}Sequence d’étiquettes y (=POS tag)

Estimation paramétrique des probabilités basées sur la familleexponentielle:

p(y, x) =1Z

∏

t

Ψt(yt , yt−1, xt),

Ψt(yt , yt−1, xt) = exp (∑

k θk fk(yt , yt−1, xt))

Dépendances de Ψt ⇒ forme du modèle HMMθk : paramètres à estimer (cf regression logistique)fk(yt , yt−1, xt) : expression générique des caractéristiques

(détails plus loin)


CRF = généralisation des HMM

Cas général (slide précédent):

p(y, x) =1Z

∏

t

exp

(∑

k

θk fk(yt , yt−1, xt)

)

Cas particulier :fk = existence de (yt , yt−1) ou (yt , xt)

p(y, x) =

1Z

∏

t

exp

∑

i ,j∈Sθi ,j1yt=i&yt−1=j +

∑

i∈S,o∈Oµo,i1yt=i&xt=o

Avec :

θi ,j = log p(yt = i |yt−1 = j)

µo,i = log p(x = o|yt = i)

Z = 1

⇒ Dans ce cas, les caractéristiques sont binaires (1/0)TAL – Introduction 35/44

CRF : passage aux probas conditionnelles

p(y, x) =1Z

∏

t

exp

(∑

k


)

⇒p(y|x) =

∏t exp (

∑k θk fk(yt , yt−1, xt))∑

y ′∏

t exp(∑

k θk fk(y ′t , y′t−1, xt)

)

=1

Z (x)

∏

t

exp

(∑

k


)


Apprentissage CRF

p(y|x) =1

Z (x)

∏

t

exp

(∑

k


)

Comme pour la regression logistique:

Lcond =∑

n

log p(yn|xn) =

∑

n

∑

t

∑

k

θk fk(xn, ynt , ynt−1)−

∑

n

log(Z (xn))

Comment optimiser?


Apprentissage CRF

p(y|x) =1

Z (x)

∏

t

exp

(∑

k


)


Lcond =∑

n

log p(yn|xn) =

∑

n

∑

t

∑

k


∑

n

log(Z (xn))

Comment optimiser?

Montée de gradient ∂L∂θk

θk ← θk +∑

n,t

fk(xn, ynt , ynt−1)−

∑

n,t

∑

y ′t ,y′t−1

fk(xn, y ′t , y′t−1)p(y ′t , y

′t−1|xn)

Calcul exact possible O(TM2N), solutions approchées plus rapides

M : nb labels, T : lg chaine, N : nb chaines


Apprentissage CRF

p(y|x) =1

Z (x)

∏

t

exp

(∑

k


)


Lcond =∑

n

log p(yn|xn) =

∑

n

∑

t

∑

k


∑

n

log(Z (xn))

Comment optimiser?

La difficulté réside essentiellement dans le facteur denormalisation


Régularisation

Que pensez-vous de la complexité du modèle?Comment la limiter?

Lcond =∑

n,k,t


∑

n

log(Z (xn))

⇒

Lcond =∑

n,k,t


∑

n

log(Z (xn))+1

2σ2‖θ‖2

Lcond =∑

n,k,t


∑

n

log(Z (xn))+α∑

k

|θk |


Et les fj alors???

p(y|x) =1

Z (x)exp

{∑

k

T∑

t=1

θk fk(x, yt , yt−1)

}

Par défaut, les fk sont des features et ne sont pas apprisesExemples:

f1(x, yt , yt−1) = 1 if yt = ADVERB and the word ends in"-ly"; 0 otherwise.

If the weight θ1 associated with this feature is large andpositive, then this feature is essentially saying that we preferlabelings where words ending in -ly get labeled as ADVERB.

f2(x, yt , yt−1) = 1 si t=1, yt = VERB, and the sentence endsin a question mark; 0 otherwise.f3(x, yt , yt−1) = 1 if yt−1 = ADJECTIVE and yt = NOUN; 0otherwise.

Il est possible d’apprendre des features sur des données


Et les fj alors???

p(y|x) =1

Z (x)exp

{∑

k

T∑

t=1


}


f1(x, yt , yt−1) = 1 if yt = ADVERB and the word ends in"-ly"; 0 otherwise.f2(x, yt , yt−1) = 1 si t=1, yt = VERB, and the sentence endsin a question mark; 0 otherwise.

Again, if the weight θ2 associated with this feature is largeand positive, then labelings that assign VERB to the firstword in a question (e.g., Is this a sentence beginning with averb??) are preferred.

f3(x, yt , yt−1) = 1 if yt−1 = ADJECTIVE and yt = NOUN; 0otherwise.



Et les fj alors???

p(y|x) =1

Z (x)exp

{∑

k

T∑

t=1


}


f1(x, yt , yt−1) = 1 if yt = ADVERB and the word ends in"-ly"; 0 otherwise.f2(x, yt , yt−1) = 1 si t=1, yt = VERB, and the sentence endsin a question mark; 0 otherwise.f3(x, yt , yt−1) = 1 if yt−1 = ADJECTIVE and yt = NOUN; 0otherwise.

Again, a positive weight for this feature means that adjectivestend to be followed by nouns.



Feature engineering

Label-observation features : les fk qui s’exprimentfk(x, yt) = 1yt=cqk(x) sont plus facile à calculer (ils sontcalculés une fois pour toutes)

e.g. : si x = moti , x termine par -ing, x a une majuscule...alors 1 sinon 0

Node-obs feature : même si c’est moins précis, essayer dene pas mélanger les références sur les observations et sur lestransitions.

fk(x, yt , yt−1) = qk(x)1yt=c1yt−1=c′ ⇒fk(xt , yt) = qk(x)1yt=c , & fk+1(yt , yt−1) = 1yt=c1yt−1=c′

Boundary labels


Feature engineering (2)

Unsupported features : générée automatiquement à partirdes observations (e.g. with n’est pas un nom de ville)... Maispas très pertinente.

Utiliser ces features pour désambiguiser les erreurs dans la based’apprentissage

Feature inductionFeatures from different time stepsRedundant featuresComplexe features = model ouputs


Inférence

Processus:

1 Définir des caractéristiques,2 Apprendre les paramètres du modèles (θ)3 Classer les nouvelles phrases (=inférence)

y? = arg maxy

p(y|x)

⇒ Solution très proche de l’algorithme de Viterbi


Inférence (2)

p(y|x) =1

Z (x)

∏

t

Ψt(yt , yt−1, xt),

Ψt(yt , yt−1, xt) = exp

(∑

k


)

1 Passerelle avec les HMM:

Ψt(i , j , x) = p(yt = j |yt−1 = i)p(xt = x |yt = j)

αt(j) =∑

i

Ψt(i , j , x)αt−1(i), βt(i) =∑

j

Ψt(i , j , x)βt+1(j)

δt(j) = maxi

Ψt(i , j , x)δt−1(i)

2 Les définitions restent valables avec les CRF (cf Sutton,McCallum), avec:

Z (x) =∑

i

αT (i) = β0(y0)


Applications

Analyse des phrases: analyse morpho-syntaxiqueNER: named entity recognition

... Mister George W. Bush arrived in Rome together with ...

... O name name name O O city O O ...

Passage en 2D... Analyse des images


Applications

Analyse des phrases: analyse morpho-syntaxiqueNER: named entity recognition

... Mister George W. Bush arrived in Rome together with ...

... O name name name O O city O O ...

Passage en 2D... Analyse des images

Détection des contours

Classification d’objets

features =

cohésion de l’espace

enchainementsusuels/impossibles

crédit: DGM lib


tal: tâches de nlp introduction à la classification des

Documents