representation for pos tagging learning character...
TRANSCRIPT
Learning Character Level Representation for POS Tagging
Cıcero Nogueira dos SantosBianca Zadrozny
Presented ByAnirban Majumder
Introduction : Distributed Word Embedding
● Useful technique to capture syntactic and semantic information about words.
● But for many of the NLP task such as POS tagging, the information about word morphology and shape is important which is not captured in these embeddings.
● Proposes a deep neural network to learn Character-Level Representation to capture intra-word information.
Char-WNN Architecture
● joins two word and character level embedding for POS tagging
● extension of Collobert et al’s(2011) NN architecture
● Uses a convolutional layer to extract char-embedding for word of any size
Char-WNN Architecture
● Input: Fixed sized window of words centralized in target word
● Output: For each word in a sentence, the NN gives each word a score for each tag τ ∈ T (Tag Set)
Word and Char-Level Embedding
● word is from a fixed size vocabulary Vwrd and every word w ∈ Vchr
, a fixed size of character vocabulary
● Two embedding matrix are used:
Wwrd ∈ Rd wrd ×|V wrd|
Wchr ∈ R d chr ×|V chr|
Word and Char-Level Embedding
● Given a sentence with n words {w1,w2,...,wn}, each word wn is converted into a vector representation un as follows:
un= { rwrd ; rwch }
where rwrd ∊ Rdwrd is the word level embeddingand rwch ∊ Rclu is the character level embedding
Word and Char-Level Embedding
Word and Char-Level Embedding
Char-Level Embedding : Details
● Produces local features around each character of the word
● combines them to get a fixed size character-level embedding
● Given a Word w composed of M characters {c1,c2,...,cM}, each cM is transformed into a character embedding rm
chr . Them input to the convolution layer is the sequence of character embedding of M characters.
Char-Level Embedding : Details
● window of size kchr (character context window) of successive windows in the sequence of {rchr
1 , rchr
2 , ..., rchr
M }
● The vector zm (concatenation of character embedding m)for each character embedding is defined as follows :
zm = (rchrm−(kchr−1)/2 , ..., r
chrm+(kchr−1)/2 )
T
Char-Level Embedding : Details
● Convolutional layer computer the jth element of the character embedding rwch of the word w as follows:
[rwch]j = max1<m<M[W0zm + b0]j
● Matrix W0 is used to extract local features around each character window of the given word
● Global fixed-sized feature vector is obtained using max operator over each character window
Char-Level Embedding : Details
● Parameter to be learned :Wchr, W0 and b0
● Hyper-parameters :
dchr : the size of the character vectorclu : the size of the convolution unit
(also the size of the character-level embedding) kchr : the size of the character context window
Scoring
● follow Collobert et al.’s (2011) window approach to score all tags T for each word in a sentence
● the assumption that the tag of a word depends mainly on its neighboring words
● to compute tag scores for the n-th word in the sentence, we first create a vector xn resulting from the concatenation of a sequence of kwrd embeddings, centralized in the n-th word
Scoring
● the vector xn :
xn = (un−(kwrd−1)/2 , ..., un+(kwrd−1)/2)T
● the vector xn is processed by two NN layers to compute scores :s(xn) = W2 h(W1 xn + b1) + b2
where W1 ∈ Rhl u×k wrd(d wrd+cl u)
W2 ∈ R|T|×hl u
Structured Inference :
● the tags of neighbouring words are strongly dependent
● prediction scheme that takes into the sentence structure (Collobert et. al, 2011)
Structured Inference :
● We compute the score for a tag path [t]1N={t1,t2,...,tn} as
S ([w]1N,[t]1
N,θ) = ∑n=1N(Atn−1,tn + s(xn)tn)
s(xn)tn is the score for the tag tn for the word wn Atn-1,
tn is a transition score for jumping from tag tn-1 to tag tn
θ is the set of all trainable network parameters (Wwrd, Wchr, W0 , b0 , W1 , b1 , W2 , b2 , A)
Network Training :
● network is trained by minimizing a negative log-likelihood over the training set D, same as Collobert et al.(2011)
● interpret a sentence score as a conditional probability over a path
log p( [t]N1|[w]N
1,θ) = S([w]N1,[t]
N1,θ)
− log(∑X ∀[u]N1∈TN e S([w]N1,[u]N1,θ))
● used stochastic gradient descent to minimize the negative log-
likelihood with respect to θ
English Datasets
SET SENT. TOKENS OOSV
OOUV TRAINING 38,219 912,344 0 6317
DEVELOP. 5,527 131,768 4,467 958
TEST 5,462 129,654 3,649 923
WSJ Corpus
Experimental Setup : POS Tagging Datasets
Portuguese Datasets
SET SENT. TOKENS OOSV
OOUV TRAINING 42,021 959,413 0 4155
DEVELOP. 2,212 48,258 1360 202
TEST 9,141 213,794 9523 1004
Mac-Morpho Corpus
English POS Tagging Results:
SYSTEM FEATURES ACC. ACC.OOSV ACC. OOUV
CHARWNN – 97.32 89.86 85.48
WNN CAPS+SUF2 97.21 89.28 86.89
WNN CAPS 97.08 86.08 79.96
WNN SUF2 96.33 84.16 80.61
WNN – 96.13 80.68 71.94
Comparison of different NNs for POS Tagging of the WSJ Corpus
Portuguese POS Tagging Results:
SYSTEM FEATURES ACC. ACC.OOSV ACC. OOUV
CHARWNN – 97.47 92.49 89.74
WNN CAPS+SUF3 97.42 92.64 89.64
WNN CAPS 97.27 90.41 86.35
WNN SUF3 96.35 85.73 81.67
WNN – 96.19 83.08 75.40
For POS Tagging of the Mac-Morpho Corpus
Results:
● Most similar words using character-level embeddings learned with WSJ Corpus
INCONSIDERABLE 83-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0055
INCONCEIVABLE 43-YEAR-OLD ROCKET-LIKE FINANCIALLY UNEASINESS 0.0085
INDISTINGUISHABLE 63-YEAR-OLD FERN-LIKE ESSENTIALLY UNHAPPINESS 0.0075
INNUMERABLE 73-YEAR-OLD SLIVER-LIKE GENERALLY UNPLEASANTNESS 0.0015
INCOMPATIBLE 49-YEAR-OLD BUSINESS-LIKE IRONICALLY BUSINESS 0.0040
INCOMPREHENSIBLE 53-YEAR-OLD WAR-LIKE SPECIALLY UNWILLINGNESS 0.025
Results:
● Most similar words using word-level embeddings learned using unlabeled English texts
INCONSIDERABLE 00-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0000
INSIGNIFICANT SEVENTEEN-YEAR-OLD BURROWER WORLDWIDE PARESTHESIA 0.00000
INORDINATE SIXTEEN-YEAR-OLD CRUSTACEAN-LIKE 000,000,000 HYPERSALIVATION 0.000
ASSUREDLY FOURTEEN-YEAR-OLD TROLL-LIKE 00,000,000 DROWSINESS
0.000000 UNDESERVED NINETEEN-YEAR-OLD SCORPION-LIKE SALES DIPLOPIA
±
SCRUPLE FIFTEEN-YEAR-OLD UROHIDROSIS RETAILS BREATHLESSNESS -0.00
Results:
● Most similar words using word-level embeddings learned using unlabeled Portuguese texts
GRADAÇÕES CLANDESTINAMENTE REVOGAÇÃO DESLUMBRAMENTO DROGASSE
TONALIDADES ILEGALMENTE ANULAÇÃO ASSOMBRO –
MODULAÇÕES ALI PROMULGAÇÃO EXOTISMO –
CARACTERIZAÇÕES ATAMBUA CADUCIDADE ENFADO –
NUANÇAS BRAZZAVILLE INCONSTITUCIONALIDADE ENCANTAMENTO –
COLORAÇÕES ˜ VOLUNTARIAMENTE NULIDADE FASCÍNIO –
Future Work :
● Analyzing the interrelationship between the two embeddings in more details
● Applying this work to other NLP tasks such as text chunking, NER etc.
Thank You