learning task-specific bilexical embeddingspranava/slides/colingslides.pdflearning task-speci c...
TRANSCRIPT
-
Learning Task-specific Bilexical Embeddings
Pranava Madhyastha(1), Xavier Carreras(1,2), Ariadna Quattoni(1,2)
(1) Universitat Politècnica de Catalunya (2) Xerox Research Centre Europe
-
Bilexical Relations
I Increasing interest in bilexical relations (relation between pairs of words)
I Dependency Parsing - lexical items (words) connected by binary relations
Small birds sing loud songs
ROOT
NMOD SUBJ
OBJ
NMOD
I Bilexical Predictions can be modelled as Pr(modifier|head)
2
-
In Focus: Unseen words
Adjective-Noun relation, where an adjective modifies a noun
Vynil can be applied to electronic devices and cases
NMOD?
NMOD?
I If one/more of the above nouns or adjectives have not been observed in thesupervision: estimating Pr(adjective|noun)
I Zipf distribution
I Generalisation is a challenge
3
-
Distributional Word Space Models
I Distributional Hypothesis: Linguistic items with similar distributions have similar meanings
the curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-whitesun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber abird on the shape of the moon over the trees in frontBut I could nt see the moon or the stars , only therning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None ofthe light of an enormous moon . The plash of flowing wman s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airshoud obscured part of the moon . The Allied guns behind
I For every word we can compute an n-dimensional vector space representation φ(w)→ Rn from a largecorpus
4
-
Contributions
Formulation of statistical models to improve bilexical prediction tasks
I Supervised framework to learn bilexical models over distributional representations
⇒ based on learning bilinear formsI Compressing representations by imposing low-rank constraints to bilinear forms
I Lexical embeddings tailored for a specific bilexical task.
5
-
Overview
Bilexical Models
Low Rank Constraints
Learning
Experiments
6
-
Overview
Bilexical Models
Low Rank Constraints
Learning
Experiments
7
-
Unsupervised Bilexical Models
I We can define a simple bilexical model as:
Pr(m | h) = exp {〈φ(m), φ(h)〉}∑m′ exp {〈φ(m′), φ(h)〉}
where 〈φ(x), φ(y)〉 denotes the inner-product.I Problem: Designing appropriate contexts for required relations
I Solution: Leverage supervised training corpus
8
-
Supervised bilexical model
I We define the bilexical model in a bilinear setting here as:
φ(m)>Wφ(h)
where:
φ(m) and φ(h) are n−dimensional representations of m and hW ∈ Rn×n is a matrix of parameters
9
-
Interpreting the Bilinear Models
I If we write the bilinear model as:
n∑i=1
n∑j=1
fi ,j(m, h)Wi ,j
I fi ,j(m, h) = φ(m)[i ]φ(h)[j]
⇒ Bilinear models are linear models, with an extended feature space!I =⇒ We can re-use all the algorithms designed for linear models.
10
-
Using Bilexical Models
I We define the bilexical operator as:
Pr(m|h) =exp
{φ(m)>Wφ(h)
}∑m′∈M exp {φ(m′)>Wφ(h)}
⇒ Standard conditional log-linear model
11
-
Overview
Bilexical Models
Low Rank Constraints
Learning
Experiments
12
-
Rank Constraints
φ(m)>Wφ(h)
[m1 m2 · · · · · · · · · mn
]︸ ︷︷ ︸φ(m)>
w11 w12 · · · · · · · · · w1nw21 w22 · · · · · · · · · w2n
......
......
......
......
......
......
......
......
......
wn1 wn2 · · · · · · · · · wnn
h1h2.........hn
φ(h)
13
-
Rank Constraints
I Factorizing W :
[m1 m2 · · · · · · mn
]︸ ︷︷ ︸φ(m)>
u11 · · · u1ku21 · · · w2k
......
......
......
un1 · · · unn
︸ ︷︷ ︸
U
σ1 · · · 0... . . . ...0 · · · σk
︸ ︷︷ ︸
Σ
v11 · · · · · · v1n... ... ... ...vk1 · · · · · · vkn
︸ ︷︷ ︸
V>
︸ ︷︷ ︸
SVD(W ) = UΣV>
h1h2......hn
φ(h)
I Please note: W has rank k
14
-
Low Rank Embedding
I Regrouping, we get:[m1 m2 · · · · · · mn
]u11 · · · u1ku21 · · · w2k
......
......
......
un1 · · · unn
︸ ︷︷ ︸
φ(m)>U
σ1 · · · 0... . . . ...0 · · · σk
︸ ︷︷ ︸
Σ
v11 · · · · · · v1n... ... ... ...vk1 · · · · · · vkn
h1h2......hn
︸ ︷︷ ︸
V>φ(h)
I We can see φ(m)>U as a projection of m and V>φ(h) as a projection of hI ⇒ Rank(W ) defines the dimesionality of the induced space, hence the embedding
15
-
Computational Properties
I In many tasks, given a head, rank a huge number of modifiersI Strategy:
I Project each lexical item in the vocabulary into its low dimensional embedding ofsize k
I Compute the bilexical score as k−dimensional inner productI Substantial computational gain as long as we obtain low-rank models
16
-
Summary
I Induce high dimensional representation from a huge corpus
I Learn embeddings suited for a given task
I Our bilexical formulation is, in principle, a linear model but with an extendedfeatures space
I Low rank bilexical embedding is computationally efficient
17
-
Overview
Bilexical Models
Low Rank Constraints
Learning
Experiments
18
-
Formulation
I Given:I Set of training tuples D = (m1, h1) . . . (ml , hl)I where m are modifiers and h are headsI the distributional representations: φ(m) and φ(h) is computed over some corpus
I We set it as a conditional log-linear distribution:
Pr(m|h) =exp
{φ(m)>Wφ(h)
}∑m′∈M exp {φ(m′)>Wφ(h)}
19
-
Learning and Regularization
I Std. conditional Max. Likelihood optimization; maximize the log-likelihoodfunction:
log Pr(D) =∑
(m,h)∈D
φ(m)>Wφ(h)− log∑
m′∈Mexp
{φ(m′)>Wφ(h)
}I Adding regularization penalty, our algorithm essentially maximizes:∑
(m,h)∈D
log Pr(m | h)) + λ‖W ‖p
I Regularization using the proximal gradient method (FOBOS):I `1 Regularization, ‖W ‖1 ⇒ Sparse feature spaceI `2 Regularization, ‖W ‖2 ⇒ Dense parametersI `∗ Regularization, ‖W ‖∗ ⇒ Low Rank Embedding
20
-
Algorithm: Proximal Algorithm for Bilexical Operators
1 while iteration < MaxIteration do2 Wt+0.5 = Wt − ηtg(Wt); // gradient of neg log-likelihood
/* adding regularization penalty: */
/* Wt+1 = argminW ||Wt+0.5 −W ||22 + ηtλr(W ) *//* we use proximal operator */
3 if `1 regularizer then4 Wt+1(i , j) = sign(Wt+0.5(i , j)) ·max(Wt+0.5(i , j)− ηtλ, 0);
// Basic thresholding operation
5 else if `2 regularizer then6 Wt+1 =
11+ηtλ
Wt+0.5; // Basic scaling operation
7 else if nuclear norm regularizer then8 Wt+0.5 = UΣV
>;9 σ̄i = max(σi − ηtλ, 0); // σi = the i-th element on Σ
10 Wt+1 = UΣ̄V> ;
11 end
21
-
Overview
Bilexical Models
Low Rank Constraints
Learning
Experiments
22
-
Experiments
I Tasks:I Noun-Adjective relations:
I Pr(adjective|noun) and Pr(noun|adjective)I Verb-Object relations:
I Pr(object|verb) and Pr(verb|object)I Data:
I Supervised corpus: gold standard dependencies of the Penn TreebankI We partition the heads of head-modifier relations into three parts:
60% heads for training, 10% heads for validation and 30% heads for test.I No heads from the test set were training set.
I Corpora for distributional representations: BLLIP corpus
I Training: For each head word, using supervised data, we compile a list ofcompatible and incompatible modifiers
23
-
Results
Nouns Predicted Adjectives
presidentexecutive, senior, chief, frank, former, international, marketing, assistant,annual, financial
wife former, executive, new, financial, own, senior, old, other, deputy, major
sharesannual, due, net, convertible, average, new, high-yield, initial, tax-exempt, subordinated
mortgagesannualized, annual, three-month, one-year, average, six-month, conven-tional, short-term, higher, lower
month last, next, fiscal, first, past, latest, early, previous, new, current
problem new, good, major, tough, bad, big, first, financial, long, federal
holidaynew, major, special, fourth-quarter, joint, quarterly, third-quarter, small,strong, own
Table: 10 most likely adjectives for some nouns
24
-
Results
46
48
50
52
54
56
58
60
62
1e3 1e4 1e5 1e6 1e7 1e8
pairw
ise a
ccura
cy
number of operations
Objects given Verb
unsupervisedNNL1L2
I Pairwise accuracy - measure ofcompatible/incompatible modifiers
I Capacity of the model - given thehead, number of double operationsrequired to compute scores for allmodifiers
I In general if the representation is nand there are m modifiers then:
I `1 & `2 ⇒ if W has d non-zeroweights ⇒ dm
I `∗ ⇒ if the rank of W is k ⇒kn + km
25
-
50
55
60
65
70
75
80
85
90
1e3 1e4 1e5 1e6 1e7 1e8
pairw
ise a
ccura
cy
Adjectives given Noun
unsupervisedNNL1L2 66
68
70
72
74
76
78
80
1e3 1e4 1e5 1e6 1e7 1e8
Nouns given Adjective
unsupervisedNNL1L2
46
48
50
52
54
56
58
60
62
1e3 1e4 1e5 1e6 1e7 1e8
number of operations
Objects given Verb
unsupervisedNNL1L2
60
65
70
75
80
1e3 1e4 1e5 1e6 1e7 1e8
number of operations
Verbs given Object
unsupervisedNNL1L2
Figure: Pairwise accuracy vs number of double operations to compute the distribution over mfor a given h
26
-
Prepositional Phrase attachment
Verb Object Modifier
(v) (o) (m)
NMOD NOMINAL(prep)
VERBAL(prep)
I Given: For every preposition p, set of trainingtuples Dp = {(v , o, p,m, y)1. . .(v , o, p,m, y)l}
I Distributional representations: φ(v), φ(o), φ(m)
Pr(y=V |〈v , o, p,m〉) =exp
{φ(v)>W Vp φ(m)
}Z
&
Pr(y=O|〈v , o, p,m〉) =exp
{φ(o)>W Vp φ(m)
}Z
I Does the bilinear model complement the linearmodel?
I For a constant λ ∈ [0, 1]
Pr(y |x) = λ PrL(y |x) + (1− λ) PrB(y |x)
27
-
Results
55
60
65
70
75
80
for from with
att
ach
me
nt
accu
racy
bilinear L1bilinear L2
bilinear NNlinear
interpolated L1interpolated L2
interpolated NN
Figure: Attachment accuracies of linear, bilinear and interpolated models for three prepositions
28
-
Conclusion
I We have presented a semi-supervised bilexical model that has a potential togeneralize over unseen words
I We have proposed a method to learn low-rank embeddings for scoring bilexicalrelations efficiently
I We want to apply this idea to other bilexical tasks in NLP
I We want to explore how we can combine other feature representations withlow-rank bilexical operators.
29
-
Thank You
30
Bilexical ModelsLow Rank ConstraintsLearningExperiments