![Page 1: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/1.jpg)
Word representations:A simple and general method for semi-supervised learning
Joseph Turian
with Lev Ratinov and Yoshua Bengio
Goodies:http://metaoptimize.com/projects/wordreprs/
![Page 2: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/2.jpg)
2
Supmodel
Supdata
Supervised training
![Page 3: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/3.jpg)
3
Supmodel
Supdata
Supervised training
Semi-suptraining?
![Page 4: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/4.jpg)
4
Supmodel
Supdata
Supervised training
Semi-suptraining?
![Page 5: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/5.jpg)
5
Supmodel
Supdata
Supervised training
Semi-suptraining?
Morefeats
![Page 6: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/6.jpg)
6
Supmodel
Supdata
Morefeats
Supmodel
Supdata
Morefeats
sup task 1
sup task 2
![Page 7: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/7.jpg)
7
Semi-supmodel
Unsupdata
Supdata
Joint semi-sup
![Page 8: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/8.jpg)
8
Semi-supmodel
Unsupmodel
Unsupdata
Supdata
unsuppretraining
semi-supfine-tuning
![Page 9: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/9.jpg)
9
Unsupmodel
Unsupdata
unsuptraining
unsupfeats
![Page 10: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/10.jpg)
10
Semi-supmodel
Unsupdata
unsuptraining
Sup training
Supdata
unsupfeats
![Page 11: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/11.jpg)
11
Unsupdata
unsuptraining
unsupfeats
sup task 1 sup task 2 sup task 3
![Page 12: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/12.jpg)
12
What unsupervised featuresare most useful in NLP?
![Page 13: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/13.jpg)
13
Natural language processing
• Words, words, words• Words, words, words• Words, words, words
![Page 14: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/14.jpg)
14
How do we handle words?
• Not very well
![Page 15: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/15.jpg)
15
“One-hot” word representation
• |V| = |vocabulary|, e.g. 50K for PTB2
word -1, word 0, word +1
Pr dist over labels
(3*|V|) x m
3*|V|
m
![Page 16: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/16.jpg)
16
One-hot word representation
• 85% of vocab words occur as only 10% of corpus tokens
• Bad estimate of Pr(label|rare word)
word 0
|V| x m
|V|
m
![Page 17: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/17.jpg)
17
Approach
![Page 18: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/18.jpg)
18
Approach
• Manual feature engineering
![Page 19: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/19.jpg)
19
Approach
• Manual feature engineering
![Page 20: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/20.jpg)
20
Approach
• Induce word reprs over large corpus, unsupervised
• Use word reprs as word features for supervised task
![Page 21: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/21.jpg)
21
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs
![Page 22: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/22.jpg)
22
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs
![Page 23: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/23.jpg)
23
Distributional representations
FW(size ofvocab)
C
e.g. Fw,v = Pr(v follows word w)or Fw,v = Pr(v occurs in same doc as w)
![Page 24: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/24.jpg)
24
Distributional representations
FW(size ofvocab)
C d
g( ) = f
g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand transC >> d
![Page 25: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/25.jpg)
25
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs
![Page 26: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/26.jpg)
26
Class-based word repr
• |C| classes, hard clustering
word 0
(|V|+|C|) x m
|V|+|C|
m
![Page 27: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/27.jpg)
27
Class-based word repr
• Hard vs. soft clustering• Hierarchical vs. flat clustering
![Page 28: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/28.jpg)
28
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs
– Brown (hard, hierarchical) clustering– HMM (soft, flat) clustering
• Distributed word reprs
![Page 29: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/29.jpg)
29
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs
– Brown (hard, hierarchical) clustering– HMM (soft, flat) clustering
• Distributed word reprs
![Page 30: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/30.jpg)
30
Brown clustering
• Hard, hierarchical class-based LM• Brown et al. (1992)• Greedy technique for maximizing bigram
mutual information• Merge words by contextual similarity
![Page 31: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/31.jpg)
31
Brown clustering
(image from Terry Koo)
cluster(chairman) = `0010’2-prefix(cluster(chairman)) = `00’
![Page 32: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/32.jpg)
32
Brown clustering
• Hard, hierarchical class-based LM• 1000 classes• Use prefixes = 4, 6, 10, 20
![Page 33: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/33.jpg)
33
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs
– Brown (hard, hierarchical) clustering– HMM (soft, flat) clustering
• Distributed word reprs
![Page 34: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/34.jpg)
34
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs
![Page 35: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/35.jpg)
35
Distributed word repr
• k- (low) dimensional, dense representation• “word embedding” matrix E of size |V| x k
word 0
k x m
k
m
![Page 36: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/36.jpg)
36
Sequence labeling w/ embeddings
word -1, word 0, word +1
(3*k) x m
|V| x k,tiedweights
m
“word embedding” matrix E of size |V| x k
![Page 37: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/37.jpg)
37
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs
– Collobert + Weston (2008)– HLBL embeddings (Mnih + Hinton, 2007)
![Page 38: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/38.jpg)
38
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs
– Collobert + Weston (2008)– HLBL embeddings (Mnih + Hinton, 2007)
![Page 39: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/39.jpg)
39
Collobert + Weston 2008
w1 w2 w3 w4 w550*5
1
100
w1 w2 w3 w4 w5
score > μ + score
![Page 40: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/40.jpg)
40
50-dim embeddings: Collobert + Weston (2008)t-SNE vis by
van der Maaten +Hinton (2008)
![Page 41: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/41.jpg)
41
Less sparse word reprs?
• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs
– Collobert + Weston (2008)– HLBL embeddings (Mnih + Hinton, 2007)
![Page 42: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/42.jpg)
42
Log bilinear Language Model (LBL)
w1 w2 w3 w4 w5
Linear prediction of w5
} Z)targetpredictexp(Pr
![Page 43: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/43.jpg)
43
HLBL
• HLBL = hierarchical (fast) training of LBL• Mnih + Hinton (2009)
![Page 44: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/44.jpg)
44
Approach
• Induce word reprs over large corpus, unsupervised– Brown: 3 days– HLBL: 1 week, 100 epochs– C&W: 4 weeks, 50 epochs
• Use word reprs as word features for supervised task
![Page 45: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/45.jpg)
45
Unsupervised corpus
• RCV1 newswire• 40M tokens (vocab = all 270K types)
![Page 46: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/46.jpg)
46
Supervised Tasks
• Chunking (CoNLL, 2000)– CRF (Sha + Pereira, 2003)
• Named entity recognition (NER)– Averaged perceptron (linear classifier)– Based upon Ratinov + Roth (2009)
![Page 47: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/47.jpg)
47
Unsupervised word reprsas features
Word = “the”
Embedding = [-0.2, …, 1.6]
Brown cluster = 1010001100(cluster 4-prefix = 1010,cluster 6-prefix = 101000,…)
![Page 48: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/48.jpg)
48
Unsupervised word reprsas features
Orig X = {pos-2=“DT”: 1, word-2=“the”: 1, ...}
X w/ Brown = {pos-2=“DT”: 1 , word-2=“the”: 1,class-2-pre4=“1010”: 1,class-2-pre6=“101000”: 1}
X w/ emb = {pos-2=“DT”: 1 , word-2=“the”: 1,word-2-dim00: -0.2, …,word-2-dim49: 1.6, ...}
![Page 49: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/49.jpg)
49
Embeddings: Normalization
E = σ * E / stddev(E)
![Page 50: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/50.jpg)
50
Embeddings: Normalization(Chunking)
![Page 51: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/51.jpg)
51
Embeddings: Normalization(NER)
![Page 52: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/52.jpg)
52
Repr capacity (Chunking)
![Page 53: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/53.jpg)
53
Repr capacity (NER)
![Page 54: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/54.jpg)
54
Test results (Chunking)
93
93.5
94
94.5
95
95.5baseline
HLBL
C&W
Brown
C&W+Brown
Suzuki+Isozaki(08), 15MSuzuki+Isozaki(08), 1B
![Page 55: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/55.jpg)
55
Test results (NER)
84
85
86
87
88
89
90
91Baseline
Baseline+nonlocalGazeteers
C&W
HLBL
Brown
All
All+nonlocal
![Page 56: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/56.jpg)
56
MUC7 (OOD) results (NER)
66687072747678808284
Baseline
Baseline+nonlocalGazeteers
C&W
HLBL
Brown
All
All+nonlocal
![Page 57: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/57.jpg)
57
Test results (NER)
88
88.5
89
89.5
90
90.5
91
Lin+Wu (09),3.4BSuzuki+Isozaki(08), 37MSuzuki+Isozaki(08), 1BAll+nonlocal,37MLin+Wu (09),700B
![Page 58: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/58.jpg)
Test results
• Chunking: C&W = Brown• NER: C&W < Brown• Why?
58
![Page 59: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/59.jpg)
59
Word freq vs word error(Chunking)
![Page 60: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/60.jpg)
60
Word freq vs word error(NER)
![Page 61: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/61.jpg)
61
Summary
• Both Brown + word emb can increase acc of near-SOTA system
• Combining can improve accuracy further• On rare words, Brown > word emb• Scale parameter σ = 0.1• Goodies:http://metaoptimize.com/projects/wordreprs/
Word features! Code!
![Page 62: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/62.jpg)
62
Difficulties with word embeddings
• No stopping criterion during unsup training• More active features (slower sup training)• Hyperparameters
– Learning rate for model– (optional) Learning rate for embeddings– Normalization constant
• vs. Brown clusters, few hyperparams
![Page 63: Word representations: A simple and general method for semi-supervised learning](https://reader034.vdocuments.us/reader034/viewer/2022042719/56814a9a550346895db7a8c6/html5/thumbnails/63.jpg)
63
HMM approach
• Soft, flat class-based repr• Multinomial distribution over hidden states
= word representation• 80 hidden states• Huang and Yates (2009)• No results with HMM approach yet