wordnet-enhanced topic modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · wordnet concept...
TRANSCRIPT
Wordnet-Enhanced
Topic Models
Hsin-Min Lu
盧信銘
Department of Information Management
National Taiwan University
1
Outline
• Introduction
• Literature Review
• Wordnet-Enhanced Topic Model
• Experiments
2
Introduction
• Leveraging unstructured data is a
challenging yet rewarding task
• Topic modeling, a family of unsupervised
learning models, is useful in discovering
latent topic structures in free text data
• Topic models assume that a document is
the mixture of topic distributions
• Each topic is a distribution of the
vocabulary
3
4
Statistical Topic Models for Text Mining
Text
Collections
Probabilistic
Topic Modeling
… web 0.21
search 0.10
link 0.08
graph 0.05
…
…
Subtopic discovery
Opinion comparison
Summarization
Topical pattern
analysis
…
term 0.16
relevance 0.08
weight 0.07
feedback 0.04
independ. 0.03
model 0.03
…
Topic models
(Multinomial distributions)
PLSA [Hofmann 99]
LDA [Blei et al. 03]
Author-Topic
[Steyvers et al. 04]
…
Pachinko allocation [Li & McCallum 06]
Topic over time [Wang et al. 06]
Introduction (Cont’d.)
Introduction (Cont’d.)
• An on-going research stream is to
incorporate meta-data variables into topic
modeling
– Richer models
– Useful estimation results
• This study aims at incorporating Wordnet
synset information into topic models
– A topic may be the combination of Wordnet
synsets, or/and
– The hidden co-occurrence structure 5
Introduction (cont’d.)
• Wordnet-Enhanced Topic Model
– Incorporate Wordnet synsets into topic
models
– Wordnet synsets affect the prior of topics
– Multinomial-probit-like setting for prior
– Wordnet synset influence topic inference at
token-level
– Document-level random effects for document-
wide topic tendency
– Inference using Gibbs sampling
6
Literature Review
• Wordnet
• Latent Dirichlet Allocation (LDA)
• LDA with Dirichlet Forest Prior
• Concept-Topic Model
• LDA with Wordnet
7
Wordnet
• WordNet is a large lexical database of
English.
– POS: Nouns, verbs, adjectives and adverbs
• Words are organized by synsets
– A synset expresses a distinct concept
– Synsets are interlinked by means of
conceptual-semantic and lexical relations
– Synsets form a network
– Useful for computational linguistics and
natural language processing
8
Wordnet (Cont’d.)
• Important difference between Wordnet and
thesaurus
– WordNet interlinks not just word forms (strings
of letters) but specific senses of words
– WordNet labels the semantic relations among
words, whereas the groupings of words in a
thesaurus does not follow any explicit pattern
other than meaning similarity
9
WordNet (Cont’d.)
• A lexical semantic network relating word forms and
lexicalized concepts (i.e., concepts that speakers have
adopted word forms to express)
• Main relations—hyponymy/troponymy (kind-of/way-to),
meronymy (part-whole), synonymy, antonymy
• Predominantly hierarchical, few relations across
grammatical class, glosses & example sentences do not
participate in network
• Nouns organized under 9 unique beginners
• Command-line interface & C library
• Prehistoric (but greppable!) db format
Lexical Matrix
Creation of Synsets
Three principles: • Minimality
• Coverage
• Replacability
Synsets
{house} is ambiguous. {house, home} has the sense of a social unit living together;
Is this the minimal unit?
{family, house , home} will make the unit completely unambiguous.
For coverage:
{family, household, house, home} ordered according to frequency.
Replacability of the most frequent words is a requirement.
Synset creation
From first principles
– Pick all the senses from good standard dictionaries.
– Obtain synonyms for each sense.
– Needs hard and long hours of work.
Wordnet Statistics (Version 2.1)
POS Unique
Strings Synsets
Total
Word-Sense
Pairs
Noun 117097 81426 145104
Verb 11488 13650 24890
Adjective 22141 18877 31302
Adverb 4601 3644 5720
Totals 155327 117597 207016
15
Wordnet Example
• Fake (n) has three senses:
– Something that is counterfeit; not what is
seems to be (synonyms: sham, postiche)
– A person who makes deceitful pretenses
(synonyms: imposter, impotor, pretender,
faker, …)
– [Football] A deceptive move made by a
football player (synonym: juke)
16
juke
sham
postichefake, n
imposterimpostor
pretender
fakerfraud
shammerrole player
pseudopseud
entity
causal agent physical objectliving thing
organism, being
person
bad person
wrongdoer
deceiver
whole thing, unit
artifact
creation
representation
copy
imitation
act, human action
action
choice, selection
decision
move
tacticalmaneuver
feint
Wordnet Example (Cont’d.)
Unique beginner synsets
Topic Models
• Latent variable models are useful in
discovering hidden structures in text data
– Latent Semantic Indexing using singular value
decomposition (SVD) (Deerwester et al. 1990)
– Probabilistic Latent Semantic Indexing (pLSI)
(Hofmann 1999)
– Latent Dirichlet allocation (LDA) (Blei 2003)
19
Topic Models (Cont’d.)
• LDA addresses the shortcoming of its
predecessors
– SVD may contain negative factor loadings,
which makes the result hard to explain
– pLSI (aspect model) : The number of
parameters grow linearly w.r.t. the number of
documents
• Lead to model overfitting
– LDA outperforms pLSI in terms of testing
probability (perplexity) 20
LDA Generative Process
21
LDA Inference Problem
22
LDA Model
23
LDA Model (Cont’d.)
24
LDA Model (Cont’d.)
25
LDA: Intractable Inference
26
Model Estimation Methods
Model Latent Z Latent
Other Parameter
Collapsed
Gibbs Sampling LDA Sample Integrate
Out Integrate out 𝜙 𝑗
Stochastic EM TOT Sample Integrate
Out Integrate out 𝜙 𝑗 ;
maximize w.r.t
other parameters
Variational
Bayes LDA and
DTM Assume
Independent Assume
Independent
(retain
sequential
structure in
DTM)
Maximize
Augmented
Gibbs Sampling WNTM
(This
Study)
Sample Sample Integrate out 𝜙𝑗 ;
sample other
parameters 27
Collapsed Gibbs Sampling
28
𝑃 𝑊 𝑍, 𝜂 =
𝑃 𝑍 𝛼 =
Marginalize 𝜃𝑑
29
Marginalize 𝜃𝑑
30
Joint Probability
31
Posterior Probabilty
32
Posterior Probability (Cont’d.)
33
Limitations of The LDA Model
• Additional meta-data information cannot
be included into the model
– Partially addressed problem:
– The author-topic model (AT) (Rosen-Zvi
2010) and Dirichlet-multinomial regression
(DMR) (Mimno and McCallum, 2008)
– The AT model delivers worse performance
compared to the LDA model
• Except when testing articles are very short
– The AT model is not a general framework to
include arbitrary document-level meta data
into the model 34
LDA with Dirichlet Forest Prior
• Dirichlet Forest Prior can be used to
incorporate prior knowledge
– Mixture of Dirichlet tree distributions
– Two basic types of knowledge
• Must-Links: two words have similar probability
within any topic
• Cannot-Links: two words should not both have
large probability within any topic
35 Andrzejewski, Zhu, and Craven, ICML 2009
LDA with Dirichlet Forest Prior (Cont’d.)
– Additional types of knowledge:
• Split: separate two or more sets of word from a
single topic into different topics by placing must-
links within the sets and cannot-links between
them
• Merge: combine two or more sets of words using
must-links
• Isolate: placing must-links within the common set,
and placing cannot-link between the common set
and the other high-probability words from all topics
36
Dirichlet Tree Distribution For Must-Link
• A Dirichlet distribution is a composition of
Dirichlet distribution
(a) A, B, and C are vocabulary, start
sampling from the root node: model Must-
like(A, B)
(b) A instance with 𝛽 = 1 and 𝜂 = 50
37
Dirichlet Tree Distribution
• Dirichlet tree distribution can preserve
specific correlation structure that cannot
be accomplished by the standard Dirichlet
distribution
• (c) A large set of sample from the Dirichlet
tree in (b); note 𝑝(𝐴) ≈ 𝑝(𝐵)
• (d) Dirichlet
distribution
with (50, 50, 1)
38
Combining Dirichlet Tree for Cannot-Link
• (e) Cannot-Link (A,B) and Cannot-Link(B,C)
• (f) The complementary graph of (e)
• (g) The Dirichlet subtree for clique
{A,C}
• (h) The Dirichlet subtree for clique {B}
39
LDA with Dirichlet Forest Prior (Cont’d.)
• 𝑞~𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡𝐹𝑜𝑟𝑒𝑠𝑡(𝛽, 𝜂)
• 𝜙~𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡𝑇𝑟𝑒𝑒(𝑞)
• A Dirichlet Forest is a mixture of
Dirichlet Trees
40
LDA with Dirichlet Forest Prior (Cont’d.)
41
Concept Topic Model
• Observed words are
either generated from
a set of hidden topics
or a set of fixed
concepts
42
Steyvers, Smyth, and Chemuduganta, 2011
LDA with Wordnet
• Words are generated by walking down the
tree of Wordnet synsets
43 Boyd-Graber, Blei, and Zhu, 2007, EMNLP
Research Gaps
• Dirichlet forest prior can be used to
“constraint” topic models
– However, a model cannot “turn off” the
constraints when they are inappropriate
• LDAWN provides a model-drive word-sense
disambigulation mechanism
– Not suitable for topic modeling since LDAWN
cannot handle words not in Wordnet
• CTM assumes that pre-existing concepts
are “constant”
– Different concept may emerge in different
context 44
Developing the
Wordnet-Enhanced Topic Model (WNTM)
• Need a more flexible framework to include
Wordnet concepts into the latent topic
model
• A topic in WNTM may be
– The combination of several WN synsets
– A new topic unrelated to existing synsets
– The combination of the above two
45
The WNTM
• 𝑥𝑑𝑖: the vector
for Wordnet
concept
• Token-level
influence
structure
• 𝑞𝑑,𝑗: document-
specific topic
tendency
• 𝑔𝑗: slope for 𝑥𝑑𝑖 46
The WNTM Model
• 𝐻𝑑𝑖,𝑗 = 𝑞𝑑,𝑗 + 𝑥𝑑𝑖′ 𝑔𝑗 + 𝑒𝑑𝑖,𝑗
• 𝑒𝑑𝑖,𝑗~𝑁 0, Σj
• 𝑧𝑑𝑖 = 0, if max 𝐻𝑑𝑖 < 0
𝑗, if max 𝐻𝑑𝑖 = 𝐻𝑑𝑖,𝑗
47
Inference: Gibbs Sampling
• Updating z
• 𝑝 𝑧𝑑𝑖 = 𝑗 𝑧−𝑑𝑖 , 𝑤𝑑𝑖 , 𝑤−𝑑𝑖 , 𝑋, 𝑞, 𝑔, Σ
∝ 𝑝 𝑤𝑑𝑖 𝑧𝑑𝑖 = 𝑗,⋅ 𝑝(𝑧𝑑𝑖 = 𝑗|𝑞, 𝑔, Σ, 𝑋)
=𝑛−𝑑𝑖,𝑗
𝑤𝑑𝑖 + 𝛽
𝑛−𝑑𝑖,𝑗⋅
+𝑊𝛽𝑝(𝑧𝑑𝑖 = 𝑗|𝑞𝑑, 𝑔, Σ, 𝑥𝑑𝑖)
48
Inference: Augmented Gibbs Sampling
• Updating H
• 𝐻𝑑𝑖,𝑗|𝐻𝑑𝑖,−𝑗~𝑇𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑑 𝑁𝑜𝑟𝑚𝑎𝑙(⋅)
– McCulloch and Rossi (1994), Imai and van
Dyk (2004)
49
Inference: Augmented Gibbs Sampling
• Draw 𝑎2∗ from 𝑡𝑟𝑎𝑐𝑒(Σ−1 𝑜𝑙𝑑)/χ 𝐽−1 2
2 .
• Draw 𝐻 𝑑𝑖,𝑗∗ by first draw 𝐻𝑑𝑖,𝑗
∗ conditional on 𝑧, 𝑞 𝑜𝑙𝑑 ,
𝑔 𝑜𝑙𝑑 , Σ 𝑜𝑙𝑑 , 𝐻 𝑜𝑙𝑑 and set 𝐻 𝑑𝑖,𝑗∗ = 𝑎∗𝐻𝑑𝑖,𝑗
∗ .
• Draw 𝑞 𝑛𝑒𝑤 , and 𝑔 𝑛𝑒𝑤 by first draw 𝑞 ∗, 𝑔 ∗, and
𝑎2∗∗ conditional on 𝐻 𝑑𝑖,𝑗∗ , 𝑎2∗, and Σ 𝑜𝑙𝑑 and set
𝑞 𝑛𝑒𝑤 = 𝑞 ∗/𝑎∗∗, 𝑔 𝑛𝑒𝑤 = 𝑔 ∗/𝑎∗∗.
• Draw Σ 𝑛𝑒𝑤 by first draw Σ ∗ conditional on 𝐻 𝑑𝑖,𝑗∗ , 𝑞 ∗,
𝑔 ∗ and set Σ 𝑛𝑒𝑤 = Σ ∗/Σ 11∗ , 𝐻𝑑𝑖,𝑗
𝑛𝑒𝑤=
𝐻 𝑑𝑖,𝑗∗
Σ 11∗
.
50
Implementation
• C, C++, and OpenMP (core functions) + R
(function interfaces) + Python (text pre-
processing)
51
Research Testbed
• Reuters-21578
– 11,771 documents
– 775,553 words
– 26,898 unique words
• Wordnet 2.1 is used for concept
construction
52
Wordnet Concept Construction
• Filter out Wordnet synsets that are most
relevant to the given corpus
• Definition of a concept
– A group of words with similar meanings
constructed from Wordnet synsets
• Consider nouns only
– Organized in a tree structure
53
Wordnet Concept Construction (Cont’d.)
• For each word,
– Find the root form using the morphy tool
– Identify the synsets for the word
– For each synset
• Construct a concept by merging words in this
synset, its descendants, its parent, its siblings, and
descendants of sibling
• Delete a concept if it contains less than 5 distinct
tokens
• A concept is not useful if it contains too few words
54
Wordnet Concept Construction
(Cont’d.)
• For each concept
– Compute average co-occurrence length
• the number of unique tokens appearing in a
document
• Average over all positive values
– Delete concepts with average co-occurrence
length <= 1.15
• Sort the concept in descending order by a
relevance score (average co-occurrence
length / number of unique token)
• Delete the concepts in the last 25
percentile
55
Wordnet Concepts
Concept
# Unique Words /
Avg. Freq. /
Avg. Co-occur. Len.
Words in the Concept
(List at Most 10 Words)
proportion.n.01 6/2372.7/1.24
scale, percent, pct, content, rate,
percentage.
security.n.04 7/1856.9/1.47
scrip, debenture, share, treasury,
convertible, stock, bond.
offer.n.02 9/842.4/1.24
price, question, proposition, prospectus,
tender, proposal, reward, bid, special.
fossil fuel.n.01 6/838.8/1.64 oil, jet, gas, petroleum, coal, crude.
funds.n.01 7/806.0/1.15
exchequer, pocket, till, trough, treasury,
roll, bank.
sum.n.01 49/736.3/2.21
figure, revenue, pool, win, purse, sales,
profits, rent, proceeds, payoff (list
truncated).
social science.n.01 5/688.2/1.17
econometrics, politics, economics,
finance, government.
slope.n.01 15/616.7/1.42
decline, upgrade, descent, waterside,
rise, coast, uphill, steep, brae, fall (list
truncated).
gregorian calendar
month.n.01 20/612.3/1.51
february, feb, mar, march, august, aug,
september, sept, december, dec (list
truncated). 56
Summary Statistics of Wordnet Concepts
# of Wordnet
Concepts Per
Word
0 1 2 3 4 or
more
Proportion 45% 27% 12% 8% 8%
57
Perplexity at Different Sweeps
• # of topic = 25 58
Estimated Topic “Statement”
Top Keywords:
estimate statement bill account action order coupon intervention
review case usair accounting suit pass transfer
Wordnet Concepts:
commercial document.n.01 (5.43)*
estimate statement bill account order
proceeding.n.01 (4.29) *
action intervention review case suit
relationship.n.03 (0.86) *
account hold restraint trust confinement
advantage.n.01 (0.69)*
account leverage profitability expediency privilege
fact.n.01 (0.51) *
case observation score specific item
Matching LDA Topic
Top Keywords:
ct net loss shr profit rev note oper avg shrs mths qtr sales exclude gain
*Estimated coefficients for Wordnet concepts. 59
Estimated Topic “Earnings”
Top Keywords:
mln ct net loss dlrs shr profit rev note year gain oper include avg
shrs
Wordnet Concepts:
advantage.n.01 (3.59) *
profit gain good leverage preference
subject.n.01 (-0.02) *
puzzle head precedent case question
push.n.01 (-0.03) *
pinch crunch nudge mill boost
legislature.n.01 (-0.06) *
diet congress house senate parliament
Matching LDA Topic
Top Keywords:
mln note net stg include profit tax extraordinary pretax operate full item
making turnover income 60
Estimated Topic “Market Update”
Top Keywords:
week total end product period average amount demand supply line
inflation term shipment number release
Wordnet Concepts:
quantity.n.03 (5.33) *
total product average amount term
part.n.09 (4.66) *
end period factor top beginning
work time.n.01 (4.38) *
week turn hours shift turnaround
economic process.n.01 (4.34) *
demand supply inflation consumption spiral
merchandise.n.01 (4.26) *
line shipment number release inventory cargo
Matching LDA Topic
Top Keywords:
union south area spokesman city ship strike port worker africa line week
affect state southern 61
Estimated Topic “Macroeconomics”
Top Keywords:
dollar market currency west yen economic dealer central growth cut
japan economy expect policy interest
Wordnet Concepts:
semite.n.01 (-0.03) *
palestinian arab saudi omani arabian
rational_number.n.01 (-0.11) *
thousandth fraction fourth eighth half
seed.n.01 (-0.12) *
soybean coffee hazelnut nut cob
fact.n.01 (-0.22) *
observation score specific item case
Matching LDA Topic
Top Keywords:
dollar currency yen west exchange market rates japan dealer central
german germany intervention finance paris 62
The Effect of Wordnet Concept
63
The Effect of Topic Number
64
Questions
65