modeling documents by combining semantic concepts with unsupervised statistical learning author:...
TRANSCRIPT
Modeling Documents by Combining Semantic Concepts with
Unsupervised Statistical Learning
Author: Chaitanya Chemudugunta
America Holloway
Padhraic Smyth
Mark Steyvers
Source: ISWC2008
Reporter: Yong-Xiang Chen
2
Abstract• Problem: mapping of an entire document or Web page to
concepts in a given ontology • Proposing a probabilistic modeling framework
– Based on statistical topic models (also known as LDA models)
– The methodology combines:• human-defined concepts• data-driven topics
• The methodology can be used to automatically tag Web pages with concepts – From a known set of concepts – Without any need for labeled documents – Map all words in a document, not just entities
3
Concepts from ontologies
• Ontology includes– Ontological concepts
• Associated vocabulary • The hierarchical relations between concepts
• Topics from statistical models and concepts from ontologies both represent “focused” sets of words that relate to some abstract notion
4
Words that relate to “FAMILY”
5
Statistical Topic Models• LDA is a state-of-the-art unsupervised learning techniqu
e for extracting thematic information
• Topic model – Words in a document arise via a two-stage process:
• Word-topic distribution: p(w|z) • Topic-document distributions: p(z|d)
– Both learned in a completely unsupervised manner • Gibbs sampling assign topics to each word in the corpus
– The topic variable z plays the role of a low-dimensional representation of the semantic content of a document
6
Topic
• A topic is in the form of a multinomial distribution over a vocabulary of words
• Topic can in a loose sense be viewed as a probabilistic representation of a semantic concept
7
How the statistical topic modeling techniques “overlay” probabilities on concepts within ontology?
1. C: amount of human-defined concepts – Each concept cj consists of a finite set of Nj unique words
2. A corpus of documents such as Web pages • How to merge these two sources of information based
on topic model?– “tagging” documents with concepts from the ontology, but with
little or no supervised labeled data available
• The concept model: replace topic z with concept c
• Unknown parameters: p(wi|cj) and p(cj |d) • Goal: to estimate these from an appropriate corpus
8
Concept model VS Topic model
• In the concept model – The words that belong to a concept are
defined by a human a priori – Limited to a small subset of the overall
vocabulary
• In the topic model – All words in the vocabulary can be associated
with any particular topic but with different probabilities
9
Treat concepts as “topics with constraints”
• The constraints consist of setting words that are not a priori mentioned in a concept to have probability 0
• Use Gibbs sampling to assign concepts to words in documents – The additional constraint is that a word can only be as
signed to a concept that it is associated with in the ontology
10
Estimate unknown parameters
• To estimate p(wi|cj) – Count how many words in the corpus were as
signed by the sampling algorithm to concept cj and normalize these counts to arrive at the probability distribution p(wi|cj)
• To estimate p(cj |d) – Count how many times each concept is assig
ned to a word in document d and again normalize and smooth the counts to obtain p(cj |d)
11
Example
• Learned probabilities for words (ranked highest by probability) for two different concepts from the CIDE ontolgy, after training on the TASA corpus
12
Usage of concept model
• Use the learned probabilistic representation of concepts to map new documents into concepts within an ontology
• Use the semantic concepts to improve the quality of data-driven topic models
13
Variations of the concept model framework Concept L
• Concept U (concept-uniform) – The word-concept probabilities p(wi|cj) are defined to be uniform for all
words within a concept – Baseline model
• Concept F (concept-fixed) – The word-concept probabilities are available a priori as part of the conce
pt definition • For both of these models, Gibbs sampling used as before, but the p
(w|c) probabilities are held fixed and not learned • Concept LH, Concept FH
– Incorporate hierarchical information such as a concept tree– An internal concept node is associated with its own words and all the wo
rds associated with its children • ConceptL+Topics, ConceptLH+Topics
– Incorporation of unconstrained data-driven topics alongside the concepts
– Allowing the Gibbs sampling procedure to either assign a word to a constrained concept or to one of the unconstrained topics
14
Experiment setting (1/2)
• Text corpus– Touchstone Applied Science Associates
(TASA) dataset – D = 37, 651 documents with passages
excerpted from educational texts – Divided into 9 different educational topics – Only focus on the documents classified as
SCIENCE(D = 5356, word tokens = 1.7M ) and SOCIAL STUDIES(D = 10, 501, word tokens = 3.4M)
15
Experiment setting (2/2)
• Concepts – The Open Directory Project (ODP), a humanedited hi
erarchical directory of the web • Contains descriptions and urls on a large number of hierarchi
cally organized topics • Extracted all the topics in the SCIENCE subtree, which consi
sts of C =10, 817 nodes – The Cambridge International Dictionary of English (CI
DE)• Consists of C = 1923 hierarchically organized semantic categ
ories • The concepts vary in the number of the words with a median
of 54 words and a maximum of 3074. • Each word can be a member of multiple concepts
16
Tagging Documents with Concepts
• Assigning likely concepts to each word in a document, depending on the context of the document
• The concept models assign concepts at the word level, so the document can be summarized at multiple levels of granularity – Snippets, sections, whole Web pages…
17
Using the ConceptU model to automatically tag a Web page with CIDE concepts
18
Using the ConceptU model to automatically tag a Web page with ODP concepts
19
Tagging at the word level using the ConceptL model
20
Language Modeling Experiments • Preplexity
– Lower scores are better since they indicate that the model’s distribution is closer to that of the actual text
– 90% of the documents being used for training – 10%for computing test perplexity
21
Perplexity for various models
22
Topic model vs ConceptsLH+Topics
23
Varying the training data
24
Conclusion
• The model can automatically place words and documents in a text corpus into a set of human-defined concepts
• Illustrated how concepts can be “tuned” to a corpus to obtain a probabilistic language model leading to improved language models