filtering semi-structured documents based on faceted feedback lanbo zhang, yi zhang, qianli xing...
TRANSCRIPT
Filtering Semi-Structured DocumentsBased on Faceted Feedback
Lanbo Zhang, Yi Zhang, Qianli XingInformation Retrieval and Knowledge Management (IRKM) Lab
University of California, Santa Cruz
Personalized Information Filtering• Identify user-desired documents from a
document stream• Two families of filtering approaches
– Collaborative Filtering (CF)– Content-Based Filtering (CBF)
• Applications: news feeder, email spam filter, etc.
2
Filtering SystemNews
Blogs
Emails Passed documents
…
Semi-Structured Documents
• Increasingly prevalent over the Internet• Emails, news, movies, tweets, etc.
• Plenty of metadata available
3
Definitions
• Facet: a metadata field– Date, Topic, Location, Director, Genre, etc.
• Facet-Value Pair (FVP): a metadata field assigned with a particular value– Topic: Royal wedding– Date: 04-29-2011– Location: London, UK
4Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Motivation
• Existing filtering approaches learn user interests based on users’ relevance judgments of documents
• Users may have prior knowledge on which facet-value pairs are relevant– English-only readers
• “Language: English”– Social network analysts
• “Company: Facebook”
5Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
6
Can we exploit users’ prior knowledge on facet-value pairs for filtering?
Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
A New User Interaction Mechanism:Faceted Feedback
7
Filtering System
FVP candidates: Lang: … Topic: … Date: …
Relevant FVPs: Topic: … Lang: …
Research Questions
• Question 1– How to select facet-value pair candidates?
• Question 2– How to learn user profiles based on faceted
feedback?
8Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Q1: Possible Methods
• Feature selection methods for text classification– E.g., Mutual Information, Chi-Square measure, etc.
• Usually a large number of labeled documents available
• Query expansion methods for retrieval– E.g., TFIDF score on pseudo relevant documents
• No labeled documents available
9Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
FVP Selection: Our Approach
• In a filtering task– A large number of unlabeled documents– Possibly a small number of labeled documents
• We rank facet-value pairs by
10
Pseudo relevant (positively classified) documents
User-labeled relevant documents
Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
fN
Nf log)(IDF
Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant
Research Questions
• Question 1– How to select facet-value pair candidates?
• Question 2– How to learn user profiles based on faceted
feedback?
11Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Content-Based Filtering (CBF)
• Treated as a binary text classification task• User profile: a feature vector that represents a
user’s information needs (interests/preferences)
• Given the user profile θ, a document can be determined as relevant or not according to:
12Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Document vectorDocument label
The core of CBF is learning the user profile!
Q2: Possible Methods
• Simple methods– Boolean strategy (AND, OR)– Feature selection– Pseudo relevant document
• Sophisticated methods– Bayesian logistic regression with an adjusted prior
(Dayanik et al. 06)– Generalized Expectation Criteria (Druck et al. 08)
13Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Our Approach
• The assumption– A feature is selected by a user since it has a high
correlation with the document label (R/NR)
• Generalized Constraint Model (GCM)
14Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Correlation Decomposition
• Sufficiency– The probability of a document being relevant given
that the feature has occurred: P(R+|f=1)– P(R+|f=1)=1 : sufficient features
• E.g., “Company: Facebook” for social network analysts
• Necessity– The probability of the feature having occurred given
that a document is relevant: P(f=1|R+)– P(f=1|R+)=1 : necessary features
• E.g., “Language: English” for English-only readers
15Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Examples: Highly-Correlated Features
16
The whole corpus
R+
f2=1
f1=1
f3=1
1) f1 is a sufficient feature since P(R+|f1=1)=1
2) f2 is a necessary feature since P(f2=1|R+)=1
Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
3) f3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5)
Estimating Sufficiency
17
Document label
The feature
The set of documents covered by feature f
User profile vector
Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Estimation of the label of document di
Estimating Necessity
18
Feature sufficiency
Bayes’ Theorem!
Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Prior distribution
Reference Distributions
• Our assumption– User selects a feature since it has a high sufficiency
and/or a high necessity
• Reference distributions: two Bernoulli dist’ns – The sufficiency/necessity of a user-selected feature
should be close to the reference distribution– KL-divergence for similarity measure
19Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
User Profile Learning• The unified loss function to combine two types
of feedback:
20
User-labeled documents
Necessary features
Sufficient features
Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Ts , Tn: reference dist’ns
User Interaction Mechanisms
• Two mechanisms– Mechanism 1: ask users to select features they
think are relevant– Mechanism 2: ask users to specifically select
features they think are sufficient and necessary respectively
21Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Outline
• Introduction• Faceted Feedback
– Facet-Value Pair Candidate Selection– Learning from Faceted Feedback
• Experiments– Settings– Results
• Summary
22
Data Sets
• Use two data sets from TREC filtering track– TREC 2000: OHSUMED (348566 medical articles) +
63 topics (information needs)• Metadata field: MeSH (Medical Subject Headings)
– TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors
• Metadata fields: Topic, Industry, Region
• Split each topic set into two equal-size subsets– One for parameter tuning, the other for testing
23Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Faceted Feedback Collection
• Recruit subjects on Mechanical Turk– Five subjects per topic– The average performances will be reported
• For each topic, we show subjects– The topic description (information need)– A group of facet-value pair candidates
24Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Evaluation Metrics
• Precision (macro)• Recall (macro)• T11U = 2 * Nrd – Nnd
– Nrd: the number of relevant docs delivered
– Nnd: the number of non-relevant docs delivered
• T11SU =– MinNU = -0.5– MaxU: the maximum possible utility (T11U)
25Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Outline
• Introduction• Faceted Feedback
– Facet-Value Pair Candidate Selection– Learning from Faceted Feedback
• Experiments– Settings– Results
• Summary
26
Results 1: w/wo Faceted Feedback (FF)
27
Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known.
Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
# relevant docs initially known
Results 2: Different Learning Algorithms
28
Our approach
Existing approaches
BOOL(A), BOOL(O): Boolean strategy
FS: feature selection based on FF
Pseudo-D/Q: pseudo relevant doc/query
Prior: logistic regression with Bayesian prior
GEC: generalized expectation criteria
Outline
• Introduction• Faceted Feedback
– Facet-Value Pair Candidate Selection– Learning from Faceted Feedback
• Experiments– Settings– Results
• Summary
29
Summary
• Faceted feedback is useful for filtering, especially in the cold-start scenarios
• The Generalized Constraint Model (GCM) is a robust user profile learning algorithm
• In future work, we will evaluate our methods on data sets where faceted features are more important– Movie, music, product, etc.
30Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Questions?
31
Filtering Semi-Structured Documents Based on Faceted Feedback
Lanbo Zhang, Yi Zhang, Qianli XingInformation Retrieval and Knowledge Management (IRKM) Lab
University of California, Santa [email protected]
[email protected]@gmail.com