filtering semi-structured documents based on faceted feedback lanbo zhang, yi zhang, qianli xing...

Filtering Semi-Structured DocumentsBased on Faceted Feedback

Lanbo Zhang, Yi Zhang, Qianli XingInformation Retrieval and Knowledge Management (IRKM) Lab

University of California, Santa Cruz

Personalized Information Filtering• Identify user-desired documents from a

document stream• Two families of filtering approaches

– Collaborative Filtering (CF)– Content-Based Filtering (CBF)

• Applications: news feeder, email spam filter, etc.

2

Filtering SystemNews

Blogs

Emails Passed documents

…

Semi-Structured Documents

• Increasingly prevalent over the Internet• Emails, news, movies, tweets, etc.

• Plenty of metadata available

3

Definitions

• Facet: a metadata field– Date, Topic, Location, Director, Genre, etc.

• Facet-Value Pair (FVP): a metadata field assigned with a particular value– Topic: Royal wedding– Date: 04-29-2011– Location: London, UK

4Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Motivation

• Existing filtering approaches learn user interests based on users’ relevance judgments of documents

• Users may have prior knowledge on which facet-value pairs are relevant– English-only readers

• “Language: English”– Social network analysts

• “Company: Facebook”


6

Can we exploit users’ prior knowledge on facet-value pairs for filtering?

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

A New User Interaction Mechanism:Faceted Feedback

7

Filtering System

FVP candidates: Lang: … Topic: … Date: …

Relevant FVPs: Topic: … Lang: …

Research Questions

• Question 1– How to select facet-value pair candidates?

• Question 2– How to learn user profiles based on faceted

feedback?


Q1: Possible Methods

• Feature selection methods for text classification– E.g., Mutual Information, Chi-Square measure, etc.

• Usually a large number of labeled documents available

• Query expansion methods for retrieval– E.g., TFIDF score on pseudo relevant documents

• No labeled documents available


FVP Selection: Our Approach

• In a filtering task– A large number of unlabeled documents– Possibly a small number of labeled documents

• We rank facet-value pairs by

10

Pseudo relevant (positively classified) documents

User-labeled relevant documents


fN

Nf log)(IDF

Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant

Research Questions

• Question 1– How to select facet-value pair candidates?

• Question 2– How to learn user profiles based on faceted

feedback?


Content-Based Filtering (CBF)

• Treated as a binary text classification task• User profile: a feature vector that represents a

user’s information needs (interests/preferences)

• Given the user profile θ, a document can be determined as relevant or not according to:


Document vectorDocument label

The core of CBF is learning the user profile!

Q2: Possible Methods

• Simple methods– Boolean strategy (AND, OR)– Feature selection– Pseudo relevant document

• Sophisticated methods– Bayesian logistic regression with an adjusted prior

(Dayanik et al. 06)– Generalized Expectation Criteria (Druck et al. 08)


Our Approach

• The assumption– A feature is selected by a user since it has a high

correlation with the document label (R/NR)

• Generalized Constraint Model (GCM)


Correlation Decomposition

• Sufficiency– The probability of a document being relevant given

that the feature has occurred: P(R+|f=1)– P(R+|f=1)=1 : sufficient features

• E.g., “Company: Facebook” for social network analysts

• Necessity– The probability of the feature having occurred given

that a document is relevant: P(f=1|R+)– P(f=1|R+)=1 : necessary features

• E.g., “Language: English” for English-only readers


Examples: Highly-Correlated Features

16

The whole corpus

R+

f2=1

f1=1

f3=1

1) f1 is a sufficient feature since P(R+|f1=1)=1

2) f2 is a necessary feature since P(f2=1|R+)=1


3) f3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5)

Estimating Sufficiency

17

Document label

The feature

The set of documents covered by feature f

User profile vector


Estimation of the label of document di

Estimating Necessity

18

Feature sufficiency

Bayes’ Theorem!


Prior distribution

Reference Distributions

• Our assumption– User selects a feature since it has a high sufficiency

and/or a high necessity

• Reference distributions: two Bernoulli dist’ns – The sufficiency/necessity of a user-selected feature

should be close to the reference distribution– KL-divergence for similarity measure


User Profile Learning• The unified loss function to combine two types

of feedback:

20

User-labeled documents

Necessary features

Sufficient features


Ts , Tn: reference dist’ns

User Interaction Mechanisms

• Two mechanisms– Mechanism 1: ask users to select features they

think are relevant– Mechanism 2: ask users to specifically select

features they think are sufficient and necessary respectively


Outline

• Introduction• Faceted Feedback

– Facet-Value Pair Candidate Selection– Learning from Faceted Feedback

• Experiments– Settings– Results

• Summary

22

Data Sets

• Use two data sets from TREC filtering track– TREC 2000: OHSUMED (348566 medical articles) +

63 topics (information needs)• Metadata field: MeSH (Medical Subject Headings)

– TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors

• Metadata fields: Topic, Industry, Region

• Split each topic set into two equal-size subsets– One for parameter tuning, the other for testing


Faceted Feedback Collection

• Recruit subjects on Mechanical Turk– Five subjects per topic– The average performances will be reported

• For each topic, we show subjects– The topic description (information need)– A group of facet-value pair candidates


Evaluation Metrics

• Precision (macro)• Recall (macro)• T11U = 2 * Nrd – Nnd

– Nrd: the number of relevant docs delivered

– Nnd: the number of non-relevant docs delivered

• T11SU =– MinNU = -0.5– MaxU: the maximum possible utility (T11U)


Outline




• Summary

26

Results 1: w/wo Faceted Feedback (FF)

27

Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known.


# relevant docs initially known

Results 2: Different Learning Algorithms

28

Our approach

Existing approaches

BOOL(A), BOOL(O): Boolean strategy

FS: feature selection based on FF

Pseudo-D/Q: pseudo relevant doc/query

Prior: logistic regression with Bayesian prior

GEC: generalized expectation criteria

Outline




• Summary

29

Summary

• Faceted feedback is useful for filtering, especially in the cold-start scenarios

• The Generalized Constraint Model (GCM) is a robust user profile learning algorithm

• In future work, we will evaluate our methods on data sets where faceted features are more important– Movie, music, product, etc.


Questions?

31

Filtering Semi-Structured Documents Based on Faceted Feedback

Lanbo Zhang, Yi Zhang, Qianli XingInformation Retrieval and Knowledge Management (IRKM) Lab

University of California, Santa [email protected]

[email protected]@gmail.com

mailto:[email protected]



filtering semi-structured documents based on faceted feedback lanbo zhang, yi zhang, qianli xing...

Documents

yi zhang

relevant documents lanbo

santa cruz slide

relevant slide

faceted feedback lanbo

santa cruz estimation

santa cruz intuition

pseudo relevant documents