![Page 1: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/1.jpg)
Carnegie Mellon
Novelty and Redundancy Detection in Adaptive Filtering
Yi Zhang , Jamie Callan , Thomas Minka
Carnegie Mellon University
{yiz, callan, minka}@cs.cmu.edu
![Page 2: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/2.jpg)
Carnegie Mellon
Outline
• Introduction : task definition and related work• Building an filtering system
– Filtering system structure– Redundancy measures
• Experimental methodology– Creating testing datasets– Evaluation measures
• Experimental result• Conclusion and future work
![Page 3: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/3.jpg)
Carnegie Mellon
Task Definition
• What user want in adaptive filtering: relevant & novel information as soon as the document arrives
• Current filtering systems are relevant oriented.– Optimization: deliver as much relevant information as
possible
– Evaluation: relevant recall/precision. System gets credit for relevant but redundant information
![Page 4: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/4.jpg)
Carnegie Mellon
Relates to First Story Detection in TDT • No work on novelty detection in adaptive filtering • Current research on FSD in TDT:
– Goal : identify the first story of an event
– Current performance: far from solved
• FSD in TDT != Novelty Detection while filtering– Assumption on redundancy definition
– Unsupervised learning vs. supervised learning.
– Novelty Detection in filtering is about user specified domain, and user information is available
![Page 5: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/5.jpg)
Carnegie Mellon
Outline
• Introduction : task definition and related work• Building an filtering system
– Filtering system Structure– Redundancy measures
• Experimental methodology– Creating testing datasets– Evaluation measures
• Experimental result• Conclusion and future work
![Page 6: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/6.jpg)
Carnegie Mellon
Relevancy vs. Novelty
• User wants: relevant and novel information• Contradiction?
– Relevant: deliver document similar to previously delivered relevant documents to user
– Novel: deliver document not similar to previously delivered relevant documents to user
• Solution: two stages system– Use different similarity measure to model relevancy
and novelty
![Page 7: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/7.jpg)
Carnegie Mellon
Two Stages Filtering System
OR
OR
. . . . . . .OR
OR
Stream of Documents
RelevanceFiltering
RedundancyFiltering
Novel NovelRedundant Redundant
![Page 8: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/8.jpg)
Carnegie Mellon
Two Problems for Novelty Detection
• Input: – A sequence of document user read
– User feedback
• Redundancy measure (our current focus):– Measures redundancy of current document with
previous documents
– Profile specific any time updating of redundancy/novelty measure
• Thresholding– only document with a redundancy score below
threshold is considered novel
![Page 9: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/9.jpg)
Carnegie Mellon
Redundancy Measures
• Use similarity/distance/difference between two documents to measure redundancy
• 3 types of document representation – Set difference
– Geometric distance (cosine similarity)
– Distributional Similarity (language model)
)d |R(d argmax )R(d it documentrelevant delivered :d
ti
![Page 10: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/10.jpg)
Carnegie Mellon
Set Difference
• Main idea: – Boolean bag-of-words representation – Use smoothing to add frequent words to the doc
representation
• Algorithm: – wj Set(d) iff Count (wj, d) > k
Count (wj, d) = 1 * tf wj,d + 3 *rdf w + 2 * df wj – Using the number of new words in dt to measure the
novelty
R(dt | di)= -|Set(dt) Set(di)|
![Page 11: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/11.jpg)
Carnegie Mellon
Geometric Distance
• Main idea:– Basic vector space approach
• Algorithm:– Represent a document as a vector, and the weight of
each dimension is the tf*idf score of corresponding word
– Using cosine distance to measure the redundancy R(dt | di) = Cosine(dt, di)
![Page 12: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/12.jpg)
Carnegie Mellon
Distributional Similarity (1)
• Main idea:– Unigram language models
• Algorithm:– Represent a document d as a words distribution d
– Measure the redundancy/novelty between two documents using Kullback-Leibler (KL) distance of the corresponding two distributions
R(dt | di) = - KL ( dt , di,)
![Page 13: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/13.jpg)
Carnegie Mellon
Distributional Similarity (2):Smoothing
• Why smoothing: – maximum likelihood estimation of d will make KL (
dt , di,) infinite because of unseen words
– make the estimate of language model more accurate
• Smoothing algorithms for d :– Bayesian smoothing using dirichlet priors
(Zhai&Lafferty SIGIR 01)
– Smoothing using shrinkage (McCallum ICML98)
– A mixture model based smoothing
![Page 14: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/14.jpg)
Carnegie Mellon
A Mixture Model: Relevancy vs. Novelty
MT: T
TopicME: E
General EnglishMI: d_core
New Information
E T d_core
•Relevancy detection: focus on learning T
•Redundancy detection: focus on learning d_core
![Page 15: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/15.jpg)
Carnegie Mellon
Outline
• Introduction : task definition and related work• Building an filtering system
– Filtering system structure– Redundancy measures
• Experimental methodology– Creating testing datasets– Evaluation measures
• Experimental result• Conclusion and future work
![Page 16: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/16.jpg)
Carnegie Mellon
A New Evaluation dataset:APWSJ
• Combine 1988-1990 AP+WSJ to get a corpus which is likely to contain redundant documents
• Hired undergraduates to read all relevant documents chronologically sorted and let them to judge:– Whether a document is redundant
– If yes, identify document set that make this document redundant
• Two degree of redundancy: absolutely redundant vs. somewhat redundant
• Adjudicated by two assessors
![Page 17: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/17.jpg)
Carnegie Mellon
Another Evaluation Dataset: TREC Interactive Data
• Combine TREC-6, TREC-7 and TREC-8 interactive dataset (20 TREC topics)
• Each topic contains several aspects• NIST assessors identify aspects for each document• Assume dt is redundant if all aspects related to dt
have already been covered by previous documents user seen.– Strong assumption on what’s novel/redundant – Can still provide useful information
![Page 18: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/18.jpg)
Carnegie Mellon
Evaluation Methodology (1)
• Four components of an adaptive filtering system– relevancy measure– relevance threshold– redundancy measure– redundancy threshold
• Goal: focus on redundancy measures, and avoid the influence of other part of the filtering system
• Assume we have a perfect relevancy detection stage to avoid influence of that stage
• Use 11-pt average recall and precision graph to avoid the influence of thresholding module
![Page 19: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/19.jpg)
Carnegie Mellon
Evaluation Methodology (2)
Redundant Non-Redundant
Delivered R+ N+
Not delivered R- N-
NRNR
NR
RR
R
NR
R
_MistakeRedundancy
Recall_Redundancy
Precision_Redundancy
![Page 20: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/20.jpg)
Carnegie Mellon
Outline
• Introduction : task definition and related work• Building an filtering system
– Filtering system Structure– Redundancy measures
• Experimental methodology– Creating testing datasets– Evaluation measures
• Experimental result• Conclusion and future work
![Page 21: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/21.jpg)
Carnegie Mellon
Comparing Different Redundancy Measures on Two Datasets
• Cosine measure is consistently good (ONE SLIDE TO EXPLAIN)
• Mixture language model works much better than other LM approach
![Page 22: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/22.jpg)
Carnegie Mellon
Mistakes After Thresholding
Measures absolutely redundant or somewhat redundant
absolutely redundant only
Set Distance 43.5% 28%
Cosine Distance
28.1% 18.7%
Shrinkage (LM)
44.3% 21%
Dirichlet Prior (LM)
42.4% 21%
Mixture Model (LM)
27.4% 16.7%
• A simple thresholding algorithm that makes the system complete
• Learning user’s preference is important
• Similar results for interactive track data on paper
![Page 23: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/23.jpg)
Carnegie Mellon
Outline
• Introduction : task definition and related work• Building an filtering system
– Filtering system Structure– Redundancy measures
• Experimental methodology– Creating testing datasets– Evaluation measures
• Experimental result• Conclusion and future work
![Page 24: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/24.jpg)
Carnegie Mellon
Conclusion: Our Contributions • Novelty/redundancy detection in an adaptive filtering
system– Two stages approach
• Reasonably good at identifying redundant documents– Cosine similarity– Mixture language model
• Factors affecting accuracy– Accuracy at finding relevant documents– Redundancy measure– Redundancy threshold
![Page 25: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/25.jpg)
Carnegie Mellon
Future work
• Cosine similarity is far from the optimal (symmetric vs. asymmetric)
• Feature engineering: time, source, author, name entity…
• Better novelty measure– Doc.-doc. distance vs. doc-cluster distance (?)– Depend on user: what is novel/redundant for the user?
• Learning user redundancy preferences– Thresholding: sparse training data problem
![Page 26: Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan, minka}@cs.cmu.edu](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bf831a28abf838c86745/html5/thumbnails/26.jpg)
Carnegie Mellon
Appendix: Threshold Algorithm
• Initialize Rthreshold to let only near duplicates as redundant
• For each dt delivered:
If user said it is redundant and R(dt)> argmax R(di) for all di (delivered relevant document)
Rthreshold=R(dt )
Else
Rthreshold=Rthreshold-(Rthreshold-R(dt ))/10