on frequent chatters mining claudio lucchese 1 st hpc lab workshop 6/15/12 1st hpc workshp - claudio...
DESCRIPTION
ABCDEFGHIJKLM Frequent Patterns Mining 6/15/12 1st HPC Workshp - Claudio LuccheseTRANSCRIPT
On Frequent Chatters Mining
Claudio Lucchese
1st HPC Lab Workshop
Click icon to add picture
6/15/12 1st HPC Workshp - Claudio Lucchese
Frequent Patterns Mining
• How may patterns do you see in the following dataset ?A B C D E F G H I J K L M
1
2
3
4
5
6
7
8
9
10
11
12
13
6/15/12 1st HPC Workshp - Claudio Lucchese
Claudio Lucchese, Salvatore Orlando, Raffaele Perego: Mining Top-K Patterns from Binary Datasets in Presence of Noise. SDM 2010
A B C D E F G H I J K L M1
2
3
4
5
6
7
8
9
10
11
12
13
Frequent Patterns Mining
6/15/12 1st HPC Workshp - Claudio Lucchese
Frequent Patterns Mining
usually rows and cols are not in “good-looking” order6/15/12 1st HPC Workshp - Claudio
Lucchese
State of the art• Most recent approaches try to discover the top-k
patterns that optimize different cost functions:• Minimize Noise (“holes”) or• Minimize MDL
• encoding(Patterns) + encoding(Data|Patterns)• Maximize Information Ratio:
• Number of bits of information w.r.t. to the Maximum Entropy Model built on the basis of rows and cols marginal distribution
• Minimize length of patterns and the amount of noise (our approach =)
6/15/12 1st HPC Workshp - Claudio Lucchese
Evaluation• Unsupervised:
• Measure how well the proposed algorithm optimizes the proposed cost function
• What is the best cost function ?
• We are investigating supervised measures:• Unsupervised extraction: extract patterns from
classification/clustering dataset without class/cluster labels information
• Supervised evaluation: measure how well the patterns can predict/match classes/clusters
• Preliminary result:• Fancy cost functions might not be the best ones6/15/12 1st HPC Workshp - Claudio
Lucchese
Information Overload in News
6/15/12 1st HPC Workshp - Claudio Lucchese
Gianmarco De Francisci Morales, Aristides Gionis, Claudio Lucchese: From chatter to headlines: harnessing the real-time web for personalized news
recommendation. WSDM 2012.
✓Timeliness
✓Personalization
Can we exploit Twitter?
Number of mentions of “Osama Bin
Laden”
6/15/12 1st HPC Workshp - Claudio Lucchese
• 90% of the clicks
happen within 2
days from
publication
• Only a few occur
early!
News Get Old Soon
6/15/12 1st HPC Workshp - Claudio Lucchese
T.Rex (Twitter-based news recommendation system)
• Builds a user model from Twitter
• Signals from user generated content, social neighbors and popularity across Twitter and news
• Entity-based representation (overcomes vocabulary mismatch)
• Learn a personalized news ranking function:
• Pick up candidates from a pool of related or popular fresh news, rank them and present top-k to the user
6/15/12 1st HPC Workshp - Claudio Lucchese
• Ranking function is user and time dependent
• Social model + Content model + Popularity model
• Popularity model tracks entity popularity by the number of
mentions in Twitter and news (with exponential forgetting)
• Content model measures relatedness of a bag-of-entities
representation of a users’ tweet stream and of a news article
• Social model weights the content model of every social
neighbor by a truncated PageRank on the Twitter network
Recommendation Model
6/15/12 1st HPC Workshp - Claudio Lucchese
✓Designed to be streaming and lightweight (just counting)
✓User model is updated continuously
System Overview
6/15/12 1st HPC Workshp - Claudio Lucchese
• Learning to rank approach with SVM
• Each time the user clicks on a news, we learn a set of
preferences (clicked_news > non_clicked_news):
• Prune the number of constraints for scalability:
• only news published in the last 2 days
• only take the top-k news for each ranking component
• Can optionally include additional features for news articles:
• click count, age, etc... (T.Rex+)
Learning the Weights
6/15/12 1st HPC Workshp - Claudio Lucchese
✓User generated content is a very good predictor albeit very sparse
✓Click Count is a strong baseline but does not help T.Rex+
Predicting Clicked News
6/15/12 1st HPC Workshp - Claudio Lucchese
Predicting Clicked Entities
6/15/12 1st HPC Workshp - Claudio Lucchese
Future works (?)• Explain a set of news showing how the main
topics interacted with each other over time.
6/15/12 1st HPC Workshp - Claudio Lucchese
Future works (?)• Explain a set of news showing how the main
topics interacted with each other over time.• Example: European sovereign-debt crisis
time
Merkel
Monti
France
Berlusconi
Greece
EU
New Italiangovernment
Fiscal CompactEuroBond Obama
Loan
6/15/12 1st HPC Workshp - Claudio Lucchese
Future works (?)• Explain a set of news showing how the main topics
interacted with each other over time.• Applications:
• Given the news the user is currently reading, provide an explanation of the related facts that precede that news
• Given a query, provide an explanation of the documents related to that query
• Given a set of topics, explain their relations over time
• Browse a collection of news, by changing the topics of interest, the time window, the granularity
6/15/12 1st HPC Workshp - Claudio Lucchese
Future works (?)• Explain a set of news showing how the main
topicsinteracted with each other over time.
• A topic is a named entity relevant over time• An interaction is a cluster of news related to
some event and relevant in a small time window• It might be important to cover the given time
window, but recent events might be more interesting
6/15/12 1st HPC Workshp - Claudio Lucchese
Future works (?)• Explain a set of news showing how the main
topicsinteracted with each other over time.
• Given a maximum number of main topics and interactions, maximize:• Topic coverage and diversity• Events time coverage• Cluster similarity• Main topics connectivity
6/15/12 1st HPC Workshp - Claudio Lucchese
Future works (?)• Explain a set of news showing how the main topics
interacted with each other over time.• Its is different from news clustering:
• Even if you had a good clustering, might not be trivial to select which events and which topics to show in order to maximize the amount of information delivered to the user
• There is some interesting related work• aimed at finding chains of news,
we are more interested in topic evolution 6/15/12 1st HPC Workshp - Claudio
Lucchese
Thank you !
Click icon to add picture
6/15/12 1st HPC Workshp - Claudio Lucchese