research trends in multimedia content services
DESCRIPTION
Research Trends in Multimedia Content Services. Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences. András A. Benczúr. Web 2.0, 3.0 …?. Platform convergence (Web, PC, mobile, television) – information vs. recreation - PowerPoint PPT PresentationTRANSCRIPT
Research Trends in Multimedia Content Services
Data Mining and Web Search GroupComputer and Automation Research
Institute
Hungarian Academy of Sciences
András A. Benczúr
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Web 2.0, 3.0 …?
• Platform convergence (Web, PC, mobile, television) – information vs. recreation
• Emphasis on social content (blogs, Wikipedia, photo and video sharing)
• From search towards recommendation (query free, profile based, personalized)
• From text towards multimedia• Glocalization (language, geography)• Spam
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
A sample service
RSSWeb 2.0
• Small screen browsing
• Recommendation based on user profile (avoid query typing)
• Read blogs, view media, …
client software
Recommender engine
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
The user profile
• History stored for each user:• Known ratings, preferences,
opinion – scarce!• Items read, weighted by time spent
• details seen, scrolling, back button• Terms in documents read,
tf.idf weighted top list• User language, region, current
location and known sociodemographic data
• Multimedia!
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Same item—multiple source
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Information vs recreation: Do not mix the two?
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Spam is increasingly annoying
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Distribution of categories
Reputable 70.0%
Spam 16.5%
Weborg 0.8%
Ad 3.7%
Non-existent 7.9%
Empty 0.4%Alias 0.3%
Unknown 0.4%
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Keresési találati pozíció hatása
Talá
lati
pozí
ció n
ézé
sével tö
ltött
id
ő
Talá
lath
oz
érk
ezé
s id
eje
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Multimedia Information Retrieval
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Similar objects
Segmentation
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Class of Query Image
Pre-classified Images
VOC2007
Original Training Set
Query Images
ImageCLEF Object Retrieval Task
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Networked relation
•spam•social network analysis•churn
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Szociális hálózatok
Üzleti ADSL
Üzleti
Egyéni ADSL
Egyéni
Egyéni és üzleti ügyfelek
home
business
ADSL ---ADSL ---
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Biztosítási csalások – hálózatban
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Stacked Graphical Learning
1. Predict churn p(v) of node v2. For target node u, aggregate p(v) for
neighbors to form new feature f(u)3. Rerun classification by adding feature
f(.)4. Iterate
?u
v1
v2
v7
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Why social networks are hard to analyze
Subgraphs of social networks
Medium size dense communities attract
much algorithmic work
Tentacles induce noise
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Mapping into 2D
plain spectral
semidefinite
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Research HighlightsResearch Highlights
Recommenders: KDD Cup 2007 Task 1 First Prize
Predict the probability that a user rated a movie in 2006, based on
year –2005 training data Spam filtering: Web Spam Challenge 1 first
placeChurn prediction: method presented at
KDD Cup 2009 WorkshopTask XXXX
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Netflix: lessons and differences learned
•Ratings 1– 5 stars•Predict an unseen rating•Evaluation: RMSE•0.8572: $1,000,000 •Current leader: 0.8650• Oct/07: 0.8712KDD Cup 2007•same data set•predict existence of a rating
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Results of two separate tasks
BellKor team report [Bell, Koren 2007]:• Low rank approximation• Restricted Boltzmann Machine• Nearest neighborKDD Cup 2007: Predict probability
that a user rated a movie in 2006:• Given list of 100,000 user–movie pairs• Users and movies drawn from Netflix
Prize data setWinner report [K, B, and our colleauges
2007]
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
For a given user i and movie j
where is the predicted valueKDD Cup example:• Our RMSE: 0.256• First runner up: 0.263 • All zeroes prediction: 0.279 (Place 10-13)
But why do we use RMSE and not precision/recall?
• RMSE preferes correct probability guesses for the majority unfrequently visited items
• The presence of the recommender changes usage
Evaluation and Issue 1
ji,
ijij )w(w= 22 ˆRMSE
otherwise 1
given rating no if 0=wij
ijw
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Method Overview
• Probability by naive user-movie independence• Item frequency estimation (Time Series)• User frequency estimation• Reaches RMSE 0.260 in itself (still first
place)
• Data Mining• SVD• Item-item similarities• Association Rules
• Combination (we used linear regression)
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Time series prediction
Interest remains for long time range (several years)
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Short lifetime of online items
OrigoVery different behavior in time: news articles
http://www.origo.hu/filmklub/20060124kiolte.html
Publication day
Next day usage peak
Third day
and gone …
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
K-dim SVD: Noise filtering – the essence of the matrix – optimizes
• SVD explains ratings as effect of few linear factors
• RMSE (ℓ2 error) 10-30 dim: 0.93
Issue: too many news items
18K Netflix movies vs.
potentially infinite set of items
-> may recommend data source but not the item
SVD
22 ˆRMSE )A(A= ijij
use
r
movie news item
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
• Content similarity might be the key feature
• Relative success of trivial estimates on KDD Cup!
• Data mining techniques overlap, apparently catch similar patterns
• Precision/recall is more important than RMSE
• Solution must make heavy use of time
Lessons learned
A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008
Future plans and ideasFuture plans and ideas
• New partners and application fields: network infrastructure, new generation services, bioinformatics, …?
• Scaling our solutions to multi-core architectures
• Use our search (cross-lingual, multimedia etc) and recommender system capabilities in major solutions; mobile, new generation platforms etc.
• Expand means of our European level collaboration, e.g. KIC participation
Questions ?Andras A. Benczur
[email protected]://datamining.sztaki.hu