Download - Movie Recommendation with DBpedia - IIR 2012
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
MOVIE RECOMMENDATION WITH DBPEDIA
Politecnico di Bari
Via Orabona, 4
70125 Bari (ITALY)
Roberto Mirizzi, Tommaso Di Noia, Azzurra Ragone, Vito Claudio Ostuni, Eugenio Di Sciascio [email protected], [email protected] , [email protected], [email protected], [email protected]
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Outline
DBpedia: a nucleus for a Web of Open Data Social knowledge bases for similarity detection
Semantic Vector Space Model Vector Space Model adapted to RDF graphs
MORE: More than Movie Recommendation Content-based recommendation in action
Evaluation Precision and Recall experiments with MovieLens
Conclusion
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
What is Linked Data?
Linked Data is about using the Web to connect related data that wasn't previously linked, or using the Web to lower the barriers to linking data currently linked using other methods. More specifically, Wikipedia defines Linked Data as “a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.” [www.linkeddata.org]
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
DBpedia: a Nucleus for a Web of Data (i)
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
DBpedia: a Nucleus for a Web of Data (ii)
Let’s use all this knowledge to build smarter content-based recommender
systems
The DBpedia knowledge base currently describes more than 3.64 million things, highly interconnected in the RDF graph.
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Social KBs for similarity detection
Ocean’s Eleven
George Clooney
Brad Pitt
Ocean’s Twelve
Steven Soderbergh
Catherine Zeta-Jones
2000s crime films
American criminal comedy films
Crime films
Crime
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Semantic Vector Space Model (i)
[http://en.wikipedia.org/wiki/File:Vector_space_model.jpg]
Quick recap on Vector Space Model Vector Space Model is an algebraic model for representing both text documents and queries as vectors of index terms wt,d that are positive and non-binary.
1, 2, ,, ,...,T
d d d N dv w w w
, ,t d t d tw tf idf
,
,
,
t d
t d
k dk
ntf
n
, ,1
2 2
, ,1 1
( , )
N
i j i qj q ij
N Nj i j i qi i
w wd dsim d q
d q w w
' 'logt
Didf
d D t d
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Each resource (movie) is expressed as a tensor in a multi-dimensional space where each dimension corresponds to a specific property of the considered datasets (e.g., starring, subject/broader, director, genre, …)
Semantic Vector Space Model (ii)
Ocean’s Eleven
George Clooney
Steven Soderberg 2000s crime films
Crime starring
director subject/broader
genre
Ocean’s Twelve
Brad Pitt Catherine Zeta-Jones
Crime films American criminal…
Oce
an’s
Ele
ven
Ge
org
e C
loo
ney
Stev
en
So
de
rbe
rg
20
00
s cr
ime
film
s
Cri
me
Oce
an’s
Tw
elv
e
Bra
d P
itt
Cat
he
rin
e Z
eta-
Jon
es
Cri
me
film
s A
me
rica
n c
rim
inal
…
Vector Space Model applied to RDF graphs
Ocean’s Eleven Ocean’s Twelve
starring
Ge
org
e C
loo
ney
B
rad
Pit
t C
ath
eri
ne
Zet
a-Jo
ne
s
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
STARRING George
Clooney [gc] (38 movies)
Catherine Z. Jones [czj] (22 movies)
Brad Pitt [bp]
(35 movies)
Ocean’s Eleven [o11] (13 actors)
Ocean’s Twelve [o12] (15 actors)
STARRING George
Clooney [gc] (38 movies)
Catherine Z. Jones [czj] (22 movies)
Brad Pitt [bp]
(35 movies)
Ocean’s Eleven [o11] (13 actors)
Ocean’s Twelve [o12] (15 actors)
Semantic Vector Space Model (iii)
Ocean’s Eleven
STARRING George
Clooney [gc] (38 movies)
Catherine Z. Jones [czj] (22 movies)
Brad Pitt [bp]
(35 movies)
Ocean’s Eleven [o11] (13 actors)
Ocean’s Twelve [o12] (15 actors)
Ocean’s Twelve
xyxyx actormovieactormovieactor idftfw ,,
12 11 12 11 12 11
12 12 12 11 11
, , , , , ,
12 112 2 2 2 2
, , , , ,
( , )gc o gc o czj o czj o bp o bp o
starring
gc o czj o bp o gc o bp o
w w w w w wsim o o
w w w w w
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Semantic Vector Space Model (iv)
0.24235
49184log
13
1
0.21035
49184log
15
1
022
49184log0
0.22322
49184log
15
1
0.23938
49184log
13
1
0.20738
49184log
15
1
1111
1212
1111
1212
1111
1212
,,
,,
,,
,,
,,
,,
bpobpobp
bpobpobp
czjoczjoczj
czjoczjoczj
gcogcogc
gcogcogc
idftfw
idftfw
idftfw
idftfw
idftfw
idftfw12 11( , )starring starringsim o o
12 11( , )genre genresim o o
12 11( , )subject subjectsim o o
+
+
),( 1112 oosim
+
… =
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
MORE: More than Movie Recommendation
http://apps.facebook.com/movie-recommendation/
MORE is a Facebook application that semantically recommends movies to the user leveraging the knowledge within DBpedia. MORE supports the user in exploratory browsing tasks by guiding their search through a semantic knowledge space. Similarities between movies are computed by a Semantic version of the classical Vector Space Model (sVSM), applied to semantic datasets.
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Semantic Content-based Recommender
Given a user profile, defined as:
( ) likes j jprofile u m u m
We compute a similarity between mi and the information encoded in profile(u):
( )
1( , )
( , )( )
j
p p j i
m profile u p
i
sim m mP
r u mprofile u
If this similarity is greater or equal to 0.5, we suggest the movie mi to the user u.
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Training the system
In order to identify the best possible values for the coefficients p (i.e., the weights associated to each property), we train the system via a genetic algorithm adopting an N-fold cross validation approach (with N = 5) on the 100k MovieLens dataset. At the end we obtain a set Ap = {p
1, …, p5} of 5 different values for each p, e.g.:
Then, we evaluate the performances with standard precision and recall tests, when p is one of the following:
min( )pA max( )pA ( )pavg A ( )pmedian A ( )plowestError A
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Evaluation: Precision & Recall
@@
Rec N TestSetP N
N
@@
Rec N TestSetR N
TestSet
The figure shows high values of Precision and Recall. The best values are obtained choosing the lowest misclassification error on Ap for the coefficients p.
3,4,5,6,7N
We also evaluated the importance of the subject/broader property. The information of this property is peculiar of ontological datasets. As shown in the figure, the performances drastically decrease if we do not consider this property.
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Conclusion & Future directions
The huge amount of data available on DBpedia can be successfully exploited to build content-based recommender systems.
We have presented MORE, a Facebook application that leverages the knowledge within DBpedia to produce movie recommendations by means of a semantic version of the classical vector space model (sVSM).
Evaluation against historical datasets and high values of precision and recall prove the validity of our approach.
We are currently working on: Testing the approach with different domains
Improving the recommendation with a hybrid approach (content-based and collaborative filtering)
We acknowledge partial support of HP IRP 2011. Grant CW267313.
3rd Italian Information Retrieval Workshop (IIR 2012) – Bari January 26, 2012
Q? A!