language models for collaborative filtering neighbourhoods [ecir '16 slides]

46
ECIR 2016, P, I L M C F N Daniel Valcarce, Javier Parapar, Álvaro Barreiro @dvalcarce @jparapar @AlvaroBarreiroG Information Retrieval Lab @IRLab_UDC University of A Coruña Spain

Upload: daniel-valcarce

Post on 18-Feb-2017

119 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Outline

1. Recommender Systems

2. Weighted Sum Recommender (WSR)

3. Improving WSR

4. Language Models for Neighbourhoods

5. Experiments

6. Conclusions and Future Directions

1/31

RECOMMENDER SYSTEMS

Recommender Systems

Recommender systems aim to provide items that may be ofinterest to the users.

Top-N recommendation techniques create a ranking of the Nmost relevant items for each user.

Main categories:

# Content-based: exploits item metadata to recommenditems similar to those the target user liked in the past.

# Collaborative filtering: relies on the user feedback such asratings or clicks.

# Hybrid: combination of content-based and collaborativefiltering approaches.

3/31

Recommender Systems

Recommender systems aim to provide items that may be ofinterest to the users.

Top-N recommendation techniques create a ranking of the Nmost relevant items for each user.

Main categories:

# Content-based: exploits item metadata to recommenditems similar to those the target user liked in the past.

# Collaborative filtering: relies on the user feedback such asratings or clicks.

# Hybrid: combination of content-based and collaborativefiltering approaches.

3/31

Collaborative Filtering

Collaborative Filtering (CF) exploit feedback from users:

# Explicit: ratings or reviews.

# Implicit: clicks or purchases.

Two main families of CF methods:

# Model-based: learn a model from the data and use it forrecommendation.

# Neighbourhood-based (or memory-based): computerecommendations using directly part of the ratings.

4/31

Collaborative Filtering

Collaborative Filtering (CF) exploit feedback from users:

# Explicit: ratings or reviews.

# Implicit: clicks or purchases.

Two main families of CF methods:

# Model-based: learn a model from the data and use it forrecommendation.

# Neighbourhood-based (or memory-based): computerecommendations using directly part of the ratings.

4/31

Notation

# The set of users U

# The set of items I

# The rating that the user u gave to the item i is ru ,i

# The set of items rated by user u is denoted by Iu

# The set of users that rated item i is denoted by Ui

# The average rating of user u is denoted by µu

# The average rating of item i is denoted by µi

# The user neighbourhood of user u is denoted by Vu

# The item neighbourhood of item i is denoted by Ji

5/31

Neighbourhood-based Methods

Two perspectives:

# User-based: recommend items that users with commoninterests with you liked.

# Item-based: recommend items similar to those you liked.Similarity between items is computed using common usersamong items (not the content!).

The effectiveness of neighbourhood-based methods relieslargely on how neighbours are computed.

The most common approach is to compute the k nearestneighbours (k-NN algorithm) using a pairwise similarity.

6/31

Popular Pairwise Similarities (user-based)

Pearson’s Correlation (user-based)

pearson (u , v) �∑

i∈Iu∩Iv

�ru ,i − µu

� �rv ,i − µv

�√∑i∈Iu

�ru ,i − µu

�2√∑

i∈Iv

�rv ,i − µv

�2

Cosine (user-based)

cosine (u , v) �∑

i∈Iu∩Iv ru ,i rv ,i√∑i∈Iu r2

u ,i

√∑i∈Iv r2

v ,i

7/31

Popular Pairwise Similarities (item-based)

Pearson’s Correlation (item-based)

pearson�i , j

��

∑u∈Ui∩U j

�ru ,i − µi

� �ru , j − µ j

�√∑i∈Ui

�ru ,i − µi

�2√∑

i∈U j

�ru , j − µ j

�2

Cosine (item-based)

cosine�i , j

��

∑u∈Ui∩U j ru ,i ru , j√∑

i∈Ui r2u ,i

√∑i∈U j r2

u , j

8/31

Non-Normalised in Neighbourhood

NNCosNgbr (Cremonesi et al., RecSys 2010):

# Simple and effective item-based neighbourhood algorithm:

r̂u ,i � bu ,i +∑j∈ Ji

s�i , j

� �ru , j − bu , j

# Removes the effect of biases (the observed deviations fromthe average): bu ,i � µ + bu + bi

minb∗

∑ �ru ,i − µ − bu − bi

�2+ β *

,

∑u∈U

b2u +

∑i∈I

b2i+-

# Uses a shrunk cosine similarity:

s�i , j

��

|Ui ∩U j ||Ui ∩U j | + α cosine

�i , j

9/31

Non-Normalised in Neighbourhood

NNCosNgbr (Cremonesi et al., RecSys 2010):

# Simple and effective item-based neighbourhood algorithm:

r̂u ,i � bu ,i +∑j∈ Ji

s�i , j

� �ru , j − bu , j

# Removes the effect of biases (the observed deviations fromthe average): bu ,i � µ + bu + bi

minb∗

∑ �ru ,i − µ − bu − bi

�2+ β *

,

∑u∈U

b2u +

∑i∈I

b2i+-

# Uses a shrunk cosine similarity:

s�i , j

��

|Ui ∩U j ||Ui ∩U j | + α cosine

�i , j

9/31

Non-Normalised in Neighbourhood

NNCosNgbr (Cremonesi et al., RecSys 2010):

# Simple and effective item-based neighbourhood algorithm:

r̂u ,i � bu ,i +∑j∈ Ji

s�i , j

� �ru , j − bu , j

# Removes the effect of biases (the observed deviations fromthe average): bu ,i � µ + bu + bi

minb∗

∑ �ru ,i − µ − bu − bi

�2+ β *

,

∑u∈U

b2u +

∑i∈I

b2i+-

# Uses a shrunk cosine similarity:

s�i , j

��

|Ui ∩U j ||Ui ∩U j | + α cosine

�i , j

9/31

WEIGHTED SUM RECOMMENDER (WSR)

Weighted Sum Recommender (WSR)

The original NNCosNgbr:

r̂u ,i � bu ,i +∑j∈ Ji

s�i , j

� �ru , j − bu ,i

�(1)

Not using biases removal (NNCosNgbr’):

r̂u ,i �∑j∈ Ji

s�i , j

�ru , j (2)

Using plain cosine instead of shrunk cosine (WSR-IB):

r̂u ,i �∑j∈ Ji

cosine�i , j

�ru , j (3)

Also the user-based version (WSR-UB):

r̂u ,i �∑v∈Vu

cosine (u , v) rv ,i (4)

11/31

Weighted Sum Recommender (WSR)

The original NNCosNgbr:

r̂u ,i � bu ,i +∑j∈ Ji

s�i , j

� �ru , j − bu ,i

�(1)

Not using biases removal (NNCosNgbr’):

r̂u ,i �∑j∈ Ji

s�i , j

�ru , j (2)

Using plain cosine instead of shrunk cosine (WSR-IB):

r̂u ,i �∑j∈ Ji

cosine�i , j

�ru , j (3)

Also the user-based version (WSR-UB):

r̂u ,i �∑v∈Vu

cosine (u , v) rv ,i (4)

11/31

Weighted Sum Recommender (WSR)

The original NNCosNgbr:

r̂u ,i � bu ,i +∑j∈ Ji

s�i , j

� �ru , j − bu ,i

�(1)

Not using biases removal (NNCosNgbr’):

r̂u ,i �∑j∈ Ji

s�i , j

�ru , j (2)

Using plain cosine instead of shrunk cosine (WSR-IB):

r̂u ,i �∑j∈ Ji

cosine�i , j

�ru , j (3)

Also the user-based version (WSR-UB):

r̂u ,i �∑v∈Vu

cosine (u , v) rv ,i (4)

11/31

Weighted Sum Recommender (WSR)

The original NNCosNgbr:

r̂u ,i � bu ,i +∑j∈ Ji

s�i , j

� �ru , j − bu ,i

�(1)

Not using biases removal (NNCosNgbr’):

r̂u ,i �∑j∈ Ji

s�i , j

�ru , j (2)

Using plain cosine instead of shrunk cosine (WSR-IB):

r̂u ,i �∑j∈ Ji

cosine�i , j

�ru , j (3)

Also the user-based version (WSR-UB):

r̂u ,i �∑v∈Vu

cosine (u , v) rv ,i (4)

11/31

Experiments with WSR

Algorithm ML 100k ML 1M R3-Yahoo! LibraryThing

NNCosNgr 0.1427 0.1042 0.0138 0.0550NNCosNgr’ 0.3704a 0.3334a 0.0257a 0.2217ad

WSR-IB 0.3867ab 0.3382ab 0.0274ab 0.2539abd

WSR-UB 0.3899ab 0.3430ab 0.0261a 0.1906a

Table: Values of nDCG@10. Statistical significance is superscripted(Wilcoxon two-sided p < 0.01). Pink = best algorithm. Blue = notsignificantly different to the best.

12/31

IMPROVING WSR

Improving WSR

Can we do better with this simple approach (WSR)?

Yes!

Pairwise similarities have a huge impact on performance.

Cosine provides important improvements over Pearson’scorrelation coefficient (Cremonesi et al., RecSys 2010).

Let’s study cosine similarity from the perspective ofInformation Retrieval.

14/31

Improving WSR

Can we do better with this simple approach (WSR)? Yes!

Pairwise similarities have a huge impact on performance.

Cosine provides important improvements over Pearson’scorrelation coefficient (Cremonesi et al., RecSys 2010).

Let’s study cosine similarity from the perspective ofInformation Retrieval.

14/31

Cosine Similarity and the Vector Space Model

Recommendation Information Retrieval

Target user QueryRest of users Documents

Items Terms

Under this scheme, using cosine similarity for findingneighbours is equivalent to search in the Vector Space Model.

If we swap users and items, we can derive an analogousitem-based approach.

We can use sophisticated search techniques for findingneighbours!

15/31

Cosine Similarity and the Vector Space Model

Recommendation Information Retrieval

Target user QueryRest of users Documents

Items Terms

Under this scheme, using cosine similarity for findingneighbours is equivalent to search in the Vector Space Model.

If we swap users and items, we can derive an analogousitem-based approach.

We can use sophisticated search techniques for findingneighbours!

15/31

Cosine Similarity and the Vector Space Model

Recommendation Information Retrieval

Target user QueryRest of users Documents

Items Terms

Under this scheme, using cosine similarity for findingneighbours is equivalent to search in the Vector Space Model.

If we swap users and items, we can derive an analogousitem-based approach.

We can use sophisticated search techniques for findingneighbours!

15/31

Cosine Similarity and the Vector Space Model

Recommendation Information Retrieval

Target user QueryRest of users Documents

Items Terms

Under this scheme, using cosine similarity for findingneighbours is equivalent to search in the Vector Space Model.

If we swap users and items, we can derive an analogousitem-based approach.

We can use sophisticated search techniques for findingneighbours!

15/31

LANGUAGE MODELS FOR NEIGHBOURHOODS

Language Models

Statistical language models are a state-of-the-art framework fordocument retrieval.

Documents are ranked according to their posterior probabilitygiven the query:

p(d |q) � p(q |d) p(d)p(q)

rank� p(q |d) p(d)

The query likelihood, p(q |d), is based on a unigram model:

p(q |d) �∏t∈q

p(t |d)c(t ,d)

The document prior, p(d), is usually considered uniform.

17/31

Language Models

Statistical language models are a state-of-the-art framework fordocument retrieval.

Documents are ranked according to their posterior probabilitygiven the query:

p(d |q) � p(q |d) p(d)p(q)

rank� p(q |d) p(d)

The query likelihood, p(q |d), is based on a unigram model:

p(q |d) �∏t∈q

p(t |d)c(t ,d)

The document prior, p(d), is usually considered uniform.

17/31

Language Models

Statistical language models are a state-of-the-art framework fordocument retrieval.

Documents are ranked according to their posterior probabilitygiven the query:

p(d |q) � p(q |d) p(d)p(q)

rank� p(q |d) p(d)

The query likelihood, p(q |d), is based on a unigram model:

p(q |d) �∏t∈q

p(t |d)c(t ,d)

The document prior, p(d), is usually considered uniform.

17/31

Language Models for Finding Neighbourhoods (I)

Information Retrieval:

p(d |q) rank� p(d)

∏t∈q

p(t |d)c(t ,d)

User-based collaborative filtering:

p(v |u) rank� p(v)

∏i∈Iu

p(i |v)rv ,i

Item-based collaborative filtering:

p( j |i) rank� p( j)

∏u∈Ui

p(u | j)ru , j

18/31

Language Models for Finding Neighbourhoods (II)

User-based collaborative filtering:

p(v |u) rank� p(v)

∏i∈Iu

p(i |v)rv ,i

We assume a multinomial distribution over the count of ratings.The maximum likelihood estimate (MLE) is:

pmle(i |v) � rv ,i∑j∈Iv rv , j

However it suffers from sparsity. We need smoothing!

19/31

Language Models for Finding Neighbourhoods (II)

User-based collaborative filtering:

p(v |u) rank� p(v)

∏i∈Iu

p(i |v)rv ,i

We assume a multinomial distribution over the count of ratings.The maximum likelihood estimate (MLE) is:

pmle(i |v) � rv ,i∑j∈Iv rv , j

However it suffers from sparsity. We need smoothing!

19/31

Smoothing Methods for Language Models

Absolute Discounting (AD)

pδ(i |u) � max(ru ,i − δ, 0) + δ |Iu | p(i |C)∑j∈Iu ru , j

Jelinek-Mercer (JM)

pλ(i |u) � (1 − λ) ru ,i∑j∈Iu ru , j

+ λ p(i |C)

Dirichlet Priors (DP)

pµ(i |u) � ru ,i + µ p(i |C)µ +∑

j∈Iu ru , j

20/31

EXPERIMENTS

Experimental settings

Baselines:

# Pearson’s correlation coefficient

# RM1Sim: user-based similarity (Bellogín et al., RecSys’13)

# Cosine similarity

Our similarities are Language Models using:

# Absolute Discounting smoothing

# Jelinek-Mercer smoothing

# Dirichlet Priors smoothing

22/31

Parameter Sensibility of WSR-UB on MovieLens 100k

0.18

0 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k

0.280.300.320.340.360.380.40

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

µ

nDC

G@

10

λ, δ

PearsonCosine

RM1Sim (λ)LM-Absolute Discounting (δ)

LM-Jelinek-Mercer (λ)LM-Dirichlet Priors (µ)

23/31

Parameter Sensibility of WSR-IB on R3-Yahoo!

0.0120.0140.0160.0180.0200.0220.0240.0260.0280.030

100 101 102 103 104 105 106

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

nDC

G@

10

µ

λ, δ

PearsonCosine

LM-Absolute Discounting (δ)LM-Jelinek-Mercer (λ)

LM-Dirichlet Priors (µ)

24/31

Precision (nDCG@10)

Algorithm ML 100k ML 1M R3-Yahoo LibraryThing

NNCosNgbr 0.1427 0.1042 0.0138 0.0550PureSVD 0.3595a 0.3499ac 0.0198a 0.2245a

Cosine-WSR 0.3899ab 0.3430a 0.0274ab 0.2476ab

LM-DP-WSR 0.4017abc 0.3585abc 0.0271ab 0.2464ab

LM-JM-WSR 0.4013abc 0.3622abcd 0.0276ab 0.2537abcd

Table: Values of precision in terms of normalised discountedcumulative gain at 10. Statistical significance is superscripted(Wilcoxon two-sided p < 0.01). Pink = best algorithm. Blue = notsignificantly different to the best.

25/31

Diversity (Gini@10)

Algorithm ML 100k ML 1M R3-Yahoo! LibraryThing

Cosine-WSR 0.0549 0.0400 0.0902 0.1025LM-DP-WSR 0.0659 0.0435 0.1557 0.1356LM-JM-WSR 0.0627 0.0435 0.1034 0.1245

Table: Values of the complement of the Gini index at 10.Pink = best algorithm.

26/31

Novelty (MSI@10)

Algorithm ML 100k ML 1M R3-Yahoo! LibraryThing

Cosine-WSR 11.0579 12.4816 21.1968 41.1462LM-DP-WSR 11.5219 12.8040 25.9647 46.4197LM-JM-WSR 11.3921 12.8417 21.7935 43.5986

Table: Values of novelty in terms of Mean Self Information at 10.Pink = best algorithm.

27/31

CONCLUSIONS AND FUTURE DIRECTIONS

Conclusions

Novel approach for computing user or item neighbourhoodsbased on statistical language models. It can be combined with asimple algorithm (WSR):

# Highly accurate recommendations.

# Improve novelty and diversity figures compared to cosine.

# Low computational complexity.

We can leverage inverted indexes to compute neighbourhoods:

# High efficiency.

# High scalability.

29/31

Future work

Use non-uniform priors:

# Include document/profile length normalisation.

# Introduce business strategies.

Besides multinomial, explore other probability distributions:

# Multivariate Bernoulli.

# Multivariate Poisson.

30/31

THANK YOU!

@DVALCARCEhttp://www.dc.fi.udc.es/~dvalcarce