runninghead: applyingpsychologicalmodelstomusic

Running head: APPLYING PSYCHOLOGICAL MODELS TO MUSICRECOMMENDATION 1

Exploring the Uses of Psychological Models of Generalization in Music Recommendation

Systems

Tiffany Hwu

University of California, Berkeley

APPLYING PSYCHOLOGICAL MODELS TO MUSIC RECOMMENDATION 2

Exploring the Uses of Psychological Models of Generalization in Music Recommendation

Systems

Abstract

Recommender systems are integrated into many internet-based services, including

movie recommendations, shopping suggestions, and automatic playlist generation.

Although there are a number of effective recommender models inspired by traditional

machine learning methods, the use of psychological models in recommendation is far less

explored. We propose that the process of finding similar music tracks parallels the

cognitive task of generalization, which could potentially be used to aid in music playlist

recommendation. In generalization, stimuli are defined within a psychological space, in

which previously experienced stimuli are used to create generalizations about newly

presented stimuli. Similarly, a person who is trying to construct a playlist will use their

prior musical knowledge or intuitions to find songs of similar taste. The main objective of

this work is to evaluate the effectiveness of applying psychological models to large scale

online datasets which contain the listening histories of users. The models are tested both

qualitatively and quantitively by holding out portions of the dataset and evaluating how

well they can predict the missing information. Using common metrics from information

retrieval, we explore the advantages and differences of using psychological models over

traditional machine learning models in recommender systems. Additionally, we provide an

example of how large existing databases of human behavior can be used to conduct

psychology experiments in a robust and affordable manner.


Introduction

Recommendation systems, also known as recommender systems, have had a wide

variety of uses in recommending music, shopping items, and movies, to the benefit of

websites hoping to maximize profit and customers searching for items that meet their needs

and preferences. The basic recommender system consists of a collection of items described

by content and user ratings, accompanied by a model that uses this data to generate

predictions on which items a particular user will prefer. There are two main approaches to

recommendation: the collaborative filtering approach, which relies on user ratings for

prediction, and a content-based approach, which uses data on features innate to the items,

such as the audio features of songs. Specifically within the collaborative filtering approach,

there are memory-based algorithms which compare users and their preferred items, and

model-based algorithms which use the data to train models and learn latent

representations of the users and items(Su & Khoshgoftaar, 2009).

Music recommendation systems are a subset of recommendation system containing a

collection of tracks and user preferences of tracks which are either explicit in the form of

numerical ratings or implicit with other behavioral data. With websites such as

Last.fm(CBS Interactive, n.d.) containing vast amounts of information on the contents of

user playlists, we are able to create and test our own recommendation models.

The process of providing recommendations based on user and item data can be

viewed as a task of finding which stimuli are most similar to each other and which

conceptual groups of stimuli can be formed. In other words, the process of recommendation

can be viewed as a form of generalization as described by Shepard(1987). This different

perspective on recommendation models naturally presents the question of whether we can

use the field of psychology to augment current recommendation algorithms. A number of

computational models of generalization could readily be used as recommendation

algorithms and contrasted with traditional models to see what they may contribute to the

cause. For instance, Tenenbaum and Griffiths(2001) suggest that human generalization can


be captured within a simple Bayesian framework which is able to generalize from an

arbitrary number of consequential stimuli and with an arbitrary representational structure.

This Bayesian generalization model is an example of the many topics of psychology

relevant to our topic of recommendation.

Such an exploration would benefit not only the world of recommender systems, but

also serve as an example of incorporating large preexisting datasets into psychology

research. The availability of music playlist data makes it a perfect medium for observing

psychological trends. Unlike the traditional experimental paradigm of experimentation

with a small population size of hand-run subjects, we can move toward the paradigm of

finding large preconstructed data of human behavior that buffers against risks of small

homogenous testing populations and expensive experimental procedures. With these

motivations, we compare and contrast various psychological and non-psychological models

in music recommendation.

This paper begins by providing background on the Bayesian generalization

framework. We then extend this model and other models to the task of music

recommendation, and describe the datasets and methods used to compare them. Finally,

we discuss the ways in which recommendations using psychological models can provide far

different results from recommendations of more traditional models.

Background

Bayesian Generalization. The Bayesian generalization framework(Tenenbaum &

Griffiths, 2001) has been successfully used in a variety of different psychological domains.

The framework consists of a query X of positively seen examples and a hypothesis space

H, a collection of hypotheses h defined by collections of positively seen examples. The

likelihood of any one hypothesis is defined by


P (X|h) =

1/|h|n x(j) ∈ h

0 otherwise

(1)

which demonstrates the size principle, the idea that hypotheses of smaller size (defined by

fewer examples) are more likely than hypotheses of larger size. Here, size is represented by

the variable n. To find the posterior probability P (h|x) of a hypothesis being correct, we

apply Bayes’ rule. The prior, P (h), is adjusted according to the particular task that is

being modeled, as shown here:

P (h|X) = P (X|h)P (h)∑h′∈H

P (X|h′)P (h′) (2)

Once there is a posterior probability of each hypothesis being correct, we can

determine whether a new object is part of a concept C by applying the equation below. C

represents the concept embodied by our query X, and P (y ∈ C|h) is either a 1 or a 0

depending on whether the output element exists in the hypothesis:

P (y ∈ C|X) =∑h∈H

P (y ∈ C|h)p(h|X) (3)

Abbott, Austerweil, and Griffiths(2012) apply a Bayesian generalization framework

for large-scale word learning. With a hypothesis space constructed from Wordnet(Miller,

1995), the model is able to learn the taxonomic relationships between words. The success

of the Bayesian generalization framework in this domain is the main motivation for

expanding into other applications such as music playlist recommendation.

Recommendation as Generalization

Recommendation can be viewed as a generalization task in which a group of items

exists in a user’s history and the goal is to determine which other items would belong in a

similar category. In this section we discuss how to apply models of generalization to this

new domain.


Datasets

Our primary method for exploring psychological and traditional models follows the

format of the Million Song Dataset Challenge, a music recommendation challenge that

provides half of the listening histories of a large collection of users and asks contestants to

predict the missing half of the data(Brian McFee & Lanckriet, 2012). Since the contest was

hosted by Kaggle in 2012 and is no longer accepting submissions, the missing half of the

data has been released, allowing us to calculate scores that our models would have achieved

if entered in the competition. While all contestants have access to advanced audio features

of each song through the million song dataset, our solution relies mainly on user ratings.

The listening histories for 110,000 users is provided in the form of triplets consisting

of user id, song id, and playcount. A dataset containing the listening histories of an

additional 1 million users is available in the Echonest Taste Profile Subset, which follows

the same format.

Additionally, we repeat the same procedure on the AOTM-2011 dataset(McFee &

Lanckriet, 2012), a large dataset compiled from Art of the Mix, which is a website where

users post their favorite playlists. The playlist data was separated into equally-sized

training and testing sets, split randomly. While both datasets draw from the Million Song

Dataset, the AOTM-2011 dataset consists of consciously-selected playlists as opposed to

entire listening histories as in the MSD challenge.

Constructing a Hypothesis Space

The testing and training datasets were converted into matrices with columns

representing users and rows representing songs. The elements of the matrix are ‘1’ if the

song has been played by the user at least once or is contained in a playlist, and ‘0’

otherwise. This process resulted in 2 binary matrices for each dataset, as summarized

below.


Models

The primary task of our work involves applying psychological models on the visible

half of user listening histories to see how well they can predict the songs in the missing

half. All models will be applied to the binary matrices described above. In addition, each

model will be tested along 2 conditions: query size (number of songs in the visible half of a

user’s history), and popularity threshold (matrix filtered out to contain only songs above a

specified total playcount).

Music Recommendation as Bayesian Inference. To apply this framework to

our data, we treat each each column as a hypothesis. The corresponding query X of

positively seen examples is a collection of songs representing the visible half of a particular

user’s listening history. The likelihood is as described above, with hypotheses of fewer

songs being more likely.

The prior in this case is assigned an Erlang distribution, representative of the

intuition that intermediate-sized playlists are more likely than small or large playlists. This

is described by P (h) ∝ (|h|/σ2)e−|h|/σ where σ was hand-selected as 10.

One further adjustment was made to the likelihood calculation, allowing for an error

term ε = 1 ∗ 10−15 accounting for noise in the dataset. This allows for a likelihood to be

calculated even if not all songs in a query are members of a particular hypothesis. This is

expressed in

P (d|h) =

1/|h|(1− ε) + ε d ∈ h

ε otherwise

(4)

P (X|h) =∏d

P (d|h) (5)

Finally we can use probability generalization to create a ranking of all songs in the

dataset listed in order of songs most likely to be in the user’s missing half of listening

history.


Exemplar/Prototype Theory. The exemplar and prototype models perform

categorization tasks through probability density estimates(Ashby & Alfonso-Reese, 1995)

as described below.

Prototype theory is the idea that some objects are more prototypical of a category

than others and can be used as the basis of comparison when deciding whether or not a

new stimulus belongs to the category. A formal model based on this idea can be stated as

Equation 6, where dist is the Hamming distance between the two vectors x and y, and λp

is a hand-picked value (0.15 in this case) which is used to optimize results. The score is

thus calculated by

Pscore(y) = exp{−λpdist(x, yproto)} (6)

Exemplar theory is the idea that all instances in memory belonging to a certain

category are used in the process of comparison with a new stimulus. The formalized model

of this idea is similar to that of prototype theory, except that it uses the sum of

comparisons with all items in a category rather than a prototype. As with the prototype

model, λe is a hand-picked value of 0.15, found through

Escore(y) =∑xj∈x

exp{−λedist(y, xj)} (7)

The models described above can be applied as they are to the binarized matrices of

user listening history. We can interpret the prototype model as a construction of a

prototypical song representing all of the songs in a query. We can then rank all songs by

similarity to the prototype. The exemplar model compares all songs in the dataset to each

song in the query and sums up the comparisons.

Baseline Models

Performance of the psychological models is measured alongside non-psychological

models as a standard of comparison


Bayesian Sets. The Bayesian sets model is a machine learning method used to

categorize elements into sets(Ghahramani & Heller, 2005). It can be applied very efficiently

with a single matrix multiplication and has seen success in modeling judgments of

representativeness in images(Abbott, Heller, Ghahramani, & Griffiths, 2011). The Bayesian

sets score is found by the following:

score(x) = p(x|Hc)/p(x) (8)

TFIDF. Term frequency-inverse document frequency is a common technique for

determining the uniqueness of a term within a document. The term is devalued if it

belongs in many documents, and valued if it appears frequently within specific documents.

TFIDF is purportedly used in many commercially-used music recommendation

algorithms(Mims, 2011) and thus serves as a good standard of comparison. The two

equations below summarize our use of TFIDF

score(h) =∑x∈X

TF (x, h) ∗ IDF (x) (9)

P (y ∈ C|X) =∑h∈H

P (y ∈ C|h)score(h) (10)

Term frequency (TF) is simply the frequency of a term x in a document h and inverse

document frequency (IDF) is the reciprocal of the frequency of documents containing the

term. In our case, users are analogous to documents and terms are analogous to songs. We

compute the sum of TFIDF scores for each song in the query X and use a probability

generalization scheme identical to the one used in the Bayesian generalization framework.

Popularity. As our most baseline model, we can simply rank all songs by total

playcount. Songs with higher playcount are more highly recommended as a whole and

therefore appear more relevant in general. This serves as a sanity check, as we must ensure

that no reasonably effective model should fare worse than this.


Metrics

The submission and scoring process is as follows. A submission uses the visible half of

the data in whatever way it wishes, and returns a list of songs for each user, ranked in

order of how likely the song is to be in the missing half of the data for a specific user. Just

as in the Million Song Dataset Challenge, we use four standard metrics in information

retrieval to compare the ranked output y for user u with the actual hidden data. This is

represented in matrix M .

Precision at 10. Precision is a common metric of information retrieval,

representing the proportion of correct items in a top-k ranking. The particular form of

precision we use here is the precision at rank 10 of the ranked list of relevant songs,

calculated by

Pk(u, y) = 110

10∑j=1

Mu,y(j) (11)

Truncated mAP. The main metric used to compare submissions in the

competition is the mean average precision (mAP) of the ranked song suggestions, with a

cutoff τ of the first 500 songs. nu is the total number of users. Average precision is found by

AP (u, y) = 1nu

τ∑k=1

Pk(u, y) ·Mu,y(k) (12)

while mean average precision is simply the mean of these AP scores:

mAP = 1m

∑u

AP (u, yu) (13)

DCG. Discounted cumulative gain (DCG) highly rewards relevant documents for

having a high ranking and penalizes for having a low ranking. We find the DCG only up to

the 10th element in the ranking, so n = 10. The equation for this is

DCG(n) =n∑j=1

2relevant(j) − 1log(1 + j) (14)


MRR. Mean reciprocal rank (MRR) is simply the mean of the reciprocal ranks of

all items in the query set. n is the number of items in the query. Mean reciprocal rank is

calculated by

MRR(n) = 1n

n∑j=1

1relevant(j) (15)

Results

Since all of the metrics used show similar trends, we will use mAP for illustrating

results, as it was the main metric for the Million Song Dataset Challenge. Full results can

be found in the appendix.

*NOTE: The metrics have not yet been run on prototype and exemplar models.

Varying Population Threshold

All models show an increase in performance as the population threshold increases as

presented in Figures 1 and 2. This makes sense, as the precision scores should increase

when there is more listening history data available for each song. For the Million Song

Dataset Challenge, the Bayesian generalization model appears to outperform TF-IDF,

while the reverse is true for the AOTM-2011 dataset.

Varying Query Size

For the Million Song Dataset Challenge, the models perform poorly on query sizes

1-10, and perform much better when the full query sizes are used. For the AOTM-2011

dataset, Bayesian generalization and TF-IDF show an increase in performance as the query

size increases. The Bayesian generalization framework shows a particular strength in

generating correct recommendations when the query size is small. Interestingly, Bayesian

sets performs worse as the query size increases.


Qualitative Comparison

A quick glance at the actual content of the recommendations provided by different

models tangibly shows the large difference in results. In a sample query of 3 Michael

Jackson songs, the Bayesian generalization model has inferred the theme of our query and

suggested only Michael Jackson songs. Bayesian Sets appears to have guessed themes of

80’s music and Halloween (likely from Michael Jackson’s ’Thriller’). TF-IDF may have

picked up on these but also recommends a few songs which have a less clear relation to

Michael Jackson. Further, if we provide just one song, ’Thriller’, in our query for Bayesian

generalization, it directly picks up the Halloween theme and offers several Halloween songs.

Discussion

The qualitative and quantitative results both show that Bayesian generalization can

often provide insightful recommendations that more traditional models overlook. A

particular strength of the Bayesian generalization framework is its ability to detect the

theme of a certain query when given one song (ex. Thriller) vs. three songs (ex. three

Michael Jackson songs). This suggests that applying such psychological models to music

recommendation could perhaps provide a more human-like quality to current

recommendation systems. The difference in trends regarding scores of the MSD taste

profile and the scores of the AOTM-2011 dataset show the importance in selecting the

proper model for the task. A proposed explanation of this discrepancy lies in the fact that

the AOTM-2011 dataset consists of users selecting songs that they believe would go well

together. Thus, applying a Bayesian generalization model might do well at modeling a real

human who makes recommendations based off of their knowledge of particular song

combinations. Additionally, since the MSD taste profile hypotheses contain entire listening

histories and are less thematic, smaller queries may be insufficient to generate good

recommendations, as suggested by the query size results.


These results additionally call to question the use of traditional information retrieval

metrics when thinking about the problem of recommendation. A typical quantitative

approach in machine learning consists of the methods we used, in which half of the dataset

is removed and recovered. While our results can detect a few trends in increasing

population thresholds and query sizes, many phenomena, such as the poor performance of

Bayesian sets with increasing query size, is hard to deconstruct. In contrast, a qualitative

survey of a sample query leads to clear contrasts amongst the models and a good intuition

for which model would be best for the task.

Conclusion

The world of music recommendation and recommender systems in general shows

strong use of traditional machine learning techniques and metrics. We have seen that

applying psychological models of generalization can contribute significantly to current

systems and perhaps provide more insight into how a human would recommend songs

versus how a typical machine learning algorithm would recommend songs.


References

Abbott, J. T., Austerweil, J. L., & Griffiths, T. L. (2012). Constructing a hypothesis space

from the web for large-scale Bayesian word learning. In Proceedings of the 34th

Annual Conference of the Cognitive Science Society.

Abbott, J. T., Heller, K. A., Ghahramani, Z., & Griffiths, T. L. (2011). Testing a Bayesian

measure of representativeness using a large image database. In NIPS (Vol. 24, pp.

2321–2329).

Ashby, F. G., & Alfonso-Reese, L. A. (1995). Categorization as probability density

estimation. Journal of mathematical psychology, 39 (2), 216–233.

Brian McFee, D. P. E., Thierry Bertin-Mahieux, & Lanckriet, G. R. (2012). The million

song dataset challenge. Proceedings of the 21st International Conference Companion

on World Wide Web, 12 , 909–916. Retrieved from

http://cosmal.ucsd.edu/ gert/papers/msdc.pdf

CBS Interactive. (n.d.). www.last.fm. ((last accessed: Apr. 16 2014))

Ghahramani, Z., & Heller, K. A. (2005). Bayesian sets. In NIPS (Vol. 2, pp. 22–23).

McFee, B., & Lanckriet, G. R. (2012). Hypergraph models of playlist dialects. In Ismir

(pp. 343–348).

Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the

ACM , 38 (11), 39–41. (Princeton University dataset available at

http://wordnet.princeton.edu)

Mims, C. (2011). How itunes genius really works. Technology Review. Retrieved from

www.technologyreview.com/view/419198/how-itunes-genius-really-works/

Shepard, R. N. (1987). Towards a universal law of generalization for psychological science.

Science, 237 , 1317-1323.

Su, & Khoshgoftaar. (2009). A survey of collaborative filtering techniques. Advances in

Articial Intelligence, 2009 (421425).

Tenenbaum, J., & Griffiths, T. L. (2001). Generalization, similarity, and Bayesian


inference. Behavioral and Brain Sciences, 24 , 629-641.


train set test set

matrix dimensions 286213*110000 286213x110000

avg songs per user 13.1903 1.2358

avg users per song 3.7568 0.3520

Table 1

Summary of MSD Challenge Dataset


train set test set

matrix dimensions 1818*13514 1818x13514

avg songs per user 9.8113 8.4004

avg users per song 1.3199 1.1301

Table 2

Summary of AOTM-2011 Dataset


Psychological Models Traditional Models

Bayesian Generalization Bayesian Sets

Exemplar TF-IDF

Prototype Popularity

Table 3

Summary of models being tested


Bayesian Generalization TF-IDF Bayesian Sets

*Bad *Smooth Criminal The Monster Mash

*I Just Can’t Stop Loving You Tiny Dancer *Smooth Criminal

*Smooth Criminal Like A Prayer Nightmare on My Street

*Man in the Mirror *Let’s Get It On *Bad

*Wanna Be Startin’ Somethin’ I Believe in a Thing Called Love *Can You Feel It

*PYT *Money Love is a Battlefield

*You Rock My World Kiss Every Day is Halloween

*Baby Be Mine *Bad *I Just Can’t Stop Loving You

*Why You Wanna Trip on Me? Dead Man’s Party Shake Your Groove Thing

*Speechless Halloween *Stranger in Moscow

Table 4

Rankings for query: Billie Jean, Thiller, The Way You Make Me Feel. Michael Jackson

songs are denoted by asterisks (*).


Bayesian Generalization

Halloween

The Monster Mash

Werewolves of London

I Put a Spell on You

*Billie Jean

Dead Man’s Party

Girls Just Wanna Have Fun

*Smooth Criminal

Ghost Town

*The Way You Make Me Feel

I’m a Mummy

Video Killed the Radio Star

Table 5

Rankings for Bayesian generalization for query: Thriller. Michael Jackson songs are

denoted by asterisks (*).


Figure 1 . The mean average precision (mAP) of each model on the Million Song Dataset

Challenge as a function of population threshold.


Figure 2 . The mean average precision (mAP) of each model on the AOTM dataset as a

function of population threshold.


Figure 3 . The mean average precision (mAP) of each model on the Million Song Dataset

Challenge as a function of query size.


Figure 4 . The mean average precision (mAP) of each model on the AOTM dataset as a

function of query size.


Appendix A: Full Results - Varying Query Size

AOTM-2011 Dataset

N 1 2 3 5 10 all

Bayesian Generalization 0.0025 0.0029 0.0033 0.0037 0.0039 0.0039

TF-IDF 0.0029 0.0026 0.0029 0.0031 0.0032 0.0032

Bayesian Sets 0.0029 0.0015 0.0014 0.0017 0.0016 0.0016

Popularity 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033

Prototype - - - - - -

Exemplar - - - - - -

Table 6

P at 10

N 1 2 3 5 10 all


TF-IDF 0.0024 0.0024 0.0027 0.0030 0.0031 0.0031

Bayesian Sets 0.0032 0.0015 0.0015 0.0015 0.0015 0.0015

Popularity 0.0035 0.0035 0.0035 0.0035 0.0035 0.0035



Table 7

mAP


N 1 2 3 5 10 all


TF-IDF 6.5722 6.5685 6.5700 6.5707 6.5709 6.5709

Bayesian Sets 6.5744 6.5631 6.5630 6.5639 6.5633 6.5631

Popularity 6.5767 6.5767 6.5767 6.5767 6.5767 6.5767



Table 8

DCG

N 1 2 3 5 10 all


TF-IDF 0.0020 0.0020 0.0022 0.0024 0.0025 0.0025

Bayesian Sets 0.0029 0.0012 0.0013 0.0013 0.0013 0.0013

Popularity 0.0032 0.0032 0.0032 0.0032 0.0032 0.0032



Table 9

MRR

Million Song Dataset Challenge

Appendix B: Full Results- Varying Population Threshold


N 1 2 3 5 10 all

Bayesian Generalization 2.8425e-05 1.4213e-04 5.6850e-05 0 0 1.3758e-02

TF-IDF 5.6850e-05 5.6850e-05 2.8425e-05 0 2.4417e-02 4.4940e-02

Bayesian Sets 5.6850e-05 5.6850e-05 2.8425e-05 0 2.8425e-05 4.4940e-02

Popularity 3.1950e-02 3.1950e-02 3.1950e-02 3.1950e-02 3.1950e-02 3.1950e-02



Table 10

P at 10

N 1 2 3 5 10 all

Bayesian Generalization 5.9968e-05 3.2262e-05 1.8260e-05 8.5345e-06 1.3408e-05 1.9592e-02

TF-IDF 5.8601e-05 2.1097e-05 1.3903e-05 9.6758e-06 1.1140e-05 3.8134e-02

Bayesian Sets 3.5465e-05 6.7142e-05 9.7076e-06 8.9287e-06 8.0981e-06 3.4094e-02

Popularity 1.9151e-02 1.9151e-02 1.9151e-02 1.9151e-02 1.9151e-02 1.9151e-02



Table 11

mAP

AOTM-2011 Dataset


N 1 2 3 5 10 all


TF-IDF 6.5555 6.5555 6.5552 6.5551 6.5550 6.6688

Bayesian Sets 6.5553 6.5556 6.5551 6.5550 6.5551 6.7747

Popularity 6.7804 6.7804 6.7804 6.7804 6.7804 6.7804



Table 12

DCG

N 1 2 3 5 10 all


TF-IDF 0.4440 0.5319 0.5347 0.5349 0.5349 0.5349

Bayesian Sets 0.0032 0.4137 0.4318 0.4487 0.4552 0.4554

Popularity 0.0034 0.0034 0.0034 0.0034 0.0034 0.0034



Table 13

MRR


T none 2 3 5

Bayesian Generalization 0.0039 0.0057 0.0060 0.0066

TF-IDF 0.0032 0.0050 0.0055 0.0058

Bayesian Sets 0.0016 0.0025 0.0028 0.0032

Popularity 0.0033 0.0043 0.0047 0.0056

Prototype - - - -

Exemplar - - - -

Table 14

P at 10

T none 2 3 5


TF-IDF 0.0031 0.0071 0.0085 0.0103

Bayesian Sets 0.0015 0.0040 0.0048 0.0071

Popularity 0.0035 0.0073 0.0089 0.0126

Prototype - - - -

Exemplar - - - -

Table 15

mAP


T none 2 3 5


TF-IDF 6.5709 6.5810 6.5841 6.5856

Bayesian Sets 6.5631 6.5688 6.5705 6.5731

Popularity 6.5767 6.5835 6.5860 6.5921

Prototype - - - -

Exemplar - - - -

Table 16

DCG

T none 2 3 5


TF-IDF 0.0025 0.0061 0.0074 0.0092

Bayesian Sets 0.0013 0.0036 0.0044 0.0066

Popularity 0.0032 0.0068 0.0082 0.0117

Prototype - - - -

Exemplar - - - -

Table 17

MRR


Million Song Dataset Challenge

T none 10 25 50 100 200


TF-IDF 0.0244 0.0408 0.0479 0.0508 0.0523 0.0523

Bayesian Sets 0.0449 0.0404 0.0408 0.0408 0.0405 0.0359

Popularity 0.0319 0.0322 0.0327 0.0337 0.0366 0.0434



Table 18

P at 10

T none 10 25 50 100 200


TF-IDF 0.0381 0.0509 0.0553 0.0589 0.0638 0.0736

Bayesian Sets 0.0341 0.0363 0.0379 0.0391 0.0400 0.0444

Popularity 0.0192 0.0222 0.0256 0.0305 0.0409 0.0634



Table 19

mAP


T none 10 25 50 100 200


TF-IDF 6.6688 6.7515 6.7920 6.8122 6.8265 6.8335

Bayesian Sets 6.7747 6.7521 6.7570 6.7600 6.7623 6.7459

Popularity 6.7804 6.7822 6.7857 6.7931 6.8130 6.8616



Table 20

DCG

T none 10 25 50 100 200


TF-IDF 0.0142 0.0214 0.0263 0.0309 0.0380 0.0479

Bayesian Sets 0.0145 0.0170 0.0195 0.0220 0.0257 0.0315

Popularity 0.0120 0.0142 0.0167 0.0204 0.0282 0.0454



Table 21

MRR

runninghead: applyingpsychologicalmodelstomusic

Documents