crowdsourcing preference judgments for evaluation of music similarity tasks

CrowdsourcingPreference Judgments for Evaluation of Music Similarity Tasks

Julián Urbano, Jorge Morato,Mónica Marrero and Diego Martí[email protected]

SIGIR CSE 2010Geneva, Switzerland · July 23rd

2

Outline

•Introduction•Motivation•Alternative Methodology•Crowdsourcing Preferences•Results•Conclusions and Future Work

3

Evaluation Experiments

•Essential for Information Retrieval [Voorhees, 2002]

•Traditionally followed the Cranfield paradigm▫Relevance judgments are the most

important part of test collections (and the most expensive)

•In the music domain evaluation has not been taken too seriously until very recently▫MIREX appeared in 2005 [Downie et al., 2010]

▫Additional problems with the construction and maintenance of test collections [Downie, 2004]

4

Music Similarity Tasks

•Given a music piece (i.e. the query) return a ranked list of other pieces similar to it▫Actual music contents, forget the

metadata!

•It comes in two flavors▫Symbolic Melodic Similarity (SMS)▫Audio Music Similarity (AMS)

•It is inherently more complex to evaluate▫Relevance judgments are very problematic

Relevance (Similarity) Judgments•Relevance is usually considered on a fixed

scale▫Relevant, not relevant, very relevant…

•For music similarity tasks relevance is rather continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007]

▫Single melodic changes are not perceived to change the overall melody Move a note up or down in pitch, shorten it,

etc.▫But the similarity is weaker as more

changes apply

•Where is the line between relevance levels?

5

Partially Ordered Lists

•The relevance of a document is implied by its position in a partially ordered list [Typke et al., 2005]

▫Does not need any prefixed relevance scale

•Ordered groups of documents equally relevant▫Have to keep the order of the groups▫Allow permutations within the same group

•Assessors only need to be sure that any pair of documents is ordered properly

6

Partially Ordered Lists (II)

7

Partially Ordered Lists (and III)

•Used in the first edition of MIREX in 2005[Downie et al., 2005]

•Widely accepted by the MIR communityto report new developments [Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006]

•MIREX was forced to move to traditionallevel-based relevance since 2006▫Partially ordered lists are expensive▫And have some inconsistencies

8

Expensiveness

•The ground truth for just 11 queries took 35 music experts for 2 hours [Typke et al., 2005]

▫Only 11 of them had time to work on all 11 queries▫This exceeds MIREX’s resources for a single task

•MIREX had to move to level-based relevance▫BROAD: Not Similar, Somewhat Similar, Very Similar▫FINE: numerical, from 0 to 10 with one decimal digit

•Problems with assessor consistency came up

9

Issues with Assessor Consistency•The line between levels is certainly

unclear[Jones et al., 2007][Downie et al., 2010]

10

Original Methodology

•Go back to partially ordered lists▫Filter the collection▫Have the experts rank the candidates▫Arrange the candidates by rank▫Aggregate candidates whose ranks are not

significantly different (Mann-Whitney U)•There are known odd results and inconsistencies

[Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b]

▫Disregard changes that do not alter the actual perception, such as clef or key and time signature

▫Something like changing the language of a text and use synonyms [Urbano et al., 2010a]

11

Inconsistencies due to Ranking

12

Alternative Methodology

•Minimize inconsistencies [Urbano et al., 2010b]

•Cheapen the whole process

•Reasonable Person hypothesis [Downie, 2004]

▫With crowdsourcing (finally)

•Use Amazon Mechanical Turk▫Get rid of experts [Alonso et al., 2008][Alonso et al., 2009]

▫Work with “reasonable turkers”▫Explore other domains to apply

crowdsourcing

13

Equally Relevant Documents

•Experts were forced to give totally ordered lists

•One would expect ranks to randomly average out▫Half the experts prefer one document▫Half the experts prefer the other one

•That is hardly the case▫Do not expect similar ranks if the experts

can not give similar ranks in the first place

14

Give Audio instead of Images

•Experts may guide by the images, not the music▫Some irrelevant changes in the image can

deceive

•No music expertise should be needed▫Reasonable person turker hypothesis

15

Preference Judgments

•In their heads, experts actually do preference judgments▫Similar to a binary search▫Accelerates assessor fatigue as the list

grows

•Already noted for level-based relevance▫Go back and re-judge [Downie et al., 2010][Jones et al.,

2007]

▫Overlapping between BROAD and FINE scores

•Change the relevance assessment question▫Which is more similar to Q: A or B? [Carterette

et al., 2008]

16

Preference Judgments (II)

•Better than traditional level-based relevance▫Inter-assessor agreement▫Time to answer

•In our case, three-point preferences▫A < B (A is more similar)▫A = B (they are equally similar/dissimilar)▫A > B (B is more similar)

17

Preference Judgments (and III)

•Use a modified QuickSort algorithm to sort documents in a partially ordered list▫Do not need all O(n2) judgments, but

O(n·log n)

X is the current pivot on the segmentX has been pivot already

18

How Many Assessors?

•Ranks are given to each document in a pair▫+1 if it is preferred over the other one▫-1 if the other one is preferred▫0 if they were judged equally

similar/dissimilar•Test for signed differences in the

samples•In the original lists 35 experts were used

▫Ranks of a document between 1 and more than 20

•Our rank sample is less (and equally) variable▫rank(A) = -rank(B) ⇒ var(A) = var (B)▫Effect size is larger so statistical power

increases▫Fewer assessors are needed overall

19

Crowdsourcing Preferences

•Crowdsourcing seems very appropriate▫Reasonable person hypothesis▫Audio instead of images▫Preference judgments▫QuickSort for partially ordered lists

•The task can be split into very small assignments

•It should be much more cheap and consistent▫Do not need experts▫Do not deceive and increase consistency▫Easier and faster to judge▫Need fewer judgments and judges

20

New Domain of Application

•Crowdsourcing has been used mainly to evaluate text documents in English

•How about other languages?▫Spanish [Alonso et al., 2010]

•How about multimedia?▫Image tagging? [Nowak et al., 2010]

▫Music similarity?

21

Data•MIREX 2005 Evaluation collection

▫~550 musical incipits in MIDI format▫11 queries also in MIDI format▫4 to 23 candidates per query

•Convert to MP3 as it is easier to play in browsers

•Trim the leading and tailing silence▫1 to 57 secs. (mean 6) to 1 to 26 secs.

(mean 4)▫4 to 24 secs. (mean 13) to listen to all 3

incipits•Uploaded all MP3 files and a Flash player

to a private server to stream data on the fly

22

HIT Design

2 yummy cents of dollar

23

Threats to Validity

•Basically had to randomize everything▫Initial order of candidates in the first

segment▫Alternate between queries▫Alternate between pivots of the same query▫Alternate pivots as variations A and B

•Let the workers know about this randomization

•In first trials some documents were judged more similar to the query than the query itself!▫Require at least 95% acceptance rate▫Ask for 10 different workers per HIT [Alonso et

al., 2009]

▫Beware of bots (always judged equal in 8 secs.)

24

Summary of Submissions

•The 11 lists account for 119 candidates to judge

•Sent 8 batches (QuickSort iterations) to MTurk

•Had to judge 281 pairs (38%) = 2810 judgments

•79 unique workers for about 1 day and a half

•A total cost (excluding trials) of $70.25

25

Feedback and Music Background•23 of the 79 workers gave us feedback

▫4 very positive comments: very relaxing music

▫1 greedy worker: give me more money▫2 technical problems loading the audio in 2

HITs Not reported by any of the other 9 workers

▫5 reported no music background▫6 had formal music education▫9 professional practitioners for several

years▫9 play an instrument, mainly piano▫6 performers in choir

26

Agreement between Workers

•Forget about Fleiss’ Kappa▫Does not account for the size of the

disagreement▫A<B and A=B is not as bad as A<B and

B<A•Look at all 45 pairs of judgments per pair

▫+2 if total agreement (e.g. A<B and A<B)▫+1 if partial agreement (e.g. A<B and

A=B)▫0 if no agreement (i.e. A<B and B<A)▫Divide by 90 (all pairs with total

agreement)

•Average agreement score per pair was 0.664 ▫From 0.506 (iteration 8) to 0.822 (iteration

2)

27

Agreement Workers-Experts

•Those 10 judgments were actually aggregated

Percentages per row total

▫155 (55%) total agreement▫102 (36%) partial agreement▫23 (8%) no agreement

•Total agreement score = 0.735•Supports the reasonable person

hypothesis

28

Agreement Single Worker-Experts

29

Agreement (Summary)

30

•Very similar judgments overall▫The reasonable person hypothesis stands still▫Crowdsourcing seems a doable alternative▫No music expertise seems necessary

•We could use just one assessor per pair▫If we could keep him/her throughout the query

Ground Truth Similarity

•Do high agreement scores translate intohighly similar ground truth lists?

•Consider the original lists (All-2) as ground truth

•And the crowdsourced lists as a system’s result▫Compute the Average Dynamic Recall [Typke et al.,

2006]

▫And then the other way around

•Also compare with the (more consistent) original lists aggregated in Any-1 form [Urbano et al., 2010b]

31

Ground Truth Similarity (II)

•The result depends on the initial ordering▫Ground truth = (A, B, C), (D, E) ▫Results1 = (A, B), (D, E, C)

ADR score = 0.933▫Results2 = (A, B), (C, D, E)

ADR score = 1

•Results1 is identical to Results2

•Generate 1000 (identical) versions by randomly permuting the documents within a group

32

Ground Truth Similarity (and III)

Min. and Max. between square brackets

•Very similar to the original All-2 lists•Like the Any-1 version, also more

restrictive•More consistent (workers were not

deceived)

33

MIREX 2005 Revisited

•Would the evaluation have been affected?▫Re-evaluated the 7 systems that

participated▫Included our Splines system [Urbano et al., 2010a]

•All systems perform significantly worse▫ADR score drops between 9-15%

•But their ranking is just the same▫Kendall’s τ = 1

34

Conclusions

•Partially ordered lists should come back

•We proposed an alternative methodology▫Asked for three-point preference judgments▫Used Amazon Mechanical Turk

Crowdsourcing can be used for music-related tasks

Provided empirical evidence supporting the reasonable person hypothesis

•What for?▫More affordable and large-scale

evaluations

35

Conclusions (and II)

•We need fewer assessors▫More queries with the same man-power

•Preferences are easier and faster to judge•Fewer judgments are required

▫Sorting algorithm

•Avoid inconsistencies (A=B option)•Using audio instead of images gets rid of

experts

•From 70 expert hours to 35 hours for $70

36

Future Work

•Choice of pivots in the sorting algorithm▫e.g. the query itself would not provide

information

•Study the collections for Audio Tasks▫They have more data

Inaccessible▫But no partially ordered list (yet)

•Use our methodology with one real expert judging preferences for the same query

•Try crowdsourcing too with one single worker

37

Future Work (and II)

•Experimental study on the characteristics of music similarity perception by humans▫Is it transitive?

We assumed it is▫Is it symmetrical?

•If these properties do not hold we have problems

•Id they do, we can start thinking on Minimal and Incremental Test Collections[Carterette et al., 2005]

38

39

And That’s It!

Picture by 姒儿喵喵

http://www.flickr.com/photos/crystaljingsr/