crowdsourcing preference judgments for evaluation of music similarity tasks

39
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks Julián Urbano, Jorge Morato, Mónica Marrero and Diego Martín [email protected] SIGIR CSE 2010 Geneva, Switzerland · July 23rd

Upload: brooklyn

Post on 11-Jan-2016

15 views

Category:

Documents


1 download

DESCRIPTION

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks. Julián Urbano , Jorge Morato , Mónica Marrero and Diego Martín [email protected]. SIGIR CSE 2010 Geneva, Switzerland · July 23rd. Outline. Introduction Motivation Alternative Methodology - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

CrowdsourcingPreference Judgments for Evaluation of Music Similarity Tasks

Julián Urbano, Jorge Morato,Mónica Marrero and Diego Martí[email protected]

SIGIR CSE 2010Geneva, Switzerland · July 23rd

Page 2: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

2

Outline

•Introduction•Motivation•Alternative Methodology•Crowdsourcing Preferences•Results•Conclusions and Future Work

Page 3: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

3

Evaluation Experiments

•Essential for Information Retrieval [Voorhees, 2002]

•Traditionally followed the Cranfield paradigm▫Relevance judgments are the most

important part of test collections (and the most expensive)

•In the music domain evaluation has not been taken too seriously until very recently▫MIREX appeared in 2005 [Downie et al., 2010]

▫Additional problems with the construction and maintenance of test collections [Downie, 2004]

Page 4: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

4

Music Similarity Tasks

•Given a music piece (i.e. the query) return a ranked list of other pieces similar to it▫Actual music contents, forget the

metadata!

•It comes in two flavors▫Symbolic Melodic Similarity (SMS)▫Audio Music Similarity (AMS)

•It is inherently more complex to evaluate▫Relevance judgments are very problematic

Page 5: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Relevance (Similarity) Judgments•Relevance is usually considered on a fixed

scale▫Relevant, not relevant, very relevant…

•For music similarity tasks relevance is rather continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007]

▫Single melodic changes are not perceived to change the overall melody Move a note up or down in pitch, shorten it,

etc.▫But the similarity is weaker as more

changes apply

•Where is the line between relevance levels?

5

Page 6: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Partially Ordered Lists

•The relevance of a document is implied by its position in a partially ordered list [Typke et al., 2005]

▫Does not need any prefixed relevance scale

•Ordered groups of documents equally relevant▫Have to keep the order of the groups▫Allow permutations within the same group

•Assessors only need to be sure that any pair of documents is ordered properly

6

Page 7: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Partially Ordered Lists (II)

7

Page 8: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Partially Ordered Lists (and III)

•Used in the first edition of MIREX in 2005[Downie et al., 2005]

•Widely accepted by the MIR communityto report new developments [Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006]

•MIREX was forced to move to traditionallevel-based relevance since 2006▫Partially ordered lists are expensive▫And have some inconsistencies

8

Page 9: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Expensiveness

•The ground truth for just 11 queries took 35 music experts for 2 hours [Typke et al., 2005]

▫Only 11 of them had time to work on all 11 queries▫This exceeds MIREX’s resources for a single task

•MIREX had to move to level-based relevance▫BROAD: Not Similar, Somewhat Similar, Very Similar▫FINE: numerical, from 0 to 10 with one decimal digit

•Problems with assessor consistency came up

9

Page 10: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Issues with Assessor Consistency•The line between levels is certainly

unclear[Jones et al., 2007][Downie et al., 2010]

10

Page 11: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Original Methodology

•Go back to partially ordered lists▫Filter the collection▫Have the experts rank the candidates▫Arrange the candidates by rank▫Aggregate candidates whose ranks are not

significantly different (Mann-Whitney U)•There are known odd results and inconsistencies

[Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b]

▫Disregard changes that do not alter the actual perception, such as clef or key and time signature

▫Something like changing the language of a text and use synonyms [Urbano et al., 2010a]

11

Page 12: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Inconsistencies due to Ranking

12

Page 13: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Alternative Methodology

•Minimize inconsistencies [Urbano et al., 2010b]

•Cheapen the whole process

•Reasonable Person hypothesis [Downie, 2004]

▫With crowdsourcing (finally)

•Use Amazon Mechanical Turk▫Get rid of experts [Alonso et al., 2008][Alonso et al., 2009]

▫Work with “reasonable turkers”▫Explore other domains to apply

crowdsourcing

13

Page 14: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Equally Relevant Documents

•Experts were forced to give totally ordered lists

•One would expect ranks to randomly average out▫Half the experts prefer one document▫Half the experts prefer the other one

•That is hardly the case▫Do not expect similar ranks if the experts

can not give similar ranks in the first place

14

Page 15: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Give Audio instead of Images

•Experts may guide by the images, not the music▫Some irrelevant changes in the image can

deceive

•No music expertise should be needed▫Reasonable person turker hypothesis

15

Page 16: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Preference Judgments

•In their heads, experts actually do preference judgments▫Similar to a binary search▫Accelerates assessor fatigue as the list

grows

•Already noted for level-based relevance▫Go back and re-judge [Downie et al., 2010][Jones et al.,

2007]

▫Overlapping between BROAD and FINE scores

•Change the relevance assessment question▫Which is more similar to Q: A or B? [Carterette

et al., 2008]

16

Page 17: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Preference Judgments (II)

•Better than traditional level-based relevance▫Inter-assessor agreement▫Time to answer

•In our case, three-point preferences▫A < B (A is more similar)▫A = B (they are equally similar/dissimilar)▫A > B (B is more similar)

17

Page 18: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Preference Judgments (and III)

•Use a modified QuickSort algorithm to sort documents in a partially ordered list▫Do not need all O(n2) judgments, but

O(n·log n)

X is the current pivot on the segmentX has been pivot already

18

Page 19: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

How Many Assessors?

•Ranks are given to each document in a pair▫+1 if it is preferred over the other one▫-1 if the other one is preferred▫0 if they were judged equally

similar/dissimilar•Test for signed differences in the

samples•In the original lists 35 experts were used

▫Ranks of a document between 1 and more than 20

•Our rank sample is less (and equally) variable▫rank(A) = -rank(B) ⇒ var(A) = var (B)▫Effect size is larger so statistical power

increases▫Fewer assessors are needed overall

19

Page 20: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Crowdsourcing Preferences

•Crowdsourcing seems very appropriate▫Reasonable person hypothesis▫Audio instead of images▫Preference judgments▫QuickSort for partially ordered lists

•The task can be split into very small assignments

•It should be much more cheap and consistent▫Do not need experts▫Do not deceive and increase consistency▫Easier and faster to judge▫Need fewer judgments and judges

20

Page 21: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

New Domain of Application

•Crowdsourcing has been used mainly to evaluate text documents in English

•How about other languages?▫Spanish [Alonso et al., 2010]

•How about multimedia?▫Image tagging? [Nowak et al., 2010]

▫Music similarity?

21

Page 22: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Data•MIREX 2005 Evaluation collection

▫~550 musical incipits in MIDI format▫11 queries also in MIDI format▫4 to 23 candidates per query

•Convert to MP3 as it is easier to play in browsers

•Trim the leading and tailing silence▫1 to 57 secs. (mean 6) to 1 to 26 secs.

(mean 4)▫4 to 24 secs. (mean 13) to listen to all 3

incipits•Uploaded all MP3 files and a Flash player

to a private server to stream data on the fly

22

Page 23: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

HIT Design

2 yummy cents of dollar

23

Page 24: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Threats to Validity

•Basically had to randomize everything▫Initial order of candidates in the first

segment▫Alternate between queries▫Alternate between pivots of the same query▫Alternate pivots as variations A and B

•Let the workers know about this randomization

•In first trials some documents were judged more similar to the query than the query itself!▫Require at least 95% acceptance rate▫Ask for 10 different workers per HIT [Alonso et

al., 2009]

▫Beware of bots (always judged equal in 8 secs.)

24

Page 25: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Summary of Submissions

•The 11 lists account for 119 candidates to judge

•Sent 8 batches (QuickSort iterations) to MTurk

•Had to judge 281 pairs (38%) = 2810 judgments

•79 unique workers for about 1 day and a half

•A total cost (excluding trials) of $70.25

25

Page 26: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Feedback and Music Background•23 of the 79 workers gave us feedback

▫4 very positive comments: very relaxing music

▫1 greedy worker: give me more money▫2 technical problems loading the audio in 2

HITs Not reported by any of the other 9 workers

▫5 reported no music background▫6 had formal music education▫9 professional practitioners for several

years▫9 play an instrument, mainly piano▫6 performers in choir

26

Page 27: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Agreement between Workers

•Forget about Fleiss’ Kappa▫Does not account for the size of the

disagreement▫A<B and A=B is not as bad as A<B and

B<A•Look at all 45 pairs of judgments per pair

▫+2 if total agreement (e.g. A<B and A<B)▫+1 if partial agreement (e.g. A<B and

A=B)▫0 if no agreement (i.e. A<B and B<A)▫Divide by 90 (all pairs with total

agreement)

•Average agreement score per pair was 0.664 ▫From 0.506 (iteration 8) to 0.822 (iteration

2)

27

Page 28: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Agreement Workers-Experts

•Those 10 judgments were actually aggregated

Percentages per row total

▫155 (55%) total agreement▫102 (36%) partial agreement▫23 (8%) no agreement

•Total agreement score = 0.735•Supports the reasonable person

hypothesis

28

Page 29: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Agreement Single Worker-Experts

29

Page 30: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Agreement (Summary)

30

•Very similar judgments overall▫The reasonable person hypothesis stands still▫Crowdsourcing seems a doable alternative▫No music expertise seems necessary

•We could use just one assessor per pair▫If we could keep him/her throughout the query

Page 31: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Ground Truth Similarity

•Do high agreement scores translate intohighly similar ground truth lists?

•Consider the original lists (All-2) as ground truth

•And the crowdsourced lists as a system’s result▫Compute the Average Dynamic Recall [Typke et al.,

2006]

▫And then the other way around

•Also compare with the (more consistent) original lists aggregated in Any-1 form [Urbano et al., 2010b]

31

Page 32: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Ground Truth Similarity (II)

•The result depends on the initial ordering▫Ground truth = (A, B, C), (D, E) ▫Results1 = (A, B), (D, E, C)

ADR score = 0.933▫Results2 = (A, B), (C, D, E)

ADR score = 1

•Results1 is identical to Results2

•Generate 1000 (identical) versions by randomly permuting the documents within a group

32

Page 33: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Ground Truth Similarity (and III)

Min. and Max. between square brackets

•Very similar to the original All-2 lists•Like the Any-1 version, also more

restrictive•More consistent (workers were not

deceived)

33

Page 34: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

MIREX 2005 Revisited

•Would the evaluation have been affected?▫Re-evaluated the 7 systems that

participated▫Included our Splines system [Urbano et al., 2010a]

•All systems perform significantly worse▫ADR score drops between 9-15%

•But their ranking is just the same▫Kendall’s τ = 1

34

Page 35: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Conclusions

•Partially ordered lists should come back

•We proposed an alternative methodology▫Asked for three-point preference judgments▫Used Amazon Mechanical Turk

Crowdsourcing can be used for music-related tasks

Provided empirical evidence supporting the reasonable person hypothesis

•What for?▫More affordable and large-scale

evaluations

35

Page 36: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Conclusions (and II)

•We need fewer assessors▫More queries with the same man-power

•Preferences are easier and faster to judge•Fewer judgments are required

▫Sorting algorithm

•Avoid inconsistencies (A=B option)•Using audio instead of images gets rid of

experts

•From 70 expert hours to 35 hours for $70

36

Page 37: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Future Work

•Choice of pivots in the sorting algorithm▫e.g. the query itself would not provide

information

•Study the collections for Audio Tasks▫They have more data

Inaccessible▫But no partially ordered list (yet)

•Use our methodology with one real expert judging preferences for the same query

•Try crowdsourcing too with one single worker

37

Page 38: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Future Work (and II)

•Experimental study on the characteristics of music similarity perception by humans▫Is it transitive?

We assumed it is▫Is it symmetrical?

•If these properties do not hold we have problems

•Id they do, we can start thinking on Minimal and Incremental Test Collections[Carterette et al., 2005]

38

Page 39: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

39

And That’s It!

Picture by 姒儿喵喵