information retrieval meta-evaluation: challenges and opportunities in the music domain

65
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain Julián Urbano @julian_urbano University Carlos III of Madrid ISMIR 2011 Miami, USA · October 26th Picture by Daniel Ray

Upload: julian-urbano

Post on 26-Jun-2015

738 views

Category:

Technology


0 download

DESCRIPTION

The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.

TRANSCRIPT

Page 1: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

Information Retrieval Meta-Evaluation:

Challenges and Opportunities

in the Music DomainJulián Urbano @julian_urbano

University Carlos III of Madrid

ISMIR 2011Miami, USA · October 26thPicture by Daniel Ray

Page 2: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

Picture by Bill Mill

Page 3: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

current evaluation practices

hinder the proper

development of Music IR

Page 4: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

we lack

meta-evaluation studies

we can’t complete the IR

research & development cycle

Page 5: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

how we got here?

Picture by NASA History Office

Page 6: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

20111960

Cranfield 2(1962-1966)

MEDLARS(1966-1967)

SMART(1961-1995)

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)

the basis users collections large-scale multi-language &multi-modal

Page 7: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

20111960

Cranfield 2(1962-1966)

MEDLARS(1966-1967)

SMART(1961-1995)

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)

ISMIR(2000-today)

ISMIR 2001 resolution on the needneedneedneed to create standardizedstandardizedstandardizedstandardized MIR test collectionstest collectionstest collectionstest collections, taskstaskstaskstasks, and

evaluation metricsmetricsmetricsmetrics for MIR research and developmentMIR research and developmentMIR research and developmentMIR research and development

3 workshops (2002-2003):The MIR/MDL Evaluation Project

Page 8: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

20111960

ISMIR(2000-today)

MIREX(2005-today)

ISMIR 2001 resolution on the needneedneedneed to create standardizedstandardizedstandardizedstandardized MIR test collectionstest collectionstest collectionstest collections, taskstaskstaskstasks, and

evaluation metricsmetricsmetricsmetrics for MIR research and developmentMIR research and developmentMIR research and developmentMIR research and development

3 workshops (2002-2003):The MIR/MDL Evaluation Project

followfollowfollowfollow the steps of the Text IR Text IR Text IR Text IR folksbut carefully: but carefully: but carefully: but carefully: not everything applies to music

>1200runs!

Cranfield 2(1962-1966)

MEDLARS(1966-1967)

SMART(1961-1995)

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)

Page 9: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

20111960

ISMIR(2000-today)

MIREX(2005-today)

are we done already?

positive impact on MIR

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)

Cranfield 2(1962-1966)

MEDLARS(1966-1967)

SMART(1961-1995)

Evaluation is not easynot easynot easynot easy nearly 2 decades ofMetaMetaMetaMeta----EvaluationEvaluationEvaluationEvaluation in Text IR

Page 10: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

20111960

ISMIR(2000-today)

MIREX(2005-today)

are we done already?

positive impact on MIR

TREC(1992-today)

CLEF(2000-today)

NTCIR(1999-today)

Cranfield 2(1962-1966)

MEDLARS(1966-1967)

SMART(1961-1995)

Evaluation is not easynot easynot easynot easy nearly 2 decades ofMetaMetaMetaMeta----EvaluationEvaluationEvaluationEvaluation in Text IR

a lot of thingsa lot of thingsa lot of thingsa lot of thingshave happened here!

“not everything applies”but much of it does!but much of it does!but much of it does!but much of it does!

some good practices inherited from here

Page 11: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

we still have

a very long

way to go

Page 12: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

evaluationPicture by Official U.S. Navy Imagery

Page 13: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

Cranfield Paradigm

Task

User Model

Page 14: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

Experimental Validity

how well an experiment meets the well-grounded

requirements of the scientific method

do the results fairly and actually assess

what was intended?

Meta-Evaluation

analyze the validity of IR Evaluation experiments

Page 15: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

Task

Use

r m

od

el

Do

cum

en

ts

Qu

eri

es

Gro

un

d t

ruth

Sy

ste

ms

Me

asu

res

Construct x x x

Content x x x x x

Convergent x x x

Criterion x x x

Internal x x x x x

External x x x x

Conclusion x x x x x

Page 16: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

experimental failures

Page 17: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

construct validity

what?

do the variables of the experiment correspond

to the theoretical meaning of the concept

they purport to measure?

how?

thorough selection and justification

of the variables used

#fail

measure quality of a Web search engine

by the number of visits

Page 18: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

construct validity in IR

effectiveness measures and their user model[Carterette, SIGIR2011]

set-based measures do not resemble real users[Sanderson et al., SIGIR2010]

rank-based measures are better[Jarvelin et al., TOIS2002]

graded relevance is better[Voorhees, SIGIR2001][Kekäläinen, IP&M2005]

other forms of ground truth are better[Bennet et al., SIGIRForum2008]

Page 19: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

content validity

what?

do the experimental units reflect and represent

the elements of the domain under study?

how?

careful selection of the experimental units

#fail

measure reading comprehension

only with sci-fi books

Page 20: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

content validity in IR

tasks closely resembling real-world settings

systems completely fulfilling real-user needs

heavy user component, difficult to control

evaluate de system component instead[Cleverdon, SIGIR2001][Voorhees, CLEF2002]

actual value of systems is really unknown[Marchioni, CACM2006]

sometimes they just do not work with real users[Turpin et al., SIGIR2001]

Page 21: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

content validity in IR

documents resembling real-world settings’

large and representative samples

careful selection of queries, diverse but reasonable[Voorhees, CLEF2002][Carterette et al., ECIR2009]

some queries are better to differentiate bad systems[Guiver et al., TOIS2009][Robertson, ECIR2011]

random selection is not goodnot goodnot goodnot good

specially for Machine LearningMachine LearningMachine LearningMachine Learning

Page 22: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

convergent validity

what?

do the results agree with others, theoretical or

experimental, they should be related with?

how?

careful examination and confirmation

of the relationship between the results

and others supposedly related

#fail

measures of math skills not correlated

with abstract thinking

Page 23: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

convergent validity in IR

ground truth data is subjective

differences across groups and over time

different results depending on who evaluates

absolute numbers change

relative differences stand still for the most part[Voorhees, IP&M2000]

for large-scale evaluations or varying experience

of assessors, differences do exist[Carterette et al., 2010]

Page 24: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

convergent validity in IR

measures are precision- or recall-oriented

they should therefore be correlated with each other

but they actually are not[Kekäläinen, IP&M2005][Sakai, IP&M2007]

better correlated with others than with themselves![Webber et al., SIGIR2008]

correlation with user satisfaction in the task[Sanderson et al., SIGIR2010]

ranks, unconventional judgments, discounted gain…[Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]

reliability?

Page 25: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

criterion validity

what?

are the results correlated with those of

other experiments already known to be valid?

how?

careful examination and confirmation of the

correlation between our results and previous ones

#fail

ask if the new drink is good

instead of better than the old one

Page 26: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

criterion validity in IR

practical large-scale methodologies: pooling[Buckley et al., SIGIR2004]

judgments by non-experts[Bailey et al., SIGIR2008]

crowdsourcing for low-cost[Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010]

estimate measures with fewer judgments[Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]

select what documents to judge, by informativeness[Carterette et al., SIGIR2006][Carterette et al., SIGIR2007]

use no relevance judgments at all[Soboroff et al., SIGIR2001]

less effort, but same results?same results?same results?same results?

Page 27: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

internal validity

what?

can the conclusions be rigorously drawn

from the experiment alone

and not other overlooked factors?

how?

careful identification and control of possible

confounding variables and selection of desgin

#fail

measure usefulness of Windows vs Linux vs iOS

only with Apple employees

Page 28: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

internal validity in IR

inconsistency: performance depends on assessors [Voorhees, IP&M2000][Carterette et al., SIGIR2010]

incompleteness: performance depends on pools

system reinforcement[Zobel, SIGIR2008]

affects reliability of measures and overall results[Sakai, JIR2008][Buckley et al., SIGIR2007]

train-test: same characteristics in queries and docs

improvements on the same collections: overfitting[Voorhees, CLEF2002]

measures must be fair to all systems

specially forMachine LearningMachine LearningMachine LearningMachine Learning

Page 29: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

external validity

what?

can the results be generalized

to other populations and experimental settings?

how?

careful design and justification

of sampling and selection methods

#fail

study cancer treatment mostly with teenage males

Page 30: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

external validity in IR

weakest point of IR Evaluation[Voorhees, CLEF2002]

large-scale is always incomplete[Zobel, SIGIR2008][Buckley et al., SIGIR2004]

test collections are themselves an evaluation result

but they become hardly reusable[Carterette et al., WSDM2010][Carterette et al., SIGIR2010]

Page 31: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

external validity in IR

cross-collection comparisons are unjustified

highly depends on test collection characteristics[Bodoff et al., SIGIR2007][Voorhees, CLEF2002]

systems perform differently with different collections

interpretation of results must be in terms of

pairwise comparisons, not absolute numbers[Voorhees, CLEF2002]

do not claim anything about state of the art

based on a handful of experiments

baselines can be used to compare across collections[Armstrong et al., CIKM2009]meaningful,

not random!not random!not random!not random!

Page 32: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

conclusion validity

what?

are the conclusions justified based on the results?

how?

careful selection of the measuring instruments and

statistical methods used to draw grand conclusions

#fail

more access to the Internet in China than in the US

because of the larger total number of users

Page 33: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

conclusion validity in IR

measures should be sensitive and stable[Buckley et al., SIGIR2000]

and also powerful[Voorhees et al., SIGIR2002][Sakai, IP&M2007]

with little effort[Sanderson et al., SIGIR2005]

always bearing in mind

the user model and the task

Page 34: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

conclusion validity in IR

statistical methods to compare score distributions[Smucker et al., CIKM2007][Webber et al., CIKM2008]

correct interpretation of the statistics

hypothesis testing is troublesome

statistical significance ≠ practical significance

increasing #queries (sample size) increases power

to detect ever smaller differences (effect-size)

eventually, everything is statistically significant

Page 35: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

challenges

Picture by Brian Snelson

Page 36: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 37: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 38: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 39: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 40: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 41: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

MIR evaluation practices

do not allow usto complete this cycle

Page 42: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 43: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

loose definition of task task task task intent intent intent intent and user modeluser modeluser modeluser model

realistic data

Page 44: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 45: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

collections are too small too small too small too small and/or biasedbiasedbiasedbiased

lack of realistic, realistic, realistic, realistic, controlled public controlled public controlled public controlled public

collectionscollectionscollectionscollections

privateprivateprivateprivate, undescribedand unanalyzed

collections emerge

can’t replicate can’t replicate can’t replicate can’t replicate results, often

leading towrong conclusionswrong conclusionswrong conclusionswrong conclusions

standardstandardstandardstandard formats and evaluation software to

minimize bugsminimize bugsminimize bugsminimize bugs

Page 46: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 47: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycleundocumentedundocumentedundocumentedundocumented measures,

no accepted evaluation softwarelack of baselines as lower bound(random is not a baseline!)(random is not a baseline!)(random is not a baseline!)(random is not a baseline!)

proper statisticsproper statisticsproper statisticsproper statistics correct interpretationinterpretationinterpretationinterpretationof statistics

Page 48: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 49: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

rawrawrawraw musical material unknown

undocumented undocumented undocumented undocumented queries queries queries queries and/ordocumentsdocumentsdocumentsdocuments

go back to private collections: overfittingoverfittingoverfittingoverfitting!!!!

Page 50: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

Page 51: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

IR Research & Development Cycle

collections can’t be can’t be can’t be can’t be reusedreusedreusedreused

blindblindblindblindimprovementsimprovementsimprovementsimprovements

go back toprivate collections:

overfittingoverfittingoverfittingoverfitting!!!!

Page 52: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

Picture by Donna Grayson

Page 53: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

collections

large, heterogeneous and controlled

not a hard endeavour, except for the damn copyright

Million Song Dataset!

still problematic (new features?, actual music)

standardize collections across tasks

better understanding and use of improvements

Page 54: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

raw music data

essential for Learning and Improvement phases

use copyright-free data

Jamendo!

study possible biases

reconsider artificial material

Page 55: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

evaluation model

let teams run their own algorithms(needs public collections)

relief for IMIRSEL and promote wider participation

successfuly used for 20 years in Text IR venues

adopted by MusiCLEF

only viable alternative in the long run

MIREX-DIY platforms still don’t allow full completion

of the IR Research & Development Cycle

Page 56: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

organization

IMIRSEL plans, schedules and runs everything

add a 2nd tier of organizers, task-specific

logistics, planning, evaluation, troubleshooting…

format of large forums like TREC and CLEF

smooth the process and develop tasks that really

push the limits of the state of the art

Page 57: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

overview papers

every year, by task organizers

detail the evaluation process, data, results

discussion to boost Interpretation and Learning

perfect wrap-up for team papers

rarely discuss results, and many are not even drafted

Page 58: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

specific methodologies

MIR has unique methodologies and measures

meta-evaluate: analyze and improve

human effects on the evaluation

user satisfaction

Page 59: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

standard evaluation software

bugs are inevitable

open evaluation software to everybody

gain reliability

speed up the development process

serve as documentation for newcomers

promote standardization of formats

Page 60: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

baselines

help measuring the overall progress of the filed

standard formats + standard software +

public controlled collections + raw music +

task-specific organization

measure the state of the art

Page 61: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

commitment

we need to acknowledge the current problems

MIREX should not only be a place to

evaluate and improve systems

but also a place to

meta-evaluate and improve how we evaluate

and a place to

design tasks that challenge researchers

analyze our evaluation methodologies

Page 62: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

we all need to start

questioningevaluation practices

Page 63: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

it’s worth it

Picture by Brian Snelson

Page 64: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

we all need to start

questioningevaluation practices

it‘s not that eveything we do is wrong…

Page 65: Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

we all need to start

questioningevaluation practices

it‘s not that eveything we do is wrong…it’s that we don’t know it!