is peer review any good? a quantitative analysis of peer review

Post on 28-Nov-2014

2.038 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is a presentation of the paper in which we focus on the analysis of peer reviews and reviewers behavior in conference review processes. We report on the development, definition and rationale of a theoretical model for peer review processes to support the identification of appropriate metrics to assess the processes main properties. We then apply the proposed model and analysis framework to data sets about reviews of conference papers. We discuss in details results, implications and their eventual use toward improving the analyzed peer review processes.

TRANSCRIPT

Ispeerreviewanygood?Aquan4ta4ve

analysisofpeerreview

FabioCasa),MaurizioMarchese,AzzurraRagone,Ma6eoTurrini

UniversityofTrento

h6p://eprints.biblio.unitn.it/archive/00001654/01/techRep045.pdf

Ini)alGoals

•  Understandhowwellpeerreviewworks

•  Understandhowtoimprovetheprocess

•  Metrics+Analysis–  (refertoliquiddoc)

•  Focusonlyongatekeepingaspect “Not everything that can be counted counts,

and not everything that counts can be counted.” -- Albert Einstein

MetricDimensions

Quality

Fairness Efficiency

Kendall Distance

Divergence

Disagreement

Biases

Robustness

Unbiasing

Effort vs quality. Effort-invariant alternatives

DataSets

•  Around7000reviewsfromvariousconferencesintheCSfield(moreontheway)– Large,medium,small– Somewith“youngreviewers”

Ispeerrevieweffec)ve?Doesitwork?

•  Andwhatdoesitmeantobeeffec)ve?HOWdowemeasureit?

•  Easiertomeasure/detect“problems”

•  Peerreviewrankingvs.idealranking

Comparingrankings

28 17 2

45 67 .. ..

89 33 ..

33 89 2

17 67 ..

28 .. ..

45

Idealranking(?)

•  Successinasubsequentphase•  Cita)ons

Suggested reading: Positional effect on citation and readership in arXiv, by Haque and Ginsparg

Comparingrankings

T=3

N=10

Divergence:Div(t,N)Kendallτ

9

Results:peerreviewrankingvs.cita)oncount

10

Divergence

Div

Normalizedt

Randomnessandreliability

•  Quality‐relatedbutindependentofthecriteriaforthe“ideal”ranking

•  Basicstats•  Disagreement

•  Robustness•  Biases

11

Quality‐relatedMetrics:Sta)s)cs

12

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 1 2 3 4 5 6 7 8 9 10

Prob

ability

Marks

Distribu4onofmarks(integermarks)

Disagreement

•  Measurethedifferencebetweenthemarksgivenbythereviewersonthesamecontribu4on.

•  Thera)onalebehindthismetricisthatinareviewprocessweexpectsomekindofagreementbetweenreviewers.

NormalizedDisagreement(aferdiscussion)

14

C1 C2 C3

Computed 0,27 0,32 (high variance) 0,26 (high variance)

Reshuffled 0,34 0,40 0,32

Robustness

•  Sensi)vitytosmallvaria)oninthemarks– Triestoassesstheimpactofsmallindecisionsingivingthemark(e.g.,6vs7…..)

•  Measuresdivergenceaferapplyinganε‐varia)ontothemark

•  Results:reasonablyrobustexceptfortheconferencemanagedbyyoungresearchers

15

Metricdimensions

Quality

Fairness Efficiency

Statistics

Kendall Distance

Divergence

Disagreement

Biases

Robustness Unbiasing

Effort

Fairness

•  Defini)on:Areviewprocessisfairifandonlyoftheacceptanceofacontribu)ondoesnotdependonthepar)cularsetofPCmembersthatreviewsit

•  Thekeyisintheassignmentofapapertoreviewers:Apaperassignmentisunfairifthespecificassignmentinfluences(makesmorepredictable)thefateofthepaper.

Poten)albiases

•  Ra4ngbias:Reviewersarebiasediftheyconsistentlygivehigher/lowermarksthantheircolleagueswhoarereviewingthesamepaper

•  Affilia4onbias

•  Topicbias•  Countrybias•  Genderbias•  …

ComputedNormalizedRa)ngBiases

C2 C3 C4

top accepting 3,44 1,52 1,17

top rejecting -2,78 -2,06 -1,17

> + |min bias| 5% 9% 7%

< - |min bias| 4% 8% 7%

C2 C3 C4

Unbiasing effect (divergence) 9% 11% 14%

Unbiasing effect (reviewers affected) 16 5 4

20

top related