rater reliability how good is your coding?. why estimate reliability? quality of your data number of...

14
Rater Reliability How Good is Your Coding?

Upload: deborah-pope

Post on 16-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Rater ReliabilityHow Good is Your Coding?

Page 2: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Why Estimate Reliability?

Quality of your data

Number of coders or raters needed

Reviewers/Grant Applications

Page 3: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

For What Variables Do You Need Reliability

Estimates?Any variables with judgments

Ratings of any kind

Recordings, even of numbers or counts

Basically, all of them

Page 4: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Data Collection (1)

1 judge rates all targets. NA1.

2 judges, each rates (different) half of the targets. More than 2, but each rates different targets. NA2.

2 judges, each rate all targets. 3 or more, all rate all. Crossed design.

4 judges, different pairs rate each targets – all targets by 2, but different 2 each target. 3 or more, not all rate all. Nested design.

Page 5: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Data Collection (2)

IMHO, Use a fully crossed design to estimate reliability (otherwise it will be hard to estimate and you have to hire help). Fully crossed is good for final data collection, too, but may not be feasible.

Use any design (crossed or nested) to collect real data.

Use proper estimate of reliability (fixed for crossed, random for nested, proper number of raters) for the design you finally used.

Page 6: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Estimation (1)

Use the data you collected to compute sums of squares for judge, target, and error. SAS GLM can do this for you.

Compute ICC(2,1) or ICC(3,1) depending on whether your design will be fixed (crossed) or random (nested)

Apply Spearman-Brown to estimate the reliability of your data.

Page 7: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Estimation (2)

If you collected fully crossed data (all judges saw all targets for entire study), you can treat each rater as a column (item), and each target or study as a row (person), and then compute Cronbach’s alpha for those data as rater reliability index. Alpha =ICC(3,k).

Can’t do that if raters and targets are not crossed.

Page 8: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Illustration (1)3 raters judge rigor of 5 articles using 1 to 5 scale.

Study Jim Joe Sue

1 2 3 1

2 3 2 2

3 4 3 3

4 5 4 4

5 5 5 3

Page 9: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Illustration (2)Computer Input: One column for ratings, one for rater, one for target.Analysis: GLM – rating equals rater, target, rater by target. (can use SAS, SPSS, R, whatever)Output: sums of squares and mean squares for each.

Source Type III SS Mean SquareRater 3.73 1.87Target 14.27 3.57Rater*Target 2.93 .37

Page 10: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Illustration (3)

ICC(2,1) = one random rater

ICC(3,1) = one fixed rater

61.5/)37.87.1(337).13(57.3

37.57.3

nEMSJMSkEMSkBMS

EMSBMS

/)()1(

EMSkBMS

EMSBMS

)1(

74.37).13(57.3

37.57.3

Use mean squares to compute intraclass correlations.

See Shrout & Fleiss, 1979, to see additional ICCs.

Page 11: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Illustration (4)Use Spearman Brown to estimate reliability of multiple raters and to estimate the number of raters needed for a desired level of reliability.

Reliability of 2 raters

Raters needed for rxx of .90

random

fixed

ii

iiCC k

k

)1(1'

76.61.1

)61(.2'

CC

85.74.1

)74(.2'

CC

)1(

)1(*

*

L

Lm

675.5)90.1(61.

)61.1(9.

m

416.3)90.1(74.

)74.1(9.

m

Page 12: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

SPSS

Raters are columns, ratings are rows

Analyze, Scale, Reliability Analysis

Drag all columns into Items

The default: Model Alpha will produce ICC(3,k)

In this case alpha = .897 (three judges, same judges all rate every target & take the average)

Page 13: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

SPSS (2)

To get 1 fixed judge, Analyze, Scale, Reliability, all colums into Items, then click Statisics

Check box Intraclass correlation coefficient

For 1 fixed judge, click 2-way mixed, ok, then run

In this case 1 fixed judge is .74.

For 1 random judge, click 1-way random

In this case, 1 random judge .59 (not quite .61 because of my rounding error.

Page 14: Rater Reliability How Good is Your Coding?. Why Estimate Reliability? Quality of your data Number of coders or raters needed Reviewers/Grant Applications

Categorical Agreement

If the same data were categorical, we could compute a percent agreement for each item and average over items. This does not take chance agreement into account, but it is easy to do.

We should use kappa in such a cases.

Can use SPSS if 2 raters, but not if there are more.

You can use SAS (my program) if more than two

http://faculty.cas.usf.edu/mbrannick/software/kappa.htm