reliability

16
Reliability As my grand pappy, Old Reliable, used to say . . . Who is this famous bloodhound? What was he noted for saying? 1

Upload: reuben-webb

Post on 31-Dec-2015

53 views

Category:

Documents


0 download

DESCRIPTION

Reliability. As my grand pappy, Old Reliable, used to say . . . Who is this famous bloodhound? What was he noted for saying?. What CU?. Reliability Topics:. The Basic Notion of Reliability Factors Affecting Reliability Methods of Determining Reliability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Reliability

Reliability

As my grand pappy, Old Reliable,

used to say . . .

Who is this famous bloodhound?What was he noted for saying?

1

Page 2: Reliability

What CU?

2

Page 3: Reliability

Reliability Topics:

The Basic Notion of Reliability

Factors Affecting Reliability

Methods of Determining Reliability

Methods Used by Professional Test Makers

Method Suggested for Your Own Tests

Standard Error of Measurement

Confidence Bands

3

Page 4: Reliability

Basic Notions of Reliability

Reliability refers to the reliability of a test score or set of test scores, not the reliability of the test.

Reliability questions ask: “Are the scores consistent?” “Are they stable?”

Reliability is a matter of degree; it is NOT all-or-none.

Reliability is not the same as validity – validity asks “Does a test measure what is suppose to?” (reliability is necessary for, but not a sufficient condition for, validity) .

Reliability deals with unsystematic error in assessment. Systematic error (examples, “I test well because I am ‘test-wise’” or “I do not test well because English is not my first language”) will not be uncovered through tests of reliability

4

Page 5: Reliability

Factors Affecting Reliability:Sources of UnreliabilityTest Scoring –

difference between two scorers judgments

one scorer over time (fatigue) and/or halo effect

Test Content –

the sample of test items is too small

the sample of test items is not evenly selected across material

Test Administration –

noise, time limits not consistent, physical conditions

Personal Conditions –

temporary ups and downs

(chronic test anxiety would be a systematic error and thus undetectable through measures of reliability)

Note: None of these factors automatically result in unreliability, but as we build our assessments, we hope to reduce the impact of these factors. The extent to which these factors may be affecting test scores is an empirical question and we can and will address this as we continue.

5

Page 6: Reliability

A Bit of Theory (True/Observed)

The “perfect test” would be unaffected by the sources of unreliability and on this perfect test each examinee should get his or her true score. Unfortunately, we know the observed score we get was likely affected by one or more of the sources of unreliability.

So, our observed score is likely too high or to low. The difference between the observed score and the true score we call the error score; and this score can be positive or negative.

We can express this mathematically as: True Score = Obtained Score +/- Error T = O +/- E (or, looking at it another way, O = T +/- E)

Theory Time: If we could re-administer a test to one person an infinite number of times, what would expect the distribution of their scores to look like? Answer: The Bell Shaped Curve. We will return to this concept when we discuss the standard error of measurement.

6

Page 7: Reliability

Determining Reliability by Usingthe Concept of Correlation I can use my understanding of correlation (how two things are related)

to come up with a mathematical calculation that will suggest the strength (or lack of strength) regarding one or more of the sources of unreliability that I have identified.

I will be calculating what will be called the reliability coefficient (since it is a correlation coefficient measuring a type of reliability). This value will range -1 to +1.

For example, let’s consider rater reliability. That is, do different scorers rate equally; or, another concern, does one scorer rate differently over time. We express that as either

Inter-rater : reliability among raters (international – many nations) Intra-rater : same rater (intramural sports – within 1 school)

Note: the hyphen after inter- and intra- may not be used by some authors

Compute using Spearman Rank Correlation

7

Page 8: Reliability

Re-enter the Correlation Coefficient - the calculated number that best describes the relationship between two variables, but now we will call it the reliability coefficient

Reliability coefficient – symbol is “r” – linear relationships Range -1.00 through .00 to +1.00

Sign indicates direction + indicates that as one variable increases, the other variable increases - indicates that as one variable increases, the other variable decreases

Number indicates strength Although the following table is somewhat arbitrary, the following thinking might be useful in interpretation:

-1.0 to -0.7 strong converse association. -0.7 to -0.3 weak converse association. -0.3 to +0.3 little or no association. +0.3 to +0.7 weak direct association. +0.7 to +1.0 strong direct association.

8

Page 9: Reliability

Some History . . .

Karl Pearson (1857-1936)

9

Pearson was a Galton protégé and was appointed the first Galton Professor or Eugenics (1911) at University College of London . Introduced a new science:  "Biometrics" which integrated statistics with evolutionary theory. Advocated social imperialism — "superior" races and countries should produce more offspring than those considered to be less developed. In the United States, Indiana was the first to pass a pioneering statute (1907) allowing state officials to sterilize those deemed unfit to breed. California enacted an even stricter eugenics law.  California made it legal for state officials to asexualize those considered feeble-minded, prisoners exhibiting sexual or moral perversions, and anyone with more than three criminal convictions. 

Page 10: Reliability

More Reliability Approaches to Consider

Test-retest – (impractical for you; important in standardized tests)

Alternate Forms (again, impractical for you but important in standardized tests)

Internal Consistency (not appropriate for speeded tests) Kuder-Richardson (really a series of formulas based on

dichotomously scored items) Coefficient alpha - Cronbach’s (most widely used as can be

used with continuous item types) Split-half; odd-even – w/Spearman-Brown correction to

apply to full test (easiest for you to do and understand)

10

Page 11: Reliability

Reliability of Your Classroom Tests

I would recommend doing Split-Half Reliability. Step 1 – Split your test into two parts (odd – even). Step 2 – Use “Pearson Product Moment Correlation -

Ungrouped Data” to determine “rxy” (rxy represents the correlation between the two halves of the scale). By doing the split-half we reduce the number of items which we know will automatically reduce the reliability, SO

Step 3 – To estimate reliability of whole test then use the Spearman Brown “correction” formula

rsb = 2rxy /(1+rxy)

where rsb is the split-half reliability coefficient

11

Page 12: Reliability

As a Teacher, What Do I Need to Know Most About Reliability For tests I create myself:

Increasing number of items increases reliability. Moderate difficulty level increases reliability. Having items measuring similar content increases

reliability. For standardized tests I use:

Look for each test’s published reliability data. Use the published reliability coefficient to

determine the Standard Error of Measurement (abbreviated SEM) found in the data

See the following illustration

12

Page 13: Reliability

Standard Error of Measurement

The SEM is the standard deviation of a hypothetically infinite number of obtained scores around a person’s true score.

13

Page 14: Reliability

SEM and Confidence Bands

The SEM is a standard deviation of a distribution assumed to be normal .

So computing the SEM can help me better interpret scores Formula: SEM = SD 1 - r I can take the computed SEM and build a Confidence Band

around my score. Confidence Band

68% Confidence Band +/- 1 SEM 95% Confidence +/- 1.96 SEM 99% Confidence +/- 2.58 SEM

I can also do percentiles (a bit harder). Many professional test makers give me this information

14

Page 15: Reliability

Final Thoughts & Advice

Use multiple sources of information. Find and Use a published test’s SEM to help

interpretation. Standard Error of Measurement is distinct from:

Standard error of mean (samples/populations) Standard error of estimate (prediction)

Reliability for Criterion-referenced Items may use techniques already covered but sometimes require special treatment.

Worry about scorer reliability when score depends on judgment.

15

Page 16: Reliability

More Final Words . . .

Reliability for Sub Scores is problematic since small clusters are usually quite unreliable.

For important decisions, get reliability >.90. Be wary of short tests. To increase reliability,

increase number of items, exercises, or observations.

Occasionally check reliability of your classroom tests.

Be able to distinguish between reliability and validity.

16