test reliability & development using irt - jonathan … reliability & development using irt...

Test Reliability & Development Using IRT

University of KansasItem Response Theory

Stats Camp ‘07

Overview

• Reliability with IRT–Item and Test Information

Functions• Concepts• Equations• Uses and Examples

• Optimal Test Design

Reliability with IRT

• We all know that reliability (precision) is a desirable property for an assessment.

• The more reliable a test is, the more precisely we can measure the construct.

• For any scaling procedure (IRT or CTT), as reliability goes up, the standard error of measurement goes down.


• In CTT, reliability is a one-number summary of test precision, and there is a corresponding single standard error of measurement that is used for any test score.

• In IRT, test precision is conceptualized as something called Information, which is conditional on the trait level being measured.– Some tests could measure certain trait levels very

well but measure others poorly…


• A further advantage of IRT with respect to evaluating reliability is that we can consider the amount of Information an item and/or a test provides.

• In CTT, measures of item quality exist, but these are only indirectly related to what the reliability of the test will be.

Item Information Function

• “Item Information” indicates an item’s usefulness for assessing ability.

• By “usefulness” we basically mean how good an item is at distinguishing examinees with lower ability levels from those with higher ability levels.

• Information Precision

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

P (u

= 1

| θ)

0.0

0.2

0.4

0.6

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

0.8

1.0


• Items are basically more informative where the slope of the ICC is steepest, which happens when…bj is relatively close to θi,aj is relatively high, andcj is relatively low

• If cj = 0, an item provides its maximum information when θi = bj

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

P (u

= 1

| θ)

a = 1.0

c = 0.0

b = 1.0 or 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

a = 1.0

c = 0.0

b = 1.0 or 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

P (u

= 1

| θ)

b = -1.0

c = 0.2

a = 1.0 or 0.5

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

b = -1.0

c = 0.2

a = 1.0 or 0.5

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

P (u

= 1

| θ)

a = 1.0

b = 0.0

c = 0.0 or 0.2

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

a = 1.0

b = 0.0

c = 0.0 or 0.2


• IMPORTANT: information is a function of θ, which means that an item could be very informative for some ability levels and relatively uninformative for others.

• Example: difficult items are informative for higher ability levels, but don’t tell us much about lower ability levels (because they mostly get all those items wrong!)

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

P (u

= 1

| θ)

c = 0.0

a = 1.2 or 0.8

b = 1.0 or 0.0

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

c = 0.0

a = 1.2 or 0.8

b = 1.0 or 0.0

Item Information Functionfor the 3-PL

' 2

2 2

( ) ( ) 2

[ ( )]( )

( ) ( )

(1 )[ ][1 ]j j j j

jj

j j

j jDa b Da b

j

PI

P Q

D a cc e eθ θ

θθ

θ θ

− − −

=

−=

+ +

Notes on IIF

• The roles of aj and cj are easy to see– as aj increases, information increases– as cj increases, information decreases

• As ability moves away from bj (+ or -) the denominator increases, so information approaches zero.

Maximum Information

If cj = 0, then Information is maximized at bj

If cj > 0, then Information is maximized at an ability level slightly greater than bj

max1 ln 0.5(1 1 8 )j j

j

b cDa

θ ⎡ ⎤= + + +⎣ ⎦

Test Information Function

• Just like we add up ICCs to get a TCC, we add up IIFs to get a TIF.

• Information will continue to increase as we add test items, therefore increasing precision.

• All things equal, longer tests provide increased measurement precision.

Test Information Function

• Defined for a set of items at each point along the ability (θ) scale

• Test information is influenced by the ‘quality’ and the number of test items

1

( ) ( )n

jj

I Iθ θ=

=∑

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

P (u

= 1

| θ)

0

1

2

3

4

5

6

7

8

-3 -2 -1 0 1 2 3

Ability (θ)

E(X

| θ)

0.0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

0

1

2

3

4

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

Conditional Error for Maximum Likelihood Estimates

• One of the great benefits of IRT scaling is that measurement precision and error can now be considered conditional on θ.

Conditional Error for Maximum Likelihood Estimates

• Standard error of an MLE is determined by:

1ˆ( )ˆ( )

SEI

θθ

=

Conditional Standard Error

• The imprecision of ability estimation is therefore inversely related to the amount of Information with respect to ability that is available.

• Since Information increases with the quality and number of items, the SE conversely decreases…which hopefully makes some sense!

0

1

2

3

4

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ) a

nd S

E(θ)

8-item Test Information Function

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ) a

nd S

E(θ)

Information may be spread across a relatively wide range…

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ) a

nd S

E(θ)

or maximized around an ability level of interest(e.g., a cutscore)

Info and SE Example

At 1.0, ( 1) 91 1ˆ( ) 0.33

ˆ 9( )ˆ ˆIf 1.0, ( ) 0.33

I

SEI

SE

θ θ

θθ

θ θ

= = =

= = =

= =

Info and SE Example

At 0.0, ( 0) 31 1ˆ( ) 0.58

ˆ 3( )ˆ ˆIf 0.0, ( ) 0.58

I

SEI

SE

θ θ

θθ

θ θ

= = =

= = =

= =

Info and SE Example

At 1.0, ( 1) 11 1ˆ( ) 1.0

ˆ 1( )ˆ ˆIf 1.0, ( ) 1.0

I

SEI

SE

θ θ

θθ

θ θ

=− =− =

= = =

=− =

95% Confidence Interval

• Because MLEs are asymptotically normally distributed, we create a 95% confidence interval around a point estimate of ability by adding and subtracting 1.96 standard errors:

• Estimate ± 1.96 SE(recall critical values from a standard normal distribution)

0

0.1

0.2

0.3

0.4

0.5

-3 -2 -1 0 1 2 3

Prob

abili

tyStandard Normal Distribution

0.025 0.025

0.95


• For θ = 1, SE=0.33 1.0 ± 0.65– 95% chance that examinee’s true ability is in

between 0.35 and 1.65• For θ = 0, SE=0.58 0.0 ± 1.14

– 95% chance that examinee’s true ability is in between -1.14 and 1.14

• For θ = -1, SE=1.0 -1.0 ± 1.96– 95% chance that examinee’s true ability is in

between -2.96 and 0.96


• As information increases…– SE decreases– CI becomes narrower– Increased trust in ability estimate

• As information decreases…– SE increases– CI becomes wider– Decreased trust in ability estimate

Notes on IIF and TIF

• Note that the contribution of Ij(θ) to I(θ) does not depend on the particular combination of test items.– Each item contributes independently

• This is a very big advantage of IRT over CTT: reliability can be described conditionally (as information), and it does not depend on the particular set of items.

Mini-CTT lesson• In CTT, item discrimination (quality) is the

item-total correlation• This will depend on the item itself, but is

also influenced by the other test items.• Adding items changes the total score, thus

changing the correlation.• Therefore, it’s difficult to anticipate the

reliability of a test when creating a form from a bank of previously piloted items, unless those items all appeared together.

CTT versus IRT• In IRT, item quality is Information, which

is affected by aj, bj, cj, and θ.• An item’s information function will be

independent of the other items on the test, as will its contribution to the TIF.

• Adding more and/or better items will increase TIF, but won’t impact any IIF.

• Therefore, it’s easy to anticipate the reliability of a test when creating a form from a bank of previously piloted items.

Excel Spreadsheet Demo

• Show Excel Spreadsheet containing eight items, their ICCs, TCC, IIFs, TIF and SE.

• Specify different item parameters and determine how changes affect the resulting graphs.

Uses of Item and Test Information Functions

1) Providing conditional SE of trait2) Building a test to meet desired

statistical specifications3) Revising an existing test4) Comparing tests

Conditional SE

• As previously stated, the precision (reliability) and imprecision (error) of a test scaled with IRT is conditional on θ.

• Tests may be better or worse for measuring certain trait levels

Test Development

• From a pool of previously piloted test items, IRT makes it relatively easy to switch items in and out and determine what the resulting Information function will be.

• This tells the test maker what the conditional standard errors will be, too.

Test Development

• Another benefit to test development is that multiple forms may be built to the same statistical specifications.

• This process is often referred to as “Pre-equating.”

• Building strictly parallel forms is always difficult, but these procedures can help.

Test Revision

• Likewise, test items may be removed from previously existing forms (e.g, to create a “short form” of a test).

• Test items may also need to be added if the previous form is found to be unreliable.

• Estimating the new reliability of the test is straightforward with IRT

Test Revision

• In CTT, such test revisions require the assumption that the deleted or added items are of comparable statistical quality to those already on the test.–Spearman-Brown prophecy formula–This may or may not be true!

Comparing Tests

• When comparing the reliability (i.e., precision) of two test forms, its useful to determine the ratio of their information with respect to θ.

• This ratio is known as the relative efficiency of a test: RE(θ).

• Consider two previous example TIFs

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ) a

nd S

E(θ)

Information targeted around a cutscore

We’ll call this“Form X”

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ) a

nd S

E(θ)

Information spread across a wide range

We’ll call this“Form Y”

( ) info for form X at ( )( ) info for form Y at

Suppose at =1 ( ) 9.0 =1 ( ) 3.6

9Then, ( 1) 2.53.6

X

Y

X

Y

IREI

II

RE

θ θθθ θ

θ θθ θ

θ

= →

→ =→ =

= = =

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

In the region θ = 1, Form X is 2.5 times more efficient than Form Y

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

In the region θ ≈ 0.10, Form X is just as efficient as Form Y

0

2

4

6

8

10

-3 -2 -1 0 1 2 3

Ability (θ)

Info

( θ)

In the region θ = -1, Form X is LESS efficient than Form Y RE(θ)=0.23

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3

Ability (θ)

RE(θ)

Form X is more efficient than Form Y above the point θ ≈ 0.1

0

2

4

6

8

10

12

-3 -2 -1 0 1 2 3

Ability (θ)

RE(θ)

Form Y is more efficient than Form X below the point θ ≈ 0.1

Next…

• Test Score Equating using IRT

test reliability & development using irt - jonathan … reliability & development using irt...

Documents