reliability in language testing

28

Upload: seray-tanyer

Post on 15-Jul-2015

293 views

Category:

Education


1 download

TRANSCRIPT

Introduction to identify potential sources of error in a given measure of

communicative language ability and to minimize the effect of these factors on that measure.

errors of measurement (unreliability) because we know that test performance is affected by factors other than the abilities we want to measure.

When we minimize the effects of these various factors, we minimize measurement error and maximize reliability.

‘How much of an individual’s test performance is due to measurement error, or to factors 0ther than the language

ability we want to measure?’

Introduction In this chapter,

Measurement error in test scores,

The potential sources of this error,

The different approaches to estimating the relative effects of these sources of error on test scores, and

The considerations to be made in determining which of these approaches may be appropriate for a given testing situation.

Factors that affect language test scores

The examination of reliability depends upon distinguishing the effects of the abilities we want to measure from the effects of other factors.

If we wish to estimate how reliable our test scores are:

we must begin with a set of definitions of the abilities we want to measure, and of the other factors that we expect to affect test scores.

Factors that affect language test scores

Factors that affect language test scores1. Communicative language ability: specific abilities that determine how an individual performs on a given test. Example: In a test of sensitivity to register: the students who perform the best and receive the highest scores would be those with the highest level of sociolinguistic competence.

2. Test method facets: testing environment, the test rubric, the nature of input and expected response, the relationship between input and response

3. Personal attributes: individual characteristics (cognitive style and knowledge of particular content areas) - group characteristics (sex, race, and ethnic background)

4. Random factors: unpredictable and largely temporary conditions (mental alertness or emotional state) - uncontrolled differences in test method facets (changes in the test environment from one day to the next or differences in the way different test administrators carry out their responsibilities)

Factors that affect language test scores

1. The primary interest in using language tests is to make inferences about one or more components of an individual’s communicative language ability.

2. Random factors and test method facets are generally considered to be sources of measurement error (reliability)

3. Personal attributes (i.e. sex, ethnic background, cognitive style and prior knowledge of content area) are discussed as sources of test bias, or test invalidity, and these will therefore be discussed in Chapter 7 (validity)

Theories and models of reliability

Any factors other than the ability being tested that affect test scores are potential sources of error that decrease the reliability of scores.

to identify these sources of error and estimate the magnitude of their effect on test scores.

how different theories and models define the various influences on test scores

1. Classical True Score Measurement Theory

Classical true score (CTS) measurement theory consists of a set of assumptions about the relationships between true or observed test scores and the factors that affect these scores.

Reliability is defined in the CTS theory in terms of true score variance.

True score: due to an individual’s level of ability / Error score: due to factors other than the ability being tested

observed score = true score + error score (actual test score)

1. Classical True Score Measurement Theory

Since we can never know the true scores of individuals, we can never know what the reliability is, but can estimate it from the observed scores.

The basis for all such estimates in the CTS model is the correlation between parallel tests.

Parallel Test: In order for two tests to be considered parallel, they are supposed to measures of the same ability (equivalents, alternate forms).

If the observed scores on two parallel tests are highly correlated, these test can be considered reliable indicators of the ability being measured.

1. Classical True Score Measurement Theory

Within the CTS model there are 3 approaches to estimating reliability, each of which addresses different sources of error:

a. Internal consistency estimates are concerned with sources of error such as differences in test-tasks and item formats, inconsistencies within and among scorers.

b. Stability estimates indicate how consistent test scores are over time

c. Equivalence estimates provide an indication of the extent to which scores on alternate forms of a test are equivalent.

The estimates of reliability that these approaches yield are called reliability coefficients.

1. Classical True Score Measurement Theory

Internal consistency is concerned with how consistent test takers’ performances on different parts of the test are with each other

Two approaches to estimating internal consistency

-an estimate based on correlation between two halves (the Spearman-Brown split-half estimate)

-estimates which are based on ratios of the variances of parts of the test – halves or items – to total test score variance (the Guttman split-half, the Kuder-Richardson formulae, and coefficient alpha)

1. Classical True Score Measurement Theory

Rater consistency: In test scores that are obtained subjectively (ratings of compositions or oral interviews) a source of error is inconsistency in these ratings.

Intra-rater reliability: In order to examine the reliability of ratings of a single rater, at least two independent ratings from this rater are obtained. This is accomplished by rating the individual samples once and then re-rating them at a later time in a different, random order.

Inter-rater reliability: two different raters. In examining inter-rater consistency, at that time, two ratings from these rater are obtained and correlated.

1. Classical True Score Measurement Theory

Stability (test-retest reliability): In this approach, we administer the test twice to a group of individuals and then compute the correlation between the two sets of scores. This correlation can then be interpreted as an indication of how stable the scores are over time.

Equivalence (parallel forms reliability): In this approach, we try to estimate the reliability of alternate forms of a given test, by administer both forms to a group of individuals. Then the correlation between the two set of scores can be computed.

2. Generalizability theory (G-theory)

Generalizability theory is an extension of the classical model

It enables test developers to examine several sources of variance simultaneously, and to distinguish the systematic from random error.

Firstly, the test developer designs and conducts a study to investigate the sources of the variances (G-study).

Depending on the outcome of this G-study, the test developer may revise the test or the procedures for administering it, or if the results are satisfactory, the test developer proceeds to the second stage, a decision study (D-study).

2. Generalizability theory (G-theory)

In a D-study, the test developer administers the test under operational conditions, in which the test will be used to make the decisions for which it is designed. Then, the test developer uses G-theory procedures to estimate the magnitude of the variance components.

Terms related to generalizability theory

Universe of generalization: the domain of uses or abilities (or both) to which we want test scores to generalize.

Universe of measures: the types of test scores we would be willing to accept as indicators of the ability to be measured.

2. Generalizability theory (G-theory)

Terms related to generalizability theory

Populations of persons: the group about whom we are going to make decisions or inferences

Universe score: the mean of a person’s scores on all measures from the universe of possible measures (similar to CTS-theory true score)

This conceptualization of generalizability reveals that a given estimate of generalizability is limited to the specific universe of measures and population of persons within which it is defined, and that a test score that is ‘True’ for all persons, times, and places simply does not exist.

2. Generalizability theory (G-theory)

Generalizability Coefficients: The G-theory analog of the CTS-theory reliability coefficient is the generalizability coefficient universe score coefficient generalizability coefficient = observed score coefficient

Estimation: In order to estimate the relative effect of different sources of variance on the observed scores, it is necessary to obtain multiple measures for each person under the different conditions for each facet

2. Generalizability theory (G-theory)

Estimation: One statistical procedure that can be used for estimating the relative effects of different sources of variance on test scores is the (ANOVA)

Example: An oral interview: with different question forms, or sets of questions, and different interviewer/raters

Using ANOVA, we could obtain estimates for all the variance components in the design: (1) the main effects for persons, raters, and forms; (2) the two-way interactions between persons and raters, persons and forms, and forms and raters, and (3) a component that contains the three-way interaction among persons, raters, and forms, as well as for the random variance

3. Standard Error of Measurement (SEM)

The approaches to estimating reliability that have been developed within both CTS theory and G-theory are based on group performance, and provide information for test developers and test users about how consistent the scores of groups are on a given test.

Reliability and generalizability coefficients provide no direct information about the accuracy of individual test scores.

A need for one indicator of how much we would expect an individual’s test scores to vary.

The most useful indicator for this purpose is called the standard error of measurement.

The smaller standard deviation of errors (standard error of measurement, SEM) results in more reliable tests

4. Item-response theoryBecause of the limitations in CTS-theory and G-theory, psychometricians have developed a number of mathematical models for relating an individual’s test performance to that individual’s level of ability.

Item response theory presents a more powerful approach in that it can provide sample-free estimates of individual's true scores, or ability levels, as well as sample-free estimates of measurement error at each ability level.

4. Item-response theoryThe unidimensionality assumption: Most of the IRT models make the specific assumption that the items in a test measure a single, or unidimensional ability or trait, and that the items form a unidimensional scale of measurement

Item characteristic curve: Each specific IRT model makes specific assumptions about the relationship between the test taker’s ability and his performances on a given item. These assumptions are explicitly stated in the mathematical formula that is item characteristic curve (ICC).

4. Item-response theoryAbility score: Recall that neither CTS theory nor G-theory provides an estimation of an individual’s level of ability. One of the advantages of IRT is that it provides estimates of individual test takers’ levels of ability.

Precision of measurement: Precision of measurement are addressed in the IRT concept of item information function which refers to the amount of information a given item provides for estimating an individual’s level of ability. Test of information function, on the other hand, is the sum of the item information functions, each of which contributes independently to the total, and is a measure of how much information a test provides at different ability levels.

Reliability of criterion-referenced test score

NR test scores are most useful in situations in which comparative decisions are made such as the selection of individuals for a program. CR test scores, on the other hand, are more useful when making ‘absolute’ decisions regarding mastery or nonmastery of the ability domain.

The concept of reliability applies to two aspects of criterion-referenced tests:

- the accuracy of the obtained score as an indicator of a ‘domain’ score (J. D. Brown (1989) has derived a formula)

- the consistency of the decisions that are based on CR test scores (Threshold loss agreement indices - Squared-error loss agreement indices)

Factors that affect reliability estimates

Length of test: long tests are generally more reliable than short ones

Difficulty of test and test score variance: the greater the score variance, the more reliable the tests will tend to be (Norm-referenced tests)

Cut-off score: the greater the differences between the cut-off score and the mean score, the greater will be the reliability (Criterion-referenced tests).

Systematic measurement error

Systematic error is different from random error.

For example, if every form of a reading comprehension test contained passages from the area of “economics”, then the facet ‘passage content’ would be fixed to one condition - economics.

To the extent that test scores are influenced by individuals’ familiarity with this particular content area, as opposed to their reading comprehension ability, this facet will be a source of error in our measurement of reading comprehension. It is a kind of systematic error

Systematic measurement error

The effects of systematic error:

- The general effect of systematic error is constant for all observations; it affects the scores of all individuals who take the test.

- The specific effect varies across individuals; it affects different individuals differentially

The effects of test method

Standardization of test facets results in introducing sources of systematic variance into the test scores. When a single testing technique is used (close test), the test might be a better indicator of individuals’ ability to take cloze tests than of their reading comprehension ability.

Conclusion

Any factors other than the ability being tested that affect test scores are potential sources of error that decrease the reliability of scores.

Therefore, it is essential that we be able to identify these sources of error and estimate the magnitude of their effect on test scores.