testing in language programs (chapter 8)

20
Language Test Reliability Language Test Reliability Teacher: Teacher: Dr. Golshan Prepared by: Prepared by:Tahere Bakhshi November 2015 November 2015 In the name of God

Upload: tahere-bakhshi

Post on 19-Jan-2017

230 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Testing in language programs (chapter 8)

Language Test ReliabilityLanguage Test ReliabilityTeacher: Teacher: Dr. Golshan

Prepared by:Prepared by:Tahere Bakhshi

November 2015November 2015

In the name of God

Page 2: Testing in language programs (chapter 8)

A test should have: A test should have: Reliability: (Same result under the same Reliability: (Same result under the same condition)condition)Validity: (Scale to measure the size of head Validity: (Scale to measure the size of head Not sth else)Not sth else)Usability or Practicality: (Not too difficult, Usability or Practicality: (Not too difficult, practical to use)practical to use)

•The problem of measuring mental traits, The problem of measuring mental traits, language proficiency, motivation and … !language proficiency, motivation and … !• Tests should measure consistently ! Tests should measure consistently ! •Then (Potential sources of Variance)?! Then (Potential sources of Variance)?!

Page 3: Testing in language programs (chapter 8)

Variance Variance

VarianceVariance: : variance measures how far a set of numbers is spread out. Variance of Zero: Identical valuesSmall Variance: Expected value close to meanHigh Variance: Spread out values, far from mean

Page 4: Testing in language programs (chapter 8)

Sources of Sources of Variance Variance Meaningful Variance Meaningful Variance Those sources that make variance related to the purpose of the test. To gain the goal: (Items be related to the purpose of designed test & students’ knowledge on topic. Test validity issue: (see Table 8.1, P. 170) )

Other Factors unrelated to the aim of the test :Other Factors unrelated to the aim of the test :

Measurement error or Error Variance Measurement error or Error Variance

Those sources that make variance related to other

extraneous variables.

Page 5: Testing in language programs (chapter 8)

Types of issues related to Types of issues related to error varianceerror variance1.Variance due to the environment: (1.Variance due to the environment: (Noise, classroom

temperature, outside noises, distractions, amount of space per person, lighting, ventilation, or other environmental factors)).

2. Variance due to the administration procedure: 2. Variance due to the administration procedure: ((Directions of test, Quality of equipment and timing (Cassette or teachers ) )). Table 2.5, p.35

3. Variance due to examinees: 3. Variance due to examinees: ((Condition of students: their fatigue, health, hearing or vision)). ((Psychological factors: motivation, memory, concentration, forgetfulness, impulsiveness, carelessness and…). (). (Students’ testwiseness and Strategies))

4. Variance due to scoring procedure: 4. Variance due to scoring procedure: Errors in doing scoring. Subjective nature of scoring procedure.

5. Variance due to test and test items: (5. Variance due to test and test items: (Printing, knowing answer sheet, number of items, item selection, quality of test items, test security))

The mentioned sources of measurement error should be The mentioned sources of measurement error should be minimized so that there is no Variance in students’ minimized so that there is no Variance in students’

scores. scores.

Page 6: Testing in language programs (chapter 8)

((Dependable or trustworthy))A test is considered reliable if it would give us the same result over and over again.

How is reliability measured?How is reliability measured?By comparing two sets of scores for a single assessment By comparing two sets of scores for a single assessment (such as two rater scores for the same person). (such as two rater scores for the same person). After having two sets of scores for a group of students, we can determine how similar they are by computing a statistic known as the reliability coefficient.

Reliability Coefficient: Reliability Coefficient: A numerical index of reliability, ranging from 0 to 1.

Number closer to 1 = high reliability. A low reliability coefficient indicates more error in the assessment results.

Reliability is considered good or acceptable if the reliability coefficient is .80 or above.

Reliability of NRTs Reliability of NRTs

Page 7: Testing in language programs (chapter 8)
Page 8: Testing in language programs (chapter 8)

1. Test-Retest Reliability: 1. Test-Retest Reliability: SituationSituation: Same people taking two administrations of the same test.ProcedureProcedure: Correlate scores on the two tests which yields the coefficient of stability.MeaningMeaning: the extent to which scores on a test can be generalized over different occasions (temporal stability). Appropriate use: Appropriate use: Information about the stability of the trait over time. DisadvantagesDisadvantages: Requires two testing sessions, Learning, Test effect.

    Three Basic Strategies to Estimate the Three Basic Strategies to Estimate the reliability of a Test: reliability of a Test:

Page 9: Testing in language programs (chapter 8)

2. Parallel / Equivalent-Forms 2. Parallel / Equivalent-Forms Reliability: Reliability:

SituationSituation: Testing of same people on different but comparable forms of the test. (Forms A & B)

ProcedureProcedure: correlate the scores from the two tests which yields a coefficient of equivalence.

Meaning:Meaning: the consistency of response to different item samples (where testing is immediate) and across occasions (where testing is delayed).

Appropriate use: Appropriate use: to provide information about the equivalence of forms.

Ali usually ………… late at night. A. study Ali usually ………… late at night. A. study b. studies c. studying b. studies c. studying

Reza often ………… the shopping in the afternoon. A. do Reza often ………… the shopping in the afternoon. A. do b. does c. doing b. does c. doing

Page 10: Testing in language programs (chapter 8)

3. Internal Consistency 3. Internal Consistency Reliability: Reliability: • Situation: Situation: a single administration of one test form. All

items in an internally consistent scale assess the same construct.

•Procedure: Procedure: Divide test into comparable halves and correlate scores from both halves.– Split Half with Spearman Brown adjustment– Kuder Richardson #20 and #21– Cronbach’s Alpha•Meaning: Meaning: consistency across the parts of a measuring instrument (“parts” = individual items or subgroups of items). •Appropriate Use: Appropriate Use: Where focus is on the degree to which same characteristic is being measured. A measure of test homogeneity.

Page 11: Testing in language programs (chapter 8)

Internal Consistency Internal Consistency Strategies Strategies All items in the test should be homogenous. And there should be a relationship among them.

Split – Half Split – Half ReliabilityReliability

Cronbach AlphaCronbach Alpha

Kuder-Kuder-Richardson Richardson FormulasFormulas

Page 12: Testing in language programs (chapter 8)

Split – Half Reliability)Split – Half Reliability)In split-half reliability we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. the split-half reliability estimate, as shown in the figure, is simply the correlation between these two total scores. In the example it is .87.Odd/ even Items, easy and difficult item equally distributed.

Page 13: Testing in language programs (chapter 8)

Spearman Brown Prophecy Spearman Brown Prophecy FormulaFormula

k = the number of items I k = the number of items I WANT toWANT toestimate the reliability for estimate the reliability for divided bydivided bythe number of items I HAVE the number of items I HAVE reliability forreliability for

11*

11

11

krrkrkk

Page 14: Testing in language programs (chapter 8)

Cronbach AlphaCronbach AlphaCronbach Coefficient Alpha used only if the item scores are other than 0 & 1. (Such as Likert scale). )This is advisable for essay items, problem solving and 5-scaled items. ; based on 2 or more parts of the test, requires only one administration of the test.

Page 15: Testing in language programs (chapter 8)

Kuder – Richardson Kuder – Richardson FormulasFormulasKuder and Richardson believed that all items in a test are designed to measure a single trait. KR21 is the most practical, frequently used and convenient method of estimating reliability.

K – R20 = most advisable if the p values vary a lot K – R21 = most advisable if the items do not vary much in difficulty, i.e., the p values are more or less similar.

The KR21 formula is a simplified version of the The KR21 formula is a simplified version of the KR20.KR20.

Page 16: Testing in language programs (chapter 8)

Inter-rater ReliabilityInter-rater ReliabilityHaving a sample of test papers (essays) scored independently by two examiners.

Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions.  Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed. 

Page 17: Testing in language programs (chapter 8)

Intra-rater Reliability Intra-rater Reliability The degree of stability observed when a measurement is repeated under identical conditions by the same rater.•Note: Intra-rater reliability makes it possible to determine the degree to which the results obtained by a measurement procedure can be replicated.

Page 18: Testing in language programs (chapter 8)

Standard Error of MeasurementStandard Error of Measurement All tests scores contain some error For any test, the higher the reliability estimate, the

lower the error The standard error or measurement is the average

standard deviation of the error variance over the number of people in the sample.

Can be used to estimate a range within which a true score would likely fall.

We never know the true score By knowing the S.E.M. and by understanding the

normal curve, we can assess the likelihood of the true score being within certain limits.

The higher the reliability the lower the standard error of measurement, hence more confidence we can place in the accuracy of a person’s test score.

Page 19: Testing in language programs (chapter 8)

Factors That Affect The Factors That Affect The Reliability CoefficientReliability Coefficient

• Test lengthTest length• Range of scoresRange of scores• Item similarityItem similarity

Page 20: Testing in language programs (chapter 8)

QuestionsQuestions??