half a century of observing and doing language assessment and testing: some lessons learned?

Half a century of observing and doing language assessment and testing: Some lessons learned?

Sauli TakalaUniversity of Gothenburg

September 21, 2010

•Lee J. Cronbach, one of the giants in measurement, testing and evaluation, gave a Valedictory at the Board of Testing and Assessment first anniversary conference in 1994.

•It was intended to consist of casual reminiscences of the olden days. He was planning to say farewell and to convey how glad he was to see others fight the testing battles.

•He found that the plan for lightsome retrospection did not hold up. ”Measurement specialists live in interesting times, and there are important matters to talk about.” He was going to draw on past experiences extending back to his first involvement with a national educational assessment in 1939,• but his topic was really ”clear and present dangers”.

•Is there some lesson for me/us here? What might be clear and present dangers in language testing and assessment?

Some lightsome reminiscences of the past close to 50 years• How it all started – 1965; from philology to testing/assessment• Lado (1961) and Educational Measurement (1951) to rescue• Sweden – Torsten Lindblad and other Swedish colleagues • IEA – Six subject survey (English, French): late1960s/early 1970s• IEA 6-week international seminar on curriculum development in Gränna August 1971; director of program: Benjamin S. Bloom (HFS)• My main mentors: Robert Lado, John B. Carroll, Rebecca Valette, Alan Davies, Bernard Spolsky, Albert Pilliner, Sandra Savignon, Lyle Bachman, Charles A.; John Trim (CoE projects)• Spolsky ´s AILA talk on ”art or science” in 1976; AILA/Lund/Krashen• Elana Shohamy´s preliminary edition of practical handbook (1985) – oral testing topical; her book was very helpful• IEA International Study of Writing 1981-1984 (UIUC); L1; tested by Lyle; Ed Psych, Center for the Study of Reading• ACTFL, ILR ; FSI (”mother of all scales”)• DIALANG; Felly, Norman, Charles, Neus...; diagnosis, CEFR scales• Ingrian returnees: different culture; computerized testing• EALTA: Felly and Norman; Neus and Gudrun• CoE: Manual, Reference Supplement; 2002-

Some general reflections based on this experience:• Language testing/assessment has developed new approaches reflecting developments in eg. (applied) linguistics, psychology, test theory and also due to social/educational changes. (norm-ref., crit.ref., self-assessment; perf/authentic..)• Changes in the view of the purposes/uses of testing assessment in education -> a broader/more responsive view (testing/assessment of, in, for education/learning)• The stakeholders have become more varied and their role is better acknowledged -> increased transparency, co-operation• Growing awareness of the impacts of testing/assessment -> standards, codes of ethics/good practice, international associations (ILTA, ALTE, EALTA...), washback, power, big business...• Relationship between education and economic development: national assessments, international comparative assessments (new methodological challenges..) • Technology and its impact on testing/assessment

No man is an island .....

J.L.M. Trim 2002

EALTA Executive Committee: Turku, January 2009

EALTA 6th Conference, Turku, June, 2009

Some things I´ve learned – I think

Some FAQs• How many? – texts, tasks, items, raters...• How difficult can/ should tasks/items be?• How well should tasks/items discriminate?• How reliable should (sub)tests be?• How high agreement should there be among raters/judges?• How many different test/assessment formats should be used?• What I have learned: No simple answers!! Some sensible guidelines...

• Language testing and assessment is carried out in many contexts and for many purposes → There are a variety of forms/formats of testing and assessment. •There is no one best test format and there is hardly any inherently bad/wrong format, either.• Language testing cultures differ and what is acceptable/normal in one testing culture may be a taboo in another. Fashions also influence practice.• The challenge is to make the format fit the purpose (cf. Gilbert & Sulliva, The Mikado). • Thus, the purpose is where everything starts.

There are ManyFactors in Test Development

Legal frameworkExamination frame-

work

Test specifications

Test theoryApplied ling

Testing know-how

Test constructionTraining of item writers

Raters, Interviewers

MonitoringFeedbackResearch

Test administration and rating of performances

CEFR

More questions...

• So, why test/assess (at all)? Why not just teach and study? Functions of evaluation. (of, in, for)• Should good testing/assessment simply be the same as good teaching? Can they be basically identical? Do they have to fulfil the same criteria?• How far should teaching determine testing/ assessment and how strongly should assumed washback effect do so?• What are reasonable criteria of good testing/assessment?• Good questions? Do we have any good answers beyond ”It depends”?

A brief and non-technical definition of criteria of good practice in testing and assessment: all testing/assessment •should give the testtaker a good opportunity to show his/her language competence (this opportunity should be fair and perceived to be fair)•should produce reliable information about real (de facto) language competence so that the interpretations and decisions arrived at are appropriate and warranted• should be practical, effective, economical , useful and beneficial

A simple four-phase model of testing/assessment(1)

• Pre-response phase and its main tasks/ activities: What?• Response phase:How?• Post-response phase: How good (are the results)? • Reflective phase: How did it work? How can it be improved?

A simple four-phase model (2)

Are the phases of equal importance?• I argue they are not.• The first phase, the quality of pre-response preparatory work, is most important. It is the input to the subsequent phases. GIGO-effect. Knowhow counts! • No statistical wizardry can turn a bad item/task into a good one. It´s too late! • Pilot testing/pretesting is not always poss-ible. →All the more urgent: quality assurance from the start. Thorough ”stimulus analysis”.

First Phase: pre-response tasks• planning• specifications• item/task review• Are re reasonably happy about what we will ask learners/users to respond to?

A simple four-phase model (3): Item/Task Review

• Cross-check drafts across languages: have all test developers who know a language sufficiently well read and comment on drafts (English, French, German, Spanish....)-> comparability• Ask item writers and reviewers discuss items using an agreed checklist, eg. • The difficulty of the passage on which the item is based• The clarity of the item (lack of clarity may lead to irrelevant impact on difficulty): a rough scale could be: very clear, not fully clear• The number of plausible options: 4, 3, 2, 1 (MC)• The amount of text that the correct option is based on: 1-2 sentences/lines, 3-5 sentences/lines, more than 5 sentences/ lines• The amount of inference needed: very little, some, considerable•Such analysis could lead to useful discussions.

F. Kaftandjieva

Phase 1 - Item Review Aims to:

Provide information (based on expert’s judgments) about test items/tasks in regard to their:

• Content validity

• Anticipated level of difficulty• Fairness• Technical quality

F. Kaftandjieva

Item Review: Fairness1. Does the item contain any information

that could be seen as offensive to any specific group?

2. Does the item include or imply any stereotypic depiction of any group?

3. Does the item portray any group as degraded in any way?

4. Does the item contain clues or information that could be seen to work to the benefit or detriment of any group?

5. Does the item contain any group-specific language or vocabulary (e.g., culture-related expressions, slang, or expressions that may be unfamiliar to examinees of either sex or of a particular age)?

GROUPS:• gender• socio-

economic• racial • regional• cultural• religious• ethnic• handicapped• age• other

F. Kaftandjieva

Item Review: Technical Quality• Does the item/task conform to the specifications in content and

format? • Are the directions clear, concise and complete? • Is the item/task clear? • Does the item/taks conform to standard grammar and usage• Is the item/task independent of other items? • Is the item/task free of unintended clues to the correct answer? • Is the item/asks free of tricky expressions or options? • Is the item/task free of extraneous or confusing material? • Is the item/task free of other obvious flaws in construction? • Is the format unusual or complicated such that it interferes with

students ‘ ability to answer the item correctly?• Is a student’s prior knowledge other than of the subject area being

tested necessary to answer the item/task?• Is the item/task content inaccurate or factually incorrect?

A simple four-phase model (4)

What are the functions of the subsequent phases?• Response phase: elicit/provide good, sufficient and fair samples of language performance• Post-response phase: score/rate performances: sufficient agreement/ reliability; sufficient statistical analyses • Reflective phase: should not be neglect-ed; vital for development of know-how

A simple four-phase model (I5)How to develop allround know-how:• One needs to be well-read on relevant literature (basic references, journals).• One needs to have adequate basic statistical knowledge.• One needs to be involved and interested in all phases. • One needs continuous and concrete feedback on one´s contributions.• In sum: one needs (1) an adequate theoretical foundation, (2) solid practical evidence-based experience (feedback), and (3) reflection. Experience counts!

A simple four-phase model (6): Avoiding pitfalls – some lessons learned• Avoid asking questions which require a response of personal preferences, tastes, values (vulgar vs elegant: Felly: traffic item!)• If a task/item proves difficult to construct and even after repeated efforts it feels less good, it is usually best to drop it, as it will usually not work. It is faster to write a new one! vs. Don´t touch/abuse my items!• Beware of developing a narrow routine, using the same kind of approach (”personal fingerprint”).

Useful to know/be aware...

Balancing between opportunities and constraints

Traditional view of what the reliability of a test depends on:1)The length of the test. More evidence, more reliable. Increasing length is no certain guarantee. Quality counts How well the items/tasks discriminate. 2) Discrimination: varies depending on item difficulty. Good discrimination vs. ”appropriate” difficulty?1) How homogeneous/heterogenous the test takers are as a group. More variance -> higher reliability.

Ebel (1965, 89): appropriate difficulty; traditional approach

• items of intermediate difficulty (p= 50%) discriminate well and enhance reliality• due to the possibility of guessing – ideal p-value:• 75% for true-false items• 67% for a 3-alternative MC items• 62% for a 4-alternative MC items •Ebel, Robert L. (1965/1979) Essentials of Educational Measurement. Englewood Cliffs, N.J: Prentice-Hall. (excellent basic reference)

Item per cent correct

Maximum discrimination

95% or 5% .19 90% or 10% .37 85% or 15% .55 80% or 20% .74 75% or 25% .92

Relationship between item difficulty and maximum discrimination power

One lesson learned: aim at well-discriminating items/tasks.

Test length is related to reliability

Measurement error due to items (+/-) is balanced better with more items. Results derived from longer tests can usually be relied on more than from shorter tests. Spearman-Brown prediction formula: example25-item test with a reliability of .70 -> 35 items .766, 43 items .80, 54 items .90. Such additional items need to be similar to the orginal items, ie. homogeneous complementation. No free or even cheap lunch! By the same token, a 43-item test with a relaibility of .80 can be homogeneously reduced to a 20-item test with a reliability of .70. (The risk of trying to economise)

# ca-te-gor-ies

2 3 4 5 6 7 8 9 10 11 12 13 14

# cut scores

1 2 3 4 5 6 7 8 9 10 11 12 13

Re-quir-ed reli-bility ≥

800 900 941 962 973 980 985 987 990 992 993 994 995

Souce: Felianka Kaftandjieva, 2008 (2010)

Relationship between reliability and the number of cut scores/performance groups

List Compr. +ReadCompr.

Advanced English 2001

Intermed.Swedish 2005

Advanced Finnish2005

Short German2005

Short French2005

Short Russian2005

ShortSpanish2005

N= Rel. N Rel N Rel. N Rel. N Rel. N Rel. N Rel.

Whole test

60 .892 55 .881 40 .778 46 .832 50 .868 63 .881 50 .912

50 % of items

30 .801 28 .787 20 .654 23 .686 25 .759 31 .779 25 .819

Delete items<. 30 discrim.

50 .904 45 .890 15 .669 24 .812 27 .858 42 .873 38 .912

Delete items < .40 discrim.

32 .883 29 .875 . 13 .809 30 .844 27 .899

The importance of good discrimination: example

If several cut scores are to be set, some well-discrimating (rather/very) easy and difficult items are needed.

0102030405060708090

100

%

ABC

Swedish B: LC/2004/ Item 1 (n = 14231; P = 71%; D = +0.37)

0102030405060708090

100

%

ABC

RuotsiB, s04; KY/Osio 2(n = 14231; P = 68%; D = +0.20)

L01 * LmcGr10 Crosstabulation

0 0 0 0 1 0 0 0 0 0 1

,0% ,0% ,0% ,0% ,0% ,0% ,0% ,0% ,0% ,0% ,0%

604 879 588 721 1498 773 1614 771 1317 1327 10092

32,8% 50,8% 56,9% 66,3% 71,0% 80,4% 85,9% 93,6% 94,3% 97,3% 70,9%

574 337 153 114 197 62 90 14 28 7 1576

31,2% 19,5% 14,8% 10,5% 9,3% 6,5% 4,8% 1,7% 2,0% ,5% 11,1%

664 515 293 252 415 126 175 39 51 30 2560

36,0% 29,8% 28,3% 23,2% 19,7% 13,1% 9,3% 4,7% 3,7% 2,2% 18,0%

1842 1731 1034 1087 2111 961 1879 824 1396 1364 14229

100,0% 100,0% 100,0% 100,0% 100,0% 100,0% 100,0% 100,0% 100,0% 100,0% 100,0%

Count

% within LmcGr10

Count

% within LmcGr10

Count

% within LmcGr10

Count

% within LmcGr10

Count

% within LmcGr10

A

B

C

L01

Total

1 2 3 4 5 6 7 8 9 10

LmcGr10

Total

Swedish B:LC, 2004, Item1 (n = 14231; P = 71%; D = +0.37)

The importance of being good

• having good specifications to work on (a blueprint)• having an appropriate selection of tasks: reading, listening, speaking, writing, interaction: choice of topics, text types/genres• having an appropriate selection of cognitive operations (levels of processing)

The importance of being good: Some ways of doing this• having clear instructions for tasks• having an appropriate range of test/ assessment formats• having relevant scoring criteria (prepared simultaneously with tasks, not afterwards; revise if necessary); inform test takers of them in an appropriate manner• having a good competence in scoring/ rating (training, feedback – experience)

F. Kaftandjieva

Scoring Criteria should be:• Easily understood

• Relevant to the learning outcome

• Compatible with other criteria used in the rubric

• Precise

• Representative of the vocabulary of the discipline

• Observable, requiring minimal interpretation

• Unique, not overlapping with another criterion or trait

F. Kaftandjieva

The 3R Formula of Item Quality

ReviewReview

ReviseRevise

WriteWrite

Text coverage – readability/compehensibility of texts. How do vocabulary size, text length and sample size influence the stability of estimating coverage? 23 samples from British National Corpus, 26 texts of different length from Time Almanac, estimated using 10 different sample sizes with 1000 iterations. Means and standard deviations. Text coverage is more stable when vocabulary size is larger, text length bigger and several samples are used. Stability is greater when several shorter texts are used than fewer longer texts. Intermediate level proficiency (3000 words): one long text would require about 1750 words to reach sufficient stability while the same result can be achieved with 4 texts of 250 words (1000 words) and with 9 50-word texts (450 words).

-> It seems to pay to have several texts of varying lengthChujo, K. & Utiyama, M. (2005) Understanding the role of text length, sample size and vocabulary size in determining text coverage, Reading in a foreign language, Vol. 17, No 1, April 2005 (online journal at http://nflrc.hawaii.edu/rfl)

http://nflrc.hawaii.edu/rfl

Writing the choices: Use as Many Distractors as Possible but Three Seems to be a Natural Limit A growing body of research supports the use of three options for conventional MC items (Andres & del Castillo, 1990; Bruno & Dirkzwager, 1995; Downing & Haladyna, 1997; Landrum, Cashin & Theis, 1993; Lord, 1977; Rodriguez, 1997; Sax & Michael, 1991; Trevisan, Sax & Michael, 1991, 1994). One issue is the way distractors perform with test-takers. A good distractor should be selected by low achievers and ignored by high achievers. As chapters 8 and 9 show, a science of option response validation exists and is expanding to include more graphical methods.

To summarize this research on the correct number of options, evidence exists to suggest a slight advantage to having more options per test item, but only if each distractor is doing its job. Haladyna & Downing (1996) found that the number of useful distractors per item on the average, for a well-developed standardized test was between one and two. Another implication of this research is that three options may be a natural limit for most MC items. Thus, item writers are often frustrated in finding a useful fourth or fifth option because they do not exist.“The option of despair”

The advice given here is that one should write as many good distractors as one can, but should expect that only one or two will really work as intended. It does not matter how many distractors one produces for any given MC item but it does matter that each distractor performs as intended. This advice runs counter to most standardized testing programs. Customarily, answer sheets are used with a predetermined number of options, such as four or five. However, both theory and research support the use of one or two distractors, so the existence of nonperforming distractors is nothing more than window dressing. Thus, test developers have the dilemma of producing unnecessary distractors, which do not operate as they should, for the appearance of the test, versus producing tests with varying degrees of options.

One criticism of using fewer instead of more options for an item is that guessing plays a greater role in determining a student’s score. The use of fewer distractors will increase the chances of a student guessing the right answer. However, the probability that a test-taker will increase his or her score significantly over a 20-, 50-, or 100-item test by pure guessing is infinitesimal. The floor of a test containing three options per item for a student who lacks knowledge and guesses randomly throughout the test is 33% correct. Therefore, administering more test items will reduce the influence of guessing on the total test score, This logic is sound for two-option items as well, because the floor of the scale is 50% and the probability of a student making 20, 50, or 100 successful randomly correct guesses is very close to zero. (pp. 89-90)

F. Kaftandjieva

To summarize:

There is NO item format that is appropriate for all purposes and on all occasions.

Instead of conclusions

Reflections

Reflect on:• what is the key requirement (fair/equal?)• what test developers do – ought to do• theory & practice (hen – egg)• comprehension – indicators of demonstrated competence (cf.Wittgenstein –inner processes)• writing – usually testing drafting competence?• optimization – satisficing (good practice; good enough; improvement) (Herbert Simon)• optimization – avoiding avoidable pitfalls• checklists, templates• keep everything as simple as possible• first-hand experience – evidence of having made progress

half a century of observing and doing language assessment and testing: some lessons learned?

Documents

board of testing

language assessment

testing battles

language testingassessment

oral testing topical

testingassessment lado

selfassessment perfauthentic

impacts of testingassessment