tsl3123 module ppg

180
TOPIC 1 OVERVIEW OF ASSESSMENT: CONTEXT, ISSUES AND TRENDS 1.0 SYNOPSIS Topic 1 provides you with some meanings of test, measurement, evaluation and assessment, some basic historical development in language assessment, and the changing trends of language assessment in the Malaysian context. 1.1 LEARNING OUTCOMES By the end of this topic, you will be able to: 1. define and explain the important terms of test, measurement, evaluation, and assessment; 2. examine the historical development in Language Assessment; 3. describe the changing trends in Language Assessment in the Malaysian context and discuss the contributing factors. 1.2 FRAMEWORK OF TOPICS 1

Upload: intan-nazira

Post on 12-Dec-2015

13 views

Category:

Documents


1 download

DESCRIPTION

module sem6 tesl ipg

TRANSCRIPT

Page 1: Tsl3123 Module Ppg

TOPIC 1 OVERVIEW OF ASSESSMENT: CONTEXT, ISSUES AND TRENDS

1.0 SYNOPSIS

Topic 1 provides you with some meanings of test, measurement, evaluation

and assessment, some basic historical development in language assessment,

and the changing trends of language assessment in the Malaysian context.

1.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

1. define and explain the important terms of test, measurement, evaluation, and assessment;

2. examine the historical development in Language Assessment;

3. describe the changing trends in Language Assessment in the Malaysian context and discuss the contributing factors.

1.2 FRAMEWORK OF TOPICS

1

Page 2: Tsl3123 Module Ppg

CONTENT

SESSION ONE (3 hours)

1.3 INTRODUCTION

Assessment and examinations are viewed as highly important in most Asian

countries such as Malaysia. Language tests and assessment have also

become a prevalent part of our education system. Often, public examination

results are taken as important national measures of school accountability.

While schools are ranked and classified according to their students’

performance in major public examinations, scores from language tests are

used to infer individuals’ language ability and to inform decisions we make

about those individuals.

In this topic, let’s discuss about the concept of measurement at its

numerous definitions. We will also look into the historical development in

language assessment and the changing trends of language assessment in our

country.

1.4 DEFINITION OF TERMS – test, measurement, evaluation, and assessment.

1.4.1 Test

The four terms above are frequently used interchangeably in any

academic discussions. A test is a subset of assessment intended to measure

a test-taker's language proficiency, knowledge, performance or skills. Testing

is a type of assessment techniques. It is a systematically prepared procedure

that happens at a point in time when a test-taker gathers all his abilities to

achieve ultimateperformance because he knows that his responses are being

evaluated and measured.A test is first a method of measuring a test-taker’s

ability, knowledge or performance in a given area; and second it must

measure.

2

Page 3: Tsl3123 Module Ppg

Bachman (1990) who was also quoted by Brown defined a test as a

process of quantifying a test-taker’s performance according to explicit

procedures or rules.

1.4.2 Assessment

Assessment is every so oftena misunderstood term. Assessment is ‘a

comprehensive process of planning, collecting, analysing, reporting, and using

information on students over time’(Gottlieb, 2006, p. 86).Mousavi (2009)is of

the opinion that assessment is ‘appraising or estimating the level of magnitude

of some attribute of a person’. Assessment is an important aspect in the fields

of language testing and educational measurement and perhaps, the most

challenging partof it. It is an ongoing process in educational practice, which

involves a multitude of methodological techniques. It can consist of tests,

projects, portfolios, anecdotal information and student self-reflection.A test

may be assessed formally or informally, subconsciously or consciously, as well

as incidental or intended by an appraiser.

1.4.3 Evaluation

Evaluation is another confusing term. Many are confused between

evaluation and testing. Evaluation does not necessary entail testing. In

reality, evaluation is involved when the results of a test (or other assessment

procedure) are used for decision-making (Bachman, 1990, pp. 22-23).

Evaluation involves the interpretation of information. If a teacher simply

records numbers or makes check marks on a chart, it does not constitute

evaluation. When a tester or marker evaluate, s/he “values” the results in such

a way that the worth of the performance is conveyed to the test-taker. This is

usually done with some reference to the consequences, either good or bad of

the performance.This is commonly practised in applied linguistics research,

where the focus is often on describing processes, individuals, and groups, and

the relationships among language use, the language use situation, and

language ability.

3

Page 4: Tsl3123 Module Ppg

Test scores are an example of measurement, and conveying the

“meaning” of those scores is evaluation. However, evaluation can occur

without measurement. For example, if a teacher appraises a student’s correct

oral response with words like “Excellent insight, Lilly!”it is evaluation.

1.4.4 Measurement

Measurement is the assigning of numbers to certain attributes of

objects, events, or people according to a rule-governed system. For our

purposes of language testing, we will limit the discussion to unobservable

abilities or attributes, sometimes referred to as traits, such as grammatical

knowledge, strategic competence or language aptitude. Similar to other tyoes

of assessment, measurement must be conducted according to explicit rules

and procedures as spelled out in test specifications, criteria, and procedures

for scoring.Measurement could be interpreted as the process of quantifying the

observed performance of classroom learners. Bachman (1990) cautioned us

to distinguish between quantitative and qualitative descriptions. Simply put,

the former involves assigning numbers (including rankings and letter grades)

to observed performance, while the latter consists of written descriptions, oral

feedback, and non-quantifiable reports.

The relationships among test, measurement, assessment, and their

uses are illustrated in Figure 1.

Figure 1:The relationship between tests, measurement and assessment.(Source: Bachman, 1990)

4

Page 5: Tsl3123 Module Ppg

2.0 Historical development in language assessment

From the mid-1960s, through the 1970s, language testing practices

reflected in large-scale institutional language testing and in most language

testing textbooks of the time - was informed essentially by a theoretical view of

language ability as consisting of skills (listening, speaking, reading and

writing) and components (e.g. grammar, vocabulary, pronunciation) and

an approach to test design that focused on testing isolated ‘discrete

points’ of language, while the primary concern was with psychometric

reliability (e.g. Lado,1961; Carroll,1968). Language testingresearchwas

dominated largely bythe hypothesis that language proficiency consisted of a

single unitarytrait, and a quantitative, statisticalresearch methodology (Oller,

1979).

The 1980s saw other areas of expansion in language testing,

mostimportantly, perhaps, in the influence of second language

acquisition(SLA) research, which spurred language testers to investigate

not only a wide variety of factors such as field independence/dependence (e.g.

Stansfield and Hansen, 1983; Hansen, 1984; Chapelle, 1988), academic

discipline and background knowledge (e.g. Erickson and Molly, 1983;

Alderson and Urquhart, 1985; Hale, 1988) and discoursedomains (Douglas

and Selinker, 1985) on language test performance, but also the strategies

involved in the process of test-taking itself(e.g. Grotjahn, 1986; Cohen,

1987).

If the 1980s saw a broadening of the issues and concerns of language

testing into other areas of applied linguistics, the 1990s saw a continuation of

this trend. In this decade the field also witnessed expansionsin a number of

areas:

a) research methodology;

b) practical advances;

c) factors that affect performance on language tests;

d) authentic, or performance, assessments; and

e) concerns with the ethics of language testing and professionalising5

Page 6: Tsl3123 Module Ppg

the field

The beginning of the new millennium is another exciting time for anyone

interested in language testing and assessment research. Current

developments in the fields of applied linguistics, language learning and

pedagogy, technological innovation, and educational measurement have

opened up some rich new research avenues.

3.0 Changing trends in Language Assessment-Malaysian context

History has clearly shown thatteaching and assessment should be

intertwined in education.Assessment and examinations are viewed as highly

important in Malaysia. One does not need to look very far to see how

important testing and assessment havebecome in our education system.

Often, public examination results are taken as important national measures of

school accountability. Schools are ranked and classified according to their

students’ performance in major public examinations. Just as assessment

impacts student learning and motivation, it also influences the nature of

instruction in the classroom. There has been considerable recent literature that

has promoted assessment as something that is integrated with instruction, and

not an activity that merely audits learning (Shepard, 2000). When assessment

is integrated with instructions, it informs teachers about what activities and

assignments will be most useful, what level of teaching is most appropriate,

and how summative assessments provide diagnostic information.

With this in mind, we have to look at the changing trends in assessment

particularly language assessment in this country, which has been carried out

mainly through the examination system until recent years.Starting from the

year 1845, written tests in schools were introduced for a number of subjects.

This trend in assessment continued with the intent to gauge (determine) the

effectiveness of the teaching-learning process. In Malaysia, the development

of formal evaluation and testing in education began after Independence.

Public examinations have long been the only measurement of students’

6

Page 7: Tsl3123 Module Ppg

achievement. Figure 1 shows the four stages/phases of development of

examination system in our country. The stages are as follow:

Pre-Independence

Razak Report

RahmanTalib Report

Cabinet Report

Malaysia Education Blueprint (2013-2025)

On 3rd May 1956, the Examination Unit (later known as Examination

Syndicate) in the Ministry of Education (MOE) was formed on the

recommendation of the Razak Report (1956). The main objective of the

Malaysia Examination Syndicate (MES) was to fulfil one of the Razak Report’s

recommendations, which was to establish a common examination system for

all the schools in the country.

In line with the on-going transformation of the national educational

system, the current scenario is gradually changing. A new evaluation system

known as the School Based Assessment (SBA) was introduced in 2002 as a

move away from traditional teaching to keep abreast with changing trends of

assessment and to gauge the competence of students by taking into

consideration both academic and extra curricular achievements.

According to the Malaysian Ministry of Education (MOE), the new

assessment system aims to promote a combination of centralised and school-

based assessment. Malaysian Teacher Education Division (TED) is entrusted

by the Ministry of Education to formulate policies and guidelines to prepare

teachers for the new implementation of assessment. As emphasised in the

innovation of the student assessment, continuous school-based assessment is

administered at all grades and all levels. Additionally, students sit for common

public examinations at the end of each level. It is also a fact that the role of

teachers in the new assessment system is vital. Teachers will be given

empowerment in assessing their students.

7

Page 8: Tsl3123 Module Ppg

The Malaysia Education Blueprint was launched in September this year,

and with it, a three-wave initiative to revamp the education system over the

next 12 years. One of its main focuses is to overhaul the national curriculum

and examination system, widely seen as heavily content-based and un-

holistic.It is a timely move, given our poor results at the 2009 Programme for

International Student Assessment (PISA) tests. Based on the 2009

assessment, Malaysia lags far behind regional peers like Singapore, Japan,

South Korea, and Hong Kong in every category.

Poor performance in Pisa is normally linked to students not being able

to demonstrate higher order thinking skill. To remedy this, the Ministry of

Education has started to implement numerous changes to the examination

system. Two out of the three nationwide examinations that we currently

administer to primary and secondary students have gradually seen major

changes. Generally, the policies are ideal and impressive, but there are still a

few questions on feasibility that have been raised by concern parties.

Figure 2 below shows the development of educational evaluation in Malaysia

since pre-independence until today.

8

Page 9: Tsl3123 Module Ppg

9

Implementation of the Razak Report (1956)

Implementation of the Razak Report (1956)

Pre-IndependencePre-Independence

Examinations were conducted according to the needs of school or based on overseas examinations such as the Overseas School Certificate.

Examinations were conducted according to the needs of school or based on overseas examinations such as the Overseas School Certificate.

Razak Report gave birth to the National Education Policy and the creation of Examination Syndicate (LP). LP conducted examinations such as the Cambridge and Malayan Secondary School Entrance Examination (MSSEE), and Lower Certificate of Education (LCE) Examination.

Razak Report gave birth to the National Education Policy and the creation of Examination Syndicate (LP). LP conducted examinations such as the Cambridge and Malayan Secondary School Entrance Examination (MSSEE), and Lower Certificate of Education (LCE) Examination.

Implementation of the RahmanTalib Report (1960)

Implementation of the RahmanTalib Report (1960)

Implementation of the Cabinet Report (1979)

Implementation of the Cabinet Report (1979)

RahmanTalib Report recommended the following actions:1. Extend schooling age to 15 years old.2. Automatic promotion to higher classes.3. Multi-stream education (Aneka Jurusan).The following changes in examination were made:- The entry of elective subjects in LCE and SRP.- Introduction examination of the Standard 5 Evaluation Examination.- The introduction of Malaysia's Vocational Education Examination.- The introduction of the Standard 3 Dignostic Test (UDT).

RahmanTalib Report recommended the following actions:1. Extend schooling age to 15 years old.2. Automatic promotion to higher classes.3. Multi-stream education (Aneka Jurusan).The following changes in examination were made:- The entry of elective subjects in LCE and SRP.- Introduction examination of the Standard 5 Evaluation Examination.- The introduction of Malaysia's Vocational Education Examination.- The introduction of the Standard 3 Dignostic Test (UDT).

The implementation of Cabinet Report resulted in evolution of the education system to its present state, especially with KBSR and KBSM. Adjustments were made in examination to fulfill the new curriculum's needs and to ensure it is in line with the National Education Philosophy.

The implementation of Cabinet Report resulted in evolution of the education system to its present state, especially with KBSR and KBSM. Adjustments were made in examination to fulfill the new curriculum's needs and to ensure it is in line with the National Education Philosophy.

Page 10: Tsl3123 Module Ppg

By and large, the role of MES is to complement and complete the

implementation of the national education policy. Among its achievements are:

Exercise

Describe the stages involved in the development of educational evaluation in Malaysia.

10

iiii

ii

iiiiii

iviv

vv

vivi

Figure 3: The achievements of Malaysia Examination Syndicate (MES)

Source:Malaysia Examination Board (MES)http://apps.emoe.gov.my/1pm/maklumatam.htm

Implementation of the Malaysia

Education Blueprint (2013 – 2025)

Implementation of the Malaysia

Education Blueprint (2013 – 2025)

The emphasis is on School-Based Assessment (SBA). It was first introduced in 2002. It is a new system of assessment and is one of the new areas where teachers are directly involved. The revamp of the national examination and school-based assessments in stages, whereby by 2016, at least 40% of questions in UjianPenilaianSekolahRendah (UPSR) and 50% in SijilPelajaran Malaysia (SPM) are of high order thinking skills questions.

The emphasis is on School-Based Assessment (SBA). It was first introduced in 2002. It is a new system of assessment and is one of the new areas where teachers are directly involved. The revamp of the national examination and school-based assessments in stages, whereby by 2016, at least 40% of questions in UjianPenilaianSekolahRendah (UPSR) and 50% in SijilPelajaran Malaysia (SPM) are of high order thinking skills questions.

Figure 2: The development of educational evaluation in Malaysia

Source: Malaysia Examination Board (MES)http://apps.emoe.gov.my/1pm/maklumatam.htm

Page 11: Tsl3123 Module Ppg

Read more: http://www.nst.com.my/nation/general/school-based-assessment-plan-may-need-tweaking-1.166386

Tutorial questionExamine the contributing factors to the changing trends of language assessment.Create and present findings using graphic organisers.

TOPIC 2ROLE AND PURPOSES OF ASSESSMENT IN TEACHING AND LEARNING

2.0 SYNOPSIS

Topic 2 provides you an insight on the reasons/purposes of assessment. It

also looks at the different types of assessments and the classifications of tests

according to their purpose.

2.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

4. explain the reasons/purposes of assessment;

5. distinguish the differences between assessment of learning and assessment for learning;

6. name and differentiate the different test types.

2.2 FRAMEWORK OF TOPICS

11

Role and Purposes of

Assessment in Teaching and

Learning

Role and Purposes of

Assessment in Teaching and

Learning

Page 12: Tsl3123 Module Ppg

CONTENT

SESSION TWO (3 hours)

2.3 Reasons/Purpose of Assessment

Critical to educators is the use of assessment to both inform and guide

instruction. Using a wide variety of assessment tools allows a teacher to

determine which instructional strategies are effective and which need to be

modified. In this way, assessment can be used to improve classroom practice,

plan curriculum, and research one's own teaching practice. Of course,

assessment will always be used to provide information to children, parents,

and administrators. In the past, this information was primarily expressed by a

"grade". Increasingly, this information is being seen as a vehicle to empower

students to be self-reflective learners who monitor and evaluate their own

progress as they develop the capacity to be self-directed learners. In addition

to informing instruction and developing learners with the ability to guide their

own instruction, assessment data can be used by a school district to measure

student achievement, examine the opportunity for children to learn, and

provide the basis for the evaluation of the district's language programmes.

Assessment instruments, whether formal tests or informal assessments,

serve multiple purposes. Commercially designed and administered tests may

be used for measuring proficiency, placing students into one of several levels

of course, or diagnosing students’ strengths and weaknesses according to

specific linguistic categories, among other purposes. Classroom-based

12

Reasons / Purposes of Assessment

Reasons / Purposes of Assessment

Assessment of Learning /

Assessment for Learning

Assessment of Learning /

Assessment for Learning

Types of Tests: Proficiency,

Achievement, Diagnostic, Aptitude, and Placement Tests

Types of Tests: Proficiency,

Achievement, Diagnostic, Aptitude, and Placement Tests

Page 13: Tsl3123 Module Ppg

teacher-made tests might be used to diagnose difficulty or measure

achievement in a given unit of a course. Specifying the purpose of an

assessment instrument and stating its objectives are an essential first step in

choosing, designing, revising, or adapting the procedure an educator will

finally use.

We need to rethink the role of assessment in effective schools, where

“effective” means maximising learning for the most students.  What uses of

assessment are most likely to maximise student learning and well being? How

best can we use assessment in the service of student learning and wellbeing?

We have a traditional answer to these questions.  Our traditional answer says

that to maximise student learning we need to develop rigorous standardised

tests given once a year to all students at approximately the same time.  Then,

the results are used for accountability, identifying schools for additional

assistance, and certifying the extent to which individual students are “meeting

competency.”

Let us take a closer look at the two assessments below i.e. Assessment

of Learning and Assessment for Learning.

2.4 Assessment of Learning

Assessment of learning is the use of a task or an activity to measure,

record, and report on a student’s level of achievement in regards to specific

learning expectations.

This traditional way of using assessment in the service of student

learning is assessment of learning - assessments that take place at a point in

time for the purpose of summarising the current status of student

achievement.  This type of assessment is also known as summative

assessment.

This summative assessment, the logic goes, will provide the focus to

improve student achievement, give everyone the information they need to

13

Page 14: Tsl3123 Module Ppg

improve student achievement, and apply the pressure needed to motivate

teachers to work harder to teach and learn.

2.5 Assessment for leaning

Now compare this to assessment for learning.  Assessment for learning

is roughly equivalent ( the same) to formative assessment - assessment

intended to promote further improvement of student learning during the

learning process.

Assessment for learning is more commonly known as formative and

diagnostic assessments.  Assessment for learning is the use of a task or an

activity for the purpose of determining student progress during a unit or block

of instruction.  Teachers are now afforded the chance to adjust classroom

instruction based upon the needs of the students.  Similarly, students are

provided valuable feedback on their own learning.  

Formative assessment is not a new idea to us as educators.  However,

during the past several years there has been literally an explosion of

applications linked to sound research.In this evolving conception, formative

assessment is more than testing frequently, although frequent information is

important.  Formative assessment also involves actually adjusting teaching to

take account of these frequent assessment results.  Nonetheless( however),

formative assessment is even more than using information to plan next steps. 

Formative assessment seems to be most effective when students are involved

in their own assessment and goal setting.

2.6 Types of tests

The most common use of language tests is to identify strengths and

weaknesses in students’ abilities. For example, through testing we can

discover that a student has excellent oral abilities but a relatively low level of

reading comprehension. Information gleaned from tests also assists us in

deciding who should be allowed to participate in a particular course or

programme area. Another common use of tests is to provide information

14

Page 15: Tsl3123 Module Ppg

about the effectiveness of programmes of instruction.

Henning (1987) identifies six kinds of information that tests provide about

students. They are:

o Diagnosis and feedback

o Screening and selection

o Placement

o Program evaluation

o Providing research criteria

o Assessment of attitudes and socio-psychological differences

Alderson, Clapham and Wall (1995) have a different classification

scheme. They sort tests into these broad categories: proficiency,

achievement, diagnostic, progress, andplacement. Brown (2010), however,

categorised tests according to their purpose, namely achievement tests,

diagnostic tests, placement tests, proficiency test, and aptitude tests.

Proficiency Tests

Proficiency tests are not based on a particular curriculum or language

programme. They are designed to assess the overall language ability of

students at varying levels. They may also tell us how capable a

person is in a particular language skill area.Their purpose is to describe what

students are capable of doing in a language.

Proficiency tests are usually developed by external bodies such as

examination boards like Educational Testing Services (ETS) or Cambridge

ESOL. Some proficiency tests have been standardised for international use,

such as the American TOEFL test which is used to measure the English

language proficiency of foreign college students who wish to study in North-

American universities or the British-Australian IELTS test designed for those

who wish to study in the United Kingdom or Australia (Davies et al., 1999).

15

Page 16: Tsl3123 Module Ppg

Achievement Tests

Achievement tests are similar to progress tests in that their purpose is

to see what a student has learned with regard to stated course outcomes.

However, they are usually administered at mid-and end- point of the semester

or academic year. The content of achievement tests is generally based on the

specific course content or on the course objectives. Achievement tests are

often cumulative, covering material drawn from an entire course or semester.

Diagnostic Tests

Diagnostic tests seek to identify those language areas in which a

student needs further help. Harris and McCann (1994 p. 29) point out that

where “other types of tests are based on success, diagnostic tests are based

on failure.” The information gained from diagnostic tests is crucial for further

course activities and providing students with remediation. Because diagnostic

tests are difficult to write, placement tests often serve a dual function of both

placement and diagnosis (Harris & McCann, 1994; Davies et al., 1999).

Aptitude Tests

This type of test no longer enjoys the widespread use it once had. An

aptitude test is designed to measure general ability or capacity to learn a

foreign language a priori (before taking a course) and ultimate predicted

success in that undertaking. Language aptitude tests were seemingly

designed to apply to the classroom learning of any language. In the United

States, two common standardised English Language tests once used were the

Modern Language Aptitude Test (MLAT; Carroll & Sapon, 1958) and the

Pimsleur Language Aptitude Battery (PLAB; Pimsleur, 1966). Since there is

no research to show unequivocally that these kinds of tasks predict

communicative success in a language, apart from untutored language

acquisition, standardised aptitude tests are seldom used today with the

16

Page 17: Tsl3123 Module Ppg

exception of identifying foreign language disability (Stansfield & Reed, 2004).

Progress Tests

These tests measure the progress that students are making towards

defined course or programme goals. They are administered at various stages

throughout a language course to see what the students have learned, perhaps

after certain segments of instruction have been completed. Progress tests are

generally teacher produced and are narrower in focus than achievement tests

because they cover a smaller amount of material and assess fewer objectives.

Placement Tests

These tests, on the other hand, are designed to assess students’ level

of language ability for placement in an appropriate course or class. This type

of test indicates the level at which a student will learn most effectively. The

main aim is to create groups, which are homogeneous in level. In designing a

placement test, the test developer may choose to base the test content either

on a theory of general language proficiency or on learning objectives of the

curriculum. In the former, institutions may choose to use a well-established

proficiency test such as the TOEFL or IELTS exam and link it to curricular

benchmarks. In the latter, tests are based on aspects of the syllabus taught at

the institution concerned.

In some contexts, students are placed according to their overall rank in

the test results. At other institutions, students are placed according to their

level in each individual skill area. Elsewhere, placement test scores are used

to determine if a student needs any further instruction in the language or could

matriculate directly into an academic programme.

Discuss and present the various types of tests and assessment tasks that students have experienced.

Discuss the extent tests or assessment tasks serve their purpose.

17

Page 18: Tsl3123 Module Ppg

18

Page 19: Tsl3123 Module Ppg

The end of the topic. Happy reading!

TOPIC 3 BASIC TESTING TERMINOLOGY

3.0 SYNOPSIS

Topic 3 provides input on basic testing terminology. It looks at the definitions, purposes and differences of various tests.

3.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

7. explain the meaning and purpose of different types of language tests;

8. compare between Norm-Referenced Test and Criterion-Referenced Test, Formative and Summative Tests, Objective and Subjective Tests

3.2 FRAMEWORK OF TOPICS

19

Norm-Referenced and Criterion-Referenced

Norm-Referenced and Criterion-Referenced

Page 20: Tsl3123 Module Ppg

CONTENT

SESSION THREE (3 hours)

3.3 Norm-Referenced Test (NRT)

According to Brown (2010), in NRTs an individual test-taker’s score is

interpreted in relation to a mean (average score), median (middle score),

standard deviation (extent of variance in scores), and/or percentile rank. The

purpose of such tests is to place test-takers along a mathematical continuum

in rank order. In a test, scores are commonly reported back to the test-taker in

the form of a numerical score for example, 250 out of 300 and a percentile

rank for instance 78 percent, which denotes that the test-taker’s score was

higher than 78 percent of the total number of test-takers but lower than 22

pecent in the administration. In other words, NRT is administered to compare

an individual performance with his peers’ and/or compare a group with other

groups. In the School-Based Evaluation, NRT is used for the summative

evaluation, such as in the end of the year examination for the streaming and

selection of students.

20

Types of TestsTypes of Tests

Formative and Summative

Formative and Summative

Objective and Subjective

Objective and Subjective

Page 21: Tsl3123 Module Ppg

3.4 Criterion-Referenced Test (CRT)

Gottlieb (2006) on the other hand refers Criterion-referenced tests as

the collection of information about student progress or achievement in relation

to a specified criterion. In a standards-based assessment model, the

standards serve as the criteria or yardstick for measurement. Following

Glaser (1973), the word ‘criterion’ means the use of score values that can be

accepted as the index of attainment to a test-taker. Thus, CRTs are designed

to provide feedback to test-takers, mostly in the form of grades, on specific

course or lesson objectives. Curriculum Development Centre (2001) defines

CRT as an approach that provides information on student’s mastery based on

the criteria determined by the teacher. These criteria are based on learning

outcomes or objectives as specified in the syllabus. The main advantage of

CRTs is that they provide the testers to make inferences about how much

language proficiency, in the case of language proficiency tests, or knowledge

and skills, in the aspect of academic achievement tests, that

test-takers/students originally have and their successive gains over time. As

opposed to NRTs, CRTs focus on student’s mastery of a subject matter

(represented in the standards) along a continuum instead of ranking student

on a bell curve. Table 3 below shows the differences between Norm-

Referenced Test (NRT) and Criterion-Referenced Test (CRT).

Norm-Referenced Test Criterion-Referenced TestDefinition

Purpose

A test that measures student’s achievement as compared to other students in the group

Determine performance difference among individual and groups

An approach that provides information on student’s mastery based on a criterion specified by the teacherDetermine learning mastery based on specified criterion and standard

Test Item

Frequency

From easy to difficult level and able to discriminate examinee’s abilityContinuous assessment in the classroom

Guided by minimum achievement in the related objectivesContinuous assessment

Appropriateness Summative evaluation Formative evaluation

21

Page 22: Tsl3123 Module Ppg

Example Public exams: UPSR, PMR, SPM, and STPM

Mastery test: monthly test, coursework, project, exercises in the classroom

Table 3: The differences between Norm-Referenced Test (NRT) and Criterion-Referenced Test (CRT)

3.5 Formative Test

Formative test or assessment, as the name implies, is a kind of

feedback teachers give students while the course is progressing. Formative

assessment can be seen as assessment for learning. It is part of the

instructional process.We can think of formative assessment as “practice.”

With continual feedback the teachers may assist students to improve their

performance. The teachers point out on what the students have done wrong

and help them to get it right. This can take place when teachers examine the

results of achievement and progress tests. Based on the results of formative

test or assessment, the teachers can suggest changes to the focus of

curriculum or emphasis on some specific lesson elements. On the other hand,

students may also need to change and improve. Due to the demanding nature

of this formative test, numerous teachers prefer not to adopt this test although

giving back any assessed homework or achievement test present both

teachers and students healthy and ultimate learning opportunities.

3.6 Summative Test

Summative test or assessment, on the other hand, refers to the kind of

measurement that summarise what the student has learnt orgive a one-off

measurement.In other words, summative assessment is assessment of

student learning. Students are more likely to experience assessment carried

out individually where they are expected to reproduce discrete language items

from memory.The results then are used to yield a school report and to

determine what students know and do not know.It does not necessarily

provide a clear picture of an individual’s overall progress or even his/her full

potential, especially if s/heis hindered by the fear factor of physically sitting for

22

Page 23: Tsl3123 Module Ppg

a test, but may provide straightforward and invaluable results for teachers to

analyse. It is given at a point in time to measure student achievement in

relation to a clearly defined set of standards, but it does not necessarily show

the way to future progress. It is given after learning is supposed to occur. End

of the year tests in a course and other general proficiency or public exams are

some of the examples of summative tests or assessment.Table 3.1 shows

formative and summative assessments that are common in schools.

Formative Assessment Summative AssessmentAnecdotal records Final exams

Quizzes and essays National exams (UPSR, PMR, SPM, STPM)

Diagnostic tests Entrance examsTable 3.1: Common formative and summative assessments in schools

3.7 Objective Test

According to BBC Teaching English, an objective test is a test that

consists of right or wrong answers or responses and thus it can be marked

objectively. Objective tests are popular because they are easy to prepare and

take, quick to mark, and provide a quantifiable and concrete result. They tend

to focus more on specific facts than on general ideas and concepts.

The types of objective tests include the following:

i. Multiple choice items/questions

ii. True-false items/questions:

iii. Matchingitems/questions; and

iv. Fill-in the blanks items/questions.

In this topic, let us focus on the multiple-choice questions, which may

look easy to construct but in reality, it is very difficult to build correctly. This is

23

Page 24: Tsl3123 Module Ppg

congruent with the viewpoint of Hughes (2003, pp76-78) who warns against

many weaknesses of multiple-choice questions. The weaknesses include:

It may limit beneficial washback;

It may enable cheating among test-takers;

It is very challenging to write successful items;

This technique strictly limits what can be tested;

This technique tests only recognition knowledge;

It may encourage guessing,which may have a considerable effect on

test scores.

Let’s look at some important terminology when designing multiple-choice

questions. This objective test item comprises five terminologies namely:

1. Receptive or selective response

Items that the test-takers chooses from a set of responses, commonly

called a supply type of response rather than creating a response.

2. Stem

Every multiple-choice item consists of a stem (the ‘body’ of the item

that presents a stimulus). Stem is the question or assignment in an item. It is

in a complete or open, positive or negative sentence form. Stem must be

short or simple, compact and clear. However, it must not easily give away the

right answer.

3. Options or alternatives

They are known as a list of possible responses to a test item.

There are usually between three and five options/alternatives to

choose from.

4. Key

24

Page 25: Tsl3123 Module Ppg

This is the correct response. The response can either be correct

or the best one. Usually for a good item, the correct answer is not obvious as

compared to the distractors.

5. Distractors

This is known as a ‘disturber’ that is included to distract students from

selecting the correct answer. An excellent distractor is almost the same as the

correct answer but it is not.

When building multiple-choice items for both classroom-based and

large-scaled standardised tests, consider the four guidelines below:

i. Design each item to measure a single objective;

ii. State both stem and options as simply and directly as possible;

iii. Make certain that the intended answer is clearly the one correct

one;

iv. (Optional) Use item indices to accept, discard or revise item.

3.8 Subjective Test

Contrary to an objective test, a subjective test is evaluated by giving an

opinion, usually based on agreed criteria.Subjective tests include essay, short-

answer, vocabulary, and take-home tests. Some students become very

anxious of these tests because they feel their writing skills are not up to par.

In reality, a subjective test provides more opportunity to test-takers to

show/demonstrate their understanding and/or in-depth knowledge and skills in

the subject matter. In this case, test takers might provide some acceptable,

alternative responses that the tester, teacher or test developer did not

predict. Generally, subjective tests will test the higher skills of analysis,

synthesis, and evaluation. In short, subjective test will enable students to be

25

Page 26: Tsl3123 Module Ppg

more creative and critical. Table 3.2 shows various types of objective and

subjective assessments.

Objective Assessments Subjective AssessmentsTrue/False Items Extended-response ItemsMultiple-choice Items Restricted-response ItemsMultiple-responses Item EssayMatching Items

Table 3.2: Various types of objective and subjective assessments

Some have argued that the distinction between objective and subjective

assessments is neither useful nor accurate because, in reality, there is no

such thing as ‘objective’ assessment. In fact, all assessments are created with

inherent biases built into decisions about relevant subject matter and content,

as well as cultural (class, ethnic, and gender) biases.

Reflection

1. Objective test items are items that have only one answer or correct response. Describe in-depth the multiple-choice test item.

2. Subjective test-items allocate subjectivity in the response given by thetest-takers. Explain in detail the various types of subjective test-items.

Discussion

1. Identify at least three differences between formative and summative assessment?

2. What are the strengths of multiple-choice items compared to essay items?

3. Informal assessments are often unreliable, yet they are still important in classrooms. Explain why this is the case, and defend your explanation with examples.

4. Compare and contrast Norm-Referenced Test with Criterion- Referenced Test.

26

Page 27: Tsl3123 Module Ppg

TOPIC 4 BASIC PRINCIPLES OF ASSESSMENT

4.0 SYNOPSIS

Topic 4 defines the basic principles of assessment (reliability, validity,

practicality, washback, and authenticity) and the essential sub-categories

within reliability and validity.

4.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

1. define the basic principles of assessment (reliability, validity, practicality, washback, and authenticity) and the essential sub-categories within reliability and validity;

2. explain the differences between validity and reliability;

3. distinguish the different types of validity and reliability in tests and other instruments in language assessment.

4.2 FRAMEWORK OF TOPICS

27ReliabilityReliability

Page 28: Tsl3123 Module Ppg

CONTENT

SESSION FOUR (3 hours)

4.3 INTRODUCTION

Assessment is a complex, iterative process requiring skills,

understanding, and knowledge-in the exercise of professionally judgment. In

this process, there are five important criteria that the testers ought to look into

for “testing a test”: reliability, validity, practicality, washback and authenticity.

Since these five principles are context dependent, there is no priority order

implied in the order of presentation.

4.4 RELIABILITY ( consistency)

Reliability means the degree to which an assessment tool produces

stable and consistent results. It is a concept, which is easily being

misunderstood (Feldt & Brennan, 1989).

Reliability essentially denotes ‘consistency, stability, dependability, and

accuracy of assessment results’ (McMillan, 2001a, p.65 in Brown, G. et al,

2008). Since there is tremendous variability from either teacher or tester to

teacher/tester that affects student performance, thus reliability in planning,

28

Types of Tests

Types of Tests

ValidityValidity

PracticalityPracticality

ObjectivityObjectivity

InterpretabilityInterpretability

AuthenticityAuthenticity

Washback EffectWashback Effect

Page 29: Tsl3123 Module Ppg

implementing, and scoring student performances gives rise to valid

assessment.

Fundamentally, a reliable (trustworthy) test is consistent and

dependable. If a tester administers the same test to the same test-taker or

matched test-takers on two circumstances, the test should give the same

results. In a validity chain, it is stated that test administrators need to be sure

that the scoring performance has to be carried out properly. If scores used by

the tester do not reflect accurately what the test-taker actually did, would not

be rewarded by another marker, or would not be received on a similar

assessment, then these scores lack reliability. Errors occur in scoring in any

ways-for example, giving Level 2 when another rater would give Level 4,

adding up marks wrongly, transcribing scores from test paper to database

inaccurately, students performing really well on the first half of the assessment

and poorly on the second half due to fatigue, and so on. Thus, lack of

reliability in the scores students receive is a treat to validity.

According to Brown (2010), a reliable test can be described as follows:

Consistent in its conditions across two or more administrations Gives clear directions for scoring / evaluation Has uniform rubrics for scoring / evaluation Lends itself to consistent application of those rubrics by the

scorer Contains item / tasks that are unambiguous to the test-taker

4.4.1 Rater Reliability

When humans are involved in the measurement procedure,

there is a tendency of error, biasness and subjectivity in determining

the scores of similar test.There are two kinds of rater reliability namely

inter-rater reliability and intra-rater reliability.

Inter-rater reliability refers to the degree of similarity between

different tester or rater; can two or more testers/raters, without

29

Page 30: Tsl3123 Module Ppg

influencing one another, give the same marks to the same set of scripts

(contrast with intra-rater reliability).

One way to test inter-rater reliability is to have each rater assign

each test item a score. For example, each rater might score

items on a scale from 1 to 10. Next, you would calculate the

correlation (connection) between the two ratings to determine the level

of inter-rater reliability. Another means of testing inter-rater reliability is

to have raters determine which category each observation falls into and

then calculate the percentage of agreement between the raters. So, if

the raters agree 8 out of 10 times, the test has an 80% inter-rater

reliability rate. Rater reliability is assessed by having two or more

independent judges score the test. The scores are then compared to

determine the consistency of the raters’ estimates.

Intra-rater reliability is an internal factor. In intra-rater reliability,

its main aim is consistency within the rater. For example, if a rater

(teacher) has many examination papers to mark and does not have

enough time to mark them, s/he might take much more care with the

first, say, ten papers, than the rest. This inconsistency will affect the

students’ scores; the first ten might get higher scores. In other

words, while inter-rater reliability involves two or more raters, intra-

rater reliability is the consistency of grading by a single rater.

Scores on a test are rated by a single rater/judge at different times.

When we grade tests at different times, we may become

inconsistent in our grading for various reasons. Some papers that are

graded during the day may get our full and careful attention, while

others that are graded towards the end of the day are very quickly

glossed over. As such, intra rater reliability determines the

consistency of our grading.

Both inter-and intra-rater reliability deserve close attention in that

test scores are likely to vary from rater to rater or even from the same rater

(Clark, 1979).

30

Page 31: Tsl3123 Module Ppg

4.4.2 Test Administration Reliability

There are a number of reasons which influences test

administration reliability. Unreliability occurs due to outside

interference like noise, variations in photocopying, temperature

variations, the amount of light in various parts of the room, and even

the condition of desk and chairs. Brown (2010) stated that he once

witnessed the administration of a test of aural comprehension in which

an audio player was used to deliver items for comprehension, but due

to street noise outside the building, test-taker sitting next to open

windows could not hear the stimuli clearly. According to him, that was

a clear case of unreliability caused by the conditions of the test

administration.

4.4.3 Factors influencing Reliability

Figure 4.4.3 Factors that affect the reliability of a test

The outcome of a test is influenced by many factors.

Assuming that the factors are constant and not subject to

change, a test is considered to be reliable if the scores

are consistent and not different from other equivalent and

reliable test scores. However, tests are not free from

errors. Factors that affect the reliability of a test include

test length factors, teacher and student factors,

31

Page 32: Tsl3123 Module Ppg

environment factors, test administration factors, and

marking factors.

a. Test length factors

In general, longer tests produce higher reliabilities. Due to the

dependency on coincidence and guessing, the scores will be more

accurate if the duration of the test is longer. An objective test has

higher consistency because it is not exposed to a variety of

interpretations. A valid test is said to be reliable but a reliable test need

not be valid. A consistent score does not necessary measure what is

intended to measure. In addition, the test items that are the samples of

the subject being tested and variation in the samples may be found in

two equivalent tests and there can be one of the causes test outcomes

are unreliable.

b. Teacher-Student factors

In most tests, it is normally for teachers to construct and

administer tests for students. Thus, any good teacher-student

relationship would help increase the consistency of the results. Other

factors that contribute to positive effects to the reliability of a test include

teacher’s encouragement, positive mental and physical condition,

familiarity to the test formats, and perseverance (determination) and

motivation.

c. Environment factors

An examination environment certainly influences test-takers and

their scores. Any favourable environment with comfortable chairs and

desks, good ventilation, sufficient light and space will improve the

reliability of the test. On the contrary, a non-conducive environment will

affect test-takers’ performance and test reliability.

d. Test administration factors

32

Page 33: Tsl3123 Module Ppg

Because students' grades are dependent on the way tests are being

administered, test administrators should strive to provide clear and

accurate instructions, sufficient time and careful monitoring of tests to

improve the reliability of their tests. A test-re-test technique can be

used to determine test reliability.

e. Marking factors

Unfortunately, we human judges have many opportunities to introduce

error in our scoring of essays (Linn & Gronlund, 2000; Weigle, 2002).It

is possible that our scoring invalidates many of the interpretations we

would like to make based on this type of assessment.Brennan (1996)

has reported that in large-scale, high-stakes marking panels that are

tightly trained and monitored marker effects are small. Hence, it can

be concluded that in low-stakes, small-scale marking, there is

potentially a large error introduced by individual markers. It is also

common that different markers award different marks for the same

answer even with a prepared mark scheme. A marker’s assessment

may vary from time to time and with different situations. Conversely, it

does not happen to the objective type of tests since the responses are

fixed. Thus, objectivity is a condition for reliability.

4.5 VALIDITY

Validity refers to the evidence base that can be provided about

appropriateness of the inferences, uses, and consequences that come from

assessment (McMillan, 2001a).Appropriateness has to do with the soundness

(accuracy), trustworthiness, or legitimacy of the claims or inferences

(conclusion that testers would like to make on the basis of obtained scores.

Clearly, we have to evaluate the whole assessment process and its constituent

(component) parts by how soundly (thoroughly) we can defend the

consequences that arise from the inferences and decisions we make. Validity,

in other words, is not a characteristic of a test or assessment; but a judgment,

which can have varying degrees of strength.33

Page 34: Tsl3123 Module Ppg

So, the second characteristic of good tests is validity, which refers to

whether the test is actually measuring what it claims to measure. This is

important for us as we do not want to make claims concerning what a student

can or cannot do based on a test when the test is actually measuring

something else. Validity is usually determined logically although several types

of validity may use correlation coefficients.

According to Brown (2010), a valid test of reading ability actually

measures reading ability and not 20/20 vision, or previous knowledge of a

subject, or some other variables of questionable relevance. To measure

writing ability, one might ask students to write as many words as they can in

15 minutes, then simply count the words for the final score. Such a test is

practical (easy to administer) and the scoring quite dependable (reliable).

However, it would not constitute (represent ) a valid test of writing ability

without taking into account its comprehensibility (clarity), rhetorical discourse

elements, and the organisation of ideas.

The following are the different types of validity:

Face validity: Do the assessment items appear to be appropriate?

Content validity: Does the assessment content cover what you want to assess? Have satisfactory samples of language and language skills been selected for testing?

Construct validity: Are you measuring what you think you're measuring? Is the test based on the best available theory of language and language use?

Concurrent (parallel) validity: Can you use the current test score to estimate scores of other criteria? Does the test correlate with other existing measures?

Predictive validity: Is it accurate for you to use your existing students’ scores to predict future students’ scores? Does the test successfully predict future outcomes?

34

Page 35: Tsl3123 Module Ppg

It is fairly obvious that a valid assessment should have a good coverage of

the criteria (concepts, skills and knowledge) relevant to the purpose of the

examination. The important notion here is the purpose.

Figure 4.5: Types of Validity

4.5.1 Face validity

Face validity is validity which is “determined impressionistically;

for example by asking students whether the examination was

appropriate to the expectations” (Henning, 1987). Mousavi (2009) refers

face validity as the degree to which a test looks right, and appears to

measure the knowledge or abilities it claims to measure, based on the

subjective judgement of the examinees who take it, the administrative

personnel who decide on its use, and other psychometrically

unsophisticated observers.

35

Page 36: Tsl3123 Module Ppg

It is pertinent (important ) that a test looks like a test even at first

impression. If students taking a test do not feel that the questions given

to them are not a test or part of a test, then the test may not be valid as

the students may not take it seriously to attempt the questions. The

test, hence, will not be able to measure what it claims to measure.

4.5.2 Content validity

Content validity“is concerned with whether or not the content of

the test is sufficiently representative and comprehensive for the test to

be a valid measure of what it is supposed to measure” (Henning,

1987).The most important step in making sure of content validity is to

make sure all content domains are presented in the test. Another

method to verify validity is through the use of Table of Test Specification

that can give detailed information on each content, level of skills, status

of difficulty, number of items, and item representation for rating in each

content or skill or topic.

We can quite easily imagine taking a test after going through an

entire language course. How would you feel if at the end of the course,

your final examination consists of only one question that covers one

element of language from the many that were introduced in the course?

If the language course was a conversational course focusing on the

different social situations that one may encounter, how valid is a final

examination that requires you to demonstrate your ability to place an

order at a posh restaurant in a five-star hotel?

4.5.3 Construct validity

Construct is a psychological concept used in measurement.

Construct validity is the most obvious reflection of whether a test

measures what it is supposed to measure as it directly addresses the

issue of what it is that is being measured. In other words, construct

validity refers to whether the underlying theoretical constructs that the

36

Page 37: Tsl3123 Module Ppg

test measures are themselves valid. Proficiency, communicative

competence, and fluency are examples of linguistic constructs; self-

esteem and motivation are psychological constructs.

Fundamentally every issue in language learning and teaching

involves theoretical constructs. When you are assessing a student’s

oral proficiency for instance. To possess construct validity, the test

should consist of various components of fluency: speed, rhythm,

juncture, (lack of) hesitations, and other elements within the construct of

fluency. Tests are, in a manner of speaking, operational definitions of

constructs in that their test tasks are the building blocks of the entity that

is being measured (see Davidson, Hudson, & Lynch, 1985; T.

McNamara, 2000).

4.5.4 Concurrent validity

Concurrent validity is the use of another more reputable and

recognised test to validate one’s own test. For example, suppose you

come up with your own new test and would like to determine the validity

of your test. If you choose to use concurrent validity, you would look for

a reputable test and compare your students’ performance on your test

with their performance on the reputable and acknowledged test. In

concurrent validity, a correlation coefficient is obtained and used to

generate an actual numerical value. A high positive correlation of 0.7 to

1 indicates that the learners’ score is relatively similar for the two tests

or measures.

For example, in a course unit whose objective is for students to

be able to orally produce voiced and unvoiced stops in all possible

phonetics environments, the results of one teacher’s unit test might be

compared with an independent assessment such as a commercially

produced test of similar phonemic proficiency. Since criterion-related

evidence usually falls into one of two categories of concurrent and

predictive validity, a classroom test designed to assess mastery of a

37

Page 38: Tsl3123 Module Ppg

point of grammar in a communicative use will have criterion validity if

test scores are verified either by observed subsequent behaviour or by

other communicative measures of grammar point in question.

4.5.5 Predictive validity

Predictive validity is closely related to concurrent validity in that it

too generates a numerical value. For example, the predictive validity of

a university language placement test can be determined several

semesters later by correlating the scores on the test to the GPA of the

students who took the test. Therefore, a test with high predictive validity

is a test that would yield predictable results in a latter measure. A

simple example of tests that may be concerned with predictive validity is

the trial national examinations conducted at schools in Malaysia as it is

intended to predict the students’ performance on the actual SPM

national examinations. (Norleha Ibrahim, 2009)

As mentioned earlier validity is a complex concept, yet it is

crucial to the teacher’s understanding of what makes a good test. It is

good to heed Messick’s (1989, p. 36) caution that validity is not an all-

or-none proposition and that various forms of validity may need to be

applied to a test in order to be satisfied worth its overall effectiveness.

What are reliability and validity? What determines the reliability of a test?

What are the different types of validity? Describe any three types and cite examples.

http://www.2dix.com/pdf-2011/testing-and-evaluation-in-esl-pdf.php

4.5.6 Practicality

Although practicality is an important characteristic of tests, it is by

far a limiting factor in testing. There will be situations in which after we

38

Page 39: Tsl3123 Module Ppg

have already determined what we consider to be the most valid test, we

need to reconsider the format purely because of practicality issues. A

valid test of spoken interaction, for example, would require that the

examinees be relaxed, interact with peers and speak on topics that they

are familiar and comfortable with. This sounds like the kind of

conversations that people have with their friends while sipping afternoon

teaby the roadside stalls. Of course such a situation would be a highly

valid measure of spoken interaction – if we can set it up. Imagine if we

even try to do so. It would require hidden cameras as well as a lot of

telephone calls and money.

Therefore, a more practical form of the test especially if it is to be

administered at the national level as a standardised test, is to have a

short interview session of about fifteen minutes using perhaps a picture

or reading stimulus that the examinees would describe or discuss.

Therefore, practicality issues, although limiting in a sense, cannot be

dismissed if we are to come up with a useful assessment of language

ability. Practicality issues can involve economics or costs,

administration considerations such as time and scoring procedures, as

well as the ease of interpretation. Tests are only as good as how well

they are interpreted. Therefore tests that cannot be easily interpreted

will definitely cause many problems.

4.5.7 Objectivity

The objectivity of a test refers to the ability of

teachers/examiners who mark the answer scripts. Objectivity refers to

the extent, in which an examiner examines and awards scores to the

same answer script. The test is said to have high objectivity when the

examiner is able to give the same score to the similar answers guided

by the mark scheme. An objective test is a test that has the highest

level of objectivity due to the scoring that is not influenced by the

examiner’s skills and emotions. Meanwhile, subjective test is said to

39

Page 40: Tsl3123 Module Ppg

have the lowest objectivity. Based on various researches, different

examiners tend to award different scores to an essay test. It is also

possible that the same examiner would give different scores to the

same essay if s/he is to re-check at different times.

4.5.8 Washback effect

The term 'washback' or backwash (Hughes, 2003, p.1)

refers to the impact that tests have on teaching and learning. Such

impact is usually seen as being negative: tests are said to force

teachers to do things they do not necessarily wish to do.However, some

have argued that tests are potentially also 'levers for change' in

language education: theargument being that if a bad test has negative

impact,a good test should or could have positive washback(Alderson,

1986b; Pearson, 1988).

Cheng, Watanabe, and Curtis (2004) offered an entire anthology

to the issue of wash back while Spratt (2005) challenged teachers to

become agents of beneficial washback in their language classrooms.

Brown (2010) discusses the factors that provide beneficial washback in

a test.He mentions that such a test can positively influence what and

how teachers teach, students learn; offer learners a chance to

adequately prepare, give learners feedback that enhance their language

development, is more formative in nature than summative, and provide

conditions for peak performance by the learners.

In large-scale assessment, washback often refers to the effects

that tests have on instruction in terms of how students prepare for the

test. In classroom-based assessment, washback can have a number of

positive manisfestations, ranging from the benefit of preparing and

reviewing for a test to the learning that accrues from feedback on one’s

performance. Teachers can provide information that “washes back” to

students in the form of useful diagnoses of strengths and weaknesses.

40

Page 41: Tsl3123 Module Ppg

The challenge to teachers is to create classroom tests that serve

as learning devices through which washback is achieved. Students’

incorrect responses can become a platform for further improvements.

On the other hand, their correct responses need to be complimented,

especially when they represent accomplishments in a student’s

developing competence. Teachers can have various strategies in

providing guidance or coaching. Washback enhances a number of

basic principles of language acquisition namely intrinsic motivation,

autonomy, self-confidence, language ego, interlanguage, and strategic

investment, among others.

Washback is generally said to be either positive or negative.

Unfortunately, students and teachers tend to think of the negative

effects of testing such as “test-driven” curricula and only studying and

learning “what they need to know for the test”. Positive washback, or

what we prefer to call “guided washback” can benefit teachers, students

and administrators. Positive washback assumes that testing and

curriculum design are both based on clear course outcomes, which are

known to both students and teachers/testers. If students perceive that

tests are markers of their progress towards achieving these outcomes,

they have a sense of accomplishment. In short, tests must be part of

learning experiences for all involved. Positive washback occurs when a

test encourages good teaching practice.

Washback is particularly obvious when the tests or examinations

in question are regarded as being very vital and having a definite impact

on the student’s or test-taker’s future. We would expect, for example,

that national standardised examinations would have strong washback

effects compared to a school-based or classroom-based test.

4.5.9 Authenticity

Another major principle of language testing is authenticity. It is a

concept that is difficult to define, particularly within the art and science

41

Page 42: Tsl3123 Module Ppg

of evaluating and designing test. Citing Bachman and Palmer (1996) in

Brown (2010) authenticity is “the degree of correspondence of the

characteristics of a given language test task to the features of a target

language task” (p.23) and then suggested an agenda for identifying

those target language tasks and for transforming them into valid test

items.

Language learners are motivated to perform when they are faced

with tasks that reflect real world situations and contexts. Good testing

or assessment strives to use formats and tasks that reflect the types of

situation in which students would authentically use the target language.

Whenever possible, teachers should attempt to use authentic materials

in testing language skills.

4.6.0 Interpretability

Test interpretation encompasses all the ways that meaning is

assigned to the scores.  Proper interpretation requires knowledge

about the test, which can be obtained by studying its manual and other

materials along with current research literature with respect to its

use; no one should undertake the interpretation of scores on any test

without such study. In any test interpretation, the following

considerations should be taken into account.

A.  Consider Reliability:  Reliability is important because it is a

prerequisite to validity and because the degree to which a score may

vary due to measurement error is an important factor in its

interpretation.

B.  Consider Validity:  Proper test interpretation requires knowledge of

the validity evidence available for the intended use of the test.  Its

validity for other uses is not relevant.  Indeed, use of a measurement

for a purpose for which it was not designed may constitute misuse. 

42

Page 43: Tsl3123 Module Ppg

      The nature of the validity evidence required for a test depends upon its

use.   

C.  Scores, Norms, and Related technical Features:  The result of

scoring a test or subtest is usually a number called a raw score, which

by itself is not interpretable.  Additional steps are needed to translate

the number directly into either a verbal description (e.g., pass or 

      fail) or into a derived score (e.g., a standard score).  Less than full

understanding of these procedures is likely to produce errors in

interpretation and ultimately in counseling or other uses.

D.  Administration and Scoring Variation:  Stated criteria for score

interpretation assume standard procedures for administering and

scoring the test.  Departures from standard conditions and procedures

modify and often invalidate these criteria.

Study some of commercially produced tests and evaluate the authenticity of these tests/ test items.

Discuss the importance of authenticity in testing.

Based on samples of formative and summative assessments, discuss aspects of reliability/validity that must be considered in these assessments.

Discuss measures that a teacher can take to ensure high validity of language assessment for the primary classroom.

43

Page 44: Tsl3123 Module Ppg

TOPIC 5 DESIGNING CLASSROOM LANGUAGE TEST

5.0 SYNOPSIS

Topic 5 exposes you the stages of test construction, the preparing of test

blueprint/test specifications, the elements in a Test Specifications Guidelines

And the importance of following the guidelines for constructing tests items.

Then we look at the various test formats that are appropriate for language

assessment.

5.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

1. identify the different stages of test construction

2. describe the features of a test specification

3. draw up a test specification that reflect both the purpose and the

objectives of the test

4. compare and contrast Bloom’s taxonomy and SOLO taxonomy

5. categorise test items according to Bloom’s taxonomy

44

Page 45: Tsl3123 Module Ppg

6. discuss the elements of test items of high quality, reliability and

validity

7. identify the elements in a Test Specifications Guidelines

8. demonstrate an understanding of the importance of following the

guidelines for constructing tests items

9. illustrate test formats that are appropriate and meet the

requirements of the learning outcomes

5.2 FRAMEWORK OF TOPICS

CONTENT

SESSION FIVE (3 hours)

5.3 Stages of Test Construction

Constructing a test is not an easy task; it requires a variety of skills

along with deep knowledge in the area for which the test is to be

constructed. The steps include:

i determining vi pre-testing ii planning vii validating iii writing iv preparingv reviewing

5.3.1 Determining

45

Page 46: Tsl3123 Module Ppg

The essential first step in testing is to make oneself perfectly

clear about what it is one wants to know and for what purpose. When

we start to construct a test, the following questions have to be

answered.

Who are the examinees?

What kind of test is to be made?

What is the precise purpose?

What abilities are to be tested?

How detailed and how accurate the results must be?

How important is the backwash effect?

What constraints are set by the unavailability of expertise, facilities,

time of construction, administration, and scoring?

What is the scope of the test?

5.3.2 Planning

The first form that the solution takes is a set of specifications for

the test.This will include information on: content, format and timing,

criteria,levels of performance, and scoring procedures.

In this stage, the test constructor has to determine the content by

answering the following questions:

Describing the purpose of the test;

Describing the characteristics of the test takers, the nature of the

population of the examinees for whom the test is being designed.

Defining the nature of the ability we want to measure;

Developing a plan for evaluating the qualities of test usefulness, which

is the degree to which a test is useful for teachers and students, it

includes six qualities: reliability, validity, authenticity, practicality inter-

activeness, and impact;

Identifying resources and developing a plan for their allocation and

management;

Determining format and timing of the test;

Determining levels of performance;

46

Page 47: Tsl3123 Module Ppg

Determining scoring procedures

5.3.3 Writing

Although writing items is time-consuming, writing good items is an art.

No one can expect to be able consistently to produce perfect items.

Some items will have to be rejected, others reworked. The best way to

identify items that have to be improved or abandoned is through

teamwork. Colleagues must really try to find fault; and despite the

seemingly inevitable emotional attachment that item writers develop to

items that they have created, they must be open to, and ready to

accept, the criticisms that are offered to them. Good personal relations

are a desirable quality in any test writing team.

Test items writers should possess the following characteristics:

They have to be experienced in test construction.

They have to be quite knowledgeable of the content of the test.

They should have the capacity in using language clearly andeconomically.

They have to be ready to sacrifice time and energy.

Another basic aspect in writing the items of the test is sampling.

Sampling means that test constructors choose widely from the whole

area of the course content. It is most unlikely that everything found

under the heading of 'Content’ in the specifications can be included in

any one version of the test. Choices have to be made for content

validity and for beneficial backwash. One should not concentrate solely

on elements known to be easy to test. Rather, the content of the test

should be a representative sample of the course material. I

5.3.4 Preparing

One has to understand the major principles, techniques and experience

of preparing the test items. Not every teacher can make a good tester.

To construct different kinds of tests, the tester should observe some

47

Page 48: Tsl3123 Module Ppg

principles. In the production-type tests, we have to bear in mind that no

comments are necessary. Test writers should also try to avoid test

items, which can be answered through test- wiseness. Test-

wiseness refers to the capacity of the examinees to utilise the

characteristics and formats of the test to guess the correct answer.

5.3.5 Reviewing

Principles for reviewing test items:

The test should not be reviewed immediately after its construction,

but after some considerable time.

Other teachers or testers should review it. In a language test, it is

preferable if native speakers are available to review the test.

5.3.6 Pre-testing

After reviewing the test, it should be submitted to pre-testing.

The tester should administer the newly-developed test to a group of

examinees similar to the target group and the purpose is to analyse

every individual item as well as the whole test.

Numerical data (test results) should be collected to check the

efficiency of the item, it should include item facility and

discrimination.

5.3.7 Validating

Item Facility (IF) shows to what extent the item is easy or difficult. The

items should neither be too easy nor too difficult. To measure the facility

or easiness of the item, the following formula is used:

IF= number of correct responses (Σc) / total number of candidates (N)

And to measure item difficulty:

IF= (Σw) / (N)

The results of such equations range from 0 – 1. An item with a

facility index of 0 is too difficult, and with 1 is too easy. The ideal item is

one with the value of (0.5) and the acceptability range for item facility is

between [0.37 → 0.63], i.e. less than 0.37 is difficult, and above 0.63 is

48

Page 49: Tsl3123 Module Ppg

easy.

Thus, tests which are too easy or too difficult for a given sample

population, often show low reliability. As noted in Topic 4, reliability is

one of the complementary aspects of measurement.

5.4 Preparing Test Blueprint / Test Specifications

Test specifications (specs) for classroom use can be an outline of your

test (Brown, 2010), what it will “look like”. Consider your test

specs as a blueprint of the test that include the following:

a description of its content

item types (methods, such as multiple-choice, cloze, etc.)

tasks (e.g. written essay, reading a short passage, etc.)

skills to be included

how the test will be scored

how it will be reported to students

For classroom purposes (Davidson & Lynch, 2002), the specs

are your guiding plan for designing an instrument that effectively fulfils

your desired principles, especially validity.

It is vital to note that for large-scale standardised tests like Test

of English as a Foreign Language (TOEFL® Test), International

English Language Testing System (IELTS), Michigan English

Language Assessment Battery) MELAB, and the like, that are intended

to be widely distributed and thus are broadly generalised, test

specifications are much more formal and detailed (Spaan, 2006). They

are also usually confidential so that the institution that is designing the

test can ensure the validity of subsequent forms of a test.

Many language teachers claim that it is difficult to construct an item. In

reality, it is rather easy to develop an item, if we are committed in the

planning of the measuring instruments to evaluate students’

achievement.

However, what exactly is an item for a test? An item is a tool, an

instrument, instruction or question used to get feedback from test-

49

Page 50: Tsl3123 Module Ppg

takers, which is an evidence t of something that is being measured. An

item is an instrument used to get feedback, which is a useful information

for consideration in measuring or asserting a construct measurement.

Items can be classified as a recall and thinking item. A recall item is the

item that requires one to recall in order to answer, and a thinking item

refers to an item that requires test-takers to use their thinking skills to

attempt.

For instance, in a grammar unit test that will be administered at

the end of a three-week grammar course for high beginning adult

learners (Level 2). The students will be taking a test that covers verb

tenses and two integrated skills (listening/speaking and reading/writing)

and the grammar class they attend serves to reinforce the grammatical

forms that they have learnt in the two earlier classes.

Based on the scenario above, the test specs that you design

might consist of the four sequential steps:

1. a broad outline of how the test will be organised

2. which of the eight sub-skills you will test

3. what the various tasks and item types will be

4. how results will be scored, reported to students, and used in future class (washback)

Besides knowing the purpose of the test you are creating, you

are required to know as precisely as possible what it is you want to test.

Do not conduct a test hastily. Instead, you need to examine the

objectives for the unit you are testing carefully.

5.5 Bloom’s and SOLO Taxonomies

5.5.1 Bloom’s Taxonomy (Revised)

Blooms’ Taxonomy is a systematic way of describing how a

learner’s performance develops from simple to complex levels in their

affective, psychomotor and cognitive domain of learning. The Original

Taxonomy provided carefully developed definitions for each of the six

major categories in the cognitive domain. The categories were

Knowledge, Comprehension, Application, Analysis, Synthesis, and

Evaluation. With the exception of Application, each of these was 50

Page 51: Tsl3123 Module Ppg

broken into subcategories. The complete structure of the original

Taxonomy is shown in Figure 5.1.

The categories were ordered from simple to complex and from

concrete to abstract. Further, it was assumed that the original

Taxonomy represented a cumulative hierarchy; that is, mastery of each

simpler category was prerequisite to mastery of the next more complex

one. In their cognitive domain, there are six stages, namely:

Knowledge, Comprehension, Application, Analysis, Synthesis and

Evaluation. Unfortunately, traditional education tends to base the

student learning in this domain. In the original Taxonomy, the

Knowledge category embodied both noun and verb aspects. The noun

or subject matter aspect was specified in Knowledge's extensive

subcategories. The verb aspect was included in the definition given to

Knowledge in that the student was expected to be able to recall or

recognise knowledge. This brought uni-dimensionality to the framework

at the cost of a Knowledge category that was dual in nature and thus

different from the other Taxonomic categories. In 1990s, Anderson

(former student of Bloom) eliminated this inconsistency in the revised

Taxonomy by allowing these two aspects, the noun and verb, to form

separate dimensions, the noun providing the basis for the Knowledge

51

Figure 5.1: Original Terms of Bloom’s Taxonomy Retrieved from: http://www. kurwongbss.qld.edu.au/thinking/Bloom/blooms.htm

Adapted from: Pohl, 2000, Learning to Think, Thinking to Learn, p.8

Page 52: Tsl3123 Module Ppg

dimension and the verb forming the basis for the Cognitive Process

dimension as shown in Figure 5.2.

In the revised Bloom’s Taxonomy, the names of six major

categories were changed from noun to verb forms. As the taxonomy

reflects different forms of thinking and thinking is an active process

verbs were used instead of nouns.

Besides, the subcategories of the six major categories were also

replaced by verbs and some subcategories were re-organised. The

knowledge category was renamed. Knowledge is an outcome or product of

thinking not a form of thinking per se. Consequently, the word knowledge

was inappropriate to describe a category of thinking and was replaced with

the word remembering instead. Comprehension and synthesis were

retitled to understanding and creating respectively, in order to better reflect

the nature of the thinking defined in each category. Table 3 below

provides a summary of the above.

Table 3: The Cognitive Dimension Process52

Figure 5.2: Bloom’s Revised Taxonomy Retrieved from: http://www. kurwongbss.qld.edu.au/thinking/Bloom/blooms.htm

Page 53: Tsl3123 Module Ppg

Level 1 – C1

Categories & Cognitive Processes

Alternative Names Definition

Remember Retrieve knowledge from long-term memory

Recognising Identifying Locating knowledge in long-term memory that is consistent with presented material

Recalling Retrieving Retrieving relevant knowledge from long-term memory

Level 2 – C2

Categories & Cognitive Processes

Alternative Names Definition

Understand Construct meaning from instructional messages, including oral, written, and graphic communication

InterpretingClarifyingParaphrasingRepresentingTranslating

Changing from one form of representation to another

ExemplifyingIllustratingInstantiating Finding a specific

example or illustration of a concept or principle

ClassifyingCategorisingSubsuming Determining that

something belongs to a category

SummarisingAbstractingGeneralising Abstracting a general

theme or major point(s)

53

Page 54: Tsl3123 Module Ppg

InferringConcludingExtrapolatingInterpolatingPredicting

Drawing a logical conclusion from presenting information

ComparingContrastingMappingMatching

Detecting correspondences between two ideas, objects, and the like

ExplainingConstructing models

Constructing a cause and effect model of a system

Level 3 – C3

Categories & Cognitive Processes

Alternative Names Definition

Apply Applying a procedure to a familiar task

ExecutingCarrying out

Applying a procedure to a familiar task

ExemplifyingIllustratingInstantiating Applying a procedure to

an unfamiliar task

AnalyseUsing

Break materials into its constituent parts and determine how the parts relate to one another and to an overall structure or purpose

DifferentiatingDiscriminatingDistinguishingFocusingSelecting

Distinguishing relevant from irrelevant parts or important from unimportant parts of presented material

OrganisingFinding coherenceIntegratingOutliningParsingStructuring

Determining how elements fit or function within a structure

54

Page 55: Tsl3123 Module Ppg

AttributingDeconstructing

Determining a point of view, bias, values, or intent underlying presented material

Evaluating Make judgments based on criteria and standards

CheckingCoordinatingDetectingMonitoringTesting

Detecting inconsistencies or fallacies within a process or product, determining whether a process or product has internal consistency; detecting the effectiveness of a procedure as it is being implemented

CritiquingJudging

Detecting inconsistencies betweena product and external criteria;determining whether a product has external consistency; detecting the appropriateness of a procedure for a given problem

Create Putting elements together to form a coherent or functional whole; reorganise elements into a new pattern or structure

GeneratingHypothesising

Coming upwith alternative hypotheses based on criteria

PlanningDesigning

Devising a procedure for accomplishing some task

55

Page 56: Tsl3123 Module Ppg

ProducingConstructing

Inventing a product

The Knowledge Domain

Categories & Cognitive Processes

Definition

Factual Knowledge The basic elements students must know to the acquainted with a discipline or solve problems in it

Conceptual Knowledge The interrelationships among the basic elements within a larger structure that enable them to function together

Procedural Knowledge How to do something, methods of inquiry, and criteria for using skills, algorithms, techniques, and methods

Metacognitive Knowledge Knowledge of cognition in general as well as

awareness and knowledge of one’s own cognition

5.5.2 SOLO Taxonomy

On the other hand, SOLO, which stands for the Structure of the

Observed Learning Outcome, taxonomy is a systematic way of

describing how a learner’s performance develops from simple to

complex levels in their learning. Biggs & Collis first introduced it, in their

1982 study. There are 5 stages, namely Prestructural, Unistructural,

Multistructural, which are in a quantitative phrase and Relational and

Extended Abstract, which are in a qualitative phrase.

Students find learning more complex as it advances. SOLO is

a means of classifying learning outcomes in terms of their complexity,

enabling teachers to assess students’ work in terms of its quality not of

how many bits of this and of that they got right. At first we pick up only

one or few aspects of the task (unistructural), then several aspects but

56

Page 57: Tsl3123 Module Ppg

they are unrelated (multistructural), then we learn how to integrate

them into a whole (relational), and finally, we are able to generalise that

whole to as yet untaught applications (extended abstract). The diagram

below shows lists verbs typical of each such level.

57

Page 58: Tsl3123 Module Ppg

58

Page 59: Tsl3123 Module Ppg

Figure 5.3: SOLO Taxonomy

The SOLO taxonomy maps the complexity of a student’s work by linking

it to one of five phases:  little or no understanding (Prestructural), through a

simple and then more developed grasp of the topic (Unistructural and

Multistructural), to the ability to link the ideas and elements of a task together

(Relational) and finally (Extended Abstract) to understand the topic for

themselves, possibly going beyond the initial scope of the task (Biggs & Collis,

1982; Hattie & Brown, 2004). In their later research into multimodal learning,

Biggs & Collis noted that there was an ‘increase in the structural complexity of

their (the students’) responses’ (1991:64).

It may be useful to view the SOLO taxonomy as an integrated strategy,

to be used in lesson design, in task guidance and formative and summative

assessment (Smith & Colby, 2007; Black & William, 2009; Hattie, 2009; Smith,

2011). The structure of the taxonomy encourages viewing learning as an on-

59

Page 60: Tsl3123 Module Ppg

going process, moving from simple recall of facts towards a deeper

understanding; that learning is a series of interconnected webs that can be

built upon and extended. Nückles et al., (2009:261) elaborates:

Cognitive strategies such as organization and elaboration are at the heart of meaningful learning because they enable the learner to organize learning into a coherent structure and integrate new information with existing knowledge, thereby enabling deep understanding and long-term retention.

This would help to develop Smith’s (2011:92) “self-regulating, self-evaluating

learners who were well motivated by learning.”

A range of SOLO based techniques exist to assist teachers and

students. Use of constructional alignment (Biggs & Tang, 2009) encourages

teachers to be more explicit when creating learning objectives, focusing on

what the student should be able to do and at which level. This is essential for a

student to make progress and allows for the creation of rubrics, for use in class

(Black &Wiliam, 2009; Nückles et al., 2009; Huang, 2012), to make the

process explicit to the student. Use of HOTS viz. Higher Order Thinking Skills)

maps (Hook & Mills, 2011) can be used in English to scaffold in depth

discussion, encouraging students to:

Develop interpretations, use research and critical thinking effectively to develop their own answers, and write essays that engage with the critical conversation of the field (Linkon, 2005:247, cited in Allen, 2011).

It may also be helpful in providing a range of techniques for differentiated

learning (Anderson, 2007; Hook & Mills, 2012).

The SOLO taxonomy has a number of proponents. Hook & Mills

(2011:5) refer to it as ‘a model of learning outcomes that helps schools

develop a common understanding’. Moseley et al. (2005:306) advocates its

use as a ‘framework for developing the quality of assessment’ citing that it is

60

Page 61: Tsl3123 Module Ppg

‘easily communicable to students’. Hattie (2012:54), in his wide-ranging

investigation into effective teaching and ‘visible learning’, outlines three levels

of understanding: surface, deep and conceptual. He indicates that:

The most powerful model for understanding these three levels and integrating them into learning intentions and success criteria is the SOLO model.

 

However, the taxonomy is not without critics; Chick (1998:20) believes

that ‘there is potential to misjudge the level of functioning’ and Chan et al.

(2002:512) criticises its ‘conceptual ambiguity’ stating that the ‘categorisation’

is ‘unstable’. In these two studies, the SOLO taxonomy was used primarily for

assessing completed work, so use throughout the teaching process may

alleviate these issues.

  An additional criticism, in particular when the taxonomy is compared

with that of Bloom (1956), is the SOLO taxonomy’s structure. Biggs & Collis

(1991) refers to the structure as a hierarchy, as does Moseley et al. (2005);

naturally, there are concerns when complex processes, such as human

thought, are categorised in this manner. However, Campbell et al. (1992)

explained the structure of the SOLO taxonomy as consisting as a series of

cycles (especially between the Unistructural, Multistructural and Relational

levels), which would allow for a development of breadth of knowledge as well

as depth.

However, SOLO taxonomy can be used not only in designing the

curriculum in terms of the learning outcomes intended, but also in

assessment.It can be effectively used for students to deconstruct exam

questions to understand marks awarded and as a vehicle for self-assessment

and peer-assessment.

5.6 Guidelines for constructing test items

61

Page 62: Tsl3123 Module Ppg

Tests do not work without well-written test items. Test-takers appreciate

clearly written questions that do not attempt to trick or confuse them into

incorrect responses. The following presents the major characteristics of well-

written test items.

5.6.1 Aim of the test

Test item development is a critical step in building a test that properly

meets certain standards. A good test is only as good as the quality of the test

items. If the individual test items are not appropriate and do not perform well,

how can the test scores be meaningful? The topic to be evaluated (construct)

and where the evaluation is done (title/context) must be part of the curriculum.

If it is evaluated outside the curriculum, the curricular validity of the item can

be disputed. Therefore, test items must be developed to precisely measure the

objectives prescribed by the blueprint and meet quality standards.

5.6.2 Range of the topics to be tested

A test must measure the test-takers’ ability or proficiency in applying the

knowledge and principles on the topics that they have learnt. Ample

opportunity must be given to students to learn the topics that are to be

evaluated. This opportunity would include the availability of language

teachers, well-equipped facilities, and the expertise of the language teachers

in conducting the lessons and providing the skills and knowledge that would be

evaluated to the test-takers or students.

5.6.3 Range of skills to be tested

Test item writers should always attempt to write test items that measure

higher levels of cognitive processing. This is not an easy task. It should be a

goal of the writer to ensure their items have cognitive characteristics

exemplifying understanding, problem-solving, critical

thinking, analysis, synthesis, evaluation and interpreting rather than just

declarative knowledge. There are many theories that provide frameworks on

levels of thinking and Bloom’s taxonomy is often cited as a tool to use in item

writing. Always stick to writing important questions that represent and can

62

Page 63: Tsl3123 Module Ppg

predict that a test-taker is proficient at high levels of cognitive processing in

doing their test proficiently.

5.6.4 Test format

Test items should always follow a consistent design so that the

questioning process in itself does not give unnecessary difficulty to answering

questions. Therefore a logical and consistent stimulus format for writing test

items can help expedite the laborious process of writing test items as well as

supply a format for asking basic questions. A format that provides an initial

starting structure to use in writing questions can be valuable for item writers.

When these formats are used, test takers can quickly read and understand the

questions, since the format is expected. For example, to measure

understanding of knowledge or facts, questions can begin with the following:

• What best defines ….?

• What is not the characteristic of ….?

• What is an example of ….?

5.6.5 Level of difficulty

A test has a planned number of questions at a level of difficulty and

discrimination to best determine mastery and non-mastery performance states.

Test-takers should clearly understand what is needed in education and

language assessment to prepare for the examination and how much

experience performing certain activities would help in preparation. This should

be the road map that helps item writers create test items and helps test takers

understand what will be required of them to pass an examination. In any test

item construction, we must assure that weak students could answer easy item,

intermediate language proficiency students could answer easy and moderate

items whereas high language proficiency students could answer easy,

63

Page 64: Tsl3123 Module Ppg

moderate and advance test items. A reliable and valid test instrument should

encompass all three levels of difficulties.

5.6.6 International and Cultural Considerations (biasness)

In standardised tests when exams are distributed internationally, either

in a single language or translated to other languages, always refrain from

the use of slang, geographic references, historical references or dates

(holidays) that may not be understood by an international examinee. Tests

need to be adapted to other society so that meaning is fully translated correctly

and benefits are not given to a particular group of test-takers. Steps should be

taken to avoid item content that may bias gender, race or other cultural

groups.

What are the good characteristics of a test item?

Explain each characteristic of a test item in a graphic organiser.

http://books.google.com.my/books/about/Constructing_Test_Items.html?id=Ia3SGDfbaV0C&redir_esc=y

6.0 Test format

What is the difference between test format and test type? For example,

when you want to introduce new kinds of test, for example, reading test, which

is organised a little bit different from the existing test items, what do you say?

Test format or test type? Test format refers to the layout of questions on a test.

For example, the format of a test could be two essay questions, 50 multiple-

choice questions, etc.For the sake of brevity, I will consider providing the

outlines of some large-scale standardised tests.

64

Page 65: Tsl3123 Module Ppg

UPSR

Primary School Evaluation Test, also as known  Ujian Penilaian

Sekolah Rendah (commonly abbreviated as UPSR; Malay), is a national

examination taken by all pupils in our country at the end of their sixth year

in primary school before they leave for secondary school. It is prepared and

examined by the Malaysian Examinations Syndicate. This test consists of two

papers namely Paper 1 and Paper 2.

Multiple-choice questions are tested using a standardised optical

answer sheet that uses optical mark recognition for detecting answers for

Paper 1 and Paper 2 comprises three sections, namely Sections A, B, and C.

TOEFL (Teaching of Foreign Language)

The TOEFL test is administered two ways; as an Internet-based test

(TOEFL iBT™), and as a paper-based test (TOEFL PBT). Most of the 4,500+

test sites in the world use the TOEFL iBT.The TOEFL iBT® test is given in

English and administered via the Internet. There are four sections (listening,

reading, speaking and writing), which take a total of about four and a half

hours to complete.

IELTS Test Format

IELTS is a test of all four language skills – Listening, Reading, Writing &

Speaking. Test-takers will take the Listening, Reading and Writing tests all on

the same day one after the other, with no breaks in between. Depending on

the examinee’s test centre, one’s Speaking test may be on the same day as

the other three tests, or up to seven days before or after that. The total test

time is under three hours. The test format is illustrated below.

65

Page 66: Tsl3123 Module Ppg

Figure 6: IELTS Test Format

66

Page 67: Tsl3123 Module Ppg

TOPIC 6 ASSESSING LANGUAGE SKILLS CONTENT

6.0 SYNOPSIS

Topic 6 focuses on ways to assess language skills and language

content. It defines the types of test items used to assess language skills

and language content. It also provides teachers with suggestions on

ways a teacher can assess the listening, speaking, reading and writing

skills in a classroom. It also discusses concepts of and differences

between discrete point test, integrative test and communicative test.

6.1 LEARNING OUTCOMES

At the end of Topic 6, teachers will be able to:

Identify and carry out the different types of assessment to assess

language skills and language content

Understand anddifferentiate between objective and subjective

testing

Understand and differentiate between discrete point test,

integrative test and communicative test in assessing language.

6.2 FRAMEWORK OF TOPICS

CONTENT

67

Page 68: Tsl3123 Module Ppg

SESSION SIX (6 hours)

6.2.1 Types of test items to assess language skills

a. Listening

Basically there are two kinds of listening tests: tests that test specific aspects

of listening, like sound discrimination; and task based tests which test skills in

accomplishing different types of listening tasks considered important for the

students being tested. In addition to this, Brown 2010 identified four types of

listening performance from which assessment could be considered.

i. Intensive : listening for perception of the components (phonemes, words,

intonation, discourse markers,etc) of a larger stretch of language.

ii. Responsive : listening to a relatively short stretch of language ( a

greeting, question, command, comprehension check, etc.) in order to

make an equally short response

iii. Selective : processing stretches of discourse such as short monologues

for several minutes in order to “scan” for certain information. The

purpose of such performance is not necessarily to look for global or

general meaning but to be able to comprehend designated information

in a context of longer stretches of spoken language( such as classroom

directions from a teacher, TV or radio news items, or stories).

Assessment tasks in selective listening could ask students, for example,

to listen for names, numbers, grammatical category, directions (in a map

exercise), or certain facts and events.

iv. Extensive : listening to develop a top-down , global

understanding of spoken language. Extensive performance

ranges from listening to lengthy lectures to listening to a

conversation and deriving a comprehensive message or

purpose. Listening for the gist – or the main idea- and making

inferences are all part of extensive listening.

68

Page 69: Tsl3123 Module Ppg

b. Speaking

In the assessment of oral production, both discrete feature

objective tests and integrative task-based tests are used. The first

type tests such skills as pronunciation, knowledge of what

language is appropriate in different situations, language required in

doing different things like describing, giving directions, giving

instructions, etc. The second type involves finding out if pupils can

perform different tasks using spoken language that is appropriate

for the purpose and the context. Task-based activities involve

describing scenes shown in a picture, participating in a discussion

about a given topic, narrating a story, etc. As in the listening

performance assessment tasks, Brown 2010 cited four categories

for oral assessment.

1. Imitative . At one end of a continuum of types of speaking

performance is the ability to imitate a word or phrase or possibly

a sentence. Although this is a purely phonetic level of oral

production, a number of prosodic (intonation, rhythm,etc.), lexical

, and grammatical properties of language may be included in the

performance criteria. We are interested only in what is

traditionally labelled “pronunciation”; no inference are made

about the test-takers ability to understand or convey meaning or

to participate in an interactive conversation. The only role of

listening here is in the short-term storage of a prompt, just long

enough to allow the speaker to retain the short stretch of

language that must be imitated.

2. Intensive. The production of short stretches of oral language

designed to demonstrate competence in a narrow band of

grammatical, phrasal, lexical, or phonological relationships.

Examples of intensive assessment tasks include directed

response tasks (requests for specific production of speech),

reading aloud, sentence and dialogue completion, limited picture-

69

Page 70: Tsl3123 Module Ppg

cued tasks including simple sentences, and translation up to the

simple sentence level.

3. Responsive. Responsive assessment tasks include interaction

and test comprehension but at somewhat limited level of very

short conversation, standard greetings, and small talk, simple

requests and comments, etc. The stimulus is almost always a

spoken prompt (to preserve authenticity) with one or two follow-up

questions or retorts:

A. Liza : Excuse me, do you have the time?

Don : Yeah. Six-fifteen.

B. Jo : What is the most urgent social problem today?

Sue : I would say bullying.

C. Lan : Hey, Shan, how’s it going?

Shan: Not bad, and yourself?

Lan : I’m good.

Shan: Cool. Okay gotta go.

4. Interactive. The difference between responsive and interactive

speaking is in the length and complexity of the interaction, which

sometimes includes multiple exchanges and/or multiple

participants. Interaction can be broken down into two types : (a)

transactional language, which has the purpose of exchanging

specific information, and (b) interpersonal exchanges, which have

the purpose of maintaining social relationships. (In the three

dialogues cited above, A and B are transactional, and C is

interpersonal).

5. Extensive (monologue). Extensive oral production tasks include

speeches, oral presentations, and storytelling, during which the

opportunity for oral interaction from listeners is either highly limited

(perhaps to nonverbal responses) or ruled out together. Language

70

Page 71: Tsl3123 Module Ppg

style is more deliberative (planning is involved) and formal for

extensive tasks.In can include informal monologue such as

casually delivered speech (e.g., recalling a vacation in the

mountains, conveying recipes, recounting the plot of a novel or

movie).

c. Reading

Cohen (1994), discussed various types of reading and meaning

assessed. He describes skimming and scanning as two different types

of reading. In the first, a respondent is given a lengthy passage and is

required to inspect it rapidly (skim) or read to locate specific

information (scan) within a short period of time. He also discusses

receptive reading or intensive reading which refers to “a form of

reading aimed at discovering exactly what the author seeks to

convey” (p. 218). This is the most common form of reading especially

in test or assessment conditions. Another type of reading is to read

responsively where respondents are expected to respond to some

point in a reading text through writing or by answering questions.

A reading text can also convey various kinds of meaning and reading

involves the interpretation or comprehension of these meanings. First,

grammatical meaning are meanings that are expressed through

linguistic structures such as complex and simple sentences and the

correct interpretation of those structures. A second meaning is

informational meaning which refers largely to the concept or messages

contained in the text. Respondents may be required to comprehend

merely the information or content of the passage and this may be

assessed through various means such as summary and précis writing.

Compared to grammatical or syntactic meaning, informational meaning

requires a more general understanding of a text rather than having to

pay close attention to the linguistic structure of sentences. A third

meaning contained in many texts is discourse meaning. This refers to

the perception of rhetorical functions conveyed by the text. One typical

71

Page 72: Tsl3123 Module Ppg

function is discourse marking which adds cohesiveness to a text.

These words, such as unless, however, thus, therefore etc., are crucial

to the correct interpretation of a text and students may be assessed on

their ability to understand the discoursal meaning that they bring in the

passage. Finally, a fourth meaning which may also be an object of

assessment in a reading test is the meaning conveyed by the writer’s

tone. The writer’s tone – whether it is cynical, sarcastic, sad or etc.- is

important in reading comprehension but may be quite difficult to

identify, especially by less proficient learners. Nevertheless, there can

be many situations where the reader is completely wrong in

comprehending a text simply because he has failed to perceive the

correct tone of the author.

d. Writing

Brown (2004), identifies three different genres of writing which are

academic writing, job-related writing and personal writing, each of

which can be expanded to include many different examples. Fiction,

for example, may be considered as personal writing according to

Brown’s taxonomy. Brown (2010) identified four categories of written

performance that capture the range of written production which can

be used to assess writing skill.

1. Imitative. To produce written language, the learner must attain the

skills in the fundamental, basic tasks of writing letters, words,

punctuation, and brief sentences. This category includes the

ability to spell correctly and to perceive phoneme-grapheme

correspondences in the English spelling system. At this stage the

learners are trying to master the mechanics of writing. Form is

the primary focus while context and meaning are of secondary

concern.

2. Intensive (controlled). Beyond the fundamentals of imitative

writing are skills in producing appropriate vocabulary within a

context, collocation and idioms, and correct grammatical features

up to the length of a sentence. Meaning and context are

72

Page 73: Tsl3123 Module Ppg

important in determining correctness and appropriateness but

most assessment tasks are more concerned with a focus on form

and are rather strictly controlled by the test design.

3. Responsive. Assessment tasks require learners to perform at a

limited discourse level, connecting sentences into a paragraph

and creating a logically connected sequence of two or three

paragraphs. Tasks relate to pedagogical directives, lists of

criteria, outlines, and other guidelines. Genres of writing include

brief narratives and descriptions, short reports, lab reports,

summaries, brief responses to reading, and interpretations of

charts and graphs. Form-focused attention is mostly at the

discourse level, with a strong emphasis on context and meaning.

4. Extensive. Extensive writing implies successful management of all

the processes and strategies of writing for all purposes, up to the

length of an essay, a term paper, a major research project report,

or even a thesis. Focus is on achieving a purpose, organizing and

developing ideas logically, using details to support or illustrate

ideas, demonstrating syntactic and lexical variety, and in many

cases, engaging in the process of multiple drafts to achieve a final

product. Focus on grammatical form is limited to occasional

editing and proofreading of a draft.

6.2.2 Objective and Subjective test

Tests have been categorized in many different ways. The most

familiar terms regarding tests are the objective and subjective

tests . We normally associate objective tests with multiple choice

question type tests and subjective tests with essays. However, to

be more accurate we will consider how the test is graded. Objective

tests are tests that are graded objectively while subjective tests are

thought to involve subjectivity in grading.

There are many examples of each type of test. Objective type tests

include the multiple choice test, true false items and matching items

73

Page 74: Tsl3123 Module Ppg

because each of these are graded objectively. In these examples of

objective tests, there is only 

one correct response and the grader does not need to subjectively

assess the response.

Examples of the subjective test include essays and short answer

questions. However some other types of common tests such as the

dictation test, filling in the blank type tests, as well as interviews and

role plays can be considered subjective and objective type tests

where they fall on some sort of continuum where some tests are

more objective than others. As such, some of these tests would fall

closer to one end of the continuum or the other.

Two other terms, select type tests and supply type tests are related

terms when we think of objective and subjective tests. In most

cases, objective tests are similar to select type tests where students

are expected to select or choose the answer from a list of options.

Just as a multiple choice question test is an objective type test, it

can also be considered a select type test. Similarly, tests involving

essay type questions are supply type as the students are expected

to supply the answer through their essay. How then would you

classify a fill in the blank type test? Definitely for this type of test,

the students need to supply the answer, but what is supplied is

merely a single word or a short phrase which differs tremendously

from an essay. It may therefore be helpful to once again consider a

continuum with supply type and select type items at each end of the

continuum respectively.

It is possible to now combine both continua as shown in Figure 6.1

with the two different test formats placed within the two continua:

74

Page 75: Tsl3123 Module Ppg

Figure 6.1: Continua for different types of test formats

It is not by accident that we find there are few, if any, test formats that are either supply type and objective or select type and subjective. Select type tests tend to be objective while supply type tests tend to be subjective.

In addition to the above, Brown and Hudson (1998), have also suggested three broad categories to differentiate tests according to how students are expected to respond. These categories are the selected response tests, the constructed response tests, and the personal response tests. Examples of each of these types of tests are given in Table 6.1.

Table 6.1: Types of Tests According to Students’ Expected Response

Selected response Constructed response Personal response

True false Fill-in Conferences

Matching Short answer Portfolios

Multiple choice Performance testSelf and peer assessments

Selected response assessments, according to Brown and Hudson (1998),

are assessment procedures in which “students typically do not create any

language” but rather “select the answer from a given list” (p. 658).

Constructed response assessment procedures require students to

“produce language by writing, speaking, or doing something else” (p.

660). Personal response assessments, on the other hand, require

students to produce language but also allows each students’ response to

be different from one another and for students to “communicate what they

75

Page 76: Tsl3123 Module Ppg

want to communicate” (p. 663). These three types of tests, categorised

according to how students respond, are useful when we wish to

determine what students need to do when they attempt to answer test

questions.

6.2.3 Types of test items to assess language content

a. Discrete Point Test and Integrative Test

Language tests may also be categorised as either discrete point or

integrative. Discrete point tests examine one element at a time.

Integrative tests, on the other hand, “requires the candidate to

combine many language elements in the completion of a

task” (Hughes, 1989: 16). It is a simultaneous measure of

knowledge and ability of a variety of language features, modes, or

skills.

A multiple choice type test is usually cited as an example of a

discrete point test while essays are commonly regarded as the

epitome of integrative tests. However, both the discrete point test

and the integrative test are a matter of degree. A test may be more

discrete point than another and similarly a test may be more

integrative than another. Perhaps the more important aspect is to be

aware of the discrete point or integrative nature of a test as we must

be careful of what we believe the test measures.

This brings us to the question of how discrete point is a multiple

choice question type item? While it is definitely more discrete point

than an essay, it may still require more than just one skill or ability in

order to complete. Let’s say you are interested in testing a student’s

knowledge of the relative pronoun and decide to do so by using a

multiple choice test item. If he fails to answer this test item correctly,

would you conclude that the student has problems with the relative

pronoun? The answer may not be as straight forward as it seems.

The test is presented in textual form and therefore requires the

76

Page 77: Tsl3123 Module Ppg

student to read. As such, even the multiple choice test item involves

some integration of language skills as this example shows, where in

addition to the grammatical knowledge of relative pronouns, the

student must also be able to read and understand the question.

Perhaps a clearer way of viewing the distinction between the

discrete point and the integrative test is to examine the perspective

each takes toward language. In the discrete point test, language is

seen to be made up of smaller units and it may be possible to test

language by testing each unit at a time. Testing knowledge of the

relative pronoun, for example, is certainly assessing the students on

a particular unit of language and not on the language as a whole. In

an integrative test, on the other hand, the perspective of language is

that of an integrated whole which cannot be broken up into smaller

units or elements. Hence, the testing of language should maintain

the integrity or wholeness of the language.

b. Communicative Test

As language teaching has emphasised the importance of

communication through the communicative approach, it is not surprising

that communicative tests have also been given prominence. A

communicative emphasis in testing involves many aspects, two of

which revolve around communicative elements in tests and meaningful

content. Both these aspects are briefly addressed in the following sub

sections:

  Integrating Communicative Elements into ExaminationsAlderson and Banerjee (2002), report on various studies that seem to

point to the difficulty in achieving authenticity in tests. They cite Spence-

Brown (2001) who posits that “the very act of assessment changes the

nature of a potentially authentic task and compromises authenticity” and

that “authenticity must be related to the implementation of an activity,

not to its design” (p. 99). In her study, students were required to

77

Page 78: Tsl3123 Module Ppg

interview native speakers outside the classroom and submit a tape-

recording of the interview. While this activity seems quite authentic, the

students were observed to prepare for the interview by “rehearsing the

interview, editing the results, and engaging in spontaneous, but flawed

discourse” (Alderson & Banerjee, 2002: 99), all of which are inauthentic

when viewed in terms of real life situations. Alderson himself argues

that because candidates in language tests are not interested in

communicating but to display their language abilities, the test situation

is a communicative event in itself and therefore cannot be used to

replicate any real world event (p. 98).

Chalhoub-Deville (2003), argues for tests that take context into

consideration. She believes that there should be a “shift in focus of our

measurement from traditional examinations of the construct in terms of

response consistency, to investigations that systematically explore

inconsistent (which does not mean random) performances across

contexts” (p. 378). In the future, besides context, tests will also need to

integrate elements of communication such as topic initiation, topic

maintenance, and topic change in order for the test to become more

authentic and realistic. Due to issues of practicality, involving especially

the amount of time and extent of organisation to allow for such

communicative elements to emerge, it will not be an easy task to

achieve.

The idea of bringing communicative elements into the language test is

not a new one. In his review of communicative tests, Fulcher (2000),

notes the descriptors of a communicative test as suggested by several

theorists. The three principles of communicative tests that he highlights

are that communicative tests:

involve performance; are authentic; and are scored on real-life outcomes.

78

Page 79: Tsl3123 Module Ppg

In short, the kinds of tests that we should expect more of in the future

will be communicative tests in which candidates actually have to

produce the language in an interactive setting involving some degree of

unpredictability which is typical of any language interaction situation.

These tests would also take the communicative purpose of the

interaction into consideration and require the student to interact with

language that is actual and unsimplified for the learner. Fulcher finally

points out that in a communicative test, “the only real criterion of

success … is the behavioural outcome, or whether the learner was able

to achieve the intended communicative effect” (p. 493). It is obvious

from this description that the communicative test may not be so easily

developed and implemented. Practical reasons may hinder some of the

demands listed. Nevertheless, a solution to this problem has to be

found in the near future in order to have valid language that are

purposeful and can stimulate positive washback in teaching and

learning.

Exercise 1

1. In your opinion and based on your teaching experience, how would you conduct the testing of reading, writing and speaking skills of your own

students? What are the methods that you employ? Share this with your classmates and exchange ideas.

2. Describe three different types of writing performance as suggested by Brown (2004) and relate their relationship to academic writing, job related writing and personal writing.

79

Page 80: Tsl3123 Module Ppg

 TOPIC 7SCORING, GRADING AND ASSESSMENT CRITERIA

7.0 SYNOPSIS

Topic 7 focuses on the scoring, grading and assessment criteria. It provides

teachers with brief descriptions on the different approaches to scoring

namely:-objective, holistic and analytic.

7.1 LEARNING OUTCOMES

By the end of Topic 7, teachers will be able to:

Identify and differentiate the different approaches used in scoring

Use the different approaches used in scoring in assessing language

7.2 FRAMEWORK OF TOPICS

80

Page 81: Tsl3123 Module Ppg

CONTENT

SESSION SEVEN (3 hours)

7.2.1 Objective approach

A type of scoring approach is the objective scoring approach. This scoring

approach relies on quantified methods of evaluating students’ writing. A

sample of how objective scoring is conducted is given by Bailey (1999) as

follows:

Establish standardization by limiting the length of the assessment: Count

the first 250 words of the essay.

Identify the elements to be assessed: Go through the essay up to the 250th

word underlining every mistake – from spelling and mechanics through

verb tenses, morphology, vocabulary, etc. Include every error that a literate

reader might note.

Operationalise the assessment: Assign a weight score to each error, from 3

to 1. A score of 3 is a severe distortion of readability or flow of ideas; 2 is a

moderate distortion; and 1 is a minor error that does not affect readability in

any significant way.

Quantify the assessment: Calculate the essay Correctness Score by using

250 words as the numerator of a fraction, and the sum of error scores as

the denominator: The denominator is the sum of all the error scores:

7.2.2 Holistic approach

In holistic scoring, the reader reacts to the students’ compositions as a

whole and a single score is awarded to the writing. Normally this score is

on a scale of 1 to 4, or 1 to 6, or even 1 to 10.(Bailey, 1998 : 187). Each

score on the scale will be accompanied with general descriptors of ability.

The following is an example of a holistic scoring scheme based on a 6

point scale.

81

Page 82: Tsl3123 Module Ppg

Table 7.1: Holistic Scoring Scheme

Source: S.S. Moya, Evaluation Assistance Center (EAC)-East, Georgetown

University, Washington

RRating CCriteria5 5-6 Vocabulary is precise, varied, and vivid.

Organization is appropriate to writing assignment and contains clear introduction, development of ideas, and conclusion.

Transition from one idea to another is smooth and provides reader with clear understanding that topic is changing.

Meaning is conveyed effectively. A few mechanical errors may be present but

do not disrupt communication. Shows a clear understanding of writing and

topic development.4 4 Vocabulary is adequate for grade level.

Events are organized logically, but some part of the sample may not be fully developed.

Some transition of ideas is evident. Meaning is conveyed but breaks down at

times. Mechanical errors are present but do not

disrupt communication. Shows a good understanding of writing and

topic development. 3 Vocabulary is simple. Organization may be

extremely simple or there may be evidence of disorganization.

There are a few transitional markers or repetitive transitional markers.

Meaning is frequently not clear. Mechanical errors affect communication. Shows some understanding of writing and

topic development. 2 Vocabulary is limited and repetitious. Sample

82

Page 83: Tsl3123 Module Ppg

is comprised of only a few disjointed sentences.

No transitional markers. Meaning is unclear. Mechanical errors cause serious disruption in

communication. Shows little evidence of discourse

understanding. 1 Responds with a few isolated words. No

complete sentences are written. No evidence of concepts of writing.

0 No response.

The 6 point scale above includes broad descriptors of what a student’s essay

reflects for each band. It is quite apparent that graders using this scale are

expected to pay attention to vocabulary, meaning, organisation, topic

development and communication. Mechanics such as punctuation are

secondary to communication.

Bailey also describes another type of scoring related to the holistic approach

which she refers to as primary trait scoring. In primary trait scoring, a particular

functional focus is selected which is based on the purpose of the writing and

grading is based on how well the student is able to express that function. For

example, if the function is to persuade, scoring would be on how well the

author has been able to persuade the grader rather than how well organised

the ideas were, or how grammatical the structures in the essay were. This

technique to grading emphasises functional and communicative ability rather

than discrete linguistic ability and accuracy.

7.2.3 Analytic approachAnalytical scoring is a familiar approach to many teachers. In analytical

scoring, raters assess students’ performance on a variety of categories

which are hypothesised to make up the skill of writing. Content, for example,

is often seen as an important aspect of writing – i.e. is there substance to

what is written? Is the essay meaningful? Similarly, we may also want to

consider the organisation of the essay. Does the writer begin the essay with

an appropriate topic sentence?

Are there good transitions between paragraphs? Other categories that we

may want to also consider include vocabulary, language use and mechanics.

The following are some possible components used in assessing writing

83

Page 84: Tsl3123 Module Ppg

ability using an analytical scoring approach and the suggested weightage

assigned to each:

Components WeightContent 30 pointsOrganisation 20 pointsVocabulary 20 pointsLanguage Used 25 pointsMechanics 5 points

The points assigned to each component reflect the importance of

each of the components.

Comparing the Three Approaches

Each of the three scoring approaches claims to have its own advantages and disadvantages. These can be illustrated by Table 7.2

Table 7.2: Comparison of the Advantages and Disadvantages of the Three Approaches to Scoring Essays

Scoring Approach

Advantages Disadvantages

Holistic Quickly graded Provide a public standard that is

understood by the teachers and students alike

Relatively higher degree of rater reliability

Applicable to the assessment of many different topics

Emphasise the students’ strengths rather than their weaknesses.

The single score may actually mask differences

across individual compositions. Does not provide a lot of diagnostic feedback

Analytical It provides clear guidelines in

grading in the form of the various components.

Allows the graders to consciously address important aspects of writing.

Writing ability is unnaturally split up intocomponents.

Objective Emphasises the students’

strengths rather than their weaknesses.

Still some degree of subjectivity involved. Accentuates negative aspects of the

learner’swriting without giving credit for what they cando well.

84

EXERCISE

1. Based on your understanding, draw a mind map to indicate the advantages and disadvantages of the three approaches to scoring essays.

Page 85: Tsl3123 Module Ppg

TOPIC 8ITEM ANALYSIS AND INTERPRETATION

8.0 SYNOPSIS

Topic 8 focuses on item analysis and interpretation. It provides teachers with

brief descriptions on basic statistics terminologies such as mode, median, mean,

standard deviation, standard score and interpretation of data. It will also look at

some item analysis that deals with item difficulty and item discrimination.

Teachers will also be introduced to distractor analysis in language assessment.

8.1 LEARNING OUTCOMES

By the end of Topic 8, teachers will be able to:

Identify and differentiate some basic statistics terminologies used.

determine how well items discriminate using item discrimination; and

Analyse how well a distractor in a test item performs

8.2 FRAMEWORK OF TOPICS

85

Page 86: Tsl3123 Module Ppg

CONTENT

SESSION EIGHT (6 hours)

8.2.1 Basic Statistics

Let us assume that you have just graded the test papers for your class. You

now have a set of scores. If a person were to ask you about the performance

of the students in your class, it would be very difficult to give all the scores in

the class. Instead, you may prefer to cite only one score.

Or perhaps you would like to report on the performance by giving some

values that would help provide a good indication of how the students in your

class performed. What values would you give? In this section, we will look at

two kinds of measures, namely measures of central tendency and measures

of dispersion. Both these types of measures are useful in score reporting.

Central tendency measures the extent to which a set of scores gathers

around. There are three major measures of central tendency. They are the

mode, median and mean.

MODE Mode is the most frequently occurring raw score in a set of scores. The following is a set of scores:15, 13, 12, 12, 13, 16, 13, 17, 14, 18What is the mode for this set of scores? If you said 13, then

86

Page 87: Tsl3123 Module Ppg

you are correct as it occurs more often than others. It is possible to have one mode in a set of scores. If there are two modes, then the set of scores is referred to as being bimodal.

MEDIAN The median refers to the score that is in the middle of the set of scores when the scores are arranged in ascending or descending order. There are seven scores in the set of scores above. If we arrange it in order based on value, it would be 45, 47, 50, 51, 52, 54, 65. In this set of scores, the median will be 51 as it is the middle score. There are three scores lower than it and an equal number of scores higher than it.What happens when there are an even number of scores? Let’s take the following set of scores as an example:45, 47, 50, 51, 52, 53, 54, 65As there is no one score that is in the middle, we need to take the two in the middle, add them up and divide by two. As such, the median is 51.5 as (51 + 52)/2 or 103/2 =51.5. Always remember, however, that when we wish to find the median, we have to first arrange the scores in either ascending or descending order of value.

MEANThe mean of a set of test scores is the arithmetic mean or average and is calculated as SX/N where S (sigma) refers to the sum of, X refers to the raw or observed scores, and N is the number of observed scores. Look at the following set of scores:47, 65, 45, 54, 50, 52, 51

The mean for this set of scores is 364/7 = 52

8.2.2 Standard deviation

Standard deviation refers to how much the scores deviate from the mean.

There are two methods of calculating standard deviation which are the

deviation method and raw score method which are illustrated by the following

formulae.

To illustrate this, we will use 20, 25,30. Using standard deviation method, we come up with the following table:

87

Page 88: Tsl3123 Module Ppg

Table 8.1:Calculating the Standard Deviation Using the Deviation Method

Using the raw score method, we can come up with the following:

Table 8.2 : Calculating the Standard Deviation Using the Raw Score Method

88

Page 89: Tsl3123 Module Ppg

Both methods result in the same final value of 5. If you are calculating

standard deviation with a calculator, it is suggested that the deviation

method be used when there are only a few scores and the raw score

method be used when there are many scores. This is because when

there are many scores, it will be tedious to calculate the square of the

deviations and their sum.

8.2.3 Standard score

Standardised scores are necessary when we want to make

comparisons across tests and measurements. Z scores and T scores

are the more common forms of standardised scores although you

may come up with your own standardised score. A standardised score

can be computed for every raw score in a set of scores for a test.

i. The Z score

The Z score is the basic standardised score. It is referred to as the

basic form as other computations of standardised scores must first

calculate the Z score. The formula used to calculate the Z score is as

follows:

89

Page 90: Tsl3123 Module Ppg

Table 8.3: Calculating the Z Score for a Set of Scores

Z score values are very small and usually range only from –2 to 2.

Such small values make it inappropriate for score reporting especially

for those unaccustomed to the concept. Imagine what a parent may

say if his child comes home with a report card with a Z score of 0.47 in

English Language! Fortunately, there is another form of standardised

score - the T score – with values that are more palatable to the

relevant parties.

ii. The T score

The T score is a standardised score which can be computed using the

formula 10 (Z) + 50. As such, the T score for students A, B, C, and D in

the table 4.3 are 10(-1.28) + 50; 10 (-0.23) + 50; 10(0.47) + 50; and 10

(1.04) + 50 or 37.2, 47.7, 54.7, and 60.4 respectively. These values

seem perfectly appropriate compared to the Z score. The T score

average or mean is always 50 (i.e. a standard deviation of 0) which

connotes an average ability and the mid point of a 100 point scale.

90

Page 91: Tsl3123 Module Ppg

8.2.4 Interpretation of data

The standardised score is actually a very important score if we want to

compare performance across tests and between students. Let us take the

following scenario as an example:

How can En. Abu solve this problem? He would have to have standardised scores in order to decide. This would require the following information:

Test 1 : X = 42 standard deviation= 7Test 2 : X = 47 standard deviation= 8

Using the information above, En. Abu can find the Z score for each raw score reported as follows:

Table 8.4: Z Score for Form 2A

Based on Table 8.4, both Ali and Chong have a negative Z score as their total score for both tests. However, Chong has a higher Z score total (i.e. –1.07 compared to – 1.34) and therefore performed better when we take the performance of all the other students into consideration.

THE NORMAL CURVE

The normal curve is a hypothetical curve that is supposed to represent all

naturally occurring phenomena. It is assumed that if we were to sample a

particular characteristic such as the height of Malaysian men, then we will

91

Page 92: Tsl3123 Module Ppg

find that while most will have an average height of perhaps 5 feet 4 inches,

there will be a few who will be relatively shorter and an equal number who

are relatively taller. By plotting the heights of all Malaysian men according to

frequency of occurrence, it is expected that we would obtain something

similar to a normal distribution curve. Similarly, test scores that measure any

characteristic such as intelligence, language proficiency or writing ability of a

specific population is also expected to provide us with a normal curve.

The following is a diagram illustrating how the normal curve would look like.

Figure 8.1: The normal distribution or Bell curve

The normal curve in Figure 8.1 is partitioned according to standard

deviations (i.e. – 4s, -3s, + 3s, + 4s) which are indicated on the horizontal

axis. The area of the curve between standard deviations is indicated in

percentage on the diagram. For example, the area between the mean (0

standard deviation) and +1 standard deviation is 34.13%. Similarly, the

area between the mean and –1 standard deviation is also 34.13%. As

such, the area between –1 and 1 standard deviations is 68.26%.

92

Page 93: Tsl3123 Module Ppg

In using the normal curve, it is important to make a distinction between

standard deviation values and standard deviation scores. A standard

deviation value is a constant and is shown on the horizontal axis of the

diagram above. The standard deviation score, on the other hand, is the

obtained score when we use the standard deviation formula provided

earlier. So, if we find the score to be 5 as in the earlier example, then the

score for the standard deviation value of 1 is 5 and for the value of 2 is 5

x 2 = 10 and for the value of 3 is 15 and so on. Standard deviation values

of –1, -2, and –3 will have corresponding negative scores of –5, -10, and

–15.

8.2.5 Item analysis

a. Item difficulty Item difficulty refers to how easy or difficult an item is. The formula

used to measure item difficulty is quite straightforward. It involves

finding out how many students answered an item correctly and

dividing it by the number of students who took this test. The formula is

therefore:

For example, if twenty students took a test and 15 of them correctly

answered item 1, then the item difficulty for item 1 is 15/20 or 0.75.

Item difficulty is always reported in decimal points and can range from

0 to 1. An item difficulty of 0 refers to an extremely difficult item with no

students getting the item correct and an item difficulty of 1 refers to an

easy item which all students answered correctly.

The appropriate difficulty level will depend on the purpose of the test.

According to Anastasi & Urbina (1997), if the test is to assess mastery,

then items with a difficulty level of 0.8 can be accepted. However, they

go on to describe that if the purpose of the test is for selection, then we

should utilise items whose difficulty values come closest to the desired

93

Page 94: Tsl3123 Module Ppg

selection ratio –for example, if we want to select 20%, then we should

choose items with a difficulty index of 0.20.

b. Item discrimination

Item discrimination is used to determine how well an item is able to

discriminate between good and poor students. Item discrimination values

range from –1 to 1. A value of –1 means that the item discriminates

perfectly, but in the wrong direction. This value would tell us that the

weaker students performed better on a item than the better students.

This is hardly what we want from an item and if we obtain such a value,

it may indicate that there is something not quite right with the item. It is

strongly recommended that we examine the item to see whether it is

ambiguous or poorly written. A discrimination value of 1 shows positive

discrimination with the better students performing much better than the

weaker ones – as is to be expected.

Let’s use the following instance as an example. Suppose you have just

conducted a twenty item test and obtained the following results:

Table 8.5: Item Discrimination

94

Page 95: Tsl3123 Module Ppg

As there are twelve students in the class, 33% of this total would be 4

students. Therefore, the upper group and lower group will each consist

of 4 students each. Based on their total scores, the upper group would

consist of students L, A, E, and G while the lower group would consist of

students J, H, D and I.

We now need to look at the performance of these students for each item

in order to find the item discrimination index of each item.

For item 1, all four students in the upper group (L, A, E, and G)

answered correctly while only student H in the lower group answered

correctly. Using the formula described earlier, we can plug in the

numbers as follows:

Two points should be noted. First, item discrimination is especially

important in norm referenced testing and interpretation as in such

instances there is a need to discriminate between good students who

do well in the measure and weaker students who perform poorly. In 95

Page 96: Tsl3123 Module Ppg

criterion referenced tests, item discrimination does not have as

important a role. Secondly, the use of 33.3% of the total number of

students who took the test in the formula is not inflexible as it is possible

to use any percentage between 27.5% to 35% as the value.

c. Distractor analysis

Distractor analysis is an extension of item analysis, using techniques

that are similar to item difficulty and item discrimination. In distractor

analysis, however, we are no longer interested in how test takers select

the correct answer, but how the distractors were able to function

effectively by drawing the test takers away from the correct answer. The

number of times each distractor is selected is noted in order to

determine the effectiveness of the distractor. We would expect that the

distractor is selected by enough candidates for it to be a viable

distractor.

What exactly is an acceptable value? This depends to a large extent on

the difficulty of the item itself and what we consider to be an acceptable

item difficulty value for test items. If we are to assume that 0.7 is an

appropriate item difficulty value, then we should expect that the

remaining 0.3 be about evenly distributed among the distractors. 

Let us take the following test item as an example:In the story, he was unhappy because_____________________________

A. it rained all dayB. he was scoldedC. he hurt himselfD. the weather was hot

Let us assume that 100 students took the test. If we assume that A is the

answer and the item difficulty is 0.7, then 70 students answered correctly.

What about the remaining 30 students and the effectiveness of the three

distractors? If all 30 selected D, then distractors B and C are useless in their

role as distractors. Similarly, if 15 students selected D and another 15

96

Page 97: Tsl3123 Module Ppg

selected B, then C is not an effective distractor and should be replaced.

Therefore, the ideal situation would be for each of the three distractors to be

selected by an equal number of all students who did not get the answer

correct, i.e. in this case 10 students. Therefore the effectiveness of each

distractor can be quantified as 10/100 or 0.1 where 10 is the number of

students who selected the tiems and 100 is the total number of students

who took the test. This technique is similar to a difficulty index although the

result does not indicate the difficulty of each item, but rather the

effectiveness of the distractor. In the first situation described in this

paragraph, options A, B, C and D would have a difficulty index of 0.7, 0, 0,

and 0.3 respectively. If the distractors worked equally well, then the indices

would be 0.7, 0.1, 0.1, and 0.1. Unlike in determining the difficulty of an

item, the value of the difficulty index formula for the distractors must be

interpreted in relation to the indices for the other distractors.

From a different perspective, the item discrimination formula can also be

used in distractor analysis. The concept of upper groups and lower groups

would still remain, but the analysis and expectation would differ slightly from

the regular item discrimination that we have looked at earlier. Instead of

expecting a positive value, we should logically expect a negative value as

more students from the lower group should select distractors. Each

distractor can have its own item discrimination value in order to analyse how

the distractors work and ultimately refine the effectiveness of the test item

itself. 

Table 8.6: Selection of Distractors

Distractor A Distractor B Distractor C Distractor D

Item 1 8* 3 1 0

Item 2 2 8* 2 0

Item 3 4 8* 0 0

Item 4 1 3 8* 0

Item 5 5 0 0 7*

d. * indicates key

For Item 1, the discrimination index for each distractor can be calculated using the discrimination

index formula. From Table 8.5, we know that all the students in the upper group answered this item

correctly and only one student from the lower group did so. If we assume that the three remaining

97

Page 98: Tsl3123 Module Ppg

students from the lower group all selected distractor B, then the discrimination index for item 1,

distractor B will be:

This negative value indicates that more students from the lower group selected the distractor compared

to students from the upper group. This result is to be expected of a distractor and a value of -1 to 0 is

preferred. 

EXERCISE

1. Calculate the mean, mode, median and range of the following set of scores:

23, 24, 25, 23, 24, 23, 23, 26, 27, 22, 28.

2. What is a normal curve and what does this show? Does the final result always show a normal curve and how does this relate to standardised tests?

98

Page 99: Tsl3123 Module Ppg

TOPIC 9REPORTING OF ASSESSMENT DATA

9.0 SYNOPSIS

Topic 9 focuses on reporting assessment data. It provides teachers with brief

descriptions on the purposes of reporting and the reporting methods.

9.1 LEARNING OUTCOMES

By the end of Topic 9, teachers will be able to:

Understand the purposes of reporting of assessment data

Understand and use the different reporting methods in language assessment

9.2 FRAMEWORK OF TOPICS

CONTENT

SESSION NINE (3 hours)

99

Page 100: Tsl3123 Module Ppg

9.2.1 Purposes of reporting

We can say that the main purpose of tests is to obtain information

concerning a particular behaviour or characteristic. Based on information

obtained from tests, several different types of decisions can be made.

Kubiszyn & Borich (2000), mention eight different types of decisions

made on the basis of information obtained from tests. These educational

decisions are shown in Figure 9.1

Figure 9.1 :Eight Types of Decisions Mode

Instructional decisions are made based on test results when, for

example, teachers decide to change or maintain their instructional

approach. If a teacher finds out that most of his class have failed his

test, there are many possible reactions he can have. The teacher

could evaluate the effectiveness of his own teaching or instructional

approach and implement the necessary changes.Tests yield scores

and teachers will have to make decisions in terms of the kind of

grades to give students. As grades are indicators of student

performance, teachers need to decide whether a student deserves a

high grade – perhaps an A – on the basis of some form of

assessment.

100

Page 101: Tsl3123 Module Ppg

Traditionally, and perhaps for a long time to come, this assessment will be

in the form of tests. Sometimes, we give tests to find out the strengths

and weaknesses of our students.

Decisions related to selection, placement, counselling and guidance,

programme or curriculum, and administrative policy are all made at

levels higher than the classroom.

Administrators, educational agencies and institutions may be involved in

these decisions.

Selection and placement decisions are somewhat similar. However, a

selection decision relates to whether or not a student is selected for a

programme or for admission into an institution based on a test score.

Tests such as TOEFL and IELTS are often used by universities to

decide whether a candidate is suitable, and hence selected for

admission.

A placement decision, however, deals with where a candidate should

be placed based on performance on the test. A clear example is the

language placement examination for newly admitted students commonly

administered by many local and foreign universities.

Based on their performance on such a test, students are placed into

different language classes that are arranged according to proficiency

levels.

Counselling and guidance decisions are also made by relevant parties

such as counsellors and administrators on the basis of exam results.

Counsellors often give advice in terms of appropriate vocations for some

of their students. These advice is likely to be made on the basis of the

students’ own test scores. Programme or curriculum decisions reflect the

kinds of changes made to the educational programme or curriculum

based on examination results. Finally, there are also administrative

policy decisions that need to be made which are also greatly influenced

by test scores.

101

Page 102: Tsl3123 Module Ppg

9.2.2 Reporting methods

Student achievement progress can be reported by comparing:

i. Norm - Referenced Assessment and Reporting

Assessing and reporting a student's achievement and progress in

comparison to other students.

ii Criterion - Referenced Assessment and Reporting

Assessing and reporting a student's achievement and progress in

comparison to predetermined criteria.

An outcomes-approach to assessment will provide information about

student achievement to enable reporting against a standards

framework.

iii An outcomes-approach

Acknowledges that students, regardless of their class or grade, can be

working towards syllabus outcomes anywhere along the learning

continuum.

Principles of effective and informative assessment and reporting

Effective and informative assessment and reporting practice:

Has clear, direct links with outcomes

The assessment strategies employed by the teacher in the

classroom need to be directly linked to and reflect the syllabus

outcomes. Syllabus outcomes in stages will describe the standard

against which student achievement is assessed and reported.

Is integral to teaching and learning

Effective and informative assessment practice involves selecting

strategies that are naturally derived from well structured teaching

and learning activities. These strategies should provide information

concerning student progress and achievement that helps inform

102

Page 103: Tsl3123 Module Ppg

ongoing teaching and learning as well as the diagnosis of areas of

strength and need.

Is balanced, comprehensive and varied

Effective and informative assessment practice involves teachers

using a variety of assessment strategies that give students multiple

opportunities, in varying contexts, to demonstrate what they know,

understand and can do in relation to the syllabus outcomes.

Effective and informative reporting of student achievement takes a

number of forms including traditional reporting, student profiles,

Basic Skills Tests, parent and student interviews, annotations on

student work, comments in workBooks, portfolios, certificates and

awards.

Is valid

Assessment strategies should accurately and appropriately assess

clearly defined aspects of student achievement. If a strategy does

not accurately assess what it is designed to assess, then its use is

misleading.

Valid assessment strategies are those that reflect the actual

intention of teaching and learning activities, based on syllabus

outcomes.

Where values and attitudes are expressed in syllabus outcomes,

these too should be assessed as part of student learning.

Is fair

Effective and informative assessment strategies are designed to

ensure equal opportunity for success regardless of students' age,

gender, physical or other disability, culture, background language,

socio-economic status or geographic location.

Engages the learner

Effective and informative assessment practice is student centred.

Ideally there is a cooperative interaction between teacher and

students, and among the students themselves.

The syllabus outcomes and the assessment processes to be used

should be made explicit to students. Students should participate in

the negotiation of learning tasks and actively monitor and reflect

upon their achievements and progress.

103

Page 104: Tsl3123 Module Ppg

Values teacher judgement

Good assessment practice involves teachers making judgements,

on the weight of assessment evidence, about student progress

towards the achievement of outcomes.

Teachers can be confident a student has achieved an outcome

when the student has successfully demonstrated that outcome a

number of times, and in varying contexts.

The reliability of teacher judgement is enhanced when teachers

cooperatively develop a shared understanding of what constitutes

achievement of an outcome. This is developed through cooperative

programming and discussing samples of student work and

achievements within and between schools. Teacher judgement

based on well defined standards is a valuable and rich form of

student assessment.

Is time efficient and manageable

Effective and informative assessment practice is time efficient and

supports teaching and learning by providing constructive feedback to

the teacher and student that will guide further learning.

Teachers need to plan carefully the timing, frequency and nature of

their assessment strategies. Good planning ensures that assessment

and reporting is manageable and maximises the usefulness of the

strategies selected (for example, by addressing several outcomes in

one assessment task).

Recognises individual achievement and progress

Effective and informative assessment practice acknowledges that

students are individuals who develop differently. All students must be

given appropriate opportunities to demonstrate achievement.

Effective and informative assessment and reporting practice is

sensitive to the self esteem and general well-being of students,

providing honest and constructive feedback.

Values and attitudes outcomes are an important part of learning that

should be assessed and reported. They are distinct from knowledge,

understanding and skill outcomes.

Involves a whole school approach

An effective and informative assessment and reporting policy is

developed through a planned and coordinated whole school approach.

Decisions about assessment and reporting cannot be taken

104

Page 105: Tsl3123 Module Ppg

independently of issues relating to curriculum, class groupings,

timetabling, programming and resource allocation.

Actively involves parents

Schools and their communities are responsible for jointly developing

assessment and reporting practices and policies according to their local

needs and expectations.

Schools should ensure full and informed participation by parents in the

continuing development and review of the school policy on reporting

processes.

Conveys meaningful and useful information

Reporting of student achievement serves a number of purposes, for a

variety of audiences. Students, parents, teachers, other schools and

employers are potential audiences. Schools can use student

achievement information at a number of levels including individual, class,

grade or school. This information helps identify students for targeted

intervention and can inform school improvement programs. The form of

the report must clearly serve its intended purpose and audience.

Effective and informative reporting acknowledges that students can be

demonstrating progress and achievement of syllabus outcomes across

stages, not just within stages.

Good reporting practice takes into account the expectations of the

school community and system requirements, particularly the need for

information about standards that will enable parents to know how their

children are progressing.

Student achievement and progress can be reported by comparing

students' work against a standards framework of syllabus outcomes,

comparing their prior and current learning achievements, or comparing

their achievements to those of other students. Reporting can involve a

combination of these methods. It is important for schools and parents to

explore which methods of reporting will provide the most meaningful and

useful information.

TOPIC 10ISSUES AND CONCERNS RELATED TO ASSESSMENT IN MALAYSIAN PRIMARY SCHOOLS

105

Page 106: Tsl3123 Module Ppg

10.0 SYNOPSIS

Topic 10 focuses on the issues and concerns related to assessment in the

Malaysian primary schools. It will look at how assessment is viewed and used

in Malaysia.

10.1 LEARNING OUTCOMES

By the end of Topic 10, teachers will be able to:

Understand some issues and concerns regarding assessment in the Malaysian primary schools

Understand Chapter 4 of the Malaysian Education Blueprint 2013-2025

Use the different types of assessment in assessing language in school (cognitive-level,school-based and alternative assessment)

10.2 FRAMEWORK OF TOPICS

106

Page 107: Tsl3123 Module Ppg

CONTENT

SESSION TEN (3 hours)

10.3 Exam-oriented System

The educational administration in Malaysia is highly centralised with four

hierarchical levels; that is, federal, state, district and the lowest level, school.

Major decision-and policy-making take place at the federal level represented

by the Ministry of Education (MoE), which consists of the Curriculum

Development Centre, the school division, and the Malaysian Examination

Syndicate (MES).

The current education system in Malaysia is too examination-oriented and

over-emphasizes rote-learning with institutions of higher learning fast

becoming mere diploma mills.Like most Asian countries (e.g., Gang 1996;

Lim and Tan 1999; Choi 1999); Malaysia so far has focused on public

examination results as important determinants of students’ progression to

higher levels of education or occupational opportunities (Chiam 1984).

The Malaysian education system requires all students to sit for public

examinations at the end of each level of schooling. There are four public

examinations from primary to postsecondary education. These are the

Primary School Achievement Test (UPSR) at the end of six years of primary

education, the Lower Secondary Examination (PMR) at the end of another

three years’ schooling, the Malaysian Certificate of Education (SPM) at the

end of 11 years of schooling, and the Malaysian Higher School Certificate

Examination (STPM) or the Higher Malaysian Certificate for Religious

Education (STAM) at the end of 13 years’ schooling (MoE 2004).

Malaysia Education Blueprint 2013-2025

“In October 2011, the Ministry of Education launched a comprehensive review of the education system in Malaysia in order to develop a new National Education Blueprint. This

107

Page 108: Tsl3123 Module Ppg

decision was made in the context of rising international education standards, the Government’s aspiration of better preparing Malaysia’s children for the needs of the 21st century, and increased public and parental expectations of education policy. Over the course of 11 months, the Ministry drew on many sources of input, from education experts at UNESCO, World Bank, OECD, and six local universities, to principals, teachers, parents, and students from every state in Malaysia. The result is a preliminary Blueprintthat evaluates the performance of Malaysia’s education system against historical starting points and international benchmarks. The Blueprint also offers a vision of the education system and students that Malaysia both needs and deserves, and suggests 11 strategic and operational shifts that would be required to achieve that vision. The Ministry hopes that this effort will inform the national discussion on how to fundamentally transform Malaysia’s education system, and will seek feedback from across the community on this preliminary effort before finalising the Blueprint in December 2012.”

The examined Curriculum

In public debate, the issue of teaching to the test has often translated

into debates over whether the UPSR, PMR, and SPM examinations

should be abolished. Summative national examinations should not in

themselves have any negative impact on students. The challenge is that

these examinations do not currently test the full range of skills that the

education system aspires to produce. An external review by Pearson

Education Group of the English examination papers at UPSR

and SPM level noted that these assessments would benefit from

the inclusion of more questions testing higher-order thinking skills,

such as application, analysis, synthesis and evaluation. For example,

their analysis of the 2010 and 2011 English Language UPSR papers

showed that approximately 70% of the questions tested basic skills of

knowledge and comprehension.

LP has started a series of reforms to ensure that, as per policy,

assessments are evaluating students holistically. In 2011, in parallel

with the KSSR, the LP rolled out the new PBS format that is intended

108

Page 109: Tsl3123 Module Ppg

to be more holistic, robust, and aligned to the new standard-referenced

curriculum. There are four components to the new PBS:

▪ School assessment refers to written tests that assess subject

learning. The test questions and marking schemes are developed,

administered, scored, and reported by school teachers based on

guidance from LP;

▪ Central assessment refers to written tests, project work, or

oral tests (for languages) that assess subject learning. LP develops

the test questions and marking schemes. The tests are, however,

administered and marked by school teachers;

▪ Psychometric assessment refers to aptitude tests and a

personality inventory to assess students’ skills, interests, aptitude,

attitude and personality. Aptitude tests are used to assess students’

innate and acquired abilities, for example in thinking and problem

solving. The personality inventory is used to identify key traits and

characteristics that make up the students’ personality. LP develops

these instruments and provides guidelines for use. Schools are,

however, not required to comply with these guidelines; and

▪ Physical, sports, and co-curricular activities assessment

refers to assessments of student performance and participation

in physical and health education, sports, uniformed bodies, clubs,

and other non-school sponsored activities. Schools are given the

flexibility to determine how this component will be assessed.

The new format enables students to be assessed on a broader range of

output over a longer period of time. It also provides teachers with more

regular information to take the appropriate remedial actions for their

students. These changes are hoped to reduce the overall emphasis on

teaching to the test, so that teachers can focus more time on delivering

meaningful learning as stipulated in the curriculum.

In 2014, the PMR national examinations will be replaced with school

109

Page 110: Tsl3123 Module Ppg

and centralised assessment. In 2016, a student’s UPSR grade will no longer

be derived from a national examination alone, but from a combination of PBS

and the national examination. The format of the SPM remains the same, with

most subjects assessed through thenational examination, and some subjects

through a combination of examinations and centralised assessments.

10.4 Cognitive Levels of Assessment

Bloom's Taxonomy of Cognitive Levels Knowledge Comprehension Application Analysis Synthesis Evaluation

Knowledge

Recalling memorized information. May involve remembering a wide range of material from specific facts to complete theories, but all that is required is the bringing to mind of the appropriate information. Represents the lowest level of learning outcomes in the cognitive domain.

Learning objectives at this level: know common terms, know specific facts, know methods and procedures, know basic concepts, know principles.

Question verbs: Define, list, state, identify, label, name, who? when? where? what?

Comprehension

The ability to grasp the meaning of material. Translating material from one form to another (words to numbers), interpreting material (explaining or summarizing), estimating future trends (predicting consequences or effects). Goes one step beyond the simple remembering of material, and represent the lowest level of understanding.

Learning objectives at this level: understand facts and principles, interpret verbal material, interpret charts and graphs, translate verbal material to

110

Page 111: Tsl3123 Module Ppg

mathematical formulae, estimate the future consequences implied in data, justify methods and procedures.

Question verbs: Explain, predict, interpret, infer, summarize, convert, translate, give example, account for, paraphrase x?

Application

The ability to use learned material in new and concrete situations. Applying rules, methods, concepts, principles, laws, and theories. Learning outcomes in this area require a higher level of understanding than those under comprehension.

Learning objectives at this level: apply concepts and principles to new situations, apply laws and theories to practical situations, solve mathematical problems, construct graphs and charts, demonstrate the correct usage of a method or procedure.

Question verbs: How could x be used to y? How would you show, make use of, modify, demonstrate, solve, or apply x to conditions y?

Analysis

The ability to break down material into its component parts. Identifying parts, analysis of relationships between parts, recognition of the organizational principles involved. Learning outcomes here represent a higher intellectual level than comprehension and application because they require an understanding of both the content and the structural form of the material.

Learning objectives at this level: recognize unstated assumptions, recognizes logical fallacies in reasoning, distinguish between facts and inferences, evaluate the relevancy of data, analyze the organizational structure of a work (art, music, writing).

Question verbs: Differentiate, compare / contrast, distinguish x from y, how does x affect or relate to y? why? how? What piece of x is missing / needed?

Synthesis

(By definition, synthesis cannot be assessed with multiple-choice questions. It appears here to complete Bloom's taxonomy.)

The ability to put parts together to form a new whole. This may involve the production of a unique communication (theme or speech), a plan of operations (research proposal), or a set of abstract relations (scheme for

111

Page 112: Tsl3123 Module Ppg

classifying information). Learning outcomes in this area stress creative behaviors, with major emphasis on the formulation of new patterns or structure.

Learning objectives at this level: write a well organized paper, give a well organized speech, write a creative short story (or poem or music), propose a plan for an experiment, integrate learning from different areas into a plan for solving a problem, formulate a new scheme for classifying objects (or events, or ideas).

Question verbs: Design, construct, develop, formulate, imagine, create, change, write a short story and label the following elements:

Evaluation

The ability to judge the value of material (statement, novel, poem, research

report) for a given purpose. The judgments are to be based on definite

criteria, which may be internal (organization) or external (relevance to the

purpose). The student may determine the criteria or be given them. Learning

outcomes in this area are highest in the cognitive hierarchy because they

contain elements of all the other categories, plus conscious value judgments

based on clearly defined criteria.

Learning objectives at this level: judge the logical consistency of written

material, judge the adequacy with which conclusions are supported by data,

judge the value of a work (art, music, writing) by the use of internal criteria,

judge the value of a work (art, music, writing) by use of external standards of

excellence.

Question verbs: Justify, appraise, evaluate, judge x according to given criteria.

Which option would be better/preferable to party y?

10.5 School-based Assessment

The traditional system of assessment no longer satisfies the educational

and social needs of the third millennium. In the past few decades, many

countries have made profound reforms in their assessment systems.

Several educational systems have in turn introduced school-based

assessment as part of or instead of external assessment in their112

Page 113: Tsl3123 Module Ppg

certification. While examination bodies acknowledge the immense

potential of school-based assessment in terms of validity and flexibility,

yet at the same time they have to guard against or deal with difficulties

related to reliability, quality control and quality assurance. In the debate

on school-based assessment, the issue of ‘why’ has been widely written

about and there is general agreement on the principles of validity of

this form of assessment.

Izard (2001) as well as Raivoce and Pongi (2001) explain that school-

based assessment (SBA) is often perceived as the process put in place

to collect evidence of what students have achieved, especially in

important learning outcomes that do not easily lend themselves to the

pen and paper tests. Daugherty (1994) clarifies that this type of

assessment has been recommended: …because of the gains in the

validity which can be expected when students’ performance on

assessed tasks can be judged in a greater range of contexts and more

frequently than is possible within the constraints of time- limited, written

examinations. However, as Raivoce and Pongi (2001) suggest the

validity of SBA depends to a large extent on the various assessment

tasks students are required to perform.

Burton (1992) provides the following five rules of the thumb that may be

applied in the planning stage of school-based assessment :

1. The assessment should be appropriate to what is being assessed.

2. The assessment should enable the learner to demonstrate positive

achievement and reflect the learner’s strengths.

3. The criteria for successful performance should be clear to all

concerned

4. The assessment should be appropriate to all persons being assessed

5. The style of assessment should blend with the learning pattern so it

contributes to it.

In the Malaysian SBA context, assessment for and of learning

• Standard-referenced Assessment

• Holistic

113

Page 114: Tsl3123 Module Ppg

• Integrated

• Balance

• Robust

Components of SBA/ PBS

1. Academic:

• School Assessment (using Performance Standards)

• Centralised Assessment

2. Non-academic:

• Physical Activities, Sports and Co-curricular Assessment (Pentaksiran

Aktiviti Jasmani, Sukan dan Kokurikulum - PAJSK)

• Psychometric/Psychological Tests

Centralised Assessment

• Conducted and administered by teachers in schools using instruments,

rubrics, guidelines, time line and procedures prepared by LP

• Monitoring and moderation conducted by PBS Committee at School,

District and State Education Department, and LP

School Assessment

• The emphasis is on collecting first hand information about pupils’ learning

based on curriculum standards

• Teachers plan the assessment, prepare the instrument and administer the

assessment during teaching and learning process

• Teachers mark pupils’ responses and report their progress continuously.

10.6 Alternative Assessment

Alternative assessments are assessment procedures that differ from the

traditional notions and practice of tests with respect to format,

performance, or implementation. It is likely that alternative assessment

found its roots in writing assessment because of the need to provide

continuous assessment rather than a single impromptu evaluation

(Alderson & Banerjee, 2001). 

114

Page 115: Tsl3123 Module Ppg

As the term indicates, alternative assessments are assessment

proposals that present “alternatives” to the more traditional

examination formats. They have become more popular of late because

of some doubts raised regarding the ability of traditional assessment to

elicit a fair and accurate measure of a student’s performance.

Alternative assessment brings together with it a complete set of

perspectives that contrast against traditional tests and assessments.

Table 10.1 illustrates some of the major differences between traditional

and alternative assessments.

Table 10.1: Contrasting Traditional and “Alternative” Assessment Source: Adapted from Bailey (1998:207 and Puhl, 1997: 5)

Traditional Assessment Alternative Assessment

One-shot tests Continuous, longitudinal assessment

Indirect tests Direct tests

Inauthentic tests Authentic assessment

Individual projects Group projects

No feedback to learners Feedback provided to learners

Speeded exams Power exams

Decontextualised test tasks Contextualised test tasks

Norm-referenced score reporting Criterion-referenced score reporting

Standardised tests Classroom-based tests

Summative Formative

Product of instruction Process of instruction

Intrusive Integrated

Judgmental Developmental

Teacher proof Teacher mediated

115

Page 116: Tsl3123 Module Ppg

In discussing alternative assessments, Herman et al. (1992: 6) list several of

their common characteristics. They describe alternative assessments as

performing the following:

Ask the students to perform, create, produce, or do something.

Tap higher-level thinking and problem-solving skills.

Use tasks that represent meaningful instructional activities.

Invoke real-world applications.

People, not machines, do the scoring, using human judgment.

Require new instructional and assessment roles for teachers.

Alternative assessments are suggested largely due to a growing concern that

traditional assessments are not able to accurately measure the ability we are

interested in. They are also seen to be more student centred as they cater for

different learning styles, cultural and educational backgrounds as well as

language proficiencies.

Tannenbaum (1996), comments that alternative assessments focus on documenting individual strengths and development which would assist in the teaching and learning process.

Nevertheless, although alternative assessments are compatible with the contemporary emphases on the process as well as product of learning (Croker, 1999), several shortcomings of alternative assessments have been noted.

Perhaps one of the major limitations of alternative assessments is that accounts of the benefits of alternative assessment tend to be “descriptive and persuasive, rather than research-based” (Alderson & Banerjee, 2001: 229). Alternative assessments are also said to be limited to the classroom and has not become part of mainstream assessment. Brown and Hudson, in advocating alternative assessment, seem to have taken a safer approach by suggesting the term “alternatives in assessment”. They believe that educators should be familiar with all possible formats of assessment and decide on the format that best measures the ability or construct that they are interested in. Hence, these alternatives would include all possible assessment formats both traditional and informal.

Despite these limitations, alternative assessments present a viable and

116

Page 117: Tsl3123 Module Ppg

exciting option in eliciting and assessing the students’ actual abilities. There are a number of test formats that are considered alternative assessment formats.

Physical demonstration Pictorial products Reading response logs K-W-L (what I know/what I want to know/what I’ve learned) charts Dialogue journals Checklists Teacher-pupils conferences Interviews Performace tasks Portfolios Self assessment Peer assessment

Portfolios

A well known and commonly uses alternative assessment is the portfolio assessment. The contents of the portfolio become evidence of abilities much like how we would use a test to measure the abilities of our students.

Bailey (1998, p: 218), describes a portfolio to contain four primary elements.

First, it should have an introduction to the portfolio itself which provides an overview to the content of the portfolio. Bailey even suggests that this section include a reflective essay by the student in order to help express the student’s thoughts and feelings about the portfolio, perhaps explaining strengths and possible weaknesses as well as explain why certain pieces are included in the portfolio.

Secondly, she argues that portfolios should have what she refers to as an academic works section. This section is meant to demonstrate the students’ “improvement or achievement in the major skill areas” (p. 218). The third section is described as a personal section in which students may wish to include their journals, score reports of tests that they have sat for, as well as photographs and other items that illustrate their experiences with as well as achievements in the English language. Finally, an assessment section may contain evaluations made by peers, teachers as well as self evaluations.

117

Page 118: Tsl3123 Module Ppg

Table 10.1: Contents of a PortfolioSource: Adapted from Bailey (1998: 218)

Introductory Section Academic Works Section

• Overview• Reflective Essay

• Samples of best work• Samples of work demonstrating development

Personal Section Assessment Section

• Journals• Score reports• Photographs• Personal items

• Evaluation by peers• Self-evaluation      

The portfolio can be said to be a student’s personal documentation that helps demonstrate his or her ability and successes in the language. It may even require students to consciously select items that can document their own progress as learners. The actual compilation of the content of the portfolio is in itself a learning experience. Some suggest that students should attach a short reflection on each piece or item placed in the portfolio. Portfolio assessment, therefore, is both a learning and assessment experience. This dual function can be considered as one of the benefits of portfolio assessment.

Brown and Hudson (1998), summarise several other advantages in using portfolios in assessment. They discuss these advantages according to how the portfolio strengthens students’ learning, enhances the teacher’s role and improves the testing process. With respect to testing, the advantages of using portfolio as an assessment instrument are listed as follows (pp.664-665):

enhances student and teacher involvement in assessment;

provides opportunities for teachers to observe students using meaningful language; to accomplish various authentic tasks in a variety of contexts and situations; permit the assessment of the multiple dimensions of language learning; provide opportunities for both students and teachers to work together and reflect on what it means to assess students’ language growth; increase the variety of information collected on students; and make teachers’ ways of assessing student work more systematic.

118

Page 119: Tsl3123 Module Ppg

Self Assessment and Peer Assessment

Two other common forms of alternative assessment are the self-

assessment and peer-assessment procedures. Both these forms

of assessment are strongly advocated by Puhl (1997) as she

believes that they are essential to continuous assessment, a

cornerstone to alternative assessment. The benefits of self and

peer assessment are especially found in formative stages of

assessment in which the development of the students’ abilities

are emphasised.

Self appraisals are also thought to be quite accurate and are said

to increase student motivation. Puhl (1997), describes a case

study in which she believes self-assessment forced the students

to reread and thereby make necessary editing and corrections to

their essays before they handed them in. Nevertheless, in order

for self assessment to be useful and not a futile exercise, the

learners need to be trained and initially guided in performing their

self assessment. This training involves providing students with

the rationale for self assessment and how it is intended to work

and how it is capable of helping them.

In language teaching and learning, self assessment is relevant in

assessing all the language skills. An example of the self assessment

of the listening skill, especially in the comprehension of questions

asked is suggested by Cohen (1994), as follows:

Comprehension of questions asked: 

5. I can always understand the questions with no difficulties and without having ask for repetition

4. I can usually understand questions, but I might occasionally ask for repetition

3. I have difficulty with some questions, but I generally get the meaning

119

Page 120: Tsl3123 Module Ppg

2. I have difficulty understanding most questions even after repetition

1. I don’t understand questions well at all

These questions are useful in the formative stages of

assessment as it helps students identify their own strengths and

weaknesses and respond accordingly. Through asking these

types of self assessment questions, the students are expected

to become more sensitive to their own learning and ultimately

perform better in the final summative evaluation at the end of

the instructional programme.

Peer assessment differs from self assessment in that it involves

the social and emotional dimensions to a much greater extent.

Peer-assessment can be defined as a response in some form to

other learners’ work (Puhl, 1997). It can be given by a group or

an individual and it can take “any of a variety of coding systems:

the spoken word, the written word, checklists, questionnaires,

nonverbal symbols, numbers along a scale, colours, etc.” (p.8)

Peer assessment requires that a student take up the role of “a

critical friend” to another student in order to “support, challenge,

and extend each other’s learning” (Brooks, 2002: 73). Among

the reported benefits of peer assessment are as follows:

remind learners they are not working in isolation;

help create a community of learners;

improve the product (“Two heads are better than one”);

improve the process; motivates, even inspires;

help learners be reflective; and

stimulate meta-cognition.

EXERCISE

In your opinion, what are the advantages of using portfolios as

120

Page 121: Tsl3123 Module Ppg

a form of alternative assessment?

REFERENCES

Allen, I. J. (2011). Repriviledging reading: The negotiation of uncertainty. Pedagogy: Critical Approaches to Teaching Literature, Language Composition, and Culture, 12 (1) pp. 97-120. Available at: http://pedagogy.dukejournals.org/cgi/doi/10.1215/15314200-1416540(RetrievedSeptember 26, 2013)

Alderson, J. C. (1986b). Innovations in language testing? In M. Portal (Ed.), Innovations in language testing. pp. 93-105. Windsor: NFER/Nelson.

121

Page 122: Tsl3123 Module Ppg

Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian,P.W., Cruikshank, K.A., Mayer, R.E., Pintrich, P.R.,Raths, J., & Wittrock, M.C. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's Taxonomy of Educational Objectives (Complete edition). New York: Longman.

Anderson, K. M., (2007). Differentiating instruction to include all students. Preventing School Failure, 51 (3) pp. 49-54.Bachman, L. F. (2004). Statistical Analyses for Language Assessment. pp. 22-23. Cambridge, UK: Cambridge University Press.

Biggs, J. B. and Collis, K. F. (1982).Evaluating the Quality of Learning: the SOLO taxonomy. New York, NY: Academic Press.

Biggs, J. B., & Collis, K .F. (1991) Multimodal learning and the quality of intelligent behaviour. In: H. Rowe (Ed.) Intelligence: Reconceptualization and measurement.  Hillsdale, NJ: Lawrence Erlbaum. pp. 57-75.

Biggs, J.B.& Tang, C. (2009). Applying constructive alignment to outcomes- based teaching and learning. Training Material. “Quality Teaching for Learning in Higher Education” Workshop for Master Trainers.  Ministry of Higher Education. Kuala Lumpur. Black, P. & Wiliam, D. (2009). Developing the theory of formative assessment J. Gardiner, ed. Educational Assessment Evaluation and Accountability, 1 (1), pp. 5–31. Available at: http://eprints.ioe.ac.uk/1119/. (Retrieved 23 August 2013)

Bloom, B. S. (Ed.). Engelhart, M.D., Furst, E.J., Hill,W.H., & Krathwohl, D.R. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain.New York: David McKay.

Bloom, B. S. (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc.

Brennan, R. L. (1996). Generalizability of performance assessments. In G. W. Phillips (Ed.), Technical issues in large-scale performance assessment (NCES 96-802) (pp. 19-58). Washington, DC: National Center for Education Statistics.

122

Page 123: Tsl3123 Module Ppg

Brown, H. D., & Abeywickrama, P. (2010). Language Assessment: Principles and Classroom Practices.New York, NY: Pearson Education.

Brown, G., & Yule, G. (1983). Teaching the spoken language.

Cambridge: Cambridge University Press.

Brown, H.D. (1994). Teaching by principles: An interactive approach

to language pedagogy. Englewood Cliffs, NJ: Prentice Hall Regents.

Campbell, K. J., Watson, J. M., & Collis, K. F. (1992).Volume measurement and intellectual development. Journal of Structural Learning. 11,  pp. 279-298.

Carroll, J. B., & Sapon, S. M. (1958). Modern Language Aptitude Test. New York, NY: The Psychological Corporation.

Cheng, L. Watanabe, Y., & Curtis, A. (Eds.). (2004). Washback in language testing: Research contexts and methods. Mahwah, NJ: Lawrence Erlbaum Associates.

Chick, H. (1998).Cognition in the Formal Modes: Research mathematics and the SOLO taxonomy. Mathematics Education Research Journal. 10 (2) pp. 4-26.

Clark, J. (1979). Direct vs. semi-direct tests of speaking ability. In E. Briere & F. Hinofotis (Eds.), Concepts in language testing: Some recent studies (pp. 35-49). Washington, DC:TESOL.

Davidson, F., Hudson, T. & Lynch, B. (1985). Language testing: Operationalization in classroom measurement and L2 research.

In M. Celce-Murcia (Ed.). Beyond basics: Issues and research in TESOL pp. 137-152. Rowley, MA: Newbury House.

Davidson, F., & Lynch, B. (2002). Testcraft: A teacher’s guide to writing and using language test specifications. New Haven, CT: Yale University Press.

Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T. and McNamara, T. (1999). Dictionary of language testing. Cambridge: University ofCambridge Local Examinations Syndicate and Cambridge University Press.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (ed.). Educational Measurement. (3rd. ed.) pp.105-146. New York, NY:

Macmillan.

123

Page 124: Tsl3123 Module Ppg

Gottlieb, M. (2006). Assessing English Language Learners: Bridges from Language Proficiency to Academic Achievement. USA: Corwin Press.

Grotjahn, R. (1986).Test validation and cognitive psychology: Some methodological considerations.Language Testing 3,pp.158–85.

Hattie, J. (2009).Visible Learning. New York: Routledge.

Hattie, J. (2012) Visible Learning for Teachers: Maximizing Impact on Learning. Abingdon: Routledge

Hattie, J. & Brown, G. (2004) Cognitive processes in asTTle: The SOLO taxonomy. University of Auckland/Ministry of Education. asTTle Technical Report 43

 Hook, P. & Mills, J. (2011) SOLO Taxonomy: A Guide for Schools Book 1: A common language of learning. Laughton, UK: Essential Resources Educational Publishers.

Huang, S.C. (2012).English Teaching: Practice and Critique 11 (4), pp. 99–119.

Hughes, A. (2003). Testing for language teachers (2nd. Ed.). Cambridge, MA: Cambridge University Press.Gavin, B. et al. (2008). An introduction to educational assessment, measurement and evaluation. (2nd ed.). Australia: Pearson

Education New Zealand.

McNamara, T. (2000). Language testing. Oxford, UK: Oxford University Press.

Linn, R. L., & Gronlund, N. E. (2000). Measurement and assessment in teaching. (8th ed.). Upper Saddle River, NJ: Merrill/Prentice Hall.

Malaysia Education Blueprint 2013-2025.

McMillan, J. H. (2001a.). Classroom assessment: Principles and practice for effective instruction.(2nd ed.). Boston: MA: Allyn & Bacon.

Messick, S. (1989). Validity. In R. Linn (Ed.) Educational measurement. Pp. 13-103. New York, NY:: MacMillan.

124

Page 125: Tsl3123 Module Ppg

Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S., Miller, J., & Newton, D. (2005).Frameworks for Thinking: A handbook for teaching and learning. Cambridge: Cambridge University Press.

Mousavi, S. A. (2009). An encyclopedic dictionary of language testing (4th ed.) Tehran: Rahnama Publications.

Norleha Ibrahim. (2009). Management of measurement and evaluation Module. Selongor: Open University Malaysia.

Nückles, M., Hübner, S. & Renkl, A. (2009). Enhancing self-

regulated learning by writing learning protocols. Learning and

Instruction, 19(3), pp. 259– 271. Available

at: http://linkinghub.elsevier.com/retrieve/pii/S0959475208000558

(Retrieved March 26, 2013).

Oller, J. W. (1979). Language tests at school: A pragmatic approach. London: Longman.

Pearson, I. (1988).Tests as levers for change. In D. Chamberlain& R. Baumgardner (Eds.), ESP in the classroom: Practice and evaluation (Vol. 128, 98-107). London: Modern

EnglishPublications.

Pimsleur, P. (1966). Pimsleur Language Aptitude Battery. New York, NY: Harcourt, Brace & World.

Shepard, L. A. (2000). The role of assessment in a learning culture. Paper presented at the Annual Meeting of the American Educational Research Association.Available http://www.aera.net/meeting/am2000/wrap/praddr01.htm(Retrieved 10.8.2013)

Smith, A. (2011) High Performers: The Secrets of Successful

Schools. Camarthen: Crown House Publishing.

Smith, T.W. & Colby, S.A. (2007). Teaching for Deep Learning.

The Clearing House.  80 (5) pp. 205–211.

125

Page 126: Tsl3123 Module Ppg

Spaan, M. (2006). Test and item specifications development.Language Assessment Quarterly, 3, pp. 71-79.

Spratt, M. (2005). Washback and the classroom: The implications for teaching and learning of studies of washback from exams. Language Teaching Research, 19, 5-29.

Stansfield, C., & Reed, D. (2004). The story behind the Modern Language Aptitude Test: An interview with John B. Carrol (1916-2003). Language Assessment Quarterly, 1, pp.43-56.

Websites

http://www.catforms.com/pages/Introduction-to-Test-Items.html(Retrieved 9.8.2013)

http://myenglishpages.com/blog/summative-formative-assessment/ - (Retrieved 10.8.2013)

http://www.teachingenglish.org.uk/knowledge-database/objective-test - (Retrieved 12.8.2013)

http://assessment.tki.org.nz/Using-evidence-for learning/Concepts/Concept/Reliability-and-validity

PANEL PENULIS MODULPROGRAM PENSISWAZAHAN GURU

MOD PENDIDIKAN JARAK JAUH(PENDIDIKAN RENDAH)

NAMA KELAYAKAN

NURLIZA BT [email protected]

KELULUSAN: M.A TESL University of North Texas, USA B.A (Hons) English North Texas State University, USA Sijil Latihan Perguruan Guru Siswazah (Kementerian

Pelajaran Malaysia)

126

Page 127: Tsl3123 Module Ppg

ANG CHWEE [email protected]

PENGALAMAN KERJA 4 tahun sebagai guru di sekolah menengah 21 tahun sebagai pensyarah di IPG

KELULUSAN M.Ed.TESL Universiti Teknologi Malaysia B.Ed. (Hons.) Agri. Science/TESL, Universiti Pertanian

Malaysia

PENGALAMAN KERJA 23 tahun sebagai guru di sekolah menengah 7 tahun sebagai pensyarah di IPG

127