oept technical manual

Running head: EXAMINING RATER BEHAVIOR ON THE OEPT 0

OEPT TECHNICAL MANUAL

Oral English Proficiency Test

Purdue

University

Oral English

Proficiency

Program

O E P T T e c h n i c a l M a n u a l P a g e | 1

Table of contents

Acknowledgements ....................................................................................................................................... 3

Introduction .................................................................................................................................................. 4

The Oral English Proficiency Program ........................................................................................................... 4

Certification of oral English proficiency by test score .............................................................................. 4

Certification of oral English proficiency through the ENGL 620 course.................................................... 4

Evolution of the OEPT Test ........................................................................................................................... 5

The first OEPT test .................................................................................................................................... 5

OEPT1 test structure and content ........................................................................................................ 5

Table 1: OEPT1 Item Summary ............................................................................................................. 6

The second OEPT test: the OEPT2............................................................................................................. 6

OEPT2 Test item development ............................................................................................................. 7

Piloting of the OEPT2 ............................................................................................................................ 7

Rescaling the OEPT2 .............................................................................................................................. 8

OEPT2 Holistic Scale .............................................................................................................................. 8

Table 2: OEPT2 Holistic Scale ................................................................................................................ 9

OEPT2 test structure and content ...................................................................................................... 10

Item response conditions.................................................................................................................... 10

Table 3: OEPT2 Item Summary ........................................................................................................... 11

OEPT2 test structure and content ...................................................................................................... 12

Item response conditions.................................................................................................................... 12

Resources for score users and prospective test takers .............................................................................. 12

Scheduling, registration and administration of the OEPT .......................................................................... 13

Test scheduling and registration ............................................................................................................. 13

Test administration schedule and numbers ........................................................................................... 13

Testing labs and computers .................................................................................................................... 13

Examinee ID requirements and orientation to the test process ............................................................ 14

Test procedures and length .................................................................................................................... 14

Test recording storage ............................................................................................................................ 14

Quality control ........................................................................................................................................ 15


Test survey .......................................................................................................................................... 15

Fake testing ......................................................................................................................................... 15

OEPT Rating and Scoring ............................................................................................................................. 16

Rating guidelines and process ................................................................................................................. 16

Scoring guidelines ................................................................................................................................... 16

Rater Training .......................................................................................................................................... 17

Introduction to the OEPT test, scale and rating process .................................................................... 17

Monthly rater training ........................................................................................................................ 18

OEPT Test Results ........................................................................................................................................ 18

Release of test results ............................................................................................................................. 18

Test score data retention and examinee score records ......................................................................... 19

Monthly test data reports ....................................................................................................................... 19

Requests for review of OEPT scores and early retests ........................................................................... 20

Policies for retests ................................................................................................................................... 20

Use of OEPT Test Scores ............................................................................................................................. 20

Reliability and Comparability of OEPT Scores ............................................................................................. 21

Estimates of Reliability and Standard Error of Measurement ................................................................ 21

Figure 1: Formula for Calculating SEM ................................................................................................ 22

Table 4: Internal Consistency and SEM by Test Form for the OEPT2 for Year 2011 ........................... 22

Table 5: Internal Consistency and SEM by Test Form for the OEPT1 for Years 2005-2009 ................ 23

Inter-correlations among item scores .................................................................................................... 23

Table 6: Item Internal Consistency for the OEPT2 for Year 2011 ....................................................... 24

Table 7: Item Internal Consistency for the OEPT1 for Years, 2005-2009 ........................................... 24

Inter-rater reliability ............................................................................................................................... 25

Table 8: Spearman Rank Correlation Coefficients of Inter-rater Reliability ....................................... 25

Table 9: Consensus Estimates of Inter-rater Reliability ...................................................................... 25

Figure 2: Partial Credit Model used for Many-facet Rasch Measurement ......................................... 26

Question 1: Do raters rate consistently on the OEPT? ....................................................................... 26

Table 10: Measurement Estimates of Inter-rater Reliability for August 2012 OEPT data .................. 27

Question 2: Do raters display the same level of leniency or severity on the OEPT? .......................... 28

Figure 3: Wright Map for August 2012 OEPT data .............................................................................. 29

Table 11: Misclassification of Examinees on August 2012 OEPT ........................................................ 30


Question 3: Do raters apply the OEPT scale in the same manner? .................................................... 30

Other Test Statistics .................................................................................................................................... 31

Table 12: OEPT Examinees by College and Academic Year ................................................................. 31

Table 13: OEPT Score Distribution by Academic Year ......................................................................... 32

Table 14: OEPT Pass/Fail Rates by Academic Year ............................................................................. 32

Table 15: Reported Use of Practice Test by OEPT Examinees ............................................................ 32

References .................................................................................................................................................. 33

Appendices .................................................................................................................................................. 34

Appendix 1: Test Survey screenshots ..................................................................................................... 34

Appendix 2: Text of Test Preparation Brochure ..................................................................................... 37

Acknowledgements

The following OEPP staff members contributed to this manual: Nancy Kauper (editor), Xun Yan

(Reliability and Comparability of OEPT Scores), April Ginther, Jennifer Redden, Kaz Mishima,

Lixia Cheng, and Ploy Thirakunkovit.

© 2013 Purdue University Oral English Proficiency Program


Introduction

The Oral English Proficiency Test, or OEPT, is designed to measure the oral English proficiency

of Purdue University graduate students whose first language is not English. OEPT test scores are

used by all departments at Purdue University in determining the eligibility of graduate students

for assignment of teaching duties involving direct classroom instruction of undergraduate

students.

The Oral English Proficiency Program

The Oral English Proficiency Program (OEPP) at Purdue University was established in 1987

under the auspices of the Office of the Provost. The OEPP was created to implement the Purdue

University policy which states that any person whose first language is not English and who holds

or is a candidate for appointment as a graduate teaching assistant must demonstrate adequate

oral English proficiency before assignment to duties involving direct contact with

undergraduate students. That is, all international teaching assistants whose first language is not

English must be “certified” for oral English proficiency.

Certification of oral English proficiency by test score

Any student whose first language is not English may be certified by the following test scores:

A score of at least 27 on the speaking portion of the TOEFL iBT;

A score of 8.0 or higher on the IELTS speaking section;

A score of 76 or higher on the speaking portion of the Pearson Test of English (PTE

Academic);

A score of 50 or higher on Purdue’s Oral English Proficiency Test (OEPT).

Certification of oral English proficiency through the ENGL 620 course

The OEPP is responsible for certifying students who do not meet the above test score cut offs.

Students who fail the OEPT test are eligible to be enrolled, by recommendation of their

departments, in the English 620 course, Classroom communication in ESL for international

teaching assistants. This course is designed, taught, and administered by OEPP staff. Most

students who take ENGL 620 are certified after one semester of the course, but some are

required to repeat the course a second, or in some cases a third time before being certified for

oral English proficiency. Although it is rare, some students are not certified after three

semesters of the course, the maximum number allowed.


Evolution of the OEPT Test Prior to using the OEPT test, the OEPP used the SPEAK, an oral English proficiency test

published by Educational Testing Service (ETS), to screen prospective international teaching

assistants (ITAs). A major impetus for developing a local oral English proficiency test was that

institutional tests such as SPEAK and TOEFL did not have university TA-related content, and test

administration was costly and time consuming. By moving to the local computer-mediated test,

administration and rating time was reduced by two thirds.

In July 2000, the OEPP was awarded a Multimedia Instructional Development Center Grant to

develop a computer-based screening device to better serve the needs of the program and its

examinees. As a result, the Oral English Proficiency Test (OEPT) was developed and became

operational in 2001.

The first OEPT test

The OEPT, in use since 2001, is a computer-based semi-direct test of oral English proficiency

developed and administered by the OEPP on the main Purdue campus. There have been two

versions of the OEPT test. The first version was developed by April Ginther and Krishna P.C.

Madhavan and was written in Authorware.

A computer-mediated test was chosen as the format for the OEPT due to concerns of fairness:

all test takers experience the same test interface, without the variability introduced by different

interlocutors in face-to-face interviews. A computer mediated test also allows flexibility with

respect to administration, item types, and data capture for rating and analysis.

OEPT1 test structure and content

The first OEPT test had 8 parallel forms, each with ten rated items. Table 1 provides a summary

of the test items for the first OEPT.


Table 1: OEPT1 Item Summary

Item Title Abbr Prompt Expected Response

1 Personal History per Video

Provide personal information.

2 Read Aloud ral Text

Read a 200-word English text clearly.

3 Summarize

Graph gp Graph

Summarize and interpret information presented in a bar graph.

4 Newspaper

Headline np Text

Express a personal opinion about issues that occur in university settings.

5 Compare and

Contrast cnc Text

Based on available information presented in two charts, express a preference and explain

your choice.

6 Give Advice rtc Text Provide advice to an undergraduate about a

classroom issue.

7 Pass on

Information psn Text

Understand information presented in a text and pass the information on verbally to

another person.

8 Telephone Message

tel Audio Understand information presented in an audio recording and pass it on verbally to

another person.

9 Summarize

Conversation 1 svp Video

Summarize the content of a short conversation after watching a video of the

conversation twice.

10 Summarize

Conversation 2 lvp Video

Summarize the content of a longer conversation after watching a video of the

conversation twice.

The second OEPT test: the OEPT2

The OEPT test is administered in ITaP (Information Technology at Purdue) computer labs on the

Purdue main campus. Daily and weekly updates by ITaP to the software settings on lab

computers impacted the OEPT1 test program, resulting in frequent problems with test

administration. By 2008 the Authorware test had become so problematic that the OEPP

director initiated a project to revise the test.


The current version of the test, often referred to as the OEPT2, was developed in 2008-2009.

April Ginther, Director of the OEPP, and Jennifer Redden, OEPP Program Coordinator,

supervised the test revision project.

OEPT2 Test item development

Test item development took place in the summer and fall of 2008. Some item types from the

OEPT1 were retained, such as the long conversation and reading aloud. Writers of new items

were Purdue graduate students in English as a Second Language (ESL) and Linguistics programs

who had experience in testing or teaching ESL at the college or university level.

By the end of the fall 2008 semester, writing of test items was completed and the items were

organized into six parallel forms. Four of these forms were designated to be used in the test,

while the other two forms were reserved to be posted on the OEPP website as practice tests.

From January until early March 2009, voice recording of test items was completed.

Software development of the new test was completed in March 2009. ITaP programmers

wrote the test in JAVA. Compared with the Authorware version, the JAVA interface has been

reported as “good”, “easy to use”, and “much better than the previous one [the previous

version of the test]” by participants in the pilot project.

Piloting of the OEPT2

Piloting of the OEP2 took place in late March and early April of 2009. Participants in the pilot

testing process were recruited from the student group who had taken the earlier version of the

OEPT, in order to represent a known range in oral English proficiency and language

backgrounds.

Using the ITaP computer system under fully operational conditions, the pilot test was

administered to 90 examinees. After completing the test, all pilot examinees completed a

written questionnaire. Ten of these pilot examinees were then interviewed individually and

asked to discuss various aspects of the test-taking experience.

Feedback about changes to the test was positive. Test takers reported that they liked the new

background color, screen layout, and shortened instructions, as well as the variety and random

order of easy and difficult items. Raters appreciate the improved recordings of examinees’ item

responses. Improvements to recording quality are due in part to improved headset technology

with USB connections.


In mid to late April, 2009, modifications were made to the test based, for the most part, on

pilot participants’ feedback and suggestions.

Rescaling the OEPT2

In summer 2009, a rescaling project was undertaken. The OEPT1 used a five-point rating scale,

with possible scores of 2,3,4,5, and 6. The level 2 score was very rarely used and was deemed

unnecessary. Yang’s (2000) dissertation research using FACETS to analyze OEPT1 score data

suggested that another score level was called for between scores 4 and 5, the cutoff between

failing and passing.

The OEPT2 rescaling project consisted of several stages:

30 pilot exams were selected according to the examinees’ previous scores on the OEPT1:

5 examinees at score level 3, 10 each at levels 4 and 5, and five examinees at level 6.

Raters rated these 30 pilot tests using the 2-6 scale of the OEPT1.

Next, raters were asked to rank these 30 exams into six levels: low fail, medium fail, high

fail; and low pass, medium pass, high pass.

Raters then rated another 30 pilot tests using a six-point scale.

Raters were then put into 6 groups to write prose descriptors for one of 6 levels, using

pilot tests scored at that level.

Finally, raters rated a third batch of 30 exams based on a 6-point scale.

Prompted by Yang’s research, and as a result of the rescaling project, the OEPT1 five-point

rating scale of 2-3-4-5-6 was replaced with a six-point scale of 35-40-45-50-55-60. The cutoff

between failing and passing scores is 45/50.

OEPT2 Holistic Scale

The OEPT2 holistic scale descriptors (Table 2) are general guidelines for rating. More detailed

descriptors are provided to raters to help with difficult decisions. The descriptors have been

reviewed and revised periodically since 2009 to improve their efficacy in helping raters award

scores.

Once the new 6-point scale was in place, a new rater training software program was built for

the purpose of orienting OEPT raters to the new rating scale. This program gave established and

new raters extensive practice rating on the new scale in preparation for the August 2009 test

administrations, the largest testing period of the academic year with between 200 and 300

students tested. In August 2009 the OEPT2 was administered to 240 students and all tests were

rated using the new 6-point scale. Since that time the new version of the test has been

administered with no major technical problems.


Table 2: OEPT2 Holistic Scale

OEPT2 HOLISTIC SCALE revised 11-8-2012

Level General Proficiency Level Requirements of Listener Performance of Speaker

60

Excellent and Consistent across items. Majority of items 60. Minimal listener effort required to adjust to accent. Frequent displays of lexico-syntactic sophistication and fluency. Speaker is at ease and confident fulfilling task, elaborating a personalized message, using accurate English. Errors are minor and few.

55

More than Adequate. Mix of 55, 60, with a few 50 if any. Little listener effort required to adjust to accent/prosody/ intonation. Consistently intelligible, comprehensible, coherent. Strong skills across items. Wide range of vocab and syntactic structures, generally sophisticated responses. Speaker may exert some noticeable effort or show minor fluency issues in elaborating clear message to fulfill task. Errors are minor.

50

Adequate and ready for the classroom without support. Majority of items 50, possibly some 55 or very few 45. Acceptably small amount of listener effort required to adjust to accent/prosody/intonation. Consistently intelligible and comprehensible. Speaker may exert a little noticeable effort, but despite minor errors of grammar/vocab/stress/fluency, message is adequately coherent, with correct information, some lexico-syntactic sophistication, and displays of automaticity and fluency.

45

Borderline - Inconsistent – Minimally adequate for classroom with support. Mix of 45 and 50, very few, if any, 40. Tolerable listener effort required to adjust. Consistently intelligible. Strengths & weaknesses across characteristics or items. Message is generally coherent, but may require more than a little noticeable effort for speaker to compose, or delivery may be slow. Or message may be clear and expressed fluently, but language use is somewhat simplistic.

40

Limited - Not ready for the classroom. Mix of 40 and 45, or a few 35, if any. Able to address prompts and complete responses. Consistent listener effort may be necessary. Message may be simplistic/unfocussed/incomplete/ incorrect. May struggle somewhat to build sentences/argument or to articulate sounds. May be occasionally unintelligible, incomprehensible, or incoherent.

35

Restricted - May need more than 1 semester of support. Mix of 35 and 40. Listening may require considerable effort. May be unintelligible or incoherent more than occasionally OR have marked deficiencies in at least 3 other areas: fluency, vocabulary, grammar/syntax, listening comprehension, articulation/pronunciation, prosody. May have difficulty completing responses.



The OEPT2 test has 2 practice items and 12 rated items. There are four parallel forms of the

test. The 2 practice items and 2 of the rated items (Item 1 - Area of Study, and Item 11 - Read

Aloud 1) are identical on all four test forms.

Table 3 above shows the title and expected response type of the two practice items and twelve

rated items of the OEPT2. The rated portion of the test is composed of five text-based items,

two graphics-based items, three listening tasks, and two read-aloud items.

Item response conditions

After each item is presented, examinees are given up to two minutes of preparation time

followed by up to two minutes of response time, with the exception of item 10 (short lecture)

which allows up to three minutes of response time.

The OEPT is designed to “bias for best” test taker performance. No special outside content or

cultural knowledge is required to respond to OEPT items; items are designed so that test takers

can easily use available information in order to formulate an appropriate response.

Furthermore, test takers are allowed to take notes and may take up to two minutes to prepare

their responses. By presenting all test takers with items that represent a variety of everyday

language, providing generous amounts of time to prepare and respond, and allowing test takers

to take notes, we offer favorable administrative conditions for all test takers. In addition, the

OEPT Practice Test, which is available on the OEPP website for anyone to use, is designed to

introduce test takers to the administrative conditions associated with a semi-direct test of oral

English.


Table 3: OEPT2 Item Summary

Item no.

Item Title Abbr Prompt

type Expected Response

P1 Personal History 1 warm

1 Text

Talk about your country, region, or city of origin.

P2 Personal History 2 warm

2 Text

Talk about your favorite holiday in your home country.

1 Area of Study aos Text Describe your area of study for an audience of

people not in your field.

2 Newspaper

Headline np Text

Given an issue concerning university education, express an opinion and build an argument to

support it.

3 Compare and

Contrast cnc Text

Based on 2 sets of given information, make a choice and explain why you made it.

4 Pros and Cons pros Text Consider a TA workplace issue, decide on a course of action, and discuss the possible consequences

of that action.

5 Respond to Complaint

rtc Text Give advice to an undergraduate concerning a

course or classroom issue.

6 Bar Chart barc Graph Describe and interpret numerically-based,

university-related data.

7 Line Graph lg Graph Describe and interpret numerically-based,

university-related data.

8 Telephone Message tel Audio Relay a telephone message in a voicemail

to a peer.

9 Conversation conv Audio Summarize a conversation between a student and

prof.

10 Short lecture sl Audio Summarize a lecture on a topic concerning

graduate study.

11 Read Aloud 1 -

Sounds ral1 Text

Read aloud a short text containing all the major consonant and vowel sounds

of English.

12 Read Aloud 2 -

Text ral2 Text

Read aloud a passage from a University policy statement containing complex, dense text.



The OEPT2 test has 2 practice items and 12 rated items. There are four parallel forms of the

test. The 2 practice items and 2 of the rated items (Item 1 - Area of Study, and Item 11 - Read

Aloud 1) are identical on all four test forms.

Table 3 above shows the title and expected response type of the two practice items and twelve

rated items of the OEPT2. The rated portion of the test is composed of five text-based items,

two graphics-based items, three listening tasks, and two read-aloud items.

Item response conditions

After each item is presented, examinees are given up to two minutes of preparation time

followed by up to two minutes of response time, with the exception of item 10 (short lecture)

which allows up to three minutes of response time.

The OEPT is designed to “bias for best” test taker performance. No special outside content or

cultural knowledge is required to respond to OEPT items; items are designed so that test takers

can easily use available information in order to formulate an appropriate response.

Furthermore, test takers are allowed to take notes and may take up to two minutes to prepare

their responses. By presenting all test takers with items that represent a variety of everyday

language, providing generous amounts of time to prepare and respond, and allowing test takers

to take notes, we offer favorable administrative conditions for all test takers. In addition, the

OEPT Practice Test, which is available on the OEPP website for anyone to use, is designed to

introduce test takers to the administrative conditions associated with a semi-direct test of oral

English.

Resources for score users and prospective test takers

The OEPP website located at http://www.purdue.edu/oepp/ provides an overview of the OEPP

and the OEPT, and provides a link to the OEPT Practice Test page at

http://oepttutorial.org/Default.aspx?p=test. The Practice Test page has links to the OEPT Video

Orientation which introduces viewers to Purdue, its policies about international teaching

assistants, and the OEPT; video clips of graduate students talking about life and study at

Purdue; two OEPT practice tests; the OEPT scale with descriptions of response characteristics

associated with different score levels; OEPT sample responses consisting of test response

recordings of two examinees who passed the test; and a contact form for users to report

technical difficulties or ask questions directly to the OEPP. Items on the OEPT Practice Tests are

taken from earlier versions of the OEPT, and the test format is identical to that of the OEPT2.

Watching the orientation videos, listening to the sample recordings, and taking the practice

http://www.purdue.edu/oepp/

http://oepttutorial.org/Default.aspx?p=test


tests are activities intended to help examinees understand the context of the OEPT, the format

of the test, and the types of speaking and listening skills test takers will need in order to pass

the test. The OEPP website is periodically updated to reflect the current status and

administration of the OEPT and to provide resources to support prospective examinees and

teaching assistants.

Scheduling, registration and administration of the OEPT

Test scheduling and registration

OEPP staff, generally the testing coordinator, determines a test administration schedule prior to

the beginning of each semester and makes a request to reserve ITaP lab space. Once ITaP has

reserved lab space, the OEPP secretaries enter the test schedule information into the OEPP web

application data base. The test schedule is then supplied to the OEPP liaisons. (Each academic

department in the university has a designated OEPP liaison, generally the graduate program

secretary.) Liaisons then send the OEPP secretarial staff a list of names of their potential ITA’s

along with their preferences for test session dates. The OEPP secretarial staff enters the

students’ names into the OEPP web application and registers the students into one of the

scheduled test sessions. Liaisons inform students of their test registration status and dates.

A pre-test information brochure was developed by the OEPP and made available to liaisons in

late January 2013. The brochures are to be read by students to help them prepare for taking

the OEPT and to facilitate registration, sign-in and orientation processes. The text of the test

preparation brochure can be seen on the OEPP web site and in Appendix 2.

Test administration schedule and numbers

Every August there are generally twenty-five separate test administrations, most of which take

place the week before fall semester classes begin. In September through November, and

January through April there are generally three test admins per month, and generally two or

three test admins in the month of July. Each test administration accommodates up to fifteen

examinees, depending on the size of the testing lab.

Testing labs and computers

The OEPT test is administered on campus in one of the ITaP PC computer labs. ITaP labs are

assigned to the OEPP for test administration based on lab availability. A variety of different labs,

normally used for instruction or for general student use, has been designated for OEPT

administration. Computers in all ITaP labs are configured by ITaP and require Purdue user

names and passwords.


An ITaP lab is reserved for about four hours for each test administration. OEPP testing staff

spends from one to two hours setting up computer test stations prior to each day’s testing.

Test setup involves logging on to computers using a user name and secure password (both

provided by ITaP specifically for the OEPT), manually configuring the settings for the OEPP USB

headset microphones on each computer, doing a sound and recording check with the headset

at each computer, downloading the test files from the server by running a zip file then loading

the JAVA test program, and finally placing paper and pencils at each station. Since August of

2012, cardboard screens have also been placed around each computer testing station in an

effort to minimize distractions to test takers.

Examinee ID requirements and orientation to the test process

When the computer test stations are ready, usually 15 minutes before the designated test time,

one or two test administrators sign examinees in. Examinees must show an official photo ID

with their name written in the English alphabet, and must sign a registration list. Since the

summer of 2011, speakers of Chinese are asked to indicate on the sign-in sheet which is their

native variety of Chinese language. After a group of examinees is signed in, test administrators

distribute unique examinee IDs for logging on to the test program, explain the test process,

inform examinees about the survey that follows the test, and give instructions about using the

headsets and the desired speaking volume. Since 2012, examinees are asked to read the

instructions from a handout before the oral orientation given by a test administrator. Much of

the information included in this orientation is also available in the pre-test brochure.

Test procedures and length

After sign-in and orientation, examinees are free to choose a test station, log on to the test with

the examinee ID provided to them, and begin the test. They may proceed at their own pace.

Examinees wear a headset with microphone to record their responses. A sound and recording

check is included in the test introduction so that examinees can adjust sound levels before

beginning the actual test. Note taking is allowed on paper provided by the OEPP, but notes

may not be taken out of the lab by examinees. Most examinees take no more than one hour to

complete both the test and the after-test survey that is built into the program. At least one test

administrator is always present to proctor and to assist with any questions or problems that

may arise. After an examinee finishes the test and the survey, a test administrator collects the

examinee’s notes and copies the test recording to a flash drive.

Test recording storage

As each examinee completes and submits their test and survey, the files are automatically

uploaded to the OEPP testing directory on a secure Purdue server. The test recordings that

were copied to a flash drive are later copied onto a hard drive in the OEPP testing office. In a

rare circumstance when the server is down and automatic upload is not successful, the backup


copy is the only copy of the examinees’ tests. These tests can later be added to the server

manually.

After completing the test, all examinees are given a green post-test brochure and directed to

read it after they leave. This brochure is entitled: The Oral English Proficiency Test: What

Happens Next? A Guide for Students and Their Departments. Here students can read about

OEPT scores and what they mean, about the English 620 course and the Professional

Development section of that course, and about various community resources for helping them

develop English language skills.

Quality control

Test survey

In order to control test and test administration quality and monitor examinee test preparation,

each examinee must fill out a test survey immediately after completing the OEPT. From 2001

until the 2009 OEPT2 test revision, the test survey was administered on paper. Since 2009, the

survey is part of the test program and survey responses are automatically uploaded to the

server and the OEPP web application data base. The current version of the OEPT survey consists

of four parts and generally requires five to ten minutes to complete. Screen shots of the survey

can be seen in Appendix 1.

Survey statistics and written comments are reviewed every month and appropriate steps taken

to address any issues that may be actionable. Copies of survey results including written

comments are compiled in a ring binder kept in the testing offices.

Fake testing

A few days before test administration begins each month, fake testing and rating is done to

ensure that the test program and rating application run smoothly on the ITaP system. Because

ITaP makes updates to computers on a weekly basis, opportunities for incompatibilities

between the OEPT test program and the ITaP lab computers arise from time to time. These

problems are generally not difficult to resolve, but must be discovered enough in advance of

test administration to allow time for ITaP to fix the problems.


OEPT Rating and Scoring

Once a test is automatically uploaded to the testing directory on the secure server (or manually

added), testing staff assign tests to raters by means of the OEPP web application. Rating and

scoring is all done on the OEPP web application.

Rating guidelines and process

Raters access the OEPP web application and rate online. Normally a test is assigned to two

certified raters as well as to one or more apprentice raters, according to a design in which each

rater is paired with every other rater on at least one test so that appropriate comparisons can

be drawn between raters. Electronic copies of OEPT rater assignment sheets are stored on the

shared Y and T drives, and paper copies are compiled in ring binders labeled “OEPT Test Admin

Info” kept in the testing offices. These ring binders also hold copies of the examinee sign-in

sheets.

Raters assign a score to each item in addition to a holistic overall score, and are required to

write comments justifying their overall score and describing examinee strengths and

weaknesses. Raters must also indicate if they are confident in their rating, and if not confident,

they must explain why. There is also an opportunity to explain any technical issues that were

encountered. While rating an exam, raters can access the OEPT rating scale descriptors as well

as the exam item prompts in written form via links on the rating page.

Rating may be done anywhere that raters have access to the internet. OEPT rating guidelines

specify that rating must be done in a time and place when the rater is free from other

distractions and can devote their full attention to the rating task. The OEPP web app captures

and records the time of submission of rater’s score as well as the amount of time the rater took

to rate the exam.

Scoring guidelines

After all certified raters have finished rating a particular exam, the testing coordinator is

responsible for assigning an overall score to the exam using the OEPP web application. Only the

scores of certified raters are taken into account when assigning overall scores.

If two raters assign the same score to an exam, that score is entered as the overall score. If the

two raters disagree by one score level, i.e. 5 points, and the two scores are on the same side of

the pass/fail cutoff, the overall score will be the lower score in the case of a 35/40 split, or the

higher score, in the case of 50/55 or 55/60 split. If the two raters’ scores are 10 points apart,

and they are still on the same side of the pass/fail cutoff, the final score of the test would be


the average of the two scores. For example, rater scores of 50 and 60 result in a final score

of 55.

However, if the two rater scores stand on different sides of the pass/fail cutoff, a third rater is

assigned to rate the test independently. If the third rater agrees with either of the first two

raters, the majority rule applies. For example, rater scores of 45, 55, and 55 would result in an

overall score of 55.

Since 2010, however, a score of 45, while still a failing score, allows an academic department to

assign that student to a classroom teaching position while the student concurrently takes the

ENGL 620 course. Due to this possible consequence of a 45 score, a third rater is also assigned

to a test with split scores involving a 45. That is, a 40/45 split or 35/45 split is assigned to a third

rater.

With the new 6-point scale and cases of third ratings, three raters sometimes assign three

different scores to a test. Most tests that receive three different scores would not be assigned

to a fourth rater; instead, the final score will be on the same side of the cutoff as the majority,

and equal to the score that is closer to the cutoff. For example, in a 3-way split such as

35/40/50, two of the three scores are on the failing side of the cutoff, so the student fails the

test. 40 is closer to the cutoff, therefore the final score of this test is a 40. In a split such as

45/55/60, the final score would be 55.

In occasional cases where rater scores are discrepant or unexpected, the testing coordinator

may assign additional raters and/or may review the test recording and make a scoring decision

based on rater scores and written comments along with her own judgment as a trained rater.

The test may also be flagged for discussion during rater training.

Rater Training

Introduction to the OEPT test, scale and rating process

Rater training and supervision is the responsibility of the OEPP testing coordinator. All OEPT

raters go through an introductory training process and rating apprenticeship before being

certified as raters. Trainees are first introduced to the test, the scale, and the rating process by

means of a custom-designed computer program called RITA which generally requires fifteen to

twenty hours to complete. The Testing Coordinator and Program Coordinator have

administrative access to the program allowing them to monitor trainee progress throughout

the program.

After completing the RITA training, apprentice raters take part in all regular rater training and

test rating activities, but apprentice scores assigned to actual OEPT tests do not contribute to

final score decisions. The OEPP web app allows testing staff to view apprentice item and overall


scores and written comments, as well as to extract data that allows statistical analyses of

apprentice rater scores and comparisons to certified rater scores.

Most rater trainees complete the RITA program in the summer then apprentice during the fall

semester. During the fall semester apprentices rate approximately 50 tests from regular test

admins, complete monthly rater training assignments, and take part in monthly rater training

sessions. Most are then certified to rate during spring semester starting in January. Apprentice

rater statistics are periodically examined to ensure that apprentices are progressing toward

acceptable agreement with ‘certified’ raters. Occasionally a very good apprentice rater (often a

TA who has been working in the OEPP in another capacity and who is already familiar with the

test and rating scale) is certified after a shorter apprenticeship if they consistently have a high

level of agreement with the OEPT scores.

Monthly rater training

All raters take part in monthly training sessions which serve to align raters to the scale and to

each other. Prior to each session, raters are given an assignment to complete which may be

discussed during the training session. Assignments consist of tasks such as listening to

benchmark tests and writing descriptions of the test performances, rating tests, ranking several

tests, doing analytical scoring of items to focus on particular aspects of proficiency such as

fluency or intelligibility, and describing in writing and diagram form the scoring decision process

with reference to the holistic scale.

During training sessions, results from the assignment may be viewed and discussed, and more

benchmark tests are listened to, rated, analyzed, and discussed. Raters are informed of scoring

results of the prior month’s test administration, including rater agreement and correlations,

number of tests requiring 3rd raters, and test score distributions. The 2-hour rater training

sessions generally take place on the Friday before the Monday of OEPT monthly test admins. In

August, rater training takes place in the morning of the first day of testing.

OEPT Test Results

Release of test results

The OEPP does not report test scores to examinees; all reporting of scores to examinees is done

by the departmental liaisons.

Due to confidentiality and university regulations prohibiting the reporting of scores by email, up

until 2011 OEPT score reports were compiled by testing staff after each test administration,

printed, and given to the OEPP secretarial staff, who then reported the scores by phone to the

liaisons in each academic department. In 2011 the OEPP web application was amended to allow

departmental liaisons direct access to student scores by logging on to a new departmental


score report function of the secure web application. This was an important improvement to the

score reporting process, reducing reporting error and saving OEPP testing and secretarial staff

significant time and effort.

Test score data retention and examinee score records

In 2005 Purdue’s Office of the Provost engaged Information Technology at Purdue (ITaP) to

support and complement the OEPP’s testing system. ITaP created the OEPP web application

which allows OEPP staff to register examinees for test sessions, assign raters to exams, rate

exams online, assign final scores to exams, create score reports, and indicate the ENGL 620

course and certification status of students. In addition, the web app allows OEPP staff to

generate cumulative data reports on examinee information such as age, sex, school,

department, native country, native language, test date, overall score, certification information,

and survey results, as well as rater statistics such as raters’ test item scores and overall test

scores.

ITaP maintains and services the OEPP web app and database and makes modifications when

necessary. The testing coordinator must communicate needs to the ITaP liaisons and determine

together with them when any modifications will be done. Test and web app modification work

depends upon the availability of ITaP programmers and is usually carried only once or twice a

year. After the modifications have been made to the QA (quality assurance) version of the test

or web app, the testing coordinator is responsible for testing all changes and all affected

functions to see if everything works as it should. Other OEPP staff are often recruited to test

functions that they use. (A special QA log in and password, supplied by ITaP, is required to

access the QA versions of the test and web app.) After all QA testing has been completed and

all modifications finalized, the changes are then scheduled to be applied to the actual test or

web app, which is called the Prod version. This entire process should take place well in advance

of any actual test admins or rating so that the Prod version can also be tested for bugs during a

period of little or no official use.

Monthly test data reports

After the monthly test administration and rating period, a report is compiled with test,

examinee, and rater information and statistics. These monthly reports include numbers of

examinees tested, score distributions, passing and failing rates, number of exams requiring

third raters, number of retests, rater agreement statistics, rater correlation matrices, and

examinee information by department and native country. In addition, test survey statistics and

written comments are extracted and compiled. These statistics, code, and reports are saved on

the T drive, and paper copies of the reports are compiled in ring binders kept in the testing

offices and labeled “Monthly Test Data Reports” and “Test Survey Responses”.


Requests for review of OEPT scores and early retests

Academic departments sometimes request that a test with a failing score be reviewed. Often

the request is made because faculty and staff consider the student’s oral English proficiency to

be adequate for certification. Sometimes the request originates with the student, but a student

request must be vetted by the academic department before the OEPP will consider it. A handful

of test review requests are made each year, usually after the August test administration.

Test reviews involve the testing coordinator examining the raters’ scores and comments and

listening to all or portions of the test, then making a recommendation to the director, who may

also review the test results and recording. A review may result in a student being granted

certification with no further testing or coursework, a student being granted a retest earlier than

the normal retest policy indicates, or a score standing as is. Occasionally, with a borderline test

performance, the OEPP allows the academic department to decide whether or not a student

has adequate oral English proficiency to fulfill the duties of a classroom TA in their department.

Policies for retests

Examinees are allowed to retake the test after one year, with the exception of examinees who

received a score of 45, who are allowed to retest after 6 months. Special requests to the OEPP

by academic departments to retest examinees earlier than this are taken into consideration and

often granted. Students who have taken the ENGL 620 course and not been certified after one

semester are generally not allowed to retake the test; these students must retake the course in

order to be certified.

Use of OEPT Test Scores

The OEPT is used to test the oral English proficiency of non-native English speaking graduate

students who are prospective TAs. A score of 50, 55, or 60 means a student is certified for oral

English proficiency and can be considered for a TA position involving the direct instruction of

undergraduate students.

The level of OEPT passing scores (50, 55, or 60) may contribute to an academic department’s

decisions about assigning TA positions for certain courses. For example, a department may feel

that a particular instructional assignment that requires lecturing several times a week may only

be appropriate for a grad student who received a score of 55 or 60 on the OEPT.

An OEPT score of 45 means a student is borderline in terms of proficiency and must either take

the test again and pass it or take ENGL 620 in order to be certified. Students scoring 45 may be

allowed to hold an instructional TA position if their academic department makes prior


arrangements with the OEPP. In these cases, the student must be enrolled in ENGL 620 while

concurrently holding the teaching position.

A score of 40 or 35 indicates that a student is not ready for a classroom position due to

insufficient oral English proficiency. A student receiving a 40 or 35 will be placed on the waiting

list to take the ENGL 620 course taught by the OEPP. Students identified as high priority by their

departments are given first priority for course enrollment. OEPP approval is required to enroll

in the course; students may not enroll independently.

Students who take the ENGL 620 course are generally not retested with the OEPT.

Test scores influence which section of the English 620 course a student is placed in. The OEPP

Program Coordinator makes every effort to place students of similar OEPT score levels in the

same section of the course. One or two sections of the course may be populated primarily by

students who scored 35 on the OEPT, while other sections are reserved for those who scored

45. There are normally 10 or 11 sections of the course offered each fall and each spring

semester, with 8 students assigned to each section. All course sections follow the same general

syllabus, but each section has instruction tailored to the levels and needs of the students

assigned to it.

Although graduate students must be certified in oral English proficiency to be eligible to teach

undergraduate courses, due to limited access to the ENGL 620 course, academic departments

are discouraged from establishing certification of oral English proficiency as a requirement of

graduation.

Reliability and Comparability of OEPT Scores

Estimates of Reliability and Standard Error of Measurement

Measures of internal consistency, or item reliability, indicate the likelihood that the

performance of individual examinees will be ranked the same across items. In other words,

internal consistency measures indicate if all test items are testing the same aspect of

examinees’ language proficiency, e.g., speaking. The item reliability coefficients for the OEPT

are expressed in Cronbach’s alpha coefficient. The item reliability index ranges from 0 to 1; a

reliability index of 0 means that test items are not at all consistent; a reliability index of 1

means that the test is perfectly consistent and all test items are testing the same ability or

competence.

Standard error of measurement (SEM) is an index of the amount of variation in scores that is

due to imprecise measurement. The index of SEM is operationalized by an estimate of the

average difference between observed scores (examinees’ OEPT scores) and true scores (what


examinees’ OEPT scores would hypothetically be without any measurement error). In other

words, SEM can be interpreted as an averaged standard deviation of OEPT scores for all

individual examinees had these examinees taken the OEPT a number of times over a short

period of time. The SEM indices are calculated using the formula in Figure 1 below, where SEM

is the standard error of measurement, S is the standard deviation of the scaled scores, and α is

Cronbach’s alpha coefficient.

Figure 1: Formula for Calculating SEM

*

Tables 4 and 5 provide estimates of internal consistency and SEM for the OEPT2 test in 2011

and the OEPT1 test from 2005 to 2009 respectively. The reliability estimates for all four forms

are very high, indicating that all items test examinees’ speaking ability. In addition, the small

values of SEM suggest that the OEPT test is an instrument for precisely measuring examinees’

speaking ability.

Table 4: Internal Consistency and SEM by Test Form for the OEPT2 for Year 2011

Form

Scale

Item Reliability Coefficient

SD

SEM

1 1-6 0.97 1.15 0.20 2 1-6 0.98 1.10 0.16

3 1-6 0.98 1.26 0.18 4 1-6 0.95 1.24 0.29

Overall 1-6 0.97 1.19 0.21


Table 5: Internal Consistency and SEM by Test Form for the OEPT1 for Years 2005-2009

Form

Scale

Item Reliability Coefficient

SD

SEM

1 3-6 0.99 0.79 0.09 2 3-6 0.99 0.82 0.10

3 3-6 0.99 0.81 0.10 4 3-6 0.99 0.82 0.10 5 3-6 0.99 0.83 0.09

6 3-6 0.99 0.81 0.09 7 3-6 0.99 0.77 0.09

8 3-6 0.99 0.82 0.09 Overall 3-6 0.99 0.81 0.10

Inter-correlations among item scores

The correlation of scores across items is another way of presenting item reliability or internal

consistency of the test by providing detailed statistics on the correlation between each pair of

items. The correlation of scores across items is expressed in Spearman Rank Correlation

Coefficients with an index ranging from 0 to 1. A reliability index of 1 indicates that the items

are identical in terms of ranking examinees’ performance on these items, and can thus be

claimed to measure the same construct. A reliability index of 0 indicates that the two items are

not similar at all. Tables 6 and 7 display the correlation matrices among items on the OEPT2 and

the OEPT1. The correlation coefficients among items on the OEPT1 are very high, ranging from

0.82 to 0.92. This indicates that the OEPT items consistently test the same construct. The

correlation coefficients among OEPT2 items are also high but with a wider range, from 0.52 to

1.00. The difference in inter-item correlations between the two versions is expected because of

the shift from a 4-point rating scale on the OEPT1 to a 6-point rating scale on the OEPT2, and

because the same types of items are used to detect finer differences among examinees.


Table 6: Item Internal Consistency for the OEPT2 for Year 2011

AOS

NP

CNC

PRO

RTC

HG

LG

TEL

CVN

SL

RAL1

RAL2

AOS 1.00 NP 1.00 1.00 CNC 0.75 0.75 1.00 PRO 0.76 0.76 0.96 1.00 RTC 0.63 0.63 0.78 0.80 1.00 HG 0.68 0.68 0.81 0.82 0.75 1.00 LG 0.68 0.68 0.82 0.82 0.73 0.93 1.00 TEL 0.55 0.55 0.65 0.66 0.80 0.77 0.76 1.00 CVN 0.63 0.63 0.75 0.76 0.71 0.88 0.87 0.84 1.00 SL 0.55 0.55 0.65 0.66 0.80 0.77 0.77 0.82 0.74 1.00 RAL1 0.56 0.56 0.66 0.66 0.63 0.77 0.77 0.66 0.73 0.80 1.00 RAL2 0.52 0.52 0.61 0.61 0.56 0.73 0.73 0.60 0.70 0.72 0.90 1.00

Table 7: Item Internal Consistency for the OEPT1 for Years, 2005-2009

PER

RAL

GP

NP

CNC

RTC

PSN

TEL

SVP

LVP

PER 1.00 RAL 0.83 1.00 GP 0.85 0.82 1.00 NP 0.86 0.84 0.88 1.00 CNC 0.85 0.82 1.00 0.88 1.00 RTC 0.88 0.85 0.87 0.89 0.87 1.00 PSN 0.87 0.85 0.87 0.89 0.87 0.90 1.00 TEL 0.86 0.85 0.85 0.87 0.85 0.89 0.89 1.00 SVP 0.88 0.86 0.87 0.88 0.87 0.90 0.90 0.89 1.00 LVP 0.87 0.85 0.88 0.89 0.88 0.90 0.90 0.89 0.92 1.00


Inter-rater reliability

Inter-rater reliability for the OEPT is reported in three different types of estimate: consistency,

consensus, and measurement estimates (Stemler, 2004). Each of these three types of estimate

has different implications in regard to inter-rater reliability and, combined, they enable a better

interpretation of rater performance than any single inter-rater reliability statistic. Both the

consistency estimate (Spearman rank correlation coefficient) and consensus estimates

(percent-agreement figures) were computed through Statistical Analysis Software (SAS; version

9.3).

Table 8: Spearman Rank Correlation Coefficients of Inter-rater Reliability

Academic Year Inter-rater reliability

2009-2010 0.65

2010-2011 0.72

2011-2012 0.71

Table 9: Consensus Estimates of Inter-rater Reliability

Academic Year Exact

Agreement Adjacent

Agreement

2009-2010 39% 47%

2010-2011 46% 44%

2011-2012 47% 42%

Many-facet Rasch measurement (MFRM) analysis (Linacre, 1989) reports measurement

estimates of inter-rater reliability and of rater performance on the OEPT by means of the

FACETS computer program (Version 3.67; Linacre, 2010). The FACETS program modeled the

relationship between three facets of the OEPT—examinee, rater, and item—by calibrating

these facets as well as the rating scale onto the same linear scale (i.e. the logit scale), thus

allowing for comparisons of rater consistency, rater severity and rater’s application of the OEPT

scale.


The MFRM analysis of the OEPT uses a partial credit model in lieu of a rating scale model

because although each item was rated on the same rating scale, in order to examine ratings for

each individual rater, it was hypothesized that raters apply the OEPT scale in different manners1

(Myford, personal communication, October 22, 2012). The model is expressed in mathematical

terms in Figure 2.

Figure 2: Partial Credit Model used for Many-facet Rasch Measurement

[ ]

where = probability of examinee n being rated k on task i by rater j,

= probability of examinee n being rated k-1 on task i by rater j,

= oral English proficiency of examinee n, = difficulty of task i, = severity of rater j, and

= difficulty of being rated k relative to being rated k-1 by rater j.

The Rasch model is specified to answer the following three questions:

Question 1: Do raters rate consistently on the OEPT?

The MRFM analysis provides individual-level statistical indicators of rater consistency via outfit

and infit mean square residuals. Outfit mean square residual (i.e. outfit) is the unweighted

mean squared standardized residual, which is particularly sensitive to extremely unexpected

ratings, i.e. outliers. Infit mean square residual (i.e. infit) is the weighed mean-squared residual,

which is less sensitive to outliers but more sensitive to unexpected rating patterns which

provide more information about the examinee. While outlying ratings can be statistically

adjusted, unexpected rating patterns are normally hard to remedy. Moreover, since rating

patterns are associated with estimation precision, infit is often regarded as more important

than outfit in evaluating rater consistency (Myford & Wolfe, 2003).

An example of measurement estimates of inter-rater reliability for the OEPT can be found in

Table 10, which reports both infit and outfit statistics as an indication of rater consistency. A

rule of thumb to check fit statistics is to use a range of 0.5 to 2.0 for low-stakes tests (Linacre,

1 A different model can be selected based on the facet of interest. For example, one can use a rating scale model to

examine examinees or items and hypothesize that the same OEPT rating scale be applied across raters. Even in the same model, different mathematical model specification is required if a different set of research questions are asked.


2002) and a more restricted range of 0.7 to 1.3 for high-stakes tests (McNamara, 1996). To

ensure better quality control, the examination of rater performance on the OEPT uses the latter

criteria, i.e. the more restricted range of rater fit statistics. Therefore, raters with an infit higher

than 1.3 are considered to rate inconsistently and unpredictably, whereas raters demonstrating

an infit lower than 0.7 are too predictable in their ratings, which may be a flag for potential

rater effects, e.g. central tendency or halo effect. The fit statistics shown in Table 10 suggest

that in terms of outfit statistics, raters R3061 and R3064 appear to have assigned some

unexpected ratings; however infit statistics of all OEPT raters are within the even more

stringent range of 0.7 to 1.3, suggesting that all raters were capable of rating internally

consistently.

Table 10: Measurement Estimates of Inter-rater Reliability for August 2012 OEPT data

Rater Severity Model SE Infit MSq Zstd Outfit Msq Zstd

R2581 1.66 0.08 1.00 0 0.98 -0.20 R3059 1.28 0.07 0.74 -4.30 0.83 -1.90 R3061 0.71 0.07 1.24 3.36 1.60 6.20 R3063 0.63 0.07 0.97 -0.40 0.97 -0.50 R2801 0.48 0.08 1.04 0.70 1.01 0.10 R3062 0.13 0.07 0.96 0.70 1.00 0 R3060 -0.46 0.08 1.11 1.70 1.02 0.20 R3040 -0.83 0.08 0.89 -1.70 0.95 -0.70 R2539 -0.83 0.09 0.92 -1.00 0.90 -1.20 R2359 -0.93 0.07 0.72 -4.70 0.68 -3.70 R3064 -1.04 0.06 1.00 0 2.34 6.20

Mean 0.07 0.07 0.96 -0.60 1.12 0.40 SD 0.91 0.01 0.14 2.30 0.44 3.00

Separation: 12.05; Reliability: 0.99

Fixed (all same) chi-square: 1629.6, df=10, p<0.001

Inter-rater agreement opportunities: 3010; Exact agreement: 33.2%; Expected: 37%


Question 2: Do raters display the same level of leniency or severity on the OEPT?

The MFRM analysis provided statistical indices of rater severity at both the group-level and the

individual-level. At the group level, the MFRM analysis conducted a fixed chi-square test of the

hypothesis that all raters are of the same severity level. For example, the output shown in

Table 10 indicates a statistically significant difference of severity among OEPT raters (χ²=1629.6,

df=10, p<0.001). The group rater separation index offers similar evidence (G=12.05) that, due to

raters’ small standard errors of measurement, the Rasch model can distinguish as many as 16

statistically distinct levels of severity among OEPT raters. The reliability coefficient provided in

Table 10 (r=0.99) can be considered as equivalent to the internal consistency reliability

coefficient (Cronbach’s alpha) of the OEPT in the classical test theory approach. At the

individual level, rater severity was measured in logits with a possible range between - and

+ , where positive values indicate severity and negative values indicate leniency. From

Table 10 we can see that rater severity ranges from -1.04 logit to 1.66 logit, with rater R3064

being the most severe and rater R2581 the most lenient. However, rater severity measures are

closely distributed around 0 (M=0.07, SD=0.91), suggesting that no rater was extremely harsh

or lenient in scoring. In addition, L1 English speaking and L2 English speaking raters did not

show significant difference of severity in their ratings, as severity measures of L2 English

speaking raters (R3040, R3060 and R3062) are ranked in the middle of the whole group.

The Wright Map offers another way to investigate rater severity and its influence on

examinees’ scores. The example shown in Figure 3 confirms the output from Table 10, ranking

rater R2581 on the top and rater R3064 in the bottom on the logit scale of severity measures

(shown in the third column of the map). Furthermore, the range of rater severity measures was

much narrower than the range of examinee proficiency measures (shown in the second

column). This indicates that variability in raters’ severity levels only had a small impact on

examinee scores.


Figure 3: Wright Map for August 2012 OEPT data


Therefore, although raters displayed a varied degree of severity, variability in rater severity was

so small that it exerted very little influence on examinees’ scores. In fact, a comparison

between examinees’ observed average scores and fair average scores (i.e. average scores after

adjusted for rater severity) revealed that, among the 253 sample exams, there were only two

false positives and two false negatives in terms of misclassification of examinees as pass or fail

on the OEPT (see Table 11). That the difference between the observed and adjusted scores is so

small also suggests that these misclassifications are within the margin of error.

Table 11: Misclassification of Examinees on August 2012 OEPT

Examinee Total Count Observed Average

Fair-M Average

13102 26 49.81 50.19

13030 26 48.85 50.32

13069 26 50.19 49.98

13055 26 50.77 49.76

Question 3: Do raters apply the OEPT scale in the same manner?

As well as aligning examinee proficiency, rater severity and item difficulty on the same logit

scale, the Wright Map visualizes how raters apply the OEPT scale in relation to examinees’ oral

English proficiency measures. If raters apply the OEPT scale in the same way, on the right-hand

side of the Wright map we should be able to discern straight lines across raters between

adjacent levels on the OEPT scale. In Figure 3, we can very roughly visualize two straight lines

separating OEPT 45 and 50, and 50 and 55, suggesting that OEPT raters had generally reached a

mutual interpretation of examinees’ oral English proficiency at the passing score level, i.e. OEPT

50. Nevertheless, the jagged lines between adjacent scores on the rest of the OEPT scale imply

that raters applied the OEPT scale in very different ways. FACETS can also perform rater

interaction analysis to find out whether raters rate subgroups of OEPT test takers, e.g. Chinese,

Indian, or Korean examinees, in the same way.

In summary, the MFRM analysis is employed to complement consistency and consensus

estimates of the inter-rater reliability of the OEPT for a better interpretation of rater

performance. It provides detailed information about rater performance both at the individual

and group level. In terms of rater monitoring and training, MFRM analysis can also be used to

monitor rater performance on a semester basis and examine the effectiveness of rater training

over time.


Other Test Statistics

Table 12: OEPT Examinees by College and Academic Year

College 2009-2010 2010-2011 2011-2012 Total

n % n % n % n %

Engineering 236 51% 204 41% 188 39% 628 43%

Science 102 22% 131 26% 137 28% 370 26

Agriculture 24 5% 26 5% 38 8% 88 6

Liberal Arts 17 4% 31 6% 29 6% 77 5

Technology 14 3% 24 5% 26 5% 64 4

Graduate School 16 3% 15 3% 15 3% 46 3

Health and Human Sciences

9 2% 24 5% 13 3% 46 3

Management 18 4% 9 2% 15 3% 42 3

Education 17 4% 13 3% 8 2% 38 3

Pharmacy 10 2% 11 2% 9 2% 30 2

Veterinary Medicine

3 1% 8 2% 5 1% 16 1

Total 466 496 483 1446


Table 13: OEPT Score Distribution by Academic Year

Year 35 40 45 50 55 60

2009-2010 4% 26% 21% 30% 15% 4%

2010-2011 6% 25% 21% 34% 11% 3%

2011-2012 8% 17% 29% 32% 11% 3%

Table 14: OEPT Pass/Fail Rates by Academic Year

Pass Fail total

n % n % n

2009-2010 228 49% 237 51% 466

2010-2011 239 48% 257 52% 496

2011-2012 222 46% 261 54% 483

Total 765 52% 690 48% 1446

Table 15: Reported Use of Practice Test by OEPT Examinees

Year

Yes No total

n % n % n

2009-2010 238 51% 228 49% 466

2010-2011 278 56% 218 44% 496

2011-2012 290 60% 193 40% 483

Total 806 56% 640 44% 1446


References

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.

Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch

Measurement Transactions, 16, 878.

Linacre, J. M. (2010). Facets Rasch measurement computer program (Version 3.67.0) [Computer

software]. Chicago: Winsteps.com.

Messick, S. (1989). Meaning and Values in Test Validation: The Science and Ethics of

Assessment. Educational Researcher, 18 (2), 5-11.

McNamara, T. F. (1996). Measuring Second Language Performance. London and New York:

Addison Wesley Longman.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet

Rasch measurement: part I. Journal of Applied Measurement, 4 (4), 386-422.

Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to

estimating interrater reliability. Practical Assessment, Research & Evaluation, 9 (4).

Retrieved December 9, 2012 from http://PAREonline.net/getvn.asp?v=9&n=4 .

Yang, R. (2010). A Many-facet Rasch Analysis of Rater Effects on an Oral English Proficiency

Test. Doctoral Dissertation. Purdue University.


Appendices

Appendix 1: Test Survey screenshots


Appendix 1: Test Survey Screenshots (continued)


Appendix 2: Text of Test Preparation Brochure

PURDUE ORAL ENGLISH PROFICIENCY PROGRAM

Preparing for the Oral English Proficiency Test: A Guide for Students and their Departments Oral English Proficiency Program 810 Young Hall 155 S. Grant Street West Lafayette, IN 47907 www.purdue.edu/oepp Phone: 765-494-9380 Email: [email protected]

Why do you need to take the OEPT?

Your department may ask you to take the OEPT. Only your department can register you to take the OEPT.

The OEPT is one method your department uses to determine which of their graduate students may be best suited to assist with an undergraduate class at Purdue.

By University policy, any student whose first language is not English must be certified for oral English proficiency before being offered a position involving the direct instruction of undergrad students.

If you did not meet any of the following minimum test scores for certification, and your department wants to hire you to work with undergraduates, you will need to take the OEPT:

o IELTS speaking: 8.0

o TOEFL iBT speaking: 27

o PTE speaking: 76

If you did not take any of the above tests, you may need to take the OEPT.

mailto:[email protected]


Appendix 2: Text of Test Preparation Brochure (continued)

How to Prepare for the Test

Take the OEPT practice test at least once. There are 2 versions of the practice test.

The OEPT practice test, sample responses, rating scale, and a video orientation are on the OEPP web site and at:

http://oepttutorial.org/

Check out other test and orientation information on the OEPP web site:

http://www.purdue.edu/oepp/

Get your OEPT registration information from your department and record the info below. Bring this brochure with you to the test.

Date of test_______________

Location_________________

Test time_________________ (arrive 20 minutes before this)

Tips for Test Day

Arrive at the testing lab 20 minutes before the test begins. If you arrive more than 10 minutes after the posted test time, you may not be allowed to take the test.

Wait in the hallway until you are asked to enter the test room.

Have your Purdue ID or other official photo ID ready (passport or U.S. driver's license). We do not accept: Electronic IDs, IDs without photos, or IDs that are not in English.

If you do not have a photo ID, you will not be allowed to take the test.

Bring a bottle of water with a screw-on cap, but no other food or drink.

Turn off and put away all electronic devices before entering the lab.

Put all belongings (including cell phones) away in your bag, backpack, purse, etc. You must leave all bags in the front of the testing lab.

You will be provided with paper and pencils to take notes.

You may want to bring a sweater or jacket in case the lab is cold.

http://oepttutorial.org/


Appendix 2: Text of Test Preparation Brochure (continued)

What to do during the test: Testing Dos and Don’ts

Do speak at a regular speed, not too fast and not too slow.

Do speak in a moderate volume. Don’t speak in a loud voice.

o Be aware that there will be noise in the testing lab. While every effort is made to

leave space between students, you may hear other test takers talking during the

test.

Don’t mute the headset.

Do give feedback on the short survey that follows the test.

Do finish the test and survey in 1½ hrs. For example: Begin at 10:30, finish and out

by12:00.

When you finish the test and survey, before you leave

Leave your notes on your desk. Write your test ID number at the top of each page on

which you have taken notes.

Leave your test ID slip on your desk on top of your notes.

Raise your hand to call a test administrator to check your computer, test file, and notes.

Don't forget to pick up all your belongings at the front of the room on your way out.

Take a green brochure with you and read it later; it contains information about test

scores and the English 620 course.

For more information on the OEPP and its services, please visit the OEPP website or speak with

your department’s graduate studies office

oept technical manual

Documents