2011 02 21 automated grading essay n

Grading essays:Humans vs. machine

Updated 2/21/2011 11:49:20 AM |

By Scott Jaschik, Inside Higher Ed

FAIRFAX, Va. — If a computer can win at Jeopardy, can one grade the essays offreshmen?

At George Mason University Saturday, at theFourth International Conference on WritingResearch, the Educational Testing Servicepresented evidence that a pilot test ofautomated grading of freshman writingplacement tests at the New Jersey Institute ofTechnology showed that computerprograms can be trusted with the job. TheNJIT results represent the first "validitytesting" — in which a series of tests areconducted to make sure that the scoring wasaccurate — that ETS has conducted ofautomated grading of college students'essays. Based on the positive results, ETSplans to sign up more colleges to gradeplacement tests in this way — and is alreadydoing so.

But a writing scholar at the MassachusettsInstitute of Technology presented researchquestioning the ETS findings, and arguingthat the testing service's formula forautomated essay grading favors verbosityover originality. Further, the critiquesuggested that ETS was able to get goodresults only because it tested short answeressays with limited time for students — andan ETS official admitted that the testingservice has not conducted any validitystudies on longer form, and longer timed,writing.

MORE: How to test writingWRITING: Fooling the College Board

Chaitanya Ramineni, an ETS researcher,outlined the study of NJIT's use of the testingservice's E-Rater to grade writing placementessays. NJIT has freshmen write answers toshort essay prompts and uses four prompts,arranged in various configurations of twoprompts per student, with 30 minutes towrite.

The testing service compared the results ofE-Rater evaluations of students' papers tohuman grading, and to students' scores onthe SAT writing test and the essay portion ofthe SAT writing test (which is graded byhumans). ETS found very high correlationsbetween the E-Rater grades and the SATgrades and, generally, to the human gradesof the placement test.

In fact, Ramineni said, one of the problemsthat surfaced in the review was that somehumans doing the evaluation were notscoring students' essays on some promptsin consistent ways, based on the rubric usedby NJIT. While many writing instructors maynot trust automated grading, she said, it is

Advertisement

Format Dynamics :: CleanPrint :: http://www.usatoday.com/news/educa... http://www.usatoday.com/cleanprint/?unique=1340886196244

1／4 2012/6/28 下午 08:23

important to remember that "human scoringsuffers from flaws."

Andrew Klobucar, assistant professor ofhumanities at NJIT, said that he has alsonoticed a key change in student behaviorsince the introduction of E-Rater. One of theconstant complaints of writing instructors isthat students won't revise. But at NJIT,Klobucar said, first-year students are willingto revise essays multiple times when they arereviewed through the automated system,and in fact have come to embrace revision ifit does not involve turning in papers to liveinstructors.

Students appear to view handing in multipleversions of a draft to a human to be"corrective, even punitive," in ways thatdiscourage them, he said. Their willingnessto submit drafts to E-Rater is a hugeadvance, he said, given that "theconstruction and revision of drafts isessential" for the students to become betterwriters.

After the ETS and NJIT presentationsencouraging the use of automated grading, Les Perelman came forward as, he said, "theloyal opposition" to the idea. Perelman,director of writing across the curriculum atMIT, has a wide following among writinginstructors for his critiques of standardizedwriting tests — even when graded by people.

He may be best known for his experimentspsyching out the College Board by figuringout which words earn students high gradeson the SAT essay, and then having studentswrite horrific prose using those words, andearn high scores nonetheless.

Perelman did not dispute the possibility thatautomated essay grading may correlate

highly with human grading in the NJITexperiment. The problem, he said, is that hisresearch has demonstrated that there is aflaw in almost all standardized grading ofshort essays: In the short essay, short timelimit format, scoring correlates strongly withessay length, so the person who gets themost words on paper generally does better— regardless of writing quality, andregardless of human or computer grading.

In four separate studies of the SAT essaytests, Perelman explained, high correlationswere found between length and score. Otherwriting tests — with times of one hourinstead of times of 25 minutes — found thatthe correlation between length and scoredropped by half. In more open-endedwriting assignments, the correlation largelydisappeared, he said.

After reviewing these nine tests, he said thatfor any formula to work (grading by humansin short time periods, but especially gradingby computer), the values that are rewardedare likely to be suspect.

Perelman then critiqued the qualities that go

Advertisement


2／4 2012/6/28 下午 08:23

into the ETS formula for automated grading.For instance, many parts of the formula lookat ratios — the ratio of grammar error tototal number of words, ratio of mechanicserrors to word count, and so forth. Thussomeone who writes lots of words, andkeeps them simple (even to the point ofnonsense), will do well.

ETS says its computer program tests"organization" in part by looking at thenumber of "discourse units" — defined ashaving a thesis idea, a main statement,supporting sentences and so forth. ButPerelman said that the reward in thismeasure of organization is for the numberof units, not their quality. He said that underthis rubric, discourse units could be floppedin any order and would receive the samescore — based on quantity.

Other parts of the formula, he noted, punishcreativity. For instance, the computer judges"topical analysis" by favoring "similarity ofthe essay's vocabulary to other previouslyscored essays in the top score category." "Inother words, it is looking for trite, commonvocabulary," Perelman said. "To use an SATword, this is egregious." Word complexity isjudged, among other things, by averageword length, so, he suggested, students arerewarded for using"antidisestablishmentarianism," regardlessof whether it really advances the essay. Andthe formula also explicitly rewards length ofessay.

Perelman went on to show how Lincolnwould have received a poor grade on theGettysburg Address (except perhaps forstarting with "four score," since it was shortand to the point). And he showed how theETS rubric directly contradicts most of George Orwell's legendary rules of writing.

For instance, he noted that Orwell instructedus to "never use a metaphor, simile, or otherfigure of speech which you are used toseeing in print," to "never use a long wordwhere a short one will do" and that "if it ispossible to cut a word out, always cut it out."ETS would take off points for following all ofthat good advice, he said.

Perelman ended his presentation by flashingan image that he said represented thedanger of going to automated grading justbecause we can: Frankenstein.

Paul Deane, of the ETS Center forAssessment, Design and Scoring, respondedto Perelman by saying that he agreed withhim on the need for study of automatedgrading of longer essays and of writingproduced over longer periods of time than30 minutes. He also said that ETS has workedsafeguards into its program so that ifsomeone, for instance, used words like"antidisestablishmentarianism" repeatedly,the person would not be able to earn a highscore with a trick.

Generally, he said that the existing research

Advertisement


3／4 2012/6/28 下午 08:23

is sufficient to demonstrate the value ofautomated grading, provided that it is usedin the right ways. The computer "won't tellyou if someone has written a prize essay," hesaid. But it can tell you if someone has"knowledge of academic English" andwhether someone has the "fundamentalskills" needed — enough information to usein placement decisions, along with othertools, as is the case at NJIT.

Automated grading evaluates "key parts ofthe writing construct," Deane said, even if itdoesn't identify all of the writing skills ordeficits of a given student.

For more information about reprints & permissions,visit our FAQ's. To report corrections andclarifications, contact Standards Editor Brent Jones.For publication consideration in the newspaper, sendcomments to [email protected]. Include name,phone number, city and state for verification. To viewour corrections, go to corrections.usatoday.com.

Advertisement


4／4 2012/6/28 下午 08:23

2011 02 21 automated grading essay n

Documents