scoring essays automatically using surface features

Scoring Essays Automatically Using Surface Features

Randy M. Kaplan, Susanne Wolff, Jill C. Burstein, Chi Lu, Don Rock, and Bruce Kaplan

GRE Board Report No. 94-21P

August 1998

This report presents the findings of a research project funded by and carried out under the auspices of the Graduate

Record Examinations Board.

Educational Testing Service, Princeton, NJ 0854 1

********************

Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate

Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy.

********************

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,

services, and employment policies are guided by that principle.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service.

Copyright @3 1998 by Educational Testing Service. All rights reserved.

Abstract

Surface features of essays include nonlinguistic characteristics - such as total number of words per essay (essay length), sentence length, and word length - as well as linguistic characteristics - such as the total number of grammatical errors, the types of grammatical errors, or the kinds of grammatical constructions (e.g., passive or active) that appear in an essay. This study examines the feasibility of using linguistic and nonlinguistic surface characteristics in an automatic procedure for predicting essay scores. It builds on earlier work by R.M. Kaplan, Burstein, Lu, B. Kaplan, Wolff (1995) Page (1966) and Page and Petersen (1995).

Introduction

As testing programs move away from multiple-choice items, the incorporation of an increasing number of constructed-response items in tests becomes inevitable. Essays items, a constructed-response item type, are becoming more and more common in large-scale testing programs. Scoring essay items is time-consuming. Because it requires human judges and, furthermore, multiple judges, it is quite costly. With cost such an important consideration for test design, the role of computers in the scoring process and its potential for cost reduction merits close attention.

Page (1966) claims that effective and accurate essay scoring can be done by computer. His method, analyses, and results are described in a series of papers referenced in Page and Petersen (1995). Page recommends an indirect, statistically based approach that relies on nonlinguistic surface characteristics of an essay. Surface features of an essay item include the total number of sentences, the number of words per essay, and the average length of words. A more linguistically-based approach might attempt to score an essay by producing a structured representation of syntactic and semantic characteristics and by analyzing this representation. By using surface features, Page circumvents the need to understand and represent the meaning (content) of an essay. His methodology nevertheless achieves scoring results that correlate .7 or better across multiple judges.

Page makes only passing reference to the surface criteria, orpruxes, employed by his system, and presents us essentially with a black box. Here we explore an approach similar to the one taken by Page, concentrating on surface characteristics of essay items and evaluating their use asproxes in our automated approach to large-scale scoring tasks. We scored the same 1,3 14 PRAXIS essays scored by the PEG system in the Page and Petersen (1995) study.’ Our criteria, unlike Page’s, can be evaluated as to whether or not they satisfy the general requirements of scoring systems employed by ETS.

Criteria for automatic scoring

An automated scoring system developed and used by ETS must satisfy several requirements to ensure that the risks of employing it do not outweigh its benefits.

Criterion 1. Defensibility. Any scoring procedure must be educationally defensible. The score produced by a scoring procedure must be traceable to its source, namely, the rationale used to produce the score. The mechanism that created the score must be defensible by a capability to rationally explain how the score was determined.

Criterion 2. Accuracy. Any scoring procedure (manual or automated) must maintain a specified level of accuracy. The accuracy of a scoring procedure is typically determined by comparing scores produced by the automated procedure to the scores produced by

’ The rubric used to score the essays holistically can be found in the Appendix.

2

human raters. The more accurate the scoring procedure, the more highly correlated the two sets of scores.’

Criterion 3. Coachability. The scoring procedure should not be coachable. Suppose that a particular scoring procedure for essays bases its score on the number of words in the essay. It does not matter what is written or how it is written - all that matters is that the essay contains a specific number of words. A student can be easily coached with this information and write an essay consisting of nonsense words and nonsense sentences. Such an essay would receive the highest score possible. Although this example is extreme, it exemplifies an non-acceptable situation. A scoring procedure should not be transparent or otherwise simple so that it can be discovered and coached to the test-taking community.

Criterion 4. Cost. The scoring procedure must be cost-effective. The purpose of an automatic procedure for scoring essays is to reduce the cost of scoring and to improve consistency. Therefore, the cost to score should not be more than an apriori determined acceptable cost. A second important cost is the one incurred for setting up the scoring procedure. Setup might include operations like creating a computer-based rubric for the computer-based scoring process. Setup costs may significantly increase expenses for an automated scoring process and should be taken into consideration when evaluating any new scoring procedure.

We have dwelt on these criteria for an acceptable scoring procedure at some length because a procedure that seems acceptable on the surface, on closer analysis may not meet one or more of these criteria and must be rejected.

Scoring with surface features

Surface features of natural language are those characteristics that can be observed directly; that is, no inference is necessary for their identification. An example of a surface feature present in any written or spoken utterance is the number of words in a sentence. There are computer programs designed specifically to identify and extract surface characteristics of texts -- grammar-checking programs.

A grammar-checking program automatically checks sentence construction, analyzes one or several consecutive sentences, and determines whether any grammar rules of the language have been violated. The idea is certainly worthwhile, but in practice the task is very difficult. Many grammar-checking programs can correctly account for only 35% to 40% of the errors in any written passage. One program we have evaluated accounts for up to 60% of the errors. Though grammar checkers do not have a high rate of error-detecting accuracy, they have proven useful in assisting writers to identify potential problems.

A recent study by Kaplan et al. (1995) developed a decision model for scoring essays written by nonnative English speakers. It evaluated the effectiveness of a variety of grammar-checking programs for predicting essay scores and selected the best performing

2 We assume that the grade assigned by a human grader is always acceptable and therefore considered correct in the context of comparison of human and machine scoring.

3

grammar checkers to construct the decision model. A total of 936 essays were analyzed. The essays were first scored by human experts using a set of criteria that corresponded closely to criteria used in grammar-checking programs. Two hundred forty-two characteristics across four grammar checkers were classified into categories of balance, cohesion, concision, discourse, elegance, emphasis, grammar, logic, precision, punctuation, relation, surface, transition, unity, and usage. Each essay was then automatically analyzed by calculating the number of errors in each of the categories. A logistic regression model was built based on these error category aggregates and then used to predict scores for the 936 essays. There was a 30% agreement between computed and the manual (human) scores.

One of the conclusions reached in this study was that the aggregation of raw errors into categories may have been detrimental to the score prediction process. A linguistic analysis confirmed that the raw error data may be better suited than the aggregate data for predicting scores.

A Characteristics-based Scoring; Procedure

We decided to analyze essays using an an approach similar to that of Page and Petersen (1995) in conjunction with that developed in Kaplan et al. (1995) The results of the Kaplan study were to guide the scoring procedure process. Specifically, the Kaplan study had determined that of the four grammar-checking programs, one was most accurate in predicting essay scores. This grammar checker, RightWriter, was chosen to analyze the essays in the current study. As discussed above, the raw error score data rather than the error score aggregates were taken to construct the predictive regression model. The analysis procedure and results of the analysis are described in the two following sections.

Analysis

As indicated, the following analysis builds on earlier work (Kaplan et al., 1995, Page and Petersen, 1995) dealing with the feasibility of computerized scoring of essays. We propose several scoring models that employ surface characteristics, among them the best of the computerized procedures using a grammar checker (RightWriter) from our earlier analysis. Page and Petersen state that the fourth root of the length of an essay was a relatively accurate predictor of essay scores for the essays. Three of our models therefore include the fourth root of essay length as a scoring variable. The criterion for selecting the best of the five models is how well -it predicts the average of the human ratings of the essays (a) in the validation sample of 1,014 student essays and (b) in a cross-validation sample of an additional 300 student essays. These are the two sample sets of the Page and Petersen study. The variations of model were fit on the validation sample and then replicated on the cross-validation sample. This validity model assumes that the average over six raters is a reasonably accurate criterion and has considerable face validity.

The validation sample consisted of 1,014 essays drawn from the computer-based writing assessment part of the Praxis Series. This assessment measures writing skills of beginning teachers through essays on general topics. The 1,014 human-scored essays had been analyzed previously by the Page procedure but not by the RightWriter model. Table 1

4

presents the multiple correlation squared on the validation sample for five alternative prediction models:

(1) The RightWriter model (Ml), which includes 18 predictors (derived from the RightWriter grammar-checking program)

(2) a model that just includes the linear count of words (M.2) (3) a model that just includes the fourth root of the length of the

essay measured by number of words (which seems to be a component of the PEG model) (M3)

(4) a modification of the Page model that includes both a linear and fourth root of the length of the essay (M4)

(5) the combined model, i.e., using both the RightWriter predictors and the linear and fourth root of the length of an essay (M5)

RightWriter 1 Ml

# of words I M2

# of words fourth root # words linear count and fourth root RightWriter + # words linear count and fourth root

M3

M4

M5

rediction Models on the Validation SamDle 1

#of predictors

18

1

1

2

20

R’ F Prob

.50 502.04 .ooo

.61 77.18 .ooo

Table 1 shows that essay length has a relatively large impact on the prediction of the average of the human ratings. Both the simple count of words per essay and the fourth root of the number of words do significantly better than all 18 predictors from RightWriter in the validation sample. The fact that the fourth root does better than the simple count (.49 vs .46) suggests that the fknctional relationship between the number of words in the essay and the average essay rating is not strictly linear.

Figure 1 is a plot of the average score by average essay length for 11 groups of essays arranged by increasing score intervals along the word-length scale. Essays are assigned a score on a scale ranging from 0 to 6 with 0 being the poorest score assigned and 6 representing an excellent score. The relationship between essay length and average essay grades in Figure 1 shows a rather rapid acceleration for lOO- to 400-word essays; rate of increase slows as the essays become longer.

Average Essay Length versus Average Essay Score

Average Essay Length (number of essays)

Figure 1 Average Essay Length (in words) vs. Average Essay Score

The fact that there is a significant increase in R2 when information from both the essay length and the RightWriter predictors are included in the equation suggests that although there is considerable overlap the two different kinds of measures, grammar checkers and measures of essay length are also measuring somewhat different abilities. The total predictable variance on the validation sample , that is, the R2 with both the RightWriter predictors and the essay-length predictors in the “full” model (M5), can be partitioned into variance uniquely due to (1) grammar skills, (2) essay length, and (3) shared variance:

Predictable Variance

.61 100%

= Variance Unique + Variance Unique + Shared Variance to Essay Length to Grammar

= .21 + .ll + .29 = 35% + 18% + 47%

Almost 35 % of the total predictable variance is unique to the measures of essay length, and about half of that is unique to grammar measures. Clearly essay length is measuring a writing skill other than the knowledge and use of grammar assessed by the RightWriter computer program.

The above results, however, need to be replicated on the cross-validation sample to ensure that they do indeed generalize to an independent sample. This is particularly critical when there are many predictors, as in the RightWriter model, and the potential for overfitting. Similarly, nonlinear finctional relationships as described by essay length may also benefit unduly from over-fitting.

Table 2 presents the cross-validated multiple correlations squared for the five models that were fitted in the validation sample.

6

All of the models seem to have cross-validated. In fact, there were some gains in predictive accuracy for the essay length prediction systems. An approximate partition of variance carried out on the cross-validated results yields the following partition of variance:

Pred. Variance = Variance unique + Variance unique + Shared Variance to essay length to grammar

.65 = .25 + .05 + .35 100% = 38% + 8% + 54%

Clearly, essay length continues to measure something appearing to be a proxy for component(s) of writing skills not directly measured by the grammar checkers.

Figures 2 through 6 present additional results on the relative accuracy of the predictions of each of the models in the cross-validated sample.

Exact +/- 3

0%

Exact +/- 1

42%

Exact

56%

Exact +/- 2

2%

Figure 2 Scoring results for Ml

7

Figure 3 Scoring results for M2

Figure 4. Scoring results for M3

Figure 5. Scoring results for M4

Figure 6. Scoring results for MS

The pie charts show the number of scores that exactly match a human score as well as the number and extent of over- and under-prediction. For example, Figure 2 shows that the RightWriter model (Ml) gave exact predictions for 166 out of 300 essays in the cross- validation sample. Similarly, 126 essays were either over- or under-predicted by one scale point on the six-point rating scale. Seven essays were off by two points on the six-point rating scale. When the RightWriter model (Ml in Figure 2) is compared with the simple fourth root essay-length model (M3 in Figure 4), we see that the essay-length model shows a significant improvement over the grammar model in the number of exact “hits” and shows fewer of the more serious errors in prediction. These results are, of course, consistent with the multiple correlation results. It would seem that essay length measures some aspect of writing outside of simple grammar competence. The increased acceleration in the functional relationship between the number of words and the ratings in the neighborhood of 100 to 450 words suggests that essay length may be aprux for topic knowledge, which in turn reflects a coherent writing presentation. Some knowledge of the topic content would seem to be a necessary condition for a demonstration of writing skills in this type of essay. In addition to showing content knowledge, it may take 300 to 400 words to present evidence of writing ability. Word counts beyond the 450 or 500 level become superfluous for demonstrating writing ability. Reading exercises show similar results where demonstrations of adult literacy are partly a function of how much prior familiarity the reader has with the content presented.

Note (see Appendix) that essay length is an implicit criterion in the rubric for human scoring of Praxis essays: developing ideas with examples and supporting statements goes hand in hand with essay length. This might explain in part the strong influence of essay length as a predictor of holistic scores.

The generalizability of measures of essay length to more scientific topics might be of interest. In science essays, the functional relationship might even be more nonlinear in the sense that the point at which additional words become superfluous may occur earlier. This could be readily handled by measuring essay length, possibly taking the fifth or sixth root rather than the fourth root.

9

Discussion

Our goal was to explore the possibility of predicting essay scores automatically. We constructed various scoring models, including two based on data produced by an off-the- shelf grammar checking program. Four of our models (M2-M5) predicted exact scores 60% or more of the time, and within +/-1 of the score over 90% of the time, within the cross-validation sample.

These results seem to indicate an extremely viable scoring procedure. However, the results should be examined with some caution and with regard to how well the scoring procedure meets the four criteria for scoring procedures of Section 2. Accuracy is not at issue here: we can achieve, in general, approximately 90% scoring accuracy. Another question entirely is whether the procedure is defensible. In a commercial or public testing situation, the question will eventually arise as to how we arrived at a certain score for a particular item In the case of our models, we might be hard put to provide a satisfying answer if the largest contributing factor to a score were the number of words in the essay. Clearly, this might not be acceptable for a short, but to-the-point essay. Here, the scoring procedure would have to be augmented with some type of content analysis procedure. And this would put us back to we where we started in the search for a scoring procedure which does automated content analysis.

Similarly, if our model were chiefly based on essay length, coachability would be another problem. A coaching organization could readily prepare students to write essays tailored to our scoring procedure. In its simplest form, this could be the instruction “to write the longest essay more or less about the subject specified by the item.” The essay would not necessarily have to make sense. Even a content analyzer might not be able to resolve the problem of scoring the essay in a defensible way. Hopefully, the scoring procedure would select such an essay as a candidate for a human grader.

An automated scoring procedure relying on surface characteristics meets the “cost” criterion. The preparation cost is minimal (no rubric needs to be prepared), and the scoring procedure itself is fairly rapid.

Conclusions

An automated scoring procedure based solely on surface features of a writing passage, although cost-effective and, in most cases accurate, carries with it some notable problems. First, such a procedure is not defensible unless we know precisely how the predictor(s) account for the various aspects of writing skills as analyzed by human graders. The apparent coachability of a procedure based solely on these features would not seem to make it acceptable in a high-stakes testing program. On the other hand, it might be acceptable in conjunction with other more complex manual or automated procedures. It could be deployed to screen papers for subsequent scoring, to add information to a manual scoring process, or to augment a more complex content-based analysis.

10

Scoring systems based on surface characteristics could be used in conjunction with human scoring of essays as a “second rater“. With suffkient diagnostic information in addition to the score computed by such a procedure, such systems may prove useful tools for scoring essays, provided the results are-always interpreted by a human grader as well.

11

References

Kaplan, R. M., Burstein J., Lu, C., Rock, D., Kaplan, B., Wolff, S. (1995). Evaluating a prototype essay scoring procedure using off the shelf software Princeton, NJ: (Educational Testing Service Research Report RR-95-2 1).

Page, E. B. & Petersen, N. (1995, March). The computer moves into essay grading: updating the ancient test. Phi Delta Kappan, 561-65

Page, E. B (1966, January). The imminence of grading essays by computer. Phi Delta Kappan, 23 8-43.

12

Appendix

SCORING GUIDE

PRAXIS I: PRE-PROFESSIONAL SKILLS AND CORE BATTERY WRITING TESTS

Readers will assign scores based on the following scoring guide. The essays must respond to the assigned task, although parts of the assignment may be treated by implication.

Scores

6 A 6 essay demonstrates a high degree of competence in response to the assignment but may have a few minor errors.

An essay in this category - is well organized and coherently developed - clearly explains or illustrates key ideas - demonstrates syntactic variety - clearly displays facility in the use of language - is generally free from errors in mechanics, usage, and sentence structure

5 A 6 essay demonstrates clear competence in response to the assignment but may have minor errors.

An essay in this category - is generally well organized and coherently developed - explains or illustrates key ideas - demonstrates some syntactic variety - displays facility in the use of language - is generally free from errors in mechanics, usage, and sentence structure

4 A 4 essay demonstrates competence in response to the assignment. An essay in this category - is adequately organized and developed - explains or illustrates some of the key ideas - demonstrates adequate facility with language - may display some errors in mechanics, usage, or sentence structure, but not a

consistent pattern of such errors

3 A 3 essay demonstrates some degree of competence in response to the assignment but is noticeably flawed.

An essay in this category reveals one or more of the following weaknesses: - inadequate organization or development - inadequate explanation or illustration of key ideas - a pattern or accumulation of errors in mechanics, usage, or sentence structure - limited or inappropriate word choice

2 A 2 essay demonstrates only limited competence and is seriously flawed. An essay in this category reveals one or more of the following weaknesses: - weak organization or very little development - little or no relevant detail - serious errors in mechanics, usage, sentence structure, or word choice

1 A 1 essay demonstrates fundamental deficiencies in writing skills. An essay in this category contains serious and persistent writing errors or is incoherent or is undeveloped.

Copyright Q 1984, 1987, 1994 by Educational Testing Service. All rights reserved. Unauthorized reproduction is prohibited.

scoring essays automatically using surface features

Documents