document resume ed 134 599document resume ed 134 599 tm 005 995 author brown, david lile title'...

DOCUMENT RESUME

ED 134 599 TM 005 995

AUTHOR Brown, David LileTITLE' Faculty Ratings and Student Grades.I.A Large-Scale

Multivariate Analysis by Course Sections.PUB DATE Dec 74NOTE 127p.; Ph.D. Dissertation, University of

ConnecticutAVAILABLE FROM Dr. David Lile Brown, 36 Mansfield Hollow Road

Extension, Mansfield Center, Connecticut 06250($5.00)

EDRS PRICE MF-$0.83 Plus Postage. RC Not Available from EDRS.DESCRIPTORS Bias; Effective Teaching; Factor Analysis; *Grades

(Scholastic); *Higher Education; Multiple Regression,Analysis; Predictor Variables; Rating Scales;*Statistical Analysis; *Student Evaluation of TeacherPerformance; Validity

IDENTIFIERS University of Connecticut

ABSTRACTThe purpose of this dissertation is to provide

evidence )olearing on the question of the influence- of the gradesstudents receive on their ratings of the college teachers who gavethem those grades. Specifically, certain characteristics of the gradedistribution within each course section are evaluated as predictorsof the students' ratings of the teacher of that course section.Muliivariate techniques were employed to evaluate data across anentire university. Over 30,000 anonymous student ratitgs of 2,360course sections were collected after students had received finalcourse grades, and without student or administrator knowledge thatthe ratings wrould,be used in the study. Factor analysis was used toreduce the eight-item rating instrument to a single criterionvariable. Subsequently, stepwise multiple regression analysis wasused, both to reduce an initial battery of predictors to an optimallyreduced subset, and to test the incremental importan4:e of certaingrading variables as predictors of the criterion.. The primaryimplication of the study is that there is a relationship betweengrades and ratings, but it only accounts for about nine percent ofthe variance in the ratings. There is a significant bias, but factorsother than grades must also be influencing student ratings. Whetheror not these other factors are valid measures of teachingeffectiveness remains to be determined, but one seemingly invalidfactor (the grading bias) has been identified. (Author/RC)

***********************************************************************Documents acquired by ERIC include many informal unpublished

* materials not available from other sources. ERIC makes every effort ** to obtain the best copy available. Nevertheless, items of marginal *

* reproducibility are often encountered and this affects the quality *

* of the microfiche and hardcopy reproductions ERIC makes available *

* via the ERIC Document Reproduction Service (EDRS). EDRS is not* responsible for the quality of the original document. Reproductions ** supplied by EDRS are the best that can be made from the original.***********************************************************************

ck.)

FACULTY RATINGS AND STUDENT GRADES:

A-LARGE-SCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS

BY

DAVID L1LE BROWN

FIRST PUBLICATION, DECEMBER 1974

COPYRIGHT qD DAVID LILE BROWN 1974,

`O NF NNODUCEIONYN':,`,12:7) `,,TEN.A., BY MICROFICHE ONLY HA`, PFFN r.NANTFP

1,1t 6101.,,)N, FA ' A".0 ONnANIZA" ONS ONE PAT.4,-.;`. N AGPLF \IP," A.THHF NA

r;`,1 ONTHF P AFPQCD&J(

`.f Sv.."1"."or. r rI"NYN,H` CJANF

U 5 DEPARTMEH f OF HEALTH.EOUCATION & WELFARENATIONAL INSTITUTE OF

EDUCATION

1105 DOCUMENT HAS BEEN REPRO-DUCED EXACTLY AS RECEIVED FROMTrIE PERSON OP ORGANIZATION OPIG.N-ATING IT POINTS oF VIEW Op OP,NIONSSTATED DO Nor NECESSARILY REPRE-SENT OF FICIAL NATIONAL INSTITUTE OFEDI/CATION POSITION OR POLICY

ABSTRACT


A LARGE-SCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS

David Lile Brown, Ph.D.

The University of Connecticut, 1974

Effective teaching has been a primary educational goal

throughout history, and the evaluation of teaching is

therefore one of the central concerns of education. Student

ratings are often considered one of the best indicators of

teaching effectiveness, because the student is in the most

privileged position to view the teaching and to experience

its effects. The use of such ratings has become widespread,

b-t controversy rages over whether student ratings are valid

r asures of teaching effectiveness, especially when used for

making decisions about faculty pay, promotion, and tenure.

It is possible that students are not mature enough to

appraise teaching or, worse, that they may be exploiting

rating systems to punish strict teachers and to reward

lenient ones. If so, this would have important implications

for the interpretation of student ratings. Proper

interpretation requires an answer to the research question:

What is the influence of the, grades students receive or

their ratings of the college teachers who gave them those

3

David Lile Brown -The University of Connecticut, 1974 2.

grades? This question has not been answered satisfactorily

by previous research. The former evidence has been

conflicting and inadequate in,a number of ways. The

present study, however, was intended to avoid many of the

shortcomings of previous studies.

This study employed multivariate techniques to evaluate

data across an entire university. Over 30,000 anonymous

student ratings of 2,360 course sections were collected

after students had received final course grades, and without

studert or administrator knowledge that the ratings would

be uLed in this study. Factor analysis was used to reduce

the c.,ight-item rating instrument to a single criterion

variable. Subsequently, stepwise multiple regression

analyses were used, both to duce an initial battery of

predictors to an optimally re' 2d subset, and to test the

incremental importance of certain grading variables as

predictors of the criterion.

Results showed that the simple correlation between the

average student grade in each course section and the average

student rating of the teacher of that course section was

35, 2 < .000001. Moreover, the average grade was the

single !..,est predictor, of those available, of the average

ratinyl: ,;nd when average grade was added to the optimally

reduce: 'obset of other predictors, it sign.:ficantly .

improv,-T1 the multiple correlation from .25 to .39,

F (4, 2345) = 60.13, 2 < .001.

The primary implication of the results of this study

4

David Lile Brown--The University of Connecticut, 1974 3.

is that there is a relationship between ,grades aad ratings,

but it only accounts for about 9% of the variance in

ratings. In other words, there is a significant bias, but

factors other than grades must also be influencing student

ratings. Whether or not these other factors are valid

measures of teaching effectiveness remLins to be determined,

but one seemingly invalid factor (the grading bias) has

been identified through this study. The interpretation of

student ratings should take this bias into account, and

methods should be devised to eliminate it.

5


A LARGE-SCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS

David Lile Brown.

B.S., Tufts University, 1968

., University of Connecticut, 1969

A Dissertation

Submitted in Partial Fulfillment of the

Requirements for the Degree cf

Doctor of Philosophy

at

The University of Connecticut

1974

Copyright by

David Lile Brown

1974

7

APPROVAL PAGE

Doctor of Philosophy Dissertation

FACULTY RATINGS AND STUDENT GRADEL;:"

A LARGETSCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS

Presented by

David Lile Brown, B.S., M.A.

Major Advisor

Associate Advisor

Associate Advisor

Associate Advisor

Associate Advisor

2

The University of Connecticut

1974

8

ii

ACKNOWLEDGMENTS

This writer wishes to express his appreciation to:

Prof. Ellis B. Page, who, as his major advisor, has

been a tremendous source of inspiration and enctluragement

during the past six years;

The other members of his advisory committee, Drs

Steven V. Owen, Robert K. Gable, Richard W. Whinfield, and

Marvin Rothstein, whose suggestions, corrections, and

insights were most valuable in guiding this effort to a

successful conclusion;

The University of Connecticut Computer Center for the_

use of computer facilities;

Dr. Dorothy C. Goodwin, Mrs. Shirley Malinowski, Mrs.

Althea McLaughlin, and Mks. Meryl Kogan of the University of

Connecticut Bureau of Institutional Research for their

assistance with data collection and data editing;

Mr. Richard J. Stec and Miss Barbara J.. Sochor of the

University of Connecticut Student Data Systems and

Programming Office for assistance with data collection and

computer tape copying;

Mrs. Katherine J. Brown and Mr. Rudy Voit of the

Registrar's Office for assistance with data collection; and

Lile H. and Helen W. Brown, his parents, for support,

patience, and encouragement during this effort;

9

iii

V/

TABLE OF CONTENTS

Page

ACKNOWLEDGMENTS iii

LIST OF TABLES

LIST OF FIGURES vi

Chapter

I.. STATEMENT AND DEFINITION OF THE PROBLEMStatement of.the,ProblemPUrpose of the StudyNeed for the StudySources of .the Dat-a.

1

1

46

18

II. SUMMARY OF RELATED RESEARCH 19

Validity of Student Ratings 19

Related Issues 25

Summary,

30

III. PROCEDURF 31

Subjects , 31

Instrument 31

Initial Predictor Variables 35

Grading Variables .

41

Criterion Variables 42

Statistical Analyses 43

Summary 46

IV. RESULTS 47

Factor Analyses of the RatingInstrument's Items 47Reduction of the rnitial Batt..ryof Predictor Variables 50

Addition of the Five New PredictorVariables to the OptimallyReduced Subset 60

Parallel Results Using Other Criteria... 62

Summary 65

V. DISCUSSION AND CONCLUSIONS 66

Di'4cussion of Results and Implications '66

Conclusions an0 Recommendations 77

APPENDIX 81

BIBLIOGRAPHY 99

iv

1 0

LIST OF TABLES

Table Paw1, Data for Figure 1, QPR Cutoff Points for ,

Graduating Seniors, 1950-1974 13

2. Data for Figure 2, Median QPR's for All

Undergraduates, 1952-1974 14

3. Data for Figure 3, Average of All A-Through-F

Grades, 1961-1974 15

4. Mean Differences Between Quantitative and A

Verbal SAT Scores of Undergraduates with

Different lajors 39

5. Factor Analyaes of the Eight Items on the

University of Connecticut Rating Scale

for Instruction

6. Product-moment Correlations Among the Eight

-Items on the University of Connecticut Rating.

Scale for Instruction for 1972 and 1973 49

7. Means and Standard DevLtions of the

Predictor and Criterion Variables 51

8. Product-moment Correlations Among the

Predictor and Criterion Variables 52

9. First Stepwise Multiple Regression Analysis

Reduction of the Initial Battery of

Predictor Variables 59

10. Second Stepwise Multiple Regression Analysis--

Addition of the Five Grading Variables

to the Optimally Reduced Subset 61

1 1

LIST OF FIGURES

Figure

1. Ouintile QPR cutoff points for graduating

Page

seniors, 1950-1974 10

2. Median QPR's of all undergraduates after

the end of each semester, 1952-1974 11

3. Average of all A-through-F grades given,

semester by semester, 1961-1974 12

4. The University of Connecticut Rating

Scale for Instruction 32

5. Key to symbols used in Figures 6-18 85

6. Merge 1 86

7. First missing data input 87

8. Merge 2 88

9. Sort 1 89

10. Sort 2 90

11. Mbrge 3 91

12. Second missing-data input 92

13. Merge 4 93

14. Merge 5 94

0. Third missing data input 95

16. Merge-6 96

17. Regression run 1 97

18. Regression run 2 98

vi

12

CHAPTER I

STATEMENT AND DEFINITION OF THE PROBLEM

Statement of the Problem

Considerable recent controversy has centered around

faculty evaluation methods, especially student ratings of

college teachers (Centra, 1973; Doyle & Whitely, 1974; Frey,

1974; Ze1b3, 1974). According to several researchers

(Bausell & Magoon, 1972a; Capozza, 1973; Carrier, Howard, &

Miller, 1974; Centra, 1973; Costin, Greenough, & Menges,

1971; French-Lazovik, 1974; Menzie, 1973; Treffinccer &

Feldhusen, 1970), the use of student ratings to evaluate the

faculty has become so widespread that it is now almost taken

for granted at institutions of higher learning. Controversy

rages, however, over the validity of such ratings, especially

concerning "the trend towardIormal, quantitative use of the

results of the evaluations in determinations of faculty

promotions'and vlaries" (Zelby, 1974, p. 1267).

Faculty reactions tD student evaluation of teaching

range from acclaim to outrage. Supporters point to the need

for evaluation and to the potential that ratings may have

for improving educatiun. Opponents emphasize that students

may not be proper judges, or that rating instruments may not

1

13

ask students the right questions.

The whole issue is complex, but especially this

question of validity. Are student ratings a valid measure

of teacher effectiveness? The answer depends, of course, on

definitions of validity and of teacher effectiveness. Since

there are no universally accepted definitions of these,

progress toward the development of such measures is impeded.

Nevertheless:attempts at progress commonly are made with a

rather vague notion of what "effective teaching" means.

If factual learning alone were the objective of

effective teaching, one could validly measure various

teachers' effectiveness through achievement testing. Indeed,

many would support such measures as extremely valid, but

others would irsist that the "real goal in teaching is to

impart philosophical values or to inculcate a special

attitude toward learning rather than to simply help the

student to master che subject matter" (Frey, 1974, p. 47).

Still others would be inclined to consider the promotion of

factual learning primary, but in combination with certain

affective factors. How effective, for example, is the

teacher in drawing students' attention and affection? Or

how are the students made to feel about the specific subject

matter they are learning? Does the teacher make it

i. -!resting?

Whatever view one takes of effective teaching, the

major issue relevant to student ratings is still validity:

Are student ratings a valid measure of effective teaching?

1 4

Some would argue that student ratings are valid because the

student, as Lne consurler, is ia the best position to view

the teaching and to experience the effects of the teaching

(see Centra, 1974; Costin, Greenough, & Menges, 1971; FreY,

1974; McKeachie, 1969).

Others would argue, for various reasons, that student

ratings are not valid as a measure of teaching effectiveness.

For example, LeComte (1974), Peck (1971), and St.Onge (1974)

express strong dou-uts that students are in any position to

judge the scholarship offered by the faculty. According to

Costin et al. (1971), student ratings are typically

challenged on the grounds "that student ratings are

unreliable, that the ratings will favor an entertainer over

the instructor who gets his material across effectively,

that ratings are highly correlated with expected grades (a

hard grader would thus get poor ratings) and that students

are not competent judges of instruction since long-term

benefits of a course may not be clear at the time it is

rated" (p. 511).

One of the most important points of contention is

whether or not students can be objective enough for their

ratings to be valid. Even though student ratings are

necessarily subjective opinions, they could still be valid

measures of effective teaching if students were able to

identify effective teaching (or at least some component or

components thereof), and if their ratings were unbiased by

irrelevant variables. Some believe, however, that students

1 5

may bias their ratings in favor of lenient teachers, or at

least in favor of the teachers in whose courses they get the

highest grades. If so, this would have important

implications for the interpretation of such ratings. It is

even likely that such ratings would be at odds with larger

educational objectias. For instance, "given a specific

format, it is possible to adapt one's teaching technique to

obtain a good or bad evaluation and . . . a good evaluation

may be associated with a teaching technique of lesser

educational value than a poor evaluation" (Zelby, 1974, p.

1268).

Therefore, one may pose the research question: What is

the influence of the grades students receive on their

rat_ngs of the L lege teachers who gave them those grades?

Purpose of the Study

The purpose of this study is to provide evidence

bearing on the research question. Specifically, certain

characteristics of the grade distribution within each course

section are evaluated as predictors of the students' ratings

of the teacher of that course section. The following are

the grading variables:

1. Average of the grades in the course section,

2. Standard deviation of the grades,

3. Variance of the grades,

4. Skewness of the grades, and

5. Kurtosis of the grades.

16

Stepwise multiple regression analysis is employed to

evalui..t: these grading variables for their effectiveness as

predictors, when used in combination with certain other

predictors of such faculty ratings. These other predictors

constitute an "optimally reduced subset" of the following

variables:

1. Sex,

2. Appointment length,

3. Percentage employed,

4. Class size,

5. Number of years tenured,

6. Age,

7. Years since hiring,

8. Course level,

9. Title level,

10. Course location,

11. Department quantitativeness, and

12. Rate (percentage) of return of ratings.

An "optimally reduced subat" is defined as the subset which

had the lowest standard error of estimate during a prior

stepwise multiple regression analysis.

. The hypothesis of this study is that the grading

variables are strongly related to student ratings of college

teachers, and that.they will significantly improve the best

prediction of those ratings attained by the other available

predictors.

17

6

Need for the Study

The evaluation of teaching is one of the central

concerns of education. Effective teaching has been a

primary educational goal throughout history. As mentioned

above, one commonly used me-,ure of effective,teaching is

evaluation by the student. It would be hoped that student

ratings would be valid and that they would promote effective

teaching through feedback to teachers and administrators.

However, it is not certain that students are mature enough,

wise enough, or objective enough to evaluate such

instructional behavior without bias. Are student ratings

valid measures of effective-teaching? Or are they more or

less a "payoff" in a sort of game between teachers and

students, in which teachers reward (or punish) students with

grades, and students respond in kind with ratings?

The conflicting evidence to datP has not satisfactorily

answered this question. The extent and the importance of

the relationship between grades and ratings have not'been

determined. Neither have any causes of such a relationship

been positively identified, partially because the

relationship itself has not been firmly established.

Former evidence on this question has been not only

conflicting, but also inadequate in a nunher of ways. The

results of previous studies will be suLaz.ized in chapter

II, but most of these studies had one or more of the

following potential. shortcomings:

7

1. A relatively small sample size;

2. Sampling of only one or a few departments;

3. Sampling of only graduate assistant level teachers;

4. Sampling of team-taught courses;

5. Samp'ing of only freshmen level courses (or some

other level);

6. Contamination by voluntarism (using only teachers

who volunteer to be rated);

7. Contamination by sensitization (a new system is

initiated, or a class is upset by a researcher suddenly

appearing on the scene);

8. Contamination by knowledge of the study (students,

faculty, or administrators know ahead of time that the

ratings will be used in an "experiment");

9. Contamination by lack of anonymity protection for

the students;

10. An indefinite treatment (students did not actually

know what their final grade in the course would be at tbe

time of the ratings); and

Il. A lack of multivariate analysis (often only simple

correlations among a few variables are reported, whereas

multivariate techniques permit simultane)us examination of

the influence of many variables and pro- ide evidence of the

"extent as well as the nature of any Such influence" (Doyle &

Whitely, 1974, p. 260)).

This present study does not suffer from any of the

above potential shortcomings, and it should provide better

19

8

evidence bearing on the research question than that

previously fpund.

A further need for this study is institutional. While

a somewhat similar study was conducted at the University of

Connecticut ten years ago, no such study has been done

recently. Changes have occurred in the rating instrument

since that study of 10 years ago (Garber, 1964), and

improvements have been made in the administrative proced-ares

connected with the collection of ratings. There is rerJon

to believe the data are less subject to coll ction .rors

now than formerly.

Furthermor, ,3nd perlips more importantly,.over the last

10 years, grades at tha_ Uni.versity of. Connecticut have taken

a sharp turn upward (see Figures 1-3 below). Whatever-the'

relationship between grades and.ratings may have been 10

years ago, _that relationship well may have changed in light

of the changes in grading, or in light of certain other'

changes related to the rise of student power in general.

The advent, in the fall of 1968, of both pass-fail options

and liberalized policies regarding the dropping of courses

may have influenced the relationship between grades and

ratings in some way. The appointment of students to

formerly all faculty committees is another recent change

which could possibly bear upon the research question:

Whether or not the trend in grading has had any'effect

on the relationship between grades and ratings, it is a

striking trend and worthy of attention. Figure 1 shows the

2 0

9

trend over the last 25 years of the quality point ratios

(QPR's) of graduating seniors. The UR is the grade average,

computed as the sum of the quality points (number of credits

in a course times the grade in that course, where "A" = 4,

"B" = 3, "C" = 2, "D" = 1, and "F" = 0) divided by the total

number of credits. At the University of Connecticut, QPR's

are multiplied by 10 and range from 0 to 40, but for the

sake of clarity to readers outside of the University of

Connecticut, the range used throughout this study is the

more prevalent 0 to 4. Figure 1 demonstrates that there was

a great, deal of consistency from 1950 until 1967 or 1968,

but chat_the trend has been upward ever since. Table 1

provides the data used in Figure 1.

Figure 2-supplies further evidence/of the trend in

grading at the University of Connecticut, showing the median

QPR s of all undergraduates after each semester. Note that

2ach spring semester is represented by the year marks along

the x-axis of Figure 2, while each fall semester is at the

half-way point between year marks. The upward trend in

median QPR's indicated by Figure 2 started about 1964, which

is, as one would expect, three or four years before the

start of the upward trend in graduating seniors' QPR's as

shown in Figure 1. The data for Figure 2 are given in Table

2. Data for the spring semester of 1970 are missing because

of a student strike near the end of that semester, following

the U.S. bombing of Cambodia. In many courses that semester,

final exams were cancelled and grades of "S" for

4.0

80th percentile

0-4 60th percentile

A-4, 40th percentile

0-41@ 20th percentile

3.5 v-IF minimum for graduation

3 0

QPR

2,5

2.0

V V r 7 7 7 7 7 7 7 7 V. 7 7 7

1950 55 60

YEARS

65 70

Figure 1. Quintile QPR cutoff points for graduating seniors, 1950-1974.

23

4.0

3,5

3.0

IPR

2.5

2.0

1952 55 60

YEARS

65 70,

Figure 2. Median QPR's of all undergraduates after the end of each semester, 1952-1974.

24

4,

25

*"4 100's and 200's level courses'only

*4164 100's, 200's, and 300's level courses

QPR

1961 65

YEARS

70

Figure 3. Average of all A-through-F grades,given, semester by semester, 1961-1974.

2(

13

Table 1

Data for Figure 1, QPR Cutoff Points for

Graduating Seniors, 1950-1974

Graduating

Class of 80th

Percentiles:

60th 40th 20th

1950 2.80 2.49 2.26 2.04

1951 2.83 2.49 2.24 -2.02

195_ 2.80 2.46 2.21 1.99

1953 2.85 2.52 2.28 2.06

1954 2.81 2.50 2.23 2.03

1955 2.81 2.44 2.20 2.00

1956 2.84 2.51 2.24 2.00

1957 2.84 2.49 2.23 2.00

1958 2.82 2.48 2.23 1.93

1959 2.83 2.48 2.21 2.00

1960 2.82 2.48 2.22 2.08

1961 2.81 2.48 2.24 2.03

1962 2.85 2.,7 2.24 2.04

1963 2.83 2.49 2.25 2.04

1964 ' 2.81 2.43 2.25 2.05

1965 2.83 2.50 2.26 2.04

1966 2.83 2.50 2.27 2.06

1967 2.88 2.55 2.30 2.03

1968 2.89 2.54 2.30 2.06

1969 2.94 2.59 2.36 2.10

1970 3.04 2.70 2.44 2.17

1971 3.11 2.79 2.52 2.25

1972 3.19 2.89 2.62 2.33

1973 3.30 2.99 2.72 2.43

1974 3.32 3.02 2.77 2.47

Note. The minimum QPR for graduation (Oth percentile) was

1.80 every year.

2 8

Table 2

Data for Figure 2, Median QPR's for

All Undergraduates, 1952-1974

Year Spring Semester Fall Semester

1952 2.27 2.21

1953 2.28 2.19

1954 2.27 2.21

1955 2.29 2.24

1956 2.31 2.24

1957 2.30 2.24

1958 2.28 2.20

1959 2.23 2.18

1960 2.31 2.23

1961 2.29 2.27

1962 2.19 2.25

1963 2.18 2.28

1964 2.35 2.26

1965 2.37 2.35

1966 2.45 2.43

1967 2.48 2.45

1968 2.56 2.53

1969 2.57 2.58

1970 2.72

1971 2.87 2.82

1972 3.02 3.04

1973 3.03 2.88

1974 3.02

14

Note. Data for the spring semester of 1970 and the fall

semester of 1974 (current semester) are missing.

2 9

15

Table 3

Data for Figure 3, Average of

All A-Through-F Grades, 1961-1974

Year Spring Semester Fall Semester

1961 2.23

1962 2.29 2.26

1963 2.31 2.27

1964 2.26

1965 2.34 2.33

1966 2.42 2.40

1967 2.44 2.48

1968 2.56 2.55

1969 2.64 2.59

1970 3.18 2.73

1971 2.82 2.79

1972 2.89 2.86

1973 2.89 2.80

1974 2.92

Note-. Zeta for the following semesters_are_missing spring_

of 1961, spring of 1964, and fall of 1974 (current

sethester).

3 0

16

"Satisfactory" were given. Some courses did have finals,

and some teachers used regular grades with'or without final

exam scores. However, the median QPR for all undergraduates

was not computed by the administration because of the

drastic effect of the strike on grades. That effect is

clearly indicated in Figure 3.

Figure 3 shows the trend in the average of the "A-

through-F" grades given each semester. Again, the upward

trend since 1964 is clearly indicated. The inclusion,

starting with the spring semester of 1967, of 300's

(graduate) level courses would account for some of the rise

in this curve, ut it 1ould have resulted in a one-time

jump. The constant upward trend can not be explained by the

inclusion of these courses, which was unavoidable because of

a change in administrative procedure. The data for Figure 3

are shown in Table 3.

There is no indication that the ability level of the

.students at the UniversiLy of Connecticut has followed (or

preceded) such a drastic trend as that indicated by Figures

1-3 (personal communications with the Admissions Office). A

study by Baird and Feister (1972) found that a large number

of faculties tended to give out the same distribution of,

grades over the years 1964-1968, even regardless of changes

in the ability leve: of the students in some cases. The

grading trend found at the University of Connecticut,

however, is quite the opposite (higher grades without any

s ch increase in student ability levels). This trend

3 1

17

seems to be part of a nationwide trend toward leniency

occurring since the period of the Baird and Feister study.

Recent rePorts involving some 23 campuses across the nation

("Cal State," 1974; Ladas, 1974; "Too Many A's," 1974) have

supplied evidence that a "grade glut has been spreading

across academe" ("Too Many A's,"-1974,-p. 106) in-the last

few years.

It is not certain whether teachers are "simply being

generous" or "are bribing students with good grades to get

good grades themselves" ("Toe) Many A's," 1974, p. 106). But

there is definitely an increased leniency on the part of

this-faculty, and itlis possible that the relationship_

between grades and ratings is not the same as it was 10

years ago. Furthermore, it is likely,that the upwrd trend,

in.grading has not been universal across departments (cf.

Ladas, 1974; Postman, 1974) nor across all teachers and that,

as a result, the influence of grades, if.any, on ratings

would e more pronounced for.some teachers and for some

departments than for others.

Thus there are multiple reasons for conducting this

study. The research question has not been satisfactorily

answered; previous studies have suffered from several

shortcomings; and changes in the ratings-instrument and-

grading practices have occurred at this university (and

apparently nationally). The present research is an attempt

to provide better and more up-to-date knowledge concerning

faculty evaluation by students.

32

13

Sources of the Data

This study uses dLta from the spring semester of 1973

at the University of Connecticut. Some of the data-were

obtained from admissions records or other branches of the

university administration and were collected prior to 1F73.

The evaluation data were collected by the Bureau of

Institutional Research in the customary manner establishted

by that office and followed each spring. Detailed

descriptions of the variables investigated and the rating

instrument used appear in chapter III.

No one, including the students and the administrators

who gathered the data, knew that the data would be used in

this present study. Grade data were obtained from the

registrar's records. Since ratings are done anonymously,

it is not possible to match the individual students' ratings

with their grades. Nevertheless, a great deal of inforaticin

about the distribution of grades in each course st.Iction is

known and can be matched with a large amount of other data

about the course section and its teacher. Thus, the

separate course sections (N = 2,360) were the units of

analysis for this study. The loss of the ability to match

grades and ratings for individual students is offset by

certain compensations, such as the anonymity of the ratings

data (which is what precludes the matching) and the ability

to study data across an entire university.

3 3

CHAPTER II

SUMMARY OF RELATED RESEARCH

This chapter presents a summary of the theories and

research which form the background to this study. The

research related to the major issue of the Validity of

student ratings is covered first, followed by a summary 'of

the research on several related issues.

Validity of Student 'Ratings

In 1971, Costin, Greenough, and Menges-conducted a

review of the literature on student ratings of college

teachers. They stated that faculty members could judge the

validity of student ratings "to the extent students'

subjective criteria match the faculty members' goals in

teaching" (p. 513), and thus, a determination must be made

of "the basis on which students make their iledgments" (p.

514). In their discussion 'Of various possible bases for

student judgments, Costin et al. cited conflicting evidence

as'to whether "students may judge instruction on the basis

of its 'entertainment' value rather than on its information,

contribution to learning, or long-term usefulness" (p. 517).

They also stated that students may make rating judgments

20'

based on the grades received or expected.in'the Cdurse.

This is the issue upon which this present study is focused.

Costin et al. reviewed 28 studies of the relationship

between student ratings and grades. Of those, 15 found no

significant relationship, 1 found a small negative

correlation, and 12 found significant, but small, positive

correlations. Typically the cbrrelations did not exceed

.30, but they could be considered collectively as evidence

that the true relationship may be'low and.positive.

In an earlier study ,at the University of Connecticut,

Garber. (1964) used student ratings (responses to the eight

separate items on the then current rating instrument) to

predict "difference scores" (the differences between the :

students' grades and their "expected grades"). Current

grade point averages, were used as the expected:grades.

Garber obtained a multiple correlation of .43; correction

for shrinkage yielded an R of .39 (a highly significant

multiple R, 2 < .001). This outcome was for the multiple

regression analysis using teachers as units of analysis.

Using students as units of anaIyis, Garber obtained,

similar results (R = .31; correction for shrinkage yielded

an R of .28, R < .001). He concluded\that the two behaviors,

student ratings of their teachers, and the direction in

which the teachers tended to grade the students (higher or

lower than their current grade point "averages) were either

directly or indirectly dependent on each other.

Bausell and Magoon (1972a), using a much larger sample

3 5-

21

than most other studies (N = 12,000 ratings, from which

random stratified samples were drawn), "found strong,

consistent biases in both instructor and course ratings

which can be traced to (a) the grade the student expects to

receive, and (b) the discrepancy between the student's

expected grade and his GPA" (p. 1021). They concluded that .

student ratings of instructors are "axiomatically valid for

their designed purpose, but must be interpreted with

caution" (p. 1022) since the assignment of low grades may

be proper in many cases, but would result in lower ratings.

The implication is that administrative use of ratings for

pay, promotion, and tenure decisions may.require great

caution.

A study by Treffinger and Feldhusen (1970) used

characteristics of the students to predict student ratings

of their teachers. They concluded that student variables,

including grades received, were not,significant predictors.

They found that generalized pre-course ratings were most

often the best predictor of end-of-course ratings, and

therefore they concluded that the "student's rating of the

course is clearly a complex interaction of his initial

feelings, certain cognitive and affective characteristics

of teachers and pupils, and instructor performance" (p. 622).

Treffinger and Feldhusen noted that "it should be the

instructor's performance or ability as a teacher and

.students' reliable perceptions and evaluations r,f th,

performance which constitute the majority of the ....3-1.ance

3 6

22

in instructor ratings" p. 622). They suggested that the

difference between pre- and post-course ratings might be

more useful than post-course ratings. But other research

(especially Holmes, 1972) indicates that such a difference

may be very heavily affected by changes in grades from

initial expected grades to end-of-course expected or actual

grades.

Bausell and Magoon (1972b) found that "rating changes

did occur as a function of .changes in grade expectancy"

(p. 10). Holmes (1972) conducted an experiment in which

expected grades were "disconfirmed." "Half of the students

who deserved and expected A's or B's were given their

expected grades, while half were given a grade one step

lower than expected" (p. 130). Correct grades were given

after ratings were collected. He concluded that "if

students' grades disconfirm their expectancies, the students

will tend'to deprecate the instructor's teaching performance

in areas other than his grading system" (p. 130, emphasis

added). These results support a theory that students are

not objective; that t ley use ratings as a "payoff;" and that

ratings might therefore be invalid.

Many studies have investigated the relationship between

ratings and achievement rather than grades. Such studies

are, of course, directly related to the valic f issue since

it is obvious that high achievement constitutes a primary

goal cf effective teaching. If students rate highest the

teachers from whom they learn the most, then student ratings

3 7

23

wou1d have a certain validity. Furthermore, such studies

often provide important insights into the relationship'

between ratings and grades insofar as achievemen't and grades

are closely interrelated.

One such.study (Rodin and Rodin, 1972) stirred a great0

deal of interest, in the relationship between ratings and

grades. The authors found a significant negative

correlation between amount learned and ratings of the

teachers (teaching assistants in one large undergraduate

calculus course with 12 sections). Controversy was stirred

over their 'conclusion'that "perhaps students resent

nstruc'tors who force them to work too shard and to learn

more than they wish" (p. 1166).

Another investigator (Capozza, 1973) agreed. He ,

concluded that "the hypothesis that students give good

ratings to classes in which they, learn a great deal must be

rejected. Emotional fact.ors such as grades and perhaps a.,

distaSte for the hardships of learning seem to bias the

results in the-opposite direction" (p. 127). , Though his

sample (250 students in eight course sections) was larger

than many, Capozza himself considered it small.

Gessner (1973)'found results opposite those of the

Rodins and Capozza. He criticized the Rodin and Rodin study

for its use of teaching assistants, whose instructional role

was apparently an ancillary one: Gessner's own data,

however, were based on only one course (N = 78), and it was

team-taught. There can be no generalization, then, across

3 8

'24

courses or instructors.

A study by Potter, Nalini and Lewandowski (1973)

employed a better research design than others (with students

and teachers randomly assigned to classes). Unfortunately,

however, teacher trainees were used. They found a modest,

but significant, correlation between ratings and achievement

,when the effect of aptitude was partialled out. Although

their sample waS small (N = 254 students in 24 course

sections), they observed that the magnitude of the

correlation between achievement and ratings seemed"to be

lower when the correlation bfl2tween achievement and aptitude

was higher', and vice versa. They concluded: "the stronger

the relation between-aptitude and achievement, the lesgr,the

relation between achievement and rating" (p. 2). Ay

corollary of this conclusion might be that ratings may

suffer from a grading bias in direct proportion to how

arbitrary the,grading is (cf. Holmes, 1972).

There are two conflicting explanations often proposed

for a negative correlation between achievement and ratingg:

(a) better learners are more critical of teachers or (b)

there is resentment of the hard work teachers force students

to do in order to achieve. Rodin (1974), however, indicated

that we may not teed either of these theoretical

explanations, since ,the true relationship well may be-

positive. _She reviewed several studies and found that most

of the correlations between ratings and amount learned

tended "to lie in the range r = .20 to r = .30 . . . which

3 9

25

indicates that amount learned accounts for only about 4 to 9

percent of the variance in student ratings" (p. 5). Rodin

concluded that student ratings are not indicative of amount

iparned. She compared student ratings of teaching to

consumer ratings of the palatability of peanut butter, and

argued that, just because the consumer ratings may be pootly

related to laboratory ratings of nutritiveness, the validity

of palatability ratings is r,-,t thereby impugned (unless one

expects the palatability ratings to tell us all about peanut

butter). Rodin implied, therefore, that student ratings may

be a valid measure of one component of effective teaching,

and that grades could be one important influence on that

component.

Related Issues

Several issues, which have been hinted at above, are

directly relateeto validit One iS the possible presence

of a "halo-effect." That is, are global impressions at

work, masking the separate traits of effective teaching?

Do the individual items on rating instruments provide uJeful

information, or do they form large clusters of .traita? Such

clusters of traits could be valid measures Of-,teaching

effectivenss even if Lie separate items or traits were not.

Strong halo effects were noted by Royce (1960), Garber

(1964), Potter; Nalin,. and LeWandowski (1973), and Widlak,

McDaniel, and Feldhusen (1973). Hoyt (1969) found a halo

effect in student ratings of courses and conCluded that it

4 0

26

interfered with students' ability to discriminate fairly

among the traits of-the course covered by the rating

instrument. Costin, Greenough, and Menges (1971), however,

cited several studies in support of a theory that ratings

on multiple attributes of instruction are low in halo

effect.

Widlak et al. reviewed a representative sample of 22

studies over the past 30 years. They concluded that, while

most rating instruments can be factored into at least two

components, there is likely to be a halo effect "so strong

. . tha- t the specific item ratings may have little

diagnostic value in assessing a teacher's strengths and

weaknesses and, ultimately, low potential for improving

teaching" (p. 10). Widlak et al. found three factors

throughout the studies they reviewed.. They called these the

"Actor," "Interactor," and "Director" factors (referring to

roles of teaching). These might also be called

"Performance," "Rapport with students," and "Course

structure/difficulty" factors.

Another issue, somewhat related to the halo-effec,_

issue is the persistence of first impressions or

preconceptions. The findings of Bausell and Magoon (1972b)

suggest that first impressions and preconceptions are

persistent. They found that ratings of teachers after the

first day of classes were very durable, and that they were

highly correlated (r = .67, n = 20 courses) with ratings

taken at the end of the course.

4 1

27

As.mentioned'above, Treffinger and Feldhusen (1970)

found that generalized pre-course ratings were most often

the best predictor of end-of-course ratings. Furthermore,

Parent (1971) suggests that ratings should be taken early

in the course to provide feedback to the teachers in time

to modify the teaching performance before the courses were

finished and "wasted." If early attitudes are persistent,

then end-of-course ratings would not improve, but Parent

suggests that end-of-course ratings should be eliminated.

If ratings were always done very early in the course, some

of the grading bias (if there were one) might be eliminated.

Some bias might remain, however, since expected grades

could still have an influence, as could grades on various

quizzes and papers.

On the one hand, the durability of first impressions

might indicate that student ratings are invalidated,

because they are not measures of teaching effectiveness

over the entire course. On the other hand, it may be that,

whatever student ratings are measuring, it simply does not

take very long to evaluate it (especially, perhaps, the

Actor factor). In the case of preconceptions, it could be

that there is accurate foreknowledge about the course or

instructor from word-of-mouth advice from other students,

or from prior experiences with the instructor himself or

with similar courses taught bylothers.

Costin et al. suggested that student ratings might be

shown to be valid if there were a hie, correlation between

4 2

.28

student and peer ratings (agreement between different

judges). The results of the Bausell and Magoon (1972b)

study suggest that ratings by outside observers, including

other faculty members, would tend to correlate highly with

student ratings, even though the outside observers might

view only a few class sessions. Toug and Feldhusen (1973)

supported this view, as do the results of four studies

(r's from .30 to .63) cited by Costin et al. (1971). Centra

(1974), however, found that while there is agreement between

student and peer ratings of teachers, the peer ratings were

"less reliable" (low interjudge agreement among the peers).

He concluded that both student and peer ratings should be

used, but to measure different traits. Costin et al.

cautioned that peer ratings may be influenced by student

ratings through hearsay (see also Jaeger & Freijo, 1974).

It is possible, however, that a grading bias is responsible

for the difference between student and peer ratings, or at

least some portion of the difference. One implication is

that peer ratings might be used to partial out the grading

bias if one exists.

Still another issue relates to the definition and

measurement of effective teaching. It is the issue over

whether or not the good researcher is likely to be a good

tc(Acher. Stallings and Singhal (1969) found a significent

correlation between teacher evaluations and research output

(measured by weighted combinations of the number of books,

articles, etc.). They concluded, as many others have, that

43

2 9

productive researchers are usually very good teachers, but

that students often have the stereotype that very active

researchers neglect their teaching. If pervasive, such a

stereotype would ten(' o lower the validity of student

ratings of researcicrF nce the ratings might be based on

a misconception. C-.Danirl. and Feldhusen (1970) found that

students did bias their ratings against heavy researchers

or prolific writers; they theorize-that the students may

be correct -- that researchers do neglect their teaching.

Many studies (notably Aleamoni, 1974; Bendig, 1952;

and Elmore and LaPointe, 1974) have examined relationships

between faculty ratings and certain characteristics of the

teachers or courses. There was, according to them, little

consensus about the characteristics of thc "best" or "worst"

teachers or classes. At the University of Connecticut,

however, administrative findings over the past few years

indicate there are definite differences in ratings across

several variables. Historically, students at this

university rate male teachers slightly higher than female

teachers overall, but not on every.item of the scale. In

general, the smaller the class size, the higher the ratings

of the instructor have been. Similarly, classes at the

branches have been rated higher than courses at the main

campus at Storrs, Connecticut. Officials have noted,

however, that this last difference ma:- be due to ,the fact

that the largest course sections are taught at Storrs.

Historical evidence also indicates that al_ the

4 4

30

University of Connecticut:._the.more advanced the course,' .

the higher the ratings. Tenured faculty have received

better ratings than non-tenured teachers. With respect to.

age, a teacher's ratings seem to,peak during the 31-_to-40

age period, with ratings rising steadily to that point and---

then declining only slightly and very slowly afterwards.

Thus, variables such as age, sex, tenure status, and class

size might be good predictors of student ratings.

Summary

A careful scrutiny of the literature reveals that very

few multivariate studies have been conducted in the area of

this problem, and that several'criticisms (mentioned in

chapter I) may be leveled at most of the simpler, but more

numerous, correlational studies. Furthermore, there is

disagreement among the researchers, and directly conflicting

results from their studies have been confounded by the use

of many different research methods, different rating

instruments, and different sample characteristics (including

different-units of analysis). In sum, the research question-

has not been answered satisfactorily, though several

interesting ccmcepts and theories have emergee which could

possibly prove helpful, once better evidence is obtained.

43

CHAPTER III

PROCEDURE

This chapter describes the subjects, th0 rating

instrument, the predictor variables, the criterion variables,

and the statistical methods used in this study. Appearing

in the appendix are macro flowcharts of the computer

programming stepfi required to assemble the data for analysis.

Subjects

The !'subjects" in this study should be considered to

be the 2,360 course sections-Ifor which complete data could

be assembled. Since only 89 course_ sections were deleted

for missing' data, this study covered 96.377 of the entire

population of 2,449 course sections evaluated following

the spring semester of 1973 at the Univeriity of Connecticut.

Over 30,000 rating forms were returned, representing about

55% of the students who were sent them.

Instrument

The University of Connecticut Rating Scale for

Instruction (UCRSI; Bureau of Institutional Research,

University of Connecticut, 1971) is presented in Figure 4.

31

4 6

Figure 4. UNIVERSITY OF CONNECTICUT RATING SCALE FOR INSTRUCTION FORM NO. AA 118 11/7 1

INSTRUCTOR COURSE

MAKE NO MARKS

IN THIS SPACE

FAULTY

EMPLOYEE NUMBER

1. Knowledge Of Subject

2. Presentation Of Materiel

3. Balance ofBreadth & Detail

4, Enthusiasm forSubject

5. Fairness inMarking

6. Attitude TowardStudent

7. personalMannerisms

8. Over-All SummaryAs A Teachei

1 2:, 3Knowledge offield inadequate

1 2 3

Very hard tofollow

1, 2'Getc bogged downio trivia

3

Scum; irksometo Uirn

3

Partial andureiudiced

1 2 3uru,iyinuatlietirand intolerant

1 2 3Distracting andirritating

2' 3

!.intsfactury

4

4

4

4

4

DEPT.

'01yknow:cdo.,,Hp

6 7

.,,i.r.rnahy

7

.1 I I) (

7

;

7

;

, wr

'1)

SEC. ENRL.

8, 9Highlyexpert

13! 91i 101:',

Makes subjectvery clear

81 9!, 106Good tazance ofbreadth & detail

8 91. 10',;Displays greatenthuslanM

8.Very fairand impartial

8 9.; 10!7Sympathetic andunderstanding

8, 9, 10,:Completely-acceptable

9 1W.1

outatanding

INSTRUCTOR COURSE

MAKE NO MARKS

IN THIS SPACE

1. Knowledge Of Subject

2. Presentation Of Material

3. Balance ofBreadth & Detail

4. Enthusiasm for' Subject

:.5. Fairness in '

Marking

6. Attitude TowardStudent

7. PersonalMannerisms

8. Over-All Sui ImaryAs A Teach.

,

1 2 3

Knowledge 6:field inadequate

3 2, 3:.;

Very hard tofollow

21.1, 3:!Gets ticrgged downin trivia

1r ', 2H 3;;seems irksometo-hirn-.:.

.1 '!

Partial andprejudiced

3'.Unsympatheticand intolerant

1 '

0r:4:acting and. irritating.

3:unsatisfactory

4

4

4

4

4

DEPT.

6 7

6, iirc 1,,,r1AIiyi,i74,7 .,tonriable

G 7

trcai--.050i,:,-.nal.)

b- 6' 7

MIN: iy,lit,reIted

i; 6 7

lit.6rrahlykirr

77o1- . rior,r

5 6riveia;,e

1.-

SEC. ENRL.

8 9; 10 C

expert

, ,8, 9:: 10'

Makes subjectvery clear

0

-Doud balance ofbreadth & detaIl

r,8!; 9.'d 10i)

Displays greatenthusiasm

-,

10:.;

Very fairsilo impartial

8 9:i 101:

ympathetic andunderstandins

-9 : 10'Completelyacceptable

8 : 91, 106:

Outstanding

THIS RATING IS TO BE ENTIRELY IMPERSONAL. DO NOT MAKE ANY MARKSON THIS PAGE WHICH COULD SERVE TO IDENTIFY YOU.

47 DC 6717 ;_.. ` NORTHEAST BUSINESS FORMSMERIDEN, CONN. 06450

33

The UCRSI is the instrument currently used at the university

in a program of faculty evaluation that-had its origin in

the late 1940's. It started with a University Senate

concern with the quality of teaching. It was felt that

there might be an overemphasis on the "publish or perish"

ethic.

At first, the university evaluated only certain "target"

teachers (those for whom the administration requested

information, usually for career decisions). But by the late

1960's, the process had grown into one that covered every

faculty member teaching a "ratable" course section. The

nonratable course sections are those deemed not ratable in

the normal way, such as seminars, independent study courses,

field work, practice teaching, team-taught courses, graduate

assistant-taught courses, and others. The decision as to

whether or not a specific course section is ratable is made

by the individual department chairmen each year.

After students have received their grades for the

,spring semester, and after they have moved to their summer

addresses, the rating forms are mailed to the students with

their teachers' names and their course sections already

filled in (see Figure 4). The university paYs for postage

both ways, and the anonymity of the students is guaranteed

since students' names or identification numbers appear1

nowhere on returned forms. The rate of return of ratings:

has remained around 55% over the recent years.

Ratings are done only after each spring semester. This

4 8

34

is done for two reasons: (1) so that students will fill out

rating forms at their homes in the summer away from the

influences of their fellow students, and (2) especially so

that the cost of evaluation will not be doubled (by

repeating the evaluations each semester).

Results are machine tabulated and sent both,to the

individual teachers and to their department chairmen.

Cumulative average scores (cumulative since 1969) accompany-

each teacher's ratings for the recently ended course

sections. Since the Bureau of Institutional Research edits

each of the incoming rating forms by hand (including

checking for the clarity of markings), and since invalid

forms are removed, high accuracy is Claimed for the ratings

data.

The ratings data are provided to promotion and tenure

committees by the department chairmen, along with other

pertinent information. There has been some concern at the

university-that someone might attribute significance to very

small differences in ratings, when in fact, only extreme

differences in ratings have any practical meaning. There

is also some concern about the validity of the UCRSI. There

is no certainty as to what it measures, kthough university

officials,have noted that the ratinga seem to have a

moderately high "reliability." That is, large samples of

student ratings show a high degree of agreement with each

other. What the students are agreeing about, however, is

not absolutely clear.

4 9

35

Initial Predictor Variables

The initial battery of predictors of student ratings

of college teachers included twelve variables. Presented

here are briet descriptions of each of these predictors and

of how each was derived.

Sex. The sex of each teacher was obtained from his or

her professional history, filed with the administration.

\ No cases wei-e lost for missing data for this variable. The

characters M and F for male and female, were converted to

values 1 and 2 respectively before computation began.

\ Appointment length. The number of months of each

teacher'i appointment was also obtained from the

professional history.. In most cases, teachers are anpointed

each year :)r a term of 9 months. A few have 10 mt...

appointments, and several are for 11 months. The values 9,

10, and 11 were used for this variable, and no cases were

lost for missing data.

Percentage employed. Most teachers at the University

of Connecticut are employed full-time, regardless of their

appointment length. Several are part-time, however, and

the percentages vary. .Values from 1 to 100, representing

percentages of full-time employment, were obtained from the

professional history, and no cases were lost for missing

daia.

Class size. The number of students enrolled in each

course section at the end of the semester was obtained from

36

the grade distribution data. It equals the total number of

all grades gilien, including incompletes, etc. No cases were

lost for missing data for this variable.

Number of years tenured. The year that a teacher was

tenured (if he wr.n) 1,-,2s obtained from the 'professional file.

The number of yet3- .3 since bang tenured was computed (zero

if not tenured) by subtracting the tenure year from 1973.

No data were missing.

Age. The birthdate of each teacher was obtained from

the_professional history file, and each teacher's

chronological age in completed years was computed as of

May 8, 1973 (the end of the spring semester).

Years employed at the University of Connecticut. The

hiring date of each teacher was also obtained from the

professional history. The number of years of continuous

service at the university was computed as of May 8, 1973

(the end of the spring semester) for each case. None of the

cases were lost ort account of missing data for this variable.

Course level. Course numbers were obtained from the

rating responses records. The level of the course was then

computed by dividing the course number by 100 and dropping

the digits to the right of the decimal point. The:resuling

values (0, 1, 2, 3, and 4) reprempt courses with numbers

1-99, 100-199, 200-299,-300-399, and 400-499 respectively.

No cases were lost for missing'data for this variable.

Title level. Each.teacher's classification cade was

obtained from the professional file. A list of the

37

classification codes belonging to each title level

(Instructor, Assistant Professor, Associate Professor, and

Professor) was created with the aid of personnel from the

Bureau of Institutional Research for use in one of the

merging computer programs. Title levels (values 1, 2, ,,

and 4 for Instructor, Assistant Professor, Associate

Professor, and Professor respectively) were,ass4ned to

each teacher according to his, or her classification cOde,

Two course sections were lost since.one teacher's

classification code could not be correctly grouped with any

one level.

Course location. ,A branch code was obtained for each

course section from the rating responses data. A velue of

I was assigned to the course location if the course was

taught at an7 of the branches (Hartford, Stamford,

Southeastern, or Hartford M.B.A.), and a value of 2 was

assigned to course sections taught at the Storrs main

campus. No cases were lost for missing data for this

variable.

Department quantitativeness. Student records for

juniors and' seniors (majors are not declared earlier)

included the major department and his two Scholastic

Aptitude Test Scores (Verbal and Quantitative). For each

department inwhich studen,ts declared a major, the average

quantitativeness (122j) was computed as the-average

difference in Quantitative and Verbal SAT scores of the

students majoring in that department (j). The formula

5 2

-

used to compute the average quantitativeness for each

department was the following:

- slyly& / Ej

38"

yhere, in the jtlidepartment, SATQ, is the ith student'sij

QuantitativeSATscore,SATV.is the ith student's Verbal-ij

SAT score, and n. is the number of students majoring in that-0

department. The results of these computations are pfesented

in Table 4. Since some departments had very few Ludents

as majors, the quantitativeness figures for those

departments should be interpreted cautiously. With small

n's, those departments may have radically different

quantitativeness figures from one year to the next.

Note also that departments high in quantitativeness

hal:7e high positive values (maximum of 169.5), and

departikents low in quantitativeness have low or negative

values (minimum of -135.0). Being very high or very low

(or anywhere between for, that matter) on this scale of

quantitativeness does not signify anything about the quality

of the department or its average SAT scores. One could

argue either way that it is'better to be highly "verbal"

or high1y-Thuantitative.7 Yet for suggesting a profile,

the scale appears to be very meaningful; it has high face

validity.

There are 10 departments,in which no,undergraduate

students may major (Aerospace R.O.T.C., Biobehavioral

Science, (general) Engineering, Interdepartmental,

53

3 9

Table 4

Mean Differences Between Quantitative and Verbal SAT Scores

of Undergraduatea with.Different Majors

Department

Name Me an S . D .

Statistics 169.50 4 49.10

Civil Engineering 119.99 101 73.10Mathem-tica 106.04 144 87.04

Chemical Engineering 105.42 26 80.99

Mechanica Engineering 95.89 35 ,71.18

Electrical..Engineering 94..64 78 96.26

Accounting 82.23 140 81.77

Pharmacy 80.93 45 84.94

Finance , 71.60 102 84.64

,Business 68.06 190 87.95

Physics 64.68 19 0 70.33

Italian 64.50 2 46.50

Geography 62.25 12 70.35

Rhysical Education 61.47 87 72.46

Geology 60.83 24 87.42

Chemistry 59.00 43 82.62

Agricultural Engineering 58.50 8 84.80

Industrial Administration 57.55 60 86.65

Agricultural Economics 56.33 3 42.87

Marketing 52.44 94 81.47

Pre-Veterinary 49.00 13 89.70

Nutritional Science 43.00 7 76.93

Biology 41.86 345 92.16

Horticulture 41.07 99 82.83

Animal Industries 39.98 49 85.01

Economics 31.99 101 77.50

(Continued on the next page)

5 4

40

Table-4 (Continued)

Department

-Name_ Mean S.D.

Agriculture andNatural Resources 29.00 1 0

Spanish 22.82 38 94.36Physical Therapy 19.25 181 90.33Political Science 18.85 189 84.80Foods & Nutrition 17.86 22 69.38Family Economics 17.57 7 76.56German 16.57 14 119.31Clothing, Textiles,Interior Design

&15.79 62 97.21

Speech 14.66 110 88.50Child Development &Family Relations 14.10 174 81.78

Psychology_ 13.49 321 93.86Eistory 13.37 188 94.02Music 10.56 45 93.50Medical Technology 10.29 35

,

84.86lociology 10.16 225 89.27Education 9.60 354 81.45Art 6.11 82 77.10Philosophy -3.46 33 79.87Nursing -4.24 193 76.47Anthropology -5.30 57 86.03French 17.50 47 73.60Dramatic Arts -23.02 47 104.45Russian -29.71 7 127.18English -33.09 401 88.84Classics -135.00 1 0

5 5

41

Journalism, Linguistics, Metallurgy, Polish, Portuguese, andScience). The 56 course sections that were-taught in these10 departments were excluded from this study because of

missing data for department quantitativeness.

Rate of return of ratings. For each course section,

the-percentage of rating forms returned by the students

was computed. The number of ratings returned (obtained

from the ratings data) was divided by the enrollment in

the class (see class size). Values for rates of return

were permitted to'range from .001 to 1.000, and no cases

were dropped on account of missing data.

Grading Variables

AL..TeragLiEaie. The number of each type of grade given

in each course section was obtained from the grade

distribution data (from the registrar's records). Grades

such as "I" for Incomplete and "P" for Pass were ignored,

and the average of the "A:through-F" grades was computed

for each course section (values permitted from 0.00 to

4.00). There were 27 cases deleted on account of missing

data for this variable. In these 27 cours,e sections, no

grades were given in the "A-through-F" range. Note that

these course sections differ from those in which all "F's"

were given (resulting in an average grade of 0.00), and

thus the 27 cases could not logically be included in the

study with any particular value for an average grade.

5 6

Standard deviation of grades, Variance of grades,

Skewness of grades, and Kurtosis of grades. These

characteristics of the distribution of the "A-through-F"

grades in each course section were computed at the same

time as the average grade, using the registrar's grade

distribution data. No further cases (beyond the 27 cases

lost for average grade missing data) were lost for missing

data for these variables.

Criterion Variables

Ten potential criterion variables were computed for

each course section using the rating responses data. The

process of selecting the criteria used in the stepwi..e

multiple regression analyses is explained in the next

section and in the next chapter. Descriptions of the 10

potential criteria and how they were derived are presented

here.

Items 1-8. The average rating on each of the eight

items on the University of Connecticut Rating Scale for

Instruction (UCRSI, see Figure 4) was computed for each

course section. Four course sections (in which no one

answered Item 8) had to be deleted for missing data.

Average of Items 1 through 8. The average of the

responses to all of the items on the UCRSI was computed for

each course section using the ratings data. No cases were

dropped on account of missing data for this variable none

beyond the four dropped for Item 8 missing data).

57

43

Average of Items 1 through 7. The average of the first

seven items on the UCRSI was also computed from the rating

responses data, and there were no missing data.

Statistical Analyses

In order to determine the number and nature of

components or dimensions underlying the,rating instrument

items, two principal components factor analyses (of 1972 and

1973 ratings data) were conducted Prior to the main effort

of this study. The results'of these (details presented in

chapter IV) were instrumental when it came to making

decisions about the chOice cf a criterion variable.

One stepwise multiple regression analysis was employed

to reduce the initial battery of 12 predictor variables

(predicting the average of Items 1-8 on the UCRSI) to an

'optimally reduced subset" (see chapter I). The results of_

this'regression analysis are presented in chapter IV and

discussed in chapter V. As far as the procedure is

concerned, this regression analysis provided a multiple

correlation and an ordered, optimally reduced subset of

predictors, both of which were needed as input to following

steps.

A second stepwise multiple regression analysis was

performed using the optimally reduced subset found in the

first regression analysis and the five grading variables

(see chapter I) to predict the same criterion (average of

Items 1-8). First, the variables in the optimally reduced

5 8

subset entered this regression equation in the same order

that they entered the prior regression equation (i.e.,

ordered). Then, the grading variables were allowed to enter

the regression equation in the order of their.ability 'to

improve the equation (i.e., floating). This served to test

the incremental importance of,the grading variables as

predictors of ratings, given that certain other variables

were already in the regression equation. Thus, the

variables in the optimally reduced subset, in effect,

"partialled out" a chunk of the variance before the grading

variables were even considered.

The results of the second'regression analysis also

determinei the rela.cive importance of the grading variables

and the overall ability of the available predictor variables

to predict ratings. These results are also presented in

chapter Iv and discussed in chapter V.

Cross-va:idation of the multiple regression analyses

was si...alated-through an ekamination of the estimated amount

of shrinkage in the multiple coTrelations using McNemar's

(1962, F. 184) shrinkage formlila:

R' = {1 - (1 - R2) ,(1\1 - ) / (N - n)]}5 (2)

where P' is the multiple correlation after shrinkage, N is

the-sample size, n is the nuMber of predictor variables, and

112 f.s the multiple correlation squared.

In order to test the significance of the increase in

the oultiple correlation when cne grading variables were

5 9

added to the optimal subset of other variables, an F-test

of significance was performed using another of McNemar's

(1962, p. 284) formulae:

(RI2 - R22, / (ml m2)F =

(1 - R12) / (N - - 1)

45

(3)

where RI2 is the larger multiple correlation squared, R22

is the smaller multiple correlation squared, N is the sample

size, ml is the number of predictor variables associated

with RI, and m2 is the number of predictor variables

associated with R2, with degrees of freedom ml - m2, and

N - ml - 1._ _

The significance of the multiple correlations by

themselves was determined with F-tests of significance using

yet another of McNemar's (1962, p. 283) formulae:

F = (R2 / [(1 R2) / (N m 1)] (4)

where R2 is the multiple correlation squared, m is the

number of predictor variables included in the multiple

correlation, and N is the sample size, with degrees of

freedom m and N - m - 1._ _

The significance of the individual simple correlations

was determined using a significance table for correlation

coefficients.

6 0

Summary

Student ratings of college teachers at the University

of Connecticut during the spring 1973 semester were studied

to determine whether or not the addition.of five new'

predictor variables dealing with grades could signifiCantly

improve an optimal set of predictors reduced from an initial

battery of predictors Factor analysis was used to_rechice

the eight-item rating instrument' to a single criterion

variable. Stepwise multiple regressiOn analysis was used,

both to reduce the initial'battery of predictors to an

optimally reduced subset, and to test the incremental

.importance of the grading variables as predictors of

iverage ratings.

6 1

CHAPTER IV

RESULTS

This chapter presents the results of the statistical

analyses of this -study. The procedures used to obtain

these results are explained in chapter III, and a discussion

of these results is-qprovided in chapter V.

Factor Analyses of the Rating Instruthent's Items

Principal components factor analyses of two separate

years' ratings (1972 and 1973) yielded nearly identical

results. These results, presented in Table 5, show that

all eight items on the University of Connecticut Rating

Scale for Instruction (UCRSI) loaded heavily (.778 or

greater) on a single factor. Item 8, 'Over-All Summary As

A Teacher," had a factor loading of over .97 both times.

Furthermore, the correlations among th-e eight items,

presented in Table 6, were all +.52 or greater and highly

significant (2. < Amon). It should be noted that these

correlations are among course section averages.

The results of these preliminary factor analyses

indicated that there was essentially one, global dimension

underlying the ratings data, and that a single criterion

47

6 2

48

Table 5

Factor Analyses of the Eight ItemS On the

University of Connecticut Rating Scald for Instruction

1972

.Factor 1

1973

Factor 1

N.

Eigenvalue

Percentage of Variance

2417

6.030

75.372

2465

6.011

75.139

:Items: Factor Factor

Loadings:- Loadings:

1. Knowledge of Subject .778 .783

2. Presentation of Material .889 .895

3. Balance of Breadth & Detail .887. .902

4. Enthusiasm for Subject .852 .832'

5. Fairness in Marking. .836 .824

6. Attitude' Toward Student .853 .852

7. Personal Mannerisms .867 .863

8. Clver-All Summary as a Teacher .970 .971

6 3

49

Table 6

Product-moment Correlations Among the Eight Items

on the University of Connecticut

Rating SCale for Instruction for 1972 and 1973

Items: 1 2 3 4 5 6 7

1. Knowledge of

Subject

2. Presentation .66(.66)of Material

3. Balance of .69 .88

Breadth & Detail (.66) (.87)

4. Enthusiasm .74 .70 .69(.72)for Subject (.70) (.68)

Fairness in .55 .63 .68 .58

Marking (.56) (.64) (.68) (.64)

6. Attitude Toward .52 .66 .66 .66 .81(.54)Student (.65) (.64) (.70) (.81)

7. Personal .55 .73 .74 .63 .71 .78(.56)Mannerisms (.74) (.74) (.67) (.70) (.77)

,8. Over-All Summary .74 .90 .89 .79 .77 .81 .82

as a Teacher (.74) (.89) (.87) (.81) (.77) (.81) (.83)

Note. 1972 correlations are in parentheses; N's for 1972

and 1973 were 2417 and 2465 respectively; all

correlations are significant, 2 << .000001.

6 4

[t

variable representing that factorcould be employed in the

subsequent.stepwise multiple regression analyses.. The .

50

avarage of all eight items was selected as the primary

criterion for this study, but results were also obtained

for three other criteria considered possibly representative

of the factor (see below, Parallel Results Using Other

Criteria). One of these other criteria, Item 8 by itself,

was chosen on account of its extremely high factor loading.

The average of the other seven items was chosen as a

criterion for purposes of comparison with the first two

criteria, since all eight items had very high factor

loadings. Item 5 by itself was also used as a criterion in

order to determine how well the predictor variables could

predict the students' ratings of "grading fairness."

Reduction of the Initir7 Battery of Predictor Variables

The means and standard deviattons of the 27 predictor

and criterion variables are presented in Table 7. They

were zomir-ired as part of this first stepwise multiple

regression analysis. It should be noted that these means

and standard deviations were computed across the 2,360

course sections and, therefore, that teachers are unequally

represented according to' the numbar of course sections they

taught.

Table 8 shows the correlations among the 27 predictor

and criterion variables. These are the correlations that

were computed by the stepwise multiple regression analysis

6 5

Table 7

Means and Standard Deviations

of the Predictor and Criterion Variables

Variable Mean S.D.

1

2

3

4

5

6

7

8

Sex

Appointment length

Percentage employed

Class size,

Number of years tenured

Age

Years since hiring

Course level

1.15

9.04

98.73

22.48

4.69

41.31

7.59

1.79

.36

.28

7.68

28.34

6.99

10.10

7.45

.79

9 Title level 2.65 1.00

10 Course location 1.80 .40

11 Department quantitativeness 33.75 43.35

12 Rate of return of ratings .57 .16

13 Average grade 2.94 .57

14 Standard deviation of grades .72 .32

15 Variance of grades .61 .44

16 newness of grades 2.60 2.81

17 Kurtosis of grades 13.62 29.34

18 Item 1 8.1'. 1.13

19 Item 2 6.91 1.60

20 Item 3 6.87 1.47

21 Item 4 8.01 1.29

22 Item 5 7.61 1.29

23 Item 6 7.54 1.46

24 Item 7 7.52 1.38

25 Item 8 7.33 1.45

26 Average of Items 1-8 7.49 1.20

27 Average of Items 1-7 7.51 1.17

Note. N = 2,360 for each variable.

6 6

52

Table 8

Product-moment Correlations

Among the Predictor and Criterion Variables

Variable

'1 Sex 1.00 -.03

2 Appointment length -.93 1.00

3 Percentage employed -.07 -.10

4 Class size .00 .00

5 No. of years tenured -.09 -.02

6 Age .08 .00

7 ,Years since hiring -.06, .00

, 8 ,Course level -.12 .00

9 Title level -.20 -.03

10 Course location -.22 .06

11 Dept. quant. -.09 -.02

12 Rate of return .05 -.01

13 Average grade -.04 .08

14 std dev. of grades .04 -.05

'15 Variance of grades..

.04 -.05

16 Skewness of grades .02 -.03

17 Kurtosis of grades .02 -.02

18 ,Item 1 -.00 -.02

19 Item 2. .05 .01

20 Item 3 .05 .00

21 Item 4 _ .06 .01

22 Item 5 .02 .02

23 Item 6 .02 .03

24 ,Item 7 .06 .02

25 Item 8 .02 .01

26 Average of Items 1-8 .04 .01

27 'Average of Items 1-7 .04 .01

Note. All r's > .04 (or -.04) are

For r's > .10 (or < -.10), E <

6 7

-.07

-.lo

1.00

-.01

.07

.01

.09

-.01

Al-.01

.01

-.02

-.01

-.01

.01

.01

.01

.08

.05

.05

.07

.01

.,03

'.05

.06

.06

.06

.00

.00

-.01

-.09

-.02

.07

.08

An,01

1.00 -.04 -.08

-.04 1.00 .68

-.08 .68 1.00

-.06 .95 .70

-.14 .05 :08

-.04 .62 .58

.08 .15 -.03

-.04 .08 .06

-.06 .00 .02

-.16 .03 .03

.20 -.03 -.07

.14 -.03 -.07

.51 -.03 -.09

.77 -.03 -207

-.06 .15 .16

-.02 '.03 -.04

-.04 -.00 -.06

-.05 .05 .07

-.06 -.03 -.04

-.07 .01 '.00

-.06 -.03 -.07

-.05 .03 -.00

-.06 .03°. -.00

-.,06 .02 -.00

significant, E < .05.

.000001.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

le

19

20

21

22

23

24

25

26

27

53

Table 8 (Continued)

Variable 7 8 9 10 11

Sex -.06 -.12 -.20 -.22 -.09 .05Appointment length .00 .00 -.03 .06 -.02 -.01Percentage employed .09 -.01 .11 -.01 .01 -.02Class size -.06 -.14 -.04 .08 -.04 -.06No. of years tenured .95 .05 .62 .15 .08 .00Age .70 .08 .58 -.03 06 .02Years since hiring 1.00 .02 .61 .10 .07 .00Course level .02 1.00 .26 .34 -.01 .05Title level .61 .26 1.00 .30 .06 .01Course location .10 .34 .30 1.00 .04 -.04Dept. quant. .07 -.01 .06 .04 1.00 .06Rate of return .00 .05 01 -.04 .06 1.00Average grade .02 .56 .15 .31 -.15 .03Std. dev. of grades -.03 -.52 -.15 -.17 .15 -.07Variance of grades -.03 -.47 -.14 -.17 .19 -.04Skewness of grades -.04 -.38 -.11 -.09 .18 -.04Kurtosis of grades -.03 -.21 -.05 -.01 .10 -.02Item 1 .16 .11 .28 .02 -.01 .-05

Item 2 .03 .15 .09 .06 -.09 .01Item 3 .00 .16 .08 .04 -.03 .02Item 4 .06 .14 .14 .01 -.12 .00Item 5 -.03 .16 .06 .08 .02 -.01Item 6 .01 .19 0. .10 -.04 -.03Item 7 -.03 .18 .06 .08 -.04 .01

Item 8 .04 .15 .12 .06 -.04 .01

Average of Items 1-8 .03 .18 .12 .07 -.05 .01

Average of Items 1-7 .03 .19 .12 .07 -.05 .01

6 8

54

Table 8 (Continued)

Variable 13 14 15 16 17 18

1 Sex -.04 .04 .04 .02 .02 -.002 Appointment length .08 -.05 -.05 -.03 -.02 -.023 Percentage employed -.01 -.01 .01 .01 .01 .084 Class size -.16 .20 .14 .51 .77 -.065 No. of years tenured .03 -.03 -.03 -.03 -.03 .156 Age .03 -.07 -.07 -.09 -.07 .167 Years since hiring .02 -.03 -.03 -.04 -.03 .16

Course level .56 -.52 -.47 -.38 -.21 .119 Title level .15 -.15 -.14 -.11 -.05 .28

10 Course location .31 -.17 -.17 -.09 -.01 .02.11 Dept. quant. -.15 .13 .19 .18 .10 -.0112 Rate of return .03 -.N7 -.04 -.04 -.02 .0513 Average grade 1.00 -.68 -.65 -.56 -.33 .1414 Std. dev. of grades -.68 1.00 .94 .78 .44 -.1015 Vpriance of grades -.65 .94 1.00 .86 .49 -.0816 Skewness of grades -.56 .78 .86 1.00 .82 -.0717 Kurtosis of grades -.33 .44 .49 .82 1.00 -.0418 Item 1 .14 -.10 -.08 -.07 -.04 1.0019 Item 2 ..29 -.17 -.16 -.12 -.06 .6720 Item 3 .27 -.16 -.15 -.11 -.06 .6821 Item 4 .23 -.19 -.18 -.16 -.09 .7322 Item 5 .41 -.20 -.17 -.14 -.10 .5423 Item 6 .43 -.23 -.23 -.20 -.13 .5124 Item 7 .32 -.17 -.14 -.08 .5425 Item 8 .32 -.19 -. .7426 Average of Items 1-8 .75 -.21 -.16 -.09 .7727 Average of Items 1-7 .36 -.21 -.19 -.16 -.09 .77

6 9

55

Table 8 (Continued)

Variable 19 20 21 22 23 24

1 Sex .05 .05 .06 -.02 .02 .06

2 Appointment length .01 .00 .01 .02 .03 .02

3 Percentage employed .05 .05 .07 .01 .03 05

4 Class size -.02 -.04 -.05 -.06 -.07 -.06

5 No. of years tenured .03 -.00 .05 -.03 .01 -.03

6 Age -.04 -.06 .07 -.04 .00 -.07

7 Years since hiring .03 .00 .06 -.03 .01 -.03

8 Course level .15 .16 .14 .16 .19 .18

9 Title level .09 .08 .14 .06 .05 .06

10 Course location .06 .04 .01 .08 .10 .08

11 Dept. quant. -.09 -.03 -.12 .02 -.04 -.04

12 Rate of return .01 .02 .00 -.01 -.03 .01

13 Average grade .29 .27 .23 .41 .43 .32

14 Std. dev. of grades -.17 -.16 -.19 -.20 -.23 -.18

15 Variance of grades -.16 -.15 -.18 -.17 -.23 -.17

16 Skewness of grades -.12 -.11 -.16 -.14 -.20 -.14

17 Kurtosis of grades -.06 -.06 -.09 -.10 -.13 -.08

18 Item 1 .67 .68 .73 .54 .51 .54

19 Item 2 1.00 .89 .70 .64 .66 .74

20 Item 3 .89 1.00 .69 .68 .67 .74

21 Item 4 .70 .69 1.00 .57 .65 .63

22 Item 5 .64 .68 .57 1.00 .81 .71

23 Item 6 .66 .67 .65 .81 1.00 .78

24 Item 7 .74 .74 .63 .71 .78 1.00

25 Item 8 .90 .89 ,79 .78 .81 .83

26 Average of Items 1-8 .90 .91 .82 .83 .85 .87

27 Average of Items 1-7 .90 .90 .83 .83 .86 .87

7 0

Table 8 (Continued)

Variable 25 26 27

1 Sex .02 .04 .04

2 Appointment length .01 .01 .01

3 Percentage employed .06 .06 .06

4 Class size -.05 -.06 -.06

5 No. of years tenured .03 .03 .02

6 Age -.00 -.00 -.00

7 Years since hiring .04 .03 .03

8 Course level .15 .18 .19

9 Title level .12 .12 .12

10 Course location .06 .07 .07

11 Dept. quant. -.04 -.05 -.05

12 Rate of return .01 .01 .01

13 Average grade .32 .35 .36

14 Std. dev. of grades -.19 -.21 -.21

15 Variance of grades -.18 -.19 -.19

16 Skewness of grades -.15 -.16 -.16

17 Kurtosis of grades -.09 -.09 -.09

18 Item 1 .74 .77 .77

19 Item 2 .90 .90 .90

20 Item 3 .89 .91 .90

21 Item 4 .79 .82 .83

22 Item 5 .78 .83 .83

23 Item 6 .81 .85 .86

24 Item 7 .83 .87 .87

25 Item 8 1.00 .97 .96

26 Average of Items 1-8 .97 1.00 1.00

27 Average of Items 1-7 .96 2.00 1.00

7 1

56

57

subprogram of the Statistical Package .for the Social

Sciences (SPSS, Nie et al., 1970). These correlations

were subsequently used by that subprogram to perform the

regression analyses. It should be noted that these are,

again, correlations among courbe sect .on averages. Also,

because more cases were deleted for missing data for the

regression analyses (more variables involved) than for the

factor analyses, SCOR of the correlations in Table 6 differ

very slightly from the corresponding ones in Table 8.

Given the large sample size, any of these correlations

that exceeds .035 is significantly different from zero

(2 < .05). Furthermore, for any r that exceeds .048, 2 is

less than .01; and if r exceeds .10, then 2. is less than

.000001. Thus one may be fa171y certain that ev2n the

relatively weak relationships found in this study were not

attributable to chance variation.

The first major finding, bearing on the research

question, was the correlation between the average student

grade in each course section and the average student rating

of the teacher of that course section. This was found to

be .35 (2. << .000001). The other correlations involving

these two variables are particularly interesting. For

example, a correlation of -.15 was found between the average

grade in each course section and the quantitativeness of the

department in which that course is taught. Also, the

correlation between average grade and Item 5 on the rating

instrument ("Fairness in Mhrking") 4as .41, and the

7 2

58

correlation between average grade 'And Item 6 ("Attitude

Teward Student' was .43. Discussior of the implications

of euese and other results is withheld until chapter V.

The results of te first stepwi,se multiple regression

analysis are presented in Table 9. The initial battery of

12 predicL,rs of ratings was reduced to an "optimally

reduced subset" (see chapters I and III). This subset

consisted of the first 10 variables shown in Table 9 (above

the line of dashes).. The variables are listed in the rank

order of their importance in improving the predictability

of ratings by the regression equation. Also shown in Table

9, for each variable, are the standard error of estimate

after the variable's inclusion in the regression equation,

the multiple R, R2, the increase in R2 over the previous

step, the simple correlation with the criterion, and the

F-value that signifies the importance of the variable to

the regression equation as of the last step.

The multiple correlation produced by the optimally

reduced subset of 10 predictors was .25. Correction for

shrinkage yielded an R of .24. Although this multiple

correlation is not very large (and accounts ftr only

slightly more than 6% of the criterion variance), it is

highly significant, F (10, 2349) = 14.12, 2 <'.001. It

should be noted that the first variable to enter the

regression equation, "course level," accounted for over

half of the criterion variance finally accounted for by

the er.L.ire optimally reduced subset of 10 predictors.

7 3

Table 9 First Stepwise Multiple Regression Analysis--Reduction of fie Initial Battery

of Predictor Variables

Rank Variable

1 Course level

2 Title level

3 Age

4 Sex

5 Percentage employed

6 Dept. quant.

7 Class size

8 Appointment length

9 Years since hiring

10 No. of years tenured

11 Course location

12 Rate of return

R R2

.18162 .03299

.19699 .03881

.21045 .04429

.22995 .05288

.23486 .05516

.23889 .05707

.24261 .05886

.243L, .05937

.24431 .05969

.24555 .06029

.24603 .06053

.24605 .06054

Increase Standard

inR2 Error of

- Estimate

.03299 1.18363

.00582 1.18032

.00548 1.17719

.00859 1.17214

.00228 1.17098

001'./1 1.17004

.00179 1.16918

.00051 1.16911

.00032 1.16916

.00061 1.1i:903

.00024 1.16913

.00001 1.16938

wim=11=Y111Ylmlm.YININ....11111MV11.

*** 2 < .001

74

* 2 < .05

4

Simple

r.

-

F-Value Signif.

.18162 52.190 ***

.12155 31.695 ***

-.00118 19.545 ***

.03509 19.702 ***

.05653 5.181 *

-.05062 4.775 *

-.05908 3.772 *

.01186 1.120 n.s.

.03132 2.065 n.s.

.02543 1.242 n.s.

.06555 .605 n.s.

.00768 .014 n.s.

n.s. 2. > .05

60

Furthermore, even though all 12 predictor variables

increased the multiple R somewhat, only the first 10

reduced the standard error of estimate (optimality).

The F-values of the predictors as of the last step

show the final significance of each predictor. They also

demonstrate the fact that predictors may gain or lose

significance when other predictors enter the regression

equation. (The rank order of the F-values is not the same

as the order of inclusion of the variables into the

regression equation.)

Addition of the Five New Predictor Variables

to the Optimally Reduced Subset

The results of the second stepwise multiple regression

analysis are presented in Table 10. These results show that

the average of z.he student grades in each course section was

the single best predictor of the average rating of the

teacher of that course sect.on. Furthermore, the addition

of the grading variables to the optimally reduced subset of

other predictors significantly improved the multiple

correlation from .25 to .39, F (4, 2345) = 60.13, 2 < .001.

The variable "average grade" by itself accounted for nearly

8.5% of the criterion variance (more than was accounted for

by the entire optimally reduced subset of other predictors).

A further indication of the importance of the grading

variables is provided by the fact that the variable "average

grade," when it entered the regression equation at step

7 6

Table 10 Second Stepwise Multiple Regression Analysis--Addition,of the Five Grading

Variables to the Optimally Reduced Subset'

Rank Variable Increase Standard---Simple

i 2

Error ofn R- Estimate -

F-Value -Signif.

1 Course level .18162 .03299 .03299

2 Title level .19699 .03881 .00582

3 Age 41045 .04429 .00548

4 Sex .22995 .05288 .00859

5 Percentage employed .23486 .05516 .00228

6 Dept. quant. .23889 .05707 .00191

7 Class size .24261 .05886 .00179

8 Appointment length .24366 .05937 .00051

9 Years since hiring .24431 .05969 .00032

10 No. of years tenured' .24555 .06029 .00061

1.18363

1.18032

1.17719

1.17214

1.17098

1.17004

1.16918

1.16911

1.16916

1.16903

.18162 1.333 n.s.

.12155 29.842 ***

-.00118 11.949 ***

.03509 17.323 ***

.05653 6.003 **

-.05062 .002 n.s.

-.05908 3.878 *

.01186 .177 n.s.

.03132 3.167 *

.02543 3.298 *

11 Average grade

12 Skewness of grades

13 Standard deviation

of grades

14 Variance of grades

.38080 .14501

.38445 .14780

.38469 .14799

.08472 1.11533

.00279 1.11375

.00018 1.11386

.35333 206.038 ***

-.15866 3.191 *

-.20510 3.532 *

.38621 .14916 .00117 1.11334 -.19352 3.168 *

15 Kurtosis of grades .38622 .14916 .00001 1.11357 -.09191 .019 n.s.

*** < .001 ** < .01 * 2 < .05 n.s. > .05

77

oI

78

62

number 1.1, completely dominated the regression equation.

It took over and rearranged the regression equation to such

an extent that the variables 'course level" and "department

quantitativeness" lost nearly all of their significance as

contributors to the final regression equation. The extent

of the rearrangement is clearly indicated by the column of

F-values in Table 10 comPared to the same column in Table 9.

The first 14 variables listed in Table 10 (abovethe

lower line of dashes) produced the multiple correlation

with the lowest standard error of estimate. That R was .39.

Correction for shrinkage yielded an B. of .38. uhile this

is still not a very large multiple correlation (accounting

for only about 157 of the criterion variance), it is highly

significant by itself, F (14, 2345) = 28.28, < .001. The

significance of the increase in the multiple correlation is

described above.

Parallel Results Using Other Criteria

The chpice of the criterion variable for the above

regression analyses is explained above (see Factor Analyses

of the Rating Instrument's Items). As mentioned above,

three other criteria, Item 8 by itself, the average of the

other seven items, and Item 5 by itself, might represent

the single ratings factor as well as the average of all

eight items. In order to determine if the choice of the

criterion was important, three more pairs of stepwise

multiple regression analyscs were performed. These analyses

7 9

63

were run for the three other criteria exactly as they Tiere

for the first criterion. A reduction of the initial battery

of predictors was done first, and then the five grading

variables were added to the new optimally reduced subset in

each case.

The results of these analyses were nearly identical tc

those obtained with the original criterion. Using "Iteln 8"

as the criterion, the reduction of the initial battery Of

predictors resulted I- a multiple correlation of .23. The

correction for shrinkage yielded an R of .22, F (9, 2350) 0:

13.01, a < .001. The variable "appointment length" did Aot

make a significant contribution to this regression equaCion,

though it did for the other three criteri,_ Thus, this

optimally reduced subset of predictors of "Ite inclOdea

only nine of the independent variables.

The addition of the grading variables to this optiOally

reduced subset of nine predictors increased the multiple

correlation to .36, F (13, 2346) = 26.09, a < .001. The

correction for shrinkage did not lower this R. The inoease

in the multiple correlation was also highly significant,

F (4, 2346) = 52.94, a < .001.

Similarly, when the average of the first seven iteote

on the UCRSI was used as the criterion, the reduction of the

initial battery of predictors resulted in a multiple

correlation of .25. Correction for shrinkage yielded

of .24, F (10, 2349) = 14.44, a < .001. is optimally

.reduced subset of predictors included the -ame 10 variables

80

64

as the subset of predictors of,the average of all eight

items, and there was only one minor difference in their

order: the variables "age" and "sex" were reversed although

their respective F-values as of the last step were nearly

identical in both analyses.

The addition of the five grading variables in this case

incre,..sed the multiple correlation to .39, with the

correction for shrinkage yielding an RI:4f .38, F (14, 2345)

= 28.72, p < .001. The increase in the multiple correlation

wLs also highly significant, F (4, 2345) ,'= 60.75, p < .001.

The use of "Item 5" as the criterion yielded slightly

different results. The reduction of the initial'battery of

predictors produced a lower multiple R than it did_for the

other thre criteria. However, it was still a highly

significant .19, F (9, 2350) = 9.29, 2 < .001. Thè .

cor-rection for.shrinkage did not lower this,R. The

resulting optimally reduced ,Libse cvnsisted of nine of the0

predictors. "Course level," "age," and "title level" were

the best three pradictors out of the initial battery of 12

(as in the other reductf_ons), but the order of the less

important predi cors wrs altered.

When the flve grading variables were added to the

optimally reduced subset of predictors of "Item -5," the

multiple correlat4ua increased to .45, F (13, 2346) =

<< .001. The correction for shrinkage did not lower this'

R. This increase (from the smallest R found in this study

to the largest) was, of course, very significant,-

65

F (4, 2346) = 120.71, p << .001.

It should be noted that for all four pairs of

regression analyses, the reported F-values computed for

the increases in the multiple R's (mn account of the

addition of the grading variables) were based on the

multiple R's after shrinkage. This is conservative

(slightly understating the F-value), hut in this study it

resulted in no practical differences, since all of the

F-valnes were so highly significant.

Summary

Factor analyses of the eight-item rating instrument

showed that there was essentially one factor underlying the

UCRSI ratings data. This led to the choice of the average

of all eight items on the UCRSI as the criterion variable

to represent that factor. Multiple regression analyses

yielded low but highly significant multiple correlations.

Moreover, they showed that the addition of the grading

variables to the optimally reduced subset of other

predictors of ratIngs did indeed make a highly significant

improvement in the predictive efficiency of the regression

equation. Furthermore, practically identical results T-Tere

obtained for three other criterion variables.

8 2

C1T4PTER V

DI )N AND CONCLUSIONS

This chapter prebents a discussion of the results of

study and explores several implications of those

-'ults. This chapter also provides soule recommendations

in light of certain conclusions based on those results and

implications.

Discussion of Results and Implications

The results of this study apparently provide a unique

contribution to the literature on the research question:

What is the influence of the grades students receive on

their ratinlgs of the college teachers who gave them those

grades? This study has used multivariate techniques, and

has studied a very large sample of course sections across

an entire university. Previous studies have studied

typically only one or a few course sections, or have

suf-ered from several other inadequacies (see chapter I).

This study did not suffer from those inadequacies, and thus

the results are probably more defi-itive and generalizable.

Furthermore, these results provide more up-to-date knowledge

in view of certain cnanges that have occurred in rating and

66

8 3

67

grading pra,:ices over the past decade (see chapter I).

The primary implication of the results of this study

is that ther2 is an interrelationship between grades and

ratings. It would seam that both the direction and the

extent of the hypothesized "grac_ng bias" have been

demonstrated. That is, students apparently tend to rate

lower those teachers from whom they receive lower grades

and vice versa. However, in light of the fact that the

grading variables accounted for only about 9% of the

variance in ratings in this study, other factors (valid or

not) must also be influe:wing the students' ratings. In

other words, students did not simply "payofi" their teachers

with ratings in direct proportion to the grades they

received from those teachers. Rather, the students were

biased or influenced by their grades, overall, in such a

way that permitted other considerations also to be involve

in the rating process.

Possibly some of the students &d strictly "payoff"

their teachers for grades rer:eived, while others were not

at all influenced by their grades. On the other hand, it

is possible that most or all of the students' ratings are

merely "shifted" by the grades received. That is, sco;ents

might rate teachL s more or less validly, but plus ox minus

a certain amount acccrding to the grades received. Without

the ability to pair indivieual student grades and ratings

(given the anonymity of ratings), it could not be determined

whether some students were significantly more influenced by

8 4

68

their grades than were other students. This would be an

appropriate question for further research to answer

conclusively, but the previous studies that have used

paired data have often found essentially the same degree

of relationship between grades and ratings, in spite of

certain inadequacies or faults in those studies.

Overall average ratings are often the measure used by

administrators and

about faculty pay.

shown the probable

To some extent, it

between individual

bias. It does not

department chairmen for making decisions

promotion, and tenure, and this study has

existence of an overall grading bias.

is irrelevant what the diffei-ences are

students, when 7't comes

even matter whether th_S

to -the Wing

Las :

or subconscious, so long as it is actually deuc.n'eit upon

grades. Unl 3S one could develop a way to count only ths

ratings of those students who are not biased by ,,radm, or

unless a conscious grading bias were 2WDject tc eU.nanation

more than an unconscious bias, then the overall gmlIdi.rs bias

will exist whatever its makeup maght be.

Even though the rate c- return f the ra'.ing Jsed in

this study was only 557G, it has beeA about 55% 2o1 migly

years, and these are ratings ;at. syste, :ically

used for making administrative decie.ons tL.culty

careers. Tha is, even though returned ratings may got be

gifieralizable across all of the studznts the f.-c,lty member

taught (since students wno return ratings may nct constitute

a random sample so far as their opinions are c ncen, d),

8 5

6 9

such ratings are commonly used as though they we're

representative of overall student opinion. The results

of this study should be generalizable to the many other

rating systems operating with similar rates of return.

It remain.; for futur -r:.,,;earchers to determine whether an

appreciably differetv.. ,ace of .eturn would have any

influence on the relationships Couad in this study, but

these results are applicable to rating -ystems as they are

commonly used.

Also, to the extent that the course sectious and

rating procedures at the Universif7y of Connecticut are

representative of tnose across the nation, the results of

this study are generalizable. The relationship between

grades and ratings probably varies from one institution

to another becaus- of d!_ffercaces in the students, faculty,

grading systems, rating instruments, and rating procedures.

For exernple, the timing of _tings (before or after final

grade's are awarded) is an L.: tant diffe.r.ence between

rating systems, even though previous research (e.g., Bausell

& Magoon, 1972a; Holmes, 1971) indicates that an expected

grade bias probably exists before the actual grade i,

determined. Perhaps futurP researchers could use similar

rating procedures 2t a large number of institutions with

comparable grading practices to find Jut how generalizable

the results of this study are.

It is possible that there are other variables which,

when added to the final regression equation in this study,

8 6 )

70

would lower or eliminate the im-portance of the grading

variables as predictors oS ratings. That is, it is possible

that grades and ratings correlate without there being any

causal relaticiiship between them, and that ratings actually

depend completely on some other urknown factor or factors.

This is doubtful, however. The simple and multiple

correlat4ons found in this study indicate a definite, if

partial, Lelationship. Furthermore, the very high

significance levels of these simp-le-and multiple

corr,qations sugget that the results are not attributable

to chance variation. Also, ,:ertain p7:avious findings (see

especially Holmes, 1972) provide strong (experimental)

e- dence of a causal relationship between grades and

ratings.

The results of this study would have been even more

definitive, had the optimally reduced subset of the initial

battery of predictors accounted for a larger amount of the

variance in ratings. This would have lowered the ch nces

that any other variable exists which would acccant

of the variance in the ratings (thus suggesting t'rir

present results might be spurious). The more criz....1(

var_ance "partialled 1;.y the optimally reduced subset,

more significant the increase attributable to the

grading variables uculd have been.

Ideally, nearly all of the v-rince in ratings would

attributable to students' reliable peacy?Ttions of t'leir

teachers' performances (Treffinger aid Feldh.lsen, 1970).

8 7

71

If the grading bias were about the only exception to this

rule, it might explain why only about 67 of the criterion

variance could be accounted for by the 10 predictors in the

optimally reduced subset. That is, maybe there are no

predictors of ratings more significant than those found in

this study (outsde of other variables which might taP the

students' opinions about their teachers in some other way)-.

Even with the grading variables added in, the final

regression equatiJn in this study was able to account for

only about 15% of the criterion variance: With such a

large portion of the variance left unaccounted for, :the'

chrn-2es are theoretically greater that some variable could

disprove the existence of what seems to be the grading bias.

This writer doubts the existence of any such variable.

Most likely there are few accurate predictors of student

ratings of facult:y performance.

.One possibility is that student ratings are almost

totally invalid as measures of effective teaching anyway,

and relate 'mare to student and faculty parsonality

interactions. Similarly, peer ratings may be measures of

personality conflicts or even secondhand student ratings.

Furthermore, without a consensual definition of ,....iffective

teaching (or the purposes of education for that matter),

even unbiased judgments may not be highly related to each

other.

The results of studies by Treffinger and Feldhusen

(1;7C) and by Bausell and Magoon 1972b) indicate at

8 8

72

students' first impressions and preconceptions are

persistent and may be the best predictors of end-of-course

ratings. Thus, it is possible that researchers should be

looking for predictors of such first impressions or

preconceptions. Perhaps the grades students receive

constitute the major influence on ratings after the initial

opinion is formed.

Also, to the extent that there is a large portion of

error variance in the ratings data, the importance of the

grading bias found in this study looms even larger. That

is, if the reliability of the ratings is considerably lower

than 1.00, the- the gradin3 variables would account for more

than 97 of the systematic variance. For example, if the

reliability were .75, then the percentage of the sstematic

variance accounted for by the grading variables would be

127 (.09 / .75). Unfortunatk,ly, there is no ac^urate

estimate of evzctly how reliable student ratings are, nor

even how reliability would best be defined. But for the

purposes of this stlAdy, it is conserva0e to assume that

the reliabili,i is nearly 1.00 and that the grading bias

is no mol dignificant :-.han indicated by the findings.

Several of the details of the .;!suits of thi study

are worthy of menton. Notably, the results of the factor

anfayses of the eight-item rating instrument (see Tab1,, 5

and 6) are interesting in their own right, beyond the1_,

utility alection of criteria. Whatever the g-

impressien of teachers Im'Ly epend upon, it was apparently

73

measured in the same way by the rating instrument in 1972

and 1973. The single factors and t_le correlations between

items for the two yearS are strikingly similar. It is also

interesting to note the order of the sizes of the factor

loadings. "Knowledge of Subject" had the lowest factor

loading of all eight, and "Over.-All Summary as a Teacher"

had the highest loading.

The finding of only one factor suggests that a halo

effect is present, and, as suggested by Hoyt (1969) and by

Widlak et al. (1973), the separate items may be of little

diagnostic value. This writer doubts that the high

correlatiors found between items represent any "true"

relationships between the eight traits that the instrument

attempts to measure. Rather, there seems to be a halo

effect confounding the meaning of the sc arate items. It

is porsiblr that cert,_in 'profile effects" (indicative of

differences across the rating items) may exist, even though

maf,ked by -Ile halo effect. However, the diagnostic value

of such "high inference" items (very general and open to

varying interpretati,-ns) is questionable anyway, even if

there were no confounding influences. It is not at all

Clear exactly what a teacher should do to improve his

ratings on such global traits.

Student ratings, therefore, may no be improving

teaching in the expected way. Spe ific, but-globa1., items

mypaLeuLly can not help the teacher improve his teaching.

In fact, with a grading bias pl2sent, ratings almost

9 0

74

assuredly act counter to the larger educational goals.

That is, while a full range of grades may be educationall

appropriate, the grading bias would act to diminish the

apparent effectiveness of teachers who do give out some low

grades, and it would act to reward the lenient teachers who

give out higher and less discriminating grades.

Of course, one optimistic explanation for a grading

bias is that many high grades accompanied by high ratings

in a course section might reflect superior teaching or a

highly successful class (perhaps using "contract grading").

However, this writer doubts that such is the general case.

,rna1 criteria, such as standardized achievement tests,

would be needed to substantiate any claim that such high

grades are Jdeserved, especially in light the grading

trend over the past several years (see chapter I).

It was 1'yl-othesizeitthat certain departments at this

university were mote lenient (in the awarding of grades)

than others. Specifically, it was thought that the "hard"

science and mathematical departments were giving out lower

grades than other departments, and that thc faculty members

in those highly quantitative departments might- be suffering

from lower ratings. The colre.l.ation of -.15 found becween

average grade and department quantitativeness supi_orts such

a theory, as does the correlation of -.05 ')etween ratings

and department quantitativeness. Of course these

correlations do not prove any causality of t_le relationships;

in addition, the size of the correlations suggests the

0 1

75

relationships are slight.

Nevertheless, department quantitativeness was a

significant predictor of. ratings until the grading variables

were allowed to enter the regression equation; then its

predictive potential was subsumed by the grading variables.

It would seem that tl'e variable "department quantitativeness"

was provilling some information about grades indirectly

(because of the relationship between department

quantitativeness and.grades). However, when "better"

information about grades (the grading variables themselves)

entered the regression equation, then department

quantitativeness was of no further importance in predicting

ratings. The same situation occurred wiL:h the variable

"course level," which was the best preditor of ratings

before the grading variables were considered. The F-values

in Tables 9 and 10 demonstrate the extent o' these variables'

loss in predictive potential.

It is possible that certain departments are suffering

more from the grading bias than others ln account of th

differences in grading practices. Moreover, it is likely

tc_at some individual teachers are suffering more than others

according to their grade distributions. The teachers with

lenient grade distributions are probably not always

the best teachers. Thus, the grading bias lowers the

probability that ratings could be v-q,lid as measvres of

teaching effectiveness.

As suggested earlier, the effect of grades on ratings

9 2

76

well may be only partial, i.e., one of several influences

(see discussion of first impressions above). The simpleP

correlation between grades and ratings found in this study

was not high (.35), but it was very highly significant

because of the large sample size. Many previouE, qtudies,

in spite of serious shortcomings (see chapter T), also found

significant correlations of this approximate magnitude.

Typically.the correlations did not exceeu .30 (Costin t

al., 1971). The results of this s'tudy sulport the findings

of the large portion of the previous research which found

such correlations. Thus, this correlation of .35 between

grades and ratings, and the similar correlations found by

other researchers, could be quite accurate and i dicate that

the "true" relationship between grades and ratings probably

lies in the vicinity of .30 to .35.

The correlation of .41 found between average grade and

Item 5 ("Fairness in Marking") indicates that this item is

especially .sensitive tc The grading bias.. Students

evidently consider higher grades as "fair." This is not

very surprising. One might expect thEt a rating item like

"Fairness in Marking" woUld receive th brunt of the grading

.bias. Actually, however, zrading bias apparently

affects al Aght items (see correic,Aons among average

grade and Items 1-8 in Tabie 8). Furthermore, the

correlation is highest (.43) for Item 6 ("Attu6e 7oward

Student"). Students apparently consider grdes awarded as

a primary indication of their teachrs attituds tcird

93

77

students.

While Items 5 and 6 seem to be the most sensitive of

the eight items to the grading bias (having the t,,.ghegt

correlations with all five of the grading variables), other

considerations must also enter into the studerts' ratings

even of these two items. When Item 5 by itself was the

criterion variable (see Parallel Results Using Other

Criteria), the grading variables stil_ managed to account

for only about 177 of the criterion variance. Thus, the

bias is partial even for Items 5 and 6, but it is pervasive

acro9s all eig't items. Thi_s finding supports the above

mentioned assertion by Bausell and Magoqn (1972b) the'.

disap?ointed studeuts "will tend to deprecate the

in-tructor's teaching performance in areas other than his

grading sy3tem" (p. 130, emphasis added).

ConcluSions and Recommeltions

It would seem that one important influence on student

ratJngs of college teachers has been identified. It is the

so-called "grading bias," and it apparently accour'- for

about 97 of the variance in ratins. Variables which could

account for most of the rest of the variance, however, have'

not been identified. Student ratings may or may not be

mostly valid in spite of the grading bias. There is no

proof either way. It remAins for future research to answer

many such questions raised by the results of this study.

Even if researchers could find variables which would account

9 4

78

for the rest o' the variance in the ratings, that would be

no guarantee that ratings are valid as measures of effective

teaching. Such variables, however, probably would offer

substantial clues as to whether or not ratings are valid,

depend:ng on the identity of those variables.

Well defined educational goals are needed before

effective teaching can be defined, and external criteria of

effective teaching must be well.defined before valid

measures of effectiveness can be firmly established. This

positive, constructive, and difficult wurk needs to be dor.

The rE'ults of this study are, .therefore, somewhat

negative. That is, a grading bias has been found which most

.ikely 'serves to lower the validity of student ratings.

Perhaps positive steps could be taken which would eliminate

,..the grading bias. If possible, methods should be devised

which .would do just that. One possibility is that ratings

could be collected very early in the semestgi-. This wouldr-

allow for feedbaCk-,to the teachers in titne for improvement

to occur before the end of the semester (assuming tiie

teachers would be responsive and that ratiags would indicate

desired changes). However, there is still likely to be an

expected grade bias (Holmes, 1971), and there is still no

certainty about the validity of ratings even without any

grading bias.

The results of this study have demonstrated a grading

bias for & "high inference" rating instrument (containing

very guneral questions open to varying interpretations), but

9 5

MICROCOPY RESOLUIION l'EST CHAR1

NATIONAL iitlfiLAU () ';1ANPA;'ib, A

79

radically different results might occur with so-called "low

inference scales" (asking more objectivvluestions which

supposedly would be less subject-to halo effects and biases).

Such scales may or may not eliminate the grading bias and

may or may not be valid as measures of effective teaching.

These facts remain to be determined by future researchers.

One might suggest that students could be "taught" how

to rate their teachers more fairly. The evidence of a halo

effect, however, suggests that students are unable tp

separate their feelings about grades and teachers,

especially on "high inference scales." Or it might be

suggested that, if grading practices were completely fair

and non-arbitrary, then any bias caused by grades would

disafTeer. However, if grades were made fair more

justly discriminating), it seems that grades would be lower,

and ihat this might in turn increase the bias..

Perhaps 'the most important conclusion for immediate

consuMption is-that% since grades do seem to influencek

stuaent-ratings of college teachers, this bias should be

taken into account whenever one is.interpreting student

ratings. Administrators, department chairmen, and promotion

and tenure committees acros's the nation should remember that

the grading bias exists and that certain teachers may suffer

from the bias more than others, depending on the grades_they

give out. It seems intuitively obvious that'one should not

reward teachers for leniency if students are expected-to

work and to achieve. It would be better to reward teachers

80

according to the actual achievement levels reached by their

students (measured by standardized achievement tests).

One policy option might be to drop student ratings

altogether. They are costly, and poisibly not worth the

cost, especially since they must be interpreted with caution.

Some reasons for nOt eliminating student ratings are: they

are well established; some institutional prestige is

associated with their use; many students want them; they at

least give the appearance of requiring teacher accountability;

and dropping them could conceivably lower faculty concern

for effective teaching. On the other hand, it might be

argued, better teaching and fairer grading would occur

without the faculty's probable fear of reprisals on ratings

(Ladas, 1974; "Too Many A's," 1974).

This writer recommends.that university officials and

faculties review the above implications, and decide how best

to serve the educational needs involved. If student ratings

can not be made objective and valid, if the grading bias can

not be eliminated, and if student ratings can not be dropped

outright, then it would seem two possible paths are suggested.

Either decision makers should ignore the ratings, or they

should combine them with other measures of teacher

effectiveness (perhaps achievement tests or indices of the

amount of work done by the students) in such a way that

ratings would.not encourage teachers to be slack or to

demand too little.

9 7

0

APPENDIX

MACRO FLOWCHARTS

OF THE DATA PROdESSING STEPS-USED IN THIS STUDY

.Figures 6 through 18 represent the major data,.

-processing steps required to accumulate the data for this

study.. Figure 5 Provides a key to the symbols used in

those-figures, and the letters and nUmerals-Within the-

symbols in the figures are the names of the tapes,

clocuments, and operations symbolized. Keypunching and

manual data lookup operations are labeled but not named.

A brief description of the purpose of each step follows.

Merge 1. This first step matched and merged data from

the "instructor header record" tape (one instructor header

record per.course setion evaluated) with data from the

"professional history file" tape.The merged data were

Output onto both paper and tape, And information about

missing data was also printed on the paper output for use

in the next step.

First missing data input. In this step, the

information obtained in the merge 1,step was used to look

Up and punch onto cards thedata missing frOm the

professional history file tape. The resulting deck of

81

\9 8

82

cards was retained for input into the merge 5 step below.

Merge 2. This step matched and merged the faculty

evaluitions data with the merged output from the merge I

step. The individuai ratings data were used to compute .

means, N's, and-standard deviations for each'item on the

rating scale, and the reoalts were output-onto both paper

and tape.

Sort 1. The records on the output tape from the merge

2 step abolie were sorted by branch, department, cowse,

number, and Section number. The sorted records were output

onto another tape for later use in the merge 3 step.

Sort 2. The grade distribution data was sorted as in

sort 1 above so that the course sections woUld be In the .-

same sequence on both tapes for input into the merge 3

program.

Merge 3. This program matched and merged the grade

distribution'data with the other data previously assembled

for each course section. Computations of average grades

and standard deviations of grades in the course sections

were made at this point, and the merged results were output

onto paper and tape. Also on the paper output was

information about missing data detected by this program.

Second misSingata input. The information on missing

data from merge 3 was used, in this step, to create a deckN\

of cards containing that miasing data. The cards were

retained for input into the merge 5 step below.

Merge 4. This program matched students' records of

9 9\

83

their Quantitative.and VerbarScholastic-Aptitude test scores

with their'-declared majors in order, to compute for each

department the variable "department quantitativeness." The

results of these computations were output onto paper and

--cards, The deck of cards wa; retained-for input into the

merge 5 program.

Merge 5. The purpose of this step .was tevinsert the

data On cards (from previous steps) into the couree section.. I

records, to compute several newvariables from old ones,(age'

'from date of birth for example), and to edit the data for

the acceptability of the values. The results were output

onto.papei and tape, and informetion about miseing data and

improper values was also printed on the output paper.

Third missing data input. The information on missing

or unacceptable data from the merge 5 program was used to

,create a deck of cards containing the missing data. :This

deck was retained for input into the merge 6'program below.

Merge 6, This program was used to Insert the missing

data discovered in the merge 5 step into the final data

records, which were output onto paper and tape.

Regression run 1: This step accomplished the reduction

cof the initial"battery'df 'predictor Variables using the

steOwise multiple regression andysis subprogram of the

Statistical Package for the Social Sciences (SPSS, Nie et1

al., 1970)., The results of this analysis were used to make

the rdered list of variables (the reduced set in the order

-of thei\inclusion in the regression equation) for input

84

into the next step.

Regression run 2. This step was another stepwise

multiple regression analysis-using SPSS as above, but with

the rive grading variables added to the optimally reduced

subset of predictors found in regression run 1.

Magnetic Computer Tape

'/

/,

Paper Document or Computer Outputl'aper

85

Processing Step or Program-

'fk.Deck of Computer Cards ;

%.

Keyr nching Step

Manual OPeration

SortingsPperation

Figure 5. .Key-to.symfpols used in Figures6 through 1 .

102

Pigure 6. Merge. 1.,

103

00

\LOOK

MISSING

DATA

KEYPUNCH

MISSING

DA"rA

PROFMD

Figure 7. First missing data input.

104')

Figure 8 . Merge

105

88

Figure 9. Sort 1.

10a

D.

89

, 1

Figure 10._ 4ort 2.

,

107

r

,

108

t (4

91

92

LTAB866

MKMISSING

DATA

KEYPUNCH

MISSING ,

DATA //

(MGM.

Figure 12. Second missing data input.

109

VERMAT DEPTCD MAJORS

MERGE 4

TABDQT

Figure 13: Merge' 4.

110

DEPTQT

\

93

-

,

Figure 14. Merge 5.

111

94

,r

TAB869

,001( UP/

MISSING

DATA

KgyPUNCH

MISSING'

DATA

MRG5MD

Figure I . Third missing data input.

112

4.

.41

95

7

.Figure 16. Merge 6.

4, 4 113

96

Figure I

?VARIABLE

LIST 1

REGRESSION

RUN 1

TABRR1

Regression run I.

.114

VARIABLE

LIST2

VARIABLE,

LIST 3

rVARIABLE

LIST 3

a

977

VARIABtE.

LIpT 3

REGRESSION

.RUN 2

TABRR2g ,

Figure 1 .Regression run 2.

1. 7

98

Dr

I.

BIBLIOGRAPHY

Aleamoni, L. M., & Graham, M. H. The relationship between

CEQ ratings and instructor's rank, class size, and course

level. Journal of Educational Measurement, 1974, 11,

189,02.

Americlan Psychological Association. Publication manual ofD

the American Psychological Association (Rev. ed.).

Washington, D.C.: Author, 1967.

American Psychological- Association. Pub7acation manual of. .

the AmoiitAnPsychological'AsSociation (2nd ,ed.).

Washington,D.C.: Athor, 1974. .

Anikeef, A. M. Factors affecting.student.evaluationof

college,faculty members. Journal of Applied Psychology,

1953, 37, 458-460.

Bain, P. T. The pass-fail option: -The congruence between

the rationale for and student reasons.in electing.t 1.

Journal of Educational Research, 1973, 66,029-5498.NN

Baird, L., & Feister, W. J. Grading standards: The

Nrelation of changes in average student ability to the,s

-

average grades awarded. American Educational ResearchN.,

\ Journal,' 1972; 9, 4317442.\\ NN ,

.

\

99

NINI. 6-

Bausell, R. B., & Magoon, J. Expected grade in a course,

grade point average, and student ratings of the course

and the instructor. Educational and Psychological

Measurement, 1972, 32, 1013-1023. (a)

Bausell, R. Br. & Magoon, J. The persistence of first

impressions in course and instructor evaluations. Paper

pfesented at the meeting of the American Educational

Research Association, Chicago, April, 1972. (b)

Bendig, A. W., A, preliminary study of the effect of

academic level, sex, and course variables on the student

rating of psychology instructors. Journal of Psychology,

1952, 34, 21-26.

Bendig, A. W. Tfie relation of level of course achievement

'to students' instructor and dourse ratings in 'introductory

psychblogy. Educational andPsycholOgiCal Measurement,

1953, 13. 437-448. (a)

Bendig, A. W. Student achievement in introductorY

psychology and student ratings of the competende and

empathy oUtheir.instructors. :Journal:of Psychology,

1953, 36,427-433. ,(b)'

-Bendig, A. W. A'factor analysis of student ratings of

psy,chology instructors on the Purdue Scale. Journal of

Educationall'sychology, 1954, 45, 385-393.

Best, J. W. Research in education ,(2nd ed.). -Englewood

Cliffs, New Jersey: Prentice-Hall, 1970.

101

Blum, M. L. Ah investigation of the relation existing

'between siudents' grades and their ratings of the

instructors' ability to teach. Journal of Educational

Psychology, 1936, 27, 217-221.

Brown, G. D. System/360 job control language. tieig York:,

Wiley, 1970.

Cal State_to probe system's grading practices. 7Ie

Chronicle of Higher Education, 1974, 8(18), 2.

Capozza, D. R: Student evaluations, grades and learning in

economics. Western Economic Journal, 1973, 11, 127.

'Carrier, N. A., Howard, G. S., & Miller, W. G. Course

evaluation: When? Journal of Educational Psychology,

1974, 66, 609-613.

Centre, J. A. Effectiveness of student feedback in

modifying;college instruction.. Journal of Educational

Psychology, 1973, 65, 395-401.

Centra,_J. A. dollege-teaching: Who should evaluate it?

Findings, 1974, 1(1), 5-8.

Committee on the Student ih Higher Education. The student

in higher education. New Haven, Connecticut: The Hdzen

Foundation, 1968.

Cooley, W. W., & Wanes, P. R. Multivariate data analysis.

New York: Wiley, 1971.

Costin, F., Greenough, W. 'T., & Menges, R. J. Student

ratings of college,teaching: Reliability, validity and

usefulness. Review of'Sducational.Research, 1971, 41,

511-535.

118

Costin, F., & Grush, J. E. Personality correlates of

teacher-student behavior in the college classroom.

Journal of Educational Psychology, 1973, 65, 35-44.

Darlington, R. B. Multiple regression in psychological

researCh and practice. Psychological Bulletin, 1968,-

69, 161-182.

Downgradihg no-grade. Time,'1974, 103(5), 66.

Doyle, K. O., Jr., & Whitely, S. E. Student rstings as-

criteria for effective teaching: American Educational

102

Research Journal 1974, 11, 259-274.

Duke Uniyersity Computation Center. TSAR user s Manual.

Durham, North Carolina: Author, 1971.

Dziuban, ,C. D., & Shirkey, E. C. On the pi6ichometric

assessment of correlation matrices. American Educational

Research 'Journal, 1974, 11, 211-216..

Eble, K. E. Improving college teaching. Phi Delta Kappan,

1971, 52, 2'83-285. .(a)

Eble, K. E. Study indicates teaching is poor: College

Management., 1971, 6, 29. (b)

Eble, K. E., & The.Conference on Career Development. Career

development of the effective college teacher. Washington,

D.C.: American Association of University Professors;

Elmore,.p, B., & LaPointe, K.,A.' Effects of-teacher Aex

and studenf sex.On the e4aluStion of dollege instructors.f

JOUrnal .of Educational Psychology, 1974, 66, 386-389.

119

'103

Etaugh A. Reliability of college grades nd grade

point_averageS: SoMe impliCations or prediction of

-icademic-performadce. EdUcatiOnal and'Psychological

Measurement, 1972;32, 104:3-1049,

Fahey,-G. L. Student rating of teaching, some questionable aassumptions: Paper presented at the conference at the

University of Pittsburg Institute for Higher iducation,

'Pittsburg, December, 1970.A

Fliger, IL If Johnny gets an " ." U.S. News and World

Report, 1973, 75(2), 72.

French-Lazovik, G. Predictability of students' eyaluations

tof College teachers froM compoilent ratings. -.''Sournal of

EducAtional Psychology, 1974, 66, 373-385.

. Frey, P. W. The ongoing debate Student. evaluation of

.teaching. Change., 1974, 6(1), pp., 47-48; 64.

Gadzella, B. M. College students' views and ratings of an

ideal professor. College and University, 1948, 44, 80-96. .

Garber, H. Certain factors underlying the relationship

0 between 'course grades and student judgments of college.

teachers (Doctoral dissertation, University of,

Connecticut, 1964). Dissertation Abstracts, 1964,

6512. (University Microfilms No. 66-00847).h

3

darber, 4. Some relationships between Course grades and

student.judgments of college teachers.- Paper presented

at the Meeting of the iMericah Educational Research

Association, NewYork, February, 1967.

104

Gessner, P, K. Evaluation,!of instruction. Science, 1973,A

180,.566-570. ;

Granzin, K. L., & Painter, J. J. A new explanation for

students' course evaluatioa tendencies. _American

Educational ResearchJt;urnal, 1973, 10, 115-1,24..,

Group for Human DeVelopment in Higher Education. fisagly

""*. developMent in'a time"of.teirenCbMent. New Rochelle, New

York: :Change Magazine, 1974';.

Heilma. D., -8c4entrouti W. .0. Thg.rating of college:,:

.7,

teacherti on tentraits by their:Students- Journalof

. Educational,PIgychology, 1936, 27, 197-216.

Hills, J. R. Consitent college grading standards thmugh,

quating.' Educational:and Psychologicil MeatUtemen

1972, .32, 137-146..

c

,Holifies,' D.S. The relationship betWeem expected grap.s And'

students''eValuationsof their IlmOtrUCtprs. EdOcitional1

'and isychological MeasureMent,-4974 31, 951957,! '

:;/!.

Holmes.;-D. 'S. Effects of grades and disConfirMed,grade.

expectations on. students!%evalUatibnsof_their.instructoi.,

Jouinal of Educational Psychology, 1972; 63, 130-133.

'Jaeger, R. M., & Fieijo, T. D. Some psychometric questions

in the, Oaluation of:prOfessorsi,l,: Journal'of EdUCitiOhai'

Psycholgy, 1974, 86, 416-423:

Kawecki, G. Teacher evaluatioris4! ,Are they accurate?,

Connecticut Daily,Nampus, 1974, 77(119), pp.kA

1. 7.

fi

Reefer, IC E. Characteristics of studen who make accurate%

and inaccurate self-predictioni o college achievement.

Journal of Educational Reaearh, 1971, 64:,- 401-404.

Kerlinger, F. N. Foundations of behavioral research

(2nd ed.). New York: Holt, Rinehart 'and Winston, 1973.

Kerlinger, F. N., & Pedhatur, E. 'J. Multifile regression in

behavioral research. New York: Holt , ,Rinehart and

;;;;;72,, 1973./

Kirschenbaum, H., Napier, R. , St Simon , S . B. Wad-ja-get?

the grading game in American 'education. New York: Hart,

1971. ,

Ladas; H. -Grades: Standardizing the unstandaidized,

standird-:. Phi Delta Rappan, 1974, 56, 185-187.

.LeCamte, E. Does student power corrupt? 'National Review,

f.

1974, 26,. 1166.. , ,

.Lee C. B. (Ed.). Improving college teaching. Washington,-

, D.C.: American Council on Education, 1967.

Maeroff, G. I. Students' scores again show drbp. -New York

Times, 1973, 123(42), pp. 1; 26.

Malinowski, S. N. Not discarded Connecticut Daily Campus,

1974, 77(100), 2. '

McCracken, D. 1191,.. A guide to FORTRAN IV programming. New.

York: Wiley, 1965.

McDaniel, E. D., & Feldhusen,. J. F. Relationship's ,between

faculty ratings and indexes ,of service and sch4irship.

Proceedings of the 78th Annual ConirentiOn of the American

'Psychological Associationi?.1970 6197620. (Summary)

-

106

McKeachie, W. J. Student ratings of faculty. AAUP BUlletin,

:.:1969, 55, 439-444.

McKeachie., W. J. itesearch.on student ratings of teaching.

-;ePaper.presented at the conference-it the University Of

Pittsburg Institutelor.Higher,Education, Pittsburg,

December, 1970.

McKeachie, W. J. The decliue and fall of the laws of

learning. Educational Researcher, 3) 7-11.

McNemar, Q. Psychological statistics (3rd ed.).' New

Wile

Menzie, J. C. An,analysis of the proces of teacher

evaluation'in the community college. Los Angeles,

Calif.: University of California at Los Angeles, 1973.

(ERIC Document Reproduction Service No. ED 083 960)

Moellenberg, W. To grade or not to grade; is that the6

question? dollege and University, 1973, 49, 5-13.

Nie, N. H., Bent, D. H., & Hull, C. H. Statistical package

foi the social Sciences. New York: McGraw-Hill, 1970.

Nunnaly, J. C. Psychometric theory% New York: McGraw-

Hill, 1967.

Ockerman, E. W. Student input in changing grading systems.

College and University, 1972, 47, 390-399.

Owen, S. V. Student ratings:of teacher competence: A

cautionary note. Paper presented at jointmeeting of the

Northeastern Educational Research Association and the

National Council for. Measurement in .Education, Ellenville,

Newlork, November, 1974.

123

107

-Pambookian, H. S. Initial level of student evaluation of

instruciion as a source of influence on instructor change

after feedback. Journal of Educational Psychology, 1974,

66, 52-56.

Pape, R. Evaluate i4tructors each semester. Connecticut

Daily Campus, 1974, 77(106), 2. (a)

Pape, R. Evaluation in midstream; _Connecticut Daily

Campus, 19741 77(96), 2. (b)

Parent, E. R., Vaughan, C. E., & Wharton, K. A new approach

to course evaluation. Journal of Higher Education, 1971,

42, 133-138.

Peck, R. Can students evaluate their education? Education

Digest, 1971, 36, 29-31.

:Postman, N. A D+ for Mr. Ladas. Phi Delta Kappan, 1974,

16, 187-188.

Potter, D., Nalin, P., & Lewandowski, A. The relation of

student achievement and student ratings of teachers.

Paper presented at the meeting of the American Educational-

Research Association, New Orleans, March, 1973.

----Rodin,---M,Students..61teachers. Change; 1974, 6(2), :5.ft

Rodin, M:, flgRodin0113. il.udent evaluations of teachera.

Science, 1972, 177, 116471166:.

Rummel,- R. J. ,Applied'factor analysis. Evanston, Illinois:,

Northwestern University. Press, 1970.

Silverberg, S. Grad finds grades inflating. Connecticut

Daily Campus, 1974, 77(113), 2.

124

. 108

klysz, W. D. An evaluation of statistical software in the

social sciences. Communications of the ACM, 1974, 17,

326=332.

Spady, W. G. Authority, conflict, and teacher effectiveness.

Educational Researcher, 1973, 2(1), 4-10.

Stallings, W. A., & Singhal, S. Some observations on the

relapionships between reseaich productivity and student

evaluations of courses and teaching. Paper presented at

the meeting of the American Educational Research

Association, Los Angeles, February, 1969.

Stanley, J. C. K-R 20 as the stepped up mean item ,

intercorrelation. National Council on MeasurementsUsed

in Education Yearbook, 1957, 14, 78-92.

St.Onge, K. R. Let's get back to essentials. Change,

1974, 6(5), pp. 6-7; 62.

Sullivan A. M., & Skanes, G. R.. Validity of student

evaluation of teaching and the characteristics of

successful instructors. Journal of Educational

Psychology, 1974, 66, 584-590.

. Tatsuoka, M. M., & Tiedeman, D. V. Statistics as an aspect

of scientific metho&in research on teaching, Chapter IV

in Handbook of research on teaching, N. L. Gage (Ed.),

Chicago: -Rand McNally & Co., 1963.

Too many A's. Time, 1974, 104(20), 106.

125

. ,

.. 109

Touq,,M. S., & Feldhusen, J. F. The relationship between

students' tatings of instructors and their participation

in classroom discussion. Paper presented at the meeting

of the National Council on Maasurement in Education, New

Orleans, February,"1973.

Treffinger, D. J., & Feldhusen, J. F. Predicting students'

ratings of instruction. Proceedings of the 78th Annual

Convention of the American.Ps cholo ical Association,

-. 1970, 621-622. (Summar

TUrnbull, W. W. Testing shifts from 'which' to 'how.' New.

York Times, 1974, 123(42), 82..

Vacon, B. Faculty evaluate grade distribution. Connecticut11

Daily Campus, 1974, 77(100,, pp. 1; 4. (a)

Vacon, B. Grades:- Do they pass the test? Connacticut

Daily Camus, 1974, 77(105), pp. 1; 5. (b)

Vacon, B. New requirements limit students on Dean's List.

ConneeicUt Daily Campus,, 1974, 77(73), 3. (c)

Veldman, D. J. Fortran programming forNthe behavioralN\

sciences. New York: Holt, Rinehart and Winston, 1967.

Voeks*,-V7 W.,4 French, G. M. Are student-ratings of

-

teachers affected by grades? Journal of Hi her Education,

1960, 31, 330-334.

Watkins, B. T. A-to-F grading system heavily favored by

undergraduate, graduate-institutions. The Chronicle of

Higher Education, 1973, 8(8), 2.

126 N

Widlak, F. IC, & Feldhusen, J. F. Factor

analyses of,an instructor rating scale. Paper presented

at the meeting-of the Americangducational Research

lAssociation,'New Or.l.eans, February, 1973.

Winer, B 3. Statistical

(2nd-ed.). New York: McGraw-Hill,\1971.

Worthington, L. H., 16Grant C. V. lectors of.academic

succesS: 'A, nhativariate analysis. Journal of. Educational

Research, 1971, 65, 7-1-0.

Zelby, L. W. Student-faculty evaluation. Science, 1974

183, 1267-127A.

-127