document resume ed 134 599document resume ed 134 599 tm 005 995 author brown, david lile title'...
TRANSCRIPT
DOCUMENT RESUME
ED 134 599 TM 005 995
AUTHOR Brown, David LileTITLE' Faculty Ratings and Student Grades.I.A Large-Scale
Multivariate Analysis by Course Sections.PUB DATE Dec 74NOTE 127p.; Ph.D. Dissertation, University of
ConnecticutAVAILABLE FROM Dr. David Lile Brown, 36 Mansfield Hollow Road
Extension, Mansfield Center, Connecticut 06250($5.00)
EDRS PRICE MF-$0.83 Plus Postage. RC Not Available from EDRS.DESCRIPTORS Bias; Effective Teaching; Factor Analysis; *Grades
(Scholastic); *Higher Education; Multiple Regression,Analysis; Predictor Variables; Rating Scales;*Statistical Analysis; *Student Evaluation of TeacherPerformance; Validity
IDENTIFIERS University of Connecticut
ABSTRACTThe purpose of this dissertation is to provide
evidence )olearing on the question of the influence- of the gradesstudents receive on their ratings of the college teachers who gavethem those grades. Specifically, certain characteristics of the gradedistribution within each course section are evaluated as predictorsof the students' ratings of the teacher of that course section.Muliivariate techniques were employed to evaluate data across anentire university. Over 30,000 anonymous student ratitgs of 2,360course sections were collected after students had received finalcourse grades, and without student or administrator knowledge thatthe ratings wrould,be used in the study. Factor analysis was used toreduce the eight-item rating instrument to a single criterionvariable. Subsequently, stepwise multiple regression analysis wasused, both to reduce an initial battery of predictors to an optimallyreduced subset, and to test the incremental importan4:e of certaingrading variables as predictors of the criterion.. The primaryimplication of the study is that there is a relationship betweengrades and ratings, but it only accounts for about nine percent ofthe variance in the ratings. There is a significant bias, but factorsother than grades must also be influencing student ratings. Whetheror not these other factors are valid measures of teachingeffectiveness remains to be determined, but one seemingly invalidfactor (the grading bias) has been identified. (Author/RC)
***********************************************************************Documents acquired by ERIC include many informal unpublished
* materials not available from other sources. ERIC makes every effort ** to obtain the best copy available. Nevertheless, items of marginal *
* reproducibility are often encountered and this affects the quality *
* of the microfiche and hardcopy reproductions ERIC makes available *
* via the ERIC Document Reproduction Service (EDRS). EDRS is not* responsible for the quality of the original document. Reproductions ** supplied by EDRS are the best that can be made from the original.***********************************************************************
ck.)
FACULTY RATINGS AND STUDENT GRADES:
A-LARGE-SCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS
BY
DAVID L1LE BROWN
FIRST PUBLICATION, DECEMBER 1974
COPYRIGHT qD DAVID LILE BROWN 1974,
`O NF NNODUCEIONYN':,`,12:7) `,,TEN.A., BY MICROFICHE ONLY HA`, PFFN r.NANTFP
1,1t 6101.,,)N, FA ' A".0 ONnANIZA" ONS ONE PAT.4,-.;`. N AGPLF \IP," A.THHF NA
r;`,1 ONTHF P AFPQCD&J(
`.f Sv.."1"."or. r rI"NYN,H` CJANF
U 5 DEPARTMEH f OF HEALTH.EOUCATION & WELFARENATIONAL INSTITUTE OF
EDUCATION
1105 DOCUMENT HAS BEEN REPRO-DUCED EXACTLY AS RECEIVED FROMTrIE PERSON OP ORGANIZATION OPIG.N-ATING IT POINTS oF VIEW Op OP,NIONSSTATED DO Nor NECESSARILY REPRE-SENT OF FICIAL NATIONAL INSTITUTE OFEDI/CATION POSITION OR POLICY
ABSTRACT
FACULTY RATINGS AND STUDENT GRADES:
A LARGE-SCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS
David Lile Brown, Ph.D.
The University of Connecticut, 1974
Effective teaching has been a primary educational goal
throughout history, and the evaluation of teaching is
therefore one of the central concerns of education. Student
ratings are often considered one of the best indicators of
teaching effectiveness, because the student is in the most
privileged position to view the teaching and to experience
its effects. The use of such ratings has become widespread,
b-t controversy rages over whether student ratings are valid
r asures of teaching effectiveness, especially when used for
making decisions about faculty pay, promotion, and tenure.
It is possible that students are not mature enough to
appraise teaching or, worse, that they may be exploiting
rating systems to punish strict teachers and to reward
lenient ones. If so, this would have important implications
for the interpretation of student ratings. Proper
interpretation requires an answer to the research question:
What is the influence of the, grades students receive or
their ratings of the college teachers who gave them those
3
David Lile Brown -The University of Connecticut, 1974 2.
grades? This question has not been answered satisfactorily
by previous research. The former evidence has been
conflicting and inadequate in,a number of ways. The
present study, however, was intended to avoid many of the
shortcomings of previous studies.
This study employed multivariate techniques to evaluate
data across an entire university. Over 30,000 anonymous
student ratings of 2,360 course sections were collected
after students had received final course grades, and without
studert or administrator knowledge that the ratings would
be uLed in this study. Factor analysis was used to reduce
the c.,ight-item rating instrument to a single criterion
variable. Subsequently, stepwise multiple regression
analyses were used, both to duce an initial battery of
predictors to an optimally re' 2d subset, and to test the
incremental importance of certain grading variables as
predictors of the criterion.
Results showed that the simple correlation between the
average student grade in each course section and the average
student rating of the teacher of that course section was
35, 2 < .000001. Moreover, the average grade was the
single !..,est predictor, of those available, of the average
ratinyl: ,;nd when average grade was added to the optimally
reduce: 'obset of other predictors, it sign.:ficantly .
improv,-T1 the multiple correlation from .25 to .39,
F (4, 2345) = 60.13, 2 < .001.
The primary implication of the results of this study
4
David Lile Brown--The University of Connecticut, 1974 3.
is that there is a relationship between ,grades aad ratings,
but it only accounts for about 9% of the variance in
ratings. In other words, there is a significant bias, but
factors other than grades must also be influencing student
ratings. Whether or not these other factors are valid
measures of teaching effectiveness remLins to be determined,
but one seemingly invalid factor (the grading bias) has
been identified through this study. The interpretation of
student ratings should take this bias into account, and
methods should be devised to eliminate it.
5
FACULTY RATINGS AND STUDENT GRADES:
A LARGE-SCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS
David Lile Brown.
B.S., Tufts University, 1968
., University of Connecticut, 1969
A Dissertation
Submitted in Partial Fulfillment of the
Requirements for the Degree cf
Doctor of Philosophy
at
The University of Connecticut
1974
Copyright by
David Lile Brown
1974
7
APPROVAL PAGE
Doctor of Philosophy Dissertation
FACULTY RATINGS AND STUDENT GRADEL;:"
A LARGETSCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS
Presented by
David Lile Brown, B.S., M.A.
Major Advisor
Associate Advisor
Associate Advisor
Associate Advisor
Associate Advisor
2
The University of Connecticut
1974
8
ii
ACKNOWLEDGMENTS
This writer wishes to express his appreciation to:
Prof. Ellis B. Page, who, as his major advisor, has
been a tremendous source of inspiration and enctluragement
during the past six years;
The other members of his advisory committee, Drs
Steven V. Owen, Robert K. Gable, Richard W. Whinfield, and
Marvin Rothstein, whose suggestions, corrections, and
insights were most valuable in guiding this effort to a
successful conclusion;
The University of Connecticut Computer Center for the_
use of computer facilities;
Dr. Dorothy C. Goodwin, Mrs. Shirley Malinowski, Mrs.
Althea McLaughlin, and Mks. Meryl Kogan of the University of
Connecticut Bureau of Institutional Research for their
assistance with data collection and data editing;
Mr. Richard J. Stec and Miss Barbara J.. Sochor of the
University of Connecticut Student Data Systems and
Programming Office for assistance with data collection and
computer tape copying;
Mrs. Katherine J. Brown and Mr. Rudy Voit of the
Registrar's Office for assistance with data collection; and
Lile H. and Helen W. Brown, his parents, for support,
patience, and encouragement during this effort;
9
iii
V/
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS iii
LIST OF TABLES
LIST OF FIGURES vi
Chapter
I.. STATEMENT AND DEFINITION OF THE PROBLEMStatement of.the,ProblemPUrpose of the StudyNeed for the StudySources of .the Dat-a.
1
1
46
18
II. SUMMARY OF RELATED RESEARCH 19
Validity of Student Ratings 19
Related Issues 25
Summary,
30
III. PROCEDURF 31
Subjects , 31
Instrument 31
Initial Predictor Variables 35
Grading Variables .
41
Criterion Variables 42
Statistical Analyses 43
Summary 46
IV. RESULTS 47
Factor Analyses of the RatingInstrument's Items 47Reduction of the rnitial Batt..ryof Predictor Variables 50
Addition of the Five New PredictorVariables to the OptimallyReduced Subset 60
Parallel Results Using Other Criteria... 62
Summary 65
V. DISCUSSION AND CONCLUSIONS 66
Di'4cussion of Results and Implications '66
Conclusions an0 Recommendations 77
APPENDIX 81
BIBLIOGRAPHY 99
iv
1 0
LIST OF TABLES
Table Paw1, Data for Figure 1, QPR Cutoff Points for ,
Graduating Seniors, 1950-1974 13
2. Data for Figure 2, Median QPR's for All
Undergraduates, 1952-1974 14
3. Data for Figure 3, Average of All A-Through-F
Grades, 1961-1974 15
4. Mean Differences Between Quantitative and A
Verbal SAT Scores of Undergraduates with
Different lajors 39
5. Factor Analyaes of the Eight Items on the
University of Connecticut Rating Scale
for Instruction
6. Product-moment Correlations Among the Eight
-Items on the University of Connecticut Rating.
Scale for Instruction for 1972 and 1973 49
7. Means and Standard DevLtions of the
Predictor and Criterion Variables 51
8. Product-moment Correlations Among the
Predictor and Criterion Variables 52
9. First Stepwise Multiple Regression Analysis
Reduction of the Initial Battery of
Predictor Variables 59
10. Second Stepwise Multiple Regression Analysis--
Addition of the Five Grading Variables
to the Optimally Reduced Subset 61
1 1
LIST OF FIGURES
Figure
1. Ouintile QPR cutoff points for graduating
Page
seniors, 1950-1974 10
2. Median QPR's of all undergraduates after
the end of each semester, 1952-1974 11
3. Average of all A-through-F grades given,
semester by semester, 1961-1974 12
4. The University of Connecticut Rating
Scale for Instruction 32
5. Key to symbols used in Figures 6-18 85
6. Merge 1 86
7. First missing data input 87
8. Merge 2 88
9. Sort 1 89
10. Sort 2 90
11. Mbrge 3 91
12. Second missing-data input 92
13. Merge 4 93
14. Merge 5 94
0. Third missing data input 95
16. Merge-6 96
17. Regression run 1 97
18. Regression run 2 98
vi
12
CHAPTER I
STATEMENT AND DEFINITION OF THE PROBLEM
Statement of the Problem
Considerable recent controversy has centered around
faculty evaluation methods, especially student ratings of
college teachers (Centra, 1973; Doyle & Whitely, 1974; Frey,
1974; Ze1b3, 1974). According to several researchers
(Bausell & Magoon, 1972a; Capozza, 1973; Carrier, Howard, &
Miller, 1974; Centra, 1973; Costin, Greenough, & Menges,
1971; French-Lazovik, 1974; Menzie, 1973; Treffinccer &
Feldhusen, 1970), the use of student ratings to evaluate the
faculty has become so widespread that it is now almost taken
for granted at institutions of higher learning. Controversy
rages, however, over the validity of such ratings, especially
concerning "the trend towardIormal, quantitative use of the
results of the evaluations in determinations of faculty
promotions'and vlaries" (Zelby, 1974, p. 1267).
Faculty reactions tD student evaluation of teaching
range from acclaim to outrage. Supporters point to the need
for evaluation and to the potential that ratings may have
for improving educatiun. Opponents emphasize that students
may not be proper judges, or that rating instruments may not
1
13
ask students the right questions.
The whole issue is complex, but especially this
question of validity. Are student ratings a valid measure
of teacher effectiveness? The answer depends, of course, on
definitions of validity and of teacher effectiveness. Since
there are no universally accepted definitions of these,
progress toward the development of such measures is impeded.
Nevertheless:attempts at progress commonly are made with a
rather vague notion of what "effective teaching" means.
If factual learning alone were the objective of
effective teaching, one could validly measure various
teachers' effectiveness through achievement testing. Indeed,
many would support such measures as extremely valid, but
others would irsist that the "real goal in teaching is to
impart philosophical values or to inculcate a special
attitude toward learning rather than to simply help the
student to master che subject matter" (Frey, 1974, p. 47).
Still others would be inclined to consider the promotion of
factual learning primary, but in combination with certain
affective factors. How effective, for example, is the
teacher in drawing students' attention and affection? Or
how are the students made to feel about the specific subject
matter they are learning? Does the teacher make it
i. -!resting?
Whatever view one takes of effective teaching, the
major issue relevant to student ratings is still validity:
Are student ratings a valid measure of effective teaching?
1 4
Some would argue that student ratings are valid because the
student, as Lne consurler, is ia the best position to view
the teaching and to experience the effects of the teaching
(see Centra, 1974; Costin, Greenough, & Menges, 1971; FreY,
1974; McKeachie, 1969).
Others would argue, for various reasons, that student
ratings are not valid as a measure of teaching effectiveness.
For example, LeComte (1974), Peck (1971), and St.Onge (1974)
express strong dou-uts that students are in any position to
judge the scholarship offered by the faculty. According to
Costin et al. (1971), student ratings are typically
challenged on the grounds "that student ratings are
unreliable, that the ratings will favor an entertainer over
the instructor who gets his material across effectively,
that ratings are highly correlated with expected grades (a
hard grader would thus get poor ratings) and that students
are not competent judges of instruction since long-term
benefits of a course may not be clear at the time it is
rated" (p. 511).
One of the most important points of contention is
whether or not students can be objective enough for their
ratings to be valid. Even though student ratings are
necessarily subjective opinions, they could still be valid
measures of effective teaching if students were able to
identify effective teaching (or at least some component or
components thereof), and if their ratings were unbiased by
irrelevant variables. Some believe, however, that students
1 5
may bias their ratings in favor of lenient teachers, or at
least in favor of the teachers in whose courses they get the
highest grades. If so, this would have important
implications for the interpretation of such ratings. It is
even likely that such ratings would be at odds with larger
educational objectias. For instance, "given a specific
format, it is possible to adapt one's teaching technique to
obtain a good or bad evaluation and . . . a good evaluation
may be associated with a teaching technique of lesser
educational value than a poor evaluation" (Zelby, 1974, p.
1268).
Therefore, one may pose the research question: What is
the influence of the grades students receive on their
rat_ngs of the L lege teachers who gave them those grades?
Purpose of the Study
The purpose of this study is to provide evidence
bearing on the research question. Specifically, certain
characteristics of the grade distribution within each course
section are evaluated as predictors of the students' ratings
of the teacher of that course section. The following are
the grading variables:
1. Average of the grades in the course section,
2. Standard deviation of the grades,
3. Variance of the grades,
4. Skewness of the grades, and
5. Kurtosis of the grades.
16
Stepwise multiple regression analysis is employed to
evalui..t: these grading variables for their effectiveness as
predictors, when used in combination with certain other
predictors of such faculty ratings. These other predictors
constitute an "optimally reduced subset" of the following
variables:
1. Sex,
2. Appointment length,
3. Percentage employed,
4. Class size,
5. Number of years tenured,
6. Age,
7. Years since hiring,
8. Course level,
9. Title level,
10. Course location,
11. Department quantitativeness, and
12. Rate (percentage) of return of ratings.
An "optimally reduced subat" is defined as the subset which
had the lowest standard error of estimate during a prior
stepwise multiple regression analysis.
. The hypothesis of this study is that the grading
variables are strongly related to student ratings of college
teachers, and that.they will significantly improve the best
prediction of those ratings attained by the other available
predictors.
17
6
Need for the Study
The evaluation of teaching is one of the central
concerns of education. Effective teaching has been a
primary educational goal throughout history. As mentioned
above, one commonly used me-,ure of effective,teaching is
evaluation by the student. It would be hoped that student
ratings would be valid and that they would promote effective
teaching through feedback to teachers and administrators.
However, it is not certain that students are mature enough,
wise enough, or objective enough to evaluate such
instructional behavior without bias. Are student ratings
valid measures of effective-teaching? Or are they more or
less a "payoff" in a sort of game between teachers and
students, in which teachers reward (or punish) students with
grades, and students respond in kind with ratings?
The conflicting evidence to datP has not satisfactorily
answered this question. The extent and the importance of
the relationship between grades and ratings have not'been
determined. Neither have any causes of such a relationship
been positively identified, partially because the
relationship itself has not been firmly established.
Former evidence on this question has been not only
conflicting, but also inadequate in a nunher of ways. The
results of previous studies will be suLaz.ized in chapter
II, but most of these studies had one or more of the
following potential. shortcomings:
7
1. A relatively small sample size;
2. Sampling of only one or a few departments;
3. Sampling of only graduate assistant level teachers;
4. Sampling of team-taught courses;
5. Samp'ing of only freshmen level courses (or some
other level);
6. Contamination by voluntarism (using only teachers
who volunteer to be rated);
7. Contamination by sensitization (a new system is
initiated, or a class is upset by a researcher suddenly
appearing on the scene);
8. Contamination by knowledge of the study (students,
faculty, or administrators know ahead of time that the
ratings will be used in an "experiment");
9. Contamination by lack of anonymity protection for
the students;
10. An indefinite treatment (students did not actually
know what their final grade in the course would be at tbe
time of the ratings); and
Il. A lack of multivariate analysis (often only simple
correlations among a few variables are reported, whereas
multivariate techniques permit simultane)us examination of
the influence of many variables and pro- ide evidence of the
"extent as well as the nature of any Such influence" (Doyle &
Whitely, 1974, p. 260)).
This present study does not suffer from any of the
above potential shortcomings, and it should provide better
19
8
evidence bearing on the research question than that
previously fpund.
A further need for this study is institutional. While
a somewhat similar study was conducted at the University of
Connecticut ten years ago, no such study has been done
recently. Changes have occurred in the rating instrument
since that study of 10 years ago (Garber, 1964), and
improvements have been made in the administrative proced-ares
connected with the collection of ratings. There is rerJon
to believe the data are less subject to coll ction .rors
now than formerly.
Furthermor, ,3nd perlips more importantly,.over the last
10 years, grades at tha_ Uni.versity of. Connecticut have taken
a sharp turn upward (see Figures 1-3 below). Whatever-the'
relationship between grades and.ratings may have been 10
years ago, _that relationship well may have changed in light
of the changes in grading, or in light of certain other'
changes related to the rise of student power in general.
The advent, in the fall of 1968, of both pass-fail options
and liberalized policies regarding the dropping of courses
may have influenced the relationship between grades and
ratings in some way. The appointment of students to
formerly all faculty committees is another recent change
which could possibly bear upon the research question:
Whether or not the trend in grading has had any'effect
on the relationship between grades and ratings, it is a
striking trend and worthy of attention. Figure 1 shows the
2 0
9
trend over the last 25 years of the quality point ratios
(QPR's) of graduating seniors. The UR is the grade average,
computed as the sum of the quality points (number of credits
in a course times the grade in that course, where "A" = 4,
"B" = 3, "C" = 2, "D" = 1, and "F" = 0) divided by the total
number of credits. At the University of Connecticut, QPR's
are multiplied by 10 and range from 0 to 40, but for the
sake of clarity to readers outside of the University of
Connecticut, the range used throughout this study is the
more prevalent 0 to 4. Figure 1 demonstrates that there was
a great, deal of consistency from 1950 until 1967 or 1968,
but chat_the trend has been upward ever since. Table 1
provides the data used in Figure 1.
Figure 2-supplies further evidence/of the trend in
grading at the University of Connecticut, showing the median
QPR s of all undergraduates after each semester. Note that
2ach spring semester is represented by the year marks along
the x-axis of Figure 2, while each fall semester is at the
half-way point between year marks. The upward trend in
median QPR's indicated by Figure 2 started about 1964, which
is, as one would expect, three or four years before the
start of the upward trend in graduating seniors' QPR's as
shown in Figure 1. The data for Figure 2 are given in Table
2. Data for the spring semester of 1970 are missing because
of a student strike near the end of that semester, following
the U.S. bombing of Cambodia. In many courses that semester,
final exams were cancelled and grades of "S" for
4.0
80th percentile
0-4 60th percentile
A-4, 40th percentile
0-41@ 20th percentile
3.5 v-IF minimum for graduation
3 0
QPR
2,5
2.0
V V r 7 7 7 7 7 7 7 7 V. 7 7 7
1950 55 60
YEARS
65 70
Figure 1. Quintile QPR cutoff points for graduating seniors, 1950-1974.
23
4.0
3,5
3.0
IPR
2.5
2.0
1952 55 60
YEARS
65 70,
Figure 2. Median QPR's of all undergraduates after the end of each semester, 1952-1974.
24
4,
25
*"4 100's and 200's level courses'only
*4164 100's, 200's, and 300's level courses
QPR
1961 65
YEARS
70
Figure 3. Average of all A-through-F grades,given, semester by semester, 1961-1974.
2(
13
Table 1
Data for Figure 1, QPR Cutoff Points for
Graduating Seniors, 1950-1974
Graduating
Class of 80th
Percentiles:
60th 40th 20th
1950 2.80 2.49 2.26 2.04
1951 2.83 2.49 2.24 -2.02
195_ 2.80 2.46 2.21 1.99
1953 2.85 2.52 2.28 2.06
1954 2.81 2.50 2.23 2.03
1955 2.81 2.44 2.20 2.00
1956 2.84 2.51 2.24 2.00
1957 2.84 2.49 2.23 2.00
1958 2.82 2.48 2.23 1.93
1959 2.83 2.48 2.21 2.00
1960 2.82 2.48 2.22 2.08
1961 2.81 2.48 2.24 2.03
1962 2.85 2.,7 2.24 2.04
1963 2.83 2.49 2.25 2.04
1964 ' 2.81 2.43 2.25 2.05
1965 2.83 2.50 2.26 2.04
1966 2.83 2.50 2.27 2.06
1967 2.88 2.55 2.30 2.03
1968 2.89 2.54 2.30 2.06
1969 2.94 2.59 2.36 2.10
1970 3.04 2.70 2.44 2.17
1971 3.11 2.79 2.52 2.25
1972 3.19 2.89 2.62 2.33
1973 3.30 2.99 2.72 2.43
1974 3.32 3.02 2.77 2.47
Note. The minimum QPR for graduation (Oth percentile) was
1.80 every year.
2 8
Table 2
Data for Figure 2, Median QPR's for
All Undergraduates, 1952-1974
Year Spring Semester Fall Semester
1952 2.27 2.21
1953 2.28 2.19
1954 2.27 2.21
1955 2.29 2.24
1956 2.31 2.24
1957 2.30 2.24
1958 2.28 2.20
1959 2.23 2.18
1960 2.31 2.23
1961 2.29 2.27
1962 2.19 2.25
1963 2.18 2.28
1964 2.35 2.26
1965 2.37 2.35
1966 2.45 2.43
1967 2.48 2.45
1968 2.56 2.53
1969 2.57 2.58
1970 2.72
1971 2.87 2.82
1972 3.02 3.04
1973 3.03 2.88
1974 3.02
14
Note. Data for the spring semester of 1970 and the fall
semester of 1974 (current semester) are missing.
2 9
15
Table 3
Data for Figure 3, Average of
All A-Through-F Grades, 1961-1974
Year Spring Semester Fall Semester
1961 2.23
1962 2.29 2.26
1963 2.31 2.27
1964 2.26
1965 2.34 2.33
1966 2.42 2.40
1967 2.44 2.48
1968 2.56 2.55
1969 2.64 2.59
1970 3.18 2.73
1971 2.82 2.79
1972 2.89 2.86
1973 2.89 2.80
1974 2.92
Note-. Zeta for the following semesters_are_missing spring_
of 1961, spring of 1964, and fall of 1974 (current
sethester).
3 0
16
"Satisfactory" were given. Some courses did have finals,
and some teachers used regular grades with'or without final
exam scores. However, the median QPR for all undergraduates
was not computed by the administration because of the
drastic effect of the strike on grades. That effect is
clearly indicated in Figure 3.
Figure 3 shows the trend in the average of the "A-
through-F" grades given each semester. Again, the upward
trend since 1964 is clearly indicated. The inclusion,
starting with the spring semester of 1967, of 300's
(graduate) level courses would account for some of the rise
in this curve, ut it 1ould have resulted in a one-time
jump. The constant upward trend can not be explained by the
inclusion of these courses, which was unavoidable because of
a change in administrative procedure. The data for Figure 3
are shown in Table 3.
There is no indication that the ability level of the
.students at the UniversiLy of Connecticut has followed (or
preceded) such a drastic trend as that indicated by Figures
1-3 (personal communications with the Admissions Office). A
study by Baird and Feister (1972) found that a large number
of faculties tended to give out the same distribution of,
grades over the years 1964-1968, even regardless of changes
in the ability leve: of the students in some cases. The
grading trend found at the University of Connecticut,
however, is quite the opposite (higher grades without any
s ch increase in student ability levels). This trend
3 1
17
seems to be part of a nationwide trend toward leniency
occurring since the period of the Baird and Feister study.
Recent rePorts involving some 23 campuses across the nation
("Cal State," 1974; Ladas, 1974; "Too Many A's," 1974) have
supplied evidence that a "grade glut has been spreading
across academe" ("Too Many A's,"-1974,-p. 106) in-the last
few years.
It is not certain whether teachers are "simply being
generous" or "are bribing students with good grades to get
good grades themselves" ("Toe) Many A's," 1974, p. 106). But
there is definitely an increased leniency on the part of
this-faculty, and itlis possible that the relationship_
between grades and ratings is not the same as it was 10
years ago. Furthermore, it is likely,that the upwrd trend,
in.grading has not been universal across departments (cf.
Ladas, 1974; Postman, 1974) nor across all teachers and that,
as a result, the influence of grades, if.any, on ratings
would e more pronounced for.some teachers and for some
departments than for others.
Thus there are multiple reasons for conducting this
study. The research question has not been satisfactorily
answered; previous studies have suffered from several
shortcomings; and changes in the ratings-instrument and-
grading practices have occurred at this university (and
apparently nationally). The present research is an attempt
to provide better and more up-to-date knowledge concerning
faculty evaluation by students.
32
13
Sources of the Data
This study uses dLta from the spring semester of 1973
at the University of Connecticut. Some of the data-were
obtained from admissions records or other branches of the
university administration and were collected prior to 1F73.
The evaluation data were collected by the Bureau of
Institutional Research in the customary manner establishted
by that office and followed each spring. Detailed
descriptions of the variables investigated and the rating
instrument used appear in chapter III.
No one, including the students and the administrators
who gathered the data, knew that the data would be used in
this present study. Grade data were obtained from the
registrar's records. Since ratings are done anonymously,
it is not possible to match the individual students' ratings
with their grades. Nevertheless, a great deal of inforaticin
about the distribution of grades in each course st.Iction is
known and can be matched with a large amount of other data
about the course section and its teacher. Thus, the
separate course sections (N = 2,360) were the units of
analysis for this study. The loss of the ability to match
grades and ratings for individual students is offset by
certain compensations, such as the anonymity of the ratings
data (which is what precludes the matching) and the ability
to study data across an entire university.
3 3
CHAPTER II
SUMMARY OF RELATED RESEARCH
This chapter presents a summary of the theories and
research which form the background to this study. The
research related to the major issue of the Validity of
student ratings is covered first, followed by a summary 'of
the research on several related issues.
Validity of Student 'Ratings
In 1971, Costin, Greenough, and Menges-conducted a
review of the literature on student ratings of college
teachers. They stated that faculty members could judge the
validity of student ratings "to the extent students'
subjective criteria match the faculty members' goals in
teaching" (p. 513), and thus, a determination must be made
of "the basis on which students make their iledgments" (p.
514). In their discussion 'Of various possible bases for
student judgments, Costin et al. cited conflicting evidence
as'to whether "students may judge instruction on the basis
of its 'entertainment' value rather than on its information,
contribution to learning, or long-term usefulness" (p. 517).
They also stated that students may make rating judgments
20'
based on the grades received or expected.in'the Cdurse.
This is the issue upon which this present study is focused.
Costin et al. reviewed 28 studies of the relationship
between student ratings and grades. Of those, 15 found no
significant relationship, 1 found a small negative
correlation, and 12 found significant, but small, positive
correlations. Typically the cbrrelations did not exceed
.30, but they could be considered collectively as evidence
that the true relationship may be'low and.positive.
In an earlier study ,at the University of Connecticut,
Garber. (1964) used student ratings (responses to the eight
separate items on the then current rating instrument) to
predict "difference scores" (the differences between the :
students' grades and their "expected grades"). Current
grade point averages, were used as the expected:grades.
Garber obtained a multiple correlation of .43; correction
for shrinkage yielded an R of .39 (a highly significant
multiple R, 2 < .001). This outcome was for the multiple
regression analysis using teachers as units of analysis.
Using students as units of anaIyis, Garber obtained,
similar results (R = .31; correction for shrinkage yielded
an R of .28, R < .001). He concluded\that the two behaviors,
student ratings of their teachers, and the direction in
which the teachers tended to grade the students (higher or
lower than their current grade point "averages) were either
directly or indirectly dependent on each other.
Bausell and Magoon (1972a), using a much larger sample
3 5-
21
than most other studies (N = 12,000 ratings, from which
random stratified samples were drawn), "found strong,
consistent biases in both instructor and course ratings
which can be traced to (a) the grade the student expects to
receive, and (b) the discrepancy between the student's
expected grade and his GPA" (p. 1021). They concluded that .
student ratings of instructors are "axiomatically valid for
their designed purpose, but must be interpreted with
caution" (p. 1022) since the assignment of low grades may
be proper in many cases, but would result in lower ratings.
The implication is that administrative use of ratings for
pay, promotion, and tenure decisions may.require great
caution.
A study by Treffinger and Feldhusen (1970) used
characteristics of the students to predict student ratings
of their teachers. They concluded that student variables,
including grades received, were not,significant predictors.
They found that generalized pre-course ratings were most
often the best predictor of end-of-course ratings, and
therefore they concluded that the "student's rating of the
course is clearly a complex interaction of his initial
feelings, certain cognitive and affective characteristics
of teachers and pupils, and instructor performance" (p. 622).
Treffinger and Feldhusen noted that "it should be the
instructor's performance or ability as a teacher and
.students' reliable perceptions and evaluations r,f th,
performance which constitute the majority of the ....3-1.ance
3 6
22
in instructor ratings" p. 622). They suggested that the
difference between pre- and post-course ratings might be
more useful than post-course ratings. But other research
(especially Holmes, 1972) indicates that such a difference
may be very heavily affected by changes in grades from
initial expected grades to end-of-course expected or actual
grades.
Bausell and Magoon (1972b) found that "rating changes
did occur as a function of .changes in grade expectancy"
(p. 10). Holmes (1972) conducted an experiment in which
expected grades were "disconfirmed." "Half of the students
who deserved and expected A's or B's were given their
expected grades, while half were given a grade one step
lower than expected" (p. 130). Correct grades were given
after ratings were collected. He concluded that "if
students' grades disconfirm their expectancies, the students
will tend'to deprecate the instructor's teaching performance
in areas other than his grading system" (p. 130, emphasis
added). These results support a theory that students are
not objective; that t ley use ratings as a "payoff;" and that
ratings might therefore be invalid.
Many studies have investigated the relationship between
ratings and achievement rather than grades. Such studies
are, of course, directly related to the valic f issue since
it is obvious that high achievement constitutes a primary
goal cf effective teaching. If students rate highest the
teachers from whom they learn the most, then student ratings
3 7
23
wou1d have a certain validity. Furthermore, such studies
often provide important insights into the relationship'
between ratings and grades insofar as achievemen't and grades
are closely interrelated.
One such.study (Rodin and Rodin, 1972) stirred a great0
deal of interest, in the relationship between ratings and
grades. The authors found a significant negative
correlation between amount learned and ratings of the
teachers (teaching assistants in one large undergraduate
calculus course with 12 sections). Controversy was stirred
over their 'conclusion'that "perhaps students resent
nstruc'tors who force them to work too shard and to learn
more than they wish" (p. 1166).
Another investigator (Capozza, 1973) agreed. He ,
concluded that "the hypothesis that students give good
ratings to classes in which they, learn a great deal must be
rejected. Emotional fact.ors such as grades and perhaps a.,
distaSte for the hardships of learning seem to bias the
results in the-opposite direction" (p. 127). , Though his
sample (250 students in eight course sections) was larger
than many, Capozza himself considered it small.
Gessner (1973)'found results opposite those of the
Rodins and Capozza. He criticized the Rodin and Rodin study
for its use of teaching assistants, whose instructional role
was apparently an ancillary one: Gessner's own data,
however, were based on only one course (N = 78), and it was
team-taught. There can be no generalization, then, across
3 8
'24
courses or instructors.
A study by Potter, Nalini and Lewandowski (1973)
employed a better research design than others (with students
and teachers randomly assigned to classes). Unfortunately,
however, teacher trainees were used. They found a modest,
but significant, correlation between ratings and achievement
,when the effect of aptitude was partialled out. Although
their sample waS small (N = 254 students in 24 course
sections), they observed that the magnitude of the
correlation between achievement and ratings seemed"to be
lower when the correlation bfl2tween achievement and aptitude
was higher', and vice versa. They concluded: "the stronger
the relation between-aptitude and achievement, the lesgr,the
relation between achievement and rating" (p. 2). Ay
corollary of this conclusion might be that ratings may
suffer from a grading bias in direct proportion to how
arbitrary the,grading is (cf. Holmes, 1972).
There are two conflicting explanations often proposed
for a negative correlation between achievement and ratingg:
(a) better learners are more critical of teachers or (b)
there is resentment of the hard work teachers force students
to do in order to achieve. Rodin (1974), however, indicated
that we may not teed either of these theoretical
explanations, since ,the true relationship well may be-
positive. _She reviewed several studies and found that most
of the correlations between ratings and amount learned
tended "to lie in the range r = .20 to r = .30 . . . which
3 9
25
indicates that amount learned accounts for only about 4 to 9
percent of the variance in student ratings" (p. 5). Rodin
concluded that student ratings are not indicative of amount
iparned. She compared student ratings of teaching to
consumer ratings of the palatability of peanut butter, and
argued that, just because the consumer ratings may be pootly
related to laboratory ratings of nutritiveness, the validity
of palatability ratings is r,-,t thereby impugned (unless one
expects the palatability ratings to tell us all about peanut
butter). Rodin implied, therefore, that student ratings may
be a valid measure of one component of effective teaching,
and that grades could be one important influence on that
component.
Related Issues
Several issues, which have been hinted at above, are
directly relateeto validit One iS the possible presence
of a "halo-effect." That is, are global impressions at
work, masking the separate traits of effective teaching?
Do the individual items on rating instruments provide uJeful
information, or do they form large clusters of .traita? Such
clusters of traits could be valid measures Of-,teaching
effectivenss even if Lie separate items or traits were not.
Strong halo effects were noted by Royce (1960), Garber
(1964), Potter; Nalin,. and LeWandowski (1973), and Widlak,
McDaniel, and Feldhusen (1973). Hoyt (1969) found a halo
effect in student ratings of courses and conCluded that it
4 0
26
interfered with students' ability to discriminate fairly
among the traits of-the course covered by the rating
instrument. Costin, Greenough, and Menges (1971), however,
cited several studies in support of a theory that ratings
on multiple attributes of instruction are low in halo
effect.
Widlak et al. reviewed a representative sample of 22
studies over the past 30 years. They concluded that, while
most rating instruments can be factored into at least two
components, there is likely to be a halo effect "so strong
. . tha- t the specific item ratings may have little
diagnostic value in assessing a teacher's strengths and
weaknesses and, ultimately, low potential for improving
teaching" (p. 10). Widlak et al. found three factors
throughout the studies they reviewed.. They called these the
"Actor," "Interactor," and "Director" factors (referring to
roles of teaching). These might also be called
"Performance," "Rapport with students," and "Course
structure/difficulty" factors.
Another issue, somewhat related to the halo-effec,_
issue is the persistence of first impressions or
preconceptions. The findings of Bausell and Magoon (1972b)
suggest that first impressions and preconceptions are
persistent. They found that ratings of teachers after the
first day of classes were very durable, and that they were
highly correlated (r = .67, n = 20 courses) with ratings
taken at the end of the course.
4 1
27
As.mentioned'above, Treffinger and Feldhusen (1970)
found that generalized pre-course ratings were most often
the best predictor of end-of-course ratings. Furthermore,
Parent (1971) suggests that ratings should be taken early
in the course to provide feedback to the teachers in time
to modify the teaching performance before the courses were
finished and "wasted." If early attitudes are persistent,
then end-of-course ratings would not improve, but Parent
suggests that end-of-course ratings should be eliminated.
If ratings were always done very early in the course, some
of the grading bias (if there were one) might be eliminated.
Some bias might remain, however, since expected grades
could still have an influence, as could grades on various
quizzes and papers.
On the one hand, the durability of first impressions
might indicate that student ratings are invalidated,
because they are not measures of teaching effectiveness
over the entire course. On the other hand, it may be that,
whatever student ratings are measuring, it simply does not
take very long to evaluate it (especially, perhaps, the
Actor factor). In the case of preconceptions, it could be
that there is accurate foreknowledge about the course or
instructor from word-of-mouth advice from other students,
or from prior experiences with the instructor himself or
with similar courses taught bylothers.
Costin et al. suggested that student ratings might be
shown to be valid if there were a hie, correlation between
4 2
.28
student and peer ratings (agreement between different
judges). The results of the Bausell and Magoon (1972b)
study suggest that ratings by outside observers, including
other faculty members, would tend to correlate highly with
student ratings, even though the outside observers might
view only a few class sessions. Toug and Feldhusen (1973)
supported this view, as do the results of four studies
(r's from .30 to .63) cited by Costin et al. (1971). Centra
(1974), however, found that while there is agreement between
student and peer ratings of teachers, the peer ratings were
"less reliable" (low interjudge agreement among the peers).
He concluded that both student and peer ratings should be
used, but to measure different traits. Costin et al.
cautioned that peer ratings may be influenced by student
ratings through hearsay (see also Jaeger & Freijo, 1974).
It is possible, however, that a grading bias is responsible
for the difference between student and peer ratings, or at
least some portion of the difference. One implication is
that peer ratings might be used to partial out the grading
bias if one exists.
Still another issue relates to the definition and
measurement of effective teaching. It is the issue over
whether or not the good researcher is likely to be a good
tc(Acher. Stallings and Singhal (1969) found a significent
correlation between teacher evaluations and research output
(measured by weighted combinations of the number of books,
articles, etc.). They concluded, as many others have, that
43
2 9
productive researchers are usually very good teachers, but
that students often have the stereotype that very active
researchers neglect their teaching. If pervasive, such a
stereotype would ten(' o lower the validity of student
ratings of researcicrF nce the ratings might be based on
a misconception. C-.Danirl. and Feldhusen (1970) found that
students did bias their ratings against heavy researchers
or prolific writers; they theorize-that the students may
be correct -- that researchers do neglect their teaching.
Many studies (notably Aleamoni, 1974; Bendig, 1952;
and Elmore and LaPointe, 1974) have examined relationships
between faculty ratings and certain characteristics of the
teachers or courses. There was, according to them, little
consensus about the characteristics of thc "best" or "worst"
teachers or classes. At the University of Connecticut,
however, administrative findings over the past few years
indicate there are definite differences in ratings across
several variables. Historically, students at this
university rate male teachers slightly higher than female
teachers overall, but not on every.item of the scale. In
general, the smaller the class size, the higher the ratings
of the instructor have been. Similarly, classes at the
branches have been rated higher than courses at the main
campus at Storrs, Connecticut. Officials have noted,
however, that this last difference ma:- be due to ,the fact
that the largest course sections are taught at Storrs.
Historical evidence also indicates that al_ the
4 4
30
University of Connecticut:._the.more advanced the course,' .
the higher the ratings. Tenured faculty have received
better ratings than non-tenured teachers. With respect to.
age, a teacher's ratings seem to,peak during the 31-_to-40
age period, with ratings rising steadily to that point and---
then declining only slightly and very slowly afterwards.
Thus, variables such as age, sex, tenure status, and class
size might be good predictors of student ratings.
Summary
A careful scrutiny of the literature reveals that very
few multivariate studies have been conducted in the area of
this problem, and that several'criticisms (mentioned in
chapter I) may be leveled at most of the simpler, but more
numerous, correlational studies. Furthermore, there is
disagreement among the researchers, and directly conflicting
results from their studies have been confounded by the use
of many different research methods, different rating
instruments, and different sample characteristics (including
different-units of analysis). In sum, the research question-
has not been answered satisfactorily, though several
interesting ccmcepts and theories have emergee which could
possibly prove helpful, once better evidence is obtained.
43
CHAPTER III
PROCEDURE
This chapter describes the subjects, th0 rating
instrument, the predictor variables, the criterion variables,
and the statistical methods used in this study. Appearing
in the appendix are macro flowcharts of the computer
programming stepfi required to assemble the data for analysis.
Subjects
The !'subjects" in this study should be considered to
be the 2,360 course sections-Ifor which complete data could
be assembled. Since only 89 course_ sections were deleted
for missing' data, this study covered 96.377 of the entire
population of 2,449 course sections evaluated following
the spring semester of 1973 at the Univeriity of Connecticut.
Over 30,000 rating forms were returned, representing about
55% of the students who were sent them.
Instrument
The University of Connecticut Rating Scale for
Instruction (UCRSI; Bureau of Institutional Research,
University of Connecticut, 1971) is presented in Figure 4.
31
4 6
Figure 4. UNIVERSITY OF CONNECTICUT RATING SCALE FOR INSTRUCTION FORM NO. AA 118 11/7 1
INSTRUCTOR COURSE
MAKE NO MARKS
IN THIS SPACE
FAULTY
EMPLOYEE NUMBER
1. Knowledge Of Subject
2. Presentation Of Materiel
3. Balance ofBreadth & Detail
4, Enthusiasm forSubject
5. Fairness inMarking
6. Attitude TowardStudent
7. personalMannerisms
8. Over-All SummaryAs A Teachei
1 2:, 3Knowledge offield inadequate
1 2 3
Very hard tofollow
1, 2'Getc bogged downio trivia
3
Scum; irksometo Uirn
3
Partial andureiudiced
1 2 3uru,iyinuatlietirand intolerant
1 2 3Distracting andirritating
2' 3
!.intsfactury
4
4
4
4
4
DEPT.
'01yknow:cdo.,,Hp
6 7
.,,i.r.rnahy
7
.1 I I) (
7
;
7
;
, wr
'1)
SEC. ENRL.
8, 9Highlyexpert
13! 91i 101:',
Makes subjectvery clear
81 9!, 106Good tazance ofbreadth & detail
8 91. 10',;Displays greatenthuslanM
8.Very fairand impartial
8 9.; 10!7Sympathetic andunderstanding
8, 9, 10,:Completely-acceptable
9 1W.1
outatanding
INSTRUCTOR COURSE
MAKE NO MARKS
IN THIS SPACE
1. Knowledge Of Subject
2. Presentation Of Material
3. Balance ofBreadth & Detail
4. Enthusiasm for' Subject
:.5. Fairness in '
Marking
6. Attitude TowardStudent
7. PersonalMannerisms
8. Over-All Sui ImaryAs A Teach.
,
1 2 3
Knowledge 6:field inadequate
3 2, 3:.;
Very hard tofollow
21.1, 3:!Gets ticrgged downin trivia
1r ', 2H 3;;seems irksometo-hirn-.:.
.1 '!
Partial andprejudiced
3'.Unsympatheticand intolerant
1 '
0r:4:acting and. irritating.
3:unsatisfactory
4
4
4
4
4
DEPT.
6 7
6, iirc 1,,,r1AIiyi,i74,7 .,tonriable
G 7
trcai--.050i,:,-.nal.)
b- 6' 7
MIN: iy,lit,reIted
i; 6 7
lit.6rrahlykirr
77o1- . rior,r
5 6riveia;,e
1.-
SEC. ENRL.
8 9; 10 C
expert
, ,8, 9:: 10'
Makes subjectvery clear
0
-Doud balance ofbreadth & detaIl
r,8!; 9.'d 10i)
Displays greatenthusiasm
-,
10:.;
Very fairsilo impartial
8 9:i 101:
ympathetic andunderstandins
-9 : 10'Completelyacceptable
8 : 91, 106:
Outstanding
THIS RATING IS TO BE ENTIRELY IMPERSONAL. DO NOT MAKE ANY MARKSON THIS PAGE WHICH COULD SERVE TO IDENTIFY YOU.
47 DC 6717 ;_.. ` NORTHEAST BUSINESS FORMSMERIDEN, CONN. 06450
33
The UCRSI is the instrument currently used at the university
in a program of faculty evaluation that-had its origin in
the late 1940's. It started with a University Senate
concern with the quality of teaching. It was felt that
there might be an overemphasis on the "publish or perish"
ethic.
At first, the university evaluated only certain "target"
teachers (those for whom the administration requested
information, usually for career decisions). But by the late
1960's, the process had grown into one that covered every
faculty member teaching a "ratable" course section. The
nonratable course sections are those deemed not ratable in
the normal way, such as seminars, independent study courses,
field work, practice teaching, team-taught courses, graduate
assistant-taught courses, and others. The decision as to
whether or not a specific course section is ratable is made
by the individual department chairmen each year.
After students have received their grades for the
,spring semester, and after they have moved to their summer
addresses, the rating forms are mailed to the students with
their teachers' names and their course sections already
filled in (see Figure 4). The university paYs for postage
both ways, and the anonymity of the students is guaranteed
since students' names or identification numbers appear1
nowhere on returned forms. The rate of return of ratings:
has remained around 55% over the recent years.
Ratings are done only after each spring semester. This
4 8
34
is done for two reasons: (1) so that students will fill out
rating forms at their homes in the summer away from the
influences of their fellow students, and (2) especially so
that the cost of evaluation will not be doubled (by
repeating the evaluations each semester).
Results are machine tabulated and sent both,to the
individual teachers and to their department chairmen.
Cumulative average scores (cumulative since 1969) accompany-
each teacher's ratings for the recently ended course
sections. Since the Bureau of Institutional Research edits
each of the incoming rating forms by hand (including
checking for the clarity of markings), and since invalid
forms are removed, high accuracy is Claimed for the ratings
data.
The ratings data are provided to promotion and tenure
committees by the department chairmen, along with other
pertinent information. There has been some concern at the
university-that someone might attribute significance to very
small differences in ratings, when in fact, only extreme
differences in ratings have any practical meaning. There
is also some concern about the validity of the UCRSI. There
is no certainty as to what it measures, kthough university
officials,have noted that the ratinga seem to have a
moderately high "reliability." That is, large samples of
student ratings show a high degree of agreement with each
other. What the students are agreeing about, however, is
not absolutely clear.
4 9
35
Initial Predictor Variables
The initial battery of predictors of student ratings
of college teachers included twelve variables. Presented
here are briet descriptions of each of these predictors and
of how each was derived.
Sex. The sex of each teacher was obtained from his or
her professional history, filed with the administration.
\ No cases wei-e lost for missing data for this variable. The
characters M and F for male and female, were converted to
values 1 and 2 respectively before computation began.
\ Appointment length. The number of months of each
teacher'i appointment was also obtained from the
professional history.. In most cases, teachers are anpointed
each year :)r a term of 9 months. A few have 10 mt...
appointments, and several are for 11 months. The values 9,
10, and 11 were used for this variable, and no cases were
lost for missing data.
Percentage employed. Most teachers at the University
of Connecticut are employed full-time, regardless of their
appointment length. Several are part-time, however, and
the percentages vary. .Values from 1 to 100, representing
percentages of full-time employment, were obtained from the
professional history, and no cases were lost for missing
daia.
Class size. The number of students enrolled in each
course section at the end of the semester was obtained from
36
the grade distribution data. It equals the total number of
all grades gilien, including incompletes, etc. No cases were
lost for missing data for this variable.
Number of years tenured. The year that a teacher was
tenured (if he wr.n) 1,-,2s obtained from the 'professional file.
The number of yet3- .3 since bang tenured was computed (zero
if not tenured) by subtracting the tenure year from 1973.
No data were missing.
Age. The birthdate of each teacher was obtained from
the_professional history file, and each teacher's
chronological age in completed years was computed as of
May 8, 1973 (the end of the spring semester).
Years employed at the University of Connecticut. The
hiring date of each teacher was also obtained from the
professional history. The number of years of continuous
service at the university was computed as of May 8, 1973
(the end of the spring semester) for each case. None of the
cases were lost ort account of missing data for this variable.
Course level. Course numbers were obtained from the
rating responses records. The level of the course was then
computed by dividing the course number by 100 and dropping
the digits to the right of the decimal point. The:resuling
values (0, 1, 2, 3, and 4) reprempt courses with numbers
1-99, 100-199, 200-299,-300-399, and 400-499 respectively.
No cases were lost for missing'data for this variable.
Title level. Each.teacher's classification cade was
obtained from the professional file. A list of the
37
classification codes belonging to each title level
(Instructor, Assistant Professor, Associate Professor, and
Professor) was created with the aid of personnel from the
Bureau of Institutional Research for use in one of the
merging computer programs. Title levels (values 1, 2, ,,
and 4 for Instructor, Assistant Professor, Associate
Professor, and Professor respectively) were,ass4ned to
each teacher according to his, or her classification cOde,
Two course sections were lost since.one teacher's
classification code could not be correctly grouped with any
one level.
Course location. ,A branch code was obtained for each
course section from the rating responses data. A velue of
I was assigned to the course location if the course was
taught at an7 of the branches (Hartford, Stamford,
Southeastern, or Hartford M.B.A.), and a value of 2 was
assigned to course sections taught at the Storrs main
campus. No cases were lost for missing data for this
variable.
Department quantitativeness. Student records for
juniors and' seniors (majors are not declared earlier)
included the major department and his two Scholastic
Aptitude Test Scores (Verbal and Quantitative). For each
department inwhich studen,ts declared a major, the average
quantitativeness (122j) was computed as the-average
difference in Quantitative and Verbal SAT scores of the
students majoring in that department (j). The formula
5 2
-
used to compute the average quantitativeness for each
department was the following:
- slyly& / Ej
38"
yhere, in the jtlidepartment, SATQ, is the ith student'sij
QuantitativeSATscore,SATV.is the ith student's Verbal-ij
SAT score, and n. is the number of students majoring in that-0
department. The results of these computations are pfesented
in Table 4. Since some departments had very few Ludents
as majors, the quantitativeness figures for those
departments should be interpreted cautiously. With small
n's, those departments may have radically different
quantitativeness figures from one year to the next.
Note also that departments high in quantitativeness
hal:7e high positive values (maximum of 169.5), and
departikents low in quantitativeness have low or negative
values (minimum of -135.0). Being very high or very low
(or anywhere between for, that matter) on this scale of
quantitativeness does not signify anything about the quality
of the department or its average SAT scores. One could
argue either way that it is'better to be highly "verbal"
or high1y-Thuantitative.7 Yet for suggesting a profile,
the scale appears to be very meaningful; it has high face
validity.
There are 10 departments,in which no,undergraduate
students may major (Aerospace R.O.T.C., Biobehavioral
Science, (general) Engineering, Interdepartmental,
53
3 9
Table 4
Mean Differences Between Quantitative and Verbal SAT Scores
of Undergraduatea with.Different Majors
Department
Name Me an S . D .
Statistics 169.50 4 49.10
Civil Engineering 119.99 101 73.10Mathem-tica 106.04 144 87.04
Chemical Engineering 105.42 26 80.99
Mechanica Engineering 95.89 35 ,71.18
Electrical..Engineering 94..64 78 96.26
Accounting 82.23 140 81.77
Pharmacy 80.93 45 84.94
Finance , 71.60 102 84.64
,Business 68.06 190 87.95
Physics 64.68 19 0 70.33
Italian 64.50 2 46.50
Geography 62.25 12 70.35
Rhysical Education 61.47 87 72.46
Geology 60.83 24 87.42
Chemistry 59.00 43 82.62
Agricultural Engineering 58.50 8 84.80
Industrial Administration 57.55 60 86.65
Agricultural Economics 56.33 3 42.87
Marketing 52.44 94 81.47
Pre-Veterinary 49.00 13 89.70
Nutritional Science 43.00 7 76.93
Biology 41.86 345 92.16
Horticulture 41.07 99 82.83
Animal Industries 39.98 49 85.01
Economics 31.99 101 77.50
(Continued on the next page)
5 4
40
Table-4 (Continued)
Department
-Name_ Mean S.D.
Agriculture andNatural Resources 29.00 1 0
Spanish 22.82 38 94.36Physical Therapy 19.25 181 90.33Political Science 18.85 189 84.80Foods & Nutrition 17.86 22 69.38Family Economics 17.57 7 76.56German 16.57 14 119.31Clothing, Textiles,Interior Design
&15.79 62 97.21
Speech 14.66 110 88.50Child Development &Family Relations 14.10 174 81.78
Psychology_ 13.49 321 93.86Eistory 13.37 188 94.02Music 10.56 45 93.50Medical Technology 10.29 35
,
84.86lociology 10.16 225 89.27Education 9.60 354 81.45Art 6.11 82 77.10Philosophy -3.46 33 79.87Nursing -4.24 193 76.47Anthropology -5.30 57 86.03French 17.50 47 73.60Dramatic Arts -23.02 47 104.45Russian -29.71 7 127.18English -33.09 401 88.84Classics -135.00 1 0
5 5
41
Journalism, Linguistics, Metallurgy, Polish, Portuguese, andScience). The 56 course sections that were-taught in these10 departments were excluded from this study because of
missing data for department quantitativeness.
Rate of return of ratings. For each course section,
the-percentage of rating forms returned by the students
was computed. The number of ratings returned (obtained
from the ratings data) was divided by the enrollment in
the class (see class size). Values for rates of return
were permitted to'range from .001 to 1.000, and no cases
were dropped on account of missing data.
Grading Variables
AL..TeragLiEaie. The number of each type of grade given
in each course section was obtained from the grade
distribution data (from the registrar's records). Grades
such as "I" for Incomplete and "P" for Pass were ignored,
and the average of the "A:through-F" grades was computed
for each course section (values permitted from 0.00 to
4.00). There were 27 cases deleted on account of missing
data for this variable. In these 27 cours,e sections, no
grades were given in the "A-through-F" range. Note that
these course sections differ from those in which all "F's"
were given (resulting in an average grade of 0.00), and
thus the 27 cases could not logically be included in the
study with any particular value for an average grade.
5 6
Standard deviation of grades, Variance of grades,
Skewness of grades, and Kurtosis of grades. These
characteristics of the distribution of the "A-through-F"
grades in each course section were computed at the same
time as the average grade, using the registrar's grade
distribution data. No further cases (beyond the 27 cases
lost for average grade missing data) were lost for missing
data for these variables.
Criterion Variables
Ten potential criterion variables were computed for
each course section using the rating responses data. The
process of selecting the criteria used in the stepwi..e
multiple regression analyses is explained in the next
section and in the next chapter. Descriptions of the 10
potential criteria and how they were derived are presented
here.
Items 1-8. The average rating on each of the eight
items on the University of Connecticut Rating Scale for
Instruction (UCRSI, see Figure 4) was computed for each
course section. Four course sections (in which no one
answered Item 8) had to be deleted for missing data.
Average of Items 1 through 8. The average of the
responses to all of the items on the UCRSI was computed for
each course section using the ratings data. No cases were
dropped on account of missing data for this variable none
beyond the four dropped for Item 8 missing data).
57
43
Average of Items 1 through 7. The average of the first
seven items on the UCRSI was also computed from the rating
responses data, and there were no missing data.
Statistical Analyses
In order to determine the number and nature of
components or dimensions underlying the,rating instrument
items, two principal components factor analyses (of 1972 and
1973 ratings data) were conducted Prior to the main effort
of this study. The results'of these (details presented in
chapter IV) were instrumental when it came to making
decisions about the chOice cf a criterion variable.
One stepwise multiple regression analysis was employed
to reduce the initial battery of 12 predictor variables
(predicting the average of Items 1-8 on the UCRSI) to an
'optimally reduced subset" (see chapter I). The results of_
this'regression analysis are presented in chapter IV and
discussed in chapter V. As far as the procedure is
concerned, this regression analysis provided a multiple
correlation and an ordered, optimally reduced subset of
predictors, both of which were needed as input to following
steps.
A second stepwise multiple regression analysis was
performed using the optimally reduced subset found in the
first regression analysis and the five grading variables
(see chapter I) to predict the same criterion (average of
Items 1-8). First, the variables in the optimally reduced
5 8
subset entered this regression equation in the same order
that they entered the prior regression equation (i.e.,
ordered). Then, the grading variables were allowed to enter
the regression equation in the order of their.ability 'to
improve the equation (i.e., floating). This served to test
the incremental importance of,the grading variables as
predictors of ratings, given that certain other variables
were already in the regression equation. Thus, the
variables in the optimally reduced subset, in effect,
"partialled out" a chunk of the variance before the grading
variables were even considered.
The results of the second'regression analysis also
determinei the rela.cive importance of the grading variables
and the overall ability of the available predictor variables
to predict ratings. These results are also presented in
chapter Iv and discussed in chapter V.
Cross-va:idation of the multiple regression analyses
was si...alated-through an ekamination of the estimated amount
of shrinkage in the multiple coTrelations using McNemar's
(1962, F. 184) shrinkage formlila:
R' = {1 - (1 - R2) ,(1\1 - ) / (N - n)]}5 (2)
where P' is the multiple correlation after shrinkage, N is
the-sample size, n is the nuMber of predictor variables, and
112 f.s the multiple correlation squared.
In order to test the significance of the increase in
the oultiple correlation when cne grading variables were
5 9
added to the optimal subset of other variables, an F-test
of significance was performed using another of McNemar's
(1962, p. 284) formulae:
(RI2 - R22, / (ml m2)F =
(1 - R12) / (N - - 1)
45
(3)
where RI2 is the larger multiple correlation squared, R22
is the smaller multiple correlation squared, N is the sample
size, ml is the number of predictor variables associated
with RI, and m2 is the number of predictor variables
associated with R2, with degrees of freedom ml - m2, and
N - ml - 1._ _
The significance of the multiple correlations by
themselves was determined with F-tests of significance using
yet another of McNemar's (1962, p. 283) formulae:
F = (R2 / [(1 R2) / (N m 1)] (4)
where R2 is the multiple correlation squared, m is the
number of predictor variables included in the multiple
correlation, and N is the sample size, with degrees of
freedom m and N - m - 1._ _
The significance of the individual simple correlations
was determined using a significance table for correlation
coefficients.
6 0
Summary
Student ratings of college teachers at the University
of Connecticut during the spring 1973 semester were studied
to determine whether or not the addition.of five new'
predictor variables dealing with grades could signifiCantly
improve an optimal set of predictors reduced from an initial
battery of predictors Factor analysis was used to_rechice
the eight-item rating instrument' to a single criterion
variable. Stepwise multiple regressiOn analysis was used,
both to reduce the initial'battery of predictors to an
optimally reduced subset, and to test the incremental
.importance of the grading variables as predictors of
iverage ratings.
6 1
CHAPTER IV
RESULTS
This chapter presents the results of the statistical
analyses of this -study. The procedures used to obtain
these results are explained in chapter III, and a discussion
of these results is-qprovided in chapter V.
Factor Analyses of the Rating Instruthent's Items
Principal components factor analyses of two separate
years' ratings (1972 and 1973) yielded nearly identical
results. These results, presented in Table 5, show that
all eight items on the University of Connecticut Rating
Scale for Instruction (UCRSI) loaded heavily (.778 or
greater) on a single factor. Item 8, 'Over-All Summary As
A Teacher," had a factor loading of over .97 both times.
Furthermore, the correlations among th-e eight items,
presented in Table 6, were all +.52 or greater and highly
significant (2. < Amon). It should be noted that these
correlations are among course section averages.
The results of these preliminary factor analyses
indicated that there was essentially one, global dimension
underlying the ratings data, and that a single criterion
47
6 2
48
Table 5
Factor Analyses of the Eight ItemS On the
University of Connecticut Rating Scald for Instruction
1972
.Factor 1
1973
Factor 1
N.
Eigenvalue
Percentage of Variance
2417
6.030
75.372
2465
6.011
75.139
:Items: Factor Factor
Loadings:- Loadings:
1. Knowledge of Subject .778 .783
2. Presentation of Material .889 .895
3. Balance of Breadth & Detail .887. .902
4. Enthusiasm for Subject .852 .832'
5. Fairness in Marking. .836 .824
6. Attitude' Toward Student .853 .852
7. Personal Mannerisms .867 .863
8. Clver-All Summary as a Teacher .970 .971
6 3
49
Table 6
Product-moment Correlations Among the Eight Items
on the University of Connecticut
Rating SCale for Instruction for 1972 and 1973
Items: 1 2 3 4 5 6 7
1. Knowledge of
Subject
2. Presentation .66(.66)of Material
3. Balance of .69 .88
Breadth & Detail (.66) (.87)
4. Enthusiasm .74 .70 .69(.72)for Subject (.70) (.68)
Fairness in .55 .63 .68 .58
Marking (.56) (.64) (.68) (.64)
6. Attitude Toward .52 .66 .66 .66 .81(.54)Student (.65) (.64) (.70) (.81)
7. Personal .55 .73 .74 .63 .71 .78(.56)Mannerisms (.74) (.74) (.67) (.70) (.77)
,8. Over-All Summary .74 .90 .89 .79 .77 .81 .82
as a Teacher (.74) (.89) (.87) (.81) (.77) (.81) (.83)
Note. 1972 correlations are in parentheses; N's for 1972
and 1973 were 2417 and 2465 respectively; all
correlations are significant, 2 << .000001.
6 4
[t
variable representing that factorcould be employed in the
subsequent.stepwise multiple regression analyses.. The .
50
avarage of all eight items was selected as the primary
criterion for this study, but results were also obtained
for three other criteria considered possibly representative
of the factor (see below, Parallel Results Using Other
Criteria). One of these other criteria, Item 8 by itself,
was chosen on account of its extremely high factor loading.
The average of the other seven items was chosen as a
criterion for purposes of comparison with the first two
criteria, since all eight items had very high factor
loadings. Item 5 by itself was also used as a criterion in
order to determine how well the predictor variables could
predict the students' ratings of "grading fairness."
Reduction of the Initir7 Battery of Predictor Variables
The means and standard deviattons of the 27 predictor
and criterion variables are presented in Table 7. They
were zomir-ired as part of this first stepwise multiple
regression analysis. It should be noted that these means
and standard deviations were computed across the 2,360
course sections and, therefore, that teachers are unequally
represented according to' the numbar of course sections they
taught.
Table 8 shows the correlations among the 27 predictor
and criterion variables. These are the correlations that
were computed by the stepwise multiple regression analysis
6 5
Table 7
Means and Standard Deviations
of the Predictor and Criterion Variables
Variable Mean S.D.
1
2
3
4
5
6
7
8
Sex
Appointment length
Percentage employed
Class size,
Number of years tenured
Age
Years since hiring
Course level
1.15
9.04
98.73
22.48
4.69
41.31
7.59
1.79
.36
.28
7.68
28.34
6.99
10.10
7.45
.79
9 Title level 2.65 1.00
10 Course location 1.80 .40
11 Department quantitativeness 33.75 43.35
12 Rate of return of ratings .57 .16
13 Average grade 2.94 .57
14 Standard deviation of grades .72 .32
15 Variance of grades .61 .44
16 newness of grades 2.60 2.81
17 Kurtosis of grades 13.62 29.34
18 Item 1 8.1'. 1.13
19 Item 2 6.91 1.60
20 Item 3 6.87 1.47
21 Item 4 8.01 1.29
22 Item 5 7.61 1.29
23 Item 6 7.54 1.46
24 Item 7 7.52 1.38
25 Item 8 7.33 1.45
26 Average of Items 1-8 7.49 1.20
27 Average of Items 1-7 7.51 1.17
Note. N = 2,360 for each variable.
6 6
52
Table 8
Product-moment Correlations
Among the Predictor and Criterion Variables
Variable
'1 Sex 1.00 -.03
2 Appointment length -.93 1.00
3 Percentage employed -.07 -.10
4 Class size .00 .00
5 No. of years tenured -.09 -.02
6 Age .08 .00
7 ,Years since hiring -.06, .00
, 8 ,Course level -.12 .00
9 Title level -.20 -.03
10 Course location -.22 .06
11 Dept. quant. -.09 -.02
12 Rate of return .05 -.01
13 Average grade -.04 .08
14 std dev. of grades .04 -.05
'15 Variance of grades..
.04 -.05
16 Skewness of grades .02 -.03
17 Kurtosis of grades .02 -.02
18 ,Item 1 -.00 -.02
19 Item 2. .05 .01
20 Item 3 .05 .00
21 Item 4 _ .06 .01
22 Item 5 .02 .02
23 Item 6 .02 .03
24 ,Item 7 .06 .02
25 Item 8 .02 .01
26 Average of Items 1-8 .04 .01
27 'Average of Items 1-7 .04 .01
Note. All r's > .04 (or -.04) are
For r's > .10 (or < -.10), E <
6 7
-.07
-.lo
1.00
-.01
.07
.01
.09
-.01
Al-.01
.01
-.02
-.01
-.01
.01
.01
.01
.08
.05
.05
.07
.01
.,03
'.05
.06
.06
.06
.00
.00
-.01
-.09
-.02
.07
.08
An,01
1.00 -.04 -.08
-.04 1.00 .68
-.08 .68 1.00
-.06 .95 .70
-.14 .05 :08
-.04 .62 .58
.08 .15 -.03
-.04 .08 .06
-.06 .00 .02
-.16 .03 .03
.20 -.03 -.07
.14 -.03 -.07
.51 -.03 -.09
.77 -.03 -207
-.06 .15 .16
-.02 '.03 -.04
-.04 -.00 -.06
-.05 .05 .07
-.06 -.03 -.04
-.07 .01 '.00
-.06 -.03 -.07
-.05 .03 -.00
-.06 .03°. -.00
-.,06 .02 -.00
significant, E < .05.
.000001.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
le
19
20
21
22
23
24
25
26
27
53
Table 8 (Continued)
Variable 7 8 9 10 11
Sex -.06 -.12 -.20 -.22 -.09 .05Appointment length .00 .00 -.03 .06 -.02 -.01Percentage employed .09 -.01 .11 -.01 .01 -.02Class size -.06 -.14 -.04 .08 -.04 -.06No. of years tenured .95 .05 .62 .15 .08 .00Age .70 .08 .58 -.03 06 .02Years since hiring 1.00 .02 .61 .10 .07 .00Course level .02 1.00 .26 .34 -.01 .05Title level .61 .26 1.00 .30 .06 .01Course location .10 .34 .30 1.00 .04 -.04Dept. quant. .07 -.01 .06 .04 1.00 .06Rate of return .00 .05 01 -.04 .06 1.00Average grade .02 .56 .15 .31 -.15 .03Std. dev. of grades -.03 -.52 -.15 -.17 .15 -.07Variance of grades -.03 -.47 -.14 -.17 .19 -.04Skewness of grades -.04 -.38 -.11 -.09 .18 -.04Kurtosis of grades -.03 -.21 -.05 -.01 .10 -.02Item 1 .16 .11 .28 .02 -.01 .-05
Item 2 .03 .15 .09 .06 -.09 .01Item 3 .00 .16 .08 .04 -.03 .02Item 4 .06 .14 .14 .01 -.12 .00Item 5 -.03 .16 .06 .08 .02 -.01Item 6 .01 .19 0. .10 -.04 -.03Item 7 -.03 .18 .06 .08 -.04 .01
Item 8 .04 .15 .12 .06 -.04 .01
Average of Items 1-8 .03 .18 .12 .07 -.05 .01
Average of Items 1-7 .03 .19 .12 .07 -.05 .01
6 8
54
Table 8 (Continued)
Variable 13 14 15 16 17 18
1 Sex -.04 .04 .04 .02 .02 -.002 Appointment length .08 -.05 -.05 -.03 -.02 -.023 Percentage employed -.01 -.01 .01 .01 .01 .084 Class size -.16 .20 .14 .51 .77 -.065 No. of years tenured .03 -.03 -.03 -.03 -.03 .156 Age .03 -.07 -.07 -.09 -.07 .167 Years since hiring .02 -.03 -.03 -.04 -.03 .16
Course level .56 -.52 -.47 -.38 -.21 .119 Title level .15 -.15 -.14 -.11 -.05 .28
10 Course location .31 -.17 -.17 -.09 -.01 .02.11 Dept. quant. -.15 .13 .19 .18 .10 -.0112 Rate of return .03 -.N7 -.04 -.04 -.02 .0513 Average grade 1.00 -.68 -.65 -.56 -.33 .1414 Std. dev. of grades -.68 1.00 .94 .78 .44 -.1015 Vpriance of grades -.65 .94 1.00 .86 .49 -.0816 Skewness of grades -.56 .78 .86 1.00 .82 -.0717 Kurtosis of grades -.33 .44 .49 .82 1.00 -.0418 Item 1 .14 -.10 -.08 -.07 -.04 1.0019 Item 2 ..29 -.17 -.16 -.12 -.06 .6720 Item 3 .27 -.16 -.15 -.11 -.06 .6821 Item 4 .23 -.19 -.18 -.16 -.09 .7322 Item 5 .41 -.20 -.17 -.14 -.10 .5423 Item 6 .43 -.23 -.23 -.20 -.13 .5124 Item 7 .32 -.17 -.14 -.08 .5425 Item 8 .32 -.19 -. .7426 Average of Items 1-8 .75 -.21 -.16 -.09 .7727 Average of Items 1-7 .36 -.21 -.19 -.16 -.09 .77
6 9
55
Table 8 (Continued)
Variable 19 20 21 22 23 24
1 Sex .05 .05 .06 -.02 .02 .06
2 Appointment length .01 .00 .01 .02 .03 .02
3 Percentage employed .05 .05 .07 .01 .03 05
4 Class size -.02 -.04 -.05 -.06 -.07 -.06
5 No. of years tenured .03 -.00 .05 -.03 .01 -.03
6 Age -.04 -.06 .07 -.04 .00 -.07
7 Years since hiring .03 .00 .06 -.03 .01 -.03
8 Course level .15 .16 .14 .16 .19 .18
9 Title level .09 .08 .14 .06 .05 .06
10 Course location .06 .04 .01 .08 .10 .08
11 Dept. quant. -.09 -.03 -.12 .02 -.04 -.04
12 Rate of return .01 .02 .00 -.01 -.03 .01
13 Average grade .29 .27 .23 .41 .43 .32
14 Std. dev. of grades -.17 -.16 -.19 -.20 -.23 -.18
15 Variance of grades -.16 -.15 -.18 -.17 -.23 -.17
16 Skewness of grades -.12 -.11 -.16 -.14 -.20 -.14
17 Kurtosis of grades -.06 -.06 -.09 -.10 -.13 -.08
18 Item 1 .67 .68 .73 .54 .51 .54
19 Item 2 1.00 .89 .70 .64 .66 .74
20 Item 3 .89 1.00 .69 .68 .67 .74
21 Item 4 .70 .69 1.00 .57 .65 .63
22 Item 5 .64 .68 .57 1.00 .81 .71
23 Item 6 .66 .67 .65 .81 1.00 .78
24 Item 7 .74 .74 .63 .71 .78 1.00
25 Item 8 .90 .89 ,79 .78 .81 .83
26 Average of Items 1-8 .90 .91 .82 .83 .85 .87
27 Average of Items 1-7 .90 .90 .83 .83 .86 .87
7 0
Table 8 (Continued)
Variable 25 26 27
1 Sex .02 .04 .04
2 Appointment length .01 .01 .01
3 Percentage employed .06 .06 .06
4 Class size -.05 -.06 -.06
5 No. of years tenured .03 .03 .02
6 Age -.00 -.00 -.00
7 Years since hiring .04 .03 .03
8 Course level .15 .18 .19
9 Title level .12 .12 .12
10 Course location .06 .07 .07
11 Dept. quant. -.04 -.05 -.05
12 Rate of return .01 .01 .01
13 Average grade .32 .35 .36
14 Std. dev. of grades -.19 -.21 -.21
15 Variance of grades -.18 -.19 -.19
16 Skewness of grades -.15 -.16 -.16
17 Kurtosis of grades -.09 -.09 -.09
18 Item 1 .74 .77 .77
19 Item 2 .90 .90 .90
20 Item 3 .89 .91 .90
21 Item 4 .79 .82 .83
22 Item 5 .78 .83 .83
23 Item 6 .81 .85 .86
24 Item 7 .83 .87 .87
25 Item 8 1.00 .97 .96
26 Average of Items 1-8 .97 1.00 1.00
27 Average of Items 1-7 .96 2.00 1.00
7 1
56
57
subprogram of the Statistical Package .for the Social
Sciences (SPSS, Nie et al., 1970). These correlations
were subsequently used by that subprogram to perform the
regression analyses. It should be noted that these are,
again, correlations among courbe sect .on averages. Also,
because more cases were deleted for missing data for the
regression analyses (more variables involved) than for the
factor analyses, SCOR of the correlations in Table 6 differ
very slightly from the corresponding ones in Table 8.
Given the large sample size, any of these correlations
that exceeds .035 is significantly different from zero
(2 < .05). Furthermore, for any r that exceeds .048, 2 is
less than .01; and if r exceeds .10, then 2. is less than
.000001. Thus one may be fa171y certain that ev2n the
relatively weak relationships found in this study were not
attributable to chance variation.
The first major finding, bearing on the research
question, was the correlation between the average student
grade in each course section and the average student rating
of the teacher of that course section. This was found to
be .35 (2. << .000001). The other correlations involving
these two variables are particularly interesting. For
example, a correlation of -.15 was found between the average
grade in each course section and the quantitativeness of the
department in which that course is taught. Also, the
correlation between average grade and Item 5 on the rating
instrument ("Fairness in Mhrking") 4as .41, and the
7 2
58
correlation between average grade 'And Item 6 ("Attitude
Teward Student' was .43. Discussior of the implications
of euese and other results is withheld until chapter V.
The results of te first stepwi,se multiple regression
analysis are presented in Table 9. The initial battery of
12 predicL,rs of ratings was reduced to an "optimally
reduced subset" (see chapters I and III). This subset
consisted of the first 10 variables shown in Table 9 (above
the line of dashes).. The variables are listed in the rank
order of their importance in improving the predictability
of ratings by the regression equation. Also shown in Table
9, for each variable, are the standard error of estimate
after the variable's inclusion in the regression equation,
the multiple R, R2, the increase in R2 over the previous
step, the simple correlation with the criterion, and the
F-value that signifies the importance of the variable to
the regression equation as of the last step.
The multiple correlation produced by the optimally
reduced subset of 10 predictors was .25. Correction for
shrinkage yielded an R of .24. Although this multiple
correlation is not very large (and accounts ftr only
slightly more than 6% of the criterion variance), it is
highly significant, F (10, 2349) = 14.12, 2 <'.001. It
should be noted that the first variable to enter the
regression equation, "course level," accounted for over
half of the criterion variance finally accounted for by
the er.L.ire optimally reduced subset of 10 predictors.
7 3
Table 9 First Stepwise Multiple Regression Analysis--Reduction of fie Initial Battery
of Predictor Variables
Rank Variable
1 Course level
2 Title level
3 Age
4 Sex
5 Percentage employed
6 Dept. quant.
7 Class size
8 Appointment length
9 Years since hiring
10 No. of years tenured
11 Course location
12 Rate of return
R R2
.18162 .03299
.19699 .03881
.21045 .04429
.22995 .05288
.23486 .05516
.23889 .05707
.24261 .05886
.243L, .05937
.24431 .05969
.24555 .06029
.24603 .06053
.24605 .06054
Increase Standard
inR2 Error of
- Estimate
.03299 1.18363
.00582 1.18032
.00548 1.17719
.00859 1.17214
.00228 1.17098
001'./1 1.17004
.00179 1.16918
.00051 1.16911
.00032 1.16916
.00061 1.1i:903
.00024 1.16913
.00001 1.16938
wim=11=Y111Ylmlm.YININ....11111MV11.
*** 2 < .001
74
* 2 < .05
4
Simple
r.
-
F-Value Signif.
.18162 52.190 ***
.12155 31.695 ***
-.00118 19.545 ***
.03509 19.702 ***
.05653 5.181 *
-.05062 4.775 *
-.05908 3.772 *
.01186 1.120 n.s.
.03132 2.065 n.s.
.02543 1.242 n.s.
.06555 .605 n.s.
.00768 .014 n.s.
n.s. 2. > .05
60
Furthermore, even though all 12 predictor variables
increased the multiple R somewhat, only the first 10
reduced the standard error of estimate (optimality).
The F-values of the predictors as of the last step
show the final significance of each predictor. They also
demonstrate the fact that predictors may gain or lose
significance when other predictors enter the regression
equation. (The rank order of the F-values is not the same
as the order of inclusion of the variables into the
regression equation.)
Addition of the Five New Predictor Variables
to the Optimally Reduced Subset
The results of the second stepwise multiple regression
analysis are presented in Table 10. These results show that
the average of z.he student grades in each course section was
the single best predictor of the average rating of the
teacher of that course sect.on. Furthermore, the addition
of the grading variables to the optimally reduced subset of
other predictors significantly improved the multiple
correlation from .25 to .39, F (4, 2345) = 60.13, 2 < .001.
The variable "average grade" by itself accounted for nearly
8.5% of the criterion variance (more than was accounted for
by the entire optimally reduced subset of other predictors).
A further indication of the importance of the grading
variables is provided by the fact that the variable "average
grade," when it entered the regression equation at step
7 6
Table 10 Second Stepwise Multiple Regression Analysis--Addition,of the Five Grading
Variables to the Optimally Reduced Subset'
Rank Variable Increase Standard---Simple
i 2
Error ofn R- Estimate -
F-Value -Signif.
1 Course level .18162 .03299 .03299
2 Title level .19699 .03881 .00582
3 Age 41045 .04429 .00548
4 Sex .22995 .05288 .00859
5 Percentage employed .23486 .05516 .00228
6 Dept. quant. .23889 .05707 .00191
7 Class size .24261 .05886 .00179
8 Appointment length .24366 .05937 .00051
9 Years since hiring .24431 .05969 .00032
10 No. of years tenured' .24555 .06029 .00061
1.18363
1.18032
1.17719
1.17214
1.17098
1.17004
1.16918
1.16911
1.16916
1.16903
.18162 1.333 n.s.
.12155 29.842 ***
-.00118 11.949 ***
.03509 17.323 ***
.05653 6.003 **
-.05062 .002 n.s.
-.05908 3.878 *
.01186 .177 n.s.
.03132 3.167 *
.02543 3.298 *
11 Average grade
12 Skewness of grades
13 Standard deviation
of grades
14 Variance of grades
.38080 .14501
.38445 .14780
.38469 .14799
.08472 1.11533
.00279 1.11375
.00018 1.11386
.35333 206.038 ***
-.15866 3.191 *
-.20510 3.532 *
.38621 .14916 .00117 1.11334 -.19352 3.168 *
15 Kurtosis of grades .38622 .14916 .00001 1.11357 -.09191 .019 n.s.
*** < .001 ** < .01 * 2 < .05 n.s. > .05
77
oI
78
62
number 1.1, completely dominated the regression equation.
It took over and rearranged the regression equation to such
an extent that the variables 'course level" and "department
quantitativeness" lost nearly all of their significance as
contributors to the final regression equation. The extent
of the rearrangement is clearly indicated by the column of
F-values in Table 10 comPared to the same column in Table 9.
The first 14 variables listed in Table 10 (abovethe
lower line of dashes) produced the multiple correlation
with the lowest standard error of estimate. That R was .39.
Correction for shrinkage yielded an B. of .38. uhile this
is still not a very large multiple correlation (accounting
for only about 157 of the criterion variance), it is highly
significant by itself, F (14, 2345) = 28.28, < .001. The
significance of the increase in the multiple correlation is
described above.
Parallel Results Using Other Criteria
The chpice of the criterion variable for the above
regression analyses is explained above (see Factor Analyses
of the Rating Instrument's Items). As mentioned above,
three other criteria, Item 8 by itself, the average of the
other seven items, and Item 5 by itself, might represent
the single ratings factor as well as the average of all
eight items. In order to determine if the choice of the
criterion was important, three more pairs of stepwise
multiple regression analyscs were performed. These analyses
7 9
63
were run for the three other criteria exactly as they Tiere
for the first criterion. A reduction of the initial battery
of predictors was done first, and then the five grading
variables were added to the new optimally reduced subset in
each case.
The results of these analyses were nearly identical tc
those obtained with the original criterion. Using "Iteln 8"
as the criterion, the reduction of the initial battery Of
predictors resulted I- a multiple correlation of .23. The
correction for shrinkage yielded an R of .22, F (9, 2350) 0:
13.01, a < .001. The variable "appointment length" did Aot
make a significant contribution to this regression equaCion,
though it did for the other three criteri,_ Thus, this
optimally reduced subset of predictors of "Ite inclOdea
only nine of the independent variables.
The addition of the grading variables to this optiOally
reduced subset of nine predictors increased the multiple
correlation to .36, F (13, 2346) = 26.09, a < .001. The
correction for shrinkage did not lower this R. The inoease
in the multiple correlation was also highly significant,
F (4, 2346) = 52.94, a < .001.
Similarly, when the average of the first seven iteote
on the UCRSI was used as the criterion, the reduction of the
initial battery of predictors resulted in a multiple
correlation of .25. Correction for shrinkage yielded
of .24, F (10, 2349) = 14.44, a < .001. is optimally
.reduced subset of predictors included the -ame 10 variables
80
64
as the subset of predictors of,the average of all eight
items, and there was only one minor difference in their
order: the variables "age" and "sex" were reversed although
their respective F-values as of the last step were nearly
identical in both analyses.
The addition of the five grading variables in this case
incre,..sed the multiple correlation to .39, with the
correction for shrinkage yielding an RI:4f .38, F (14, 2345)
= 28.72, p < .001. The increase in the multiple correlation
wLs also highly significant, F (4, 2345) ,'= 60.75, p < .001.
The use of "Item 5" as the criterion yielded slightly
different results. The reduction of the initial'battery of
predictors produced a lower multiple R than it did_for the
other thre criteria. However, it was still a highly
significant .19, F (9, 2350) = 9.29, 2 < .001. Thè .
cor-rection for.shrinkage did not lower this,R. The
resulting optimally reduced ,Libse cvnsisted of nine of the0
predictors. "Course level," "age," and "title level" were
the best three pradictors out of the initial battery of 12
(as in the other reductf_ons), but the order of the less
important predi cors wrs altered.
When the flve grading variables were added to the
optimally reduced subset of predictors of "Item -5," the
multiple correlat4ua increased to .45, F (13, 2346) =
<< .001. The correction for shrinkage did not lower this'
R. This increase (from the smallest R found in this study
to the largest) was, of course, very significant,-
65
F (4, 2346) = 120.71, p << .001.
It should be noted that for all four pairs of
regression analyses, the reported F-values computed for
the increases in the multiple R's (mn account of the
addition of the grading variables) were based on the
multiple R's after shrinkage. This is conservative
(slightly understating the F-value), hut in this study it
resulted in no practical differences, since all of the
F-valnes were so highly significant.
Summary
Factor analyses of the eight-item rating instrument
showed that there was essentially one factor underlying the
UCRSI ratings data. This led to the choice of the average
of all eight items on the UCRSI as the criterion variable
to represent that factor. Multiple regression analyses
yielded low but highly significant multiple correlations.
Moreover, they showed that the addition of the grading
variables to the optimally reduced subset of other
predictors of ratIngs did indeed make a highly significant
improvement in the predictive efficiency of the regression
equation. Furthermore, practically identical results T-Tere
obtained for three other criterion variables.
8 2
C1T4PTER V
DI )N AND CONCLUSIONS
This chapter prebents a discussion of the results of
study and explores several implications of those
-'ults. This chapter also provides soule recommendations
in light of certain conclusions based on those results and
implications.
Discussion of Results and Implications
The results of this study apparently provide a unique
contribution to the literature on the research question:
What is the influence of the grades students receive on
their ratinlgs of the college teachers who gave them those
grades? This study has used multivariate techniques, and
has studied a very large sample of course sections across
an entire university. Previous studies have studied
typically only one or a few course sections, or have
suf-ered from several other inadequacies (see chapter I).
This study did not suffer from those inadequacies, and thus
the results are probably more defi-itive and generalizable.
Furthermore, these results provide more up-to-date knowledge
in view of certain cnanges that have occurred in rating and
66
8 3
67
grading pra,:ices over the past decade (see chapter I).
The primary implication of the results of this study
is that ther2 is an interrelationship between grades and
ratings. It would seam that both the direction and the
extent of the hypothesized "grac_ng bias" have been
demonstrated. That is, students apparently tend to rate
lower those teachers from whom they receive lower grades
and vice versa. However, in light of the fact that the
grading variables accounted for only about 9% of the
variance in ratings in this study, other factors (valid or
not) must also be influe:wing the students' ratings. In
other words, students did not simply "payofi" their teachers
with ratings in direct proportion to the grades they
received from those teachers. Rather, the students were
biased or influenced by their grades, overall, in such a
way that permitted other considerations also to be involve
in the rating process.
Possibly some of the students &d strictly "payoff"
their teachers for grades rer:eived, while others were not
at all influenced by their grades. On the other hand, it
is possible that most or all of the students' ratings are
merely "shifted" by the grades received. That is, sco;ents
might rate teachL s more or less validly, but plus ox minus
a certain amount acccrding to the grades received. Without
the ability to pair indivieual student grades and ratings
(given the anonymity of ratings), it could not be determined
whether some students were significantly more influenced by
8 4
68
their grades than were other students. This would be an
appropriate question for further research to answer
conclusively, but the previous studies that have used
paired data have often found essentially the same degree
of relationship between grades and ratings, in spite of
certain inadequacies or faults in those studies.
Overall average ratings are often the measure used by
administrators and
about faculty pay.
shown the probable
To some extent, it
between individual
bias. It does not
department chairmen for making decisions
promotion, and tenure, and this study has
existence of an overall grading bias.
is irrelevant what the diffei-ences are
students, when 7't comes
even matter whether th_S
to -the Wing
Las :
or subconscious, so long as it is actually deuc.n'eit upon
grades. Unl 3S one could develop a way to count only ths
ratings of those students who are not biased by ,,radm, or
unless a conscious grading bias were 2WDject tc eU.nanation
more than an unconscious bias, then the overall gmlIdi.rs bias
will exist whatever its makeup maght be.
Even though the rate c- return f the ra'.ing Jsed in
this study was only 557G, it has beeA about 55% 2o1 migly
years, and these are ratings ;at. syste, :ically
used for making administrative decie.ons tL.culty
careers. Tha is, even though returned ratings may got be
gifieralizable across all of the studznts the f.-c,lty member
taught (since students wno return ratings may nct constitute
a random sample so far as their opinions are c ncen, d),
8 5
6 9
such ratings are commonly used as though they we're
representative of overall student opinion. The results
of this study should be generalizable to the many other
rating systems operating with similar rates of return.
It remain.; for futur -r:.,,;earchers to determine whether an
appreciably differetv.. ,ace of .eturn would have any
influence on the relationships Couad in this study, but
these results are applicable to rating -ystems as they are
commonly used.
Also, to the extent that the course sectious and
rating procedures at the Universif7y of Connecticut are
representative of tnose across the nation, the results of
this study are generalizable. The relationship between
grades and ratings probably varies from one institution
to another becaus- of d!_ffercaces in the students, faculty,
grading systems, rating instruments, and rating procedures.
For exernple, the timing of _tings (before or after final
grade's are awarded) is an L.: tant diffe.r.ence between
rating systems, even though previous research (e.g., Bausell
& Magoon, 1972a; Holmes, 1971) indicates that an expected
grade bias probably exists before the actual grade i,
determined. Perhaps futurP researchers could use similar
rating procedures 2t a large number of institutions with
comparable grading practices to find Jut how generalizable
the results of this study are.
It is possible that there are other variables which,
when added to the final regression equation in this study,
8 6 )
70
would lower or eliminate the im-portance of the grading
variables as predictors oS ratings. That is, it is possible
that grades and ratings correlate without there being any
causal relaticiiship between them, and that ratings actually
depend completely on some other urknown factor or factors.
This is doubtful, however. The simple and multiple
correlat4ons found in this study indicate a definite, if
partial, Lelationship. Furthermore, the very high
significance levels of these simp-le-and multiple
corr,qations sugget that the results are not attributable
to chance variation. Also, ,:ertain p7:avious findings (see
especially Holmes, 1972) provide strong (experimental)
e- dence of a causal relationship between grades and
ratings.
The results of this study would have been even more
definitive, had the optimally reduced subset of the initial
battery of predictors accounted for a larger amount of the
variance in ratings. This would have lowered the ch nces
that any other variable exists which would acccant
of the variance in the ratings (thus suggesting t'rir
present results might be spurious). The more criz....1(
var_ance "partialled 1;.y the optimally reduced subset,
more significant the increase attributable to the
grading variables uculd have been.
Ideally, nearly all of the v-rince in ratings would
attributable to students' reliable peacy?Ttions of t'leir
teachers' performances (Treffinger aid Feldh.lsen, 1970).
8 7
71
If the grading bias were about the only exception to this
rule, it might explain why only about 67 of the criterion
variance could be accounted for by the 10 predictors in the
optimally reduced subset. That is, maybe there are no
predictors of ratings more significant than those found in
this study (outsde of other variables which might taP the
students' opinions about their teachers in some other way)-.
Even with the grading variables added in, the final
regression equatiJn in this study was able to account for
only about 15% of the criterion variance: With such a
large portion of the variance left unaccounted for, :the'
chrn-2es are theoretically greater that some variable could
disprove the existence of what seems to be the grading bias.
This writer doubts the existence of any such variable.
Most likely there are few accurate predictors of student
ratings of facult:y performance.
.One possibility is that student ratings are almost
totally invalid as measures of effective teaching anyway,
and relate 'mare to student and faculty parsonality
interactions. Similarly, peer ratings may be measures of
personality conflicts or even secondhand student ratings.
Furthermore, without a consensual definition of ,....iffective
teaching (or the purposes of education for that matter),
even unbiased judgments may not be highly related to each
other.
The results of studies by Treffinger and Feldhusen
(1;7C) and by Bausell and Magoon 1972b) indicate at
8 8
72
students' first impressions and preconceptions are
persistent and may be the best predictors of end-of-course
ratings. Thus, it is possible that researchers should be
looking for predictors of such first impressions or
preconceptions. Perhaps the grades students receive
constitute the major influence on ratings after the initial
opinion is formed.
Also, to the extent that there is a large portion of
error variance in the ratings data, the importance of the
grading bias found in this study looms even larger. That
is, if the reliability of the ratings is considerably lower
than 1.00, the- the gradin3 variables would account for more
than 97 of the systematic variance. For example, if the
reliability were .75, then the percentage of the sstematic
variance accounted for by the grading variables would be
127 (.09 / .75). Unfortunatk,ly, there is no ac^urate
estimate of evzctly how reliable student ratings are, nor
even how reliability would best be defined. But for the
purposes of this stlAdy, it is conserva0e to assume that
the reliabili,i is nearly 1.00 and that the grading bias
is no mol dignificant :-.han indicated by the findings.
Several of the details of the .;!suits of thi study
are worthy of menton. Notably, the results of the factor
anfayses of the eight-item rating instrument (see Tab1,, 5
and 6) are interesting in their own right, beyond the1_,
utility alection of criteria. Whatever the g-
impressien of teachers Im'Ly epend upon, it was apparently
73
measured in the same way by the rating instrument in 1972
and 1973. The single factors and t_le correlations between
items for the two yearS are strikingly similar. It is also
interesting to note the order of the sizes of the factor
loadings. "Knowledge of Subject" had the lowest factor
loading of all eight, and "Over.-All Summary as a Teacher"
had the highest loading.
The finding of only one factor suggests that a halo
effect is present, and, as suggested by Hoyt (1969) and by
Widlak et al. (1973), the separate items may be of little
diagnostic value. This writer doubts that the high
correlatiors found between items represent any "true"
relationships between the eight traits that the instrument
attempts to measure. Rather, there seems to be a halo
effect confounding the meaning of the sc arate items. It
is porsiblr that cert,_in 'profile effects" (indicative of
differences across the rating items) may exist, even though
maf,ked by -Ile halo effect. However, the diagnostic value
of such "high inference" items (very general and open to
varying interpretati,-ns) is questionable anyway, even if
there were no confounding influences. It is not at all
Clear exactly what a teacher should do to improve his
ratings on such global traits.
Student ratings, therefore, may no be improving
teaching in the expected way. Spe ific, but-globa1., items
mypaLeuLly can not help the teacher improve his teaching.
In fact, with a grading bias pl2sent, ratings almost
9 0
74
assuredly act counter to the larger educational goals.
That is, while a full range of grades may be educationall
appropriate, the grading bias would act to diminish the
apparent effectiveness of teachers who do give out some low
grades, and it would act to reward the lenient teachers who
give out higher and less discriminating grades.
Of course, one optimistic explanation for a grading
bias is that many high grades accompanied by high ratings
in a course section might reflect superior teaching or a
highly successful class (perhaps using "contract grading").
However, this writer doubts that such is the general case.
,rna1 criteria, such as standardized achievement tests,
would be needed to substantiate any claim that such high
grades are Jdeserved, especially in light the grading
trend over the past several years (see chapter I).
It was 1'yl-othesizeitthat certain departments at this
university were mote lenient (in the awarding of grades)
than others. Specifically, it was thought that the "hard"
science and mathematical departments were giving out lower
grades than other departments, and that thc faculty members
in those highly quantitative departments might- be suffering
from lower ratings. The colre.l.ation of -.15 found becween
average grade and department quantitativeness supi_orts such
a theory, as does the correlation of -.05 ')etween ratings
and department quantitativeness. Of course these
correlations do not prove any causality of t_le relationships;
in addition, the size of the correlations suggests the
0 1
75
relationships are slight.
Nevertheless, department quantitativeness was a
significant predictor of. ratings until the grading variables
were allowed to enter the regression equation; then its
predictive potential was subsumed by the grading variables.
It would seem that tl'e variable "department quantitativeness"
was provilling some information about grades indirectly
(because of the relationship between department
quantitativeness and.grades). However, when "better"
information about grades (the grading variables themselves)
entered the regression equation, then department
quantitativeness was of no further importance in predicting
ratings. The same situation occurred wiL:h the variable
"course level," which was the best preditor of ratings
before the grading variables were considered. The F-values
in Tables 9 and 10 demonstrate the extent o' these variables'
loss in predictive potential.
It is possible that certain departments are suffering
more from the grading bias than others ln account of th
differences in grading practices. Moreover, it is likely
tc_at some individual teachers are suffering more than others
according to their grade distributions. The teachers with
lenient grade distributions are probably not always
the best teachers. Thus, the grading bias lowers the
probability that ratings could be v-q,lid as measvres of
teaching effectiveness.
As suggested earlier, the effect of grades on ratings
9 2
76
well may be only partial, i.e., one of several influences
(see discussion of first impressions above). The simpleP
correlation between grades and ratings found in this study
was not high (.35), but it was very highly significant
because of the large sample size. Many previouE, qtudies,
in spite of serious shortcomings (see chapter T), also found
significant correlations of this approximate magnitude.
Typically.the correlations did not exceeu .30 (Costin t
al., 1971). The results of this s'tudy sulport the findings
of the large portion of the previous research which found
such correlations. Thus, this correlation of .35 between
grades and ratings, and the similar correlations found by
other researchers, could be quite accurate and i dicate that
the "true" relationship between grades and ratings probably
lies in the vicinity of .30 to .35.
The correlation of .41 found between average grade and
Item 5 ("Fairness in Marking") indicates that this item is
especially .sensitive tc The grading bias.. Students
evidently consider higher grades as "fair." This is not
very surprising. One might expect thEt a rating item like
"Fairness in Marking" woUld receive th brunt of the grading
.bias. Actually, however, zrading bias apparently
affects al Aght items (see correic,Aons among average
grade and Items 1-8 in Tabie 8). Furthermore, the
correlation is highest (.43) for Item 6 ("Attu6e 7oward
Student"). Students apparently consider grdes awarded as
a primary indication of their teachrs attituds tcird
93
77
students.
While Items 5 and 6 seem to be the most sensitive of
the eight items to the grading bias (having the t,,.ghegt
correlations with all five of the grading variables), other
considerations must also enter into the studerts' ratings
even of these two items. When Item 5 by itself was the
criterion variable (see Parallel Results Using Other
Criteria), the grading variables stil_ managed to account
for only about 177 of the criterion variance. Thus, the
bias is partial even for Items 5 and 6, but it is pervasive
acro9s all eig't items. Thi_s finding supports the above
mentioned assertion by Bausell and Magoqn (1972b) the'.
disap?ointed studeuts "will tend to deprecate the
in-tructor's teaching performance in areas other than his
grading sy3tem" (p. 130, emphasis added).
ConcluSions and Recommeltions
It would seem that one important influence on student
ratJngs of college teachers has been identified. It is the
so-called "grading bias," and it apparently accour'- for
about 97 of the variance in ratins. Variables which could
account for most of the rest of the variance, however, have'
not been identified. Student ratings may or may not be
mostly valid in spite of the grading bias. There is no
proof either way. It remAins for future research to answer
many such questions raised by the results of this study.
Even if researchers could find variables which would account
9 4
78
for the rest o' the variance in the ratings, that would be
no guarantee that ratings are valid as measures of effective
teaching. Such variables, however, probably would offer
substantial clues as to whether or not ratings are valid,
depend:ng on the identity of those variables.
Well defined educational goals are needed before
effective teaching can be defined, and external criteria of
effective teaching must be well.defined before valid
measures of effectiveness can be firmly established. This
positive, constructive, and difficult wurk needs to be dor.
The rE'ults of this study are, .therefore, somewhat
negative. That is, a grading bias has been found which most
.ikely 'serves to lower the validity of student ratings.
Perhaps positive steps could be taken which would eliminate
,..the grading bias. If possible, methods should be devised
which .would do just that. One possibility is that ratings
could be collected very early in the semestgi-. This wouldr-
allow for feedbaCk-,to the teachers in titne for improvement
to occur before the end of the semester (assuming tiie
teachers would be responsive and that ratiags would indicate
desired changes). However, there is still likely to be an
expected grade bias (Holmes, 1971), and there is still no
certainty about the validity of ratings even without any
grading bias.
The results of this study have demonstrated a grading
bias for & "high inference" rating instrument (containing
very guneral questions open to varying interpretations), but
9 5
MICROCOPY RESOLUIION l'EST CHAR1
NATIONAL iitlfiLAU () ';1ANPA;'ib, A
79
radically different results might occur with so-called "low
inference scales" (asking more objectivvluestions which
supposedly would be less subject-to halo effects and biases).
Such scales may or may not eliminate the grading bias and
may or may not be valid as measures of effective teaching.
These facts remain to be determined by future researchers.
One might suggest that students could be "taught" how
to rate their teachers more fairly. The evidence of a halo
effect, however, suggests that students are unable tp
separate their feelings about grades and teachers,
especially on "high inference scales." Or it might be
suggested that, if grading practices were completely fair
and non-arbitrary, then any bias caused by grades would
disafTeer. However, if grades were made fair more
justly discriminating), it seems that grades would be lower,
and ihat this might in turn increase the bias..
Perhaps 'the most important conclusion for immediate
consuMption is-that% since grades do seem to influencek
stuaent-ratings of college teachers, this bias should be
taken into account whenever one is.interpreting student
ratings. Administrators, department chairmen, and promotion
and tenure committees acros's the nation should remember that
the grading bias exists and that certain teachers may suffer
from the bias more than others, depending on the grades_they
give out. It seems intuitively obvious that'one should not
reward teachers for leniency if students are expected-to
work and to achieve. It would be better to reward teachers
80
according to the actual achievement levels reached by their
students (measured by standardized achievement tests).
One policy option might be to drop student ratings
altogether. They are costly, and poisibly not worth the
cost, especially since they must be interpreted with caution.
Some reasons for nOt eliminating student ratings are: they
are well established; some institutional prestige is
associated with their use; many students want them; they at
least give the appearance of requiring teacher accountability;
and dropping them could conceivably lower faculty concern
for effective teaching. On the other hand, it might be
argued, better teaching and fairer grading would occur
without the faculty's probable fear of reprisals on ratings
(Ladas, 1974; "Too Many A's," 1974).
This writer recommends.that university officials and
faculties review the above implications, and decide how best
to serve the educational needs involved. If student ratings
can not be made objective and valid, if the grading bias can
not be eliminated, and if student ratings can not be dropped
outright, then it would seem two possible paths are suggested.
Either decision makers should ignore the ratings, or they
should combine them with other measures of teacher
effectiveness (perhaps achievement tests or indices of the
amount of work done by the students) in such a way that
ratings would.not encourage teachers to be slack or to
demand too little.
9 7
0
APPENDIX
MACRO FLOWCHARTS
OF THE DATA PROdESSING STEPS-USED IN THIS STUDY
.Figures 6 through 18 represent the major data,.
-processing steps required to accumulate the data for this
study.. Figure 5 Provides a key to the symbols used in
those-figures, and the letters and nUmerals-Within the-
symbols in the figures are the names of the tapes,
clocuments, and operations symbolized. Keypunching and
manual data lookup operations are labeled but not named.
A brief description of the purpose of each step follows.
Merge 1. This first step matched and merged data from
the "instructor header record" tape (one instructor header
record per.course setion evaluated) with data from the
"professional history file" tape.The merged data were
Output onto both paper and tape, And information about
missing data was also printed on the paper output for use
in the next step.
First missing data input. In this step, the
information obtained in the merge 1,step was used to look
Up and punch onto cards thedata missing frOm the
professional history file tape. The resulting deck of
81
\9 8
82
cards was retained for input into the merge 5 step below.
Merge 2. This step matched and merged the faculty
evaluitions data with the merged output from the merge I
step. The individuai ratings data were used to compute .
means, N's, and-standard deviations for each'item on the
rating scale, and the reoalts were output-onto both paper
and tape.
Sort 1. The records on the output tape from the merge
2 step abolie were sorted by branch, department, cowse,
number, and Section number. The sorted records were output
onto another tape for later use in the merge 3 step.
Sort 2. The grade distribution data was sorted as in
sort 1 above so that the course sections woUld be In the .-
same sequence on both tapes for input into the merge 3
program.
Merge 3. This program matched and merged the grade
distribution'data with the other data previously assembled
for each course section. Computations of average grades
and standard deviations of grades in the course sections
were made at this point, and the merged results were output
onto paper and tape. Also on the paper output was
information about missing data detected by this program.
Second misSingata input. The information on missing
data from merge 3 was used, in this step, to create a deckN\
of cards containing that miasing data. The cards were
retained for input into the merge 5 step below.
Merge 4. This program matched students' records of
9 9\
83
their Quantitative.and VerbarScholastic-Aptitude test scores
with their'-declared majors in order, to compute for each
department the variable "department quantitativeness." The
results of these computations were output onto paper and
--cards, The deck of cards wa; retained-for input into the
merge 5 program.
Merge 5. The purpose of this step .was tevinsert the
data On cards (from previous steps) into the couree section.. I
records, to compute several newvariables from old ones,(age'
'from date of birth for example), and to edit the data for
the acceptability of the values. The results were output
onto.papei and tape, and informetion about miseing data and
improper values was also printed on the output paper.
Third missing data input. The information on missing
or unacceptable data from the merge 5 program was used to
,create a deck of cards containing the missing data. :This
deck was retained for input into the merge 6'program below.
Merge 6, This program was used to Insert the missing
data discovered in the merge 5 step into the final data
records, which were output onto paper and tape.
Regression run 1: This step accomplished the reduction
cof the initial"battery'df 'predictor Variables using the
steOwise multiple regression andysis subprogram of the
Statistical Package for the Social Sciences (SPSS, Nie et1
al., 1970)., The results of this analysis were used to make
the rdered list of variables (the reduced set in the order
-of thei\inclusion in the regression equation) for input
84
into the next step.
Regression run 2. This step was another stepwise
multiple regression analysis-using SPSS as above, but with
the rive grading variables added to the optimally reduced
subset of predictors found in regression run 1.
Magnetic Computer Tape
'/
/,
Paper Document or Computer Outputl'aper
85
Processing Step or Program-
'fk.Deck of Computer Cards ;
%.
Keyr nching Step
Manual OPeration
SortingsPperation
Figure 5. .Key-to.symfpols used in Figures6 through 1 .
102
Pigure 6. Merge. 1.,
103
00
\LOOK
MISSING
DATA
KEYPUNCH
MISSING
DA"rA
PROFMD
Figure 7. First missing data input.
104')
Figure 8 . Merge
105
88
Figure 9. Sort 1.
10a
D.
89
, 1
Figure 10._ 4ort 2.
,
107
r
,
108
t (4
91
92
LTAB866
MKMISSING
DATA
KEYPUNCH
MISSING ,
DATA //
(MGM.
Figure 12. Second missing data input.
109
VERMAT DEPTCD MAJORS
MERGE 4
TABDQT
Figure 13: Merge' 4.
110
DEPTQT
\
93
-
,
Figure 14. Merge 5.
111
94
,r
TAB869
,001( UP/
MISSING
DATA
KgyPUNCH
MISSING'
DATA
MRG5MD
Figure I . Third missing data input.
112
4.
.41
95
7
.Figure 16. Merge 6.
4, 4 113
96
Figure I
?VARIABLE
LIST 1
REGRESSION
RUN 1
TABRR1
Regression run I.
.114
VARIABLE
LIST2
VARIABLE,
LIST 3
rVARIABLE
LIST 3
a
977
VARIABtE.
LIpT 3
REGRESSION
.RUN 2
TABRR2g ,
Figure 1 .Regression run 2.
1. 7
98
Dr
I.
BIBLIOGRAPHY
Aleamoni, L. M., & Graham, M. H. The relationship between
CEQ ratings and instructor's rank, class size, and course
level. Journal of Educational Measurement, 1974, 11,
189,02.
Americlan Psychological Association. Publication manual ofD
the American Psychological Association (Rev. ed.).
Washington, D.C.: Author, 1967.
American Psychological- Association. Pub7acation manual of. .
the AmoiitAnPsychological'AsSociation (2nd ,ed.).
Washington,D.C.: Athor, 1974. .
Anikeef, A. M. Factors affecting.student.evaluationof
college,faculty members. Journal of Applied Psychology,
1953, 37, 458-460.
Bain, P. T. The pass-fail option: -The congruence between
the rationale for and student reasons.in electing.t 1.
Journal of Educational Research, 1973, 66,029-5498.NN
Baird, L., & Feister, W. J. Grading standards: The
Nrelation of changes in average student ability to the,s
-
average grades awarded. American Educational ResearchN.,
\ Journal,' 1972; 9, 4317442.\\ NN ,
.
\
99
NINI. 6-
Bausell, R. B., & Magoon, J. Expected grade in a course,
grade point average, and student ratings of the course
and the instructor. Educational and Psychological
Measurement, 1972, 32, 1013-1023. (a)
Bausell, R. Br. & Magoon, J. The persistence of first
impressions in course and instructor evaluations. Paper
pfesented at the meeting of the American Educational
Research Association, Chicago, April, 1972. (b)
Bendig, A. W., A, preliminary study of the effect of
academic level, sex, and course variables on the student
rating of psychology instructors. Journal of Psychology,
1952, 34, 21-26.
Bendig, A. W. Tfie relation of level of course achievement
'to students' instructor and dourse ratings in 'introductory
psychblogy. Educational andPsycholOgiCal Measurement,
1953, 13. 437-448. (a)
Bendig, A. W. Student achievement in introductorY
psychology and student ratings of the competende and
empathy oUtheir.instructors. :Journal:of Psychology,
1953, 36,427-433. ,(b)'
-Bendig, A. W. A'factor analysis of student ratings of
psy,chology instructors on the Purdue Scale. Journal of
Educationall'sychology, 1954, 45, 385-393.
Best, J. W. Research in education ,(2nd ed.). -Englewood
Cliffs, New Jersey: Prentice-Hall, 1970.
101
Blum, M. L. Ah investigation of the relation existing
'between siudents' grades and their ratings of the
instructors' ability to teach. Journal of Educational
Psychology, 1936, 27, 217-221.
Brown, G. D. System/360 job control language. tieig York:,
Wiley, 1970.
Cal State_to probe system's grading practices. 7Ie
Chronicle of Higher Education, 1974, 8(18), 2.
Capozza, D. R: Student evaluations, grades and learning in
economics. Western Economic Journal, 1973, 11, 127.
'Carrier, N. A., Howard, G. S., & Miller, W. G. Course
evaluation: When? Journal of Educational Psychology,
1974, 66, 609-613.
Centre, J. A. Effectiveness of student feedback in
modifying;college instruction.. Journal of Educational
Psychology, 1973, 65, 395-401.
Centra,_J. A. dollege-teaching: Who should evaluate it?
Findings, 1974, 1(1), 5-8.
Committee on the Student ih Higher Education. The student
in higher education. New Haven, Connecticut: The Hdzen
Foundation, 1968.
Cooley, W. W., & Wanes, P. R. Multivariate data analysis.
New York: Wiley, 1971.
Costin, F., Greenough, W. 'T., & Menges, R. J. Student
ratings of college,teaching: Reliability, validity and
usefulness. Review of'Sducational.Research, 1971, 41,
511-535.
118
Costin, F., & Grush, J. E. Personality correlates of
teacher-student behavior in the college classroom.
Journal of Educational Psychology, 1973, 65, 35-44.
Darlington, R. B. Multiple regression in psychological
researCh and practice. Psychological Bulletin, 1968,-
69, 161-182.
Downgradihg no-grade. Time,'1974, 103(5), 66.
Doyle, K. O., Jr., & Whitely, S. E. Student rstings as-
criteria for effective teaching: American Educational
102
Research Journal 1974, 11, 259-274.
Duke Uniyersity Computation Center. TSAR user s Manual.
Durham, North Carolina: Author, 1971.
Dziuban, ,C. D., & Shirkey, E. C. On the pi6ichometric
assessment of correlation matrices. American Educational
Research 'Journal, 1974, 11, 211-216..
Eble, K. E. Improving college teaching. Phi Delta Kappan,
1971, 52, 2'83-285. .(a)
Eble, K. E. Study indicates teaching is poor: College
Management., 1971, 6, 29. (b)
Eble, K. E., & The.Conference on Career Development. Career
development of the effective college teacher. Washington,
D.C.: American Association of University Professors;
Elmore,.p, B., & LaPointe, K.,A.' Effects of-teacher Aex
and studenf sex.On the e4aluStion of dollege instructors.f
JOUrnal .of Educational Psychology, 1974, 66, 386-389.
119
'103
Etaugh A. Reliability of college grades nd grade
point_averageS: SoMe impliCations or prediction of
-icademic-performadce. EdUcatiOnal and'Psychological
Measurement, 1972;32, 104:3-1049,
Fahey,-G. L. Student rating of teaching, some questionable aassumptions: Paper presented at the conference at the
University of Pittsburg Institute for Higher iducation,
'Pittsburg, December, 1970.A
Fliger, IL If Johnny gets an " ." U.S. News and World
Report, 1973, 75(2), 72.
French-Lazovik, G. Predictability of students' eyaluations
tof College teachers froM compoilent ratings. -.''Sournal of
EducAtional Psychology, 1974, 66, 373-385.
. Frey, P. W. The ongoing debate Student. evaluation of
.teaching. Change., 1974, 6(1), pp., 47-48; 64.
Gadzella, B. M. College students' views and ratings of an
ideal professor. College and University, 1948, 44, 80-96. .
Garber, H. Certain factors underlying the relationship
0 between 'course grades and student judgments of college.
teachers (Doctoral dissertation, University of,
Connecticut, 1964). Dissertation Abstracts, 1964,
6512. (University Microfilms No. 66-00847).h
3
darber, 4. Some relationships between Course grades and
student.judgments of college teachers.- Paper presented
at the Meeting of the iMericah Educational Research
Association, NewYork, February, 1967.
104
Gessner, P, K. Evaluation,!of instruction. Science, 1973,A
180,.566-570. ;
Granzin, K. L., & Painter, J. J. A new explanation for
students' course evaluatioa tendencies. _American
Educational ResearchJt;urnal, 1973, 10, 115-1,24..,
Group for Human DeVelopment in Higher Education. fisagly
""*. developMent in'a time"of.teirenCbMent. New Rochelle, New
York: :Change Magazine, 1974';.
Heilma. D., -8c4entrouti W. .0. Thg.rating of college:,:
.7,
teacherti on tentraits by their:Students- Journalof
. Educational,PIgychology, 1936, 27, 197-216.
Hills, J. R. Consitent college grading standards thmugh,
quating.' Educational:and Psychologicil MeatUtemen
1972, .32, 137-146..
c
,Holifies,' D.S. The relationship betWeem expected grap.s And'
students''eValuationsof their IlmOtrUCtprs. EdOcitional1
'and isychological MeasureMent,-4974 31, 951957,! '
:;/!.
Holmes.;-D. 'S. Effects of grades and disConfirMed,grade.
expectations on. students!%evalUatibnsof_their.instructoi.,
Jouinal of Educational Psychology, 1972; 63, 130-133.
'Jaeger, R. M., & Fieijo, T. D. Some psychometric questions
in the, Oaluation of:prOfessorsi,l,: Journal'of EdUCitiOhai'
Psycholgy, 1974, 86, 416-423:
Kawecki, G. Teacher evaluatioris4! ,Are they accurate?,
Connecticut Daily,Nampus, 1974, 77(119), pp.kA
1. 7.
fi
Reefer, IC E. Characteristics of studen who make accurate%
and inaccurate self-predictioni o college achievement.
Journal of Educational Reaearh, 1971, 64:,- 401-404.
Kerlinger, F. N. Foundations of behavioral research
(2nd ed.). New York: Holt, Rinehart 'and Winston, 1973.
Kerlinger, F. N., & Pedhatur, E. 'J. Multifile regression in
behavioral research. New York: Holt , ,Rinehart and
;;;;;72,, 1973./
Kirschenbaum, H., Napier, R. , St Simon , S . B. Wad-ja-get?
the grading game in American 'education. New York: Hart,
1971. ,
Ladas; H. -Grades: Standardizing the unstandaidized,
standird-:. Phi Delta Rappan, 1974, 56, 185-187.
.LeCamte, E. Does student power corrupt? 'National Review,
f.
1974, 26,. 1166.. , ,
.Lee C. B. (Ed.). Improving college teaching. Washington,-
, D.C.: American Council on Education, 1967.
Maeroff, G. I. Students' scores again show drbp. -New York
Times, 1973, 123(42), pp. 1; 26.
Malinowski, S. N. Not discarded Connecticut Daily Campus,
1974, 77(100), 2. '
McCracken, D. 1191,.. A guide to FORTRAN IV programming. New.
York: Wiley, 1965.
McDaniel, E. D., & Feldhusen,. J. F. Relationship's ,between
faculty ratings and indexes ,of service and sch4irship.
Proceedings of the 78th Annual ConirentiOn of the American
'Psychological Associationi?.1970 6197620. (Summary)
-
106
McKeachie, W. J. Student ratings of faculty. AAUP BUlletin,
:.:1969, 55, 439-444.
McKeachie., W. J. itesearch.on student ratings of teaching.
-;ePaper.presented at the conference-it the University Of
Pittsburg Institutelor.Higher,Education, Pittsburg,
December, 1970.
McKeachie, W. J. The decliue and fall of the laws of
learning. Educational Researcher, 3) 7-11.
McNemar, Q. Psychological statistics (3rd ed.).' New
Wile
Menzie, J. C. An,analysis of the proces of teacher
evaluation'in the community college. Los Angeles,
Calif.: University of California at Los Angeles, 1973.
(ERIC Document Reproduction Service No. ED 083 960)
Moellenberg, W. To grade or not to grade; is that the6
question? dollege and University, 1973, 49, 5-13.
Nie, N. H., Bent, D. H., & Hull, C. H. Statistical package
foi the social Sciences. New York: McGraw-Hill, 1970.
Nunnaly, J. C. Psychometric theory% New York: McGraw-
Hill, 1967.
Ockerman, E. W. Student input in changing grading systems.
College and University, 1972, 47, 390-399.
Owen, S. V. Student ratings:of teacher competence: A
cautionary note. Paper presented at jointmeeting of the
Northeastern Educational Research Association and the
National Council for. Measurement in .Education, Ellenville,
Newlork, November, 1974.
123
107
-Pambookian, H. S. Initial level of student evaluation of
instruciion as a source of influence on instructor change
after feedback. Journal of Educational Psychology, 1974,
66, 52-56.
Pape, R. Evaluate i4tructors each semester. Connecticut
Daily Campus, 1974, 77(106), 2. (a)
Pape, R. Evaluation in midstream; _Connecticut Daily
Campus, 19741 77(96), 2. (b)
Parent, E. R., Vaughan, C. E., & Wharton, K. A new approach
to course evaluation. Journal of Higher Education, 1971,
42, 133-138.
Peck, R. Can students evaluate their education? Education
Digest, 1971, 36, 29-31.
:Postman, N. A D+ for Mr. Ladas. Phi Delta Kappan, 1974,
16, 187-188.
Potter, D., Nalin, P., & Lewandowski, A. The relation of
student achievement and student ratings of teachers.
Paper presented at the meeting of the American Educational-
Research Association, New Orleans, March, 1973.
----Rodin,---M,Students..61teachers. Change; 1974, 6(2), :5.ft
Rodin, M:, flgRodin0113. il.udent evaluations of teachera.
Science, 1972, 177, 116471166:.
Rummel,- R. J. ,Applied'factor analysis. Evanston, Illinois:,
Northwestern University. Press, 1970.
Silverberg, S. Grad finds grades inflating. Connecticut
Daily Campus, 1974, 77(113), 2.
124
. 108
klysz, W. D. An evaluation of statistical software in the
social sciences. Communications of the ACM, 1974, 17,
326=332.
Spady, W. G. Authority, conflict, and teacher effectiveness.
Educational Researcher, 1973, 2(1), 4-10.
Stallings, W. A., & Singhal, S. Some observations on the
relapionships between reseaich productivity and student
evaluations of courses and teaching. Paper presented at
the meeting of the American Educational Research
Association, Los Angeles, February, 1969.
Stanley, J. C. K-R 20 as the stepped up mean item ,
intercorrelation. National Council on MeasurementsUsed
in Education Yearbook, 1957, 14, 78-92.
St.Onge, K. R. Let's get back to essentials. Change,
1974, 6(5), pp. 6-7; 62.
Sullivan A. M., & Skanes, G. R.. Validity of student
evaluation of teaching and the characteristics of
successful instructors. Journal of Educational
Psychology, 1974, 66, 584-590.
. Tatsuoka, M. M., & Tiedeman, D. V. Statistics as an aspect
of scientific metho&in research on teaching, Chapter IV
in Handbook of research on teaching, N. L. Gage (Ed.),
Chicago: -Rand McNally & Co., 1963.
Too many A's. Time, 1974, 104(20), 106.
125
. ,
.. 109
Touq,,M. S., & Feldhusen, J. F. The relationship between
students' tatings of instructors and their participation
in classroom discussion. Paper presented at the meeting
of the National Council on Maasurement in Education, New
Orleans, February,"1973.
Treffinger, D. J., & Feldhusen, J. F. Predicting students'
ratings of instruction. Proceedings of the 78th Annual
Convention of the American.Ps cholo ical Association,
-. 1970, 621-622. (Summar
TUrnbull, W. W. Testing shifts from 'which' to 'how.' New.
York Times, 1974, 123(42), 82..
Vacon, B. Faculty evaluate grade distribution. Connecticut11
Daily Campus, 1974, 77(100,, pp. 1; 4. (a)
Vacon, B. Grades:- Do they pass the test? Connacticut
Daily Camus, 1974, 77(105), pp. 1; 5. (b)
Vacon, B. New requirements limit students on Dean's List.
ConneeicUt Daily Campus,, 1974, 77(73), 3. (c)
Veldman, D. J. Fortran programming forNthe behavioralN\
sciences. New York: Holt, Rinehart and Winston, 1967.
Voeks*,-V7 W.,4 French, G. M. Are student-ratings of
-
teachers affected by grades? Journal of Hi her Education,
1960, 31, 330-334.
Watkins, B. T. A-to-F grading system heavily favored by
undergraduate, graduate-institutions. The Chronicle of
Higher Education, 1973, 8(8), 2.
126 N
Widlak, F. IC, & Feldhusen, J. F. Factor
analyses of,an instructor rating scale. Paper presented
at the meeting-of the Americangducational Research
lAssociation,'New Or.l.eans, February, 1973.
Winer, B 3. Statistical
(2nd-ed.). New York: McGraw-Hill,\1971.
Worthington, L. H., 16Grant C. V. lectors of.academic
succesS: 'A, nhativariate analysis. Journal of. Educational
Research, 1971, 65, 7-1-0.
Zelby, L. W. Student-faculty evaluation. Science, 1974
183, 1267-127A.
-127