response styles in student evaluation of teaching · iii graduate students were analyzed....
TRANSCRIPT
Response Styles in Student Evaluation of Teaching
by
Edgar Andrés Valencia Acuña
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy
Department of Curriculum, Teaching and Learning Ontario Institute for Studies in Education
University of Toronto
© Copyright by Edgar Andrés Valencia Acuña 2017
ii
Response Styles in Student Evaluation of Teaching
Edgar Andrés Valencia Acuña
Doctor of Philosophy
Department of Curriculum, Teaching and Learning
Ontario Institute for Studies in Education
University of Toronto
2017
Abstract
Student Evaluation of Teaching (SET) typically refers to the use of summated rating scales to
measure teaching quality base on students’ report. SET is widely used in post-secondary
education institutions for informing teacher professional development, curriculum revision,
personnel decisions, and for institutional accountability. The literature on SET validity is
abundant but often atheoretical, the evidence inconclusive, and provides scarce attention to
content and response process. Specifically, research examining whether students respond
independently of content relying on response styles is rare. Types of response styles are
acquiescence/disacquiescence (tendency to agree/disagree across items), extreme (tendency to
endorse extreme response options across items), and midpoint response styles (tendency to use
the midpoint option across items). Evidence of a substantial degree of response style would
reduce the validity of SET scores as a measure of teaching quality and their utility for informing
formative and summative decisions due to overestimation or underestimation of the actual level
of teaching quality and artificial changes in the relationship to other variables. Three topics
examined in the study are the degree to which SET scores are affected by response styles,
differences in the extent to which SET scores are affected by response styles across measurement
conditions, and the degree to which response styles moderate differences in SET scores between
female and male teachers. Responses to a SET summated rating scale from N=5,921 education
iii
graduate students were analyzed. Student-level indexes of response styles suggest a high degree
of acquiescence in the direction of teaching quality overestimation, and no disacquiescence,
extreme, or midpoint response styles. A 2 (academic department) x 2 (program type) x 6
(academic session) ANOVAs on response styles indexes suggests no statistically significant
differences across measurement conditions. Finally, multiple linear regression analysis indicates
a statistically significant moderator effect of acquiescence on the difference in SET scores
between female and male teachers. The discussion addresses implications of the findings for
developers and users of SET summated rating scales, alternative interpretations of the observed
pattern of responses, limitations, and suggestions for future research.
iv
Acknowledgments
Muchas personas contribuyeron directa o indirectamente para que pudiera completar esta tesis
después de cinco años de programa. Mi primer agradecimiento es para mi familia, en especial
para mi hija Magdalena Sofía. Deseo sinceramente que los frutos de este trabajo la beneficien
duraderamente. Mis padres Pilar y Eduardo fueron una importante fuente de apoyo en momentos
complejos, lo mismo que mis hermanos Erik y Marlene. Con mucho gusto intentaré
constantemente retribuir su generosa y desinteresada ayuda.
Verónica Santelices en mayor medida, junto con Sandy Taut, Jorge Manzi, David Huepe y
Natalia Salas, fueron los que incentivaron mi inquietud por cursar un doctorado. A ellos les
agradezco enormemente el simple pero significativo hecho de hacerme creer que tenía las
condiciones para postular a un programa en el extranjero y terminar con éxito mi tesis.
En Canadá recibí el apoyo de muchas personas. Mi especial agradecimiento es para un grupo de
amigos con el que nos acompañamos tanto en buenos como en malos momentos. La primera de
esas personas es Elizabeth Rosales, quien se convirtió en mi más cercana amiga y cómplice.
Fannie me apoyó en los momentos que más necesité. Bryant y Faye López (junto con Alma y
Coco), me enseñaron muchísimas lecciones de tango y de la vida. Con Alejandra y Sam
compartimos momentos divertidos y duros, y me apoyaron en el complejo último tramo de mi
doctorado. Mi doctorado no sería la experiencia enriquecedora que es sin este grupo de
excelentes personas. Otros amigos que hicieron el camino más divertido son Tugce, Irene,
Frances, Desmond, Claire, Junko, Anahit, Ilinca, Yecid, Angela, Mariana, y Felipe. A todos ellos
les agradezco sinceramente todo su apoyo.
v
Table of Contents
Acknowledgments.......................................................................................................................... iv
Table of Contents .............................................................................................................................v
List of Tables ............................................................................................................................... viii
List of Figures ................................................................................................................................ ix
List of Equations ..............................................................................................................................x
List of Appendices ......................................................................................................................... xi
Chapter 1 ..........................................................................................................................................1
Introduction .................................................................................................................................1
1.1 Student Evaluation of Teaching ...........................................................................................1
1.2 SET Summated Rating Scales .............................................................................................2
1.3 Issues in SET........................................................................................................................4
1.4 Relevance and Rationale ......................................................................................................5
1.5 Focus of Study .....................................................................................................................6
1.6 Summary of Structure ..........................................................................................................8
Chapter 2 ..........................................................................................................................................9
Literature Review ........................................................................................................................9
2.1 The Target Construct in SET .............................................................................................10
2.1.1 Vague Definition of the Target Construct .............................................................10
2.1.2 Teaching Quality ....................................................................................................12
2.1.3 SET and Teaching Quality .....................................................................................14
2.2 Validity of SET ..................................................................................................................15
2.2.1 Classical Test Theory .............................................................................................15
2.2.2 Definition of Validity .............................................................................................19
2.2.3 SET Validity Findings ...........................................................................................23
2.3 Response Styles in SET .....................................................................................................32
vi
2.3.1 Approaches to Examine Response Styles ..............................................................33
2.3.2 Manifest Variable Approach and Same Items .......................................................33
2.3.3 Types of Response Styles ......................................................................................35
2.3.4 Response Styles in SET .........................................................................................38
2.4 Summary and Limitations ..................................................................................................41
2.4.1 Summary ................................................................................................................41
2.4.2 Limitations in SET Validity Research ...................................................................42
2.4.3 Focus of Study .......................................................................................................44
Chapter 3 ........................................................................................................................................46
Methodology .............................................................................................................................46
3.1 Participants .........................................................................................................................46
3.2 Instrument ..........................................................................................................................47
3.3 Administration ...................................................................................................................50
3.4 Data Analysis .....................................................................................................................51
3.4.1 Research Question 1 ..............................................................................................51
3.4.2 Research Question 2 ..............................................................................................53
3.4.3 Research Question 3 ..............................................................................................53
3.4.4 Software .................................................................................................................57
Chapter 4 ........................................................................................................................................58
Results .......................................................................................................................................58
4.1 Distribution of Responses ..................................................................................................58
4.2 Research Question 1 ..........................................................................................................61
4.3 Research Question 2 ..........................................................................................................63
4.3.1 Summary Statistics.................................................................................................63
4.3.2 ANOVA Results ....................................................................................................64
4.4 Research Question 3 ..........................................................................................................66
vii
4.4.1 Part 1: Differences Teacher’s Gender ....................................................................66
4.4.2 Part 2: ARS Moderator Effect................................................................................67
4.4.3 Practical Significance.............................................................................................71
Chapter 5 ........................................................................................................................................73
Discussion .................................................................................................................................73
5.1 Summary and Implications ................................................................................................73
5.1.1 Implications............................................................................................................75
5.1.2 Recommendations ..................................................................................................77
5.2 Alternative Interpretation of Findings ...............................................................................82
5.2.1 High Level of Teaching Quality ............................................................................83
5.2.2 Construct Underrepresentation ..............................................................................83
5.2.3 Ceiling Effect .........................................................................................................84
5.2.4 Online Survey Mode ..............................................................................................84
5.2.5 Strong Satisficing ...................................................................................................85
5.2.6 Evaluation Goals ....................................................................................................86
5.3 Limitations and Future Research .......................................................................................87
5.3.1 Use of Manifest Variable Approach ......................................................................87
5.3.2 Use of Observational Data .....................................................................................88
5.3.3 Use of a Quantitative Approach .............................................................................88
5.3.4 Future Research .....................................................................................................89
References ......................................................................................................................................91
Appendices ...................................................................................................................................108
Copyright Acknowledgements.....................................................................................................109
viii
List of Tables
Table 1 .......................................................................................................................................... 47
Table 2 .......................................................................................................................................... 49
Table 3 .......................................................................................................................................... 52
Table 4 .......................................................................................................................................... 60
Table 5 .......................................................................................................................................... 61
Table 6 .......................................................................................................................................... 62
Table 7 .......................................................................................................................................... 63
Table 8 .......................................................................................................................................... 66
Table 9 .......................................................................................................................................... 68
ix
List of Figures
Figure 1 ......................................................................................................................................... 59
Figure 2 ......................................................................................................................................... 69
Figure 3 ......................................................................................................................................... 70
x
List of Equations
Equation 1 ..................................................................................................................................... 16
Equation 2 ..................................................................................................................................... 17
Equation 3 ..................................................................................................................................... 54
Equation 4 ..................................................................................................................................... 55
Equation 5 ..................................................................................................................................... 55
1
Chapter 1
Introduction
Chapter 1 introduces the present study pertaining the examination of response styles in the
context of the administration of a student’s evaluation of teaching (SET) summated rating scale
at a post-secondary education institution. Section 1.1 summarizes the most relevant aspects of
SET as a method for measuring teaching quality using the report of students. Section 1.2
explains the key attributes of a summated rating scale, the most popular mode of asking students
about teaching. Section 1.3 outlines important issues affecting the utilization of SET. Section 1.4
explains the relevance and rationale of the study. Section 1.5 states the focus of the study and
Section 1.6 portrays the structure and content of the remaining chapters.
1.1 Student Evaluation of Teaching
The close relationship between teaching and learning justifies the need of reliable, accurate and
useful information about teaching: to promote good teaching and subsequently enhance students’
learning (Joint Committee on Standards for Educational Evaluation, 2009).
The evaluation of teaching can inform relative strengths and weaknesses of individual teachers,
themes for planning professional development plans for a group of teachers, or the social
recognition of outstanding teaching among members of an educational community. Undoubtedly,
the most popular way to retrieve information about teaching in post-secondary education
institutions is through the report of students (Berk, 2005; Johnson, 2000; Zabaleta, 2007),
formally referred to as Student Evaluation of Teaching (SET).
An explanation of the popularity of SET is the fact that most institutions retrieve information
about teaching from students utilizing standardized questionnaires due to the inexpensive cost
and straightforward implementation and reporting of results of this tool (Penny, 2003; Spooren,
Brockx, & Mortelmans, 2013). Other methods of teaching evaluation such as observation
protocols and portfolios are methodologically more complex to develop and usually require
trained evaluators, increasing time and costs. Another proposed cause to explain the popularity
of SET is the lack of alternative methods of teaching evaluation supported by validity evidence
2
(Marsh, 1997). However, the more validity evidence could simply reflect the popularity of SET
due to the convenience of implementing standardized questionnaires.
What SET intends to measure, or the target construct in SET is often vaguely defined. There is a
great diversity of content in the literature, and multiple terms are used interchangeably and in a
non-univocal manner. The concept of teaching quality (Fenstermacher & Richardson, 2005) can
serve the purpose of standardizing the differences in content and terms existing in the literature.
Teaching quality encompasses two related yet qualitatively different aspects of teaching: good
teaching referring to the quality of the teaching task, and successful teaching referring to
teaching that produces learning. SET may relate to students’ report of good teaching, students’
report of successful teaching, or both.
SET currently informs multiple types of decisions. SET originally informed improvement the
instruction, curriculum, and programs. Starting the decade of the 1970s, SET begun to
increasingly inform administrative and personnel decisions including retention, tenure, and
promotion of faculty. SET is also utilized for departmental and institutional accountability
following trends in postsecondary education administration. In practice, SET frequently informs
more than one of the previous purposes simultaneously (Aylett & Gregory, 1996; Spooren et al.,
2013).
1.2 SET Summated Rating Scales
There are multiple ways in which students can report information about teaching quality, for
instance, through individual interviews or focus groups. However, in a vast majority of cases,
SET is based on standardized questionnaires (Spooren et al., 2013), specifically summated rating
scales.
Summated rating scales are one of the most utilized tools in the social sciences and education for
the measurement of attitudes, opinions, personality, and emotional states among other constructs
(Spector, 1992). Summated rating scales are utilized to retrieve information about the past,
present or future, and about the respondent (self-report), about others (other-report), or about
external objects or events.
3
All summated rating scales including the ones found in the context of SET share four basic
characteristics (Spector, 1992):
1. A rating scale contains multiple items.
2. Items measure a property or attribute that varies quantitatively.
3. An item is a statement and participants are asked to choose the response option that
best reflects their response to the statement.
4. Items have no right answer.
The central idea underlying the use of a summated rating scale is that the sum of responses to
individual items (the total score) reflects the level of the target construct. In the case of a SET
summated rating scale, the total score reflects the magnitude of teaching quality.
As the sum of responses to a group of content-related items (but not as responses to an individual
item), the level of measurement of SET scores is interval (Brown, 2011; Carifio & Perla, 2007).
Properties of interval measures are magnitude and equal intervals (Kaplan & Saccuzzo, 2008).
Magnitude informs the amount of teaching quality. Possible uses of this property are the
comparison of relative strengths and weaknesses among different teaching attributes, or the
identification of teachers with higher, lower or equal level of teaching quality.
The property of equal intervals means that the difference between two any points on the response
scale is the same (Kaplan & Saccuzzo, 2008). Equal intervals allow arithmetic operations on
scores and the application of descriptive and inferential statistics such as correlational analyses
and analysis of variance (Brown, 2011). An example is the calculation of differences in SET
scores between female and male teachers. Such differences are meaningful only when the level
of measurement is at least interval.
The interpretation of SET score as a measure of teaching quality is a function of the items
(Messick, 1995). For instance, a teacher can obtain a SET score of 3.0 from the sum of items
based on a scale ranging from 1 (low-level of teaching quality) to five (high-level of teaching
quality). Aspects such as the item wording, item position, and the number, order, and labels in
the response scale can affect scores (Smyth, Dillman, & Christian, 2009; Tourangeau, Rips, &
4
Rasinski, 2000). Consequently, SET scores greatly depend on the way students interpret items
and utilize the response scale.
The interpretation of SET score as a measure of teaching quality not only depends on items.
Other aspects that can influence the interpretation of SET scores are students and the context of
the measurement (Messick, 1995). For instance, a group of first-year engineering students with
little exposure to teaching in post-secondary education would have a very different
conceptualization of teaching quality than a group of education graduate students. An instrument
with a high proportion of items targeting general attributes of teaching quality may not
differentiate between novice and expert teachers whereas an instrument including specific and
more complex aspects of teaching quality can effectively distinguish between these two groups.
Scores obtained at an department in which teacher evaluation is an important priority to support
continuous teaching improvement may have a different meaning for students, teachers and other
stakeholders than scores obtained at a department in which SET informs personnel decisions and
accountability.
1.3 Issues in SET
The definition and measurement of teaching quality are two challenging tasks as documented by
decades of educational research (Berliner, 2005; Fenstermacher & Richardson, 2005). For
instance, Popham (1992) placed the search for valid teaching evaluation along with other two
Humanity’s perennial quests: the Holy Grail and the Fountain of Youth. The evaluation of
teaching based on students’ report in post-secondary education institutions is not an exception to
the challenges of properly defining and measuring teaching quality.
A first important issue affecting the interpretation and use of SET scores relates to the content.
The definition of the target construct in SET often relies on a weak theory about teaching quality
and SET literature offers little evidence supporting that content of SET summated rating scales is
appropriate (Penny, 2003).
A second relevant issue affecting SET relates to the same cause explaining its popularity: the
utilization of summated rating scales. Penny (2003) correctly pointed out that SET scores are not
more valid than the method utilized to retrieve the information, and a common assumption
among developers and users of summated rating scales is that total score accurately reflects the
5
target construct, which implies that no extraneous influences affect participant’s responses
(Cronbach, 1946; Wetzel, Böhnke, & Brown, 2016). The measurement literature describes
various ways in which the previous assumption is wrong (Spector, 1991; Viswanathan, 2005).
Therefore, rather than assuming, developers and users need to provide evidence of SET scores
validity. Validity refers to an overall judgment of the extent to which theory and evidence
support the intended interpretation and use of scores (American Educational Research
Association, American Psychological Association, & National Council on Measurement in
Education, 2014). Specifically, evidence should support that 1) there are not aspects of the target
construct definition excluded from the instrument content (construct underrepresentation) and 2)
responses are not severely influenced by processes extraneous or irrelevant to the intended
interpretation and use of scores (construct-irrelevant variance).
A third important issue relates to the type of evidence collected to support the validity of SET
scores. Despite the vast amount of relevant literature, there is limited attention provided to a
fundamental aspect of the use of summated rating scales: the response process (Penny, 2003).
Instead, a considerable amount of research reports discriminant evidence, the correlation
between SET scores and irrelevant variables, for instance, the gender of the teacher. It seems
more reasonable to identify sources of construct-irrelevant variance from carefully examining
and reporting aspects of the response process itself before advancing into the examination of the
relationship between SET scores and other variables.
1.4 Relevance and Rationale
Measurement concepts such as validity, reliability, comparability and fairness “are not just
measurement principles; they are social values” (Messick, 1995, p. 5. italic in the original).
Scores from measurement tools need a systematic examination to warrant adherence to these
social values.
The last revision of the Standards for Educational and Psychological Testing (AERA, APA, &
NCME, 2014) argues that a proper interpretation and use of scores can result in wiser and more
equitable decisions about individuals and programs, whereas improper use of scores might lead
to an adverse impact on test-takers and other stakeholders.
6
The extended use of SET as a method of teaching evaluation in post-secondary education
institutions and the use of SET scores for informing formative decisions justify the need of
evidence supporting the validity of those scores (Joint Committee on Standards for Educational
Evaluation, 2009). The exigence of sound validity evidence increases as SET scores inform high
stake decisions such as hiring and faculty promotion.
When students’ responses to a SET summated rating scale are affected by construct
underrepresentation or construct-irrelevant variance, there is lower support for the interpretation
of scores as a measure of teaching quality. These two problems also affect the use of scores for
formative and summative decisions. The systematic examination of these two types of problems
allows their subsequent control and minimization, strengthening subsequent decisions based on
scores.
Finally, SET is not an objective measure of teaching quality, and instead, scores come from
students’ responses to a summated rating scale, a tool prone to numerous sources of construct-
irrelevant variance, increasing the need of careful examination of SET. The focus of the study
relates to a specific source of construct-irrelevant variance affecting scores obtained from
summated rating scale.
1.5 Focus of Study
In the measurement literature, a well-known source of construct-irrelevant variance that affect
scores from summated rating scales is the participant’s systematic tendency to use the response
scale in a stereotyped or aberrant manner, referred to as response styles (Cronbach, 1946;
Paulhus, 1991; Van Vaerenbergh & Thomas, 2013; Viswanathan, 2005). Response styles are
ways in which the respondent utilizes the response scale in a manner inconsistent with the
intended interpretation and use of scores. As a result, total score would be a confound between
the target construct and the response style. The confound would not only affect the interpretation
of total scores as a measure of the target construct but also psychometric properties of the scale
and subsequent statistical analysis of scores (Viswanathan, 2005).
The present study examines the degree to which scores obtained from the administration of a
SET summated rating scale at a large teacher education institution in Southern Ontario are
affected by response styles.
7
Evidence of response styles would jeopardize the interpretation of SET scores as a measure of
teaching quality for low stake decisions (for instance, informing teaching improvement) and high
stake decisions (for example, personnel decisions). On the contrary, evidence ruling out response
styles would support (along with other types of evidence) the interpretation and use of SET as a
measure of teaching quality for formative and summative purposes.
Research examining how response styles affect SET scores addresses one of the issues in the
literature mentioned earlier: the lack of validity evidence based on response process. As an
example, in a systematic literature review including studies on SET validity since 2000, there is
almost no mention of the problem of response styles1 (Spooren et al., 2013). Similarly, a review
of the literature covering articles since the 1970s mentions only one specific type of response
styles, halo, as a potential source of construct-irrelevant variance affecting SET scores
(Gravestock & Gregor-Greenleaf, 2008). Therefore, systematically examining the plausibility of
response styles in SET appears to be a significant contribution to the literature.
The validity of scores is also a function of persons and the context of measurement (Messick,
1995a). The study also examines differences in the degree to which SET scores are affected by
responses styles across three measurement conditions: the academic department, the type of
graduate program, and the session.
Lastly, considering that SET scores inform summative decisions, and that response styles affect
the subsequent statistical analysis of total scores, the study examines whether response styles
moderates the observed difference in SET scores between female and male teachers.
The three research questions that guide the study are:
1. To what extent SET scores are affected by response styles?
2. What are the differences in the degree to which SET scores are affected by response
styles across measurement conditions?
1 Response styles or other related terms such as response bias, response set, rater bias, rater effects, and rater error.
8
3. Is there a difference in SET scores between female and male teachers, and to what
extent do response styles moderate such difference?
1.6 Summary of Structure
The structure of the study contains a total of five chapters: 1) Introduction, 2) literature review,
3) methodology, 4) results, and 5) discussion.
Chapter 2 presents a review of literature that sustains the examination of response styles in the
context of SET. The literature review covers four subjects: 1) a definition of the target construct
in SET, 2) validity of SET, 3) response styles and evidence of response styles in the context of
SET, and 4) summary and limitations in SET literature.
Chapter 3 describes the study’s methodology including the population of students, characteristics
of the SET summated rating scale, the procedure of administration, intended interpretation and
use of SET scores, and data analysis strategy followed to produce evidence for each of the three
research questions presented above.
The last two chapters present the results of the statistical analysis of SET data (Chapter 4) and a
discussion of these results (Chapter 5). Chapter 5 discussed implications of the findings for SET
developers and users, possible alternative interpretations of the reported findings, limitations of
the study and recommendations for future research.
9
Chapter 2
Literature Review
Chapter 2 presents a review of relevant literature in support of the examination of response styles
in the context of SET. The literature review covers the following four subjects: the definition of
the target construct in SET (Section 2.1), the validity of SET (Section 2.2), evidence of response
styles in SET scores (Section 2.3), and limitations in SET literature (Section 2.4).
Section 2.1 discusses the general question of what SET intends to measure. The literature review
suggests that the definition of the target construct in SET is a problematic issue. The study
proposes and defines teaching quality as the target construct in SET.
Section 2.2 discusses the validity of SET in three logically connected parts. The first part
explains the theory underlying the use of summated rating scales and provides context for
understanding the concept of validity of scores. The second part presents an overview and a
current definition of the concept of validity along with a description of types of validity
evidence. The third part summarizes findings from multiple types of SET validity evidence
reported in the literature.
Section 2.3 defines response styles, describes types of response styles documented in the
measurement literature along with their examination procedure, summarizes findings from the
few studies reporting response styles in the context of SET.
Lastly, Section 2.4 offers a summary of the key points addressed in the literature review,
identifies limitations that sustain the examination of response styles in the context of SET, and
outlines the research questions of the study.
10
2.1 The Target Construct in SET
A sound theory sustaining the development of measurement tools is of the utmost importance as
recognized by The Standards for Educational and Psychological Testing (AERA et al., 2014) and
the Personnel Evaluation Standards (Joint Committee on Standards for Educational Evaluation,
2009). Some essential functions of a theory are the definition of the target construct and its
attributes, operationalization, and explanation of the relationship with related and not related
variables, for instance, whether subgroups by gender or race should differ in their levels of the
target construct.
SET literature also recognizes the crucial importance of a sound theory underlying the
measurement of teaching quality. The lack of theory supporting SET is precisely the first and
most important issue that negatively affects the definition of the target construct and the
interpretation and use of this tool of teaching evaluation (Marsh, 1987; Ory & Ryan, 2001;
Penny, 2003; Spooren et al., 2013).
The literature reports that, instead of grounded in a theory of teaching quality, SET instruments
are often home-made or ad-hoc questionnaires constructed by adapting items from pre-existing
instruments (Marsh, 1987; Or & Ryan, 2001; Penny, 2003). The poor or lack of theory
supporting the development of SET is expressed in a low consistency in the number and nature
of attributes of teaching quality across instruments (Spooren et al., 2013). For instance, in their
analysis of eleven instruments published in literature since 2000, Spooren et al. (2013) reported
that the number of attributes of teaching quality varies between two and twelve, and the most
common content is the overall attribute of quality of instruction. As examples, other attributes of
teaching quality included in SET instruments are helpfulness of the teacher, teacher’s
enthusiasm, the level of care and support offered by the teacher, organization of the course,
clarity of course objectives, and quality of assessments.
2.1.1 Vague Definition of the Target Construct
A fundamental function of a theory in the context of the development of a measurement tool is
the definition of the target construct, and a logical consequence of the lack of theory sustaining
SET is a vague definition of the target construct. In this regard, there is no univocal
understanding in the literature on what SET is supposed to measure.
11
Two are two expressions of the vague definition of the target construct in SET: 1) ambiguity in
the term to refer to the intended target construct; 2) interpreting SET scores as a simple measure
of student’s satisfaction.
SET literature is plagued with various terms to refer to the intended target construct. Examples of
these terms are teacher quality, good teaching, teacher efficacy, teaching performance, and
teacher effectiveness. For instance, in a review of research on SET validity, Marsh (1997) refers
to teacher effectiveness as the target construct in SET suggesting that scores should support the
improvement of teaching quality. In their overview of findings on SET validity, Spooren et al.
(2013) indistinctly refer to teaching quality, effective teaching, good teaching, and teaching
effectiveness. Penny (2003), while analyzing limitations of research on SET validity, alternates
between the concepts teaching quality and teaching effectiveness. None of the previous authors
offer a precise definition of teaching quality or any of the other terms.
A second expression of the vague definition of the target construct is the interpretation of SET
scores as a measure of students’ satisfaction with the teacher. In this context, students are
considered customers (Spooren et al., 2013), and satisfaction refers to the level of happiness of
students about teaching (Penny, 2003).
There are three situations in which student’s satisfaction with the teacher is utilized to interpret
SET scores. The first situation occurs when SET scores are interpreted simultaneously as a
measure of teaching quality and a measure of students’ satisfaction, making the two terms
equivalent (for instance, MacNell, 2015; Boring, 2015). A second situation occurs when
administrators interpret SET scores for institutional accountability or from a managerial
perspective. The interpretation of SET scores simply changes from teaching quality to students’
satisfaction with the course or teacher (Kuwaiti & Subbarayalu, 2015; Spooren et al., 2013;
Valsan & Sproule, 2008; Boring, 2015). A third situation occurs when SET scores are interpreted
as a measure of students’ satisfaction as a hypothesis to explain an observed relationship between
SET scores and an irrelevant variable (Zabaleta, 2007; Penny, 2003; Boring, Ottoboni, & Stark,
2016). In all the previous examples, the use of students’ satisfaction as target construct in SET is
not grounded in theory or empirical evidence and reflects an arbitrary interpretation of SET
scores.
12
In summary, there is often a lack of a theory of teaching quality underlying the development of
SET reflected in home-made or ad-hoc instruments, substantial differences in content, and a
vague definition of the target construct that indistinctively refers to teaching quality, teaching
performance, effectiveness, or students’ satisfaction.
2.1.2 Teaching Quality
The vague definition of the target construct underlying the development of SET should not
surprise. Following Berliner (2005), defining teaching quality is difficult, the concept of quality
is often ineffable, and quality involves a judgment that always depends on the specific context.
The study follows the distinction between good teaching and successful teaching (Berliner,
2005; Fenstermacher & Richardson, 2005; Ingvarson & Rowe, 2008), two separate but related
components of teaching quality. The difference between the concepts of good and successful
teaching is the criteria used for judging one and another. Whereas good teaching refers to aspects
of the instruction itself (the task of teaching), successful teaching is teaching that produces
learning.
2.1.2.1 Good Teaching
Good teaching is teaching “that comports with morally defensible standards and rationally sound
principles of instructional practice” (Fenstermacher & Richardson, 2000, p. 7). Good teaching
occurs when “the content taught accords with disciplinary standards of adequacy and
completeness, and that the methods employed are age-appropriate, morally defensible, and
undertaken with the intention of enhancing the learner’s competence with respect to the content
studied” (p. 9). Under this definition, highly qualified teachers “provide evidence that certain
qualities of teaching are frequently present in the everyday experiences of their students”
(Berliner, 2005, p. 207).
Two fundamental features of good teaching are: 1) good teaching is normative, and; 2) good
teaching is contextual. Good teaching refers to “what is expected of people in a position”
(Berliner, 2005, p. 207). The norm derives from the unique setting in which teaching and
learning happen.
13
An example of the normative and contextual character of good teaching is considering the
differences in the understanding of teaching and learning across three dominant approaches in
Education. Good teaching could refer to teaching centered on the transmission of content
(positivist perspective), teaching as pedagogical content knowledge expertise and the
transformation of students’ cognitive ability (cognitive perspective), or teaching as facilitation of
students’ deep understanding based on personally relevant experiences (constructivist
perspective) (Fenstermacher & Richardson, 2000).
Good teaching involves at least three components referred to as acts of teaching (Fenstermacher
& Richardson, 2005):
1. Logical acts such as defining, demonstrating, modeling, explaining, and correcting.
2. Psychological acts such as caring, motivating, encouraging, rewarding, punishing,
planning, and evaluating, and;
3. Moral acts, such as showing honesty, courage, tolerance, compassion, respect, and
fairness.
Good teaching is a necessary but not sufficient condition for learning. Evidence of good teaching
does not imply that students will learn. Learning is a complex process and is also affected by 1)
willingness and effort by the student, 2) a social environment supportive of teaching and
learning, and 3) opportunity to teach and learn (Fenstermacher & Richardson, 2005; Ingvarson &
Rowe, 2008). When these other conditions of learning are satisfied, then good teaching can turn
into successful teaching.
2.1.2.2 Successful Teaching
Successful teaching refers to “teaching that yields the intended learning” (Fenstermacher &
Richardson, 2005, p. 6). Successful teaching relates to students’ achievement, and more precisely
to changes in achievement between two moments: time 1 when the student lacks a certain
content, and time 2 when the student acquires the content. Successful teaching implies that the
teacher possesses the content, intends to impart the content, and engages the student in a
relationship that allows the student to acquire the content (Fenstermacher & Richardson, 2005).
14
Successful teaching means that the student acquires the content “to some reasonable and
acceptable level of proficiency” (p.9).
Determining successful teaching involves numerous logical and methodological challenges
(Berliner, 2005). A first issue is the gathering of evidence of student’s achievement at time 1 and
time 2 to determine the degree of learning. The second problem is to link teaching and learning
causally. Specifically, isolating teaching effect from other factors affecting learning (such as
willingness and effort by the student, or the social environment) is technically complex.
Statistical methods that attempt to isolate teacher’s contribution to students learning are “filled
with psychometric problems” (Berliner, 2005).
2.1.3 SET and Teaching Quality
The previous definition of teaching quality has implications for SET design, scores
interpretation, and use. SET summated rating scales can inquire students about the two
components of teaching quality: good and successful teaching.
Pertaining the measurement of good teaching, a student can report the logical, psychological and
moral acts of teaching because they have multiple opportunities to observe teaching over the
length of a course and provide a justified report of the extent to which those acts are present in
her/his academic experiences.
SET items measuring good teaching can take the form of a report of others, report of external
objects or events, or self-report. Some examples are: “the teacher presented the content in a
challenging manner” (other-report), “the content of the course was challenging” (report of an
object or event), or “I felt challenged by the course content” (self-report).
Pertaining successful teaching, a student can report the amount of learning and the degree to
which teaching contributed to his/her learning. SET items measuring successful teaching can
take the form of a self-report (e.g. “I learned a great deal in this course”), report of an object
(“the course contributed a great deal to my understanding of the material”), or other-report (“The
teacher was effective in enhancing my understanding of the course material”).
The measurement of successful teaching in the context of SET faces a significant challenge: the
poor accuracy of the self-report of learning. Two sources informing the about the previous issue
15
are training evaluation and self-assessment literature. The self-report of the amount of learning is
considered a reaction which refer to an attitude towards the training (Kirkpatrick, 1977, 1977,
1979). A reaction differs from the measurement of learning, in which specific knowledege, skills
or attitudes pertaining training goals are assses using objective measures. The empirical evidence
consistently indicates that reactions are not predictive of actual learning and training impact
(Alliger, Tannenbaum, Bennett Jr., Traver, & Shotland, 1997; Salas & Cannon-Bowers, 2001).
Similarly, research on individual self-assessment indicates that people often overestimate their
level of knowledge and skills (Dunning & Helzer, 2014; Zell & Krizan, 2014), and such
evidence includes the self-assessment of students in postsecondary education institutions (Boud
& Falchikov, 1989; Ward, Gruppen, & Regehr, 2002). For instance, Bowman (2010) reported
that the correlation coefficients between college students’ self-report of learning and objective
measures of learning of the same constructs were virtually zero. Although self-assessment is a
valuable tool in the context of the development of metacognitive skills and self-regulated
learning (Pintrich, 2002), the high inaccuracy of self-assessment scores makes its utilization in
the context of teacher evaluation problematic.
2.2 Validity of SET
The extended use of SET as a measure of teaching quality in postsecondary education
institutions along with the use of SET scores for informing multiple types of decisions help
explain the vast number of studies on SET, specifically studies examining its validity.
Validity in the context of SET directly relates to the theory underlying the use of summated
rating scales, classical test theory (CTT). CTT allows the identification of the multiple elements
affecting the interpretation of scores obtained from summated rating scales.
2.2.1 Classical Test Theory
CTT provided the first formal foundation for the measurement of psychological and educational
constructs. The conception of CTT relates to three advancements occurred at the beginning of
the 20th century (Traub, 1997): 1) the realization that all measurement contains some degree of
error; 2) the conception of error of measurement as a random variable; 3) the concept of
correlation and its method of calculation.
16
A fundamental proposition in CTT is that the two components of an individual’s observed score
are her/his true level of the target construct (true score) and measurement error. Equation 1
expresses the previous proposition (Kline, 2005; Spector, 1992):
𝑂 = 𝑇 + 𝐸
Equation 1
In Equation 1, 𝑂 represents an individual’s observed score, 𝑇 represents his/her true score, and 𝐸
represents measurement error.
Other three important propositions in CTT are (Traub, 1997):
1. Measurement error (𝐸) is a random latent variable.
2. Measurement error has zero covariance with true score latent variable (𝐸 and 𝑇 are
independent).
3. Measurement error is independent of the error component of other measures.
The error component 𝐸 is typically referred to as random error (Kline, 2005; Spector, 1991) and
indicates the level in which scores are not consistent across repetitions of the measurement, for
instance across the multiple items in a summated rating scale. Non-systematic factors affecting
the measurement introduce random error. Examples of non-systematic factors are brief
fluctuations in mood and motivation, language difficulties, ambiguous items, uncontrolled
administration conditions (e.g. noise), distraction, memory/attention vacillations,
mechanical/motor vacillations, illness, fatigue, emotional strain, chance and non-contingent
responding2 (Viswanathan, 2005). The random error component of a measurement affects the
reliability of scores. Utilizing multiple items, one of the four key characteristics of a summated
2 Non-contingent responding, careless responding (Wetzel et al., 2016), inconsistent responding, and random
responding, is sometimes defined as a type of response style (McGrath et al., 2010, Viswanathan, 2005) because
refers to responding independent from content. However, non-contingent responding does not introduce systematic
error as other types of response styles (defined later in the document) because this type of measurement error occurs
when respondents vary her/his responses in an unsystematic manner (McGrath et al., 2010), hence introducing
random error.
17
rating scale, is one strategy to minimize random error and increase reliability (Spector, 1992;
Viswanathan, 2005; AERA et al., 2014).
2.2.1.1 Construct-Irrelevant Variance
A noteworthy expansion of Equation 1 in the context scores obtained from summated rating
scales is the following (Spector, 1992):
𝑂 = 𝑇 + 𝐸 + 𝐵
Equation 2
Equation 2 presents three (instead of two) components of an individual’s observed score: her/his
true score (𝑇), random error (𝐸) and a new source of measurement error (𝐵) reflecting construct-
irrelevant variance (AERA et al., 2014). Literature often utilizes the terms bias3 (Spector, 1992)
and systematic measurement error (Viswanathan, 2005). For consistency with the Standards for
Educational and Psychological Testing, the preferred term in the study is construct-irrelevant
variance.
Construct-irrelevant variance is systematic variation in observed scores that do not reflect true
score and is caused by processes extraneous (or irrelevant) to the intended interpretation and use
of scores (AERA et al., 2014).
Examples of sources of construct-irrelevant variance in summated rating scales are leading
question and the use of unbalanced response categories. A common response style such as the
respondents’ tendency to agree or disagree can also introduce construct-irrelevant variance
(Viswanathan, 2005; James, Demaree, & Wolf, 1984).
Construct-irrelevant variance is not randomly distributed, the mean differs from zero, and its
effect cannot be erased utilizing multiple items. Instead, developers of summated rating scales
need to control and minimize potential sources of construct-irrelevant variance (Spector, 1992).
3 A second meaning of bias not related to the purpose of the study is “construct underrepresentation or construct
irrelevance components of a test that affect the performance of different groups” (AERA, APA, NCME, 2014, p.
686) such as groups based on gender or race. Bias from this perspective is the focus of the examination of test
fairness (i.e. item and test bias) and not part of the focus of this specific study.
18
2.2.1.2 Additive and Correlational Error
A common and reasonable use of SET is summarizing responses from a group of students (for
instance, students taught by the same teacher) using the mean or another method of data
aggregation. Construct-irrelevant variance can affect scores in two ways when scores are
aggregated across individuals: as additive error and as correlational error (Viswanathan, 2005).
Additive error increases or reduces observed scores by a constant magnitude similarly across all
individuals. An example of a summative error is considering the measurement of height in a
group of persons with their shoes on. The observed height will be higher than the actual height
(measured barefoot) by a constant (the height of the shoe). In this example, the observed height
across individuals contains a constant deviation from the actual height in a positive direction.
In summated rating scales, examples of sources of additive error are leading questions,
interviewer bias, and unbalanced response categories (Viswanathan, 2005). All these sources
would affect all respondents in the same manner.
A consequence of additive error is the underestimation of overestimation of the actual level of
the target construct (true score). SET scores influenced by summative error would be lower or
higher than the actual level of teaching quality.
Additive error can also affect the correlation coefficient between observed scores and other
variables due to a reduction in scores variance. A reduction in scores variance may occur when
additive error lowers or increases scores towards one of the ends of the response scale
(Viswanathan, 2005).
The second type of construct-irrelevant variance is correlational error, caused by a non-constant
source of systematic error. An example is considering the measurement of weight of a group of
individuals (for example, female) after breakfast and another group (for example, men) before
breakfast. In this case, a comparison of weight between female and male would indicate a lower
difference in weight than the difference obtained if all individuals step on the weight scale
without breakfast.
In summated rating scales, correlational error is produced by “different individuals responding in
consistently different ways over and above true differences in the construct” (Viswanathan,
19
2005, p. 15). Correlational error affects the relationship between observed scores and other
variables because “consistent differences across individuals over and above the construct being
measured may be positively correlated, negatively correlated, or not correlated with the
construct” (Viswanathan, 2005, p. 16). In the case of summated rating scales, an example of
correlational error may occur when several constructs are measured using the same method
(common method factor), leading to inflated relationships between items (Viswanathan, 2005).
In summary, CTT defines two components of observed scores: true score and measurement
error. Two sources of measurement error are random error and construct-irrelevant variance.
Random measurement error affects the reliability of scores and is often minimized using multiple
items. Construct-irrelevant variance is systematic variation in observed scores not related to true
score. Two types of construct-irrelevant variance are additive and correlational error. Additive
error produces underestimation or overestimation of true score, and correlational error produces
changes in the coefficients of correlation with other variables.
The propositions underlying CTT should encourage developers and users of measurement tools
in educational contexts to provide evidence supporting that sources of construct-irrelevant
variance do not excessively influence observed scores. The concept of validity offers a
framework for this task, the systematic evaluation of the intended interpretation and use of scores
(AERA et al., 2014).
2.2.2 Definition of Validity
2.2.2.1 Overview
The concept of validity has evolved, and a summary of previous conceptualizations would
contribute the understanding of current definitions and limitations in SET validity literature.
Following Kane (2001), three models of validity precede modern conceptualizations: criterion
validation, content validation, and construct validation.
The criterion validation model defines validity simply as the level of accuracy of a test, in which
scores are expected to estimate or predict a criterion. According to Kane, the criterion model was
popular between 1950 and 1970 for the validation of selection and placement decision in which a
common standard was the candidate’s actual level of performance in a task. In the context of the
measurement of educational and psychological constructs, the identification of a suitable
20
criterion is challenging turning the validity of the criterion itself into a problem (Kane, 2001).
Research in SET shares the difficulty of identifying a valid criterion of teaching quality (Marsh
& Roche, 1997).
A proposed solution to the lack of a valid criterion was the “review of the test content by subject-
matter experts” (Kane, 2001, p. 320), which would provide evidence of content relevance and
representativeness of the measure, referred to as content validity. According to Kane, validation
of educational achievement test between 1950 and 1970 typically relied on the content validation
model. Two limitations of content-validation are that the experts’ judgment often shows a strong
confirmatory bias and a review of the content does not provide direct evidence of the validity of
the inferences made from scores (Kane, 2001).
Construct validation originally served the validation of the theory predicting the relationships
among constructs used in clinical assessment and worked as a complement to the criterion and
content validation models (Kane, 2001). The construct validation model was proposed and
utilized for the validation of psychological constructs grounded in strong theory. The validity of
the intended interpretation of scores is evaluated regarding “how well the observed scores satisfy
the theory” (p. 321). For instance, if the observations are consistent with the relationships among
constructs predicted by the theory, the theory underlying the measurement and the measurement
itself are both valid (Kane, 2001).
The construct validation model impacted the conceptualizations of validity in three ways (Kane,
2001). First, this model recognizes the importance of theory for defining and measuring
constructs. Second, the model recognizes the need of clearly stating intended interpretation of
scores before evaluating the validity of scores. Third, the model introduces the concept of
challenging proposed score interpretations and “the importance of considering possible alternate
interpretations” (p. 324).
2.2.2.2 Unified Concept of Validity
In the context of multiple validity models co-existing by the end of the 1970s, researchers were
“highly opportunistic in the choice of validity evidence” (Kane, 2001, p. 323). In response to
such situation, current conceptualizations of validity use the construct validity model as an
21
umbrella to integrate criterion and content validity, not as different types of validity but as
different kinds of evidence of validity (Kane, 2001).
As a unified concept, validity refers to an “overall evaluative judgment” (Messick, 1995a, p. 5)
on “the degree to which evidence and theory support the interpretations of test scores4 for
proposed uses of tests” (AERA et al., 2014, p. 59).
An important aspect of the concept of validity is that both evidence and theory need to relate to a
specific interpretation and use of scores (AERA et al., 2014). For instance, when validity
evidence only supports a formative use of SET scores (e.g. improvement of teaching), new
pertinent evidence should be provided in support of the use of SET scores for other purposes
such as personnel decisions or tenure. If scores inform multiple uses, then evidence needs to
support each of these multiple uses.
Another important aspect of the concept of validity is its evolving nature. The interpretation of
scores depends on items, persons, and the conditions of measurement. When any of these aspects
vary across replications of the measurement, validity evidence justifying the intended
interpretation and use of scores in this new instance should be provided (Messick, 1995b).
In the case of SET, the validity of scores depends on the group of items included in the
summated rating scale. The validity of SET scores would change if the population of students
changes. Similarly, measurement conditions such timing (mid-term, end of the term), the
anonymity of responses and the mode of administration could also affect the validity of scores.
The overall context of the evaluation (e.g. academic department, type of program, discipline),
and time (session) could also influence the validity of SET scores (Spooren et al., 2013).
Evidence of validity can emerge by considering two types of rival hypotheses that challenge the
intended interpretation of scores: construct underrepresentation and construct-irrelevant variance
(AERA et al., 2014). For instance, an examination of SET instrument revealing that item content
excludes important attributes of teaching quality (for example, lack of items targeting the moral
acts of teaching) would indicate construct underrepresentation. Evidence indicating the influence
4 Test refers to any kind of measurement tool based on a standardized procedure (AERA, APA, & NCME, 2014),
and includes SET summated rating scales.
22
of gender stereotype (which is not part of the definition of teaching quality) on students’
responses to a SET summated rating scale would indicate construct-irrelevant variance.
Construct under-representation and construct-irrelevant variance adversely impact the use of
SET scores for formative and summative decisions.
2.2.2.3 Types of Validity Evidence
There are four sources of validity evidence that can help support the intended interpretation and
use of test scores. Sources of validity evidence are evidence based on content, response process,
internal structure, and relationship to other variables (AERA et al., 2014).
Evidence based on content refers to the extent to which aspects such as themes, wording, the
format of items, administration and scoring reflect the target construct as defined by the
developer of the measurement tool. Content should also appropriately match the intended use of
scores. An example is examining whether items from a SET summated rating scale cover all the
aspects of teaching quality that would serve the purpose of informing teaching improvement.
Evidence based on the response process involves examining propositions about the expected
cognitive aspects involved in rating items. Examples are examining whether students use
appropriate criteria, whether students are investing enough cognitive effort in the rating task, or
whether irrelevant criteria such as teachers' personality, appearance, or gender affect students’
ratings.
Evidence based on internal structure pertains to the extent to which the relationships between
items and dimensions included in the measurement tool match the observed responses. An
example is examining the dimensionality of scores (unidimensional or multidimensional) and
testing the extent to which the expected relationships among items and dimensions satisfactory
explain the observed relationships in the data.
Evidence based on the relationship to other variables examines the extent to which these
relationships are consistent with the intended interpretation and use of scores. Three types of
relationships to other variables are convergent, discriminant, and test-criterion.
Convergent evidence refers to the examination of the relationship between the target construct
and variables theoretically related, for instance, evaluating whether SET scores converge with
23
other measures of teaching quality such as a classroom observation protocol scored by trained
observers. A statistically significant relationship between SET scores and the objective measure
of teaching quality provides convergent validity evidence.
Discriminant evidence pertains to variables less related to the target construct, for instance,
evaluating whether SET scores correlate with constructs such as students’ satisfaction with the
course, teacher’s personality, teacher’s attractiveness or gender. The lack of a statistically
significant relationship between SET scores and the irrelevant variable provides discriminant
validity evidence.
Finally, test-criterion evidence pertains to the examination of the relationship between test scores
and expected outcomes, for instance, examining whether SET scores from multiple teachers
predict students’ final grades (Marsh & Roche, 1997) under the assumption that the level of
teaching quality significantly explains students’ learning.
2.2.3 SET Validity Findings
The two preceding subsections (2.2.1 and 2.2.2) propose that all measurement contains error,
explain how observed scores can differ from true score (the actual level of the target construct),
and offer a framework to evaluate the intended interpretation and use of scores.
Subsection 2.2.3 summarizes findings on SET validity separated in three subjects. The first
subject is the overall evaluation of SET validity from accumulated empirical evidence. The
second subject is a summary of discriminant evidence, an important focus within SET research.
The third subject pertains to evidence of differences in SET scores between female and male
teachers, one of the most examined irrelevant variables in studies reporting discriminant
evidence that specifically informs the third research question in the study.
2.2.3.1 Overall Evaluation
There are contradictory positions regarding the overall validity of SET scores based on
accumulated empirical evidence. Whereas early literature is more positive towards the validity of
SET scores, recent literature seems more critical (Gravestock & Gregor-Greenleaf, 2008).
24
Early literature defends the use of SET as a valid measure of teaching quality based on high
coefficients of reliability and evidence based on relationship to other variables, specifically
convergent and discriminant evidence (Gravestock & Gregor-Greenleaf, 2008).
An example of the positive attitude towards SET is Greenwald (1997) who characterized
research on SET during the 1970s as mainly concerned with the influence of students’ grade
expectation on SET scores (discriminant evidence). Greenwald claimed that those concerns were
“effectively answered and largely put to rest by subsequent research” (p. 1184). In fact,
Greenwald reported a decline in the number of studies on SET validity during the 1990s, and he
speculated that this decline was the result of prior research resolving the major issues regarding
SET validity.
Two highly-cited publications by Herbert Marsh (Marsh, 1987; Marsh & Roche, 1997) examined
the major issues in SET mentioned by Greenwald. These concerns include reliability of scores,
the internal structure of SET, relationship to other variables, and the perceived utility of SET.
A first conclusion reported by Marsh is that the reliability of SET scores is high and that scores
are consistent across students evaluating the same teacher.
A second conclusion reported by Marsh is that SET scores are multidimensional rather than
unidimensional. Findings support the nine dimensions of the Students’ Evaluation of Educational
Quality (SEEQ) instrument. The dimensions of teaching quality included are Learning/Value,
Instructor Enthusiasm, Organization/Clarity, Group Interaction, Individual Rapport, Breadth of
Coverage, Examinations/Grading, Assignments/Reading, and Workload/Difficulty.
Evidence based on relationship to other variables summarized by Marsh includes convergent,
discriminant and test-criterion. Convergent evidence indicates that SET scores correlate with
teaching evaluation by other sources, such as self-assessment (teacher versus students), and
trained external observers.
As reported by Marsh, discriminant evidence indicates that SET scores reflect teaching quality
rather than the quality of the course. The relationship between SET scores from teachers that
taught the same course-content is close to zero, and the correlation between SET scores from the
same teacher in courses is above r = 0.6. Related to discriminant evidence as well, Marsh
25
reported that SET scores are only weakly or not correlated at all with irrelevant variables. The
variables reported by Marsh are students’ prior subject interest, expected grade and actual grade,
course’ workload or difficulty, class size, the level of the course (graduate, undergraduate), year
in school, the gender of the teacher, academic discipline, purpose or evaluation, administrative
conditions, and students’ personality.
Lastly, Marsh reports that SET is perceived as useful by teachers when appropriate support is
offered, by students in course selection, and by administrators for use in personnel decisions.
Marsh’ positive findings regarding SET are mostly (but not exclusively) based on evidence from
one specific instrument of his authorship (Marsh, 1982). He acknowledges that “many
instruments fail to provide a comprehensive evaluation of theoretically sound, multiple
dimensions of teaching quality, thus undermining their usefulness, particularly for diagnostic
feedback” (Marsh, 1997, p. 1188). However, many authors have subsequently echoed the above
and other positive validity findings to argue in favor of the overall validity of SET5 (Theall &
Franklin, 2001). There is a tendency in the literature to re-interpret these positive validity results
authoritatively and omit the words of caution regarding the use of SET made by the original
authors (Johnson, 2000).
Since Marsh’s report, an overwhelming amount of evidence has become available, and recent
literature seems to recede from the previous favorable appraisal of SET. The current main aspect
of concern regarding SET is the fundamental question on whether scores reflect the intended
target construct, teaching quality (Penny, 2003; Boring, Ottoboni, & Stark, 2016; Penny, 2003;
Stark & Freishtat, 2014).
Recent literature suggests at least caution when using SET scores mostly because of the doubts
about what SET measures compared to what it intends to measure (Penny, 2003). Among the
more critical appraisals against SET, Olivares (2003) concludes that SET scores “are not
appropriate for drawing inferences regarding teaching effectiveness” (p. 240). Valsan and
Sproule (2008) argue that findings from validity research are misleading because the construct
5 Implicitly, these authors refer to validity of SET summated rating scale rather than validity of SET scores.
26
teaching quality “has no verifiable empirical content” (p. 940). Stark and Freishtat (2014)
conclude that “there is no consensus on what SET does measure” (p. 13).
As in the case of early literature, current research expresses concern about SET based on validity
evidence based on relationship to other variables, specifically discriminant evidence. In fact,
there is little attention to evidence based on content and response process (Ory & Ryan, 2001;
Penny, 2003). Specifically, current SET research is characterized by a lack of support to “content
relevance, adequacy of coverage, empirical and theoretical analysis of rating forms, the scores
and any action based on them” (Penny, 2003, p. 401). Also, there is little or no information about
the validity of scores based on results from proper psychometric analysis (Penny, 2003; Spooren
et al., 2013).
Two recent and remarkable examples of research on SET validity providing evidence based on
response process are Gee (2017) and Bassett, Cleveland, Acorn, Nix, and Snyder (2017), who
examined response strategies and motivation of students in the context of two SET summated
rating scale administered in the UK context.
Following the analysis of think-aloud protocols, Gee (2017) reported that students did not
provide enough cognitive processing in the rating of SET items. Students relied on superficial
response strategies, for instance, providing the same response to all items. Additionally, students
reported that they felt motivated to inflate their scores influenced by personal and power
relationships, for instance, with the goal of rewarding friendly teachers or to present themselves
as not-conflicting students.
Bassett et al. (2017) reported insufficient students’ effort after analyzing responses to improbable
items included in a SET instrument. An example of improbable item is “the instructor never even
attempted to answer any student question related to the course.” The average level of responses
endorsing positive responses to unlikely statements was high, fluctuating between 24% and 69%
of students. The study also reported that only 20% of students indicated that they responded to
all items seriously. Lastly, students reported that they did not believe that administrators or
teachers would use the results of the evaluation.
Despite the two previous examples of studies providing evidence based on response process,
most research on SET focus on discriminant evidence. The following subsection summarizes
27
findings on SET discriminant evidence with an emphasis on the relationship between SET scores
and teacher’s gender, one of the most examined irrelevant variables in SET literature.
2.2.3.2 Discriminant Evidence
Discriminant evidence refers to the test of the relationship between the target construct and
variables that theoretically are not related to the target construct (irrelevant variables).
Discriminant evidence is often obtained using experimental designs and correlational analysis
(AERA et al., 2014).
A lack of relationship between SET scores and an irrelevant variable (for instance, teacher’s
gender) would provide support in favor of the validity of SET scores because no relationship is
expected based on theory. A significant association between SET scores and an irrelevant
variable would reflect that the two variables are not independent (for instance, when SET scores
are higher for female teachers), a finding inconsistent with the theory that sustains the
measurement of teaching quality.
The lack of independence between SET scores and an irrelevant variable can suggest 1) a true
relationship between the two variables (for instance, a different meaning of the target construct
among subgroups), 2) construct underrepresentation, or 3) construct-irrelevant variance (AERA
et al., 2014). Discriminant evidence on its own does not indicate which of the previous three
alternatives explains the observed relationship. The lack of independence between the target
construct and the irrelevant variable should encourage further investigation, for instance, a
revision of the theory sustaining SET, a review of the content, and examination of the response
process.
In the context of SET, studies providing discriminant evidence are known as examining “biasing
factors” (Bassin, 1974; Bonitz, 2011; Gravestock & Gregor-Greenleaf, 2008; Olivares, 2003;
Penny, 2003; Spooren et al., 2013; Theall & Franklin, 2001). The term “bias” has a related, yet
different meaning than the definition presented previously in the study, bias as construct-
irrelevant variance. Instead, a “biasing factor” simply indicates a theoretically irrelevant variable
that correlates with SET scores.
The list of variables considered irrelevant in the context of SET is extensive and includes
(Spooren et al., 2013):
28
• Student’s background variables: gender, age, and maturation.
• Student’s academic variables: class attendance, student’s effort, expected grade, students’
course performance (examinations and final grades), students’ goals orientation, the
discrepancy between expected-actual grade, grading leniency, pre-course interest, change
in course interest.
• Teacher’s variables: gender, reputation, research productivity, teaching experience, age,
language background (native versus ELS), race, tenure, rank, sexual orientation, and
personality traits such as charisma, personality, physical attractiveness, fairness, attitudes
toward students, image compatibility (ideal versus actual teacher), likability, and initial
impression.
• Course’s variables: size, attendance rate, difficulty, discipline, workload, year in the
program, type (lab versus lecture), elective versus required, general versus specific content,
syllabus tone (friendly versus unfriendly).
Spooren at al (2013) and Stark & Freishtat (2014) summarize evidence indicating that student’s
variables with a statistically significant correlation with SET scores are cognitive background,
class attendance, effort, and grade expectation. Teacher variables with a statistically significant
correlation with SET scores are gender, reputation, experience, and age. Course variables with a
statistically significant correlation with SET scores are size, attendance rate, and course
difficulty.
Unfortunately, reports summarizing these relationships only indicate statistical significance and
exclude an interpretation of practical significance (i.e. effect size indexes). Although statistically
significant, other authors conclude that the relevance of these relationships is very small or even
trivial (Cashin, 1995; Marsh, 1987; Marsh & Roche, 2000; Penny, 2003). Until now, there is no
complete consensus on the actual importance of these irrelevant variables in the context of SET
discriminant validity evidence.
2.2.3.3 Teacher’s Gender
The gender of the teacher is one of the most examined theoretically irrelevant variable in SET
validity research (Bonitz, 2011). The relationship between SET scores and the gender of the
teacher is a concern in this study from the perspective of examining the effect of response styles
29
on subsequent statistical analysis of SET scores. The following section summarizes recent
findings from studies reporting differences in SET scores by teacher’s gender.
In general, findings from studies examining differences in SET scores between female and male
teachers are similar to those from studies examining other irrelevant variables: empirical
evidence is non-conclusive, and the practical significance of statistically significant differences is
usually small or trivial (Marsh & Roche, 1997). However, some authors claim that SET scores
are “biased” by teacher’s gender based on small or trivial coefficients of practical significance.
Two recent examples are Boring et al., (2016) and Stark & Freishtat (2014).
There are three published studies based on experimental designs that analyze the effect of
perceived gender of the teacher on SET scores (Arbuckle & Williams, 2003; Bonitz, 2011;
MacNell, Driscoll, & Hunt, 2015). In these studies, researchers utilized methods that allow the
manipulation of the teacher’s gender. One method is the use of a gender-neutral audio lecture as
teaching format (gender manipulation is in the SET questionnaire) (Arbuckle & Williams, 2003).
A second method is the use of an online course as teaching format (gender manipulation is in
course’s description and material) (MacNell et al., 2015). A third method is the use of vignettes
in survey experiments (Bonitz, 2011).
Arbuckle & Williams (2003) utilized a 2 (teacher’s gender) x 2 (teacher’s age) x 2 (student’s
gender) experimental design in the context of an audio lecture about “Stages of Relationship
Building” attended by college students. The authors reported that the same lecture was rated
higher when students believed that the teacher was a male under-35 than when students believed
that the teacher was a male over 55-male, a female under 35, or a female over 55, F(9, 330) =
2.63, p=.006, partial 𝜂2= .076. The partial 𝜂2 implies that teacher’s gender accounted for 7% of
the variance in SET scores, a difference indicating a medium effect7.
A second experiment conducted on undergraduate students attending to an online introductory-
course on anthropology/sociology (MacNell et al., 2015) revealed that students tended to assign
higher scores to male teachers over female teachers regardless of the actual gender. However, the
6 Effect size was calculated from reported MANOVA results.
7 Interpretation of effect size follows Cohen’s (1988) rule of thumb: small, medium and large.
30
effect of gender identity on SET scores was not statistically significant. Re-analysis of the data
using nonparametric tests (Boring et al., 2016) confirmed the original findings. The non-
parametric tests revealed differences between female and male teachers in only three items out of
14 items using an alpha level of p < .05. No differences were observed in total SET scores.
However, the original study reported that gender identity explained a 13% of the variance in SET
scores (R2 = 0.13) which indicates a medium practical significance of the difference. In
comparison, the actual gender of the teacher explained less than 1% of the variance of SET
scores (R2 = 0.01).
The third experiment examined undergraduate psychology students evaluating a short vignette
describing a hypothetical teacher (Bonitz, 2011). Results from a 2 (teacher’s gender) x 2
(student’s gender) x 2 (course type: counseling psychology or research methods) experimental
design indicate no main or interaction effect of teacher’s gender identity on SET scores, F(1,
602) = 0.13, p = .72, 95% CI for the difference in means = [0.18, -0.10]. Teacher’s gender
explained less than 0.1% of the variance of SET scores (partial 𝜂2<.00) 8, which reflect no
practical significance of the difference by teacher’s gender.
In summary, findings from experimental studies are mixed. One study reported a practical and
statistically significant difference in SET scores favoring male teachers (Arbuckle & Williams,
2003). One study reported a practical and non-statistically difference in favor of male teachers
(MacNell et al., 2015). Lastly, one study reported no practical nor statistically significant
difference between female and male teachers (Bonitz, 2011).
Evidence of differences in SET scores between female and male teachers from observational
studies reach statistical significance more often, but their practical significance is small or trivial.
For instance, Basow and Montgomery (2005) utilized a 2 (teacher gender) x 2 (student gender) x
3 (teacher’s rank) ANOVA on SET scores from students enrolled at a liberal arts college. They
reported a statistically significant main effect of teacher’s gender, F (6,682) = 4.32, p < 0.001, 𝜂2
= .036. The practical significance9 of the difference in SET scores favoring female over male
teacher is small.
8 Effect size was calculated from reported ANOVA results.
9 Effect size was calculated from reported ANOVA results.
31
Smith et al. (2007) reported a statistically significant main effect of teacher’s gender on SET
scores from undergraduate communication students, F (1, 10955) = 146.90, p < .001, η2 = .01,
indicating a small practical significance of being a female teacher over male teacher on SET
scores.
McPherson, Jewell, and Kim (2009) used regression analysis on SET scores from undergraduate
economy students. They found that the unstandardized regression coefficient of being a male
teacher on SET scores was B = .094 ( = 0.14; p < .001) for teachers of principles of economy
classes and B = .07 ( = 0.11; p < 0.01) for teachers of upper-level economy classes, after
controlling for student’s variables such as grade expectations, response rate, class size, and
teacher’s characteristics including teaching experience, race, and rank (adjunct versus tenure-
track). The practical significance of difference expressed in the standardized regression
coefficients and differences in means10 is small, with Cohen’s d = 0.19 for principles of economy
classes and d = 0.15 for upper-level economy courses. No statistically significant effect of
teacher’s gender on SET scores was found in graduate students in economics using a similar
approach (McPherson & Jewell, 2007).
Finally, Boring (2015) reported findings from an observational study using SET responses from
first-year undergraduate students at a French university using a generalized ordered logit model.
The study found that male teachers are more likely to be endorsed with the highest response
option from male students, and female teachers are less likely to be assigned the higher response
options by both female and male students. Permutation tests were conducted using the same data
to better account for noncompliance with score distribution assumptions11 (Boring et al., 2016).
Findings based on nonparametric tests (permutation tests) resulted in similar conclusions than the
previous report: male teacher received higher scores than female teachers, with an overall
correlation coefficient of r = 0.09, p = .000 and coefficients ranging from r = .04 (p = .63) and r
= .11 (p = .10) across disciplines (History, Microeconomics, Political Science). Although
interpreted as “large and statistically significant” differences (p. 1), the practical significance of
10 Standardized correlation coefficients and effect sizes were calculated from reported unstandardized regression
coefficients and descriptive statistics. 11
According to Boring (2015) teachers are not a random and independent sample from a normally distributed
population with equal variance and different means by gender, indicating that the null hypothesis is unrealistic.
32
these correlation coefficients is small using Cohen’s rule of thumb for interpreting effect size (J.
Cohen, 1988).
In summary, findings from observational studies examining differences in SET scores by
teacher’s gender are mixed. Two studies reported higher SET scores for female teachers (Basow
& Montgomery, 2005; Smith et al., 2007). The same number of studies reported higher scores
for male teachers (Boring, 2015; Boring et al., 2016; McPherson et al., 2009). Only one study
found no difference in SET scores by teacher’s gender (McPherson & Jewell, 2007). These
recent findings from experimental and observational studies are consistent with previously
published reviews indicating inconsistent results and small or no practical significance of the
differences in SET scores between female and male teachers (Gravestock & Gregor-Greenleaf,
2008; Marsh & Roche, 1997; Spooren et al., 2013).
2.3 Response Styles in SET
A common belief among developers and users of measurement tools in educational contexts is
that observed scores are determined exclusively by the target construct that the tool intends to
measure (Cronbach, 1946; Wetzel, Böhnke, et al., 2016). Underlying the previous interpretation
of observed scores as only reflecting true score is that processes irrelevant to the definition of the
target construct are not influencing responses. In other words, developers and users assume that
there is no construct-irrelevant variance in observed scores.
In this regard, the measurement literature has long recognized that irrelevant factors often
influence responses to measurement tools such as summated rating scales. An instance occurs
when a student shows the tendency to agree, disagree, or select extreme options in the response
scale across items. These response patterns suggest processes irrelevant to the target construct.
The previous examples pertain to response styles, well-documented sources of construct-
irrelevant variance affecting summated rating scales (AERA et al., 2014; Cronbach, 1946;
Viswanathan, 2005; Wetzel, Böhnke, et al., 2016).
A response style is defined as the systematic tendency to respond to questionnaire items
irrespective of their content (Paulhus, 1991; Viswanathan, 2005; Wetzel et al., 2016).
Specifically, a response style is a stereotyped or aberrant individual response pattern across items
33
and is attributed to an individual tendency to favor certain response options over others
(Macmillan & Douglas, 1990).
As an expression of an individual process not related to the instrument content, response styles
can reduce the validity of scores as a source of construct-irrelevant variance (AERA et al., 2014)
affecting the interpretation and use of observed scores by introducing additive and correlational
error (Viswanathan, 2005).
2.3.1 Approaches to Examine Response Styles
There are different strategies for examining response styles in scores obtained from summated
rating scales. Two essential differences among strategies are 1) the use of additional items versus
the same items that measure the target construct, and 2) the use of a manifest versus latent
variable approach (Wetzel, Böhnke, et al., 2016). Subsequently, four different ways of measuring
response styles are 1) same items with a manifest variable approach, 2) same items with a latent
variable approach, 3) additional items with a manifest variable approach, and 4) additional items
with a latent variable approach. According to Wetzel et al. (2016), the most popular strategy for
examining response styles is the calculation of frequency indexes using the same items than the
target construct. Methods for examining response styles based on latent variable approaches are
very recent, and no systematic review and comparison of methods is available yet.
The study examines response styles is in the context of secondary SET data with no additional
items measuring response styles included in the instrument. Hence, the section focuses on
explaining the rationale of examining response styles using the same items measuring the target
construct and a manifest variable approach.
2.3.2 Manifest Variable Approach and Same Items
The examination of response styles using a manifest variable approach and the same items than
the target construct can be found early in the measurement literature (Wetzel, Böhnke, et al.,
2016). Two remarkable examples are halo effect (Thorndike, 1920), leniency/severity and range
restriction (Kingsbury, 1922).
Thorndike (1920) noticed that estimates of the report of others of a priori relatively independent
traits such as intelligence, industry, technical skill, reliability, leadership, and character made by
34
superiors of industrial employees and aviation cadets were highly and evenly correlated.
According to Thorndike, correlations were “higher than reality” and “too much alike” (p. 25).
Thorndike believed that superiors rated these independent aspects of their subordinates affected
by “a marked tendency to think of the person in general as rather good or rather inferior and to
color the judgments of the qualities by this general feeling” (p. 25). Thorndike called halo to this
error in the judgment of independent attributes.
Kingsbury (1922) examined how managers scored a group of employees across seventeen
attributes (e.g. vitality, alertness, enthusiasm, loyalty) comparing scores against the normal
probability curve. As in the case Thorndike, Kingsbury also identified halo in managers’
evaluations of employees and explained this tendency as the influence of “amiable quality in the
employee, good appearance, tact, etc.” or “a brusk manner, unpleasant voice, or other socially
irritating trait” (p. 380). Kingsbury also noticed that managers would use “wrong or changing
quantitative standards” leading to “high marker” managers (severity) and “low marker”
managers (leniency) (p. 380). Finally, Kingsbury described that some managers would provide
ratings that were too uniform obscuring differences among employees (range restriction).
The two examples above illustrate an early use of manifest variables and the same items as a
strategy for examining response styles. Also, the examples above suggest that manifest variable
approaches assume certain attributes regarding the utilization of the response scale, distribution
of scores, and relationships among items. These assumptions depend on intended interpretation
and use of scores (for example, recruitment, professional development, retention, promotion, and
firing).
Examples of these assumptions are that respondents should discriminate among independent
attributes (Thorndike, 1920), that respondents should not excessively agree with all questionnaire
items (Lentz, 1938), or that scores should match a normal distribution allowing discrimination
among participants (Kingsbury, 1922). These propositions are necessary for properly informing
formative and summative decisions based on scores.
Some authors propose that violations of these propositions reflect error in the measurement
attributable to limitations in participants ability to provide accurate responses (for instance
Murphy & Balzer, 1989; Saal, Downey, & Lahey, 1980). However, other authors believe that
these violations could reflect participants’ strategic thinking regarding the utilization of the
35
evaluation results (Murphy & Balzer, 1989; Murphy & Cleveland, 1995; Murphy, Cleveland,
Skattebo, & Kinney, 2004). In fact, the goal of the evaluation could affect respondents’
motivation to provide accurate responses, discussed in Chapter 5.
There is a limitation affecting the utilization of a manifest variable approach for the examination
of response styles. A manifest variable approach cannot separate response styles variance from
target construct variance using the same items, and only a latent variable approach would allow
the separation between target construct and response style using the same items. Consequently, a
proper interpretation from analysis relying on a manifest variable approach is that SET scores
would reflect both teaching quality and response styles. Such evidence would serve diagnostic
for informing other more elaborate analysis utilizing a latent variable approach. Chapter 5
expands on the utilization of a latent variable approach in the context of the limitations of the
study.
A second limitation shared by manifest and latent variable approaches is the lack of information
concerning the causes responses styles. In other words, these approaches indicate “what” is
happening in the data but not “why.” The value of these approaches is in informing about a
potential source of construct-irrelevant variance affecting SET validity (the “what”) that, once
identified, need to be subsequently addressed. The examination of causes explaining response
styles falls beyond the scope of the study. However, some hypotheses are presented later in
Chapter 5.
2.3.3 Types of Response Styles
A definition and examination method using a manifest variable approach and the same items
than the target construct is presented below for acquiescence/disacquiescence response styles,
extreme response style, midpoint response style, halo and range restriction.
2.3.3.1 ARS/DRS
Acquiescence response style (ARS), or yeah-saying, refers to the tendency to agree with
statements (or more generally, endorse the highest response option) irrespective of the content of
items. Disacquiescence response style (DRS) (or no-saying) is the tendency to disagree with
statements (or endorse the lowest response option) regardless of the content of items (McGrath,
36
Mitchell, Kim, & Hough, 2010; Paulhus, 1991; Spector, 1991; Viswanathan, 2005; Wetzel,
Böhnke, et al., 2016).
Possible causes of ARS and DRS are a complex, ambiguous, vague or neutral item wording,
uncertainty or low cognitive ability in the respondent, and the result of distraction and time
pressure (Paulhus, 1991; Viswanathan, 2005). ARS and DRS are also consequences of strong
satisficing (Barge & Gehlbach, 2012; Krosnick, Narayan, & Smith, 1996) and possibly triggered
when students want to avoid negative consequences of the evaluation results on teachers
(Murphy & Cleveland, 1995; Murphy et al., 2004).
The most popular method for examining ARS/DRS is calculating an individual frequency index
based on the proportion of responses stating the most positive (ARS) or negative (DRS) response
option across all items. Some authors calculate the index including all agreement/disagreement
response options across questionnaire items (Richardson, 2012; van Herk, Poortinga, &
Verhallen, 2004), whereas others report the number the responses utilizing the “Yes” response
options across questionnaire items (Spooren, Mortelmans, & Thijssen, 2012). When the rating
scale includes negative and positive worded items, an alternative procedure is calculating the
proportion for each type of item (negative and positive worded) and then averaging the two
indexes.
As a proportion of the total number of responses, a value close to +1 would indicate agreement
(or disagreement) with all items, and a value close to 0 would indicate no acquiescence /
disacquiescence.
Leniency/Severity are two terms related to acquiescence/disacquiescence and useful when
several participants (for instance, students) evaluate the same target (for example, teachers).
Leniency is the tendency to provide scores spuriously high regardless of the dimension, and
severity (also stringency or harshness) is the tendency to score spuriously low irrespective of the
dimension (Kingsbury 1922; Viswanathan, 2005; Wolfe, 2004; Wetzel et al., 2016).
Leniency/Severity do not reflect necessarily students using the highest or lowest response option
as in acquiescence/disacquiescence, but unusually high or low responses compared to other
students. Because of their similarity with acquiescence/disacquiescence and limitations in the
SET data (and explained in Chapter 3), the study only examines ARS/DRS.
37
2.3.3.2 ERS
Extreme response style (ERS) refers to the tendency to respond using the extremes of the scale
regardless of content (Paulhus, 1991; Viswanathan, 2005; Wetzel et al., 2016; McGrath et al.,
2010).
Paulhus (1991) mentions situational factors such as ambiguity, emotional arousal, and speediness
as possible triggers of ERS. Viswanathan (2005) mentions “intolerance for ambiguity or
dogmatism, anxiety, respondents lacking appropriate cognitive schemas, or content that is
meaningful, important, or involving to respondents” as causes of extreme response style (p. 141).
A frequency index of ERS is the proportion of extreme categories that a participant endorses
across all questionnaire items (Viswanathan, 2005; Wetzel et al., 2016; van Herk et al., 2004,
Richardson, 2012). The sum of ARS and DRS indexes results in the ERS index.
2.3.3.3 MRS
Midpoint responding style (MRS) (Viswanathan, 2005; Wetzel et al., 2016) also neutral or
moderacy bias (McGrath et al., 2010) is the tendency to score using the middle point on the
scale.
Possible causes of MRS are “evasiveness, indecision, or indifference” (Viswanathan, 2005, p.
136). A frequency index of midpoint responding is the proportion of responses using the
midpoint across all items.
Central tendency (Kingsbury, 1922), also centrality (Wolfe, 2004) is a concept related to MRS.
Central tendency is “the propensity to award a restricted range of scores around the mean (or
mode or median) and to avoid awarding extreme scores” (Leckie & Baird, 2011, p. 400).
Central tendency is defined relative to a measure of central tendency of the scores distribution
(Saal et al., 1980), and it differs from midpoint response style because the mean (or mode,
median) of the scores distribution is not necessarily the midpoint of the scale. Central tendency
pertains the evaluation of the same target (for instance, a teacher) by several participants (for
example, students). Because of limitations in the SET data and its similarity between midpoint
response style and central tendency, the study reports MRS.
38
2.3.3.4 Range Restriction
Range restriction (Murphy & Balzer, 1989), also response range (Viswanathan, 2005) refers to
the tendency to use the response scale narrowly. Range restriction helps identify participants that
are too uniform in their scoring (Saal et al., 1980). Logically, scores affected by ARS/DRS
(leniency/severity), ERS, or MRS (central tendency) would also reflect range restriction. The
measurement of range restriction relates to the evaluation of the same target (for instance, a
teacher) by several participants (for example, students), and due to limitations in the SET data,
the calculation of a range restriction index is not feasible. However, the standard deviation of
SET scores is reported and interpreted as the result of the data analysis.
2.3.3.5 Halo
Halo (Thorndike, 1920; Kingsbury, 1922; Leckie & Baird, 2011) is the tendency to provide
“highly correlated ratings across a range of criteria” even to “conceptually unrelated items”
(Wetzel et al., 2016, p.10). Halo can indicate a respondents’ failure to discriminate among scale
dimensions.
A method for examining halo is interpreting dimension intercorrelations matrix (Saal et al.,
1980), for instance, calculating the correlation between SET dimensions using the mean score of
teachers across students. Systematic high correlations among dimensions are an indication of
halo. The same analysis can be performed using Principal Component Analysis or Factor
Analysis. The presence of one component of factor explaining a high proportion of variance
indicates halo as a likely problem affecting scores.
2.3.4 Evidence of Response Styles in SET
Studies on response styles are relatively new in the context of SET. Hence, the influence of
response styles on SET scores needs further exploration (Spooren et al., 2013). Response styles
already examined in the context of SET are acquiescence (Yorke, 2009; Richardson, 2012;
Spooren et al., 2012), leniency/severity (Rantanen, 2013), and extreme response style
(Richardson, 2012). There is one study examining the effect of response styles on the use of
scores in subsequent statistical analyses (Richardson, 2012). Findings from these four studies
suggest that response styles might affect the intended interpretation of SET scores as a measure
of teaching quality and the subsequent analysis of SET data.
39
Yorke (2009) is the first author bringing attention to the issue of response styles in the context of
SET, specifically examining acquiescence. The study utilized a manifest variable approach and
examined ARS using the same items than the target construct. York reported no evidence of
ARS in a summated rating scale measuring students’ experience of teaching and learning.
In the study, York (2009) showed that the distribution of items (using Kolmogorov–Smirnov
test) was not affected by factors pertaining the response scale (for instance, reversing the order of
presentation of response options). Also, scores distribution was not affected by changes in item
wording, for example, variation in the number of positive and negative worded items and their
order of presentation. Yorke concluded that responses reflected content rather than acquiescence.
A problem with York’s (2009) study is the lack of utilization of a standard method for examining
acquiescence (for instance, frequency index). Furthermore, by definition, acquiescence pertains
to an individual response pattern, an aspect missing in York’s study. The lack of a common
measure of acquiescence leaves unanswered the question of the degree to which acquiescence
affected SET scores in the study and makes impossible the comparison with empirical evidence
from other sources.
Richardson (2012) reported findings of response styles in scores from an instrument named
Course Experience Questionnaire (CEQ). The analysis relied on frequency indexes calculated
from the same items measuring SET. Richardson reported that the average level of acquiescence
response style among students (examined as a proportion of individual responses across items
using the highest two response options in a five-points response scale) was 0.30 for positively
worded items and 0.32 for negatively worded items. The average level of extreme response style
(examined as a proportion of individual responses across questionnaire items using the first and
last response option in the five-points response scale) was 0.35 for positively worded items and
0.45 for negatively worded items. Richardson (2012) also reported that the levels of
acquiescence and extreme response style correlated with students’ marks. The coefficients of
correlation between the levels of response styles and student’s marks varied between r = .23
(positively worded items) and r = .32 (negatively worded items). Finally, the study reported that
the variation in students’ marks explained by a measure of learning styles dropped from 21.8% to
18.9% after statistically controlling by response styles.
40
Spooren et al. (2012) examined acquiescence response style in SET scores using a latent variable
approach, structural equation modeling, and the same items than the ones measuring SET. The
study compared three measurement models12. A first model reflects the theoretical dimensions in
the SET summated rating scale (model 1). A second model adds to model 1 a common factor
explaining additional variance across all SET items. A third model adds to model 1 specific
common factors explaining additional variance across items within SET dimensions.
The fit13 between model 1 and the observed data was reasonable (RMSEA = .051; CFI = .989;
AIC = 134.884). Only model 2 showed a better fit with the observed data than model 1 (RMSEA
= .045; CFI = .992; AIC = 119.370). The relationship between the common factor from model 2
and a frequency index of acquiescence response style was subsequently estimated using a
structural equation model. Contrary to the authors’ expectation, the correlation between the latent
variable and the frequency index was low, implying that the common factor is not explained by
acquiesce response style. The authors proposed to interpret the common factor as halo suggesting
that scores might reflect variables such as instructor’s charisma or teacher professionalism.
There are two conceptual problems in the study reported by Spooren et al., (20012). First, factor
analysis is a standard method of examination of halo, no acquiescence. Another problem is the
notorious lack of theory when interpreting the common factor (or halo) as teacher’s charisma of
teacher professionalism. A simple and more plausible explanation is a common method factor
due to aspects of the measurement that are similar across items, such as the response scale or
items worded positively (Viswanathan, 2005; AERA et al. 2014).
One last publication on response styles in the context of SET reported evidence of
leniency/stringency on scores based on the results from a generalizability study (manifest
variable approach) using the same items than the target construct (Rantanen, 2013). In the study,
SET total variance was decomposed into three components using a hierarchical linear model
12 Measurement model refers to the internal structure of the measurement, the relationships between items (manifest
variables) and dimensions (latent variables) that underlies the development of a rating scale (Brown, 2006). 13
Brown’s (2006) guidelines for interpreting reasonable fit between model and observed data are values of RMSEA
close or below to 0.05 and values of CFI close or above .95. Akaike Information Criterion (AIC) is an information
criterion index and serves the purpose of comparing across models that differ in the number of factors, a lower value
indicates a better model fit.
41
approach: teacher, students (individualized by an anonymous identification), and items. The
proportion of variance explained by students was 16.8% in comparison with the 24% explained
by teachers and 46.4% of residual variance (not explained by students, teachers, or items). The
author interpreted the percentage of variance explained by students as students using the
response scale in a systematically lenient (tendency to assign low scores) or stringent (tendency
to assign high scores) manner independently of the teacher. However, the study does not report
the proportion of students exhibiting leniency or severity responding. More generally, finding
from the study seems to suggest that students are not discriminating across teachers, and that
could indicate leniency or severity but also central tendency or range restriction.
2.4 Summary and Limitations
2.4.1 Summary
The main conclusion from the first subject presented in this review of the literature is that the
definition of the target construct in SET is problematic. Often there is weak or no theory of
teaching quality underlying the development of SET summated rating scales. Weak theory
relates to home-made and ad-hoc instruments, a high diversity of content among instruments,
and a vague definition of the target construct. Furthermore, studies employ terms such as
teaching efficacy, teaching effectiveness, teaching quality, and student’s satisfaction
interchangeably.
A vague definition of the target construct is a first aspect that casts doubt on the validity of SET
scores. A summated rating scale based on poorly defined conceptual domains of teaching quality
would encourage students to base their responses on their understanding of quality (Spooren et
al., 2013) or to report inaccurate or biased information (Valsan & Sproule, 2008).
In this study, the definition of teaching quality distinguishes two related yet different aspects:
good teaching (the quality of teaching task) and successful teaching (teaching that contributes to
learning). Considering that SET literature shows a vague definition of the target construct, the
interpretation of findings pertains to “teacher quality” in general without distinction between
good and successful teaching. The distinction between good and successful teaching also serves
the purpose of describing the content of the specific SET summated rating scale examined in this
study along with hypothesis regarding how response styles vary depending on item content.
42
The second subject presented in this literature review relates to the validity of SET. There are
contradictory positions regarding the overall validity of SET scores based on accumulated
evidence. Early literature is positive towards the validity of SET scores, and recent literature is
critical. The current most important issue is the fundamental question on whether SET scores
reflect teaching quality. The way in which researchers address this matter is mostly through
evidence based on relationship to other variables, and specifically, discriminant evidence.
Furthermore, there is little or no consideration to evidence based on content or response process.
Discriminant evidence is one of the most important types of evidence utilized in the evaluation
of SET scores validity. The list of irrelevant variables examined include characteristics of the
student, teacher, and course. Findings from this vein of research are inconclusive. An example is
the examination of differences between SET scores by the gender of the teacher. Findings from
experimental and observational studies often do not reach statistical significance. Indexes of
effect size indicate a small or no practical significance of the differences in SET scores between
female and male teachers.
The third subject addressed is the literature review is response styles in the context of SET.
Overall, empirical findings are inconclusive because of the low number of studies, weaknesses in
the methodology, and differences in the types of response styles examined. The literature has not
explored yet topics such as differences in how response styles affect SET scores across different
measurement conditions, or how response styles affect the relationship between SET and other
variables. However, the findings recommend the examination of response styles as a potential
source of construct-irrelevant variance in SET scores because they can affect inferences about
the true level of teaching quality. The definition of validity also encourages the continuous
examination of sources of construct-irrelevant variance across items, persons and settings, and
summated rating scales are prone to response styles.
2.4.2 Limitations in SET Validity Research
The most significant limitation in SET validity research is the same major issue affecting the
development, interpretation, and use of SET scores: a weak theory of teaching quality informing
validation efforts. Validation in the context of the weak theory is a recognizable problem in the
measurement field and receives the name of weak program of construct validity (Cronbach,
1988; Kane, 2001).
43
A weak program conveys the risk of turning validation into “sheer exploratory empiricism”
(Cronbach, 1988, p. 11) in which “any evidence even remotely connected to the test score is
relevant to validity” (Kane, 2001, p. 326). A weak program suggests the possibility of
researchers being “highly opportunistic in the choice of validity evidence” (Kane, 2001, p. 323).
Under a weak program, researchers should state their hypothesis “as explicit as possible, then
devising deliberate challenges” (Cronbach, 1988, p. 12). This study concludes that the
overemphasis on one specific type of validity evidence in detriment of others is an expression of
a weak program in the context of SET validity research.
Other significant limitations in SET validity research that emerge from contrasting the literature
and the definition of validity offered in section 2.2.3 are:
1. There is no explicit connection between validity evidence and intended uses of SET
scores.
2. Interpretation of validity relates to SET as an abstraction or to a specific instrument
rather than scores.
3. There is a tendency to generalize validity findings over instruments, persons, and
settings without providing relevant validity evidence.
4. There is scarce importance provided to validity evidence based on content and
response process.
There are specific limitations affecting studies providing discriminant evidence. In coherence
with a weak program, Marsh (1997) summarizes these studies as “atheoretical, methodologically
flawed, and not based on well-articulated operational definition of bias [construct-irrelevant
variance]” (p. 1190).
A serious problem in studies providing discriminant evidence affects the interpretation of a
statistically significant correlation coefficient between SET scores and irrelevant variables.
Specifically, the literature often interprets a lack of independence between target construct and
an irrelevant variable as a sign of construct-irrelevant variance (“bias”) (Boring, 2015; Olivares,
2003; Stark & Freishtat, 2014). Other two plausible interpretations often ignored are: 1)
44
construct underrepresentation and 2) a true relationship between variables. Studies examining
discriminant evidence do not provide further analysis confirming that the correlation between
SET scores and an irrelevant variable represents construct-irrelevant variance nor acknowledge
that the correlational design does not indicate which of the three plausible explanations is
correct.
Another serious problem pertaining SET discriminant evidence is the no interpretation of the
practical significance of these relationships, for example, by omitting indexes of effect size. An
example is relying on p-values of small or even trivial correlation coefficients to wrongly
conclude that “SET are biased against female teachers by an amount that is large and statistically
significant” (Boring et al., 2016, p. 1).
The third problem is that the scarce importance provided to validity evidence based on content
and response process casts doubt on the meaning of the correlation coefficients between
(problematic) SET scores and other variables. Sources of construct-irrelevant variance can
explain the correlation coefficients between SET scores and other variables due to additive and
correlational error.
2.4.3 Focus of Study
The present study presents evidence to evaluate the interpretation of SET scores as a measure of
teaching quality for informing formative and summative decisions at a large teacher education
institution. The evidence relates the examination of response styles, an important source of
construct-irrelevant variance in summated rating scales (Viswanathan, 2005; Wetzel, Böhnke, et
al., 2016) but scarcely examined in the context of SET. As a source of construct-irrelevant
variance, response styles can produce overestimation or underestimation of the true level of
teaching quality due to additive error, affecting formative and summative decisions.
Considering that the validity of scores depends on the conditions of measurement (Messick,
1995b), the degree to which SET scores are affected by responses styles could differ across
conditions such as the academic department, the type of graduate program, and the session.
Finally, response styles can change the relationship between SET scores and other variables due
to correlational error, affecting summative decisions and analysis pertaining discriminant
45
evidence. Furthermore, identify sources of construct-irrelevant variance is a first reasonable step
preceding the examination of the relationship between SET scores and other variables.
The three research questions that guide the study are:
1. To what extent SET scores are affected by response styles?
2. What are the differences in the degree to which SET scores are affected by response
styles across measurement conditions?
3. Is there a difference in SET scores between female and male teachers, and to what
extent do response styles moderate such difference?
46
Chapter 3
Methodology
Chapter 3 describes the study’s methodology, including the population of students,
characteristics of the SET summated rating scale, the procedure of administration specifying
intended interpretation and use of SET scores, and data analysis strategy followed to produce
evidence for each of the three research questions presented above.
3.1 Participants
The present study analyzes students’ evaluation of teaching from an institute of education part of
a large public research university located in Southern Ontario. Institute and university’s
authorities reviewed a research request for access to SET data, and following the request’s
approval, the manager of SET at the institute submitted students’ responses along with
information about the instrument development and administration.
For confidentiality reasons, the institute submitted the information in an anonymized manner to
prevent identification of teachers. Additionally, the institute did not collect any form of
identification of students during the instrument administration.
The total number of students’ evaluations of teaching is 6,133. The study excluded students
enrolled in special programs (other than Master or Ph.D./Ed.D, 114 cases) and cases with
missing values across all items (98 cases).
As presented in Table 1, the number of students included in the analysis is 5,921 distributed
among two departments14 (A and B), two types of academic program (Master and Ph.D./Ed.D)
and six academic sessions (Summer 2014 to Winter 2016).
14 The total number of academic departments at the institute of education is four, however, authorization was
granted for examining SET data from only two departments.
47
Table 1
Number of students by academic department, program type, and session
Academic
Department
Program
Type
Summer
2014
Fall
2014
Winter
2015
Summer
2015
Fall
2015
Winter
2016
Total
A Master 751 233 991 433 159 443 3,010
Ph.D./Ed.D. 98 19 101 68 12 71 369
B Master 422 226 444 410 265 324 2,091
Ph.D./Ed.D. 80 31 122 86 38 94 451
Total 1,351 509 1,658 997 474 932 5,921
The number of courses15 with students’ evaluations of teaching across departments, programs,
and sessions is 462, with an average of 12.82 (SD = 6.76), varying between 1 and 53.
The number of teachers across courses, departments, programs, and sessions is 159, and each
teacher received between 1 and 201 students’ evaluations (SD = 32.98). The percentage of
female teachers (71.1%) largely overpasses the percentage of male teachers (28.9%) and is close
to the overall proportion of female teachers at the institute of education for the specific period
covered by the data (68%)16.
The data submitted by the institute has two characteristics that limit the examination of response
styles. First, severity/leniency, central tendency, and range restriction require identification of
students and teachers, and the data provided does not include such information. A second
limitation is the lack of measurement of constructs unrelated to teaching quality, which results in
the impossibility of examining halo. Only one additional variable, teacher’s gender, was
submitted along with the students’ evaluations with the specific purpose of examining research
question number three.
3.2 Instrument
The SET summated rating scales includes eight items measuring aspects of the “learning
experience of students” during the length of a course. The instrument included a general item
assessing the “overall experience” during the course. The “overall” item uses a different response
15 Refers to a specific instance of a course in which teacher, section (if multiples) and session are specific, for
instance, course Research Methods in Education, teacher John Doe, section 1, summer 2016. 16
Personal communication with institute’s SET manager.
48
format than the other eight. Therefore, the analysis excluded the item. There are no differences in
the instrument content across departments, program type, or sessions.
The prompt utilized in the instrument to present content and explain the response procedure is
the following:
“You are presented with a series of statements about aspects of a course
learning experience. Using the scale provided, please indicate the extent to
which each aspect was part of your course experience.”
The response format utilized by students is a five-points Likert-type with the following labels: 1)
not at all, 2) somewhat, 3) moderately, 4) mostly, and 5) a great deal.
As reported by the institute’s SET manager, a commission selected the instrument content from a
bank of items developed by the university’s teaching support unit. The selection of the content
involved a “rigorous consultation phase” with “faculty, programs, and departments.” 17 No other
information regarding the development of the instrument was available at the time of elaborating
this report.
The eight items included in the SET summated rating scale are:
1. “I found the course intellectually stimulating.”
2. “The course provided me with a deeper understanding of the subject matter.”
3. “The instructor created a course atmosphere that was conducive to my learning.”
4. “Course projects, assignments, tests, and/or exams improved my understanding of the
course material.”
5. “Course projects, assignments, tests, and/or exams provided opportunity for me to
demonstrate an understanding of the course material.”
17 Personal communication with institute’s SET manager.
49
6. “The instructor explained the learning objectives for the course.”
7. “The course instructor demonstrated respect for diversity (e.g., race,gender, ability,
religion, sexual orientation, etc.) in the classroom.”
8. “The course instructor encouraged students to express their own ideas in the class."
The SET summated rating scale combines a self-report item (item 1), report-of-objects items
targeting the course and course components (items 2, 4 and 5), and report-of-other items
targeting the teacher (2, 6, 7 and 8). Content covers the two components of teaching quality
(good and successful teaching) and the three types of acts of teaching (logical, psychological and
moral). Table 2 summarizes the type of report and type of content for each item.
Table 2
Type of report and content in SET summated rating scale
Item Type of Report Aspect of Teaching
Quality
Type of Act
of Teaching
Item 1 Self-report Good Teaching Logical
Item 2 Object-report Successful Teaching
Item 3 Other-report Good Teaching Psychological
Item 4 Object-report Successful Teaching
Item 5 Object-report Good Teaching Logical
Item 6 Other report Good Teaching Logical
Item 7 Other report Good Teaching Moral
Item 8 Other report Good Teaching Psychological
Table 2 shows that six items measure good teaching, with three items covering logical acts (item
1, 5, 6), two items covering psychological acts (item 3 and 8), and one item measuring a moral
act (items 7). Two items measure successful teaching (items 2 and 4).
The diversity of content summarized in Table 2 recommends the interpretation SET scores as an
overall measure of teaching quality. The reliability of the responses to the summated rating scale
as estimated by Cronbach’s alpha coefficient is 0.93 indicating a high individual consistency
(and low random error), supporting the utilization of total score.
50
There are two aspects of the instrument’s content calling for a cautious interpretation of SET
scores as a measure of teaching quality. Although the consultation phase of content selection
described above reflects the normative and contextual characteristics of the definition of good
teaching, the first problem is the low number of items measuring the different acts of teaching,
which suggests a potential problem of construct under-representation. Second, the instrument
includes items measuring successful teaching, reflecting the importance of teaching to foster
learning in this educational institution. However, the validity of “reaction” items or self-
assessment items as measures of teacher’s contribution to students’ learning is dubious based on
the previous literature review, casting doubt on their utility for informing formative and
summative decisions. Therefore, the first finding of the study is that the SET content reflects
partial aspects of good teaching and includes problematic items pertaining successful teaching.
3.3 Administration
Close to the end of the academic session, students received an institutional message by email for
each course they were enrolled inviting them to participate in an online course evaluation survey.
Two follow-up messages encouraging students to fill out the course evaluation survey followed
the original invitation. The participation in the course evaluation survey was voluntary, and the
response rate was 65% in department A, and 71% in department B18.
No intended interpretation of SET scores was offered to students at any component of the
instrument administration according to the information available at the time of elaborating this
report. However, the introductory paragraph of the instrument stated the intended use of SET in
the following manner: “Your feedback is important to us” (…) “teaching evaluations are
designed to improve course offerings and may be considered in promotion or tenure decisions for
faculty.” The introduction also stated the anonymity of responses, and that teachers would
receive the results only after the submission of the course final grades, probably as a mean to
minimize the influence of students’ grade expectation on responses.
The instrument content and administration procedure do not indicate neither the intended
interpretation of scores nor the use of scores for formative purposes. However, institutional
18 Personal communication with institutional SET manager.
51
documents state that SET assesses the “effectiveness of teaching19” and declare four intended
uses of SET20 summarized below:
1. Provide formative data to instructors for the continuous improvement of their
teaching.
2. Inform members of the institution about teaching.
3. Provide data for summative evaluation of teaching including annual merit, tenure, and
promotion review.
4. Provide data for program and curriculum review.
A logical conclusion of the comparison between instrument content and institutional documents
is that students received incomplete information regarding the intended interpretation and
planned uses of SET scores.
3.4 Data Analysis
The data analysis strategy comprises four parts. The first part is the report of responses
distribution using graphs and descriptive statistics. The report includes histograms for each item
and measures of central tendency, dispersion, and shape of the distribution for each item and
SET scores (total score). The remaining parts are related to the three research questions in the
study.
3.4.1 Research Question 1
The analysis of the degree to which SET scores are affected by response styles relies on a
manifest variable approach (Paulhus, 1991; Wetzel, Böhnke, et al., 2016). The study reports four
frequency indexes of response styles: acquiescence (ARS), disacquiescence (DRS), extreme
19 “Guidelines for the assessment of teaching”; despite what the document states, the analysis of the instrument
content suggests that the target construct is teaching quality. 20
“Policy on the Student Evaluation of Teaching in Courses”
52
(ERS), and midpoint (MDR) response styles21. The calculation of frequency indexes of response
styles is per student and considers responses to all items.
The calculation of a frequency index supposes that the systematic choice of a specific response
option across items reflects the response style. Table 3 presents the relationship between the
selection of a response option and a response style. The proportion of responses matching the
scoring in Table 3 represents the degree to which SET scores are affected by response styles.
Table 3
Scoring for calculating frequency indexes of response styles
Response Style Not at all Somewhat Moderately Mostly A great deal
ARS 0 0 0 0 1
DRS 1 0 0 0 0
MRS 0 0 1 0 0
ERS 1 0 0 0 1
Note: ARS = index of acquiescence response style; DRS = index of acquiescence response style; ERS =
index of extreme response styles; MRS = index of midpoint response style.
An example is the calculation of the frequency index of acquiescence. The value of ARS index
for a specific student is the sum of responses matching “A great deal” (symbolized as 1 in Table
3) divided by the number of items (eight, which transforms the sum into a proportion). The
choice of any other response option by the student does not reflect acquiescence (symbolized as
zero in Table 3). ARS index can vary between 0 (no acquiescence) and 1 (maximum
acquiescence). The same rationale applies to the other response style indexes: DRS, MRS, and
ERS.
In addition to the four indexes of response styles, the study reports an index of acquiescence
relative to disacquiescence (ARSR). The index is the difference between the proportion of
positive (“mostly” and “a great deal”) and negative (“not at all” and “somewhat”) responses
across items. The ARSR index summarizes both acquiescence and disacquiescence and is less
correlated with extreme response style (van Herk, 2004). ARSR index can vary between -1
21 The limitations in the data described in subsection 3.1 prevent the examination of halo, leniency/severity, central tendency and
range restriction.
53
(maximum disacquiescence) and 1 (maximum acquiescence) and is reported exclusively to
compare results from the study with other studies in the context of SET.
3.4.2 Research Question 2
Research question 2 relates to differences in the degree to which SET scores are affected by
response styles across conditions of measurement. The report of results includes 1) descriptive
statistics of response style indexes by academic department, program type, and academic session
and 2) results from analysis of variance (ANOVA). Specifically, a 2 (department) x 2 (program
type) x 6 (academic session) factorial ANOVA was conducted on response style indexes to
determine differences in the degree to which responses styles vary across these measurement
conditions.
3.4.3 Research Question 3
The last issue pertains to determining differences in SET scores between female and male
teachers (part 1) and the extent to which response styles moderate such difference (part 2).
Research question 3 only focuses on acquiescence response style because findings from section
4.1(research question 1) indicate that acquiescence is the most relevant type of response style
affecting SET scores in the study.
The two parts embedded in research question 3 are addressed using linear regression analysis
(Kenny, 1979; Baron & Kenny, 1986; J. Cohen et al., 2003). Researchers can use linear
regression analysis for explanation or prediction of a dependent variable using one or multiple
independent variables. In the study, linear regression analysis is used for explanation and informs
1) whether teacher’s gender explains differences in SET scores and 2) the moderator effect of
acquiescence on explaining differences in SET scores between female and male teachers. The
following subsections described in detail the way in which the two parts embedded in question 3
are answered using linear regression analysis.
3.4.3.1 Part 1: Differences by Teacher’s Gender
The linear regression model expressed in Equation 3 informs about the magnitude and direction
of the difference in SET scores between female and male teachers:
54
𝑌SET = 𝐵0 + 𝐵1𝐺𝑒𝑛𝑑𝑒𝑟 + 𝑒
Equation 3
Equation 3 indicates that SET scores (𝑌SET) are explained by three elements: a constant (𝐵0), a
regression coefficient related to teacher’s gender (𝐵1), and a term indicating error of prediction
(𝑒). 𝐵1 expresses the direction and magnitude of the differences in SET scores between female
and male teachers.
Teacher’s gender is a dichotomous variable coded as a dummy variable. In this analysis, a value
of “1” represents a female teacher, and a value of “0” represents a male teacher. The coding
enables the interpretation of the constant (𝐵0) as the mean of SET scores for male teachers, and
the regression coefficient of teacher’s gender (𝐵1) as the magnitude and direction of the
difference between female and male teachers. A positive value of 𝐵1 indicates that SET scores
are higher among female teachers, and a negative value indicates that SET scores are higher
among male teachers.
In addition to the constant (𝐵0) and regression coefficient of teacher’s gender (𝐵1), the following
information is provided and reported as result of the analysis (J. Cohen et al., 2003; Ellis, 2010;
Kenny, 1979):
• Test of significance of the null hypothesis indicating no linear relationship between
teacher’s gender and SET scores (𝐻0: 𝐵1 = 0).
• The standard error of estimate (SE) reflecting the estimated population standard deviation
of the residuals of estimating SET scores from teacher’s gender (𝑒). SE is interpreted as a
measure of the imprecision of the regression coefficient.
• Standardized regression coefficient of teacher’s gender (𝛽1 = 𝐵1𝑠𝑑𝑔𝑒𝑛𝑑𝑒𝑟
𝑠𝑑𝑆𝐸𝑇), which is
independent from the original scale of the variables and allows comparisons among studies.
• The coefficient of determination (R2, and adjR2) which indicates the proportion of the
variance of SET scores accounted by teacher’s gender. R2 is a measure of the strength of
the relationship (effect size). adjR2 is R2 adjusted by sample size.
• Test of significance (F test) of the null hypothesis indicating that R2 is zero (𝐻0: 𝑅2 = 0).
55
3.4.3.2 Part 2: ARS Moderator Effect
The operationalization of a moderator effect is the statistically significant interaction between the
moderator and an independent variable using multiple linear regression analysis (Baron &
Kenny, 1986). The multiple linear regression model presented in Equation 4 informs about the
role of acquiescence response style as moderator of the difference in SET scores between female
and male teachers.
𝑌SET = 𝐵0 + 𝐵1𝐺𝑒𝑛𝑑𝑒𝑟 + 𝐵2𝐴𝑅𝑆𝐶 + 𝐵3(𝐴𝑅𝑆𝐶 × 𝐺𝑒𝑛𝑑𝑒𝑟) + 𝑒
Equation 4
Equation 4 indicates that SET scores (𝑌SET) are explained by five elements: a constant (𝐵0), three
regression coefficients related to teacher’s gender (𝐵1), degree of acquiescence response style
(𝐵2), the interaction between acquiescence and teacher’s gender (𝐵3), and a term expressing
error of prediction (𝑒).
The coding of teacher’s gender is the same than in part 1. The degree of acquiescence response
style is centered in the grand-mean to enhance the interpretation of the constant (𝐵0) and
regression coefficients (𝐵2, 𝐵3), a recommended procedure for interpreting interaction terms in
regression analysis (J. Cohen et al., 2003). Equation 5 shows the transformation of original
acquiescence values into grand-mean centered values.
𝐴𝑅𝑆𝐶 = 𝐴𝑅𝑆 − �̅�𝐴𝑅𝑆
Equation 5
In Equation 5, the grand-mean centered degree of acquiescence for a specific student (𝐴𝑅𝑆𝐶) is
the difference between his/her original degree of acquiescence (𝐴𝑅𝑆) and the mean of
acquiescence across all students in the sample (�̅�𝐴𝑅𝑆). A grand-mean centered value of
acquiescence equal to zero indicates an average degree of acquiescence, negative values indicate
lower than the average degree of acquiescence, and positive values indicate a higher than the
average level of acquiescence.
The constant (𝐵0) in Equation 4 reflects the conditional mean of SET scores for male teachers
with an average degree of acquiescence; the regression coefficient of teacher’s gender (𝐵1)
56
reflects the magnitude and direction of the difference between female and male teachers
controlling by acquiescence; the regression coefficient of acquiescence (𝐵2) reflects the amount
of change in SET scores when the degree of acquiescence changes in one unit; and the regression
coefficient of the interaction term (𝐵3) indicates how much the difference between female and
male teachers changes as acquiescence varies from lower to higher values.
In addition to the constant (𝐵0) and regression coefficients (𝐵1, 𝐵2, and 𝐵3) from Equation 4, the
following information is provided as result of the analysis:
• Test of significance of the null hypothesis indicating no linear relationship between
independent variables and SET scores (𝐻0: 𝐵𝐼 = 0).
• The standard error of estimate (SE) indicating the estimated population standard deviation
of the residuals of estimating SET scores from an independent variable. SE is a measure of
the imprecision of the regression coefficient.
• Standardized regression coefficients (𝛽𝐼 = 𝐵𝐼𝑠𝑑𝐼𝑉
𝑠𝑑𝑆𝐸𝑇), which are scale-free estimates of
regression coefficients and allow comparability among predictors and other studies.
• The coefficient of multiple determination (R2 and adjR2) indicates the proportion of SET
scores variance accounted by all the independent variables. R2 is a measure of the strength
of a relationship between dependent and independent variables (effect size). adjR2 is R2
adjusted by sample size and the number of predictors.
• Test of significance (F test) of the null hypothesis indicating that R2 is zero (𝐻0: 𝑅2 = 0).
Part two of research question 3 focuses on the test of significance of the interaction term (𝐵3) as
indication of the moderator effect of acquiescence on the difference in SET scores between
female and male teachers. Two-way plots of predicted SET scores against degree of
acquiescence are presented to enhance the interpretation of the interaction term and help
understand the results of the moderation analysis. The study reports two indexes of the practical
significance of regression coefficients (eta square and Cohen’s f square) along with guidelines
for interpreting these indexes in the context of moderator analysis.
The report of regression analysis described in part 1 and part 2 includes results at the individual
(students’ responses), course-level (students’ responses aggregated by course) and teacher
(students’ responses aggregated by teacher) levels of analysis. These levels of analysis have
57
practical relevance considering that researchers and users of SET data routinely aggregate results
at the course and the teacher level of analysis, for instance, when used by administrators (Stark &
Freishtat, 2014).
3.4.4 Software
The data was imported from the original file in Microsoft Excel format to STATA 13.1
(StataCorp, 2013) to conduct most of the statistical analysis reported in the study. STATA do
files containing the commands for data management and transformation (i.e. response styles
indexes), and analyses are available for further reference.
58
Chapter 4
Results
Chapter 4 reports the analysis of data pertaining the examination of response styles in the context
of responses to a SET summated rating scale at a large teacher education institution. There are
four sections in Chapter 4. Section 4.1 reports item and SET scores distribution. Section 4.2
reports findings pertaining the degree to which SET scores are affected by response styles.
Section 4.2 reports differences in the extent to which SET scores are affected by response styles
across measurement conditions. Section 4.3 reports findings related to the effect of acquiescence
response style in moderating differences in SET scores between male and female teachers.
4.1 Distribution of Responses
Figure 1 (on page 59) contains eight item histograms from 5,921 students’ responses to the SET
summated rating scale. Histograms indicate the proportion of students scoring each of the
options available on the response scale.
Examination of histograms in Figure 1 reveals that students utilized the full range of response
options on the scale. However, students utilized the options asymmetrically. In general, students
preferred the two highest response options (5 = “A great deal”; 4 = “Mostly”) in each of the eight
items included in the instrument. The proportion of students endorsing any of the two highest
response options in the scale rounds 80% obtained by adding the percentage of students
endorsing response options 4 and 5.
Two extreme cases are item 7 (“The instructor demonstrated respect for diversity”) and item 8
(“The instructor encouraged students to express their own ideas”). In these two cases, the
proportion of students endorsing the highest response option is over 80%.
59
Figure 1
SET item distribution (N=5,921)
60
Table 4 reports measures of central tendency (mean, median), dispersion (standard deviation),
skewness, kurtosis, and percentiles 5, 25, 50, 75 and 95) for items and SET score. SET score
keeps the same metric than original items because the sum of SET items (total score) is divided
by the number of items.
Table 4
Descriptive statistics for SET items and overall score (N=5,921)
Variable Mean SD Skew Kurt P5 P25 P50 P75 P95
Item 1 4.18 1.01 -1.20 3.77 2.00 4.00 4.00 5.00 5.00
Item 2 4.27 0.99 -1.35 4.14 2.00 4.00 5.00 5.00 5.00
Item 3 4.21 1.08 -1.37 4.05 2.00 4.00 5.00 5.00 5.00
Item 4 4.22 0.98 -1.25 3.97 2.00 4.00 5.00 5.00 5.00
Item 5 4.27 0.95 -1.35 4.34 2.00 4.00 5.00 5.00 5.00
Item 6 4.35 0.92 -1.53 4.92 2.00 4.00 5.00 5.00 5.00
Item 7 4.70 0.68 -2.87 12.11 3.00 5.00 5.00 5.00 5.00
Item 8 4.62 0.80 -2.43 8.92 3.00 5.00 5.00 5.00 5.00
SET score 4.35 0.77 -1.56 5.34 2.75 4.00 4.62 5.00 5.00
Note: Skew=Skewness; Kurt=Kurtosis
Table 4 indicates that the average response to an item fluctuated between 4.18 (item 1) and 4.70
(item 7) with small standard deviations (one point or less than one point). The negative skewness
and high kurtosis reflect that responses are distributed closely around the highest value on the
response scale, with a large tail towards the lowest values. Values of kurtosis over three indicate
that the peak of the distribution is greater than the peak of a normal distribution (Moors, 1986),
and this is the case of each SET item.
Table 4 also indicates that the median (P50) in seven out of eight items is five (“a great deal”)
and means that a 50% of students marked the highest response option available on the scale in
almost every item.
As reported in the histograms, items 7 and 8 are two extreme cases showing a negatively skewed
and leptokurtic shape (kurtosis over 3), indicating that the distribution of responses concentrates
towards the highest value, with a tall peak and a large tail towards the lowest values.
The distribution of SET scores follows the same pattern than the distribution of individual items,
with a mean between the two highest response options in the scale (M = 4.35), and small
61
standard deviation (SD = .77). The negative skewness reflects that the tail of the SET scores
distribution is longer towards the lower values on the response scale, and the kurtosis over three
indicates that the peak of the distribution is taller than the peak of a normal distribution.
4.2 Research Question 1
Based on visual examination and summary statistics of items and SET score, responses seem
coherent with acquiescence rather than disacquiescence, extreme, or midpoint responses styles.
However, specific student-level response style frequency indexes are necessary to describe
systematic patterns of responses across items. Table 5 reports summary statistics (mean, standard
deviation, skewness, kurtosis, and percentiles 5, 25, 50, 75 and 95) for each response style index
(ARS, ARSR, DRS, ERS, and MRS).
Table 5
Summary statistics for response style indexes (N=5,921)
Variable Mean SD Skew Kurt P5 P25 P50 P75 P95
ARS 0.59 0.36 -0.28 1.61 0.00 0.25 0.62 1.00 1.00
ARSR 0.78 0.43 -2.28 7.63 -0.25 0.75 1.00 1.00 1.00
DRS 0.02 0.09 7.16 62.01 0.00 0.00 0.00 0.00 0.12
ERS 0.61 0.35 -0.30 1.67 0.00 0.25 0.62 1.00 1.00
MRS 0.09 0.16 2.17 7.90 0.00 0.00 0.00 0.12 0.50
Note: Skew = Skewness; Kurt = Kurtosis; P = Percentile; ARS = index of acquiescence response style;
ARSR = relative index of acquiescence response style; DRS = index of acquiescence response style; ERS
= index of extreme response styles; MRS = index of midpoint response style.
Table 5 shows the students’ average value for each response style index. The first two indexes,
ARS (M = .59, SD = .36) and ARSR (M = .78, SD = .43) indicate that SET scores are
significantly affected by acquiescence. At least half of the students scored five out of eight items
(P50 = .62) using the highest response option in the scale (ARS), and at least half of the students
scored all eight items (P50 = 1) using either the highest or second highest response option
(ARSR). The proportion of students using exclusively the option “5” across all items is 30.10%,
and the percentage of students using the two highest response options across items is 64.52%.
Despite the high proportion of students systematically utilizing the highest or two highest options
in the response scale across items, not all students responded consistently with acquiescence
62
response style. However, the proportion of these students is small. For instance, students that
scored any item using response options 1 to 4 without using the option 5 is 11.26%.
Table 5 also indicates that the degree to which SET scores are affected by ERS is also high. The
students’ average ERS index is M = .61 (SD = .35) indicating that in general students scored
almost five items choosing either the highest (“a great deal”) or lowest (“not at all”) response
option.
Students’ average value for the other two response styles indexes, DRS and MRS, are MDRS = .02
(SD = .09) and MMRS =.09 (SD = .16) respectively, indicating that SET scores are not affected by
these two types of response styles.
The index of ERS needs careful interpretation. ARS and ERS account for the proportion of
answers using the highest option in the response scale (plus the lowest response option in the
case of ERS). A logical conclusion is that the ARS index explains ERS. The Pearson's r
coefficient between ARS and ERS indexes is r = .97, p < .001, indicating a near perfect linear
relationship which leads to exclude ERS as affecting responses. As a reference, Table 6 reports
the correlation coefficients among all five response styles indexes.
Table 6
Correlation coefficients (lower triangle) and statistical significance level (upper triangle)
among response style indexes
ARS ARSR DRS ERS MRS
ARS 0.00 0.000 0.00 0.00
ARSR 0.64 0.000 0.00 0.00
DRS -0.27 -0.60 0.01 0.12
ERS 0.97 0.51 -0.03 .000
MRS -0.59 -0.54 0.02 -0.61
Note: ARS = index of acquiescence response style; ARSR = relative index of acquiescence response style;
DRS = index of acquiescence response style; ERS = index of extreme response styles; MRS = index of
midpoint response style.
In summary, findings reported in this section reveal that a significant proportion of students
show a systematic tendency to respond to SET items using the highest response options on the
scale, suggesting that SET scores are affected by acquiescence response style. Students seem not
influenced by other types of response styles such as disacquiescence or midpoint response style.
63
The high values of ERS index only reflect ARS and not the systematic use of the two extreme
response options.
4.3 Research Question 2
The report of findings pertaining research question 2 includes summary statistics of response
styles indexes by department (A and B), program type (Master and Ph.D./Ed.D.), and session
(Summer 2014 to Winter 2016), and results from analysis of variance (ANOVA).
4.3.1 Summary Statistics
Table 7 presents mean and standard deviation for ARS, ARSR, DRS, ERS, and MRS across the
different conditions.
Table 7
Descriptive statistics of response styles indexes by measurement conditions
ARS ARSR DRS ERS MRS
M SD M SD M SD M SD M SD
Department: A 0.56 0.37 0.73 0.47 0.02 0.10 0.58 0.36 0.10 0.17
Department: B 0.64 0.35 0.83 0.37 0.01 0.07 0.65 0.34 0.07 0.15
Program: Master 0.58 0.37 0.76 0.45 0.02 0.09 0.60 0.35 0.09 0.16
Program: Ph.D./Ed.D. 0.67 0.33 0.86 0.31 0.01 0.05 0.68 0.33 0.07 0.13
Session: 2014 Summer 0.59 0.37 0.81 0.40 0.01 0.06 0.60 0.36 0.08 0.15
Session: 2014 Fall 0.59 0.36 0.79 0.42 0.02 0.10 0.61 0.34 0.08 0.15
Session: 2015 Winter 0.58 0.37 0.75 0.48 0.03 0.11 0.61 0.34 0.09 0.16
Session: 2015 Summer 0.61 0.36 0.80 0.42 0.01 0.08 0.63 0.36 0.08 0.16
Session: 2015 Fall 0.58 0.36 0.77 0.43 0.01 0.07 0.60 0.35 0.09 0.16
Session: 2016 Winter 0.60 0.37 0.77 0.43 0.01 0.07 0.61 0.35 0.09 0.17
Note: ARS = index of acquiescence response style; ARSR = relative index of acquiescence response style;
DRS = index of acquiescence response style; ERS = index of extreme response styles; MRS = index of
midpoint response style.
Summary statistics from Table 7 suggest differences in ARS and ERS indexes across
measurement conditions. For instance, students in department A show lower values of ARS (M =
.56, SD = .37), ARSR (M = .73, SD = .47), and ERS (M = .58, SD = .36) than the values of ARS
(M = .64, SD = .35), ARSR (M = .83, SD = .37) and ERS (M = .65, SD = .34) found among
students in department B. Master students show lower values of ARS (M = .58, SD = .37) and
ARSR (M = .76, SD = .45) than Ph.D./Ed.D. students (ARS M = .67, SD = .33; ARSR M = .86,
SD = .31). Similar differences are observed in the case of ERS index. The average value of ERS
64
index among Master students (M = .60, SD = .33) is lower than among Ph.D./Ed.D. students (M
= .68, SD = .33).
The average value of ARS, ARSR, and ERS indexes across sessions are similar. Values of ARS
narrowly vary between .58 (2015 Winter) and .61 (2015 Summer). Values of ARSR vary
between .75 (2015 Winter) and .81 (2014 Summer). Values of ERS vary between .60 (2014
Summer, and 2015 Fall) and .63 (2015 Summer).
Regarding DRS and MRS indexes, average values by academic department, program type, and
session are approximately the same than those reported in section 4.1. Average values of DRS
index are no greater than .02 across conditions. Similarly, average values of MRS index are not
greater than .10 across conditions. Since the values of DRS and MRS indexes are very close to
zero in all the conditions analyzed, the subsequent analysis focuses on ARS, ARSR, and ERS.
4.3.2 ANOVA Results
Three three-way ANOVA were conducted to examine the effects of the academic department,
the type of program, and the session on ARS, ARSR and ERS indexes respectively. The analysis
utilized an alpha level = .01 (probability of rejecting the null hypothesis when the null
hypothesis is true).
Results from ANOVA on ARS index indicate a statistically significant effect of department, F(1,
5920) = 26.75, p = 0.00, partial η2 < .01, and program type, F(1, 5920) = 14.44, p = .00, partial
η2 < .01. There were no statistically significant effect of session, F(5, 5920) = .78, p = 0.56,
partial η2 < .01, the interactions between academic department x program type, F(1, 5920) =
6.23, p = 0.01, partial η2 < .01, academic department x session, F (5, 5920) = 0.91, p = .47,
partial η2 < .01, and type of program x session, F (5, 5920) = 0.82, p = .43, partial η2 < .01, nor
the three-way interaction, F (5, 5920) = 1.16, p = .32, partial η2 < .01. Post hoc pairwise
comparison of means across levels of department and academic program reveled that ARS was
statistically significantly higher in department B than A, 95% CI [0.05, 0.11], and among
Ph.D./Ed.D. students than Master students, 95% CI [0.03, 0.09].
ANOVA results on ARSR index are consistent with those reported for ARS. Academic
department, F (1, 5920) = 24.70, p = .00, partial η2 < .01, and program type, F (1, 5920) = 8.63, p
= 0.00, partial η2 < .01, had a statistically significant effect on ARSR. Post hoc pairwise
65
comparison of means across levels of department and academic program reveled that ARSR was
statistically significantly higher in department B than A, 95% CI [0.06, 0.14], and among
Ph.D./Ed.D. students than Master students, 95% CI [0.02, 0.10]. Like the ANOVA results on
ARS index, neither session, F (5, 5920) = 1.20, p = 0.30, partial η2 < .01, the interactions
between academic department x type of program, F (1, 5920) = 2.84, p = .09, partial η2 < .01,
academic department x session, F (5, 5920) = 0.90, p = .47, partial η2 < .01, program type x
session, F (5, 5920) = 0.60, p = .69, partial η2 < .01, nor the three-way interaction, F (5, 5920) =
1.52, p = .18, partial η2 < .01, had a statistically significant effect on ARSR.
ANOVA results on ERS indicate that academic department, F (1, 5920) = 23.80, p = .00, partial
η2 < .01, and program type, F (1, 5920) = 11.11, p = .00, partial η2 < .01, had a statistically
significant effect on ERS. Once again, post hoc pairwise comparison of means across levels of
department and academic program reveled that ERS was statistically significantly higher in
department B than A, 95% CI [0.05, 0.11], and among Ph.D./Ed.D. students than Master
students, 95% CI [0.02, 0.09]. Neither session, F (5, 5920) = 0.85, p = .51, partial η2 < .01, any
of the two-way interactions, academic department x program type, F (1, 5920) = 6.59, p = .01,
partial η2 < .01, academic department x session, F (5, 5920) = 0.97, p = .45, partial η2 < .01, and
type of program x session, F (5, 5920) = 0.77, p = .57, partial η2 < .01, nor the three-way
interaction, F (5, 5920) = 1.09, p = .36, partial η2 < .01, had a statistically significant effect on
ERS.
Findings from ANOVA provide no evidence to support that ARS, ARSR, and ERS indexes differ
over academic sessions (failed to reject the null hypothesis). Findings from ANOVA support that
ARS, ARSR, and ERS indexes differ across departments and type of program (rejected the null
hypothesis). However, the small proportion of variance accounted by these two conditions, with
partial η2 less than .01 (or less than 1% or variability explained) indicates that there is no
practical significance of these statistically significant differences. Therefore, the degree to which
SET scores are affected by acquiescence and extreme response styles is consistent across the
measurement conditions examined.
66
4.4 Research Question 3
The section reports findings from linear regression analysis pertaining differences in SET scores
between female and male teachers (part 1), and multiple regression analysis pertaining the
moderator effect of acquiescence in such difference (part 2).
4.4.1 Part 1: Differences Teacher’s Gender
Table 8 presents results from linear regression analysis of SET scores on teacher’s gender at the
individual (students), course (students’ responses aggregated by course) and teacher (students’
responses aggregated by teacher) level of analysis
Table 8
Summary of linear regression analysis for testing teachers’ gender differences
Student-Level
(N= 5921)
Course-Level
(N=462)
Teacher-Level
(N=159)
Parameter 𝐵 SE 𝛽 𝐵 SE 𝛽 𝐵 SE 𝛽
Constant (𝐵0) 4.26 .01* .* 4.26 .03* . * 4.28 .05 .
Teacher’s Gender (𝐵1) .12 .02* .08* .14 .04* .15* .09 .06 .11
R2 .005* .023* .013
adjR2 .005* .021* .006
F 34.3* 11.3* 2.1 *p < .01.
Results from regression analysis at the student-level show that the average SET score for male
teachers is 𝐵0= 4.26 and that female teachers receive higher scores than male teachers as
indicated by a regression coefficient 𝐵1= .12. The difference between female and male teachers
is statistically different from zero (t = 5.86, p < .01) and the proportion of variance of SET scores
accounted by teacher’s gender is also statistically different from zero (F(1,5919)=34.3, p<.01).
However, the practical significance of this difference is rather trivial as indicated by the
standardized regression coefficient (𝛽1=.08) and the proportion of explained variance (R2=.005).
Specifically, the magnitude of 𝛽1 falls below the threshold of a small effect (J. Cohen, 1988;
Ellis, 2010), and the proportion of variance of SET scores accounted by teacher’s gender is less
than 1%.
67
Results from regression analysis at the course-level are similar than those reported at the student-
level. Average SET score for male teachers is 𝐵0= 4.26. Female teachers received higher scores
than male teachers (𝐵1= .14). The difference in SET scores between female and male teachers at
the course-level is statistically different from zero (t = 3.36, p < .01), and the proportion of
variance of SET scores accounted by teacher’s gender is also statistically different from zero
(F(1, 460) = 11.3, p < .01). The magnitude of the difference at the course-level is higher than at
the individual level as suggested by the standardized regression coefficient (𝛽1 = .14) and the
2.1% of variance of SET scores accounted by teacher’s gender (R2 = .021). In this case, the
practical significance of the difference in SET scores by teacher’s gender at the course-level is
small.
Finally, results from regression analysis at the teacher-level slightly depart from those reported at
the student and course-level. Consistently with the previous results, at the teacher-level the
average SET score for male teachers is 𝐵0 = 4.28, and female teachers receive higher scores than
male teachers (𝐵1 = .09). However, the difference in SET scores between female and male
teachers is not statistically different from zero as reported by the test of null hypothesis of the
regression coefficient (t = 1.44, p = .15). The proportion of variance of SET scores explained by
teacher’ gender is also not statistically different from zero (F(1, 157) = 2.01, p = .15). However,
the practical significance of the difference in SET scores by teacher’s gender at the teacher-level
is equivalent to the one obtained at the course-level and can be considered small.
In summary, results from regression analysis indicate that there is a statistically significant
difference in SET scores in favor of female teachers over male teachers at the student and
course-level of analysis. Additionally, the level of analysis seems to affect the difference in SET
scores between female and male teachers, which is highest at the course-level of analysis.
Regardless of the level of analysis, the magnitude of the difference suggests a small practical
significance of the difference in SET scores by teacher’s gender.
4.4.2 Part 2: ARS Moderator Effect
Table 9 reports the results from multiple linear regression analysis to test the moderator effect of
acquiescence on the difference in SET scores between female and male teachers at the individual
(students), course (students’ responses aggregated by course) and teacher (students’ responses
aggregated by teacher) levels of analysis. For the sake of completeness, Table 9 presents all the
68
relevant information from multiple regression analysis. However, the focus of this section is on
the difference in SET scores between female and male teacher after statistically controlling by
acquiescence (𝐵1) and, more importantly, the interaction term (𝐵3) as operationalization of the
moderator effect of acquiescence.
Table 9
Summary of multiple linear regression analysis for testing moderator effect
Student -Level
(N= 5921)
Course-Level
(N=462)
Teacher-Level
(N=159)
Parameter 𝐵 SE 𝛽 𝐵 SE 𝛽 𝐵 SE 𝛽
Constant (𝐵0) 4.34 .00* . * 4.37 .01* . * 4.35* .01 .*
Teacher’s Gender (𝐵1) .01 .01* .01* .00 .02* .00* .00* .02 .00*
ARSc (𝐵2) 1.89 .02* .89* 2.16 .07* .98* 2.18* .10 1.00*
Interaction (𝐵3) -.10 .02* -.04* -.19 .08* -.07+ -.19* .12 -.07*
R2 .74* .86* 0.89*
adjR2 .74* .86* 0.88*
F 5845.4* 980.6* 419.2* *p < .01. +p < .05.
Results at the student-level show that the difference in SET scores between female and male
teachers (𝐵1=0.1) is not statistically different from zero (t = -1.65, p = .10) when the level of
acquiescence is average. The magnitude the standardized regression coefficient (𝛽1 = .01),
suggest no practical significance of this difference. The interaction between acquiescence and
teacher’s gender is statistically different from zero (t = -3.43, p < .01). Specifically, when the
degree of acquiescence changes in one unit, the difference in SET scores between female and
male teachers varies in 𝐵3= -.10.
The moderator effect of acquiescence at the individual level is represented graphically Figure 2.
The two-way graph presents predicted values of SET scores against the degree of acquiescence
for female and male teachers (separate lines). A gray horizontal line indicates the grand-mean of
SET scores across students. The graph suggests that the difference in SET scores in favor of
female teachers increases at lower values of acquiescence, and the difference in SET scores in
favor of female teachers decreases at higher values of acquiescence.
69
Figure 2
Moderator effect of acquiescence response style at the student-level
Results from multiple regression analysis at the course-level are consistent with those reported at
the individual level. The difference in SET scores between female and male teachers is not
statistically different from zero (t = -0.15, p =.88) when the level of acquiescence is average. The
interaction between acquiescence and teacher’s gender is statistically different from zero (t = -
2.35, p<.05). Specifically, when the degree of acquiescence changes in one unit, the difference in
SET scores between female and male teachers varies in 𝐵3 = -.19. a higher level of moderation
effect than the one reported at the individual level.
The moderator effect of acquiescence at the course level is presented graphically in Figure 3. The
difference in SET scores in favor of female teachers increases at lower values of acquiescence,
and the difference in SET scores between female and male teachers decreases and is reversed at
higher values of acquiescence.
70
Figure 3
Moderator effect of acquiescence response style at the course-level
Results from multiple regression analysis at the teacher level also suggest no statistically
significant difference in SET scores between female and male teachers when the level of
acquiescence is average (t = 0.09, p=.09). Although the size of the regression coefficient of the
interaction (𝐵3= -.19) is the same than the one obtained at the course-level, the interaction term
is not statistically different from zero (t = -1.54, p=0.12), probably caused by a smaller number
of teachers compared to the number of courses and students, and higher standard error of
prediction (SE).
In summary, either at the individual, course and teacher level of analysis, comparisons of SET
scores between female and male teachers differ before (part 1) and after statistically controlling
by acquiescence (part 2). Specifically, when acquiescence is held constant, differences by gender
of the teacher are not statistically different from zero. Furthermore, the degree of acquiescence
71
moderates the differences in SET scores between female and male teachers. At lower values of
acquiescence, female teachers receive higher SET scores than male teachers. At higher values of
acquiescence, the difference is reduced and inverted in favor of male teachers at the course-level
of analysis. In other words, a high level of acquiescence hides differences in SET scores favoring
female teachers over male teachers. The moderator effect of acquiescence is statistically
significant at the student and course level of analysis but not at the teacher level of analysis.
The practical significance of acquiescence as moderator of the difference in SET scores by the
gender of the teacher is discussed separately in the next subsection. Additionally, the level of
analysis seems to affect the statistical conclusion but not the practical significance conclusion
regarding the moderator effect of acquiescence.
4.4.3 Practical Significance
A challenge in assessing the practical importance of the moderator effect of acquiescence in the
context of SET is the lack of references to compare the results from the study. From the low
number of studies examining response styles in SET scores, none has reported moderation
effects.
Following general guidelines for ascertaining practical significance of standardized regression
coefficients, the magnitude of the interaction at the course-level (𝛽3=.07) does not reach the
threshold for interpreting the effect as small (L. Cohen, Manion, & Morrison, 2007) leading to
conclude that there is no practical significance of the difference. Using eta square as a measure
of practical significance, the interaction term at the course level accounts for 1% of the variance
of SET scores (𝜂2 = .01), and is also lower than the value for interpreting such effect as small (J.
Cohen, 1988). Similarly, using Cohen’s f square, the magnitude of the interaction between
acquiescence and teacher’s gender (𝑓2 = .015) falls below the threshold for interpreting the
effect as small (J. Cohen, 1988). All these three indexes of practical significance suggest that the
statistically significant moderator effect is irrelevant. However, Ellis (2010) and Kenny (2015)
point out that the average size of the moderator effect for categorical variables (but not
continuous variables) across research in Psychology as measured by Cohen’s 𝑓2 is .002
(Aguinis, Beaty, Boik, & Pierce, 2005). In this context, Kenny (2015) suggests a more realistic
standard of practical significance for Cohen’s f square of 0.005, 0.01, and 0.025 for small,
medium, and large effects. With little information to compare findings, the study proposes that
72
an interaction effect of 𝑓2=.015 has a medium practical significance when considering that
differences in SET scores between female and male teachers can change substantively at low
degrees versus high degrees of acquiescence at the course level as seen in Figure 3.
Overall, the evidence presented in the study suggests that acquiescence response style can hide
real differences in SET scores between female and male teachers and that the degree of
acquiescence affects comparisons of SET scores by teacher’s gender.
73
Chapter 5
Discussion
Chapter 5 addresses three aspects of the findings reported in the study pertaining the examination
of response styles in the context of SET. The first section presents a summary of findings and
discusses their implications for SET developers and users (Section 5.1). The second section
revises six alternative interpretations of the results other than response styles, and rationally
discusses their plausibility (Section 5.2). The last part describes the limitations in the study
affecting interpretation and proposes guidelines for future research (Section 5.3).
5.1 Summary and Implications
SET developers often rely on the convenience of summated rating scales for inquiring students
about teaching quality in post-secondary education institutions. Response styles are well-
documented sources of construct-irrelevant variance that can influence the interpretation and use
of scores from summated rating scales. However, research ruling out response styles as evidence
to support the validity of SET scores is scarce and flawed. Therefore, the main topic covered in
the study is the extent to which SET scores obtained from graduate students enrolled at a teacher
education institution are affected by response styles.
Analysis of SET data revealed that a high proportion of students systematically endorsed the
highest option in the response scale across SET items, a pattern consistent with acquiescence
response style. The analysis revealed that the high degree of extreme response style reflects
acquiescence and not the tendency to choose the two extreme options on the response scale. The
analysis showed that disacquiescence and midpoint response styles do not affect SET scores.
The literature review provided two examples of studies reporting acquiescence in the context of
SET. Using a non-standard index of acquiescence, Spooren et al. (2012) reported that only 8.4%
of the students selected “Yes” answers (“rather agree,” “agree,” or “totally agree”) to 10 or more
SET items out of 15. The comparable result in this study is 80.34%. Richardson (2012) reported
an index of ARSR =.30 (SD = .38) from the administration of the Course Experience
Questionnaire, and an index of ARSR = .28 (SD = .19) from the administration of the Revised
Approaches to Studying Inventory. The comparable result in this study is ARSR = .78 (SD = .43)
74
(Table 5). Overall, the average value of acquiescence reported in this study is higher than in the
two previous examples.
The study examined differences in the degree to which SET scores are affected by response
styles across three measurement conditions available in the SET data: academic department, type
of program, and session. The descriptive analysis suggested no differences in the degree of
disacquiescence and midpoint response style across measurement conditions. ANOVA indicated
higher degrees of acquiescence and extreme response styles in department B versus department
A, and among Ph.D./Ed.D. students versus Master students, and no statistically significant effect
of session nor the interaction among the three conditions. Although differences by department
and program type are statistically significant, the practical significance of these differences is
rather trivial, and the conclusion is that response styles are consistent across the measurement
conditions examined.
Literature suggests that response styles are consistent within the course of a questionnaire
measuring different constructs (Kam & Zhou, 2015; Plieninger, 2016; Wetzel, Böhnke, et al.,
2016). Literature also suggests that response styles are stable across time (Billiet & Davidov,
2008; Weijters, Geuens, & Schillewaert, 2010; Wetzel, Lüdtke, Zettler, & Böhnke, 2016).
Overall, the study complements prior evidence and suggests that response styles are consistent
across measurement conditions. These findings provide no hypothesis to devise methods for
controlling and minimizing response styles. Future research needs to examine differences in the
degree of response styles across other measurement conditions.
The last topic explored in the study is the extent to which response styles affect the subsequent
use of SET data, specifically the relationship between SET scores and other variables. Before
statistically controlling the effect of acquiescence, SET scores are slightly higher for female
teachers than male teachers. After statistically removing acquiescence, there is no statistically
significant difference in SET scores between female and male teachers. The moderation analysis
showed that acquiescence changes the difference in SET scores between female and male
teachers. The practical significance of the moderator effect of acquiescence is small using
realistic criteria for interpreting effect size indexes.
In general, the findings presented in the study does not rule out the plausibility of the influence
of acquiescence response style on SET scores. By the contrary, the findings suggest that SET
75
scores might reflect simultaneously teaching quality and acquiescence response style and that
SET scores might overestimate the actual level of teaching quality. The confound is consistent
across departments, program type, and sessions, and affects the use of SET scores in subsequent
statistical analysis.
In addition to the presence of construct-irrelevant variance, SET scores are based on only six
items measuring three types of tasks of teaching, suggesting construct-underrepresentation of
good teaching. The instrument includes two items measuring successful teaching, a construct
whose validity shows little empirical support.
5.1.1 Implications
The more severe negative consequence of acquiescence response style in SET scores is the
impossibility of determining the true level of teaching quality based on students’ report because
acquiescence affects measures of central tendency (overestimation) and dispersion (narrowing
the distribution of scores, range restriction).
When considering the use of SET scores for formative purposes, an instructor would interpret a
high evaluation score as purely reflecting his/her teaching ability, a common belief among users
of summated rating scales. However, SET scores seem influenced to a considerable extent by a
students’ tendency not related to teaching quality.
The overestimation rand range restriction of SET scores limit the utility of scores for informing
teaching improvement as intended originally by the instrument developer. For instance, a teacher
receiving a report indicating that all the attributes of teaching quality are at the highest possible
level can hardly use the information for teaching improvement because there is no attribute to
improve. Similarly, an instructor may not find useful the comparison of his/her results with the
results from colleagues because most of the other teachers would also show the highest level of
teaching quality. Therefore, the study concludes that SET scores should not be utilized for
formative purposes.
Acquiescence affects not only formative uses but also summative uses of SET scores. For
instance, a ranking of teachers based on raw scores and scores after controlling for acquiescence
response style differs substantively in the SET data from this study. The Spearman rank
correlation coefficient between these two scores is rs = -.16, p = .00. For instance, administrators
76
would wrongly believe that a teacher with a certain score above a pre-established limit (cut-off)
possess higher teaching ability, possibly qualifying for recognition or promotion based on
inaccurate teaching evaluation scores. After controlling by acquiescence, the same teacher would
likely not pass previous cut-off.
The main implication for current users of scores is caution when utilizing SET for formative and
summative purposes. For instance, users could utilize scores for identifying only teachers in high
need of professional development. Users could also utilize scores to identify exemplary teachers.
Until complementary evidence of validity based on content and response process becomes
available, users should restrain from taking summative decisions based on scores. In that regard,
institutions can reduce the weight of SET in personnel and administrative decisions, reducing the
consequences associate to the use of SET scores.
Response styles can also artificially increment reliability coefficients (James, Demaree, & Wolf,
1984; Wetzel, Böhnke, et al., 2016). Cronbach’s alpha coefficient in the study shows that the
estimated lower bound of the reliability of the SET scale is .93, indicating a high internal
consistency but also suggesting redundancy among items.
Acquiescence jeopardizes the inferences regarding relationship to other variables (Paulhus, 1991;
Viswanathan, 2005). In the case of the study, differences in SET scores between female and male
teachers are higher before statistically removing acquiescence. Probably, acquiescence produced
range restriction due to summative error and introduced correlational error, producing the
reported moderator effect. The results imply that minimizing acquiescence would increment the
difference in SET scores in favor of female teachers. Such difference would affect the formative
and summative use of SET scores and encourages a careful examination of instrument content,
response process, and theory sustaining the development of the SET instrument.
Finally, acquiescence can explain the observed relationship between SET scores and irrelevant
variable. Therefore, research on SET validity contributing with evidence based on relationship to
other variables including discriminant evidence needs first to rule out the influence of response
styles (or any other source of construct-irrelevant variance) from SET scores under the risk of
providing non-realistic estimates of the size, statistical and practical significance of these
relationships.
77
Findings call for caution in the interpretation and utilization of SET scores and raise a relevant
concern that needs subsequent examination by SET developers and users.
5.1.2 Recommendations
Proving that a scale does not measure bias (construct-irrelevant variance) is essential in
validation (Spector, 1992). Therefore, the examination of response styles and other sources of
construct-irrelevant variance are a primary task to ensure that interpretation and use of scores are
valid: “proliferation of tests of high sounding psychological constructs in disregard of response
bias [styles] is a conspicuous waste of research” (Loevinger, 1959, p. 306).
The first and more relevant recommendation is the idea of supporting the development and
validation of SET summated rating scales on sound theory of teaching and learning.
Simultaneously, the development of SET summated rating scales should adhere to standard
procedures in the measurement field. A first step in test-development is the definition of the
target construct and the creation of test specifications (or blueprint) that identify levels of
performance according to the intended use of scores. A second step is the development of items
based on test specifications. In this regard, a preliminary pilot testing of items should be
conducted. Cognitive interviews and think-aloud protocols can inform about the response
process and potential sources of construct-irrelevant variance in responses to items. A third step
is item analysis to ensure that items have appropriate difficulty and discrimination levels.
Finally, the definition of cut-off scores should rely on standard settings methods.
A second recommendation pertains a note of caution in the interpretation of scores from
summated rating scale as a measure of teachers’ actual level of performance. As mentioned in
Section 1.2, items from summated rating scale do not have a right or wrong answer, and they are
not appropriate for inferring performance or ability (Spector, 1992). Therefore, SET scores are
not a perfect measure of teacher’s performance or teaching ability. SET scores are proxies of true
teaching quality as reported by students, and the accuracy of SET scores (from test-criterion
evidence) is still unknown. In the interpretation of SET scores, users should also keep in mind
that defining and measuring teaching quality is a difficult task, and that SET scores will contain
measurement error, either random error or construct-irrelevant variance. These two limitations in
the use of summated rating scales in the context of teaching evaluation discourage the utilization
78
of SET scores for summative decisions and accountability when there is limited validity
evidence to support the use of scores for those types of decisions.
A third recommendation relates to the implementation of methods for minimizing and
controlling response styles. Specifically, methods for controlling and reducing acquiescence
should pertain the student, instrument, and conditions of measurement.
SET developers can enhance students’ competence in the use of SET summated rating scales in
different ways. Examples are developing a students’ scoring manual, explaining the relevance of
SET scores to students, and encouraging their intelligent and careful participation (Kingsbury,
1922). Asking students to provide scores for clearly stated formative purposes in low-stakes
contexts can help avoid scores inflation, reduce strong satisficing and encourage effort and
motivation for optimizing responses.
There is no consensus on how to counteract response styles by manipulating instrument or
measurement conditions without introducing other types of construct-irrelevant variance in
scores (Wetzel, Böhnke, et al., 2016). An example is the popular recommendation of balancing
positive and negative worded items to counteract acquiescence. The previous strategy can affect
respondents’ accuracy due to higher task demands in item interpretation. The use of positive and
negative worded items also introduces a method effect that can change the internal structure and
reliability of scales (Zhang & Savalei, 2015).
Three feasible strategies to minimize acquiescence manipulating instrument features are 1)
positively packing the response scale (Lam & Klockars, 1982), 2) the use of a wide scale (Lam
& Stevens, 1994) and 3) the use of the expanded format (Zhang & Savalei, 2015). These three
strategies can lower mean scores from summated rating scales. The efficacy of these three
strategies in the content of SET rating scales is unknown and demands further research.
A positive packed scale is a response scale in which the labels (or anchors) are not equally
spaced. An example of equally spaced scale is the use of the labels poor, need improvement,
satisfactory, quite good, and excellent. A positively packed version of the same scale would
utilize the labels poor, fair, good, very good, excellent (Lam & Klockars, 1982). In the example,
only two anchors cover the distance between the first anchor and midpoint (poor, fair), and four
79
anchors cover the distance between the midpoint and the last anchor (fair, good, very good,
excellent).
A wide scale encompasses the use of semantically broader labels or anchors in the response scale
maintaining the same number of options. The current instrument utilizes “not at all” and a “great
deal” as the lowest and highest response option. A wider response scale can utilize “never”
(replaces “not at all”), and “always” (replaces “a great deal.”)22.
The expanded format involves presenting each item and response option simultanesouly as one
statement. The following example presents the expanded format with a positively packed scale
and wide scale anchors using an item from the SET summated rating scale presented in the
study:
• “I never found the course intellectually stimulating.”
• “I sometimes found the course intellectually stimulating.”
• “I often found the course intellectually stimulating.”
• “I almost every time found the course intellectually stimulating.”
• “I always found the course intellectually stimulating.”
Another recommendation for developers is the inclusion of additional items to measure the level
of acquiescence in responses. An acquiescence frequency index based on responses to a specific
scale can reflect the level of acquiescence independently from responses to SET items. The
literature recommend at least 30 items of diverse content, with equal number of positively and
negatively worded items to properly measure acquiescence (Kam, 2015). Background questions
inquiering students about the level of relevance, effort and honesty in responses and questions
targeting potential reasons to increase scores (inducements, power relationships, level of
consequence of scores) can also suggest and explain the presence of acquiescence.
One last method for controlling and minimizing response styles is utilizing statistical methods,
for instance, linear regression analysis (Webster, 1958) as utilized in this study. Regression
analysis allows to statistically remove or control acquiescence from SET scores. The predicted
22 Notice that frequency anchors fit better the report of students’ experience in the course than the original anchors.
80
value of SET scores would reflect the expected level of teaching quality when the level of
acquiescence is constant.
Item response theory (IRT) allows the prediction of teaching quality statistically controlling by
the level of acquiescence, just like in the case of regression analysis. However, IRT implies a
higher level of precision in the estimation of teaching quality and acquiescence at the expense of
increased mathematical and computational complexity.
IRT “models the probability of ticking a certain response option as a function of the underlying
latent variable” (Van Vaerenbergh & Thomas, 2013, p. 207). IRT assumes that the production of
responses depends on the interaction between the student (or measurement object) and an item
(measurement agent). Responses depend on the level of the trait of the person (called person’s
position) and the difficulty of the item (Wu, Adams, Wilson, & Haldane, 2007). For instance,
students should provide higher SET scores if they experience higher levels of teaching quality.
IRT assumes that responses are only explained by the level of teaching quality, in other words,
that responses are unidimensional. These assumptions are the foundation of the mathematical
formulas utilized in IRT models (Chiang, Green, & Cox, 2009). Section b in Figure 4 illustrates
the unidimensionality assumption of SET scores.
A specific example of an IRT model in the context the diagnostic of response styles is the
Multidimensional Rating Scale Model (MRSM). MRSM can estimate teaching quality and
acquiescence using same or different items (Wetzel, Böhnke, et al., 2016; Wetzel & Carstensen,
2015). The rating scale (RS) model is an extension of the one-parameter IRT model for
dichotomous items for responses to summated rating scales that have in common a multiple-
category response format. An example of such response format is the Likert-type scale,
commonly used in SET summated rating scales. The multidimensional item response model
allows the measurement of multiple latent variables underlying a multidimensional test (Wu et
al., 2007). In a multidimensional within-item response model, responses to a single item can
reflect two or more latent variables.
81
Figure 4
(a) The intended SET measurement model and (b) a rival measurement model with acquiescence
response style (ARS)
TQ=Teaching Quality; ARS=Acquiescence Response Style
In the case of this study, the MRSM models allow the measurement of teaching quality and
acquiescence simultaneously using the same eight items included in the instrument (represented
in Section b in Figure 4). MRSM allow the operationalization of acquiescence using the same
definition utilized for the calculation of frequency indexes.
MRSM offers a method for determining which model (intended, section a in Figure 4, or rival,
section b in Figure 4) better reproduces the observed relationships in the data. The selection of
the model with better fit is conducted by observing deviance and Akaike’s Information Criterion
(AIC) statistics (Wu et al., 2007). Lower values of deviance and AIC indicate relative best fitting
model. The likelihood ratio test is a 𝜒2 test of the difference in deviance between two competing
models: the null hypothesis is that model 1 (rival) fits the data as well as model 2 (intended)
(Osteen, 2010; Wu et al., 2007).
Two recommendations for policy makers that can contribute increase the validity of SET in the
long term: 1) the creation of a task force on teaching standards; 2) the creation of standards of
teaching evaluation in postsecondary education. Teaching standards address the issue of a lack of
82
theory about teaching and learning in post-secondary education. Teaching standards covering all
the important aspects of teaching and learning in post-secondary education settings can guide the
development of test specifications for SET instruments and other measures of teaching quality.
Teaching standards can also guide teaching training and professional development. Standards for
teaching evaluation in postsecondary education would help develop valid and fair teacher
evaluations and reduce the lack of expertise on educational measurement (Onwuegbuzie, Daniel,
& Collins, 2009). These standards should rely on recent measurement theory and a current
definition of validity. An example of teaching evaluation standards in the K-12 context are “The
personnel evaluation standards: how to assess systems for evaluating educators” (Joint
Committee on Standards for Educational Evaluation, 2009).
A final recommendation for researchers is against the reification of SET validity evidence from
literature reviews and other sources. Contrary to the claim that empirical evidence supports that
SET scores are valid (Olivares, 2003; Ory, 2001; Theall & Franklin, 2001), rather evidence
supports that SET scores can be valid under certain conditions (Marsh & Roche, 1997). The
generalization of validity findings from individual studies to legitimate the use of other
instruments or the same instrument on different populations and measurement conditions often
occurs in the SET literature (Johnson, 2000). The opposite reaction, denying SET scores validity
based on the finding from individual studies as in Boring et al., (2016) and Stark & Freishtat
(2014) should also be avoided. The previous practice is inconsistent with current and accepted
definitions of scores validity and validation (Messick, 1989). SET developers and users should
continuously examine and report validity evidence for each specific measurement instance that
involves not only the use of different items but also differences in populations and settings.
5.2 Alternative Interpretation of Findings
The findings presented in the study suggest -but do not prove- the existence of processes not
relevant to the intended interpretation and use of SET scores as a measure of teaching quality for
formative and summative decisions. Limitations inherent to the study design (discussed later)
demand complementary types of evidence to fully understand “why” a substantial proportion of
students endorsed the highest response option in the scale across items.
The alternative interpretations discussed here are 1) a high level of teaching quality, 2) construct-
underrepresentation problem, 3) ceiling effect, 4) influence of survey mode, 5) strong satisficing,
83
and 6) evaluation goals influenced responses. These alternative interpretations can either
challenge or complement the interpretation of findings presented in the study and help inform
future research.
5.2.1 High Level of Teaching Quality
A first alternative interpretation is precisely the one that the study attempts to challenge: that the
actual level of teaching quality is high, students provided responses exclusively based on
content, and observed scores reflect true score.
The above alternative interpretation faces the same limitation than the present study. Without
complementary validity evidence addressing why students consistently endorsed the highest
response option across items, the claim that responses are based exclusively on content without
the influence of irrelevant processes is tentative.
By the contrary, there is at least one reason making this alternative interpretation problematic.
Specifically, there would be no need for evaluation when teaching quality is expected to show a
negatively skewed distribution caused by students endorsing the highest response options across
items. An examination of such score distribution suggests that there is no attribute of teaching
quality to improve and that most of the teachers show a similarly high level of teaching quality.
Therefore, an important proportion of students endorsing the highest response options across
items is simply not coherent with formative decisions nor summative decisions, and for this
reason, interpreting that observed SET scores reflect true teaching quality is contradictory with
the proposed use of scores.
5.2.2 Construct Underrepresentation
Another alternative interpretation is that observed scores reflect true teaching ability but content
includes aspects of teaching quality easy to achieve for most teachers. The previous possibility
refers to construct underrepresentation, another problem that reduces score validity. As per
demonstrated in section 3.3, the instrument includes few items measuring the different acts of
teaching (logic, psychological and moral), suggesting construct underrepresentation and
recommending a careful interpretation of SET scores as a partial measure of teaching quality.
Therefore, construct-underrepresentation is a plausible alternative interpretation of the pattern of
84
responses reported in the study that also affects the intended use of scores for formative and
summative decisions.
5.2.3 Ceiling Effect
Ceiling effect (Hessling, Traxel, & Schmidt, 2004; Masino & Lam, 2014) is a third alternative
interpretation of the pattern of responses observed in the study. The principal difference between
ceiling effect and acquiescence response style is that the first (ceiling effect) attributes the
observed data pattern to instrument issues and assumes that responding reflects the target
construct, whereas the second (acquiescence) attributes the observed data pattern to processes
not related to the content.
Low item difficulty, meaning that items are easy to endorse, can cause ceiling effect either due to
content (for instance, construct underrepresentation, as discussed above) or by features of the
response scale, for example, inappropriate format with few or not properly labeled anchors.
A way of minimizing ceiling effect is the use of a response scale that allows a better
discrimination across levels of teaching quality. A higher discrimination can be achieved by
increasing the number of response options, modifying the labels in the response scale (Hessling
et al., 2004), or modifying item wording as in the case of the wide format discussed previously
(Lam & Stevens, 1994).
Ceiling effect and acquiescence response style are not mutually exclusive interpretations.
However, it is uncertain that modifications in the response scale to minimize ceiling effect would
lead to a less proportion of responses consistent with acquiescence response style because of the
additional problem of construct-underrepresentation.
5.2.4 Online Survey Mode
Along with a potential ceiling effect due to instrument design issues, an important aspect of the
measurement procedure in the study that may contribute to explain the negatively skewed and
narrow distribution of SET scores is the use of online survey mode. Whereas the other two
comparable studies utilized paper and postal surveys (Richardson, 2005; Spooren et al., 2012),
the mode of administration in the study was online. Survey mode is known to introduce mode-
specific types of error on responses (Smyth et al., 2009). Findings of differences in SET scores
85
between online and paper-based modes of administration are mixed. A group of studies indicates
no difference between modes of administration (Avery, Bryant, Mathios, Kang, & Bell, 2006;
Dommeyer, 2004; Stowell, Addison, & Smith, 2012). Another group of studies reports both
higher scores in online versus paper-based questionnaires (Bruns, Rupert, & Zhang, 2011;
Burton, Civitano, & Steiner-Grossman, 2012; Morrison, 2013) and higher scores in paper-based
versus online mode questionnaires (Capa-Aydin, 2016). None of the previous studies specifically
report differences in acquiescence or other response styles across modes of administration.
Considering that at least in certain cases online mode is related to inflated SET scores, online
survey mode seems a likely alternative interpretation that demands further examination.
5.2.5 Strong Satisficing
The pattern of responses reported in the study is also consistent with strong satisficing and the
use of anchor and adjustment strategy.
Satisficing theory is a framework for exploring suboptimal survey responses. The theory predicts
that respondents will choose the first satisfactory or acceptable response alternative rather than
the optimal response (Krosnick, 1999; Krosnick & Alwin, 1987). Satisficing assumes that
responses to a survey question need a significant amount of cognitive work that respondents may
be not interested into delivering.
Respondents can save cognitive work in several ways (Barge & Gehlbach, 2012). For instance,
respondents can use the anchor and adjustment strategy, in which the “response to an initial
survey item provides a cognitive anchor from which they insufficiently adjust in answering the
subsequent item” (Gehlbach & Barge, 2012, p. 419). Anchor and adjustment can result in a
participant agreeing with all the statements in a questionnaire (acquiescence response style).
Three conditions increase satisficing responding strategy: 1) a greater task difficulty, 2) a lower
respondent’s ability, and 3) lower respondent’s motivation to optimize” (Krosnick, 1999).
“Strong satisficing” occurs when respondents process questions superficially and provide an
arbitrary or random response.
In the case of the study, conditions for satisficing 1) and 2) are not plausible. Participants in the
study are graduate students enrolled at a teacher education institution, familiarized with teaching
and learning concepts. The population of students is at least as capable of providing valid
86
responses (if not more) than other populations of students with lower levels of acquiescence
(undergraduate students enrolled in programs not related to Education). The third condition for
satisficing seems a more plausible cause of the response pattern in the data. Perhaps students
were motivated to complete the online questionnaire, but not enough motivated to provide
accurate responses or optimize. Strong satisficing, and specifically a lower respondent’s
motivation to optimize is a likely explanation that can help understand “why” students were
acquiescent in their responses.
5.2.6 Evaluation Goals
The last aspect that could lead to high SET scores is related to the intended use of the evaluation,
or evaluation goals, either those implicit or explicit in the evaluation context.
Wetzel et al. (2016) argue that the context of measurement can potentially affect participant’s
motivation to provide accurate answers and trigger different types of response styles. Known
examples of the influence of the evaluation context are “for subordinates to exhibit positive
leniency when describing supervisors, and for judges to select neutral response alternatives when
items are ambiguous or when the judges wish to be evasive” (James, Demaree, & Wolf, 1984,
p.90).
One important aspect to consider is the level of personal involvement with the goals of the
evaluation. A high personal involvement and perceiving the evaluation as relevant and useful to
society can help reduce response styles in low stakes contexts. The same is not true in high
stakes contexts in which only costly and inefficient modifications in the scoring process can
minimize a reduced a small group of response styles (Wetzel, Böhnke, et al., 2016).
There is only one published study comparing students’ internal evaluation goals and SET scores
(Murphy et al., 2004). Students scored the level in which the following goals were important in
their judgment: 1) identifying the instructor’s weaknesses, 2) identifying the instructor’s
strengths, 3) providing fair ratings, and 4) motivating instructors. The study reported a positive
relationship between the scores of importance of the four evaluation goals and SET scores, with
r2 ranging from 0.07 to 0.36 (pilot study) and 0.09 to 0.45 (main study). The authors conclude
that “raters [students] pursuing different goals tend to give different ratings, even when they have
87
observed the same performance” (p. 162). The previous study did not report goals related to
summative decisions as in the case of the present study.
Evaluation goals are relevant to the discussion of SET scores validity because the utilization of
SET simultaneously for formative, summative and accountability purposes can introduce
conflicting goals and incentives for score inflation (Penny, 2003; Spooren et al., 2012; Yorke,
2009). As an example, teachers report attempts to artificially increase their evaluation scores by
introducing behaviors such as inducements, pre-evaluation actions, manipulation, watching
during SET, providing academic extras, and grading leniency (Simpson & Siguaw, 2000).
Students seem to highly value formative decisions based on SET scores (Chen & Hoshower,
2003; Ernst, 2014).
In the case of the present study, ambiguously stated and possibly conflicting evaluation goals
presented to students during instrument administration might have influenced students’ internal
goals. Possible internal goals causing inflated SET scores are 1) attempt to avoid negative
consequences of low scores on teachers, 2) a low involvement and perceived relevance (related
to satisficing), 3) attempt to motivate instructors by endorsing high scores, and 4) students’
response to instructor’s inducements. The plausibility of these four explanations requires further
examination.
5.3 Limitations and Future Research
Three important limitations affect the implications based on findings reported in this study: 1)
response styles examination approach, 2) the use of secondary SET data; 3) general limitations
inherent to quantitative research methodology.
5.3.1 Use of Manifest Variable Approach
The study utilized a manifest variable approach to study response styles. Frequency indexes of
responses styles are easy to compute and interpret. However, some authors defend approaches
based on more sophisticated mathematical models because of the confound between the response
style and the target construct inherent to frequency indexes calculated from the same items than
the target construct (Section 2.3.1). Latent variable models can effectively separate target
construct variance from response style variance (Bolt & Johnson, 2009; Wetzel, Böhnke, et al.,
2016).
88
An example of utilization of latent variable approach is the use of Structural Equation Modelling
for examining acquiescence (Ferrando, Morales-Vives, & Lorenzo-Seva, 2016). Another
example is the use of item response models to examine midpoint and extreme response style,
either as categorical latent variables (Tutz & Berger, 2016) or as continuous latent variable
(Wetzel & Carstensen, 2015). Methods for examining response styles based on latent variable
approaches are very recent, and no systematic review and comparison of methods is available yet
(Wetzel, Böhnke, et al., 2016).
Regardless the clear advantage of more sophisticated statistical models, the high degree of
acquiescence affecting SET scores in this study is on its own sufficient evidence of construct-
irrelevant variance. In other situations, results from a manifest variable approach might lead to a
less precise diagnostic, and a latent variable approach, such as the MRSM presented in section
5.1.2, would help supplement those results.
5.3.2 Use of Observational Data
The lack of control on content and administration procedure associated with the use of secondary
data also limited the study in several ways. First, the inclusion of non-related constructs would
allow the examination of halo effect. Second, the inclusion of (anonymized) individual
identification of students would allow a more extensive and complete analysis of response styles
by including severity/leniency, central tendency, and range restriction. A significant limitation
pertaining content is the problem of construct-underrepresentation reported in Section 3.2,
suggesting the inclusion of more content targeting good teaching. Finally, observational data
limited the analysis of differences in the degree to which response styles differ by type of report
(self-report, other-report, report of objects) and type of content (logical, psychological or moral
acts of teaching).
5.3.3 Use of a Quantitative Approach
Following a process of rational argumentation based on the concept of validity and a review of
the literature, the study tested the plausibility of an alternative interpretation of SET scores,
answering the question of “what” (source of construct-irrelevant variance) might explain SET
scores other than the target construct. The quantitative strategy followed in the study is a
reasonable first step to determine whether response styles might represent a potential problem for
89
SET scores interpretation and subsequent use for formative and summative purposes. The
strategy is useful when observational data is available and the number of students is high, as in
the case of the present study. The design does not address the question of “why” a high
proportion of students relied on a response pattern consistent with acquiescence. Future research
needs to address the inherent lack of depth of the study design.
5.3.4 Future Research
The quantitative nature of the research design utilized in the study provides with strong initial
evidence of response styles as a source of construct-irrelevant variance in SET scores at this
specific educational institution. However, alternative interpretations and limitations discussed
above recommend further research.
A research design aimed to examine the validity of SET scores should retrieve evidence
illuminating about possible causes of response styles, specifically from the response process,
addressing one of the limitations in current SET validity research along with the lack of strong
theory. In this regard, a mixed method research design can provide well-sustained conclusions
about a target phenomenon (Creswell & Clark, 2010; Greene, Caracelli, & Graham, 1989) and
about causality (Howe, 2012). Mixed method approaches to validation are increasingly used and
strongly recommended by literature (Koskey, Sondergeld, Stewart, & Pugh, 2016; Luyt, 2012;
Morell & Tan, 2009; Onwuegbuzie, Bustamante, & Nelson, 2010). Thus, part of the limitations
of this study can be addressed by a mixed methods design.
A study aimed to examine causes of response styles in SET scores can rely on an experimental
design (quantitative phase) and think-aloud protocols (qualitative phase). Possible independent
variables that are expected to affect response styles are evaluation relevance (higher relevance
would increase students’ motivation and reduce satisficing) and the level of consequences of the
evaluation (high stakes evaluation would lead to higher scores).
In the experimental phase, relevance can be manipulated, for instance, by suggesting that the
teacher reads each students’ report (and within this factor, scores can be anonymous or non-
anonymous to further increase personal relevance) and that the use of scores would benefit future
student through subsequent teaching development support and course modifications. Indicating
that scores would inform administrative decisions such as removing a teacher from a course the
90
next academic session or that scores would impact annual personnel evaluation manipulate the
level of consequences. An ambiguity condition with vague relevance and level of consequences
would reflect a typical SET administration as in the case of the institution in this study. Changes
in the instrument context, for instance, the invitation email and at the introductory paragraph in
the SET summated rating scale, can produce the experimental manipulation.
A second phase involving the use of think-aloud protocols conducted under the same
experimental conditions would provide narrative evidence of students thinking during the
process of responding. Findings from think-aloud protocols can provide information about how
the experimental manipulation affects responses styles. Also, qualitative evidence can suggest
other sources of construct-irrelevant variance that might reduce the validity of SET scores.
Finally, future research should address how the type of report (self-report, other-report, object-
report) and type of content (logical, psychological, and moral acts of teaching) affect response
styles. Such research should utilize a summated rating scale that properly covers all the attributes
of good teaching relevant for the specific discipline and educational context.
91
References
Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect size and power in assessing
moderating effects of categorical variables using multiple regression: a 30-year review.
The Journal of Applied Psychology, 90(1), 94–107. https://doi.org/10.1037/0021-
9010.90.1.94
Alliger, G. M., Tannenbaum, S. I., Bennett Jr., W., Traver, H., & Shotland, A. (1997). A meta-
analysis of the relations among training criteria. Personnel Psychology, 50(2), 341–358.
https://doi.org/10.1111/j.1744-6570.1997.tb00911.x
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (2014). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Arbuckle, J., & Williams, B. D. (2003). Students’ Perceptions of Expressiveness: Age and
Gender Effects on Teacher Evaluations. Sex Roles, 49(9–10), 507–516.
https://doi.org/10.1023/A:1025832707002
Avery, R. J., Bryant, W. K., Mathios, A., Kang, H., & Bell, D. (2006). Electronic Course
Evaluations: Does an Online Delivery System Influence Student Evaluations? The
Journal of Economic Education, 37(1), 21–37. https://doi.org/10.3200/JECE.37.1.21-37
Aylett, R., & Gregory, K. (1996). Evaluating Teacher Quality in Higher Education. Psychology
Press.
Barge, S., & Gehlbach, H. (2012). Using the Theory of Satisficing to Evaluate the Quality of
Survey Data. Research in Higher Education, 53(2), 182–200.
Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of
92
Personality and Social Psychology, 51(6), 1173–1182. https://doi.org/10.1037/0022-
3514.51.6.1173
Basow, S. A., & Montgomery, S. (2005). Student Ratings and Professor Self-Ratings of College
Teaching: Effects of Gender and Divisional Affiliation. Journal of Personnel Evaluation
in Education, 18(2), 91–106. https://doi.org/10.1007/s11092-006-9001-8
Bassett, J., Cleveland, A., Acorn, D., Nix, M., & Snyder, T. (2017). Are they paying attention?
Students’ lack of motivation and attention potentially threaten the utility of course
evaluations. Assessment & Evaluation in Higher Education, 42(3), 431–442.
https://doi.org/10.1080/02602938.2015.1119801
Bassin, W. M. (1974). A Note on the Biases in Students’ Evaluations of Instructors. The Journal
of Experimental Education, 43(1), 16–17.
https://doi.org/10.1080/00220973.1974.10806298
Berk, R. A. (2005). Survey of 12 strategies to measure teaching effectiveness. International
Journal of Teaching and Learning in Higher Education, 17(1), 48–62.
Berliner, D. C. (2005). The near impossibility of testing for teacher quality. Journal of Teacher
Education, 56(3), 205–213. https://doi.org/10.1177/0022487105275904
Billiet, J. B., & Davidov, E. (2008). Testing the Stability of an Acquiescence Style Factor Behind
Two Interrelated Substantive Variables in a Panel Design. Sociological Methods &
Research, 36(4), 542–562. https://doi.org/10.1177/0049124107313901
Bolt, D. M., & Johnson, T. R. (2009). Addressing Score Bias and Differential Item Functioning
Due to Individual Differences in Response Style. Applied Psychological Measurement,
33(5), 335–352. https://doi.org/10.1177/0146621608329891
93
Bonitz, V. S. (2011). Student Evaluation of Teaching: Individual Differences and Bias Effects.
Graduate Theses and Dissertations. Paper 12211. Retrieved from
http://lib.dr.iastate.edu/etd/1221
Boring, A. (2015). Gender Biases in student evaluations of teachers (Documents de Travail de
l’OFCE No. 2015–13). Observatoire Francais des Conjonctures Economiques (OFCE).
Retrieved from http://econpapers.repec.org/paper/fcedoctra/1513.htm
Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student Evaluations of Teaching (Mostly) Do
Not Measure Teaching Effectiveness. ScienceOpen Research, 0(0), 1–11.
https://doi.org/10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1
Boud, D., & Falchikov, N. (1989). Quantitative studies of student self-assessment in higher
education: a critical analysis of findings. Higher Education, 18(5), 529–549.
Bowman, N. (2010). Can 1st-Year College Students Accurately Report Their Learning and
Development? American Educational Research Journal, 47(2), 466–496.
https://doi.org/10.3102/0002831209353595
Brown, J. D. (2011). Questions and answers about language testing statistics: Likert items and
scales of measurement? Retrieved July 12, 2017, from
http://hosted.jalt.org/test/bro_34.htm
Bruns, S. M., Rupert, T. J., & Zhang, Y. (2011). Effects of Converting Student Evaluations of
Teaching from Paper to Online Administration. In Advances in Accounting Education:
Teaching and Curriculum Innovations (Vol. 12, pp. 167–192). Emerald Group Publishing
Limited. Retrieved from http://www.emeraldinsight.com/doi/full/10.1108/S1085-
4622%282011%290000012010
94
Burton, W. B., Civitano, A., & Steiner-Grossman, P. (2012). Online versus paper evaluations:
differences in both quantitative and qualitative data. Journal of Computing in Higher
Education, 24(1), 58–69. https://doi.org/10.1007/s12528-012-9053-3
Capa-Aydin, Y. (2016). Student evaluation of instruction: comparison between in-class and
online methods. Assessment & Evaluation in Higher Education, 41(1), 112–126.
https://doi.org/10.1080/02602938.2014.987106
Carifio, J., & Perla, R. J. (2007). Ten common misunderstandings, misconceptions, persistent
myths and urban legends about Likert scales and Likert response formats and their
antidotes. Journal of Social Sciences, 3(3), 106–116.
Cashin, W. E. (1995). Student Ratings of Teaching: The Research Revisited. IDEA Paper No. 32.
Retrieved from http://eric.ed.gov/?id=ED402338
Chen, T., & Hoshower, L. B. (2003). Student Evaluation of Teaching Effectiveness: An
assessment of student perception and motivation. Assessment & Evaluation in Higher
Education, 28(1), 71–88. https://doi.org/10.1080/02602930301683
Chiang, K. S., Green, K. E., & Cox, E. O. (2009). Rasch analysis of the Geriatric Depression
Scale-Short Form. The Gerontologist, 49(2), 262–275.
https://doi.org/10.1093/geront/gnp018
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2 edition). Hillsdale,
N.J: Routledge.
Cohen, J., Cohen, P., West, S. G., Aiken, L. S., Patricia Cohen, S. G. W., & Leona, S. A. (2003).
Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.).
Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Cohen, L., Manion, L., & Morrison, K. (2007). Research Methods in Education (6 edition).
London ; New York: Routledge.
95
Creswell, J. W., & Clark, V. L. P. (2010). Designing and Conducting Mixed Methods Research:
Second Edition. (J. W. Creswell & V. L. P. Clark, Eds.) (2 edition). Los Angeles: Sage
Publications.
Cronbach, L. J. (1946). Response sets and test validity. Educational and Psychological
Measurement, 6, 475–494.
Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. Braun (Eds.),
Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum.
Dommeyer, C. J. |Baum. (2004). Gathering Faculty Teaching Evaluations by In-Class and Online
Surveys: Their Effects on Response Rates and Evaluations. Assessment & Evaluation in
Higher Education, 29(5), 611–623.
Dunning, D., & Helzer, E. G. (2014). Beyond the Correlation Coefficient in Studies of Self-
Assessment Accuracy Commentary on Zell & Krizan (2014). Perspectives on
Psychological Science, 9(2), 126–130. https://doi.org/10.1177/1745691614521244
Ellis, P. D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the
Interpretation of Research Results. Cambridge University Press.
Ernst, D. (2014). Expectancy theory outcomes and student evaluations of teaching. Educational
Research and Evaluation, 20(7–8), 536–556.
https://doi.org/10.1080/13803611.2014.997138
Fenstermacher, G. D., & Richardson, V. (2005). On making determinations of quality in
teaching. The Teachers College Record, 107(1), 186–213.
Ferrando, P. J., Morales-Vives, F., & Lorenzo-Seva, U. (2016). Assessing and Controlling
Acquiescent Responding When Acquiescence and Content Are Related: A
Comprehensive Factor-Analytic Approach. Structural Equation Modeling: A
96
Multidisciplinary Journal, 23(5), 713–725.
https://doi.org/10.1080/10705511.2016.1185723
Gee, N. (2017). A study of student completion strategies in a Likert-type course evaluation
survey. Journal of Further and Higher Education, 41(3), 340–350.
https://doi.org/10.1080/0309877X.2015.1100717
Gehlbach, H., & Barge, S. (2012). Anchoring and Adjusting in Questionnaire Responses. Basic
and Applied Social Psychology, 34(5), 417–433.
https://doi.org/10.1080/01973533.2012.711691
Gravestock, P., & Gregor-Greenleaf, E. (2008). Student Course Evaluations: Research, Models
and Trends. Toronto: Higher Education Quality Council of Ontario.
Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a Conceptual Framework for
Mixed-Method Evaluation Designs. Educational Evaluation and Policy Analysis, 11(3),
255–274. https://doi.org/10.3102/01623737011003255
Hessling, R. M., Traxel, N. M., & Schmidt, T. J. (2004). Ceiling Effect. In M. Lewis-Beck, A.
Bryman, & T. Futing Liao (Eds.), The SAGE Encyclopedia of Social Science Research
Methods. 2455 Teller Road, Thousand Oaks California 91320 United States of America:
Sage Publications, Inc. Retrieved from http://methods.sagepub.com/reference/the-sage-
encyclopedia-of-social-science-research-methods/n102.xml
Howe, K. R. (2012). Mixed Methods, Triangulation, and Causal Explanation. Journal of Mixed
Methods Research, 1558689812437187. https://doi.org/10.1177/1558689812437187
Ingvarson, L., & Rowe, K. (2008). Conceptualising and Evaluating Teacher Quality: Substantive
and Methodological Issues. Australian Journal of Education, 52(1), 5–35.
https://doi.org/10.1016/j.apmr.2010.02.005
97
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability
with and without response bias. Journal of Applied Psychology, 69(1), 85–98.
https://doi.org/10.1037/0021-9010.69.1.85
Johnson, R. (2000). The Authority of the Student Evaluation Questionnaire. Teaching in Higher
Education, 5(4), 419–434. https://doi.org/10.1080/713699176
Joint Committee on Standards for Educational Evaluation. (2009). The personnel evaluation
standards: how to assess systems for evaluating educators (2nd ed). Thousand Oaks, CA:
Corwin Press.
Kam, C. C. S. (2015). Further Considerations in Using Items With Diverse Content to Measure
Acquiescence. Educational and Psychological Measurement, 76(1), 164–174.
https://doi.org/10.1177/0013164415586831
Kam, C. C. S., & Zhou, M. (2015). Does Acquiescence Affect Individual Items Consistently?
Educational and Psychological Measurement, 75(5), 764–784.
https://doi.org/10.1177/0013164414560817
Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement,
38(4), 319–342. https://doi.org/10.1111/j.1745-3984.2001.tb01130.x
Kaplan, R. M., & Saccuzzo, D. P. (2008). Psychological Testing: Principles, Applications, and
Issues (7 edition). Belmont, CA: Wadsworth Publishing.
Kenny, D. A. (1979). Correlation and Causality (1St Edition edition). New York: John Wiley &
Sons Inc.
Kenny, D. A. (2015, March 31). Moderator Variables. Retrieved June 20, 2017, from
http://davidakenny.net/cm/moderation.htm#GG
Kingsbury, F. A. (1922). Analyzing ratings and training raters. Journal of Personnel Research
(Pre-1986), 1(000008), 377.
98
Kirkpatrick, D. L. (1977). Evaluating Training Programs: Evidence vs. Proof. Training and
Development Journal, 77(11), 9–12.
Kirkpatrick, D. L. (1979). Techniques for evaluating training programs. Classic Writings on
Instructional Technology, 1, 231–241.
Kline, T. J. B. (2005). Psychological Testing: A Practical Approach to Design and Evaluation.
Thousand Oaks, Calif: Sage Publications.
Koskey, K. L. K., Sondergeld, T. A., Stewart, V. C., & Pugh, K. J. (2016). Applying the Mixed
Methods Instrument Development and Construct Validation Process: The Transformative
Experience Questionnaire. Journal of Mixed Methods Research, 1558689816633310.
https://doi.org/10.1177/1558689816633310
Krosnick, J. A. (1999). Survey Research. Annual Review of Psychology, 50(1), 537–567.
https://doi.org/10.1146/annurev.psych.50.1.537
Krosnick, J. A., & Alwin, D. F. (1987). An Evaluation of a Cognitive Theory of Response-Order
Effects in Survey Measurement. Public Opinion Quarterly, 51(2), 201–219.
https://doi.org/10.1086/269029
Krosnick, J. A., Narayan, S., & Smith, W. R. (1996). Satisficing in surveys: Initial evidence. New
Directions for Evaluation, 1996(70), 29–44. https://doi.org/10.1002/ev.1033
Kuwaiti, A. A., & Subbarayalu, A. V. (2015). Appraisal of students experience survey (SES) as a
measure to manage the quality of higher education in the Kingdom of Saudi Arabia: an
institutional study using six sigma model. Educational Studies, 0(0), 1–14.
https://doi.org/10.1080/03055698.2015.1043977
Lam, T. C. M., & Klockars, A. J. (1982). Anchor Point Effects on the Equivalence of
Questionnaire Items. Journal of Educational Measurement, 19(4), 317–22.
99
Lam, T. C. M., & Stevens, J. J. (1994). Effects of Content Polarization, Item Wording, and
Rating Scale Width on Rating Response. Applied Measurement in Education, 7(2), 141–
158. https://doi.org/10.1207/s15324818ame0702_3
Leckie, G., & Baird, J. A. (2011). Rater Effects on Essay Scoring: A Multilevel Analysis of
Severity Drift, Central Tendency, and Rater Experience. Journal of Educational
Measurement, 48(4), 399–418. https://doi.org/10.1111/j.1745-3984.2011.00152.x
Lentz, T. F. (1938). Acquiescence as a factor in the measurement of personality. Psychological
Bulletin, 35(9), 659.
Loevinger, J. (1959). Theory and techniques of assessment. Annual Review of Psychology, 10,
287–316. https://doi.org/10.1146/annurev.ps.10.020159.001443
Luyt, R. (2012). A Framework for Mixing Methods in Quantitative Measurement Development,
Validation, and Revision: A Case Study. Journal of Mixed Methods Research, 6(4), 294–
316. https://doi.org/10.1177/1558689811427912
Macmillan, N. A., & Douglas, C. (1990). Response bias: Characteristics of detection theory,
threshold theory, and “nonparametric” indexes. Psychological Bulletin, 107(3), 401–413.
https://doi.org/10.1037/0033-2909.107.3.401
MacNell, L., Driscoll, A., & Hunt, A. N. (2015). What’s in a Name: Exposing Gender Bias in
Student Ratings of Teaching. Innovative Higher Education, 40(4), 291–303.
https://doi.org/10.1007/s10755-014-9313-4
Marsh, H. W. (1982). SEEW: A Reliable, Valid, and Useful Instrument for Collecting Students’
Evaluations of University Teaching. British Journal of Educational Psychology, 52(1),
77–95. https://doi.org/10.1111/j.2044-8279.1982.tb02505.x
100
Marsh, H. W. (1987). Students’ evaluations of University teaching: Research findings,
methodological issues, and directions for future research. International Journal of
Educational Research, 11(3), 253–388. https://doi.org/10.1016/0883-0355(87)90001-2
Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluations of teaching effectiveness
effective: The critical issues of validity, bias, and utility. American Psychologist, 52(11),
1187–1197. https://doi.org/10.1037/0003-066X.52.11.1187
Marsh, H. W., & Roche, L. A. (2000). Effects of grading leniency and low workload on students’
evaluations of teaching: Popular myth, bias, validity, or innocent bystanders? Journal of
Educational Psychology, 92(1), 202–228. https://doi.org/10.1037/0022-0663.92.1.202
Masino, C., & Lam, T. C. M. (2014). Choice of rating scale labels: implication for minimizing
patient satisfaction response ceiling effect in telemedicine surveys. Telemedicine Journal
and E-Health: The Official Journal of the American Telemedicine Association, 20(12),
1150–1155. https://doi.org/10.1089/tmj.2013.0350
McGrath, R. E., Mitchell, M., Kim, B. H., & Hough, L. (2010). Evidence for response bias as a
source of error variance in applied assessment. Psychological Bulletin, 136(3), 450–470.
https://doi.org/10.1037/a0019216
McPherson, M. A., & Jewell, R. T. (2007). Leveling the Playing Field: Should Student
Evaluation Scores be Adjusted?*. Social Science Quarterly, 88(3), 868–881.
https://doi.org/10.1111/j.1540-6237.2007.00487.x
McPherson, M. A., Jewell, R. T., & Kim, M. (2009). What Determines Student Evaluation
Scores? A Random Effects Analysis of Undergraduate Economics Classes. Eastern
Economic Journal, 35(1), 37–51.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–
110). New York, NY: MacMillan.
101
Messick, S. (1995a). Standards of validity and the validity of standards in performance
assessment. Educational Measurement: Issues and Practice, 14(4), 5–8.
Messick, S. (1995b). Validity of psychological assessment: Validation of inferences from
persons’ responses and performances as scientific inquiry into score meaning. American
Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741
Moors, J. J. A. (1986). The Meaning of Kurtosis: Darlington Reexamined. The American
Statistician, 40(4), 283–284. https://doi.org/10.1080/00031305.1986.10475415
Morell, L., & Tan, R. J. B. (2009). Validating for Use and Interpretation: A Mixed Methods
Contribution Illustrated. Journal of Mixed Methods Research, 3(3), 242–264.
https://doi.org/10.1177/1558689809335079
Morrison, K. (2013). Online and paper evaluations of courses: a literature review and case study.
Educational Research and Evaluation, 19(7), 585–604.
https://doi.org/10.1080/13803611.2013.834608
Murphy, K. R., & Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of Applied
Psychology, 74(4), 619–624. https://doi.org/10.1037/0021-9010.74.4.619
Murphy, K. R., & Cleveland, J. (1995). Understanding performance appraisal : social,
organizational, and goal-based perspectives. Thousand Oaks, CA: Sage Publications.
Murphy, K. R., Cleveland, J. N., Skattebo, A. L., & Kinney, T. B. (2004). Raters Who Pursue
Different Goals Give Different Ratings. Journal of Applied Psychology, 89(1), 158–164.
https://doi.org/10.1037/0021-9010.89.1.158
Olivares, O. J. (2003). A Conceptual and Analytic Critique of Student Ratings of Teachers in the
USA with Implications for Teacher Effectiveness and Student Learning. Teaching in
Higher Education, 8(2), 233–245. https://doi.org/10.1080/1356251032000052465
102
Onwuegbuzie, A. J., Bustamante, R. M., & Nelson, J. A. (2010). Mixed Research as a Tool for
Developing Quantitative Instruments. Journal of Mixed Methods Research, 4(1), 56–78.
https://doi.org/10.1177/1558689809355805
Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. T. (2009). A meta-validation model for
assessing the score-validity of student teaching evaluations. Quality & Quantity, 43(2),
197–209. https://doi.org/10.1007/s11135-007-9112-4
Ory, J. C. (2001). Faculty Thoughts and Concerns About Student Ratings. New Directions for
Teaching and Learning, 2001(87), 3–15. https://doi.org/10.1002/tl.23
Ory, J. C., & Ryan, K. (2001). How Do Student Ratings Measure Up to a New Validity
Framework? New Directions for Institutional Research, 2001(109), 27–44.
https://doi.org/10.1002/ir.2
Osteen, P. (2010). An Introduction to Using Multidimensional Item Response Theory to Assess
Latent Factor Structures. Journal of the Society for Social Work and Research, 1(2), 66–
82. https://doi.org/10.5243/jsswr.2010.6
Paulhus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. R. Shaver,
& L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes
(pp. 17–59). San Diego, CA, US: Academic Press.
Penny, A. R. (2003). Changing the Agenda for Research into Students’ Views about University
Teaching: Four shortcomings of SRT research. Teaching in Higher Education, 8(3), 399–
411. https://doi.org/10.1080/13562510309396
Pintrich, P. R. (2002). The Role of Metacognitive Knowledge in Learning, Teaching, and
Assessing. Theory Into Practice, 41(4), 219–225.
https://doi.org/10.1207/s15430421tip4104_3
103
Plieninger, H. (2016). Mountain or Molehill? A Simulation Study on the Impact of Response
Styles. Educational and Psychological Measurement, 0013164416636655.
https://doi.org/10.1177/0013164416636655
Popham, W. J. (1992). Educational Evaluation (3 edition). Boston: Pearson.
Rantanen, P. (2013). The number of feedbacks needed for reliable evaluation. A multilevel
analysis of the reliability, stability and generalisability of students’ evaluation of teaching.
Assessment & Evaluation in Higher Education, 38(2), 224–239.
https://doi.org/10.1080/02602938.2011.625471
Richardson, J. T. E. (2005). Students’ Approaches to Learning and Teachers’ Approaches to
Teaching in Higher Education. Educational Psychology, 25(6), 673–680.
https://doi.org/10.1080/01443410500344720
Richardson, J. T. E. (2012). The role of response biases in the relationship between students’
perceptions of their courses and their approaches to studying in higher education. British
Educational Research Journal, 38(3), 399–418.
https://doi.org/10.1080/01411926.2010.548857
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the
psychometric quality of rating data. Psychological Bulletin, 88(2), 413–428.
https://doi.org/10.1037/0033-2909.88.2.413
Salas, E., & Cannon-Bowers, J. A. (2001). The Science of training: A decade of progress. Annual
Review of Psychology, 52(1), 471–499. https://doi.org/10.1146/annurev.psych.52.1.471
Simpson, P. M., & Siguaw, J. A. (2000). Student Evaluations of Teaching: An Exploratory Study
of the Faculty Response. Journal of Marketing Education, 22(3), 199–213.
https://doi.org/10.1177/0273475300223004
104
Smith, S. W., Yoo, J. H., Farr, A. C., Salmon, C. T., & Miller, V. D. (2007). The Influence of
Student Sex and Instructor Sex on Student Ratings of Instructors: Results from a College
of Communication. Women’s Studies in Communication, 30(1), 64–77.
https://doi.org/10.1080/07491409.2007.10162505
Smyth, J. D., Dillman, D. A., & Christian, L. M. (2009). Context effects in Internet surveys: New
issues and evidence. In A. N. Joinson, K. Y. A. McKenna, T. Postmes, & U.-D. Reips
(Eds.), Oxford Handbook of Internet Psychology. Oxford, UK: Oxford University Press.
Retrieved from
http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199561803.001.0001/oxf
ordhb-9780199561803-e-027
Spector, P. E. (1991). Summated Rating Scale Construction: An Introduction (1st ed.). Newbury
Park, CA: Sage.
Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the Validity of Student Evaluation of
Teaching The State of the Art. Review of Educational Research, 83(4), 598–642.
https://doi.org/10.3102/0034654313496870
Spooren, P., Mortelmans, D., & Thijssen, P. (2012). ‘Content’ versus ‘style’: Acquiescence in
student evaluation of teaching? British Educational Research Journal, 38(1), 3–21.
https://doi.org/10.1080/01411926.2010.523453
Stark, P., & Freishtat, R. (2014). An Evaluation of Course Evaluations. ScienceOpen Research.
Retrieved from https://www.scienceopen.com/document/id/ad8a9ac9-8c60-432a-ba20-
4402a2a38df4
StataCorp. (2013). Stata Statistical Software: Release 13. College Station, TX: StataCorp LP.
105
Stowell, J. R., Addison, W. E., & Smith, J. L. (2012). Comparison of online and classroom-based
student evaluations of instruction. Assessment & Evaluation in Higher Education, 37(4),
465–473. https://doi.org/10.1080/02602938.2010.545869
Theall, M., & Franklin, J. (2001). Looking for Bias in All the Wrong Places: A Search for Truth
or a Witch Hunt in Student Ratings of Instruction? New Directions for Institutional
Research, 2001(109), 45–56. https://doi.org/10.1002/ir.3
Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied
Psychology, 4(1), 25–29.
https://doi.org/http://dx.doi.org.myaccess.library.utoronto.ca/10.1037/h0071663
Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The Psychology of Survey Response (1st ed.).
Cambridge, UK: Cambridge University Press.
Traub, R. E. (1997). Classical Test Theory in Historical Perspective. Educational Measurement:
Issues and Practice, 16(4), 8–14. https://doi.org/10.1111/j.1745-3992.1997.tb00603.x
Tutz, G., & Berger, M. (2016). Response Styles in Rating Scales: Simultaneous Modeling of
Content-Related Effects and the Tendency to Middle or Extreme Categories. Journal of
Educational and Behavioral Statistics, 41(3), 239–268.
https://doi.org/10.3102/1076998616636850
Valsan, C., & Sproule, R. (2008). The invisible hands behind the student evaluation of teaching:
the rise of the new managerial elite in the governance of higher education. Journal of
Economic Issues, 939–958.
van Herk, H., Poortinga, Y. H., & Verhallen, T. M. M. (2004). Response Styles in Rating Scales
Evidence of Method Bias in Data From Six EU Countries. Journal of Cross-Cultural
Psychology, 35(3), 346–360. https://doi.org/10.1177/0022022104264126
106
Van Vaerenbergh, Y., & Thomas, T. D. (2013). Response Styles in Survey Research: A Literature
Review of Antecedents, Consequences, and Remedies. International Journal of Public
Opinion Research, 25(2), 195–217. https://doi.org/10.1093/ijpor/eds021
Viswanathan, M. (2005). Measurement Error and Research Design. Thousand Oaks, CA: SAGE
Publications, Inc. Retrieved from http://dx.doi.org/10.4135/9781412984935.n3
Ward, M., Gruppen, L., & Regehr, G. (2002). Measuring Self-assessment: Current State of the
Art. Advances in Health Sciences Education, 7(1), 63–80.
Webster, H. (1958). Correcting personality scales for response sets or suppression effects.
Psychological Bulletin, 55(1), 62–64. https://doi.org/10.1037/h0048031
Weijters, B., Geuens, M., & Schillewaert, N. (2010). The stability of individual response styles.
Psychological Methods, 15(1), 96–110. https://doi.org/10.1037/a0018721
Wetzel, E., Böhnke, J., & Brown, A. (2016). Response Biases. In F. T. L. Leong, D. Bartram, F.
Cheung, K. F. Geisinger, & D. Iliescu (Eds.), The ITC International Handbook of Testing
and Assessment (1 edition, pp. 349–363). New York: Oxford University Press.
Wetzel, E., & Carstensen, C. H. (2015). Multidimensional Modeling of Traits and Response
Styles. European Journal of Psychological Assessment, 1–13.
https://doi.org/10.1027/1015-5759/a000291
Wetzel, E., Lüdtke, O., Zettler, I., & Böhnke, J. R. (2016). The Stability of Extreme Response
Style and Acquiescence Over 8 Years. Assessment, 23(3), 279–291.
https://doi.org/10.1177/1073191115583714
Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science,
46(1), 35–51.
Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER ConQuest Version 2.0:
Generalised Item Response Modelling Software. ACER Press.
107
Yorke, M. (2009). ‘Student experience’ surveys: some methodological considerations and an
empirical investigation. Assessment & Evaluation in Higher Education, 34(6), 721–739.
https://doi.org/10.1080/02602930802474219
Zabaleta, F. (2007). The use and misuse of student evaluations of teaching. Teaching in Higher
Education, 12(1), 55–76. https://doi.org/10.1080/13562510601102131
Zell, E., & Krizan, Z. (2014). Do People Have Insight Into Their Abilities? A Metasynthesis.
Perspectives on Psychological Science, 9(2), 111–125.
https://doi.org/10.1177/1745691613518075
Zhang, X., & Savalei, V. (2015). Improving the Factor Structure of Psychological Scales The
Expanded Format as an Alternative to the Likert Scale Format. Educational and
Psychological Measurement, 0013164415596421.
https://doi.org/10.1177/0013164415596421
108
Appendices
109
Copyright Acknowledgements