response styles in student evaluation of teaching · iii graduate students were analyzed....

Response Styles in Student Evaluation of Teaching

by

Edgar Andrés Valencia Acuña

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Department of Curriculum, Teaching and Learning Ontario Institute for Studies in Education

University of Toronto

© Copyright by Edgar Andrés Valencia Acuña 2017

ii

Response Styles in Student Evaluation of Teaching

Edgar Andrés Valencia Acuña

Doctor of Philosophy

Department of Curriculum, Teaching and Learning

Ontario Institute for Studies in Education

University of Toronto

2017

Abstract

Student Evaluation of Teaching (SET) typically refers to the use of summated rating scales to

measure teaching quality base on students’ report. SET is widely used in post-secondary

education institutions for informing teacher professional development, curriculum revision,

personnel decisions, and for institutional accountability. The literature on SET validity is

abundant but often atheoretical, the evidence inconclusive, and provides scarce attention to

content and response process. Specifically, research examining whether students respond

independently of content relying on response styles is rare. Types of response styles are

acquiescence/disacquiescence (tendency to agree/disagree across items), extreme (tendency to

endorse extreme response options across items), and midpoint response styles (tendency to use

the midpoint option across items). Evidence of a substantial degree of response style would

reduce the validity of SET scores as a measure of teaching quality and their utility for informing

formative and summative decisions due to overestimation or underestimation of the actual level

of teaching quality and artificial changes in the relationship to other variables. Three topics

examined in the study are the degree to which SET scores are affected by response styles,

differences in the extent to which SET scores are affected by response styles across measurement

conditions, and the degree to which response styles moderate differences in SET scores between

female and male teachers. Responses to a SET summated rating scale from N=5,921 education

iii

graduate students were analyzed. Student-level indexes of response styles suggest a high degree

of acquiescence in the direction of teaching quality overestimation, and no disacquiescence,

extreme, or midpoint response styles. A 2 (academic department) x 2 (program type) x 6

(academic session) ANOVAs on response styles indexes suggests no statistically significant

differences across measurement conditions. Finally, multiple linear regression analysis indicates

a statistically significant moderator effect of acquiescence on the difference in SET scores

between female and male teachers. The discussion addresses implications of the findings for

developers and users of SET summated rating scales, alternative interpretations of the observed

pattern of responses, limitations, and suggestions for future research.

iv

Acknowledgments

Muchas personas contribuyeron directa o indirectamente para que pudiera completar esta tesis

después de cinco años de programa. Mi primer agradecimiento es para mi familia, en especial

para mi hija Magdalena Sofía. Deseo sinceramente que los frutos de este trabajo la beneficien

duraderamente. Mis padres Pilar y Eduardo fueron una importante fuente de apoyo en momentos

complejos, lo mismo que mis hermanos Erik y Marlene. Con mucho gusto intentaré

constantemente retribuir su generosa y desinteresada ayuda.

Verónica Santelices en mayor medida, junto con Sandy Taut, Jorge Manzi, David Huepe y

Natalia Salas, fueron los que incentivaron mi inquietud por cursar un doctorado. A ellos les

agradezco enormemente el simple pero significativo hecho de hacerme creer que tenía las

condiciones para postular a un programa en el extranjero y terminar con éxito mi tesis.

En Canadá recibí el apoyo de muchas personas. Mi especial agradecimiento es para un grupo de

amigos con el que nos acompañamos tanto en buenos como en malos momentos. La primera de

esas personas es Elizabeth Rosales, quien se convirtió en mi más cercana amiga y cómplice.

Fannie me apoyó en los momentos que más necesité. Bryant y Faye López (junto con Alma y

Coco), me enseñaron muchísimas lecciones de tango y de la vida. Con Alejandra y Sam

compartimos momentos divertidos y duros, y me apoyaron en el complejo último tramo de mi

doctorado. Mi doctorado no sería la experiencia enriquecedora que es sin este grupo de

excelentes personas. Otros amigos que hicieron el camino más divertido son Tugce, Irene,

Frances, Desmond, Claire, Junko, Anahit, Ilinca, Yecid, Angela, Mariana, y Felipe. A todos ellos

les agradezco sinceramente todo su apoyo.

v

Table of Contents

Acknowledgments.......................................................................................................................... iv

Table of Contents .............................................................................................................................v

List of Tables ............................................................................................................................... viii

List of Figures ................................................................................................................................ ix

List of Equations ..............................................................................................................................x

List of Appendices ......................................................................................................................... xi

Chapter 1 ..........................................................................................................................................1

Introduction .................................................................................................................................1

1.1 Student Evaluation of Teaching ...........................................................................................1

1.2 SET Summated Rating Scales .............................................................................................2

1.3 Issues in SET........................................................................................................................4

1.4 Relevance and Rationale ......................................................................................................5

1.5 Focus of Study .....................................................................................................................6

1.6 Summary of Structure ..........................................................................................................8

Chapter 2 ..........................................................................................................................................9

Literature Review ........................................................................................................................9

2.1 The Target Construct in SET .............................................................................................10

2.1.1 Vague Definition of the Target Construct .............................................................10

2.1.2 Teaching Quality ....................................................................................................12

2.1.3 SET and Teaching Quality .....................................................................................14

2.2 Validity of SET ..................................................................................................................15

2.2.1 Classical Test Theory .............................................................................................15

2.2.2 Definition of Validity .............................................................................................19

2.2.3 SET Validity Findings ...........................................................................................23

2.3 Response Styles in SET .....................................................................................................32

vi

2.3.1 Approaches to Examine Response Styles ..............................................................33

2.3.2 Manifest Variable Approach and Same Items .......................................................33

2.3.3 Types of Response Styles ......................................................................................35

2.3.4 Response Styles in SET .........................................................................................38

2.4 Summary and Limitations ..................................................................................................41

2.4.1 Summary ................................................................................................................41

2.4.2 Limitations in SET Validity Research ...................................................................42

2.4.3 Focus of Study .......................................................................................................44

Chapter 3 ........................................................................................................................................46

Methodology .............................................................................................................................46

3.1 Participants .........................................................................................................................46

3.2 Instrument ..........................................................................................................................47

3.3 Administration ...................................................................................................................50

3.4 Data Analysis .....................................................................................................................51

3.4.1 Research Question 1 ..............................................................................................51



3.4.4 Software .................................................................................................................57

Chapter 4 ........................................................................................................................................58

Results .......................................................................................................................................58

4.1 Distribution of Responses ..................................................................................................58

4.2 Research Question 1 ..........................................................................................................61


4.3.1 Summary Statistics.................................................................................................63

4.3.2 ANOVA Results ....................................................................................................64


vii

4.4.1 Part 1: Differences Teacher’s Gender ....................................................................66

4.4.2 Part 2: ARS Moderator Effect................................................................................67

4.4.3 Practical Significance.............................................................................................71

Chapter 5 ........................................................................................................................................73

Discussion .................................................................................................................................73

5.1 Summary and Implications ................................................................................................73

5.1.1 Implications............................................................................................................75

5.1.2 Recommendations ..................................................................................................77

5.2 Alternative Interpretation of Findings ...............................................................................82

5.2.1 High Level of Teaching Quality ............................................................................83

5.2.2 Construct Underrepresentation ..............................................................................83

5.2.3 Ceiling Effect .........................................................................................................84

5.2.4 Online Survey Mode ..............................................................................................84

5.2.5 Strong Satisficing ...................................................................................................85

5.2.6 Evaluation Goals ....................................................................................................86

5.3 Limitations and Future Research .......................................................................................87

5.3.1 Use of Manifest Variable Approach ......................................................................87

5.3.2 Use of Observational Data .....................................................................................88

5.3.3 Use of a Quantitative Approach .............................................................................88

5.3.4 Future Research .....................................................................................................89

References ......................................................................................................................................91

Appendices ...................................................................................................................................108

Copyright Acknowledgements.....................................................................................................109

viii

List of Tables

Table 1 .......................................................................................................................................... 47

Table 2 .......................................................................................................................................... 49

Table 3 .......................................................................................................................................... 52

Table 4 .......................................................................................................................................... 60

Table 5 .......................................................................................................................................... 61

Table 6 .......................................................................................................................................... 62

Table 7 .......................................................................................................................................... 63

Table 8 .......................................................................................................................................... 66

Table 9 .......................................................................................................................................... 68

ix

List of Figures

Figure 1 ......................................................................................................................................... 59

Figure 2 ......................................................................................................................................... 69

Figure 3 ......................................................................................................................................... 70

x

List of Equations

Equation 1 ..................................................................................................................................... 16

Equation 2 ..................................................................................................................................... 17

Equation 3 ..................................................................................................................................... 54

Equation 4 ..................................................................................................................................... 55

Equation 5 ..................................................................................................................................... 55

1

Chapter 1

Introduction

Chapter 1 introduces the present study pertaining the examination of response styles in the

context of the administration of a student’s evaluation of teaching (SET) summated rating scale

at a post-secondary education institution. Section 1.1 summarizes the most relevant aspects of

SET as a method for measuring teaching quality using the report of students. Section 1.2

explains the key attributes of a summated rating scale, the most popular mode of asking students

about teaching. Section 1.3 outlines important issues affecting the utilization of SET. Section 1.4

explains the relevance and rationale of the study. Section 1.5 states the focus of the study and

Section 1.6 portrays the structure and content of the remaining chapters.

1.1 Student Evaluation of Teaching

The close relationship between teaching and learning justifies the need of reliable, accurate and

useful information about teaching: to promote good teaching and subsequently enhance students’

learning (Joint Committee on Standards for Educational Evaluation, 2009).

The evaluation of teaching can inform relative strengths and weaknesses of individual teachers,

themes for planning professional development plans for a group of teachers, or the social

recognition of outstanding teaching among members of an educational community. Undoubtedly,

the most popular way to retrieve information about teaching in post-secondary education

institutions is through the report of students (Berk, 2005; Johnson, 2000; Zabaleta, 2007),

formally referred to as Student Evaluation of Teaching (SET).

An explanation of the popularity of SET is the fact that most institutions retrieve information

about teaching from students utilizing standardized questionnaires due to the inexpensive cost

and straightforward implementation and reporting of results of this tool (Penny, 2003; Spooren,

Brockx, & Mortelmans, 2013). Other methods of teaching evaluation such as observation

protocols and portfolios are methodologically more complex to develop and usually require

trained evaluators, increasing time and costs. Another proposed cause to explain the popularity

of SET is the lack of alternative methods of teaching evaluation supported by validity evidence

2

(Marsh, 1997). However, the more validity evidence could simply reflect the popularity of SET

due to the convenience of implementing standardized questionnaires.

What SET intends to measure, or the target construct in SET is often vaguely defined. There is a

great diversity of content in the literature, and multiple terms are used interchangeably and in a

non-univocal manner. The concept of teaching quality (Fenstermacher & Richardson, 2005) can

serve the purpose of standardizing the differences in content and terms existing in the literature.

Teaching quality encompasses two related yet qualitatively different aspects of teaching: good

teaching referring to the quality of the teaching task, and successful teaching referring to

teaching that produces learning. SET may relate to students’ report of good teaching, students’

report of successful teaching, or both.

SET currently informs multiple types of decisions. SET originally informed improvement the

instruction, curriculum, and programs. Starting the decade of the 1970s, SET begun to

increasingly inform administrative and personnel decisions including retention, tenure, and

promotion of faculty. SET is also utilized for departmental and institutional accountability

following trends in postsecondary education administration. In practice, SET frequently informs

more than one of the previous purposes simultaneously (Aylett & Gregory, 1996; Spooren et al.,

2013).

1.2 SET Summated Rating Scales

There are multiple ways in which students can report information about teaching quality, for

instance, through individual interviews or focus groups. However, in a vast majority of cases,

SET is based on standardized questionnaires (Spooren et al., 2013), specifically summated rating

scales.

Summated rating scales are one of the most utilized tools in the social sciences and education for

the measurement of attitudes, opinions, personality, and emotional states among other constructs

(Spector, 1992). Summated rating scales are utilized to retrieve information about the past,

present or future, and about the respondent (self-report), about others (other-report), or about

external objects or events.

3

All summated rating scales including the ones found in the context of SET share four basic

characteristics (Spector, 1992):

1. A rating scale contains multiple items.

2. Items measure a property or attribute that varies quantitatively.

3. An item is a statement and participants are asked to choose the response option that

best reflects their response to the statement.

4. Items have no right answer.

The central idea underlying the use of a summated rating scale is that the sum of responses to

individual items (the total score) reflects the level of the target construct. In the case of a SET

summated rating scale, the total score reflects the magnitude of teaching quality.

As the sum of responses to a group of content-related items (but not as responses to an individual

item), the level of measurement of SET scores is interval (Brown, 2011; Carifio & Perla, 2007).

Properties of interval measures are magnitude and equal intervals (Kaplan & Saccuzzo, 2008).

Magnitude informs the amount of teaching quality. Possible uses of this property are the

comparison of relative strengths and weaknesses among different teaching attributes, or the

identification of teachers with higher, lower or equal level of teaching quality.

The property of equal intervals means that the difference between two any points on the response

scale is the same (Kaplan & Saccuzzo, 2008). Equal intervals allow arithmetic operations on

scores and the application of descriptive and inferential statistics such as correlational analyses

and analysis of variance (Brown, 2011). An example is the calculation of differences in SET

scores between female and male teachers. Such differences are meaningful only when the level

of measurement is at least interval.

The interpretation of SET score as a measure of teaching quality is a function of the items

(Messick, 1995). For instance, a teacher can obtain a SET score of 3.0 from the sum of items

based on a scale ranging from 1 (low-level of teaching quality) to five (high-level of teaching

quality). Aspects such as the item wording, item position, and the number, order, and labels in

the response scale can affect scores (Smyth, Dillman, & Christian, 2009; Tourangeau, Rips, &

4

Rasinski, 2000). Consequently, SET scores greatly depend on the way students interpret items

and utilize the response scale.

The interpretation of SET score as a measure of teaching quality not only depends on items.

Other aspects that can influence the interpretation of SET scores are students and the context of

the measurement (Messick, 1995). For instance, a group of first-year engineering students with

little exposure to teaching in post-secondary education would have a very different

conceptualization of teaching quality than a group of education graduate students. An instrument

with a high proportion of items targeting general attributes of teaching quality may not

differentiate between novice and expert teachers whereas an instrument including specific and

more complex aspects of teaching quality can effectively distinguish between these two groups.

Scores obtained at an department in which teacher evaluation is an important priority to support

continuous teaching improvement may have a different meaning for students, teachers and other

stakeholders than scores obtained at a department in which SET informs personnel decisions and

accountability.

1.3 Issues in SET

The definition and measurement of teaching quality are two challenging tasks as documented by

decades of educational research (Berliner, 2005; Fenstermacher & Richardson, 2005). For

instance, Popham (1992) placed the search for valid teaching evaluation along with other two

Humanity’s perennial quests: the Holy Grail and the Fountain of Youth. The evaluation of

teaching based on students’ report in post-secondary education institutions is not an exception to

the challenges of properly defining and measuring teaching quality.

A first important issue affecting the interpretation and use of SET scores relates to the content.

The definition of the target construct in SET often relies on a weak theory about teaching quality

and SET literature offers little evidence supporting that content of SET summated rating scales is

appropriate (Penny, 2003).

A second relevant issue affecting SET relates to the same cause explaining its popularity: the

utilization of summated rating scales. Penny (2003) correctly pointed out that SET scores are not

more valid than the method utilized to retrieve the information, and a common assumption

among developers and users of summated rating scales is that total score accurately reflects the

5

target construct, which implies that no extraneous influences affect participant’s responses

(Cronbach, 1946; Wetzel, Böhnke, & Brown, 2016). The measurement literature describes

various ways in which the previous assumption is wrong (Spector, 1991; Viswanathan, 2005).

Therefore, rather than assuming, developers and users need to provide evidence of SET scores

validity. Validity refers to an overall judgment of the extent to which theory and evidence

support the intended interpretation and use of scores (American Educational Research

Association, American Psychological Association, & National Council on Measurement in

Education, 2014). Specifically, evidence should support that 1) there are not aspects of the target

construct definition excluded from the instrument content (construct underrepresentation) and 2)

responses are not severely influenced by processes extraneous or irrelevant to the intended

interpretation and use of scores (construct-irrelevant variance).

A third important issue relates to the type of evidence collected to support the validity of SET

scores. Despite the vast amount of relevant literature, there is limited attention provided to a

fundamental aspect of the use of summated rating scales: the response process (Penny, 2003).

Instead, a considerable amount of research reports discriminant evidence, the correlation

between SET scores and irrelevant variables, for instance, the gender of the teacher. It seems

more reasonable to identify sources of construct-irrelevant variance from carefully examining

and reporting aspects of the response process itself before advancing into the examination of the

relationship between SET scores and other variables.

1.4 Relevance and Rationale

Measurement concepts such as validity, reliability, comparability and fairness “are not just

measurement principles; they are social values” (Messick, 1995, p. 5. italic in the original).

Scores from measurement tools need a systematic examination to warrant adherence to these

social values.

The last revision of the Standards for Educational and Psychological Testing (AERA, APA, &

NCME, 2014) argues that a proper interpretation and use of scores can result in wiser and more

equitable decisions about individuals and programs, whereas improper use of scores might lead

to an adverse impact on test-takers and other stakeholders.

6

The extended use of SET as a method of teaching evaluation in post-secondary education

institutions and the use of SET scores for informing formative decisions justify the need of

evidence supporting the validity of those scores (Joint Committee on Standards for Educational

Evaluation, 2009). The exigence of sound validity evidence increases as SET scores inform high

stake decisions such as hiring and faculty promotion.

When students’ responses to a SET summated rating scale are affected by construct

underrepresentation or construct-irrelevant variance, there is lower support for the interpretation

of scores as a measure of teaching quality. These two problems also affect the use of scores for

formative and summative decisions. The systematic examination of these two types of problems

allows their subsequent control and minimization, strengthening subsequent decisions based on

scores.

Finally, SET is not an objective measure of teaching quality, and instead, scores come from

students’ responses to a summated rating scale, a tool prone to numerous sources of construct-

irrelevant variance, increasing the need of careful examination of SET. The focus of the study

relates to a specific source of construct-irrelevant variance affecting scores obtained from

summated rating scale.

1.5 Focus of Study

In the measurement literature, a well-known source of construct-irrelevant variance that affect

scores from summated rating scales is the participant’s systematic tendency to use the response

scale in a stereotyped or aberrant manner, referred to as response styles (Cronbach, 1946;

Paulhus, 1991; Van Vaerenbergh & Thomas, 2013; Viswanathan, 2005). Response styles are

ways in which the respondent utilizes the response scale in a manner inconsistent with the

intended interpretation and use of scores. As a result, total score would be a confound between

the target construct and the response style. The confound would not only affect the interpretation

of total scores as a measure of the target construct but also psychometric properties of the scale

and subsequent statistical analysis of scores (Viswanathan, 2005).

The present study examines the degree to which scores obtained from the administration of a

SET summated rating scale at a large teacher education institution in Southern Ontario are

affected by response styles.

7

Evidence of response styles would jeopardize the interpretation of SET scores as a measure of

teaching quality for low stake decisions (for instance, informing teaching improvement) and high

stake decisions (for example, personnel decisions). On the contrary, evidence ruling out response

styles would support (along with other types of evidence) the interpretation and use of SET as a

measure of teaching quality for formative and summative purposes.

Research examining how response styles affect SET scores addresses one of the issues in the

literature mentioned earlier: the lack of validity evidence based on response process. As an

example, in a systematic literature review including studies on SET validity since 2000, there is

almost no mention of the problem of response styles1 (Spooren et al., 2013). Similarly, a review

of the literature covering articles since the 1970s mentions only one specific type of response

styles, halo, as a potential source of construct-irrelevant variance affecting SET scores

(Gravestock & Gregor-Greenleaf, 2008). Therefore, systematically examining the plausibility of

response styles in SET appears to be a significant contribution to the literature.

The validity of scores is also a function of persons and the context of measurement (Messick,

1995a). The study also examines differences in the degree to which SET scores are affected by

responses styles across three measurement conditions: the academic department, the type of

graduate program, and the session.

Lastly, considering that SET scores inform summative decisions, and that response styles affect

the subsequent statistical analysis of total scores, the study examines whether response styles

moderates the observed difference in SET scores between female and male teachers.

The three research questions that guide the study are:

1. To what extent SET scores are affected by response styles?

2. What are the differences in the degree to which SET scores are affected by response

styles across measurement conditions?

1 Response styles or other related terms such as response bias, response set, rater bias, rater effects, and rater error.

8

3. Is there a difference in SET scores between female and male teachers, and to what

extent do response styles moderate such difference?

1.6 Summary of Structure

The structure of the study contains a total of five chapters: 1) Introduction, 2) literature review,

3) methodology, 4) results, and 5) discussion.

Chapter 2 presents a review of literature that sustains the examination of response styles in the

context of SET. The literature review covers four subjects: 1) a definition of the target construct

in SET, 2) validity of SET, 3) response styles and evidence of response styles in the context of

SET, and 4) summary and limitations in SET literature.

Chapter 3 describes the study’s methodology including the population of students, characteristics

of the SET summated rating scale, the procedure of administration, intended interpretation and

use of SET scores, and data analysis strategy followed to produce evidence for each of the three

research questions presented above.

The last two chapters present the results of the statistical analysis of SET data (Chapter 4) and a

discussion of these results (Chapter 5). Chapter 5 discussed implications of the findings for SET

developers and users, possible alternative interpretations of the reported findings, limitations of

the study and recommendations for future research.

9

Chapter 2

Literature Review

Chapter 2 presents a review of relevant literature in support of the examination of response styles

in the context of SET. The literature review covers the following four subjects: the definition of

the target construct in SET (Section 2.1), the validity of SET (Section 2.2), evidence of response

styles in SET scores (Section 2.3), and limitations in SET literature (Section 2.4).

Section 2.1 discusses the general question of what SET intends to measure. The literature review

suggests that the definition of the target construct in SET is a problematic issue. The study

proposes and defines teaching quality as the target construct in SET.

Section 2.2 discusses the validity of SET in three logically connected parts. The first part

explains the theory underlying the use of summated rating scales and provides context for

understanding the concept of validity of scores. The second part presents an overview and a

current definition of the concept of validity along with a description of types of validity

evidence. The third part summarizes findings from multiple types of SET validity evidence

reported in the literature.

Section 2.3 defines response styles, describes types of response styles documented in the

measurement literature along with their examination procedure, summarizes findings from the

few studies reporting response styles in the context of SET.

Lastly, Section 2.4 offers a summary of the key points addressed in the literature review,

identifies limitations that sustain the examination of response styles in the context of SET, and

outlines the research questions of the study.

10

2.1 The Target Construct in SET

A sound theory sustaining the development of measurement tools is of the utmost importance as

recognized by The Standards for Educational and Psychological Testing (AERA et al., 2014) and

the Personnel Evaluation Standards (Joint Committee on Standards for Educational Evaluation,

2009). Some essential functions of a theory are the definition of the target construct and its

attributes, operationalization, and explanation of the relationship with related and not related

variables, for instance, whether subgroups by gender or race should differ in their levels of the

target construct.

SET literature also recognizes the crucial importance of a sound theory underlying the

measurement of teaching quality. The lack of theory supporting SET is precisely the first and

most important issue that negatively affects the definition of the target construct and the

interpretation and use of this tool of teaching evaluation (Marsh, 1987; Ory & Ryan, 2001;

Penny, 2003; Spooren et al., 2013).

The literature reports that, instead of grounded in a theory of teaching quality, SET instruments

are often home-made or ad-hoc questionnaires constructed by adapting items from pre-existing

instruments (Marsh, 1987; Or & Ryan, 2001; Penny, 2003). The poor or lack of theory

supporting the development of SET is expressed in a low consistency in the number and nature

of attributes of teaching quality across instruments (Spooren et al., 2013). For instance, in their

analysis of eleven instruments published in literature since 2000, Spooren et al. (2013) reported

that the number of attributes of teaching quality varies between two and twelve, and the most

common content is the overall attribute of quality of instruction. As examples, other attributes of

teaching quality included in SET instruments are helpfulness of the teacher, teacher’s

enthusiasm, the level of care and support offered by the teacher, organization of the course,

clarity of course objectives, and quality of assessments.

2.1.1 Vague Definition of the Target Construct

A fundamental function of a theory in the context of the development of a measurement tool is

the definition of the target construct, and a logical consequence of the lack of theory sustaining

SET is a vague definition of the target construct. In this regard, there is no univocal

understanding in the literature on what SET is supposed to measure.

11

Two are two expressions of the vague definition of the target construct in SET: 1) ambiguity in

the term to refer to the intended target construct; 2) interpreting SET scores as a simple measure

of student’s satisfaction.

SET literature is plagued with various terms to refer to the intended target construct. Examples of

these terms are teacher quality, good teaching, teacher efficacy, teaching performance, and

teacher effectiveness. For instance, in a review of research on SET validity, Marsh (1997) refers

to teacher effectiveness as the target construct in SET suggesting that scores should support the

improvement of teaching quality. In their overview of findings on SET validity, Spooren et al.

(2013) indistinctly refer to teaching quality, effective teaching, good teaching, and teaching

effectiveness. Penny (2003), while analyzing limitations of research on SET validity, alternates

between the concepts teaching quality and teaching effectiveness. None of the previous authors

offer a precise definition of teaching quality or any of the other terms.

A second expression of the vague definition of the target construct is the interpretation of SET

scores as a measure of students’ satisfaction with the teacher. In this context, students are

considered customers (Spooren et al., 2013), and satisfaction refers to the level of happiness of

students about teaching (Penny, 2003).

There are three situations in which student’s satisfaction with the teacher is utilized to interpret

SET scores. The first situation occurs when SET scores are interpreted simultaneously as a

measure of teaching quality and a measure of students’ satisfaction, making the two terms

equivalent (for instance, MacNell, 2015; Boring, 2015). A second situation occurs when

administrators interpret SET scores for institutional accountability or from a managerial

perspective. The interpretation of SET scores simply changes from teaching quality to students’

satisfaction with the course or teacher (Kuwaiti & Subbarayalu, 2015; Spooren et al., 2013;

Valsan & Sproule, 2008; Boring, 2015). A third situation occurs when SET scores are interpreted

as a measure of students’ satisfaction as a hypothesis to explain an observed relationship between

SET scores and an irrelevant variable (Zabaleta, 2007; Penny, 2003; Boring, Ottoboni, & Stark,

2016). In all the previous examples, the use of students’ satisfaction as target construct in SET is

not grounded in theory or empirical evidence and reflects an arbitrary interpretation of SET

scores.

12

In summary, there is often a lack of a theory of teaching quality underlying the development of

SET reflected in home-made or ad-hoc instruments, substantial differences in content, and a

vague definition of the target construct that indistinctively refers to teaching quality, teaching

performance, effectiveness, or students’ satisfaction.

2.1.2 Teaching Quality

The vague definition of the target construct underlying the development of SET should not

surprise. Following Berliner (2005), defining teaching quality is difficult, the concept of quality

is often ineffable, and quality involves a judgment that always depends on the specific context.

The study follows the distinction between good teaching and successful teaching (Berliner,

2005; Fenstermacher & Richardson, 2005; Ingvarson & Rowe, 2008), two separate but related

components of teaching quality. The difference between the concepts of good and successful

teaching is the criteria used for judging one and another. Whereas good teaching refers to aspects

of the instruction itself (the task of teaching), successful teaching is teaching that produces

learning.

2.1.2.1 Good Teaching

Good teaching is teaching “that comports with morally defensible standards and rationally sound

principles of instructional practice” (Fenstermacher & Richardson, 2000, p. 7). Good teaching

occurs when “the content taught accords with disciplinary standards of adequacy and

completeness, and that the methods employed are age-appropriate, morally defensible, and

undertaken with the intention of enhancing the learner’s competence with respect to the content

studied” (p. 9). Under this definition, highly qualified teachers “provide evidence that certain

qualities of teaching are frequently present in the everyday experiences of their students”

(Berliner, 2005, p. 207).

Two fundamental features of good teaching are: 1) good teaching is normative, and; 2) good

teaching is contextual. Good teaching refers to “what is expected of people in a position”

(Berliner, 2005, p. 207). The norm derives from the unique setting in which teaching and

learning happen.

13

An example of the normative and contextual character of good teaching is considering the

differences in the understanding of teaching and learning across three dominant approaches in

Education. Good teaching could refer to teaching centered on the transmission of content

(positivist perspective), teaching as pedagogical content knowledge expertise and the

transformation of students’ cognitive ability (cognitive perspective), or teaching as facilitation of

students’ deep understanding based on personally relevant experiences (constructivist

perspective) (Fenstermacher & Richardson, 2000).

Good teaching involves at least three components referred to as acts of teaching (Fenstermacher

& Richardson, 2005):

1. Logical acts such as defining, demonstrating, modeling, explaining, and correcting.

2. Psychological acts such as caring, motivating, encouraging, rewarding, punishing,

planning, and evaluating, and;

3. Moral acts, such as showing honesty, courage, tolerance, compassion, respect, and

fairness.

Good teaching is a necessary but not sufficient condition for learning. Evidence of good teaching

does not imply that students will learn. Learning is a complex process and is also affected by 1)

willingness and effort by the student, 2) a social environment supportive of teaching and

learning, and 3) opportunity to teach and learn (Fenstermacher & Richardson, 2005; Ingvarson &

Rowe, 2008). When these other conditions of learning are satisfied, then good teaching can turn

into successful teaching.

2.1.2.2 Successful Teaching

Successful teaching refers to “teaching that yields the intended learning” (Fenstermacher &

Richardson, 2005, p. 6). Successful teaching relates to students’ achievement, and more precisely

to changes in achievement between two moments: time 1 when the student lacks a certain

content, and time 2 when the student acquires the content. Successful teaching implies that the

teacher possesses the content, intends to impart the content, and engages the student in a

relationship that allows the student to acquire the content (Fenstermacher & Richardson, 2005).

14

Successful teaching means that the student acquires the content “to some reasonable and

acceptable level of proficiency” (p.9).

Determining successful teaching involves numerous logical and methodological challenges

(Berliner, 2005). A first issue is the gathering of evidence of student’s achievement at time 1 and

time 2 to determine the degree of learning. The second problem is to link teaching and learning

causally. Specifically, isolating teaching effect from other factors affecting learning (such as

willingness and effort by the student, or the social environment) is technically complex.

Statistical methods that attempt to isolate teacher’s contribution to students learning are “filled

with psychometric problems” (Berliner, 2005).

2.1.3 SET and Teaching Quality

The previous definition of teaching quality has implications for SET design, scores

interpretation, and use. SET summated rating scales can inquire students about the two

components of teaching quality: good and successful teaching.

Pertaining the measurement of good teaching, a student can report the logical, psychological and

moral acts of teaching because they have multiple opportunities to observe teaching over the

length of a course and provide a justified report of the extent to which those acts are present in

her/his academic experiences.

SET items measuring good teaching can take the form of a report of others, report of external

objects or events, or self-report. Some examples are: “the teacher presented the content in a

challenging manner” (other-report), “the content of the course was challenging” (report of an

object or event), or “I felt challenged by the course content” (self-report).

Pertaining successful teaching, a student can report the amount of learning and the degree to

which teaching contributed to his/her learning. SET items measuring successful teaching can

take the form of a self-report (e.g. “I learned a great deal in this course”), report of an object

(“the course contributed a great deal to my understanding of the material”), or other-report (“The

teacher was effective in enhancing my understanding of the course material”).

The measurement of successful teaching in the context of SET faces a significant challenge: the

poor accuracy of the self-report of learning. Two sources informing the about the previous issue

15

are training evaluation and self-assessment literature. The self-report of the amount of learning is

considered a reaction which refer to an attitude towards the training (Kirkpatrick, 1977, 1977,

1979). A reaction differs from the measurement of learning, in which specific knowledege, skills

or attitudes pertaining training goals are assses using objective measures. The empirical evidence

consistently indicates that reactions are not predictive of actual learning and training impact

(Alliger, Tannenbaum, Bennett Jr., Traver, & Shotland, 1997; Salas & Cannon-Bowers, 2001).

Similarly, research on individual self-assessment indicates that people often overestimate their

level of knowledge and skills (Dunning & Helzer, 2014; Zell & Krizan, 2014), and such

evidence includes the self-assessment of students in postsecondary education institutions (Boud

& Falchikov, 1989; Ward, Gruppen, & Regehr, 2002). For instance, Bowman (2010) reported

that the correlation coefficients between college students’ self-report of learning and objective

measures of learning of the same constructs were virtually zero. Although self-assessment is a

valuable tool in the context of the development of metacognitive skills and self-regulated

learning (Pintrich, 2002), the high inaccuracy of self-assessment scores makes its utilization in

the context of teacher evaluation problematic.

2.2 Validity of SET

The extended use of SET as a measure of teaching quality in postsecondary education

institutions along with the use of SET scores for informing multiple types of decisions help

explain the vast number of studies on SET, specifically studies examining its validity.

Validity in the context of SET directly relates to the theory underlying the use of summated

rating scales, classical test theory (CTT). CTT allows the identification of the multiple elements

affecting the interpretation of scores obtained from summated rating scales.

2.2.1 Classical Test Theory

CTT provided the first formal foundation for the measurement of psychological and educational

constructs. The conception of CTT relates to three advancements occurred at the beginning of

the 20th century (Traub, 1997): 1) the realization that all measurement contains some degree of

error; 2) the conception of error of measurement as a random variable; 3) the concept of

correlation and its method of calculation.

16

A fundamental proposition in CTT is that the two components of an individual’s observed score

are her/his true level of the target construct (true score) and measurement error. Equation 1

expresses the previous proposition (Kline, 2005; Spector, 1992):

𝑂 = 𝑇 + 𝐸

Equation 1

In Equation 1, 𝑂 represents an individual’s observed score, 𝑇 represents his/her true score, and 𝐸

represents measurement error.

Other three important propositions in CTT are (Traub, 1997):

1. Measurement error (𝐸) is a random latent variable.

2. Measurement error has zero covariance with true score latent variable (𝐸 and 𝑇 are

independent).

3. Measurement error is independent of the error component of other measures.

The error component 𝐸 is typically referred to as random error (Kline, 2005; Spector, 1991) and

indicates the level in which scores are not consistent across repetitions of the measurement, for

instance across the multiple items in a summated rating scale. Non-systematic factors affecting

the measurement introduce random error. Examples of non-systematic factors are brief

fluctuations in mood and motivation, language difficulties, ambiguous items, uncontrolled

administration conditions (e.g. noise), distraction, memory/attention vacillations,

mechanical/motor vacillations, illness, fatigue, emotional strain, chance and non-contingent

responding2 (Viswanathan, 2005). The random error component of a measurement affects the

reliability of scores. Utilizing multiple items, one of the four key characteristics of a summated

2 Non-contingent responding, careless responding (Wetzel et al., 2016), inconsistent responding, and random

responding, is sometimes defined as a type of response style (McGrath et al., 2010, Viswanathan, 2005) because

refers to responding independent from content. However, non-contingent responding does not introduce systematic

error as other types of response styles (defined later in the document) because this type of measurement error occurs

when respondents vary her/his responses in an unsystematic manner (McGrath et al., 2010), hence introducing

random error.

17

rating scale, is one strategy to minimize random error and increase reliability (Spector, 1992;

Viswanathan, 2005; AERA et al., 2014).

2.2.1.1 Construct-Irrelevant Variance

A noteworthy expansion of Equation 1 in the context scores obtained from summated rating

scales is the following (Spector, 1992):

𝑂 = 𝑇 + 𝐸 + 𝐵

Equation 2

Equation 2 presents three (instead of two) components of an individual’s observed score: her/his

true score (𝑇), random error (𝐸) and a new source of measurement error (𝐵) reflecting construct-

irrelevant variance (AERA et al., 2014). Literature often utilizes the terms bias3 (Spector, 1992)

and systematic measurement error (Viswanathan, 2005). For consistency with the Standards for

Educational and Psychological Testing, the preferred term in the study is construct-irrelevant

variance.

Construct-irrelevant variance is systematic variation in observed scores that do not reflect true

score and is caused by processes extraneous (or irrelevant) to the intended interpretation and use

of scores (AERA et al., 2014).

Examples of sources of construct-irrelevant variance in summated rating scales are leading

question and the use of unbalanced response categories. A common response style such as the

respondents’ tendency to agree or disagree can also introduce construct-irrelevant variance

(Viswanathan, 2005; James, Demaree, & Wolf, 1984).

Construct-irrelevant variance is not randomly distributed, the mean differs from zero, and its

effect cannot be erased utilizing multiple items. Instead, developers of summated rating scales

need to control and minimize potential sources of construct-irrelevant variance (Spector, 1992).

3 A second meaning of bias not related to the purpose of the study is “construct underrepresentation or construct

irrelevance components of a test that affect the performance of different groups” (AERA, APA, NCME, 2014, p.

686) such as groups based on gender or race. Bias from this perspective is the focus of the examination of test

fairness (i.e. item and test bias) and not part of the focus of this specific study.

18

2.2.1.2 Additive and Correlational Error

A common and reasonable use of SET is summarizing responses from a group of students (for

instance, students taught by the same teacher) using the mean or another method of data

aggregation. Construct-irrelevant variance can affect scores in two ways when scores are

aggregated across individuals: as additive error and as correlational error (Viswanathan, 2005).

Additive error increases or reduces observed scores by a constant magnitude similarly across all

individuals. An example of a summative error is considering the measurement of height in a

group of persons with their shoes on. The observed height will be higher than the actual height

(measured barefoot) by a constant (the height of the shoe). In this example, the observed height

across individuals contains a constant deviation from the actual height in a positive direction.

In summated rating scales, examples of sources of additive error are leading questions,

interviewer bias, and unbalanced response categories (Viswanathan, 2005). All these sources

would affect all respondents in the same manner.

A consequence of additive error is the underestimation of overestimation of the actual level of

the target construct (true score). SET scores influenced by summative error would be lower or

higher than the actual level of teaching quality.

Additive error can also affect the correlation coefficient between observed scores and other

variables due to a reduction in scores variance. A reduction in scores variance may occur when

additive error lowers or increases scores towards one of the ends of the response scale

(Viswanathan, 2005).

The second type of construct-irrelevant variance is correlational error, caused by a non-constant

source of systematic error. An example is considering the measurement of weight of a group of

individuals (for example, female) after breakfast and another group (for example, men) before

breakfast. In this case, a comparison of weight between female and male would indicate a lower

difference in weight than the difference obtained if all individuals step on the weight scale

without breakfast.

In summated rating scales, correlational error is produced by “different individuals responding in

consistently different ways over and above true differences in the construct” (Viswanathan,

19

2005, p. 15). Correlational error affects the relationship between observed scores and other

variables because “consistent differences across individuals over and above the construct being

measured may be positively correlated, negatively correlated, or not correlated with the

construct” (Viswanathan, 2005, p. 16). In the case of summated rating scales, an example of

correlational error may occur when several constructs are measured using the same method

(common method factor), leading to inflated relationships between items (Viswanathan, 2005).

In summary, CTT defines two components of observed scores: true score and measurement

error. Two sources of measurement error are random error and construct-irrelevant variance.

Random measurement error affects the reliability of scores and is often minimized using multiple

items. Construct-irrelevant variance is systematic variation in observed scores not related to true

score. Two types of construct-irrelevant variance are additive and correlational error. Additive

error produces underestimation or overestimation of true score, and correlational error produces

changes in the coefficients of correlation with other variables.

The propositions underlying CTT should encourage developers and users of measurement tools

in educational contexts to provide evidence supporting that sources of construct-irrelevant

variance do not excessively influence observed scores. The concept of validity offers a

framework for this task, the systematic evaluation of the intended interpretation and use of scores

(AERA et al., 2014).

2.2.2 Definition of Validity

2.2.2.1 Overview

The concept of validity has evolved, and a summary of previous conceptualizations would

contribute the understanding of current definitions and limitations in SET validity literature.

Following Kane (2001), three models of validity precede modern conceptualizations: criterion

validation, content validation, and construct validation.

The criterion validation model defines validity simply as the level of accuracy of a test, in which

scores are expected to estimate or predict a criterion. According to Kane, the criterion model was

popular between 1950 and 1970 for the validation of selection and placement decision in which a

common standard was the candidate’s actual level of performance in a task. In the context of the

measurement of educational and psychological constructs, the identification of a suitable

20

criterion is challenging turning the validity of the criterion itself into a problem (Kane, 2001).

Research in SET shares the difficulty of identifying a valid criterion of teaching quality (Marsh

& Roche, 1997).

A proposed solution to the lack of a valid criterion was the “review of the test content by subject-

matter experts” (Kane, 2001, p. 320), which would provide evidence of content relevance and

representativeness of the measure, referred to as content validity. According to Kane, validation

of educational achievement test between 1950 and 1970 typically relied on the content validation

model. Two limitations of content-validation are that the experts’ judgment often shows a strong

confirmatory bias and a review of the content does not provide direct evidence of the validity of

the inferences made from scores (Kane, 2001).

Construct validation originally served the validation of the theory predicting the relationships

among constructs used in clinical assessment and worked as a complement to the criterion and

content validation models (Kane, 2001). The construct validation model was proposed and

utilized for the validation of psychological constructs grounded in strong theory. The validity of

the intended interpretation of scores is evaluated regarding “how well the observed scores satisfy

the theory” (p. 321). For instance, if the observations are consistent with the relationships among

constructs predicted by the theory, the theory underlying the measurement and the measurement

itself are both valid (Kane, 2001).

The construct validation model impacted the conceptualizations of validity in three ways (Kane,

2001). First, this model recognizes the importance of theory for defining and measuring

constructs. Second, the model recognizes the need of clearly stating intended interpretation of

scores before evaluating the validity of scores. Third, the model introduces the concept of

challenging proposed score interpretations and “the importance of considering possible alternate

interpretations” (p. 324).

2.2.2.2 Unified Concept of Validity

In the context of multiple validity models co-existing by the end of the 1970s, researchers were

“highly opportunistic in the choice of validity evidence” (Kane, 2001, p. 323). In response to

such situation, current conceptualizations of validity use the construct validity model as an

21

umbrella to integrate criterion and content validity, not as different types of validity but as

different kinds of evidence of validity (Kane, 2001).

As a unified concept, validity refers to an “overall evaluative judgment” (Messick, 1995a, p. 5)

on “the degree to which evidence and theory support the interpretations of test scores4 for

proposed uses of tests” (AERA et al., 2014, p. 59).

An important aspect of the concept of validity is that both evidence and theory need to relate to a

specific interpretation and use of scores (AERA et al., 2014). For instance, when validity

evidence only supports a formative use of SET scores (e.g. improvement of teaching), new

pertinent evidence should be provided in support of the use of SET scores for other purposes

such as personnel decisions or tenure. If scores inform multiple uses, then evidence needs to

support each of these multiple uses.

Another important aspect of the concept of validity is its evolving nature. The interpretation of

scores depends on items, persons, and the conditions of measurement. When any of these aspects

vary across replications of the measurement, validity evidence justifying the intended

interpretation and use of scores in this new instance should be provided (Messick, 1995b).

In the case of SET, the validity of scores depends on the group of items included in the

summated rating scale. The validity of SET scores would change if the population of students

changes. Similarly, measurement conditions such timing (mid-term, end of the term), the

anonymity of responses and the mode of administration could also affect the validity of scores.

The overall context of the evaluation (e.g. academic department, type of program, discipline),

and time (session) could also influence the validity of SET scores (Spooren et al., 2013).

Evidence of validity can emerge by considering two types of rival hypotheses that challenge the

intended interpretation of scores: construct underrepresentation and construct-irrelevant variance

(AERA et al., 2014). For instance, an examination of SET instrument revealing that item content

excludes important attributes of teaching quality (for example, lack of items targeting the moral

acts of teaching) would indicate construct underrepresentation. Evidence indicating the influence

4 Test refers to any kind of measurement tool based on a standardized procedure (AERA, APA, & NCME, 2014),

and includes SET summated rating scales.

22

of gender stereotype (which is not part of the definition of teaching quality) on students’

responses to a SET summated rating scale would indicate construct-irrelevant variance.

Construct under-representation and construct-irrelevant variance adversely impact the use of

SET scores for formative and summative decisions.

2.2.2.3 Types of Validity Evidence

There are four sources of validity evidence that can help support the intended interpretation and

use of test scores. Sources of validity evidence are evidence based on content, response process,

internal structure, and relationship to other variables (AERA et al., 2014).

Evidence based on content refers to the extent to which aspects such as themes, wording, the

format of items, administration and scoring reflect the target construct as defined by the

developer of the measurement tool. Content should also appropriately match the intended use of

scores. An example is examining whether items from a SET summated rating scale cover all the

aspects of teaching quality that would serve the purpose of informing teaching improvement.

Evidence based on the response process involves examining propositions about the expected

cognitive aspects involved in rating items. Examples are examining whether students use

appropriate criteria, whether students are investing enough cognitive effort in the rating task, or

whether irrelevant criteria such as teachers' personality, appearance, or gender affect students’

ratings.

Evidence based on internal structure pertains to the extent to which the relationships between

items and dimensions included in the measurement tool match the observed responses. An

example is examining the dimensionality of scores (unidimensional or multidimensional) and

testing the extent to which the expected relationships among items and dimensions satisfactory

explain the observed relationships in the data.

Evidence based on the relationship to other variables examines the extent to which these

relationships are consistent with the intended interpretation and use of scores. Three types of

relationships to other variables are convergent, discriminant, and test-criterion.

Convergent evidence refers to the examination of the relationship between the target construct

and variables theoretically related, for instance, evaluating whether SET scores converge with

23

other measures of teaching quality such as a classroom observation protocol scored by trained

observers. A statistically significant relationship between SET scores and the objective measure

of teaching quality provides convergent validity evidence.

Discriminant evidence pertains to variables less related to the target construct, for instance,

evaluating whether SET scores correlate with constructs such as students’ satisfaction with the

course, teacher’s personality, teacher’s attractiveness or gender. The lack of a statistically

significant relationship between SET scores and the irrelevant variable provides discriminant

validity evidence.

Finally, test-criterion evidence pertains to the examination of the relationship between test scores

and expected outcomes, for instance, examining whether SET scores from multiple teachers

predict students’ final grades (Marsh & Roche, 1997) under the assumption that the level of

teaching quality significantly explains students’ learning.

2.2.3 SET Validity Findings

The two preceding subsections (2.2.1 and 2.2.2) propose that all measurement contains error,

explain how observed scores can differ from true score (the actual level of the target construct),

and offer a framework to evaluate the intended interpretation and use of scores.

Subsection 2.2.3 summarizes findings on SET validity separated in three subjects. The first

subject is the overall evaluation of SET validity from accumulated empirical evidence. The

second subject is a summary of discriminant evidence, an important focus within SET research.

The third subject pertains to evidence of differences in SET scores between female and male

teachers, one of the most examined irrelevant variables in studies reporting discriminant

evidence that specifically informs the third research question in the study.

2.2.3.1 Overall Evaluation

There are contradictory positions regarding the overall validity of SET scores based on

accumulated empirical evidence. Whereas early literature is more positive towards the validity of

SET scores, recent literature seems more critical (Gravestock & Gregor-Greenleaf, 2008).

24

Early literature defends the use of SET as a valid measure of teaching quality based on high

coefficients of reliability and evidence based on relationship to other variables, specifically

convergent and discriminant evidence (Gravestock & Gregor-Greenleaf, 2008).

An example of the positive attitude towards SET is Greenwald (1997) who characterized

research on SET during the 1970s as mainly concerned with the influence of students’ grade

expectation on SET scores (discriminant evidence). Greenwald claimed that those concerns were

“effectively answered and largely put to rest by subsequent research” (p. 1184). In fact,

Greenwald reported a decline in the number of studies on SET validity during the 1990s, and he

speculated that this decline was the result of prior research resolving the major issues regarding

SET validity.

Two highly-cited publications by Herbert Marsh (Marsh, 1987; Marsh & Roche, 1997) examined

the major issues in SET mentioned by Greenwald. These concerns include reliability of scores,

the internal structure of SET, relationship to other variables, and the perceived utility of SET.

A first conclusion reported by Marsh is that the reliability of SET scores is high and that scores

are consistent across students evaluating the same teacher.

A second conclusion reported by Marsh is that SET scores are multidimensional rather than

unidimensional. Findings support the nine dimensions of the Students’ Evaluation of Educational

Quality (SEEQ) instrument. The dimensions of teaching quality included are Learning/Value,

Instructor Enthusiasm, Organization/Clarity, Group Interaction, Individual Rapport, Breadth of

Coverage, Examinations/Grading, Assignments/Reading, and Workload/Difficulty.

Evidence based on relationship to other variables summarized by Marsh includes convergent,

discriminant and test-criterion. Convergent evidence indicates that SET scores correlate with

teaching evaluation by other sources, such as self-assessment (teacher versus students), and

trained external observers.

As reported by Marsh, discriminant evidence indicates that SET scores reflect teaching quality

rather than the quality of the course. The relationship between SET scores from teachers that

taught the same course-content is close to zero, and the correlation between SET scores from the

same teacher in courses is above r = 0.6. Related to discriminant evidence as well, Marsh

25

reported that SET scores are only weakly or not correlated at all with irrelevant variables. The

variables reported by Marsh are students’ prior subject interest, expected grade and actual grade,

course’ workload or difficulty, class size, the level of the course (graduate, undergraduate), year

in school, the gender of the teacher, academic discipline, purpose or evaluation, administrative

conditions, and students’ personality.

Lastly, Marsh reports that SET is perceived as useful by teachers when appropriate support is

offered, by students in course selection, and by administrators for use in personnel decisions.

Marsh’ positive findings regarding SET are mostly (but not exclusively) based on evidence from

one specific instrument of his authorship (Marsh, 1982). He acknowledges that “many

instruments fail to provide a comprehensive evaluation of theoretically sound, multiple

dimensions of teaching quality, thus undermining their usefulness, particularly for diagnostic

feedback” (Marsh, 1997, p. 1188). However, many authors have subsequently echoed the above

and other positive validity findings to argue in favor of the overall validity of SET5 (Theall &

Franklin, 2001). There is a tendency in the literature to re-interpret these positive validity results

authoritatively and omit the words of caution regarding the use of SET made by the original

authors (Johnson, 2000).

Since Marsh’s report, an overwhelming amount of evidence has become available, and recent

literature seems to recede from the previous favorable appraisal of SET. The current main aspect

of concern regarding SET is the fundamental question on whether scores reflect the intended

target construct, teaching quality (Penny, 2003; Boring, Ottoboni, & Stark, 2016; Penny, 2003;

Stark & Freishtat, 2014).

Recent literature suggests at least caution when using SET scores mostly because of the doubts

about what SET measures compared to what it intends to measure (Penny, 2003). Among the

more critical appraisals against SET, Olivares (2003) concludes that SET scores “are not

appropriate for drawing inferences regarding teaching effectiveness” (p. 240). Valsan and

Sproule (2008) argue that findings from validity research are misleading because the construct

5 Implicitly, these authors refer to validity of SET summated rating scale rather than validity of SET scores.

26

teaching quality “has no verifiable empirical content” (p. 940). Stark and Freishtat (2014)

conclude that “there is no consensus on what SET does measure” (p. 13).

As in the case of early literature, current research expresses concern about SET based on validity

evidence based on relationship to other variables, specifically discriminant evidence. In fact,

there is little attention to evidence based on content and response process (Ory & Ryan, 2001;

Penny, 2003). Specifically, current SET research is characterized by a lack of support to “content

relevance, adequacy of coverage, empirical and theoretical analysis of rating forms, the scores

and any action based on them” (Penny, 2003, p. 401). Also, there is little or no information about

the validity of scores based on results from proper psychometric analysis (Penny, 2003; Spooren

et al., 2013).

Two recent and remarkable examples of research on SET validity providing evidence based on

response process are Gee (2017) and Bassett, Cleveland, Acorn, Nix, and Snyder (2017), who

examined response strategies and motivation of students in the context of two SET summated

rating scale administered in the UK context.

Following the analysis of think-aloud protocols, Gee (2017) reported that students did not

provide enough cognitive processing in the rating of SET items. Students relied on superficial

response strategies, for instance, providing the same response to all items. Additionally, students

reported that they felt motivated to inflate their scores influenced by personal and power

relationships, for instance, with the goal of rewarding friendly teachers or to present themselves

as not-conflicting students.

Bassett et al. (2017) reported insufficient students’ effort after analyzing responses to improbable

items included in a SET instrument. An example of improbable item is “the instructor never even

attempted to answer any student question related to the course.” The average level of responses

endorsing positive responses to unlikely statements was high, fluctuating between 24% and 69%

of students. The study also reported that only 20% of students indicated that they responded to

all items seriously. Lastly, students reported that they did not believe that administrators or

teachers would use the results of the evaluation.

Despite the two previous examples of studies providing evidence based on response process,

most research on SET focus on discriminant evidence. The following subsection summarizes

27

findings on SET discriminant evidence with an emphasis on the relationship between SET scores

and teacher’s gender, one of the most examined irrelevant variables in SET literature.

2.2.3.2 Discriminant Evidence

Discriminant evidence refers to the test of the relationship between the target construct and

variables that theoretically are not related to the target construct (irrelevant variables).

Discriminant evidence is often obtained using experimental designs and correlational analysis

(AERA et al., 2014).

A lack of relationship between SET scores and an irrelevant variable (for instance, teacher’s

gender) would provide support in favor of the validity of SET scores because no relationship is

expected based on theory. A significant association between SET scores and an irrelevant

variable would reflect that the two variables are not independent (for instance, when SET scores

are higher for female teachers), a finding inconsistent with the theory that sustains the

measurement of teaching quality.

The lack of independence between SET scores and an irrelevant variable can suggest 1) a true

relationship between the two variables (for instance, a different meaning of the target construct

among subgroups), 2) construct underrepresentation, or 3) construct-irrelevant variance (AERA

et al., 2014). Discriminant evidence on its own does not indicate which of the previous three

alternatives explains the observed relationship. The lack of independence between the target

construct and the irrelevant variable should encourage further investigation, for instance, a

revision of the theory sustaining SET, a review of the content, and examination of the response

process.

In the context of SET, studies providing discriminant evidence are known as examining “biasing

factors” (Bassin, 1974; Bonitz, 2011; Gravestock & Gregor-Greenleaf, 2008; Olivares, 2003;

Penny, 2003; Spooren et al., 2013; Theall & Franklin, 2001). The term “bias” has a related, yet

different meaning than the definition presented previously in the study, bias as construct-

irrelevant variance. Instead, a “biasing factor” simply indicates a theoretically irrelevant variable

that correlates with SET scores.

The list of variables considered irrelevant in the context of SET is extensive and includes

(Spooren et al., 2013):

28

• Student’s background variables: gender, age, and maturation.

• Student’s academic variables: class attendance, student’s effort, expected grade, students’

course performance (examinations and final grades), students’ goals orientation, the

discrepancy between expected-actual grade, grading leniency, pre-course interest, change

in course interest.

• Teacher’s variables: gender, reputation, research productivity, teaching experience, age,

language background (native versus ELS), race, tenure, rank, sexual orientation, and

personality traits such as charisma, personality, physical attractiveness, fairness, attitudes

toward students, image compatibility (ideal versus actual teacher), likability, and initial

impression.

• Course’s variables: size, attendance rate, difficulty, discipline, workload, year in the

program, type (lab versus lecture), elective versus required, general versus specific content,

syllabus tone (friendly versus unfriendly).

Spooren at al (2013) and Stark & Freishtat (2014) summarize evidence indicating that student’s

variables with a statistically significant correlation with SET scores are cognitive background,

class attendance, effort, and grade expectation. Teacher variables with a statistically significant

correlation with SET scores are gender, reputation, experience, and age. Course variables with a

statistically significant correlation with SET scores are size, attendance rate, and course

difficulty.

Unfortunately, reports summarizing these relationships only indicate statistical significance and

exclude an interpretation of practical significance (i.e. effect size indexes). Although statistically

significant, other authors conclude that the relevance of these relationships is very small or even

trivial (Cashin, 1995; Marsh, 1987; Marsh & Roche, 2000; Penny, 2003). Until now, there is no

complete consensus on the actual importance of these irrelevant variables in the context of SET

discriminant validity evidence.

2.2.3.3 Teacher’s Gender

The gender of the teacher is one of the most examined theoretically irrelevant variable in SET

validity research (Bonitz, 2011). The relationship between SET scores and the gender of the

teacher is a concern in this study from the perspective of examining the effect of response styles

29

on subsequent statistical analysis of SET scores. The following section summarizes recent

findings from studies reporting differences in SET scores by teacher’s gender.

In general, findings from studies examining differences in SET scores between female and male

teachers are similar to those from studies examining other irrelevant variables: empirical

evidence is non-conclusive, and the practical significance of statistically significant differences is

usually small or trivial (Marsh & Roche, 1997). However, some authors claim that SET scores

are “biased” by teacher’s gender based on small or trivial coefficients of practical significance.

Two recent examples are Boring et al., (2016) and Stark & Freishtat (2014).

There are three published studies based on experimental designs that analyze the effect of

perceived gender of the teacher on SET scores (Arbuckle & Williams, 2003; Bonitz, 2011;

MacNell, Driscoll, & Hunt, 2015). In these studies, researchers utilized methods that allow the

manipulation of the teacher’s gender. One method is the use of a gender-neutral audio lecture as

teaching format (gender manipulation is in the SET questionnaire) (Arbuckle & Williams, 2003).

A second method is the use of an online course as teaching format (gender manipulation is in

course’s description and material) (MacNell et al., 2015). A third method is the use of vignettes

in survey experiments (Bonitz, 2011).

Arbuckle & Williams (2003) utilized a 2 (teacher’s gender) x 2 (teacher’s age) x 2 (student’s

gender) experimental design in the context of an audio lecture about “Stages of Relationship

Building” attended by college students. The authors reported that the same lecture was rated

higher when students believed that the teacher was a male under-35 than when students believed

that the teacher was a male over 55-male, a female under 35, or a female over 55, F(9, 330) =

2.63, p=.006, partial 𝜂2= .076. The partial 𝜂2 implies that teacher’s gender accounted for 7% of

the variance in SET scores, a difference indicating a medium effect7.

A second experiment conducted on undergraduate students attending to an online introductory-

course on anthropology/sociology (MacNell et al., 2015) revealed that students tended to assign

higher scores to male teachers over female teachers regardless of the actual gender. However, the

6 Effect size was calculated from reported MANOVA results.

7 Interpretation of effect size follows Cohen’s (1988) rule of thumb: small, medium and large.

30

effect of gender identity on SET scores was not statistically significant. Re-analysis of the data

using nonparametric tests (Boring et al., 2016) confirmed the original findings. The non-

parametric tests revealed differences between female and male teachers in only three items out of

14 items using an alpha level of p < .05. No differences were observed in total SET scores.

However, the original study reported that gender identity explained a 13% of the variance in SET

scores (R2 = 0.13) which indicates a medium practical significance of the difference. In

comparison, the actual gender of the teacher explained less than 1% of the variance of SET

scores (R2 = 0.01).

The third experiment examined undergraduate psychology students evaluating a short vignette

describing a hypothetical teacher (Bonitz, 2011). Results from a 2 (teacher’s gender) x 2

(student’s gender) x 2 (course type: counseling psychology or research methods) experimental

design indicate no main or interaction effect of teacher’s gender identity on SET scores, F(1,

602) = 0.13, p = .72, 95% CI for the difference in means = [0.18, -0.10]. Teacher’s gender

explained less than 0.1% of the variance of SET scores (partial 𝜂2<.00) 8, which reflect no

practical significance of the difference by teacher’s gender.

In summary, findings from experimental studies are mixed. One study reported a practical and

statistically significant difference in SET scores favoring male teachers (Arbuckle & Williams,

2003). One study reported a practical and non-statistically difference in favor of male teachers

(MacNell et al., 2015). Lastly, one study reported no practical nor statistically significant

difference between female and male teachers (Bonitz, 2011).

Evidence of differences in SET scores between female and male teachers from observational

studies reach statistical significance more often, but their practical significance is small or trivial.

For instance, Basow and Montgomery (2005) utilized a 2 (teacher gender) x 2 (student gender) x

3 (teacher’s rank) ANOVA on SET scores from students enrolled at a liberal arts college. They

reported a statistically significant main effect of teacher’s gender, F (6,682) = 4.32, p < 0.001, 𝜂2

= .036. The practical significance9 of the difference in SET scores favoring female over male

teacher is small.

8 Effect size was calculated from reported ANOVA results.

9 Effect size was calculated from reported ANOVA results.

31

Smith et al. (2007) reported a statistically significant main effect of teacher’s gender on SET

scores from undergraduate communication students, F (1, 10955) = 146.90, p < .001, η2 = .01,

indicating a small practical significance of being a female teacher over male teacher on SET

scores.

McPherson, Jewell, and Kim (2009) used regression analysis on SET scores from undergraduate

economy students. They found that the unstandardized regression coefficient of being a male

teacher on SET scores was B = .094 ( = 0.14; p < .001) for teachers of principles of economy

classes and B = .07 ( = 0.11; p < 0.01) for teachers of upper-level economy classes, after

controlling for student’s variables such as grade expectations, response rate, class size, and

teacher’s characteristics including teaching experience, race, and rank (adjunct versus tenure-

track). The practical significance of difference expressed in the standardized regression

coefficients and differences in means10 is small, with Cohen’s d = 0.19 for principles of economy

classes and d = 0.15 for upper-level economy courses. No statistically significant effect of

teacher’s gender on SET scores was found in graduate students in economics using a similar

approach (McPherson & Jewell, 2007).

Finally, Boring (2015) reported findings from an observational study using SET responses from

first-year undergraduate students at a French university using a generalized ordered logit model.

The study found that male teachers are more likely to be endorsed with the highest response

option from male students, and female teachers are less likely to be assigned the higher response

options by both female and male students. Permutation tests were conducted using the same data

to better account for noncompliance with score distribution assumptions11 (Boring et al., 2016).

Findings based on nonparametric tests (permutation tests) resulted in similar conclusions than the

previous report: male teacher received higher scores than female teachers, with an overall

correlation coefficient of r = 0.09, p = .000 and coefficients ranging from r = .04 (p = .63) and r

= .11 (p = .10) across disciplines (History, Microeconomics, Political Science). Although

interpreted as “large and statistically significant” differences (p. 1), the practical significance of

10 Standardized correlation coefficients and effect sizes were calculated from reported unstandardized regression

coefficients and descriptive statistics. 11

According to Boring (2015) teachers are not a random and independent sample from a normally distributed

population with equal variance and different means by gender, indicating that the null hypothesis is unrealistic.

32

these correlation coefficients is small using Cohen’s rule of thumb for interpreting effect size (J.

Cohen, 1988).

In summary, findings from observational studies examining differences in SET scores by

teacher’s gender are mixed. Two studies reported higher SET scores for female teachers (Basow

& Montgomery, 2005; Smith et al., 2007). The same number of studies reported higher scores

for male teachers (Boring, 2015; Boring et al., 2016; McPherson et al., 2009). Only one study

found no difference in SET scores by teacher’s gender (McPherson & Jewell, 2007). These

recent findings from experimental and observational studies are consistent with previously

published reviews indicating inconsistent results and small or no practical significance of the

differences in SET scores between female and male teachers (Gravestock & Gregor-Greenleaf,

2008; Marsh & Roche, 1997; Spooren et al., 2013).

2.3 Response Styles in SET

A common belief among developers and users of measurement tools in educational contexts is

that observed scores are determined exclusively by the target construct that the tool intends to

measure (Cronbach, 1946; Wetzel, Böhnke, et al., 2016). Underlying the previous interpretation

of observed scores as only reflecting true score is that processes irrelevant to the definition of the

target construct are not influencing responses. In other words, developers and users assume that

there is no construct-irrelevant variance in observed scores.

In this regard, the measurement literature has long recognized that irrelevant factors often

influence responses to measurement tools such as summated rating scales. An instance occurs

when a student shows the tendency to agree, disagree, or select extreme options in the response

scale across items. These response patterns suggest processes irrelevant to the target construct.

The previous examples pertain to response styles, well-documented sources of construct-

irrelevant variance affecting summated rating scales (AERA et al., 2014; Cronbach, 1946;

Viswanathan, 2005; Wetzel, Böhnke, et al., 2016).

A response style is defined as the systematic tendency to respond to questionnaire items

irrespective of their content (Paulhus, 1991; Viswanathan, 2005; Wetzel et al., 2016).

Specifically, a response style is a stereotyped or aberrant individual response pattern across items

33

and is attributed to an individual tendency to favor certain response options over others

(Macmillan & Douglas, 1990).

As an expression of an individual process not related to the instrument content, response styles

can reduce the validity of scores as a source of construct-irrelevant variance (AERA et al., 2014)

affecting the interpretation and use of observed scores by introducing additive and correlational

error (Viswanathan, 2005).

2.3.1 Approaches to Examine Response Styles

There are different strategies for examining response styles in scores obtained from summated

rating scales. Two essential differences among strategies are 1) the use of additional items versus

the same items that measure the target construct, and 2) the use of a manifest versus latent

variable approach (Wetzel, Böhnke, et al., 2016). Subsequently, four different ways of measuring

response styles are 1) same items with a manifest variable approach, 2) same items with a latent

variable approach, 3) additional items with a manifest variable approach, and 4) additional items

with a latent variable approach. According to Wetzel et al. (2016), the most popular strategy for

examining response styles is the calculation of frequency indexes using the same items than the

target construct. Methods for examining response styles based on latent variable approaches are

very recent, and no systematic review and comparison of methods is available yet.

The study examines response styles is in the context of secondary SET data with no additional

items measuring response styles included in the instrument. Hence, the section focuses on

explaining the rationale of examining response styles using the same items measuring the target

construct and a manifest variable approach.

2.3.2 Manifest Variable Approach and Same Items

The examination of response styles using a manifest variable approach and the same items than

the target construct can be found early in the measurement literature (Wetzel, Böhnke, et al.,

2016). Two remarkable examples are halo effect (Thorndike, 1920), leniency/severity and range

restriction (Kingsbury, 1922).

Thorndike (1920) noticed that estimates of the report of others of a priori relatively independent

traits such as intelligence, industry, technical skill, reliability, leadership, and character made by

34

superiors of industrial employees and aviation cadets were highly and evenly correlated.

According to Thorndike, correlations were “higher than reality” and “too much alike” (p. 25).

Thorndike believed that superiors rated these independent aspects of their subordinates affected

by “a marked tendency to think of the person in general as rather good or rather inferior and to

color the judgments of the qualities by this general feeling” (p. 25). Thorndike called halo to this

error in the judgment of independent attributes.

Kingsbury (1922) examined how managers scored a group of employees across seventeen

attributes (e.g. vitality, alertness, enthusiasm, loyalty) comparing scores against the normal

probability curve. As in the case Thorndike, Kingsbury also identified halo in managers’

evaluations of employees and explained this tendency as the influence of “amiable quality in the

employee, good appearance, tact, etc.” or “a brusk manner, unpleasant voice, or other socially

irritating trait” (p. 380). Kingsbury also noticed that managers would use “wrong or changing

quantitative standards” leading to “high marker” managers (severity) and “low marker”

managers (leniency) (p. 380). Finally, Kingsbury described that some managers would provide

ratings that were too uniform obscuring differences among employees (range restriction).

The two examples above illustrate an early use of manifest variables and the same items as a

strategy for examining response styles. Also, the examples above suggest that manifest variable

approaches assume certain attributes regarding the utilization of the response scale, distribution

of scores, and relationships among items. These assumptions depend on intended interpretation

and use of scores (for example, recruitment, professional development, retention, promotion, and

firing).

Examples of these assumptions are that respondents should discriminate among independent

attributes (Thorndike, 1920), that respondents should not excessively agree with all questionnaire

items (Lentz, 1938), or that scores should match a normal distribution allowing discrimination

among participants (Kingsbury, 1922). These propositions are necessary for properly informing

formative and summative decisions based on scores.

Some authors propose that violations of these propositions reflect error in the measurement

attributable to limitations in participants ability to provide accurate responses (for instance

Murphy & Balzer, 1989; Saal, Downey, & Lahey, 1980). However, other authors believe that

these violations could reflect participants’ strategic thinking regarding the utilization of the

35

evaluation results (Murphy & Balzer, 1989; Murphy & Cleveland, 1995; Murphy, Cleveland,

Skattebo, & Kinney, 2004). In fact, the goal of the evaluation could affect respondents’

motivation to provide accurate responses, discussed in Chapter 5.

There is a limitation affecting the utilization of a manifest variable approach for the examination

of response styles. A manifest variable approach cannot separate response styles variance from

target construct variance using the same items, and only a latent variable approach would allow

the separation between target construct and response style using the same items. Consequently, a

proper interpretation from analysis relying on a manifest variable approach is that SET scores

would reflect both teaching quality and response styles. Such evidence would serve diagnostic

for informing other more elaborate analysis utilizing a latent variable approach. Chapter 5

expands on the utilization of a latent variable approach in the context of the limitations of the

study.

A second limitation shared by manifest and latent variable approaches is the lack of information

concerning the causes responses styles. In other words, these approaches indicate “what” is

happening in the data but not “why.” The value of these approaches is in informing about a

potential source of construct-irrelevant variance affecting SET validity (the “what”) that, once

identified, need to be subsequently addressed. The examination of causes explaining response

styles falls beyond the scope of the study. However, some hypotheses are presented later in

Chapter 5.

2.3.3 Types of Response Styles

A definition and examination method using a manifest variable approach and the same items

than the target construct is presented below for acquiescence/disacquiescence response styles,

extreme response style, midpoint response style, halo and range restriction.

2.3.3.1 ARS/DRS

Acquiescence response style (ARS), or yeah-saying, refers to the tendency to agree with

statements (or more generally, endorse the highest response option) irrespective of the content of

items. Disacquiescence response style (DRS) (or no-saying) is the tendency to disagree with

statements (or endorse the lowest response option) regardless of the content of items (McGrath,

36

Mitchell, Kim, & Hough, 2010; Paulhus, 1991; Spector, 1991; Viswanathan, 2005; Wetzel,

Böhnke, et al., 2016).

Possible causes of ARS and DRS are a complex, ambiguous, vague or neutral item wording,

uncertainty or low cognitive ability in the respondent, and the result of distraction and time

pressure (Paulhus, 1991; Viswanathan, 2005). ARS and DRS are also consequences of strong

satisficing (Barge & Gehlbach, 2012; Krosnick, Narayan, & Smith, 1996) and possibly triggered

when students want to avoid negative consequences of the evaluation results on teachers

(Murphy & Cleveland, 1995; Murphy et al., 2004).

The most popular method for examining ARS/DRS is calculating an individual frequency index

based on the proportion of responses stating the most positive (ARS) or negative (DRS) response

option across all items. Some authors calculate the index including all agreement/disagreement

response options across questionnaire items (Richardson, 2012; van Herk, Poortinga, &

Verhallen, 2004), whereas others report the number the responses utilizing the “Yes” response

options across questionnaire items (Spooren, Mortelmans, & Thijssen, 2012). When the rating

scale includes negative and positive worded items, an alternative procedure is calculating the

proportion for each type of item (negative and positive worded) and then averaging the two

indexes.

As a proportion of the total number of responses, a value close to +1 would indicate agreement

(or disagreement) with all items, and a value close to 0 would indicate no acquiescence /

disacquiescence.

Leniency/Severity are two terms related to acquiescence/disacquiescence and useful when

several participants (for instance, students) evaluate the same target (for example, teachers).

Leniency is the tendency to provide scores spuriously high regardless of the dimension, and

severity (also stringency or harshness) is the tendency to score spuriously low irrespective of the

dimension (Kingsbury 1922; Viswanathan, 2005; Wolfe, 2004; Wetzel et al., 2016).

Leniency/Severity do not reflect necessarily students using the highest or lowest response option

as in acquiescence/disacquiescence, but unusually high or low responses compared to other

students. Because of their similarity with acquiescence/disacquiescence and limitations in the

SET data (and explained in Chapter 3), the study only examines ARS/DRS.

37

2.3.3.2 ERS

Extreme response style (ERS) refers to the tendency to respond using the extremes of the scale

regardless of content (Paulhus, 1991; Viswanathan, 2005; Wetzel et al., 2016; McGrath et al.,

2010).

Paulhus (1991) mentions situational factors such as ambiguity, emotional arousal, and speediness

as possible triggers of ERS. Viswanathan (2005) mentions “intolerance for ambiguity or

dogmatism, anxiety, respondents lacking appropriate cognitive schemas, or content that is

meaningful, important, or involving to respondents” as causes of extreme response style (p. 141).

A frequency index of ERS is the proportion of extreme categories that a participant endorses

across all questionnaire items (Viswanathan, 2005; Wetzel et al., 2016; van Herk et al., 2004,

Richardson, 2012). The sum of ARS and DRS indexes results in the ERS index.

2.3.3.3 MRS

Midpoint responding style (MRS) (Viswanathan, 2005; Wetzel et al., 2016) also neutral or

moderacy bias (McGrath et al., 2010) is the tendency to score using the middle point on the

scale.

Possible causes of MRS are “evasiveness, indecision, or indifference” (Viswanathan, 2005, p.

136). A frequency index of midpoint responding is the proportion of responses using the

midpoint across all items.

Central tendency (Kingsbury, 1922), also centrality (Wolfe, 2004) is a concept related to MRS.

Central tendency is “the propensity to award a restricted range of scores around the mean (or

mode or median) and to avoid awarding extreme scores” (Leckie & Baird, 2011, p. 400).

Central tendency is defined relative to a measure of central tendency of the scores distribution

(Saal et al., 1980), and it differs from midpoint response style because the mean (or mode,

median) of the scores distribution is not necessarily the midpoint of the scale. Central tendency

pertains the evaluation of the same target (for instance, a teacher) by several participants (for

example, students). Because of limitations in the SET data and its similarity between midpoint

response style and central tendency, the study reports MRS.

38

2.3.3.4 Range Restriction

Range restriction (Murphy & Balzer, 1989), also response range (Viswanathan, 2005) refers to

the tendency to use the response scale narrowly. Range restriction helps identify participants that

are too uniform in their scoring (Saal et al., 1980). Logically, scores affected by ARS/DRS

(leniency/severity), ERS, or MRS (central tendency) would also reflect range restriction. The

measurement of range restriction relates to the evaluation of the same target (for instance, a

teacher) by several participants (for example, students), and due to limitations in the SET data,

the calculation of a range restriction index is not feasible. However, the standard deviation of

SET scores is reported and interpreted as the result of the data analysis.

2.3.3.5 Halo

Halo (Thorndike, 1920; Kingsbury, 1922; Leckie & Baird, 2011) is the tendency to provide

“highly correlated ratings across a range of criteria” even to “conceptually unrelated items”

(Wetzel et al., 2016, p.10). Halo can indicate a respondents’ failure to discriminate among scale

dimensions.

A method for examining halo is interpreting dimension intercorrelations matrix (Saal et al.,

1980), for instance, calculating the correlation between SET dimensions using the mean score of

teachers across students. Systematic high correlations among dimensions are an indication of

halo. The same analysis can be performed using Principal Component Analysis or Factor

Analysis. The presence of one component of factor explaining a high proportion of variance

indicates halo as a likely problem affecting scores.

2.3.4 Evidence of Response Styles in SET

Studies on response styles are relatively new in the context of SET. Hence, the influence of

response styles on SET scores needs further exploration (Spooren et al., 2013). Response styles

already examined in the context of SET are acquiescence (Yorke, 2009; Richardson, 2012;

Spooren et al., 2012), leniency/severity (Rantanen, 2013), and extreme response style

(Richardson, 2012). There is one study examining the effect of response styles on the use of

scores in subsequent statistical analyses (Richardson, 2012). Findings from these four studies

suggest that response styles might affect the intended interpretation of SET scores as a measure

of teaching quality and the subsequent analysis of SET data.

39

Yorke (2009) is the first author bringing attention to the issue of response styles in the context of

SET, specifically examining acquiescence. The study utilized a manifest variable approach and

examined ARS using the same items than the target construct. York reported no evidence of

ARS in a summated rating scale measuring students’ experience of teaching and learning.

In the study, York (2009) showed that the distribution of items (using Kolmogorov–Smirnov

test) was not affected by factors pertaining the response scale (for instance, reversing the order of

presentation of response options). Also, scores distribution was not affected by changes in item

wording, for example, variation in the number of positive and negative worded items and their

order of presentation. Yorke concluded that responses reflected content rather than acquiescence.

A problem with York’s (2009) study is the lack of utilization of a standard method for examining

acquiescence (for instance, frequency index). Furthermore, by definition, acquiescence pertains

to an individual response pattern, an aspect missing in York’s study. The lack of a common

measure of acquiescence leaves unanswered the question of the degree to which acquiescence

affected SET scores in the study and makes impossible the comparison with empirical evidence

from other sources.

Richardson (2012) reported findings of response styles in scores from an instrument named

Course Experience Questionnaire (CEQ). The analysis relied on frequency indexes calculated

from the same items measuring SET. Richardson reported that the average level of acquiescence

response style among students (examined as a proportion of individual responses across items

using the highest two response options in a five-points response scale) was 0.30 for positively

worded items and 0.32 for negatively worded items. The average level of extreme response style

(examined as a proportion of individual responses across questionnaire items using the first and

last response option in the five-points response scale) was 0.35 for positively worded items and

0.45 for negatively worded items. Richardson (2012) also reported that the levels of

acquiescence and extreme response style correlated with students’ marks. The coefficients of

correlation between the levels of response styles and student’s marks varied between r = .23

(positively worded items) and r = .32 (negatively worded items). Finally, the study reported that

the variation in students’ marks explained by a measure of learning styles dropped from 21.8% to

18.9% after statistically controlling by response styles.

40

Spooren et al. (2012) examined acquiescence response style in SET scores using a latent variable

approach, structural equation modeling, and the same items than the ones measuring SET. The

study compared three measurement models12. A first model reflects the theoretical dimensions in

the SET summated rating scale (model 1). A second model adds to model 1 a common factor

explaining additional variance across all SET items. A third model adds to model 1 specific

common factors explaining additional variance across items within SET dimensions.

The fit13 between model 1 and the observed data was reasonable (RMSEA = .051; CFI = .989;

AIC = 134.884). Only model 2 showed a better fit with the observed data than model 1 (RMSEA

= .045; CFI = .992; AIC = 119.370). The relationship between the common factor from model 2

and a frequency index of acquiescence response style was subsequently estimated using a

structural equation model. Contrary to the authors’ expectation, the correlation between the latent

variable and the frequency index was low, implying that the common factor is not explained by

acquiesce response style. The authors proposed to interpret the common factor as halo suggesting

that scores might reflect variables such as instructor’s charisma or teacher professionalism.

There are two conceptual problems in the study reported by Spooren et al., (20012). First, factor

analysis is a standard method of examination of halo, no acquiescence. Another problem is the

notorious lack of theory when interpreting the common factor (or halo) as teacher’s charisma of

teacher professionalism. A simple and more plausible explanation is a common method factor

due to aspects of the measurement that are similar across items, such as the response scale or

items worded positively (Viswanathan, 2005; AERA et al. 2014).

One last publication on response styles in the context of SET reported evidence of

leniency/stringency on scores based on the results from a generalizability study (manifest

variable approach) using the same items than the target construct (Rantanen, 2013). In the study,

SET total variance was decomposed into three components using a hierarchical linear model

12 Measurement model refers to the internal structure of the measurement, the relationships between items (manifest

variables) and dimensions (latent variables) that underlies the development of a rating scale (Brown, 2006). 13

Brown’s (2006) guidelines for interpreting reasonable fit between model and observed data are values of RMSEA

close or below to 0.05 and values of CFI close or above .95. Akaike Information Criterion (AIC) is an information

criterion index and serves the purpose of comparing across models that differ in the number of factors, a lower value

indicates a better model fit.

41

approach: teacher, students (individualized by an anonymous identification), and items. The

proportion of variance explained by students was 16.8% in comparison with the 24% explained

by teachers and 46.4% of residual variance (not explained by students, teachers, or items). The

author interpreted the percentage of variance explained by students as students using the

response scale in a systematically lenient (tendency to assign low scores) or stringent (tendency

to assign high scores) manner independently of the teacher. However, the study does not report

the proportion of students exhibiting leniency or severity responding. More generally, finding

from the study seems to suggest that students are not discriminating across teachers, and that

could indicate leniency or severity but also central tendency or range restriction.

2.4 Summary and Limitations

2.4.1 Summary

The main conclusion from the first subject presented in this review of the literature is that the

definition of the target construct in SET is problematic. Often there is weak or no theory of

teaching quality underlying the development of SET summated rating scales. Weak theory

relates to home-made and ad-hoc instruments, a high diversity of content among instruments,

and a vague definition of the target construct. Furthermore, studies employ terms such as

teaching efficacy, teaching effectiveness, teaching quality, and student’s satisfaction

interchangeably.

A vague definition of the target construct is a first aspect that casts doubt on the validity of SET

scores. A summated rating scale based on poorly defined conceptual domains of teaching quality

would encourage students to base their responses on their understanding of quality (Spooren et

al., 2013) or to report inaccurate or biased information (Valsan & Sproule, 2008).

In this study, the definition of teaching quality distinguishes two related yet different aspects:

good teaching (the quality of teaching task) and successful teaching (teaching that contributes to

learning). Considering that SET literature shows a vague definition of the target construct, the

interpretation of findings pertains to “teacher quality” in general without distinction between

good and successful teaching. The distinction between good and successful teaching also serves

the purpose of describing the content of the specific SET summated rating scale examined in this

study along with hypothesis regarding how response styles vary depending on item content.

42

The second subject presented in this literature review relates to the validity of SET. There are

contradictory positions regarding the overall validity of SET scores based on accumulated

evidence. Early literature is positive towards the validity of SET scores, and recent literature is

critical. The current most important issue is the fundamental question on whether SET scores

reflect teaching quality. The way in which researchers address this matter is mostly through

evidence based on relationship to other variables, and specifically, discriminant evidence.

Furthermore, there is little or no consideration to evidence based on content or response process.

Discriminant evidence is one of the most important types of evidence utilized in the evaluation

of SET scores validity. The list of irrelevant variables examined include characteristics of the

student, teacher, and course. Findings from this vein of research are inconclusive. An example is

the examination of differences between SET scores by the gender of the teacher. Findings from

experimental and observational studies often do not reach statistical significance. Indexes of

effect size indicate a small or no practical significance of the differences in SET scores between

female and male teachers.

The third subject addressed is the literature review is response styles in the context of SET.

Overall, empirical findings are inconclusive because of the low number of studies, weaknesses in

the methodology, and differences in the types of response styles examined. The literature has not

explored yet topics such as differences in how response styles affect SET scores across different

measurement conditions, or how response styles affect the relationship between SET and other

variables. However, the findings recommend the examination of response styles as a potential

source of construct-irrelevant variance in SET scores because they can affect inferences about

the true level of teaching quality. The definition of validity also encourages the continuous

examination of sources of construct-irrelevant variance across items, persons and settings, and

summated rating scales are prone to response styles.

2.4.2 Limitations in SET Validity Research

The most significant limitation in SET validity research is the same major issue affecting the

development, interpretation, and use of SET scores: a weak theory of teaching quality informing

validation efforts. Validation in the context of the weak theory is a recognizable problem in the

measurement field and receives the name of weak program of construct validity (Cronbach,

1988; Kane, 2001).

43

A weak program conveys the risk of turning validation into “sheer exploratory empiricism”

(Cronbach, 1988, p. 11) in which “any evidence even remotely connected to the test score is

relevant to validity” (Kane, 2001, p. 326). A weak program suggests the possibility of

researchers being “highly opportunistic in the choice of validity evidence” (Kane, 2001, p. 323).

Under a weak program, researchers should state their hypothesis “as explicit as possible, then

devising deliberate challenges” (Cronbach, 1988, p. 12). This study concludes that the

overemphasis on one specific type of validity evidence in detriment of others is an expression of

a weak program in the context of SET validity research.

Other significant limitations in SET validity research that emerge from contrasting the literature

and the definition of validity offered in section 2.2.3 are:

1. There is no explicit connection between validity evidence and intended uses of SET

scores.

2. Interpretation of validity relates to SET as an abstraction or to a specific instrument

rather than scores.

3. There is a tendency to generalize validity findings over instruments, persons, and

settings without providing relevant validity evidence.

4. There is scarce importance provided to validity evidence based on content and

response process.

There are specific limitations affecting studies providing discriminant evidence. In coherence

with a weak program, Marsh (1997) summarizes these studies as “atheoretical, methodologically

flawed, and not based on well-articulated operational definition of bias [construct-irrelevant

variance]” (p. 1190).

A serious problem in studies providing discriminant evidence affects the interpretation of a

statistically significant correlation coefficient between SET scores and irrelevant variables.

Specifically, the literature often interprets a lack of independence between target construct and

an irrelevant variable as a sign of construct-irrelevant variance (“bias”) (Boring, 2015; Olivares,

2003; Stark & Freishtat, 2014). Other two plausible interpretations often ignored are: 1)

44

construct underrepresentation and 2) a true relationship between variables. Studies examining

discriminant evidence do not provide further analysis confirming that the correlation between

SET scores and an irrelevant variable represents construct-irrelevant variance nor acknowledge

that the correlational design does not indicate which of the three plausible explanations is

correct.

Another serious problem pertaining SET discriminant evidence is the no interpretation of the

practical significance of these relationships, for example, by omitting indexes of effect size. An

example is relying on p-values of small or even trivial correlation coefficients to wrongly

conclude that “SET are biased against female teachers by an amount that is large and statistically

significant” (Boring et al., 2016, p. 1).

The third problem is that the scarce importance provided to validity evidence based on content

and response process casts doubt on the meaning of the correlation coefficients between

(problematic) SET scores and other variables. Sources of construct-irrelevant variance can

explain the correlation coefficients between SET scores and other variables due to additive and

correlational error.

2.4.3 Focus of Study

The present study presents evidence to evaluate the interpretation of SET scores as a measure of

teaching quality for informing formative and summative decisions at a large teacher education

institution. The evidence relates the examination of response styles, an important source of

construct-irrelevant variance in summated rating scales (Viswanathan, 2005; Wetzel, Böhnke, et

al., 2016) but scarcely examined in the context of SET. As a source of construct-irrelevant

variance, response styles can produce overestimation or underestimation of the true level of

teaching quality due to additive error, affecting formative and summative decisions.

Considering that the validity of scores depends on the conditions of measurement (Messick,

1995b), the degree to which SET scores are affected by responses styles could differ across

conditions such as the academic department, the type of graduate program, and the session.

Finally, response styles can change the relationship between SET scores and other variables due

to correlational error, affecting summative decisions and analysis pertaining discriminant

45

evidence. Furthermore, identify sources of construct-irrelevant variance is a first reasonable step

preceding the examination of the relationship between SET scores and other variables.

The three research questions that guide the study are:

1. To what extent SET scores are affected by response styles?

2. What are the differences in the degree to which SET scores are affected by response

styles across measurement conditions?

3. Is there a difference in SET scores between female and male teachers, and to what

extent do response styles moderate such difference?

46

Chapter 3

Methodology

Chapter 3 describes the study’s methodology, including the population of students,

characteristics of the SET summated rating scale, the procedure of administration specifying

intended interpretation and use of SET scores, and data analysis strategy followed to produce

evidence for each of the three research questions presented above.

3.1 Participants

The present study analyzes students’ evaluation of teaching from an institute of education part of

a large public research university located in Southern Ontario. Institute and university’s

authorities reviewed a research request for access to SET data, and following the request’s

approval, the manager of SET at the institute submitted students’ responses along with

information about the instrument development and administration.

For confidentiality reasons, the institute submitted the information in an anonymized manner to

prevent identification of teachers. Additionally, the institute did not collect any form of

identification of students during the instrument administration.

The total number of students’ evaluations of teaching is 6,133. The study excluded students

enrolled in special programs (other than Master or Ph.D./Ed.D, 114 cases) and cases with

missing values across all items (98 cases).

As presented in Table 1, the number of students included in the analysis is 5,921 distributed

among two departments14 (A and B), two types of academic program (Master and Ph.D./Ed.D)

and six academic sessions (Summer 2014 to Winter 2016).

14 The total number of academic departments at the institute of education is four, however, authorization was

granted for examining SET data from only two departments.

47

Table 1

Number of students by academic department, program type, and session

Academic

Department

Program

Type

Summer

2014

Fall

2014

Winter

2015

Summer

2015

Fall

2015

Winter

2016

Total

A Master 751 233 991 433 159 443 3,010

Ph.D./Ed.D. 98 19 101 68 12 71 369

B Master 422 226 444 410 265 324 2,091

Ph.D./Ed.D. 80 31 122 86 38 94 451

Total 1,351 509 1,658 997 474 932 5,921

The number of courses15 with students’ evaluations of teaching across departments, programs,

and sessions is 462, with an average of 12.82 (SD = 6.76), varying between 1 and 53.

The number of teachers across courses, departments, programs, and sessions is 159, and each

teacher received between 1 and 201 students’ evaluations (SD = 32.98). The percentage of

female teachers (71.1%) largely overpasses the percentage of male teachers (28.9%) and is close

to the overall proportion of female teachers at the institute of education for the specific period

covered by the data (68%)16.

The data submitted by the institute has two characteristics that limit the examination of response

styles. First, severity/leniency, central tendency, and range restriction require identification of

students and teachers, and the data provided does not include such information. A second

limitation is the lack of measurement of constructs unrelated to teaching quality, which results in

the impossibility of examining halo. Only one additional variable, teacher’s gender, was

submitted along with the students’ evaluations with the specific purpose of examining research

question number three.

3.2 Instrument

The SET summated rating scales includes eight items measuring aspects of the “learning

experience of students” during the length of a course. The instrument included a general item

assessing the “overall experience” during the course. The “overall” item uses a different response

15 Refers to a specific instance of a course in which teacher, section (if multiples) and session are specific, for

instance, course Research Methods in Education, teacher John Doe, section 1, summer 2016. 16

Personal communication with institute’s SET manager.

48

format than the other eight. Therefore, the analysis excluded the item. There are no differences in

the instrument content across departments, program type, or sessions.

The prompt utilized in the instrument to present content and explain the response procedure is

the following:

“You are presented with a series of statements about aspects of a course

learning experience. Using the scale provided, please indicate the extent to

which each aspect was part of your course experience.”

The response format utilized by students is a five-points Likert-type with the following labels: 1)

not at all, 2) somewhat, 3) moderately, 4) mostly, and 5) a great deal.

As reported by the institute’s SET manager, a commission selected the instrument content from a

bank of items developed by the university’s teaching support unit. The selection of the content

involved a “rigorous consultation phase” with “faculty, programs, and departments.” 17 No other

information regarding the development of the instrument was available at the time of elaborating

this report.

The eight items included in the SET summated rating scale are:

1. “I found the course intellectually stimulating.”

2. “The course provided me with a deeper understanding of the subject matter.”

3. “The instructor created a course atmosphere that was conducive to my learning.”

4. “Course projects, assignments, tests, and/or exams improved my understanding of the

course material.”

5. “Course projects, assignments, tests, and/or exams provided opportunity for me to

demonstrate an understanding of the course material.”

17 Personal communication with institute’s SET manager.

49

6. “The instructor explained the learning objectives for the course.”

7. “The course instructor demonstrated respect for diversity (e.g., race,gender, ability,

religion, sexual orientation, etc.) in the classroom.”

8. “The course instructor encouraged students to express their own ideas in the class."

The SET summated rating scale combines a self-report item (item 1), report-of-objects items

targeting the course and course components (items 2, 4 and 5), and report-of-other items

targeting the teacher (2, 6, 7 and 8). Content covers the two components of teaching quality

(good and successful teaching) and the three types of acts of teaching (logical, psychological and

moral). Table 2 summarizes the type of report and type of content for each item.

Table 2

Type of report and content in SET summated rating scale

Item Type of Report Aspect of Teaching

Quality

Type of Act

of Teaching

Item 1 Self-report Good Teaching Logical

Item 2 Object-report Successful Teaching

Item 3 Other-report Good Teaching Psychological

Item 4 Object-report Successful Teaching

Item 5 Object-report Good Teaching Logical

Item 6 Other report Good Teaching Logical

Item 7 Other report Good Teaching Moral

Item 8 Other report Good Teaching Psychological

Table 2 shows that six items measure good teaching, with three items covering logical acts (item

1, 5, 6), two items covering psychological acts (item 3 and 8), and one item measuring a moral

act (items 7). Two items measure successful teaching (items 2 and 4).

The diversity of content summarized in Table 2 recommends the interpretation SET scores as an

overall measure of teaching quality. The reliability of the responses to the summated rating scale

as estimated by Cronbach’s alpha coefficient is 0.93 indicating a high individual consistency

(and low random error), supporting the utilization of total score.

50

There are two aspects of the instrument’s content calling for a cautious interpretation of SET

scores as a measure of teaching quality. Although the consultation phase of content selection

described above reflects the normative and contextual characteristics of the definition of good

teaching, the first problem is the low number of items measuring the different acts of teaching,

which suggests a potential problem of construct under-representation. Second, the instrument

includes items measuring successful teaching, reflecting the importance of teaching to foster

learning in this educational institution. However, the validity of “reaction” items or self-

assessment items as measures of teacher’s contribution to students’ learning is dubious based on

the previous literature review, casting doubt on their utility for informing formative and

summative decisions. Therefore, the first finding of the study is that the SET content reflects

partial aspects of good teaching and includes problematic items pertaining successful teaching.

3.3 Administration

Close to the end of the academic session, students received an institutional message by email for

each course they were enrolled inviting them to participate in an online course evaluation survey.

Two follow-up messages encouraging students to fill out the course evaluation survey followed

the original invitation. The participation in the course evaluation survey was voluntary, and the

response rate was 65% in department A, and 71% in department B18.

No intended interpretation of SET scores was offered to students at any component of the

instrument administration according to the information available at the time of elaborating this

report. However, the introductory paragraph of the instrument stated the intended use of SET in

the following manner: “Your feedback is important to us” (…) “teaching evaluations are

designed to improve course offerings and may be considered in promotion or tenure decisions for

faculty.” The introduction also stated the anonymity of responses, and that teachers would

receive the results only after the submission of the course final grades, probably as a mean to

minimize the influence of students’ grade expectation on responses.

The instrument content and administration procedure do not indicate neither the intended

interpretation of scores nor the use of scores for formative purposes. However, institutional

18 Personal communication with institutional SET manager.

51

documents state that SET assesses the “effectiveness of teaching19” and declare four intended

uses of SET20 summarized below:

1. Provide formative data to instructors for the continuous improvement of their

teaching.

2. Inform members of the institution about teaching.

3. Provide data for summative evaluation of teaching including annual merit, tenure, and

promotion review.

4. Provide data for program and curriculum review.

A logical conclusion of the comparison between instrument content and institutional documents

is that students received incomplete information regarding the intended interpretation and

planned uses of SET scores.

3.4 Data Analysis

The data analysis strategy comprises four parts. The first part is the report of responses

distribution using graphs and descriptive statistics. The report includes histograms for each item

and measures of central tendency, dispersion, and shape of the distribution for each item and

SET scores (total score). The remaining parts are related to the three research questions in the

study.

3.4.1 Research Question 1

The analysis of the degree to which SET scores are affected by response styles relies on a

manifest variable approach (Paulhus, 1991; Wetzel, Böhnke, et al., 2016). The study reports four

frequency indexes of response styles: acquiescence (ARS), disacquiescence (DRS), extreme

19 “Guidelines for the assessment of teaching”; despite what the document states, the analysis of the instrument

content suggests that the target construct is teaching quality. 20

“Policy on the Student Evaluation of Teaching in Courses”

52

(ERS), and midpoint (MDR) response styles21. The calculation of frequency indexes of response

styles is per student and considers responses to all items.

The calculation of a frequency index supposes that the systematic choice of a specific response

option across items reflects the response style. Table 3 presents the relationship between the

selection of a response option and a response style. The proportion of responses matching the

scoring in Table 3 represents the degree to which SET scores are affected by response styles.

Table 3

Scoring for calculating frequency indexes of response styles

Response Style Not at all Somewhat Moderately Mostly A great deal

ARS 0 0 0 0 1

DRS 1 0 0 0 0

MRS 0 0 1 0 0

ERS 1 0 0 0 1

Note: ARS = index of acquiescence response style; DRS = index of acquiescence response style; ERS =

index of extreme response styles; MRS = index of midpoint response style.

An example is the calculation of the frequency index of acquiescence. The value of ARS index

for a specific student is the sum of responses matching “A great deal” (symbolized as 1 in Table

3) divided by the number of items (eight, which transforms the sum into a proportion). The

choice of any other response option by the student does not reflect acquiescence (symbolized as

zero in Table 3). ARS index can vary between 0 (no acquiescence) and 1 (maximum

acquiescence). The same rationale applies to the other response style indexes: DRS, MRS, and

ERS.

In addition to the four indexes of response styles, the study reports an index of acquiescence

relative to disacquiescence (ARSR). The index is the difference between the proportion of

positive (“mostly” and “a great deal”) and negative (“not at all” and “somewhat”) responses

across items. The ARSR index summarizes both acquiescence and disacquiescence and is less

correlated with extreme response style (van Herk, 2004). ARSR index can vary between -1

21 The limitations in the data described in subsection 3.1 prevent the examination of halo, leniency/severity, central tendency and

range restriction.

53

(maximum disacquiescence) and 1 (maximum acquiescence) and is reported exclusively to

compare results from the study with other studies in the context of SET.


Research question 2 relates to differences in the degree to which SET scores are affected by

response styles across conditions of measurement. The report of results includes 1) descriptive

statistics of response style indexes by academic department, program type, and academic session

and 2) results from analysis of variance (ANOVA). Specifically, a 2 (department) x 2 (program

type) x 6 (academic session) factorial ANOVA was conducted on response style indexes to

determine differences in the degree to which responses styles vary across these measurement

conditions.


The last issue pertains to determining differences in SET scores between female and male

teachers (part 1) and the extent to which response styles moderate such difference (part 2).

Research question 3 only focuses on acquiescence response style because findings from section

4.1(research question 1) indicate that acquiescence is the most relevant type of response style

affecting SET scores in the study.

The two parts embedded in research question 3 are addressed using linear regression analysis

(Kenny, 1979; Baron & Kenny, 1986; J. Cohen et al., 2003). Researchers can use linear

regression analysis for explanation or prediction of a dependent variable using one or multiple

independent variables. In the study, linear regression analysis is used for explanation and informs

1) whether teacher’s gender explains differences in SET scores and 2) the moderator effect of

acquiescence on explaining differences in SET scores between female and male teachers. The

following subsections described in detail the way in which the two parts embedded in question 3

are answered using linear regression analysis.

3.4.3.1 Part 1: Differences by Teacher’s Gender

The linear regression model expressed in Equation 3 informs about the magnitude and direction

of the difference in SET scores between female and male teachers:

54

𝑌SET = 𝐵0 + 𝐵1𝐺𝑒𝑛𝑑𝑒𝑟 + 𝑒

Equation 3

Equation 3 indicates that SET scores (𝑌SET) are explained by three elements: a constant (𝐵0), a

regression coefficient related to teacher’s gender (𝐵1), and a term indicating error of prediction

(𝑒). 𝐵1 expresses the direction and magnitude of the differences in SET scores between female

and male teachers.

Teacher’s gender is a dichotomous variable coded as a dummy variable. In this analysis, a value

of “1” represents a female teacher, and a value of “0” represents a male teacher. The coding

enables the interpretation of the constant (𝐵0) as the mean of SET scores for male teachers, and

the regression coefficient of teacher’s gender (𝐵1) as the magnitude and direction of the

difference between female and male teachers. A positive value of 𝐵1 indicates that SET scores

are higher among female teachers, and a negative value indicates that SET scores are higher

among male teachers.

In addition to the constant (𝐵0) and regression coefficient of teacher’s gender (𝐵1), the following

information is provided and reported as result of the analysis (J. Cohen et al., 2003; Ellis, 2010;

Kenny, 1979):

• Test of significance of the null hypothesis indicating no linear relationship between

teacher’s gender and SET scores (𝐻0: 𝐵1 = 0).

• The standard error of estimate (SE) reflecting the estimated population standard deviation

of the residuals of estimating SET scores from teacher’s gender (𝑒). SE is interpreted as a

measure of the imprecision of the regression coefficient.

• Standardized regression coefficient of teacher’s gender (𝛽1 = 𝐵1𝑠𝑑𝑔𝑒𝑛𝑑𝑒𝑟

𝑠𝑑𝑆𝐸𝑇), which is

independent from the original scale of the variables and allows comparisons among studies.

• The coefficient of determination (R2, and adjR2) which indicates the proportion of the

variance of SET scores accounted by teacher’s gender. R2 is a measure of the strength of

the relationship (effect size). adjR2 is R2 adjusted by sample size.

• Test of significance (F test) of the null hypothesis indicating that R2 is zero (𝐻0: 𝑅2 = 0).

55

3.4.3.2 Part 2: ARS Moderator Effect

The operationalization of a moderator effect is the statistically significant interaction between the

moderator and an independent variable using multiple linear regression analysis (Baron &

Kenny, 1986). The multiple linear regression model presented in Equation 4 informs about the

role of acquiescence response style as moderator of the difference in SET scores between female

and male teachers.

𝑌SET = 𝐵0 + 𝐵1𝐺𝑒𝑛𝑑𝑒𝑟 + 𝐵2𝐴𝑅𝑆𝐶 + 𝐵3(𝐴𝑅𝑆𝐶 × 𝐺𝑒𝑛𝑑𝑒𝑟) + 𝑒

Equation 4

Equation 4 indicates that SET scores (𝑌SET) are explained by five elements: a constant (𝐵0), three

regression coefficients related to teacher’s gender (𝐵1), degree of acquiescence response style

(𝐵2), the interaction between acquiescence and teacher’s gender (𝐵3), and a term expressing

error of prediction (𝑒).

The coding of teacher’s gender is the same than in part 1. The degree of acquiescence response

style is centered in the grand-mean to enhance the interpretation of the constant (𝐵0) and

regression coefficients (𝐵2, 𝐵3), a recommended procedure for interpreting interaction terms in

regression analysis (J. Cohen et al., 2003). Equation 5 shows the transformation of original

acquiescence values into grand-mean centered values.

𝐴𝑅𝑆𝐶 = 𝐴𝑅𝑆 − �̅�𝐴𝑅𝑆

Equation 5

In Equation 5, the grand-mean centered degree of acquiescence for a specific student (𝐴𝑅𝑆𝐶) is

the difference between his/her original degree of acquiescence (𝐴𝑅𝑆) and the mean of

acquiescence across all students in the sample (�̅�𝐴𝑅𝑆). A grand-mean centered value of

acquiescence equal to zero indicates an average degree of acquiescence, negative values indicate

lower than the average degree of acquiescence, and positive values indicate a higher than the

average level of acquiescence.

The constant (𝐵0) in Equation 4 reflects the conditional mean of SET scores for male teachers

with an average degree of acquiescence; the regression coefficient of teacher’s gender (𝐵1)

56

reflects the magnitude and direction of the difference between female and male teachers

controlling by acquiescence; the regression coefficient of acquiescence (𝐵2) reflects the amount

of change in SET scores when the degree of acquiescence changes in one unit; and the regression

coefficient of the interaction term (𝐵3) indicates how much the difference between female and

male teachers changes as acquiescence varies from lower to higher values.

In addition to the constant (𝐵0) and regression coefficients (𝐵1, 𝐵2, and 𝐵3) from Equation 4, the

following information is provided as result of the analysis:

• Test of significance of the null hypothesis indicating no linear relationship between

independent variables and SET scores (𝐻0: 𝐵𝐼 = 0).

• The standard error of estimate (SE) indicating the estimated population standard deviation

of the residuals of estimating SET scores from an independent variable. SE is a measure of

the imprecision of the regression coefficient.

• Standardized regression coefficients (𝛽𝐼 = 𝐵𝐼𝑠𝑑𝐼𝑉

𝑠𝑑𝑆𝐸𝑇), which are scale-free estimates of

regression coefficients and allow comparability among predictors and other studies.

• The coefficient of multiple determination (R2 and adjR2) indicates the proportion of SET

scores variance accounted by all the independent variables. R2 is a measure of the strength

of a relationship between dependent and independent variables (effect size). adjR2 is R2

adjusted by sample size and the number of predictors.

• Test of significance (F test) of the null hypothesis indicating that R2 is zero (𝐻0: 𝑅2 = 0).

Part two of research question 3 focuses on the test of significance of the interaction term (𝐵3) as

indication of the moderator effect of acquiescence on the difference in SET scores between

female and male teachers. Two-way plots of predicted SET scores against degree of

acquiescence are presented to enhance the interpretation of the interaction term and help

understand the results of the moderation analysis. The study reports two indexes of the practical

significance of regression coefficients (eta square and Cohen’s f square) along with guidelines

for interpreting these indexes in the context of moderator analysis.

The report of regression analysis described in part 1 and part 2 includes results at the individual

(students’ responses), course-level (students’ responses aggregated by course) and teacher

(students’ responses aggregated by teacher) levels of analysis. These levels of analysis have

57

practical relevance considering that researchers and users of SET data routinely aggregate results

at the course and the teacher level of analysis, for instance, when used by administrators (Stark &

Freishtat, 2014).

3.4.4 Software

The data was imported from the original file in Microsoft Excel format to STATA 13.1

(StataCorp, 2013) to conduct most of the statistical analysis reported in the study. STATA do

files containing the commands for data management and transformation (i.e. response styles

indexes), and analyses are available for further reference.

58

Chapter 4

Results

Chapter 4 reports the analysis of data pertaining the examination of response styles in the context

of responses to a SET summated rating scale at a large teacher education institution. There are

four sections in Chapter 4. Section 4.1 reports item and SET scores distribution. Section 4.2

reports findings pertaining the degree to which SET scores are affected by response styles.

Section 4.2 reports differences in the extent to which SET scores are affected by response styles

across measurement conditions. Section 4.3 reports findings related to the effect of acquiescence

response style in moderating differences in SET scores between male and female teachers.

4.1 Distribution of Responses

Figure 1 (on page 59) contains eight item histograms from 5,921 students’ responses to the SET

summated rating scale. Histograms indicate the proportion of students scoring each of the

options available on the response scale.

Examination of histograms in Figure 1 reveals that students utilized the full range of response

options on the scale. However, students utilized the options asymmetrically. In general, students

preferred the two highest response options (5 = “A great deal”; 4 = “Mostly”) in each of the eight

items included in the instrument. The proportion of students endorsing any of the two highest

response options in the scale rounds 80% obtained by adding the percentage of students

endorsing response options 4 and 5.

Two extreme cases are item 7 (“The instructor demonstrated respect for diversity”) and item 8

(“The instructor encouraged students to express their own ideas”). In these two cases, the

proportion of students endorsing the highest response option is over 80%.

59

Figure 1

SET item distribution (N=5,921)

60

Table 4 reports measures of central tendency (mean, median), dispersion (standard deviation),

skewness, kurtosis, and percentiles 5, 25, 50, 75 and 95) for items and SET score. SET score

keeps the same metric than original items because the sum of SET items (total score) is divided

by the number of items.

Table 4

Descriptive statistics for SET items and overall score (N=5,921)

Variable Mean SD Skew Kurt P5 P25 P50 P75 P95

Item 1 4.18 1.01 -1.20 3.77 2.00 4.00 4.00 5.00 5.00

Item 2 4.27 0.99 -1.35 4.14 2.00 4.00 5.00 5.00 5.00

Item 3 4.21 1.08 -1.37 4.05 2.00 4.00 5.00 5.00 5.00

Item 4 4.22 0.98 -1.25 3.97 2.00 4.00 5.00 5.00 5.00

Item 5 4.27 0.95 -1.35 4.34 2.00 4.00 5.00 5.00 5.00

Item 6 4.35 0.92 -1.53 4.92 2.00 4.00 5.00 5.00 5.00

Item 7 4.70 0.68 -2.87 12.11 3.00 5.00 5.00 5.00 5.00

Item 8 4.62 0.80 -2.43 8.92 3.00 5.00 5.00 5.00 5.00

SET score 4.35 0.77 -1.56 5.34 2.75 4.00 4.62 5.00 5.00

Note: Skew=Skewness; Kurt=Kurtosis

Table 4 indicates that the average response to an item fluctuated between 4.18 (item 1) and 4.70

(item 7) with small standard deviations (one point or less than one point). The negative skewness

and high kurtosis reflect that responses are distributed closely around the highest value on the

response scale, with a large tail towards the lowest values. Values of kurtosis over three indicate

that the peak of the distribution is greater than the peak of a normal distribution (Moors, 1986),

and this is the case of each SET item.

Table 4 also indicates that the median (P50) in seven out of eight items is five (“a great deal”)

and means that a 50% of students marked the highest response option available on the scale in

almost every item.

As reported in the histograms, items 7 and 8 are two extreme cases showing a negatively skewed

and leptokurtic shape (kurtosis over 3), indicating that the distribution of responses concentrates

towards the highest value, with a tall peak and a large tail towards the lowest values.

The distribution of SET scores follows the same pattern than the distribution of individual items,

with a mean between the two highest response options in the scale (M = 4.35), and small

61

standard deviation (SD = .77). The negative skewness reflects that the tail of the SET scores

distribution is longer towards the lower values on the response scale, and the kurtosis over three

indicates that the peak of the distribution is taller than the peak of a normal distribution.

4.2 Research Question 1

Based on visual examination and summary statistics of items and SET score, responses seem

coherent with acquiescence rather than disacquiescence, extreme, or midpoint responses styles.

However, specific student-level response style frequency indexes are necessary to describe

systematic patterns of responses across items. Table 5 reports summary statistics (mean, standard

deviation, skewness, kurtosis, and percentiles 5, 25, 50, 75 and 95) for each response style index

(ARS, ARSR, DRS, ERS, and MRS).

Table 5

Summary statistics for response style indexes (N=5,921)

Variable Mean SD Skew Kurt P5 P25 P50 P75 P95

ARS 0.59 0.36 -0.28 1.61 0.00 0.25 0.62 1.00 1.00

ARSR 0.78 0.43 -2.28 7.63 -0.25 0.75 1.00 1.00 1.00

DRS 0.02 0.09 7.16 62.01 0.00 0.00 0.00 0.00 0.12

ERS 0.61 0.35 -0.30 1.67 0.00 0.25 0.62 1.00 1.00

MRS 0.09 0.16 2.17 7.90 0.00 0.00 0.00 0.12 0.50

Note: Skew = Skewness; Kurt = Kurtosis; P = Percentile; ARS = index of acquiescence response style;

ARSR = relative index of acquiescence response style; DRS = index of acquiescence response style; ERS

= index of extreme response styles; MRS = index of midpoint response style.

Table 5 shows the students’ average value for each response style index. The first two indexes,

ARS (M = .59, SD = .36) and ARSR (M = .78, SD = .43) indicate that SET scores are

significantly affected by acquiescence. At least half of the students scored five out of eight items

(P50 = .62) using the highest response option in the scale (ARS), and at least half of the students

scored all eight items (P50 = 1) using either the highest or second highest response option

(ARSR). The proportion of students using exclusively the option “5” across all items is 30.10%,

and the percentage of students using the two highest response options across items is 64.52%.

Despite the high proportion of students systematically utilizing the highest or two highest options

in the response scale across items, not all students responded consistently with acquiescence

62

response style. However, the proportion of these students is small. For instance, students that

scored any item using response options 1 to 4 without using the option 5 is 11.26%.

Table 5 also indicates that the degree to which SET scores are affected by ERS is also high. The

students’ average ERS index is M = .61 (SD = .35) indicating that in general students scored

almost five items choosing either the highest (“a great deal”) or lowest (“not at all”) response

option.

Students’ average value for the other two response styles indexes, DRS and MRS, are MDRS = .02

(SD = .09) and MMRS =.09 (SD = .16) respectively, indicating that SET scores are not affected by

these two types of response styles.

The index of ERS needs careful interpretation. ARS and ERS account for the proportion of

answers using the highest option in the response scale (plus the lowest response option in the

case of ERS). A logical conclusion is that the ARS index explains ERS. The Pearson's r

coefficient between ARS and ERS indexes is r = .97, p < .001, indicating a near perfect linear

relationship which leads to exclude ERS as affecting responses. As a reference, Table 6 reports

the correlation coefficients among all five response styles indexes.

Table 6

Correlation coefficients (lower triangle) and statistical significance level (upper triangle)

among response style indexes

ARS ARSR DRS ERS MRS

ARS 0.00 0.000 0.00 0.00

ARSR 0.64 0.000 0.00 0.00

DRS -0.27 -0.60 0.01 0.12

ERS 0.97 0.51 -0.03 .000

MRS -0.59 -0.54 0.02 -0.61

Note: ARS = index of acquiescence response style; ARSR = relative index of acquiescence response style;

DRS = index of acquiescence response style; ERS = index of extreme response styles; MRS = index of

midpoint response style.

In summary, findings reported in this section reveal that a significant proportion of students

show a systematic tendency to respond to SET items using the highest response options on the

scale, suggesting that SET scores are affected by acquiescence response style. Students seem not

influenced by other types of response styles such as disacquiescence or midpoint response style.

63

The high values of ERS index only reflect ARS and not the systematic use of the two extreme

response options.


The report of findings pertaining research question 2 includes summary statistics of response

styles indexes by department (A and B), program type (Master and Ph.D./Ed.D.), and session

(Summer 2014 to Winter 2016), and results from analysis of variance (ANOVA).

4.3.1 Summary Statistics

Table 7 presents mean and standard deviation for ARS, ARSR, DRS, ERS, and MRS across the

different conditions.

Table 7

Descriptive statistics of response styles indexes by measurement conditions

ARS ARSR DRS ERS MRS

M SD M SD M SD M SD M SD

Department: A 0.56 0.37 0.73 0.47 0.02 0.10 0.58 0.36 0.10 0.17

Department: B 0.64 0.35 0.83 0.37 0.01 0.07 0.65 0.34 0.07 0.15

Program: Master 0.58 0.37 0.76 0.45 0.02 0.09 0.60 0.35 0.09 0.16

Program: Ph.D./Ed.D. 0.67 0.33 0.86 0.31 0.01 0.05 0.68 0.33 0.07 0.13

Session: 2014 Summer 0.59 0.37 0.81 0.40 0.01 0.06 0.60 0.36 0.08 0.15

Session: 2014 Fall 0.59 0.36 0.79 0.42 0.02 0.10 0.61 0.34 0.08 0.15

Session: 2015 Winter 0.58 0.37 0.75 0.48 0.03 0.11 0.61 0.34 0.09 0.16

Session: 2015 Summer 0.61 0.36 0.80 0.42 0.01 0.08 0.63 0.36 0.08 0.16

Session: 2015 Fall 0.58 0.36 0.77 0.43 0.01 0.07 0.60 0.35 0.09 0.16

Session: 2016 Winter 0.60 0.37 0.77 0.43 0.01 0.07 0.61 0.35 0.09 0.17

Note: ARS = index of acquiescence response style; ARSR = relative index of acquiescence response style;

DRS = index of acquiescence response style; ERS = index of extreme response styles; MRS = index of

midpoint response style.

Summary statistics from Table 7 suggest differences in ARS and ERS indexes across

measurement conditions. For instance, students in department A show lower values of ARS (M =

.56, SD = .37), ARSR (M = .73, SD = .47), and ERS (M = .58, SD = .36) than the values of ARS

(M = .64, SD = .35), ARSR (M = .83, SD = .37) and ERS (M = .65, SD = .34) found among

students in department B. Master students show lower values of ARS (M = .58, SD = .37) and

ARSR (M = .76, SD = .45) than Ph.D./Ed.D. students (ARS M = .67, SD = .33; ARSR M = .86,

SD = .31). Similar differences are observed in the case of ERS index. The average value of ERS

64

index among Master students (M = .60, SD = .33) is lower than among Ph.D./Ed.D. students (M

= .68, SD = .33).

The average value of ARS, ARSR, and ERS indexes across sessions are similar. Values of ARS

narrowly vary between .58 (2015 Winter) and .61 (2015 Summer). Values of ARSR vary

between .75 (2015 Winter) and .81 (2014 Summer). Values of ERS vary between .60 (2014

Summer, and 2015 Fall) and .63 (2015 Summer).

Regarding DRS and MRS indexes, average values by academic department, program type, and

session are approximately the same than those reported in section 4.1. Average values of DRS

index are no greater than .02 across conditions. Similarly, average values of MRS index are not

greater than .10 across conditions. Since the values of DRS and MRS indexes are very close to

zero in all the conditions analyzed, the subsequent analysis focuses on ARS, ARSR, and ERS.

4.3.2 ANOVA Results

Three three-way ANOVA were conducted to examine the effects of the academic department,

the type of program, and the session on ARS, ARSR and ERS indexes respectively. The analysis

utilized an alpha level = .01 (probability of rejecting the null hypothesis when the null

hypothesis is true).

Results from ANOVA on ARS index indicate a statistically significant effect of department, F(1,

5920) = 26.75, p = 0.00, partial η2 < .01, and program type, F(1, 5920) = 14.44, p = .00, partial

η2 < .01. There were no statistically significant effect of session, F(5, 5920) = .78, p = 0.56,

partial η2 < .01, the interactions between academic department x program type, F(1, 5920) =

6.23, p = 0.01, partial η2 < .01, academic department x session, F (5, 5920) = 0.91, p = .47,

partial η2 < .01, and type of program x session, F (5, 5920) = 0.82, p = .43, partial η2 < .01, nor

the three-way interaction, F (5, 5920) = 1.16, p = .32, partial η2 < .01. Post hoc pairwise

comparison of means across levels of department and academic program reveled that ARS was

statistically significantly higher in department B than A, 95% CI [0.05, 0.11], and among

Ph.D./Ed.D. students than Master students, 95% CI [0.03, 0.09].

ANOVA results on ARSR index are consistent with those reported for ARS. Academic

department, F (1, 5920) = 24.70, p = .00, partial η2 < .01, and program type, F (1, 5920) = 8.63, p

= 0.00, partial η2 < .01, had a statistically significant effect on ARSR. Post hoc pairwise

65

comparison of means across levels of department and academic program reveled that ARSR was

statistically significantly higher in department B than A, 95% CI [0.06, 0.14], and among

Ph.D./Ed.D. students than Master students, 95% CI [0.02, 0.10]. Like the ANOVA results on

ARS index, neither session, F (5, 5920) = 1.20, p = 0.30, partial η2 < .01, the interactions

between academic department x type of program, F (1, 5920) = 2.84, p = .09, partial η2 < .01,

academic department x session, F (5, 5920) = 0.90, p = .47, partial η2 < .01, program type x

session, F (5, 5920) = 0.60, p = .69, partial η2 < .01, nor the three-way interaction, F (5, 5920) =

1.52, p = .18, partial η2 < .01, had a statistically significant effect on ARSR.

ANOVA results on ERS indicate that academic department, F (1, 5920) = 23.80, p = .00, partial

η2 < .01, and program type, F (1, 5920) = 11.11, p = .00, partial η2 < .01, had a statistically

significant effect on ERS. Once again, post hoc pairwise comparison of means across levels of

department and academic program reveled that ERS was statistically significantly higher in

department B than A, 95% CI [0.05, 0.11], and among Ph.D./Ed.D. students than Master

students, 95% CI [0.02, 0.09]. Neither session, F (5, 5920) = 0.85, p = .51, partial η2 < .01, any

of the two-way interactions, academic department x program type, F (1, 5920) = 6.59, p = .01,

partial η2 < .01, academic department x session, F (5, 5920) = 0.97, p = .45, partial η2 < .01, and

type of program x session, F (5, 5920) = 0.77, p = .57, partial η2 < .01, nor the three-way

interaction, F (5, 5920) = 1.09, p = .36, partial η2 < .01, had a statistically significant effect on

ERS.

Findings from ANOVA provide no evidence to support that ARS, ARSR, and ERS indexes differ

over academic sessions (failed to reject the null hypothesis). Findings from ANOVA support that

ARS, ARSR, and ERS indexes differ across departments and type of program (rejected the null

hypothesis). However, the small proportion of variance accounted by these two conditions, with

partial η2 less than .01 (or less than 1% or variability explained) indicates that there is no

practical significance of these statistically significant differences. Therefore, the degree to which

SET scores are affected by acquiescence and extreme response styles is consistent across the

measurement conditions examined.

66


The section reports findings from linear regression analysis pertaining differences in SET scores

between female and male teachers (part 1), and multiple regression analysis pertaining the

moderator effect of acquiescence in such difference (part 2).

4.4.1 Part 1: Differences Teacher’s Gender

Table 8 presents results from linear regression analysis of SET scores on teacher’s gender at the

individual (students), course (students’ responses aggregated by course) and teacher (students’

responses aggregated by teacher) level of analysis

Table 8

Summary of linear regression analysis for testing teachers’ gender differences

Student-Level

(N= 5921)

Course-Level

(N=462)

Teacher-Level

(N=159)

Parameter 𝐵 SE 𝛽 𝐵 SE 𝛽 𝐵 SE 𝛽

Constant (𝐵0) 4.26 .01* .* 4.26 .03* . * 4.28 .05 .

Teacher’s Gender (𝐵1) .12 .02* .08* .14 .04* .15* .09 .06 .11

R2 .005* .023* .013

adjR2 .005* .021* .006

F 34.3* 11.3* 2.1 *p < .01.

Results from regression analysis at the student-level show that the average SET score for male

teachers is 𝐵0= 4.26 and that female teachers receive higher scores than male teachers as

indicated by a regression coefficient 𝐵1= .12. The difference between female and male teachers

is statistically different from zero (t = 5.86, p < .01) and the proportion of variance of SET scores

accounted by teacher’s gender is also statistically different from zero (F(1,5919)=34.3, p<.01).

However, the practical significance of this difference is rather trivial as indicated by the

standardized regression coefficient (𝛽1=.08) and the proportion of explained variance (R2=.005).

Specifically, the magnitude of 𝛽1 falls below the threshold of a small effect (J. Cohen, 1988;

Ellis, 2010), and the proportion of variance of SET scores accounted by teacher’s gender is less

than 1%.

67

Results from regression analysis at the course-level are similar than those reported at the student-

level. Average SET score for male teachers is 𝐵0= 4.26. Female teachers received higher scores

than male teachers (𝐵1= .14). The difference in SET scores between female and male teachers at

the course-level is statistically different from zero (t = 3.36, p < .01), and the proportion of

variance of SET scores accounted by teacher’s gender is also statistically different from zero

(F(1, 460) = 11.3, p < .01). The magnitude of the difference at the course-level is higher than at

the individual level as suggested by the standardized regression coefficient (𝛽1 = .14) and the

2.1% of variance of SET scores accounted by teacher’s gender (R2 = .021). In this case, the

practical significance of the difference in SET scores by teacher’s gender at the course-level is

small.

Finally, results from regression analysis at the teacher-level slightly depart from those reported at

the student and course-level. Consistently with the previous results, at the teacher-level the

average SET score for male teachers is 𝐵0 = 4.28, and female teachers receive higher scores than

male teachers (𝐵1 = .09). However, the difference in SET scores between female and male

teachers is not statistically different from zero as reported by the test of null hypothesis of the

regression coefficient (t = 1.44, p = .15). The proportion of variance of SET scores explained by

teacher’ gender is also not statistically different from zero (F(1, 157) = 2.01, p = .15). However,

the practical significance of the difference in SET scores by teacher’s gender at the teacher-level

is equivalent to the one obtained at the course-level and can be considered small.

In summary, results from regression analysis indicate that there is a statistically significant

difference in SET scores in favor of female teachers over male teachers at the student and

course-level of analysis. Additionally, the level of analysis seems to affect the difference in SET

scores between female and male teachers, which is highest at the course-level of analysis.

Regardless of the level of analysis, the magnitude of the difference suggests a small practical

significance of the difference in SET scores by teacher’s gender.

4.4.2 Part 2: ARS Moderator Effect

Table 9 reports the results from multiple linear regression analysis to test the moderator effect of

acquiescence on the difference in SET scores between female and male teachers at the individual

(students), course (students’ responses aggregated by course) and teacher (students’ responses

aggregated by teacher) levels of analysis. For the sake of completeness, Table 9 presents all the

68

relevant information from multiple regression analysis. However, the focus of this section is on

the difference in SET scores between female and male teacher after statistically controlling by

acquiescence (𝐵1) and, more importantly, the interaction term (𝐵3) as operationalization of the

moderator effect of acquiescence.

Table 9

Summary of multiple linear regression analysis for testing moderator effect

Student -Level

(N= 5921)

Course-Level

(N=462)

Teacher-Level

(N=159)

Parameter 𝐵 SE 𝛽 𝐵 SE 𝛽 𝐵 SE 𝛽

Constant (𝐵0) 4.34 .00* . * 4.37 .01* . * 4.35* .01 .*

Teacher’s Gender (𝐵1) .01 .01* .01* .00 .02* .00* .00* .02 .00*

ARSc (𝐵2) 1.89 .02* .89* 2.16 .07* .98* 2.18* .10 1.00*

Interaction (𝐵3) -.10 .02* -.04* -.19 .08* -.07+ -.19* .12 -.07*

R2 .74* .86* 0.89*

adjR2 .74* .86* 0.88*

F 5845.4* 980.6* 419.2* *p < .01. +p < .05.

Results at the student-level show that the difference in SET scores between female and male

teachers (𝐵1=0.1) is not statistically different from zero (t = -1.65, p = .10) when the level of

acquiescence is average. The magnitude the standardized regression coefficient (𝛽1 = .01),

suggest no practical significance of this difference. The interaction between acquiescence and

teacher’s gender is statistically different from zero (t = -3.43, p < .01). Specifically, when the

degree of acquiescence changes in one unit, the difference in SET scores between female and

male teachers varies in 𝐵3= -.10.

The moderator effect of acquiescence at the individual level is represented graphically Figure 2.

The two-way graph presents predicted values of SET scores against the degree of acquiescence

for female and male teachers (separate lines). A gray horizontal line indicates the grand-mean of

SET scores across students. The graph suggests that the difference in SET scores in favor of

female teachers increases at lower values of acquiescence, and the difference in SET scores in

favor of female teachers decreases at higher values of acquiescence.

69

Figure 2

Moderator effect of acquiescence response style at the student-level

Results from multiple regression analysis at the course-level are consistent with those reported at

the individual level. The difference in SET scores between female and male teachers is not

statistically different from zero (t = -0.15, p =.88) when the level of acquiescence is average. The

interaction between acquiescence and teacher’s gender is statistically different from zero (t = -

2.35, p<.05). Specifically, when the degree of acquiescence changes in one unit, the difference in

SET scores between female and male teachers varies in 𝐵3 = -.19. a higher level of moderation

effect than the one reported at the individual level.

The moderator effect of acquiescence at the course level is presented graphically in Figure 3. The

difference in SET scores in favor of female teachers increases at lower values of acquiescence,

and the difference in SET scores between female and male teachers decreases and is reversed at

higher values of acquiescence.

70

Figure 3

Moderator effect of acquiescence response style at the course-level

Results from multiple regression analysis at the teacher level also suggest no statistically

significant difference in SET scores between female and male teachers when the level of

acquiescence is average (t = 0.09, p=.09). Although the size of the regression coefficient of the

interaction (𝐵3= -.19) is the same than the one obtained at the course-level, the interaction term

is not statistically different from zero (t = -1.54, p=0.12), probably caused by a smaller number

of teachers compared to the number of courses and students, and higher standard error of

prediction (SE).

In summary, either at the individual, course and teacher level of analysis, comparisons of SET

scores between female and male teachers differ before (part 1) and after statistically controlling

by acquiescence (part 2). Specifically, when acquiescence is held constant, differences by gender

of the teacher are not statistically different from zero. Furthermore, the degree of acquiescence

71

moderates the differences in SET scores between female and male teachers. At lower values of

acquiescence, female teachers receive higher SET scores than male teachers. At higher values of

acquiescence, the difference is reduced and inverted in favor of male teachers at the course-level

of analysis. In other words, a high level of acquiescence hides differences in SET scores favoring

female teachers over male teachers. The moderator effect of acquiescence is statistically

significant at the student and course level of analysis but not at the teacher level of analysis.

The practical significance of acquiescence as moderator of the difference in SET scores by the

gender of the teacher is discussed separately in the next subsection. Additionally, the level of

analysis seems to affect the statistical conclusion but not the practical significance conclusion

regarding the moderator effect of acquiescence.

4.4.3 Practical Significance

A challenge in assessing the practical importance of the moderator effect of acquiescence in the

context of SET is the lack of references to compare the results from the study. From the low

number of studies examining response styles in SET scores, none has reported moderation

effects.

Following general guidelines for ascertaining practical significance of standardized regression

coefficients, the magnitude of the interaction at the course-level (𝛽3=.07) does not reach the

threshold for interpreting the effect as small (L. Cohen, Manion, & Morrison, 2007) leading to

conclude that there is no practical significance of the difference. Using eta square as a measure

of practical significance, the interaction term at the course level accounts for 1% of the variance

of SET scores (𝜂2 = .01), and is also lower than the value for interpreting such effect as small (J.

Cohen, 1988). Similarly, using Cohen’s f square, the magnitude of the interaction between

acquiescence and teacher’s gender (𝑓2 = .015) falls below the threshold for interpreting the

effect as small (J. Cohen, 1988). All these three indexes of practical significance suggest that the

statistically significant moderator effect is irrelevant. However, Ellis (2010) and Kenny (2015)

point out that the average size of the moderator effect for categorical variables (but not

continuous variables) across research in Psychology as measured by Cohen’s 𝑓2 is .002

(Aguinis, Beaty, Boik, & Pierce, 2005). In this context, Kenny (2015) suggests a more realistic

standard of practical significance for Cohen’s f square of 0.005, 0.01, and 0.025 for small,

medium, and large effects. With little information to compare findings, the study proposes that

72

an interaction effect of 𝑓2=.015 has a medium practical significance when considering that

differences in SET scores between female and male teachers can change substantively at low

degrees versus high degrees of acquiescence at the course level as seen in Figure 3.

Overall, the evidence presented in the study suggests that acquiescence response style can hide

real differences in SET scores between female and male teachers and that the degree of

acquiescence affects comparisons of SET scores by teacher’s gender.

73

Chapter 5

Discussion

Chapter 5 addresses three aspects of the findings reported in the study pertaining the examination

of response styles in the context of SET. The first section presents a summary of findings and

discusses their implications for SET developers and users (Section 5.1). The second section

revises six alternative interpretations of the results other than response styles, and rationally

discusses their plausibility (Section 5.2). The last part describes the limitations in the study

affecting interpretation and proposes guidelines for future research (Section 5.3).

5.1 Summary and Implications

SET developers often rely on the convenience of summated rating scales for inquiring students

about teaching quality in post-secondary education institutions. Response styles are well-

documented sources of construct-irrelevant variance that can influence the interpretation and use

of scores from summated rating scales. However, research ruling out response styles as evidence

to support the validity of SET scores is scarce and flawed. Therefore, the main topic covered in

the study is the extent to which SET scores obtained from graduate students enrolled at a teacher

education institution are affected by response styles.

Analysis of SET data revealed that a high proportion of students systematically endorsed the

highest option in the response scale across SET items, a pattern consistent with acquiescence

response style. The analysis revealed that the high degree of extreme response style reflects

acquiescence and not the tendency to choose the two extreme options on the response scale. The

analysis showed that disacquiescence and midpoint response styles do not affect SET scores.

The literature review provided two examples of studies reporting acquiescence in the context of

SET. Using a non-standard index of acquiescence, Spooren et al. (2012) reported that only 8.4%

of the students selected “Yes” answers (“rather agree,” “agree,” or “totally agree”) to 10 or more

SET items out of 15. The comparable result in this study is 80.34%. Richardson (2012) reported

an index of ARSR =.30 (SD = .38) from the administration of the Course Experience

Questionnaire, and an index of ARSR = .28 (SD = .19) from the administration of the Revised

Approaches to Studying Inventory. The comparable result in this study is ARSR = .78 (SD = .43)

74

(Table 5). Overall, the average value of acquiescence reported in this study is higher than in the

two previous examples.

The study examined differences in the degree to which SET scores are affected by response

styles across three measurement conditions available in the SET data: academic department, type

of program, and session. The descriptive analysis suggested no differences in the degree of

disacquiescence and midpoint response style across measurement conditions. ANOVA indicated

higher degrees of acquiescence and extreme response styles in department B versus department

A, and among Ph.D./Ed.D. students versus Master students, and no statistically significant effect

of session nor the interaction among the three conditions. Although differences by department

and program type are statistically significant, the practical significance of these differences is

rather trivial, and the conclusion is that response styles are consistent across the measurement

conditions examined.

Literature suggests that response styles are consistent within the course of a questionnaire

measuring different constructs (Kam & Zhou, 2015; Plieninger, 2016; Wetzel, Böhnke, et al.,

2016). Literature also suggests that response styles are stable across time (Billiet & Davidov,

2008; Weijters, Geuens, & Schillewaert, 2010; Wetzel, Lüdtke, Zettler, & Böhnke, 2016).

Overall, the study complements prior evidence and suggests that response styles are consistent

across measurement conditions. These findings provide no hypothesis to devise methods for

controlling and minimizing response styles. Future research needs to examine differences in the

degree of response styles across other measurement conditions.

The last topic explored in the study is the extent to which response styles affect the subsequent

use of SET data, specifically the relationship between SET scores and other variables. Before

statistically controlling the effect of acquiescence, SET scores are slightly higher for female

teachers than male teachers. After statistically removing acquiescence, there is no statistically

significant difference in SET scores between female and male teachers. The moderation analysis

showed that acquiescence changes the difference in SET scores between female and male

teachers. The practical significance of the moderator effect of acquiescence is small using

realistic criteria for interpreting effect size indexes.

In general, the findings presented in the study does not rule out the plausibility of the influence

of acquiescence response style on SET scores. By the contrary, the findings suggest that SET

75

scores might reflect simultaneously teaching quality and acquiescence response style and that

SET scores might overestimate the actual level of teaching quality. The confound is consistent

across departments, program type, and sessions, and affects the use of SET scores in subsequent

statistical analysis.

In addition to the presence of construct-irrelevant variance, SET scores are based on only six

items measuring three types of tasks of teaching, suggesting construct-underrepresentation of

good teaching. The instrument includes two items measuring successful teaching, a construct

whose validity shows little empirical support.

5.1.1 Implications

The more severe negative consequence of acquiescence response style in SET scores is the

impossibility of determining the true level of teaching quality based on students’ report because

acquiescence affects measures of central tendency (overestimation) and dispersion (narrowing

the distribution of scores, range restriction).

When considering the use of SET scores for formative purposes, an instructor would interpret a

high evaluation score as purely reflecting his/her teaching ability, a common belief among users

of summated rating scales. However, SET scores seem influenced to a considerable extent by a

students’ tendency not related to teaching quality.

The overestimation rand range restriction of SET scores limit the utility of scores for informing

teaching improvement as intended originally by the instrument developer. For instance, a teacher

receiving a report indicating that all the attributes of teaching quality are at the highest possible

level can hardly use the information for teaching improvement because there is no attribute to

improve. Similarly, an instructor may not find useful the comparison of his/her results with the

results from colleagues because most of the other teachers would also show the highest level of

teaching quality. Therefore, the study concludes that SET scores should not be utilized for

formative purposes.

Acquiescence affects not only formative uses but also summative uses of SET scores. For

instance, a ranking of teachers based on raw scores and scores after controlling for acquiescence

response style differs substantively in the SET data from this study. The Spearman rank

correlation coefficient between these two scores is rs = -.16, p = .00. For instance, administrators

76

would wrongly believe that a teacher with a certain score above a pre-established limit (cut-off)

possess higher teaching ability, possibly qualifying for recognition or promotion based on

inaccurate teaching evaluation scores. After controlling by acquiescence, the same teacher would

likely not pass previous cut-off.

The main implication for current users of scores is caution when utilizing SET for formative and

summative purposes. For instance, users could utilize scores for identifying only teachers in high

need of professional development. Users could also utilize scores to identify exemplary teachers.

Until complementary evidence of validity based on content and response process becomes

available, users should restrain from taking summative decisions based on scores. In that regard,

institutions can reduce the weight of SET in personnel and administrative decisions, reducing the

consequences associate to the use of SET scores.

Response styles can also artificially increment reliability coefficients (James, Demaree, & Wolf,

1984; Wetzel, Böhnke, et al., 2016). Cronbach’s alpha coefficient in the study shows that the

estimated lower bound of the reliability of the SET scale is .93, indicating a high internal

consistency but also suggesting redundancy among items.

Acquiescence jeopardizes the inferences regarding relationship to other variables (Paulhus, 1991;

Viswanathan, 2005). In the case of the study, differences in SET scores between female and male

teachers are higher before statistically removing acquiescence. Probably, acquiescence produced

range restriction due to summative error and introduced correlational error, producing the

reported moderator effect. The results imply that minimizing acquiescence would increment the

difference in SET scores in favor of female teachers. Such difference would affect the formative

and summative use of SET scores and encourages a careful examination of instrument content,

response process, and theory sustaining the development of the SET instrument.

Finally, acquiescence can explain the observed relationship between SET scores and irrelevant

variable. Therefore, research on SET validity contributing with evidence based on relationship to

other variables including discriminant evidence needs first to rule out the influence of response

styles (or any other source of construct-irrelevant variance) from SET scores under the risk of

providing non-realistic estimates of the size, statistical and practical significance of these

relationships.

77

Findings call for caution in the interpretation and utilization of SET scores and raise a relevant

concern that needs subsequent examination by SET developers and users.

5.1.2 Recommendations

Proving that a scale does not measure bias (construct-irrelevant variance) is essential in

validation (Spector, 1992). Therefore, the examination of response styles and other sources of

construct-irrelevant variance are a primary task to ensure that interpretation and use of scores are

valid: “proliferation of tests of high sounding psychological constructs in disregard of response

bias [styles] is a conspicuous waste of research” (Loevinger, 1959, p. 306).

The first and more relevant recommendation is the idea of supporting the development and

validation of SET summated rating scales on sound theory of teaching and learning.

Simultaneously, the development of SET summated rating scales should adhere to standard

procedures in the measurement field. A first step in test-development is the definition of the

target construct and the creation of test specifications (or blueprint) that identify levels of

performance according to the intended use of scores. A second step is the development of items

based on test specifications. In this regard, a preliminary pilot testing of items should be

conducted. Cognitive interviews and think-aloud protocols can inform about the response

process and potential sources of construct-irrelevant variance in responses to items. A third step

is item analysis to ensure that items have appropriate difficulty and discrimination levels.

Finally, the definition of cut-off scores should rely on standard settings methods.

A second recommendation pertains a note of caution in the interpretation of scores from

summated rating scale as a measure of teachers’ actual level of performance. As mentioned in

Section 1.2, items from summated rating scale do not have a right or wrong answer, and they are

not appropriate for inferring performance or ability (Spector, 1992). Therefore, SET scores are

not a perfect measure of teacher’s performance or teaching ability. SET scores are proxies of true

teaching quality as reported by students, and the accuracy of SET scores (from test-criterion

evidence) is still unknown. In the interpretation of SET scores, users should also keep in mind

that defining and measuring teaching quality is a difficult task, and that SET scores will contain

measurement error, either random error or construct-irrelevant variance. These two limitations in

the use of summated rating scales in the context of teaching evaluation discourage the utilization

78

of SET scores for summative decisions and accountability when there is limited validity

evidence to support the use of scores for those types of decisions.

A third recommendation relates to the implementation of methods for minimizing and

controlling response styles. Specifically, methods for controlling and reducing acquiescence

should pertain the student, instrument, and conditions of measurement.

SET developers can enhance students’ competence in the use of SET summated rating scales in

different ways. Examples are developing a students’ scoring manual, explaining the relevance of

SET scores to students, and encouraging their intelligent and careful participation (Kingsbury,

1922). Asking students to provide scores for clearly stated formative purposes in low-stakes

contexts can help avoid scores inflation, reduce strong satisficing and encourage effort and

motivation for optimizing responses.

There is no consensus on how to counteract response styles by manipulating instrument or

measurement conditions without introducing other types of construct-irrelevant variance in

scores (Wetzel, Böhnke, et al., 2016). An example is the popular recommendation of balancing

positive and negative worded items to counteract acquiescence. The previous strategy can affect

respondents’ accuracy due to higher task demands in item interpretation. The use of positive and

negative worded items also introduces a method effect that can change the internal structure and

reliability of scales (Zhang & Savalei, 2015).

Three feasible strategies to minimize acquiescence manipulating instrument features are 1)

positively packing the response scale (Lam & Klockars, 1982), 2) the use of a wide scale (Lam

& Stevens, 1994) and 3) the use of the expanded format (Zhang & Savalei, 2015). These three

strategies can lower mean scores from summated rating scales. The efficacy of these three

strategies in the content of SET rating scales is unknown and demands further research.

A positive packed scale is a response scale in which the labels (or anchors) are not equally

spaced. An example of equally spaced scale is the use of the labels poor, need improvement,

satisfactory, quite good, and excellent. A positively packed version of the same scale would

utilize the labels poor, fair, good, very good, excellent (Lam & Klockars, 1982). In the example,

only two anchors cover the distance between the first anchor and midpoint (poor, fair), and four

79

anchors cover the distance between the midpoint and the last anchor (fair, good, very good,

excellent).

A wide scale encompasses the use of semantically broader labels or anchors in the response scale

maintaining the same number of options. The current instrument utilizes “not at all” and a “great

deal” as the lowest and highest response option. A wider response scale can utilize “never”

(replaces “not at all”), and “always” (replaces “a great deal.”)22.

The expanded format involves presenting each item and response option simultanesouly as one

statement. The following example presents the expanded format with a positively packed scale

and wide scale anchors using an item from the SET summated rating scale presented in the

study:

• “I never found the course intellectually stimulating.”

• “I sometimes found the course intellectually stimulating.”

• “I often found the course intellectually stimulating.”

• “I almost every time found the course intellectually stimulating.”

• “I always found the course intellectually stimulating.”

Another recommendation for developers is the inclusion of additional items to measure the level

of acquiescence in responses. An acquiescence frequency index based on responses to a specific

scale can reflect the level of acquiescence independently from responses to SET items. The

literature recommend at least 30 items of diverse content, with equal number of positively and

negatively worded items to properly measure acquiescence (Kam, 2015). Background questions

inquiering students about the level of relevance, effort and honesty in responses and questions

targeting potential reasons to increase scores (inducements, power relationships, level of

consequence of scores) can also suggest and explain the presence of acquiescence.

One last method for controlling and minimizing response styles is utilizing statistical methods,

for instance, linear regression analysis (Webster, 1958) as utilized in this study. Regression

analysis allows to statistically remove or control acquiescence from SET scores. The predicted

22 Notice that frequency anchors fit better the report of students’ experience in the course than the original anchors.

80

value of SET scores would reflect the expected level of teaching quality when the level of

acquiescence is constant.

Item response theory (IRT) allows the prediction of teaching quality statistically controlling by

the level of acquiescence, just like in the case of regression analysis. However, IRT implies a

higher level of precision in the estimation of teaching quality and acquiescence at the expense of

increased mathematical and computational complexity.

IRT “models the probability of ticking a certain response option as a function of the underlying

latent variable” (Van Vaerenbergh & Thomas, 2013, p. 207). IRT assumes that the production of

responses depends on the interaction between the student (or measurement object) and an item

(measurement agent). Responses depend on the level of the trait of the person (called person’s

position) and the difficulty of the item (Wu, Adams, Wilson, & Haldane, 2007). For instance,

students should provide higher SET scores if they experience higher levels of teaching quality.

IRT assumes that responses are only explained by the level of teaching quality, in other words,

that responses are unidimensional. These assumptions are the foundation of the mathematical

formulas utilized in IRT models (Chiang, Green, & Cox, 2009). Section b in Figure 4 illustrates

the unidimensionality assumption of SET scores.

A specific example of an IRT model in the context the diagnostic of response styles is the

Multidimensional Rating Scale Model (MRSM). MRSM can estimate teaching quality and

acquiescence using same or different items (Wetzel, Böhnke, et al., 2016; Wetzel & Carstensen,

2015). The rating scale (RS) model is an extension of the one-parameter IRT model for

dichotomous items for responses to summated rating scales that have in common a multiple-

category response format. An example of such response format is the Likert-type scale,

commonly used in SET summated rating scales. The multidimensional item response model

allows the measurement of multiple latent variables underlying a multidimensional test (Wu et

al., 2007). In a multidimensional within-item response model, responses to a single item can

reflect two or more latent variables.

81

Figure 4

(a) The intended SET measurement model and (b) a rival measurement model with acquiescence

response style (ARS)

TQ=Teaching Quality; ARS=Acquiescence Response Style

In the case of this study, the MRSM models allow the measurement of teaching quality and

acquiescence simultaneously using the same eight items included in the instrument (represented

in Section b in Figure 4). MRSM allow the operationalization of acquiescence using the same

definition utilized for the calculation of frequency indexes.

MRSM offers a method for determining which model (intended, section a in Figure 4, or rival,

section b in Figure 4) better reproduces the observed relationships in the data. The selection of

the model with better fit is conducted by observing deviance and Akaike’s Information Criterion

(AIC) statistics (Wu et al., 2007). Lower values of deviance and AIC indicate relative best fitting

model. The likelihood ratio test is a 𝜒2 test of the difference in deviance between two competing

models: the null hypothesis is that model 1 (rival) fits the data as well as model 2 (intended)

(Osteen, 2010; Wu et al., 2007).

Two recommendations for policy makers that can contribute increase the validity of SET in the

long term: 1) the creation of a task force on teaching standards; 2) the creation of standards of

teaching evaluation in postsecondary education. Teaching standards address the issue of a lack of

82

theory about teaching and learning in post-secondary education. Teaching standards covering all

the important aspects of teaching and learning in post-secondary education settings can guide the

development of test specifications for SET instruments and other measures of teaching quality.

Teaching standards can also guide teaching training and professional development. Standards for

teaching evaluation in postsecondary education would help develop valid and fair teacher

evaluations and reduce the lack of expertise on educational measurement (Onwuegbuzie, Daniel,

& Collins, 2009). These standards should rely on recent measurement theory and a current

definition of validity. An example of teaching evaluation standards in the K-12 context are “The

personnel evaluation standards: how to assess systems for evaluating educators” (Joint

Committee on Standards for Educational Evaluation, 2009).

A final recommendation for researchers is against the reification of SET validity evidence from

literature reviews and other sources. Contrary to the claim that empirical evidence supports that

SET scores are valid (Olivares, 2003; Ory, 2001; Theall & Franklin, 2001), rather evidence

supports that SET scores can be valid under certain conditions (Marsh & Roche, 1997). The

generalization of validity findings from individual studies to legitimate the use of other

instruments or the same instrument on different populations and measurement conditions often

occurs in the SET literature (Johnson, 2000). The opposite reaction, denying SET scores validity

based on the finding from individual studies as in Boring et al., (2016) and Stark & Freishtat

(2014) should also be avoided. The previous practice is inconsistent with current and accepted

definitions of scores validity and validation (Messick, 1989). SET developers and users should

continuously examine and report validity evidence for each specific measurement instance that

involves not only the use of different items but also differences in populations and settings.

5.2 Alternative Interpretation of Findings

The findings presented in the study suggest -but do not prove- the existence of processes not

relevant to the intended interpretation and use of SET scores as a measure of teaching quality for

formative and summative decisions. Limitations inherent to the study design (discussed later)

demand complementary types of evidence to fully understand “why” a substantial proportion of

students endorsed the highest response option in the scale across items.

The alternative interpretations discussed here are 1) a high level of teaching quality, 2) construct-

underrepresentation problem, 3) ceiling effect, 4) influence of survey mode, 5) strong satisficing,

83

and 6) evaluation goals influenced responses. These alternative interpretations can either

challenge or complement the interpretation of findings presented in the study and help inform

future research.

5.2.1 High Level of Teaching Quality

A first alternative interpretation is precisely the one that the study attempts to challenge: that the

actual level of teaching quality is high, students provided responses exclusively based on

content, and observed scores reflect true score.

The above alternative interpretation faces the same limitation than the present study. Without

complementary validity evidence addressing why students consistently endorsed the highest

response option across items, the claim that responses are based exclusively on content without

the influence of irrelevant processes is tentative.

By the contrary, there is at least one reason making this alternative interpretation problematic.

Specifically, there would be no need for evaluation when teaching quality is expected to show a

negatively skewed distribution caused by students endorsing the highest response options across

items. An examination of such score distribution suggests that there is no attribute of teaching

quality to improve and that most of the teachers show a similarly high level of teaching quality.

Therefore, an important proportion of students endorsing the highest response options across

items is simply not coherent with formative decisions nor summative decisions, and for this

reason, interpreting that observed SET scores reflect true teaching quality is contradictory with

the proposed use of scores.

5.2.2 Construct Underrepresentation

Another alternative interpretation is that observed scores reflect true teaching ability but content

includes aspects of teaching quality easy to achieve for most teachers. The previous possibility

refers to construct underrepresentation, another problem that reduces score validity. As per

demonstrated in section 3.3, the instrument includes few items measuring the different acts of

teaching (logic, psychological and moral), suggesting construct underrepresentation and

recommending a careful interpretation of SET scores as a partial measure of teaching quality.

Therefore, construct-underrepresentation is a plausible alternative interpretation of the pattern of

84

responses reported in the study that also affects the intended use of scores for formative and

summative decisions.

5.2.3 Ceiling Effect

Ceiling effect (Hessling, Traxel, & Schmidt, 2004; Masino & Lam, 2014) is a third alternative

interpretation of the pattern of responses observed in the study. The principal difference between

ceiling effect and acquiescence response style is that the first (ceiling effect) attributes the

observed data pattern to instrument issues and assumes that responding reflects the target

construct, whereas the second (acquiescence) attributes the observed data pattern to processes

not related to the content.

Low item difficulty, meaning that items are easy to endorse, can cause ceiling effect either due to

content (for instance, construct underrepresentation, as discussed above) or by features of the

response scale, for example, inappropriate format with few or not properly labeled anchors.

A way of minimizing ceiling effect is the use of a response scale that allows a better

discrimination across levels of teaching quality. A higher discrimination can be achieved by

increasing the number of response options, modifying the labels in the response scale (Hessling

et al., 2004), or modifying item wording as in the case of the wide format discussed previously

(Lam & Stevens, 1994).

Ceiling effect and acquiescence response style are not mutually exclusive interpretations.

However, it is uncertain that modifications in the response scale to minimize ceiling effect would

lead to a less proportion of responses consistent with acquiescence response style because of the

additional problem of construct-underrepresentation.

5.2.4 Online Survey Mode

Along with a potential ceiling effect due to instrument design issues, an important aspect of the

measurement procedure in the study that may contribute to explain the negatively skewed and

narrow distribution of SET scores is the use of online survey mode. Whereas the other two

comparable studies utilized paper and postal surveys (Richardson, 2005; Spooren et al., 2012),

the mode of administration in the study was online. Survey mode is known to introduce mode-

specific types of error on responses (Smyth et al., 2009). Findings of differences in SET scores

85

between online and paper-based modes of administration are mixed. A group of studies indicates

no difference between modes of administration (Avery, Bryant, Mathios, Kang, & Bell, 2006;

Dommeyer, 2004; Stowell, Addison, & Smith, 2012). Another group of studies reports both

higher scores in online versus paper-based questionnaires (Bruns, Rupert, & Zhang, 2011;

Burton, Civitano, & Steiner-Grossman, 2012; Morrison, 2013) and higher scores in paper-based

versus online mode questionnaires (Capa-Aydin, 2016). None of the previous studies specifically

report differences in acquiescence or other response styles across modes of administration.

Considering that at least in certain cases online mode is related to inflated SET scores, online

survey mode seems a likely alternative interpretation that demands further examination.

5.2.5 Strong Satisficing

The pattern of responses reported in the study is also consistent with strong satisficing and the

use of anchor and adjustment strategy.

Satisficing theory is a framework for exploring suboptimal survey responses. The theory predicts

that respondents will choose the first satisfactory or acceptable response alternative rather than

the optimal response (Krosnick, 1999; Krosnick & Alwin, 1987). Satisficing assumes that

responses to a survey question need a significant amount of cognitive work that respondents may

be not interested into delivering.

Respondents can save cognitive work in several ways (Barge & Gehlbach, 2012). For instance,

respondents can use the anchor and adjustment strategy, in which the “response to an initial

survey item provides a cognitive anchor from which they insufficiently adjust in answering the

subsequent item” (Gehlbach & Barge, 2012, p. 419). Anchor and adjustment can result in a

participant agreeing with all the statements in a questionnaire (acquiescence response style).

Three conditions increase satisficing responding strategy: 1) a greater task difficulty, 2) a lower

respondent’s ability, and 3) lower respondent’s motivation to optimize” (Krosnick, 1999).

“Strong satisficing” occurs when respondents process questions superficially and provide an

arbitrary or random response.

In the case of the study, conditions for satisficing 1) and 2) are not plausible. Participants in the

study are graduate students enrolled at a teacher education institution, familiarized with teaching

and learning concepts. The population of students is at least as capable of providing valid

86

responses (if not more) than other populations of students with lower levels of acquiescence

(undergraduate students enrolled in programs not related to Education). The third condition for

satisficing seems a more plausible cause of the response pattern in the data. Perhaps students

were motivated to complete the online questionnaire, but not enough motivated to provide

accurate responses or optimize. Strong satisficing, and specifically a lower respondent’s

motivation to optimize is a likely explanation that can help understand “why” students were

acquiescent in their responses.

5.2.6 Evaluation Goals

The last aspect that could lead to high SET scores is related to the intended use of the evaluation,

or evaluation goals, either those implicit or explicit in the evaluation context.

Wetzel et al. (2016) argue that the context of measurement can potentially affect participant’s

motivation to provide accurate answers and trigger different types of response styles. Known

examples of the influence of the evaluation context are “for subordinates to exhibit positive

leniency when describing supervisors, and for judges to select neutral response alternatives when

items are ambiguous or when the judges wish to be evasive” (James, Demaree, & Wolf, 1984,

p.90).

One important aspect to consider is the level of personal involvement with the goals of the

evaluation. A high personal involvement and perceiving the evaluation as relevant and useful to

society can help reduce response styles in low stakes contexts. The same is not true in high

stakes contexts in which only costly and inefficient modifications in the scoring process can

minimize a reduced a small group of response styles (Wetzel, Böhnke, et al., 2016).

There is only one published study comparing students’ internal evaluation goals and SET scores

(Murphy et al., 2004). Students scored the level in which the following goals were important in

their judgment: 1) identifying the instructor’s weaknesses, 2) identifying the instructor’s

strengths, 3) providing fair ratings, and 4) motivating instructors. The study reported a positive

relationship between the scores of importance of the four evaluation goals and SET scores, with

r2 ranging from 0.07 to 0.36 (pilot study) and 0.09 to 0.45 (main study). The authors conclude

that “raters [students] pursuing different goals tend to give different ratings, even when they have

87

observed the same performance” (p. 162). The previous study did not report goals related to

summative decisions as in the case of the present study.

Evaluation goals are relevant to the discussion of SET scores validity because the utilization of

SET simultaneously for formative, summative and accountability purposes can introduce

conflicting goals and incentives for score inflation (Penny, 2003; Spooren et al., 2012; Yorke,

2009). As an example, teachers report attempts to artificially increase their evaluation scores by

introducing behaviors such as inducements, pre-evaluation actions, manipulation, watching

during SET, providing academic extras, and grading leniency (Simpson & Siguaw, 2000).

Students seem to highly value formative decisions based on SET scores (Chen & Hoshower,

2003; Ernst, 2014).

In the case of the present study, ambiguously stated and possibly conflicting evaluation goals

presented to students during instrument administration might have influenced students’ internal

goals. Possible internal goals causing inflated SET scores are 1) attempt to avoid negative

consequences of low scores on teachers, 2) a low involvement and perceived relevance (related

to satisficing), 3) attempt to motivate instructors by endorsing high scores, and 4) students’

response to instructor’s inducements. The plausibility of these four explanations requires further

examination.

5.3 Limitations and Future Research

Three important limitations affect the implications based on findings reported in this study: 1)

response styles examination approach, 2) the use of secondary SET data; 3) general limitations

inherent to quantitative research methodology.

5.3.1 Use of Manifest Variable Approach

The study utilized a manifest variable approach to study response styles. Frequency indexes of

responses styles are easy to compute and interpret. However, some authors defend approaches

based on more sophisticated mathematical models because of the confound between the response

style and the target construct inherent to frequency indexes calculated from the same items than

the target construct (Section 2.3.1). Latent variable models can effectively separate target

construct variance from response style variance (Bolt & Johnson, 2009; Wetzel, Böhnke, et al.,

2016).

88

An example of utilization of latent variable approach is the use of Structural Equation Modelling

for examining acquiescence (Ferrando, Morales-Vives, & Lorenzo-Seva, 2016). Another

example is the use of item response models to examine midpoint and extreme response style,

either as categorical latent variables (Tutz & Berger, 2016) or as continuous latent variable

(Wetzel & Carstensen, 2015). Methods for examining response styles based on latent variable

approaches are very recent, and no systematic review and comparison of methods is available yet

(Wetzel, Böhnke, et al., 2016).

Regardless the clear advantage of more sophisticated statistical models, the high degree of

acquiescence affecting SET scores in this study is on its own sufficient evidence of construct-

irrelevant variance. In other situations, results from a manifest variable approach might lead to a

less precise diagnostic, and a latent variable approach, such as the MRSM presented in section

5.1.2, would help supplement those results.

5.3.2 Use of Observational Data

The lack of control on content and administration procedure associated with the use of secondary

data also limited the study in several ways. First, the inclusion of non-related constructs would

allow the examination of halo effect. Second, the inclusion of (anonymized) individual

identification of students would allow a more extensive and complete analysis of response styles

by including severity/leniency, central tendency, and range restriction. A significant limitation

pertaining content is the problem of construct-underrepresentation reported in Section 3.2,

suggesting the inclusion of more content targeting good teaching. Finally, observational data

limited the analysis of differences in the degree to which response styles differ by type of report

(self-report, other-report, report of objects) and type of content (logical, psychological or moral

acts of teaching).

5.3.3 Use of a Quantitative Approach

Following a process of rational argumentation based on the concept of validity and a review of

the literature, the study tested the plausibility of an alternative interpretation of SET scores,

answering the question of “what” (source of construct-irrelevant variance) might explain SET

scores other than the target construct. The quantitative strategy followed in the study is a

reasonable first step to determine whether response styles might represent a potential problem for

89

SET scores interpretation and subsequent use for formative and summative purposes. The

strategy is useful when observational data is available and the number of students is high, as in

the case of the present study. The design does not address the question of “why” a high

proportion of students relied on a response pattern consistent with acquiescence. Future research

needs to address the inherent lack of depth of the study design.

5.3.4 Future Research

The quantitative nature of the research design utilized in the study provides with strong initial

evidence of response styles as a source of construct-irrelevant variance in SET scores at this

specific educational institution. However, alternative interpretations and limitations discussed

above recommend further research.

A research design aimed to examine the validity of SET scores should retrieve evidence

illuminating about possible causes of response styles, specifically from the response process,

addressing one of the limitations in current SET validity research along with the lack of strong

theory. In this regard, a mixed method research design can provide well-sustained conclusions

about a target phenomenon (Creswell & Clark, 2010; Greene, Caracelli, & Graham, 1989) and

about causality (Howe, 2012). Mixed method approaches to validation are increasingly used and

strongly recommended by literature (Koskey, Sondergeld, Stewart, & Pugh, 2016; Luyt, 2012;

Morell & Tan, 2009; Onwuegbuzie, Bustamante, & Nelson, 2010). Thus, part of the limitations

of this study can be addressed by a mixed methods design.

A study aimed to examine causes of response styles in SET scores can rely on an experimental

design (quantitative phase) and think-aloud protocols (qualitative phase). Possible independent

variables that are expected to affect response styles are evaluation relevance (higher relevance

would increase students’ motivation and reduce satisficing) and the level of consequences of the

evaluation (high stakes evaluation would lead to higher scores).

In the experimental phase, relevance can be manipulated, for instance, by suggesting that the

teacher reads each students’ report (and within this factor, scores can be anonymous or non-

anonymous to further increase personal relevance) and that the use of scores would benefit future

student through subsequent teaching development support and course modifications. Indicating

that scores would inform administrative decisions such as removing a teacher from a course the

90

next academic session or that scores would impact annual personnel evaluation manipulate the

level of consequences. An ambiguity condition with vague relevance and level of consequences

would reflect a typical SET administration as in the case of the institution in this study. Changes

in the instrument context, for instance, the invitation email and at the introductory paragraph in

the SET summated rating scale, can produce the experimental manipulation.

A second phase involving the use of think-aloud protocols conducted under the same

experimental conditions would provide narrative evidence of students thinking during the

process of responding. Findings from think-aloud protocols can provide information about how

the experimental manipulation affects responses styles. Also, qualitative evidence can suggest

other sources of construct-irrelevant variance that might reduce the validity of SET scores.

Finally, future research should address how the type of report (self-report, other-report, object-

report) and type of content (logical, psychological, and moral acts of teaching) affect response

styles. Such research should utilize a summated rating scale that properly covers all the attributes

of good teaching relevant for the specific discipline and educational context.

91

References

Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect size and power in assessing

moderating effects of categorical variables using multiple regression: a 30-year review.

The Journal of Applied Psychology, 90(1), 94–107. https://doi.org/10.1037/0021-

9010.90.1.94

Alliger, G. M., Tannenbaum, S. I., Bennett Jr., W., Traver, H., & Shotland, A. (1997). A meta-

analysis of the relations among training criteria. Personnel Psychology, 50(2), 341–358.

https://doi.org/10.1111/j.1744-6570.1997.tb00911.x

American Educational Research Association, American Psychological Association, & National

Council on Measurement in Education. (2014). Standards for educational and

psychological testing. Washington, DC: American Educational Research Association.

Arbuckle, J., & Williams, B. D. (2003). Students’ Perceptions of Expressiveness: Age and

Gender Effects on Teacher Evaluations. Sex Roles, 49(9–10), 507–516.

https://doi.org/10.1023/A:1025832707002

Avery, R. J., Bryant, W. K., Mathios, A., Kang, H., & Bell, D. (2006). Electronic Course

Evaluations: Does an Online Delivery System Influence Student Evaluations? The

Journal of Economic Education, 37(1), 21–37. https://doi.org/10.3200/JECE.37.1.21-37

Aylett, R., & Gregory, K. (1996). Evaluating Teacher Quality in Higher Education. Psychology

Press.

Barge, S., & Gehlbach, H. (2012). Using the Theory of Satisficing to Evaluate the Quality of

Survey Data. Research in Higher Education, 53(2), 182–200.

Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social

psychological research: Conceptual, strategic, and statistical considerations. Journal of

92

Personality and Social Psychology, 51(6), 1173–1182. https://doi.org/10.1037/0022-

3514.51.6.1173

Basow, S. A., & Montgomery, S. (2005). Student Ratings and Professor Self-Ratings of College

Teaching: Effects of Gender and Divisional Affiliation. Journal of Personnel Evaluation

in Education, 18(2), 91–106. https://doi.org/10.1007/s11092-006-9001-8

Bassett, J., Cleveland, A., Acorn, D., Nix, M., & Snyder, T. (2017). Are they paying attention?

Students’ lack of motivation and attention potentially threaten the utility of course

evaluations. Assessment & Evaluation in Higher Education, 42(3), 431–442.

https://doi.org/10.1080/02602938.2015.1119801

Bassin, W. M. (1974). A Note on the Biases in Students’ Evaluations of Instructors. The Journal

of Experimental Education, 43(1), 16–17.

https://doi.org/10.1080/00220973.1974.10806298

Berk, R. A. (2005). Survey of 12 strategies to measure teaching effectiveness. International

Journal of Teaching and Learning in Higher Education, 17(1), 48–62.

Berliner, D. C. (2005). The near impossibility of testing for teacher quality. Journal of Teacher

Education, 56(3), 205–213. https://doi.org/10.1177/0022487105275904

Billiet, J. B., & Davidov, E. (2008). Testing the Stability of an Acquiescence Style Factor Behind

Two Interrelated Substantive Variables in a Panel Design. Sociological Methods &

Research, 36(4), 542–562. https://doi.org/10.1177/0049124107313901

Bolt, D. M., & Johnson, T. R. (2009). Addressing Score Bias and Differential Item Functioning

Due to Individual Differences in Response Style. Applied Psychological Measurement,

33(5), 335–352. https://doi.org/10.1177/0146621608329891

93

Bonitz, V. S. (2011). Student Evaluation of Teaching: Individual Differences and Bias Effects.

Graduate Theses and Dissertations. Paper 12211. Retrieved from

http://lib.dr.iastate.edu/etd/1221

Boring, A. (2015). Gender Biases in student evaluations of teachers (Documents de Travail de

l’OFCE No. 2015–13). Observatoire Francais des Conjonctures Economiques (OFCE).

Retrieved from http://econpapers.repec.org/paper/fcedoctra/1513.htm

Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student Evaluations of Teaching (Mostly) Do

Not Measure Teaching Effectiveness. ScienceOpen Research, 0(0), 1–11.

https://doi.org/10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1

Boud, D., & Falchikov, N. (1989). Quantitative studies of student self-assessment in higher

education: a critical analysis of findings. Higher Education, 18(5), 529–549.

Bowman, N. (2010). Can 1st-Year College Students Accurately Report Their Learning and

Development? American Educational Research Journal, 47(2), 466–496.

https://doi.org/10.3102/0002831209353595

Brown, J. D. (2011). Questions and answers about language testing statistics: Likert items and

scales of measurement? Retrieved July 12, 2017, from

http://hosted.jalt.org/test/bro_34.htm

Bruns, S. M., Rupert, T. J., & Zhang, Y. (2011). Effects of Converting Student Evaluations of

Teaching from Paper to Online Administration. In Advances in Accounting Education:

Teaching and Curriculum Innovations (Vol. 12, pp. 167–192). Emerald Group Publishing

Limited. Retrieved from http://www.emeraldinsight.com/doi/full/10.1108/S1085-

4622%282011%290000012010

94

Burton, W. B., Civitano, A., & Steiner-Grossman, P. (2012). Online versus paper evaluations:

differences in both quantitative and qualitative data. Journal of Computing in Higher

Education, 24(1), 58–69. https://doi.org/10.1007/s12528-012-9053-3

Capa-Aydin, Y. (2016). Student evaluation of instruction: comparison between in-class and

online methods. Assessment & Evaluation in Higher Education, 41(1), 112–126.

https://doi.org/10.1080/02602938.2014.987106

Carifio, J., & Perla, R. J. (2007). Ten common misunderstandings, misconceptions, persistent

myths and urban legends about Likert scales and Likert response formats and their

antidotes. Journal of Social Sciences, 3(3), 106–116.

Cashin, W. E. (1995). Student Ratings of Teaching: The Research Revisited. IDEA Paper No. 32.

Retrieved from http://eric.ed.gov/?id=ED402338

Chen, T., & Hoshower, L. B. (2003). Student Evaluation of Teaching Effectiveness: An

assessment of student perception and motivation. Assessment & Evaluation in Higher


Chiang, K. S., Green, K. E., & Cox, E. O. (2009). Rasch analysis of the Geriatric Depression

Scale-Short Form. The Gerontologist, 49(2), 262–275.

https://doi.org/10.1093/geront/gnp018

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2 edition). Hillsdale,

N.J: Routledge.

Cohen, J., Cohen, P., West, S. G., Aiken, L. S., Patricia Cohen, S. G. W., & Leona, S. A. (2003).

Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.).

Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Cohen, L., Manion, L., & Morrison, K. (2007). Research Methods in Education (6 edition).

London ; New York: Routledge.

95

Creswell, J. W., & Clark, V. L. P. (2010). Designing and Conducting Mixed Methods Research:

Second Edition. (J. W. Creswell & V. L. P. Clark, Eds.) (2 edition). Los Angeles: Sage

Publications.

Cronbach, L. J. (1946). Response sets and test validity. Educational and Psychological

Measurement, 6, 475–494.

Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. Braun (Eds.),

Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum.

Dommeyer, C. J. |Baum. (2004). Gathering Faculty Teaching Evaluations by In-Class and Online

Surveys: Their Effects on Response Rates and Evaluations. Assessment & Evaluation in

Higher Education, 29(5), 611–623.

Dunning, D., & Helzer, E. G. (2014). Beyond the Correlation Coefficient in Studies of Self-

Assessment Accuracy Commentary on Zell & Krizan (2014). Perspectives on

Psychological Science, 9(2), 126–130. https://doi.org/10.1177/1745691614521244

Ellis, P. D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the

Interpretation of Research Results. Cambridge University Press.

Ernst, D. (2014). Expectancy theory outcomes and student evaluations of teaching. Educational

Research and Evaluation, 20(7–8), 536–556.

https://doi.org/10.1080/13803611.2014.997138

Fenstermacher, G. D., & Richardson, V. (2005). On making determinations of quality in

teaching. The Teachers College Record, 107(1), 186–213.

Ferrando, P. J., Morales-Vives, F., & Lorenzo-Seva, U. (2016). Assessing and Controlling

Acquiescent Responding When Acquiescence and Content Are Related: A

Comprehensive Factor-Analytic Approach. Structural Equation Modeling: A

96

Multidisciplinary Journal, 23(5), 713–725.

https://doi.org/10.1080/10705511.2016.1185723

Gee, N. (2017). A study of student completion strategies in a Likert-type course evaluation

survey. Journal of Further and Higher Education, 41(3), 340–350.

https://doi.org/10.1080/0309877X.2015.1100717

Gehlbach, H., & Barge, S. (2012). Anchoring and Adjusting in Questionnaire Responses. Basic

and Applied Social Psychology, 34(5), 417–433.

https://doi.org/10.1080/01973533.2012.711691

Gravestock, P., & Gregor-Greenleaf, E. (2008). Student Course Evaluations: Research, Models

and Trends. Toronto: Higher Education Quality Council of Ontario.

Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a Conceptual Framework for

Mixed-Method Evaluation Designs. Educational Evaluation and Policy Analysis, 11(3),

255–274. https://doi.org/10.3102/01623737011003255

Hessling, R. M., Traxel, N. M., & Schmidt, T. J. (2004). Ceiling Effect. In M. Lewis-Beck, A.

Bryman, & T. Futing Liao (Eds.), The SAGE Encyclopedia of Social Science Research

Methods. 2455 Teller Road, Thousand Oaks California 91320 United States of America:

Sage Publications, Inc. Retrieved from http://methods.sagepub.com/reference/the-sage-

encyclopedia-of-social-science-research-methods/n102.xml

Howe, K. R. (2012). Mixed Methods, Triangulation, and Causal Explanation. Journal of Mixed

Methods Research, 1558689812437187. https://doi.org/10.1177/1558689812437187

Ingvarson, L., & Rowe, K. (2008). Conceptualising and Evaluating Teacher Quality: Substantive

and Methodological Issues. Australian Journal of Education, 52(1), 5–35.

https://doi.org/10.1016/j.apmr.2010.02.005

97

James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability

with and without response bias. Journal of Applied Psychology, 69(1), 85–98.

https://doi.org/10.1037/0021-9010.69.1.85

Johnson, R. (2000). The Authority of the Student Evaluation Questionnaire. Teaching in Higher


Joint Committee on Standards for Educational Evaluation. (2009). The personnel evaluation

standards: how to assess systems for evaluating educators (2nd ed). Thousand Oaks, CA:

Corwin Press.

Kam, C. C. S. (2015). Further Considerations in Using Items With Diverse Content to Measure

Acquiescence. Educational and Psychological Measurement, 76(1), 164–174.

https://doi.org/10.1177/0013164415586831

Kam, C. C. S., & Zhou, M. (2015). Does Acquiescence Affect Individual Items Consistently?

Educational and Psychological Measurement, 75(5), 764–784.

https://doi.org/10.1177/0013164414560817

Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement,

38(4), 319–342. https://doi.org/10.1111/j.1745-3984.2001.tb01130.x

Kaplan, R. M., & Saccuzzo, D. P. (2008). Psychological Testing: Principles, Applications, and

Issues (7 edition). Belmont, CA: Wadsworth Publishing.

Kenny, D. A. (1979). Correlation and Causality (1St Edition edition). New York: John Wiley &

Sons Inc.

Kenny, D. A. (2015, March 31). Moderator Variables. Retrieved June 20, 2017, from

http://davidakenny.net/cm/moderation.htm#GG

Kingsbury, F. A. (1922). Analyzing ratings and training raters. Journal of Personnel Research

(Pre-1986), 1(000008), 377.

98

Kirkpatrick, D. L. (1977). Evaluating Training Programs: Evidence vs. Proof. Training and

Development Journal, 77(11), 9–12.

Kirkpatrick, D. L. (1979). Techniques for evaluating training programs. Classic Writings on

Instructional Technology, 1, 231–241.

Kline, T. J. B. (2005). Psychological Testing: A Practical Approach to Design and Evaluation.

Thousand Oaks, Calif: Sage Publications.

Koskey, K. L. K., Sondergeld, T. A., Stewart, V. C., & Pugh, K. J. (2016). Applying the Mixed

Methods Instrument Development and Construct Validation Process: The Transformative

Experience Questionnaire. Journal of Mixed Methods Research, 1558689816633310.

https://doi.org/10.1177/1558689816633310

Krosnick, J. A. (1999). Survey Research. Annual Review of Psychology, 50(1), 537–567.

https://doi.org/10.1146/annurev.psych.50.1.537

Krosnick, J. A., & Alwin, D. F. (1987). An Evaluation of a Cognitive Theory of Response-Order

Effects in Survey Measurement. Public Opinion Quarterly, 51(2), 201–219.

https://doi.org/10.1086/269029

Krosnick, J. A., Narayan, S., & Smith, W. R. (1996). Satisficing in surveys: Initial evidence. New

Directions for Evaluation, 1996(70), 29–44. https://doi.org/10.1002/ev.1033

Kuwaiti, A. A., & Subbarayalu, A. V. (2015). Appraisal of students experience survey (SES) as a

measure to manage the quality of higher education in the Kingdom of Saudi Arabia: an

institutional study using six sigma model. Educational Studies, 0(0), 1–14.

https://doi.org/10.1080/03055698.2015.1043977

Lam, T. C. M., & Klockars, A. J. (1982). Anchor Point Effects on the Equivalence of

Questionnaire Items. Journal of Educational Measurement, 19(4), 317–22.

99

Lam, T. C. M., & Stevens, J. J. (1994). Effects of Content Polarization, Item Wording, and

Rating Scale Width on Rating Response. Applied Measurement in Education, 7(2), 141–

158. https://doi.org/10.1207/s15324818ame0702_3

Leckie, G., & Baird, J. A. (2011). Rater Effects on Essay Scoring: A Multilevel Analysis of

Severity Drift, Central Tendency, and Rater Experience. Journal of Educational

Measurement, 48(4), 399–418. https://doi.org/10.1111/j.1745-3984.2011.00152.x

Lentz, T. F. (1938). Acquiescence as a factor in the measurement of personality. Psychological

Bulletin, 35(9), 659.

Loevinger, J. (1959). Theory and techniques of assessment. Annual Review of Psychology, 10,

287–316. https://doi.org/10.1146/annurev.ps.10.020159.001443

Luyt, R. (2012). A Framework for Mixing Methods in Quantitative Measurement Development,

Validation, and Revision: A Case Study. Journal of Mixed Methods Research, 6(4), 294–

316. https://doi.org/10.1177/1558689811427912

Macmillan, N. A., & Douglas, C. (1990). Response bias: Characteristics of detection theory,

threshold theory, and “nonparametric” indexes. Psychological Bulletin, 107(3), 401–413.

https://doi.org/10.1037/0033-2909.107.3.401

MacNell, L., Driscoll, A., & Hunt, A. N. (2015). What’s in a Name: Exposing Gender Bias in

Student Ratings of Teaching. Innovative Higher Education, 40(4), 291–303.

https://doi.org/10.1007/s10755-014-9313-4

Marsh, H. W. (1982). SEEW: A Reliable, Valid, and Useful Instrument for Collecting Students’

Evaluations of University Teaching. British Journal of Educational Psychology, 52(1),

77–95. https://doi.org/10.1111/j.2044-8279.1982.tb02505.x

100

Marsh, H. W. (1987). Students’ evaluations of University teaching: Research findings,

methodological issues, and directions for future research. International Journal of

Educational Research, 11(3), 253–388. https://doi.org/10.1016/0883-0355(87)90001-2

Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluations of teaching effectiveness

effective: The critical issues of validity, bias, and utility. American Psychologist, 52(11),

1187–1197. https://doi.org/10.1037/0003-066X.52.11.1187

Marsh, H. W., & Roche, L. A. (2000). Effects of grading leniency and low workload on students’

evaluations of teaching: Popular myth, bias, validity, or innocent bystanders? Journal of

Educational Psychology, 92(1), 202–228. https://doi.org/10.1037/0022-0663.92.1.202

Masino, C., & Lam, T. C. M. (2014). Choice of rating scale labels: implication for minimizing

patient satisfaction response ceiling effect in telemedicine surveys. Telemedicine Journal

and E-Health: The Official Journal of the American Telemedicine Association, 20(12),

1150–1155. https://doi.org/10.1089/tmj.2013.0350

McGrath, R. E., Mitchell, M., Kim, B. H., & Hough, L. (2010). Evidence for response bias as a

source of error variance in applied assessment. Psychological Bulletin, 136(3), 450–470.

https://doi.org/10.1037/a0019216

McPherson, M. A., & Jewell, R. T. (2007). Leveling the Playing Field: Should Student

Evaluation Scores be Adjusted?*. Social Science Quarterly, 88(3), 868–881.

https://doi.org/10.1111/j.1540-6237.2007.00487.x

McPherson, M. A., Jewell, R. T., & Kim, M. (2009). What Determines Student Evaluation

Scores? A Random Effects Analysis of Undergraduate Economics Classes. Eastern

Economic Journal, 35(1), 37–51.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–

110). New York, NY: MacMillan.

101

Messick, S. (1995a). Standards of validity and the validity of standards in performance

assessment. Educational Measurement: Issues and Practice, 14(4), 5–8.

Messick, S. (1995b). Validity of psychological assessment: Validation of inferences from

persons’ responses and performances as scientific inquiry into score meaning. American

Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741

Moors, J. J. A. (1986). The Meaning of Kurtosis: Darlington Reexamined. The American

Statistician, 40(4), 283–284. https://doi.org/10.1080/00031305.1986.10475415

Morell, L., & Tan, R. J. B. (2009). Validating for Use and Interpretation: A Mixed Methods

Contribution Illustrated. Journal of Mixed Methods Research, 3(3), 242–264.

https://doi.org/10.1177/1558689809335079

Morrison, K. (2013). Online and paper evaluations of courses: a literature review and case study.

Educational Research and Evaluation, 19(7), 585–604.

https://doi.org/10.1080/13803611.2013.834608

Murphy, K. R., & Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of Applied

Psychology, 74(4), 619–624. https://doi.org/10.1037/0021-9010.74.4.619

Murphy, K. R., & Cleveland, J. (1995). Understanding performance appraisal : social,

organizational, and goal-based perspectives. Thousand Oaks, CA: Sage Publications.

Murphy, K. R., Cleveland, J. N., Skattebo, A. L., & Kinney, T. B. (2004). Raters Who Pursue

Different Goals Give Different Ratings. Journal of Applied Psychology, 89(1), 158–164.

https://doi.org/10.1037/0021-9010.89.1.158

Olivares, O. J. (2003). A Conceptual and Analytic Critique of Student Ratings of Teachers in the

USA with Implications for Teacher Effectiveness and Student Learning. Teaching in

Higher Education, 8(2), 233–245. https://doi.org/10.1080/1356251032000052465

102

Onwuegbuzie, A. J., Bustamante, R. M., & Nelson, J. A. (2010). Mixed Research as a Tool for

Developing Quantitative Instruments. Journal of Mixed Methods Research, 4(1), 56–78.

https://doi.org/10.1177/1558689809355805

Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. T. (2009). A meta-validation model for

assessing the score-validity of student teaching evaluations. Quality & Quantity, 43(2),

197–209. https://doi.org/10.1007/s11135-007-9112-4

Ory, J. C. (2001). Faculty Thoughts and Concerns About Student Ratings. New Directions for

Teaching and Learning, 2001(87), 3–15. https://doi.org/10.1002/tl.23

Ory, J. C., & Ryan, K. (2001). How Do Student Ratings Measure Up to a New Validity

Framework? New Directions for Institutional Research, 2001(109), 27–44.

https://doi.org/10.1002/ir.2

Osteen, P. (2010). An Introduction to Using Multidimensional Item Response Theory to Assess

Latent Factor Structures. Journal of the Society for Social Work and Research, 1(2), 66–

82. https://doi.org/10.5243/jsswr.2010.6

Paulhus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. R. Shaver,

& L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes

(pp. 17–59). San Diego, CA, US: Academic Press.

Penny, A. R. (2003). Changing the Agenda for Research into Students’ Views about University

Teaching: Four shortcomings of SRT research. Teaching in Higher Education, 8(3), 399–

411. https://doi.org/10.1080/13562510309396

Pintrich, P. R. (2002). The Role of Metacognitive Knowledge in Learning, Teaching, and

Assessing. Theory Into Practice, 41(4), 219–225.

https://doi.org/10.1207/s15430421tip4104_3

103

Plieninger, H. (2016). Mountain or Molehill? A Simulation Study on the Impact of Response

Styles. Educational and Psychological Measurement, 0013164416636655.

https://doi.org/10.1177/0013164416636655

Popham, W. J. (1992). Educational Evaluation (3 edition). Boston: Pearson.

Rantanen, P. (2013). The number of feedbacks needed for reliable evaluation. A multilevel

analysis of the reliability, stability and generalisability of students’ evaluation of teaching.

Assessment & Evaluation in Higher Education, 38(2), 224–239.

https://doi.org/10.1080/02602938.2011.625471

Richardson, J. T. E. (2005). Students’ Approaches to Learning and Teachers’ Approaches to

Teaching in Higher Education. Educational Psychology, 25(6), 673–680.

https://doi.org/10.1080/01443410500344720

Richardson, J. T. E. (2012). The role of response biases in the relationship between students’

perceptions of their courses and their approaches to studying in higher education. British

Educational Research Journal, 38(3), 399–418.

https://doi.org/10.1080/01411926.2010.548857

Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the

psychometric quality of rating data. Psychological Bulletin, 88(2), 413–428.

https://doi.org/10.1037/0033-2909.88.2.413

Salas, E., & Cannon-Bowers, J. A. (2001). The Science of training: A decade of progress. Annual

Review of Psychology, 52(1), 471–499. https://doi.org/10.1146/annurev.psych.52.1.471

Simpson, P. M., & Siguaw, J. A. (2000). Student Evaluations of Teaching: An Exploratory Study

of the Faculty Response. Journal of Marketing Education, 22(3), 199–213.

https://doi.org/10.1177/0273475300223004

104

Smith, S. W., Yoo, J. H., Farr, A. C., Salmon, C. T., & Miller, V. D. (2007). The Influence of

Student Sex and Instructor Sex on Student Ratings of Instructors: Results from a College

of Communication. Women’s Studies in Communication, 30(1), 64–77.

https://doi.org/10.1080/07491409.2007.10162505

Smyth, J. D., Dillman, D. A., & Christian, L. M. (2009). Context effects in Internet surveys: New

issues and evidence. In A. N. Joinson, K. Y. A. McKenna, T. Postmes, & U.-D. Reips

(Eds.), Oxford Handbook of Internet Psychology. Oxford, UK: Oxford University Press.

Retrieved from

http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199561803.001.0001/oxf

ordhb-9780199561803-e-027

Spector, P. E. (1991). Summated Rating Scale Construction: An Introduction (1st ed.). Newbury

Park, CA: Sage.

Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the Validity of Student Evaluation of

Teaching The State of the Art. Review of Educational Research, 83(4), 598–642.

https://doi.org/10.3102/0034654313496870

Spooren, P., Mortelmans, D., & Thijssen, P. (2012). ‘Content’ versus ‘style’: Acquiescence in

student evaluation of teaching? British Educational Research Journal, 38(1), 3–21.

https://doi.org/10.1080/01411926.2010.523453

Stark, P., & Freishtat, R. (2014). An Evaluation of Course Evaluations. ScienceOpen Research.

Retrieved from https://www.scienceopen.com/document/id/ad8a9ac9-8c60-432a-ba20-

4402a2a38df4

StataCorp. (2013). Stata Statistical Software: Release 13. College Station, TX: StataCorp LP.

105

Stowell, J. R., Addison, W. E., & Smith, J. L. (2012). Comparison of online and classroom-based

student evaluations of instruction. Assessment & Evaluation in Higher Education, 37(4),

465–473. https://doi.org/10.1080/02602938.2010.545869

Theall, M., & Franklin, J. (2001). Looking for Bias in All the Wrong Places: A Search for Truth

or a Witch Hunt in Student Ratings of Instruction? New Directions for Institutional

Research, 2001(109), 45–56. https://doi.org/10.1002/ir.3

Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied

Psychology, 4(1), 25–29.

https://doi.org/http://dx.doi.org.myaccess.library.utoronto.ca/10.1037/h0071663

Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The Psychology of Survey Response (1st ed.).

Cambridge, UK: Cambridge University Press.

Traub, R. E. (1997). Classical Test Theory in Historical Perspective. Educational Measurement:

Issues and Practice, 16(4), 8–14. https://doi.org/10.1111/j.1745-3992.1997.tb00603.x

Tutz, G., & Berger, M. (2016). Response Styles in Rating Scales: Simultaneous Modeling of

Content-Related Effects and the Tendency to Middle or Extreme Categories. Journal of

Educational and Behavioral Statistics, 41(3), 239–268.

https://doi.org/10.3102/1076998616636850

Valsan, C., & Sproule, R. (2008). The invisible hands behind the student evaluation of teaching:

the rise of the new managerial elite in the governance of higher education. Journal of

Economic Issues, 939–958.

van Herk, H., Poortinga, Y. H., & Verhallen, T. M. M. (2004). Response Styles in Rating Scales

Evidence of Method Bias in Data From Six EU Countries. Journal of Cross-Cultural

Psychology, 35(3), 346–360. https://doi.org/10.1177/0022022104264126

106

Van Vaerenbergh, Y., & Thomas, T. D. (2013). Response Styles in Survey Research: A Literature

Review of Antecedents, Consequences, and Remedies. International Journal of Public

Opinion Research, 25(2), 195–217. https://doi.org/10.1093/ijpor/eds021

Viswanathan, M. (2005). Measurement Error and Research Design. Thousand Oaks, CA: SAGE

Publications, Inc. Retrieved from http://dx.doi.org/10.4135/9781412984935.n3

Ward, M., Gruppen, L., & Regehr, G. (2002). Measuring Self-assessment: Current State of the

Art. Advances in Health Sciences Education, 7(1), 63–80.

Webster, H. (1958). Correcting personality scales for response sets or suppression effects.

Psychological Bulletin, 55(1), 62–64. https://doi.org/10.1037/h0048031

Weijters, B., Geuens, M., & Schillewaert, N. (2010). The stability of individual response styles.

Psychological Methods, 15(1), 96–110. https://doi.org/10.1037/a0018721

Wetzel, E., Böhnke, J., & Brown, A. (2016). Response Biases. In F. T. L. Leong, D. Bartram, F.

Cheung, K. F. Geisinger, & D. Iliescu (Eds.), The ITC International Handbook of Testing

and Assessment (1 edition, pp. 349–363). New York: Oxford University Press.

Wetzel, E., & Carstensen, C. H. (2015). Multidimensional Modeling of Traits and Response

Styles. European Journal of Psychological Assessment, 1–13.

https://doi.org/10.1027/1015-5759/a000291

Wetzel, E., Lüdtke, O., Zettler, I., & Böhnke, J. R. (2016). The Stability of Extreme Response

Style and Acquiescence Over 8 Years. Assessment, 23(3), 279–291.

https://doi.org/10.1177/1073191115583714

Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science,

46(1), 35–51.

Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER ConQuest Version 2.0:

Generalised Item Response Modelling Software. ACER Press.

107

Yorke, M. (2009). ‘Student experience’ surveys: some methodological considerations and an

empirical investigation. Assessment & Evaluation in Higher Education, 34(6), 721–739.

https://doi.org/10.1080/02602930802474219

Zabaleta, F. (2007). The use and misuse of student evaluations of teaching. Teaching in Higher


Zell, E., & Krizan, Z. (2014). Do People Have Insight Into Their Abilities? A Metasynthesis.

Perspectives on Psychological Science, 9(2), 111–125.

https://doi.org/10.1177/1745691613518075

Zhang, X., & Savalei, V. (2015). Improving the Factor Structure of Psychological Scales The

Expanded Format as an Alternative to the Likert Scale Format. Educational and

Psychological Measurement, 0013164415596421.

https://doi.org/10.1177/0013164415596421

108

Appendices

109

Copyright Acknowledgements