ronald carriveau, ph.d. richard herrington, ph.d.university of north texas
TRANSCRIPT
Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas University of North Texas
http://www.unt.edu/rss/rich/IUPUI/
Outcomes
An introduction to the UNT Student Evaluation of Teaching Effectiveness (SETE) survey.
Steps for developing a teaching evaluation survey.
An online handbook of Questions and Answers regarding survey development and implementation.
An introduction to the validity study process and scale score development (with online access).
The Value of Teacher Evaluations
Clicker survey on participants’ opinions
1. How many years have you been teaching or working at the college or university level?
A. 1 to 4 yearsB. 5 to 10 yearsC. More than 10 years
2. During your career, how often have you been evaluated by students as a job requirement?
Seldom or n
ever
Often
Almost
always
0% 0%0%
A. Seldom or neverB. OftenC. Almost always
0 of 30
3. To what degree do you consider student evaluations of teachers to be high stakes?
Low
Medium
High
0% 0%0%
A. LowB. MediumC. High
0 of 30
4. How important are teacher evaluations for improving instruction?
Not im
portant
Somewhat im
portant
Very im
portant
0% 0%0%
A. Not important
B. Somewhat important
C. Very important
0 of 30
5. How valid are student evaluations as a measure of teaching effectiveness?
Not v
alid
Moderat
ely val...
High
ly va
lid
0% 0%0%
0 of 30
A. Not valid
B. Moderately valid
C. Highly valid
6. How much of an influence is the instructor’s enthusiasm and style on student ratings?
Small i
nfluence
Moderat
e influence
Big influence
0% 0%0%
A. Small influence
B. Moderate influence
C. Big influence
0 of 30
7. To what degree are you involved with teacher evaluation at your institution.
Small
Moderat
e H
igh
0% 0%0%
A. Small
B. Moderate
C. High
0 of 30
Why Student Evaluation of Teacher Effectiveness
Growing concern for accountability and self accountability
Need evaluation of instructor in terms of their instruction
Used in decisions about pay, promotion, and tenure.
Provide feedback to faculty on their instruction
Improve teaching/instruction
Improve student learning
Primary Challenge
How do we get from the raw scores on a teacher effectiveness survey to validated scale scores that are psychometrically and legally defensible.
Need to overcome the lack of validity inherent in using raw score means and medians.
Development Challenges
Need for a clearly defined purpose and what exactly to measure
Transparency and inclusion to get faculty buy-in
Conduct a comprehensive item selection process
Conduct faculty and student focus groups
Develop a defensible item response scale
Committee’s charge from the ProvostNeed a common scale across all courses for individual
effectiveness scores and for making comparisons
Committee recommended purpose Measure ‘teaching effectiveness’ versus ‘course
effectiveness.’
HB 2504 – Transparency Texas law requires posting of teacher evaluation scores
Transparency And Inclusion For Faculty Buy-in Committee makeup
Professors, Lecturers, Faculty Senators, Staff, Student
Faculty Senate Executive Board and Chair/President Get support of Chair/President first
Faculty Senate Make initial presentation and give updates Challenge is Senate may resent directive came from Provost
(i.e. administration) rather than with Senate input and approval
The item selection process vs. writing the items
Four step process starting with 3000 plus items
Initial screening – measurement intent and clarity
Rubric Scoring #1 – teaching vs course effectiveness
Rubric Scoring #2 - match items to effectiveness factors
Faculty and Student validation via Focus Groups and Surveys – importance, observable, student judgment
Faculty And Student Focus Groups
Stratified random sampling to select faculty across all colleges by proportion of enrollment (i.e. representativeness).
Standardized focus group questions and consent form for IRB Necessary for focus groups, field test, and research
Qualitative analysis Analyze all focus group comments, before and after field test
Developmental field test Small group of students with post administration analysis and
discussion
Faculty and Student Agreement On Items
Clicker Slides
My instructor is knowledgeable about the course content.
0%0%
A. Something the student can judge
B. Not Something the student can judge.
0 of 30
The book my instructor chose for the course was appropriate.
Someth
ing the st
udent c..
Not S
omething t
he stud...
0%0%
A. Something the student can judge
B. Not Something the student can judge.
0 of 30
My instructor used appropriate instructional materials.
Someth
ing the st
udent c..
Not S
omething t
he stud...
0%0%
A. Something the student can judge
B. Not Something the student can judge.
0 of 30
Developing The Item Response ScaleResearched based
Multiple references, one primary reference as a guide.
Ronald A. Berk, Thirteen Strategies to Measure College Teaching: A Consumer’s Guide to Rating Scale Construction and Assessment.
A four point agreement scale
Be prepared for challengesAnchor points on scaleNo middle pointNo na optionPositive/negative statements
1 2 3 4
Strongly Disagree Disagree Agree Strongly Agree
What are the factors of teaching effectiveness that are measured?
In addition to the overall construct of teaching effectiveness, there are three specific factors that are being measured. The twelve items on the SETE were chosen from the final pool of 28 usable items, and the four items chosen for each factor best represent the overall factor of teaching effectiveness (SETE rating instrument - not to be used without legal permissions – Copyright © 2010 by University of North Texas) .
Factor 1: Organization and explanation of materials1. My instructor explains difficult material clearly.2. My instructor communicates at a level that I can understand.3. My Instructor makes requirements clear.4. My instructor identifies relationships between and among topics
Factor 2: Learning Environment
1. My instructor establishes a climate of respect.2. My instructor is available to me on matters pertaining to the course.3. My instructor respects diverse talents4. My instructor creates an atmosphere in which ideas can be exchanged freely.
Factor 3: Self Regulated Learning
1. My instructor gives assignments that are stimulating to me. 2. My instructor encourages me to develop new viewpoints.3. My instructor arouses my curiosity.4. My instructor stimulates my creativity.
Challenges and Questions From Faculty and Senate
What is the purpose of the SETE?Facilitate student evaluation of their instructors, allowing
university-wide comparison in key areas (Charge from Provost). Should not be the only measure.
Provide a measure of teaching effectiveness as perceived by students.
Provide scores for a particular instructor that can be used for self evaluation and improvement and for measuring improvement over time.
Provide scale scores that can be aggregated into group scores for use by administrators.
Provide teacher evaluation information for UNTSatisfy the requirements of House Bill 2504 that calls for
transparencey in reporting and posting to the web.
Challenges and Questions Cont. How were the SETE items selected?
It was determined that every effort should be made to find existing surveys and published lists of survey items and to evaluate them for usefulness versus writing new items.
Validity evidence to support selection decisions was collected through seven faculty focus groups, four student focus groups, faculty and student interviews, results from a survey sent to faculty, surveys sent to students, and an item tryout field test, plus scoring rubric results from the committee members’ evaluation of items.
An item selection process started with a pool of approximately 3,000 survey items, including all current UNT department surveys and items published by other universities that are used by over 100 universities. After an initial screening process looking at measurement intent and clarity, this large pool was narrowed to 1,488 items. Evaluating these items with rating scales reduced this number to 788, and a second evaluation matching items to specific effectiveness elements reduced the number to 346.
Challenges and Questions cont.How were the SETE items selected, cont. Using specific scoring criteria to qualify items for inclusion,
committee members reduced the number of items to 51. These 51 items were then presented to students in a
developmental field test and a final draft selection of 38 items was based on the field test results and faculty focus groups.
A final review was conducted using the criteria of student viewpoint, student observable, statement measurability, conformity to the research elements, duplicity, and universality in terms of class size and in terms of online and in-class administration. The result of this process is the final survey item pool of 28 statements.
Over 400 people were involved in the process (after the validity study started, over 550 people had been involved).
Challenges and Questions From Faculty and Senate
How were the final 12 survey items chosen? The second phase of the SETE development included three teams
made up of faculty and staff who specialized in or had experience with assessment development and psychometrics. Team A conducted the validity study and psychometrics; Team B administered spring pilot tests of the SETE items and conducted follow-up faculty and student focus group; Team C developed open-ended response items.
The final 28 SETE items were pilot tested using a stratified sampling across the University. The pilot test was administered at the end of the Spring semester 2009, and a validity study team was assembled to analyze the data, validate the model fit, conduct item reduction studies, and develop a scoring methodology. The result of the psychometric work was the 12 item survey that was administered across the university in the fall of 2009.
Challenges and Questions Cont.
Why was a four point scale used for the SETE?Research shows that after five points there are diminishing
returns in terms of reliability. Additionally, information may be lost if the scale exceeds the respondents ability to discriminate among the anchor points. A 28 item survey with a 4-point scale can yield high reliability coefficients.
It was determined that four anchor points were appropriate using a response scale of 1) Strongly Disagree, 2) Disagree, 3) Agree, and 4) Strongly agree.
Challenges and Questions Cont.
Why is there no midpoint position on the scale (i.e. neutral, uncertain, or undecided)?
Information is lost when a midpoint position is included in a set of bi-polar (i.e. both positive and negative) anchors that are intended to measure the degree (intensity) of a respondent’s opinion.
The neutral mid-point is also problematic because it will lower the mean for a teacher who receives a high score and adds no compensation for a teacher who received a low score. From a measurement viewpoint, nothing is gained from a neutral response.
Berk (2006) states that, “For rating scales used to measure teaching effectiveness, it is recommended that the midpoint position be omitted and an even-numbered scale be used, such as 4 or 6 points.”
Challenges and Questions Cont.Why is there no NA (not applicable) choice?
The use of NA was avoided because the teacher effectiveness scale will be used for a class level analysis, and every time a student chooses NA, that student’s scale score will be different because one or more of the items will not be part of the score. This is a major problem in terms of measurement, analysis, and validity.
The committee recognizing that there are class conditions across the university (even on the teacher effectiveness only scale) that would require an NA option, so they followed recommended procedures for identifying which items might require an NA. Faculty were asked to identify those items which they felt could not be observed by students across all classes and thus would require an NA. Students were asked to identify those items which they felt could not be observed by students across all classes and thus would require an NA. Identified items were eliminated.
Challenges and Questions Cont.Why is there no item reversal (negative and positive items)
to address response set bias? This type of bias is referred to as acquiescence, the tendency to
agree or give positive responses regardless of the content of the items (similar to Halo effect). A strategy used to minimize the effect of this survey taking behavior is to word half of the statements positively and the other half negatively (but in random order). However, this method does not eliminate (or reduce) the bias, it simply cancels out the effect of the bias with the result that the effect of the bias is reduced to zero.
Berk (2006) recommends that reversals may be appropriate for some scales, but not for teacher effectiveness scales because the positive/negative reversals can be confusing and result in increased response time and response errors. The SETE effectiveness scale is designed to rate the teacher’s positive behaviors, not negative ones.
Challenges and Questions Cont.
What is the applicability of SETE items to courses delivered online?
Application of the SETE items to online courses was a major consideration of the committee. Expertise in delivering online instruction was well represented in the committee. Additionally, input was gathered from faculty and student groups. Several online courses were included in the SETE field test in order to do a comparison of online versus not-online student responses.
The structural equation modeling used to confirm the
structure of the student responses included online courses. Faculty and student review groups were convened at the beginning of the fall semester 09 to confirm final recommendations regarding the usefulness of SETE survey items for online courses.
Challenges and Questions Cont.
What do my SETE scores mean? How should they be interpreted?
Your SETE scores are a measure of your students’ perception of your teaching effectiveness. The scores are based on a scale across the University. In other words, all individual scores are on the same scale so that a score of, for example 600, for a teacher of a particular course in a particular department or college has the same meaning in terms of teaching effectiveness as a teacher of a particular course in a different department or college. To help with score interpretation, the following factor descriptions of effectiveness are provided on the individual teacher reports.
Challenges and Questions cont. How should they be interpreted? Each of the three
effectiveness factors has its own unique scale and thus each teacher gets a separate scale score for each factor. The overall construct of Teacher Effectiveness also has its own scale score, and thus is not simply the average of the factor scores.
Organization and Explanation
Learning Environment
Self-Regulated Learning
Overall Effectiveness
Highly Effective
710 – 981 659 – 972 747 – 998 702 – 998
Effective438 – 709 347 – 658 495 – 746 406 – 701
Somewhat Effective
167 – 437 35 - 346 243 – 494 111 – 405
Challenges and Questions Cont.What do my factor scores mean?
A high score means that you were perceived as having a high degree of what is shown in the description below. A low score means a low degree of what is shown.
Factor 1: Organization and Explanation of Materials This score reflects the student’s perception of how well the instructor:
makes the course requirements and student learning outcomes clear to the students; gives assignments, activities, and materials that are helpful and that contribute to understanding the subject; explains difficult material clearly; shows the relationships among topics and new concepts; and evaluates student work in ways that are helpful to learning.
Factor 2: Learning Environment This score reflects the student’s perception of how well the instructor:
establishes a climate of mutual respect and encouragement; motivates students to work and engage in learning; is available and encouraging; is skillful in actively engaging students in learning; and provides useful feedback.
Factor 3: Self-regulated Learning This score reflects the student’s perception of how well the instructor
guides and encourages self-directed learning in which the student is encouraged: to be open to the viewpoints of others; to develop new viewpoints; to connect course topics to a wider understanding of the subject; and to contribute to the learning process.
Challenges and Questions cont. Is my overall SETE score the average of my three factor scores?
No. The overall construct of Teacher Effectiveness has its own scale score, and thus is not simply the average of the three factor scores. A measurement model with appropriate external control variables is used in determining how items should be weighted when calculating individual scale scores for each factor. This estimation process provides a reasonably fair and unbiased estimate of the individual scale scores as well as providing a high degree of reliability and generalizability to the scale scores.
Challenges and Questions cont. The response rate for my class was very low. Does
this make my SETE score invalid? No, the scoring methodology used allows response rates
as low as one or two students to be put on the SETE scale. A weighted average methodology is used that is similar to what is used for the GRE and similar tests so that a valid point estimate can be calculated for these lower response rates in terms of class size and scale scores obtained. This lower response rate estimate may not be as interpretable (i.e. error free) as if there was a higher response rate, but it is psychometrically sound and usable and will be necessary when looking at scores over time.
Challenges and Questions cont. What if there is a dramatic spike downward in my
scale scores for a particular semester? A dramatic spike upwards or downward for a particular
semester over scores from several semesters can be a concern when looking at continuous improvement. To address this, starting with the Fall 2010 administration, prediction methodologies will be applied that will use information from previous semesters to smooth the scores across semesters so that a more fair and reasonable interpretation of effectiveness scores can be made for purposes of continuous improvement.
Structure Of The SETE (Rich)
Two aspects of the theoretical model that is of interest:The behavioral domains The inter-item relationships of the item domains
A particular factor analysis model is used to represent our theoretical SETE modelBi-factor model – a general effectiveness (G)
factor and three sub-domain factors which are relatively independent in relation to G
Theoretical Model
Higher order construct is Teaching Effectiveness
Teaching effectiveness is modeled as three sub-domains:Organization and explanation of materialsLearning environmentSelf-regulated learning
explains difficult material clearly
communicates at a level that I can understand
makes requirements clear
identifies relationships between and among topicsestablishes a climate of respectis available to me on matters pertaining to the courserespects diverse talentscreates an atmosphere of free exchange of ideas
gives assignments that are stimulating to me
encourages me to develop new viewpointsarouses my curiosity
stimulates my creativity
Organ.& explain
LearningEnviron.
Self-reg.learning
SETE ItemDescriptors
GeneralTeaching
Factor
SETE Bifactor Latent Variable Model: One General Factor and Three Specific Content Factors
Measurement Model Estimation, Survey Sampling Issues, and Validation
OVERVIEWMeasurement IssuesFactor AnalysisShort Form Item SelectionSurvey Sampling DesignCase Weighting: Using Inverse Probability
WeightingExternal Control VariablesContextual Effects and Multi-level ANOVAScale Score DevelopmentMissing ValuesFuture Refinements for ImplementationSoftware and Data ProcessingQuestions
Domain of SETE – the behavioral objectives of teaching effectivenessDomain – the elements to which a variable is limitedDomain score - true score of an infinitely long set of items (i.e.
mean on all possible items) Item domain – the universe of all possible items under
considerationReliability
A reliability coefficient is the square of the correlation between the observed score and the domain score (varies between 0 and 1)
The reliability of the SETE varies between .90 and .95Percentage of true score variance out of the total variance (true
score variance plus error variance)Construct Validity – the extent to which an item (score)
measures the attribute it was designed to measureFactor analysis is a primary methodological tool for examining
the internal structure of items and provides evidence of validityHomogenous items within factors provides evidence of validityExtensive conceptual/semantic analysis also lends evidence of
validity (i.e. student and faculty focus group evaluations of the items)
Measurement Issues
Generalizability - degree to which observed scores generalize to domain scoresA generalizability coefficient measures the generalizability of
the observed items to a corresponding infinite behavioral domain
Omega coefficient (varies between 0 and 1) The Omega coefficient is a measure of construct validity Omega is both a reliability coefficient and a generalizability
coefficient The Omega coefficient of the 12 item SETE varies
between .90 and .95Measurement Invariance – the domain(s) as represented by
the internal structure of the items, do not vary across various sub-samples of the populationSizes of factor loadings do not vary substantially across sub-
samples ( e.g. demographic variables - department, student major, gender, etc.)
Factor loading patterns do not vary substantially across sub-samples
Predictive Validity – the observed score accurately predicts other outcome measures of related interest (e.g. current semester rating predicts next semester rating with high accuracy)
Measurement Issues (cont.)
Factor analysis is often referred to as both a measurement error modeling method, and a latent variable modeling method
Latent score is defined as a true score which is an unobservable outcome (i.e. factor scores)
Observed score = True score + Error (O=T+E) True score variance = Observed score variance + Error Variance squares – observed; circle - unobserved
Inter-correlation of multiple item scores contributes to our estimates of true scores and true score variance
In classical measurement theory, error is inferred: E = O – T
The ordinality of items suggests the use of a “polychoric correlation” matrix rather than a “Pearson correlation” matrix; both were used
Factor Analysis
Graphic schematic of a simple three item factor
Factor Analysis (cont.)
The factor structure of SETE is representative of more than 100,00 responses collected over three terms, and is currently representative of two different institutions - University of North Texas & Texas Woman’s University
Confirmatory Factor Analysis Fit (CFA) resultsRMSE<.04 (less than .05 is considered excellent)GOF>.97 (greater than .95 is considered
excellent)
Factor Analysis (cont.)
Short form is used whenever the administration of the long form is problematic (e.g. fatigue)
The development of short forms (e.g. 12 item version of SETE) are a common practice
The developer of the short form should demonstrate the relative equivalence between the long form (e.g. 28 items version of SETE) and short form in terms of reliability and generalizability (the 12 and 28 item SETE both have reliability and generalizabilty greater than .90)
The “iterative” use of factor analysis (e.g. three factors with 4 items per factor) in the selection of short forms can be problematic (e.g. fitting – removing item(s) – refitting - etc.)each successive fit ignores all possible/potential
multivariate relationships between items this produces sub-optimal subsets of items
Short Form Item Selection
The 12 item SETE was selecting using a heuristic optimization method called Ant Colony Optimization (ACO)Optimal subsets were selected from the final list of 28
items such that certain properties were maximized or minimized: Maximized: large item loadings (correlation between observed
score and latent score minus measurement error) Maximized: large correlations of factor scores with other
external measures (e.g. I would recommend this course to other people)
Maximized: large model fit indices (e.g. goodness of fit: 0-1) Minimized: small RMSE (distance between model and data)
ACO selection is automated and produces near optimal results – based on estimating all possible configurations of items and selects those configurations that optimize important psychometric criterion
Multiple subsets (parallel forms) can be obtained automatically as well; these multiple subsets can be rank ordered in terms of fit
Short Form Item Selection (cont.)
Delivery of course surveys through web delivery presents sampling issues
Non-random responses produces response biasExisting “clusters” of the student populations
produce data that bias model estimates of teaching effectiveness
One approach is to post-stratify on “cluster” variables thereby reducing this confounding influence
The predominant approach in the survey sampling literature is to use case weighting and re-sampling methods to estimate and reduce this bias (e.g. inverse probability weighting)
Survey Sampling Design
Survey sampling methodology provides algorithms that are useful in estimating non-responder bias in survey samples
Inverse probability weighting (IPW) uses background variables (external control variables) to estimate “probability response classes” e.g. background demographic information – dept.,
college, major, gender, class size, grade assigned, grade expected, etc.
The inverse of these probabilities down-weights high probability response classes, and gives more weight to low response probability classes
The effect of IPW is to reduce the relationship of the background variables with the principal outcome measure of interest (teaching effectiveness)
IPW also reduces the effect of bias on the model being estimated e.g. relatively unbiased item loadings in the factor
analysis
Case Weighting: Using Inverse Probability Weighting (IPW)
The 12 item SETE modeling process looked at course level, faculty level, and student level background variables 9 background measures were selected from a total of 17
variablesprior to this, the 17 variables were pre-selected from 30
variables based on relevanceCourse level
course size, in class vs. internet, instruction type, time course held
Faculty level status (e.g. lecturer , assist. prof., full prof.), age, number of
years employed at institution, gender of instructorStudent level
anticipated grade, actual grade assigned, mean GPA, academic level, current course load, students gender, total credits earned, pre-requisites present (yes/no)
Department and student major were handled by using “multi-level ANOVA” methods (contextual modeling) – more on this later
External Control Variables (Background Variables)
Selecting the “best” external control variables are important since non-relevant variables increase error variability and reduce the predictive validity of the SETE items
Bayesian Model Averaging was used (BMA) to select the best subset of variables from 17 in predicting general teaching effectiveness (G)
BMA model selection strategy can select models that have relatively better prediction accuracy, compared to models with smaller posterior probabilities
External Control Variables (cont.)
The most relevant background variables accounted for about 9% of the total variance in general teaching effectiveness (G) 6 student level variables, 2 course level variables, and 1 faculty
level variable A relative importance metric was generated for the 9
variables this metric decomposes the variance accounted for in G (9%)
into non-overlapping components of variance relative importance allows a rank ordering of the importance of
variables
IPW with these 9 external control variables reduced the 9% variance accounted for in G to about 2-3% variance accounted for in G
External Control Variables (cont.)
The academic literature on background variables influencing student ratings of faculty indicate that anticipated grade and class size are two of the largest influences on outcome ratings larger classes are associated with lower effectiveness
ratings lower anticipated grade is associated with a lower
effectiveness ratingsOur preliminary results support these findings from the
literatureIPW can reduce the influence of background
variable bias on rating outcomesThe 12 item SETE (without IPW) already has a relatively
small amount of bias with regard the background variables investigated (9% )
this bias was further reduced to a negligible amount (2-3%)department and student major have non-significant effects
on mean effectiveness rating (G)We attribute these small effect sizes to the care with
which the original 28 items were selected
External Control Variables (cont.)
Context effects refer to the differential influence of “level” specific variables (contextual variables) on outcome measures of interest
Examples: students are nested within courses students can be nested within student majors courses are nested within departments departmental influences on teaching may vary across
departments students will respond more “similarly” (as compared to other
students in other courses) since they are exposed to the same instructor
different courses may have different courses sizes the same course can vary in size across semesters
The nested structure of response units creates what is known as “within-class correlation” (known as intra-class correlation) within-class correlation may not bias mean estimates, but will
likely create large bias for confidence intervals (conf. intervals too narrow)
the predictive accuracy of course means will also be lowerContextual effects can be modeled with “multi-level” ANOVA
methods
Contextual Effects: Department and Student Major
Analysis of variance (ANOVA) models between group and within group variation
Ideally, we would expect that between department variation in SETE ratings would be small as compared to within department variation on SETE ratingsthis creates within course correlation (intra-class
correlation – also known as intra-class reliability)Courses are nested within departments and vary
in course sizethis contributes to differences in response
consistency across courses (i.e. differences in reliability across courses)
A strategy in dealing with varying reliabilities and varying course sizes is to base course ratings on a weighted average
Multi-level ANOVA
This weighted course average is based on:
weighted_course_mean = (r) * course_mean + (1 – r) * pop_mean - where r is the course reliability - course_mean is the mean of the course - pop_mean is the mean of all course means Courses averages with low reliabilities are moved toward
the population meanCourse averages with high reliabilities do not move as
much toward the population meanReliabilities are a function of course size and response
consistencyWeighted averages calculated in this manner are
sometimes referred to as “Hierarchical Bayes” estimates or “Empirical Bayes” estimatesHeirarchical Bayes estimates “borrow strength” across
groups (pooling information across groups) to estimate group means
Heirarchical Bayes estimates are very good at reducing error (i.e. prediction error)
Multi-level ANOVA
Scale Score Development
Scale Score Development (cont.)
Accounting for non-response (missing values) is important for reducing bias in model estimates (e.g. means, factor loadings)
Simple (but inadequate) methods for dealing with missing values include: removing records with missing data, and mean substitution
Better methods exist that take into account the multivariate patterns in the complete and missing data when making a “data imputation” (e.g. maximum likelihood, multiple imputation)
Missing data patterns in SETE data are estimating using “k-nearest neighbor imputation” within a coursenearest neighbors are records that have similar completed
data patterns Within a course, the average of the k-nearest neighbor’s
completed data are used to impute the value for a variable that is missing it’s valuek-nearest neighbors assumes missing at random (MAR) – i.e.
missing data only depends on the observed data; able to take advantage of multivariate relationships in the completed data
the drawback of k-nearest neighbors is that does not include a component to model random variation, consequently uncertainty in the imputed value is underestimated
Missing Values
Data within a course:
v1 v2 v3 v4 v5 v6 3 3 4 3 4 4 -| 3 3 4 3 4 4 |- 4 nearest 3 2 4 4 4 4 | neighbors 3 2 4 4 4 4 -|
3 2 4 ? 4 4 before imputation 3 2 4 3.5 4 4 after imputation | imputed value
Missing Values (cont.)
Future implementations of SETE will focus on using SETE scores to predict teaching effectiveness one time period ahead (e.g. following semester) – dynamic measurement model
Current course ratings become a prior distribution of ratings used in predicting the following semester ratings
The current course ratings are calculated as a weighted average of the previous semester’s ratings (prior) and the current semester’s ratings (data) to produce a posterior estimate of the current semester’s rating
This posterior distribution of ratings is essentially a weighted average that lies between the previous semester rating and the current semester rating
This posterior distribution of ratings become the prior distribution for the following semester
This estimation procedure creates a moving average across semesters (weighted average)This moving average reduces unwanted variability across
semesters due to fluctuating course size and also minimizes the effect of outlying semester ratings
Further Refinements for Implementation
Two software systems were used in developing the SETE: Mplus and R
Mplus is commercial software available at http://www.statmodel.com
R is a public domain, open-source software system available at http://www.r-project.org/
Software and Data Processing
Any Questions?References:
Test Theory: A Unified Treatment (Roderick P. McDonald,1999)
Bayesian Analysis for the Social Sciences (Simon Jackman, 2010)
Model Assisted Survey Sampling (Sarndal, et.al, 1992) Introduction To Variance Estimation, 2nd ed. (Wolter, 1997) Forecasting with Exponential Smoothing (Hyndman, Koehler,
Ord, Snyder, 2008) Relative Importance: Grömping, U. (2007),
Estimators of Relative Importance in Linear Regression Based on Variance Decomposition, The American Statistician, 61, 139-147.
Bayesian Model Averaging: http://www.stat.washington.edu/raftery/Research/bma.html
ACO: Leite, Huang & Marcoulides (2008), Item Selection for the
Development of Short Forms of Scales Using an Ant Colony Optimization Algorithm, Multivariate Behavioral Research , v43 n3, p411-431.
Slide download: http://www.unt.edu/rss/rich/IUPUI/
Contact Information MEETING THE CHALLENGES OF DEVELOPING A TEACHING EFFECTIVENESS
INSTRUMENT THAT MEASURES COURSES ACROSS THE CAMPUS ON A COMMON SCALE
Slide Download Available at:https://www.unt.edu/rss/rich/IUPUI/ Ron [email protected] Richard [email protected]
Notes:_________________________________________________________________________