internal validity construct validity external validity * in the context of a research study, i.e.,...
TRANSCRIPT
Evaluation Methodologies
Internal Validity Construct Validity External Validity
* In the context of a research study, i.e., not measurement validity.
Validity*
Generally relevant only to studies with causal relationships.◦ Temporal precedence◦ Correlation◦ No plausible alternative
Key question: can the outcome be attributed to causes other than the designed interventions◦ If so, it is likely that internal validity needs to be
tightened up
Internal Validity
Threats to Internal Validity◦ Single Group Threats◦ Multiple Group Threats◦ Social threats to internal validity
Internal Validity
Single Group ThreatsImage an educational program where two different testing regimens are used.
In one, an intervention and then a post-test is used.
In the second, a pre-test, intervention and post-test is used.
What are the single group threats for this design?
Single Group Threats◦ History (something happened at the same time)◦ Maturation (something would have happened at
the same time)◦ Testing (testing itself induced an effect)◦ Instrumentation (changes in the testing)◦ Mortality (attrition in study participants)◦ Regression (regression to the mean)
Internal Validity
Suppose for the previous study we had multiple groups instead of single groups?
Multiple Group Threats are variations on the Single Group Threat with selection bias added. If the added second group is a control, for instance, it must be selected in a way that makes it fully comparable to the first group (random assignment).
If participants cannot be randomly assigned, then we get quasi-experimental design.
Multiple Group Threats
Applicable to social sciences (because people do not react simply to stimuli)◦ Diffusion (people in treatment groups talk to one
another)◦ Compensatory rivalry (treatments groups know
what is happening and develop a rivalry)◦ Resentful demoralization (same as above, but
with an opposite sign)◦ Compensatory equalization (researchers or others
equalize groups).
Social Interaction Threats
Are the results valid for other persons in other places and at other times?◦ Do they generalize?
Types of generalization Threats to external validity
External Validity
Generalizations◦ Sampling Model: try to make certain that your
study groups are a random sample of the population you wish your generalization to extend to.
◦ “Proximal Similarity”: measure or stratify the sample on the things you cannot randomize.
External Validity
Threats to external validity◦ People◦ Places◦ Times
External Validity
An assessment of how well ideas or theories are translated into actual programs.
Mapping of concrete activities into theoretical constructs.
Construct Validity
Formal articulations:◦ Nomological network (Cronbach and Meehl,
1955): researchers were to establish a theoretical network of what to measure, empirical frameworks of what to measure and the linkages between the two.
◦ Multitrait-Multimethod Matrix (Campbell and Fiske, 1959): Convergent concepts should show higher correlations divergent concepts lower correlations.
◦ Pattern matching (Trochim, 1985): Linking a theoretical pattern with an operational pattern.
Construct Validity
Threats to Construct Validity◦ Poorly defined constructs◦ Mono-operation bias: The construct is larger than
the single program / treatment you devised.◦ Mono-method bias: the construct is larger than
the limited set of measurements you devised.◦ Test and treatment interaction: measurement
changes the treatment group◦ Other threats generally fall under “labeling”
threats: a construct is essentially a metaphor, and if not precisely articulated differing meanings can be held by different persons.
Construct Validity
Social Threats to Construct Validity◦ Hypothesis guessing: participants guess at the
purpose of your study and attempt to game it.◦ Evaluation apprehension: if apprehension causes
participants to do poorly (or to pose as doing well) then the apprehension becomes a confounding factor.
◦ Researcher expectancies: Researcher expectancies confound the outcome. Hawthorne effect: people change behavior when
observed Rosenthal effect: researcher expectations can change
outcomes even when subjects are uninformed.
Construct Validity
Wake Up and Smell the Coffee
Authors see methodology as intellectual infrastructure.
Believe that rapid change in CS produces outdated methodology.
Three key claims:◦ Workloads used need to be appropriate◦ Experimental design needs to be appropriate◦ Analysis needs to be rigorous
Wake Up Overview
For this paper, the authors focus on Java◦ Modern language additions (type safety, memory
management, secure execution) have been added to Java
◦ Authors believe that these additions make previous benchmarks untenable:
Tradeoffs due to garbage collection where heap size is a control variable
Non-determinism due to adaptive optimization and sampling technologies
System warm-up from dynamic class loading and just-in-time compilation
Wake Up: Focus
Authors created a suite (DaCapo) of benchmark tools suitable for research. The suite consists of open source applications.
DaCapo validates diversity a variety of tests and then applying PCA.
Authors point to “cherry picking” research by Perez, showing that dropping diversity of measures increases ambiguous and incorrect conclusions.
Wake Up: Workloads
The authors in their results show four ways to evaluate garbage collection. Any specific measure can be “gamed” to produce a desired result.
Classic comparison of Fortran / C / C++: control for host platform and language runtime.
New comparisons: control for host platform, language runtime, heap size, nondeterminism and warm-up.
Wake Up: Experimental Design
To obtain meaningful data from noisy estimates, data must be collected and aggregated.
Current practices sometimes lack statistical rigor.
Presenting all the results from the suite (as opposed to one number) will reduce “cherry picking”.
Wake Up: Analysis