internal validity construct validity external validity * in the context of a research study, i.e.,...

Evaluation Methodologies

Internal Validity Construct Validity External Validity

* In the context of a research study, i.e., not measurement validity.

Validity*

Generally relevant only to studies with causal relationships.◦ Temporal precedence◦ Correlation◦ No plausible alternative

Key question: can the outcome be attributed to causes other than the designed interventions◦ If so, it is likely that internal validity needs to be

tightened up

Internal Validity

Threats to Internal Validity◦ Single Group Threats◦ Multiple Group Threats◦ Social threats to internal validity

Internal Validity

Single Group ThreatsImage an educational program where two different testing regimens are used.

In one, an intervention and then a post-test is used.

In the second, a pre-test, intervention and post-test is used.

What are the single group threats for this design?

Single Group Threats◦ History (something happened at the same time)◦ Maturation (something would have happened at

the same time)◦ Testing (testing itself induced an effect)◦ Instrumentation (changes in the testing)◦ Mortality (attrition in study participants)◦ Regression (regression to the mean)

Internal Validity

Suppose for the previous study we had multiple groups instead of single groups?

Multiple Group Threats are variations on the Single Group Threat with selection bias added. If the added second group is a control, for instance, it must be selected in a way that makes it fully comparable to the first group (random assignment).

If participants cannot be randomly assigned, then we get quasi-experimental design.

Multiple Group Threats

Applicable to social sciences (because people do not react simply to stimuli)◦ Diffusion (people in treatment groups talk to one

another)◦ Compensatory rivalry (treatments groups know

what is happening and develop a rivalry)◦ Resentful demoralization (same as above, but

with an opposite sign)◦ Compensatory equalization (researchers or others

equalize groups).

Social Interaction Threats

Are the results valid for other persons in other places and at other times?◦ Do they generalize?

Types of generalization Threats to external validity

External Validity

Generalizations◦ Sampling Model: try to make certain that your

study groups are a random sample of the population you wish your generalization to extend to.

◦ “Proximal Similarity”: measure or stratify the sample on the things you cannot randomize.

External Validity

Threats to external validity◦ People◦ Places◦ Times

External Validity

An assessment of how well ideas or theories are translated into actual programs.

Mapping of concrete activities into theoretical constructs.

Construct Validity

Formal articulations:◦ Nomological network (Cronbach and Meehl,

1955): researchers were to establish a theoretical network of what to measure, empirical frameworks of what to measure and the linkages between the two.

◦ Multitrait-Multimethod Matrix (Campbell and Fiske, 1959): Convergent concepts should show higher correlations divergent concepts lower correlations.

◦ Pattern matching (Trochim, 1985): Linking a theoretical pattern with an operational pattern.

Construct Validity

Threats to Construct Validity◦ Poorly defined constructs◦ Mono-operation bias: The construct is larger than

the single program / treatment you devised.◦ Mono-method bias: the construct is larger than

the limited set of measurements you devised.◦ Test and treatment interaction: measurement

changes the treatment group◦ Other threats generally fall under “labeling”

threats: a construct is essentially a metaphor, and if not precisely articulated differing meanings can be held by different persons.

Construct Validity

Social Threats to Construct Validity◦ Hypothesis guessing: participants guess at the

purpose of your study and attempt to game it.◦ Evaluation apprehension: if apprehension causes

participants to do poorly (or to pose as doing well) then the apprehension becomes a confounding factor.

◦ Researcher expectancies: Researcher expectancies confound the outcome. Hawthorne effect: people change behavior when

observed Rosenthal effect: researcher expectations can change

outcomes even when subjects are uninformed.

Construct Validity

Wake Up and Smell the Coffee

Authors see methodology as intellectual infrastructure.

Believe that rapid change in CS produces outdated methodology.

Three key claims:◦ Workloads used need to be appropriate◦ Experimental design needs to be appropriate◦ Analysis needs to be rigorous

Wake Up Overview

For this paper, the authors focus on Java◦ Modern language additions (type safety, memory

management, secure execution) have been added to Java

◦ Authors believe that these additions make previous benchmarks untenable:

Tradeoffs due to garbage collection where heap size is a control variable

Non-determinism due to adaptive optimization and sampling technologies

System warm-up from dynamic class loading and just-in-time compilation

Wake Up: Focus

Authors created a suite (DaCapo) of benchmark tools suitable for research. The suite consists of open source applications.

DaCapo validates diversity a variety of tests and then applying PCA.

Authors point to “cherry picking” research by Perez, showing that dropping diversity of measures increases ambiguous and incorrect conclusions.

Wake Up: Workloads

The authors in their results show four ways to evaluate garbage collection. Any specific measure can be “gamed” to produce a desired result.

Classic comparison of Fortran / C / C++: control for host platform and language runtime.

New comparisons: control for host platform, language runtime, heap size, nondeterminism and warm-up.

Wake Up: Experimental Design

To obtain meaningful data from noisy estimates, data must be collected and aggregated.

Current practices sometimes lack statistical rigor.

Presenting all the results from the suite (as opposed to one number) will reduce “cherry picking”.

Wake Up: Analysis

internal validity construct validity external validity * in the context of a research study, i.e.,...

Documents

single groups

single group threatshistory

study groups

multiple groups

group random assignment

treatment groups

measurement validity

theoretical pattern