questions raised by results - pennsylvania state university

Questions Raised by Results Questions raised by significant results External validity

Do results generalize to other levels of the IV? Do other variables moderate the IV’s effect?

Construct validity (Was control group good enough?) Simple two-group experiments may not be good

enough

Multi-Group Experiments

Improve construct validity Improve external validity Answer more questions

No Med High

Amount of Treatment

No Yes

Example

Benefits of Multi-Group Experiment

Better ability to estimate the effects of different amounts (levels) of a treatment

Better ability to rule out confounding variables May help discover significant relationships May help more accurately map the functional

relationship Summary: Multi-level experiments have more

external validity than simple experiments

Group Assignment inMulti-Group Experiment

Similar procedure as in simple experiment Using random table Using a dice (rather than a coin) Using Excel

Statistics Methods

t tests ANOVA

Control Group

No treatment at all? May create noise Example: medicine Taking medicine may create psychological effect.

Subjects may guess what you try to do Solution Provide "fake" treatment Similar to real treatment, in form only. Placebo pills Systems with similar appearance, but different treatments.

Factorial Design

Why multiple factors? Effects of multiple factors on subjects Main effect Effect of individual factors independently

Interaction Effect of two factors jointly

Most often seen design: 2 x 2

2X2 Factorial Design

Two factors Two levels of each factor: presence or not. Four conditions

What do researchers expect? Two simple main effects Does a factor affect subjects?

The interaction between two factors Does these two factors interfere with each other? Concerning external validity

Interactions Are Important

Simple experiment is about main effect. Maybe weak in generalization Other non-studied factors may interfere. e.g., drugs: active ingredient + other chemicals people may

take in their daily life.

Interesting questions are often about interactions e.g., drug safety

External validity questions are questions involving interactions

An Example

A study on the effect of using IM on productivity Two groups of employees Experimental group: use IM Control group: don’t use IM

Don’t use Use

Someone Questions Your Results

“I saw people using more IM in move. Does the use of IM in move have anything to do with productivity?” Two factors: IM and mobile

You need a 2 x 2 design Use IM: yes/no On mobile device: yes/no

Your data from the study Dependent variable: How many orders do they

make per week?

Different Types of Result

Main effect of both factors, and interaction Main effect of both factors, but no interaction Main effect of one factor, and interaction Main effect of one factor, but no interaction Interaction with no main effect of either factor No interaction, no main effect

Not Mobile Mobile

Not use IM

Use IM

No Mobile Mobile

Not use IM

Use IM

70 75

73 83

Two main effects, with interaction

Not Mobile Mobile

Not use IM

Use IM

Two main effects, with no interaction

Not Mobile Mobile

Not use IM

Use IM

One main effect, with interaction

Not Mobile Mobile

Not use IM

Use IM

One main effect, with no interaction

Not Mobile Mobile

Not use IM

Use IM

No main effect, with interaction

Not Mobile Mobile

Not use IM

Use IM

No main effect, no interaction

Interpretation of Results Two main effects without interaction

It all adds up. Two main effects with interaction

Two factors may amplify or impede the effect. One main effect without interaction

The effect of the factor is independent. One main effect with interaction

One factor may not have effect, but can affect the other one. e.g., catalyst

No effect with interaction The effect of one factor depends on the other, although the effect

is not significant.

In Sum

Factorial design allows to study multiple factors in one study.

More treatments More complex results More difficult to design and execute

Matched Pairs, Within-Subjects,

and Mixed Design

Different Kinds of Experiments Field experiments

In a natural setting, rather than in a lab Advantages: external validity, construct validity Disadvantages: Independence of groups, and process control

Matched pairs design Create two similar groups based on a certain criteria Advantage: internal validity Problems: matching could be hard and inaccurate

Other factors: individual differences Within-subjects designs

Each subject goes through all treatments Comparing different treatments of the same subject

Advantages: internal validity Mixed designs in factorial design

One factor: within-subjects The other: matched pairs or between-subjects

Experimental Group Control Group

=FieldExperiment

Matched Pairs =?

Within-Subjects

Matched Pairs Design: Procedure

Form matched pairs Randomly assign one member of each pair to

the treatment condition, the other to the control condition

Considerations in Using Matched Pairs Designs Finding an effective matching variable e.g., race, education, age, …

External validity Advantage: Don’t restrict subject population (can

have heterogeneous group) Disadvantage: Results may not be generalized to

participants who haven’t done the matching task Construct validity weakened because matching

may tip off participants about hypothesis

Analysis of Data in the Matched Pairs Design

Not the between-subjects t test (observations are not independent)

Dependent t test: Differences between pairs/ standard error of differences

Within-Subjects (Repeated Measures) Designs

Considerations in using within-subjects designs Increased power Order effects harm internal validity

Danger: Order Effects

Four Sources of Order Effects Practice effects Gradually improved performances

Fatigue effects Gradually deteriorated performances

Treatment carryover effects Previous treatments affect the following treatments.

Sensitization Subjects can guess what your IVs and DVs are, and

may play along with it. Hurt both construct validity and internal validity (why)

Dealing with Order Effects Minimizing each individual threat Practice, fatigue, carryover, sensitization

Allow sufficient practice, make tasks interesting, use few levels, allow sufficient time between treatments, make treatment level less noticeable, etc.

Mixing up sequences to try to balance out order effects: Randomizing and counterbalancing Randomized or counterbalanced within-subjects

designs

Randomized Within-Subjects Design Randomly determine the sequence of treatments for

each participant

Bet on luck!

We need a better solution to rule out the possibility of order effects.

Counterbalanced Within-Subjects Design

Design a set of sequences such that Every condition appears in every position the

same number of times Every condition precedes every other

condition just as many times as it follows that condition

Randomly assign participants to your sequences

Examples Two variables A – B and B – A

Three variables Treatment order 1 2 3

Group 1 A B CGroup 2 B C AGroup 3 C A BGroup 4 C B AGroup 5 A C BGroup 6 B A C

Latin Square

Help you to get the sequence of conditions 4 x 4 Latin square Treatment order 1 2 3 4

Group 1 A B D CGroup 2 B C A DGroup 3 C D B AGroup 4 D A C B

Pros and Cons of Counterbalanced Within-Subjects Designs Balances out order effects Provides information not only about the effect

of treatment, but also about the effect of order (trials, position) and sequence

May require more subjects Analysis is more sophisticated ANOVA is often required.

Conclusion Experiment studies Manipulate IVs Compare different groups Study several factors Good internal validity

Counterbalanced design, process control Improved external validity and construct validity

Experimental design Process

Craft Task

Designing a good task needs creative thinking!

Running an Experimental Study

The Book Guide book for

studies involving human users

Offers detailed guidelines

Practice and Experience Running an experiment needs practices. Experience is very important In particular to the design of an experiment

We can only discuss some general issues in this course. Semester-long courses on experimental design in

traditional psychology department

Use my research as an example

Issues to Consider

Experimental Design Test Space, Instrument, Apparatus Data Collection Subject recruitment Experimental Protocols Experimental Process Pilot Test

Experimental Design Most difficult part May take a long time to get a "satisfactory" design Several rounds Various factors to consider

Tasks, subjects, equipment, etc. Based on your hypothesis Key considerations Tasks

Incorporate both IVs and DVs with good validities Not harmful to subjects Not boring Reasonable time span

Treatments

Experimental Design: Example

My hypothesis Multiscale collaboration is more effective to help

people deal with complex information with different levels of details.

What is Multiscale Collaboration?

Gulliver’s Travels

Multiscale Collaborative Virtual Environments (mCVE)

Multiple users work together but from different scales Cross-scale collaboration between ants and

giants Users have different interaction domains

Visual information, navigation, manipulation, …

mCVE Example

Experimental Study on mCVE IVs: Multiscale information presentation and

interaction Collaboration

DV: Task completion time in interacting with complex

information Task: Searching objects with specific features in a large

area

IVs and DV Multiscale factors Information presentation

Large area: Global-level information to assist search Specific features: local-level details to assist object identification

Movement Giant steps to move quickly but roughly Baby steps to move accurately

Collaboration factors Information sharing: where the other person is Movement: the giant can carry the ant to move quickly.

DV: how long it takes to complete a task

Expected Results Multiscale + collaboration > multiscale Multiscale + collaboration > collaboration Multiscale > no multiscale Collaboration > no multiscale Multiscale + collaboration > no multiscale +

no collaboration Mulitscale ? collaboration

Treatments Determined by your IVs Two factors Multiscale: Yes/No Collaboration: Yes/No 2X2 factorial design

Interested in the impact of collaboration style on user performance Three different collaboration style

Role-free One as a guide One as a carrier

2 x 2 + 2: six treatments

Construct Validity Variables Multiscale Cross-scale information access Cross-scale action

Collaboration Dividing a task and conquering it in parallel

Performances Task completion time

Other Issues

Making the task fun Simple search defusing a bomb Providing feedback to motivate people Should not reveal your true intention.

Test Space, Instrument, Apparatus

Test space A usability lab would be

ideal. Quiet, distraction free,

easy for observation, recording capabilities, etc.

Test Space, Instrument, Apparatus If no access to a usability testing lab Set up space appropriate for experiment Private space for subject

Never do an experiment in a public place.

Test Space, Instrument, Apparatus

Instrument Capturing user performance Recording devices: audio and/or video Computer to capture data automatically if possible

Often operated by experimenters Apparatus Used by subjects directly Usually computers and related peripheral devices Also include software tools

Interaction Devices

How people interact with computer varies from person to person. May lead to errors. e.g., mouse moving and clicking delay

Methods to avoid or reduce the potential errors if possible

In My Study Subjects needed to control movement and

scale. Mouse control could be a concern.

My approach Using keyboard Providing a key-function map Labeling the function of each key Using simple language

Software Tools Different levels of functions and fidelity Truly functional systems Can deal with any actions by subjects No need to worry about "misbehaviors" by subjects

Partially functional systems Support primary tasks Need to watch subjects closely and carefully Preventing the system from malfunctioning

Mock-up systems Example: Wizard-of-Oz style system A system only offering user interface and other functions A human experimenter sitting behind to provide results based on user

inputs

What You Should Pick …

Depends on your goal, your technical skills/resources, your time, etc.

What is your interest? Design + evaluation vs. evaluation only

Can you develop the system or have someone do it for you?

Do you have one year or five years?

In My Study,

I built up a fully functional system Years of work Java-based

Data Collection

After the design, you should have a clear idea about what data to collect. Directly related to DV. Objective data collected through computer or

instruments Other data to collect Demographic data of subjects Subjective evaluation data

Methods to Collect Data If using software tools is involved and

manipulating software is possible, better write codes to collecting performance data. Otherwise, have someone dedicated to collect

performance data

Using pre- and post-test questionnaires Pre-test questionnaire

Demographic data, background, relevant skills, etc. Post-test questionnaire

Feedback, assessment, and perception on relevant tasks and phenomenoa

In My Study,

Performance data collection was coded into the system Time stamp Action type: moving, scaling Related parameters: scale (size), location

Questionnaires

Data File

http://zhang.ist.psu.edu/teaching/505/Data_Sample_Collected.txt

Surveys

Short surveys are often included in usability studies Pre-test questionnaire User demographic and other relevant data

Post-test User feedback on system

Paper or online

Subject Recruitment General population

Representative enough Considering confronting factors

Where to get the subjects? Sampling from the population you target

Convenience vs. representative People on campus

How to get them? Subject pools General public: email, ads, flyers, etc.

How many to get? Experience Calculation based on desired power. More is better, if you have sufficient resources.

Costs are based the length of your study Always have backup subjects

No show, failed study, in sufficient subjects, etc.

My Case Targeted population: general Students are OK. With 3D experiences

I recruited my participants from campus. Emails Flyers

The number of subjects Calculated with a statistics tool 24 participants (12 pairs)

Results: Time Comparison (in seconds)

mCVE

CVE

VE

mVE

Tim

e

Non-Collaboration Collaboration

200

150

100

50

Non-Multiscale

Multiscale

Multiscale collaboration can be helpful for cross-scale tasks.

In-Depth Study on Multiscale Collaboration 2 x 2 with-in subjects design

Non - Collaboration Collaboration

Non - Multiscale

Multiscale

VE

mVE

CVE

mCVE- MOVEmCVE- NoRole

mCVE - MOVEmCVE- GUIDE

+ 2

Results: Time Comparison (in seconds)

Tim

e

mCVE - MOVECVE

VEmVE

Non-Collaboration Collaboration

200

150

100

50

Non-Multiscale

Multiscale

mCVE - GUIDEmCVE - NoRole

Homework Critique a research paper

You need to list The hypothesis IVs, and DVs The task for the experiment Factors in the factorial design Approaches to counterbalance treatments

You need to point out at least one flaw in the experimental design or execution. Why? The impact(s)

External, internal, or construct validity Modification

questions raised by results - pennsylvania state university

Documents