questions raised by results - pennsylvania state university
TRANSCRIPT
Questions Raised by Results Questions raised by significant results External validity
Do results generalize to other levels of the IV? Do other variables moderate the IV’s effect?
Construct validity (Was control group good enough?) Simple two-group experiments may not be good
enough
Multi-Group Experiments
Improve construct validity Improve external validity Answer more questions
No Med High
Amount of Treatment
No Yes
Example
Benefits of Multi-Group Experiment
Better ability to estimate the effects of different amounts (levels) of a treatment
Better ability to rule out confounding variables May help discover significant relationships May help more accurately map the functional
relationship Summary: Multi-level experiments have more
external validity than simple experiments
Group Assignment inMulti-Group Experiment
Similar procedure as in simple experiment Using random table Using a dice (rather than a coin) Using Excel
Statistics Methods
t tests ANOVA
Control Group
No treatment at all? May create noise Example: medicine Taking medicine may create psychological effect.
Subjects may guess what you try to do Solution Provide "fake" treatment Similar to real treatment, in form only. Placebo pills Systems with similar appearance, but different treatments.
Factorial Design
Why multiple factors? Effects of multiple factors on subjects Main effect Effect of individual factors independently
Interaction Effect of two factors jointly
Most often seen design: 2 x 2
2X2 Factorial Design
Two factors Two levels of each factor: presence or not. Four conditions
What do researchers expect? Two simple main effects Does a factor affect subjects?
The interaction between two factors Does these two factors interfere with each other? Concerning external validity
Interactions Are Important
Simple experiment is about main effect. Maybe weak in generalization Other non-studied factors may interfere. e.g., drugs: active ingredient + other chemicals people may
take in their daily life.
Interesting questions are often about interactions e.g., drug safety
External validity questions are questions involving interactions
An Example
A study on the effect of using IM on productivity Two groups of employees Experimental group: use IM Control group: don’t use IM
Don’t use Use
Someone Questions Your Results
“I saw people using more IM in move. Does the use of IM in move have anything to do with productivity?” Two factors: IM and mobile
You need a 2 x 2 design Use IM: yes/no On mobile device: yes/no
Your data from the study Dependent variable: How many orders do they
make per week?
Different Types of Result
Main effect of both factors, and interaction Main effect of both factors, but no interaction Main effect of one factor, and interaction Main effect of one factor, but no interaction Interaction with no main effect of either factor No interaction, no main effect
Not Mobile Mobile
Not use IM
Use IM
No Mobile Mobile
Not use IM
Use IM
70 75
73 83
Two main effects, with interaction
Not Mobile Mobile
Not use IM
Use IM
Two main effects, with no interaction
Not Mobile Mobile
Not use IM
Use IM
One main effect, with interaction
Not Mobile Mobile
Not use IM
Use IM
One main effect, with no interaction
Not Mobile Mobile
Not use IM
Use IM
No main effect, with interaction
Not Mobile Mobile
Not use IM
Use IM
No main effect, no interaction
Interpretation of Results Two main effects without interaction
It all adds up. Two main effects with interaction
Two factors may amplify or impede the effect. One main effect without interaction
The effect of the factor is independent. One main effect with interaction
One factor may not have effect, but can affect the other one. e.g., catalyst
No effect with interaction The effect of one factor depends on the other, although the effect
is not significant.
In Sum
Factorial design allows to study multiple factors in one study.
More treatments More complex results More difficult to design and execute
Matched Pairs, Within-Subjects,
and Mixed Design
Different Kinds of Experiments Field experiments
In a natural setting, rather than in a lab Advantages: external validity, construct validity Disadvantages: Independence of groups, and process control
Matched pairs design Create two similar groups based on a certain criteria Advantage: internal validity Problems: matching could be hard and inaccurate
Other factors: individual differences Within-subjects designs
Each subject goes through all treatments Comparing different treatments of the same subject
Advantages: internal validity Mixed designs in factorial design
One factor: within-subjects The other: matched pairs or between-subjects
Experimental Group Control Group
=FieldExperiment
Matched Pairs =?
Within-Subjects
Matched Pairs Design: Procedure
Form matched pairs Randomly assign one member of each pair to
the treatment condition, the other to the control condition
Considerations in Using Matched Pairs Designs Finding an effective matching variable e.g., race, education, age, …
External validity Advantage: Don’t restrict subject population (can
have heterogeneous group) Disadvantage: Results may not be generalized to
participants who haven’t done the matching task Construct validity weakened because matching
may tip off participants about hypothesis
Analysis of Data in the Matched Pairs Design
Not the between-subjects t test (observations are not independent)
Dependent t test: Differences between pairs/ standard error of differences
Within-Subjects (Repeated Measures) Designs
Considerations in using within-subjects designs Increased power Order effects harm internal validity
Danger: Order Effects
Four Sources of Order Effects Practice effects Gradually improved performances
Fatigue effects Gradually deteriorated performances
Treatment carryover effects Previous treatments affect the following treatments.
Sensitization Subjects can guess what your IVs and DVs are, and
may play along with it. Hurt both construct validity and internal validity (why)
Dealing with Order Effects Minimizing each individual threat Practice, fatigue, carryover, sensitization
Allow sufficient practice, make tasks interesting, use few levels, allow sufficient time between treatments, make treatment level less noticeable, etc.
Mixing up sequences to try to balance out order effects: Randomizing and counterbalancing Randomized or counterbalanced within-subjects
designs
Randomized Within-Subjects Design Randomly determine the sequence of treatments for
each participant
Bet on luck!
We need a better solution to rule out the possibility of order effects.
Counterbalanced Within-Subjects Design
Design a set of sequences such that Every condition appears in every position the
same number of times Every condition precedes every other
condition just as many times as it follows that condition
Randomly assign participants to your sequences
Examples Two variables A – B and B – A
Three variables Treatment order 1 2 3
Group 1 A B CGroup 2 B C AGroup 3 C A BGroup 4 C B AGroup 5 A C BGroup 6 B A C
Latin Square
Help you to get the sequence of conditions 4 x 4 Latin square Treatment order 1 2 3 4
Group 1 A B D CGroup 2 B C A DGroup 3 C D B AGroup 4 D A C B
Pros and Cons of Counterbalanced Within-Subjects Designs Balances out order effects Provides information not only about the effect
of treatment, but also about the effect of order (trials, position) and sequence
May require more subjects Analysis is more sophisticated ANOVA is often required.
Conclusion Experiment studies Manipulate IVs Compare different groups Study several factors Good internal validity
Counterbalanced design, process control Improved external validity and construct validity
Experimental design Process
Craft Task
Designing a good task needs creative thinking!
Running an Experimental Study
The Book Guide book for
studies involving human users
Offers detailed guidelines
Practice and Experience Running an experiment needs practices. Experience is very important In particular to the design of an experiment
We can only discuss some general issues in this course. Semester-long courses on experimental design in
traditional psychology department
Use my research as an example
Issues to Consider
Experimental Design Test Space, Instrument, Apparatus Data Collection Subject recruitment Experimental Protocols Experimental Process Pilot Test
Experimental Design Most difficult part May take a long time to get a "satisfactory" design Several rounds Various factors to consider
Tasks, subjects, equipment, etc. Based on your hypothesis Key considerations Tasks
Incorporate both IVs and DVs with good validities Not harmful to subjects Not boring Reasonable time span
Treatments
Experimental Design: Example
My hypothesis Multiscale collaboration is more effective to help
people deal with complex information with different levels of details.
What is Multiscale Collaboration?
Gulliver’s Travels
Multiscale Collaborative Virtual Environments (mCVE)
Multiple users work together but from different scales Cross-scale collaboration between ants and
giants Users have different interaction domains
Visual information, navigation, manipulation, …
mCVE Example
Experimental Study on mCVE IVs: Multiscale information presentation and
interaction Collaboration
DV: Task completion time in interacting with complex
information Task: Searching objects with specific features in a large
area
IVs and DV Multiscale factors Information presentation
Large area: Global-level information to assist search Specific features: local-level details to assist object identification
Movement Giant steps to move quickly but roughly Baby steps to move accurately
Collaboration factors Information sharing: where the other person is Movement: the giant can carry the ant to move quickly.
DV: how long it takes to complete a task
Expected Results Multiscale + collaboration > multiscale Multiscale + collaboration > collaboration Multiscale > no multiscale Collaboration > no multiscale Multiscale + collaboration > no multiscale +
no collaboration Mulitscale ? collaboration
Treatments Determined by your IVs Two factors Multiscale: Yes/No Collaboration: Yes/No 2X2 factorial design
Interested in the impact of collaboration style on user performance Three different collaboration style
Role-free One as a guide One as a carrier
2 x 2 + 2: six treatments
Construct Validity Variables Multiscale Cross-scale information access Cross-scale action
Collaboration Dividing a task and conquering it in parallel
Performances Task completion time
Other Issues
Making the task fun Simple search defusing a bomb Providing feedback to motivate people Should not reveal your true intention.
Test Space, Instrument, Apparatus
Test space A usability lab would be
ideal. Quiet, distraction free,
easy for observation, recording capabilities, etc.
Test Space, Instrument, Apparatus If no access to a usability testing lab Set up space appropriate for experiment Private space for subject
Never do an experiment in a public place.
Test Space, Instrument, Apparatus
Instrument Capturing user performance Recording devices: audio and/or video Computer to capture data automatically if possible
Often operated by experimenters Apparatus Used by subjects directly Usually computers and related peripheral devices Also include software tools
Interaction Devices
How people interact with computer varies from person to person. May lead to errors. e.g., mouse moving and clicking delay
Methods to avoid or reduce the potential errors if possible
In My Study Subjects needed to control movement and
scale. Mouse control could be a concern.
My approach Using keyboard Providing a key-function map Labeling the function of each key Using simple language
Software Tools Different levels of functions and fidelity Truly functional systems Can deal with any actions by subjects No need to worry about "misbehaviors" by subjects
Partially functional systems Support primary tasks Need to watch subjects closely and carefully Preventing the system from malfunctioning
Mock-up systems Example: Wizard-of-Oz style system A system only offering user interface and other functions A human experimenter sitting behind to provide results based on user
inputs
What You Should Pick …
Depends on your goal, your technical skills/resources, your time, etc.
What is your interest? Design + evaluation vs. evaluation only
Can you develop the system or have someone do it for you?
Do you have one year or five years?
In My Study,
I built up a fully functional system Years of work Java-based
Data Collection
After the design, you should have a clear idea about what data to collect. Directly related to DV. Objective data collected through computer or
instruments Other data to collect Demographic data of subjects Subjective evaluation data
Methods to Collect Data If using software tools is involved and
manipulating software is possible, better write codes to collecting performance data. Otherwise, have someone dedicated to collect
performance data
Using pre- and post-test questionnaires Pre-test questionnaire
Demographic data, background, relevant skills, etc. Post-test questionnaire
Feedback, assessment, and perception on relevant tasks and phenomenoa
In My Study,
Performance data collection was coded into the system Time stamp Action type: moving, scaling Related parameters: scale (size), location
Questionnaires
Data File
http://zhang.ist.psu.edu/teaching/505/Data_Sample_Collected.txt
Surveys
Short surveys are often included in usability studies Pre-test questionnaire User demographic and other relevant data
Post-test User feedback on system
Paper or online
Subject Recruitment General population
Representative enough Considering confronting factors
Where to get the subjects? Sampling from the population you target
Convenience vs. representative People on campus
How to get them? Subject pools General public: email, ads, flyers, etc.
How many to get? Experience Calculation based on desired power. More is better, if you have sufficient resources.
Costs are based the length of your study Always have backup subjects
No show, failed study, in sufficient subjects, etc.
My Case Targeted population: general Students are OK. With 3D experiences
I recruited my participants from campus. Emails Flyers
The number of subjects Calculated with a statistics tool 24 participants (12 pairs)
Results: Time Comparison (in seconds)
mCVE
CVE
VE
mVE
Tim
e
Non-Collaboration Collaboration
200
150
100
50
Non-Multiscale
Multiscale
Multiscale collaboration can be helpful for cross-scale tasks.
In-Depth Study on Multiscale Collaboration 2 x 2 with-in subjects design
Non - Collaboration Collaboration
Non - Multiscale
Multiscale
VE
mVE
CVE
mCVE- MOVEmCVE- NoRole
mCVE - MOVEmCVE- GUIDE
+ 2
Results: Time Comparison (in seconds)
Tim
e
mCVE - MOVECVE
VEmVE
Non-Collaboration Collaboration
200
150
100
50
Non-Multiscale
Multiscale
mCVE - GUIDEmCVE - NoRole
Homework Critique a research paper
You need to list The hypothesis IVs, and DVs The task for the experiment Factors in the factorial design Approaches to counterbalance treatments
You need to point out at least one flaw in the experimental design or execution. Why? The impact(s)
External, internal, or construct validity Modification