crowdsourcing software engineering studies: … t. stolee, sebastian g. elbaum: exploring the use of...

Crowdsourcing Software Engineering Studies:

Opportunities and Perils

Sebastian Elbaum (based on work performed with Kathryn Stolee)

Software Engineers

ICSE!Researchers

ICSE!Researchers

Software Engineers

\

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Crowdsourcing Services (examples)

Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers

Photographers collect with people whoneed stock photography. 3,000,000+members

Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists

People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18

\


Summary









Summary








Who are the workers (“Conducting behavioral research on Amazon’s Mechanical Turk”,

Winter Mason & Siddharth Suri, Behavioral Research Journal, 2011) Winter Mason & Siddharth Suri

• Median 30 years old and $30K salary

• 69% of U.S. workers: “Mechanical Turk is a fruitful way to spend free time and get some cash”

• Majority from US and India

• Work for at least $1.4/hour, average $4.5/hour

• Completion time is correlated with pay, but not linear

Create tasks

Set study

Search task

Select task

Complete task

Submit task

Verify results

MTurk

ICSE!Researchers

Software EngineersCrowdsource our studies

ICSE!Researchers


• Access to population of software engineers

• Low cost

• Speedy / adaptive experimentation

Potential

ICSE!Researchers


Initial Try• I may have solved the SE Empirical Challenge! Look

how many answers I am getting with a few dollars!

• oh… some of those answers are not that useful. Are these real software engineers?

• They are completing my exercise in seconds. How? Damn… they are gaming the system.

• ouch, I need to check thousands of answers.

• Ok, now let’s give them a “real” SE task.

Kathryn T. Stolee, Sebastian G. Elbaum: Exploring the use of crowdsourcing to support empirical studies in software engineering. ESEM 2010

Goal: evaluate the impact of smells/refactoring on end user programers preferences and understanding.


Summary


Workflow in Mechanical Turk

Workers:

Searchfor Tasks

SelectTask

CompleteTask

SubmitTask



Summary

DefinitionPlanningOperationAnalysis

Experimental Task in Mechanical Turk

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis


• 22 participants, 188 tasks completed, 2 weeks, $42

• Supported hypothesis

ICSE!Researchers


Potential

• Access to population of software engineers

• Low cost

• Speedy / adaptive experimentation

“… Academics are now taking advantage of Turk, and, from my own experience with the difficulties of recruiting students to experiments, I

suspect Turk’s use will only increase.” Scientific American 2011

Venue Task Soft Engs CostESEM 2010 /

TSE 2013Compare two mashups, determine

outcome (10min)~25 (30% SE),

188 tasks $42

FSE NIER 2012 / ESEM 2013

Write small program specification as input/ouput (compare with class

exercise, 15 min)

~25 ~100 tasks $25

TOSEM 2014Rank code search results from various tools, provide qualified feedback (10

min)

~50, ~300 tasks $300

… Survey on competing scenarios for emerging technology (15min)

+1000, +1000 tasks $600

Cost per task under a dollar!

Can we get …

• X software engineers to participate?

• the X kind of software engineers?

• software engineers to do X?

• software engineers to do a X seriously?

• …

Can we get …

• X software engineers to participate?

• the K kind of software engineers?

• X software engineers to do T?

• X software engineers to do a T seriously?

• …

Yes, $, T

Yes, QA, $.

Some Ts

Some Ts, QA, $

• Setting a baseline pay

• Market forces

• Enough to motivate software engineers

• Enough not to motivate others

• Too high perceived as “baiting” requesters

• Ethical concerns

• Multiple deployments

• Still IRB “cost” (with good reason)

$

• Decompose SE problem into tasks

• Small enough to attract participants / keep low cost

• Large enough to be valuable to SE Research

• Provide motivation other than money

• Design of experiments

• Tasks are small, need many to test hypothesis, need even more to study SE problem

• Bundled tasks help attract subjects

• Control for learning, gaming, …

$

Task

• Qualifications checks and Pre-tests

• Embed obvious/repeated mostly verifiable questions to check for robots, gamers, and level of attention

• Performance threshold to pay, or pay little to all but use good workers to seed next study

• Controlling revision costs with tasks on tasks

• Multiple deployments

• Compare performance of subjects in/out MTurk

$

Task

Quality

Limitations• Accidental

• Assignment of subjects to tasks (self-selection bias)

• Single users (small tasks, no collaboration)

• Rewards are mainly monetary

• Essential

• Separation from subjects (cannot observe/interact much)

• Very limited context information

• Mismatch with many SE problems

Alterna

tive In

frastr

uctur

es

Tempered enthusiasm• Not all SE problems can be broken into small tasks

• Many SE problems require a team and communication

• Many SE problems require time to develop

• Proof by Mturking

• Balancing task design, $, and thresholds is tricky

• Lack of contact and context with subjects

Charge• Great as initial empirical vehicle (better than ugrads :)

• Could be better

• Pool or pre-qualified workers

• Capabilities to design of more complex studies

• Connection stream to worker for follow-up (+context)

• Ability to control development environment

• …

MTurk for Software Engineering Studies?

Crowdsourcing Software Engineering Studies:

Opportunities and Perils

Sebastian Elbaum (based on work performed with Kathryn Stolee)

crowdsourcing software engineering studies: … t. stolee, sebastian g. elbaum: exploring the use of...

Documents