example: build an “adult web site” classifier

22
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Upload: baakir

Post on 06-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University. Example: Build an “Adult Web Site” Classifier. Need a large number of hand-labeled sites Get people to look at sites and classify them as: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Example: Build an “Adult Web Site” Classifier

Crowdsourcing using Mechanical Turk

Quality Management and Scalability

Panos Ipeirotis – New York University

Page 2: Example: Build an “Adult Web Site” Classifier

Example: Build an “Adult Web Site” Classifier

Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X

(porn)

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr

Page 3: Example: Build an “Adult Web Site” Classifier

Example: Build an “Adult Web Site” Classifier

Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X

(porn)

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr MTurk: 2500 websites/hr, cost: $12/hr

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr MTurk: 2500 websites/hr, cost: $12/hr

Page 4: Example: Build an “Adult Web Site” Classifier

Bad news: Spammers!

Worker ATAMRO447HWJQ

labeled X (porn) sites as G (general

audience)

Worker ATAMRO447HWJQ

labeled X (porn) sites as G (general

audience)

Page 5: Example: Build an “Adult Web Site” Classifier

Improve Data Quality through Repeated Labeling

Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote

Probability of correctness increases with number of workers

Probability of correctness increases with quality of workers

1 worker

70%

correct

1 worker

70%

correct

11 workers

93%

correct

11 workers

93%

correct

Page 6: Example: Build an “Adult Web Site” Classifier

Using redundant votes, we can infer worker quality

Look at our spammer friend ATAMRO447HWJQ together with other 9 workers

Our “friend” ATAMRO447HWJQ mainly marked sites as G.Obviously a spammer…

We can compute error rates for each worker

Error rates for ATAMRO447HWJQ P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947%

Page 7: Example: Build an “Adult Web Site” Classifier

Rejecting spammers and BenefitsRandom answers error rate = 50%

Average error rate for ATAMRO447HWJQ: 49.6% P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947%

Action: REJECT and BLOCK

Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher

Page 8: Example: Build an “Adult Web Site” Classifier

Too much theory?

Demo and Open source implementation available at:

http://qmturk.appspot.com Input:

– Labels from Mechanical Turk– Some “gold” data (optional)– Cost of incorrect labelings (e.g., XG costlier than

GX)

Output: – Corrected labels– Worker error rates– Ranking of workers according to their quality

Page 9: Example: Build an “Adult Web Site” Classifier
Page 10: Example: Build an “Adult Web Site” Classifier
Page 11: Example: Build an “Adult Web Site” Classifier

How to handle free-form answers?

Q: “My task does not have discrete answers….” A: Break into two HITs:

– “Create” HIT– “Vote” HIT

Vote HIT controls quality of Creation HIT Redundancy controls quality of Voting HIT

Catch: If “creation” very good, in voting workers just vote “yes”– Solution: Add some random noise (e.g. misspell)

Creation HIT

(e.g. transcribe

caption)

Creation HIT

(e.g. transcribe

caption)

Voting HIT:

Correct or not?

Voting HIT:

Correct or not?

Example: Collect URLs

Page 12: Example: Build an “Adult Web Site” Classifier

But my free-form is not just right or wrong…

“Create” HIT “Improve” HIT “Compare” HIT

Creation HIT

(e.g. describe the image)

Creation HIT

(e.g. describe the image)

TurkIt toolkit: http://groups.csail.mit.edu/uid/turkit/

Improve HIT

(e.g. improve description)

Improve HIT

(e.g. improve description)

Compare HIT (voting)

Which is better?

Compare HIT (voting)

Which is better?

Describe this

Page 13: Example: Build an “Adult Web Site” Classifier

version 1:

A parial view of a pocket calculator together with some coins and a pen.

version 2:A view of personal items a calculator, and some gold and copper coins, and a round tip pen, these are all pocketand wallet sized item used for business, writting, calculating prices or solving math problems and purchasing items.

version 3:A close-up photograph of the following items: A CASIO multi-function calculator. A ball point pen, uncapped. Various coins, apparently European, both copper and gold. Seems to be a theme illustration for a brochure or document cover treating finance, probably personal finance.

version 4:…Various British coins; two of £1 value, three of 20p value and one of 1p value. …

version 8:

“A close-up photograph of the following items: A CASIO multi-function, solar powered scientific calculator. A blue ball point pen with a blue rubber grip and the tip extended. Six British coins; two of £1 value, three of 20p value and one of 1p value. Seems to be a theme illustration for a brochure or document cover treating finance - probably personal finance."

Page 14: Example: Build an “Adult Web Site” Classifier

Future: Break big task to simple ones and build workflow Running experiment: Crowdsource big tasks (e.g., tourist

guide)

My Boss is a Robot (mybossisarobot.com)Nikki Kittur (Carnegie Mellon) + Jim Giles (New Scientist)

– Identify sights worth checking out (one tip per worker)• Vote and rank

– Brief tips for each monument (one tip per worker)• Vote and rank

– Aggregate tips in meaningful summary• Iterate to improve…

Page 15: Example: Build an “Adult Web Site” Classifier

Thank you!

Questions?

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com

/

Email: [email protected]

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com

/

Email: [email protected]

Page 16: Example: Build an “Adult Web Site” Classifier

Correcting biases

Classifying sites as G, PG, R, X Sometimes workers are careful but biased

Classifies G → P and P → R Average error rate : too high

Is she a spammer?Is she a spammer?

Error Rates for CEO of company detecting offensive content (and parent)

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Error Rates for CEO of company detecting offensive content (and parent)

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Page 17: Example: Build an “Adult Web Site” Classifier

Correcting biases

For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias

True error-rate ~ 9%

Error Rates for Worker: ATLJIK76YH1TF

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Error Rates for Worker: ATLJIK76YH1TF

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Page 18: Example: Build an “Adult Web Site” Classifier

Scaling Crowdsourcing: Use Machine Learning Human labor is expensive, even when paying cents Need to scale crowdsourcing

Basic idea: Build a machine learning model and use it instead of humans

Data from existing

crowdsourced answers

Data from existing

crowdsourced answers

New CaseNew Case Automatic Model

(through machine

learning)

Automatic Model

(through machine

learning)

Automatic

Answer

Automatic

Answer

Page 19: Example: Build an “Adult Web Site” Classifier

22

Tradeoffs for Automatic Models: Effect of Noise

Get more data Improve model accuracy Improve data quality Improve classification

Example Case: Porn or not?

40

50

60

70

80

90

100

1 20 40 60 80 100120140160180200220240260280300

Number of examples (Mushroom)

Acc

ura

cy

Data Quality = 50%

Data Quality = 60%

Data Quality = 80%

Data Quality = 100%

Page 20: Example: Build an “Adult Web Site” Classifier

Confide

ntConfide

nt

Automatic Model

(through machine

learning)

Automatic Model

(through machine

learning)

Scaling Crowdsourcing: Iterative training

Use machine when confident, humans otherwise

Retrain with new human input → improve model → reduce need for humans

Get human(s)

to answer

Get human(s)

to answer

New CaseNew Case

Not confident

Not confident

Automatic

Answer

Automatic

Answer

Data from existing

crowdsourced answers

Data from existing

crowdsourced answers

Page 21: Example: Build an “Adult Web Site” Classifier

24

Tradeoffs for Automatic Models: Effect of Noise

Get more data Improve model accuracy Improve data quality Improve classification

Example Case: Porn or not?

40

50

60

70

80

90

100

1 20 40 60 80 100120140160180200220240260280300

Number of examples (Mushroom)

Acc

ura

cy

Data Quality = 50%

Data Quality = 60%

Data Quality = 80%

Data Quality = 100%

Page 22: Example: Build an “Adult Web Site” Classifier

Not confident

Not confident

Confiden

tConfiden

t

Automatic Model

(through machine

learning)

Automatic Model

(through machine

learning)

Scaling Crowdsourcing: Iterative training, with noise

Use machine when confident, humans otherwise Ask as many humans as necessary to ensure quality

Get human(s)

to answer

Get human(s)

to answer

New CaseNew Case

Automatic

Answer

Automatic

Answer

Confident for quality?

Not confident

for quality?

Data from existing

crowdsourced answers

Data from existing

crowdsourced answers