accuracy sas-redmore-2014-2

17
The Opportunity of Accuracy No, you can't always get what you want You can't always get what you want You can't always get what you want And if you try sometime you find You get what you need Seth Redmore, VP Marketing and Product Management @sredmore, @lexalytics, http://www.lexalytics.com/blog

Upload: sredmore

Post on 20-Aug-2015

889 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Accuracy sas-redmore-2014-2

The Opportunity of Accuracy

No, you can't always get what you want You can't always get what you want You can't always get what you want

And if you try sometime you find You get what you need

Seth Redmore, VP Marketing and Product Management@sredmore, @lexalytics, http://www.lexalytics.com/blog

Page 2: Accuracy sas-redmore-2014-2

©2014 Lexalytics Inc. All rights reserved. 2

Accuracy? Opportunity? What?

Care Lots

Care Some

Don't Care

A very rough estimate of “companies that care about accuracy…”

Page 3: Accuracy sas-redmore-2014-2

Agenda

• “Accuracy” is imprecise

Sentiment is personal

Precision/Recall/F1

Different applications require different balance

Precision and Recall are bounded by Inter-Rater Agreement

• How to tune

• How to crowdsource

3© 2014 Lexalytics Inc. All rights reserved.

Page 4: Accuracy sas-redmore-2014-2

“Accuracy” is imprecise.

© 2014 Lexalytics Inc. All rights reserved.© 2014 Lexalytics Inc. All rights reserved. 4

• Because sentiment is personal (e.g. over what dataset is sentiment “accurate”?)

• Because you may care more about precision, or you may care more about recall

Page 5: Accuracy sas-redmore-2014-2

Sentiment Accuracy is Personal!

• “Wells Fargo lost $200M last month”

• “Kölnisch wasser smells like my grandmother.”

• “Taco Bell is like Russian Roulette for your ass, but it’s worth the risk.”

• “We’re switching to Direct TV.”

• “Microsoft is dropping their prices.”

© 2014 Lexalytics Inc. All rights reserved. 5

Page 6: Accuracy sas-redmore-2014-2

Precision, Recall, F1

• Precision:

“of the items you coded, what % are correct?”

• Recall is

“of all the possible items that match the code, what % did you retrieve?”

• F1 is the harmonic mean of precision and recall

2*((precision*recall)/precision+recall)

6© 2014 Lexalytics Inc. All rights reserved.

Page 7: Accuracy sas-redmore-2014-2

Different apps require different balance

• High precision -> Social media trending

Want to know that what you’re graphing has absolutely no crap

• High recall -> Customer support requests

Really don’t want to miss even a single pissed off customer, even at the cost of having to filter through lots of not-upset customers

7© 2014 Lexalytics Inc. All rights reserved.

HIGHPRECISION

HIGHRECALL

Page 8: Accuracy sas-redmore-2014-2

Sentiment F1 scores (and “accuracy”) bounded by IRA

8© 2014 Lexalytics Inc. All rights reserved.

• MPQA Corpus

Wiebe, et al., 2005 “Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39:165-210

Grad students, 40 hours of training, 16k sentences, ~80% IRA

• To ponder: If people max out at 80%, how can a machine be scored any better?

• Answer: it can’t.

A machine will do a “poor” job of scoring content that people can’t agree on.

Page 9: Accuracy sas-redmore-2014-2

So, you want to maximize your own accuracy?

9© 2014 Lexalytics Inc. All rights reserved.

• Get a clear goal on what you’re optimizing for

Precision/recall

What is “sentiment” – does an opinion have to be expressed, or?

Bounds of neutral

• Score a set of content yourself

• Crowdsource

Page 10: Accuracy sas-redmore-2014-2

Tuning

© 2014 Lexalytics Inc. All rights reserved. 10

Page 11: Accuracy sas-redmore-2014-2

Mturk Jargon

• Worker

The individual scoring the doc.

• Requester

You

• HIT

Human Intelligence Task

(Work unit)

• Quals

Which workers can work on your task?

11© 2014 Lexalytics Inc. All rights reserved.

Page 12: Accuracy sas-redmore-2014-2

Crowsourcing Flowchart

© 2014 Lexalytics Inc. All rights reserved. 12

Page 13: Accuracy sas-redmore-2014-2

©2014 Lexalytics Inc. All rights reserved. 13

Worker Qualifications

• Control which workers get to work on your HITs

• Amazon has built-in qualifications (Categorization Masters)

~20% more expensive

Opaque process

Workers don’t get anything more for them

• Manage your list tightly:Build a small qual test, open up to a limited set of users

Manually add workers to qualification list

• Use a “# of accepted HITs > 5000” (or some other number)

• Check against gold set

• Drop workers, don’t reject HITs

Page 15: Accuracy sas-redmore-2014-2

Mturk compensation

15© 2014 Lexalytics Inc. All rights reserved.

• It is unfortunate that “crowdsourcing” is sometimes a sophisticated term for trying to get the cheapest work possible.

• Sophisticated Mturkers (the ones you want doing your work) look for a lower bound of $6/hr.

• Don’t rely on “sweatshop” labor.

• You cannot rely on the little mturk compensation app – only way to fairly judge compensation is to do some yourself.

• Lexalytics aims for $8-10/hr for our sentiment scoring work. Some projects go more, some less.

• If you are using a 3rd party service that you know is doing your crowdsourcing, please go to “http://mturk.com” and look to see what the rates are that *they* are charging the workers for your HITs

Page 16: Accuracy sas-redmore-2014-2

Summary

• There’s opportunity in caring about accuracy, since not everyone does.

• Sentiment is personal and precision/recall are bounded by Inter-Rater Agreement

• Understand your content, what’s positive/negative for you?

• Understand how you need to balance precision and recall

• Score some of your own content

• Write some instructions

• Gather a set of workers

• Set them loose (and pay them right!)

16© 2014 Lexalytics Inc. All rights reserved.

Page 17: Accuracy sas-redmore-2014-2