accuracy sas-redmore-2014-2

The Opportunity of Accuracy

No, you can't always get what you want You can't always get what you want You can't always get what you want

And if you try sometime you find You get what you need

Seth Redmore, VP Marketing and Product Management@sredmore, @lexalytics, http://www.lexalytics.com/blog

©2014 Lexalytics Inc. All rights reserved. 2

Accuracy? Opportunity? What?

Care Lots

Care Some

Don't Care

A very rough estimate of “companies that care about accuracy…”

Agenda

• “Accuracy” is imprecise

Sentiment is personal

Precision/Recall/F1

Different applications require different balance

Precision and Recall are bounded by Inter-Rater Agreement

• How to tune

• How to crowdsource

3© 2014 Lexalytics Inc. All rights reserved.

“Accuracy” is imprecise.

© 2014 Lexalytics Inc. All rights reserved.© 2014 Lexalytics Inc. All rights reserved. 4

• Because sentiment is personal (e.g. over what dataset is sentiment “accurate”?)

• Because you may care more about precision, or you may care more about recall

Sentiment Accuracy is Personal!

• “Wells Fargo lost $200M last month”

• “Kölnisch wasser smells like my grandmother.”

• “Taco Bell is like Russian Roulette for your ass, but it’s worth the risk.”

• “We’re switching to Direct TV.”

• “Microsoft is dropping their prices.”

© 2014 Lexalytics Inc. All rights reserved. 5

Precision, Recall, F1

• Precision:

“of the items you coded, what % are correct?”

• Recall is

“of all the possible items that match the code, what % did you retrieve?”

• F1 is the harmonic mean of precision and recall

2*((precision*recall)/precision+recall)


Different apps require different balance

• High precision -> Social media trending

Want to know that what you’re graphing has absolutely no crap

• High recall -> Customer support requests

Really don’t want to miss even a single pissed off customer, even at the cost of having to filter through lots of not-upset customers


HIGHPRECISION

HIGHRECALL

Sentiment F1 scores (and “accuracy”) bounded by IRA


• MPQA Corpus

Wiebe, et al., 2005 “Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39:165-210

Grad students, 40 hours of training, 16k sentences, ~80% IRA

• To ponder: If people max out at 80%, how can a machine be scored any better?

• Answer: it can’t.

A machine will do a “poor” job of scoring content that people can’t agree on.

So, you want to maximize your own accuracy?


• Get a clear goal on what you’re optimizing for

Precision/recall

What is “sentiment” – does an opinion have to be expressed, or?

Bounds of neutral

• Score a set of content yourself

• Crowdsource

Tuning


Mturk Jargon

• Worker

The individual scoring the doc.

• Requester

You

• HIT

Human Intelligence Task

(Work unit)

• Quals

Which workers can work on your task?


Crowsourcing Flowchart


©2014 Lexalytics Inc. All rights reserved. 13

Worker Qualifications

• Control which workers get to work on your HITs

• Amazon has built-in qualifications (Categorization Masters)

~20% more expensive

Opaque process

Workers don’t get anything more for them

• Manage your list tightly:Build a small qual test, open up to a limited set of users

Manually add workers to qualification list

• Use a “# of accepted HITs > 5000” (or some other number)

• Check against gold set

• Drop workers, don’t reject HITs

Communication with Mturkers

• Boards

http://www.cloudmebaby.com/

http://www.turkernation.com/

http://www.mturkforum.com/

http://www.mturkgrind.com/forum.php

http://mturkwiki.net/forum/

http://www.reddit.com/r/HITsWorthTurkingFor

http://www.reddit.com/r/mturk/

• Turkopticon

http://turkopticon.ucsd.edu/


http://www.cloudmebaby.com/


























Mturk compensation


• It is unfortunate that “crowdsourcing” is sometimes a sophisticated term for trying to get the cheapest work possible.

• Sophisticated Mturkers (the ones you want doing your work) look for a lower bound of $6/hr.

• Don’t rely on “sweatshop” labor.

• You cannot rely on the little mturk compensation app – only way to fairly judge compensation is to do some yourself.

• Lexalytics aims for $8-10/hr for our sentiment scoring work. Some projects go more, some less.

• If you are using a 3rd party service that you know is doing your crowdsourcing, please go to “http://mturk.com” and look to see what the rates are that *they* are charging the workers for your HITs

Summary

• There’s opportunity in caring about accuracy, since not everyone does.

• Sentiment is personal and precision/recall are bounded by Inter-Rater Agreement

• Understand your content, what’s positive/negative for you?

• Understand how you need to balance precision and recall

• Score some of your own content

• Write some instructions

• Gather a set of workers

• Set them loose (and pay them right!)


accuracy sas-redmore-2014-2

Technology

sentiment accuracy

f1 precision

dont care2014 lexalytics

different balance precision

opportunity of accuracy

imprecise sentiment

agenda accuracy

sentiment accurate