accuracy sas-redmore-2014-2
TRANSCRIPT
The Opportunity of Accuracy
No, you can't always get what you want You can't always get what you want You can't always get what you want
And if you try sometime you find You get what you need
Seth Redmore, VP Marketing and Product Management@sredmore, @lexalytics, http://www.lexalytics.com/blog
©2014 Lexalytics Inc. All rights reserved. 2
Accuracy? Opportunity? What?
Care Lots
Care Some
Don't Care
A very rough estimate of “companies that care about accuracy…”
Agenda
• “Accuracy” is imprecise
Sentiment is personal
Precision/Recall/F1
Different applications require different balance
Precision and Recall are bounded by Inter-Rater Agreement
• How to tune
• How to crowdsource
3© 2014 Lexalytics Inc. All rights reserved.
“Accuracy” is imprecise.
© 2014 Lexalytics Inc. All rights reserved.© 2014 Lexalytics Inc. All rights reserved. 4
• Because sentiment is personal (e.g. over what dataset is sentiment “accurate”?)
• Because you may care more about precision, or you may care more about recall
Sentiment Accuracy is Personal!
• “Wells Fargo lost $200M last month”
• “Kölnisch wasser smells like my grandmother.”
• “Taco Bell is like Russian Roulette for your ass, but it’s worth the risk.”
• “We’re switching to Direct TV.”
• “Microsoft is dropping their prices.”
© 2014 Lexalytics Inc. All rights reserved. 5
Precision, Recall, F1
• Precision:
“of the items you coded, what % are correct?”
• Recall is
“of all the possible items that match the code, what % did you retrieve?”
• F1 is the harmonic mean of precision and recall
2*((precision*recall)/precision+recall)
6© 2014 Lexalytics Inc. All rights reserved.
Different apps require different balance
• High precision -> Social media trending
Want to know that what you’re graphing has absolutely no crap
• High recall -> Customer support requests
Really don’t want to miss even a single pissed off customer, even at the cost of having to filter through lots of not-upset customers
7© 2014 Lexalytics Inc. All rights reserved.
HIGHPRECISION
HIGHRECALL
Sentiment F1 scores (and “accuracy”) bounded by IRA
8© 2014 Lexalytics Inc. All rights reserved.
• MPQA Corpus
Wiebe, et al., 2005 “Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39:165-210
Grad students, 40 hours of training, 16k sentences, ~80% IRA
• To ponder: If people max out at 80%, how can a machine be scored any better?
• Answer: it can’t.
A machine will do a “poor” job of scoring content that people can’t agree on.
So, you want to maximize your own accuracy?
9© 2014 Lexalytics Inc. All rights reserved.
• Get a clear goal on what you’re optimizing for
Precision/recall
What is “sentiment” – does an opinion have to be expressed, or?
Bounds of neutral
• Score a set of content yourself
• Crowdsource
Tuning
© 2014 Lexalytics Inc. All rights reserved. 10
Mturk Jargon
• Worker
The individual scoring the doc.
• Requester
You
• HIT
Human Intelligence Task
(Work unit)
• Quals
Which workers can work on your task?
11© 2014 Lexalytics Inc. All rights reserved.
Crowsourcing Flowchart
© 2014 Lexalytics Inc. All rights reserved. 12
©2014 Lexalytics Inc. All rights reserved. 13
Worker Qualifications
• Control which workers get to work on your HITs
• Amazon has built-in qualifications (Categorization Masters)
~20% more expensive
Opaque process
Workers don’t get anything more for them
• Manage your list tightly:Build a small qual test, open up to a limited set of users
Manually add workers to qualification list
• Use a “# of accepted HITs > 5000” (or some other number)
• Check against gold set
• Drop workers, don’t reject HITs
Communication with Mturkers
• Boards
http://www.cloudmebaby.com/
http://www.turkernation.com/
http://www.mturkforum.com/
http://www.mturkgrind.com/forum.php
http://mturkwiki.net/forum/
http://www.reddit.com/r/HITsWorthTurkingFor
http://www.reddit.com/r/mturk/
• Turkopticon
http://turkopticon.ucsd.edu/
© 2014 Lexalytics Inc. All rights reserved. 14
Mturk compensation
15© 2014 Lexalytics Inc. All rights reserved.
• It is unfortunate that “crowdsourcing” is sometimes a sophisticated term for trying to get the cheapest work possible.
• Sophisticated Mturkers (the ones you want doing your work) look for a lower bound of $6/hr.
• Don’t rely on “sweatshop” labor.
• You cannot rely on the little mturk compensation app – only way to fairly judge compensation is to do some yourself.
• Lexalytics aims for $8-10/hr for our sentiment scoring work. Some projects go more, some less.
• If you are using a 3rd party service that you know is doing your crowdsourcing, please go to “http://mturk.com” and look to see what the rates are that *they* are charging the workers for your HITs
Summary
• There’s opportunity in caring about accuracy, since not everyone does.
• Sentiment is personal and precision/recall are bounded by Inter-Rater Agreement
• Understand your content, what’s positive/negative for you?
• Understand how you need to balance precision and recall
• Score some of your own content
• Write some instructions
• Gather a set of workers
• Set them loose (and pay them right!)
16© 2014 Lexalytics Inc. All rights reserved.