How can crowdsourcing and machine learning improve speech technology?
Joao Freitas, Daniela BragaCSW Global LondonApril 14th 2016
April 2016 definedcrowd2
How many of you have tried speech recognition?
April 2016 definedcrowd3
Speech Technology is everywhere
April 2016 definedcrowd4
And it starts to understands you…
April 2016 definedcrowd5
What it takes to get there
Large amounts of data
Deep Learning
3000+ hours speech recordings + transcription200+ words with pronunciations
0.5M natural language variants + semantic annotation
Language and Product dependent!
April 2016 definedcrowd6
DefinedCrowd landscape
We serve the data needs for AI and ML landscape.
We’re a SaaS company that collects and enriches training data for AI,
combining crowdsourcing and ML.
April 2016 definedcrowd7
The world before DefinedCrowd
Louis, Speech Scientist
Wants to test if the Chinese acoustic model works for
Mandarin speakers in Singapore
User Goal
Hires:• Few vendors• 1PM • 1 Dev• 1 Chinese LE in-
house
What does he do?
50 hours of raw speech with…
• Poor quality (~20% of garbage)
• Unknown sources • Long wait
What does he get?
April 2016 definedcrowd8
The world after DefinedCrowd
Andy, Speech Scientist
Wants to test if the Chinese acoustic model works for
Mandarin speakers in Singapore
User Goal
Subscribes our platform
What does he do?
50 hours of pure speech with…
• High-quality• 100% transparency• 50% faster
throughput
What does he get?
• Picks a template• Adjusts settings
and picks the crowd• Launches the job• Collects the data
How does he do it?
April 2016 definedcrowd9
Our platform – enterprise side
April 2016 definedcrowd
Unique crowd model
US: 200+
Brazil: 200+
Taiwan: 100+
Russia: 200+
Japan: 100+
Korea: 100+
Ukraine (100+)
Spain (100+)Portugal (100+)
France (100+)Germany (100+)
Denmark (50+)
Sweden (50+)Finland (50+)
Netherlands (50+)
Italy (100+) Greece (100+)
Czech Republic (100+)
Poland (100+)
Turkey (100+)
Belgium (50+)
Australia: 100+
New Zealand:50+
Mexico: 100+Puerto Rico: 100+
Canada: 100+
China: 200+
Vietnam: 50+Thailand: 50+
Malaysia: 50+Singapore: 50+
India: 100+
30+ countries
100+ dialects
3,000 crowd
April 2016 definedcrowd11
We know a lot about our crowd
Languages & Dialect
User Activity
Job Performance
School & Courses
Profile Info
Other Jobs
April 2016 definedcrowd12
Why is Machine Learning
relevant for Crowdsourcing?
April 2016 definedcrowd13
We learn from metadata to provide recommendations to customers and crowd members
How we use Machine Learning
April 2016 definedcrowd14
How we detect spam
Raw data
• Logging system• Behavior measures
Data Processing
•Clean data •Transform data
Feature Extraction
• Task-related measures (e.g. average duration)
• Session Duration• Execution peaks• Consensus score• Real-time audits
Classification & Analysis
• Detect outliers/ anomalies
• Predict task / job duration
OUTLIE
R
April 2016 definedcrowd15
Example of Results I
April 2016 definedcrowd
Same results – Different perspective
April 2016 definedcrowd17
Another Dimension
April 2016 definedcrowd18
Quality in our platform
1. Combined score of Qualification Tests2. Real-time Audits and Reviews3. Majority Vote 4. Overall Majority 5. Worker Expertise6. Task Subjectiveness7. …
April 2016 definedcrowd19
Other predictions using Machine Learning
Best quality / budget tradeoff
Best match between job and crowd member
Expected quality
When will a job finish (even before it starts)
Quality Time
Cost
definedcrowdIntelligent data for AI
contacts: [email protected]@[email protected]