DEEP LEARNING IN BUSINESS CONVERSATION ANALYSIS
ANTHONY SCODARY, GRIDSPACEWONKYUM LEE, GRIDSPACE
INTRO
“Which translation speech recognition so and so forth I mean there's a whole bunch of amazing applications that are made possible by deep learning and so internet service providers are using it for internal application development.
And then lastly what you mentioned as cloud service providers and basically because of the adoption of gp use and because of the success of kuta and so many applications are now able to be accelerate on gp use so that we can extend the capabilities of moore's law so that we can continue.
You'd have the benefits of of computing acceleration, which which in the cloud means reducing cost.
And that's on the serve cloud service provider side of of the Internet company so that would be amazon web services as the Google compute cloud.”
OVERVIEW
1. Business Conversations2. Recognition3. Analysis
1. Business Conversations
DEEP LEARNING IN BUSINESS CONVERSATION ANALYSIS
PROTOCOLS
SIGNALPROCESSING
PROTOCOLS
- Symbol Set (Lexicon)
- Rules (Syntax)
- Meaning (Semantics)
SINK
TYPES OF PROTOCOLS
SOURCE MEDIUM
TYPES OF PROTOCOLS: ENDPOINTS
BIRDCALL SEISMOGRAPH GROWLING
ELECTRICFENCE TCP FIRE
ALARM
“SIT” SIRI SPEECH
NATURE
MACHINE
HUMAN
NATURE MACHINE HUMAN
TYPES OF PROTOCOLS: H2H MEDIA
BANDWIDTH
INFORMATION DENSITY
SMSVOICEMAIL
CHAT
MISSEDCALL
POSTCARDWAVING
SPEECH
WHY DO WE STILL TALK?
- Fast
- Innate
- Layered
- Synchronous
- Dense in meaning
ORGANIZATIONS
INTERNALCOMMUNICATION
EXTERNALCOMMUNICATION
CallsMeetingsHallway Chats
Support CallsIn-Person Sales
DocumentsEmailChatSMS
Chat SupportSocial MediaEmail
ORGANIZATIONS
INTERNALCOMMUNICATION
EXTERNALCOMMUNICATION
CallsMeetingsHallway Chats
Support CallsIn-Person Sales
DocumentsEmailChatSMS
Chat SupportSocial MediaEmail
Mostly lost today
THIS DATA MATTERS
THIS DATA MATTERS
2. Recognition
DEEP LEARNING IN BUSINESS CONVERSATION ANALYSIS
REAL-TIME CALL ANALYSIS
ASRDSPSCANNERCLASSIFIER
Feature Extraction(MFCC)
Acoustic Model (GMM)
Lexicon
Language Model
“hello”
Conventional ASR - Combination of blocks designed by each expertise
GMM-HMM: 1980-2010
ASR
Feature Extraction(MFCC)
Acoustic Model (GMM)
Lexicon
Language Model
“hello”
Lots of tuning to improve accuracy
Robust Feature, Speaker-Adaptation, Application specific LM
ASR
Feature Extraction(MFCC)
Acoustic Model
Lexicon
Language Model
“hello”
Replacing acoustic model with deep neural net
DNN-HMM: 30%-40% improvement (2011-2017)
ASR
All-in-one Deep Learning Model
“hello”
Someday in the near future, Replacing whole models with one neural net
End-to-End ASR: active research in-progress
ASR
Simple Linear model(GMM)Advanced Linear model (GMM-SAT-DT)
Deep Learning ModelEnd-to-End Deep Learning (under development)
“Human parity”
ASR error rate for decades (in Academia) WER (log scale)
ASR HISTORY
“However, it’s still NOT Easy in real-world business conversational voice”
Language Challenge
Acoustic Challenge
• Domain specific terminology (company name, product name, …)• Spontaneous speech (natural conversation)• Accent, Dialect, Mispronunciation
• Noise (background, channel)• Acoustic effect (reverberation, Lombard effect)• Variability from speakers• Microphone displacement (near/far field)
ASR CHALLENGES
Data is King!
- General Conversational Data + in-domain data (training with in-domain data improves 15-30% accuracy)
- Simulated data with variety noise helps! (improves 10-15% accuracy)
- Data collection with semi-supervised training helps
LARGE-SCALE DATA PROCESSING
Multi-GPU Training
- 4x Titan X with parallel training- One week for full-training with 25k hours audio- 80x Faster than 32 core CPU machine
LARGE-SCALE DATA PROCESSING
Real-time adaptive processing
- Online i-vector adaptation (5-10% improvement)- speaker characteristics- environmental noise- Accent & dialect
- Context-based grammar adaptation (recognize in-domain specific terms)
REAL-TIME ADAPTIVE PROCESSING
State-of-Art deep learning model
- Time-delayed neural network- Computation optimization (Subsampling,
bi-phone, etc)- WFST framework for search
“Purely sequence-trained neural networks for ASR based on lattice-free MMI”, Interspeech 2016
WER: 5~6% Capital Market Model 12~15% Customer Intelligence ModelReal-Time-Factor: 0.3-0.35
STATE OF THE ART DEEP LEARNING MODEL
DEEP LEARNING IN BUSINESS CONVERSATION ANALYSIS
3. Analysis
IS TRANSCRIPTION REALLY WHAT YOU WANT ANYWAY?
STUFF WITH ACTUAL USE TO COMPANIES
- Prediction
- Classification
- Summarization
- Entity Extraction
- Anomaly Detection
“ARTIFICIAL INTELLIGENCE”
“ARTIFICIAL INTELLIGENCE”
ARITHMETIC
GRAPH SEARCH
CHESS
IMAGE RECOGNITION
CONVERSATION
EMOTION
CONSCIOUSNESS
ABOVE THIS LINE THIS SURELY IS
“REAL” INTELLIGENCE
“ARTIFICIAL INTELLIGENCE”
TECHNOLOGY REVOLUTION
WASTE OF MONEY AND
TIME
“ARTIFICIAL INTELLIGENCE”
We focus on the industry needs as
an engineering task.
ANALYSIS
1. Speech is complex.
Let models decide what features
matter for a task or application.
ANALYSIS
2. Speech is high dimensional.
Datasets must be large enough to
train large models to match.
ANALYSIS
3. Conversational speech is noisy.
Large, well-augmented datasets are
necessary to be robust.
ANALYSIS
ANALYSIS
ANALYSIS
ANALYSIS
...
ANALYSIS
ANALYSIS
aardvark
zebra
One-hot(D-dimensions) ℝ300
ℝ40
ANALYSIS
KING
QUEEN
BROTHER
SISTER
MAN
WOMAN
ANALYSIS
i have no political party actually
~~~‘democrat’
i have no political party actually
~~~‘democrat’
i have no political party actually
~~~‘democrat’
ANALYSIS
API
gridspace.com
QUESTIONS?