frank rudzicz - university of torontofrank/download/communicating... · frank rudzicz scientist,...
TRANSCRIPT
SPOClab signal processing and oral communication
Computational Linguistics, 5 December 2012
Frank Rudzicz Scientist, Toronto Rehabilitation Institute
Assistant professor, Department of Computer Science University of Toronto
SPOClab signal processing and oral communication
An introduction to SPOClab • SPOClab (Signal Processing and Oral Communication)
is a new lab intersecting Computer Science and the Toronto Rehabilitation Institute.
• Our purpose is to produce software that helps people with disabilities* communicate.
• Today’s talk is about how we will pursue that purpose.
(*) e.g., neuro-motor and cognitive disorders, psychological trauma.
Introduction 2
SPOClab signal processing and oral communication
Introduction 3
Communicating with machines
with prep. 2a [indicates] a participant in an action … e.g., Automatic speech recognition
with prep. 6a [indicates] means … or instrumentality … e.g., assistive technology
• Our work will generally involve two codependent themes.
SPOClab signal processing and oral communication
Introduction 4
Dysarthria
SPOClab signal processing and oral communication
Automatic speech recognition (ASR) “open the pod bay doors”
Language model Acoustic model
Background 5
SPOClab signal processing and oral communication
Dysarthria Neuro-motor articulatory
difficulties resulting in unintelligible speech.
Can computers do better?
Background 6
SPOClab signal processing and oral communication
0
10
20
30
40
50
60
70
80
90
2 4 6 8 10 12 14 16
Wor
d re
cogn
ition
acc
urac
y (%
)
Number of Gaussians
Dysarthria and ASR word accuracy
Non-dysarthric
Dysarthric
Background 7
SPOClab signal processing and oral communication
Acoustic ambiguity
Non-dysarthric Dysarthric
Is this acoustic behaviour indicative of underlying articulatory behaviour?
Background 8
SPOClab signal processing and oral communication
Articulatory knowledge
/m/ /n/ /ng/
Background 9
SPOClab signal processing and oral communication
TORGO 2.0 • TORGO was built for ASR with cerebral palsy. What about
i. Other neuro-motor deficits? Less-verbal patients? ii. Alternatives to electromagnetic articulography (e.g., video)? iii. Focus on articulatory gestures?
Data collection 10
SPOClab signal processing and oral communication
TORGO online
Data collection 11
• In 2013 we will hopefully have a server capable of receiving Voice over IP (VoIP) calls. • Can we trawl the web for data? Can we record automated
dialogues with random callers (as in MIT’s Jupiter)?
SPOClab signal processing and oral communication
Classifying all this data
Conditional random fields
Neural networks Support vector machines
q1 q2 q3
o1 o1 o1
l1 l2 l3
Dynamic Bayes nets
...
...
Articulatory speech recognition 12
SPOClab signal processing and oral communication
Beyond discretized articulation
13
SPOClab signal processing and oral communication
Task-dynamics: Represents speech as goal-based reconfigurations of the vocal tract. 𝑀𝑧′′ + 𝐵𝑧′ + 𝐾(𝑧 − 𝑧0)
I. Dynamic speech gestures
‘pub’
We wish to do classification in a low-dimensional and informative space that incorporates goal-based and long-term dynamics.
Tongue body constriction degree
glottis
lip aperture
We require a theoretical framework to represent relevant and continuous articulatory motion.
time
Dynamic speech gestures 14
SPOClab signal processing and oral communication
II. Acoustic-articulatory inversion
...
𝝎𝟎 𝝁𝟎 𝝈𝟎 𝝈𝒏 ...
Input acoustics
Hidden layer
Output layer
Mixture density network
Intensity map of estimated tongue tip constriction over
time
Acoustic-articulatory inversion 15
SPOClab signal processing and oral communication
Is dysarthria a distortion of non-dysarthric speech? … or are they both distortions of a common abstraction?
III. The noisy channel
𝑃 𝑌𝑑 𝑌𝑐) Dysarthric speech, 𝑌𝑑
Non-dysarthric speech, 𝑌𝑐
𝑃(𝑌𝑑|𝑋) Dysarthric speech, 𝑌𝑑 Abstract
speech, 𝑋 𝑃(𝑌𝑐|𝑋) Non-dysarthric
speech, 𝑌𝑐
The noisy channel 16
SPOClab signal processing and oral communication
Super-duper ASR
𝑃 𝑌𝑐 𝑋) Yc
X 𝑃(𝑌𝑑|𝑋) Yd
How might we combine a noisy channel model, acoustic-articulatory inversion and a dynamical model of speech production within a speech recognition system?
Super-duper ASR 17
SPOClab signal processing and oral communication
Feedback and biological plausibility
A biologically plausible model 18
• In task dynamics, 𝑀𝑧′′ + 𝐵𝑧′ + 𝐾(𝑧 − 𝑧0) ignores or takes for granted: 1. Feedback, especially acoustic, proprioceptive, and tactile feedback. 2. Unit selection – words and syllable structures are known in advance. 3. Grammar and vocabulary. 4. Semantics.
• We want a more biologically plausible
model of speech perception/production. • Control-theoretic neural networks? • Can we include representations of
the brain and its pathologies?
SPOClab signal processing and oral communication
Interpreting brain signals
Interpreting brain signals 19
• Hidden Markov models are sometimes used to classify electroencephalographic data. • Can we improve accuracy with more advanced models? • What features and sensor locations are most informative? • How to remove artifacts from very noisy signals? • How to elicit imagined words?
SPOClab signal processing and oral communication
Introduction 20
Dysarthria
SPOClab signal processing and oral communication
Talking to machines
Put this there.
My hands are in the air.
Buy ticket... AC490...
yes
Telephony
Dictation
Multimodal interaction
Talking to machines 21
SPOClab signal processing and oral communication
Talking to humans
Talking to humans 22
SPOClab signal processing and oral communication
1. Noise reduction
Spectral subtraction removes environmental signal noise.
Before After
Noise reduction 23
SPOClab signal processing and oral communication
2. De-voicing consonants
The “voice bar”
De-voicing consonants 24
SPOClab signal processing and oral communication
3. ‘Splicing’: Deletions and insertions sounds are patched with synthetic equivalents.
sounds (e.g., ‘stuttering’) are simply removed.
feelin
feelin
pronounced
pronounced
‘Splicing’: Deletions and insertions 25
SPOClab signal processing and oral communication
4. Tempo morphing • Dysarthric speech tends to be a lot (often 3x) slower than
typical speech.
• We squish sonorants in time to be closer to their expected length. • A phase vocoder squishes (or stretches) the length of a signal
without affecting its pitch or frequency characteristics.
Tempo morphing 26
SPOClab signal processing and oral communication
5. Formant ambiguity
Can we separate the vowels so that they are more mutually distinct?
Non-dysarthric Dysarthric
Formant ambiguity 27
SPOClab signal processing and oral communication
5. Formant morphing
Before After
Formant morphing 28
SPOClab signal processing and oral communication
Multimodal interaction (MMI) • Can a touch screen augment speech transformation?
• e.g., mixing a database of canned phrases with natural speech. • How would word/phrase prediction and correction work in this
context? • How can modern
virtual keyboards be modified to help people with physical disabilities? With cognitive disabilities?
Multimodal interaction 29
SPOClab signal processing and oral communication
• Integrating concurrent streams of communication can, e.g.: • Enable more natural and efficient expression, and • Reduce ambiguity in any one of those streams.
Put this there.
SPOClab signal processing and oral communication
• ‘Ambient intelligence’ – speech interfaces in the environment. • Emergency scenarios (e.g.,
reacting to falls) • e.g., HomeLab: “do you want me
to call for help?”
• Can be used to guide an individual through daily tasks. • e.g., Homelab: “don’t forget to turn
the faucet off!” • Crucially, this involves detecting
and correcting breakdowns in communication.
SPOClab signal processing and oral communication
• We will need ASR for individuals with dementia. • Can we specialize ASR models for cognitive deficits?
• Each of the vocabulary, language model, and grammar may
differ from those for the general public. • What are their effects on ASR performance? • How to limit or adjust these dynamically?
SPOClab signal processing and oral communication
Interfaces for automated dialogues • Simple dialogue with a mobile robot
is now being tested in HomeLab.
• Are alternative modes appropriate? • e.g., could a digital assistant be useful
on tablets or on the TV? • How do we measure success beyond
completion of daily tasks?
Interfaces for automated dialogues 33
SPOClab signal processing and oral communication
SPOClab will build software to help people with disabilities communicate. This is a deliberately broad goal.
We will build advanced models of speech production/perception. These will be used within augmented speech recognition. We will build brain-machine interfaces that model speech production and perception as abstract dynamical systems.
We will build systems that help to make people more intelligible to others. We will support aging in-place by helping individuals with cognitive disorders be more capable, and more independent.
SPOClab signal processing and oral communication