frank rudzicz - university of torontofrank/download/communicating... · frank rudzicz scientist,...

SPOClab signal processing and oral communication

Computational Linguistics, 5 December 2012

Frank Rudzicz Scientist, Toronto Rehabilitation Institute

Assistant professor, Department of Computer Science University of Toronto


An introduction to SPOClab • SPOClab (Signal Processing and Oral Communication)

is a new lab intersecting Computer Science and the Toronto Rehabilitation Institute.

• Our purpose is to produce software that helps people with disabilities* communicate.

• Today’s talk is about how we will pursue that purpose.

(*) e.g., neuro-motor and cognitive disorders, psychological trauma.

Introduction 2


Introduction 3

Communicating with machines

with prep. 2a [indicates] a participant in an action … e.g., Automatic speech recognition

with prep. 6a [indicates] means … or instrumentality … e.g., assistive technology

• Our work will generally involve two codependent themes.


Introduction 4

Dysarthria


Automatic speech recognition (ASR) “open the pod bay doors”

Language model Acoustic model

Background 5


Dysarthria Neuro-motor articulatory

difficulties resulting in unintelligible speech.

Can computers do better?

Background 6


0

10

20

30

40

50

60

70

80

90

2 4 6 8 10 12 14 16

Wor

d re

cogn

ition

acc

urac

y (%

)

Number of Gaussians

Dysarthria and ASR word accuracy

Non-dysarthric

Dysarthric

Background 7


Acoustic ambiguity

Non-dysarthric Dysarthric

Is this acoustic behaviour indicative of underlying articulatory behaviour?

Background 8


Articulatory knowledge

/m/ /n/ /ng/

Background 9


TORGO 2.0 • TORGO was built for ASR with cerebral palsy. What about

i. Other neuro-motor deficits? Less-verbal patients? ii. Alternatives to electromagnetic articulography (e.g., video)? iii. Focus on articulatory gestures?

Data collection 10


TORGO online

Data collection 11

• In 2013 we will hopefully have a server capable of receiving Voice over IP (VoIP) calls. • Can we trawl the web for data? Can we record automated

dialogues with random callers (as in MIT’s Jupiter)?


Classifying all this data

Conditional random fields

Neural networks Support vector machines

q1 q2 q3

o1 o1 o1

l1 l2 l3

Dynamic Bayes nets

...

...

Articulatory speech recognition 12


Beyond discretized articulation

13


Task-dynamics: Represents speech as goal-based reconfigurations of the vocal tract. 𝑀𝑧′′ + 𝐵𝑧′ + 𝐾(𝑧 − 𝑧0)

I. Dynamic speech gestures

‘pub’

We wish to do classification in a low-dimensional and informative space that incorporates goal-based and long-term dynamics.

Tongue body constriction degree

glottis

lip aperture

We require a theoretical framework to represent relevant and continuous articulatory motion.

time

Dynamic speech gestures 14


II. Acoustic-articulatory inversion

...

𝝎𝟎 𝝁𝟎 𝝈𝟎 𝝈𝒏 ...

Input acoustics

Hidden layer

Output layer

Mixture density network

Intensity map of estimated tongue tip constriction over

time

Acoustic-articulatory inversion 15


Is dysarthria a distortion of non-dysarthric speech? … or are they both distortions of a common abstraction?

III. The noisy channel

𝑃 𝑌𝑑 𝑌𝑐) Dysarthric speech, 𝑌𝑑

Non-dysarthric speech, 𝑌𝑐

𝑃(𝑌𝑑|𝑋) Dysarthric speech, 𝑌𝑑 Abstract

speech, 𝑋 𝑃(𝑌𝑐|𝑋) Non-dysarthric

speech, 𝑌𝑐

The noisy channel 16


Super-duper ASR

𝑃 𝑌𝑐 𝑋) Yc

X 𝑃(𝑌𝑑|𝑋) Yd

How might we combine a noisy channel model, acoustic-articulatory inversion and a dynamical model of speech production within a speech recognition system?

Super-duper ASR 17


Feedback and biological plausibility

A biologically plausible model 18

• In task dynamics, 𝑀𝑧′′ + 𝐵𝑧′ + 𝐾(𝑧 − 𝑧0) ignores or takes for granted: 1. Feedback, especially acoustic, proprioceptive, and tactile feedback. 2. Unit selection – words and syllable structures are known in advance. 3. Grammar and vocabulary. 4. Semantics.

• We want a more biologically plausible

model of speech perception/production. • Control-theoretic neural networks? • Can we include representations of

the brain and its pathologies?


Interpreting brain signals

Interpreting brain signals 19

• Hidden Markov models are sometimes used to classify electroencephalographic data. • Can we improve accuracy with more advanced models? • What features and sensor locations are most informative? • How to remove artifacts from very noisy signals? • How to elicit imagined words?


Introduction 20

Dysarthria


Talking to machines

Put this there.

My hands are in the air.

Buy ticket... AC490...

yes

Telephony

Dictation

Multimodal interaction

Talking to machines 21


Talking to humans

Talking to humans 22


1. Noise reduction

Spectral subtraction removes environmental signal noise.

Before After

Noise reduction 23


2. De-voicing consonants

The “voice bar”

De-voicing consonants 24


3. ‘Splicing’: Deletions and insertions sounds are patched with synthetic equivalents.

sounds (e.g., ‘stuttering’) are simply removed.

feelin

feelin

pronounced

pronounced

‘Splicing’: Deletions and insertions 25


4. Tempo morphing • Dysarthric speech tends to be a lot (often 3x) slower than

typical speech.

• We squish sonorants in time to be closer to their expected length. • A phase vocoder squishes (or stretches) the length of a signal

without affecting its pitch or frequency characteristics.

Tempo morphing 26


5. Formant ambiguity

Can we separate the vowels so that they are more mutually distinct?

Non-dysarthric Dysarthric

Formant ambiguity 27


5. Formant morphing

Before After

Formant morphing 28


Multimodal interaction (MMI) • Can a touch screen augment speech transformation?

• e.g., mixing a database of canned phrases with natural speech. • How would word/phrase prediction and correction work in this

context? • How can modern

virtual keyboards be modified to help people with physical disabilities? With cognitive disabilities?

Multimodal interaction 29


• Integrating concurrent streams of communication can, e.g.: • Enable more natural and efficient expression, and • Reduce ambiguity in any one of those streams.

Put this there.


• ‘Ambient intelligence’ – speech interfaces in the environment. • Emergency scenarios (e.g.,

reacting to falls) • e.g., HomeLab: “do you want me

to call for help?”

• Can be used to guide an individual through daily tasks. • e.g., Homelab: “don’t forget to turn

the faucet off!” • Crucially, this involves detecting

and correcting breakdowns in communication.


• We will need ASR for individuals with dementia. • Can we specialize ASR models for cognitive deficits?

• Each of the vocabulary, language model, and grammar may

differ from those for the general public. • What are their effects on ASR performance? • How to limit or adjust these dynamically?


Interfaces for automated dialogues • Simple dialogue with a mobile robot

is now being tested in HomeLab.

• Are alternative modes appropriate? • e.g., could a digital assistant be useful

on tablets or on the TV? • How do we measure success beyond

completion of daily tasks?

Interfaces for automated dialogues 33


SPOClab will build software to help people with disabilities communicate. This is a deliberately broad goal.

We will build advanced models of speech production/perception. These will be used within augmented speech recognition. We will build brain-machine interfaces that model speech production and perception as abstract dynamical systems.

We will build systems that help to make people more intelligible to others. We will support aging in-place by helping individuals with cognitive disorders be more capable, and more independent.

frank rudzicz - university of torontofrank/download/communicating... · frank rudzicz scientist,...

Documents