frank rudzicz - university of torontofrank/download/communicating... · frank rudzicz scientist,...

35
SPOC lab signal processing and oral communication Computational Linguistics, 5 December 2012 Frank Rudzicz Scientist, Toronto Rehabilitation Institute Assistant professor, Department of Computer Science University of Toronto

Upload: lynhu

Post on 14-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Computational Linguistics, 5 December 2012

Frank Rudzicz Scientist, Toronto Rehabilitation Institute

Assistant professor, Department of Computer Science University of Toronto

Page 2: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

An introduction to SPOClab • SPOClab (Signal Processing and Oral Communication)

is a new lab intersecting Computer Science and the Toronto Rehabilitation Institute.

• Our purpose is to produce software that helps people with disabilities* communicate.

• Today’s talk is about how we will pursue that purpose.

(*) e.g., neuro-motor and cognitive disorders, psychological trauma.

Introduction 2

Page 3: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Introduction 3

Communicating with machines

with prep. 2a [indicates] a participant in an action … e.g., Automatic speech recognition

with prep. 6a [indicates] means … or instrumentality … e.g., assistive technology

• Our work will generally involve two codependent themes.

Page 4: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Introduction 4

Dysarthria

Page 5: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Automatic speech recognition (ASR) “open the pod bay doors”

Language model Acoustic model

Background 5

Page 6: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Dysarthria Neuro-motor articulatory

difficulties resulting in unintelligible speech.

Can computers do better?

Background 6

Page 7: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

0

10

20

30

40

50

60

70

80

90

2 4 6 8 10 12 14 16

Wor

d re

cogn

ition

acc

urac

y (%

)

Number of Gaussians

Dysarthria and ASR word accuracy

Non-dysarthric

Dysarthric

Background 7

Page 8: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Acoustic ambiguity

Non-dysarthric Dysarthric

Is this acoustic behaviour indicative of underlying articulatory behaviour?

Background 8

Page 9: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Articulatory knowledge

/m/ /n/ /ng/

Background 9

Page 10: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

TORGO 2.0 • TORGO was built for ASR with cerebral palsy. What about

i. Other neuro-motor deficits? Less-verbal patients? ii. Alternatives to electromagnetic articulography (e.g., video)? iii. Focus on articulatory gestures?

Data collection 10

Page 11: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

TORGO online

Data collection 11

• In 2013 we will hopefully have a server capable of receiving Voice over IP (VoIP) calls. • Can we trawl the web for data? Can we record automated

dialogues with random callers (as in MIT’s Jupiter)?

Page 12: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Classifying all this data

Conditional random fields

Neural networks Support vector machines

q1 q2 q3

o1 o1 o1

l1 l2 l3

Dynamic Bayes nets

...

...

Articulatory speech recognition 12

Page 13: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Beyond discretized articulation

13

Page 14: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Task-dynamics: Represents speech as goal-based reconfigurations of the vocal tract. 𝑀𝑧′′ + 𝐵𝑧′ + 𝐾(𝑧 − 𝑧0)

I. Dynamic speech gestures

‘pub’

We wish to do classification in a low-dimensional and informative space that incorporates goal-based and long-term dynamics.

Tongue body constriction degree

glottis

lip aperture

We require a theoretical framework to represent relevant and continuous articulatory motion.

time

Dynamic speech gestures 14

Page 15: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

II. Acoustic-articulatory inversion

...

𝝎𝟎 𝝁𝟎 𝝈𝟎 𝝈𝒏 ...

Input acoustics

Hidden layer

Output layer

Mixture density network

Intensity map of estimated tongue tip constriction over

time

Acoustic-articulatory inversion 15

Page 16: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Is dysarthria a distortion of non-dysarthric speech? … or are they both distortions of a common abstraction?

III. The noisy channel

𝑃 𝑌𝑑 𝑌𝑐) Dysarthric speech, 𝑌𝑑

Non-dysarthric speech, 𝑌𝑐

𝑃(𝑌𝑑|𝑋) Dysarthric speech, 𝑌𝑑 Abstract

speech, 𝑋 𝑃(𝑌𝑐|𝑋) Non-dysarthric

speech, 𝑌𝑐

The noisy channel 16

Page 17: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Super-duper ASR

𝑃 𝑌𝑐 𝑋) Yc

X 𝑃(𝑌𝑑|𝑋) Yd

How might we combine a noisy channel model, acoustic-articulatory inversion and a dynamical model of speech production within a speech recognition system?

Super-duper ASR 17

Page 18: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Feedback and biological plausibility

A biologically plausible model 18

• In task dynamics, 𝑀𝑧′′ + 𝐵𝑧′ + 𝐾(𝑧 − 𝑧0) ignores or takes for granted: 1. Feedback, especially acoustic, proprioceptive, and tactile feedback. 2. Unit selection – words and syllable structures are known in advance. 3. Grammar and vocabulary. 4. Semantics.

• We want a more biologically plausible

model of speech perception/production. • Control-theoretic neural networks? • Can we include representations of

the brain and its pathologies?

Page 19: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Interpreting brain signals

Interpreting brain signals 19

• Hidden Markov models are sometimes used to classify electroencephalographic data. • Can we improve accuracy with more advanced models? • What features and sensor locations are most informative? • How to remove artifacts from very noisy signals? • How to elicit imagined words?

Page 20: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Introduction 20

Dysarthria

Page 21: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Talking to machines

Put this there.

My hands are in the air.

Buy ticket... AC490...

yes

Telephony

Dictation

Multimodal interaction

Talking to machines 21

Page 22: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Talking to humans

Talking to humans 22

Page 23: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

1. Noise reduction

Spectral subtraction removes environmental signal noise.

Before After

Noise reduction 23

Page 24: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

2. De-voicing consonants

The “voice bar”

De-voicing consonants 24

Page 25: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

3. ‘Splicing’: Deletions and insertions sounds are patched with synthetic equivalents.

sounds (e.g., ‘stuttering’) are simply removed.

feelin

feelin

pronounced

pronounced

‘Splicing’: Deletions and insertions 25

Page 26: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

4. Tempo morphing • Dysarthric speech tends to be a lot (often 3x) slower than

typical speech.

• We squish sonorants in time to be closer to their expected length. • A phase vocoder squishes (or stretches) the length of a signal

without affecting its pitch or frequency characteristics.

Tempo morphing 26

Page 27: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

5. Formant ambiguity

Can we separate the vowels so that they are more mutually distinct?

Non-dysarthric Dysarthric

Formant ambiguity 27

Page 28: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

5. Formant morphing

Before After

Formant morphing 28

Page 29: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Multimodal interaction (MMI) • Can a touch screen augment speech transformation?

• e.g., mixing a database of canned phrases with natural speech. • How would word/phrase prediction and correction work in this

context? • How can modern

virtual keyboards be modified to help people with physical disabilities? With cognitive disabilities?

Multimodal interaction 29

Page 30: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

• Integrating concurrent streams of communication can, e.g.: • Enable more natural and efficient expression, and • Reduce ambiguity in any one of those streams.

Put this there.

Page 31: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

• ‘Ambient intelligence’ – speech interfaces in the environment. • Emergency scenarios (e.g.,

reacting to falls) • e.g., HomeLab: “do you want me

to call for help?”

• Can be used to guide an individual through daily tasks. • e.g., Homelab: “don’t forget to turn

the faucet off!” • Crucially, this involves detecting

and correcting breakdowns in communication.

Page 32: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

• We will need ASR for individuals with dementia. • Can we specialize ASR models for cognitive deficits?

• Each of the vocabulary, language model, and grammar may

differ from those for the general public. • What are their effects on ASR performance? • How to limit or adjust these dynamically?

Page 33: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

Interfaces for automated dialogues • Simple dialogue with a mobile robot

is now being tested in HomeLab.

• Are alternative modes appropriate? • e.g., could a digital assistant be useful

on tablets or on the TV? • How do we measure success beyond

completion of daily tasks?

Interfaces for automated dialogues 33

Page 34: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication

SPOClab will build software to help people with disabilities communicate. This is a deliberately broad goal.

We will build advanced models of speech production/perception. These will be used within augmented speech recognition. We will build brain-machine interfaces that model speech production and perception as abstract dynamical systems.

We will build systems that help to make people more intelligible to others. We will support aging in-place by helping individuals with cognitive disorders be more capable, and more independent.

Page 35: Frank Rudzicz - University of Torontofrank/Download/Communicating... · Frank Rudzicz Scientist, ... mixing a database of canned phrases with natural speech. ... • Can be used to

SPOClab signal processing and oral communication