a distributed system for recognizing home automation commands and distress calls in the italian...

3
A Distributed System for Recognizing Home Automation Commands and Distress Calls in the Italian Language Emanuele Principi 1 , Stefano Squartini 1 , Francesco Piazza 1 , Danilo Fuselli 2 , and Maurizio Bonifazi 2 1 A3Lab, Department of Information Engineering, Universit ` a Politecnica delle Marche, Ancona, Italy 2 FBT Elettronica Spa, Via Paolo Soprani 1, Recanati (MC), Italy [email protected] 1. Introduction Ageing of population in the most developed countries is posing a great challenge in social healthcare systems [1]. Population Ageing Ambient Assisted Living: use of technologies for supporting and assisting people in their own homes. Solution: Ambient Assisted Living Current technologies employ application dependent sensors, such as vital sensors, cameras or microphones. Recent studies demonstrated that people prefer non-invasive sensors, as micro- phones [2, 3]. Several works appeared in the literature recognize emergencies based on audio signals only [2, 4, 5, 6]. Different Approaches to the Problem A new system that integrates technologies for monitoring the acoustic environ- ment, hands-free communication and home automation services. Idea 2. The proposed system The system is composed by two functional units: – Local Multimedia Control Unit (LMCU): it recognizes distress calls and home automation commands, and manages the hands-free communication. – Central Management and Processing Unit (CMPU): it integrates home and content delivery automation services, and it manages emergency states. Private Branch eXchange (PBX): it interfaces the system with the Public Switched Telephone Network (PSTN) and analogue phone devices. Internet 9 - 9 - CMPU PBX LCMU LCMU Home Network PSTN Overview One microphone acquires the input speech signal. A Voice Activity Detector selects the audio segments that contain the voice signal. An interference cancellation module removes the sounds coming from known audio sources (e.g., from a television or a radio). Two operating states: monitoring: the speech recognition engine is active and detects home automa- tion commands and distress calls. calling: it occurs when a phone call is ongoing, the speech recognition engine is disabled and the acoustic echo cancellation and the VoIP stack are active. Hands-free communication is guaranteed by VoIP a stack and an AEC. Hardware platform: based on ARM processors and embedded Linux. The Local Multimedia Control Unit x(n): the far-end signal. s(n): the near-end speech of a local speaker. w(n): the room impulse response. d(n)= s(n)+ w(n) * x(n): the signal acquired by the microphone. e(n): the residual echo. Acoustic Echo Cancellation algorithms remove the echo produced by the acous- tic coupling between the loudspeaker and the microphone. The system presented in this paper, the algorithm implemented in the Speex codec has been used [7]. Interference cancellation: the same algorithm is employed to reduce the degrad- ing effects of known sources of interference (e.g., a television). Acoustic Echo & Interference Cancellation Interspeech, Lyon, France, 2013 25-29 August 2013

Upload: a3labdsp

Post on 24-May-2015

428 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: A Distributed System for Recognizing Home Automation Commands and Distress Calls in the Italian Language

A Distributed System for Recognizing Home Automation

Commands and Distress Calls in the Italian LanguageEmanuele Principi1, Stefano Squartini1, Francesco Piazza1, Danilo Fuselli2, and Maurizio

Bonifazi21A3Lab, Department of Information Engineering, Universita Politecnica delle Marche, Ancona, Italy

2FBT Elettronica Spa, Via Paolo Soprani 1, Recanati (MC), [email protected]

1. Introduction

Ageing of population in the most developed countries is posing a great challenge insocial healthcare systems [1].

Population Ageing

Ambient Assisted Living: use of technologies for supporting and assisting people intheir own homes.

Solution: Ambient Assisted Living

• Current technologies employ application dependent sensors, such as vital sensors,cameras or microphones.

• Recent studies demonstrated that people prefer non-invasive sensors, as micro-phones [2, 3].

• Several works appeared in the literature recognize emergencies based on audiosignals only [2, 4, 5, 6].

Different Approaches to the Problem

A new system that integrates technologies for monitoring the acoustic environ-ment, hands-free communication and home automation services.

Idea

2. The proposed system

• The system is composed by two functional units:

– Local Multimedia Control Unit (LMCU): it recognizes distress calls and homeautomation commands, and manages the hands-free communication.

– Central Management and Processing Unit (CMPU): it integrates home andcontent delivery automation services, and it manages emergency states.

• Private Branch eXchange (PBX): it interfaces the system with the PublicSwitched Telephone Network (PSTN) and analogue phone devices.

Internet

Tuesday

9SMS

73 -+

x

cingular9:41 AM

iPod

Web

Mail

Phone

Settings

Notes

Calculator

Clock

YouTubeStocks

MapsWeather

Camera

Photos

Calendar

Text

Tuesday

9SMS

73 -+

x

cingular9:41 AM

iPod

Web

Mail

Phone

Settings

Notes

Calculator

Clock

YouTubeStocks

MapsWeather

Camera

Photos

Calendar

Text

CMPU

PBX

LCMU

LCMUHome

Network

PSTN

Overview

VoIPStack

VAD InterferenceCancellation

SpeechRecognition

"Monitoring"Estate

"Calling"Estate

Television,ERadio,Eetc.

AEC

• One microphone acquires the input speech signal.

• A Voice Activity Detector selects the audio segments that contain the voice signal.

• An interference cancellation module removes the sounds coming from known audiosources (e.g., from a television or a radio).

• Two operating states:

– monitoring: the speech recognition engine is active and detects home automa-tion commands and distress calls.

– calling: it occurs when a phone call is ongoing, the speech recognition engine isdisabled and the acoustic echo cancellation and the VoIP stack are active.

• Hands-free communication is guaranteed by VoIP a stack and an AEC.

• Hardware platform: based on ARM processors and embedded Linux.

The Local Multimedia Control Unit

• x(n): the far-end signal.

• s(n): the near-end speech of a local speaker.

• w(n): the room impulse response.

• d(n) = s(n) + w(n) ∗ x(n): the signal acquired by the microphone.

• e(n): the residual echo.

• Acoustic Echo Cancellation algorithms remove the echo produced by the acous-tic coupling between the loudspeaker and the microphone.

• The system presented in this paper, the algorithm implemented in the Speex codechas been used [7].

• Interference cancellation: the same algorithm is employed to reduce the degrad-ing effects of known sources of interference (e.g., a television).

Acoustic Echo & Interference Cancellation

Interspeech, Lyon, France, 2013 25-29 August 2013

Page 2: A Distributed System for Recognizing Home Automation Commands and Distress Calls in the Italian Language

3. Commands and Distress calls Recognition

• Commands and distress calls recognition is performed by means of the Pocket-Sphinx [8] speech recognition engine.

• The support for semi-continuous Hidden Markov Models and for fixed point arith-metic makes the engine suitable for embedded devices.

The engine: PocketSphinx

• Mel-Frequency Cepstral Coefficients (MFCC) are currently employed in the ma-jority of speech recognition systems.

• Power Normalized Cepstral Coefficients (PNCC)a have been recently proposedto improve the robustness against distortions (e.g., noise and reverberation) [9].

• The final feature vector is composed by 13 mean normalized static coefficientsand their first and second derivatives.

• Number of arithmetic operations: PNCC 17516, MFCC 13010.

Medium-TimeProcessing

Time-FrequencyNormalization

Pre-EmphasisSTFTw&

MagnitudeSquared

GammatoneFrequencyIntegration

MeanwPowerNormalization

PowerwFunctionNon-Linearity

DCTw&wMean

Normalization

Speech PNCCFirstw&wSecondDerivatives

aA free library for PNCC extraction is available at http://www.a3lab.dii.univpm.it/projects/libpncc

Feature Extraction: Power Normalized Cepstral Coefficients

PNCClExtraction Training

Utterances:l1310

Minutes:l105

Speakers:l60

Female/male:l50h

APASCIlTraininglSet

Stateslperlphone:l3

Skip:lno

Gaussianslperlstate:l4

Senones:l200

AcousticlModel

• Training has been performed using the APASCI corpus [10].

– It is composed of Italian speech utterances recorded in a quiet room with avocabulary of about 3000 words.

– Training has been performed on the training set part of the speaker indepen-dent subset

– Acoustic model parameters have been chosen on the test set part of thespeaker independent subset (860 utterances spoken by 20 males and 20 fe-males, for a total duration of about 69 minutes).

• These values have been selected using the APASCI speaker independent testset as development set and a world loop grammar as language model. Theobtained word recognition accuracy is 64.5%.

Acoustic Model

• The current version of the PocketSphinx (version 0.8) is not able to perform re-jection using generic word models [11] or with garbage models as in keywordspotting [12].

• The “confidence score” associated to each result is reliable only for vocabular-ies of more than 100 words, a number exceeding the vocabulary size of theapplication scenario taken into consideration.

• The approach followed to reject of out of grammar words is inspired by the tech-nique of Vertanen [13].

– This consists in training a special “garbage phone” that represents a genericword phone replacing the actual phonetic transcription of a percentage of thevocabulary words with a sequence of garbage phones.

– Then, in the recognition dictionary a garbage word is introduced whose pho-netic transcription is the single garbage phone.

– Here, training of the garbage phone has been performed substituting 10% ofthe phonetic transcriptions of the words present in the training vocabulary.

– Words have been chosen so that each phone of the Italian language is equallyrepresented, so that they are all trained with the same amount of data.

Out of Grammar Rejection

• Robustness to mismatches between training and testing conditions can be im-proved by adapting the acoustic model.

• Before using the system for the first time, the user is asked to speak a set ofphonetically rich sentences so that all phones of the Italian language are trainedusing the same amount of data.

• Adaptation is performed by means of Maximum Likelihood Linear Regression(MLLR).

Speaker adaptation

• A simple finite state grammar where the word GARBAGE is mapped to the GPphone in the dictionary.

• Sequences of one or more garbage words are automatically ignored.

• Stop-words, such as articles and prepositions, are also ignored by the system.

public <SENTENCE> = <COMMAND> | <DISTRESS_CALL> | GARBAGE;

<COMMAND> = ((accendi|spegni) la luce) |

((alza|abbassa) la temperatura);

<DISTRESS_CALL> = aiuto | ambulanza;

Language Model

4. The ITAAL speech corpus

• ITAAL is a new speech corpus of home automation commands and distresscalls in Italian a.

• People read three groups of sentences in Italian: home automation commands(e.g., “close the door”), distress calls (e.g., “help!”, “ambulance!”) and phonet-ically rich sentences. These have been extracted from the “speaker indepen-dent” set of the APASCI corpus and they cover all the phones of the Italianlanguage.

• Every sentence was spoken both in normal and shouted conditions, with thehome automation commands and the phonetically rich sentences pronouncedwithout emotional inflection. In contrast, people were asked to speak distresscalls as they were frightened.

aFree samples are available at http://www.a3lab.dii.univpm.it/projects/itaal

Description

Commands Distress calls APASCINumber of sentences 15 5 6Total words 45 8 52Unique Words 18 6 51Duration (minutes) 39 13 17Repetitions 3 3 1

Number of speakers 20Female/male ratio 50%Average age 41.70 ± 11.17SNR (headset/distant) 51.46 dB/34.08 dB

Details

Recording equipment

Room setup

0 100 200 300 400

0

0.5

1

·10−2

F0 (Hz)

normalshouted

−60 −50 −40 −30 −20 −10 0

0

2

4

6

8

·10−2

Active speech level (dB)

Pitch and active speech level

Interspeech, Lyon, France, 2013 25-29 August 2013

Page 3: A Distributed System for Recognizing Home Automation Commands and Distress Calls in the Italian Language

5. Experiments

• Signals: headset and array central microphone of ITAAL corpus.

• Four recognition tasks:

1. recognition of home automation commands uttered in normal speaking style2. recognition of distress calls uttered in normal speaking style3. recognition of home automation commands uttered in shouted speaking style4. recognition of distress calls uttered in shouted speaking style

• Each task has been evaluated with and without adapting the acoustic model.

• Adaptation strategy: acoustic model adaptation has been performed separatelyfor each speaker, for each talking style (normal, shouted) and for each micro-phone (headset, distant) on the phonetically rich sentences of ITAAL.

• Evaluation metric: Sentence Recognition Accuracy, defined as

Number of correctly recognized sentencesTotal number of sentences

Description

Commands Distress calls AverageHeadset Normal Shouted Normal ShoutedBaseline 94.22 87.89 99.67 93.00 93.70

MLLR 95.77 96.11 100.00 95.33 96.80DistantBaseline 19.13 10.67 70.67 45.33 36.45

MLLR 37.15 37.67 85.00 72.67 58.12

Sentence recognition results

• The accuracy on the headset microphone signals exceeds 90% in each task,with the only exception being shouted commands recognition.

• In shouted speaking style, both headset and microphone signals results exhibita similar performance decrease.

• The performance difference between normal and shouted conditions is evidentalso on the distant microphone signals, and this is consistent with the results in[14].

• The high mismatch between training and testing conditions, mainly due to re-verberation (the T60 of the room is 0.72 s) is the main cause of the performancereduction.

• Consider that the acoustic model parameters have been chosen using theAPASCI test set, thus on noise and reverberation free signals.

Comments on “Baseline” results

• On the headset microphone, the introduction of speaker adaptation produces asignificant performance improvement over non-adapted results.

– Other than for the acoustic conditions, the improvement can be attributed alsoto the different accent between the APASCI corpus speakers (northern Italy)and the ITAAL ones (central Italy).

– Shouted speaking style improvements are more significant.

• On the distant talking microphone, the improvement is more evident and itreaches an average of 21.67%.

• Distress calls recognition can be considered satisfactory for being employed ina real scenario.

Comments on “MLLR” results

6. Conclusions and Future Developments

• A system for the emergency detection and the remote assistance in a smart-home has been presented.

• It is composed of multiple local multimedia control units that actively monitor theacoustic environment, and one central management and processing unit thatcoordinates the system.

• The acoustic environment is monitored by means of a voice activity detector, aninterference removal module, and a speech recognizer based on PocketSphinx.

• The detection of a distress call automatically triggers a phone call with a contactof the address book.

• Hands-free communication is possible by means of a VoIP stack based on theSIP protocol, and the Speex acoustic echo canceller.

• The system has been evaluated on ITAAL, a new speech corpus of home au-tomation commands and distress calls in Italian.

• Recognition results showed that with MLLR speaker adaptation the system isachieves a distress call recognition accuracy of 85.00% in normal speakingstyles utterances, and 72.67% in shouted speaking style ones.

Conclusions• Algorithms will be integrated to improve the recognition accuracy on shouted

speech, for example detecting the speaking style as in [15, 16].

• Multiple microphone channels will be employed to increase the performanceusing distant talking conditions.

• A subjective evaluation will be carried out to establish the effectiveness of theoverall system.

Future works

References

[1] K. Giannakouris, “Aging characterizes the demographic perspectives of the European societies,” 2008, Eurostat 72.

[2] S. Goetze, J. Schroder, and S. Gerlach, “Acoustic monitoring and localization for social care,” Journal of Computing Science and Engineering, vol. 6, no. 1, pp. 40–50, 2012.

[3] M. Ziefle, C. Rocker, and A. Holzinger, “Medical technology in smart homes: Exploring the user’s perspective on privacy, intimacy and trust,” in Proc. of COMPSACW, Jul. 18-222011, pp. 410–415.

[4] M. Vacher, A. Fleury, J.-F. Serignat, N. Noury, and H. Glasson, “Preliminary evaluation of speech/sound recognition for telemedice application in a real environment,” in Proc. ofInterspeech, Brisbane, Australia, Sep. 2008, pp. 496–499.

[5] M. Vacher, B. Lecouteux, and F. Portet, “Recognition of voice commands by multisource ASR and noise cancellation in a smart home environment,” in Proc. of Eusipco, Bucharest,Romania, Aug. 27-31 2012, pp. 1663–1667.

[6] K. Drossos, A. Floros, K. Agavanakis, N.-A. Tatlas, and N.-G. Kanellopoulos, “Emergency voice/stress-level combined recognition for intelligent house applications,” in Proc. of the132nd AES Convention, Budapest, Hungary, Apr. 26-29 2012, pp. 1–11.

[7] J.-M. Valin, “On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk,” IEEE Trans. on Audio, Speech, and Lang. Process., vol. 15, no. 3, pp.1030–1034, 2007.

[8] D. Huggins-Daines, M. Kumar, A. Chan, A. Black, M. Ravishankar, and A. Rudnicky, “PocketSphinx: A free, real-time continuous speech recognition system for hand-held devices,”in Proc. of ICASSP, vol. 1, Toulouse, France, May 15-19 2006, pp. 185–188.

[9] C. Kim and R. M. Stern, “Power-normalized coefficients (PNCC) for robust speech recognition,” in Proc. of ICASSP, Kyoto, Japan, Mar. 2012, pp. 4101–4104.

[10] B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, and M. Omologo, “Automatic segmentation and labeling of english and italian speech databases,” in Proc. ofEurospeech, Berlin, Germany, Sep. 22-25 1993, pp. 653–656.

[11] I. Bazzi and J. Glass, “Modeling out-of-vocabulary words for robust speech recognition,” in Proc. of ICSLP, vol. 1, Beijing, China, 2000, pp. 401–404.

[12] E. Principi, S. Cifani, C. Rocchi, S. Squartini, and F. Piazza, “Keyword spotting based system for conversation fostering in tabletop scenarios: Preliminary evaluation,” in Proc. ofHSI, Catania, Italy, May 21-23 2009, pp. 216–219.

[13] K. Vertanen, “Baseline WSJ acoustic models for HTK and Sphinx: Training recipes and recognition experiments,” Cavendish Laboratory, University of Cambridge, Tech. Rep.,2006.

[14] P. Zelinka, M. Sigmund, and J. Schimmel, “Impact of vocal effort variability on automatic speech recognition,” Speech Communication, no. 54, pp. 732–742, 2012.

[15] J. Pohjalainen, P. Alku, and T. Kinnunen, “Shout detection in noise,” in Proc. of ICASSP, Prague, Czech Republic, May 22-27 2011, pp. 4968–4971.

[16] N. Obin, “Cries and whispers-classification of vocal effort in expressive speech,” in Proc. of Interspeech, Portland, OR, USA, Sep. 9-13 2012, pp. 2234–2237.

Interspeech, Lyon, France, 2013 25-29 August 2013