real-time non-intrusive speech quality estimation: a signal-based model

3
R EAL - TIME ,N ON -I NTRUSIVE S PEECH Q UALITY E STIMATION AS IGNAL -BASED M ODEL M UHAMMAD A DIL R AJA C OLIN F LANAGAN W IRELESS ACCESS R ESEARCH C ENTRE D EPARTMENT OF E LECTRONIC AND C OMPUTER E NGINEERING U NIVERSITY OF L IMERICK , L IMERICK ,I RELAND T HE NEED FOR SPEECH QUALITY ESTIMATION Telecommunication networks strive to provide better QoS In order to ensure proper functioning of a network speech quality estimation is required Based on human experience of a call S PEECH Q UALITY A SSESSMENT M ETHODOLOGIES Two approaches to speech quality Assessment 1 Subjective Assessment 2 Objective Assessment S UBJECTIVE A SSESSMENT OF S PEECH Q UALITY Speech quality is estimated by humans. Advantage – Reliable results. Limitations 1 Expensive 2 Time Consuming 3 Laborious 4 Lack of Repeatability Mean Opinion Score (MOS) is the measure of quality. 5 – Excellent O BJECTIVE A SSESSMENT OF S PEECH Q UALITY A computer automated fast and reliable program is used to assay human perception of speech quality Two approaches: 1 Intrusive Assessment 2 Non-Intrusive Assessment Output – MOS-LQO O BJECTIVE A SSESSMENT OF S PEECH Q UALITY Intrusive Assessment The signal under test is compared against a corresponding reference signal. Advantages: 1 The most reliable artificial means of estimating speech quality 2 Tests can be repeated easily Limitations: 1 Consumes considerable computing resources. 2 Is not useful for continuous monitoring of quality due to requirement of a reference signal. O BJECTIVE A SSESSMENT OF S PEECH Q UALITY ITU-T P.862 (PESQ) A methodology whereby a computer automated software may predict the speech quality. PESQ algorithm is the current ITU-T Recommendation for intrusive speech quality estimation. The speech signal is mapped from time domain to time-frequency representation using the psychophysical equivalents of frequency and intensity. F IGURE : VoIP system O BJECTIVE A SSESSMENT OF S PEECH Q UALITY ITU-T P.862 (PESQ) A methodology whereby a computer automated software may predict the speech quality. It has shown a high correlation with various ITU-T benchmark tests. For 30 ITU-T subjective tests the Pearson’s Correlation Coefficient (R) was 0.935 O BJECTIVE A SSESSMENT OF S PEECH Q UALITY Non-Intrusive Assessment A challenging problem since a reference is not available. Two approaches exist 1 Signal-based models 2 Parametric models Signal-based models Recent approaches are based on emulating 1 Human speech production model 2 Psychoacoustic processing of human ear ITU-T P.563 is the current Recommendation. O BJECTIVE A SSESSMENT OF S PEECH Q UALITY Parametric Measurement of VoIP Quality Functions of transport layer metrics and other measurable quantities. Cogent metrics may be: Packet Loss Rate Variable delay – jitter End-to-end delay ... Aimed at Real-time and continuous evaluation of quality Presupposes that all link metrics are known VOICE OVER IP – VO IP Packet based communication channel Uses wire-line speech codecs Linear Predictive Coding (LPC) is having vogue Coded frames are packetized into RTP/UDP Internet is used for transportation The receiver does the reverse process F IGURE : VoIP system R ESEARCH G OAL Derivation of a signal-based non-intrusive speech quality estimation model. Hybrid optimization is used with Genetic Programming (GP) and Genetic Algorithms (GA). S IGNAL -BASED M ETHODS Preprocessing Feature extraction Perceptual mapping ITU-T P.563 S CHEMATIC F IGURE : VoIP system P REPROCESSING Level normalization to - 26 dBov Two additional versions of the normalized signal are formed to emulate the frequency response of: 1 A standard telephony handset 2 Cordless and mobile phones Voice activity detection (VAD) is performed to: Discard speech sections shorter than 12 ms Join sections that are less than 200 ms apart

Upload: adil-raja

Post on 20-Jul-2015

53 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model

REAL-TIME, NON-INTRUSIVE SPEECH QUALITY ESTIMATIONA SIGNAL-BASED MODEL

MUHAMMAD ADIL RAJA COLIN FLANAGANWIRELESS ACCESS RESEARCH CENTRE

DEPARTMENT OF ELECTRONIC AND COMPUTER ENGINEERING UNIVERSITY OF LIMERICK,LIMERICK, IRELAND

THE NEED FOR SPEECH QUALITY ESTIMATIONTelecommunication networks strive to provide better QoSIn order to ensure proper functioning of a network speech qualityestimation is requiredBased on human experience of a call

SPEECH QUALITY ASSESSMENT METHODOLOGIESTwo approaches to speech quality Assessment1 Subjective Assessment2 Objective AssessmentSUBJECTIVE ASSESSMENT OF SPEECH QUALITY

Speech quality is estimated by humans.Advantage – Reliable results.Limitations1 Expensive2 Time Consuming3 Laborious4 Lack of RepeatabilityMean Opinion Score (MOS) is the measure of quality.

5 – ExcellentOBJECTIVE ASSESSMENT OF SPEECH QUALITY

A computer automated fast and reliable program is used to assay humanperception of speech qualityTwo approaches:1 Intrusive Assessment2 Non-Intrusive Assessment

Output – MOS-LQOOBJECTIVE ASSESSMENT OF SPEECH QUALITYIntrusive Assessment

The signal under test is compared against a corresponding referencesignal.Advantages:1 The most reliable artificial means of estimating speech quality2 Tests can be repeated easilyLimitations:1 Consumes considerable computing resources.2 Is not useful for continuous monitoring of quality due to requirement of a reference

signal.

OBJECTIVE ASSESSMENT OF SPEECH QUALITYITU-T P.862 (PESQ) A methodology whereby a computer automatedsoftware may predict the speech quality.

PESQ algorithm is the current ITU-T Recommendation for intrusivespeech quality estimation.The speech signal is mapped from time domain to time-frequencyrepresentation using the psychophysical equivalents of frequency andintensity.

FIGURE : VoIP system

OBJECTIVE ASSESSMENT OF SPEECH QUALITYITU-T P.862 (PESQ) A methodology whereby a computer automatedsoftware may predict the speech quality.

It has shown a high correlation with various ITU-T benchmark tests.For 30 ITU-T subjective tests the Pearson’s Correlation Coefficient (R) was0.935

OBJECTIVE ASSESSMENT OF SPEECH QUALITYNon-Intrusive Assessment

A challenging problem since a reference is not available.Two approaches exist1 Signal-based models2 Parametric models

Signal-based modelsRecent approaches are based on emulating1 Human speech production model2 Psychoacoustic processing of human ear

ITU-T P.563 is the current Recommendation.OBJECTIVE ASSESSMENT OF SPEECH QUALITYParametric Measurement of VoIP Quality

Functions of transport layer metrics and other measurable quantities.Cogent metrics may be:

Packet Loss RateVariable delay – jitterEnd-to-end delay. . .

Aimed at Real-time and continuous evaluation of qualityPresupposes that all link metrics are known

VOICE OVER IP – VOIPPacket based communication channelUses wire-line speech codecsLinear Predictive Coding (LPC) is having vogueCoded frames are packetized into RTP/UDPInternet is used for transportationThe receiver does the reverse process

FIGURE : VoIP system

RESEARCH GOALDerivation of a signal-based non-intrusive speech quality estimationmodel.Hybrid optimization is used with Genetic Programming (GP) and GeneticAlgorithms (GA).

SIGNAL-BASED METHODSPreprocessingFeature extractionPerceptual mapping

ITU-T P.563 SCHEMATIC

FIGURE : VoIP system

PREPROCESSINGLevel normalization to - 26 dBovTwo additional versions of the normalized signal are formed to emulate thefrequency response of:1 A standard telephony handset2 Cordless and mobile phonesVoice activity detection (VAD) is performed to:

Discard speech sections shorter than 12 msJoin sections that are less than 200 ms apart

Page 2: Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model

FEATURE EXTRACTIONFeature extraction is performed by three modules

Vocal tract model and LPC analysisIntrusive quality analysis with a pseudo reference signalDistortion specific parameters (temporal clipping, noise, frame erasuresetc.)43 features are obtained altogether that depict various characteristics ofspeech quality

VOCAL TRACT MODELThe vocal tract is modeled as a series of concatenated tubes to reveal theanomalies in the speech signal.The statistics relevant to these anomalies form speech features

INTRUSIVE QUALITY ANALYSIS WITH A PSEUDO REFERENCE SIGNALA pseudo reference signal is formed using the distorted signal

An LP analysis is performed on the distorted signalLP analysis yields coefficients that are quantized to fit into the vocal tractof a typical talkerA pseudo reference speech signal is formedComparison between the distorted speech signal and the pseudoreference signal yields a coarse estimate of speech quality i.e. a feature.

MAPPING TO SPEECH QUALITYBased on a restricted set of key parameters, an assignment to a dominantdistortion class is madeDistortion classes are:1 Unnaturalness of speech2 Noise3 Interruptions, mutes and temporal clippingA two-step mapping process is followed:1 An initial quality estimate is made that is a linear combination of a subset of features

that fall under a particular distortion class2 A final quality estimate that is a linear combination of the initial quality estimate and 11

additional features.THE PROPOSED MODEL

ITU-T P.563 has been chosen for feature extractionReasons:1 P.563 is the current state-of-the-art algorithm for speech quality estimation2 It computes the most numerous and varied features relevant to speech quality

A new mapping is derived by employing symbolic regressionSIGNIFICANCE OF VOICE ACTIVITY DETECTIONVAD

Speakers remain silent for 50% of the conversationSending talkspurt frames only results in bandwidth saving.Mapping talkspurt packets/frames only, reduces deviation from the targetMOS . . .Since talkspurt frames have different impact on listeners’ perceptionHence mlrVAD and mblVAD are calculated for mapping.

A BRIEF INTRO TO GPGP is a Machine Learning Technique inspired by biological evolution. Abranch of EvolutionaryAimed at evolving program expressions/computer code.Each individual encodes a symbolic expression.Solution Representation.

A tree structure is the most popular representation.Other representations include graphs and linear structures such as arrays.A readily compilable source code.

Primary application area is modeling.Commercial Application - predicting stock index.Scientific Application - modeling physical processes.Engineering Application - reverse engineering, designing circuitry, regression,classification.Data Mining.

A SIMPLIFIED GP BREEDING CYCLEGP uses four steps to solve problems:

Generate an initial population of random compositions of the functions and terminals ofthe problem (computer programs).

Functions: plus, minus, times, divide, sin, cos, log, power, , sqrt.Terminals: Can be variables (network traffic parameters) and constants.

Execute each program in the population and assign it a fitness value according to howwell it solves the problem.

Minimization of MSE.Copy the best existing programs (Selection).

Roulette Wheel Selection - Fitness Proportionate Selection.Tournament Selection.

Create new computer programs by mutation and crossover.

EXPERIMENTAL SETUPTwo GP Experiments were performed with various configurationsCommonalities

Each experiment constituted 50 runsEach Run spanned 100 generations

A SIMPLIFIED GP BREEDING CYCLE: A SYMBOLIC REPRESENTATION

GP EXPERIMENTSCommon Parameters

TABLE : Common Parameters of GP experimentsParameter ValuePopulation Size 3,000Initial Tree Depth 6Selection LPP TournamentTournament Size 7Genetic Operators Crossover, Subtree Mutation and

Point MutationOperators Probability Type AdaptiveOperator Probabilities 0.95, 0.1, 0.1Survival ElitistGeneration Gap 1Function Set +, -, *, /, sin, cos, log10, loge, sqrt, power.Terminal Set Random numbers [-6–6]. P.563 features.

GP EXPERIMENTSExperimental Details

Experiment 1:Linear Scaling MSEs

MSEs(y , t) = 1/nn∑i

(ti − (a + byi))2 (1)

a = t − by , b =cov(t, y)

var(y)(2)

GP EXPERIMENTSExperimental Details

Experiment 2A GA is used to tune the leaf coefficients of the 30 best GP treesA GA of type double was used with population size of 100 and 15 generations per run

ON DATA COLLECTION8 MOS labeled databases were employed in this research which include:

Speech samples of male and female speakersDistortion conditions:

Signal correlated noiseFrame erasuresBit ErrorsTranscodingFront-end clippingLow bit-rate codingSpeech level variation

1,100, 70% were used for training468, 30% were used for testing

STATISTICAL ANALYSIS

TABLE : Statistical analysis of the GP experiments and derived models(a) MSE Statistics for Best Individuals of 50 Runs forExperiments 1 & 2

Experiment1 Experiment2Stats MSEtr MSEte Size MSEtr MSEte SizeMean 0.3673 0.3488 35.58 0.3618 0.3441 36.16Std.Dev. 0.0172 0.0183 13.9972 0.0159 0.0169 17.5875Max. 0.4049 0.4026 70 0.3885 0.3817 102Min. 0.3239 0.3146 12 0.3271 0.3071 18

(b) Results of Mann-Whitney-Wilcoxon Significance Test

Experiment1RMSEtr RMSEte Size

Experiment2 0 0 0

Page 3: Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model

COMPARISON WITH ITU-T P.563Prediction gain with respect to ITU-T P.563

%PG =MSEP.563 − MSEp

MSEP.563× 100 (3)

where MSEP.563 and MSEp represent the MSE of ITU-T P.563 and theproposed model with respect to reference MOS respectively.

TABLE : Performance results of the proposed model versus the reference implementationof ITU-T P.563 in terms of MSEs

RITU-T GP Based PercentageP.563 Model Enhancement

Training 0.3937 0.3415 9.89Testing 0.3674 0.3071 16.41

SELECTED FEATURESAverage pitch: differentiates between unnatural male and female voices ,50 % overlapped.Final VTP average: VTP – array that stores the cross sectional areas ofthe emulated vocal tract tubes –. Mean of the area of last tube over theentire length of the signal.ART average: ART (articulators) – front, middle and rear cavities of thehuman vocal tract. Represents the average size of the rear cavity.Basic voice quality: Result of the second feature extraction module.

LPC kurtosis/skewness and absolute LPC skewness: Statistics relevant tothe 21 LPCs of the speech signal.Spectral clarity: Difference between the values of harmonics of pitch andthe non-harmonic spectral components in the gaps between theharmonics.Estimated segmental SNR: To detect the presence of signal correlatednoise.

CONCLUSIONSA speech quality estimation model is derived with the aid of GPstandard and hybridized GP has been employedThe new model outperforms the reference implementation of ITU-T P.563The resulting model is a function of a reduced set of parametersThis may also result in computationally efficient model for speech qualityestimation