nortel networks confidential 2 evaluaton of objective quality estimators: methods used with voice...

2 NORTEL NETWORKS CONFIDENTIAL

Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing

Leigh Thorpe

Nortel CTO Services GroupVQEG Ottawa Meeting, Sept 10-14, 2007

3 Nortel Confidential

Overview

• Evaluation goals and analysis approach

• Database characteristics

• Subjective results: internal consistency

• Specific characteristics of measurement: resolution & performance on specific types of impairment


Evaluation of Measurement Models

Want to understand how well the model predicts quality as rated by users

Need to assess performance against an evaluation database1. How close are the predictions for a set of test cases to the subjective ratings

for those same cases? 2. Does the model differentiate neighbouring points in the correct direction?

Interested in three aspects of performance:Accuracy: is the model good at predicting the subjective rating

Resolution/Monotonicity:


Three methods of analysis (1) Graphical: scatterplot and regression line

• plot subjective scores on x-axis, objective measure on y-axis. • the spread of dots shows visually how closely the variables track each other and how close their

relationship is to the ideal (the main diagonal)• by inspection can see how subgroups behave compared to overall performance.

(2) The correlation coefficient• r, the Pearson Product-Moment Correlation, measures strength of linear relationship, the

tendency for two variables to increase or decrease together• does not indicate how close the values of the two variables are

• perfect correlation gives r = 1 or −1; no relationship gives r = 0• the measurement units for the two variables may be same or different• the number of points and the dynamic range of the variables (difference from highest to lowest)

will each affect the value of the correlation coefficent

(3) The Standard Error of Estimates (SEE)• a measure of deviation of the dependent variable from its regression line • can compute a score for subsets of the conditions tested• SEE is a measure of deviation: smaller is better. The closer the points are to the line (the better

the prediction), the smaller the SEE value.• SEE is a measure of dispersion similar to standard deviation, and behaves like standard deviation


Performance on subgroups of pointsWhat correlation tells us

Computing the correlation coefficient for a subgroup can mislead us about how the subgroup relates to the overall group.

The red points show a different relationship between the variables than is seen for the overall group.

The correlation for those points tells us about their relationship to each other, but not to the rest of the data.

r = 0.83r = 0.94

*** *

*** *

*

* **

**

** *

*

*

*

*

**

**


What SEE tells us

**

*

*

**

* *

*

*

*

*

Analogous to a standard deviation, SEE is the square root of the average of squared deviations. It is the RMS deviation from the regression line for a given set of points. It can be calculated for any set of points with sufficient n, say n ≥ 6.

Compare two groups of points: SEE is smaller for the yellow deviations than for the red deviations.

SEE is in the same units as the variable for which it captures the variation. For this example, SEE has the units of y.


Evaluation Samples: The “Database”

The evaluation database consists of:• a number of samples of the signal of interest• a mean subjective rating for each sample

Ideally, • the database should contain samples (test cases) covering the full

range of types and levels of impairments that the model will encounter in usage conditions.

• single database: all subjects have rated all test cases

• where multiple databases are used, • there should be sufficient common test cases across the databases to show

whether the subjective ratings line up


Criteria used for new Voice Qual Database

• Cover a broad range of impairment types and levels• different types of codecs, range of packet loss, • background noise (for these cases, noise is in the reference)• combinations of these: coding, noise, packet loss, tandeming

• Two languages: English, French

• Multiple talkers • eight---four per langage

• Include conditions that will challenge candidate methods• time warping (temporal shift) and noise reduction

• A large number of judgments to obtain stable scores • We used n = 60 for each sample


Effect of Truncating Quality Range

P.563 compared to Subjective Ratings(Per-condition)

y = 0.2858x + 2.4483

R2 = 0.2858

1

2

3

4

5

1 2 3 4 5

P.563

Linear (P.563)

ACR MOS

r = 0.53


y = 0.581x + 1.2929

R2 = 0.7167

1

2

3

4

5

1 2 3 4 5

P.563

Linear (P.563)

ACR MOS

r = 0.85

This small range database is simulated from the above by restricting the range of subjective values. Care was taken in the simulation to keep the number of points about the same.) The range restriction reduced the correlation coefficient from 0.85 to 0.53..


Database detailsLanguages tested separately;

• listeners were native speakers of language heard

Samples 6 – 8 sec duration• each made up of two unrelated sentences from same talker

Four talkers per language; talkers crossed with conditions• 1304 samples (326 x 4)

Test room ambient noise low

Presented at nominal telephone listening volume

Too many samples to complete in one session:• samples were divided across four test sessions• each session included one instance of each condition• the four talkers were represented equally in all sessions• therefore, every listener heard every test case, but not always with the same talker


Internal Consistency of Database: English

Subjective Ratings: Internal Consistency(Per-condition, data split arbitrarily)

y = 1.0315x - 0.1297

R2 = 0.9903

1

2

3

4

5

1 2 3 4 5

Ratings by first group of 30 subjects

Rat

ing

s b

y se

con

d g

rou

p o

f 30

su

bje

cts

ACR MOS

ACR MOS

r = 0.995

English samples.

This is the upper limit of performance that can be detected with this database.

One half

Oth

er h

alf

English Database: Internal Consistency(Per condition means, arbitrary split)

R = 0.995

The variability of these samples indicates a resolution of about 0.25 MOS, as would be expected for n = 30 (ie, half).


Internal Consistency of Database: French

r = 0.995

French Database: Internal Consistency(Per-condition, arbitrary split)

R = 0.995

1

2

3

4

5

1 2 3 4 5

One half

Oth

er h

alf

ACR MOS

ACR MOS

French samples

R = 0.995


Correlation Coefficient (r) by Algorithm

0.910.840.920.93Averaged*

0.830.820.900.91Merged

0.870.780.900.90French

0.900.850.920.93English

Model DModel CModel BModel ASubj Data

This is the correlation for French and English scores averaged together, not the average of the correlation coefficients!


Results for Model A

PESQ (raw) compared to Subjective Ratings(Per-condition)

1

2

3

4

5

1 2 3 4 5

P.862 (raw) ENG

P.862 (raw) FR

ACR MOS

r = 0.93

The spread of these points shows that Model A can resolve subjective quality to no better than about 0.5 MOS.


Results for Model C


1

2

3

4

5

1 2 3 4 5

P.563 ENG

P.563 FR

ACR MOS

r = 0.84

This model shows a tendency to compress the range of its output score, relative to the subjective scores.

There are a number of outliers in the lower left quadrant.

The mid-range resolution is about 3/4 MOS.


Example: data plotted by subgroup

1

2

3

4

5

1 2 3 4 5

Clean

MNRU

Codecs Random Packet Loss

Packet Loss CR

Bursty Packet Loss

Packet Loss CB

Temporal ClippingBackground Noise

Noise with packet loss

Noise Reduction

ANIQUE+per condition


Example of results for subgroupsSEE* values

0.260.300.300.23Overall

0.210.280.320.22Noise Reduction

0.260.400.320.25Noise + Packet Loss

0.270.410.320.22Noise

0.290.220.320.30Temporal Clipping

0.170.210.260.20Constrained Bursty PL

0.330.320.230.16Bursty Packet Loss

0.290.220.340.24Constrained Random PL

0.290.230.310.23Random Packet Loss

0.180.290.290.27Codecs

0.300.290.490.41MNRU

Model DModel CModel BModel A

Combined

*based on means across languages


What can we learn from the voice metric testing that can assist in evaluation of video metrics?1. Ensure the use of a range of quality in the subjective test samples (next slide).

• this can affect the correlation observed

2. Include all the impairments you are going to want to assess with the model, or that may be encountered in signals that pass through networks.

3. Within reason, any subjective metric can be used, as long as it is sufficiently sensitive to the variation in quality over the range used. It doesn’t need to be MOS.

4. Collect data from as many viewers as practicable • n> 30 if possible

5. Examine internal consistency of subjective ratings

6. Examine performance of the models on subgroups within the data• select a statistic that provides an unbiased result.

• (r is not unbiased in this application).

• SEE statistic provides credible alternative

7. Examine resolution and monotonicity• quantitative metrics??


Interpretating regression and correlation

****

* ** *

**

**

***** * **

**

**

*** ** ** *

****

Weak relationship: the points fall far from the line, and the cloud of points is about as long as it is wide. It looks as though a line on any direction would be as good.

Strong relationship and the line is very similar to the diagonal: on average, the objective measure is closely tracking subjective score. For MOS prediction, this is the most desireable result.

Strong relationship, but the line is canted relative to the diagonal: the objective measure is using a smaller range than the subjective score. Note: the value of the correlation coefficient does not indicate whether the line tracks the diagonal.

Deviation from linear: the objective measure follows the diagonal for the lower portion, but underestimates the quality of the conditions in the upper range. We can compute a regression line, but it will not account for the non-linearity. We could compute a best fit curve, but there is no “correlation” statistic to indicate the strength of a non-linear relationship.

****

** * *

** **

*

*


Working with correlation (1)

Correlation coefficients cannot be averaged. Why not?

*** *

* ** *

****

r = 0.94

** * ** **

*** **

r = 0.92

*** *

* ** *

****

** * ** *

*

*** **

r = 0.93

Correlation is not a linear process, and so the correlations cannot be treated with linear operations (like averaging).

Database A Database BDatabases

A & B Merged

r = 0.65


Nortel DatabaseSummary of Impairment Conditions

326 cases x 4 talkers x 2 languages = 2608 test samples in the database

326Total

good and poor noise reduction algorithm48Noise Reduction

2%, 4%, random & bursty54Noise + Packet Loss

20, 10, 0 dB SNR, Hoth, car, babble, street33Noise

15-60 ms clip, +/-80 ms shift, 120 ms mute21Temporal Clipping

same speech & mask for each codec22Constrained Bursty PL

1% - 10% PL, 10, 20, 30 ms packets54Bursty Packet Loss

same speech & mask for each codec22Constrained Random PL

1% - 10% PL, 10, 20, 30 ms packets54Random Packet Loss

G.711, G.729, AMR, tandem7Codecs

High quality only2Clean

Range of QualityNo. of Cases

Category

5 - 35 dBQ7MNRU


Results for Model A by subgroup

1

2

3

4

5

1 2 3 4 5

Clean

MNRU

Codecs Random Packet Loss

Packet Loss CR

Bursty Packet Loss

Packet Loss CB

Temporal ClippingBackground Noise

Noise with packet loss

Noise Reduction

ANIQUE+per condition

P.862 (raw)per condition

English

nortel networks confidential 2 evaluaton of objective quality estimators: methods used with voice...

Documents