nortel networks confidential 2 evaluaton of objective quality estimators: methods used with voice...
TRANSCRIPT
2 NORTEL NETWORKS CONFIDENTIAL
Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing
Leigh Thorpe
Nortel CTO Services GroupVQEG Ottawa Meeting, Sept 10-14, 2007
3 Nortel Confidential
Overview
• Evaluation goals and analysis approach
• Database characteristics
• Subjective results: internal consistency
• Specific characteristics of measurement: resolution & performance on specific types of impairment
4 Nortel Confidential
Evaluation of Measurement Models
Want to understand how well the model predicts quality as rated by users
Need to assess performance against an evaluation database1. How close are the predictions for a set of test cases to the subjective ratings
for those same cases? 2. Does the model differentiate neighbouring points in the correct direction?
Interested in three aspects of performance:Accuracy: is the model good at predicting the subjective rating
Resolution/Monotonicity:
5 Nortel Confidential
Three methods of analysis (1) Graphical: scatterplot and regression line
• plot subjective scores on x-axis, objective measure on y-axis. • the spread of dots shows visually how closely the variables track each other and how close their
relationship is to the ideal (the main diagonal)• by inspection can see how subgroups behave compared to overall performance.
(2) The correlation coefficient• r, the Pearson Product-Moment Correlation, measures strength of linear relationship, the
tendency for two variables to increase or decrease together• does not indicate how close the values of the two variables are
• perfect correlation gives r = 1 or −1; no relationship gives r = 0• the measurement units for the two variables may be same or different• the number of points and the dynamic range of the variables (difference from highest to lowest)
will each affect the value of the correlation coefficent
(3) The Standard Error of Estimates (SEE)• a measure of deviation of the dependent variable from its regression line • can compute a score for subsets of the conditions tested• SEE is a measure of deviation: smaller is better. The closer the points are to the line (the better
the prediction), the smaller the SEE value.• SEE is a measure of dispersion similar to standard deviation, and behaves like standard deviation
6 Nortel Confidential
Performance on subgroups of pointsWhat correlation tells us
Computing the correlation coefficient for a subgroup can mislead us about how the subgroup relates to the overall group.
The red points show a different relationship between the variables than is seen for the overall group.
The correlation for those points tells us about their relationship to each other, but not to the rest of the data.
r = 0.83r = 0.94
*** *
*** *
*
* **
**
** *
*
*
*
*
**
**
7 Nortel Confidential
What SEE tells us
**
*
*
**
* *
*
*
*
*
Analogous to a standard deviation, SEE is the square root of the average of squared deviations. It is the RMS deviation from the regression line for a given set of points. It can be calculated for any set of points with sufficient n, say n ≥ 6.
Compare two groups of points: SEE is smaller for the yellow deviations than for the red deviations.
SEE is in the same units as the variable for which it captures the variation. For this example, SEE has the units of y.
8 Nortel Confidential
Evaluation Samples: The “Database”
The evaluation database consists of:• a number of samples of the signal of interest• a mean subjective rating for each sample
Ideally, • the database should contain samples (test cases) covering the full
range of types and levels of impairments that the model will encounter in usage conditions.
• single database: all subjects have rated all test cases
• where multiple databases are used, • there should be sufficient common test cases across the databases to show
whether the subjective ratings line up
9 Nortel Confidential
Criteria used for new Voice Qual Database
• Cover a broad range of impairment types and levels• different types of codecs, range of packet loss, • background noise (for these cases, noise is in the reference)• combinations of these: coding, noise, packet loss, tandeming
• Two languages: English, French
• Multiple talkers • eight---four per langage
• Include conditions that will challenge candidate methods• time warping (temporal shift) and noise reduction
• A large number of judgments to obtain stable scores • We used n = 60 for each sample
10 Nortel Confidential
Effect of Truncating Quality Range
P.563 compared to Subjective Ratings(Per-condition)
y = 0.2858x + 2.4483
R2 = 0.2858
1
2
3
4
5
1 2 3 4 5
P.563
Linear (P.563)
ACR MOS
r = 0.53
P.563 compared to Subjective Ratings(Per-condition)
y = 0.581x + 1.2929
R2 = 0.7167
1
2
3
4
5
1 2 3 4 5
P.563
Linear (P.563)
ACR MOS
r = 0.85
This small range database is simulated from the above by restricting the range of subjective values. Care was taken in the simulation to keep the number of points about the same.) The range restriction reduced the correlation coefficient from 0.85 to 0.53..
11 Nortel Confidential
Database detailsLanguages tested separately;
• listeners were native speakers of language heard
Samples 6 – 8 sec duration• each made up of two unrelated sentences from same talker
Four talkers per language; talkers crossed with conditions• 1304 samples (326 x 4)
Test room ambient noise low
Presented at nominal telephone listening volume
Too many samples to complete in one session:• samples were divided across four test sessions• each session included one instance of each condition• the four talkers were represented equally in all sessions• therefore, every listener heard every test case, but not always with the same talker
12 Nortel Confidential
Internal Consistency of Database: English
Subjective Ratings: Internal Consistency(Per-condition, data split arbitrarily)
y = 1.0315x - 0.1297
R2 = 0.9903
1
2
3
4
5
1 2 3 4 5
Ratings by first group of 30 subjects
Rat
ing
s b
y se
con
d g
rou
p o
f 30
su
bje
cts
ACR MOS
ACR MOS
r = 0.995
English samples.
This is the upper limit of performance that can be detected with this database.
One half
Oth
er h
alf
English Database: Internal Consistency(Per condition means, arbitrary split)
R = 0.995
The variability of these samples indicates a resolution of about 0.25 MOS, as would be expected for n = 30 (ie, half).
13 Nortel Confidential
Internal Consistency of Database: French
r = 0.995
French Database: Internal Consistency(Per-condition, arbitrary split)
R = 0.995
1
2
3
4
5
1 2 3 4 5
One half
Oth
er h
alf
ACR MOS
ACR MOS
French samples
R = 0.995
14 Nortel Confidential
Correlation Coefficient (r) by Algorithm
0.910.840.920.93Averaged*
0.830.820.900.91Merged
0.870.780.900.90French
0.900.850.920.93English
Model DModel CModel BModel ASubj Data
This is the correlation for French and English scores averaged together, not the average of the correlation coefficients!
15 Nortel Confidential
Results for Model A
PESQ (raw) compared to Subjective Ratings(Per-condition)
1
2
3
4
5
1 2 3 4 5
P.862 (raw) ENG
P.862 (raw) FR
ACR MOS
r = 0.93
The spread of these points shows that Model A can resolve subjective quality to no better than about 0.5 MOS.
16 Nortel Confidential
Results for Model C
P.563 compared to Subjective Ratings(Per-condition)
1
2
3
4
5
1 2 3 4 5
P.563 ENG
P.563 FR
ACR MOS
r = 0.84
This model shows a tendency to compress the range of its output score, relative to the subjective scores.
There are a number of outliers in the lower left quadrant.
The mid-range resolution is about 3/4 MOS.
17 Nortel Confidential
Example: data plotted by subgroup
1
2
3
4
5
1 2 3 4 5
Clean
MNRU
Codecs Random Packet Loss
Packet Loss CR
Bursty Packet Loss
Packet Loss CB
Temporal ClippingBackground Noise
Noise with packet loss
Noise Reduction
ANIQUE+per condition
18 Nortel Confidential
Example of results for subgroupsSEE* values
0.260.300.300.23Overall
0.210.280.320.22Noise Reduction
0.260.400.320.25Noise + Packet Loss
0.270.410.320.22Noise
0.290.220.320.30Temporal Clipping
0.170.210.260.20Constrained Bursty PL
0.330.320.230.16Bursty Packet Loss
0.290.220.340.24Constrained Random PL
0.290.230.310.23Random Packet Loss
0.180.290.290.27Codecs
0.300.290.490.41MNRU
Model DModel CModel BModel A
Combined
*based on means across languages
19 Nortel Confidential
What can we learn from the voice metric testing that can assist in evaluation of video metrics?1. Ensure the use of a range of quality in the subjective test samples (next slide).
• this can affect the correlation observed
2. Include all the impairments you are going to want to assess with the model, or that may be encountered in signals that pass through networks.
3. Within reason, any subjective metric can be used, as long as it is sufficiently sensitive to the variation in quality over the range used. It doesn’t need to be MOS.
4. Collect data from as many viewers as practicable • n> 30 if possible
5. Examine internal consistency of subjective ratings
6. Examine performance of the models on subgroups within the data• select a statistic that provides an unbiased result.
• (r is not unbiased in this application).
• SEE statistic provides credible alternative
7. Examine resolution and monotonicity• quantitative metrics??
21 Nortel Confidential
Interpretating regression and correlation
****
* ** *
**
**
***** * **
**
**
*** ** ** *
****
Weak relationship: the points fall far from the line, and the cloud of points is about as long as it is wide. It looks as though a line on any direction would be as good.
Strong relationship and the line is very similar to the diagonal: on average, the objective measure is closely tracking subjective score. For MOS prediction, this is the most desireable result.
Strong relationship, but the line is canted relative to the diagonal: the objective measure is using a smaller range than the subjective score. Note: the value of the correlation coefficient does not indicate whether the line tracks the diagonal.
Deviation from linear: the objective measure follows the diagonal for the lower portion, but underestimates the quality of the conditions in the upper range. We can compute a regression line, but it will not account for the non-linearity. We could compute a best fit curve, but there is no “correlation” statistic to indicate the strength of a non-linear relationship.
****
** * *
** **
*
*
22 Nortel Confidential
Working with correlation (1)
Correlation coefficients cannot be averaged. Why not?
*** *
* ** *
****
r = 0.94
** * ** **
*** **
r = 0.92
*** *
* ** *
****
** * ** *
*
*** **
r = 0.93
Correlation is not a linear process, and so the correlations cannot be treated with linear operations (like averaging).
Database A Database BDatabases
A & B Merged
r = 0.65
23 Nortel Confidential
Nortel DatabaseSummary of Impairment Conditions
326 cases x 4 talkers x 2 languages = 2608 test samples in the database
326Total
good and poor noise reduction algorithm48Noise Reduction
2%, 4%, random & bursty54Noise + Packet Loss
20, 10, 0 dB SNR, Hoth, car, babble, street33Noise
15-60 ms clip, +/-80 ms shift, 120 ms mute21Temporal Clipping
same speech & mask for each codec22Constrained Bursty PL
1% - 10% PL, 10, 20, 30 ms packets54Bursty Packet Loss
same speech & mask for each codec22Constrained Random PL
1% - 10% PL, 10, 20, 30 ms packets54Random Packet Loss
G.711, G.729, AMR, tandem7Codecs
High quality only2Clean
Range of QualityNo. of Cases
Category
5 - 35 dBQ7MNRU
24 Nortel Confidential
Results for Model A by subgroup
1
2
3
4
5
1 2 3 4 5
Clean
MNRU
Codecs Random Packet Loss
Packet Loss CR
Bursty Packet Loss
Packet Loss CB
Temporal ClippingBackground Noise
Noise with packet loss
Noise Reduction
ANIQUE+per condition
P.862 (raw)per condition
English