artigo panel check

8/13/2019 Artigo Panel Check

1/15

O RI G I N A L P A P E R

Analysing sensory panel performance in a proficiency testusing the PanelCheck software

Oliver Tomic Giorgio Luciano Asgeir Nilsen

Grethe Hyldig Kirsten Lorensen Tormod Ns

Received: 5 May 2009 / Revised: 28 October 2009 / Accepted: 9 November 2009 / Published online: 2 December 2009Springer-Verlag 2009

Abstract This paper discusses statistical methods and a

workflow strategy for comparing performance acrossmultiple sensory panels that participated in a proficiency

test (also referred to as inter laboratory test). Performance

comparison and analysis are based on a data set collected

from 26 sensory panels carrying out profiling on the same

set of candy samples. The candy samples were produced

according to an experimental design using design factors,

such as sugar, and acid level. Because of the exceptionally

large amount of data and the availability of multiple sta-

tistical and graphical tools in the PanelCheck software, a

workflow is proposed that guides the user through the data

analysis process. This allows practitioners and non-statis-

ticians to get an overview over panel performances in arapid manner without the need to be familiar with details

on the statistical methods. Visualisation of data analysis

results plays an important role as this provides a time

saving and efficient way of screening and investigating

sensory panel performances. Most of the statistical meth-

ods used in this paper are available in the open source

software PanelCheck, which may be downloaded and used

for free.

Keywords Proficiency test Inter laboratory test

Sensory profiling

Performance visualisation

PanelCheck

Introduction

Trained sensory panels are important tools for assessing the

quality of food and non-food products. There are, however,

a number of problems related to the training, stability, and

maintenance of the quality of such panels. A number of

methods have been developed that may help to achieve

better panel performance [15]. These techniques can

detect lack of precision (repeatability), disagreement

(reproducibility), and the ability or inability to discriminatebetween samples. This type of information is very useful

for improving data quality in future sessions through

increased and more targeted training on problematic issues.

Larger companies maintaining sensory panels at multi-

ple geographic locations are often subject to additional

challenges. For example, thoroughly carried out quality

control and product development require that all sensory

panels are well calibrated with one another, eliminating

potential shift between the panels and allowing for com-

parison of their results. When multiple sensory panels are

to evaluate the same set of samples global performance

issues (across multiple sensory panels) might add to

already existing local performance issues (within one

sensory panel). This further complicates comparison of

results from each involved panel. Techniques for profi-

ciency tests are available, but most of them are developed

for classical chemical inter-laboratory comparisons (see,

e.g. [6]) and with less focus on some of the more specific

aspects of sensory analysis such as those indicated above.

Important contributions to the proficiency test literature are

available [79]. In these papers, classical ANOVA,

O. Tomic (&) G. Luciano A. Nilsen T. NsNofima Mat AS, Osloveien 1, 1430 As, Norwaye-mail: [email protected]

G. HyldigDTU Aqua, National Institute of Aquatic Resources,Technical University of Denmark, Sltofts Plads,Build. 221, 2800 Lyngby, Denmark

K. LorensenChew Tech I/S, Vejle, Denmark

1 3

Eur Food Res Technol (2010) 230:497511

DOI 10.1007/s00217-009-1185-y


2/15

Principal Component Analysis (PCA), Multiple Factor

Analysis (MFA), and Generalised Procrustes Analysis

(GPA) are used for studying intra- and inter-laboratory

variation.

The main focus of the present paper is to discuss and to

illustrate how techniques developed specifically for per-

formance visualisation of a single sensory panel [5] can

also be applied for comparing multiple panels. Some of thetechniques are related to the methods mentioned above,

while others are new in this context. Univariate as well as

multivariate statistical methods will be presented and used

in this paper. The univariate methods highlight differences

for each attribute separately while the multivariate methods

look at differences at a more general level taking into

account also correlations between the attributes. All pre-

sented techniques are graphically oriented and should be

therefore easy to understand by practitioners and non-

statisticians. A major issue is to stress how the techniques

can be used to highlight or visualise various types of dif-

ferences between the assessors and the panels. Further-more, a workflow suggesting how to progress with the data

analysis and how to use the methods available in the

PanelCheck software will also be proposed. This allows for

rapid and efficient analysis of sensory profiling data, both

in case of one or multiple panels. The software provides an

intuitive and easy-to-use graphical user interface that

handles all statistical computations in the background and

visualises results in different types of plots. This enables

the practitioner and non-statistician to concentrate on per-

formance analysis rather than spending time on trying to

apply algorithms on data by themselves. The open source

PanelCheck software may be downloaded, distributed, andused for free (http://www.panelcheck.com)[10].

Experimental

The dataset discussed here is the result of a joint pro-

ject between Danish, Norwegian, Swedish, and English

research institutes and commercial companies. In all, 26

panels were involved in the project (research as well as

industry panels) with one of the aims being to investigate

performance of multiple sensory panels with the Panel-

Check software.

The samples studied were five candies (wine gums)

produced according to an experimental design with two

design factors, i.e. sugar level and acid content: A1 (high

sugar, low acid), A2 (high sugar, high acid), B (medium

sugar, low acid), C1 (low sugar, low acid), C2 (low sugar,

high acid). All samples were produced at LEAF Denmark.

The evaluation of the samples had to be performed within

1 month after production. LEAF Denmark guaranteed that

samples did not change its sensory properties within this

period. The candy samples were tested by each of the 26

participating panels.

Each sensory panel received detailed instructions about

sample preparation and evaluation. The sensory panel at

LEAF performed sensory profiling on the samples and

suggested nine sensory attributes, which the remaining 25

sensory panels were to use for profiling. Two samples

(A1, C1) were used for training and calibration by allsensory panels. Sample C2 was used as a reference

sample for maximum intensity of attribute acidic fla-

vour. For the remaining attributes, either sample A1 or

C1 were used as reference for low or high intensity. All

attributes were evaluated on an intensity scale from 0 (no

intensity) to 15 (high intensity). Water was used to clean

the palate between each sample. The nine attributes used

to describe the samples were: transparency, acidic fla-

vour, sweet taste, raspberry flavour, sugar coat (the

thickness of the sugar peel visual on the cut wine gum

piece), biting strength in the mouth (referred to as

biting), hardness, elasticity in the mouth (referred toas elasticity), sticking to teeth in the mouth (referred

to as sticking).

Each of the 5 samples was evaluated in 3 replicates,

resulting in a total of 15 samples to be tested by each panel.

One piece of wine gum weighed 3.5 g. In each serving, the

assessors got four to five pieces of which one was cut in

half by the sensory staff allowing the assessors score on

appearance attributes. For those panels that did not have

access to specific software for automatic randomisation of

candy samples, a Latin square design was provided as an

example for serving order. All 26 sensory evaluations took

place in June 2007.Table1shows an overview over the 26 panels indicat-

ing their number of assessors, size of the data matrix of

each panel, and the size of the data used for the first part of

analysis that included all panels.

Methods

In the following section, the univariate and multivariate

statistical methods used for data analysis will be discussed.

The results of these methods are visualised in various plots

helping non-statisticians to visually detect performance

issues without having to know all details on the statistical

methods. It should be emphasised that the real strength of

these methods is revealed only when using them together.

Each plot has its own special feature that represents an

element of unique information, but their joint information

content is what really provides a holistic overview over

performance of the investigated panels. The methods will

be presented in an order that complies with the suggested

data analysis workflow (see Work flow strategy).

498 Eur Food Res Technol (2010) 230:497511

1 3
http://www.panelcheck.com/http://www.panelcheck.com/


3/15

The same workflow may be applied for one sensory

panel at the time as well as for multiple sensory panels. In

this sense, one needs to think in terms of groups and

individuals such that the statistical methods may be

applied appropriately. When analysing performance of one

sensory panel, the panel as a whole represents the group

level while the assessors represent the individual level.

This changes, however, when applying the same methods

on data from multiple panels. Here, the group consisting of

26 panels represents the group level whereas each single

panel here represents the individual level. In other words,

the group of 26 panels will be treated as one large panel

with each panel representing one assessor. How this is

done, in practice will be elaborated later in Data

merging.

In the description of the statistical methods in Mixed

model ANOVA for assessing the importance of attributes

to Profile and line plots below (considering performance

analysis of only one panel) we let j =1,,J denote the

number of samples tested, m = 1,,M the number of

replicates, k = 1,,K the number of attributes and

i = 1,,I the number of assessors. We let Xi denote the

data matrix of assessor i with J*M rows and K columns.

That means that Xi of each assessor is of dimension

(J*M) 9 Kin any of the 26 data sets. For the candy data

set, the dimension of Xi then is (5*3) 9 9, with J=5

samples, M=3 replicates and K= 9 attributes.

Mixed model ANOVA for assessing the importance

of attributes

As a first step, mixed model (2- or 3-way) ANOVA can

be used for assessing the importance of the used sensory

attributes in detecting significant differences between the

samples. The method is based on modelling samples,

assessors and their interactions in two-way ANOVA or

samples, assessors, replicates and their interactions in

three-way ANOVA, and then testing for the sample effect

by the use of regular F tests. In each case (2-way or

3-way), the assessor and interaction effects are assumed

to be random [11]. Only attributes that are significant at

certain level (in the case presented here a 5% significance

Table 1 Overview over allsensory panels that participatedin the proficiency test

J, M, I the number of testedproducts, replicates andassessors, respectively

Sensory panel Number of assessors Number of data rowsin raw data (J*M*I)

Number of data rowsused in global analysis

P01 7 105 15

P02 11 165 15

P03 8 120 15

P04 11 165 15

P05 10 150 15P06 15 225 15

P07 9 135 15

P08 3 45 15

P09 11 165 15

P10 8 120 15

P11 7 105 15

P12 7 105 15

P13 5 75 15

P14 7 105 15

P15 8 120 15

P16 8 120 15

P17 9 135 15

P18 8 120 15

P19 6 90 15

P20 7 105 15

P21 6 90 15

P22 10 150 15

P23 11 165 15

P24 7 105 15

P25 10 150 15

P26 4 60 15

Total 213 3,195 390

Eur Food Res Technol (2010) 230:497511 499

1 3


4/15

level was chosen) for product effect are considered for

further analysis. In case of the candy data, two-way

ANOVA was used since the replicates of the tested

samples were served in random order. If sample replicates

were served systematically, say one replicate per session,

three-way ANOVA (with main effects for assessor, sam-

ple and replicate and interactions between them) should

be considered instead. The reason for this is that bytesting the replicates in separate sessions, it is likely that

additional systematic variance between the replicates will

be introduced into the data. The replicate effect in three-

way ANOVA then indicates whether a significant sys-

tematic session based variation in the data is present or

not.

Tucker 1 plots

In the next step, the multivariate analysis method Tucker-1

[12,13] is applied in order to get an overview over assessor

and panel performance using multiple attributes. Tucker-1is essentially PCA on an unfolded data matrix consisting of

all individual matrices Xi,av aligned horizontally. Here,

Xi,avrepresents the matrix of one assessor of the dimension

J9 K where the sample score is based on the average

across replicates, hence indicated with av in the index.

This means that the dimension of this unfolded matrix is

J9(I*K). In the case of our candy data set the dimension

would be 5 9(10*9), with J=5 samples, and K=9

attributes and say I=10 assessors [if your panel consists

of 10 assessors as it is the case for example for panels P05,

P22 and P25 (see Table1)].Iwill of course vary according

to the number of assessors in the panel and consequentlythe dimension (I*K).

PCA on this unfolded matrix provides two types of plots

that are of interest: a common scores plot and a correlation

loadings plot. The common scores plot shows how the

tested J samples relate to each other, i.e. it visualises

similarities and dissimilarities between the samples along

the found principal components. This plot gives no direct

information on assessor or panel performance, but it is a

valuable visualisation tool that helps the user to roughly

and quickly investigate whether the panel could distinguish

between the samples or not by taking the explained vari-

ances into account. If the explained variance in the first few

(usually two) PCs is relatively high, large systematic

variation is present in the data, which again may indicate

that the panel discriminates well between the samples.

Note that the explained variance for a Tucker-1 common

scores plot generally is somewhat lower for the first few

PCs compared to those from PCA on the ordinary con-

sensus average matrix. This is because the Tucker-1 anal-

ysis is based on many more variables and therefore more

noise is present in the data.

The correlation loadings plot provides performance

information on each assessor and the sensory panel as a

whole. The plot contains I*K dots, with each dot repre-

senting one assessor-attribute combination (e.g. attribute

sweet taste of assessor 5, etc.). By highlighting different

dots, either those of one assessor or those of one attribute,

one can visualise the performance of individual assessors

or the whole panel. The position of the dots within the plotprovides information on how well an individual or the

panel as a whole perform. The more noise the attribute of a

particular assessor contains, the closer the dot will appear

to the origo, i.e. the middle of the plot. The more sys-

tematic information an attribute of an assessor contains, the

closer it will appear to the outer ellipse (100% explained

variance for that attribute, see Fig. 4). The inner ellipse

represents 50% explained variance and can be considered

as a rule-of-thumb lower boundary of how much explained

variance an attribute should at least have to be considered

as good enough. It is recommended to consult also higher

PCs, since some assessors might have much systematicvariance in other dimensions than PC1 and PC2 and thus

initially appearing as noisy. Detailed information on the

statistical aspects and interpretations of Tucker-1 common

scores plot and correlation plots are given in [3].

Manhattan plots

Manhattan plots in general provide an alternative way to

visualise systematic variation in data sets as described

earlier [3]. They can be considered as a screening tool for

quick identification of assessors that perform very differ-

ently from the other assessors. The information visualisedby Manhattan plots may be computed with different sta-

tistical methods. In this paper, the Manhattan plots visu-

alise information as implemented in the PanelCheck

software. This means that PCA is applied on the individual

data matrices Xi,av and the explained variance for each

attribute is then visualised in the Manhattan plots. For the

candy data at hand, I*Kexplained variances will be given.

This means if the panel consists of say I = 10 assessors, the

number of explained variances would be 10*9 given K =9

attributes.

Manhattan plots (see example in Fig.6) visualise, in

shades of grey, how much of the variability of each attri-

bute and each assessor can be explained by the principal

components (vertical axis). A dark colour tone indicates

that only a small proportion of the variance has been

explained, while a light colour tone indicates the opposite.

Extreme points are black (0% explained variance) and

white (100% explained variance). Typically, the colour

will be darker for PC1 and then get lighter with each

additional PC from top to bottom as the explained variance

shown is cumulative over each PC. In other words, the


1 3


5/15

explained variance at PC3 is the sum of the explained

variances of PC1, PC2 and PC3. The lighter a colour tone

in a Manhattan plot is for a specific assessor-attribute

combination, the more systematic variation is given.

The explained variances may be sorted either by

assessor or by attribute, depending on what is the main

focus of investigation. When interested in checking per-

formance between assessors one may investigate a totalnumber ofIplots consisting ofKcolumns where each plot

represents one assessor and each column within the plots

represents one attribute. Here, one may look for similar

colour patterns among the assessors and detect assessors

that differ much from the others. If interested in how well

an attribute is understood and used by the panel one may

consider a total number ofKplots consisting ofIcolumns

where each plot represents one attribute and each column

represents one assessor in the plots. Here, one may inves-

tigate whether an attribute achieves high-explained vari-

ances with only few PCs or if many PCs are necessary.

Moreover, it can be detected whether some assessors mayhave more systematic variance with fewer PCs than other

assessors. In this sense, Manhattan plots may be used as a

screening tool for quick detection of assessors that behave

very differently or attributes that are not well explained

relatively to one another. Both plotting variants are

implemented in PanelCheck. More detailed information on

the statistical aspects and interpretations of Manhattan

plots are presented in [3].

Plots based on one-way ANOVA

A discussion of the one-way ANOVA model about panelperformance can be found in [5, 14]. From the one-way

ANOVA model, we obtain three statistical quantities (F, p

and MSE values) that are used to generate the so-called F

plot, MSE plot and p*MSE plot as available in the Pan-

elCheck software. These three statistical quantities are

acquired by applying one-way ANOVA on each individual

data matrix Xi and provide information on sample dis-

crimination and repeatability on each assessor. The three

plots are described in more detail below.

F plot

Fplots are based on Fvalues, which contain information

on discrimination performance of each assessor. A total of

I*K Fvalues are computed and may be presented in a bar

diagram with each bar representing the attribute of one

specific assessor. The bar diagram can be accompanied

with horizontal lines indicating different significance lev-

els. Typically, 1 and 5% level of significance are used for

this purpose. Generally, the higher an value Fvalue of an

individual assessor, the greater the ability of that assessor

to discriminate between tested samples. If differences

between the tested samples are present, one should expect

the assessors to obtain high Fvalues greater, ideally higher

than those corresponding to 1 and 5% level of significance.

MSE plot

The MSE values are the mean square errors (random errorvariance estimates) from the one-way ANOVA model.

They can be used as a measure of repeatability for each

assessor. A total ofI*KMSE values are computed and can

be plotted in a bar diagram very similar to the Fvalues in

the F plot. If an assessor almost perfectly repeats her/

himself, this value should be close to zero. The less the

repeatability of a certain assessor, the higher his/her MSE

will be. The MSE value should, however, always be con-

sidered together with the Fvalues in order to get a realistic

overview over the assessors performance. An assessor

aiming for low MSE values can achieve this through

scoring about all samples alike, thus reducing differencesbetween replicates. However, such an assessor will clearly

have no discriminative power in the analysis, as the

respective F values will also be very low. If differences

between the samples are given, an assessor should ideally

have high Fvalues and low MSE values.

p*MSE plots

In a p*MSE plot [15] the assessors ability to detect dif-

ferences between samples is plotted against their repeat-

ability using the p values and MSE values from one-way

ANOVA calculations. A total of I*Kpairs ofp and MSEvalues are computed and plotted in a scatter plot. They can

be presented together in various ways (for instance all at

the same time, only for one attribute at a time or only for

one assessor at a time) and with highlighting of the

assessors or attributes that one is particularly interested

in. In an ideal situation all assessors should achieve low

p values and low MSE values for all attributes [15] if

differences between the samples are really present, thus

ending up in the lower left corner of the plot.

Profile and line plots

Profile plots visualise how each assessor ranks and rates the

tested samples compared to the other assessors and the

panel consensus for a certain attribute (see example in

Fig.11). Each line represents one assessor (sample aver-

ages across replicates) whereas the single bold line repre-

sents the panel consensus (sample averages across

assessors and replicates). The tested samples are ranked

along the horizontal axis according to the panel consensus

from left to right with increasing intensity for that attribute.


1 3


6/15

The vertical axis represents the scores (average across

replicates) of the particular assessor for the samples. In

case of high agreement between assessors, the assessor

lines follow the consensus line closely. With increasing

disagreement, the line of each assessor will follow its own

course and the plot will appear as more cluttered.

Each line plot [14] represents one sample showing its

average scores on each attribute in form of a line con-necting each attribute from left to right (see example in

Fig.8). In addition, raw data scores may be superimposed

indicating how individual assessors have scored on the

particular sample. The vertical line for each attribute dis-

plays the scoring range used by all assessors for that given

attribute and each symbol represents one of multiple scores

provided by the panel.

Data merging and work flow strategy

In this section, we describe how to prepare and merge thesensory profiling data of the 26 panels prior to import into

the PanelCheck software. Furthermore, we propose a work

flow that suggests how to progress with the data analysis,

i.e. which plots to use first and depending on the infor-

mation found which plots to use further on. All methods

used here are integrated in the PanelCheck software and

thus may be accessed easily. The only exception is the

method used in PCA for investigating basic structure in

data. This particular analysis, however, can be easily

carried out using any multivariate statistics software

package that gives access to PCA. This work flow may also

be applied to single data sets from one panel.

Data merging

Before analysing the 26 data sets, some data pre-processing

and re-arranging is necessary. There are several possibili-

ties of how data may be merged prior to import into the

PanelCheck software.

Raw data

The most obvious way would be to concatenate all data sets

vertically, which practically would result in a single large

sensory panel with 213 assessors accumulated over all

the 26 panels. The dimension of this matrix would then

be 3,195 9 9 (see Table1, last row). By choosing this

approach, individual information on all 213 assessors is

preserved and available in the plots. In return, however,

interpretation might become cumbersome and challenging,

as some of the plots get crowded and unreadable with so

many assessors. Given this situation, performance issues on

individuals or a particular panel as a whole may be difficult

to identify. With fewer panels, though, this may be a valid

approach as the number of assessors also will be lower.

Sample averages across assessors and replicates

for each panel

Another possibility is to compute consensus sample aver-

ages for each panel across assessors and replicates. Bydoing so one will have available 26 new consensus data

matrices of dimension J9 K. For the candy data at hand,

the dimension of these matrices then is (5 99) with J =5

samples and K =9 attributes. The next step would be to

concatenate these consensus matrices vertically, resulting

in a merged data matrix of dimension (26*5) 9 9 and

import it into PanelCheck. In this case, each panel is

treated as it was an individual assessor in a sensory

panel consisting of 26 assessors. Unfortunately, with this

approach, one loses information on repeatability and per-

formance of individual assessors, since the sample aver-

ages were computed across assessors and replicates andinformation on these two factors are lost. Hence, plots

visualising repeatability performance are not available in

this case, which are Fplot, MSE plot and p*MSE plot.

Sample averages across assessors for each panel

A third alternative is to compute sample replicate aver-

ages for each panel across the assessors of that particular

panel. This will lead to 26 data matrices of dimension

(J*M) 9 K, which for the candy data at hand means

(5*3) 9 9, with J = 5 samples and M = 3 replicates. This

is also indicated in Table1. The resulting data matrixthen is of dimension (26*15) 99 when concatenating all

26 data matrices vertically and is ready for import into

PanelCheck. In this way, again each panel is treated as it

was an individual assessor in a sensory panel consist-

ing of 26 assessors, but this time information on

repeatability is available as replicate information is pre-

served on the panel level.

Data merging approaches used in this study

For the first part of the analysis (Global analysis of all 26

panels) where all 26 panels are investigated, the approach

described in Sample averages across assessors for each

panel was chosen, since it provides a performance over-

view over all panels and at the same time preserves

information on replicates on a panel level. This approach

can be seen as a middle way between the two approa-

ches described above in Raw data and Sample averages

across assessors and replicates for each channel. It is a

valuable approach when a large amount of data with many

panels is given, as in this study.


1 3


7/15

For the second part of the analysis (Local analysis of

panels P05, P17 and P25), focus is turned to only three of

the 26 panels and the individual assessors that belong to

them. These three panels (P05, P17 and P25) were identified

to be differing somewhat from to the other panels on a

number of attributes, as visualised with Tucker-1 plots (see

Results). Given this situation where only three panels are

to be analysed in detail, the data setup as described in Rawdata is an appropriate approach. With the amount of raw

data greatly reduced (down to three from 26) information on

individual assessors will be more readable in the plots.

Work flow strategy

The proposed work flow strategy (Fig. 1) is by no means a

hard rule that represents the perfect general approach to

analysing and visualising all types of sensory profiling

data. It should rather be seen as a guide or path that one

may follow when analysing a new data set, and which may

be left at any time in the data analysis process. Since eachdata set may have its own unique characteristics, it may

require a unique approach and a different order of methods

and plots to be used for analysis.

In the proposed workflow, a good starting point could be

a (either two-way or three-way) ANOVA to identify sig-

nificant attributes at 5% significance level, i.e. P\ 0.05.

Non-significant attributes close to significance may also be

considered, since only a few noisy assessors might be

enough to make the attribute switch from significant to

non-significant. Attributes which are far from being sig-

nificant (sayPvalues of 0.1 and above) may be disregarded

based on high likelihood that differences between the tes-ted samples are not present. Preferably, this cut-off limit

needs to be chosen by the panel leader who has full

knowledge about the tested products and knows how well

the assessors of his/her sensory panel usually perform.

For the next step, one may consult Tucker-1 and Man-

hattan plots. Tucker-1 correlation loadings plots as imple-

mented in the PanelCheck software are based on replicate

averages, i.e. they do not contain information on repeat-

ability. They do, however, provide some quick diagnostics

that may be confirmed with other plots especially suited to

visualise that particular kind of problem. Depending on

how the assessors are distributed over the plots one may

identify possible disagreement in sample ranking, poor

sample discrimination ability, or crossover effects by

turning the intensity scale upside down. Manhattan plots

may be used as a screening tool to identify deviating per-

formances based on the patterns found in the plots.

The next plots suggested are those based on one-way

ANOVA carried out on the individual data matrices X i of

each individual. Those plots are thep*MSE plot, F plot and

MSE plot. If an assessor lies, e.g. close to the centre of the

Tucker-1 correlation loadings plot the reason for this often

is poor discrimination ability of that particular assessor

compared to others closer to the outer ellipse. This may be

confirmed by the p*MSE plot or F plot. If poor sample

discrimination cannot be confirmed by either one-way

ANOVA plot another likely scenario might be ranking

disagreement. In this case, the problematic assessor does

not agree with the underlying structure found by Tucker-1in the first two PCs. This particular assessor might dis-

criminate well between the samples, however, not in the

same way as the panel consensus. Therefore, such an

assessor may show systematic variation in PC3 or higher.

This may be confirmed by profile plots.

If none of the plots mentioned above allows for a con-

clusion, one might want to consult line plots for visualising

the raw data of every sample. Studying details on the raw

data might help to reveal issues that are not caught with

other plots. With the help of the work flow, one may

analyse one attribute at the time and finish analysis when

performance on all attributes has been evaluated.

Results

In this section, we will first investigate sensory profiling

data of all 26 participating panels (Global analysis of all

26 panels) before we go further into detail by looking into

performance of only a few selected panels (Local analysis

of panels P05, P17 and P25) that vary somewhat from

most other panels.

Global analysis of all 26 panels

Two-way ANOVA

Following the workflow shown in Fig.1, a two-way

ANOVA (details in Mixed model ANOVA for assessing

the importance of attributes) was computed first. The

results are shown in Fig. 2. All attributes were significant

with P\0.001, hence all attributes were kept for further

analysis.

PCA for investigating basic structure in data

The purpose of this analysis step is to get a quick and

general overview over how the data is structured and to

identify panels that may differ greatly in regard to how they

perceive differences between the tested samples. This is

done by applying PCA on the merged data set as described

in Sample averages across assessors and replicates for

each panel. The results are reported in Fig. 3ac, showing

the explained variance, scores and loadings, respectively.

Figure3a shows that the two-first principal components


1 3


8/15

explain 93% of the total variability contained in the dataset.

Figure3b shows how the samples are distributed in themultivariate space. Each sample is represented 78 times

(3 replicates 9 26 panels) with a fairly good separation

between the samples. The scores plot shows that the first

axis discriminates between samples A2, B and C1 on one

side versus A1 and C2 on the other. The samples in the

latter group are characterised by high intensity for the

attributes sweet taste, sugar coat and to a certain extent

acidic flavour and raspberry flavour (see loadings plot in

Fig.3c). The former group is characterised by high

intensity for attributes sticking, transparency, elastic-

ity, hardness and biting. Along the second axis, thereseems to be a split between samples A2 and B (samples on

the left side of the scores plot) and A1 and C2 (samples on

right side of score the plot). This tendency is strongly

related to attribute acidic flavour with high intensity for

samples A2 and C2 and low intensities for samples A1 and

B. This is in accordance with the experimental design

described above (Experimental). Attribute sweet taste

also seems to contribute to the split along PC2 although not

in the same degree as acidic flavour. Moreover, there

Fig. 1 Proposed workflowfor the analysis of assessorand panel performance


1 3


9/15

seems to be no clear coherence with the sugar content in

the samples as one should expect from the experimental

design. Nonetheless, the scores plot shows that the panels

are in good agreement regarding how the samples differ

from each other except for one evaluation of sample C2

and one evaluation of A1. Other than that there are no

anomalies to be detected which rules out that there are

severe differences between any of the panels.

Tucker-1 and Manhattan plots of all panels

For the next step, Tucker-1 correlation loadings plots

(Fig.4) are utilised to identify attributes with potential

performance issues. For the data at hand, nine identical

plots are given (for nine attributes) with one attribute being

highlighted at the time.

By screening through the plots, one can see that the

overall performance between the 26 panels can be con-

sidered as very good for most of the attributes. A very large

part of the variation in the data is explained using only PC1

and PC2. The total amount of variance explained by PC1

and PC2 is 98.6%, with PC1 and PC2 explaining 92.6 and

6.0%, respectively. From previous experience, we caninform that this number is very high compared to other data

sets, despite the high number of 234 variables (26 pan-

els 9 9 attributes). One important reason for this is that

much of the noise was eliminated by averaging sample

scores over assessors. The plots show that none of the

panels is in the inner ellipse for any attributes meaning that

all of them have more than 50% systematic variation of the

variation explained by PC1 and PC2. For all attributes

Fig. 2 Product effect in the two-way ANOVA model based on 26panels. All attributes are significant with P\ 0.001 and are includedin further analysis

Fig. 3 a Explained variances from PCA on the data described inSample averages across assessors for each panel. The upper (fullline) and lower (dashed) line visualise the calibrated and validatedexplained variance, respectively. b PCA on the data described inSample averages across assessors for each panel. The PCAscores plot visualises how the 26 panels discriminated between thefive tested samples. c PCA on the data described in Sampleaverages across assessors for each panel. PCA loadings plotshowing how the attributes contributed to the variation in themerged data set


1 3


10/15

except acidic flavour, sweet taste and raspberry fla-

vour the 26 panels show very good agreement as they arewell clustered at the outer ellipse. For the three attributes

mentioned above, there is some disagreement since the

panels are more spread out along the outer ellipse. Attri-

butes acidic flavour and sweet taste are the only attri-

butes contributing to systematic variation in PC2.

Furthermore, it is obvious that panel P01 disagrees with the

other panels on attribute sticking, since it is located on the

opposite side of the other panels. From previous experi-

ence, it is known that such a situation is caused by turning

the scale upside-down, i.e. confusing high and low inten-

sity. This assumption is confirmed by the profile plot for

attribute sticking as shown in Fig. 5. Panel P01 seems tohave confused high and low intensity for the tested sam-

ples. Moreover, one may observe in Fig. 4that panel P19

has less systematic variation for attribute elasticity

compared to the other panels. A profile plot of attribute

elastic (not shown) reveals that panel P19 ranks the

samples identically to the consensus, however, its intensity

differences between the samples deviate somewhat from

that of the consensus. This is why panel P19 lies in the

same direction as the remaining panels in the Tucker-1

correlation loading plots, but it does not align as well with

the other panels.After screening through the Tucker-1 plots one may

consult Manhattan plots (Fig.6) for comparison of the

systematic variation for a specific attribute across all

Fig. 4 Nine identical Tucker-1 plots with each plot highlighting one of the nine attributes used in the profiling. There is some variation betweenthe panels for attributes acidic flavour, sweet taste and raspberry flavour

Fig. 5 Profile plot of attribute sticking. Panel P01 clearly stands outfrom the other panels because of opposite scoring on high and lowintensity of the tested samples


1 3


11/15

panels. The Manhattan plots confirm what was shown in

the Tucker-1 plots. The attributes acidic flavour, sweet

taste and raspberry flavour need two or more principal

components to reach a high level of explained variance.

For the remaining attributes, all panels reach a high

percentage of explained variance already after one prin-

cipal component. The only exception is attribute elas-

ticity, where one can easily see that panel 19 differsfrom the other panels. The lone dark bar indicates that

panel P19 has less systematic variance for this attribute

than the other panels and needs three to four principal

components before explained variance is comparable

with the other panels. For this attribute, all panels have

an explained variance that is higher than or very close to

99% using only PC1 except panel 19 with only 62%

after PC1. After PC3, the cumulative explained variance

of 90% for panel P19 is still somewhat lower than those

of the other panels. With 4 PCs panel P19 reaches

100% explained variance.

p*MSE, MSE and F plots based on one-way ANOVA

The p*MSE plots are not presented here since sample

discrimination is highly significant for all attributes across

all panels. Of the 234 given p values (26 panels 99

attributes) the highest was at P =0.037.

In Fig.7a and b, the F and MSE plot are presented,

respectively. As can be seen, some of the panels have amuch higher Fvalue than others even though all of them

are significant at 1% level. The horizontal lines indicating

F values at 1 and 5% levels cannot be seen here, since

some F values are extremely high. Both horizontal lines

therefore fall onto the vertical axis as their corresponding F

values are extremely low compared to the highest Fvalues

in the plot. When investigating the panels discrimination

ability one can see for instance that panel P11 has rela-

tively lowFvalues compared to those of panel P20. At the

same time, the MSE values (Fig.8b) of panel P11 are

relatively high. This indicates that panel P11 is somewhat

Fig. 6 Nine Manhattan plots, one for each attribute, visualisingsystematic variation from individual PCA on the data of each panel.Vertical axes represent the number of PCs used and theircorresponding cumulative explained variance. Horizontal axes

represent the respective sensory panels. Black colourcorresponds to0% explained variance, whereas white colour corresponds to 100%explained variance


1 3


12/15

less precise and has lower capability of detecting differ-

ences. Panel P20 on the other hand has relatively low MSE

(good repeatability) values combined with relatively highF

values (good sample discrimination) indicating a much

better performance than panel P11. Panel P21 is an

example of where high F values are achieved, however,

coupled with high MSE values. In other words, this panel

discriminates well between the tested samples, but lessprecisely so than panel P20. In terms of performance panel

P21 may be ranked between panel P20 (good) and panel

P11 (not as good). Still, panel P11 shows an acceptable

performance since its F values are all significant at 1%

level. Note that the F and MSE plots provide no infor-

mation on sample ranking differences and that these two

plots alone therefore are not sufficient to get a complete

evaluation on panel performance. It should be mentioned

that both plots could also be sorted by attribute to check

which of the attributes have the lowest/highest variance

and the best ability to distinguish between samples.

Line plots

Figure8 shows line plots of the five tested samples. The

plots highlight that for every sample and attribute there is a

varying degree of variability across the panels (vertical

lines indicating spread of the scores). This variability could

be due to for instance local differences in calibration. This

is particularly true for attribute 9 (transparency). Forattribute 5 (sticking), however, there seems to be a higher

degree of agreement among the panels.

Local analysis of panels P05, P17 and P25

After studying all the panels averages across assessors

(based on data as described in Sample averages across

assessors for each panel), the data of panel P05, P17 and

P25 were analysed in more detail. These three panels were

picked over others because they differ from each other for

the attributes acidic flavour, sweet taste and raspberry

flavour. They are spread somewhat in terms of location inthe Tucker-1 plot of these three attributes.

Since we are now focusing on only three panels and we

wish to analyse in more detail why these three panels differ

somewhat for the attributes mentioned above, we will use

their raw data from here on. In order to do that, their raw

data needs to be merged as described in Raw data before

being imported into PanelCheck. When merging the raw

data of panel P05, P17 and P25, the resulting data matrix

will be of dimension (435 99), with panel 5 contributing

150 rows (10 assessors 95 samples 93 replicates), panel

17 contributing 135 rows (9 assessors 95 samples 93

replicates) and panel 25 contributing 150 rows of data (10assessors 9 5 samples 93 replicates). See Table1 for

details on panel sizes. This new data set in practise rep-

resents one new large panel consisting of 29 individuals

(10 ?9 ? 10 assessors). By using the same methods as

before, now the performance of individuals belonging to

one of these three panels can be visualised.

Mixed model ANOVA

Mixed model ANOVA again reports that all attributes are

significant at level P\ 0.001, meaning that this new panel

consisting of 29 individuals discriminated well between the

samples (plot not shown). Again, all attributes were con-

sidered for further analysis.

Tucker-1 plots

Tucker-1 plots based on raw data from the three selected

panels (Fig.9) confirm what has been spotted in the

Tucker-1 plot above (Fig.4) based on the data from all 26

panels. There is substantial disagreement across assessors

Fig. 7 a F plots visualising the panels ability to discriminatebetween the tested samples for each attribute. Panel P21 discriminatesless between the samples than for example panel P20. The horizontallinesindicatingFvalues at significance level 1 and 5% are not visibleas they are very low and therefore fall onto the horizontal axis. bMSEplot visualising the repeatability of each panel. Panel P21 obviouslyhas a weaker performance regarding repeatability than for examplepanel P06


1 3


13/15

for the attributes acidic flavour, sweet taste and espe-

cially raspberry flavour indicating that further improve-

ment on agreement across assessors is possible. Although

the assessors are somewhat scattered over the correlation

loadings plots, most of them have high-explained variances

for the first two PCs. This indicates that the majority

Fig. 8 Five line plots where each plot represent the data of one sample.Vertical axes represent intensity scores. Horizontal axesrepresent thenine sensory attributes

Fig. 9 Nine identical Tucker-1 plots with each plot highlighting one of the nine attributes used in the profiling. The plots are based on raw dataof panel P05, P17 and P25


1 3


14/15

discriminates well between the samples but that theremight be disagreement on sample ranking for that partic-

ular attribute. Studying the three correlation plots in detail

confirms this by revealing that the assessors of each panel

tend to form clusters of their own within the plot. For the

remaining six attributes (transparency, sugar coat, bit-

ing, hardness, elasticity and sticking) overall agree-

ment is very good. These results were confirmed by the

Manhattan plots (not shown).

p*MSE plots

As opposed to the situation above with all 26 panels with

high significance (p*MSE, MSE and F plots based on

one-way ANOVA), in this case the p*MSE plots plot

(Fig.10) gives a valuable contribution to understanding

individual differences. It can be seen that for attribute

raspberry flavour panel P17 and to a certain extent panelP05 are less capable to detect differences between the

samples (larger P values) than panel P25. Moreover, ran-

dom noise is generally larger for panel P05 and P17. This

indicates that the individuals of panel P25 and therefore

panel P25 as a group perform much better than panel P05

and P17.

Profile plot for panels

Profile plots (Fig. 11) show that the disagreement in eval-

uating the samples is strongest for the attributes acidic

flavour, sweet taste and particularly for raspberry fla-vour, as already observed in the Tucker-1 plots in Fig. 9.

For attributes transparency, sugar coat, biting, hard-

ness and elasticity the profiles are very alike for most of

the assessors with very few exceptions. For attribute

stickiness three assessors of panel P17 (individuals P17-1,

P17-4 and P17-9) generally rated the samples with the

highest intensity (B, C1 and A2) lower than the remaining

assessors.

Fig. 10 p*MSE plot for attribute raspberry flavour for panels P05,P17 and P25

Fig. 11 Nine profile plots, one for each attribute, visualising sampleintensity and rankings for each assessor in panels P05, P17 and P25.Vertical axes represent sample intensity scores. Horizontal axes

represent the five tested samples sorted by intensity based onconsensus. The circle highlights three deviating assessors belongingto panel P17 (assessor P17-1, P17-4 and P9)


1 3


15/15

Summary and discussion

In this paper, we present how to extract critical information

on panel performance issues from a proficiency test. In the

example described in this paper, 26 sensory panels tested a

set of 5 candy samples produced according to an experi-

mental design with 3 replicates using 9 attributes. Since the

panels varied in size, with 3 assessors at the least and 15assessors at the most, the size of the data from each panel

varied thereafter. We demonstrated how to arrange the

large amount of data prior to analysis and which methods

to use in the analysis process. For this, we proposed a

general workflow that may be used as a guide through the

data analysis process, but which is not forced upon the

user.

For the data at hand, performance analysis is carried

out first at a global level, based on data from all 26 panels

where each panel treated as it was an individual

assessor. This means that rather than visualising perfor-

mance of individuals, it is the performance of panels as awhole compared to other panels that is visualised. As a

result, from this process three of the 26 panels were

identified for further analysis at a more detailed local

level. This included performance visualisation of indi-

vidual assessors from each of these three panels. In both

cases, the same methods were applied to gather informa-

tion on performance. The methods used were mixed

model ANOVA, Tucker-1 plot, Manhattan plot, one-way

ANOVA based F plot, MSE plot, p*MSE plots, profile

plot and line plot. Reason for using multiple plots and

their methods is that each of the plots contains unique

information on panel and assessor performance. Theirjoint information content provides a more complete per-

formance overview on individual assessors and their

sensory panel (local level) or sensory panels compared

with each other (global level). Performance information

from such an analysis can then be used by panel leaders as

feedback to improve over panel performance and perfor-

mance of individual assessors.

Acknowledgments Thanks to Rikke Lazarotti at LEAF Denmarkfor production of wine gum samples and providing access to thesensory profiling data. We would like to thank the Research Councilof Norway (project number 168152/110), The Foundation forResearch Levy on Agricultural Products (Norway) and The DanishFood Industry Agency for project funding.

References

1. Brockhoff P, Skovgaard I (1994) Modelling individual differ-ences between assessors in a sensory evaluation. Food QualPrefer 5:215224

2. Ns T (1990) Handling individual differences between assessorsin sensory profiling. Food Qual Prefer 2:187199

3. Dahl T, Tomic O, Wold JP, Ns T (2008) Some new tools forvisualizing multi-way sensory data. Food Qual Prefer 19:103113

4. LeS, Pages J, Husson F (2008) Methodology for the comparisonof sensory profiles provided by several panels: application to across-cultural study. Food Qual Prefer 19:179184

5. Tomic O, Nilsen A, Martens M, Ns T (2007) Visualization ofsensory profiling data for performance monitoring. LWT-FoodSci Technol 40:262269

6. Thompson M, Wood R (1993) The international harmonisedprotocol for the proficiency testing of (chemical) analytical lab-oratories. Pure Appl Chem 65:212123

7. McEwan JA (1999) Comparison of sensory panels: a ring trial.Food Qual Prefer 10:16171

8. Hunter EA, McEwan JA (1998) Evaluation of an InternationalRing Trial for sensory profiling of hard cheese. Food Qual Prefer9:343354

9. Pages J, Husson F (2001) Inter-laboratory comparison of sensoryprofiles methodology and results. Food Qual Prefer 12:297309

10. PanelCheck software (2006) Nofima Mat, As, Norway.http://www.panelcheck.com

11. Ns T, Langsrud (1988) Fixed or random assessors in sensoryprofiling? Food Qual Prefer 9:145152

12. Tucker LR (1964) The extension of factor analysis to three-

dimensional matrices. In: Frederiksen N, Gulliksen H (eds)Contributions to mathematical psychology. Holt, Rinehart &Winston, New York

13. Tucker LR (1966) Some mathematical notes on three-modefactor analysis. Psychometrika 31:279311

14. Ns T, Solheim R (1991) Detection and interpretation of varia-tion within and between assessors in sensory profiling. J SensStud 6:159177

15. Lea P, Rdbotten M, Ns T (1995) Measuring validity in sensoryanalysis. Food Qual Prefer 6:321326


1 3
http://www.panelcheck.com/http://www.panelcheck.com/

artigo panel check

Documents