artigo panel check
TRANSCRIPT
-
8/13/2019 Artigo Panel Check
1/15
O RI G I N A L P A P E R
Analysing sensory panel performance in a proficiency testusing the PanelCheck software
Oliver Tomic Giorgio Luciano Asgeir Nilsen
Grethe Hyldig Kirsten Lorensen Tormod Ns
Received: 5 May 2009 / Revised: 28 October 2009 / Accepted: 9 November 2009 / Published online: 2 December 2009Springer-Verlag 2009
Abstract This paper discusses statistical methods and a
workflow strategy for comparing performance acrossmultiple sensory panels that participated in a proficiency
test (also referred to as inter laboratory test). Performance
comparison and analysis are based on a data set collected
from 26 sensory panels carrying out profiling on the same
set of candy samples. The candy samples were produced
according to an experimental design using design factors,
such as sugar, and acid level. Because of the exceptionally
large amount of data and the availability of multiple sta-
tistical and graphical tools in the PanelCheck software, a
workflow is proposed that guides the user through the data
analysis process. This allows practitioners and non-statis-
ticians to get an overview over panel performances in arapid manner without the need to be familiar with details
on the statistical methods. Visualisation of data analysis
results plays an important role as this provides a time
saving and efficient way of screening and investigating
sensory panel performances. Most of the statistical meth-
ods used in this paper are available in the open source
software PanelCheck, which may be downloaded and used
for free.
Keywords Proficiency test Inter laboratory test
Sensory profiling
Performance visualisation
PanelCheck
Introduction
Trained sensory panels are important tools for assessing the
quality of food and non-food products. There are, however,
a number of problems related to the training, stability, and
maintenance of the quality of such panels. A number of
methods have been developed that may help to achieve
better panel performance [15]. These techniques can
detect lack of precision (repeatability), disagreement
(reproducibility), and the ability or inability to discriminatebetween samples. This type of information is very useful
for improving data quality in future sessions through
increased and more targeted training on problematic issues.
Larger companies maintaining sensory panels at multi-
ple geographic locations are often subject to additional
challenges. For example, thoroughly carried out quality
control and product development require that all sensory
panels are well calibrated with one another, eliminating
potential shift between the panels and allowing for com-
parison of their results. When multiple sensory panels are
to evaluate the same set of samples global performance
issues (across multiple sensory panels) might add to
already existing local performance issues (within one
sensory panel). This further complicates comparison of
results from each involved panel. Techniques for profi-
ciency tests are available, but most of them are developed
for classical chemical inter-laboratory comparisons (see,
e.g. [6]) and with less focus on some of the more specific
aspects of sensory analysis such as those indicated above.
Important contributions to the proficiency test literature are
available [79]. In these papers, classical ANOVA,
O. Tomic (&) G. Luciano A. Nilsen T. NsNofima Mat AS, Osloveien 1, 1430 As, Norwaye-mail: [email protected]
G. HyldigDTU Aqua, National Institute of Aquatic Resources,Technical University of Denmark, Sltofts Plads,Build. 221, 2800 Lyngby, Denmark
K. LorensenChew Tech I/S, Vejle, Denmark
1 3
Eur Food Res Technol (2010) 230:497511
DOI 10.1007/s00217-009-1185-y
-
8/13/2019 Artigo Panel Check
2/15
Principal Component Analysis (PCA), Multiple Factor
Analysis (MFA), and Generalised Procrustes Analysis
(GPA) are used for studying intra- and inter-laboratory
variation.
The main focus of the present paper is to discuss and to
illustrate how techniques developed specifically for per-
formance visualisation of a single sensory panel [5] can
also be applied for comparing multiple panels. Some of thetechniques are related to the methods mentioned above,
while others are new in this context. Univariate as well as
multivariate statistical methods will be presented and used
in this paper. The univariate methods highlight differences
for each attribute separately while the multivariate methods
look at differences at a more general level taking into
account also correlations between the attributes. All pre-
sented techniques are graphically oriented and should be
therefore easy to understand by practitioners and non-
statisticians. A major issue is to stress how the techniques
can be used to highlight or visualise various types of dif-
ferences between the assessors and the panels. Further-more, a workflow suggesting how to progress with the data
analysis and how to use the methods available in the
PanelCheck software will also be proposed. This allows for
rapid and efficient analysis of sensory profiling data, both
in case of one or multiple panels. The software provides an
intuitive and easy-to-use graphical user interface that
handles all statistical computations in the background and
visualises results in different types of plots. This enables
the practitioner and non-statistician to concentrate on per-
formance analysis rather than spending time on trying to
apply algorithms on data by themselves. The open source
PanelCheck software may be downloaded, distributed, andused for free (http://www.panelcheck.com)[10].
Experimental
The dataset discussed here is the result of a joint pro-
ject between Danish, Norwegian, Swedish, and English
research institutes and commercial companies. In all, 26
panels were involved in the project (research as well as
industry panels) with one of the aims being to investigate
performance of multiple sensory panels with the Panel-
Check software.
The samples studied were five candies (wine gums)
produced according to an experimental design with two
design factors, i.e. sugar level and acid content: A1 (high
sugar, low acid), A2 (high sugar, high acid), B (medium
sugar, low acid), C1 (low sugar, low acid), C2 (low sugar,
high acid). All samples were produced at LEAF Denmark.
The evaluation of the samples had to be performed within
1 month after production. LEAF Denmark guaranteed that
samples did not change its sensory properties within this
period. The candy samples were tested by each of the 26
participating panels.
Each sensory panel received detailed instructions about
sample preparation and evaluation. The sensory panel at
LEAF performed sensory profiling on the samples and
suggested nine sensory attributes, which the remaining 25
sensory panels were to use for profiling. Two samples
(A1, C1) were used for training and calibration by allsensory panels. Sample C2 was used as a reference
sample for maximum intensity of attribute acidic fla-
vour. For the remaining attributes, either sample A1 or
C1 were used as reference for low or high intensity. All
attributes were evaluated on an intensity scale from 0 (no
intensity) to 15 (high intensity). Water was used to clean
the palate between each sample. The nine attributes used
to describe the samples were: transparency, acidic fla-
vour, sweet taste, raspberry flavour, sugar coat (the
thickness of the sugar peel visual on the cut wine gum
piece), biting strength in the mouth (referred to as
biting), hardness, elasticity in the mouth (referred toas elasticity), sticking to teeth in the mouth (referred
to as sticking).
Each of the 5 samples was evaluated in 3 replicates,
resulting in a total of 15 samples to be tested by each panel.
One piece of wine gum weighed 3.5 g. In each serving, the
assessors got four to five pieces of which one was cut in
half by the sensory staff allowing the assessors score on
appearance attributes. For those panels that did not have
access to specific software for automatic randomisation of
candy samples, a Latin square design was provided as an
example for serving order. All 26 sensory evaluations took
place in June 2007.Table1shows an overview over the 26 panels indicat-
ing their number of assessors, size of the data matrix of
each panel, and the size of the data used for the first part of
analysis that included all panels.
Methods
In the following section, the univariate and multivariate
statistical methods used for data analysis will be discussed.
The results of these methods are visualised in various plots
helping non-statisticians to visually detect performance
issues without having to know all details on the statistical
methods. It should be emphasised that the real strength of
these methods is revealed only when using them together.
Each plot has its own special feature that represents an
element of unique information, but their joint information
content is what really provides a holistic overview over
performance of the investigated panels. The methods will
be presented in an order that complies with the suggested
data analysis workflow (see Work flow strategy).
498 Eur Food Res Technol (2010) 230:497511
1 3
http://www.panelcheck.com/http://www.panelcheck.com/ -
8/13/2019 Artigo Panel Check
3/15
The same workflow may be applied for one sensory
panel at the time as well as for multiple sensory panels. In
this sense, one needs to think in terms of groups and
individuals such that the statistical methods may be
applied appropriately. When analysing performance of one
sensory panel, the panel as a whole represents the group
level while the assessors represent the individual level.
This changes, however, when applying the same methods
on data from multiple panels. Here, the group consisting of
26 panels represents the group level whereas each single
panel here represents the individual level. In other words,
the group of 26 panels will be treated as one large panel
with each panel representing one assessor. How this is
done, in practice will be elaborated later in Data
merging.
In the description of the statistical methods in Mixed
model ANOVA for assessing the importance of attributes
to Profile and line plots below (considering performance
analysis of only one panel) we let j =1,,J denote the
number of samples tested, m = 1,,M the number of
replicates, k = 1,,K the number of attributes and
i = 1,,I the number of assessors. We let Xi denote the
data matrix of assessor i with J*M rows and K columns.
That means that Xi of each assessor is of dimension
(J*M) 9 Kin any of the 26 data sets. For the candy data
set, the dimension of Xi then is (5*3) 9 9, with J=5
samples, M=3 replicates and K= 9 attributes.
Mixed model ANOVA for assessing the importance
of attributes
As a first step, mixed model (2- or 3-way) ANOVA can
be used for assessing the importance of the used sensory
attributes in detecting significant differences between the
samples. The method is based on modelling samples,
assessors and their interactions in two-way ANOVA or
samples, assessors, replicates and their interactions in
three-way ANOVA, and then testing for the sample effect
by the use of regular F tests. In each case (2-way or
3-way), the assessor and interaction effects are assumed
to be random [11]. Only attributes that are significant at
certain level (in the case presented here a 5% significance
Table 1 Overview over allsensory panels that participatedin the proficiency test
J, M, I the number of testedproducts, replicates andassessors, respectively
Sensory panel Number of assessors Number of data rowsin raw data (J*M*I)
Number of data rowsused in global analysis
P01 7 105 15
P02 11 165 15
P03 8 120 15
P04 11 165 15
P05 10 150 15P06 15 225 15
P07 9 135 15
P08 3 45 15
P09 11 165 15
P10 8 120 15
P11 7 105 15
P12 7 105 15
P13 5 75 15
P14 7 105 15
P15 8 120 15
P16 8 120 15
P17 9 135 15
P18 8 120 15
P19 6 90 15
P20 7 105 15
P21 6 90 15
P22 10 150 15
P23 11 165 15
P24 7 105 15
P25 10 150 15
P26 4 60 15
Total 213 3,195 390
Eur Food Res Technol (2010) 230:497511 499
1 3
-
8/13/2019 Artigo Panel Check
4/15
level was chosen) for product effect are considered for
further analysis. In case of the candy data, two-way
ANOVA was used since the replicates of the tested
samples were served in random order. If sample replicates
were served systematically, say one replicate per session,
three-way ANOVA (with main effects for assessor, sam-
ple and replicate and interactions between them) should
be considered instead. The reason for this is that bytesting the replicates in separate sessions, it is likely that
additional systematic variance between the replicates will
be introduced into the data. The replicate effect in three-
way ANOVA then indicates whether a significant sys-
tematic session based variation in the data is present or
not.
Tucker 1 plots
In the next step, the multivariate analysis method Tucker-1
[12,13] is applied in order to get an overview over assessor
and panel performance using multiple attributes. Tucker-1is essentially PCA on an unfolded data matrix consisting of
all individual matrices Xi,av aligned horizontally. Here,
Xi,avrepresents the matrix of one assessor of the dimension
J9 K where the sample score is based on the average
across replicates, hence indicated with av in the index.
This means that the dimension of this unfolded matrix is
J9(I*K). In the case of our candy data set the dimension
would be 5 9(10*9), with J=5 samples, and K=9
attributes and say I=10 assessors [if your panel consists
of 10 assessors as it is the case for example for panels P05,
P22 and P25 (see Table1)].Iwill of course vary according
to the number of assessors in the panel and consequentlythe dimension (I*K).
PCA on this unfolded matrix provides two types of plots
that are of interest: a common scores plot and a correlation
loadings plot. The common scores plot shows how the
tested J samples relate to each other, i.e. it visualises
similarities and dissimilarities between the samples along
the found principal components. This plot gives no direct
information on assessor or panel performance, but it is a
valuable visualisation tool that helps the user to roughly
and quickly investigate whether the panel could distinguish
between the samples or not by taking the explained vari-
ances into account. If the explained variance in the first few
(usually two) PCs is relatively high, large systematic
variation is present in the data, which again may indicate
that the panel discriminates well between the samples.
Note that the explained variance for a Tucker-1 common
scores plot generally is somewhat lower for the first few
PCs compared to those from PCA on the ordinary con-
sensus average matrix. This is because the Tucker-1 anal-
ysis is based on many more variables and therefore more
noise is present in the data.
The correlation loadings plot provides performance
information on each assessor and the sensory panel as a
whole. The plot contains I*K dots, with each dot repre-
senting one assessor-attribute combination (e.g. attribute
sweet taste of assessor 5, etc.). By highlighting different
dots, either those of one assessor or those of one attribute,
one can visualise the performance of individual assessors
or the whole panel. The position of the dots within the plotprovides information on how well an individual or the
panel as a whole perform. The more noise the attribute of a
particular assessor contains, the closer the dot will appear
to the origo, i.e. the middle of the plot. The more sys-
tematic information an attribute of an assessor contains, the
closer it will appear to the outer ellipse (100% explained
variance for that attribute, see Fig. 4). The inner ellipse
represents 50% explained variance and can be considered
as a rule-of-thumb lower boundary of how much explained
variance an attribute should at least have to be considered
as good enough. It is recommended to consult also higher
PCs, since some assessors might have much systematicvariance in other dimensions than PC1 and PC2 and thus
initially appearing as noisy. Detailed information on the
statistical aspects and interpretations of Tucker-1 common
scores plot and correlation plots are given in [3].
Manhattan plots
Manhattan plots in general provide an alternative way to
visualise systematic variation in data sets as described
earlier [3]. They can be considered as a screening tool for
quick identification of assessors that perform very differ-
ently from the other assessors. The information visualisedby Manhattan plots may be computed with different sta-
tistical methods. In this paper, the Manhattan plots visu-
alise information as implemented in the PanelCheck
software. This means that PCA is applied on the individual
data matrices Xi,av and the explained variance for each
attribute is then visualised in the Manhattan plots. For the
candy data at hand, I*Kexplained variances will be given.
This means if the panel consists of say I = 10 assessors, the
number of explained variances would be 10*9 given K =9
attributes.
Manhattan plots (see example in Fig.6) visualise, in
shades of grey, how much of the variability of each attri-
bute and each assessor can be explained by the principal
components (vertical axis). A dark colour tone indicates
that only a small proportion of the variance has been
explained, while a light colour tone indicates the opposite.
Extreme points are black (0% explained variance) and
white (100% explained variance). Typically, the colour
will be darker for PC1 and then get lighter with each
additional PC from top to bottom as the explained variance
shown is cumulative over each PC. In other words, the
500 Eur Food Res Technol (2010) 230:497511
1 3
-
8/13/2019 Artigo Panel Check
5/15
explained variance at PC3 is the sum of the explained
variances of PC1, PC2 and PC3. The lighter a colour tone
in a Manhattan plot is for a specific assessor-attribute
combination, the more systematic variation is given.
The explained variances may be sorted either by
assessor or by attribute, depending on what is the main
focus of investigation. When interested in checking per-
formance between assessors one may investigate a totalnumber ofIplots consisting ofKcolumns where each plot
represents one assessor and each column within the plots
represents one attribute. Here, one may look for similar
colour patterns among the assessors and detect assessors
that differ much from the others. If interested in how well
an attribute is understood and used by the panel one may
consider a total number ofKplots consisting ofIcolumns
where each plot represents one attribute and each column
represents one assessor in the plots. Here, one may inves-
tigate whether an attribute achieves high-explained vari-
ances with only few PCs or if many PCs are necessary.
Moreover, it can be detected whether some assessors mayhave more systematic variance with fewer PCs than other
assessors. In this sense, Manhattan plots may be used as a
screening tool for quick detection of assessors that behave
very differently or attributes that are not well explained
relatively to one another. Both plotting variants are
implemented in PanelCheck. More detailed information on
the statistical aspects and interpretations of Manhattan
plots are presented in [3].
Plots based on one-way ANOVA
A discussion of the one-way ANOVA model about panelperformance can be found in [5, 14]. From the one-way
ANOVA model, we obtain three statistical quantities (F, p
and MSE values) that are used to generate the so-called F
plot, MSE plot and p*MSE plot as available in the Pan-
elCheck software. These three statistical quantities are
acquired by applying one-way ANOVA on each individual
data matrix Xi and provide information on sample dis-
crimination and repeatability on each assessor. The three
plots are described in more detail below.
F plot
Fplots are based on Fvalues, which contain information
on discrimination performance of each assessor. A total of
I*K Fvalues are computed and may be presented in a bar
diagram with each bar representing the attribute of one
specific assessor. The bar diagram can be accompanied
with horizontal lines indicating different significance lev-
els. Typically, 1 and 5% level of significance are used for
this purpose. Generally, the higher an value Fvalue of an
individual assessor, the greater the ability of that assessor
to discriminate between tested samples. If differences
between the tested samples are present, one should expect
the assessors to obtain high Fvalues greater, ideally higher
than those corresponding to 1 and 5% level of significance.
MSE plot
The MSE values are the mean square errors (random errorvariance estimates) from the one-way ANOVA model.
They can be used as a measure of repeatability for each
assessor. A total ofI*KMSE values are computed and can
be plotted in a bar diagram very similar to the Fvalues in
the F plot. If an assessor almost perfectly repeats her/
himself, this value should be close to zero. The less the
repeatability of a certain assessor, the higher his/her MSE
will be. The MSE value should, however, always be con-
sidered together with the Fvalues in order to get a realistic
overview over the assessors performance. An assessor
aiming for low MSE values can achieve this through
scoring about all samples alike, thus reducing differencesbetween replicates. However, such an assessor will clearly
have no discriminative power in the analysis, as the
respective F values will also be very low. If differences
between the samples are given, an assessor should ideally
have high Fvalues and low MSE values.
p*MSE plots
In a p*MSE plot [15] the assessors ability to detect dif-
ferences between samples is plotted against their repeat-
ability using the p values and MSE values from one-way
ANOVA calculations. A total of I*Kpairs ofp and MSEvalues are computed and plotted in a scatter plot. They can
be presented together in various ways (for instance all at
the same time, only for one attribute at a time or only for
one assessor at a time) and with highlighting of the
assessors or attributes that one is particularly interested
in. In an ideal situation all assessors should achieve low
p values and low MSE values for all attributes [15] if
differences between the samples are really present, thus
ending up in the lower left corner of the plot.
Profile and line plots
Profile plots visualise how each assessor ranks and rates the
tested samples compared to the other assessors and the
panel consensus for a certain attribute (see example in
Fig.11). Each line represents one assessor (sample aver-
ages across replicates) whereas the single bold line repre-
sents the panel consensus (sample averages across
assessors and replicates). The tested samples are ranked
along the horizontal axis according to the panel consensus
from left to right with increasing intensity for that attribute.
Eur Food Res Technol (2010) 230:497511 501
1 3
-
8/13/2019 Artigo Panel Check
6/15
The vertical axis represents the scores (average across
replicates) of the particular assessor for the samples. In
case of high agreement between assessors, the assessor
lines follow the consensus line closely. With increasing
disagreement, the line of each assessor will follow its own
course and the plot will appear as more cluttered.
Each line plot [14] represents one sample showing its
average scores on each attribute in form of a line con-necting each attribute from left to right (see example in
Fig.8). In addition, raw data scores may be superimposed
indicating how individual assessors have scored on the
particular sample. The vertical line for each attribute dis-
plays the scoring range used by all assessors for that given
attribute and each symbol represents one of multiple scores
provided by the panel.
Data merging and work flow strategy
In this section, we describe how to prepare and merge thesensory profiling data of the 26 panels prior to import into
the PanelCheck software. Furthermore, we propose a work
flow that suggests how to progress with the data analysis,
i.e. which plots to use first and depending on the infor-
mation found which plots to use further on. All methods
used here are integrated in the PanelCheck software and
thus may be accessed easily. The only exception is the
method used in PCA for investigating basic structure in
data. This particular analysis, however, can be easily
carried out using any multivariate statistics software
package that gives access to PCA. This work flow may also
be applied to single data sets from one panel.
Data merging
Before analysing the 26 data sets, some data pre-processing
and re-arranging is necessary. There are several possibili-
ties of how data may be merged prior to import into the
PanelCheck software.
Raw data
The most obvious way would be to concatenate all data sets
vertically, which practically would result in a single large
sensory panel with 213 assessors accumulated over all
the 26 panels. The dimension of this matrix would then
be 3,195 9 9 (see Table1, last row). By choosing this
approach, individual information on all 213 assessors is
preserved and available in the plots. In return, however,
interpretation might become cumbersome and challenging,
as some of the plots get crowded and unreadable with so
many assessors. Given this situation, performance issues on
individuals or a particular panel as a whole may be difficult
to identify. With fewer panels, though, this may be a valid
approach as the number of assessors also will be lower.
Sample averages across assessors and replicates
for each panel
Another possibility is to compute consensus sample aver-
ages for each panel across assessors and replicates. Bydoing so one will have available 26 new consensus data
matrices of dimension J9 K. For the candy data at hand,
the dimension of these matrices then is (5 99) with J =5
samples and K =9 attributes. The next step would be to
concatenate these consensus matrices vertically, resulting
in a merged data matrix of dimension (26*5) 9 9 and
import it into PanelCheck. In this case, each panel is
treated as it was an individual assessor in a sensory
panel consisting of 26 assessors. Unfortunately, with this
approach, one loses information on repeatability and per-
formance of individual assessors, since the sample aver-
ages were computed across assessors and replicates andinformation on these two factors are lost. Hence, plots
visualising repeatability performance are not available in
this case, which are Fplot, MSE plot and p*MSE plot.
Sample averages across assessors for each panel
A third alternative is to compute sample replicate aver-
ages for each panel across the assessors of that particular
panel. This will lead to 26 data matrices of dimension
(J*M) 9 K, which for the candy data at hand means
(5*3) 9 9, with J = 5 samples and M = 3 replicates. This
is also indicated in Table1. The resulting data matrixthen is of dimension (26*15) 99 when concatenating all
26 data matrices vertically and is ready for import into
PanelCheck. In this way, again each panel is treated as it
was an individual assessor in a sensory panel consist-
ing of 26 assessors, but this time information on
repeatability is available as replicate information is pre-
served on the panel level.
Data merging approaches used in this study
For the first part of the analysis (Global analysis of all 26
panels) where all 26 panels are investigated, the approach
described in Sample averages across assessors for each
panel was chosen, since it provides a performance over-
view over all panels and at the same time preserves
information on replicates on a panel level. This approach
can be seen as a middle way between the two approa-
ches described above in Raw data and Sample averages
across assessors and replicates for each channel. It is a
valuable approach when a large amount of data with many
panels is given, as in this study.
502 Eur Food Res Technol (2010) 230:497511
1 3
-
8/13/2019 Artigo Panel Check
7/15
For the second part of the analysis (Local analysis of
panels P05, P17 and P25), focus is turned to only three of
the 26 panels and the individual assessors that belong to
them. These three panels (P05, P17 and P25) were identified
to be differing somewhat from to the other panels on a
number of attributes, as visualised with Tucker-1 plots (see
Results). Given this situation where only three panels are
to be analysed in detail, the data setup as described in Rawdata is an appropriate approach. With the amount of raw
data greatly reduced (down to three from 26) information on
individual assessors will be more readable in the plots.
Work flow strategy
The proposed work flow strategy (Fig. 1) is by no means a
hard rule that represents the perfect general approach to
analysing and visualising all types of sensory profiling
data. It should rather be seen as a guide or path that one
may follow when analysing a new data set, and which may
be left at any time in the data analysis process. Since eachdata set may have its own unique characteristics, it may
require a unique approach and a different order of methods
and plots to be used for analysis.
In the proposed workflow, a good starting point could be
a (either two-way or three-way) ANOVA to identify sig-
nificant attributes at 5% significance level, i.e. P\ 0.05.
Non-significant attributes close to significance may also be
considered, since only a few noisy assessors might be
enough to make the attribute switch from significant to
non-significant. Attributes which are far from being sig-
nificant (sayPvalues of 0.1 and above) may be disregarded
based on high likelihood that differences between the tes-ted samples are not present. Preferably, this cut-off limit
needs to be chosen by the panel leader who has full
knowledge about the tested products and knows how well
the assessors of his/her sensory panel usually perform.
For the next step, one may consult Tucker-1 and Man-
hattan plots. Tucker-1 correlation loadings plots as imple-
mented in the PanelCheck software are based on replicate
averages, i.e. they do not contain information on repeat-
ability. They do, however, provide some quick diagnostics
that may be confirmed with other plots especially suited to
visualise that particular kind of problem. Depending on
how the assessors are distributed over the plots one may
identify possible disagreement in sample ranking, poor
sample discrimination ability, or crossover effects by
turning the intensity scale upside down. Manhattan plots
may be used as a screening tool to identify deviating per-
formances based on the patterns found in the plots.
The next plots suggested are those based on one-way
ANOVA carried out on the individual data matrices X i of
each individual. Those plots are thep*MSE plot, F plot and
MSE plot. If an assessor lies, e.g. close to the centre of the
Tucker-1 correlation loadings plot the reason for this often
is poor discrimination ability of that particular assessor
compared to others closer to the outer ellipse. This may be
confirmed by the p*MSE plot or F plot. If poor sample
discrimination cannot be confirmed by either one-way
ANOVA plot another likely scenario might be ranking
disagreement. In this case, the problematic assessor does
not agree with the underlying structure found by Tucker-1in the first two PCs. This particular assessor might dis-
criminate well between the samples, however, not in the
same way as the panel consensus. Therefore, such an
assessor may show systematic variation in PC3 or higher.
This may be confirmed by profile plots.
If none of the plots mentioned above allows for a con-
clusion, one might want to consult line plots for visualising
the raw data of every sample. Studying details on the raw
data might help to reveal issues that are not caught with
other plots. With the help of the work flow, one may
analyse one attribute at the time and finish analysis when
performance on all attributes has been evaluated.
Results
In this section, we will first investigate sensory profiling
data of all 26 participating panels (Global analysis of all
26 panels) before we go further into detail by looking into
performance of only a few selected panels (Local analysis
of panels P05, P17 and P25) that vary somewhat from
most other panels.
Global analysis of all 26 panels
Two-way ANOVA
Following the workflow shown in Fig.1, a two-way
ANOVA (details in Mixed model ANOVA for assessing
the importance of attributes) was computed first. The
results are shown in Fig. 2. All attributes were significant
with P\0.001, hence all attributes were kept for further
analysis.
PCA for investigating basic structure in data
The purpose of this analysis step is to get a quick and
general overview over how the data is structured and to
identify panels that may differ greatly in regard to how they
perceive differences between the tested samples. This is
done by applying PCA on the merged data set as described
in Sample averages across assessors and replicates for
each panel. The results are reported in Fig. 3ac, showing
the explained variance, scores and loadings, respectively.
Figure3a shows that the two-first principal components
Eur Food Res Technol (2010) 230:497511 503
1 3
-
8/13/2019 Artigo Panel Check
8/15
explain 93% of the total variability contained in the dataset.
Figure3b shows how the samples are distributed in themultivariate space. Each sample is represented 78 times
(3 replicates 9 26 panels) with a fairly good separation
between the samples. The scores plot shows that the first
axis discriminates between samples A2, B and C1 on one
side versus A1 and C2 on the other. The samples in the
latter group are characterised by high intensity for the
attributes sweet taste, sugar coat and to a certain extent
acidic flavour and raspberry flavour (see loadings plot in
Fig.3c). The former group is characterised by high
intensity for attributes sticking, transparency, elastic-
ity, hardness and biting. Along the second axis, thereseems to be a split between samples A2 and B (samples on
the left side of the scores plot) and A1 and C2 (samples on
right side of score the plot). This tendency is strongly
related to attribute acidic flavour with high intensity for
samples A2 and C2 and low intensities for samples A1 and
B. This is in accordance with the experimental design
described above (Experimental). Attribute sweet taste
also seems to contribute to the split along PC2 although not
in the same degree as acidic flavour. Moreover, there
Fig. 1 Proposed workflowfor the analysis of assessorand panel performance
504 Eur Food Res Technol (2010) 230:497511
1 3
-
8/13/2019 Artigo Panel Check
9/15
seems to be no clear coherence with the sugar content in
the samples as one should expect from the experimental
design. Nonetheless, the scores plot shows that the panels
are in good agreement regarding how the samples differ
from each other except for one evaluation of sample C2
and one evaluation of A1. Other than that there are no
anomalies to be detected which rules out that there are
severe differences between any of the panels.
Tucker-1 and Manhattan plots of all panels
For the next step, Tucker-1 correlation loadings plots
(Fig.4) are utilised to identify attributes with potential
performance issues. For the data at hand, nine identical
plots are given (for nine attributes) with one attribute being
highlighted at the time.
By screening through the plots, one can see that the
overall performance between the 26 panels can be con-
sidered as very good for most of the attributes. A very large
part of the variation in the data is explained using only PC1
and PC2. The total amount of variance explained by PC1
and PC2 is 98.6%, with PC1 and PC2 explaining 92.6 and
6.0%, respectively. From previous experience, we caninform that this number is very high compared to other data
sets, despite the high number of 234 variables (26 pan-
els 9 9 attributes). One important reason for this is that
much of the noise was eliminated by averaging sample
scores over assessors. The plots show that none of the
panels is in the inner ellipse for any attributes meaning that
all of them have more than 50% systematic variation of the
variation explained by PC1 and PC2. For all attributes
Fig. 2 Product effect in the two-way ANOVA model based on 26panels. All attributes are significant with P\ 0.001 and are includedin further analysis
Fig. 3 a Explained variances from PCA on the data described inSample averages across assessors for each panel. The upper (fullline) and lower (dashed) line visualise the calibrated and validatedexplained variance, respectively. b PCA on the data described inSample averages across assessors for each panel. The PCAscores plot visualises how the 26 panels discriminated between thefive tested samples. c PCA on the data described in Sampleaverages across assessors for each panel. PCA loadings plotshowing how the attributes contributed to the variation in themerged data set
Eur Food Res Technol (2010) 230:497511 505
1 3
-
8/13/2019 Artigo Panel Check
10/15
except acidic flavour, sweet taste and raspberry fla-
vour the 26 panels show very good agreement as they arewell clustered at the outer ellipse. For the three attributes
mentioned above, there is some disagreement since the
panels are more spread out along the outer ellipse. Attri-
butes acidic flavour and sweet taste are the only attri-
butes contributing to systematic variation in PC2.
Furthermore, it is obvious that panel P01 disagrees with the
other panels on attribute sticking, since it is located on the
opposite side of the other panels. From previous experi-
ence, it is known that such a situation is caused by turning
the scale upside-down, i.e. confusing high and low inten-
sity. This assumption is confirmed by the profile plot for
attribute sticking as shown in Fig. 5. Panel P01 seems tohave confused high and low intensity for the tested sam-
ples. Moreover, one may observe in Fig. 4that panel P19
has less systematic variation for attribute elasticity
compared to the other panels. A profile plot of attribute
elastic (not shown) reveals that panel P19 ranks the
samples identically to the consensus, however, its intensity
differences between the samples deviate somewhat from
that of the consensus. This is why panel P19 lies in the
same direction as the remaining panels in the Tucker-1
correlation loading plots, but it does not align as well with
the other panels.After screening through the Tucker-1 plots one may
consult Manhattan plots (Fig.6) for comparison of the
systematic variation for a specific attribute across all
Fig. 4 Nine identical Tucker-1 plots with each plot highlighting one of the nine attributes used in the profiling. There is some variation betweenthe panels for attributes acidic flavour, sweet taste and raspberry flavour
Fig. 5 Profile plot of attribute sticking. Panel P01 clearly stands outfrom the other panels because of opposite scoring on high and lowintensity of the tested samples
506 Eur Food Res Technol (2010) 230:497511
1 3
-
8/13/2019 Artigo Panel Check
11/15
panels. The Manhattan plots confirm what was shown in
the Tucker-1 plots. The attributes acidic flavour, sweet
taste and raspberry flavour need two or more principal
components to reach a high level of explained variance.
For the remaining attributes, all panels reach a high
percentage of explained variance already after one prin-
cipal component. The only exception is attribute elas-
ticity, where one can easily see that panel 19 differsfrom the other panels. The lone dark bar indicates that
panel P19 has less systematic variance for this attribute
than the other panels and needs three to four principal
components before explained variance is comparable
with the other panels. For this attribute, all panels have
an explained variance that is higher than or very close to
99% using only PC1 except panel 19 with only 62%
after PC1. After PC3, the cumulative explained variance
of 90% for panel P19 is still somewhat lower than those
of the other panels. With 4 PCs panel P19 reaches
100% explained variance.
p*MSE, MSE and F plots based on one-way ANOVA
The p*MSE plots are not presented here since sample
discrimination is highly significant for all attributes across
all panels. Of the 234 given p values (26 panels 99
attributes) the highest was at P =0.037.
In Fig.7a and b, the F and MSE plot are presented,
respectively. As can be seen, some of the panels have amuch higher Fvalue than others even though all of them
are significant at 1% level. The horizontal lines indicating
F values at 1 and 5% levels cannot be seen here, since
some F values are extremely high. Both horizontal lines
therefore fall onto the vertical axis as their corresponding F
values are extremely low compared to the highest Fvalues
in the plot. When investigating the panels discrimination
ability one can see for instance that panel P11 has rela-
tively lowFvalues compared to those of panel P20. At the
same time, the MSE values (Fig.8b) of panel P11 are
relatively high. This indicates that panel P11 is somewhat
Fig. 6 Nine Manhattan plots, one for each attribute, visualisingsystematic variation from individual PCA on the data of each panel.Vertical axes represent the number of PCs used and theircorresponding cumulative explained variance. Horizontal axes
represent the respective sensory panels. Black colourcorresponds to0% explained variance, whereas white colour corresponds to 100%explained variance
Eur Food Res Technol (2010) 230:497511 507
1 3
-
8/13/2019 Artigo Panel Check
12/15
less precise and has lower capability of detecting differ-
ences. Panel P20 on the other hand has relatively low MSE
(good repeatability) values combined with relatively highF
values (good sample discrimination) indicating a much
better performance than panel P11. Panel P21 is an
example of where high F values are achieved, however,
coupled with high MSE values. In other words, this panel
discriminates well between the tested samples, but lessprecisely so than panel P20. In terms of performance panel
P21 may be ranked between panel P20 (good) and panel
P11 (not as good). Still, panel P11 shows an acceptable
performance since its F values are all significant at 1%
level. Note that the F and MSE plots provide no infor-
mation on sample ranking differences and that these two
plots alone therefore are not sufficient to get a complete
evaluation on panel performance. It should be mentioned
that both plots could also be sorted by attribute to check
which of the attributes have the lowest/highest variance
and the best ability to distinguish between samples.
Line plots
Figure8 shows line plots of the five tested samples. The
plots highlight that for every sample and attribute there is a
varying degree of variability across the panels (vertical
lines indicating spread of the scores). This variability could
be due to for instance local differences in calibration. This
is particularly true for attribute 9 (transparency). Forattribute 5 (sticking), however, there seems to be a higher
degree of agreement among the panels.
Local analysis of panels P05, P17 and P25
After studying all the panels averages across assessors
(based on data as described in Sample averages across
assessors for each panel), the data of panel P05, P17 and
P25 were analysed in more detail. These three panels were
picked over others because they differ from each other for
the attributes acidic flavour, sweet taste and raspberry
flavour. They are spread somewhat in terms of location inthe Tucker-1 plot of these three attributes.
Since we are now focusing on only three panels and we
wish to analyse in more detail why these three panels differ
somewhat for the attributes mentioned above, we will use
their raw data from here on. In order to do that, their raw
data needs to be merged as described in Raw data before
being imported into PanelCheck. When merging the raw
data of panel P05, P17 and P25, the resulting data matrix
will be of dimension (435 99), with panel 5 contributing
150 rows (10 assessors 95 samples 93 replicates), panel
17 contributing 135 rows (9 assessors 95 samples 93
replicates) and panel 25 contributing 150 rows of data (10assessors 9 5 samples 93 replicates). See Table1 for
details on panel sizes. This new data set in practise rep-
resents one new large panel consisting of 29 individuals
(10 ?9 ? 10 assessors). By using the same methods as
before, now the performance of individuals belonging to
one of these three panels can be visualised.
Mixed model ANOVA
Mixed model ANOVA again reports that all attributes are
significant at level P\ 0.001, meaning that this new panel
consisting of 29 individuals discriminated well between the
samples (plot not shown). Again, all attributes were con-
sidered for further analysis.
Tucker-1 plots
Tucker-1 plots based on raw data from the three selected
panels (Fig.9) confirm what has been spotted in the
Tucker-1 plot above (Fig.4) based on the data from all 26
panels. There is substantial disagreement across assessors
Fig. 7 a F plots visualising the panels ability to discriminatebetween the tested samples for each attribute. Panel P21 discriminatesless between the samples than for example panel P20. The horizontallinesindicatingFvalues at significance level 1 and 5% are not visibleas they are very low and therefore fall onto the horizontal axis. bMSEplot visualising the repeatability of each panel. Panel P21 obviouslyhas a weaker performance regarding repeatability than for examplepanel P06
508 Eur Food Res Technol (2010) 230:497511
1 3
-
8/13/2019 Artigo Panel Check
13/15
for the attributes acidic flavour, sweet taste and espe-
cially raspberry flavour indicating that further improve-
ment on agreement across assessors is possible. Although
the assessors are somewhat scattered over the correlation
loadings plots, most of them have high-explained variances
for the first two PCs. This indicates that the majority
Fig. 8 Five line plots where each plot represent the data of one sample.Vertical axes represent intensity scores. Horizontal axesrepresent thenine sensory attributes
Fig. 9 Nine identical Tucker-1 plots with each plot highlighting one of the nine attributes used in the profiling. The plots are based on raw dataof panel P05, P17 and P25
Eur Food Res Technol (2010) 230:497511 509
1 3
-
8/13/2019 Artigo Panel Check
14/15
discriminates well between the samples but that theremight be disagreement on sample ranking for that partic-
ular attribute. Studying the three correlation plots in detail
confirms this by revealing that the assessors of each panel
tend to form clusters of their own within the plot. For the
remaining six attributes (transparency, sugar coat, bit-
ing, hardness, elasticity and sticking) overall agree-
ment is very good. These results were confirmed by the
Manhattan plots (not shown).
p*MSE plots
As opposed to the situation above with all 26 panels with
high significance (p*MSE, MSE and F plots based on
one-way ANOVA), in this case the p*MSE plots plot
(Fig.10) gives a valuable contribution to understanding
individual differences. It can be seen that for attribute
raspberry flavour panel P17 and to a certain extent panelP05 are less capable to detect differences between the
samples (larger P values) than panel P25. Moreover, ran-
dom noise is generally larger for panel P05 and P17. This
indicates that the individuals of panel P25 and therefore
panel P25 as a group perform much better than panel P05
and P17.
Profile plot for panels
Profile plots (Fig. 11) show that the disagreement in eval-
uating the samples is strongest for the attributes acidic
flavour, sweet taste and particularly for raspberry fla-vour, as already observed in the Tucker-1 plots in Fig. 9.
For attributes transparency, sugar coat, biting, hard-
ness and elasticity the profiles are very alike for most of
the assessors with very few exceptions. For attribute
stickiness three assessors of panel P17 (individuals P17-1,
P17-4 and P17-9) generally rated the samples with the
highest intensity (B, C1 and A2) lower than the remaining
assessors.
Fig. 10 p*MSE plot for attribute raspberry flavour for panels P05,P17 and P25
Fig. 11 Nine profile plots, one for each attribute, visualising sampleintensity and rankings for each assessor in panels P05, P17 and P25.Vertical axes represent sample intensity scores. Horizontal axes
represent the five tested samples sorted by intensity based onconsensus. The circle highlights three deviating assessors belongingto panel P17 (assessor P17-1, P17-4 and P9)
510 Eur Food Res Technol (2010) 230:497511
1 3
-
8/13/2019 Artigo Panel Check
15/15
Summary and discussion
In this paper, we present how to extract critical information
on panel performance issues from a proficiency test. In the
example described in this paper, 26 sensory panels tested a
set of 5 candy samples produced according to an experi-
mental design with 3 replicates using 9 attributes. Since the
panels varied in size, with 3 assessors at the least and 15assessors at the most, the size of the data from each panel
varied thereafter. We demonstrated how to arrange the
large amount of data prior to analysis and which methods
to use in the analysis process. For this, we proposed a
general workflow that may be used as a guide through the
data analysis process, but which is not forced upon the
user.
For the data at hand, performance analysis is carried
out first at a global level, based on data from all 26 panels
where each panel treated as it was an individual
assessor. This means that rather than visualising perfor-
mance of individuals, it is the performance of panels as awhole compared to other panels that is visualised. As a
result, from this process three of the 26 panels were
identified for further analysis at a more detailed local
level. This included performance visualisation of indi-
vidual assessors from each of these three panels. In both
cases, the same methods were applied to gather informa-
tion on performance. The methods used were mixed
model ANOVA, Tucker-1 plot, Manhattan plot, one-way
ANOVA based F plot, MSE plot, p*MSE plots, profile
plot and line plot. Reason for using multiple plots and
their methods is that each of the plots contains unique
information on panel and assessor performance. Theirjoint information content provides a more complete per-
formance overview on individual assessors and their
sensory panel (local level) or sensory panels compared
with each other (global level). Performance information
from such an analysis can then be used by panel leaders as
feedback to improve over panel performance and perfor-
mance of individual assessors.
Acknowledgments Thanks to Rikke Lazarotti at LEAF Denmarkfor production of wine gum samples and providing access to thesensory profiling data. We would like to thank the Research Councilof Norway (project number 168152/110), The Foundation forResearch Levy on Agricultural Products (Norway) and The DanishFood Industry Agency for project funding.
References
1. Brockhoff P, Skovgaard I (1994) Modelling individual differ-ences between assessors in a sensory evaluation. Food QualPrefer 5:215224
2. Ns T (1990) Handling individual differences between assessorsin sensory profiling. Food Qual Prefer 2:187199
3. Dahl T, Tomic O, Wold JP, Ns T (2008) Some new tools forvisualizing multi-way sensory data. Food Qual Prefer 19:103113
4. LeS, Pages J, Husson F (2008) Methodology for the comparisonof sensory profiles provided by several panels: application to across-cultural study. Food Qual Prefer 19:179184
5. Tomic O, Nilsen A, Martens M, Ns T (2007) Visualization ofsensory profiling data for performance monitoring. LWT-FoodSci Technol 40:262269
6. Thompson M, Wood R (1993) The international harmonisedprotocol for the proficiency testing of (chemical) analytical lab-oratories. Pure Appl Chem 65:212123
7. McEwan JA (1999) Comparison of sensory panels: a ring trial.Food Qual Prefer 10:16171
8. Hunter EA, McEwan JA (1998) Evaluation of an InternationalRing Trial for sensory profiling of hard cheese. Food Qual Prefer9:343354
9. Pages J, Husson F (2001) Inter-laboratory comparison of sensoryprofiles methodology and results. Food Qual Prefer 12:297309
10. PanelCheck software (2006) Nofima Mat, As, Norway.http://www.panelcheck.com
11. Ns T, Langsrud (1988) Fixed or random assessors in sensoryprofiling? Food Qual Prefer 9:145152
12. Tucker LR (1964) The extension of factor analysis to three-
dimensional matrices. In: Frederiksen N, Gulliksen H (eds)Contributions to mathematical psychology. Holt, Rinehart &Winston, New York
13. Tucker LR (1966) Some mathematical notes on three-modefactor analysis. Psychometrika 31:279311
14. Ns T, Solheim R (1991) Detection and interpretation of varia-tion within and between assessors in sensory profiling. J SensStud 6:159177
15. Lea P, Rdbotten M, Ns T (1995) Measuring validity in sensoryanalysis. Food Qual Prefer 6:321326
Eur Food Res Technol (2010) 230:497511 511
1 3
http://www.panelcheck.com/http://www.panelcheck.com/