artigo panel check

Upload: gutrunks

Post on 04-Jun-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Artigo Panel Check

    1/15

    O RI G I N A L P A P E R

    Analysing sensory panel performance in a proficiency testusing the PanelCheck software

    Oliver Tomic Giorgio Luciano Asgeir Nilsen

    Grethe Hyldig Kirsten Lorensen Tormod Ns

    Received: 5 May 2009 / Revised: 28 October 2009 / Accepted: 9 November 2009 / Published online: 2 December 2009Springer-Verlag 2009

    Abstract This paper discusses statistical methods and a

    workflow strategy for comparing performance acrossmultiple sensory panels that participated in a proficiency

    test (also referred to as inter laboratory test). Performance

    comparison and analysis are based on a data set collected

    from 26 sensory panels carrying out profiling on the same

    set of candy samples. The candy samples were produced

    according to an experimental design using design factors,

    such as sugar, and acid level. Because of the exceptionally

    large amount of data and the availability of multiple sta-

    tistical and graphical tools in the PanelCheck software, a

    workflow is proposed that guides the user through the data

    analysis process. This allows practitioners and non-statis-

    ticians to get an overview over panel performances in arapid manner without the need to be familiar with details

    on the statistical methods. Visualisation of data analysis

    results plays an important role as this provides a time

    saving and efficient way of screening and investigating

    sensory panel performances. Most of the statistical meth-

    ods used in this paper are available in the open source

    software PanelCheck, which may be downloaded and used

    for free.

    Keywords Proficiency test Inter laboratory test

    Sensory profiling

    Performance visualisation

    PanelCheck

    Introduction

    Trained sensory panels are important tools for assessing the

    quality of food and non-food products. There are, however,

    a number of problems related to the training, stability, and

    maintenance of the quality of such panels. A number of

    methods have been developed that may help to achieve

    better panel performance [15]. These techniques can

    detect lack of precision (repeatability), disagreement

    (reproducibility), and the ability or inability to discriminatebetween samples. This type of information is very useful

    for improving data quality in future sessions through

    increased and more targeted training on problematic issues.

    Larger companies maintaining sensory panels at multi-

    ple geographic locations are often subject to additional

    challenges. For example, thoroughly carried out quality

    control and product development require that all sensory

    panels are well calibrated with one another, eliminating

    potential shift between the panels and allowing for com-

    parison of their results. When multiple sensory panels are

    to evaluate the same set of samples global performance

    issues (across multiple sensory panels) might add to

    already existing local performance issues (within one

    sensory panel). This further complicates comparison of

    results from each involved panel. Techniques for profi-

    ciency tests are available, but most of them are developed

    for classical chemical inter-laboratory comparisons (see,

    e.g. [6]) and with less focus on some of the more specific

    aspects of sensory analysis such as those indicated above.

    Important contributions to the proficiency test literature are

    available [79]. In these papers, classical ANOVA,

    O. Tomic (&) G. Luciano A. Nilsen T. NsNofima Mat AS, Osloveien 1, 1430 As, Norwaye-mail: [email protected]

    G. HyldigDTU Aqua, National Institute of Aquatic Resources,Technical University of Denmark, Sltofts Plads,Build. 221, 2800 Lyngby, Denmark

    K. LorensenChew Tech I/S, Vejle, Denmark

    1 3

    Eur Food Res Technol (2010) 230:497511

    DOI 10.1007/s00217-009-1185-y

  • 8/13/2019 Artigo Panel Check

    2/15

    Principal Component Analysis (PCA), Multiple Factor

    Analysis (MFA), and Generalised Procrustes Analysis

    (GPA) are used for studying intra- and inter-laboratory

    variation.

    The main focus of the present paper is to discuss and to

    illustrate how techniques developed specifically for per-

    formance visualisation of a single sensory panel [5] can

    also be applied for comparing multiple panels. Some of thetechniques are related to the methods mentioned above,

    while others are new in this context. Univariate as well as

    multivariate statistical methods will be presented and used

    in this paper. The univariate methods highlight differences

    for each attribute separately while the multivariate methods

    look at differences at a more general level taking into

    account also correlations between the attributes. All pre-

    sented techniques are graphically oriented and should be

    therefore easy to understand by practitioners and non-

    statisticians. A major issue is to stress how the techniques

    can be used to highlight or visualise various types of dif-

    ferences between the assessors and the panels. Further-more, a workflow suggesting how to progress with the data

    analysis and how to use the methods available in the

    PanelCheck software will also be proposed. This allows for

    rapid and efficient analysis of sensory profiling data, both

    in case of one or multiple panels. The software provides an

    intuitive and easy-to-use graphical user interface that

    handles all statistical computations in the background and

    visualises results in different types of plots. This enables

    the practitioner and non-statistician to concentrate on per-

    formance analysis rather than spending time on trying to

    apply algorithms on data by themselves. The open source

    PanelCheck software may be downloaded, distributed, andused for free (http://www.panelcheck.com)[10].

    Experimental

    The dataset discussed here is the result of a joint pro-

    ject between Danish, Norwegian, Swedish, and English

    research institutes and commercial companies. In all, 26

    panels were involved in the project (research as well as

    industry panels) with one of the aims being to investigate

    performance of multiple sensory panels with the Panel-

    Check software.

    The samples studied were five candies (wine gums)

    produced according to an experimental design with two

    design factors, i.e. sugar level and acid content: A1 (high

    sugar, low acid), A2 (high sugar, high acid), B (medium

    sugar, low acid), C1 (low sugar, low acid), C2 (low sugar,

    high acid). All samples were produced at LEAF Denmark.

    The evaluation of the samples had to be performed within

    1 month after production. LEAF Denmark guaranteed that

    samples did not change its sensory properties within this

    period. The candy samples were tested by each of the 26

    participating panels.

    Each sensory panel received detailed instructions about

    sample preparation and evaluation. The sensory panel at

    LEAF performed sensory profiling on the samples and

    suggested nine sensory attributes, which the remaining 25

    sensory panels were to use for profiling. Two samples

    (A1, C1) were used for training and calibration by allsensory panels. Sample C2 was used as a reference

    sample for maximum intensity of attribute acidic fla-

    vour. For the remaining attributes, either sample A1 or

    C1 were used as reference for low or high intensity. All

    attributes were evaluated on an intensity scale from 0 (no

    intensity) to 15 (high intensity). Water was used to clean

    the palate between each sample. The nine attributes used

    to describe the samples were: transparency, acidic fla-

    vour, sweet taste, raspberry flavour, sugar coat (the

    thickness of the sugar peel visual on the cut wine gum

    piece), biting strength in the mouth (referred to as

    biting), hardness, elasticity in the mouth (referred toas elasticity), sticking to teeth in the mouth (referred

    to as sticking).

    Each of the 5 samples was evaluated in 3 replicates,

    resulting in a total of 15 samples to be tested by each panel.

    One piece of wine gum weighed 3.5 g. In each serving, the

    assessors got four to five pieces of which one was cut in

    half by the sensory staff allowing the assessors score on

    appearance attributes. For those panels that did not have

    access to specific software for automatic randomisation of

    candy samples, a Latin square design was provided as an

    example for serving order. All 26 sensory evaluations took

    place in June 2007.Table1shows an overview over the 26 panels indicat-

    ing their number of assessors, size of the data matrix of

    each panel, and the size of the data used for the first part of

    analysis that included all panels.

    Methods

    In the following section, the univariate and multivariate

    statistical methods used for data analysis will be discussed.

    The results of these methods are visualised in various plots

    helping non-statisticians to visually detect performance

    issues without having to know all details on the statistical

    methods. It should be emphasised that the real strength of

    these methods is revealed only when using them together.

    Each plot has its own special feature that represents an

    element of unique information, but their joint information

    content is what really provides a holistic overview over

    performance of the investigated panels. The methods will

    be presented in an order that complies with the suggested

    data analysis workflow (see Work flow strategy).

    498 Eur Food Res Technol (2010) 230:497511

    1 3

    http://www.panelcheck.com/http://www.panelcheck.com/
  • 8/13/2019 Artigo Panel Check

    3/15

    The same workflow may be applied for one sensory

    panel at the time as well as for multiple sensory panels. In

    this sense, one needs to think in terms of groups and

    individuals such that the statistical methods may be

    applied appropriately. When analysing performance of one

    sensory panel, the panel as a whole represents the group

    level while the assessors represent the individual level.

    This changes, however, when applying the same methods

    on data from multiple panels. Here, the group consisting of

    26 panels represents the group level whereas each single

    panel here represents the individual level. In other words,

    the group of 26 panels will be treated as one large panel

    with each panel representing one assessor. How this is

    done, in practice will be elaborated later in Data

    merging.

    In the description of the statistical methods in Mixed

    model ANOVA for assessing the importance of attributes

    to Profile and line plots below (considering performance

    analysis of only one panel) we let j =1,,J denote the

    number of samples tested, m = 1,,M the number of

    replicates, k = 1,,K the number of attributes and

    i = 1,,I the number of assessors. We let Xi denote the

    data matrix of assessor i with J*M rows and K columns.

    That means that Xi of each assessor is of dimension

    (J*M) 9 Kin any of the 26 data sets. For the candy data

    set, the dimension of Xi then is (5*3) 9 9, with J=5

    samples, M=3 replicates and K= 9 attributes.

    Mixed model ANOVA for assessing the importance

    of attributes

    As a first step, mixed model (2- or 3-way) ANOVA can

    be used for assessing the importance of the used sensory

    attributes in detecting significant differences between the

    samples. The method is based on modelling samples,

    assessors and their interactions in two-way ANOVA or

    samples, assessors, replicates and their interactions in

    three-way ANOVA, and then testing for the sample effect

    by the use of regular F tests. In each case (2-way or

    3-way), the assessor and interaction effects are assumed

    to be random [11]. Only attributes that are significant at

    certain level (in the case presented here a 5% significance

    Table 1 Overview over allsensory panels that participatedin the proficiency test

    J, M, I the number of testedproducts, replicates andassessors, respectively

    Sensory panel Number of assessors Number of data rowsin raw data (J*M*I)

    Number of data rowsused in global analysis

    P01 7 105 15

    P02 11 165 15

    P03 8 120 15

    P04 11 165 15

    P05 10 150 15P06 15 225 15

    P07 9 135 15

    P08 3 45 15

    P09 11 165 15

    P10 8 120 15

    P11 7 105 15

    P12 7 105 15

    P13 5 75 15

    P14 7 105 15

    P15 8 120 15

    P16 8 120 15

    P17 9 135 15

    P18 8 120 15

    P19 6 90 15

    P20 7 105 15

    P21 6 90 15

    P22 10 150 15

    P23 11 165 15

    P24 7 105 15

    P25 10 150 15

    P26 4 60 15

    Total 213 3,195 390

    Eur Food Res Technol (2010) 230:497511 499

    1 3

  • 8/13/2019 Artigo Panel Check

    4/15

    level was chosen) for product effect are considered for

    further analysis. In case of the candy data, two-way

    ANOVA was used since the replicates of the tested

    samples were served in random order. If sample replicates

    were served systematically, say one replicate per session,

    three-way ANOVA (with main effects for assessor, sam-

    ple and replicate and interactions between them) should

    be considered instead. The reason for this is that bytesting the replicates in separate sessions, it is likely that

    additional systematic variance between the replicates will

    be introduced into the data. The replicate effect in three-

    way ANOVA then indicates whether a significant sys-

    tematic session based variation in the data is present or

    not.

    Tucker 1 plots

    In the next step, the multivariate analysis method Tucker-1

    [12,13] is applied in order to get an overview over assessor

    and panel performance using multiple attributes. Tucker-1is essentially PCA on an unfolded data matrix consisting of

    all individual matrices Xi,av aligned horizontally. Here,

    Xi,avrepresents the matrix of one assessor of the dimension

    J9 K where the sample score is based on the average

    across replicates, hence indicated with av in the index.

    This means that the dimension of this unfolded matrix is

    J9(I*K). In the case of our candy data set the dimension

    would be 5 9(10*9), with J=5 samples, and K=9

    attributes and say I=10 assessors [if your panel consists

    of 10 assessors as it is the case for example for panels P05,

    P22 and P25 (see Table1)].Iwill of course vary according

    to the number of assessors in the panel and consequentlythe dimension (I*K).

    PCA on this unfolded matrix provides two types of plots

    that are of interest: a common scores plot and a correlation

    loadings plot. The common scores plot shows how the

    tested J samples relate to each other, i.e. it visualises

    similarities and dissimilarities between the samples along

    the found principal components. This plot gives no direct

    information on assessor or panel performance, but it is a

    valuable visualisation tool that helps the user to roughly

    and quickly investigate whether the panel could distinguish

    between the samples or not by taking the explained vari-

    ances into account. If the explained variance in the first few

    (usually two) PCs is relatively high, large systematic

    variation is present in the data, which again may indicate

    that the panel discriminates well between the samples.

    Note that the explained variance for a Tucker-1 common

    scores plot generally is somewhat lower for the first few

    PCs compared to those from PCA on the ordinary con-

    sensus average matrix. This is because the Tucker-1 anal-

    ysis is based on many more variables and therefore more

    noise is present in the data.

    The correlation loadings plot provides performance

    information on each assessor and the sensory panel as a

    whole. The plot contains I*K dots, with each dot repre-

    senting one assessor-attribute combination (e.g. attribute

    sweet taste of assessor 5, etc.). By highlighting different

    dots, either those of one assessor or those of one attribute,

    one can visualise the performance of individual assessors

    or the whole panel. The position of the dots within the plotprovides information on how well an individual or the

    panel as a whole perform. The more noise the attribute of a

    particular assessor contains, the closer the dot will appear

    to the origo, i.e. the middle of the plot. The more sys-

    tematic information an attribute of an assessor contains, the

    closer it will appear to the outer ellipse (100% explained

    variance for that attribute, see Fig. 4). The inner ellipse

    represents 50% explained variance and can be considered

    as a rule-of-thumb lower boundary of how much explained

    variance an attribute should at least have to be considered

    as good enough. It is recommended to consult also higher

    PCs, since some assessors might have much systematicvariance in other dimensions than PC1 and PC2 and thus

    initially appearing as noisy. Detailed information on the

    statistical aspects and interpretations of Tucker-1 common

    scores plot and correlation plots are given in [3].

    Manhattan plots

    Manhattan plots in general provide an alternative way to

    visualise systematic variation in data sets as described

    earlier [3]. They can be considered as a screening tool for

    quick identification of assessors that perform very differ-

    ently from the other assessors. The information visualisedby Manhattan plots may be computed with different sta-

    tistical methods. In this paper, the Manhattan plots visu-

    alise information as implemented in the PanelCheck

    software. This means that PCA is applied on the individual

    data matrices Xi,av and the explained variance for each

    attribute is then visualised in the Manhattan plots. For the

    candy data at hand, I*Kexplained variances will be given.

    This means if the panel consists of say I = 10 assessors, the

    number of explained variances would be 10*9 given K =9

    attributes.

    Manhattan plots (see example in Fig.6) visualise, in

    shades of grey, how much of the variability of each attri-

    bute and each assessor can be explained by the principal

    components (vertical axis). A dark colour tone indicates

    that only a small proportion of the variance has been

    explained, while a light colour tone indicates the opposite.

    Extreme points are black (0% explained variance) and

    white (100% explained variance). Typically, the colour

    will be darker for PC1 and then get lighter with each

    additional PC from top to bottom as the explained variance

    shown is cumulative over each PC. In other words, the

    500 Eur Food Res Technol (2010) 230:497511

    1 3

  • 8/13/2019 Artigo Panel Check

    5/15

    explained variance at PC3 is the sum of the explained

    variances of PC1, PC2 and PC3. The lighter a colour tone

    in a Manhattan plot is for a specific assessor-attribute

    combination, the more systematic variation is given.

    The explained variances may be sorted either by

    assessor or by attribute, depending on what is the main

    focus of investigation. When interested in checking per-

    formance between assessors one may investigate a totalnumber ofIplots consisting ofKcolumns where each plot

    represents one assessor and each column within the plots

    represents one attribute. Here, one may look for similar

    colour patterns among the assessors and detect assessors

    that differ much from the others. If interested in how well

    an attribute is understood and used by the panel one may

    consider a total number ofKplots consisting ofIcolumns

    where each plot represents one attribute and each column

    represents one assessor in the plots. Here, one may inves-

    tigate whether an attribute achieves high-explained vari-

    ances with only few PCs or if many PCs are necessary.

    Moreover, it can be detected whether some assessors mayhave more systematic variance with fewer PCs than other

    assessors. In this sense, Manhattan plots may be used as a

    screening tool for quick detection of assessors that behave

    very differently or attributes that are not well explained

    relatively to one another. Both plotting variants are

    implemented in PanelCheck. More detailed information on

    the statistical aspects and interpretations of Manhattan

    plots are presented in [3].

    Plots based on one-way ANOVA

    A discussion of the one-way ANOVA model about panelperformance can be found in [5, 14]. From the one-way

    ANOVA model, we obtain three statistical quantities (F, p

    and MSE values) that are used to generate the so-called F

    plot, MSE plot and p*MSE plot as available in the Pan-

    elCheck software. These three statistical quantities are

    acquired by applying one-way ANOVA on each individual

    data matrix Xi and provide information on sample dis-

    crimination and repeatability on each assessor. The three

    plots are described in more detail below.

    F plot

    Fplots are based on Fvalues, which contain information

    on discrimination performance of each assessor. A total of

    I*K Fvalues are computed and may be presented in a bar

    diagram with each bar representing the attribute of one

    specific assessor. The bar diagram can be accompanied

    with horizontal lines indicating different significance lev-

    els. Typically, 1 and 5% level of significance are used for

    this purpose. Generally, the higher an value Fvalue of an

    individual assessor, the greater the ability of that assessor

    to discriminate between tested samples. If differences

    between the tested samples are present, one should expect

    the assessors to obtain high Fvalues greater, ideally higher

    than those corresponding to 1 and 5% level of significance.

    MSE plot

    The MSE values are the mean square errors (random errorvariance estimates) from the one-way ANOVA model.

    They can be used as a measure of repeatability for each

    assessor. A total ofI*KMSE values are computed and can

    be plotted in a bar diagram very similar to the Fvalues in

    the F plot. If an assessor almost perfectly repeats her/

    himself, this value should be close to zero. The less the

    repeatability of a certain assessor, the higher his/her MSE

    will be. The MSE value should, however, always be con-

    sidered together with the Fvalues in order to get a realistic

    overview over the assessors performance. An assessor

    aiming for low MSE values can achieve this through

    scoring about all samples alike, thus reducing differencesbetween replicates. However, such an assessor will clearly

    have no discriminative power in the analysis, as the

    respective F values will also be very low. If differences

    between the samples are given, an assessor should ideally

    have high Fvalues and low MSE values.

    p*MSE plots

    In a p*MSE plot [15] the assessors ability to detect dif-

    ferences between samples is plotted against their repeat-

    ability using the p values and MSE values from one-way

    ANOVA calculations. A total of I*Kpairs ofp and MSEvalues are computed and plotted in a scatter plot. They can

    be presented together in various ways (for instance all at

    the same time, only for one attribute at a time or only for

    one assessor at a time) and with highlighting of the

    assessors or attributes that one is particularly interested

    in. In an ideal situation all assessors should achieve low

    p values and low MSE values for all attributes [15] if

    differences between the samples are really present, thus

    ending up in the lower left corner of the plot.

    Profile and line plots

    Profile plots visualise how each assessor ranks and rates the

    tested samples compared to the other assessors and the

    panel consensus for a certain attribute (see example in

    Fig.11). Each line represents one assessor (sample aver-

    ages across replicates) whereas the single bold line repre-

    sents the panel consensus (sample averages across

    assessors and replicates). The tested samples are ranked

    along the horizontal axis according to the panel consensus

    from left to right with increasing intensity for that attribute.

    Eur Food Res Technol (2010) 230:497511 501

    1 3

  • 8/13/2019 Artigo Panel Check

    6/15

    The vertical axis represents the scores (average across

    replicates) of the particular assessor for the samples. In

    case of high agreement between assessors, the assessor

    lines follow the consensus line closely. With increasing

    disagreement, the line of each assessor will follow its own

    course and the plot will appear as more cluttered.

    Each line plot [14] represents one sample showing its

    average scores on each attribute in form of a line con-necting each attribute from left to right (see example in

    Fig.8). In addition, raw data scores may be superimposed

    indicating how individual assessors have scored on the

    particular sample. The vertical line for each attribute dis-

    plays the scoring range used by all assessors for that given

    attribute and each symbol represents one of multiple scores

    provided by the panel.

    Data merging and work flow strategy

    In this section, we describe how to prepare and merge thesensory profiling data of the 26 panels prior to import into

    the PanelCheck software. Furthermore, we propose a work

    flow that suggests how to progress with the data analysis,

    i.e. which plots to use first and depending on the infor-

    mation found which plots to use further on. All methods

    used here are integrated in the PanelCheck software and

    thus may be accessed easily. The only exception is the

    method used in PCA for investigating basic structure in

    data. This particular analysis, however, can be easily

    carried out using any multivariate statistics software

    package that gives access to PCA. This work flow may also

    be applied to single data sets from one panel.

    Data merging

    Before analysing the 26 data sets, some data pre-processing

    and re-arranging is necessary. There are several possibili-

    ties of how data may be merged prior to import into the

    PanelCheck software.

    Raw data

    The most obvious way would be to concatenate all data sets

    vertically, which practically would result in a single large

    sensory panel with 213 assessors accumulated over all

    the 26 panels. The dimension of this matrix would then

    be 3,195 9 9 (see Table1, last row). By choosing this

    approach, individual information on all 213 assessors is

    preserved and available in the plots. In return, however,

    interpretation might become cumbersome and challenging,

    as some of the plots get crowded and unreadable with so

    many assessors. Given this situation, performance issues on

    individuals or a particular panel as a whole may be difficult

    to identify. With fewer panels, though, this may be a valid

    approach as the number of assessors also will be lower.

    Sample averages across assessors and replicates

    for each panel

    Another possibility is to compute consensus sample aver-

    ages for each panel across assessors and replicates. Bydoing so one will have available 26 new consensus data

    matrices of dimension J9 K. For the candy data at hand,

    the dimension of these matrices then is (5 99) with J =5

    samples and K =9 attributes. The next step would be to

    concatenate these consensus matrices vertically, resulting

    in a merged data matrix of dimension (26*5) 9 9 and

    import it into PanelCheck. In this case, each panel is

    treated as it was an individual assessor in a sensory

    panel consisting of 26 assessors. Unfortunately, with this

    approach, one loses information on repeatability and per-

    formance of individual assessors, since the sample aver-

    ages were computed across assessors and replicates andinformation on these two factors are lost. Hence, plots

    visualising repeatability performance are not available in

    this case, which are Fplot, MSE plot and p*MSE plot.

    Sample averages across assessors for each panel

    A third alternative is to compute sample replicate aver-

    ages for each panel across the assessors of that particular

    panel. This will lead to 26 data matrices of dimension

    (J*M) 9 K, which for the candy data at hand means

    (5*3) 9 9, with J = 5 samples and M = 3 replicates. This

    is also indicated in Table1. The resulting data matrixthen is of dimension (26*15) 99 when concatenating all

    26 data matrices vertically and is ready for import into

    PanelCheck. In this way, again each panel is treated as it

    was an individual assessor in a sensory panel consist-

    ing of 26 assessors, but this time information on

    repeatability is available as replicate information is pre-

    served on the panel level.

    Data merging approaches used in this study

    For the first part of the analysis (Global analysis of all 26

    panels) where all 26 panels are investigated, the approach

    described in Sample averages across assessors for each

    panel was chosen, since it provides a performance over-

    view over all panels and at the same time preserves

    information on replicates on a panel level. This approach

    can be seen as a middle way between the two approa-

    ches described above in Raw data and Sample averages

    across assessors and replicates for each channel. It is a

    valuable approach when a large amount of data with many

    panels is given, as in this study.

    502 Eur Food Res Technol (2010) 230:497511

    1 3

  • 8/13/2019 Artigo Panel Check

    7/15

    For the second part of the analysis (Local analysis of

    panels P05, P17 and P25), focus is turned to only three of

    the 26 panels and the individual assessors that belong to

    them. These three panels (P05, P17 and P25) were identified

    to be differing somewhat from to the other panels on a

    number of attributes, as visualised with Tucker-1 plots (see

    Results). Given this situation where only three panels are

    to be analysed in detail, the data setup as described in Rawdata is an appropriate approach. With the amount of raw

    data greatly reduced (down to three from 26) information on

    individual assessors will be more readable in the plots.

    Work flow strategy

    The proposed work flow strategy (Fig. 1) is by no means a

    hard rule that represents the perfect general approach to

    analysing and visualising all types of sensory profiling

    data. It should rather be seen as a guide or path that one

    may follow when analysing a new data set, and which may

    be left at any time in the data analysis process. Since eachdata set may have its own unique characteristics, it may

    require a unique approach and a different order of methods

    and plots to be used for analysis.

    In the proposed workflow, a good starting point could be

    a (either two-way or three-way) ANOVA to identify sig-

    nificant attributes at 5% significance level, i.e. P\ 0.05.

    Non-significant attributes close to significance may also be

    considered, since only a few noisy assessors might be

    enough to make the attribute switch from significant to

    non-significant. Attributes which are far from being sig-

    nificant (sayPvalues of 0.1 and above) may be disregarded

    based on high likelihood that differences between the tes-ted samples are not present. Preferably, this cut-off limit

    needs to be chosen by the panel leader who has full

    knowledge about the tested products and knows how well

    the assessors of his/her sensory panel usually perform.

    For the next step, one may consult Tucker-1 and Man-

    hattan plots. Tucker-1 correlation loadings plots as imple-

    mented in the PanelCheck software are based on replicate

    averages, i.e. they do not contain information on repeat-

    ability. They do, however, provide some quick diagnostics

    that may be confirmed with other plots especially suited to

    visualise that particular kind of problem. Depending on

    how the assessors are distributed over the plots one may

    identify possible disagreement in sample ranking, poor

    sample discrimination ability, or crossover effects by

    turning the intensity scale upside down. Manhattan plots

    may be used as a screening tool to identify deviating per-

    formances based on the patterns found in the plots.

    The next plots suggested are those based on one-way

    ANOVA carried out on the individual data matrices X i of

    each individual. Those plots are thep*MSE plot, F plot and

    MSE plot. If an assessor lies, e.g. close to the centre of the

    Tucker-1 correlation loadings plot the reason for this often

    is poor discrimination ability of that particular assessor

    compared to others closer to the outer ellipse. This may be

    confirmed by the p*MSE plot or F plot. If poor sample

    discrimination cannot be confirmed by either one-way

    ANOVA plot another likely scenario might be ranking

    disagreement. In this case, the problematic assessor does

    not agree with the underlying structure found by Tucker-1in the first two PCs. This particular assessor might dis-

    criminate well between the samples, however, not in the

    same way as the panel consensus. Therefore, such an

    assessor may show systematic variation in PC3 or higher.

    This may be confirmed by profile plots.

    If none of the plots mentioned above allows for a con-

    clusion, one might want to consult line plots for visualising

    the raw data of every sample. Studying details on the raw

    data might help to reveal issues that are not caught with

    other plots. With the help of the work flow, one may

    analyse one attribute at the time and finish analysis when

    performance on all attributes has been evaluated.

    Results

    In this section, we will first investigate sensory profiling

    data of all 26 participating panels (Global analysis of all

    26 panels) before we go further into detail by looking into

    performance of only a few selected panels (Local analysis

    of panels P05, P17 and P25) that vary somewhat from

    most other panels.

    Global analysis of all 26 panels

    Two-way ANOVA

    Following the workflow shown in Fig.1, a two-way

    ANOVA (details in Mixed model ANOVA for assessing

    the importance of attributes) was computed first. The

    results are shown in Fig. 2. All attributes were significant

    with P\0.001, hence all attributes were kept for further

    analysis.

    PCA for investigating basic structure in data

    The purpose of this analysis step is to get a quick and

    general overview over how the data is structured and to

    identify panels that may differ greatly in regard to how they

    perceive differences between the tested samples. This is

    done by applying PCA on the merged data set as described

    in Sample averages across assessors and replicates for

    each panel. The results are reported in Fig. 3ac, showing

    the explained variance, scores and loadings, respectively.

    Figure3a shows that the two-first principal components

    Eur Food Res Technol (2010) 230:497511 503

    1 3

  • 8/13/2019 Artigo Panel Check

    8/15

    explain 93% of the total variability contained in the dataset.

    Figure3b shows how the samples are distributed in themultivariate space. Each sample is represented 78 times

    (3 replicates 9 26 panels) with a fairly good separation

    between the samples. The scores plot shows that the first

    axis discriminates between samples A2, B and C1 on one

    side versus A1 and C2 on the other. The samples in the

    latter group are characterised by high intensity for the

    attributes sweet taste, sugar coat and to a certain extent

    acidic flavour and raspberry flavour (see loadings plot in

    Fig.3c). The former group is characterised by high

    intensity for attributes sticking, transparency, elastic-

    ity, hardness and biting. Along the second axis, thereseems to be a split between samples A2 and B (samples on

    the left side of the scores plot) and A1 and C2 (samples on

    right side of score the plot). This tendency is strongly

    related to attribute acidic flavour with high intensity for

    samples A2 and C2 and low intensities for samples A1 and

    B. This is in accordance with the experimental design

    described above (Experimental). Attribute sweet taste

    also seems to contribute to the split along PC2 although not

    in the same degree as acidic flavour. Moreover, there

    Fig. 1 Proposed workflowfor the analysis of assessorand panel performance

    504 Eur Food Res Technol (2010) 230:497511

    1 3

  • 8/13/2019 Artigo Panel Check

    9/15

    seems to be no clear coherence with the sugar content in

    the samples as one should expect from the experimental

    design. Nonetheless, the scores plot shows that the panels

    are in good agreement regarding how the samples differ

    from each other except for one evaluation of sample C2

    and one evaluation of A1. Other than that there are no

    anomalies to be detected which rules out that there are

    severe differences between any of the panels.

    Tucker-1 and Manhattan plots of all panels

    For the next step, Tucker-1 correlation loadings plots

    (Fig.4) are utilised to identify attributes with potential

    performance issues. For the data at hand, nine identical

    plots are given (for nine attributes) with one attribute being

    highlighted at the time.

    By screening through the plots, one can see that the

    overall performance between the 26 panels can be con-

    sidered as very good for most of the attributes. A very large

    part of the variation in the data is explained using only PC1

    and PC2. The total amount of variance explained by PC1

    and PC2 is 98.6%, with PC1 and PC2 explaining 92.6 and

    6.0%, respectively. From previous experience, we caninform that this number is very high compared to other data

    sets, despite the high number of 234 variables (26 pan-

    els 9 9 attributes). One important reason for this is that

    much of the noise was eliminated by averaging sample

    scores over assessors. The plots show that none of the

    panels is in the inner ellipse for any attributes meaning that

    all of them have more than 50% systematic variation of the

    variation explained by PC1 and PC2. For all attributes

    Fig. 2 Product effect in the two-way ANOVA model based on 26panels. All attributes are significant with P\ 0.001 and are includedin further analysis

    Fig. 3 a Explained variances from PCA on the data described inSample averages across assessors for each panel. The upper (fullline) and lower (dashed) line visualise the calibrated and validatedexplained variance, respectively. b PCA on the data described inSample averages across assessors for each panel. The PCAscores plot visualises how the 26 panels discriminated between thefive tested samples. c PCA on the data described in Sampleaverages across assessors for each panel. PCA loadings plotshowing how the attributes contributed to the variation in themerged data set

    Eur Food Res Technol (2010) 230:497511 505

    1 3

  • 8/13/2019 Artigo Panel Check

    10/15

    except acidic flavour, sweet taste and raspberry fla-

    vour the 26 panels show very good agreement as they arewell clustered at the outer ellipse. For the three attributes

    mentioned above, there is some disagreement since the

    panels are more spread out along the outer ellipse. Attri-

    butes acidic flavour and sweet taste are the only attri-

    butes contributing to systematic variation in PC2.

    Furthermore, it is obvious that panel P01 disagrees with the

    other panels on attribute sticking, since it is located on the

    opposite side of the other panels. From previous experi-

    ence, it is known that such a situation is caused by turning

    the scale upside-down, i.e. confusing high and low inten-

    sity. This assumption is confirmed by the profile plot for

    attribute sticking as shown in Fig. 5. Panel P01 seems tohave confused high and low intensity for the tested sam-

    ples. Moreover, one may observe in Fig. 4that panel P19

    has less systematic variation for attribute elasticity

    compared to the other panels. A profile plot of attribute

    elastic (not shown) reveals that panel P19 ranks the

    samples identically to the consensus, however, its intensity

    differences between the samples deviate somewhat from

    that of the consensus. This is why panel P19 lies in the

    same direction as the remaining panels in the Tucker-1

    correlation loading plots, but it does not align as well with

    the other panels.After screening through the Tucker-1 plots one may

    consult Manhattan plots (Fig.6) for comparison of the

    systematic variation for a specific attribute across all

    Fig. 4 Nine identical Tucker-1 plots with each plot highlighting one of the nine attributes used in the profiling. There is some variation betweenthe panels for attributes acidic flavour, sweet taste and raspberry flavour

    Fig. 5 Profile plot of attribute sticking. Panel P01 clearly stands outfrom the other panels because of opposite scoring on high and lowintensity of the tested samples

    506 Eur Food Res Technol (2010) 230:497511

    1 3

  • 8/13/2019 Artigo Panel Check

    11/15

    panels. The Manhattan plots confirm what was shown in

    the Tucker-1 plots. The attributes acidic flavour, sweet

    taste and raspberry flavour need two or more principal

    components to reach a high level of explained variance.

    For the remaining attributes, all panels reach a high

    percentage of explained variance already after one prin-

    cipal component. The only exception is attribute elas-

    ticity, where one can easily see that panel 19 differsfrom the other panels. The lone dark bar indicates that

    panel P19 has less systematic variance for this attribute

    than the other panels and needs three to four principal

    components before explained variance is comparable

    with the other panels. For this attribute, all panels have

    an explained variance that is higher than or very close to

    99% using only PC1 except panel 19 with only 62%

    after PC1. After PC3, the cumulative explained variance

    of 90% for panel P19 is still somewhat lower than those

    of the other panels. With 4 PCs panel P19 reaches

    100% explained variance.

    p*MSE, MSE and F plots based on one-way ANOVA

    The p*MSE plots are not presented here since sample

    discrimination is highly significant for all attributes across

    all panels. Of the 234 given p values (26 panels 99

    attributes) the highest was at P =0.037.

    In Fig.7a and b, the F and MSE plot are presented,

    respectively. As can be seen, some of the panels have amuch higher Fvalue than others even though all of them

    are significant at 1% level. The horizontal lines indicating

    F values at 1 and 5% levels cannot be seen here, since

    some F values are extremely high. Both horizontal lines

    therefore fall onto the vertical axis as their corresponding F

    values are extremely low compared to the highest Fvalues

    in the plot. When investigating the panels discrimination

    ability one can see for instance that panel P11 has rela-

    tively lowFvalues compared to those of panel P20. At the

    same time, the MSE values (Fig.8b) of panel P11 are

    relatively high. This indicates that panel P11 is somewhat

    Fig. 6 Nine Manhattan plots, one for each attribute, visualisingsystematic variation from individual PCA on the data of each panel.Vertical axes represent the number of PCs used and theircorresponding cumulative explained variance. Horizontal axes

    represent the respective sensory panels. Black colourcorresponds to0% explained variance, whereas white colour corresponds to 100%explained variance

    Eur Food Res Technol (2010) 230:497511 507

    1 3

  • 8/13/2019 Artigo Panel Check

    12/15

    less precise and has lower capability of detecting differ-

    ences. Panel P20 on the other hand has relatively low MSE

    (good repeatability) values combined with relatively highF

    values (good sample discrimination) indicating a much

    better performance than panel P11. Panel P21 is an

    example of where high F values are achieved, however,

    coupled with high MSE values. In other words, this panel

    discriminates well between the tested samples, but lessprecisely so than panel P20. In terms of performance panel

    P21 may be ranked between panel P20 (good) and panel

    P11 (not as good). Still, panel P11 shows an acceptable

    performance since its F values are all significant at 1%

    level. Note that the F and MSE plots provide no infor-

    mation on sample ranking differences and that these two

    plots alone therefore are not sufficient to get a complete

    evaluation on panel performance. It should be mentioned

    that both plots could also be sorted by attribute to check

    which of the attributes have the lowest/highest variance

    and the best ability to distinguish between samples.

    Line plots

    Figure8 shows line plots of the five tested samples. The

    plots highlight that for every sample and attribute there is a

    varying degree of variability across the panels (vertical

    lines indicating spread of the scores). This variability could

    be due to for instance local differences in calibration. This

    is particularly true for attribute 9 (transparency). Forattribute 5 (sticking), however, there seems to be a higher

    degree of agreement among the panels.

    Local analysis of panels P05, P17 and P25

    After studying all the panels averages across assessors

    (based on data as described in Sample averages across

    assessors for each panel), the data of panel P05, P17 and

    P25 were analysed in more detail. These three panels were

    picked over others because they differ from each other for

    the attributes acidic flavour, sweet taste and raspberry

    flavour. They are spread somewhat in terms of location inthe Tucker-1 plot of these three attributes.

    Since we are now focusing on only three panels and we

    wish to analyse in more detail why these three panels differ

    somewhat for the attributes mentioned above, we will use

    their raw data from here on. In order to do that, their raw

    data needs to be merged as described in Raw data before

    being imported into PanelCheck. When merging the raw

    data of panel P05, P17 and P25, the resulting data matrix

    will be of dimension (435 99), with panel 5 contributing

    150 rows (10 assessors 95 samples 93 replicates), panel

    17 contributing 135 rows (9 assessors 95 samples 93

    replicates) and panel 25 contributing 150 rows of data (10assessors 9 5 samples 93 replicates). See Table1 for

    details on panel sizes. This new data set in practise rep-

    resents one new large panel consisting of 29 individuals

    (10 ?9 ? 10 assessors). By using the same methods as

    before, now the performance of individuals belonging to

    one of these three panels can be visualised.

    Mixed model ANOVA

    Mixed model ANOVA again reports that all attributes are

    significant at level P\ 0.001, meaning that this new panel

    consisting of 29 individuals discriminated well between the

    samples (plot not shown). Again, all attributes were con-

    sidered for further analysis.

    Tucker-1 plots

    Tucker-1 plots based on raw data from the three selected

    panels (Fig.9) confirm what has been spotted in the

    Tucker-1 plot above (Fig.4) based on the data from all 26

    panels. There is substantial disagreement across assessors

    Fig. 7 a F plots visualising the panels ability to discriminatebetween the tested samples for each attribute. Panel P21 discriminatesless between the samples than for example panel P20. The horizontallinesindicatingFvalues at significance level 1 and 5% are not visibleas they are very low and therefore fall onto the horizontal axis. bMSEplot visualising the repeatability of each panel. Panel P21 obviouslyhas a weaker performance regarding repeatability than for examplepanel P06

    508 Eur Food Res Technol (2010) 230:497511

    1 3

  • 8/13/2019 Artigo Panel Check

    13/15

    for the attributes acidic flavour, sweet taste and espe-

    cially raspberry flavour indicating that further improve-

    ment on agreement across assessors is possible. Although

    the assessors are somewhat scattered over the correlation

    loadings plots, most of them have high-explained variances

    for the first two PCs. This indicates that the majority

    Fig. 8 Five line plots where each plot represent the data of one sample.Vertical axes represent intensity scores. Horizontal axesrepresent thenine sensory attributes

    Fig. 9 Nine identical Tucker-1 plots with each plot highlighting one of the nine attributes used in the profiling. The plots are based on raw dataof panel P05, P17 and P25

    Eur Food Res Technol (2010) 230:497511 509

    1 3

  • 8/13/2019 Artigo Panel Check

    14/15

    discriminates well between the samples but that theremight be disagreement on sample ranking for that partic-

    ular attribute. Studying the three correlation plots in detail

    confirms this by revealing that the assessors of each panel

    tend to form clusters of their own within the plot. For the

    remaining six attributes (transparency, sugar coat, bit-

    ing, hardness, elasticity and sticking) overall agree-

    ment is very good. These results were confirmed by the

    Manhattan plots (not shown).

    p*MSE plots

    As opposed to the situation above with all 26 panels with

    high significance (p*MSE, MSE and F plots based on

    one-way ANOVA), in this case the p*MSE plots plot

    (Fig.10) gives a valuable contribution to understanding

    individual differences. It can be seen that for attribute

    raspberry flavour panel P17 and to a certain extent panelP05 are less capable to detect differences between the

    samples (larger P values) than panel P25. Moreover, ran-

    dom noise is generally larger for panel P05 and P17. This

    indicates that the individuals of panel P25 and therefore

    panel P25 as a group perform much better than panel P05

    and P17.

    Profile plot for panels

    Profile plots (Fig. 11) show that the disagreement in eval-

    uating the samples is strongest for the attributes acidic

    flavour, sweet taste and particularly for raspberry fla-vour, as already observed in the Tucker-1 plots in Fig. 9.

    For attributes transparency, sugar coat, biting, hard-

    ness and elasticity the profiles are very alike for most of

    the assessors with very few exceptions. For attribute

    stickiness three assessors of panel P17 (individuals P17-1,

    P17-4 and P17-9) generally rated the samples with the

    highest intensity (B, C1 and A2) lower than the remaining

    assessors.

    Fig. 10 p*MSE plot for attribute raspberry flavour for panels P05,P17 and P25

    Fig. 11 Nine profile plots, one for each attribute, visualising sampleintensity and rankings for each assessor in panels P05, P17 and P25.Vertical axes represent sample intensity scores. Horizontal axes

    represent the five tested samples sorted by intensity based onconsensus. The circle highlights three deviating assessors belongingto panel P17 (assessor P17-1, P17-4 and P9)

    510 Eur Food Res Technol (2010) 230:497511

    1 3

  • 8/13/2019 Artigo Panel Check

    15/15

    Summary and discussion

    In this paper, we present how to extract critical information

    on panel performance issues from a proficiency test. In the

    example described in this paper, 26 sensory panels tested a

    set of 5 candy samples produced according to an experi-

    mental design with 3 replicates using 9 attributes. Since the

    panels varied in size, with 3 assessors at the least and 15assessors at the most, the size of the data from each panel

    varied thereafter. We demonstrated how to arrange the

    large amount of data prior to analysis and which methods

    to use in the analysis process. For this, we proposed a

    general workflow that may be used as a guide through the

    data analysis process, but which is not forced upon the

    user.

    For the data at hand, performance analysis is carried

    out first at a global level, based on data from all 26 panels

    where each panel treated as it was an individual

    assessor. This means that rather than visualising perfor-

    mance of individuals, it is the performance of panels as awhole compared to other panels that is visualised. As a

    result, from this process three of the 26 panels were

    identified for further analysis at a more detailed local

    level. This included performance visualisation of indi-

    vidual assessors from each of these three panels. In both

    cases, the same methods were applied to gather informa-

    tion on performance. The methods used were mixed

    model ANOVA, Tucker-1 plot, Manhattan plot, one-way

    ANOVA based F plot, MSE plot, p*MSE plots, profile

    plot and line plot. Reason for using multiple plots and

    their methods is that each of the plots contains unique

    information on panel and assessor performance. Theirjoint information content provides a more complete per-

    formance overview on individual assessors and their

    sensory panel (local level) or sensory panels compared

    with each other (global level). Performance information

    from such an analysis can then be used by panel leaders as

    feedback to improve over panel performance and perfor-

    mance of individual assessors.

    Acknowledgments Thanks to Rikke Lazarotti at LEAF Denmarkfor production of wine gum samples and providing access to thesensory profiling data. We would like to thank the Research Councilof Norway (project number 168152/110), The Foundation forResearch Levy on Agricultural Products (Norway) and The DanishFood Industry Agency for project funding.

    References

    1. Brockhoff P, Skovgaard I (1994) Modelling individual differ-ences between assessors in a sensory evaluation. Food QualPrefer 5:215224

    2. Ns T (1990) Handling individual differences between assessorsin sensory profiling. Food Qual Prefer 2:187199

    3. Dahl T, Tomic O, Wold JP, Ns T (2008) Some new tools forvisualizing multi-way sensory data. Food Qual Prefer 19:103113

    4. LeS, Pages J, Husson F (2008) Methodology for the comparisonof sensory profiles provided by several panels: application to across-cultural study. Food Qual Prefer 19:179184

    5. Tomic O, Nilsen A, Martens M, Ns T (2007) Visualization ofsensory profiling data for performance monitoring. LWT-FoodSci Technol 40:262269

    6. Thompson M, Wood R (1993) The international harmonisedprotocol for the proficiency testing of (chemical) analytical lab-oratories. Pure Appl Chem 65:212123

    7. McEwan JA (1999) Comparison of sensory panels: a ring trial.Food Qual Prefer 10:16171

    8. Hunter EA, McEwan JA (1998) Evaluation of an InternationalRing Trial for sensory profiling of hard cheese. Food Qual Prefer9:343354

    9. Pages J, Husson F (2001) Inter-laboratory comparison of sensoryprofiles methodology and results. Food Qual Prefer 12:297309

    10. PanelCheck software (2006) Nofima Mat, As, Norway.http://www.panelcheck.com

    11. Ns T, Langsrud (1988) Fixed or random assessors in sensoryprofiling? Food Qual Prefer 9:145152

    12. Tucker LR (1964) The extension of factor analysis to three-

    dimensional matrices. In: Frederiksen N, Gulliksen H (eds)Contributions to mathematical psychology. Holt, Rinehart &Winston, New York

    13. Tucker LR (1966) Some mathematical notes on three-modefactor analysis. Psychometrika 31:279311

    14. Ns T, Solheim R (1991) Detection and interpretation of varia-tion within and between assessors in sensory profiling. J SensStud 6:159177

    15. Lea P, Rdbotten M, Ns T (1995) Measuring validity in sensoryanalysis. Food Qual Prefer 6:321326

    Eur Food Res Technol (2010) 230:497511 511

    1 3

    http://www.panelcheck.com/http://www.panelcheck.com/