physiology as a tool for ux and usability...
TRANSCRIPT
Physiology as a Tool for UX and Usability Testing
A comparative study of pupil size and other physiological measures
M A L I N F O R N E
Master of Science Thesis Stockholm, Sweden 2012
Physiology as a Tool for UX and Usability Testing
A comparative study of pupil size and other physiological measures
M A L I N F O R N E
DH224X, Master’s Thesis in Human-Computer Interaction (30 ECTS credits) Degree Progr. in Media Technology 300 credits Royal Institute of Technology year 2012 Supervisor at CSC was Ylva Ferneaus Examiner was Kristina Höök TRITA-CSC-E 2012:082 ISRN-KTH/CSC/E--12/082--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc
Abstract The purpose of this degree project is to investigate how physiological measures, such as heart rate,
skin conductance and EEG (i.e. electrical brain activity), may be useful in UX and usability testing.
One physiological research method is discussed in more detail, i.e. the measurement of pupil size, or
pupillometry. The study seeks to answer the following questions: 1.What can we find out about
human emotions and cognition by measuring and analyzing variations in pupil size? 2. What can we
find out about human emotions and cognition by using other popular physiological measurement
methods? 3. How does pupillometry compare to other physiological measures for the purpose of UX
and usability testing?
In order to answer these questions, an extensive literature review was carried out. In
addition, a minor pupillometric study was carried out, in order to practically investigate the
potential of pupil size as a tool for UX and usability testing. The study concludes that although it is
not possible to ‘measure’ the thoughts and emotions experienced during a usability test,
physiological measurement may help identify significant episodes of human-computer interaction,
such as instances of elevated emotion, frustration or intense cognitive effort. It was found that a
large number of physiological signals could be useful for this purpose, and that all have their
respective pros and cons. Ultimately, the choice of measure will depend on the context of study. The
study also showed that there is never just one possible explanation to an observed physiological
reaction. Therefore, physiological data should always be interpreted in relation to the context in
which it was collected, as well as the subject’s own account of the experience.
Sammanfattning Exjobbet syftar till att undersöka hur fysiologiska mätmetoder, såsom hjärtfrekvens,
hudkonduktans, EEG (dvs. elektrisk hjärnaktivitet) och pupillstorlek, kan användas i
användbarhetstestning. Studiens främsta fokus ligger pupillometri, d.v.s. mätningar av hur pupillens
diameter förändras över tid. Följande frågor ligger till grund för arbetet: 1. Vad kan vi få veta om
människors känslomässiga och kognitiva processer genom att mäta och analysera variationer i
pupillstorlek? 2. Vad kan vi få veta om människors känslomässiga och kognitiva processer genom
att använda andra populära fysiologiska mätmetoder? 3. Hur står sig pupillometri mot andra
fysiologiska mätmetoder för tillämpning inom användbarhetstestning?
För att besvara dessa frågor genomfördes en omfattande litteraturstudie, samt en mindre
empirisk studie, där potentialen hos pupillometri som mätmetod undersöktes praktiskt.
Sammantaget visar studien att fysiologiska mätmetoder kan användas för att identifiera händelser
av särskild betydelse under ett användbarhetstest, såsom episoder av hög kognitiv belastning,
starka känslor eller frustration och hos användaren. Ett stort antal fysiologiska mätmetoder visade
sig vara användbara för det syftet, och valet av metod beror främst på den aktuella studiens
specifika förutsättningar. Dock visar studien även att det sällan finns en entydig förklaring till en
viss fysiologisk förändring. Fysiologiska data bör därför alltid tolkas i relation till den kontext i
vilken de uppmättes, samt varje deltagares egen framställning av användarupplevelsen.
Table of Contents 1 Introduction ......................................................................................................................................................... 5
1.1 Background .................................................................................................................................................................... 5
1.1.1 Physiological Measures .................................................................................................................................... 5
1.1.2 Eye Tracking ......................................................................................................................................................... 6
1.1.3 Pupillometry ......................................................................................................................................................... 6
1.2 Purpose of Study .......................................................................................................................................................... 7
1.2.1 Research Questions ............................................................................................................................................ 7
1.2.2 Limitations ............................................................................................................................................................. 7
1.3 Method ............................................................................................................................................................................. 7
2 Theoretical Foundation ................................................................................................................................... 9
2.1 UX and Usability Testing .......................................................................................................................................... 9
2.2 Cognition in HCI ......................................................................................................................................................... 10
2.2.1 Understanding User Cognition .................................................................................................................... 10
2.2.2 Cognitive Load Assessment .......................................................................................................................... 11
2.3 Emotion in HCI ........................................................................................................................................................... 11
2.3.2 Understanding User Emotion ...................................................................................................................... 12
2.3.3 Studying User Emotion ................................................................................................................................... 13
3 Physiological Measures ................................................................................................................................ 15
3.1 Physiological Context ............................................................................................................................................... 15
3.2 Using Physiological Measures .............................................................................................................................. 16
3.3 Common Measures ................................................................................................................................................... 17
3.3.1 Cardiovascular activity ................................................................................................................................... 17
3.3.2 Skin Conductance .............................................................................................................................................. 19
3.3.3 Electrical Brain Activity ................................................................................................................................. 21
4 Pupillometry ..................................................................................................................................................... 24
4.1 Pupillary Movements ............................................................................................................................................... 24
4.1.1 Optical Reflexes ................................................................................................................................................. 24
4.1.2 Reflex Dilation .................................................................................................................................................... 25
4.2 Measuring Pupil Size ................................................................................................................................................ 26
4.3 Previous Studies ........................................................................................................................................................ 29
4.3.1 Pupillometry in Affect Recognition ........................................................................................................... 29
4.3.2 Cognitive Pupillometry ................................................................................................................................... 30
4.3.3 Dealing with the Light Reflex ....................................................................................................................... 33
4.4 Pilot Study .................................................................................................................................................................... 35
4.4.1 Participants ......................................................................................................................................................... 35
4.4.2 Equipment and Procedure ............................................................................................................................ 35
4.4.3 Cognitive Tasks .................................................................................................................................................. 36
4.4.4 Affective Stimuli ................................................................................................................................................ 37
4.4.5 Results and Analysis ........................................................................................................................................ 38
4.4.6 Lessons Learned ................................................................................................................................................ 43
5 Discussion and Analysis ............................................................................................................................... 45
5.1 Interpreting Physiological Data ........................................................................................................................... 45
5.2 Challenges for UX and Usability Testing .......................................................................................................... 47
5.3 Evaluation of Measures ........................................................................................................................................... 49
6 Conclusion ......................................................................................................................................................... 51
7 Bibliography ..................................................................................................................................................... 52
List of Abbreviations
ANS = Autonomic Nervous System
CNS = Central Nervous System
EEG = Electroencephalography
GSR = Galvanic Skin Response
HCI = Human-Computer Interaction
HR = Heart Rate
HRV = Heart Rate Variability
ICA = Index of Cognitive Activity
MPD = Mean Pupil Diameter
PD = Pupil Diameter
PNS = Peripheral Nervous System
SC = Skin Conductance
TERP = Task-Evoked Pupillary Response
UX = User Experience
Degree Project Report
Malin Jönsson Forne, 2012
5
1 Introduction
As interactive technologies become increasingly important in our everyday lives, the human-
computer interaction community has slowly moved beyond a strict focus on usability, and started to
consider the entire user experience, or UX. This means that systems are no longer assessed solely in
terms of their ability to enhance user performance, but also on their ability to motivate, entertain,
amuse or satisfy their users (Preece et al., 2002). In order to assess such factors, usability
researchers must understand more about the cognitive and emotional processes that are evoked as
users interact with a system.
A common way to address emotional and cognitive aspects in usability testing today is
through retrospective self report; that is, users are asked to describe or answer questions about
their experience after the task has been completed, either verbally or through a questionnaire of
some sort (Sherman, 2007). While such strategies are certainly useful, they are limited in their
capacity to identify changes in emotional or cognitive processing over the course of the test (unless
the user is constantly interrupted with questions, which would of course have a negative impact on
the authenticity of the user experience). These considerations have spurred an interest in
complementary methods for the assessment of the user experience.
1.1.1 Physiological Measures
Within research areas such as psychology and neurology, it has long been known that emotional
and cognitive processes give rise to measurable physiological responses in the human body. For
example, it has been found that the pupil dilates in response to cognitive or emotionally toned
stimuli (e.g. Goldwater 1972, Loewenfeld, 1993), which makes eye tracking an interesting
measurement technique. Other methods of recording physiological measures include GSR (Galvanic
Skin Response), which is associated with increased sweat production, cardiovascular measures,
which include heart rate and heart rate variability, and EEG (Electroencephalography), which
reflects electrical activity along the scalp.
Naturally, physiological measures are not some magic key to the human mind. Measures of
bodily reactions do not, as one might be tempted to believe, enable us to draw definite conclusions
about what a person is thinking or feeling at a given time. The analysis of bodily reactions may,
however, provide some additional clues to the user experience. How these clues may be obtained,
and what new insights they may lead to, is what this study aims to investigate. Focus will lie on the
measurement of pupil size (i.e. pupillometry), but this method will also be compared to other,
perhaps more commonly used physiological measures.
1.1.2 Eye Tracking
This master thesis project is carried out in collaboration with Tobii Technology, one of the leading
producers of eye tracking systems. As the term suggests, eye tracking is technology that allows us to
track and record eye movements. Although these systems come in different forms, most modern
eye trackers (including those produced by Tobii) use a combination of infrared light sources and
infrared video cameras to determine the point of regard (Tullis & Albert, 2008). When in use,
(invisible) near infrared light is pointed to the eye of the user, creating a strong reflection in the
Degree Project Report
Malin Jönsson Forne, 2012
6
retina (known as the “bright pupil”). In addition to this, a small but sharp glint, called the corneal
reflection, appears on the cornea of the eye. These reflections are recorded by the infrared camera,
and their relative positions are then used to calculate the point of regard (Duchowski, 2007). Eye
tracking technology can be used for a variety of different purposes, but this study will focus on the
use of eye tracking in usability and UX research.
Eye tracking has become an increasingly popular tool in usability testing over the past few
years, as systems become more reliable and easy to use (Tullis & Albert, 2008). Nowadays, eye
tracking technology can be incorporated into a computer monitor, or even into a pair of glasses,
which makes it suitable for many different settings. In a typical eye tracking study, gaze data is
collected as users perform some given task(s). The data may then be subjected to statistical
analysis, or visualized to show what users looked at, for how long and/or in what order. The
relevance of such studies is founded on the mind-eye hypothesis, which holds that what people are
looking at is usually the same as what they are thinking about (Just & Carpenter, 1976). In other
words, we may presume that if we track the movements of a person’s gaze, we can follow along the
path of her attention (Duchowski, 2007). Although there are certainly exceptions to this rule, eye-
tracking is considered a useful tool for studying user attention in usability testing (Nielsen &
Pernice, 2010).
1.1.3 Pupillometry
While the eye-tracking applications described so far are interesting and widely used, they are not
the focus of this thesis. Instead, I concentrate on a particular kind of eye tracking called
pupillometry, in which (changes in) pupil size are measured. Most people know that pupil size
varies with the intensity of light, but pupillary movements may also be related to cognitive and/or
emotional processes. This makes them a potential source of new insights into the user experience.
As eye movements are recorded with an eye tracker, pupil size data is usually collected in
the process (Tullis & Albert, 2008). Nevertheless, this data is seldom analyzed, as focus often lies on
gaze patterns (i.e. where the subject was looking during the course of the experiment). If we want
to gain insights into cognitive and emotional processes, however, pupil size data might be a
valuable resource. Pupil dilations might tell us if a certain part of the interaction was particularly
complex (i.e. gave rise to a high cognitive load) or if there was some part that caused frustration.
With a greater understanding of how pupillometry might be used to analyze cognitive and
emotional activities, and how pupil size compares to other physiological measures available,
usability researchers could make more informed choices when preparing a study. Moreover, it
might lead to a better harnessing of the data obtained in the study.
1.2 Purpose of Study
The purpose of the present study is to investigate how different physiological measures may be
used to analyze emotional and cognitive processes in the context of UX and usability testing. The
study contributes to the field of Human-Computer Interaction (HCI) by providing an overview of
different physiological measures that may be used to investigate users’ emotional and cognitive
processes. Moreover, the different measures are evaluated and compared with respect to factors
Degree Project Report
Malin Jönsson Forne, 2012
7
that are particularly important in usability testing, such as unobtrusiveness, robustness and
simplicity of use.
1.2.1 Research Questions
RQ1. What can we find out about human emotions and cognition by measuring and analyzing
variations in pupil size?
RQ2. What can we find out about human emotions and cognition by using other popular
physiological measurement methods?
RQ3. How does pupillometry compare to other physiological measures for the purpose of UX
and usability testing?
1.2.2 Limitations
This study approaches physiological measures from the usability researcher’s point of view, and the
conclusions drawn are thereby specific to that context. Furthermore, focus lies on the use of
pupillometry in UX and usability, which means that other physiological measures will not be
described in the same detail.
1.3 Method
In order to answer the research questions, a literature review was carried out. In the first phase of
the study, the aim was to get an overview of relevant research areas, such as Usability Testing,
Affective Computing, Psychology, Neuroscience and Physiology, as well as to understand their
respective relevance to this study. Furthermore, some initial questions had to be investigated,
including:
What is cognition?
What is emotion?
How is emotion and cognition relevant to HCI (Human-Computer Interaction)?
What constitutes a “good” method for usability testing?
What methods are used today to study emotion and cognition in usability testing?
What are the pros and cons of using physiological measures?
What is eye tracking, and pupillometry?
What other physiological measures might be relevant to this study?
Answers to these questions were sought in a wide variety of sources, spanning over research areas
such as Human-Computer Interaction, Affective Computing, Cognitive Psychology, Emotion
Research and Psychophysiology. After the initial questions had been explored, focus shifted to the
core research questions of the study. First, pupillometry was investigated in further detail. Some of
the main questions to be answered in this phase were:
How does pupillometry work theoretically?
Degree Project Report
Malin Jönsson Forne, 2012
8
How can it be implemented practically (in usability testing)?
What can pupillometry reveal about human emotion and cognition?
Given the vast amount of pupillometric research available, I started out from some extensive
reviews, in particular Loewenfeld (1993), but also Goldwater (1972) and Beatty (1982, 2000).
This gave me a good overview of the knowledge in the field, and it also helped me identify some of
the most important studies conducted before the turn of the century, which were then examined in
greater detail. Thereafter, more recent studies, particularly those relating to HCI, were examined, in
order to understand the current state of the art. Other physiological measures identified in the
introductory phase were investigated in a similar, though less thorough, manner. Studies and
reviews of affective computing were particularly useful in this phase.
Although the core of this study is a literature review, a minor pupillometric study was
conducted in order to gain some additional insights. The main purposes of the study were:
1. To practically investigate how pupil size measurements may be incorporated in a simple eye tracking study.
2. To investigate whether we can measure pupil dilation in response to cognitive or emotional stimuli without extensive data processing, technical skills or time consumption.
3. To gain some practical experience of pupillometric research, in order to better understand the challenges involved.
The study consisted of two parts, one in which subjects performed simple math problems
(cognitive task), and one in which they were presented with emotionally toned pictures (affective
stimuli). Meanwhile, pupil size measurements were performed with a Tobii Eye Tracker. For a
detailed description of the study, please refer to section 4.4.
In the last phase of the study, the different physiological measures were compared with
respect to factors that might be important in the context of usability testing. The result of this
analysis was used to create a model, which provides a collective view of the different measures and
their value when studying emotions and cognition in the context of UX and usability testing.
Degree Project Report
Malin Jönsson Forne, 2012
9
2 Theoretical Foundation The aim of this chapter is to provide a theoretical foundation to the present study, by introducing
the reader to some of the core concepts related to the study. In the first section, I provide a brief
introduction to UX and usability testing. Thereafter, the concepts of cognition and emotion will be
discussed in turn, especially with regard to their relevance to human-computer interaction (HCI), as
well as to the present study.
2.1 UX and Usability Testing
Usability can be described as “the extent to which a product can be used by specified users to
achieve specified goals [...] in a specified context” (ISO 9241-11:1998). More specifically, the term is
often broken down into a set of design goals, including (Rubin and Chisnell, 2008):
Effectiveness (How good is the system at doing what it is supposed to do?)
Efficiency (Does the system allow users to sustain a high level of productivity?)
Learnability (How easy is it to start using the system?)
Satisfaction (What is the user’s perceptions, feelings, and opinions of the product?)
Traditionally, usability goals have mostly been concerned with improving the productivity of users
interacting with a system, and design goals such as efficiency and effectiveness can certainly be
important for systems intended to support working practices (Preece et al., 2002). However, the
growth of leisure and entertainment uses of technology means that users often have other goals in
interacting with a system than mere productivity. HCI practitioners today must thus expand their
design thinking to include other possible values of technology, such as fun, enjoyment and
emotional engagement (Isbister & Höök, 2009). Such design goals are often associated with the
concept of user experience, or UX, which has become an increasingly important concept in HCI over
the last decade or so (Harbich & Hassenzahl, 2008). Traditional aspects of usability are certainly
part of what makes up the user experience, but UX is not limited to the specific moment in time
when an interaction takes place. On the contrary, the UX point-of-view stresses that users’
evaluations of interactive experiences evolve beyond the end of the interaction itself, and good
experiences can give rise to revisitable good moods and enduring, rewokable memories (Cockton,
2008). Therefore, the most fundamental error interaction designers make in the design process is
to sketch things, without connecting those things to good experiences and outcomes for the people
who will interact with them (ibid.).
In this study, I discuss different methods that may be applied in the evaluation of interactive
systems, i.e. UX and usability testing. Usability testing allows for more informed design decisions,
and may serve as a way to ensure that important design goals are being met by the product or
prototype in question (Rubin and Chisnell, 2008). Ideally, an iterative cycle of tests is performed
during the course of system development, in order to gradually shape or mold a usable product into
place (ibid.). However, usability testing may also be used to evaluate existing interfaces, or to
compare two or more alternative design solutions.
Degree Project Report
Malin Jönsson Forne, 2012
10
Typically, usability testing involves observing representative end users using the system or product
to perform realistic tasks (Rubin and Chisnell, 2008). The basic approach originates from classic
empirical research methods, but has been adapted to fit the fast-paced, highly pressurized
commercial environment in which most interactive systems are developed (ibid.). In usability
research, it may for example be impossible or inappropriate to use large numbers of test subjects,
or to adopt the strict control of the testing environment which is often required in academic
research. This is particularly true today, when rigid, sequential “water fall” methodologies of
software development are increasingly being replaced by more flexible, iterative or agile
development processes (cf. Dingsøyr et al., 2010).
2.2 Cognition in HCI
The term cognition refers to all aspects of human thinking and reasoning, including processes such
as perception, attention and memory (Preece et al., 2002). Understanding more about these
processes may be of great value for the design and evaluation of interactive systems, especially if
focus lies on design goals such as efficiency, effectiveness and learnability. Naturally, the way a user
interface is designed will affect how well users can perceive relevant information, understand
important functions and remember how to carry out tasks. In cognitive science, these capacities are
often described as limited resources. In order to optimally support human-computer interaction,
interaction designers must thus take the limitations of our cognitive capacities into account.
2.2.1 Understanding User Cognition
The concept of mental workload or cognitive load provides a useful framework for understanding
the limitations of user’s cognitive capacities. Cognitive load theory is based on the notion of a
limited “working memory”, which is involved in all conscious cognitive activity (Hollender et al.,
2010). As additional items are added to the pile of information that needs to be actively processed,
cognitive load increases. Too much simultaneous processing will lead to cognitive overload, making
it impossible for users to complete the task at hand (ibid.). Interfaces that require too much mental
effort may thus create user frustration, or cause the user to abandon a task altogether. However, a
monotonous task with too little cognitive stimulation may also act as a stressor. A typical
monotonous or boring situation is when the demands for sustained attention are high, but little
new information or is conveyed (Kecklund et al., 2004). This may occur during uneventful
motorway driving, or when the task is to monitor an industrial process. If motivation is high, the
individual may compensate for the lack of stimulation by mobilizing extra energy. However, such
responses are effortful, and can only be sustained over short periods of time. Eventually, the person
will experience boredom and fatigue, which may reduce productivity and, in some cases, even
result in dangerous situations (ibid.). In order to optimally support user practices, human-
computer interfaces should thus provide an adequate amount of cognitive stimulation, avoiding
underload as well as overload.
Attention can also be described as a limited cognitive resource, which needs to be allocated
to ongoing events (Kecklund et al., 2004). Attention is slow, sequential and difficult to sustain for
more than brief periods of time (Kahneman, 1973). Therefore, successful human-computer
interaction requires that the user can effectively manage her attention between different elements
Degree Project Report
Malin Jönsson Forne, 2012
11
of an interface. Research has shown that poorly timed interruptions, due to for example instant
messages, an incoming email or a system alert, have a negative impact user performance, especially
if the user is actively engaged in a demanding task (Bailey et al., 2006). According to Iqbal et al.
(2004), an attractive solution would be to develop systems that could identify moments of low
mental workload, in which users may be interrupted at a minimal cost.
2.2.2 Cognitive Load Assessment
There are three main categories of mental workload assessment: performance-related, subjective
and physiological. In the first case, the cognitive load demanded by a certain (primary) task is
typically evaluated by measuring the performance on another, secondary task (Cegerra & Chevalier,
2008). For example, users may be asked to use a driving simulator engaging in conversation. In this
case, a complicated traffic situation is likely to cause gaps in the conversation, which would be
interpreted as a sign of increased cognitive load in the subject.
By contrast, subjective measures rely on the subjects’ own reports of their experience.
Several scales have been developed to formalize these ratings, for example the NASA-TLX
procedure, which is often considered to be the most accurate (ibid.). Although such ratings may be
a good reflection of the user’s subjective experience, they are limited in one respect: the ratings are
usually obtained after a task has been completed, which means they do not give any account of the
change in cognitive load over the course of a task.
The third means of cognitive load assessment is through physiological measures, which is
the main focus of this study. These measures include pupil size, heart rate and EEG, and will be
further discussed in Chapter 3 and 4 of this report.
2.3 Emotion in HCI
Computer use has often been regarded a purely rational activity, in which emotions are secondary,
or can even get in the way of successful interaction (Picard, 1997). However, interest in emotional
aspects of HCI is increasing. In his book Emotional Design (2005), Don Norman put it this way:
“In the 1980s [...] I addressed utility and usability, function and form, all in a logical, dispassionate way
— even though I am infuriated by poorly designed objects. But now I’ve changed. [...] Sure, utility and
usability are important, but without fun and pleasure, joy and excitement, and yes, anxiety and anger,
fear and rage, our lives would be incomplete.”
What Norman describes is the importance of user experience, and in particular how it relates to
emotion. If the interaction with software can evoke strong positive feelings, then the user is more
likely to come back and use the system again and again. However, negative emotions may also be an
important part of great experiences, especially when it comes to leisure use of technology, such as
computer games. Gilbert Cockton (2008) argues that the most challenging game interactions can be
both unpleasant and frustrating; but finally completing a game after weeks of struggle can be
immensely satisfying. In this case, what makes the interaction worthwhile for the user is the sense
of achievement (ibid.). In other cases, however, negative user affect such as stress or frustration
may lead to critical errors, or prevent completion of a task altogether.
In western thinking, emotion and cognition have traditionally been regarded as separate
processes (Höök, 2012). In the 1990’s, however, researchers began to understand that emotional
Degree Project Report
Malin Jönsson Forne, 2012
12
and cognitive processes are interrelated, and may interact in ways that are important for intelligent
functioning (Picard, 2001). Rosalind W. Picard was among the first researchers to explicitly address
the role of affect in human-computer interaction, when she published the book Affective Computing
in 1997. Picard introduced affective computing as a new field of research, concerned with
“computing that relates to, arises from, or deliberately influences emotions” (Picard, 1997). While
this definition is generally held, the starting point of Picard’s work was more specific. Coming from
the field of artificial intelligence (AI), Picard suggested that machine intelligence should include
skills of emotional intelligence. This was a major shift in thinking at the time, since previous AI
efforts had primarily focused on mathematical, verbal and perceptual capabilities (Picard, 2001).
So far, much research within the field of affective computing has focused on emotion
recognition, often through the measurement and analysis of physiological measures. According to
Picard (2003), the goal of this research is to eventually design computers that will better serve
people’s needs by recognizing and responding to user emotion. However, other researchers in
affective computing argue that rather than trying to ‘measure’ user emotion, we should try to make
peoples’ emotional experiences available for reflection (Höök, 2012). For example, Sanches et al.
(2010) describe the development of a mobile stress management tool called Affective Health. The
system measures heart rate, skin conductance (see 3.3) and body movement, and uses the data to
create a visualisation that users can reflect on and interpret themselves. According to the authors,
such a system avoids a “reductionist and sometimes erroneous” automatic interpretation from
physiological signals to emotion labels (ibid.).
2.3.2 Understanding User Emotion
To date, there is no universally accepted definition ‘emotion’. However, initiatives have been taken
to establish a HCI specific ‘working definition’ of the term, starting out from four basic assumptions
(Crane & Peter, 2008):
1. Emotions are multifaceted processes that unfold over time.
2. Emotions are induced by internal or external events.
3. Emotions manifest themselves through multiple channels, resulting in specific physiological patterns.
4. Emotion channels are loosely coupled and may interact in complex ways.
Note that this definition refers to emotions as processes. This choice of words underlines the fact
that emotions are not stable ‘states’, but subjected to continuous change. This, of course,
complicates any efforts to label or ‘measure’ affective experiences. Another important assumption
stated above is that emotions manifest themselves through multiple channels, or ‘modalities’. This
notion is widely agreed upon among emotion researchers. Although different sets of modalities
have been suggested, the following three are usually mentioned (Scherer, 2005):
1. Subjective experiences (what a person is actually feeling).
2. Motor expressions (face, voice, gestures).
3. Bodily symptoms (any physiological changes in the body).
Note that this description clarifies the distinction between ‘emotion’ and ‘feeling’, two concepts
which are easily confused; feeling is just one part of what constitutes an emotion, i.e. the subjective
Degree Project Report
Malin Jönsson Forne, 2012
13
experience. The term ‘affect’, by contrast, is often used as a synonym of ‘emotion’ (e.g. Picard, 1997),
a practice which is adopted in this thesis as well.
Another topic of debate within the affective sciences relates to how emotions should be
described or modeled. On a high level of abstraction, current emotion theories can be divided into
two approaches: discrete and dimensional emotions (Partala, 2005). The discrete approach starts
out from emotion labels used in everyday language, such as ‘anger’, ‘fear’ and ‘happiness’, and
attempt to categorize affective processes according to these labels (cf. Ekman et al., 1982). The
dimensional approach starts out from a set of basic dimensions, each of which consists of two
opposite adjective pairs. The most commonly used dimensions are valence, ranging from pleasant
(positive valence) to unpleasant (negative valence), and arousal, ranging from calm to excited
(Sheirer et al., 2001). These dimensions make up the x and y axis of ‘emotional space’, into which all
emotions can be categorized based on their different characteristics. According to Partala (2005),
most scientists currently agree that the discrete and continuous approaches are complementary,
and that both may be more or less useful, depending on the context of study.
As previously mentioned, emotion and cognition are not separate processes, but closely
related to one another. Today, many researchers talk about a cognitive component of emotion,
arguing that emotional experiences are determined by the cognitive evaluation or appraisal of
events (Scherer, 2005). For example, barely passing an exam is not an inherently happy or sad
event; the emotional reaction depends on the subjective evaluation of the result, in relation to
expectations of the outcome.
2.3.3 Studying User Emotion
In the previous section, I introduced three ‘modalities of emotion’ (i.e. subjective experience, motor
expression and bodily symptoms). These modalities point out a direction for emotion research: If
we accept that emotions manifest themselves through different channels, then the way to
understand emotions would be to ‘tune in’ to one or more of these channels.
The first modality of emotion, i.e. what a person is actually feeling, is of course hard to
‘measure’. Naturally, only the person having an emotion can know what it feels like, and the only
way to extract at least some of that information is by asking the person. In usability testing, this is
usually done through a more or less structured interview, or by using a questionnaire of some sort
(Madrigal & McCain, 2009).
Motor expression, or ‘body language’ to use a more common term, has been extensively
studied in emotion research (Haag et al., 2004). This modality includes gestures, posture and facial
expressions; in short every emotional expression that can be observed by the people around us. A
disadvantage of using motor expression for emotion research is that people can control or “fake”
their body language, at least to some degree. For example, a person may choose to conceal a feeling
of disappointment with a smile. Naturally, this may lead to misinterpretations, in particular if the
emotion recognition is performed by a computer, which may not be able to interpret the
surrounding circumstances.
The third modality of emotion, and the focus of this study, is bodily symptoms. This modality
includes all physical reactions that are associated with an emotion. Some researchers argue that
physiological measurement is a particularly promising method for affect recognition, because these
measures are less susceptible to environmental inference or voluntary masking than for example
Degree Project Report
Malin Jönsson Forne, 2012
14
facial expressions (cf. Picard, 1997). Physiological measurement has been extensively researched
within affective computing, and some researchers in the field argue that reliable affect recognition
could be achieved through the integration of several physiological measures (cf. Hudlicka, 2003).
The use of physiological measures is further discussed in the next chapter.
Degree Project Report
Malin Jönsson Forne, 2012
15
3 Physiological Measures This chapter serves as a general introduction to the study of physiological responses, and provides
a more detailed description of some of the most popular measures. The first section puts
physiological responses into context, briefly describing their role in the human body and how they
are brought about. Thereafter, I discuss some general issues regarding the practical implementation
of physiological measures in usability testing. Finally, the third section of the chapter provides an
introduction to some of the most popular physiological measures, i.e. cardiovascular activity, skin
conductance and EEG. Pupillometry, being the main focus of this study, will be discussed separately
in Chapter 4 of the report.
3.1 Physiological Context
The human nervous system can be divided into a central and a peripheral system, which are each
responsible for different parts of the body. The central nervous system (CNS) includes the spinal
cord and the brain, and can be described as the body’s control center. The spinal cord is responsible
for simple reflexes and serves as a pathway between the brain and other parts of the body. The
brain is responsible for all cognitive processing, including perception, memory and thought, but is
also the center of emotions (Chanel et al., 2009).
The peripheral nervous system (PNS) can be described as the body’s communication system,
and acts mainly below the level of consciousness. The PNS is responsible for carrying signals from
the CNS to the rest of the body, but it also transfers sensory information from the organs (e.g. eyes,
ears and skin) back to the brain, where it is processed and interpreted. Of special relevance to this
study is the autonomic nervous system (ANS), which is often described as a subdivision of PNS.
However, current research underlines the integrated nature of the human nervous system, and has
found that there are actually close interactions between its central and autonomic divisions
(Kreibig 2010). The primary task of ANS is to provide quick and reliable responses to surrounding
events, preparing the body for appropriate action (ibid.). This can only be attained by the
coordination and integration of neurological activity, from the highest level in the cortex down to
the spinal cord and peripheral nervous system (ibid.).
There are two branches of ANS, the sympathetic and the parasympathetic branch, which are
responsible for different bodily responses. When fully activated, the sympathetic division of ANS
prepares the body for crisis that may require sudden, intense physical activity: heart and
respiration rates are rising, sweat breaks out and alertness increases (Barreto et al., 2007). This is
known as the ‘fight or flight’ response, and may be experienced in highly emotional or stressful
situations (Partala, 2005). By contrast, the parasympathetic division of ANS brings the body back
from the emergency state, and is associated with effective emotion regulation and restoration of
energy (ibid.).
In addition to stress and emotion, cognitive factors may also influence ANS activity. In
particular, the activation of the sympathetic branch is associated with high levels of cognitive
workload. This increased activation or arousal can often lead to improved cognitive performance, at
least up to a certain point. However, the effect can only be sustained over brief periods of time
(Kecklund & Åkerstedt, 2004). Parasympathetic activity, on the other hand, has been associated
with enhanced attention (Rantanen et al., 2010).
Degree Project Report
Malin Jönsson Forne, 2012
16
3.2 Using Physiological Measures
Evidence that human physiology responds to a variety of mental events has been available since the
19th century (Ward & Marsden, 2003). Skin conductance, respiration, electrical brain activity,
muscle tension, pupillary size and cardiovascular activity have all been reported to vary in response
to factors such as task difficulty, levels of attention, experiences of frustration and emotionally
toned stimuli (Andressi, 2000). Therefore, it has been proposed that physiological data might be a
valuable tool for usability testing, as it could help identify elements and events of cognitive or
emotional relevance to the user (Ward & Marsden, 2003).
However, the integration of physiological measures in usability testing has some inherent
difficulties. First of all, most existing studies have been performed in tightly controlled
experimental settings. This goes against one of the basic requirements of usability testing, namely
that the test conditions should be as close to “real-world” use as possible. Thus, if physiological
measures are to be applied to the less controlled conditions of usability testing, then great care
must be taken in the design of testing procedures (Ward & Marsden, 2003). Another challenge lies
in the interpretation of data, since the same kind of physiological responses may be observed for
different mental states, such as frustration, surprise or increased cognitive effort (ibid.). Therefore,
a correct interpretation requires knowledge of the context in which the data was obtained. In order
to better understand the results, it is thus advisable to record additional observations along with
the physiological measurements, such as comments, observed behaviors and subjective ratings of
events (Kecklund & Åkerstedt, 2004).
Another important issue in physiological measurement is referred to as the baseline
problem: How do we establish a reference response for a given physiological measure, against
which other obtained values may be compared? What is, for example the “normal” heart rate, pupil
size or skin conductance? Unfortunately, physiological responses are highly individual, which
generally makes between-subject comparisons misleading (Gunes & Pantic, 2010). In addition,
significant variations that are unrelated to emotional or cognitive factors may be observed within
subjects, depending on for example environmental factors (temperature, humidity etc.), time of day
or the subject’s pre-trial activities (Ward & Marsden, 2003). This makes it impossible to establish
any critical or cut-off values for physiological measures, corresponding to, for example a particular
emotional state or level of mental effort (Kecklund & Åkerstedt, 2004). Instead, observed variations
must always be interpreted in relation to the baseline for the specific subject, time and context in
which data was collected.
A common approach to the baseline problem is to use the average response obtained over a
period of time before trial-onset, during which no significant stimuli is presented (cf. Dufresne et al.,
2010). Subjects may, for example, be sitting in a dark room or in front of a blank screen for some
time, while their physiological responses are being recorded. A problem with this method is that
although little external stimuli is presented to the subject, it is impossible to control his or her
thoughts or state of mind, which may be influenced by a bad day at work, a pleasant memory, or
any other internal stimuli. Another approach to the baseline problem is to define the reference
response as the average value obtained for the measure over the course of the experimental
session.
A typical case in which the baseline may be of value is when comparing or averaging
physiological responses over multiple subjects. In such cases, the following formula (where R is the
Degree Project Report
Malin Jönsson Forne, 2012
17
physiological response under study), may be used to normalize the individual results (e.g. Dufresne
et al., 2010):
The normalization process allows for a more accurate averaging of results. For example, certain test
subjects may have a natural tendency to sweat more, or an inherently faster heart rate than others.
Without baseline correction of the results, these individuals would have a larger impact on the
averaged results than the other participants (Beatty & Lucero-Wagoner, 2000).
As mentioned in the previous chapter, many researchers argue for the integration of several
measures in order to obtain a collective understanding of a user’s mental state. Gunes and Pantic
(2010) describe two main approaches to this problem: feature and decision level fusion. In decision-
level fusion, the different features obtained are analyzed separately. Once a classification has been
made for each feature, the results are combined to produce the final hypothesis. This method
typically assumes that the different features are independent from each other, which is often not
the case (heart rate, for example, is influenced by respiration patterns). However, the assumption of
mutual interdependence makes the problem of data fusion more manageable. Feature-level fusion is
somewhat more challenging, and becomes even more so as the number of features increases. This is
particularly true if the measures obtained have very different temporal properties, either because
the measurement equipments have different sampling rates, or because the different responses are
inherently out of sync (e.g. heart rate and EEG). In such cases, it is particularly important to make
sure data from different sources are time-stamped correctly (ibid.).
3.3 Common Measures
There is no ‘gold standard’ for physiological measurement; instead; each measure has its pros and
cons (Kecklund & Åkerstedt 2004). However, this section will describe some of the measures that
seem particularly important or relevant to the present study, i.e. skin conductance, cardiovascular
activity and EEG. According to Chanel et al. (2009), skin conductance (GSR) and heart rate are
almost always included in affect recognition, an observation that I support based on my own
literature review. Therefore, it seems only natural that these measures should be discussed here. I
have also chosen to look into the use and future potential of EEG, which is an up and coming
technology in HCI research. An introduction to each of these measures is provided in this chapter,
while pupillometry, being the focus of the study, will be discussed separately in the next chapter.
3.3.1 Cardiovascular activity
Cardiovascular activity refers to activity of the heart, and includes parameters such as heart rate,
heart rate variability and blood volume pressure. There are two common ways to measure
cardiovascular activity: Electrocardiogram (ECG) and Photoplethysmography (PPG; Park 2009).
ECG measures the electrical pulse produced by the heart every time it contracts to pump out blood.
This method requires (at least) three electrodes, which can be attached on both arms, both legs or
above the chest. Arm or leg placement is considered more practical for HCI research, but the
distance to the heart makes the signal more vulnerable to noise caused by for example body
movement or internal organ activity (ibid.).
Degree Project Report
Malin Jönsson Forne, 2012
18
If ECG monitors the electrical activity of the heart, PPG concentrates on its mechanical activity, by
measuring the blood flowing in and out of a toe or finger. This information is typically obtained by
placing a sensor on the toe or finger, while infrared light is emitted into the skin. Because the level
of light absorption changes with the amount of blood flowing underneath the skin, it is possible to
retrieve the heart rate from this measurement. A downside of PPG is that, due to the rather long
distance from the toes/fingers to heart, the blood flow may not always be strong enough for the
sensor to record the PPG. In general, finger placement gives a slightly more reliable signal than toe
placement (Park 2009). On the other hand, having a sensor placed on the finger while interacting
with a computer may have a negative impact on the user experience, which may in turn influence
the obtained data.
Cardiovascular monitoring is perhaps mostly associated with medical contexts, where it
may be used to identify elevated risks of heart disease or to evaluate the efficiency of a treatment
(cf. Kecklund & Åkerstedt, 2004). For this purpose, considerable efforts have been made to produce
wearable measuring devices, which may accompany people in their every-day lives. Various
solutions have been proposed, including devices that are worn on the finger, forehead, wrist or ear
region (Poh et al., 2011). These efforts are equally interesting for UX and usability research, as
interactive systems are no longer limited to stationary office environments. However, most of the
wearables proposed so far must still be connected to additional hard-ware pieces (for power and
data acquisition), which may be bulky and cumbersome to handle (ibid.). Poh et al. (2011)
proposed an alternative approach to the problem, introducing a system called the Heartphones. The
idea is to integrate measuring equipment into devices that users are already familiar with, in this
case a (smart) mobile phone and a pair of modified earphones. The system uses PPG technology for
unobtrusive measurement of cardiovascular activity, so that users are free to carry out their
everyday tasks (ibid.). Another attractive solution was developed by Yoo et al. (2006), who
developed a wrist-band type PPG device with a Bluetooth communication interface to provide
mobility.
Heart rate (HR) is perhaps the most straightforward measure of cardiovascular activity. In a
review of ANS activity in emotions, Sylvia Kreibig (2010) provides a summary of the findings
related to HR response. She reports that HR has been found to increase for a number of negative
emotions (e.g. anger, anxiety, embarrassment, fear, crying sadness) as well as for some positive
emotions (e.g. happiness, joy) and surprise (which is hard to classify on a valence scale). A decrease
in HR, on the contrary, is observed when people experience affection, contentment or non-crying
sadness — emotions that, according to Kreibig, all involve an element of passivity. These findings
support the rather unsurprising conclusion that heart rate is a reflection of the level of autonomic
activation of an organism, associated with activation of the sympathetic branch of ANS.
However, heart rate is not only a reflection of sympathetic nervous system activity.
Research has demonstrated that the parasympathetic nervous system causes the heart to slow
down when we pay close attention to a stimulus, perhaps to allow the body to calm down until
proper assessment of the situation has been reached (Park, 2009). This knowledge may be useful in
usability testing, because it could help indicate whether a particular object or feature caught the
user’s attention or not. However, it should be observed that this response is only observed when a
subject is attending to external stimuli; internal stimuli, like solving a math problem, is instead
associated with increased HR (ibid.).
Degree Project Report
Malin Jönsson Forne, 2012
19
Heart rate variability (HRV), or sinus arrhythmia, is a measure of the fluctuations of the beat-to-
beat interval of the heart. The HRV response is influenced by a number of factors, including,
physical activity, body posture, respiration, cognitive effort and state of arousal (Berntson et al.,
1997). According to Kecklund & Åkerstedt (2004), the heart's ability to beat faster or slower in
response to changing mental or physical demands tends to decrease as the level of stress or
cognitive workload increases. Thereby, these states are associated with a decrease in HRV.
According to Rowe et al. (1998), HRV has been found to respond to transitions from rest to task
conditions in a large number of studies focusing on mental workload. When it comes to affective
assessment, some studies have suggested that HRV may be sensitive not only to the level of arousal,
but also to the emotional valence of a stimulus (cf. Rantanen et al.).
When studying cardiovascular activity, it is important to be aware of the effects of changes
in activity or posture on the heart’s activity. This is perhaps not that problematic in usability
testing, where subjects are often asked to sit in front of a computer while performing the test.
However, it might be a problem for ambulatory assessment of a mobile interface, or if the target of
evaluation is a video game with an element of physical interaction.
Another unwanted influence on HR and HRV is respiration. In general, inhalation is
associated with inhibition of parasympathetic activity, which causes a temporary increase in heart
rate, while the opposite effect is observed during exhalation (Berntson et al., 1997). Therefore,
some researchers have suggested that respiration rate should be measured along with
cardiovascular activity, in order to control for the effects of breathing on the obtained signal (ibid.).
Another disadvantage of these measures is the dual influence of the sympathetic and
parasympathetic nervous systems on cardiovascular activity. This complicates the interpretation of
data, because the signal obtained is not informative of the respective branch’s influence on cardiac
functioning (Kreibig 2010). For example, acceleration of heart rate may be caused by increased
arousal (sympathetic activation), but it may also be an indication of decreased attention to external
stimuli (parasympathetic deactivation). Therefore, it is particularly important to consider the
context when analyzing cardiovascular responses (Park, 2009).
3.3.2 Skin Conductance
Skin conductance (SC) or galvanic skin response (GSR) is a well known indication of arousal, and
has long been used for example in lie detectors (Kecklund & Åkerstedt, 2004). In essence, GSR is a
reflection of sweat production; increased sweating leads to more moisture in the skin, a lower
electrical resistance and therefore higher conductance.
In order to obtain the galvanic skin response, a small electrical current is passed through
the skin, using a pair of electrodes (Barreto et al., 2007). These electrodes are usually placed either
on the palms or the soles, because these body parts have a particularly high concentration of sweat
glands (Park, 2009). However, using the hands or feet as measuring points may be somewhat
problematic. Most human-computer interfaces require free use of the hands to obtain successful
interaction, which makes sensors placed on the palm a significant limitation. Using the soles for
data acquisition may seem like the better choice, but unfortunately, this requires subjects to
remove their socks and keep their foot lifted throughout the session, to keep the sensors from
touching the floor (ibid.). These restrictions could certainly have a negative impact on the user
experience, and thus influence the outcome of the test. However, less intrusive applications of GSR
Degree Project Report
Malin Jönsson Forne, 2012
20
are under development. For example, Ming-Zher Poh et al. (2010) describe a wireless wristband
with built in electrodes, which can be used to measure skin conductance during everyday activities.
According to the authors, the components used for the sensor can be purchased off the shelf for
approximately $150. This may be compared to commercial systems (such as Flexcomp Infiniti,
www.thoughttechnology.com), which may cost over $6000 (Poh et al., 2010).
So what can we learn by monitoring skin conductance? Naturally, activity of the sweat
glands increases with physical activity; but in addition to this, GSR has been found to increase in
response to most affective states. The explanation for this phenomenon lies in the action
preparation associated with most affective states (Kreibig 2010). The most obvious examples of this
are perhaps anger – associated with preparation for fight – and fear – associated with preparation
for flight. By contrast, some emotional states are associated with a decrease in electrodermal
activity, which may in turn be taken as an indication of decreased motor preparation. This is true
for sadness, which is typically experienced when a loss has occurred that cannot be undone; relief,
which is experienced after a threat has passed; and contentment, which is experienced when a
satisfactory outcome has been attained. In these cases, the significant event has already occurred,
which makes further action futile (ibid.). Thus, it is only natural that activity in the sweat glands
decreases.
Deep down, skin conductance is directly related to activation of the sympathetic division of
ANS, and thus independent from parasympathetic activity (Park, 2009). This is good, because it
means that GSR is less open to misinterpretation than many other physiological measures (such as
heart rate and pupil size, which are influenced by both divisions of ANS).
A disadvantage of GSR is the fact that it is hard to link an observed response to a particular
point in time. All measures of peripheral activity have relatively long response latency, typically a
few seconds, but GSR is particularly slow, with response latencies somewhere around 3 to 6
seconds from stimulus onset (Chanel et al. 2009, Park 2009). One reason for the large variation in
reaction time is that rather than constantly producing sweat, human sweat glands tend to ‘spout
out’ sweat. For this reason, it is not recommended to use GSR to identify the exact moment when a
response was triggered. Instead, researchers should calculate the average response over a period of
time, for example the duration of a task or the presentation of a stimulus, and then compare the
result to other such units (Park, 2009). Thereby, researchers may compare the level of stress
elicited by different stimuli or types of tasks.
In 2003, Ward & Marsden conducted a study in which they investigated if a combination
skin conductance (SC), heart rate (HR) and blood volume pulse (BVP) could be useful as a tool for
usability testing. Data was collected under rather loosely controlled HCI situations, with the aim of
identifying typical physiological patterns relating to different HCI events. The results revealed large
variations in range and magnitude of the GSR, both between different individuals and within the
same individual on different occasions. However, when the results were converted to percentage
variations, some general patterns could be observed (Ward & Marsden, 2003):
In situations with low stress with no significant events, there was a steady decrease in both
SC and HR, suggesting lowered levels of arousal. No sudden changes occurred after an initial
“settling down” period of 2-3 minutes.
During “normal” use of software in realistic situations, considerable fluctuations in HR, SC
and BVP were observed, although responses would remain around the same general level
Degree Project Report
Malin Jönsson Forne, 2012
21
through most of the interaction. When a known usability problem was encountered
however (in this case a difficult-to-find link), a rapid increase in SC was observed.
Following an unexpected HCI event (in this case the appearance of an alert box),
participants would exhibit increases in SC and HR, indication a sudden increase in arousal.
After a latency of 1 second, the most extreme response for SC was an increase of 63% over
the following 9 seconds. The unexpected stimuli would also produce increased fluctuation
in the physiological data.
Ward & Marsden concluded that the data seemed to indicate a relationship between physiological
measures (SC in particular) and different kinds of HCI events. However, it was also observed that
the experiments took place under rather loosely controlled situations, which allowed for a number
of uncontrolled sources of variability (ibid.).
3.3.3 Electrical Brain Activity
In addition to peripheral measures of physiological activity (like the ones mentioned above), there
are a number of ways to assess central processing. The human brain contains approximately 100
billion neurons, which communicate either through tiny electrical impulses, or by exchanging
chemicals, called neurotransmitters (Lee & Tan 2006). Every event, behavior, thought or emotion
produces millions of such impulses in the brain, which may be measured with technologies such as
EEG (electroencephalography), fMRI (functional Magnetic Resonance Imaging) or PET-scanning
(Positron emission tomography). Thereby, it is possible to analyze the activity in different regions
of the brain (e.g. the frontal, visual and motor cortex), and to identify recurring patterns.
While the PET and fMRI methodologies have many advantages, including a high spatial
resolution, EEG is currently considered the most suitable alternative for usability testing (Lee &
Tan, 2006, Chanel et al. 2009, Antonenko et al. 2010). First of all, modern EEG is comparatively
cheap and less intrusive than the alternative methods, which either require subjects to lie in still
during data acquisition (fMRI) or to ingest substances before trial onset (PET; Antonenko et al.
2010). Moreover, EEG has a very short response latency compared to PET and fMRI, both of which
rely on variations in blood flow along the scalp, which might not appear until several seconds after
emotion onset. On the downside, however, EEG requires direct contact between test subject and
measuring equipment, unlike for example fMRI (ibid.).
Unlike PET and fMRI, EEG measures electrical activity in the brain in a direct manner, by
placing electrodes along the scalp. An electrical impulse transmitted from a single neuron is too tiny
to be detected by the EEG, but the coordinated activity of large groups of neurons may result in
electrical fields that are strong enough to be measured from outside the skull (Lee & Tan 2006). The
signal obtained from each measuring point is passed through a differential amplifier, and the
resulting EEG is a waveform reflecting voltage variation over time. However, some electrical
impulses may be lost or scattered before they reach a measuring point, which means that the
obtained EEG is at best a crude representation of brain activity (ibid.).
A great challenge involved in using EEG relates to the presence of measuring artifacts, which
originate from electrical impulses that are unrelated to cerebral activity. Such artifacts may
originate from muscle tension, heart beats, eye blinks or body movement of any kind. Furthermore,
the electroencephalograph may pick up signals from electronic equipment in the test environment.
Degree Project Report
Malin Jönsson Forne, 2012
22
However, most contemporary EEG systems are equipped with robust software, which may facilitate
data analysis by removing some of the most common artifacts (Chanel et al. 2009)
Once the EEG has been obtained, the signal is usually analyzed by looking at the spectral power
in a set of standard frequency bands, which have been found to correspond to certain types of
neural activity (Lee & Tan, 2006). The different components of the EEG are extracted through signal
processing techniques, such as Fourier transformation. At present, it is believed that the brain
generates at least four basic rhythms or wave patterns. These are (Antonenko et al. 2010):
Delta waves (<4 Hz)
Theta waves (4-7 Hz)
Alpha waves (8-12 Hz)
Beta(-low) waves (>12 Hz)
In addition to those frequencies, the following two are often added to the list:
Beta-high waves (20-30 Hz)
Gamma waves (>30 Hz)
As we can see, the basic components of the EEG response form a continuum from low to high
frequencies. The naming of the components may seem confusing for anyone familiar with the Greek
alphabet, but has to do with the order in which the different rhythms were discovered. In healthy
individuals, the low frequency delta waves are only present during sleep, while faster alpha waves
dominate when a subject is awake but inattentive (ibid.).
Another way to analyze EEG responses is to extract the event-related potentials, or ERPs.
This technique is often used in studies that investigate the EEG response to a specific task or
stimulus. The most common way to extract the ERP is through data averaging, which means that
the amplitude values over short epochs of time are averaged to create a new waveform (Coles &
Rugg 1995). The background EEG, i.e. brain activity that is unrelated to the significant stimulus, is
assumed to vary randomly, and will therefore tend to average to zero (ibid.). What is left after the
averaging is therefore largely a representation of the event related activity. Once the ERP is
obtained, principal component analysis (PCA) may be applied to identify its different components,
which may give information about cognitive states (ibid.). However, ERPs have a limited potential
for usability testing, because it typically requires presenting stimuli at regulated timings and under
carefully controlled conditions (Lee & Tan 2006).
Lee and Tan (2006) observed that many HCI researchers were hesitant to explore the
domain of EEG, either because they felt that they lacked the required knowledge, or because of the
high cost of owning and maintaining the equipment (Lee & Tan, 2006). Traditional EEG systems are
indeed expensive, with high-end devices costing approximately USD 20,000-25,000 (ibid.).
Moreover, the equipment is difficult to handle and highly obtrusive. In typical medical applications,
between 16 and 25 flat metal discs (i.e. electrodes) are placed along the scalp using a sticky paste,
and each electrode is connected by wires to the recording machine (A.D.A.M. Medical Encyclopedia,
2012). However, recent technological advancements have allowed less intrusive implementations
of EEG, using caps or dry electrodes. Usability researchers can now gain access to (very) simple
wireless EEG headsets at prices stating from just under 100 USD (see e.g.
http://www.neurosky.com/ or http://www.emotiv.com/). While such devices
demonstrate that the use of EEG is no longer limited to laboratory settings, the number of
Degree Project Report
Malin Jönsson Forne, 2012
23
measuring points they provide is very small, which could compromise the value of the results
(Chanel et al. 2009).
In 2006, Lee & Tan performed a study in which they investigated the potential of a low-cost, 2-
channel EEG system (retailing at approximately 1500 USD). Two similar experiments were
conducted, of which only the second will be described here. The goal of the experiment was to
distinguish between three different tasks, based on differences in the resulting EEG patterns. Eight
subjects performed the following tasks, involving the computer game Halo (Microsoft Game
Studios):
Rest: Participants were asked to relax and fixate their eyes on the screen. No interaction
with the game occurred. (This task was used as the baseline).
Solo: Participants used keyboard and mouse to navigate through the game and interact with
objects in the environment. However, no enemies were visible in this task.
Play: Participants played against other participants, including an expert player who made
sure subjects were engaged in the game throughout the task.
The test sessions took place in an unmodified office environment, containing several computers,
fluorescent lights and other potential sources of noise. Due to the high variance in EEG properties
between individuals, the task classification procedure was performed separately for each
participant. The result was a mean classification accuracy of 92.4%, indicating that low-cost EEG
equipment could be sufficient for simply performing task classification and detection (Lee & Tan,
2006).
In a review of the use of EEG for cognitive load assessment, Antonenko et al. (2010) report
that two components of the EEG are sensitive to task difficulty manipulations: alpha and theta.
Alpha waves are dominant when subjects are awake but inattentive, and have been found to
decrease in response to mental effort. Theta activity, by contrast, has been found to increase with
cognitive load. Theta and alpha waves can thus be combined to assess mental effort (Antonenko et
al. 2010).
So far, relatively few studies have investigated the practical usefulness of EEG for emotion
assessment (Chanel et al. 2009). The amygdala is the main seat of emotions in the brain, but the
pre-frontal cortex is also involved in affective processing, especially the appraisal of emotional
stimuli. When subjects are confronted with emotional stimuli, we can observe different response
patterns in the pre-frontal cortex depending on the valence properties of the stimuli; negative
stimuli are associated high alpha activity, while positive stimuli are associated with low alpha
activity (Davidson et al. 2003). In a study from 2009, Chanel et al. investigated the potential of EEG
for distinguishing between three affective states: positive excitement (high arousal, positive
valence), negative excitement (high arousal, negative valence) and calm-neutral (low arousal,
neutral valence). Using 64-electrode EEG equipment, they obtained a classification accuracy of 70%
when different sets of EEG features were combined. The accuracy increased further when
peripheral measures (GSR, respiration and blood volume pressure) were combined with the EEG
analysis (Chanel et al., 2009).
Degree Project Report
Malin Jönsson Forne, 2012
24
4 Pupillometry This chapter revolves around pupillometry, i.e. the study of pupillary movements. In the first
section, I provide an introduction to pupillary movements, and explain some of the most important
factors that may have an impact on pupil size. Thereafter, I explain how pupillometric data may be
obtained and analyzed, and discuss some practical issues related to pupil size measurement. In
section three, I provide an overview of previous studies dealing with emotion, cognition and
pupillary movements, based on my literature review of the subject. Finally, I preset a minor pilot
study, in which I investigate the potential and practical challenges of pupillometry as a tool for UX
and usability testing.
4.1 Pupillary Movements
For anyone who wishes to understand pupillometric data, it is important to know that pupil size is
not determined by one single factor, but a complex interaction between different processes in the
body. This section provides an introduction to some of the most important pupillary movements,
and the factors that lie behind them.
All pupillary movements are governed by two antagonistic sets of muscles in the iris: the
sphincter pupillae and the dilator pupillae. The sphincter muscles constrict the pupil when activated,
whereas contraction of the dilator muscles is associated with pupil enlargement (Beatty & Lucero-
Wagoner, 2000). The two sets of muscles thus work as a reciprocal system, in which activation of
one muscle group is accompanied by inhibition of the other (Loewenfeld, 1993). Thus, the diameter
of the pupil is determined by the relative activation of the two muscle groups.
Both dilation and constriction of the pupil is controlled mainly by the autonomic nervous
system, but while activation of the dilator muscles is linked to the sympathetic branch of ANS, the
sphincter muscles are controlled by the parasympathetic branch (Beatty & Lucero-Wagoner, 2000).
4.1.1 Optical Reflexes
The primary function of the pupil is to control the amount of light that enters the eye, much like the
changing the aperture of a camera lens. In dim light conditions, the pupil dilates to allow more light
to enter the eye, while in bright light conditions; the pupil constricts to shut out some of the light.
This is referred to as the light reflex (Beatty & Lucero-Wagoner, 2000). In humans, pupil diameter
may vary from less than 1 to more than 9 mm due to luminance conditions (ibid.). Another optical
response is the accommodation response, or near reflex. This reflex allows the eye to adapt to
different fixation distances by changing the curvature of the lens, and thereby the pupil diameter
(ibid).
The reflexes described above both have clear-cut optical functions, and are unrelated to
cognitive and emotional processing. For the purposes of this study, therefore, they may be regarded
as disturbing factors. The accommodation response is problematic in settings where altering
fixation distances are required, but are less important in cases where subjects are looking at a fixed
computer screen throughout the test session (as is usually the case in usability testing). The light
reflex, by contrast, cannot easily be overlooked. Even very slight changes in luminance levels can
trigger a response, which makes the light reflex an issue for any study involving visual stimuli.
Degree Project Report
Malin Jönsson Forne, 2012
25
As stated by Irene E. Loewenfeld (1993):
“Anyone familiar with the low threshold of the pupillary light reflex knows, of course, that it is
impossible to change from one picture to a recognizably different one without the likelihood of a
pupillary change.“
Loewenfeld further concludes that it is not enough to just control the overall brightness of a picture
or a computer screen, as some researchers have attempted.
In addition to the relatively large scale movements of the light reflex, there are also tiny
oscillations of pupil size, which increase in frequency with intensity of illumination (Loewenfeld
1993). These continuous oscillations, sometimes referred to as pupillary unrest, are absent only in
dim light or darkness (when they are replaced by slower, pulsing movements, called “fatigue
waves”).
4.1.2 Reflex Dilation
It has long been known that in conscious, healthy individuals, any sensory, emotional or mental
stimulus (with the exception of light) elicits pupillary dilation (Loewenfeld 1993). Already in 1910,
the German neurologist Oswald Bumke concluded:
“We know today that every mental event, every physical effort, every impulse of will, each activation of
attention, and especially each affect causes pupillary dilation” (translated in Loewenfeld, 1993).
More recent studies have provided solid evidence for this statement (see e.g. Goldwater, 1972,
Loewenfeld, 1993). The kind of pupillary movements listed above, caused by cognitive or emotional
factors rather than optical phenomena, are generally referred to as reflex dilation. This response is
typically observed around 300 to 500 ms after stimulus onset, and has a peak amplitude of less than
a 0,5 mm increase in pupil size (Beatty & Lucero-Wagoner, 2000). Like the light reflex, the dilation
response is a fleeting movement, accompanied by continuous oscillations of pupil size. However,
the fluctuations associated with the light response are more irregular and sharp than those linked
to the light reflex, often exhibiting large jumps followed by rapid declines in pupil diameter
(Marshall, 2000).
As the words of Bumke depict, reflex dilation may occur in response to stimuli that are
cognitive or emotional, internal or external. This is bad news for this study, because it makes it hard
to determine the cause of an observed dilation; should it be interpreted as a sign of emotional
arousal, cognitive load, physical effort or something completely different? In 1979, Stanners et al.
observed that most pupillometric research had so far focused on either a cognitive or an affective
(arousal) interpretation of pupillary responses. As we shall see in the following section, this
observation still holds today, over three decades later. However, Stanners et al. (1979) conducted a
study in which they investigated the interaction between cognitive and emotional effects on pupil
size. They found that arousal manipulations had an influence on pupil size only when the cognitive
demands of the task were minimal. The authors concluded that cognitive demands take priority
over arousal factors as determinants of pupillary response. Beatty (1982) came to a similar
conclusion in a review article, where he concluded that emotional factors are relatively
unimportant as determinants of pupil size in information-processing tasks. According to Beatty,
emotional factors are more likely to affect the baseline pupillary diameter, rather than the phasic
responses studied in cognitive pupillometry.
Degree Project Report
Malin Jönsson Forne, 2012
26
American psychologist Sandra P. Marshall Marshall has suggested a method for separating the
emotionally driven pupillary responses from those that are cognitively driven. In a patent accepted
in 2003 (U.S. Pat. No. 6,572,562), Marshall describes an approach based on comparisons between
the respective responses obtained for the left and right eye. According to Marshall, differences
between pupillary responses are reflective of the difference between the two brain hemispheres,
i.e. the “right brain” (associated with logical and analytical thinking) and “left brain” (associated
with creative, emotional and intuitive thinking). This is an interesting approach, but so far, it does
not seem to have caught the attention of the pupillometric research community. Therefore, it is
hard to draw any conclusions concerning the validity of Marshall's approach.
It is important to observe that cognitive and emotional stimuli can only evoke pupil dilation
(i.e. enlargement); constriction of the pupil can only occur in response to light. In other words, the
light and dilation reflexes have opposite impacts on pupil size. Because reflex dilations have
relatively small amplitude compared to the light response, even small changes in light conditions
may be enough for the light reflex to “overrule” a dilation response. For example, a sudden flash of
light, which should on the one hand produce reflex dilation, and on the other hand cause light-
induced constriction of the pupil, will normally result in a decrease in pupil size (Loewenfeld 1993).
Different approached to deal with this problem will be discussed in the next section of this chapter.
Another important feature of reflex dilation is that its magnitude depends on the cognitive
or emotional significance of a given stimuli to the individual. For example, Beatty (1982) describes
an experiment in which subjects had to identify occurrences of a specific tone in one ear, while
tones of another frequency were presented in the other ear. He found that small but reliable
dilations occurred in response to the relevant tone (i.e. the one that had cognitive significance),
while no variation in pupil size was observed in response to the tones that were not attended to.
This underlines the importance of task formulation in pupillometric studies (as in all usability
studies). Similarly, it has been observed that when a stimulus is repeated at monotonous intervals,
the dilation response gradually decreases, as the subject becomes habituated to the stimulus.
However, this is not always true; if the stimulus has some annoying feature, its emotional impact,
and therefore the pupillary response, may instead increase over time (Loewenfeld 1993).
4.2 Measuring Pupil Size
The pupillary system is a very sensitive, low-noise source of psychophysiological data, which can be
measured in a number of different ways (Beatty & Lucero-Wagoner 2000). Early studies (e.g. Hess
& Polt 1960) simply photographed the eye with a given sampling rate, projected the pictures
obtained on a large screen and then measured the pupil with a regular ruler. While this method
proved precise enough to detect large scale variations in pupil size, it is both labor-intensive and
limited in temporal resolution. Over the last half century, custom pupillometry systems, i.e.
pupillometers, have gradually emerged (Klinger & Hanrahan, 2008). Today, there are hand-held
pupillometers on the market which are both precise and practical to use (see for example
www.neuroptics.com). Recently however, several research groups within the field of HCI have
started to take advantage of the pupillometric capabilities of eye trackers for pupil size
measurement (Klinger et al., 2008). A major advantage of eye tracking is of course that both gaze
Degree Project Report
Malin Jönsson Forne, 2012
27
and pupil data are recorded with the same equipment, which means that more information is
available for the analysis, without complicating the data collection procedure.
Although eye tracking has been around for more than 150 years, it has only recently started
to reach its full potential (Bartels & Marshall, 2012). Over the last few years, the field of eye tracking
has developed rapidly, resulting in systems that are more powerful, easier to handle and less
obtrusive (ibid). There are two main types of eye tracking systems available on the market today:
remote and head-mounted eye trackers. The basic methodology is the same for the two types, as
both rely on the video-based solution described in the introduction to this report. However, there
are some important differences, which will be discussed in the following.
Head-mounted eye tracking systems have one important advantage: the tracking unit is
fixed to the user’s head, which means that the relative position of the eyes and the tracker stays the
same as the user moves her head. Thereby, gaze data can be recorded while the user is walking
around and performing everyday tasks. This feature is of course an important if the target of your
study is a vending machine, or some other “real-world” object. On the other hand, the fact that
physical contact is required between user and tracking device may be perceived as a disadvantage.
For example, Marshall (2002) reported that some of her experimental subjects were bothered by
wearing a head-mounted eye tracker, and that this may have affected the results of her study. It
should be observed, however, that recent technical developments have resulted in less obtrusive
devices, such as glasses with built-in eye tracking capabilities (see for example
www.eyetracking-glasses.com and www.tobiiglasses.com/scientificresearch).
If the target of your study is a web-page, a computer game or some other desktop-based
user interface (which is usually the case in HCI research), then it might be preferable to use remote
eye tracking, which eliminates the need for physical contact between user and eye tracking device
altogether. As mentioned in the introduction to this report, modern eye tracking may be
incorporated into a system that resembles a standard desktop monitor, which allows for highly
unobtrusive data collection (Klinger et al., 2008). Modern remote trackers can compensate for head
movements, as long as the user does not turn away from the screen (cf. Tobii Technology, 2010).
Until recently, however, HCI researchers mostly used head-mounted eye trackers for pupillometric
studies, because remote systems were not considered precise enough for that purpose (Klinger et
al., 2008). Over the last few years however, several studies, including Klinger et al. (2008) and
Palinko et al. (2010) and Bartels and Marshall (2012), have asserted that remote eye tracking does
provide enough precision for detailed pupil size analysis.
Another thing that set different eye tracking systems apart is the way in which the pupil size
is determined. In video-based eye tracking, the optical sensor registers an image of the eyes, which
may then be used to calculate pupil size. However, the way in which pupil size is extracted from the
pupil image differs between different systems. One method is simply to count the number of pixels
encompassed by the pupil in the eye image (Klinger et al., 2008). However, the value obtained with
this method will be affected by changes in gaze direction, because of the curvature of the lens (cf.
Pomplun & Sunkara, 2003). For example, a subject looking straight at the camera will yield a pupil
image that occupies a larger number of pixels; compared to if the pupil had been captured from the
side. Another problem with the pixel-counting approach is that the pupil image is not always
perfect; artifacts such as eyelids, eyelashes, shadows and reflections from the environment may
cause partial occlusion of the pupil, which may also result in inaccurate estimations (Kumar et al.,
Degree Project Report
Malin Jönsson Forne, 2012
28
2009). Another common approach is to calculate the pupil diameter as the length of the major axis
of an ellipse fitted to the pupil image (Klinger et al., 2008). This solution eliminates some of the
problems involved in pixel-counting, but may instead yield some minor errors due to non-circular
pupil shapes (ibid.). More recently, new eye tracking systems (such as Tobii’s T/X series eye
trackers; Tobii Technology, 2010) have adopted more sophisticated algorithms, in which the pupil
image is used to calculate a 3D model of the eye. According to Tobii, this method provides a pupil
size that is closer to the external, physical size of the pupil than can be obtained by measuring pupil
size directly from the eye image. However, when performing pupillometric studies, the exact size of
the pupil in millimeters, is often less important than the change in pupil size over time.
Nevertheless, it might be helpful to know which pupil measurement approach is applied by your
eye tracking device, in order to better understand the potential sources of error.
Once pupillometric data has been collected, the next challenge is perform adequate data
processing, in order to extract the relevant information. One of the most common measures to be
extracted from pupillometric data is the mean pupil diameter (MPD), which is calculated as the
average pupil diameter over a given interval of time (e.g. the duration of a task), minus the baseline
diameter (Beatty & Lucero-Wagoner, 2000). An advantage MPD is that it is rather insensitive to
random variations in the data, due to for example eye blinks (depending, of course, on the severity
and frequency of measurement errors). On the other hand, there are some sources of bias to
consider when analyzing the averaged pupil size. For example, trial length may vary across
subjects, which will have consequences for the value obtained when data from different trials are
combined. Unless some kind of weighting procedure is adopted, a subject who needed more time to
complete the tasks will have larger impact on the obtained average (ibid.). In such cases, it may be
better to use peak dilation, which is an equally straightforward measure; the baseline diameter is
simply extracted from the maximum value obtained in an interval. However, it is important to keep
in mind that because this measure consists of a single value, it is more vulnerable to random
variations in the data than MPD (ibid.).
If the pupillary response is to be analyzed in more detail, it is important to address blink
artifacts in the data before further processing is performed. There are several possible approaches
to blink detection, but most solutions start by identifying data losses or values that fall below a
certain threshold value for the approximate duration of an eye blink (70-100 ms according to
Marshall 2000). Such occurrences are then removed or compensated through linear interpolation
(cf. Marshall 2000, Janita et al. 2010, Gao et al. 2010). Because lid-closure is associated with a slight
dilation and reconstriction of the pupil, due to the resulting change in light conditions, a few data
points before and after the blink should also be included in the blink removal (Loewenfeld 1993).
Another issue to consider in pupillometric studies relate to the presentation of data. At first
glance, it may seem reasonable to present the result in terms of percent dilation from baseline.
However, Beatty and Lucero-Wagoner conclude in a review from 2000 that the generally adopted
convention is to report both baseline diameter and pupillary diameter in millimeters. According to
the authors, this is a more appropriate practice, since all available evidence suggests that the
magnitude of the task-evoked pupillary response is independent of baseline diameter.
Consequently, a percent dilation approach would result in larger responses in cases where the
baseline diameter is small, and smaller responses in cases where the baseline was large, even
though the actual (absolute) dilation might have been the same (Beatty & Lucero-Wagoner, 2000).
Degree Project Report
Malin Jönsson Forne, 2012
29
4.3 Previous Studies
This section provides a review of the most central pupillometric findings related to this study. The
first part is a review of studies where pupil size is used for affect recognition, in research fields such
as affective computing. In the second part, focus lies on studies related to cognitive science and, in
particular, cognitive load assessment. Finally, I present some previous attempts to eliminate light
induced changes in pupil size from pupillometric data.
4.3.1 Pupillometry in Affect Recognition
Never has the pupil been so popular as in the 1960’s and 1970’s, following a series of articles on
pupil size measurement by E.H. Hess and James Polt (1960, 1964). While much of their conclusions
were essentially reconfirmations of what had already been known, one particular finding has been
the source of considerable controversy (Loewenfeld 1993). In a highly influential article, Hess
(1965) reported that the pupil reacted with “extreme dilation” to interesting or pleasing visual
stimuli, while displeasing stimuli caused “extreme constriction”. This “bi-directional” theory on
pupillary responses (i.e. that the pupil could either dilate or constrict in response to emotional
stimuli) gave promise of a marvelous new method, which would allow scientists and market
researchers to assign an “interest value” to everything from consumer products to political
candidates (Loewenfeld 1993).
In the years that followed, pupil size measurement was adopted in both commercial and
academic research, where it was used as a means to detect attitudes of like or dislike towards
package designs, different foods, nude pictures or human faces, just to name a few (Loewenfeld
1993). Unfortunately, these studies relied on false promises. In 1993, Irene E. Loewenfeld published
an extensive review of pupillary research (including more than 100 studies dealing with the bi-
directional theory), in which she concluded:
“It has been shown over and over again that [...] emotional stimuli and all other sensory and
psychologic stimuli - with the exception of light and of stimuli that alter the eye’s near point of vision -
do not constrict the pupil but dilate it.” (p. 663)
This does not mean that emotional stimuli does not affect pupil diameter at all, only that there is no
bi-directional relationship between valence and pupil size; positive and negative emotions both
result in pupil enlargement. According to Loewenfeld, the evidence obtained by Hess and some of
his followers were probably experimental artifacts, resulting from the influence of luminance
conditions, as all of these studies used visual stimuli. This is of course bad news for this study,
because it means that pupil measurement alone cannot tell us whether a user is frustrated
(negative valence) or delighted (positive valence) with an interface. It can, however, distinguish
between states of different emotional arousal.
Although affective pupillometry did not live up to the promises of early studies, it has
caught some attention in the field of affective computing. In 2002, Partala & Surakka investigated
the potential of pupillometry as a tool for affective computing, using a modern (50 Hz) eye tracking
system. In the study, subjects were confronted with emotional sounds with different valence, for
example a baby laughing (positive), a baby crying (negative) or an office background sound
(neutral). By using auditory rather than visual stimuli, Partala & Surakka limited the impact of the
light reflex. The study revealed significantly larger pupillary responses to both positive and
Degree Project Report
Malin Jönsson Forne, 2012
30
negative stimuli, as compared to neutral stimuli. Once again it was concluded that pupillary
responses cannot be used to discriminate between different emotional valences; however, it does
vary with different levels of arousal (both positive and negative stimuli were arousing, while the
neutral stimuli were not).
In a similar study from 2008, Bradley et al. investigated the pupillary responses to
emotionally toned pictures from the International Affective Picture System (IAPS; Lang et al, 2005).
In this study, however, heart rate and skin conductance was measured concurrently, in order to
confirm the assumption that pupillary changes are mediated by sympathetic and parasympathetic
activation (if so, a co-variation between the different physiological measures would be observed).
The selection of stimuli from the IAPS consisted of an equal number of neutral, pleasant and
unpleasant pictures, making up a total of 96 pictures. The mean luminosity levels of the pictures
were adapted (using Adobe Photoshop), so that the mean luminosity was the same for each of the
three picture sets. Once again, the study showed that pupillary responses were larger when viewing
emotionally arousing pictures, regardless of whether they were pleasant or unpleasant. This
pattern was closely paralleled by the skin conductance response. For heart rate, however, a
different response pattern was found, in which pleasant or neutral prompted very similar response
pattern, while neutral pictures prompted a significantly larger cardiac deceleration
(parasympathetic activation, see 3.3.1). The authors concluded that taken together, the data
provided strong support for the hypothesis that pupillary responses to affective stimuli are
associated with an increase in sympathetic activity (Bradley et al., 2008).
More recently, a number of studies in affective computing have investigated the usefulness
of pupillometry as an indication of different emotional states. Barreto, Gao and colleagues
measured pupil size together with other physiological signals, in order to compare how well the
different measures could distinguish between different stress levels (Barreto et al., 2007, Gao et al.,
2010). The stimuli used for stress elicitation was the same for both studies: a classical Stroop Color-
Word Test, in which users are asked to identify color-words on a screen (e.g. “red”), without being
distracted by the actual color they are written in.
In the first study by Barreto et al. (2007), four physiological measures were used: Pupil
Diameter (PD), Galvanic Skin Response (GSR), Blood Volume Pulse (BVP) and Skin Temperature
(ST). After the data had been collected, each measure was normalized in order to eliminate
individual differences, using values obtained in an introductory phase as a baseline. Then, the
different signals were evaluated in terms of their ability to discriminate between high and low-
stress segments of the interaction. In the case of pupil diameter, the average value of PD over each
segment was used. The results showed significantly more discriminating potential for PD than for
the other measures, while ST showed particularly limited potential (Barreto et al., 2007).
In the following study by Gao et al. (2010), PD was compared to GSR and BVP in a similar
experimental set-up. In addition to the Stroop Test, however, occasional flashes of light were added
as a stimulus, in order to see whether the light reflex could be cancelled out. The first step of the
signal processing was to remove interruptions in the PD signal due to blinking. The signal was
passed through a low-pass filter and interruptions were compensated by linear interpolation. Then,
a so-called adaptive interference canceller (AIC) was used to divide the obtained pupillary response
into one signal of interest (changes caused by affective responses) and one interference signal
(changes caused by the light reflex). The GSR and BVP signals were also processed before they were
Degree Project Report
Malin Jönsson Forne, 2012
31
used for affective assessment. Once again, the goal was to discriminate between stressed and
relaxed states, and once again, the PD signal gave significantly better results as compared to the
other measures (77.78% accuracy compared to 54.44% for the best alternative). Moreover, when
GSR and BVP were combined with PD, the accuracy actually decreased slightly (to 76.67%). Note
that these results were obtained in spite of the temporary illumination increases. The authors
concluded that pupil diameter may be one of the most important signals to involve in affective
recognition (Gao et al., 2010).
4.3.2 Cognitive Pupillometry
The relation between pupil size and mental effort has been extensively studied in
psychophysiology, and is referred to as cognitive pupillometry (Beatty & Lucero-Wagoner 2000).
One of the earliest studies of so called task-evoked pupillary response (TERP) was performed by
Hess and Polt in 1964. They studied pupillary responses while subjects performed mental
arithmetic, which gradually increased in complexity (e.g. 7*8, 13*14, 16*23). Pupil size was
measured using a camera, photographing the right eye with a sample rate of two frames per second
(i.e. 2 Hz). When averaged over all five test subjects, the results showed a gradual increase in pupil
diameter as more complex calculations were performed. A few years later, Kahneman (1966)
conducted a similar study, in which subjects were asked to remember strings of digits. Again, it was
found that pupil diameter (of the right eye) increased with task difficulty, that is, as the number of
digits in the string increased. Kahneman concluded that pupil size could be used as a measure of
memory load, or the amount of material in active processing. Subsequent studies have provided
repeated evidence for the ability of the pupil to reflect task difficulty, regardless of the nature of the
task. Typically, the pupil dilated within a second after a task was presented, returning to baseline
immediately after the answer had been given (Goldwater 1972).
While most early studies used the average pupil diameter (or MPD) as a measure of
cognitive activity, more recent work has applied complex data processing to extract the relevant
information from pupillary responses. The studies by Gao, Barreto et al. described in the previous
section is one example of more complex procedures, although their focus was on affect recognition.
For cognitive pupillometry, a data processing module called the index of cognitive activity (ICA) has
been used in a large number of studies over the last decade or so (Palinko et al. 2010). Instead of
using average pupil diameter, ICA measures the number of abrupt discontinuities per second in the
PD signal (for a more detailed description of the procedure, please refer to the next section). The
index was developed by Sandra P. Marshall, and was considered original enough to be granted a
patent in 2000 (U.S. Pat. No. 6,102,870).
The effectiveness of the ICA for detecting variations in cognitive load has been verified in a
number of studies. For example, Marshall, Pleydell-Pearce & Dickson (2002) demonstrated that the
ICA increased with task difficulty for a simple interactive task. In addition, they found that the index
could be used to detect strategy shifts (which are usually associated with a change in cognitive
load). Perhaps more importantly, however, the results generally corresponded to those found in
EEG studies of the same task. The authors concluded that pupil size measurement, being cheaper
and more portable than EEG, could potentially be used as a precursor to EEG studies, or to validate
findings obtained with EEG in the field.
Degree Project Report
Malin Jönsson Forne, 2012
32
As we have seen, there is extensive evidence for the correlation between pupil size and cognitive
workload for simple cognitive tasks. But how can this knowledge be implemented and benefited
from in the context of HCI? Iqbal et al. (2004, 2005) focused on just that in a series of studies
investigating how pupillometry might be used to manage user attention in HCI. As previously
mentioned (see 2.2.1), empirical evidence suggest that interruptions are less disruptive when they
occur during a period of low mental workload, rather than when the user is actively engaged in a
task, which means that efficient timing of system notifications could have a positive effect on user
performance (Bailey et al., 2006). In two consecutive studies (2004 & 2005), Iqbal et al. used a
head-mounted eye tracker (EyeLink II) to measure pupillary movements while users performed a
number of cognitive tasks. The first study (Iqbal et al., 2004) involved four task categories, each of
which had two levels of difficulty (i.e. easy/difficult): reading comprehension, mathematical
reasoning, visual search and sorting emails (using drag and drop). The baseline pupil diameter was
obtained before the first task, while subjects fixated on a blank screen for 10 seconds. In addition to
pupillometric data, subjective ratings of difficulty and completion time were collected, in order to
validate the workload reflected in the PD response. Once the data collection had been performed,
the percentage change in pupil size (PCPS) was computed for each user, using the following formula
(cf. 3.2):
In order to compare the mental workload of the different tasks, the average PCPS was computed
over each task (8 in total). If the method was successful, the difficult version of each task would
render a higher average PCPS (APCPS) than the easier version. In the first analysis however, only
the search task rendered a statistically significant PD difference between high and low mental
workload. The authors attributed this result to the hierarchical nature of the other three task
categories. For example, the email sorting task does involve a cognitive component, but part of the
task is also a motor component (i.e. the dragging and dropping of emails). This structure means that
the same level of mental workload will not be sustained over the entire period of task execution,
which might explain the unpredicted results. In a second analysis, therefore, the tasks were
decomposed into several subtasks. This time, a good correlation between pupil size and cognitive
load was observed.
In a second study, Iqbal et al. (2005) built on these results to further explore the workload
changes involved in interactive tasks. This time, two different tasks were used: route planning and
text editing. Both tasks involved carefully controlled subtasks, which were designed to be
representative of those involved in typical interactive tasks, e.g. selection, data entry, memory store
and recall, information processing, reasoning and motor movements. Again, APCPS was found to
vary in a predictable manner among subtasks, according to the level of cognitive workload imposed
by the task. Moreover, a significant decrease in APCPS was observed at task boundaries. The
authors suggested that an Index of Opportunity may be derived from the PD signal, indicating
moments where interruptions may occur at a lower cost (Iqbal et al., 2005).
As previously mentioned (see 4.2), recent studies within the field of HCI have often used
remote rather than head-mounted eye tracking for pupil size measurements. For example, Palinko
et al. (2010, 2011, 2012) performed a series of studies in which they used remote eye tracking in a
Degree Project Report
Malin Jönsson Forne, 2012
33
driving simulator. In the first study (Palinko et al., 2010), a new measure of cognitive load - the
mean pupil diameter change rate (MPDCR) - was introduced and evaluated. Subjects were
instructed to drive (primary task) while engaging in a word game with a front seat passenger
(secondary task). The PD signal was used to extract the mean pupil diameter change (MPDC), and
MPDCR was then calculated as the first difference of the MPDC curve. Both MPDC and MPDCR were
found to correspond well with driving performance and expected changes in cognitive load (based
on task difficulty). An advantage of these measures, according to the authors, is that they are both
rather insensitive to measurement artifacts (as compared to for example ICA), due to the averaging
process. The authors also concluded that MPDCR might be more useful than MPDC when it comes to
observing rapid changes in pupil size.
In the study described above (Palinko et al. 2010), the authors dealt with the influence of
the light reflex by confirming that the average illumination of the screen did not vary more than +/-
5% from average illumination. Based on this, they made the assumption that the light reflex did not
significantly influence pupil diameter. However, more recent studies by Palinko & Kun (2011,
2012) have focused specifically on the interaction between cognitive load and luminance conditions
and the resulting effect on pupillary responses. These studies are described in the next section.
4.3.3 Dealing with the Light Reflex
Whether we use pupil size to investigate the emotional responses to different interface designs or
the cognitive load imposed by an interactive task, the pupillary light reflex must always be taken
into consideration. This is one of the greatest challenges involved in pupillometric research, and
different attempts have been made to separate the light induced variations in pupil size from
responses that relate mental events. One such approach is the index of cognitive activity (Marshall,
2000), which measures the number of abrupt discontinuities in the PD signal over each second of a
trial. In order to separate the light-induced discontinuities from those that are cognitively driven,
the ICA (Index of Cognitive Activity) makes use of the somewhat different signal patterns associated
with on the one hand the light reflex, and on the other hand the dilation reflex (see 3.1). These two
components are decomposed from the original signal by means of wavelet analysis (using the
MatLab Wavelet Toolbox). Naturally, the ICA procedure also includes blink-removal and de-noising
of the signal (ibid).
In a paper from 2002, Marshall describes a simple validation study, in which the claimed
light reflex separation is put to the test. In the study, the obtained ICA for four different conditions
are compared: light plus cognitive effort, light plus no cognitive effort, dark plus cognitive effort and
dark plus no cognitive effort. The results demonstrate that ICA does indeed vary with different
levels of mental workload, but is rather insensitive to changes in illumination (Marshall, 2002).
A somewhat similar approach to separating the different sources of pupillary movements
was investigated by Janita & Baccio (2010). They used principal component analysis (PCA) to
identify a set of three independent components in the PD signal, and found that only one of them
varied in response to shifts in cognitive demand. The authors concluded that even though further
research is required, there might be a traceable component which uniquely reflects the effort a
subject mobilizes to perform a task (Janita & Baccino, 2010).
Pomplun & Sunkara (2003) suggested yet another approach to light reflex elimination.
They designed a simple interactive computer game, in which different geometric shapes appeared
Degree Project Report
Malin Jönsson Forne, 2012
34
on the screen. When a blue circle appeared, users were supposed to fixate it with their eyes while
pressing a button, in order to make the circle disappear from the screen. If they did not manage to
do so before a certain time had elapsed, the blue circle would explode. The task had three levels of
difficulty (easy, medium, and hard), which were obtained by varying the speed at which new items
appeared on the screen. Each level of difficulty was also combined with two different levels of
illumination (i.e. black or white background), resulting in a total of six different conditions. The
results confirmed that both the illumination conditions and the level of difficulty had a significant
effect on pupil size. Data analysis also revealed that there was no interaction between the two
factors. Based on these results, the authors suggested a possible solution to the light reflex-
problem, consisting of an additional pre-trial calibration, in which display brightness would be
varied in a systematic manner. Thereby, it would be possible to determine the participant’s pupil
size as a function of display brightness. The amount of pupil dilation induced by cognitive workload
could then be computed by subtracting the calibration value for the current display brightness from
the current pupil size (Pomplun & Sunkara, 2003).
Palinko and Kun built on this idea in two recent studies (2011, 2012), in which they further
investigate the interaction between illumination and cognitive load as determinants of pupil size.
In contrast to Pomplun and Sunkara, Palinko and Kun did not only calculate the average response
over each task condition, but performed a more detailed analysis of the momentary pupillary
response. In the most recent study (Palinko & Kun, 2012), subjects were asked to perform three
different tasks:
In the Illumination Task (IT), a static image of three trucks was presented to the user. One
truck was almost black (10% of maximum brightness), the second was medium gray (50%
brightness) and the third was nearly white (90% brightness). Test subjects we instructed to
fixate on a target (two zeros), which moved from one truck to another every 9 seconds
In the Visual Vigilance Task (VVT), subjects watched a sequence of numbers counting
upwards, with the instruction that every 6th number could be out of order. If so, the
participants were instructed to press a button, in order to indicate that they had detected
the faulty number.
In the Combination Task (CT), participants performed the two other tasks simultaneously.
The two zeros used as fixation target in the IT were now replaced by the sequence of
numbers used in the VVT.
The goal of the data analysis was to separate the pupillary response derived from each component
of the combination task by subtracting the responses obtained in the other two tasks. The first step
was thus to analyze the IT and VVT separately. The IT was analyzed by calculating the PD response
for each instance where a subject moved their point of gaze from one truck to brighter truck (black
to white, lack to gray or gray to white). The results were averaged over each participant and each
such transition. Next, the VVT was analyzed by calculating the average pupil size for the different
positions in the number sequence (1-6). As, expected, a significantly larger pupil size was obtained
for every 6th number, where subjects had to decide whether the number was out of order. Now, the
averaged pupil diameter during the IT could be subtracted from the responses obtained during the
combination task. The result was a curve that was very similar to that obtained in the VVT. In other
words, the light-induced changes in pupil size were successfully extracted from the CT-signal, so
Degree Project Report
Malin Jönsson Forne, 2012
35
that only the cognitively driven variations remained. While these results are encouraging, the
authors observe that the tasks used in the experiment are highly simplified, and that more research
is needed if a similar method is to be applied to more complex tasks (Palinko & Kun, 2012).
4.4 Pilot Study
In addition to the literature review, a simple pilot study was carried out, in which the pupillary
responses of two test subjects were recorded while they were exposed to simple cognitive and
emotional stimuli. The main purposes of the study were:
1. To practically investigate how pupil size measurements may be incorporated in a simple
eye tracking study.
2. To investigate whether the pupillary response to cognitive or emotional stimuli may be
studied without extensive technical skills data processing, or time consumption.
3. To gain some practical experience of pupillometric research, in order to better understand
the challenges involved in data collection and analysis, and thereby improve the quality of
the discussion provided in this report.
4.4.1 Participants
Two test subjects took part in the experiment: one male (26 years old) and two females (23 and 28
years old). There are several reasons why the number of participants was so small. First of all, the
degree project was limited in terms of time, and a larger number of participants would have meant
more time spent on data analysis. . Second of all, the goal of the pilot study was merely to test the
potential of the technology, not to render statistically significant data. It could be noted here that
similar restraints are not uncommon in usability testing, where the degree of confidence in the
results acquired must usually be balanced against limitations in terms of time and financial
resources (Rubin & Chisnell, 2008).
4.4.2 Equipment and Procedure
The pilot study was performed at Tobii Technology’s headquarter in Danderyd, Sweden. The first
two tests sessions were carried out on the same occasion, while the third test subject performed the
test on a later occasion. This allowed for a few minor tweaks in the study design between the two
occasions. These changes will be further discussed in the following sections.
Gaze data was collected with Tobii TX300 Eye Tracker (see figure 4.1), which is currently
one of Tobii’s most advanced (remote) trackers. The TX300 has a sampling rate of 300 Hz (300
samples per second) and can compensate for head movements that occur within a box of 37*17 cm
(at 65 cm from the screen). Pupil diameter is calculated for each eye separately, with algorithms
that compensate for differences in tracking distance and gaze angle, as well as for distortions
caused by the spherical shape of the eye. (Tobii Technology, 2012)
The test sessions were carried out in a usability and market research studio at Tobii
Technology. The studio has no windows, which makes it easier to control the illumination
conditions, and the eclectic light was kept at the same level throughout the test sessions. While
performing the tasks, the participants were seated at a desk in front of the eye tracker (see figure
Degree Project Report
Malin Jönsson Forne, 2012
36
4.1). Before the test could commence, a calibration procedure was performed, to make sure that the
tracker could identify the subjects’ eyes. Thereafter, the participants were guided through the test
by text instructions appearing on the screen. The participants had access to a keyboard, which was
used to trigger new instructions. The sessions were controlled and monitored by a person (myself)
sitting outside the visual field of the test subject, in order to avoid distractions caused by
movements in the periphery of the visual field. Because all instructions appeared on the screen,
interaction between user and test moderator (which might have caused the user to turn away from
the screen) was not necessary once the eye tracking session had commenced.
The study was designed in Tobii Studio, a software tool dedicated to the design, recording
and analysis of eye tracking data. Cognitive and affective stimuli (see the following sections) were
presented as video-clips, in order to ensure correct timing of the events.
Figure 4.1: Test Set-Up and Equipment
A. The test monitor (on the left) was seated B. Tobii TX300 Eye Tracker.
behind the test subject (in the middle).
4.4.3 Cognitive Tasks
The study consisted of two parts, one in which subjects performed simple math problems
(cognitive task), and one in which they were confronted with emotionally toned pictures (affective
stimuli). The cognitive task consisted of four math problems with two levels of difficulty (see figure
4.2 below), which were presented in the following order: easy 1, difficult 1, easy 2, and difficult 2.
The math problems used were adapted from cognitive study material from a workshop on the
EyeTrackConf-conference in Uppsala, Sweden in 2010. The subjects were given 10 seconds to solve
each problem, after which the next problem appeared automatically. Once (and if) they managed to
come up with a solution, the subjects were instructed to say it out loud. The performance data
could thereby be used to verify the assumed variations in difficulty between the different sub-tasks.
Figure 4.2: Cognitive Stimuli
The difficult math problem
(to the right) should evoke
a higher level of cognitive
load than the easier one
(to the left).
Degree Project Report
Malin Jönsson Forne, 2012
37
The visual characteristics of the stimuli were carefully controlled in order to avoid the occurrence
light induced changes in pupil size during the task. Therefore, the math problems were given the
same background color, and the numbers were placed in a similar way for all four problems. A
baseline stimulus was also created, in which the numbers in the picture were replaced by X’s. The
baseline picture was presented to the user before the real task began (no other stimulus was
presented in-between).
However, there was a twist to the carefully controlled luminance levels. The last difficult
task (Difficult 2) was deliberately given a slightly brighter gray background color (luminance 81
instead of 85 in the CIELab color space). The difference in luminance level between the stimuli can
be observed in figure, 4.2, where the difficult stimulus has a slightly brighter background. My
hypothesis was that the light induced pupil constriction caused by the change in background color
would rule out the expected dilation due to the increased difficulty of the task (see 4.1.2).
4.4.4 Affective Stimuli
In the second part of the study, subjects were presented with four emotionally toned pictures. The
pictures were selected from the Geneva Affective PicturE Database (GAPED; Dan-Glauser & Scherer,
2011), which is available online at www.affective-sciences.org/researchmaterial. Each picture in
the database is assigned with indexes for valence and arousal, on a scale from 1 to 100, which are
based on the subjective rating of sixty subjects (ibid.). Four pictures were chosen from the library
(see figure 4.3 below), based on their specified valence and arousal indexes. The chromatic
characteristics of the pictures were also taken into account, because their colors had to be matched
in order to obtain the same overall luminance. This was done using the Match Color Tool in Adobe
Photoshop.
Figure 4.3: Affective Stimuli
1. 2.
3. 4.
The pictures used as affective stimuli had the following affective characteristics, as specified in the GAPED:
1. Positive valence (92.1), low arousal (27.5).
2. Negative valence (15.6), high arousal (66.3).
3. Neutral valence (51.3), low arousal (26.2).
4. Positive valence (91.3), high arousal (57.6).
Degree Project Report
Malin Jönsson Forne, 2012
38
4.4.5 Results and Analysis
The first step of the analysis was to verify that the stimuli used had in fact evoked the cognitive or
emotional responses intended. For the cognitive task this was done by looking at user performance
for the different tasks. As seen in table 4.1, it was clear that the difficult tasks did cause more
trouble than the easier ones. In fact, none of the participants managed to solve any of the difficult
problems in the time given.
Table 4.1: Task Performance
The affective stimuli responses were verified by looking at the valence and arousal values reported
by the three participants. As we can see in table 4.2 below, the affective characteristics reported by
the users were rather consistent with the intended experience of the stimuli. For example, the
second picture (the wounded horse) did yield a low valence rating (1.3/5 on average), but high
rating for arousal (4.3/5). It may be noted, however, that pictures number 1 and 4, which were
supposed to have different arousal characteristics, received very similar ratings by the subjects (3.3
vs. 3.7 on the arousal scale). Partly, this result may be attributed in the fact that the arousal indexes
assigned to the pictures in GAPED were not that different to begin with (27.5 vs. 57.6). However, it
could also have to do with the way in which the question was posed.
Table 4.2: Affective Ratings
Before the pupillometric data could be analyzed, some pre-processing of the results was required.
First, instances of lost tracking were removed. The first two test sessions had rather few instances
Task User 1 User 2 Use r 3
Easy 1
Difficult 1 _ _ _
Easy 2 _
Difficult 2 (Bright) _ _ _
Stimuli User 1 User 2 User 3 Average
1. Positive
V:5
A:3
V:5
A:3
V:5
A:4
V:5,0
A:3,3
2. Negative Arousal V:2
A:4
V:1
A:4
V:1
A:5
V:1,3
A:4,3
3. Neutral
V:3
A:1
V:3
A:1
V:3
A:2
V:3,0
A:1,3
4. Positive Arousal V:4
A:3
V:5
A:4
V:5
A:4
V:4,7
A: 3,7
Correct answer reported
_ No answer reported
V = Valence
A = Arousal
Degree Project Report
Malin Jönsson Forne, 2012
39
of lost tracking, with over 90% successful tracking for both individuals (which means that the
tracker could identify the eyes >90% of the time). However, there was an over representation of
lost data in the cognitive segment of the recording. The reason for this might have been that I
experienced a problem with the screen settings at the first test occasion, which caused the numbers
to take up a larger proportion of the screen than intended, even reaching the edges of the screen.
This may have caused instances of lost tracking, since tracking accuracy decreases as you move
closer to the edges.
Naturally, the screen settings were fixed before the second test occasion (user 3).
Nevertheless, a rather poor overall quality of data was obtained in this session, with only 58%
successful tracking. This was probably due to the fact that the subject wore eye make-up at the
occasion, which is known to make it harder for the tracker to identify the pupils (Tobii Technology,
2010). After instances of lost data had been eliminated, the average size of the right and left pupils
was calculated for each subject and each data point. This average pupil size was then used for the
analysis reported in the following. I did, however, make a quick comparison of the average pupil
size obtained for each eye and each segment of the test respectively (affective stimuli/cognitive
tasks), but found no systematic correlation between the larger pupil (right/left) and the type of task
(cognitive/affective).
Cognitive Tasks
The next step of the data analysis was to calculate the mean pupil diameter (MPD) for each task in
the cognitive section of the test. When analyzed separately, the first two sessions resulted in very
similar trends for the MPDs. Figure 4.4 below shows the averaged results for the first two test
subjects (user 1 & 2).
Figure 4.4: MPD for Mental Arithmetics (user 1 & 2)
The first conclusion that can be drawn from the figure above is that there was a clear difference in
MPD between the baseline period, during which no cognitive task was performed, and the mental
calculation periods, during which the subjects had to mobilize some cognitive effort. In other words,
we may conclude that the pupil did respond to differences in cognitive load (and/or increased
stress due to the time pressure), at least to some degree. We can also conclude that the increased
background illumination of the last stimulus does seem to have counter-acted the effect of the
increased cognitive demand, at least partially.
[mm]
Degree Project Report
Malin Jönsson Forne, 2012
40
Nevertheless, one aspect of the results presented in figure 4.4 is not in line with my initial
assumptions: pupil size did not vary systematically with task difficulty. Instead, MPD increased for
each task, regardless of the level of difficulty (with the exception of the last, brighter stimuli). One
explanation for this trend may lie in the fact that subjects were not given the chance to relax
between the cognitive tasks. Even when they did manage to solve the tasks in the time given, the
answer was reported just before the next stimuli appeared, giving them no time to prepare for the
new stimuli. Therefore, it is no surprise if the stress levels of the participants (and therefore their
pupil size) increased continually over the course of the tasks. In order to validate this hypothesis,
the cognitive task procedure was changed slightly before the third test session (user 3). This time,
the user was give 15 seconds to solve each math problem, and the baseline appeared for 5 seconds
between every task, to give the user some time to rest and prepare for the next problem. The
results are presented in figure 4.5 below.
Figure 4.5: MPD for Mental Arithmetics (user 3)
The diagram above indicates that the small changes made to the cognitive tasks did affect the
cognitive load experienced by the participants. This time, the first difficult task (which had the same
background luminance as the other stimuli) gave rise to the largest MPD . The results were thus in
line with my initial assumption that more difficult problems would result in a higher cognitive load,
and thereby a higher MPD. For the last difficult problem, the increase in luminance seems to have
“balanced out” the reflex dilation caused by the difficult task, so that the result was a MPD that was
equal to the baseline. This was also in line with my expectations. On the other hand, the fact that the
MPDs obtained for the two easy problems were actually slightly lower than the baseline was rather
surprising. Part of the explanation may lie in the fact that the subject reported the correct answers
to these problems several seconds before the end of the task, which means that her cognitive load
should have been low during the last seconds of the stimuli presentation (cf. Kahneman, 1966). This
would of course have affected the average obtained over the whole course of the stimuli
presentation. This possible explanation may be verified by looking at figure 4.6 on the next page,
which shows how the pupil diameter changed over time during the cognitive tasks. The curve is
smoothened with a moving average function, which means that blink artifacts and presumed
measurement errors have been smoothened out. The bars at the bottom of the chart indicate the
different task phases. Luckily, this segment of the data for participant three did not contain that
many instances of lost tracking.
[mm]
Degree Project Report
Malin Jönsson Forne, 2012
41
Figure 4.6: Trend for Cognitive Tasks (user 3, x = time, y = mm)
As we can see in figure 4.6, the pupillary responses varied in a rather predictable manner during
the cognitive tasks, with dips at each of the no-task phases (baseline or rest), and abrupt dilations
during the task phases. As expected, all of the task segments resulted in higher peak dilations than
the baseline, and at least for the first easy task, it is clear that the pupils constricted after the
response had been reported (after about half the time given), which explains why the average pupil
size obtained was so low.
Affective Stimuli
Figure 4.7 below presents MPDs obtained for each of the affective stimuli. Once again, the results
are averaged over the first two test sessions, which gave very similar results. In order to facilitate
comparison with the cognitive results, I have used the same scale for the two diagrams. Thus, we
may easily observe that the cognitive tasks gave rise to greater changes in MPD than did the
affective stimuli. This is not too surprising, since the cognitive tasks demanded a higher degree of
user engagement, as compared to the more passive nature of the picture-viewing.
Figure 4.7: MPD for Affective Pictures (user 1 & 2, y = mm)
[mm]
Degree Project Report
Malin Jönsson Forne, 2012
42
The diagram in figure 4.7 reveals some rather surprising results. First of all, the neutral, low arousal
image resulted in a rather high MPD, second only to the last, positive arousal picture. However, the
most surprising result is the fact that the second picture, which was rated as the most arousing by
the test subjects (4/5), resulted in the lowest MPD. However, there is a logic explanation. The
results indicate that the reflex dilation might have been counter-acted by a light response, and this
is actually the case. During the data analysis, I went back to examine the affective pictures again,
and realized that I had used the wrong version of the second picture in the test session. The version
used did indeed have a higher overall luminance than the pictures, which explains the low MPD
obtained for the negative arousal stimuli. The mistake was corrected prior to the third test session,
but unfortunately, the affective section of that recording contained too large data losses for any
analysis to be based on it. Instead, the data obtained during the first test session was analyzed in
more detail. Figure 4.8 below shows how the PD of user 1 changed over the course of the affective
stimuli presentation (the curve is smoothened with a moving average function). The horizontal
lines in the figure indicate the points at which the stimuli changed from one picture to the next.
Figure 4.8: Trend for Affective Pictures (user 1, x = time, y = mm)
By looking at figure 4.8, we can make the following observations about the pupillary response to
the different pictures:
For the first affective stimulus (the baby), there was a rather steady increase in pupil
diameter, indicating an increased emotional arousal in the subject. The first sharp dilation
sets off around 400 ms (0,4 s) after stimulus onset, which is in line with the latency reported
by Partala and Surakka (2003), who studied the pupillary response to affective sounds.
For the second, negative arousal stimulus (wounded horse), we can observe a constriction
of the pupils during the first two seconds of the stimulus presentation, which is probably an
effect of the light reflex. Again, there is a response latency of about 400 ms, this time
followed by a sharp constriction of the pupils, which is (again) in line with the light-reflex
latency reported in previous studies (e.g. Palinko et al., 2012). On the contrary, there seems
to be no obvious explanation for the rather sharp dilation and reconstriction that follows
after the initial light response.
Degree Project Report
Malin Jönsson Forne, 2012
43
As might be expected, the neutral stimulus (street sign) that followed after the brighter
second picture resulted in a redilation of the pupils. However, there is a sharp decrease in
pupil size at the end of the stimuli presentation which has no obvious explanation.
The positive arousal stimulus starts off with a sharp dilation (without any latency) and
reconstriction which is hard to explain. However, it is followed by an increasing trend which
is indicative of increased arousal in the subject.
As we can see, some of the features of figure 4.8 are in line with what might be expected, while
others seem to have no obvious explanation. In the end, it is hard to draw any definite conclusions
based on the data obtained. One reason for this is that although the overall brightness of the
different pictures was matched (for at least three images), there were still luminance variations
within each picture. Thus, it is not unlikely that some of the variations in pupil size were evoked as
the subjects changed their point of regard (this theory might be verified by analyzing the gaze data
in relation to the pupil size, but that would be a time-consuming endeavor). But it is also important
to note that the test was based on a very simple stimulus-response understanding of human
emotion. In real life, people seldom react as we expect them to. Even if the users’ cognitive
assessments of the emotional values associated with different pictures were similar to the pre-
defined affective characteristics, it does not necessarily follow that their experience was.
4.4.6 Lessons Learned
Although there were a few flaws in the test procedure, and even though the quality of data was
partly poor, the pilot study did serve its purpose in pointing out some of the challenges involved in
practical pupillometry. Some challenges relate to eye tracking in general. It is a known fact that
some subjects are easier to track than others, and that factors such as wearing glasses or eye make-
up may lead to problems in the data collection. This difficulty was mainly experienced in the third
test session. In this case, the test subject had not been instructed to avoid eye make-up, which might
have lead to a more successful recording. A first take-away from the pilot study is therefore that
participants should be given some basic instructions before the arriving at the test facility, and that
they should be asked whether or not they wear glasses. In a ‘sharp’ study, it may also be advisable
to over-recruit slightly, in order to compensate for trails that are unsuccessful.
When it comes to the specific case of pupillometric studies, it is clear that the software tool
used to design the study, Tobii Studio, is not (yet) adapted for the analysis of pupil data. More
common visualizations used in eye tracking, which focus on where subjects are looking during the
interaction (e.g. so-called heat-maps and gaze plots), can be generated automatically in Tobii Studio,
which gives access to rather effective data analysis. When it comes to pupil size, no automatic
processing is provided, which means that the extraction of relevant features must be done manually
for each participant (which is time-consuming even for a minor study like this one); unless some
script is developed for data processing. Either way, some data processing skills on the part of the
experimenter is required.
When it comes to dealing with the light reflex, the study clearly demonstrated that even
small variations in stimuli illumination will produce pupil constriction. In other words, a strict
control of visual stimuli is necessary if we want to draw conclusions about the user’s cognitive and
emotional processes based on pupil data, unless some measure is taken to eliminate the effect of
Degree Project Report
Malin Jönsson Forne, 2012
44
the light reflex. On the other hand, such strict control is hard to achieve without changing the nature
of the interactive experience we which to evaluate. Therefore, the development of reliable,
automatic procedures for separating the different components of the pupillary response is a key
concern if pupil size is to become a truly applicable in usability testing.
Degree Project Report
Malin Jönsson Forne, 2012
45
5 Discussion and Analysis
In this chapter, I come back to the core research questions of the present study. In the first section, I
discuss what physiological measures may tell us about human emotion and cognition, and discuss
some of the important considerations involved in the interpretation of physiological data (RQ 1 &
2). Thereafter, I discuss the specific challenges involved in UX and usability testing, and how
physiological measurement may be incorporated in such contexts. In the third and last section, I try
to define what would make up a truly valuable physiological measurement method for UX and
usability testing, and discuss how well these criteria are met by the different measures investigated
in this study (RQ 3).
5.1 Interpreting Physiological Data
In the present report, I have referred to a large number of studies investigating the link between
physiological measures and human mental processes. All in all, the research conducted in this area
provides extensive evidence that both cognitive and emotional processing is associated with
measurable physiological changes in the human body, affecting parameters such as heart rate, heart
rate variability, skin conductance, electrical brain activity and pupil size. The problem, however, is
that physiological measures do not only capture changes that are related to human cognition and
emotion, but may in fact be influenced by a large number of variables, including body posture,
hormonal levels and environmental aspects (such as room temperature, electrical equipment and
luminance conditions). As noted earlier in this report, great care must thus be taken in the analysis
and interpretation of physiological signals. Park (2009) suggests that before data is collected, all
factors that may result in unwanted interference with the results should be eliminated, and that
after data has been collected, researchers should go back and reconsider if there is any room for
alternative interpretations (ibid.).
Most studies investigated in the present study were performed in controlled laboratory
settings. This approach is certainly a good way to ensure a high quality of data; on the other hand, it
may raise the question of external validity of the results. As pointed out by Picard (2010),
conclusions about the real world may be misinformed if based on the artificial or simulated. The
main reason for this is probably not the strict experimental control of laboratory environments;
rather, it has to do to with what the test situation means to the user, and her motivation for
performing the tasks at hand. Clearly, the act of buying a journey online means something different
to the user in a real-life situation, where he or she is actually going to experience the journey after
buying it, as compared a test situation, where the task is performed for the mere purpose of
evaluation. It should also be noted that the emotional and cognitive reactions observed in the
laboratory may not solely be related to the experimental stimuli, i.e. the task at hand, but could also
be evoked by the test situation as such. An example of this effect is the so-called “white-coat
hypertension” discussed in medical literature, which refers to the phenomenon of high blood
pressure demonstrated in the clinic, but not at home (Wilhelm & Grossman, 2010). It seems
reasonable to assume that a similar effect could appear in usability testing (and might have
appeared in the pilot study presented in this paper). For example, Ward & Marsden (2003)
observed that when the experimenter appeared and began asking questions (after the participants
Degree Project Report
Malin Jönsson Forne, 2012
46
had experienced a quiet “settling-in” period), all participants showed large increases in skin
conductance, indicating elevated levels of arousal.
However, provided that we actually manage to isolate the physiological responses related to
the target of study, and provided that these reactions are reasonably similar to those we might
expect in real life situations - what conclusions about cognitive or emotional processes may be
drawn from physiological data? When it comes to cognitive processes, most studies have focused on
the relationship between cognitive load and physiology. The most commonly used measures for
this purpose include HRV, EEG and pupil size, all of which (when validated against subjective and
performance-related measures) have been found to respond to changes in cognitive workload in a
predictable manner (e.g. Berntson et al., 1997, Antonenko et al, 2010, Beatty & Lucero-Wagoner,
2000).
When it comes to affective computing and emotion research in general, the goal of
physiological measurement has often been a more fine-grained classification of mental processes,
as compared to the one-dimensional scale of cognitive load. The two-dimensional valence-arousal
scale is a commonly used tool for this purpose. Although this model provides a highly simplified
view of human emotion, it is considered effective enough to distinguish between most emotional
categories used in everyday language (Mehrabian & Russell, 1974). When it comes to the arousal
dimension, there is little controversy that a large number of emotional states (such as joy,
frustration, fear, and surprise) are associated with activation of the sympathetic nervous system,
resulting in responses such as elevated heart rate, increased sweating and dilation of the pupils.
Determining the valence of emotion, on the other hand, seems more complicated, and the
usefulness of ANS responses for this purpose is still a topic of debate in emotion research (cf.
Kreibig, 2010). As previously mentioned, some studies (e.g. Rantanen et al., 2010) have suggested
that cardiovascular activity may be used to distinguish between pleasant and unpleasant emotions.
However, EEG is probably the most reliable source of information for this purpose, at least if
relatively sophisticated equipment is used.
However sophisticated methodologies we come up with, it is important to keep in mind that
physiology alone can never tell us what a person is thinking or feeling at a given time. Also, bodily
reactions are only one part of what constitutes an emotion (see 2.3.1). Therefore, many researchers
(including Kecklund & Åkerstedt, 2004, Ward & Marsden, 2003) argue that physiological data
should always be interpreted in relation to other sources of information, such as knowledge of
context, interview data and the user’s subjective ratings of the experience. Indeed, such an
approach may prove highly valuable in usability testing. The greatest advantage of physiological
measures, as compared to subjective ratings or interviews, is perhaps the fact that they may be
recorded continuously while a user is engaging in an interactive task. Thereby, physiological data
may be useful as a way to help users go back and remember what they experienced after a test has
been performed. Today, a method called Retrospective Think Aloud (RTA) is commonly applied in
usability testing, especially in eye tracking studies (Tobii Technology, 2009). In RTA, the interaction
is replayed to the user, while he or she is asked to comment on her thoughts, choices and actions. A
similar methodology could be applied for other physiological measures, provided that the data can
be visualized in a way that is accessible to the test participants. By combining physiological data
with the user’s own account of the interaction, we could perhaps get at least a little bit closer to
understanding the user experience.
Degree Project Report
Malin Jönsson Forne, 2012
47
5.2 Challenges for UX and Usability Testing
As we have seen in this study, there is substantial evidence for the link between psychological
processes and physiological signals. However, most studies in the field of psychophysiology have
measured responses to simple stimuli, such as affective pictures or mental arithmetics, and data
collection has almost exclusively been performed in more or less controlled laboratory settings.
These criteria can seldom be met in usability testing; partly because of operational constraints, and
partly because strict control of the ‘stimulus’ cannot be achieved without changing the nature of the
interaction under study.
According to Duchowski (2007), there are at least three operational constraints associated with
system evaluation (such as usability testing). These are (ibid.):
Time
Money
Personnel
All of these constraints are highly relevant to the commercial environment in which most usability
testing takes place. As discussed earlier (see 2.1), the quality of results must often be balanced
against the time and money available for testing. Thereby, “quick and dirty” assessment tools are
often chosen before more advanced research methods like the ones discussed in the present study
(cf. Madrigal & McClain, 2009). Physiological measures often require extensive data processing for
relevant information to be extracted, which may take more man hours than can be motivated by the
financial return. Therefore, sophisticated systems that are valuable for scientific purposes may not
necessarily be attractive in the commercial context in which most usability testing takes place.
Ultimately, it all comes down to return on investment: will the money you spend on measuring
equipment, training and data analysis generate enough profit or savings for the company to
motivate the expense? Hopefully, technological developments will continue to generate more
advanced technology to more affordable prices, preferably with a high degree of automatic
processing and visualization to facilitate data analysis. Ultimately, commercial interests will
probably be a determining factor for this development.
In order for new methodologies to be incorporated in every-day usability testing, it is not
enough that they are effective, accurate and affordable. Perhaps even more important is that it is
easy for the test leader to apply the technology in a practical test situation. Monitoring a test
without advanced measuring equipment may be complicated enough; therefore, it is no wonder if
usability practitioners feel reluctant towards the addition of cumbersome measuring equipment,
electrodes that need to be correctly placed (and prevented from falling off during the test session)
or eye calibrations that may or may not be successful. Moreover, not all UX and usability
practitioners will possess the technical competence required to perform detailed data analysis.
Again, technical advancements that facilitate the practical measurement and analysis of
physiological measures are necessary for these techniques to be truly useful in usability testing.
When it comes to pupillometry, two factors seem particularly important as determinants of
its future in usability testing. First, if pupil size is to be of any practical help in the evaluation of
human-computer interaction, then there must be reliable procedures for light reflex elimination
available. A few promising approaches to this problem have emerged in this study. One is to use
spectral analysis to separate the different components of the pupillary response. A version
Degree Project Report
Malin Jönsson Forne, 2012
48
approach (the ICA; Marshall, 2000) is already commercially available from Eyetracking Inc.
(http://www.eyetracking.com) but there have also been similar attempts made by for
example Janita & Baccino (2010). Another approach to light reflex elimination would be to use
some kind of pre-trial calibration to determine which light-induced responses to expect during the
interaction. Those values would then be extracted from the pupillary response obtained, resulting
in a signal that would reflect reflex dilation alone. However, this method has only been tested for
highly simplified tasks, and it is unclear whether it could be applied to the more complex tasks
involved in typical HCI. A third possible approach could be to utilize the point-of-gaze data provided
by the eye tracker, and combine that information with data concerning the luminance level of each
pixel in the screen, at each moment of the interaction. This would of course require very high
tracking precision, as well as extremely exact synchronization between different sources of
information.
The second factor that seems particularly important for pupil size to be integrated in
usability testing relates the complexity of analysis. Because of the operational constraints often
associated with usability testing, it seems unlikely that usability practitioners would make practical
use of pupillometry, unless data processing and visualization of the pupillary response becomes
less effortful and time consuming than it is today. Easy-to-use analysis tools would also make
pupillary responses more accessible, even to usability practitioners that are not so “good with
numbers”, or who lack the technical competence required to perform detailed analysis.
A recurring topic of discussion in this report has been the obtrusiveness of different
measurement methods. As mentioned earlier, usability testing is about observing representative
end users using a product to perform representative tasks, preferably in a context that is
representative of “real world” usage. A test situation that is too different from the typical use case
will have very limited value, because the results should have little relation to how actual users will
experience the product in real-life settings. Almost all physiological measurement techniques
require sensors to be placed on the body. This may disrupt the user experience, at least to some
degree. Although measuring equipment is getting smaller and less cumbersome to handle, it may
still add to the already awkward situation of being monitored while interacting with a system. On
this point, remote eye tracking has an advantage over other available measurement techniques,
since no physical sensors are required. Built-in eye-tracking monitors may also look pretty much
like any other computer screen, which should add to the authenticity of the test situation. However,
as pointed out by Park (2009), eye tracking can also be perceived as artificial. For example, it
requires the user to sit in more or less the same position during the course of the test session,
which may create unnatural tension in the subject.
Another challenge, which relates to the previous one, is that there is often a trade-off
between high quality measurements and obtrusiveness. For example, fMRI provides excellent
spatial resolution of brain activity, but is (so far) unsuitable for real-world usability testing. EEG is
less obtrusive, but has low spatial resolution and high presence of noise. As discussed above,
unobtrusiveness is an important factor in usability, but there is no general rule to apply when
evaluating different alternatives; all decisions must be based on the particular goals of the study at
hand.
Today, digital interfaces are not only accessed through stationary computers, and
consequently usability testing is not only performed in front of traditional computer monitors. If
Degree Project Report
Malin Jönsson Forne, 2012
49
physiological data are to be integrated in such contexts, additional challenges need to be taken into
consideration. First of all, not all measuring equipment is suitable for ambulatory assessment,
either because the recording devices are not wireless, or because they are too cumbersome to carry
around. However, great progress has been made in this area over the last few years. In the case of
pupillometry, ambulatory assessment may be achieved with eye tracking glasses. These may not be
unobtrusive enough for users to ‘forget’ that they are taking part in a user study, but they are
certainly practical enough to allow for easy transport as well as free body movement during the
recording session. Similar progress has been made in the field of EEG, where wireless caps with
built-in electrodes are now available on the market. However, GSR and cardiovascular measures are
probably the most practical alternatives for ambulatory data collection, as the technology needed
for data collection may nowadays be incorporated into a simple wristband or (in the case of the
latter) a pair of modified headphones paired with a smart phone.
However, ambulatory usability testing adds yet another difficulty to the use of physiological
measurement, i.e. movement artifacts. When people are not restricted in terms of mobility, the level
of noise in all physiological measures tends to increase (Gunes & Pantic, 2010). This is no surprise,
given that body movements may be responsible for pupil dilation, elevated heart rate, increased
skin conductance and artifacts in the EEG, all of which must be considered a form of noise when the
study focus lies on cognitively and and/or emotionally driven responses.
5.3 Evaluation of Measures
The literature review presented here shows that there is no ‘gold standard’ for physiological
measurement, but that all measures have their respective pros and cons. However, a few criteria
have emerged from the analysis, which seem particularly important for a physiological measure to
be both valuable and suitable for usability testing. In the following, these criteria are used to
evaluate and compare the measures investigated in this study, i.e. cardiovascular activity (HR &
HRV), skin conductance (SC), electroencephalography (EEG) and pupil diameter (PD).
Affordability
It is of course hard to say where to draw the line between affordable and too expensive, and there
are also large variations in price for the same measurement method, depending on how advanced
equipment you chose to buy. However, skin conductance (SC) and heart rate (HR) seem to be the
most affordable alternatives in general, as the technical equipment required to obtain these
responses is rather simple.
Unobtrusiveness
A lot of progress is being made in this area. Today, there are small and easy-to-wear systems
available for both HR and SC monitoring, although less obtrusive body placement may be associated
with an increase in measurement noise. Eye tracking is also a rather unobtrusive technology, as
modern systems do not require any contact between subject and tracker. EEG is probably the most
intrusive alternative today, although progress is being made in this area as well.
Degree Project Report
Malin Jönsson Forne, 2012
50
Information Density
What I mean by this criterion is that a truly useful measure should provide as much valuable
information about the user state as possible. When it comes to cognitive assessment, HRV, EEG and
PD have all been found to be good measures of cognitive load, although I have not found any
conclusive evidence that any of these measures would be more useful than the others for this
purpose. However, PD is usually measured with eye-tracking, which gives access to a large amount
of additional information concerning users’ visual attention.
For affective assessment, all measures discussed in this study may be used to indicate
instances of elevated arousal. However, EEG must be considered the most informative measure in
this respect, since it can provide more detailed information about the parts of the brain that are
activated during different phases of the interaction.
Simplicity of Use
By this criterion, I mean that the ideal measure should be easy to implement in every day usability
testing. Thus, the equipment should be easy to set up and learn how to use, even for people with
modest technical skills. This criterion is of course hard to evaluate without practical experience of
the respective measurement methodologies, and there are probably considerable differences
between different systems and manufacturers. However, from my experience with eye tracking, I
can conclude that at least for this particular system, around half an hour was enough to figure out
how to set up the system. That being said, I did experience some bumps in the road, such as my
trouble with the screen settings. However, the greater challenge when it comes to eye tracking is
probably to be aware of and learn to deal with the different factors that may limit the trackability of
a subject.
Simplicity of Analysis
This criterion has to do with how easy it is to extract relevant information from the data obtained.
Again, it is hard to make a statement without practical experience of the different measurement
methods, but it seems that both HR and SC would be easier to analyze than EEG, which requires
rather extensive knowledge about the workings of the brain. When it comes to pupillometry, the
difficulties involved in the analysis have already been discussed in the previous section of this
chapter.
Robustness
If physiological measurements are to be of any practical help in usability testing, they need to
tolerate collection under relatively loosely controlled conditions. Unfortunately, all measures
discussed in this report are affected by unwanted artifacts from factors such as physical activity,
temperature and luminance conditions. This is an important challenge for all contexts where
physiological measures are used, and an area where further development is needed, in order to
better separate different sources of influence on our physiology.
Degree Project Report
Malin Jönsson Forne, 2012
51
6 Conclusion As we have seen, there is no single ‘gold standard’ for physiological measurement in UX and
usability testing. Instead, it was found that cardiovascular measures, skin conductance, EEG and
pupillometry may all be more or less useful, depending on the context of study. Although none of
these methods allows for an absolute measurement of the thoughts or emotions experienced during
a usability test, they may help identify elements of the interaction that are particularly important or
interesting, such as instances of elevated cognitive load, frustration or other emotional reactions.
However, usability researchers should be aware that there is never just one possible explanation to
an observed physiological reaction. Therefore, physiological responses should always be
interpreted in relation to the context in which data was collected, as well as to the users’ own
account of their experience.
Degree Project Report
Malin Jönsson Forne, 2012
52
7 Bibliography
A.D.A.M., Inc., 2005. A.D.A.M. Medical Encyclopedia: EEG. [WWW Document]. URL
http://www.nlm.nih.gov/medlineplus/ency/article/003931.htm
Andreassi, J.L., 2000. Psychophysiology human behavior and physiological response. L. Erlbaum,
Publishers, Mahwah, N.J.
Antonenko, P., Paas, F., Grabner, R., Gog, T., 2010. Using Electroencephalography to Measure
Cognitive Load. Educational Psychology Review 22, pp. 425–438.
Bailey, B.P., Konstan, J.A., 2006. On the need for attention-aware systems: Measuring effects of
interruption on task performance, error rate, and affective state. Computers in Human
Behavior 22, pp. 685–708.
Barreto, A., Zhai, J., Rishe, N., Gao, Y., n.d. Significance of Pupil Diameter Measurements for the
Assessment of Affective State in Computer Users, in: Elleithy, K. (Ed.), Advances and
Innovations in Systems, Computing Sciences and Software Engineering. Springer
Netherlands, Dordrecht, pp. 59–64.
Bartels, M., Marshall, S.P., 2012. Measuring cognitive workload across different eye tracking
hardware platforms. ACM Press, p. 161.
Beatty, J., 1982. Task-evoked pupillary responses, processing load, and the structure of processing
resources. Psychological Bulletin 91, pp. 276–292.
Beatty, J., Lucero-Wagoner, B., 2000. Pupillary System, in: Cacioppo, J.T., Tassinary, L.G., And
Berntson, G (Eds.), Handbook of Psychophysiology, 2nd ed. Cambridge University Press,
New York, pp. 142-162.
Berntson, G.G., Thomas Bigger, J., Eckberg, D.L., Grossman, P., Kaufmann, P.G., Malik, M., Nagaraja,
H.N., Porges, S.W., Saul, J.P., Stone, P.H., Der Molen, M.W., 1997. Heart rate variability:
Origins, methods, and interpretive caveats. Psychophysiology 34, pp. 623–648.
Bradley, M.M., Miccoli, L., Escrig, M.A., Lang, P.J., 2008. The pupil as a measure of emotional arousal
and autonomic activation. Psychophysiology 45, pp. 602–607.
Brooke, J., 1996. SUS: a "quick and dirty" usability scale". In P. W. Jordan, B. Thomas, B. A.
Weerdmeester, & A. L. McClelland. Usability Evaluation in Industry. London: Taylor and
Francis.
Cegarra, J., Chevalier, A., 2008. The use of Tholos software for combining measures of mental
workload: Toward theoretical and methodological improvements. Behavior Research
Methods 40, pp. 988–1000.
Chanel, G., Kierkels, J.J.M., Soleymani, M., Pun, T., 2009. Short-term emotion assessment in a recall
paradigm. International Journal of Human-Computer Studies 67, pp. 607–627.
Degree Project Report
Malin Jönsson Forne, 2012
53
Crane, E., Peter, C., 2006. A working definition for HCI specific emotion research, in: Peter, C., Beale
R., Crane E., Axelrod, L Blyth, G (Eds.), 2008. Emotion in HCI: Joint Proceedings of the 2005,
2006, and 2007 International Workshops, pp. 54-61.
Coles, M. G. H., Rugg, M. D., 1995. The ERP and cognitive psychology: Conceptual issues. In M. D.
Rugg & M. G. H. Coles (Eds.), Electrophysiology of mind: Event-related brain potentials and
cognition, Oxford University Press, New York, pp. 27–39.
Dan-Glauser, E.S., Scherer, K.R., 2011. The Geneva affective picture database (GAPED): a new 730-
picture database focusing on valence and normative significance. Behavior Research
Methods 43, pp. 468–477.
Dingsøyr, T., Dybå, T., Moe, N. B., 2010. Agile Software Development: Current Research and Future
Directions. Springer Berlin Heidelberg: Berlin, Heidelberg.
Dufresne, A., Courtemanche, F., Prom Tep, S. & Sénécal, S., 2010. Physiological Measures, Eye and
Task Analysis to Track User Reactions in User Generated Content. Proceedings of Measuring
Behavior 2010, pp, 218-222.
Duchowski, A.T., 2003. Eye tracking methodology : theory and practice. Springer, London; N. Y.
Ekman, P., Levenson, R.W., Friesen, W.V., 1983. Autonomic Nervous System Activity Distinguishes
among Emotions. Science 315, pp. 1208–1210.
Gao, Y., Barreto, A., Adjouadi, M., 2010. Affective Assessment of a Computer User through the
Processing of the Pupil Diameter Signal, in: Sobh, T., Elleithy, K. (Eds.), Innovations in
Computing Sciences and Software Engineering. Springer Netherlands; Dordrecht, pp. 189–
194.
Goldwater, B.C., 1972. Psychological significance of pupillary movements. Psychological Bulletin 77,
pp. 340–355.
Gunes, H., Pantic, M., 2010. Automatic, Dimensional and Continuous Emotion Recognition.
International Journal of Synthetic Emotions 1, pp. 68–99.
Haag, A., Goronzy, S., Schaich, P., Williams, J., 2004. Emotion Recognition Using Bio-sensors: First
Steps towards an Automatic System, in: André, E., Dybkjær, L., Minker, W., Heisterkamp, P.
(Eds.), Affective Dialogue Systems. Springer Berlin Heidelberg; Berlin pp. 36–48.
Harbich, S., Hassenzahl, M., 2008. Beyond Task Completion in the Workplace: Execute, Engage,
Evolve, Expand. Affect and Emotion in Human-Computer Interaction 2008, pp. 154-162.
Hess, E.H., 1965. Attitude and Pupil Size. Scientific American 212, 46–54.
Hess, E.H., Polt, J.M., 1964. Pupil Size in Relation to Mental Activity during Simple Problem-Solving.
Science 315, pp. 1190–1192.
Hollender, N., Hofmann, C., Deneke, M., Schmitz, B., 2010. Integrating cognitive load theory and
concepts of human–computer interaction. Computers in Human Behavior 26, pp. 1278–
1288.
Degree Project Report
Malin Jönsson Forne, 2012
54
Höök, K.,2012: Affective Computing: Affective Computing, Affective Interaction and Technology as
Experience. in: Soegaard, Mads and Dam, Rikke Friis (Eds.). Encyclopedia of Human-
Computer Interaction. Aarhus, Denmark: The Interaction-Design.org Foundation. [WWW
Document]. URL http://www.interaction-design.org/encyclopedia/
affective_computing.html
Hudlicka, E., 2003. To feel or not to feel: The role of affect in human–computer interaction.
International Journal of Human-Computer Studies 59, pp. 1–32.
Ibster, K., Höök, K., 2009. On Being Supple: In Search of Rigor without Rigidity in Meeting New
Design and Evauation Challenges for HCI practitioners. CHI 2009, Boston, MA, USA.
Iqbal, S.T., Zheng, X.S., Bailey, B.P., 2004. Task-evoked pupillary response to mental workload in
human-computer interaction. ACM Press, p. 1477.
Jainta, S., Baccino, T., 2010. Analyzing the pupil response due to increased cognitive demand: an
independent component analysis study. Int J Psychophysiol 77, pp. 1–7.
Kahneman, D., 1973. Attention and effort. Prentice-Hall, Englewood Cliffs, N.J.
Kahneman, D., Beatty, J., 1966. Pupil Diameter and Load on Memory. Science 154, pp. 1583–1585.
Kecklund, G., Åkerstedt T., 2004. Report on methods and classification of stress, inattention and
emotional states. [WWW Document]. URL http://www.sensation-eu.org/span/pdf/
sens_d_112.pdf
Klingner, J., Kumar, R., Hanrahan, P., 2008. Measuring the task-evoked pupillary response with a
remote eye tracker. ACM Press, p. 69.
Kockton, G. Designing worth – connecting preferred means to desired ends. Interactions July +
August 2008., pp. 54-57.
Kreibig, S.D., 2010. Autonomic nervous system activity in emotion: A review. Biological Psychology
84, pp. 394–421.
Kumar, N. K., Kohlbecher, S., Schneider E, 2009. A novel approach to video-based pupil tracking,
IEEE International Conference on Systems, Man and Cybernetics, SMC 2009, pp. 1255-1262.
Lang, P.J., Bradley, M.M., & Cuthbert, B.N., 2008. International affective picture system (IAPS):
Affective ratings of pictures and instruction manual. Technical Report A-8. University of
Florida, Gainesville, FL.
Lee, J.C., Tan, D.S., 2006. Using a low-cost electroencephalograph for task classification in HCI
research. ACM Press, p. 81.
Loewenfeld, I.E., 1993. The pupil : anatomy, physiology, and clinical applications 1. Iowa State
University Press u.a., Ames.
Madrigal, D., McClain, B., 2009. Testing the User Experience: Consumer Emotions and Brand
Success. [WWW Document]. URL http://www.uxmatters.com/mt/archives/2009/10/
testing-the-user-experience-consumer-emotions-and-brand-success.php
Degree Project Report
Malin Jönsson Forne, 2012
55
Marshall, S.P., 2000. Method and apparatus for eye tracking and monitoring pupil dilation to
evaluate cognitive activity. U.S. Patent 6,090,051.
Marshall, S.P., 2002. The Index of Cognitive Activity: measuring cognitive workload. IEEE, pp. 75–
79.
Marshall, S.P., 2003. Methods for monitoring affective brain function. U.S. Patent 6,572,562.
Marshall, S.P., Pleydell-Pearce, C.W., Dickson, B.T., 2003. Integrating psychophysiological measures
of cognitive workload and eye movements to detect strategy shifts, in: Proceedings of the
Thirth-Sixth Annual Hawaii International Conference on System Sciences, p. 6.
Mehrabian, A., & Russell, J.A., 1974. An approach to environmental psychology. MIT Press,
Cambridge, MA, USA.
Poh, M.Z., Kim, K., Goessling, A., Swenson, N.C., Picard, R.W., 2011. Cardiovascular Monitoring Using
Earphones and a Mobile Device. IEEE Pervasive Computing, IEEE computer Society Digital
Library. URL http://doi.ieeecomputersociety.org/10.1109/MPRV.2010.91
Poh, M.Z., Swenson, N.C., Picard, R.W., 2010. A Wearable Sensor for Unobtrusive, Long-Term
Assessment of Electrodermal Activity. IEEE Transactions on Biomedical Engineering 57, pp.
1243–1252.
Nielsen, J., Pernice, K., 2010. Eyetracking web usability. New Riders, Berkeley, CA.
Norman, D.A., 2004. Emotional design why we love (or hate) everyday things. Basic Books, New
York.
Palinko, O., Kun, A.L., Shyrokov, A., Heeman, P., 2010. Estimating cognitive load using remote eye
tracking in a driving simulator. ACM Press, p. 141.
Palinko, O., Kun, A.L., 2011. Exploring the Influence of Light and Cognitive Load on Pupil Diameter
in Driving Simulator Studies. Proceedings of Driving Assessment 2011.
Palinko, O., Kun, A.L., 2012. Exploring the Effects of Visual Cognitive Load and Illumination on Pupil
Diameter in Driving Simulators. Eye Tracking Research and Applications 2012.
Park, B., 2009. Psychophysiology as a Tool for HCI Research: Promises and Pitfalls, in: Jacko, J.A.
(Ed.), Human-Computer Interaction. New Trends. Springer Berlin Heidelberg, Berlin,
Heidelberg, pp. 141–148.
Partala, T., 2005. Affective information in human-computer interaction. Doctoral dissertation,
Department of Computer Sciences, in: Dissertations in interactive technology, 1. Tampere
University Press, Tampere.
Partala, T., Surakka, V., 2003. Pupil size variation as an indication of affective processing.
International Journal of Human-Computer Studies 59, pp. 185–198.
Picard, R.W., 1997. Affective computing. MIT Press, Cambridge, Mass.
Picard, R.W., Vyzas, E., Healey, J., 2001. Toward machine emotional intelligence: analysis of affective
physiological state. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, pp.
1175–1191.
Degree Project Report
Malin Jönsson Forne, 2012
56
Picard, R.W., 2003. Affective computing: challenges. International Journal of Human-Computer
Studies 59, pp. 55–64.
Pomplun, M. & Sunkara, S. (2003). Pupil dilation as an indicator of cognitive workload in human-
computer interaction. In D. Harris, V. Duffy, M. Smith & C. Stephanidis (Eds.), Human-
Centred Computing: Cognitive, Social, and Ergonomic Aspects. Vol. 3 of the Proceedings of
the 10th International Conference on Human-Computer Interaction, pp. 542-546.
Preece, J., Rogers, Y., Sharp, H., 2002. Interaction design: beyond human-computer interaction. J.
Wiley & Sons, New York, NY.
Rowe, D.W., Sibert, J., Irwin, D., 1998. Heart rate variability. ACM Press, pp. 480–487.
Rubin, J., Chisnell, D., 2008. Handbook of usability testing how to plan, design, and conduct effective
tests [WWW Document]. URL http://www.books24x7.com/marc.asp?bookid=25203
Sanches, P., Kosmack Vaara, E., Sjölinder, M., Weymann, C., Höök, K., 2010. Affective Health –
designing for empowerment rather than stress diagnosis. CHI 2010.
Scheirer, J., Fernandez, R., Klein, J., Picard, R.W., 2002. Frustrating the user on purpose: a step
toward building an affective computer. Interacting with Computers 14, pp. 93–118.
Scherer, K.R., 2005. What are emotions? And how can they be measured? Social Science Information
44, pp. 695–729.
Sherman, P., 2007. How Do Users Really Feel About Your Design? [WWW Document]. URL
http://www.uxmatters.com/mt/archives/2007/09/how-do-users-really-feel-about-your-
design.php
R. Stanners, M. Coulter, A. Sweet, P. Murphy, 1979. The pupillary response as an indicator of arousal
and cognition, Motivation and Emotion, vol. 3, pp. 319-340.
Technology Inc., 2009. Guidelines for Using the Retrospective Think Aloud Protocol with Eye
Tracking. [WWW Document]. URL http://www.tobii.com/Global/Analysis/Training/
WhitePapers/ RTA_guidelines_eyetracking_tobii_shortpaper.pdf
Technology Inc., 2010. Tobii Eye Tracking: An introduction to eye tracking and Tobii Eye Trackers.
[WWW Document]. URL http://www.tobii.com/eye-tracking-research/global/
library/white-papers/tobii-eye-tracking-white-paper/
Tobii Technology Inc., 2010. Tobii TX300 Eye Tracker. [WWW Document]. URL
http://www.tobii.com/Global/ Analysis/Downloads/Product_Descriptions/
Tobii_TX300_EyeTracker_Product_Description.pdf
Tullis, T., Albert, B., 2008. Measuring the user experience: collecting, analyzing, and presenting
usability metrics. Elsevier/Morgan Kaufmann, Amsterdam [u.a.].
Ward, R.., Marsden, P.., 2003. Physiological responses to different WEB page designs. International
Journal of Human-Computer Studies 59, pp. 199–212.
Degree Project Report
Malin Jönsson Forne, 2012
57
Wilhelm, F.H., Grossman, P., 2010. Emotions beyond the laboratory: Theoretical fundaments, study
design, and analytic strategies for advanced ambulatory assessment. Biological Psychology
84, pp. 552–569.
Xu, J., Wang, Y., Chen, F., Choi, H., Li, G., Chen, S., Hussain, S., 2011. Pupillary response based
cognitive workload index under luminance and emotional changes. ACM Press, p. 1627.
TRITA-CSC-E 2012:082 ISRN-KTH/CSC/E--12/082-SE
ISSN-1653-5715
www.kth.se