physiology as a tool for ux and usability...

Physiology as a Tool for UX and Usability Testing

A comparative study of pupil size and other physiological measures

M A L I N F O R N E

Master of Science Thesis Stockholm, Sweden 2012

Physiology as a Tool for UX and Usability Testing

A comparative study of pupil size and other physiological measures

M A L I N F O R N E

DH224X, Master’s Thesis in Human-Computer Interaction (30 ECTS credits) Degree Progr. in Media Technology 300 credits Royal Institute of Technology year 2012 Supervisor at CSC was Ylva Ferneaus Examiner was Kristina Höök TRITA-CSC-E 2012:082 ISRN-KTH/CSC/E--12/082--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc

Abstract The purpose of this degree project is to investigate how physiological measures, such as heart rate,

skin conductance and EEG (i.e. electrical brain activity), may be useful in UX and usability testing.

One physiological research method is discussed in more detail, i.e. the measurement of pupil size, or

pupillometry. The study seeks to answer the following questions: 1.What can we find out about

human emotions and cognition by measuring and analyzing variations in pupil size? 2. What can we

find out about human emotions and cognition by using other popular physiological measurement

methods? 3. How does pupillometry compare to other physiological measures for the purpose of UX

and usability testing?

In order to answer these questions, an extensive literature review was carried out. In

addition, a minor pupillometric study was carried out, in order to practically investigate the

potential of pupil size as a tool for UX and usability testing. The study concludes that although it is

not possible to ‘measure’ the thoughts and emotions experienced during a usability test,

physiological measurement may help identify significant episodes of human-computer interaction,

such as instances of elevated emotion, frustration or intense cognitive effort. It was found that a

large number of physiological signals could be useful for this purpose, and that all have their

respective pros and cons. Ultimately, the choice of measure will depend on the context of study. The

study also showed that there is never just one possible explanation to an observed physiological

reaction. Therefore, physiological data should always be interpreted in relation to the context in

which it was collected, as well as the subject’s own account of the experience.

Sammanfattning Exjobbet syftar till att undersöka hur fysiologiska mätmetoder, såsom hjärtfrekvens,

hudkonduktans, EEG (dvs. elektrisk hjärnaktivitet) och pupillstorlek, kan användas i

användbarhetstestning. Studiens främsta fokus ligger pupillometri, d.v.s. mätningar av hur pupillens

diameter förändras över tid. Följande frågor ligger till grund för arbetet: 1. Vad kan vi få veta om

människors känslomässiga och kognitiva processer genom att mäta och analysera variationer i

pupillstorlek? 2. Vad kan vi få veta om människors känslomässiga och kognitiva processer genom

att använda andra populära fysiologiska mätmetoder? 3. Hur står sig pupillometri mot andra

fysiologiska mätmetoder för tillämpning inom användbarhetstestning?

För att besvara dessa frågor genomfördes en omfattande litteraturstudie, samt en mindre

empirisk studie, där potentialen hos pupillometri som mätmetod undersöktes praktiskt.

Sammantaget visar studien att fysiologiska mätmetoder kan användas för att identifiera händelser

av särskild betydelse under ett användbarhetstest, såsom episoder av hög kognitiv belastning,

starka känslor eller frustration och hos användaren. Ett stort antal fysiologiska mätmetoder visade

sig vara användbara för det syftet, och valet av metod beror främst på den aktuella studiens

specifika förutsättningar. Dock visar studien även att det sällan finns en entydig förklaring till en

viss fysiologisk förändring. Fysiologiska data bör därför alltid tolkas i relation till den kontext i

vilken de uppmättes, samt varje deltagares egen framställning av användarupplevelsen.

Table of Contents 1 Introduction ......................................................................................................................................................... 5

1.1 Background .................................................................................................................................................................... 5

1.1.1 Physiological Measures .................................................................................................................................... 5

1.1.2 Eye Tracking ......................................................................................................................................................... 6

1.1.3 Pupillometry ......................................................................................................................................................... 6

1.2 Purpose of Study .......................................................................................................................................................... 7

1.2.1 Research Questions ............................................................................................................................................ 7

1.2.2 Limitations ............................................................................................................................................................. 7

1.3 Method ............................................................................................................................................................................. 7

2 Theoretical Foundation ................................................................................................................................... 9

2.1 UX and Usability Testing .......................................................................................................................................... 9

2.2 Cognition in HCI ......................................................................................................................................................... 10

2.2.1 Understanding User Cognition .................................................................................................................... 10

2.2.2 Cognitive Load Assessment .......................................................................................................................... 11

2.3 Emotion in HCI ........................................................................................................................................................... 11

2.3.2 Understanding User Emotion ...................................................................................................................... 12

2.3.3 Studying User Emotion ................................................................................................................................... 13

3 Physiological Measures ................................................................................................................................ 15

3.1 Physiological Context ............................................................................................................................................... 15

3.2 Using Physiological Measures .............................................................................................................................. 16

3.3 Common Measures ................................................................................................................................................... 17

3.3.1 Cardiovascular activity ................................................................................................................................... 17

3.3.2 Skin Conductance .............................................................................................................................................. 19

3.3.3 Electrical Brain Activity ................................................................................................................................. 21

4 Pupillometry ..................................................................................................................................................... 24

4.1 Pupillary Movements ............................................................................................................................................... 24

4.1.1 Optical Reflexes ................................................................................................................................................. 24

4.1.2 Reflex Dilation .................................................................................................................................................... 25

4.2 Measuring Pupil Size ................................................................................................................................................ 26

4.3 Previous Studies ........................................................................................................................................................ 29

4.3.1 Pupillometry in Affect Recognition ........................................................................................................... 29

4.3.2 Cognitive Pupillometry ................................................................................................................................... 30

4.3.3 Dealing with the Light Reflex ....................................................................................................................... 33

4.4 Pilot Study .................................................................................................................................................................... 35

4.4.1 Participants ......................................................................................................................................................... 35

4.4.2 Equipment and Procedure ............................................................................................................................ 35

4.4.3 Cognitive Tasks .................................................................................................................................................. 36

4.4.4 Affective Stimuli ................................................................................................................................................ 37

4.4.5 Results and Analysis ........................................................................................................................................ 38

4.4.6 Lessons Learned ................................................................................................................................................ 43

5 Discussion and Analysis ............................................................................................................................... 45

5.1 Interpreting Physiological Data ........................................................................................................................... 45

5.2 Challenges for UX and Usability Testing .......................................................................................................... 47

5.3 Evaluation of Measures ........................................................................................................................................... 49

6 Conclusion ......................................................................................................................................................... 51

7 Bibliography ..................................................................................................................................................... 52

List of Abbreviations

ANS = Autonomic Nervous System

CNS = Central Nervous System

EEG = Electroencephalography

GSR = Galvanic Skin Response

HCI = Human-Computer Interaction

HR = Heart Rate

HRV = Heart Rate Variability

ICA = Index of Cognitive Activity

MPD = Mean Pupil Diameter

PD = Pupil Diameter

PNS = Peripheral Nervous System

SC = Skin Conductance

TERP = Task-Evoked Pupillary Response

UX = User Experience

Degree Project Report

Malin Jönsson Forne, 2012

5

1 Introduction

As interactive technologies become increasingly important in our everyday lives, the human-

computer interaction community has slowly moved beyond a strict focus on usability, and started to

consider the entire user experience, or UX. This means that systems are no longer assessed solely in

terms of their ability to enhance user performance, but also on their ability to motivate, entertain,

amuse or satisfy their users (Preece et al., 2002). In order to assess such factors, usability

researchers must understand more about the cognitive and emotional processes that are evoked as

users interact with a system.

A common way to address emotional and cognitive aspects in usability testing today is

through retrospective self report; that is, users are asked to describe or answer questions about

their experience after the task has been completed, either verbally or through a questionnaire of

some sort (Sherman, 2007). While such strategies are certainly useful, they are limited in their

capacity to identify changes in emotional or cognitive processing over the course of the test (unless

the user is constantly interrupted with questions, which would of course have a negative impact on

the authenticity of the user experience). These considerations have spurred an interest in

complementary methods for the assessment of the user experience.

1.1.1 Physiological Measures

Within research areas such as psychology and neurology, it has long been known that emotional

and cognitive processes give rise to measurable physiological responses in the human body. For

example, it has been found that the pupil dilates in response to cognitive or emotionally toned

stimuli (e.g. Goldwater 1972, Loewenfeld, 1993), which makes eye tracking an interesting

measurement technique. Other methods of recording physiological measures include GSR (Galvanic

Skin Response), which is associated with increased sweat production, cardiovascular measures,

which include heart rate and heart rate variability, and EEG (Electroencephalography), which

reflects electrical activity along the scalp.

Naturally, physiological measures are not some magic key to the human mind. Measures of

bodily reactions do not, as one might be tempted to believe, enable us to draw definite conclusions

about what a person is thinking or feeling at a given time. The analysis of bodily reactions may,

however, provide some additional clues to the user experience. How these clues may be obtained,

and what new insights they may lead to, is what this study aims to investigate. Focus will lie on the

measurement of pupil size (i.e. pupillometry), but this method will also be compared to other,

perhaps more commonly used physiological measures.

1.1.2 Eye Tracking

This master thesis project is carried out in collaboration with Tobii Technology, one of the leading

producers of eye tracking systems. As the term suggests, eye tracking is technology that allows us to

track and record eye movements. Although these systems come in different forms, most modern

eye trackers (including those produced by Tobii) use a combination of infrared light sources and

infrared video cameras to determine the point of regard (Tullis & Albert, 2008). When in use,

(invisible) near infrared light is pointed to the eye of the user, creating a strong reflection in the



6

retina (known as the “bright pupil”). In addition to this, a small but sharp glint, called the corneal

reflection, appears on the cornea of the eye. These reflections are recorded by the infrared camera,

and their relative positions are then used to calculate the point of regard (Duchowski, 2007). Eye

tracking technology can be used for a variety of different purposes, but this study will focus on the

use of eye tracking in usability and UX research.

Eye tracking has become an increasingly popular tool in usability testing over the past few

years, as systems become more reliable and easy to use (Tullis & Albert, 2008). Nowadays, eye

tracking technology can be incorporated into a computer monitor, or even into a pair of glasses,

which makes it suitable for many different settings. In a typical eye tracking study, gaze data is

collected as users perform some given task(s). The data may then be subjected to statistical

analysis, or visualized to show what users looked at, for how long and/or in what order. The

relevance of such studies is founded on the mind-eye hypothesis, which holds that what people are

looking at is usually the same as what they are thinking about (Just & Carpenter, 1976). In other

words, we may presume that if we track the movements of a person’s gaze, we can follow along the

path of her attention (Duchowski, 2007). Although there are certainly exceptions to this rule, eye-

tracking is considered a useful tool for studying user attention in usability testing (Nielsen &

Pernice, 2010).

1.1.3 Pupillometry

While the eye-tracking applications described so far are interesting and widely used, they are not

the focus of this thesis. Instead, I concentrate on a particular kind of eye tracking called

pupillometry, in which (changes in) pupil size are measured. Most people know that pupil size

varies with the intensity of light, but pupillary movements may also be related to cognitive and/or

emotional processes. This makes them a potential source of new insights into the user experience.

As eye movements are recorded with an eye tracker, pupil size data is usually collected in

the process (Tullis & Albert, 2008). Nevertheless, this data is seldom analyzed, as focus often lies on

gaze patterns (i.e. where the subject was looking during the course of the experiment). If we want

to gain insights into cognitive and emotional processes, however, pupil size data might be a

valuable resource. Pupil dilations might tell us if a certain part of the interaction was particularly

complex (i.e. gave rise to a high cognitive load) or if there was some part that caused frustration.

With a greater understanding of how pupillometry might be used to analyze cognitive and

emotional activities, and how pupil size compares to other physiological measures available,

usability researchers could make more informed choices when preparing a study. Moreover, it

might lead to a better harnessing of the data obtained in the study.

1.2 Purpose of Study

The purpose of the present study is to investigate how different physiological measures may be

used to analyze emotional and cognitive processes in the context of UX and usability testing. The

study contributes to the field of Human-Computer Interaction (HCI) by providing an overview of

different physiological measures that may be used to investigate users’ emotional and cognitive

processes. Moreover, the different measures are evaluated and compared with respect to factors



7

that are particularly important in usability testing, such as unobtrusiveness, robustness and

simplicity of use.

1.2.1 Research Questions

RQ1. What can we find out about human emotions and cognition by measuring and analyzing

variations in pupil size?

RQ2. What can we find out about human emotions and cognition by using other popular

physiological measurement methods?

RQ3. How does pupillometry compare to other physiological measures for the purpose of UX

and usability testing?

1.2.2 Limitations

This study approaches physiological measures from the usability researcher’s point of view, and the

conclusions drawn are thereby specific to that context. Furthermore, focus lies on the use of

pupillometry in UX and usability, which means that other physiological measures will not be

described in the same detail.

1.3 Method

In order to answer the research questions, a literature review was carried out. In the first phase of

the study, the aim was to get an overview of relevant research areas, such as Usability Testing,

Affective Computing, Psychology, Neuroscience and Physiology, as well as to understand their

respective relevance to this study. Furthermore, some initial questions had to be investigated,

including:

What is cognition?

What is emotion?

How is emotion and cognition relevant to HCI (Human-Computer Interaction)?

What constitutes a “good” method for usability testing?

What methods are used today to study emotion and cognition in usability testing?

What are the pros and cons of using physiological measures?

What is eye tracking, and pupillometry?

What other physiological measures might be relevant to this study?

Answers to these questions were sought in a wide variety of sources, spanning over research areas

such as Human-Computer Interaction, Affective Computing, Cognitive Psychology, Emotion

Research and Psychophysiology. After the initial questions had been explored, focus shifted to the

core research questions of the study. First, pupillometry was investigated in further detail. Some of

the main questions to be answered in this phase were:

How does pupillometry work theoretically?



8

How can it be implemented practically (in usability testing)?

What can pupillometry reveal about human emotion and cognition?

Given the vast amount of pupillometric research available, I started out from some extensive

reviews, in particular Loewenfeld (1993), but also Goldwater (1972) and Beatty (1982, 2000).

This gave me a good overview of the knowledge in the field, and it also helped me identify some of

the most important studies conducted before the turn of the century, which were then examined in

greater detail. Thereafter, more recent studies, particularly those relating to HCI, were examined, in

order to understand the current state of the art. Other physiological measures identified in the

introductory phase were investigated in a similar, though less thorough, manner. Studies and

reviews of affective computing were particularly useful in this phase.

Although the core of this study is a literature review, a minor pupillometric study was

conducted in order to gain some additional insights. The main purposes of the study were:

1. To practically investigate how pupil size measurements may be incorporated in a simple eye tracking study.

2. To investigate whether we can measure pupil dilation in response to cognitive or emotional stimuli without extensive data processing, technical skills or time consumption.

3. To gain some practical experience of pupillometric research, in order to better understand the challenges involved.

The study consisted of two parts, one in which subjects performed simple math problems

(cognitive task), and one in which they were presented with emotionally toned pictures (affective

stimuli). Meanwhile, pupil size measurements were performed with a Tobii Eye Tracker. For a

detailed description of the study, please refer to section 4.4.

In the last phase of the study, the different physiological measures were compared with

respect to factors that might be important in the context of usability testing. The result of this

analysis was used to create a model, which provides a collective view of the different measures and

their value when studying emotions and cognition in the context of UX and usability testing.



9

2 Theoretical Foundation The aim of this chapter is to provide a theoretical foundation to the present study, by introducing

the reader to some of the core concepts related to the study. In the first section, I provide a brief

introduction to UX and usability testing. Thereafter, the concepts of cognition and emotion will be

discussed in turn, especially with regard to their relevance to human-computer interaction (HCI), as

well as to the present study.

2.1 UX and Usability Testing

Usability can be described as “the extent to which a product can be used by specified users to

achieve specified goals [...] in a specified context” (ISO 9241-11:1998). More specifically, the term is

often broken down into a set of design goals, including (Rubin and Chisnell, 2008):

Effectiveness (How good is the system at doing what it is supposed to do?)

Efficiency (Does the system allow users to sustain a high level of productivity?)

Learnability (How easy is it to start using the system?)

Satisfaction (What is the user’s perceptions, feelings, and opinions of the product?)

Traditionally, usability goals have mostly been concerned with improving the productivity of users

interacting with a system, and design goals such as efficiency and effectiveness can certainly be

important for systems intended to support working practices (Preece et al., 2002). However, the

growth of leisure and entertainment uses of technology means that users often have other goals in

interacting with a system than mere productivity. HCI practitioners today must thus expand their

design thinking to include other possible values of technology, such as fun, enjoyment and

emotional engagement (Isbister & Höök, 2009). Such design goals are often associated with the

concept of user experience, or UX, which has become an increasingly important concept in HCI over

the last decade or so (Harbich & Hassenzahl, 2008). Traditional aspects of usability are certainly

part of what makes up the user experience, but UX is not limited to the specific moment in time

when an interaction takes place. On the contrary, the UX point-of-view stresses that users’

evaluations of interactive experiences evolve beyond the end of the interaction itself, and good

experiences can give rise to revisitable good moods and enduring, rewokable memories (Cockton,

2008). Therefore, the most fundamental error interaction designers make in the design process is

to sketch things, without connecting those things to good experiences and outcomes for the people

who will interact with them (ibid.).

In this study, I discuss different methods that may be applied in the evaluation of interactive

systems, i.e. UX and usability testing. Usability testing allows for more informed design decisions,

and may serve as a way to ensure that important design goals are being met by the product or

prototype in question (Rubin and Chisnell, 2008). Ideally, an iterative cycle of tests is performed

during the course of system development, in order to gradually shape or mold a usable product into

place (ibid.). However, usability testing may also be used to evaluate existing interfaces, or to

compare two or more alternative design solutions.



10

Typically, usability testing involves observing representative end users using the system or product

to perform realistic tasks (Rubin and Chisnell, 2008). The basic approach originates from classic

empirical research methods, but has been adapted to fit the fast-paced, highly pressurized

commercial environment in which most interactive systems are developed (ibid.). In usability

research, it may for example be impossible or inappropriate to use large numbers of test subjects,

or to adopt the strict control of the testing environment which is often required in academic

research. This is particularly true today, when rigid, sequential “water fall” methodologies of

software development are increasingly being replaced by more flexible, iterative or agile

development processes (cf. Dingsøyr et al., 2010).

2.2 Cognition in HCI

The term cognition refers to all aspects of human thinking and reasoning, including processes such

as perception, attention and memory (Preece et al., 2002). Understanding more about these

processes may be of great value for the design and evaluation of interactive systems, especially if

focus lies on design goals such as efficiency, effectiveness and learnability. Naturally, the way a user

interface is designed will affect how well users can perceive relevant information, understand

important functions and remember how to carry out tasks. In cognitive science, these capacities are

often described as limited resources. In order to optimally support human-computer interaction,

interaction designers must thus take the limitations of our cognitive capacities into account.

2.2.1 Understanding User Cognition

The concept of mental workload or cognitive load provides a useful framework for understanding

the limitations of user’s cognitive capacities. Cognitive load theory is based on the notion of a

limited “working memory”, which is involved in all conscious cognitive activity (Hollender et al.,

2010). As additional items are added to the pile of information that needs to be actively processed,

cognitive load increases. Too much simultaneous processing will lead to cognitive overload, making

it impossible for users to complete the task at hand (ibid.). Interfaces that require too much mental

effort may thus create user frustration, or cause the user to abandon a task altogether. However, a

monotonous task with too little cognitive stimulation may also act as a stressor. A typical

monotonous or boring situation is when the demands for sustained attention are high, but little

new information or is conveyed (Kecklund et al., 2004). This may occur during uneventful

motorway driving, or when the task is to monitor an industrial process. If motivation is high, the

individual may compensate for the lack of stimulation by mobilizing extra energy. However, such

responses are effortful, and can only be sustained over short periods of time. Eventually, the person

will experience boredom and fatigue, which may reduce productivity and, in some cases, even

result in dangerous situations (ibid.). In order to optimally support user practices, human-

computer interfaces should thus provide an adequate amount of cognitive stimulation, avoiding

underload as well as overload.

Attention can also be described as a limited cognitive resource, which needs to be allocated

to ongoing events (Kecklund et al., 2004). Attention is slow, sequential and difficult to sustain for

more than brief periods of time (Kahneman, 1973). Therefore, successful human-computer

interaction requires that the user can effectively manage her attention between different elements



11

of an interface. Research has shown that poorly timed interruptions, due to for example instant

messages, an incoming email or a system alert, have a negative impact user performance, especially

if the user is actively engaged in a demanding task (Bailey et al., 2006). According to Iqbal et al.

(2004), an attractive solution would be to develop systems that could identify moments of low

mental workload, in which users may be interrupted at a minimal cost.

2.2.2 Cognitive Load Assessment

There are three main categories of mental workload assessment: performance-related, subjective

and physiological. In the first case, the cognitive load demanded by a certain (primary) task is

typically evaluated by measuring the performance on another, secondary task (Cegerra & Chevalier,

2008). For example, users may be asked to use a driving simulator engaging in conversation. In this

case, a complicated traffic situation is likely to cause gaps in the conversation, which would be

interpreted as a sign of increased cognitive load in the subject.

By contrast, subjective measures rely on the subjects’ own reports of their experience.

Several scales have been developed to formalize these ratings, for example the NASA-TLX

procedure, which is often considered to be the most accurate (ibid.). Although such ratings may be

a good reflection of the user’s subjective experience, they are limited in one respect: the ratings are

usually obtained after a task has been completed, which means they do not give any account of the

change in cognitive load over the course of a task.

The third means of cognitive load assessment is through physiological measures, which is

the main focus of this study. These measures include pupil size, heart rate and EEG, and will be

further discussed in Chapter 3 and 4 of this report.

2.3 Emotion in HCI

Computer use has often been regarded a purely rational activity, in which emotions are secondary,

or can even get in the way of successful interaction (Picard, 1997). However, interest in emotional

aspects of HCI is increasing. In his book Emotional Design (2005), Don Norman put it this way:

“In the 1980s [...] I addressed utility and usability, function and form, all in a logical, dispassionate way

— even though I am infuriated by poorly designed objects. But now I’ve changed. [...] Sure, utility and

usability are important, but without fun and pleasure, joy and excitement, and yes, anxiety and anger,

fear and rage, our lives would be incomplete.”

What Norman describes is the importance of user experience, and in particular how it relates to

emotion. If the interaction with software can evoke strong positive feelings, then the user is more

likely to come back and use the system again and again. However, negative emotions may also be an

important part of great experiences, especially when it comes to leisure use of technology, such as

computer games. Gilbert Cockton (2008) argues that the most challenging game interactions can be

both unpleasant and frustrating; but finally completing a game after weeks of struggle can be

immensely satisfying. In this case, what makes the interaction worthwhile for the user is the sense

of achievement (ibid.). In other cases, however, negative user affect such as stress or frustration

may lead to critical errors, or prevent completion of a task altogether.

In western thinking, emotion and cognition have traditionally been regarded as separate

processes (Höök, 2012). In the 1990’s, however, researchers began to understand that emotional



12

and cognitive processes are interrelated, and may interact in ways that are important for intelligent

functioning (Picard, 2001). Rosalind W. Picard was among the first researchers to explicitly address

the role of affect in human-computer interaction, when she published the book Affective Computing

in 1997. Picard introduced affective computing as a new field of research, concerned with

“computing that relates to, arises from, or deliberately influences emotions” (Picard, 1997). While

this definition is generally held, the starting point of Picard’s work was more specific. Coming from

the field of artificial intelligence (AI), Picard suggested that machine intelligence should include

skills of emotional intelligence. This was a major shift in thinking at the time, since previous AI

efforts had primarily focused on mathematical, verbal and perceptual capabilities (Picard, 2001).

So far, much research within the field of affective computing has focused on emotion

recognition, often through the measurement and analysis of physiological measures. According to

Picard (2003), the goal of this research is to eventually design computers that will better serve

people’s needs by recognizing and responding to user emotion. However, other researchers in

affective computing argue that rather than trying to ‘measure’ user emotion, we should try to make

peoples’ emotional experiences available for reflection (Höök, 2012). For example, Sanches et al.

(2010) describe the development of a mobile stress management tool called Affective Health. The

system measures heart rate, skin conductance (see 3.3) and body movement, and uses the data to

create a visualisation that users can reflect on and interpret themselves. According to the authors,

such a system avoids a “reductionist and sometimes erroneous” automatic interpretation from

physiological signals to emotion labels (ibid.).

2.3.2 Understanding User Emotion

To date, there is no universally accepted definition ‘emotion’. However, initiatives have been taken

to establish a HCI specific ‘working definition’ of the term, starting out from four basic assumptions

(Crane & Peter, 2008):

1. Emotions are multifaceted processes that unfold over time.

2. Emotions are induced by internal or external events.

3. Emotions manifest themselves through multiple channels, resulting in specific physiological patterns.

4. Emotion channels are loosely coupled and may interact in complex ways.

Note that this definition refers to emotions as processes. This choice of words underlines the fact

that emotions are not stable ‘states’, but subjected to continuous change. This, of course,

complicates any efforts to label or ‘measure’ affective experiences. Another important assumption

stated above is that emotions manifest themselves through multiple channels, or ‘modalities’. This

notion is widely agreed upon among emotion researchers. Although different sets of modalities

have been suggested, the following three are usually mentioned (Scherer, 2005):

1. Subjective experiences (what a person is actually feeling).

2. Motor expressions (face, voice, gestures).

3. Bodily symptoms (any physiological changes in the body).

Note that this description clarifies the distinction between ‘emotion’ and ‘feeling’, two concepts

which are easily confused; feeling is just one part of what constitutes an emotion, i.e. the subjective



13

experience. The term ‘affect’, by contrast, is often used as a synonym of ‘emotion’ (e.g. Picard, 1997),

a practice which is adopted in this thesis as well.

Another topic of debate within the affective sciences relates to how emotions should be

described or modeled. On a high level of abstraction, current emotion theories can be divided into

two approaches: discrete and dimensional emotions (Partala, 2005). The discrete approach starts

out from emotion labels used in everyday language, such as ‘anger’, ‘fear’ and ‘happiness’, and

attempt to categorize affective processes according to these labels (cf. Ekman et al., 1982). The

dimensional approach starts out from a set of basic dimensions, each of which consists of two

opposite adjective pairs. The most commonly used dimensions are valence, ranging from pleasant

(positive valence) to unpleasant (negative valence), and arousal, ranging from calm to excited

(Sheirer et al., 2001). These dimensions make up the x and y axis of ‘emotional space’, into which all

emotions can be categorized based on their different characteristics. According to Partala (2005),

most scientists currently agree that the discrete and continuous approaches are complementary,

and that both may be more or less useful, depending on the context of study.

As previously mentioned, emotion and cognition are not separate processes, but closely

related to one another. Today, many researchers talk about a cognitive component of emotion,

arguing that emotional experiences are determined by the cognitive evaluation or appraisal of

events (Scherer, 2005). For example, barely passing an exam is not an inherently happy or sad

event; the emotional reaction depends on the subjective evaluation of the result, in relation to

expectations of the outcome.

2.3.3 Studying User Emotion

In the previous section, I introduced three ‘modalities of emotion’ (i.e. subjective experience, motor

expression and bodily symptoms). These modalities point out a direction for emotion research: If

we accept that emotions manifest themselves through different channels, then the way to

understand emotions would be to ‘tune in’ to one or more of these channels.

The first modality of emotion, i.e. what a person is actually feeling, is of course hard to

‘measure’. Naturally, only the person having an emotion can know what it feels like, and the only

way to extract at least some of that information is by asking the person. In usability testing, this is

usually done through a more or less structured interview, or by using a questionnaire of some sort

(Madrigal & McCain, 2009).

Motor expression, or ‘body language’ to use a more common term, has been extensively

studied in emotion research (Haag et al., 2004). This modality includes gestures, posture and facial

expressions; in short every emotional expression that can be observed by the people around us. A

disadvantage of using motor expression for emotion research is that people can control or “fake”

their body language, at least to some degree. For example, a person may choose to conceal a feeling

of disappointment with a smile. Naturally, this may lead to misinterpretations, in particular if the

emotion recognition is performed by a computer, which may not be able to interpret the

surrounding circumstances.

The third modality of emotion, and the focus of this study, is bodily symptoms. This modality

includes all physical reactions that are associated with an emotion. Some researchers argue that

physiological measurement is a particularly promising method for affect recognition, because these

measures are less susceptible to environmental inference or voluntary masking than for example



14

facial expressions (cf. Picard, 1997). Physiological measurement has been extensively researched

within affective computing, and some researchers in the field argue that reliable affect recognition

could be achieved through the integration of several physiological measures (cf. Hudlicka, 2003).

The use of physiological measures is further discussed in the next chapter.



15

3 Physiological Measures This chapter serves as a general introduction to the study of physiological responses, and provides

a more detailed description of some of the most popular measures. The first section puts

physiological responses into context, briefly describing their role in the human body and how they

are brought about. Thereafter, I discuss some general issues regarding the practical implementation

of physiological measures in usability testing. Finally, the third section of the chapter provides an

introduction to some of the most popular physiological measures, i.e. cardiovascular activity, skin

conductance and EEG. Pupillometry, being the main focus of this study, will be discussed separately

in Chapter 4 of the report.

3.1 Physiological Context

The human nervous system can be divided into a central and a peripheral system, which are each

responsible for different parts of the body. The central nervous system (CNS) includes the spinal

cord and the brain, and can be described as the body’s control center. The spinal cord is responsible

for simple reflexes and serves as a pathway between the brain and other parts of the body. The

brain is responsible for all cognitive processing, including perception, memory and thought, but is

also the center of emotions (Chanel et al., 2009).

The peripheral nervous system (PNS) can be described as the body’s communication system,

and acts mainly below the level of consciousness. The PNS is responsible for carrying signals from

the CNS to the rest of the body, but it also transfers sensory information from the organs (e.g. eyes,

ears and skin) back to the brain, where it is processed and interpreted. Of special relevance to this

study is the autonomic nervous system (ANS), which is often described as a subdivision of PNS.

However, current research underlines the integrated nature of the human nervous system, and has

found that there are actually close interactions between its central and autonomic divisions

(Kreibig 2010). The primary task of ANS is to provide quick and reliable responses to surrounding

events, preparing the body for appropriate action (ibid.). This can only be attained by the

coordination and integration of neurological activity, from the highest level in the cortex down to

the spinal cord and peripheral nervous system (ibid.).

There are two branches of ANS, the sympathetic and the parasympathetic branch, which are

responsible for different bodily responses. When fully activated, the sympathetic division of ANS

prepares the body for crisis that may require sudden, intense physical activity: heart and

respiration rates are rising, sweat breaks out and alertness increases (Barreto et al., 2007). This is

known as the ‘fight or flight’ response, and may be experienced in highly emotional or stressful

situations (Partala, 2005). By contrast, the parasympathetic division of ANS brings the body back

from the emergency state, and is associated with effective emotion regulation and restoration of

energy (ibid.).

In addition to stress and emotion, cognitive factors may also influence ANS activity. In

particular, the activation of the sympathetic branch is associated with high levels of cognitive

workload. This increased activation or arousal can often lead to improved cognitive performance, at

least up to a certain point. However, the effect can only be sustained over brief periods of time

(Kecklund & Åkerstedt, 2004). Parasympathetic activity, on the other hand, has been associated

with enhanced attention (Rantanen et al., 2010).



16

3.2 Using Physiological Measures

Evidence that human physiology responds to a variety of mental events has been available since the

19th century (Ward & Marsden, 2003). Skin conductance, respiration, electrical brain activity,

muscle tension, pupillary size and cardiovascular activity have all been reported to vary in response

to factors such as task difficulty, levels of attention, experiences of frustration and emotionally

toned stimuli (Andressi, 2000). Therefore, it has been proposed that physiological data might be a

valuable tool for usability testing, as it could help identify elements and events of cognitive or

emotional relevance to the user (Ward & Marsden, 2003).

However, the integration of physiological measures in usability testing has some inherent

difficulties. First of all, most existing studies have been performed in tightly controlled

experimental settings. This goes against one of the basic requirements of usability testing, namely

that the test conditions should be as close to “real-world” use as possible. Thus, if physiological

measures are to be applied to the less controlled conditions of usability testing, then great care

must be taken in the design of testing procedures (Ward & Marsden, 2003). Another challenge lies

in the interpretation of data, since the same kind of physiological responses may be observed for

different mental states, such as frustration, surprise or increased cognitive effort (ibid.). Therefore,

a correct interpretation requires knowledge of the context in which the data was obtained. In order

to better understand the results, it is thus advisable to record additional observations along with

the physiological measurements, such as comments, observed behaviors and subjective ratings of

events (Kecklund & Åkerstedt, 2004).

Another important issue in physiological measurement is referred to as the baseline

problem: How do we establish a reference response for a given physiological measure, against

which other obtained values may be compared? What is, for example the “normal” heart rate, pupil

size or skin conductance? Unfortunately, physiological responses are highly individual, which

generally makes between-subject comparisons misleading (Gunes & Pantic, 2010). In addition,

significant variations that are unrelated to emotional or cognitive factors may be observed within

subjects, depending on for example environmental factors (temperature, humidity etc.), time of day

or the subject’s pre-trial activities (Ward & Marsden, 2003). This makes it impossible to establish

any critical or cut-off values for physiological measures, corresponding to, for example a particular

emotional state or level of mental effort (Kecklund & Åkerstedt, 2004). Instead, observed variations

must always be interpreted in relation to the baseline for the specific subject, time and context in

which data was collected.

A common approach to the baseline problem is to use the average response obtained over a

period of time before trial-onset, during which no significant stimuli is presented (cf. Dufresne et al.,

2010). Subjects may, for example, be sitting in a dark room or in front of a blank screen for some

time, while their physiological responses are being recorded. A problem with this method is that

although little external stimuli is presented to the subject, it is impossible to control his or her

thoughts or state of mind, which may be influenced by a bad day at work, a pleasant memory, or

any other internal stimuli. Another approach to the baseline problem is to define the reference

response as the average value obtained for the measure over the course of the experimental

session.

A typical case in which the baseline may be of value is when comparing or averaging

physiological responses over multiple subjects. In such cases, the following formula (where R is the



17

physiological response under study), may be used to normalize the individual results (e.g. Dufresne

et al., 2010):

The normalization process allows for a more accurate averaging of results. For example, certain test

subjects may have a natural tendency to sweat more, or an inherently faster heart rate than others.

Without baseline correction of the results, these individuals would have a larger impact on the

averaged results than the other participants (Beatty & Lucero-Wagoner, 2000).

As mentioned in the previous chapter, many researchers argue for the integration of several

measures in order to obtain a collective understanding of a user’s mental state. Gunes and Pantic

(2010) describe two main approaches to this problem: feature and decision level fusion. In decision-

level fusion, the different features obtained are analyzed separately. Once a classification has been

made for each feature, the results are combined to produce the final hypothesis. This method

typically assumes that the different features are independent from each other, which is often not

the case (heart rate, for example, is influenced by respiration patterns). However, the assumption of

mutual interdependence makes the problem of data fusion more manageable. Feature-level fusion is

somewhat more challenging, and becomes even more so as the number of features increases. This is

particularly true if the measures obtained have very different temporal properties, either because

the measurement equipments have different sampling rates, or because the different responses are

inherently out of sync (e.g. heart rate and EEG). In such cases, it is particularly important to make

sure data from different sources are time-stamped correctly (ibid.).

3.3 Common Measures

There is no ‘gold standard’ for physiological measurement; instead; each measure has its pros and

cons (Kecklund & Åkerstedt 2004). However, this section will describe some of the measures that

seem particularly important or relevant to the present study, i.e. skin conductance, cardiovascular

activity and EEG. According to Chanel et al. (2009), skin conductance (GSR) and heart rate are

almost always included in affect recognition, an observation that I support based on my own

literature review. Therefore, it seems only natural that these measures should be discussed here. I

have also chosen to look into the use and future potential of EEG, which is an up and coming

technology in HCI research. An introduction to each of these measures is provided in this chapter,

while pupillometry, being the focus of the study, will be discussed separately in the next chapter.

3.3.1 Cardiovascular activity

Cardiovascular activity refers to activity of the heart, and includes parameters such as heart rate,

heart rate variability and blood volume pressure. There are two common ways to measure

cardiovascular activity: Electrocardiogram (ECG) and Photoplethysmography (PPG; Park 2009).

ECG measures the electrical pulse produced by the heart every time it contracts to pump out blood.

This method requires (at least) three electrodes, which can be attached on both arms, both legs or

above the chest. Arm or leg placement is considered more practical for HCI research, but the

distance to the heart makes the signal more vulnerable to noise caused by for example body

movement or internal organ activity (ibid.).



18

If ECG monitors the electrical activity of the heart, PPG concentrates on its mechanical activity, by

measuring the blood flowing in and out of a toe or finger. This information is typically obtained by

placing a sensor on the toe or finger, while infrared light is emitted into the skin. Because the level

of light absorption changes with the amount of blood flowing underneath the skin, it is possible to

retrieve the heart rate from this measurement. A downside of PPG is that, due to the rather long

distance from the toes/fingers to heart, the blood flow may not always be strong enough for the

sensor to record the PPG. In general, finger placement gives a slightly more reliable signal than toe

placement (Park 2009). On the other hand, having a sensor placed on the finger while interacting

with a computer may have a negative impact on the user experience, which may in turn influence

the obtained data.

Cardiovascular monitoring is perhaps mostly associated with medical contexts, where it

may be used to identify elevated risks of heart disease or to evaluate the efficiency of a treatment

(cf. Kecklund & Åkerstedt, 2004). For this purpose, considerable efforts have been made to produce

wearable measuring devices, which may accompany people in their every-day lives. Various

solutions have been proposed, including devices that are worn on the finger, forehead, wrist or ear

region (Poh et al., 2011). These efforts are equally interesting for UX and usability research, as

interactive systems are no longer limited to stationary office environments. However, most of the

wearables proposed so far must still be connected to additional hard-ware pieces (for power and

data acquisition), which may be bulky and cumbersome to handle (ibid.). Poh et al. (2011)

proposed an alternative approach to the problem, introducing a system called the Heartphones. The

idea is to integrate measuring equipment into devices that users are already familiar with, in this

case a (smart) mobile phone and a pair of modified earphones. The system uses PPG technology for

unobtrusive measurement of cardiovascular activity, so that users are free to carry out their

everyday tasks (ibid.). Another attractive solution was developed by Yoo et al. (2006), who

developed a wrist-band type PPG device with a Bluetooth communication interface to provide

mobility.

Heart rate (HR) is perhaps the most straightforward measure of cardiovascular activity. In a

review of ANS activity in emotions, Sylvia Kreibig (2010) provides a summary of the findings

related to HR response. She reports that HR has been found to increase for a number of negative

emotions (e.g. anger, anxiety, embarrassment, fear, crying sadness) as well as for some positive

emotions (e.g. happiness, joy) and surprise (which is hard to classify on a valence scale). A decrease

in HR, on the contrary, is observed when people experience affection, contentment or non-crying

sadness — emotions that, according to Kreibig, all involve an element of passivity. These findings

support the rather unsurprising conclusion that heart rate is a reflection of the level of autonomic

activation of an organism, associated with activation of the sympathetic branch of ANS.

However, heart rate is not only a reflection of sympathetic nervous system activity.

Research has demonstrated that the parasympathetic nervous system causes the heart to slow

down when we pay close attention to a stimulus, perhaps to allow the body to calm down until

proper assessment of the situation has been reached (Park, 2009). This knowledge may be useful in

usability testing, because it could help indicate whether a particular object or feature caught the

user’s attention or not. However, it should be observed that this response is only observed when a

subject is attending to external stimuli; internal stimuli, like solving a math problem, is instead

associated with increased HR (ibid.).



19

Heart rate variability (HRV), or sinus arrhythmia, is a measure of the fluctuations of the beat-to-

beat interval of the heart. The HRV response is influenced by a number of factors, including,

physical activity, body posture, respiration, cognitive effort and state of arousal (Berntson et al.,

1997). According to Kecklund & Åkerstedt (2004), the heart's ability to beat faster or slower in

response to changing mental or physical demands tends to decrease as the level of stress or

cognitive workload increases. Thereby, these states are associated with a decrease in HRV.

According to Rowe et al. (1998), HRV has been found to respond to transitions from rest to task

conditions in a large number of studies focusing on mental workload. When it comes to affective

assessment, some studies have suggested that HRV may be sensitive not only to the level of arousal,

but also to the emotional valence of a stimulus (cf. Rantanen et al.).

When studying cardiovascular activity, it is important to be aware of the effects of changes

in activity or posture on the heart’s activity. This is perhaps not that problematic in usability

testing, where subjects are often asked to sit in front of a computer while performing the test.

However, it might be a problem for ambulatory assessment of a mobile interface, or if the target of

evaluation is a video game with an element of physical interaction.

Another unwanted influence on HR and HRV is respiration. In general, inhalation is

associated with inhibition of parasympathetic activity, which causes a temporary increase in heart

rate, while the opposite effect is observed during exhalation (Berntson et al., 1997). Therefore,

some researchers have suggested that respiration rate should be measured along with

cardiovascular activity, in order to control for the effects of breathing on the obtained signal (ibid.).

Another disadvantage of these measures is the dual influence of the sympathetic and

parasympathetic nervous systems on cardiovascular activity. This complicates the interpretation of

data, because the signal obtained is not informative of the respective branch’s influence on cardiac

functioning (Kreibig 2010). For example, acceleration of heart rate may be caused by increased

arousal (sympathetic activation), but it may also be an indication of decreased attention to external

stimuli (parasympathetic deactivation). Therefore, it is particularly important to consider the

context when analyzing cardiovascular responses (Park, 2009).

3.3.2 Skin Conductance

Skin conductance (SC) or galvanic skin response (GSR) is a well known indication of arousal, and

has long been used for example in lie detectors (Kecklund & Åkerstedt, 2004). In essence, GSR is a

reflection of sweat production; increased sweating leads to more moisture in the skin, a lower

electrical resistance and therefore higher conductance.

In order to obtain the galvanic skin response, a small electrical current is passed through

the skin, using a pair of electrodes (Barreto et al., 2007). These electrodes are usually placed either

on the palms or the soles, because these body parts have a particularly high concentration of sweat

glands (Park, 2009). However, using the hands or feet as measuring points may be somewhat

problematic. Most human-computer interfaces require free use of the hands to obtain successful

interaction, which makes sensors placed on the palm a significant limitation. Using the soles for

data acquisition may seem like the better choice, but unfortunately, this requires subjects to

remove their socks and keep their foot lifted throughout the session, to keep the sensors from

touching the floor (ibid.). These restrictions could certainly have a negative impact on the user

experience, and thus influence the outcome of the test. However, less intrusive applications of GSR



20

are under development. For example, Ming-Zher Poh et al. (2010) describe a wireless wristband

with built in electrodes, which can be used to measure skin conductance during everyday activities.

According to the authors, the components used for the sensor can be purchased off the shelf for

approximately $150. This may be compared to commercial systems (such as Flexcomp Infiniti,

www.thoughttechnology.com), which may cost over $6000 (Poh et al., 2010).

So what can we learn by monitoring skin conductance? Naturally, activity of the sweat

glands increases with physical activity; but in addition to this, GSR has been found to increase in

response to most affective states. The explanation for this phenomenon lies in the action

preparation associated with most affective states (Kreibig 2010). The most obvious examples of this

are perhaps anger – associated with preparation for fight – and fear – associated with preparation

for flight. By contrast, some emotional states are associated with a decrease in electrodermal

activity, which may in turn be taken as an indication of decreased motor preparation. This is true

for sadness, which is typically experienced when a loss has occurred that cannot be undone; relief,

which is experienced after a threat has passed; and contentment, which is experienced when a

satisfactory outcome has been attained. In these cases, the significant event has already occurred,

which makes further action futile (ibid.). Thus, it is only natural that activity in the sweat glands

decreases.

Deep down, skin conductance is directly related to activation of the sympathetic division of

ANS, and thus independent from parasympathetic activity (Park, 2009). This is good, because it

means that GSR is less open to misinterpretation than many other physiological measures (such as

heart rate and pupil size, which are influenced by both divisions of ANS).

A disadvantage of GSR is the fact that it is hard to link an observed response to a particular

point in time. All measures of peripheral activity have relatively long response latency, typically a

few seconds, but GSR is particularly slow, with response latencies somewhere around 3 to 6

seconds from stimulus onset (Chanel et al. 2009, Park 2009). One reason for the large variation in

reaction time is that rather than constantly producing sweat, human sweat glands tend to ‘spout

out’ sweat. For this reason, it is not recommended to use GSR to identify the exact moment when a

response was triggered. Instead, researchers should calculate the average response over a period of

time, for example the duration of a task or the presentation of a stimulus, and then compare the

result to other such units (Park, 2009). Thereby, researchers may compare the level of stress

elicited by different stimuli or types of tasks.

In 2003, Ward & Marsden conducted a study in which they investigated if a combination

skin conductance (SC), heart rate (HR) and blood volume pulse (BVP) could be useful as a tool for

usability testing. Data was collected under rather loosely controlled HCI situations, with the aim of

identifying typical physiological patterns relating to different HCI events. The results revealed large

variations in range and magnitude of the GSR, both between different individuals and within the

same individual on different occasions. However, when the results were converted to percentage

variations, some general patterns could be observed (Ward & Marsden, 2003):

In situations with low stress with no significant events, there was a steady decrease in both

SC and HR, suggesting lowered levels of arousal. No sudden changes occurred after an initial

“settling down” period of 2-3 minutes.

During “normal” use of software in realistic situations, considerable fluctuations in HR, SC

and BVP were observed, although responses would remain around the same general level



21

through most of the interaction. When a known usability problem was encountered

however (in this case a difficult-to-find link), a rapid increase in SC was observed.

Following an unexpected HCI event (in this case the appearance of an alert box),

participants would exhibit increases in SC and HR, indication a sudden increase in arousal.

After a latency of 1 second, the most extreme response for SC was an increase of 63% over

the following 9 seconds. The unexpected stimuli would also produce increased fluctuation

in the physiological data.

Ward & Marsden concluded that the data seemed to indicate a relationship between physiological

measures (SC in particular) and different kinds of HCI events. However, it was also observed that

the experiments took place under rather loosely controlled situations, which allowed for a number

of uncontrolled sources of variability (ibid.).

3.3.3 Electrical Brain Activity

In addition to peripheral measures of physiological activity (like the ones mentioned above), there

are a number of ways to assess central processing. The human brain contains approximately 100

billion neurons, which communicate either through tiny electrical impulses, or by exchanging

chemicals, called neurotransmitters (Lee & Tan 2006). Every event, behavior, thought or emotion

produces millions of such impulses in the brain, which may be measured with technologies such as

EEG (electroencephalography), fMRI (functional Magnetic Resonance Imaging) or PET-scanning

(Positron emission tomography). Thereby, it is possible to analyze the activity in different regions

of the brain (e.g. the frontal, visual and motor cortex), and to identify recurring patterns.

While the PET and fMRI methodologies have many advantages, including a high spatial

resolution, EEG is currently considered the most suitable alternative for usability testing (Lee &

Tan, 2006, Chanel et al. 2009, Antonenko et al. 2010). First of all, modern EEG is comparatively

cheap and less intrusive than the alternative methods, which either require subjects to lie in still

during data acquisition (fMRI) or to ingest substances before trial onset (PET; Antonenko et al.

2010). Moreover, EEG has a very short response latency compared to PET and fMRI, both of which

rely on variations in blood flow along the scalp, which might not appear until several seconds after

emotion onset. On the downside, however, EEG requires direct contact between test subject and

measuring equipment, unlike for example fMRI (ibid.).

Unlike PET and fMRI, EEG measures electrical activity in the brain in a direct manner, by

placing electrodes along the scalp. An electrical impulse transmitted from a single neuron is too tiny

to be detected by the EEG, but the coordinated activity of large groups of neurons may result in

electrical fields that are strong enough to be measured from outside the skull (Lee & Tan 2006). The

signal obtained from each measuring point is passed through a differential amplifier, and the

resulting EEG is a waveform reflecting voltage variation over time. However, some electrical

impulses may be lost or scattered before they reach a measuring point, which means that the

obtained EEG is at best a crude representation of brain activity (ibid.).

A great challenge involved in using EEG relates to the presence of measuring artifacts, which

originate from electrical impulses that are unrelated to cerebral activity. Such artifacts may

originate from muscle tension, heart beats, eye blinks or body movement of any kind. Furthermore,

the electroencephalograph may pick up signals from electronic equipment in the test environment.



22

However, most contemporary EEG systems are equipped with robust software, which may facilitate

data analysis by removing some of the most common artifacts (Chanel et al. 2009)

Once the EEG has been obtained, the signal is usually analyzed by looking at the spectral power

in a set of standard frequency bands, which have been found to correspond to certain types of

neural activity (Lee & Tan, 2006). The different components of the EEG are extracted through signal

processing techniques, such as Fourier transformation. At present, it is believed that the brain

generates at least four basic rhythms or wave patterns. These are (Antonenko et al. 2010):

Delta waves (<4 Hz)

Theta waves (4-7 Hz)

Alpha waves (8-12 Hz)

Beta(-low) waves (>12 Hz)

In addition to those frequencies, the following two are often added to the list:

Beta-high waves (20-30 Hz)

Gamma waves (>30 Hz)

As we can see, the basic components of the EEG response form a continuum from low to high

frequencies. The naming of the components may seem confusing for anyone familiar with the Greek

alphabet, but has to do with the order in which the different rhythms were discovered. In healthy

individuals, the low frequency delta waves are only present during sleep, while faster alpha waves

dominate when a subject is awake but inattentive (ibid.).

Another way to analyze EEG responses is to extract the event-related potentials, or ERPs.

This technique is often used in studies that investigate the EEG response to a specific task or

stimulus. The most common way to extract the ERP is through data averaging, which means that

the amplitude values over short epochs of time are averaged to create a new waveform (Coles &

Rugg 1995). The background EEG, i.e. brain activity that is unrelated to the significant stimulus, is

assumed to vary randomly, and will therefore tend to average to zero (ibid.). What is left after the

averaging is therefore largely a representation of the event related activity. Once the ERP is

obtained, principal component analysis (PCA) may be applied to identify its different components,

which may give information about cognitive states (ibid.). However, ERPs have a limited potential

for usability testing, because it typically requires presenting stimuli at regulated timings and under

carefully controlled conditions (Lee & Tan 2006).

Lee and Tan (2006) observed that many HCI researchers were hesitant to explore the

domain of EEG, either because they felt that they lacked the required knowledge, or because of the

high cost of owning and maintaining the equipment (Lee & Tan, 2006). Traditional EEG systems are

indeed expensive, with high-end devices costing approximately USD 20,000-25,000 (ibid.).

Moreover, the equipment is difficult to handle and highly obtrusive. In typical medical applications,

between 16 and 25 flat metal discs (i.e. electrodes) are placed along the scalp using a sticky paste,

and each electrode is connected by wires to the recording machine (A.D.A.M. Medical Encyclopedia,

2012). However, recent technological advancements have allowed less intrusive implementations

of EEG, using caps or dry electrodes. Usability researchers can now gain access to (very) simple

wireless EEG headsets at prices stating from just under 100 USD (see e.g.

http://www.neurosky.com/ or http://www.emotiv.com/). While such devices

demonstrate that the use of EEG is no longer limited to laboratory settings, the number of



23

measuring points they provide is very small, which could compromise the value of the results

(Chanel et al. 2009).

In 2006, Lee & Tan performed a study in which they investigated the potential of a low-cost, 2-

channel EEG system (retailing at approximately 1500 USD). Two similar experiments were

conducted, of which only the second will be described here. The goal of the experiment was to

distinguish between three different tasks, based on differences in the resulting EEG patterns. Eight

subjects performed the following tasks, involving the computer game Halo (Microsoft Game

Studios):

Rest: Participants were asked to relax and fixate their eyes on the screen. No interaction

with the game occurred. (This task was used as the baseline).

Solo: Participants used keyboard and mouse to navigate through the game and interact with

objects in the environment. However, no enemies were visible in this task.

Play: Participants played against other participants, including an expert player who made

sure subjects were engaged in the game throughout the task.

The test sessions took place in an unmodified office environment, containing several computers,

fluorescent lights and other potential sources of noise. Due to the high variance in EEG properties

between individuals, the task classification procedure was performed separately for each

participant. The result was a mean classification accuracy of 92.4%, indicating that low-cost EEG

equipment could be sufficient for simply performing task classification and detection (Lee & Tan,

2006).

In a review of the use of EEG for cognitive load assessment, Antonenko et al. (2010) report

that two components of the EEG are sensitive to task difficulty manipulations: alpha and theta.

Alpha waves are dominant when subjects are awake but inattentive, and have been found to

decrease in response to mental effort. Theta activity, by contrast, has been found to increase with

cognitive load. Theta and alpha waves can thus be combined to assess mental effort (Antonenko et

al. 2010).

So far, relatively few studies have investigated the practical usefulness of EEG for emotion

assessment (Chanel et al. 2009). The amygdala is the main seat of emotions in the brain, but the

pre-frontal cortex is also involved in affective processing, especially the appraisal of emotional

stimuli. When subjects are confronted with emotional stimuli, we can observe different response

patterns in the pre-frontal cortex depending on the valence properties of the stimuli; negative

stimuli are associated high alpha activity, while positive stimuli are associated with low alpha

activity (Davidson et al. 2003). In a study from 2009, Chanel et al. investigated the potential of EEG

for distinguishing between three affective states: positive excitement (high arousal, positive

valence), negative excitement (high arousal, negative valence) and calm-neutral (low arousal,

neutral valence). Using 64-electrode EEG equipment, they obtained a classification accuracy of 70%

when different sets of EEG features were combined. The accuracy increased further when

peripheral measures (GSR, respiration and blood volume pressure) were combined with the EEG

analysis (Chanel et al., 2009).



24

4 Pupillometry This chapter revolves around pupillometry, i.e. the study of pupillary movements. In the first

section, I provide an introduction to pupillary movements, and explain some of the most important

factors that may have an impact on pupil size. Thereafter, I explain how pupillometric data may be

obtained and analyzed, and discuss some practical issues related to pupil size measurement. In

section three, I provide an overview of previous studies dealing with emotion, cognition and

pupillary movements, based on my literature review of the subject. Finally, I preset a minor pilot

study, in which I investigate the potential and practical challenges of pupillometry as a tool for UX

and usability testing.

4.1 Pupillary Movements

For anyone who wishes to understand pupillometric data, it is important to know that pupil size is

not determined by one single factor, but a complex interaction between different processes in the

body. This section provides an introduction to some of the most important pupillary movements,

and the factors that lie behind them.

All pupillary movements are governed by two antagonistic sets of muscles in the iris: the

sphincter pupillae and the dilator pupillae. The sphincter muscles constrict the pupil when activated,

whereas contraction of the dilator muscles is associated with pupil enlargement (Beatty & Lucero-

Wagoner, 2000). The two sets of muscles thus work as a reciprocal system, in which activation of

one muscle group is accompanied by inhibition of the other (Loewenfeld, 1993). Thus, the diameter

of the pupil is determined by the relative activation of the two muscle groups.

Both dilation and constriction of the pupil is controlled mainly by the autonomic nervous

system, but while activation of the dilator muscles is linked to the sympathetic branch of ANS, the

sphincter muscles are controlled by the parasympathetic branch (Beatty & Lucero-Wagoner, 2000).

4.1.1 Optical Reflexes

The primary function of the pupil is to control the amount of light that enters the eye, much like the

changing the aperture of a camera lens. In dim light conditions, the pupil dilates to allow more light

to enter the eye, while in bright light conditions; the pupil constricts to shut out some of the light.

This is referred to as the light reflex (Beatty & Lucero-Wagoner, 2000). In humans, pupil diameter

may vary from less than 1 to more than 9 mm due to luminance conditions (ibid.). Another optical

response is the accommodation response, or near reflex. This reflex allows the eye to adapt to

different fixation distances by changing the curvature of the lens, and thereby the pupil diameter

(ibid).

The reflexes described above both have clear-cut optical functions, and are unrelated to

cognitive and emotional processing. For the purposes of this study, therefore, they may be regarded

as disturbing factors. The accommodation response is problematic in settings where altering

fixation distances are required, but are less important in cases where subjects are looking at a fixed

computer screen throughout the test session (as is usually the case in usability testing). The light

reflex, by contrast, cannot easily be overlooked. Even very slight changes in luminance levels can

trigger a response, which makes the light reflex an issue for any study involving visual stimuli.



25

As stated by Irene E. Loewenfeld (1993):

“Anyone familiar with the low threshold of the pupillary light reflex knows, of course, that it is

impossible to change from one picture to a recognizably different one without the likelihood of a

pupillary change.“

Loewenfeld further concludes that it is not enough to just control the overall brightness of a picture

or a computer screen, as some researchers have attempted.

In addition to the relatively large scale movements of the light reflex, there are also tiny

oscillations of pupil size, which increase in frequency with intensity of illumination (Loewenfeld

1993). These continuous oscillations, sometimes referred to as pupillary unrest, are absent only in

dim light or darkness (when they are replaced by slower, pulsing movements, called “fatigue

waves”).

4.1.2 Reflex Dilation

It has long been known that in conscious, healthy individuals, any sensory, emotional or mental

stimulus (with the exception of light) elicits pupillary dilation (Loewenfeld 1993). Already in 1910,

the German neurologist Oswald Bumke concluded:

“We know today that every mental event, every physical effort, every impulse of will, each activation of

attention, and especially each affect causes pupillary dilation” (translated in Loewenfeld, 1993).

More recent studies have provided solid evidence for this statement (see e.g. Goldwater, 1972,

Loewenfeld, 1993). The kind of pupillary movements listed above, caused by cognitive or emotional

factors rather than optical phenomena, are generally referred to as reflex dilation. This response is

typically observed around 300 to 500 ms after stimulus onset, and has a peak amplitude of less than

a 0,5 mm increase in pupil size (Beatty & Lucero-Wagoner, 2000). Like the light reflex, the dilation

response is a fleeting movement, accompanied by continuous oscillations of pupil size. However,

the fluctuations associated with the light response are more irregular and sharp than those linked

to the light reflex, often exhibiting large jumps followed by rapid declines in pupil diameter

(Marshall, 2000).

As the words of Bumke depict, reflex dilation may occur in response to stimuli that are

cognitive or emotional, internal or external. This is bad news for this study, because it makes it hard

to determine the cause of an observed dilation; should it be interpreted as a sign of emotional

arousal, cognitive load, physical effort or something completely different? In 1979, Stanners et al.

observed that most pupillometric research had so far focused on either a cognitive or an affective

(arousal) interpretation of pupillary responses. As we shall see in the following section, this

observation still holds today, over three decades later. However, Stanners et al. (1979) conducted a

study in which they investigated the interaction between cognitive and emotional effects on pupil

size. They found that arousal manipulations had an influence on pupil size only when the cognitive

demands of the task were minimal. The authors concluded that cognitive demands take priority

over arousal factors as determinants of pupillary response. Beatty (1982) came to a similar

conclusion in a review article, where he concluded that emotional factors are relatively

unimportant as determinants of pupil size in information-processing tasks. According to Beatty,

emotional factors are more likely to affect the baseline pupillary diameter, rather than the phasic

responses studied in cognitive pupillometry.



26

American psychologist Sandra P. Marshall Marshall has suggested a method for separating the

emotionally driven pupillary responses from those that are cognitively driven. In a patent accepted

in 2003 (U.S. Pat. No. 6,572,562), Marshall describes an approach based on comparisons between

the respective responses obtained for the left and right eye. According to Marshall, differences

between pupillary responses are reflective of the difference between the two brain hemispheres,

i.e. the “right brain” (associated with logical and analytical thinking) and “left brain” (associated

with creative, emotional and intuitive thinking). This is an interesting approach, but so far, it does

not seem to have caught the attention of the pupillometric research community. Therefore, it is

hard to draw any conclusions concerning the validity of Marshall's approach.

It is important to observe that cognitive and emotional stimuli can only evoke pupil dilation

(i.e. enlargement); constriction of the pupil can only occur in response to light. In other words, the

light and dilation reflexes have opposite impacts on pupil size. Because reflex dilations have

relatively small amplitude compared to the light response, even small changes in light conditions

may be enough for the light reflex to “overrule” a dilation response. For example, a sudden flash of

light, which should on the one hand produce reflex dilation, and on the other hand cause light-

induced constriction of the pupil, will normally result in a decrease in pupil size (Loewenfeld 1993).

Different approached to deal with this problem will be discussed in the next section of this chapter.

Another important feature of reflex dilation is that its magnitude depends on the cognitive

or emotional significance of a given stimuli to the individual. For example, Beatty (1982) describes

an experiment in which subjects had to identify occurrences of a specific tone in one ear, while

tones of another frequency were presented in the other ear. He found that small but reliable

dilations occurred in response to the relevant tone (i.e. the one that had cognitive significance),

while no variation in pupil size was observed in response to the tones that were not attended to.

This underlines the importance of task formulation in pupillometric studies (as in all usability

studies). Similarly, it has been observed that when a stimulus is repeated at monotonous intervals,

the dilation response gradually decreases, as the subject becomes habituated to the stimulus.

However, this is not always true; if the stimulus has some annoying feature, its emotional impact,

and therefore the pupillary response, may instead increase over time (Loewenfeld 1993).

4.2 Measuring Pupil Size

The pupillary system is a very sensitive, low-noise source of psychophysiological data, which can be

measured in a number of different ways (Beatty & Lucero-Wagoner 2000). Early studies (e.g. Hess

& Polt 1960) simply photographed the eye with a given sampling rate, projected the pictures

obtained on a large screen and then measured the pupil with a regular ruler. While this method

proved precise enough to detect large scale variations in pupil size, it is both labor-intensive and

limited in temporal resolution. Over the last half century, custom pupillometry systems, i.e.

pupillometers, have gradually emerged (Klinger & Hanrahan, 2008). Today, there are hand-held

pupillometers on the market which are both precise and practical to use (see for example

www.neuroptics.com). Recently however, several research groups within the field of HCI have

started to take advantage of the pupillometric capabilities of eye trackers for pupil size

measurement (Klinger et al., 2008). A major advantage of eye tracking is of course that both gaze



27

and pupil data are recorded with the same equipment, which means that more information is

available for the analysis, without complicating the data collection procedure.

Although eye tracking has been around for more than 150 years, it has only recently started

to reach its full potential (Bartels & Marshall, 2012). Over the last few years, the field of eye tracking

has developed rapidly, resulting in systems that are more powerful, easier to handle and less

obtrusive (ibid). There are two main types of eye tracking systems available on the market today:

remote and head-mounted eye trackers. The basic methodology is the same for the two types, as

both rely on the video-based solution described in the introduction to this report. However, there

are some important differences, which will be discussed in the following.

Head-mounted eye tracking systems have one important advantage: the tracking unit is

fixed to the user’s head, which means that the relative position of the eyes and the tracker stays the

same as the user moves her head. Thereby, gaze data can be recorded while the user is walking

around and performing everyday tasks. This feature is of course an important if the target of your

study is a vending machine, or some other “real-world” object. On the other hand, the fact that

physical contact is required between user and tracking device may be perceived as a disadvantage.

For example, Marshall (2002) reported that some of her experimental subjects were bothered by

wearing a head-mounted eye tracker, and that this may have affected the results of her study. It

should be observed, however, that recent technical developments have resulted in less obtrusive

devices, such as glasses with built-in eye tracking capabilities (see for example

www.eyetracking-glasses.com and www.tobiiglasses.com/scientificresearch).

If the target of your study is a web-page, a computer game or some other desktop-based

user interface (which is usually the case in HCI research), then it might be preferable to use remote

eye tracking, which eliminates the need for physical contact between user and eye tracking device

altogether. As mentioned in the introduction to this report, modern eye tracking may be

incorporated into a system that resembles a standard desktop monitor, which allows for highly

unobtrusive data collection (Klinger et al., 2008). Modern remote trackers can compensate for head

movements, as long as the user does not turn away from the screen (cf. Tobii Technology, 2010).

Until recently, however, HCI researchers mostly used head-mounted eye trackers for pupillometric

studies, because remote systems were not considered precise enough for that purpose (Klinger et

al., 2008). Over the last few years however, several studies, including Klinger et al. (2008) and

Palinko et al. (2010) and Bartels and Marshall (2012), have asserted that remote eye tracking does

provide enough precision for detailed pupil size analysis.

Another thing that set different eye tracking systems apart is the way in which the pupil size

is determined. In video-based eye tracking, the optical sensor registers an image of the eyes, which

may then be used to calculate pupil size. However, the way in which pupil size is extracted from the

pupil image differs between different systems. One method is simply to count the number of pixels

encompassed by the pupil in the eye image (Klinger et al., 2008). However, the value obtained with

this method will be affected by changes in gaze direction, because of the curvature of the lens (cf.

Pomplun & Sunkara, 2003). For example, a subject looking straight at the camera will yield a pupil

image that occupies a larger number of pixels; compared to if the pupil had been captured from the

side. Another problem with the pixel-counting approach is that the pupil image is not always

perfect; artifacts such as eyelids, eyelashes, shadows and reflections from the environment may

cause partial occlusion of the pupil, which may also result in inaccurate estimations (Kumar et al.,



28

2009). Another common approach is to calculate the pupil diameter as the length of the major axis

of an ellipse fitted to the pupil image (Klinger et al., 2008). This solution eliminates some of the

problems involved in pixel-counting, but may instead yield some minor errors due to non-circular

pupil shapes (ibid.). More recently, new eye tracking systems (such as Tobii’s T/X series eye

trackers; Tobii Technology, 2010) have adopted more sophisticated algorithms, in which the pupil

image is used to calculate a 3D model of the eye. According to Tobii, this method provides a pupil

size that is closer to the external, physical size of the pupil than can be obtained by measuring pupil

size directly from the eye image. However, when performing pupillometric studies, the exact size of

the pupil in millimeters, is often less important than the change in pupil size over time.

Nevertheless, it might be helpful to know which pupil measurement approach is applied by your

eye tracking device, in order to better understand the potential sources of error.

Once pupillometric data has been collected, the next challenge is perform adequate data

processing, in order to extract the relevant information. One of the most common measures to be

extracted from pupillometric data is the mean pupil diameter (MPD), which is calculated as the

average pupil diameter over a given interval of time (e.g. the duration of a task), minus the baseline

diameter (Beatty & Lucero-Wagoner, 2000). An advantage MPD is that it is rather insensitive to

random variations in the data, due to for example eye blinks (depending, of course, on the severity

and frequency of measurement errors). On the other hand, there are some sources of bias to

consider when analyzing the averaged pupil size. For example, trial length may vary across

subjects, which will have consequences for the value obtained when data from different trials are

combined. Unless some kind of weighting procedure is adopted, a subject who needed more time to

complete the tasks will have larger impact on the obtained average (ibid.). In such cases, it may be

better to use peak dilation, which is an equally straightforward measure; the baseline diameter is

simply extracted from the maximum value obtained in an interval. However, it is important to keep

in mind that because this measure consists of a single value, it is more vulnerable to random

variations in the data than MPD (ibid.).

If the pupillary response is to be analyzed in more detail, it is important to address blink

artifacts in the data before further processing is performed. There are several possible approaches

to blink detection, but most solutions start by identifying data losses or values that fall below a

certain threshold value for the approximate duration of an eye blink (70-100 ms according to

Marshall 2000). Such occurrences are then removed or compensated through linear interpolation

(cf. Marshall 2000, Janita et al. 2010, Gao et al. 2010). Because lid-closure is associated with a slight

dilation and reconstriction of the pupil, due to the resulting change in light conditions, a few data

points before and after the blink should also be included in the blink removal (Loewenfeld 1993).

Another issue to consider in pupillometric studies relate to the presentation of data. At first

glance, it may seem reasonable to present the result in terms of percent dilation from baseline.

However, Beatty and Lucero-Wagoner conclude in a review from 2000 that the generally adopted

convention is to report both baseline diameter and pupillary diameter in millimeters. According to

the authors, this is a more appropriate practice, since all available evidence suggests that the

magnitude of the task-evoked pupillary response is independent of baseline diameter.

Consequently, a percent dilation approach would result in larger responses in cases where the

baseline diameter is small, and smaller responses in cases where the baseline was large, even

though the actual (absolute) dilation might have been the same (Beatty & Lucero-Wagoner, 2000).



29

4.3 Previous Studies

This section provides a review of the most central pupillometric findings related to this study. The

first part is a review of studies where pupil size is used for affect recognition, in research fields such

as affective computing. In the second part, focus lies on studies related to cognitive science and, in

particular, cognitive load assessment. Finally, I present some previous attempts to eliminate light

induced changes in pupil size from pupillometric data.

4.3.1 Pupillometry in Affect Recognition

Never has the pupil been so popular as in the 1960’s and 1970’s, following a series of articles on

pupil size measurement by E.H. Hess and James Polt (1960, 1964). While much of their conclusions

were essentially reconfirmations of what had already been known, one particular finding has been

the source of considerable controversy (Loewenfeld 1993). In a highly influential article, Hess

(1965) reported that the pupil reacted with “extreme dilation” to interesting or pleasing visual

stimuli, while displeasing stimuli caused “extreme constriction”. This “bi-directional” theory on

pupillary responses (i.e. that the pupil could either dilate or constrict in response to emotional

stimuli) gave promise of a marvelous new method, which would allow scientists and market

researchers to assign an “interest value” to everything from consumer products to political

candidates (Loewenfeld 1993).

In the years that followed, pupil size measurement was adopted in both commercial and

academic research, where it was used as a means to detect attitudes of like or dislike towards

package designs, different foods, nude pictures or human faces, just to name a few (Loewenfeld

1993). Unfortunately, these studies relied on false promises. In 1993, Irene E. Loewenfeld published

an extensive review of pupillary research (including more than 100 studies dealing with the bi-

directional theory), in which she concluded:

“It has been shown over and over again that [...] emotional stimuli and all other sensory and

psychologic stimuli - with the exception of light and of stimuli that alter the eye’s near point of vision -

do not constrict the pupil but dilate it.” (p. 663)

This does not mean that emotional stimuli does not affect pupil diameter at all, only that there is no

bi-directional relationship between valence and pupil size; positive and negative emotions both

result in pupil enlargement. According to Loewenfeld, the evidence obtained by Hess and some of

his followers were probably experimental artifacts, resulting from the influence of luminance

conditions, as all of these studies used visual stimuli. This is of course bad news for this study,

because it means that pupil measurement alone cannot tell us whether a user is frustrated

(negative valence) or delighted (positive valence) with an interface. It can, however, distinguish

between states of different emotional arousal.

Although affective pupillometry did not live up to the promises of early studies, it has

caught some attention in the field of affective computing. In 2002, Partala & Surakka investigated

the potential of pupillometry as a tool for affective computing, using a modern (50 Hz) eye tracking

system. In the study, subjects were confronted with emotional sounds with different valence, for

example a baby laughing (positive), a baby crying (negative) or an office background sound

(neutral). By using auditory rather than visual stimuli, Partala & Surakka limited the impact of the

light reflex. The study revealed significantly larger pupillary responses to both positive and



30

negative stimuli, as compared to neutral stimuli. Once again it was concluded that pupillary

responses cannot be used to discriminate between different emotional valences; however, it does

vary with different levels of arousal (both positive and negative stimuli were arousing, while the

neutral stimuli were not).

In a similar study from 2008, Bradley et al. investigated the pupillary responses to

emotionally toned pictures from the International Affective Picture System (IAPS; Lang et al, 2005).

In this study, however, heart rate and skin conductance was measured concurrently, in order to

confirm the assumption that pupillary changes are mediated by sympathetic and parasympathetic

activation (if so, a co-variation between the different physiological measures would be observed).

The selection of stimuli from the IAPS consisted of an equal number of neutral, pleasant and

unpleasant pictures, making up a total of 96 pictures. The mean luminosity levels of the pictures

were adapted (using Adobe Photoshop), so that the mean luminosity was the same for each of the

three picture sets. Once again, the study showed that pupillary responses were larger when viewing

emotionally arousing pictures, regardless of whether they were pleasant or unpleasant. This

pattern was closely paralleled by the skin conductance response. For heart rate, however, a

different response pattern was found, in which pleasant or neutral prompted very similar response

pattern, while neutral pictures prompted a significantly larger cardiac deceleration

(parasympathetic activation, see 3.3.1). The authors concluded that taken together, the data

provided strong support for the hypothesis that pupillary responses to affective stimuli are

associated with an increase in sympathetic activity (Bradley et al., 2008).

More recently, a number of studies in affective computing have investigated the usefulness

of pupillometry as an indication of different emotional states. Barreto, Gao and colleagues

measured pupil size together with other physiological signals, in order to compare how well the

different measures could distinguish between different stress levels (Barreto et al., 2007, Gao et al.,

2010). The stimuli used for stress elicitation was the same for both studies: a classical Stroop Color-

Word Test, in which users are asked to identify color-words on a screen (e.g. “red”), without being

distracted by the actual color they are written in.

In the first study by Barreto et al. (2007), four physiological measures were used: Pupil

Diameter (PD), Galvanic Skin Response (GSR), Blood Volume Pulse (BVP) and Skin Temperature

(ST). After the data had been collected, each measure was normalized in order to eliminate

individual differences, using values obtained in an introductory phase as a baseline. Then, the

different signals were evaluated in terms of their ability to discriminate between high and low-

stress segments of the interaction. In the case of pupil diameter, the average value of PD over each

segment was used. The results showed significantly more discriminating potential for PD than for

the other measures, while ST showed particularly limited potential (Barreto et al., 2007).

In the following study by Gao et al. (2010), PD was compared to GSR and BVP in a similar

experimental set-up. In addition to the Stroop Test, however, occasional flashes of light were added

as a stimulus, in order to see whether the light reflex could be cancelled out. The first step of the

signal processing was to remove interruptions in the PD signal due to blinking. The signal was

passed through a low-pass filter and interruptions were compensated by linear interpolation. Then,

a so-called adaptive interference canceller (AIC) was used to divide the obtained pupillary response

into one signal of interest (changes caused by affective responses) and one interference signal

(changes caused by the light reflex). The GSR and BVP signals were also processed before they were



31

used for affective assessment. Once again, the goal was to discriminate between stressed and

relaxed states, and once again, the PD signal gave significantly better results as compared to the

other measures (77.78% accuracy compared to 54.44% for the best alternative). Moreover, when

GSR and BVP were combined with PD, the accuracy actually decreased slightly (to 76.67%). Note

that these results were obtained in spite of the temporary illumination increases. The authors

concluded that pupil diameter may be one of the most important signals to involve in affective

recognition (Gao et al., 2010).

4.3.2 Cognitive Pupillometry

The relation between pupil size and mental effort has been extensively studied in

psychophysiology, and is referred to as cognitive pupillometry (Beatty & Lucero-Wagoner 2000).

One of the earliest studies of so called task-evoked pupillary response (TERP) was performed by

Hess and Polt in 1964. They studied pupillary responses while subjects performed mental

arithmetic, which gradually increased in complexity (e.g. 7*8, 13*14, 16*23). Pupil size was

measured using a camera, photographing the right eye with a sample rate of two frames per second

(i.e. 2 Hz). When averaged over all five test subjects, the results showed a gradual increase in pupil

diameter as more complex calculations were performed. A few years later, Kahneman (1966)

conducted a similar study, in which subjects were asked to remember strings of digits. Again, it was

found that pupil diameter (of the right eye) increased with task difficulty, that is, as the number of

digits in the string increased. Kahneman concluded that pupil size could be used as a measure of

memory load, or the amount of material in active processing. Subsequent studies have provided

repeated evidence for the ability of the pupil to reflect task difficulty, regardless of the nature of the

task. Typically, the pupil dilated within a second after a task was presented, returning to baseline

immediately after the answer had been given (Goldwater 1972).

While most early studies used the average pupil diameter (or MPD) as a measure of

cognitive activity, more recent work has applied complex data processing to extract the relevant

information from pupillary responses. The studies by Gao, Barreto et al. described in the previous

section is one example of more complex procedures, although their focus was on affect recognition.

For cognitive pupillometry, a data processing module called the index of cognitive activity (ICA) has

been used in a large number of studies over the last decade or so (Palinko et al. 2010). Instead of

using average pupil diameter, ICA measures the number of abrupt discontinuities per second in the

PD signal (for a more detailed description of the procedure, please refer to the next section). The

index was developed by Sandra P. Marshall, and was considered original enough to be granted a

patent in 2000 (U.S. Pat. No. 6,102,870).

The effectiveness of the ICA for detecting variations in cognitive load has been verified in a

number of studies. For example, Marshall, Pleydell-Pearce & Dickson (2002) demonstrated that the

ICA increased with task difficulty for a simple interactive task. In addition, they found that the index

could be used to detect strategy shifts (which are usually associated with a change in cognitive

load). Perhaps more importantly, however, the results generally corresponded to those found in

EEG studies of the same task. The authors concluded that pupil size measurement, being cheaper

and more portable than EEG, could potentially be used as a precursor to EEG studies, or to validate

findings obtained with EEG in the field.



32

As we have seen, there is extensive evidence for the correlation between pupil size and cognitive

workload for simple cognitive tasks. But how can this knowledge be implemented and benefited

from in the context of HCI? Iqbal et al. (2004, 2005) focused on just that in a series of studies

investigating how pupillometry might be used to manage user attention in HCI. As previously

mentioned (see 2.2.1), empirical evidence suggest that interruptions are less disruptive when they

occur during a period of low mental workload, rather than when the user is actively engaged in a

task, which means that efficient timing of system notifications could have a positive effect on user

performance (Bailey et al., 2006). In two consecutive studies (2004 & 2005), Iqbal et al. used a

head-mounted eye tracker (EyeLink II) to measure pupillary movements while users performed a

number of cognitive tasks. The first study (Iqbal et al., 2004) involved four task categories, each of

which had two levels of difficulty (i.e. easy/difficult): reading comprehension, mathematical

reasoning, visual search and sorting emails (using drag and drop). The baseline pupil diameter was

obtained before the first task, while subjects fixated on a blank screen for 10 seconds. In addition to

pupillometric data, subjective ratings of difficulty and completion time were collected, in order to

validate the workload reflected in the PD response. Once the data collection had been performed,

the percentage change in pupil size (PCPS) was computed for each user, using the following formula

(cf. 3.2):

In order to compare the mental workload of the different tasks, the average PCPS was computed

over each task (8 in total). If the method was successful, the difficult version of each task would

render a higher average PCPS (APCPS) than the easier version. In the first analysis however, only

the search task rendered a statistically significant PD difference between high and low mental

workload. The authors attributed this result to the hierarchical nature of the other three task

categories. For example, the email sorting task does involve a cognitive component, but part of the

task is also a motor component (i.e. the dragging and dropping of emails). This structure means that

the same level of mental workload will not be sustained over the entire period of task execution,

which might explain the unpredicted results. In a second analysis, therefore, the tasks were

decomposed into several subtasks. This time, a good correlation between pupil size and cognitive

load was observed.

In a second study, Iqbal et al. (2005) built on these results to further explore the workload

changes involved in interactive tasks. This time, two different tasks were used: route planning and

text editing. Both tasks involved carefully controlled subtasks, which were designed to be

representative of those involved in typical interactive tasks, e.g. selection, data entry, memory store

and recall, information processing, reasoning and motor movements. Again, APCPS was found to

vary in a predictable manner among subtasks, according to the level of cognitive workload imposed

by the task. Moreover, a significant decrease in APCPS was observed at task boundaries. The

authors suggested that an Index of Opportunity may be derived from the PD signal, indicating

moments where interruptions may occur at a lower cost (Iqbal et al., 2005).

As previously mentioned (see 4.2), recent studies within the field of HCI have often used

remote rather than head-mounted eye tracking for pupil size measurements. For example, Palinko

et al. (2010, 2011, 2012) performed a series of studies in which they used remote eye tracking in a



33

driving simulator. In the first study (Palinko et al., 2010), a new measure of cognitive load - the

mean pupil diameter change rate (MPDCR) - was introduced and evaluated. Subjects were

instructed to drive (primary task) while engaging in a word game with a front seat passenger

(secondary task). The PD signal was used to extract the mean pupil diameter change (MPDC), and

MPDCR was then calculated as the first difference of the MPDC curve. Both MPDC and MPDCR were

found to correspond well with driving performance and expected changes in cognitive load (based

on task difficulty). An advantage of these measures, according to the authors, is that they are both

rather insensitive to measurement artifacts (as compared to for example ICA), due to the averaging

process. The authors also concluded that MPDCR might be more useful than MPDC when it comes to

observing rapid changes in pupil size.

In the study described above (Palinko et al. 2010), the authors dealt with the influence of

the light reflex by confirming that the average illumination of the screen did not vary more than +/-

5% from average illumination. Based on this, they made the assumption that the light reflex did not

significantly influence pupil diameter. However, more recent studies by Palinko & Kun (2011,

2012) have focused specifically on the interaction between cognitive load and luminance conditions

and the resulting effect on pupillary responses. These studies are described in the next section.

4.3.3 Dealing with the Light Reflex

Whether we use pupil size to investigate the emotional responses to different interface designs or

the cognitive load imposed by an interactive task, the pupillary light reflex must always be taken

into consideration. This is one of the greatest challenges involved in pupillometric research, and

different attempts have been made to separate the light induced variations in pupil size from

responses that relate mental events. One such approach is the index of cognitive activity (Marshall,

2000), which measures the number of abrupt discontinuities in the PD signal over each second of a

trial. In order to separate the light-induced discontinuities from those that are cognitively driven,

the ICA (Index of Cognitive Activity) makes use of the somewhat different signal patterns associated

with on the one hand the light reflex, and on the other hand the dilation reflex (see 3.1). These two

components are decomposed from the original signal by means of wavelet analysis (using the

MatLab Wavelet Toolbox). Naturally, the ICA procedure also includes blink-removal and de-noising

of the signal (ibid).

In a paper from 2002, Marshall describes a simple validation study, in which the claimed

light reflex separation is put to the test. In the study, the obtained ICA for four different conditions

are compared: light plus cognitive effort, light plus no cognitive effort, dark plus cognitive effort and

dark plus no cognitive effort. The results demonstrate that ICA does indeed vary with different

levels of mental workload, but is rather insensitive to changes in illumination (Marshall, 2002).

A somewhat similar approach to separating the different sources of pupillary movements

was investigated by Janita & Baccio (2010). They used principal component analysis (PCA) to

identify a set of three independent components in the PD signal, and found that only one of them

varied in response to shifts in cognitive demand. The authors concluded that even though further

research is required, there might be a traceable component which uniquely reflects the effort a

subject mobilizes to perform a task (Janita & Baccino, 2010).

Pomplun & Sunkara (2003) suggested yet another approach to light reflex elimination.

They designed a simple interactive computer game, in which different geometric shapes appeared



34

on the screen. When a blue circle appeared, users were supposed to fixate it with their eyes while

pressing a button, in order to make the circle disappear from the screen. If they did not manage to

do so before a certain time had elapsed, the blue circle would explode. The task had three levels of

difficulty (easy, medium, and hard), which were obtained by varying the speed at which new items

appeared on the screen. Each level of difficulty was also combined with two different levels of

illumination (i.e. black or white background), resulting in a total of six different conditions. The

results confirmed that both the illumination conditions and the level of difficulty had a significant

effect on pupil size. Data analysis also revealed that there was no interaction between the two

factors. Based on these results, the authors suggested a possible solution to the light reflex-

problem, consisting of an additional pre-trial calibration, in which display brightness would be

varied in a systematic manner. Thereby, it would be possible to determine the participant’s pupil

size as a function of display brightness. The amount of pupil dilation induced by cognitive workload

could then be computed by subtracting the calibration value for the current display brightness from

the current pupil size (Pomplun & Sunkara, 2003).

Palinko and Kun built on this idea in two recent studies (2011, 2012), in which they further

investigate the interaction between illumination and cognitive load as determinants of pupil size.

In contrast to Pomplun and Sunkara, Palinko and Kun did not only calculate the average response

over each task condition, but performed a more detailed analysis of the momentary pupillary

response. In the most recent study (Palinko & Kun, 2012), subjects were asked to perform three

different tasks:

In the Illumination Task (IT), a static image of three trucks was presented to the user. One

truck was almost black (10% of maximum brightness), the second was medium gray (50%

brightness) and the third was nearly white (90% brightness). Test subjects we instructed to

fixate on a target (two zeros), which moved from one truck to another every 9 seconds

In the Visual Vigilance Task (VVT), subjects watched a sequence of numbers counting

upwards, with the instruction that every 6th number could be out of order. If so, the

participants were instructed to press a button, in order to indicate that they had detected

the faulty number.

In the Combination Task (CT), participants performed the two other tasks simultaneously.

The two zeros used as fixation target in the IT were now replaced by the sequence of

numbers used in the VVT.

The goal of the data analysis was to separate the pupillary response derived from each component

of the combination task by subtracting the responses obtained in the other two tasks. The first step

was thus to analyze the IT and VVT separately. The IT was analyzed by calculating the PD response

for each instance where a subject moved their point of gaze from one truck to brighter truck (black

to white, lack to gray or gray to white). The results were averaged over each participant and each

such transition. Next, the VVT was analyzed by calculating the average pupil size for the different

positions in the number sequence (1-6). As, expected, a significantly larger pupil size was obtained

for every 6th number, where subjects had to decide whether the number was out of order. Now, the

averaged pupil diameter during the IT could be subtracted from the responses obtained during the

combination task. The result was a curve that was very similar to that obtained in the VVT. In other

words, the light-induced changes in pupil size were successfully extracted from the CT-signal, so



35

that only the cognitively driven variations remained. While these results are encouraging, the

authors observe that the tasks used in the experiment are highly simplified, and that more research

is needed if a similar method is to be applied to more complex tasks (Palinko & Kun, 2012).

4.4 Pilot Study

In addition to the literature review, a simple pilot study was carried out, in which the pupillary

responses of two test subjects were recorded while they were exposed to simple cognitive and

emotional stimuli. The main purposes of the study were:

1. To practically investigate how pupil size measurements may be incorporated in a simple

eye tracking study.

2. To investigate whether the pupillary response to cognitive or emotional stimuli may be

studied without extensive technical skills data processing, or time consumption.

3. To gain some practical experience of pupillometric research, in order to better understand

the challenges involved in data collection and analysis, and thereby improve the quality of

the discussion provided in this report.

4.4.1 Participants

Two test subjects took part in the experiment: one male (26 years old) and two females (23 and 28

years old). There are several reasons why the number of participants was so small. First of all, the

degree project was limited in terms of time, and a larger number of participants would have meant

more time spent on data analysis. . Second of all, the goal of the pilot study was merely to test the

potential of the technology, not to render statistically significant data. It could be noted here that

similar restraints are not uncommon in usability testing, where the degree of confidence in the

results acquired must usually be balanced against limitations in terms of time and financial

resources (Rubin & Chisnell, 2008).

4.4.2 Equipment and Procedure

The pilot study was performed at Tobii Technology’s headquarter in Danderyd, Sweden. The first

two tests sessions were carried out on the same occasion, while the third test subject performed the

test on a later occasion. This allowed for a few minor tweaks in the study design between the two

occasions. These changes will be further discussed in the following sections.

Gaze data was collected with Tobii TX300 Eye Tracker (see figure 4.1), which is currently

one of Tobii’s most advanced (remote) trackers. The TX300 has a sampling rate of 300 Hz (300

samples per second) and can compensate for head movements that occur within a box of 37*17 cm

(at 65 cm from the screen). Pupil diameter is calculated for each eye separately, with algorithms

that compensate for differences in tracking distance and gaze angle, as well as for distortions

caused by the spherical shape of the eye. (Tobii Technology, 2012)

The test sessions were carried out in a usability and market research studio at Tobii

Technology. The studio has no windows, which makes it easier to control the illumination

conditions, and the eclectic light was kept at the same level throughout the test sessions. While

performing the tasks, the participants were seated at a desk in front of the eye tracker (see figure



36

4.1). Before the test could commence, a calibration procedure was performed, to make sure that the

tracker could identify the subjects’ eyes. Thereafter, the participants were guided through the test

by text instructions appearing on the screen. The participants had access to a keyboard, which was

used to trigger new instructions. The sessions were controlled and monitored by a person (myself)

sitting outside the visual field of the test subject, in order to avoid distractions caused by

movements in the periphery of the visual field. Because all instructions appeared on the screen,

interaction between user and test moderator (which might have caused the user to turn away from

the screen) was not necessary once the eye tracking session had commenced.

The study was designed in Tobii Studio, a software tool dedicated to the design, recording

and analysis of eye tracking data. Cognitive and affective stimuli (see the following sections) were

presented as video-clips, in order to ensure correct timing of the events.

Figure 4.1: Test Set-Up and Equipment

A. The test monitor (on the left) was seated B. Tobii TX300 Eye Tracker.

behind the test subject (in the middle).

4.4.3 Cognitive Tasks

The study consisted of two parts, one in which subjects performed simple math problems

(cognitive task), and one in which they were confronted with emotionally toned pictures (affective

stimuli). The cognitive task consisted of four math problems with two levels of difficulty (see figure

4.2 below), which were presented in the following order: easy 1, difficult 1, easy 2, and difficult 2.

The math problems used were adapted from cognitive study material from a workshop on the

EyeTrackConf-conference in Uppsala, Sweden in 2010. The subjects were given 10 seconds to solve

each problem, after which the next problem appeared automatically. Once (and if) they managed to

come up with a solution, the subjects were instructed to say it out loud. The performance data

could thereby be used to verify the assumed variations in difficulty between the different sub-tasks.

Figure 4.2: Cognitive Stimuli

The difficult math problem

(to the right) should evoke

a higher level of cognitive

load than the easier one

(to the left).



37

The visual characteristics of the stimuli were carefully controlled in order to avoid the occurrence

light induced changes in pupil size during the task. Therefore, the math problems were given the

same background color, and the numbers were placed in a similar way for all four problems. A

baseline stimulus was also created, in which the numbers in the picture were replaced by X’s. The

baseline picture was presented to the user before the real task began (no other stimulus was

presented in-between).

However, there was a twist to the carefully controlled luminance levels. The last difficult

task (Difficult 2) was deliberately given a slightly brighter gray background color (luminance 81

instead of 85 in the CIELab color space). The difference in luminance level between the stimuli can

be observed in figure, 4.2, where the difficult stimulus has a slightly brighter background. My

hypothesis was that the light induced pupil constriction caused by the change in background color

would rule out the expected dilation due to the increased difficulty of the task (see 4.1.2).

4.4.4 Affective Stimuli

In the second part of the study, subjects were presented with four emotionally toned pictures. The

pictures were selected from the Geneva Affective PicturE Database (GAPED; Dan-Glauser & Scherer,

2011), which is available online at www.affective-sciences.org/researchmaterial. Each picture in

the database is assigned with indexes for valence and arousal, on a scale from 1 to 100, which are

based on the subjective rating of sixty subjects (ibid.). Four pictures were chosen from the library

(see figure 4.3 below), based on their specified valence and arousal indexes. The chromatic

characteristics of the pictures were also taken into account, because their colors had to be matched

in order to obtain the same overall luminance. This was done using the Match Color Tool in Adobe

Photoshop.

Figure 4.3: Affective Stimuli

1. 2.

3. 4.

The pictures used as affective stimuli had the following affective characteristics, as specified in the GAPED:

1. Positive valence (92.1), low arousal (27.5).

2. Negative valence (15.6), high arousal (66.3).

3. Neutral valence (51.3), low arousal (26.2).

4. Positive valence (91.3), high arousal (57.6).



38

4.4.5 Results and Analysis

The first step of the analysis was to verify that the stimuli used had in fact evoked the cognitive or

emotional responses intended. For the cognitive task this was done by looking at user performance

for the different tasks. As seen in table 4.1, it was clear that the difficult tasks did cause more

trouble than the easier ones. In fact, none of the participants managed to solve any of the difficult

problems in the time given.

Table 4.1: Task Performance

The affective stimuli responses were verified by looking at the valence and arousal values reported

by the three participants. As we can see in table 4.2 below, the affective characteristics reported by

the users were rather consistent with the intended experience of the stimuli. For example, the

second picture (the wounded horse) did yield a low valence rating (1.3/5 on average), but high

rating for arousal (4.3/5). It may be noted, however, that pictures number 1 and 4, which were

supposed to have different arousal characteristics, received very similar ratings by the subjects (3.3

vs. 3.7 on the arousal scale). Partly, this result may be attributed in the fact that the arousal indexes

assigned to the pictures in GAPED were not that different to begin with (27.5 vs. 57.6). However, it

could also have to do with the way in which the question was posed.

Table 4.2: Affective Ratings

Before the pupillometric data could be analyzed, some pre-processing of the results was required.

First, instances of lost tracking were removed. The first two test sessions had rather few instances

Task User 1 User 2 Use r 3

Easy 1

Difficult 1 _ _ _

Easy 2 _

Difficult 2 (Bright) _ _ _

Stimuli User 1 User 2 User 3 Average

1. Positive

V:5

A:3

V:5

A:3

V:5

A:4

V:5,0

A:3,3

2. Negative Arousal V:2

A:4

V:1

A:4

V:1

A:5

V:1,3

A:4,3

3. Neutral

V:3

A:1

V:3

A:1

V:3

A:2

V:3,0

A:1,3

4. Positive Arousal V:4

A:3

V:5

A:4

V:5

A:4

V:4,7

A: 3,7

Correct answer reported

_ No answer reported

V = Valence

A = Arousal



39

of lost tracking, with over 90% successful tracking for both individuals (which means that the

tracker could identify the eyes >90% of the time). However, there was an over representation of

lost data in the cognitive segment of the recording. The reason for this might have been that I

experienced a problem with the screen settings at the first test occasion, which caused the numbers

to take up a larger proportion of the screen than intended, even reaching the edges of the screen.

This may have caused instances of lost tracking, since tracking accuracy decreases as you move

closer to the edges.

Naturally, the screen settings were fixed before the second test occasion (user 3).

Nevertheless, a rather poor overall quality of data was obtained in this session, with only 58%

successful tracking. This was probably due to the fact that the subject wore eye make-up at the

occasion, which is known to make it harder for the tracker to identify the pupils (Tobii Technology,

2010). After instances of lost data had been eliminated, the average size of the right and left pupils

was calculated for each subject and each data point. This average pupil size was then used for the

analysis reported in the following. I did, however, make a quick comparison of the average pupil

size obtained for each eye and each segment of the test respectively (affective stimuli/cognitive

tasks), but found no systematic correlation between the larger pupil (right/left) and the type of task

(cognitive/affective).

Cognitive Tasks

The next step of the data analysis was to calculate the mean pupil diameter (MPD) for each task in

the cognitive section of the test. When analyzed separately, the first two sessions resulted in very

similar trends for the MPDs. Figure 4.4 below shows the averaged results for the first two test

subjects (user 1 & 2).

Figure 4.4: MPD for Mental Arithmetics (user 1 & 2)

The first conclusion that can be drawn from the figure above is that there was a clear difference in

MPD between the baseline period, during which no cognitive task was performed, and the mental

calculation periods, during which the subjects had to mobilize some cognitive effort. In other words,

we may conclude that the pupil did respond to differences in cognitive load (and/or increased

stress due to the time pressure), at least to some degree. We can also conclude that the increased

background illumination of the last stimulus does seem to have counter-acted the effect of the

increased cognitive demand, at least partially.

[mm]



40

Nevertheless, one aspect of the results presented in figure 4.4 is not in line with my initial

assumptions: pupil size did not vary systematically with task difficulty. Instead, MPD increased for

each task, regardless of the level of difficulty (with the exception of the last, brighter stimuli). One

explanation for this trend may lie in the fact that subjects were not given the chance to relax

between the cognitive tasks. Even when they did manage to solve the tasks in the time given, the

answer was reported just before the next stimuli appeared, giving them no time to prepare for the

new stimuli. Therefore, it is no surprise if the stress levels of the participants (and therefore their

pupil size) increased continually over the course of the tasks. In order to validate this hypothesis,

the cognitive task procedure was changed slightly before the third test session (user 3). This time,

the user was give 15 seconds to solve each math problem, and the baseline appeared for 5 seconds

between every task, to give the user some time to rest and prepare for the next problem. The

results are presented in figure 4.5 below.

Figure 4.5: MPD for Mental Arithmetics (user 3)

The diagram above indicates that the small changes made to the cognitive tasks did affect the

cognitive load experienced by the participants. This time, the first difficult task (which had the same

background luminance as the other stimuli) gave rise to the largest MPD . The results were thus in

line with my initial assumption that more difficult problems would result in a higher cognitive load,

and thereby a higher MPD. For the last difficult problem, the increase in luminance seems to have

“balanced out” the reflex dilation caused by the difficult task, so that the result was a MPD that was

equal to the baseline. This was also in line with my expectations. On the other hand, the fact that the

MPDs obtained for the two easy problems were actually slightly lower than the baseline was rather

surprising. Part of the explanation may lie in the fact that the subject reported the correct answers

to these problems several seconds before the end of the task, which means that her cognitive load

should have been low during the last seconds of the stimuli presentation (cf. Kahneman, 1966). This

would of course have affected the average obtained over the whole course of the stimuli

presentation. This possible explanation may be verified by looking at figure 4.6 on the next page,

which shows how the pupil diameter changed over time during the cognitive tasks. The curve is

smoothened with a moving average function, which means that blink artifacts and presumed

measurement errors have been smoothened out. The bars at the bottom of the chart indicate the

different task phases. Luckily, this segment of the data for participant three did not contain that

many instances of lost tracking.

[mm]



41

Figure 4.6: Trend for Cognitive Tasks (user 3, x = time, y = mm)

As we can see in figure 4.6, the pupillary responses varied in a rather predictable manner during

the cognitive tasks, with dips at each of the no-task phases (baseline or rest), and abrupt dilations

during the task phases. As expected, all of the task segments resulted in higher peak dilations than

the baseline, and at least for the first easy task, it is clear that the pupils constricted after the

response had been reported (after about half the time given), which explains why the average pupil

size obtained was so low.

Affective Stimuli

Figure 4.7 below presents MPDs obtained for each of the affective stimuli. Once again, the results

are averaged over the first two test sessions, which gave very similar results. In order to facilitate

comparison with the cognitive results, I have used the same scale for the two diagrams. Thus, we

may easily observe that the cognitive tasks gave rise to greater changes in MPD than did the

affective stimuli. This is not too surprising, since the cognitive tasks demanded a higher degree of

user engagement, as compared to the more passive nature of the picture-viewing.

Figure 4.7: MPD for Affective Pictures (user 1 & 2, y = mm)

[mm]



42

The diagram in figure 4.7 reveals some rather surprising results. First of all, the neutral, low arousal

image resulted in a rather high MPD, second only to the last, positive arousal picture. However, the

most surprising result is the fact that the second picture, which was rated as the most arousing by

the test subjects (4/5), resulted in the lowest MPD. However, there is a logic explanation. The

results indicate that the reflex dilation might have been counter-acted by a light response, and this

is actually the case. During the data analysis, I went back to examine the affective pictures again,

and realized that I had used the wrong version of the second picture in the test session. The version

used did indeed have a higher overall luminance than the pictures, which explains the low MPD

obtained for the negative arousal stimuli. The mistake was corrected prior to the third test session,

but unfortunately, the affective section of that recording contained too large data losses for any

analysis to be based on it. Instead, the data obtained during the first test session was analyzed in

more detail. Figure 4.8 below shows how the PD of user 1 changed over the course of the affective

stimuli presentation (the curve is smoothened with a moving average function). The horizontal

lines in the figure indicate the points at which the stimuli changed from one picture to the next.

Figure 4.8: Trend for Affective Pictures (user 1, x = time, y = mm)

By looking at figure 4.8, we can make the following observations about the pupillary response to

the different pictures:

For the first affective stimulus (the baby), there was a rather steady increase in pupil

diameter, indicating an increased emotional arousal in the subject. The first sharp dilation

sets off around 400 ms (0,4 s) after stimulus onset, which is in line with the latency reported

by Partala and Surakka (2003), who studied the pupillary response to affective sounds.

For the second, negative arousal stimulus (wounded horse), we can observe a constriction

of the pupils during the first two seconds of the stimulus presentation, which is probably an

effect of the light reflex. Again, there is a response latency of about 400 ms, this time

followed by a sharp constriction of the pupils, which is (again) in line with the light-reflex

latency reported in previous studies (e.g. Palinko et al., 2012). On the contrary, there seems

to be no obvious explanation for the rather sharp dilation and reconstriction that follows

after the initial light response.



43

As might be expected, the neutral stimulus (street sign) that followed after the brighter

second picture resulted in a redilation of the pupils. However, there is a sharp decrease in

pupil size at the end of the stimuli presentation which has no obvious explanation.

The positive arousal stimulus starts off with a sharp dilation (without any latency) and

reconstriction which is hard to explain. However, it is followed by an increasing trend which

is indicative of increased arousal in the subject.

As we can see, some of the features of figure 4.8 are in line with what might be expected, while

others seem to have no obvious explanation. In the end, it is hard to draw any definite conclusions

based on the data obtained. One reason for this is that although the overall brightness of the

different pictures was matched (for at least three images), there were still luminance variations

within each picture. Thus, it is not unlikely that some of the variations in pupil size were evoked as

the subjects changed their point of regard (this theory might be verified by analyzing the gaze data

in relation to the pupil size, but that would be a time-consuming endeavor). But it is also important

to note that the test was based on a very simple stimulus-response understanding of human

emotion. In real life, people seldom react as we expect them to. Even if the users’ cognitive

assessments of the emotional values associated with different pictures were similar to the pre-

defined affective characteristics, it does not necessarily follow that their experience was.

4.4.6 Lessons Learned

Although there were a few flaws in the test procedure, and even though the quality of data was

partly poor, the pilot study did serve its purpose in pointing out some of the challenges involved in

practical pupillometry. Some challenges relate to eye tracking in general. It is a known fact that

some subjects are easier to track than others, and that factors such as wearing glasses or eye make-

up may lead to problems in the data collection. This difficulty was mainly experienced in the third

test session. In this case, the test subject had not been instructed to avoid eye make-up, which might

have lead to a more successful recording. A first take-away from the pilot study is therefore that

participants should be given some basic instructions before the arriving at the test facility, and that

they should be asked whether or not they wear glasses. In a ‘sharp’ study, it may also be advisable

to over-recruit slightly, in order to compensate for trails that are unsuccessful.

When it comes to the specific case of pupillometric studies, it is clear that the software tool

used to design the study, Tobii Studio, is not (yet) adapted for the analysis of pupil data. More

common visualizations used in eye tracking, which focus on where subjects are looking during the

interaction (e.g. so-called heat-maps and gaze plots), can be generated automatically in Tobii Studio,

which gives access to rather effective data analysis. When it comes to pupil size, no automatic

processing is provided, which means that the extraction of relevant features must be done manually

for each participant (which is time-consuming even for a minor study like this one); unless some

script is developed for data processing. Either way, some data processing skills on the part of the

experimenter is required.

When it comes to dealing with the light reflex, the study clearly demonstrated that even

small variations in stimuli illumination will produce pupil constriction. In other words, a strict

control of visual stimuli is necessary if we want to draw conclusions about the user’s cognitive and

emotional processes based on pupil data, unless some measure is taken to eliminate the effect of



44

the light reflex. On the other hand, such strict control is hard to achieve without changing the nature

of the interactive experience we which to evaluate. Therefore, the development of reliable,

automatic procedures for separating the different components of the pupillary response is a key

concern if pupil size is to become a truly applicable in usability testing.



45

5 Discussion and Analysis

In this chapter, I come back to the core research questions of the present study. In the first section, I

discuss what physiological measures may tell us about human emotion and cognition, and discuss

some of the important considerations involved in the interpretation of physiological data (RQ 1 &

2). Thereafter, I discuss the specific challenges involved in UX and usability testing, and how

physiological measurement may be incorporated in such contexts. In the third and last section, I try

to define what would make up a truly valuable physiological measurement method for UX and

usability testing, and discuss how well these criteria are met by the different measures investigated

in this study (RQ 3).

5.1 Interpreting Physiological Data

In the present report, I have referred to a large number of studies investigating the link between

physiological measures and human mental processes. All in all, the research conducted in this area

provides extensive evidence that both cognitive and emotional processing is associated with

measurable physiological changes in the human body, affecting parameters such as heart rate, heart

rate variability, skin conductance, electrical brain activity and pupil size. The problem, however, is

that physiological measures do not only capture changes that are related to human cognition and

emotion, but may in fact be influenced by a large number of variables, including body posture,

hormonal levels and environmental aspects (such as room temperature, electrical equipment and

luminance conditions). As noted earlier in this report, great care must thus be taken in the analysis

and interpretation of physiological signals. Park (2009) suggests that before data is collected, all

factors that may result in unwanted interference with the results should be eliminated, and that

after data has been collected, researchers should go back and reconsider if there is any room for

alternative interpretations (ibid.).

Most studies investigated in the present study were performed in controlled laboratory

settings. This approach is certainly a good way to ensure a high quality of data; on the other hand, it

may raise the question of external validity of the results. As pointed out by Picard (2010),

conclusions about the real world may be misinformed if based on the artificial or simulated. The

main reason for this is probably not the strict experimental control of laboratory environments;

rather, it has to do to with what the test situation means to the user, and her motivation for

performing the tasks at hand. Clearly, the act of buying a journey online means something different

to the user in a real-life situation, where he or she is actually going to experience the journey after

buying it, as compared a test situation, where the task is performed for the mere purpose of

evaluation. It should also be noted that the emotional and cognitive reactions observed in the

laboratory may not solely be related to the experimental stimuli, i.e. the task at hand, but could also

be evoked by the test situation as such. An example of this effect is the so-called “white-coat

hypertension” discussed in medical literature, which refers to the phenomenon of high blood

pressure demonstrated in the clinic, but not at home (Wilhelm & Grossman, 2010). It seems

reasonable to assume that a similar effect could appear in usability testing (and might have

appeared in the pilot study presented in this paper). For example, Ward & Marsden (2003)

observed that when the experimenter appeared and began asking questions (after the participants



46

had experienced a quiet “settling-in” period), all participants showed large increases in skin

conductance, indicating elevated levels of arousal.

However, provided that we actually manage to isolate the physiological responses related to

the target of study, and provided that these reactions are reasonably similar to those we might

expect in real life situations - what conclusions about cognitive or emotional processes may be

drawn from physiological data? When it comes to cognitive processes, most studies have focused on

the relationship between cognitive load and physiology. The most commonly used measures for

this purpose include HRV, EEG and pupil size, all of which (when validated against subjective and

performance-related measures) have been found to respond to changes in cognitive workload in a

predictable manner (e.g. Berntson et al., 1997, Antonenko et al, 2010, Beatty & Lucero-Wagoner,

2000).

When it comes to affective computing and emotion research in general, the goal of

physiological measurement has often been a more fine-grained classification of mental processes,

as compared to the one-dimensional scale of cognitive load. The two-dimensional valence-arousal

scale is a commonly used tool for this purpose. Although this model provides a highly simplified

view of human emotion, it is considered effective enough to distinguish between most emotional

categories used in everyday language (Mehrabian & Russell, 1974). When it comes to the arousal

dimension, there is little controversy that a large number of emotional states (such as joy,

frustration, fear, and surprise) are associated with activation of the sympathetic nervous system,

resulting in responses such as elevated heart rate, increased sweating and dilation of the pupils.

Determining the valence of emotion, on the other hand, seems more complicated, and the

usefulness of ANS responses for this purpose is still a topic of debate in emotion research (cf.

Kreibig, 2010). As previously mentioned, some studies (e.g. Rantanen et al., 2010) have suggested

that cardiovascular activity may be used to distinguish between pleasant and unpleasant emotions.

However, EEG is probably the most reliable source of information for this purpose, at least if

relatively sophisticated equipment is used.

However sophisticated methodologies we come up with, it is important to keep in mind that

physiology alone can never tell us what a person is thinking or feeling at a given time. Also, bodily

reactions are only one part of what constitutes an emotion (see 2.3.1). Therefore, many researchers

(including Kecklund & Åkerstedt, 2004, Ward & Marsden, 2003) argue that physiological data

should always be interpreted in relation to other sources of information, such as knowledge of

context, interview data and the user’s subjective ratings of the experience. Indeed, such an

approach may prove highly valuable in usability testing. The greatest advantage of physiological

measures, as compared to subjective ratings or interviews, is perhaps the fact that they may be

recorded continuously while a user is engaging in an interactive task. Thereby, physiological data

may be useful as a way to help users go back and remember what they experienced after a test has

been performed. Today, a method called Retrospective Think Aloud (RTA) is commonly applied in

usability testing, especially in eye tracking studies (Tobii Technology, 2009). In RTA, the interaction

is replayed to the user, while he or she is asked to comment on her thoughts, choices and actions. A

similar methodology could be applied for other physiological measures, provided that the data can

be visualized in a way that is accessible to the test participants. By combining physiological data

with the user’s own account of the interaction, we could perhaps get at least a little bit closer to

understanding the user experience.



47

5.2 Challenges for UX and Usability Testing

As we have seen in this study, there is substantial evidence for the link between psychological

processes and physiological signals. However, most studies in the field of psychophysiology have

measured responses to simple stimuli, such as affective pictures or mental arithmetics, and data

collection has almost exclusively been performed in more or less controlled laboratory settings.

These criteria can seldom be met in usability testing; partly because of operational constraints, and

partly because strict control of the ‘stimulus’ cannot be achieved without changing the nature of the

interaction under study.

According to Duchowski (2007), there are at least three operational constraints associated with

system evaluation (such as usability testing). These are (ibid.):

Time

Money

Personnel

All of these constraints are highly relevant to the commercial environment in which most usability

testing takes place. As discussed earlier (see 2.1), the quality of results must often be balanced

against the time and money available for testing. Thereby, “quick and dirty” assessment tools are

often chosen before more advanced research methods like the ones discussed in the present study

(cf. Madrigal & McClain, 2009). Physiological measures often require extensive data processing for

relevant information to be extracted, which may take more man hours than can be motivated by the

financial return. Therefore, sophisticated systems that are valuable for scientific purposes may not

necessarily be attractive in the commercial context in which most usability testing takes place.

Ultimately, it all comes down to return on investment: will the money you spend on measuring

equipment, training and data analysis generate enough profit or savings for the company to

motivate the expense? Hopefully, technological developments will continue to generate more

advanced technology to more affordable prices, preferably with a high degree of automatic

processing and visualization to facilitate data analysis. Ultimately, commercial interests will

probably be a determining factor for this development.

In order for new methodologies to be incorporated in every-day usability testing, it is not

enough that they are effective, accurate and affordable. Perhaps even more important is that it is

easy for the test leader to apply the technology in a practical test situation. Monitoring a test

without advanced measuring equipment may be complicated enough; therefore, it is no wonder if

usability practitioners feel reluctant towards the addition of cumbersome measuring equipment,

electrodes that need to be correctly placed (and prevented from falling off during the test session)

or eye calibrations that may or may not be successful. Moreover, not all UX and usability

practitioners will possess the technical competence required to perform detailed data analysis.

Again, technical advancements that facilitate the practical measurement and analysis of

physiological measures are necessary for these techniques to be truly useful in usability testing.

When it comes to pupillometry, two factors seem particularly important as determinants of

its future in usability testing. First, if pupil size is to be of any practical help in the evaluation of

human-computer interaction, then there must be reliable procedures for light reflex elimination

available. A few promising approaches to this problem have emerged in this study. One is to use

spectral analysis to separate the different components of the pupillary response. A version



48

approach (the ICA; Marshall, 2000) is already commercially available from Eyetracking Inc.

(http://www.eyetracking.com) but there have also been similar attempts made by for

example Janita & Baccino (2010). Another approach to light reflex elimination would be to use

some kind of pre-trial calibration to determine which light-induced responses to expect during the

interaction. Those values would then be extracted from the pupillary response obtained, resulting

in a signal that would reflect reflex dilation alone. However, this method has only been tested for

highly simplified tasks, and it is unclear whether it could be applied to the more complex tasks

involved in typical HCI. A third possible approach could be to utilize the point-of-gaze data provided

by the eye tracker, and combine that information with data concerning the luminance level of each

pixel in the screen, at each moment of the interaction. This would of course require very high

tracking precision, as well as extremely exact synchronization between different sources of

information.

The second factor that seems particularly important for pupil size to be integrated in

usability testing relates the complexity of analysis. Because of the operational constraints often

associated with usability testing, it seems unlikely that usability practitioners would make practical

use of pupillometry, unless data processing and visualization of the pupillary response becomes

less effortful and time consuming than it is today. Easy-to-use analysis tools would also make

pupillary responses more accessible, even to usability practitioners that are not so “good with

numbers”, or who lack the technical competence required to perform detailed analysis.

A recurring topic of discussion in this report has been the obtrusiveness of different

measurement methods. As mentioned earlier, usability testing is about observing representative

end users using a product to perform representative tasks, preferably in a context that is

representative of “real world” usage. A test situation that is too different from the typical use case

will have very limited value, because the results should have little relation to how actual users will

experience the product in real-life settings. Almost all physiological measurement techniques

require sensors to be placed on the body. This may disrupt the user experience, at least to some

degree. Although measuring equipment is getting smaller and less cumbersome to handle, it may

still add to the already awkward situation of being monitored while interacting with a system. On

this point, remote eye tracking has an advantage over other available measurement techniques,

since no physical sensors are required. Built-in eye-tracking monitors may also look pretty much

like any other computer screen, which should add to the authenticity of the test situation. However,

as pointed out by Park (2009), eye tracking can also be perceived as artificial. For example, it

requires the user to sit in more or less the same position during the course of the test session,

which may create unnatural tension in the subject.

Another challenge, which relates to the previous one, is that there is often a trade-off

between high quality measurements and obtrusiveness. For example, fMRI provides excellent

spatial resolution of brain activity, but is (so far) unsuitable for real-world usability testing. EEG is

less obtrusive, but has low spatial resolution and high presence of noise. As discussed above,

unobtrusiveness is an important factor in usability, but there is no general rule to apply when

evaluating different alternatives; all decisions must be based on the particular goals of the study at

hand.

Today, digital interfaces are not only accessed through stationary computers, and

consequently usability testing is not only performed in front of traditional computer monitors. If



49

physiological data are to be integrated in such contexts, additional challenges need to be taken into

consideration. First of all, not all measuring equipment is suitable for ambulatory assessment,

either because the recording devices are not wireless, or because they are too cumbersome to carry

around. However, great progress has been made in this area over the last few years. In the case of

pupillometry, ambulatory assessment may be achieved with eye tracking glasses. These may not be

unobtrusive enough for users to ‘forget’ that they are taking part in a user study, but they are

certainly practical enough to allow for easy transport as well as free body movement during the

recording session. Similar progress has been made in the field of EEG, where wireless caps with

built-in electrodes are now available on the market. However, GSR and cardiovascular measures are

probably the most practical alternatives for ambulatory data collection, as the technology needed

for data collection may nowadays be incorporated into a simple wristband or (in the case of the

latter) a pair of modified headphones paired with a smart phone.

However, ambulatory usability testing adds yet another difficulty to the use of physiological

measurement, i.e. movement artifacts. When people are not restricted in terms of mobility, the level

of noise in all physiological measures tends to increase (Gunes & Pantic, 2010). This is no surprise,

given that body movements may be responsible for pupil dilation, elevated heart rate, increased

skin conductance and artifacts in the EEG, all of which must be considered a form of noise when the

study focus lies on cognitively and and/or emotionally driven responses.

5.3 Evaluation of Measures

The literature review presented here shows that there is no ‘gold standard’ for physiological

measurement, but that all measures have their respective pros and cons. However, a few criteria

have emerged from the analysis, which seem particularly important for a physiological measure to

be both valuable and suitable for usability testing. In the following, these criteria are used to

evaluate and compare the measures investigated in this study, i.e. cardiovascular activity (HR &

HRV), skin conductance (SC), electroencephalography (EEG) and pupil diameter (PD).

Affordability

It is of course hard to say where to draw the line between affordable and too expensive, and there

are also large variations in price for the same measurement method, depending on how advanced

equipment you chose to buy. However, skin conductance (SC) and heart rate (HR) seem to be the

most affordable alternatives in general, as the technical equipment required to obtain these

responses is rather simple.

Unobtrusiveness

A lot of progress is being made in this area. Today, there are small and easy-to-wear systems

available for both HR and SC monitoring, although less obtrusive body placement may be associated

with an increase in measurement noise. Eye tracking is also a rather unobtrusive technology, as

modern systems do not require any contact between subject and tracker. EEG is probably the most

intrusive alternative today, although progress is being made in this area as well.



50

Information Density

What I mean by this criterion is that a truly useful measure should provide as much valuable

information about the user state as possible. When it comes to cognitive assessment, HRV, EEG and

PD have all been found to be good measures of cognitive load, although I have not found any

conclusive evidence that any of these measures would be more useful than the others for this

purpose. However, PD is usually measured with eye-tracking, which gives access to a large amount

of additional information concerning users’ visual attention.

For affective assessment, all measures discussed in this study may be used to indicate

instances of elevated arousal. However, EEG must be considered the most informative measure in

this respect, since it can provide more detailed information about the parts of the brain that are

activated during different phases of the interaction.

Simplicity of Use

By this criterion, I mean that the ideal measure should be easy to implement in every day usability

testing. Thus, the equipment should be easy to set up and learn how to use, even for people with

modest technical skills. This criterion is of course hard to evaluate without practical experience of

the respective measurement methodologies, and there are probably considerable differences

between different systems and manufacturers. However, from my experience with eye tracking, I

can conclude that at least for this particular system, around half an hour was enough to figure out

how to set up the system. That being said, I did experience some bumps in the road, such as my

trouble with the screen settings. However, the greater challenge when it comes to eye tracking is

probably to be aware of and learn to deal with the different factors that may limit the trackability of

a subject.

Simplicity of Analysis

This criterion has to do with how easy it is to extract relevant information from the data obtained.

Again, it is hard to make a statement without practical experience of the different measurement

methods, but it seems that both HR and SC would be easier to analyze than EEG, which requires

rather extensive knowledge about the workings of the brain. When it comes to pupillometry, the

difficulties involved in the analysis have already been discussed in the previous section of this

chapter.

Robustness

If physiological measurements are to be of any practical help in usability testing, they need to

tolerate collection under relatively loosely controlled conditions. Unfortunately, all measures

discussed in this report are affected by unwanted artifacts from factors such as physical activity,

temperature and luminance conditions. This is an important challenge for all contexts where

physiological measures are used, and an area where further development is needed, in order to

better separate different sources of influence on our physiology.



51

6 Conclusion As we have seen, there is no single ‘gold standard’ for physiological measurement in UX and

usability testing. Instead, it was found that cardiovascular measures, skin conductance, EEG and

pupillometry may all be more or less useful, depending on the context of study. Although none of

these methods allows for an absolute measurement of the thoughts or emotions experienced during

a usability test, they may help identify elements of the interaction that are particularly important or

interesting, such as instances of elevated cognitive load, frustration or other emotional reactions.

However, usability researchers should be aware that there is never just one possible explanation to

an observed physiological reaction. Therefore, physiological responses should always be

interpreted in relation to the context in which data was collected, as well as to the users’ own

account of their experience.



52

7 Bibliography

A.D.A.M., Inc., 2005. A.D.A.M. Medical Encyclopedia: EEG. [WWW Document]. URL

http://www.nlm.nih.gov/medlineplus/ency/article/003931.htm

Andreassi, J.L., 2000. Psychophysiology human behavior and physiological response. L. Erlbaum,

Publishers, Mahwah, N.J.

Antonenko, P., Paas, F., Grabner, R., Gog, T., 2010. Using Electroencephalography to Measure

Cognitive Load. Educational Psychology Review 22, pp. 425–438.

Bailey, B.P., Konstan, J.A., 2006. On the need for attention-aware systems: Measuring effects of

interruption on task performance, error rate, and affective state. Computers in Human

Behavior 22, pp. 685–708.

Barreto, A., Zhai, J., Rishe, N., Gao, Y., n.d. Significance of Pupil Diameter Measurements for the

Assessment of Affective State in Computer Users, in: Elleithy, K. (Ed.), Advances and

Innovations in Systems, Computing Sciences and Software Engineering. Springer

Netherlands, Dordrecht, pp. 59–64.

Bartels, M., Marshall, S.P., 2012. Measuring cognitive workload across different eye tracking

hardware platforms. ACM Press, p. 161.

Beatty, J., 1982. Task-evoked pupillary responses, processing load, and the structure of processing

resources. Psychological Bulletin 91, pp. 276–292.

Beatty, J., Lucero-Wagoner, B., 2000. Pupillary System, in: Cacioppo, J.T., Tassinary, L.G., And

Berntson, G (Eds.), Handbook of Psychophysiology, 2nd ed. Cambridge University Press,

New York, pp. 142-162.

Berntson, G.G., Thomas Bigger, J., Eckberg, D.L., Grossman, P., Kaufmann, P.G., Malik, M., Nagaraja,

H.N., Porges, S.W., Saul, J.P., Stone, P.H., Der Molen, M.W., 1997. Heart rate variability:

Origins, methods, and interpretive caveats. Psychophysiology 34, pp. 623–648.

Bradley, M.M., Miccoli, L., Escrig, M.A., Lang, P.J., 2008. The pupil as a measure of emotional arousal

and autonomic activation. Psychophysiology 45, pp. 602–607.

Brooke, J., 1996. SUS: a "quick and dirty" usability scale". In P. W. Jordan, B. Thomas, B. A.

Weerdmeester, & A. L. McClelland. Usability Evaluation in Industry. London: Taylor and

Francis.

Cegarra, J., Chevalier, A., 2008. The use of Tholos software for combining measures of mental

workload: Toward theoretical and methodological improvements. Behavior Research

Methods 40, pp. 988–1000.

Chanel, G., Kierkels, J.J.M., Soleymani, M., Pun, T., 2009. Short-term emotion assessment in a recall

paradigm. International Journal of Human-Computer Studies 67, pp. 607–627.



53

Crane, E., Peter, C., 2006. A working definition for HCI specific emotion research, in: Peter, C., Beale

R., Crane E., Axelrod, L Blyth, G (Eds.), 2008. Emotion in HCI: Joint Proceedings of the 2005,

2006, and 2007 International Workshops, pp. 54-61.

Coles, M. G. H., Rugg, M. D., 1995. The ERP and cognitive psychology: Conceptual issues. In M. D.

Rugg & M. G. H. Coles (Eds.), Electrophysiology of mind: Event-related brain potentials and

cognition, Oxford University Press, New York, pp. 27–39.

Dan-Glauser, E.S., Scherer, K.R., 2011. The Geneva affective picture database (GAPED): a new 730-

picture database focusing on valence and normative significance. Behavior Research

Methods 43, pp. 468–477.

Dingsøyr, T., Dybå, T., Moe, N. B., 2010. Agile Software Development: Current Research and Future

Directions. Springer Berlin Heidelberg: Berlin, Heidelberg.

Dufresne, A., Courtemanche, F., Prom Tep, S. & Sénécal, S., 2010. Physiological Measures, Eye and

Task Analysis to Track User Reactions in User Generated Content. Proceedings of Measuring

Behavior 2010, pp, 218-222.

Duchowski, A.T., 2003. Eye tracking methodology : theory and practice. Springer, London; N. Y.

Ekman, P., Levenson, R.W., Friesen, W.V., 1983. Autonomic Nervous System Activity Distinguishes

among Emotions. Science 315, pp. 1208–1210.

Gao, Y., Barreto, A., Adjouadi, M., 2010. Affective Assessment of a Computer User through the

Processing of the Pupil Diameter Signal, in: Sobh, T., Elleithy, K. (Eds.), Innovations in

Computing Sciences and Software Engineering. Springer Netherlands; Dordrecht, pp. 189–

194.

Goldwater, B.C., 1972. Psychological significance of pupillary movements. Psychological Bulletin 77,

pp. 340–355.

Gunes, H., Pantic, M., 2010. Automatic, Dimensional and Continuous Emotion Recognition.

International Journal of Synthetic Emotions 1, pp. 68–99.

Haag, A., Goronzy, S., Schaich, P., Williams, J., 2004. Emotion Recognition Using Bio-sensors: First

Steps towards an Automatic System, in: André, E., Dybkjær, L., Minker, W., Heisterkamp, P.

(Eds.), Affective Dialogue Systems. Springer Berlin Heidelberg; Berlin pp. 36–48.

Harbich, S., Hassenzahl, M., 2008. Beyond Task Completion in the Workplace: Execute, Engage,

Evolve, Expand. Affect and Emotion in Human-Computer Interaction 2008, pp. 154-162.

Hess, E.H., 1965. Attitude and Pupil Size. Scientific American 212, 46–54.

Hess, E.H., Polt, J.M., 1964. Pupil Size in Relation to Mental Activity during Simple Problem-Solving.

Science 315, pp. 1190–1192.

Hollender, N., Hofmann, C., Deneke, M., Schmitz, B., 2010. Integrating cognitive load theory and

concepts of human–computer interaction. Computers in Human Behavior 26, pp. 1278–

1288.

http://kth-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/search.do?vl%28freeText0%29=Dings%c3%b8yr%2c+Torgeir+&vl%286559319UI0%29=creator&vl%2832980203UI1%29=all_items&fn=search&tab=default_tab&mode=Basic&vid=KTH&scp.scps=scope%3a%28KTH%29%2cscope%3a%28kth%29%2cprimo_central_multiple_fe

http://kth-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/search.do?vl%28freeText0%29=+Dyb%c3%a5%2c+Tore+&vl%286559319UI0%29=creator&vl%2832980203UI1%29=all_items&fn=search&tab=default_tab&mode=Basic&vid=KTH&scp.scps=scope%3a%28KTH%29%2cscope%3a%28kth%29%2cprimo_central_multiple_fe

http://kth-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/search.do?vl%28freeText0%29=+Dyb%c3%a5%2c+Tore+&vl%286559319UI0%29=creator&vl%2832980203UI1%29=all_items&fn=search&tab=default_tab&mode=Basic&vid=KTH&scp.scps=scope%3a%28KTH%29%2cscope%3a%28kth%29%2cprimo_central_multiple_fe



54

Höök, K.,2012: Affective Computing: Affective Computing, Affective Interaction and Technology as

Experience. in: Soegaard, Mads and Dam, Rikke Friis (Eds.). Encyclopedia of Human-

Computer Interaction. Aarhus, Denmark: The Interaction-Design.org Foundation. [WWW

Document]. URL http://www.interaction-design.org/encyclopedia/

affective_computing.html

Hudlicka, E., 2003. To feel or not to feel: The role of affect in human–computer interaction.

International Journal of Human-Computer Studies 59, pp. 1–32.

Ibster, K., Höök, K., 2009. On Being Supple: In Search of Rigor without Rigidity in Meeting New

Design and Evauation Challenges for HCI practitioners. CHI 2009, Boston, MA, USA.

Iqbal, S.T., Zheng, X.S., Bailey, B.P., 2004. Task-evoked pupillary response to mental workload in

human-computer interaction. ACM Press, p. 1477.

Jainta, S., Baccino, T., 2010. Analyzing the pupil response due to increased cognitive demand: an

independent component analysis study. Int J Psychophysiol 77, pp. 1–7.

Kahneman, D., 1973. Attention and effort. Prentice-Hall, Englewood Cliffs, N.J.

Kahneman, D., Beatty, J., 1966. Pupil Diameter and Load on Memory. Science 154, pp. 1583–1585.

Kecklund, G., Åkerstedt T., 2004. Report on methods and classification of stress, inattention and

emotional states. [WWW Document]. URL http://www.sensation-eu.org/span/pdf/

sens_d_112.pdf

Klingner, J., Kumar, R., Hanrahan, P., 2008. Measuring the task-evoked pupillary response with a

remote eye tracker. ACM Press, p. 69.

Kockton, G. Designing worth – connecting preferred means to desired ends. Interactions July +

August 2008., pp. 54-57.

Kreibig, S.D., 2010. Autonomic nervous system activity in emotion: A review. Biological Psychology

84, pp. 394–421.

Kumar, N. K., Kohlbecher, S., Schneider E, 2009. A novel approach to video-based pupil tracking,

IEEE International Conference on Systems, Man and Cybernetics, SMC 2009, pp. 1255-1262.

Lang, P.J., Bradley, M.M., & Cuthbert, B.N., 2008. International affective picture system (IAPS):

Affective ratings of pictures and instruction manual. Technical Report A-8. University of

Florida, Gainesville, FL.

Lee, J.C., Tan, D.S., 2006. Using a low-cost electroencephalograph for task classification in HCI

research. ACM Press, p. 81.

Loewenfeld, I.E., 1993. The pupil : anatomy, physiology, and clinical applications 1. Iowa State

University Press u.a., Ames.

Madrigal, D., McClain, B., 2009. Testing the User Experience: Consumer Emotions and Brand

Success. [WWW Document]. URL http://www.uxmatters.com/mt/archives/2009/10/

testing-the-user-experience-consumer-emotions-and-brand-success.php



55

Marshall, S.P., 2000. Method and apparatus for eye tracking and monitoring pupil dilation to

evaluate cognitive activity. U.S. Patent 6,090,051.

Marshall, S.P., 2002. The Index of Cognitive Activity: measuring cognitive workload. IEEE, pp. 75–

79.

Marshall, S.P., 2003. Methods for monitoring affective brain function. U.S. Patent 6,572,562.

Marshall, S.P., Pleydell-Pearce, C.W., Dickson, B.T., 2003. Integrating psychophysiological measures

of cognitive workload and eye movements to detect strategy shifts, in: Proceedings of the

Thirth-Sixth Annual Hawaii International Conference on System Sciences, p. 6.

Mehrabian, A., & Russell, J.A., 1974. An approach to environmental psychology. MIT Press,

Cambridge, MA, USA.

Poh, M.Z., Kim, K., Goessling, A., Swenson, N.C., Picard, R.W., 2011. Cardiovascular Monitoring Using

Earphones and a Mobile Device. IEEE Pervasive Computing, IEEE computer Society Digital

Library. URL http://doi.ieeecomputersociety.org/10.1109/MPRV.2010.91

Poh, M.Z., Swenson, N.C., Picard, R.W., 2010. A Wearable Sensor for Unobtrusive, Long-Term

Assessment of Electrodermal Activity. IEEE Transactions on Biomedical Engineering 57, pp.

1243–1252.

Nielsen, J., Pernice, K., 2010. Eyetracking web usability. New Riders, Berkeley, CA.

Norman, D.A., 2004. Emotional design why we love (or hate) everyday things. Basic Books, New

York.

Palinko, O., Kun, A.L., Shyrokov, A., Heeman, P., 2010. Estimating cognitive load using remote eye

tracking in a driving simulator. ACM Press, p. 141.

Palinko, O., Kun, A.L., 2011. Exploring the Influence of Light and Cognitive Load on Pupil Diameter

in Driving Simulator Studies. Proceedings of Driving Assessment 2011.

Palinko, O., Kun, A.L., 2012. Exploring the Effects of Visual Cognitive Load and Illumination on Pupil

Diameter in Driving Simulators. Eye Tracking Research and Applications 2012.

Park, B., 2009. Psychophysiology as a Tool for HCI Research: Promises and Pitfalls, in: Jacko, J.A.

(Ed.), Human-Computer Interaction. New Trends. Springer Berlin Heidelberg, Berlin,

Heidelberg, pp. 141–148.

Partala, T., 2005. Affective information in human-computer interaction. Doctoral dissertation,

Department of Computer Sciences, in: Dissertations in interactive technology, 1. Tampere

University Press, Tampere.

Partala, T., Surakka, V., 2003. Pupil size variation as an indication of affective processing.

International Journal of Human-Computer Studies 59, pp. 185–198.

Picard, R.W., 1997. Affective computing. MIT Press, Cambridge, Mass.

Picard, R.W., Vyzas, E., Healey, J., 2001. Toward machine emotional intelligence: analysis of affective

physiological state. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, pp.

1175–1191.



56

Picard, R.W., 2003. Affective computing: challenges. International Journal of Human-Computer

Studies 59, pp. 55–64.

Pomplun, M. & Sunkara, S. (2003). Pupil dilation as an indicator of cognitive workload in human-

computer interaction. In D. Harris, V. Duffy, M. Smith & C. Stephanidis (Eds.), Human-

Centred Computing: Cognitive, Social, and Ergonomic Aspects. Vol. 3 of the Proceedings of

the 10th International Conference on Human-Computer Interaction, pp. 542-546.

Preece, J., Rogers, Y., Sharp, H., 2002. Interaction design: beyond human-computer interaction. J.

Wiley & Sons, New York, NY.

Rowe, D.W., Sibert, J., Irwin, D., 1998. Heart rate variability. ACM Press, pp. 480–487.

Rubin, J., Chisnell, D., 2008. Handbook of usability testing how to plan, design, and conduct effective

tests [WWW Document]. URL http://www.books24x7.com/marc.asp?bookid=25203

Sanches, P., Kosmack Vaara, E., Sjölinder, M., Weymann, C., Höök, K., 2010. Affective Health –

designing for empowerment rather than stress diagnosis. CHI 2010.

Scheirer, J., Fernandez, R., Klein, J., Picard, R.W., 2002. Frustrating the user on purpose: a step

toward building an affective computer. Interacting with Computers 14, pp. 93–118.

Scherer, K.R., 2005. What are emotions? And how can they be measured? Social Science Information

44, pp. 695–729.

Sherman, P., 2007. How Do Users Really Feel About Your Design? [WWW Document]. URL

http://www.uxmatters.com/mt/archives/2007/09/how-do-users-really-feel-about-your-

design.php

R. Stanners, M. Coulter, A. Sweet, P. Murphy, 1979. The pupillary response as an indicator of arousal

and cognition, Motivation and Emotion, vol. 3, pp. 319-340.

Technology Inc., 2009. Guidelines for Using the Retrospective Think Aloud Protocol with Eye

Tracking. [WWW Document]. URL http://www.tobii.com/Global/Analysis/Training/

WhitePapers/ RTA_guidelines_eyetracking_tobii_shortpaper.pdf

Technology Inc., 2010. Tobii Eye Tracking: An introduction to eye tracking and Tobii Eye Trackers.

[WWW Document]. URL http://www.tobii.com/eye-tracking-research/global/

library/white-papers/tobii-eye-tracking-white-paper/

Tobii Technology Inc., 2010. Tobii TX300 Eye Tracker. [WWW Document]. URL

http://www.tobii.com/Global/ Analysis/Downloads/Product_Descriptions/

Tobii_TX300_EyeTracker_Product_Description.pdf

Tullis, T., Albert, B., 2008. Measuring the user experience: collecting, analyzing, and presenting

usability metrics. Elsevier/Morgan Kaufmann, Amsterdam [u.a.].

Ward, R.., Marsden, P.., 2003. Physiological responses to different WEB page designs. International

Journal of Human-Computer Studies 59, pp. 199–212.



57

Wilhelm, F.H., Grossman, P., 2010. Emotions beyond the laboratory: Theoretical fundaments, study

design, and analytic strategies for advanced ambulatory assessment. Biological Psychology

84, pp. 552–569.

Xu, J., Wang, Y., Chen, F., Choi, H., Li, G., Chen, S., Hussain, S., 2011. Pupillary response based

cognitive workload index under luminance and emotional changes. ACM Press, p. 1627.

TRITA-CSC-E 2012:082 ISRN-KTH/CSC/E--12/082-SE

ISSN-1653-5715

www.kth.se

physiology as a tool for ux and usability...

Documents