the personalization of mobile health interventions

The Personalization of Mobile HealthInterventions

JULIAN ANDRES RAMOS ROJAS

Thesis committee:Anind K. Dey (Co-Chair), Information School, University of Washington

Mayank Goel (Co-chair), Human-Computer Interaction Institute, Carnegie Mellon UniversityCarissa Low, Department of Medicine, University of Pittsburgh

Tanzeem Choudhury, Department of Information Science, Cornell UniversityRobert Kraut, Human-Computer Interaction Institute, Carnegie Mellon University

A thesis proposal submitted in fulfilment ofthe requirements for the degree of

Doctor of Philosophy

Human-Computer Interaction InstituteSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA

26 November 2019

Abstract

Personalized medicine is the adjustment of medical treatment by taking into account

people’s unique demographics, genetic makeup, and lifestyle. This approach, however, relies

on domain knowledge that is often limited and forces medical practitioners to explore multiple

treatments with a patient until finding an appropriate one. During this process, patients

are on their own: They have to remember the specifics of the treatment, and they need to

identify when and what treatment to put into practice. To overcome these challenges, I

envision equipping the most popular computing device: The mobile phone, with the means

to personalize and provide health interventions. This personalized mobile health approach

would give access to health interventions to anyone with a phone, and it would be especially

impactful for populations that lack access to basic health services.

At the core of this proposal, I investigate methods for the personalization of mobile health

interventions using artificial intelligence (AI), smartphones and wearables, and the patient’s

feedback. In my work so far, I have explored two fundamental challenges: when to intervene

(identifying intervention points) and what treatment to use (treatment selection). I approached

these challenges by integrating human-computer interaction work in interruptibility (i.e.,

receptivity) and contextual bandits, an AI method for solving sequential decision-making

problems. This work was applied to a sleep intervention and compared to standard clinical

treatment. The results show that my integrated approach is as good or better than clinical

treatment, and for a stratum of the study’s sample, the results are clinically meaningful.

For my remaining thesis work, I propose to investigate methods for how to predict the

short-term effect of a treatment (models of effects), and how to predict patient adherence

to treatment (models of behavior). Mobile health researchers have identified the proposed

work as crucial for the advancement of the field. Behavior models are necessary for reducing

intervention burden and increasing adherence to the intervention. Models of effects can

inform the direction and strength treatments. My hypothesis is that both models could be used

ii

ABSTRACT iii

to compute the expected value of treatment effect. This expected value could be used to select

the best treatment: one that takes into account the effect and adherence to treatment. I plan to

use these models to augment the treatment selection previously used in my SleepU system,

which will then be deployed to college students in a sleep intervention. This model-based

approach for a mobile health intervention will be compared against my completed work

that does not use an explicit model of treatment effect and adherence, and a survey-based

approach; where treatment is selected from the patient’s own preferences and forecast of

effects of treatment. Additionally, I will measure each patient’s adherence gains from using

this model-based approach. The overall results from this work will inform the development

and deployment of effective and efficient personalized mobile health interventions in the real

world.

Acknowledgements

To be written as a part of the final dissertation.

iv

Contents

Abstract ii

Acknowledgements iv

Contents v

List of Figures viii

Chapter 1 Introduction 1

1.1 The value of dynamic mobile health interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 The elements of a mobile health intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Distal outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Proximal outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.3 Decision points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.4 Intervention points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.5 Available treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.6 Tailoring variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.7 Treatment selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Challenges in the personalization of mobile health interventions . . . . . . . . . . . . . . 10

1.3.1 Identifying intervention points using mobile-receptivity (completed) . . . . . 11

1.3.2 Treatment selection and receptivity (completed) . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.3 Development of a personalized model of effects (proposed) . . . . . . . . . . . . . 12

1.3.4 Development of models of behavior (proposed) . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.5 A models-based approach to select Initial treatment (proposed) . . . . . . . . . . 13

Chapter 2 Identifying intervention points using mobile-receptivity (completed) 15

2.1 Mobile-receptivity and interruptibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1 Detecting interruptibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17v

vi CONTENTS

2.1.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Mobile-receptivity detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.3 Classifier and Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Chapter 3 Treatment selection and receptivity (completed) 23

3.1 A framework for the personalization of mobile health interventions . . . . . . . . . . . 23

3.1.1 Sleep interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.2 Related mobile health interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 PECAM Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Sensor input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 Communication Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.3 Decision-making module: Defining the selection of a health

recommendation as a reinforcement learning problem . . . . . . . . . . . . . . . . . . 32

3.2.4 Framework connection to behavior change theories . . . . . . . . . . . . . . . . . . . . 35

3.3 Deployment and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.1 Study design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.3 Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4.4 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.5 Analysis plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5.1 H1) The combination of a mobile-receptivity detector and a decision-

making module produces better sleep outcomes than a traditional sleep

hygiene appointment intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5.2 H2) Delivering sleep recommendations at mobile-receptivity states

increases their operationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5.3 H3) The SleepU app increased motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5.4 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

CONTENTS vii

Chapter 4 Development of a personalized model of effects (proposed) 57

4.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Proposed work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Envisioned results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Chapter 5 Development of models of behavior (proposed) 61

5.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


5.4 Alternative plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Chapter 6 A models-based approach to select initial treatment (proposed) 65

6.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.3 Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.3.1 Study protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.3.2 Power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


Chapter 7 General timeline for the proposal 71

References 72

List of Figures

1.1 Traditional vs mobile health intervention cycle. 4

3.1 The PECAM Framework 24

3.2 Communication module strategy selection process 31

3.3 Fogg Behavior model example for a sleep recommendation 37

3.4 SleepU walkthrough and screenshots 39

3.5 Study design 41

3.6 Sleep duration changes over the semester 46

3.7 Actionability rates for all participants and sub-groups 49

3.8 Actionability of different delivery of notification mechanisms 55

viii

CHAPTER 1

Introduction

Personalized medicine or precision medicine is the tailoring of health interventions to take

into account genes, environment and lifestyle. This national initiative was introduced in 2015

by United States President Barack Obama (Collins and Varmus, 2015) and later renamed

to the All of Us program (Sankar and Parker, 2017), a project that is currently active in the

United States. The value of personalized health interventions comes from improved health

care outcomes from trying only treatments that are most likely to succeed; this approach

not only reduces time to achieve improved clinical outcomes, but it also decreases costs and

improves patients overall quality of care by minimizing side effects (Jameson and Longo,

2015). However, precision medicine is still a nascent field on its own and it requires further

advancement of medical techniques for characterizing patients, larger biological databases and

enhanced mobile health technology. Mobile health technology has emerged as a promising

path to personalized medicine, not only as a way to collect and monitor 24x7 and to collect

previously unreachable data, but also to support real-time interaction with the patient that

could potentially improve engagement and empowerment.

Personalization has traditionally been a process in which both the patient and physician are

involved: The clinician first provides a treatment based on experience, patient’s preference and

goal of treatment; after acquiring evidence (Ashley, 2015) of success or failure in achieving

the desired outcomes, the clinician proceeds to adjust or completely change the treatment. The

need for personalization comes from two main sources that are not necessarily exclusive: Gaps

in medical and personal knowledge. Medical knowledge may be insufficient to understand

adherence or treatment effect for an individual. Personal knowledge means an individual may

not be aware of her own preferences, treatment adherence (how the patient will comply with

1

2 1 INTRODUCTION

treatment), or treatment side effects to it ( unaware of allergies); this means that even when

the science is precise, lack of knowledge does play a role and it is a challenge that may only

be solved through trial and error. This manual personalization process is inefficient: it can

take a long time to find a treatment that works, the multiple visits to the physician have a

monetary cost and meanwhile the patient has to go through unwarranted treatment that could

have side effects and may lead to the patient giving up treatment.

Another problem is that patients are on their own when it comes to self-monitoring and

self-managing their treatment, two crucial components of self-efficacy: an individual’s belief

in their innate ability to achieve goals ( e.g., take medication on time, exercise more, etc.).

Without self-efficacy, behavior change is not viable. Mobile health (mHealth) researchers have

shown the feasibility of using Artificial Intelligence (AI) methods and mobile sensors (Paredes

et al., 2013; Rabbi, Aung, Zhang, Choudhury, 2014; Rahman, Czerwinski, Gilad-Bachrach,

Johns, 2015; Sano, Johns, Czerwinski, n.d.) to personalize health interventions. Also, there

has been work looking at the design of tools for patients that support self-management (e.g.,

blood glucose levels (Desai, Levine, Albers, Mamykina, 2017)). In my thesis, I propose

to further advance the field by studying and testing ways to personalize the elements of

mobile health interventions. Personalization, tailoring and individualization will be used

interchangeably in this proposal as they refer to the same concept in this line of work.

1.1 The value of dynamic mobile health interventions

Just-In-Time-Adaptive-Interventions in mobile health (Nahum-Shani et al., 2018) referred to

in short as mobile health interventions in this proposal, are interventions that are delivered via

a mobile device and are tailored in a dynamic fashion i.e., changes to the health intervention

are based on sensor data or user feedback and performed at multiple times over the duration

of the intervention.

Mobile health interventions are a type of dynamic computer-tailored health interventions

where dynamic means the intervention is adjusted at multiple times during the duration of

the intervention. In comparison, traditional computer tailored interventions are not dynamic

1.2 THE ELEMENTS OF A MOBILE HEALTH INTERVENTION 3

(static): usually tailoring is done at most once at the beginning of treatment. Dynamic

computer-tailored health interventions have an increased efficacy (Krebs, Prochaska, Rossi,

n.d.) in comparison to static health interventions. Besides the value provided by being more

efficient than a static health intervention, mobile health interventions have the added benefit

that they can accompany the patient at all times: A mobile health intervention can both reach

(push) or be reached by (pull) the patient at any time and place (Smith et al., 2016). Ultimately

one of the most promising roles of a mobile health intervention is to support the patient at

the time and place where treatment is put into practice, and this is a role that even the best

medical care cannot provide.

Mobile health interventions are defined by components that are not present in traditional

health interventions due to the intrinsic capabilities of mobile computing devices that make

health interventions readily available anytime and anywhere. Some of these elements have

been identified in the literature (Nahum-Shani et al., 2018) while other elements are extended

(e.g., available treatments, tailoring variables, treatment selection), or first defined (inter-

vention points, intial treatment) in this proposal to better match the nature of mobile health

interventions.

To better illustrate some of the elements, in figure 1.1 shows a general mobile health interven-

tion cycle compared to traditional health intervention. The following are the elements of a

mobile health intervention considered throughout this proposal:

1.2 The elements of a mobile health intervention

1.2.1 Distal outcomes

These are defined as the set of outcomes that are the ultimate goal of the intervention (Nahum-

Shani et al., 2018). This is also referred to as the primary clinical outcome. For example,

in drug rehabilitation, the distal outcome is the elimination of drug use; in sleep hygiene,

it is the improvement of sleep health factors. Distal outcomes are very important to health

interventions however they are usually difficult to use for day-to-day treatment adjustment:

4 1 INTRODUCTION

Diag

nosis

First tre

atment

Treatm

ent

adjustment

DemographicsGenomicsLifestyle......

How to pick First treatment?

When to deliver treatment?

C

ABC

ABC

ABC

ABC

AB

ABC

What treatment?

Treatm

ent

adjustment

Treatm

ent

adjustment

Week 1 Week 2 Weeks

Intervention points Available treatments

ABC

FIGURE 1.1: This diagram shows a basic health intervention cycle and eachstakeholder. The patient first gets a diagnosis, afterwards the intial treatmentfollows, and then there are treatment adjustments some time after. In order toselect the intial treatment the doctor needs to take into account the patient’sdemographics, genomics, lifestyle and others. After the intial treatment thepatient goes back to the doctor and depending on the health state the treatmentmay be adjusted. In a mobile health intervention, the process is the samebut every decision is taken autonomously. Also, treatment adjustment, doesnot have to occur at a fixed point in time, it can be adjusted in days or hoursdepending on the disease. This new model of health however has three mainchallenges: 1) How to select the intial treatment?, 2) When to delivery thetreatment and 3)How to select a treatment. These challenges are explainedthoroughly in section 1.3

There is usually a long time between the administration of treatment and the observation of

change. Distal outcomes alone are not sufficient to measure the intermediate success of a

health intervention, however they are crucial for the design of a health intervention. Distal

outcomes are usually domain specific.


1.2.2 Proximal outcomes

These are any outcomes that could potentially lead to the desired distal outcome as mediating

or direct factors affecting the distal outcome (Nahum-Shani et al., 2018). Typical examples of

a proximal outcome are mediators of behavior change like motivation (“BJ Fogg’s Behavior

Model,” 2016; Michie, van Stralen, West, 2011) and self-efficacy (Bandura, 1976). Proximal

outcomes apply not only to behavioral interventions but also to pharmacological treatments

that rely on basic behaviors of the patient like taking pills at specified times; in this case,

adherence to treatment is a crucial factor: Patients’ failure in adhering to medication regimes

causes 33 to 69% of hospitalizations and accounts for $100 billion in annual health care

costs (Osterberg Blaschke, 2004). Proximal outcomes are not domain specific but they are

adapted to each intervention. As an example, adherence for a pharmacological treatment

can be measured by counting how many times a patient takes a pill on time, while in a sleep

intervention, it could be measured by the number of times the participant fills out a sleep

diary. In both cases the construct is the same, but the measure is specific to the intervention.

1.2.3 Decision points

These are the points in time or more generally context (e.g., location, time of day, mood),

where a health intervention is adjusted (Nahum-Shani et al., 2018). Such adjustment could

be based on a combination of sensor input, patient feedback, computational feedback (i.e.,

estimates of future outcomes from a model) or even physician’s feedback. These decision

points may or may not be of importance depending on the application and the computing of a

decision can be decoupled from the delivery. As an example, for a sleep intervention using

sleep related outcomes, decision points could occur everyday after waking up or they could

be computed right before the moment of delivery. Assuming the sleep treatment depends

only on the previous night of sleep, there is no difference between computing a decision right

before treatment is delivered or as soon as the night of sleep data is available (after waking

up). In contrast, in an intervention for increasing physical activity based on steps, right before

delivering an intervention, an estimate of the current number of steps is necessary in order to

suggest the number of steps left to meet a pre-defined goal. In general, interventions where

6 1 INTRODUCTION

the target of the intervention involves an ever changing process (like a step count) will require

a decision point close to delivery.

1.2.4 Intervention points

These are the points in time or more generally context where a health intervention is delivered

to the patient. An important differentiator of intervention points is whether they are vulnerable

or opportunistic states (Nahum-Shani et al., 2018). Vulnerable states are those leading to

undesirable or dangerous outcomes; as an example, a stressful situation could be a vulnerable

state for a person going through drug rehabilitation since such an event could lead to relapse.

Opportunistic states are contexts used to improve health outcomes without a necessary

connection between the health outcome and treatment. As an example, the same individual

going through rehabilitation may benefit from sporadic and randomly timed reminders to

engage in positive social interactions and exercise. A key construct to find the best intervention

points is receptivity: “an individual’s transient ability and/or willingness to receive, process

and utilize just-in-time support”. This construct, rooted in the dual process model for

supportive communication, states that (Burleson, n.d.) supportive communication (e.g., a

sleep recommendation) can result in positive changes in behavior when the recipient is

motivated to process and enact the message. The identification of receptivity is crucial

for finding opportunistic intervention points. Although there has not been work looking

at the detection of receptive states from sensor streams or data in general, researchers in

human-computer interaction (HCI) have a well established body of work on a similar concept

called interruptibility and engagement. There are multiple definitions of interruptibility, but

for this proposal I refer to interruptibility as the idea that people have moments during the

day when they are available to be interrupted. At such times, an interruption has a low

enough cost so that an interruption is acceptable (Ho Intille, 2004; Okoshi et al., n.d.).

Interruptibility has been studied around computer use and more recently mobile phone use,

and as such all of this body of work is centered on finding interruptible states when an

individual is interacting with a computer or a mobile phone. More recently, HCI researchers

have looked at engagement detection (Pielot et al., 2017), an extension of interruptibility


detection, where the goal is to detect not only when an individual can be interrupted but also

when the individual further engages with the content of the interruption. An easy way to

differentiate the two follows: When an individual receives an SMS and does not even look at

it, the individual is not interruptible; when the individual glances at the SMS, the individual

is interruptible; lastly, when the individual looks at the SMS, opens it and even replies to

the sender or further engages in a task related to it, the individual has been engaged. In this

work, we use engagement detection as a proxy for detecting receptivity, however we make the

distinction that detecting a state of engagement may not always result in the detection of a

receptive state given that receptivity is more involved and depends on variables intrinsic to

the individual like ability and motivation to engage with a stimulus. All of these concepts

are related in the following way: interruptibility preludes engagement, and engagement

preludes receptivity. Interruptibility is necessary but not sufficient for engagement, likewise

engagement is necessary but not sufficient for receptivity, and receptivity implies an individual

is interruptible and engaged. Despite the importance of receptivity, and its related constructs of

engagement and interruptibility, there has not yet been any work using detection of receptivity

to trigger the delivery of a health intervention. However, some researchers have already

started including receptivity in their study protocols for future studies (Kramer Jan-Niklas et

al. 2019).

1.2.4.1 Initial treatment

In this proposal, I further refine the definition of intervention points to include the initial

treatment. The inial treatment refers to the state in which the intervention starts and is

delivered to an individual. There are two possible options for an intervention on how it could

start: 1) The intervention could start with a treatment picked at random among the possibilities

for treatment. This is the less ideal case, however it is realistic in situations when there is

not enough knowledge about the patient to perform any kind of personalization. Also, this

could be an option for interventions that are trying to fulfill research and clinical goals and as

such, this intial treatment, if uniformly randomized is a micro-randomized trial (Klasnja et

al., n.d.) and the data generated from this stage could be used for causal inference. At later

decision points the intervention could move away from a uniform probability distribution,

8 1 INTRODUCTION

however the data generated from that point forward cannot be used for causal inference

because treatment is not provided in a random fashion and instead is focused on the clinical

goal. 2) The intervention could start with a treatment picked using variables that help

identify the subset of treatments that have a higher chance of succeeding at achieving the

target outcome of the intervention. This treatment selection can be performed by means of

expert knowledge where a physician could look for specific demographic variables or other

signs. This treatment selection could also be performed using computational models that

can estimate from clinical health records or biological databases, possible outcomes based

on demographics or genetic makeup. Another possibility is to use a mixed approach where

physicians rely on computational models and their own knowledge to determine the best

course of treatment.

1.2.5 Available treatments

These are referred to as intervention options in the literature and are the different types

of treatment that are available for delivery at any given point. Here, I decided for adding

"Available" to highlight the changing nature of the context of the patient, and how that

context ultimately changes her ability to put into practice health treatments. Nahum-Shani

(Nahum-Shani et al., 2018), further defines as part of the available treatments the media of

delivery (e.g., sms, email, phone call), the type (advice, feedback), or even the quantity of the

treatment (e.g., dosage of a medication or the number of times a heath recommendation is

provided).

1.2.6 Tailoring variables

Traditionally, tailoring variables have been focused on the patient receiving the interventions

and as such, these variables provide information related to the individual that help decide when

and what intervention to provide (Nahum-Shani et al., 2018). However, it is very important

to notice that, from a mobile health intervention point of view, intervention options must be

dependent on the context of the individual receiving the intervention and the computational


resources available (e.g., battery levels, data available, internet connection). The context

of the individual can define the content of the intervention; as an example, reminding a

person to exercise when they are ready to go to bed is not only counter-intuitive, it is

also frustrating. Similarly, taking into account computational resources should limit which

recommendations are suggested to those that have enough support from data collected on that

particular individual or when the intervention is a task that requires computational resources

to complete; if such a task relies on having an internet connection and connectivity is not

available, the system should automatically provide other tasks that are available under the

current circumstances. Tailoring variables are domain and system specific.

1.2.7 Treatment selection

Treatment selection or decision rules (Nahum-Shani et al., 2018) are the underlying mechan-

ism that uses the tailoring variables to select intervention options. The decision rules pick

the intervention treatment (intervention options) based on the variables being tracked during

the intervention (tailoring variables). More broadly, these rules are not necessarily static

and can adapt to evidence of treatment or patient feedback in order to increase treatment

efficacy, engagement or any other proximal or distal intervention outcomes. This is a key

difference with traditional approaches to treatment selection; in the context of mobile health

interventions, treatment selection is not static and treatments are updated on a data driven

basis. An example of this approach is MyBehavior (Rabbi et al., 2014), a system that uses

a stochastic method to determine the best intervention to provide based on sensor data and

personal preferences.

As shown in this section, the elements of a mobile health intervention presented here are not

fundamentally different to those of a traditional health intervention, however, the nature of a

mobile health intervention provides new challenges and opportunities for improved health

care. The first such difference is on the initial treatment selection, in a traditional health

intervention, the physician uses her expertise and medical knowledge to decide. In a mobile

health intervention, this initial treatment could be chosen in a data driven fashion. Another

difference is that in a mobile health intervention, intervention points do not need to be fixed

10 1 INTRODUCTION

and they can be tailor to specifics that are not bound by availability of a physician, time of day

or even geographic location. Instead, a mobile health intervention could intervene at anytime

as needed. Last, a mobile health intervention could decide treatment at any intervention point

in an objetive manner by using available data. In the next section, all of these challenges and

their possible solutions are illustrated.

1.3 Challenges in the personalization of mobile health

interventions

Mobile health researchers have identified several aspects necessary to achieve full personaliz-

ation of health interventions. These challenges arise naturally and are rooted in the different

elements of a mobile health intervention; I first explore two fundamental challenges: when

to intervene (identifying intervention points), how to intervene (treatment selection). After

solving the above, the next challenge is to get individuals to install and try a mobile health

intervention app: In 2018, mobile phone users uninstalled %28 of the health apps installed on

their phones (of Apps, 2018). One possible way to minimize the uninstall rate, in the context

of a mobile health intervention, is to focus on the initial treatments. As will be described

in chapter 4, the initial treatments could be improved by using a prior estimated from the

integrated model of behavior and effect that can then be fed to a contextual bandit, which can

then pick sleep recommendations from the beginning that are most likely to be followed and

have a positive outcome on sleep. Moving forward, I will refer to this problem as the intial

treatment challenge: How to select treatments at the beginning of the study that are more

likely to keep the patient engaged, intervention’s burden low, and the distal health outcomes

at a satisfactory level. I plan to solve the intial treatment challenge by first estimating a model

of short-term effect of treatment i.e., a model that can estimate the direction and strength of a

treatment in a health outcome. Second I want to estimate a model of behavior; a model able

to estimate how likely a given treatment will fit the lifestyle and preferences of a patient, and

thus provide the likelihood that the patient will comply with treatment. And last, I want to

1.3 CHALLENGES IN THE PERSONALIZATION OF MOBILE HEALTH INTERVENTIONS 11

use the effects and behavior models to compute the long-term effect of treatment, which will

take into account both the patient estimated preference and strength of effect.

In this section, I provide a brief description of each of the challenges as well as the completed

and proposed work related to them. All of the completed and proposed work in this thesis

generalizes to many different health interventions, but due to time and space constraints

the work is focused around the automation of sleep hygiene, a well known sleep health

intervention. Details about sleep definition, importance and treatment are provided in the

section 3.1.1. I now provide a general breakdown of the challenges:

1.3.1 Identifying intervention points using mobile-receptivity

(completed)

The first challenge is the identification of an intervention point, given the nature of a mobile

health interventions this requires finding the best possible context for delivery of treatment;

context is not limited to time and it could include location, weather, current activity, cognitive

state or constructs specific to mobile health interventions like mobile-receptivity (defined

in section 2.1 ) among others. Identifying intervening points is crucial for the success of a

health intervention. This challenge has not been explored yet in human computer interaction

and mostly intervention work has been limited to passive approaches where the intervention

treatments appear as part of the home screen of the smartphone or when the user decides

to look for it. In already completed work, I show how intervention points can be identified

by estimating mobile-receptivity a measurable construct of receptivity (Nahum-Shani et

al., 2018) through a machine learning classifier built from smartphone sensor data. This

mobile-receptivity detector was used in the context of a sleep health intervention to identify

the best delivery times for sleep recommendations. The classifier performance at identifying

receptive times in general is shown in section2.2. The effect of using receptivity in a sleep

intervention is shown in section 3.

12 1 INTRODUCTION

1.3.2 Treatment selection and receptivity (completed)

After the identification of a time for treatment, selection of treatment is the next challenge.

Selecting a treatment in the context of a mobile health intervention is a challenging process:

From a very small amount of data, the method chosen for treatment selection should be

able to pick those treatments that will result in the highest increase in the distal outcome.

Although there is work on the topic looking at personalized (Rabbi Mashfiqui et al. 2015)

and cohort-driven (Daskalova Nediyana et al. 2018) treatment selection, that work is mostly

focused on the reinforcement of positive behaviors. In this proposal, I present a method

that generalizes a multi-armed bandit method, in a computationally tractable fashion, to

include contextual data for the selection of health recommendations and works in tandem

with a mobile-receptivity detector that recognizes the best times for delivery of treatment. In

comparison to previous work (Rabbi Mashfiqui et al. 2015; Daskalova Nediyana et al. 2018),

this method recommends new treatments to participants and also may reinforce existing ones.

This novel approach was implemented and tested in the context of a sleep health intervention.

Results of this intervention as well as details about the system can be found in 3.

1.3.3 Development of a personalized model of effects (proposed)

The challenge of treatment selection can be overcome without an initial model of effects, a

model capable of estimating the strength and direction of a treatment in a target outcome.

This model is estimated over the course of the mobile health intervention and it is used by

the treatment selection method. However, that approach is slow and puts a high strain on the

patient by forcing the exploration of treatments that may be onerous, painful or inefficient.

Having a personalized model of effects for each patient can potentially save time, keep the

patient engaged and improve the overall efficiency and efficacy of a mobile health intervention.

Despite all the advantages of using an effects model, its estimation and use in the context of

a mobile health intervention has remained elusive until the time of writing of this proposal.

For my thesis, I propose to estimate such models for the intervention options of a sleep

hygiene intervention. The estimation of the different sleep recommendations effect on sleep

health is very challenging and will require the use of and comparison across techniques like

1.3 CHALLENGES IN THE PERSONALIZATION OF MOBILE HEALTH INTERVENTIONS 13

hierarchical linear models, probabilistic graphical models like Hidden Markov models and

structural equation modelling. The precise estimation of such effects is very challenging and

likely infeasible. However approximate estimates or estimates that can provide the direction

of treatment or rankings among the available treatments, are suitable approaches for making

this model feasible. I foresee the comparison and implementation of this approach as the

main contribution of this part of the proposed work as described in section4.

1.3.4 Development of models of behavior (proposed)

Mobile health researchers have identified the value of models of patient behavior as a way to

inform a mobile health intervention (Hekler et al., n.d.; Nilsen et al., 2016; Riley et al., n.d.;

Tewari & Murphy, 2016). Models capable of estimating people’s preference or likelihood

for following a specific sequence of situations and actions have been used to estimate the

behavioral differences among different types of drivers, routes a cab driver may prefer while

navigating a city or even how people will move around an office environment or parking

structure. In my thesis, I propose to use a modified version of those models to estimate the

likelihood of people’s behavior in the future and the use these models together with a model

of the effects of treatment to estimate the expectation of treatment outcomes. The contribution

of this work lies on the adaptation, implementation and comparison of inverse reinforcement

learning models for mobile health interventions; further details are described in section5.

1.3.5 A models-based approach to select Initial treatment (proposed)

The integration of models that not only take into account the effect of an intervention but also

take into account other aspects of the individual like preferences and routine behavior are

fundamental (Nilsen et al., 2016) to guarantee personalization. Using models that take into

account patient preference and effects of treatment to select treatment can possibly achieve

better outcomes than a mechanism that selects a treatment but ignores those factors (Nilsen et

al., 2016). In this proposal, I want to investigate methods for merging models of effects and

behavior with the goal of decreasing burden and increasing efficacy of the initial treatment.

14 1 INTRODUCTION

As a main approach, I plan to use a model of behavior to estimate the probability of daily

life situations together with the decisions taken by an individual in relation to a sleep health

intervention. Such probability estimates over a span of days or weeks provide a simulation of

an individual’s behavior. This simulation then can be combined with the effects of treatment

to compute the expectation of treatment, and the expectation can be computed for all of the

available treatments. The expectation of treatment is an estimate of the long-term effects of

the intervention. The simulation results could also be used to estimate a confidence interval

for each of the intervention treatments. The expectations computed from the simulations

could further inform day to day treatment selection by trying to maximize long-term effects

or they could inform the selection of the initial treatment (Tewari & Murphy, 2016). The

contribution of this part of the proposal lies in the implementation and testing of a method for

merging the model of behavior and treatment effects. Furthermore, this method will be used

to pick the intial treatments in the context of a sleep health intervention deployed to college

students in the spring of 2020. Details about the study are provided in section6.

CHAPTER 2

Identifying intervention points using mobile-receptivity (completed)

Intervention points can be broadly defined as contexts (time, location, etc) where treatment

must be delivered. Following the definition provided by (Nahum-Shani et al., 2018) for

intervention points, this work is focused on the identification of opportunistic states defined as

contexts where the patient is not in a vulnerable state but is in a state where she has the "ability

or willingness to receive, process and utilize just-in-time support". Receptivity identification

is crucial for the success of mobile health interventions (Nahum-Shani et al., 2018), but it

may be impossible to measure since it requires the sensing of constructs like willingness

or contextual ability. Although there has not been any work looking at the detection of

receptive states from sensor streams, researchers in human-computer interaction (HCI) have

a well established body of work on a very close concept: interruptibility. In this section, is

summarized the most prominent and recent work in interruptibility detection from mobile

phone sensors. This body of work inspires the definition of mobile-receptivity as shown in

section 2.1, a construct very close to receptivity adapted for mobile health interventions and

constrained to be measurable through mobile phone sensors or similar technologies. Using

this definition, it was implemented and tested a mobile-receptivity detector. The detector

is a machine learning model trained using mobile-phone data from 4 weeks and 37 people.

Performance of the receptivity detector is provided at the end of this section. This mobile-

receptivity detector was used in a randomized clinical trial as a trigger for the delivery of a

sleep health intervention presented in chapter 3. Details about the mobile-receptivity detector

implementation are provided in section 2.2.

15

16 2 IDENTIFYING INTERVENTION POINTS USING MOBILE-RECEPTIVITY (COMPLETED)

2.1 Mobile-receptivity and interruptibility

Interruptibility is closely related to receptivity (Nahum-Shani Inbal et al. 2014), however

there is not a single definition of interruptibility and instead it has been studied under different

terms:

• Interruptibility (Okoshi et al., 2016; Ho and Intille, 2005): the idea that people have

moments during the day when they are available to be interrupted. At such times, an

interruption has a low enough cost such that an interruption is acceptable.

• Attention (Pielot et al., 2014; Pielot et al., 2015): The idea that people are busy and

have moments of attention that they can direct towards something other than their

current task.

• Boredom (Pielot et al., 2015): the idea that people intentionally seek information

and ways to entertain themselves.

• Engagement (Pielot et al., 2017) with the information presented: Users not only

attend to a notification but click on it to find out more about it. Engagement detection,

is a step forward in the direction of receptivity detection and it is well differentiated

with interruptibility work that has been mostly focused on finding a moment where

the user is reachable by a notification or another type of alert (Pielot et al., 2017).

Instead, engagement detection aims to estimate user states where they are likely to

engage with the content provided.

All of these concepts are related in the following way: interruptibility preludes engagement,

and engagement preludes receptivity. Interruptibility is necessary but not sufficient for

engagement, likewise engagement is necessary but not sufficient for receptivity, and receptivity

implies an individual is interruptible and engaged. Despite the importance of receptivity,

and its related constructs of engagement and interruptibility, there has not been yet any

work looking at the detection of receptivity to trigger the delivery of a health intervention.

However, some researchers have considered including receptivity in future studies (Kramer

et al., 2019), as a fundamental part of mobile health interventions. In this work, we bridge

interruptibility and receptivity under a new term, mobile-receptivity: A state in which an

2.1 MOBILE-RECEPTIVITY AND INTERRUPTIBILITY 17

individual has the cognitive ability to stop their current task to read and make sense of a

notification related to a health treatment in the context of a mobile health intervention. In

practice, this can be measured by means of observing when the user clicks and reads through

a push notification from a mobile phone application. Although mobile-receptivity is more

constrained than interruptibility, many of the related work and lessons learned in building

models of interruptibility can be used for building models of mobile-receptivity.

2.1.1 Detecting interruptibility

Although interruptibility itself is not sufficient for identifying mobile-receptivity states, many

of the methods and features used are useful for detecting mobile-receptivity. The preferred

method for building models of interruptibility is by using machine learning classifiers. Re-

searchers have used different classifiers to build successful interruptibility detectors, however

the preferred classifiers are decision trees and random forests (Pielot et al., 2014; Ho and

Intille, 2005; Pielot et al., 2017; Katevas et al., 2017; Okoshi et al., 2016; Dingler and

Pielot, 2015). The performance of models of interruptibility has been measured mainly in

two different ways: leave a subset of users out at random or cross-validation in which data

is randomized without taking into account time or user independence. The later evaluation,

is the most prevalent in the literature and accounts for the best results. This is expected

due to cross-validation’s over-optimistic results in time series data where the independence

assumption is broken, and as a result, work that splits the data according to users, has a

lower, but more realistic, performance results to those expected in a real world deployment. A

majority of the work in this domain report accuracy, precision and recall. Engagement work

(Pielot et al., 2017) shows the lowest performance however this is expected since engagement

is only a small subset of interruptible situations and a much more difficult event for detection.

detection work:


Paper Method Evaluation A P R F1 Featuresselection

Didn’t You See MyMessage? (Pielot etal. 2014 )

Random Forests Random cross-validation 0.68 – – – WrapperAccuracy

Using Context-Aware Computing(Ho et al. 2005 )

Decision Tree Data split into train andtest

0.91 – – – –

Beyond interrupt-ibility(Pielot et al.2017-09-11 )

XGBoost Cross validation randomiz-ing over random groups ofpeople

0.89 0.218 0.540 0.31 Featuresselection

People’s inter-ruptibility in-the-wild(Tsubouchi etal. 2017 )

Linear regression Live evaluation of themodel, the performancemetrics were reduced userresponse time 49%(54 to27 minutes)

– – – – –

Continual Predic-tion of Notifica-tion(Katevas et al.2017)

RNNs XGBost Cross-validation includinggrid search for XGBoost

AUC 0.7 0.8 0.5 0.61 Featuresselection

Towards attention-aware (Okoshi et al.2016-02 )

Random forests — 0.82 0.82 0.82 0.82 —

I’ll be there foryou(Dingler et al.2013)

Random forests – 0.79 0.77 0.82 0.79 —

When attention isnot scarce (Pielot etal. 2015 )

– Random cross-validation 0.83 – – – —

InterruptMe(Pejovicet al. 2014 )

Adaboost Random cross-validation 0.73 0.36 0.48 0.41 —

Using decision-theoretic(Rosenthal etal. 2011 )

Logistic regression — 0.9 – – – —

TABLE 2.1: All the articles including method, evaluation and performanceresults. A (Accuracy), P (Precision), R(Recall).

2.1.2 Features

In terms of the data used to build the classifiers, there is an ever increasing number of features

used for detecting interruptibility in the literature. The number of features used has varied

from 4 to more than 300 and there is not a general agreement on what features should be used.

However (Pielot et al., 2017) presents an all-encompassing categorization of the different

features used that is informative and allows for flexibility in implementation. Furthermore, all

of the features used in other works fall into one of the categories described by (Pielot et al.,

2017) and so it is recommended to use them in any interruptibility:

2.1 MOBILE-RECEPTIVITY AND INTERRUPTIBILITY 19

• Communication activity: Computer-mediated communication. This group includes

features that show how often a user is using the phone to communicate with others

by, e.g., sending or receiving messages, or making or replying to phone calls. For

instance, a user that just got distracted by an incoming phone call might not be open

to further interruptions. Examples of Communication Activity features are: number

of SMS messages received in the past hour, time since the last incoming phone call,

or category of the app that created the last notification.

• Context: Features related to the situation of the mobile phone user, i.e., his or her

environmental context. The context of use often determines whether it is appropriate

or safe to interact with the mobile phone. For instance, being at home during the

weekend may indicate opportune moments for interruption, whereas being at work

during the morning may indicate the opposite. Examples of Context features are:

time of day, estimated current distance from home, recent levels of motion activity,

or average ambient noise level during the last five minutes.

• Phone status: Features related to the status of the mobile phone. For instance, a

device with screen status ‘unlocked’ indicates that the user is currently using the

phone, thus a notification might be interrupting a concurrent task. Examples of

Phone Status features are: the current ringer mode, the charging state of the battery,

or current screen status (off, on, unlocked).

• Usage patterns: The type and intensity of usage of the phone. For instance, a user

engaged in playing a game or watching a video may be less open to an interruption,

whereas while surfing on the Internet might provide a better moment. Examples of

Phone Usage features are: number of apps launched in the 10 minutes prior to the

notification, average data usage of the current day, battery drain levels in the last

hour, number of device unlocks, screen orientation changes, or number of photos

taken during the day.

Demographics is another category however it has mainly covered age and gender and no other

variables have been studied. The importance of the features by category was studied by (Pielot

et al., 2017); in that work, the ranking from best to worst features to predict interruptibility:

Context (1), Communication (2), Usage Patterns (2), Demographics (3), Usage Patterns


(3). A feature analysis was performed by (Pielot, Dingler, Pedro, Oliver, 2014), using the

same categorization as in (Pielot et al., 2017) the ranking becomes: Communication (1),

Context(2), Demographics(3), Usage Patterns(4). These results show that consistently both

Communication and Context are the most important categories.

2.2 Mobile-receptivity detection

The main goal of mobile-receptivity detection is to detect receptivity states when people

are nearby or interacting with their phone, and use this state to remind the individual about

actionable health treatments. It is worth noting that although there are not applications of

mobile-receptivity detectors in mobile health interventions, interruptibility classifiers have

already been used outside the lab setting to increase news readership in japan (Okoshi Tadashi

et al. 2018). For my thesis, I built a mobile-receptivity classifier using most of the findings

from previous work (Pielot Martin et al. 2014; Ho Joyce et al. 2005; Pielot Martin et al. 2017;

Okoshi Tadashi et al. 2016; Katevas Kleomenis et al. 2017; Pielot Martin et al. 2015).

The classifier uses most of the features identified in (Pielot et al., 2017) (communication

activity, context, phone status and usage patterns) and was trained using data from the baseline

phase (4 weeks) of the sleep intervention study described in more detail in section 3. Below,

we provide a detailed description of how the mobile-receptivity classifier works and was

evaluated.

2.2.1 Data collection

Data for building the mobile-receptivity detector was collected from the baseline (i.e., first

4 weeks without any intervention) phase of a sleep health intervention study described in

3.The app, which did not interact in any way with the participant, collected smartphone sensor

data while running on the background. The app collected a total of 88 different features

summarized as: Communication activity (e.g., number of SMS received, time since last phone

call, etc.); Context ((e.g., light, proximity, activity from Google’s activity recognition API);

Phone status ((e.g., battery level, time since unlocked, number of times locked in the day, etc);

2.2 MOBILE-RECEPTIVITY DETECTION 21

Usage patterns ((e.g., number of apps interacted with, number of UI events, etc). The app

computed and stored the features every second as long as the phone was not asleep.

2.2.2 Pre-processing

Pre-processing during training of the classifier was kept simple to ease implementation and

avoid computing overhead for its future use live as part of a sleep intervention. The first

pre-processing step was to use a sliding window of 5 minutes, and to compute features like

mean, max, min and standard deviation over each window. After that, values were normalized

using a min-max scaler, using pre-stored min and max values to keep consistency across the

classifier training and live deployment.

Like in (Pielot et al., 2017), labels for mobile-receptivity states are obtained when the phone

user not only checks a notification but further engages in it by clicking on it.

2.2.3 Classifier and Performance evaluation

We used a MultiLayer Perceptron (MLP) from the scikit-learn library (Pedregosa et al., 2011)

for our mobile-receptivity classifier. Although state-of-the-art models mostly use Decision

Trees, for our implementation we needed the flexibility of a classifier capable of learning

from batches of data (online-learning), allowing us to train a classifier as soon as data arrives

from each participant instead of waiting for all participants to finish their baseline phase. This

functionality is available for MLP but not for Random Forests or Decision Trees. After the

model was trained, it was translated into Android-Java using the sklearn-porter (Morawiec, ).

The performance was evaluated using leave-one-out-validation stratified by participant and

is shown in Table 2.2. The mobile-receptivity classifier has a better performance (88%

accuracy, F1_score=0.54) than the state-of-the-art engagement classifier ((Pielot et al., 2017):

Precision=0.2, Recall=0.5, F1_score=0.3).


Accuracy Precision Recall F1 score0.88 0.44 0.74 0.54

TABLE 2.2: Performance of the mobile-receptivity detector

CHAPTER 3

Treatment selection and receptivity (completed)

In this proposal, the process of personalization is defined as the solution of two different but

interrelated problems: Detecting a mobile-receptivity state for delivery of the intervention and

selecting treatment based on Health outcome and compliance. To solve this challenges, in this

proposal is presented PECAM a Personalized and Context-Aware Mobile health intervention

framework. In this chapter is first introduced PECAM and its components, then sleep and

sleep intervention work in HCI and the chapter ends with the results from a sleep intervention

using PECAM and delivered in the spring of 2019 to 30 college students.

3.1 A framework for the personalization of mobile health

interventions

PECAM models a health intervention as a reinforcement learning problem incorporating the

way the patient interacts with her phone through notifications. Under PECAM as shown in

figure 3.1, we have a health intervention delivered through a phone that uses a communication

module to decide the context when the different treatments should be delivered, and a decision-

making module that decides which health treatment to deliver. The patient interacts with the

phone through notifications and she could decide to accept (i.e., read and enact the health

treatment although not necessarily immediately), dismiss (i.e., it is considered irrelevant in

the current context) or ignore (i.e., the patient is engaged in a task and did not pay attention at

all to the notification). After the patient decides what to do with the health treatment provided,

she goes about her everyday life represented as the environment (i.e., all of the external factors23

24 3 TREATMENT SELECTION AND RECEPTIVITY (COMPLETED)

that could have an effect on the patient’s decision, motivation and ability or constraints with

respect to the health treatment provided).

Recommendation

Accept

Dismiss

Ignore

Patient

Communication module

Decision-making module

Environment

Sensorsmodule

Phone

Wearable

FIGURE 3.1: The PECAM Framework. PECAM models the way a healthintervention can be delivered through a phone to a patient taking into accountthe way the patient interacts with the phone.The starting point is the phone,where multiple sensor streams including onboard sensors and external oneslike a wearable reach the phone for pre-processing and other purposes. Inthe phone there is a communication module that uses the sensor streams todecide the right context to deliver a health treatment. In tandem, a decision-making module selects, using sensor data and the patient’s feedback, the healthrecommendation (i.e., treatment) that is more likely to be followed and thathas the best impact on the health outcome of interest, as measured throughsome subset of the sensor streams available. A health recommendation is thendelivered in the form of a notification to the participant which may decide toread, dismiss or ignore the recommendation. The consequences of the patient’sdecision will have an impact that is measurable through the sensors module.The data derived from the sensors module is then used by the communicationmodule and the decision-making module

For the reminder of this proposal, sleep hygiene(Posner and Gehrman, 2011) is used as the

domain in which most of the ideas and methods presented are tested, however, the general

framework of this project can be applied to other domains like weight management, stress

management, physical activity among other health interventions. In the next section is

described very generally the importance of sleep and related work in HCI.

3.1 A FRAMEWORK FOR THE PERSONALIZATION OF MOBILE HEALTH INTERVENTIONS 25

3.1.1 Sleep interventions

Sleep in humans is defined as a natural state of unconsciousness where responses to external

stimuli are reduced. Sleep is reversible and occurs at regular intervals that are independent

of many other physiological processes. Sleep has a fundamental role for many essential

processes in the human body that regulate learning (Stickgold et al., 2001; Yang et al., 2014),

memory (Rasch and Born, 2013; Stickgold et al., 2001), weight (Nagai et al., 2013), mood

(Walker, 2009) and cardiovascular health (Wolk et al., 2005) among other processes. Sleep

is multidimensional; there is not a single factor that captures overall sleep quality. Instead,

sleep is defined using the following sleep health (Buysse, 2014) factors: Sleep duration,

the total amount of sleep obtained in a 24-hour period; Sleep efficiency, the ease of falling

asleep and returning to sleep calculated as the percent of time asleep of the total time spent in

bed; Timing, the time of occurrence of sleep within a 24 hour day; Alertness, the ability to

maintain attentive wakefulness; Quality: the subjective assessment of sleep.

The ideal sleep hygiene intervention has two components: Sleep Hygiene Education and

Sleep Hygiene Recommendations. The education component refers to teaching individuals

about the importance of sleep and its relation to general health. The recommendations are

a set of practices that are meant to improve sleep. A sleep hygiene intervention usually

starts with the education component and then sleep hygiene recommendations are introduced.

Sleep hygiene recommendations are usually taught by an expert clinician who first does

a sleep assessment to determine the individual’s most salient sleep problem and after that

proceeds to create a personalized plan of treatment: finding a set of recommendations that

are aligned with the patient’s goal, preferences, and desired outcomes. This personalized

plan of treatment however is not static; the individual usually starts trying a small set of sleep

recommendations. After some time, usually weeks, depending on outcomes from this first

plan, the clinician may suggest alternative recommendations in a follow-up visit. This process

is repeated until the desired outcomes are achieved. Oftentimes, however, health services

only provide a limited number of follow-ups or none at all, and these follow-ups are usually

weeks apart. In the meantime, the individual may be wasting time and effort trying out sleep

recommendations that do not work for her and this could result in her dropping the sleep


intervention altogether. In summary, personalization is challenging, time-consuming and

prone to error, and can take from weeks to months due to the modifications to treatment and

limited availability of clinicians, if it even succeeds at all. It is worth noting that there is high

variability in the delivery of a sleep hygiene intervention; for example, at some colleges and

universities, both the education and recommendations components are delivered in the context

of a classroom, but in such a format, there is no personalization of treatment or follow up.

In the best case scenario, sleep hygiene is provided over multiple individual sessions by an

experienced clinician.

One of the earliest work in HCI related to sleep intervention is ShutEye (Bauer et al., 2012),

a smartphone application that shows Sleep Hygiene recommendations at appropriate times in

the background of the home-screen of a user’s smartphone. ShutEye modified the background

of the home-screen to display activities that were encouraged or discouraged depending only

on the time of the day and sleep hygiene recommendations, and did so without sensing sleep-

related parameters. Although the study was exploratory, there was a decrease in subjective

sleepiness score for 8 out of 12 participants.

Horsch et al., (Horsch et al., 2017) demonstrate that the usage of reminders increased

adherence to automated parts of a CBT-I based intervention. This intervention was delivered

through a smartphone application that contained a sleep diary, a relaxation exercise, sleep

overview graphs, and reminders (set by the participant) to use the sleep diary and perform

the relaxation exercises. As part of their results, they show that reminders can improve

intervention adherence.

Daskalova et al., presents SleepCoacher (Daskalova et al., 2016), a framework for self-

experimentation with sleep recommendations. The system works by using the phone as a

sleep parameters sensor (sleep duration, time to bed, time out of bed, awakenings, etc.). Sleep

measurements are collected over a baseline period of five days and then correlations are

estimated for observed sleep related behaviors (time to bed, sleep environment, etc.) and

sleep related outcomes (awakening, sleep duration, efficiency). SleepCoacher then selects the

pair of sleep behavior-outcomes with the highest correlation, finds a corresponding template

generated by sleep experts, and then asks the participant to follow this behavior for 5 days,


followed by 5 days of no-intervention, then another 5 days of the same recommendation. The

total duration of the final study was 3 weeks with 17 participants. This intervention only

provides one recommendation to each participant. SleepCoacher, given its high correlation

selection algorithm, operates by reinforcing the participant’s behavior that shows the highest

correlation with a positive sleep outcome. In terms of outcomes as an intervention, 2 of the

17 participants showed improvements (Hedge’s g>=0.5) in their respective target variable

(frequency of awakenings, self-reported restfulness and time to fall asleep). In a different

project, Daskalova demonstrates the usage of a cohort-based approach for sleep health

intervention (Daskalova et al., 2018). This method for providing recommendations is based

on providing sleep recommendations for a new patient by looking at data from people with

similar demographics. Once a cohort is identified for a new patient sleep-related measures that

are the most dissimilar (compared to the cohort’s) is chosen as a sleep target. Then, the sleep

recommendation with the highest positive effect on the sleep target selected is provided to the

participant. Their results show that cohort-based recommendations resulted in an increase of

17 minutes in sleep duration but this result was not statistically significant.

In summary, sleep interventions in HCI are still at an exploratory stage, however they are very

promising. Most of these interventions were based on or are an extension of sleep hygiene

recommendations (Daskalova et al., 2016; ?), and the usage of daily reminders has shown

promising results (Horsch et al., 2017) at increasing adherence to the intervention.

3.1.2 Related mobile health interventions

Mobile health researchers have shown the feasibility of using Artificial Intelligence (AI)

methods and mobile sensors (Rabbi et al., 2016; Paredes et al., 2014; Sano et al., 2017;

Rahman et al., 2016) to personalize health interventions. Paredes(Paredes et al., 2014)

presents a stress intervention that uses a contextual bandit and the Upper Confidence Bound

method to provide stress recommendations through a mobile phone. Their results show

that there was a close to significant decrease of perceived stress for participants in the ML

condition and there was another effect for copying mechanisms.


Yom et.al., (Yom-Tov et al., 2017) present a system that uses a contextual bandit to personalize

the type of message received to encourage physical activity. The goal of the study was to

increase physical activity to improve health of type 2 diabetes patients. The results show a

positive effect of the system in increasing physical activity and reduction of glucose levels.

The method used is the next: First a pseudo-random policy is use to collect data for a couple

of months. After that, a policy is estimated and used in the study. The policy itself is a linear

regression model using features that summarize the state and features that capture the actions

as indicator functions. All these features then are used to predict the effect of actions and

patient state. To estimate an action, Boltzman sampling is performed over the different actions

and model outputs. The stochastic nature of the method used allows for variability in the

treatment provided and not always the best treatment is provided.

Mashfiqi et al., introduced MyBehavior (Rabbi et al., 2016), a mobile application that auto-

matically generates recommendations for a healthy lifestyle. MyBehavior uses participant-

provided preferences together with location, activity and food intake logs to suggest recom-

mendations to reduce calorie intake and increase calorie expenditure. MyBehavior was tested

in a multiple baseline (Dallery et al., 2013) design study consisting of a baseline period of 3

weeks, then 2, 3 or 4 weeks of the control condition followed by 7-9 weeks of the treatment

condition. The study was conducted with 16 participants that were ready to act (n=7) or

acting (n=9) towards healthier behavior change previous to the study. MyBehavior delivers

recommendations through an on-screen widget that also shows real-time updates of calorie

intake and expenditure, and chronological summaries of physical activities and food intake.

During the baseline condition, participants do not receive any recommendations, however

they have access to all the tracking information from the app. During the control condition,

participants receive random recommendations from a set of 42 pre-defined recommendations.

During the treatment condition, participants receive recommendations that are adapted to

participant preferences and outcomes. MyBehavior generates the recommendations using two

separate EXP3 (Auer et al., 2002) multi-armed bandits (one for food and another for exercise)

and a pareto frontier method (ROBERTS et al., ). Together, these two methods find the

recommendation with the best outcomes and with the highest participant preference. When

using the MyBehavior app, participants followed 1.2 more recommendations (p<0.0005),


walked for 10.1 (p<0.005) more minutes and burned 42.1 more calories in non-walking exer-

cises (p<0.05) and consumed 56.1 less calories (p<0.05) each day. Mashfiqi et al., followed

MyBehavior with MyBehaviorCBP (Rabbi et al., 2018), which uses a very similar method

for providing suggestions for pain management. For a thorough review of myBehavior, see

(Aung et al., 2017).

Liao et al., (Liao et al., 2019) presents a general method for the estimation of vulnerable times

from historical data. At the time of this proposal, their results are derived from simulations

however the authors plan on using this method in real world deployment of a physical activity

intervention for hypertension. Their results are very encouraging and show the value of

methods for the delivery of interventions at vulnerable times.

Overall, all of the systems and methods(Rabbi et al., 2016; Yom-Tov et al., 2017; Paredes et

al., 2014; Liao et al., 2019), produce very positive results, however most of them (Rabbi et al.,

2016; Yom-Tov et al., 2017; Paredes et al., 2014), with the exception of Liao’s (Liao et al.,

2019), lack a mechanism for proactively delivering health recommendations at opportunistic

or vulnerable times and instead they rely entirely on the user’s willingness or a predefined

time to receive recommendations. This lack of a delivery mechanism, limits the effect of the

intervention only to participants that are actively engaged with the intervention.

As a consequence, in this proposal, health recommendations are pushed to participants in

a more proactive way by displaying sleep recommendations that are relevant for the time

of the day, and at times when we detect that the patient is in a receptivity context. Also,

treatment is further personalized by using contextual bandits which can better tailor the sleep

recommendations for different contexts.


3.2 PECAM Components

3.2.1 Sensor input

The PECAM framework uses sensors to support the functions of the communication module

and the decision-making module: Phone sensors and external sensors. The phone provides

several physical sensors and several virtual sensors used by the communication model to

estimate mobile-receptivity. Examples of physical sensors used are the accelerometer, which

is used to estimate general activities such as walking, jogging, still and in a vehicle through

the Google activity recognition API. Examples of virtual sensors are estimates of how many

touches per second are produced by the user, number of calls placed in the last hour, number

of text messages received in the last hour, etc. External sensors are any sensor that is not on

the phone such as a wearable device or digital scale. For SleepU, a Fitbit was used as the

external sensor, which uses its own accelerometers and gyroscopes to estimate basic sleep

stages such as asleep, asleep movement and awakenings. The Fitbit API sleep estimates

were used as an input for the decision-making module. This very specific implementation

of the framework uses the phone and wearable as the source for sensor streams but could

be expanded to other sensors including bed and next-to-the-bed sensors to more accurately

estimate sleep stages.

3.2.2 Communication Module

The communication module is in charge of deciding when to deliver a health intervention in

the form of a notification to the user. The communication module’s main goal is to detect

mobile-receptivity states and use them to remind people of actionable health recommendations

for the current time of day, as chosen by the decision-making module. For SleepU, it was

built a mobile-receptivity classifier as described in section 2.2.

Although the main goal is to show health recommendations to the user during mobile-

receptivity states, those states are limited to the times when the user is interacting with and

3.2 PECAM COMPONENTS 31

next to the phone. This means that there are times when the user could be in a mobile-

receptivity state but this cannot be detected. To overcome this challenge, the communication

module is stochastic and at every hour of the day as shown in figure fig:random-strategy, it

decides at random whether to use the mobile-receptivity classifier for the next hour or to pick a

random time during the next hour to interrupt the participant. The probability that the mobile-

receptivity classifier is used decreases over each time period (i.e., morning, afternoon, evening)

so that a random time will always be picked in the last hour if a recommendation for that time

period has not been seen by the user. To avoid overwhelming the user, the communication

module only sends a notification once per hour during each time period, and only if the user

has not already viewed a recommendation for that time period. Further notifications are only

sent before 9am, when the user is classified as being in a mobile-receptivity state, to avoid

having the notification disrupt the user’s sleep.

FIGURE 3.2: Communication module strategy selection process. Since themobile receptivity detector may not work at all times, every hour the commu-nication module will decide at random whether to use the mobile-receptivitydetector or a random time during the next hour to push the health recom-mendation. The probabilities for picking at random either strategy changeover time with the highest probability at the beginning of the period for themobile-receptivity detector. The random strategy has the highes probability bythe end of the period. Notice how the probabilities are one at the beginningand end, this guarantees that at the beginning the mobile-receptivity is usedand at the end, if the patients has not seen a recommendation yet it will bedisplayed for sure at a random time.


3.2.3 Decision-making module: Defining the selection of a health

recommendation as a reinforcement learning problem

For PECAM, personalization is defined as the selection of an appropriate health recommenda-

tion based on two different factors: Health outcome and compliance. For SleepU, the sleep

health outcomes taken into account are sleep duration and efficiency. Compliance is defined

as whether or not the participant followed a sleep recommendation. The selection of a health

recommendation is defined as a reinforcement learning problem (?). Reinforcement learning

problems are defined as those related to sequential decision making in which an agent is

interacting in an environment by taking actions and the goal of the agent is to maximize

some reward obtained after taking each action and over a period of time. In a mobile health

intervention, the agent is the app providing the intervention, the available actions for the app

are the different health treatments that the app can provide to the patient, and the reward is a

measurement of the health outcome of interest.

In the context of the sleep recommendations problem, the agent is the SleepU app, the

available actions for the app are the different sleep recommendations that can be shown to the

user, and the reward is defined as the harmonic mean of sleep duration and sleep efficiency.

Compliance is used to control updates to the estimates of possible rewards for each action;

when a recommendation is followed, an update occurs, otherwise there is no update since

there is no new information for making an update. In summary, the SleepU app is selecting

and displaying sleep recommendations to a participant while trying to maximize the following

day’s sleep duration and efficiency of the participant. For completeness, it is assumed the

following about the sleep recommendation problem, although this generalizes to many other

health recommendation problems:

(1) The probability distribution for the actions’ rewards is unknown: This means that it

cannot be easily assumed a known probability distribution (i.e., Gaussian) for how

the reward (i.e., the health outcome) is distributed for each of the actions. This is

also referred to as an unknown data generation model (Bubeck et al., 2012).


(2) The change in sleep duration and efficiency has low variance: although it is expected

to see differences in sleep duration and efficiency after a participant follows a sleep

recommendation, it is not expected to see large changes from one day to the other.

For instance, a participant that has a sleep efficiency of 70% is not going to change

to 99% following any given recommendation for one day. This assumption is highly

dependent on the domain. As an example, in a physical activity intervention with the

goal of increasing daily steps, the average number of daily steps taken could change

drastically. However, depending on the time-frame, for example, if the goal is to

achieve an average weekly number of steps, then the changes may not be as drastic.

(3) Selecting a health recommendation is a non-stationary problem: although there

is likely to be a single recommendation that produces the best health outcomes

for a participant at any given time, this recommendation is likely to change. This

is a problem that has been identified as a common challenge for mobile health

interventions (?).

3.2.3.1 Contextual bandit

Contextual bandits (Lattimore and Szepesvári, 2019) are a generalization of the bandit

algorithms capable of dealing with context. Contextual bandits are typically used in web

advertising where the goal is to maximize click through rate by deciding, for example, on

location and topic of an ad given a particular set of contextual features like age, time of day and

season. For the implementation of PECAM presented in this work, we chose to use contextual

bandits as opposed to other methods like Q-Learning or SARSA, because contextual bandits

are more sample-efficient (i.e., learn with a smaller data set). However, applications with

access to big data sets or a large pool of participants could use more sophisticated methods.

In order to make our sleep recommendation problem computationally tractable, we divided

the context based on time into three different non-overlapping periods:morning (6:01am to

12pm), afternoon (12:01pm to 6pm) and evening (6:01pm to 6am). Then for each period we

use a different EXP3(Auer et al., 2002) multi-armed bandit. As defined in (Lattimore and

Szepesvári, 2019), this particular usage of multiple multi-armed bandits for different contexts


corresponds to a contextual bandit. For other health interventions, a similar approach could

be taken where the contextual factor with most weight in the intervention could be divided

and then individual MABs can deal with each context separately.

EXP3 works by selecting a recommendation at random from a multinomial distribution.

The EXP3 algorithm is described in algorithm 1. There are many different multi-armed

bandit methods such as the Upper Bound Confidence Interval, Thompson sampling, etc.,

however EXP3 provides the best theoretical guarantees given the assumptions of the sleep

recommendation problem. EXP3 assumes the environment is adversarial; in such an en-

vironment, whenever the bandit picks a specific sleep recommendation, the environment

can foresee the decision rule and pick a different sleep recommendation as the best at any

given time. Although, the actual environment for the sleep recommendation problem may

not be adversarial, working under that assumption prepares the bandit for the worst possible

conditions, and as such, EXP3 is guaranteed to only make a finite number of mistakes and to

adapt to a non-stationary environment. EXP3 has no assumptions about the data generation

model (Bubeck et al., 2012). The Upper Confidence Bound (UCB) approach breaks under

low variance problems (Kuleshov and Precup, 2014). Lastly, EXP3 has already being used in

related work (e.g., (Rabbi et al., 2016)) with successful results.

EXP3 in the context of the sleep recommendations starts with a uniform probability over

each of the recommendations. When a recommendation has a positive sleep outcome (high

efficiency and/or high sleep duration), the probability of that recommendation is increased

slightly while all the other recommendations’ probabilities are decreased. In order to make

the problem computationally feasible, three different multi-armed bandits (MABs) are used:

one for each period of the day (morning, afternoon and evening). A short version of the sleep

recommendations handled by each of the MABs is shown in table 3.1. To decide in which

period of the day each recommendation should appear, we worked together with a CMU sleep

clinician. This took into account that planning is an important part for some activities. For

example, for the recommendation "Avoid exercising 4 hours before bed time", the goal is

to help the student change the exercising time to the morning or afternoon and not to drop

exercising. Therefore, the best time to remind them about it is in the morning. More details


Initialization;w(0) = {w(0)

n = 1}, n = 1, ...., N ;for t=1,...,T do

β =√

( log(k)/(k · t));Select recommendation i;φ(t) =

∑Nn=1w

(t)n ;

i ∼Multinomial(w(t−1)/φ(t−1));Compute sleep score;s(t) = 1(i ∈ r(t−1)) ·H(sleepD(t−1), sleepE(t−1));Update;

w(t)n = w

(t−1)n · e(−β·`(s(t))/p

(t)n )

endAlgorithm 1: EXP3 algorithm adapted for the sleep recommendations problem. Wherew

(t)n is the weight for recommendation n at time t, p(t)n = w

(t)n /φ(t) is the probability of

selecting a recommendation, sleepD is the sleep duration in hours capped at 7 and dividedby 7, sleepE is the sleep efficiency, H(x) is the harmonic mean and 1(i ∈ r(t−1)) isone if the recommendation i pushed by the app to the participant is reported as followedi ∈ r(t−1).

on how these recommendations were selected and displayed are provided in the study design

section (3.4.1). In other domains, other MABs could be more adequate; as an example, in a

health intervention where it is necessary to optimize the health treatment very quickly (i.e.,

the number of opportunities to try out treatments is limited) and non-stationarity is not an

issue, the UCB MAB may be a better option.

3.2.4 Framework connection to behavior change theories

The design choices behind PECAM are mainly based on self-efficacy (Bandura, 1977) theory,

the Fogg Behavior Model (Fogg, 2009), the COM-B framework (Michie et al., 2011) and

closely follows the design guidelines for Just in time adaptive interventions (Nahum-Shani

et al., 2017). Self-efficacy theory posits that behavior change can only be achieved once the

individual has a perception of success towards the execution of a task. In this proposal, I

posit that achieving high self-efficacy is context-dependent: Even if an individual has high

efficacy for a given task, this task can only be executed under very specific circumstances,

so ultimately success is dictated by the individual’s ability and context. As an example,

an individual may be able and willing to stop drinking coffee to improve sleep outcomes,


TABLE 3.1: Sleep Hygiene recommendations used in the SleepU app

MAB Sleep Recommendation

Morning Keep record of your sleep with a di-ary (this app’s diary counts!)Avoid exercising 4 hours before bed-timeAlways keep the daytime routine

Afternoon Go to bed and wake up at the sametime everydayAvoid caffeine 6 hours before bed-timeAvoid alcohol 6 hours before bed-timeAvoid napsAvoid heavy meals before bedtime

Evening Sleep only when sleepyGet out of bed when not asleep in 20mins and calm down until sleepyUse bed only for sleep and sexPerform a sleep routineTake a bath 1-2 hours before bed-timeAvoid watching the clockMake the bed environment condu-cive to sleep

however due to habit, this individual may only remember to avoid coffee once inside a coffee

shop at which point surrendering to habit is easier than restraint. In such a case, a reminder

that arrives with enough time to allow the individual to avoid this particular habit could have

succeeded in helping.

Reminders driven by context and receptivity are also motivated by the Fogg Behavior Model

(FBM). This model posits that behavior is composed of three different factors: motivation,

ability and triggers. Under the FBM, for any individual to succeed at behavior change, she

needs to be motivated, needs to have the ability to perform the behavior and needs a trigger

to perform this behavior. Take as an example, in the context of a sleep intervention, the

recommendation to "avoid drinking coffee 6 hours before bedtime"; an individual’s ability

level to perform this recommendation varies over the course of the day as shown in figure 3.3,

3.3 DEPLOYMENT AND TESTING 37

FIGURE 3.3: Fogg Behavior Model adaptation of the recommendation "avoidcaffeine 6 hours before bedtitme". The horizontal blue and red rectangle showshow the ability to enact a recommendation depends on time of day

The 1907 Franklin Model D roadster.

where the morning and afternoon are among the best times to provide this recommendation,

while an evening reminder cannot result in behavior change since the window of opportunity

for succeeding has already passed. In the SleepU app, the FBM trigger is a notification

delivered to the user’s phone. COM-B (Michie et al., 2011), a behavior change framework,

relates several causal factors (e.g., capability, opportunity and motivation) for the performance

of volitional behaviour including the influence of extrinsic factors. COM-B was derived

from an exhaustive literature review and the summarization of nineteen different behavior

change frameworks. In comparison to FBM, COM-B considers the role of motivation at a

broader level in the performance of behavior mediated by ability and opportunity (triggers

under the FBM). However, COM-B goes further and suggests that motivation, capability

and opportunity are also influenced by the performance of the behavior. This implies that

motivation can increase as a patient engages more with a behavior resulting in a positive

health outcome.

3.3 Deployment and testing

Using the PECAM framework, it was implemented the SleepU app: an Android application

that uses a Fitbit wearable, user’s feedback and phone sensor data to personalize a sleep

hygiene intervention. The app was designed to only give sleep recommendations to the user,


while other functionality like tracking and visualization of sleep or other behaviors was not

part of the app. By avoiding this other functionality, any effect of the app on sleep-related

outcomes can be more directly attributed to the sleep recommendations and delivery method

provided by the SleepU app. Also, it has been shown that tracking of behaviors can be

detrimental in domains like weight loss intervention (Jakicic et al., 2016).

The next is a walk-through of the SleepU app. At installation, the app will ask the user to

connect to her Fitbit account and ask for the necessary permissions to automatically access

sleep-related data. The next day at 9 am, the app will push a notification to the user asking

her to fill out a standard sleep diary (figure 3.4:1) (i.e., time to bed and wake up). If the

user starts interacting with her phone before 9am and the communication module detects a

mobile-receptivity state, the app will push a notification about the sleep diary at that time.

After the user fills out the sleep diary, the app immediately uses the Fitbit data and diary

responses to update the probability distributions of the recommendations and selects the sleep

recommendations for the day. The probability estimates for EXP3 are updated daily using

the harmonic mean of sleep duration and efficiency and whether the participant followed

the recommendations provided or not. After the updates, SleepU pushes the morning sleep

recommendation. SleepU provides one sleep recommendation at 3 different time periods:

morning, afternoon and evening. SleepU selects which recommendations to show using

the EXP3 multi-armed bandit (MAB). Notifications for each time period from SleepU are

stopped once the user has seen a sleep recommendation for their current time period. SleepU

knows that a notification has been read because the app notification does not directly display

the recommendation in the notification text as shown in figure 3.4:2, but instead says: “I

have a new sleep recommendation for you!”. This mechanism forces the user to click on the

notification to read the recommendation. When the notification is clicked, the SleepU app

is opened and displays the sleep recommendation. In summary, SleepU will push at least 3

notifications a day (one for each period) and a maximum of one notification per hour between

9am and 12am. Participants in the study were free to mute the notifications or ignore them.

After the first day, while filling out the sleep diary, the user is also asked about the sleep

recommendations shown the previous day and whether she followed any of them. Sleep

3.3 DEPLOYMENT AND TESTING 39

FIGURE 3.4: Different screenshots from the SleepU app. Left to right: 1)SleepU diary entry, the user gets a reminder at 9 am to fill out the diary, if theychecked their phone earlier than that, the receptivity classifier could trigger anotification to fill out the sleep diary. 2) The app pushing a notification to theuser about a new sleep recommendation available, the actual recommendationtext is omitted in the notification. 3) A sleep recommendation viewable afterthe notification is clicked on. 4) Main screen of the app which gives the useraccess to the sleep recommendations selected for her for the current day, withthe other sleep recommendations hidden.

recommendations that were followed then result in an update in the probability estimates of

its respective MAB; this updates the probabilities for all recommendations for that MAB.

The recommendations in the SleepU app (figure 3.4:3) are a slight modification, for improved

readability, of the sleep hygiene recommendations offered by sleep clinicians (Centre for

Clinical Interventions, ) with a single illustration related to the recommendation. The home

screen of the app (figure 3.4:4) provides access to all the sleep recommendations already

provided for the current day.

The SleepU app has four different mechanisms for triggering the delivery of a recommenda-

tion: User, Random, Mobile-receptivity and Diary. User-triggered recommendations refer to

when the user checks the app’s recommendations on her own volition and does not involve

filling out the sleep diary or receiving a notification from the app. In this scenario, the

participant goes on her own to the phone and looks at any of the 3 sleep recommendations

available for the day (morning, afternoon and evening), available from the app’s home screen.

Random-triggered recommendations are those scheduled and shown in a notification at a

random time by the SleepU app as explained in section ??. Mobile-Receptivity-triggered


recommendations are those shown as a notification to the user after the mobile-receptivity

detector identified a receptive state. Lastly, the Diary-triggered recommendations are those

checked right after filling out the sleep diary; again in this case the participant could check

any of the morning, afternoon or evening recommendations available.

3.4 METHOD

3.4.1 Study design

We conducted a 12-weeks long, within-subjects randomized clinical study with 37 college

students from Carnegie Mellon University (CMU). The study design is shown in figure 3.5.

After screening, participants in the study were assigned at random to two different groups:

app-first group or sleep-appointment-first group. Randomization also took into account

group balance by gender. All participants were exposed to three different study phases (each

approximately 4 weeks long) in varying order depending on their group assignment: Baseline,

in this phase only data collection took place; App-Intervention, in this phase students were

asked to install the SleepU app on their phones; Sleep-appointment, in this phase students

were asked to attend a sleep health appointment in which a standard sleep hygiene intervention

was delivered by a sleep clinician. The one-time sleep-appointment was provided by the

university health center of CMU and it is part of their standard university wellness program

provided at no cost to students. The two different groups were created to counterbalance

any possible order effect. Due to limited availability of the sleep-appointment, participants

starting the study late, cancellations of the sleep appointment, and the semester calendar, the

study duration varied slightly among participants.

3.4.2 Participants

Participants were recruited using flyers and Facebook posts at university groups in the

beginning of January 2019. Participants eligible to participate in the study had to comply

with two different sets of requirements: demographics and health-related. Demographic

3.4 METHOD 41

FIGURE 3.5: Study design. All study phases lasted 4 weeks with the exceptionof screening. The Qs indicate times in the study when the participants filledout a battery of questionnaires as explained in section 3.4.4.

requirements were: 18 to 25 years old and with an active undergraduate student status at

CMU. Participants in the study were screened for on-going problematic substance use (i.e.,

drugs, alcohol or nicotine) and sleep disorders (i.e., apnea, narcolepsy, chronic, insomnia).

Participants with a substance use problem or a sleep disorder were not accepted in the study.

This exclusion criteria was necessary because participants with these issues need a specialized

sleep treatment; a standard sleep hygiene approach will not work for them and could worsen

ongoing sleep problems like insomnia. Procedures were approved by the our university’s

institutional review board, and all participants provided informed consent. All participants

were provided at no cost with a Fitbit Flex2, a wrist-worn wearable with sensors that measures

steps and sleep (awake vs. asleep). Participants were compensated with 10 dollars (US) for

each week of data logged in the study, and, as an extra incentive, those filling out 80% or more

of the diaries were allowed to keep the Fitbit Flex2. The participants were not compensated

for using the SleepU app’s sleep intervention functionality (e.g., checking or following sleep

recommendations).

3.4.3 Interventions

3.4.3.1 Sleep-appointment

A sleep-appointment intervention was scheduled with the university health center after a

participant joined the study, but the appointment occurred at the beginning of this intervention

period. During this 45 minutes to 1 hour long appointment, a sleep clinician covered basics

about sleep, performed a sleep assessment using the Pittsburgh Sleep Quality Index (PSQI)


(Buysse et al., 1989), and went over relevant sleep hygiene recommendations. The sleep

clinicians at our university follow recommendations from the Australian Centre for Clinical

Interventions (Centre for Clinical Interventions, ). After the sleep-appointment, as part of our

study, Fitbit data and sleep diary questionnaires were recorded for 4 weeks.

3.4.3.2 App-intervention

Participants would get a link from the research coordinator with instructions on how to install

the SleepU app on their phones. After installation, participants had the app on their phones

for 4 weeks and they uninstalled the app afterwards. Specific details about the way the app

works and the delivery of the sleep recommendations can be found in section ??.

3.4.4 Measures

After screening, participants that joined the study were asked to fill out a battery of question-

naires related to sleep health and other related and proximal outcomes after each phase of

the study. The questionnaires included are mechanistic proximal outcomes and measures of

psycho-social or physiological processes that are thought to mediate health behavior change,

as suggested by (Klasnja and Veeraraghavan, 2018). The questionnaires used were: the

Pittsburgh Sleep Quality Index (PSQI) (Buysse et al., 1989), Sleep Practices and Attitudes

(Grandner et al., 2014), Sleep beliefs scale (Adan et al., 2006), Perceived stress scale (Cohen

et al., 1994), Morningness - Eveningness questionnaire (Horne and Östberg, 1976) and a

Readiness to change motivation towards healthy sleep related behaviors questionnaire (i.e.,

motivation questionnaire). We created the motivation questionnaire from a readiness ruler,

a questionnaire that measures the patient’s health stage as defined in the transtheoretical

model of behavior change (Prochaska and Velicer, 1997). The readiness ruler has been used

for smoking cessation (Biener and Abrams, 1991) and alcohol rehabilitation interventions

(Heather et al., 2008). Our modification consisted of adjusting the text content for sleep

hygiene recommendations and decreasing the number of options from 10 to 8 options for

improved readability on a mobile phone, on which participants were filling out the question-

naire.

3.4 METHOD 43

In our motivation questionnaire, we asked participants to rate their readiness for each of the 14

different sleep recommendations listed in table 3.1, excluding the sleep diary recommendation

since we directly compensated participants for the diary entries. The readiness levels used

a scale from 1 to 7: 1) Not ready at all, 3) Thinking about it, 5) Planning and making a

commitment, 7) Actively/Already doing it, and a Does not apply to me option (e.g., the coffee

recommendation for participants that do not drink coffee).

Additionally, participants were asked to fill out a standard sleep diary, everyday, during the 12

weeks duration of the study. During the baseline and sleep-appointment phases, participants

received an email with a link to a website form every morning. For the app-intervention

phase, participants received the sleep diary prompt on the SleepU app (via a notification)

with three extra questions asking whether the participant followed any of the three sleep

recommendations generated by SleepU the previous day.

Sleep duration and efficiency were collected continuously during the 12 weeks of the study

except when the Fitbit was being charged. Participants in the study were instructed to wear

the Fitbit Flex2 at all times including while taking a shower and while sleeping, with the

exception of recharging. The device does not collect any data while recharging or when the

user does not wear it.

3.4.5 Analysis plan

The main hypotheses we tested are:

(1) H1: The combination of a mobile-receptivity detector and a decision-making mod-

ule results in better sleep duration and efficiency than a traditional sleep hygiene

intervention.

(2) H2: Delivering sleep recommendations at mobile-receptivity states has a higher

operationalization than recommendations users tried to put in practice on their own

or using alternative mechanisms.

(3) H3: The SleepU app increased sleep-related motivation.


For H1) The combination of a mobile-receptivity detector and a decision-making module

produces better sleep outcomes than a traditional sleep hygiene intervention, we compared the

app-intervention against the baseline and the standard sleep-appointment intervention. The

SleepU app and the sleep-appointment have the same base information (e.g.,, the same sleep

recommendations), however SleepU learns over time to select and remind the participant only

about recommendations that result in an increased sleep duration or efficiency and during

detected mobile-receptivity states. This means that since the content of both interventions is

the same, any difference in their outcomes should only come from the differences between

SleepU and the sleep-appointment. For this comparison, we performed a regression to analyze

sleep outcome variance across the phases, taking into account interaction effects between

study phase and group. In addition to evaluating outcomes on the entire study sample, for the

post-hoc analysis, we looked at smaller groups based on their sleep duration and motivation

at baseline. Using the baseline Fitbit’s sleep duration data, two groups were formed by

separating our study population into short-sleepers (<7 hours per day) and long-sleepers (>=7

hours per day); this same post-hoc analysis was used in a recent pilot study of sleep hygiene

(Levenson et al., 2016). This grouping is further supported by expert consensus and national

guidelines that state adults should sleep for more than 7 hours, and that individuals already

sleeping for 7 hours are generally not expected to increase their sleep duration. Similarly, we

created two more groups based on motivation to put sleep recommendations into practice.

According to self-efficacy theory and the transtheoretical model of behavior change(Prochaska

and Velicer, 1997), we expect participants with a readiness to change level of 5 or higher

(e.g., action or maintenance) to be very motivated to put sleep recommendations into practice,

while those that are at levels less than 5 (e.g.,, pre-contemplation, contemplation, preparation),

to be less motivated and hence will have a smaller or no change in sleep related outcomes.

For H2) Delivering sleep recommendations at mobile-receptivity states increases their opera-

tionalization, we looked at how participants interacted with the different recommendations

during the app-intervention phase of the study and whether they put them into practice the

next day or not. For testing this hypothesis, we computed the total actionability rate of each

type of triggering mechanism (e.g.,user, random, mobile-receptivity, diary), where we define

actionability to be the total number of recommendations followed of each type divided by

3.4 METHOD 45

the total number of recommendations seen of each type. Actionability was first computed for

each participant and type and then the median was used as the summary for all participants in

the study. The median was used instead of the mean due to the small sample size and the data

not following a well-defined probability distribution.

For H3) The SleepU app increased motivation, we looked at the differences in scores for the

motivation questionnaires administered at the end of each phase of the study. To understand

the effect of phase and group, we applied a two-way ANOVA.

To analyze the Fitbit sleep data (sleep duration and efficiency) across study phases, sleep data

was evaluated for normality using a Shapiro-Wilk normality test and for homogeneous vari-

ance using a Fligner-Killeen test. For normally distributed data with homogeneous variance,

a standard ANOVA was used; otherwise the Aligned Rank Transform for Nonparametric

Factorial ANOVAs was used (ART) (Wobbrock et al., 2011). ANOVAs were followed by

pairwise comparisons using paired t-tests. ARTs were followed by pairwise comparisons

using a Wilcoxon signed rank test. The appropriate effect size estimate (r, cohens’ d) was

used in each case.

Following advice from the American Statistical Association (Wasserstein et al., 2016; Wasser-

stein et al., 2019) results are presented comprehensively including both successes and failures.

For hypothesis tests, also as suggested, we report the p-values and explicitly avoid using the

term "statistically significant" (Thiese et al., 2016; Kim and Bang, 2016) and instead we trust

that researchers can make their own judgments. We also present both adjusted and un-adjusted

p-values as suggested by (Rothman, 1990), recognizing that for planned comparisons, p-value

adjustment is not necessary (Saville, 1990).

We performed our data analysis using Python 3.6 and the libraries numpy(Oliphant, 2006),

scipy(Jones et al., 2001 ), matplotlib(Hunter, 2007), seaborn (Waskom et al., 2017) and

pandas (McKinney and others, 2010). Hypothesis testing was performed using R 3.6.1 and the

ARTTool packages (Kay and Wobbrock, 2016). Both exploratory data analysis and hypothesis

testing were conducted in Jupyter notebooks.


3.5 Results

After screening, 37 participants were invited to join the study. Of those, 30 participants (22

Female, 7 Male, 1 Undisclosed) finished the study. 17 participants were in the sleep-first

group (3 Male) and 13 in the app-first group (4 male). The average length of the study was

84.4 days (min=69, max=96). We used the Fitbit data for hypothesis testing of changes in

sleep duration and efficiency. We excluded the Fitbit data from 4 participants due to large

amounts of missing data during some of the study phases. The resulting dataset has a total of

26 participants and it was used to test H1 and H2 and in their related post-hoc analyses. For

readiness to change motivation (H3), we used all of the 30 participants’ responses available

because only questionnaire data was necessary. A breakdown of the number of participants

and post-hoc analysis groups is shown in table 3.2. A summary of all of the results related to

sleep duration and efficiency (H1) is shown in table 3.3 and results related to motivation (H3)

are shown in table 3.4. The sleep progression during the semester of the sleep-intervention

first and app-intervention first groups is shown in figure 3.6.

A.I A.IS.A S.A

FIGURE 3.6: Sleep duration changes over the semester. From left to right. 1)Sleep duration for the sleep-appointment first and app-intervention first groups,S.A corresponds to tje Sleep-Appointment phase and A.I corresponds to theApp-intervention phase. 2) Sleep duration for motivated vs less-motivatedparticipants. 3) Sleep duration for short vs long sleepers.

3.5 RESULTS 47

H1 and H2 H3Short-sleepers Long-sleepers Total Short-sleepers Long-sleepers Total

Motivated 7 6 13 7 6 13Less-motivated 3 10 13 3 14 17

Total 10 16 26 10 20 30TABLE 3.2: Participants distribution for the different analyses

3.5.1 H1) The combination of a mobile-receptivity detector and a

decision-making module produces better sleep outcomes than a

traditional sleep hygiene appointment intervention

On average, sleep duration for the participants while in the app-intervention was maintained

(small increase of 4.2 minutes) from their baseline sleep (p = 0.51, padj = 1.0, r=0.09).

In comparison to the sleep-appointment, participants slept 19.2 more minutes while in the

app-intervention (p = 0.016, padj = 0.049, r = 0.32). Sleep efficiency across all participants

in the different phases did not change; for the different sub-samples, the differences are minor

(2%), which for 7 hours of sleep account for only 8.4 minutes difference. These changes in

efficiency are not clinically meaningful and as such are not discussed any further.

Sleep duration for short-sleepers when experiencing the app-intervention (SleepU) increased

by 36 minutes from their baseline duration (p = 0.043, padj = 0.13, d = 1.12) and it was also

24 minutes longer than when experiencing the sleep-appointment intervention (p = 0.068,

padj = 0.2, d = 0.8). Sleep duration for motivated participants experiencing the app-

intervention increased 19.8 minutes in comparison to their baseline (p = 0.09, padj = 0.27,

d = 0.37) and was also 22.2 minutes longer than when experiencing the sleep-appointment

intervention (p = 0.03, padj = 0.09, d = 0.48). Similar results were achieved in a recent pilot

study by sleep researchers (Levenson et al., 2016) in a similar population of short-sleepers

applying a traditional sleep hygiene intervention plus a social comparison component.

Sleep duration for long-sleepers experiencing the app-intervention decreased by 16 minutes

in comparison to their baseline phase (p = 0.43,padj = 1.0,r = 0.14), but this was still

15 minutes higher in comparison to when they were experiencing the sleep-appointment

intervention (p = 0.10, padj = 0.31, r = 0.29). Similarly, sleep duration for less-motivated


participants during the app-intervention decreased by 15.6 minutes in comparison to their

baseline phase (p = 0.59, padj = 1.0, r = 0.11) and was 15 minutes higher in comparison to

when they experienced the sleep-appointment intervention (p = 0.15, padj = 0.44, d = 0.29).

Based on these results, we can confirm H1, that the combination of the mobile-receptivity

detector and decision-making module in the SleepU app resulted in better sleep outcomes

than the traditional sleep hygiene appointment intervention.

3.5.2 H2) Delivering sleep recommendations at mobile-receptivity

states increases their operationalization

The median actionability rate of each type of recommendation across all participants is shown

in figure 3.7. A Friedman test did not reveal any large statistical differences across the four

types for all participants (p = 0.16) or the post-hoc groups (short-sleepers, long-sleepers,

motivated and less-motivated). However, sleep recommendations delivered via the mobile-

receptivity detector had a median 75% of actionability for all participants in comparison to

50% for the other mechanisms. For some groups like short-sleepers actionability was as high

as 86%. While the results shown in figure 3.7 are promising, we cannot confirm H2, that

mobile-receptivity increased operationalization of the sleep recommendations.

3.5.3 H3) The SleepU app increased motivation

There was a change in average motivation for all participants from 4.6 in the baseline phase

to 5.17 when experiencing the app-intervention (p = 0.02, padj = 0.059, r=0.55). There

was also a change for all participants from 4.6 in baseline to 5.11 when experiencing the

sleep-appointment intervention (p = 0.057, padj = 0.171, r=0.45). There was no difference

in motivation between the app-intervention and the sleep-appointment intervention (p =

0.705,padj = 1.0,r=0.07). Motivation was also measured before (4.5) and after (4.6) the

baseline phase but there was not a difference between the two (p = 0.65, r=0.13).

3.5 RESULTS 49

5 7 21 17 4 10 24 17 5 6 21 18 5 6 19 19 5 9 25 15

FIGURE 3.7: Actionability rates for all participants, short- and long-sleepersand motivated and less-motivated participants by types of notification triggermechanism (diary, random, mobile-receptivity and user). The numbers at thebottom are the average total number of notifications for each type and group.

Short-sleepers’ average motivation changed from 4.9 in the baseline phase to 5.38 when

experiencing the app-intervention (p = 0.28, padj = 0.83, r=0.59). There was also a change in

motivation for short-sleepers from 4.9 in baseline to 5.2 in the sleep-appointment intervention

(p = 0.41, padj = 1.0, r=0.39).

Long-sleepers’ average motivation changed from 4.5 in the baseline phase to 5.0 when

experiencing the app-intervention (p = 0.076, padj = 0.23, r=0.53). Also, there was a change

from 4.5 in baseline to 5.0 in the sleep-appointment intervention (p = 0.08, padj = 0.25,

r=0.49).

Motivated-students average motivation changed slightly from 5.4 in the baseline phase to 5.6

in the app-intervention (p = 0.21,padj = 0.65,r=0.52). Also, there was a change from 5.4 in

baseline to 5.7 in the sleep-appointment intervention (p = 0.093, padj = 0.28, r=0.40).

For the less-motivated students, average motivation changed from 4.0 in the baseline phase to

4.79 in the app-intervention (p = 0.0056, padj = 0.017, r=0.57). Also, there was a change

from 4.0 in baseline to 4.64 in the sleep-appointment intervention (p = 0.0093, padj = 0.028,

r=0.49).


Based on these results, we can confirm H3 that SleepU increased motivation over the baseline

phase, but to the same degree as the sleep-appointment intervention.

3.5.4 Summary of results

In general, the results support our main hypothesis (H1) that the combination of a mobile-

receptivity detector and our decision-making module results in better sleep outcomes than a

traditional sleep hygiene intervention. However, the impact was only seen on sleep duration

and not on sleep efficiency. For the different sub-groups, our combined method always resulted

in a higher sleep duration duration in comparison to the sleep-appointment intervention.

For H2, the results are encouraging but do not support our hypothesis. Although there is

a measurement of the actionability for all of the different types of recommendation trigger

mechanisms, the results are only observational and not causal i.e., evaluating this effect would

require a second study in which we deliver the sleep recommendations using a single type of

notification mechanism for a few days each to compare; unfortunately this was not feasible to

do during our study.

For H3, the results not only demonstrate that our SleepU app improved motivation, but also

that this improvement is comparable to that produced by the sleep-appointment intervention.

Additionally, we found that motivation during the screening phase and by the end of the

baseline phase did not change by a large amount. This result shows that filling out the sleep

diary daily and being involved in a sleep study (without receiving any intervention) does

not have a visible effect on participant motivation and hence we would not expect to see a

behavioral change either.

The results demonstrate the value of using a mobile-receptivity detector and a contextual

bandit for the detection of contexts to intervene and select treatment and compared it to a

standard sleep intervention delivered by an experienced clinician. Our results demonstrate

that our system overall (H1), is as good or better than an in-person, individual, one-hour

sleep intervention delivered by an experienced clinician. This result shows that it is possible

3.5 RESULTS 51

Outcome Method Sample Baseline Sleep-A. App-I BaselinevsApp-I

BaselinevsSleep-A.

Sleep-A.vsApp-I.

Duration ART All 7.24 6.99 7.31 p = 0.51,padj = 1.0,r = 0.09

p = 0.084,padj = 0.25,r = 0.24

p = 0.016,padj = 0.049,r = 0.32

Efficiency ART All 94.31% 94.25% 94.34% —-phase (p = 0.96)

—-group (p = 0.73)

—-phase-groupinteraction(p = 0.45)

Duration Anova Motivated 7.17 7.13 7.50 p = 0.09,padj = 0.27,d = 0.37

p = 0.69,padj = 1.0,d = 0.04

p = 0.03,padj = 0.09,d = 0.48

Efficiency Anova Motivated 94.58% 93.95% 93.99% p = 0.07,padj = 0.21,d = 0.26

p = 0.09,padj = 0.28,d = 0.28

p = 0.92,padj = 1.0,d = 0.01

Duration ART Less-motivated 7.374 6.86 7.115 p = 0.59,padj = 1.0,r = 0.11

p = 0.08,padj = 0.24,r = 0.34

p = 0.15,padj = 0.44,r = 0.29

Efficiency Anova Less-motivated 94.04% 94.55% 94.69% —-phase(p = 0.291)

—-group(p = 0.679)


Duration Anova Short-sleepers 6.3 6.5 6.9 p = 0.043,padj = 0.13,d = 1.12

p = 0.334,padj = 1.0,d = 0.2

p = 0.068,padj = 0.2,d = 0.8

Efficiency Anova Short-sleepers 94.82% 94.11% 94.39% —-phase (p = 0.03)

—-group(p = 0.978)


Duration ART Long-sleepers 7.82 7.30 7.55 p = 0.43,padj = 1.0,r = 0.14

p = 0.004,padj = 0.01,r = 0.48

p = 0.10,padj = 0.31,r = 0.29

Efficiency Anova Long-sleepers 93.99% 94.33% 94.31% —-phase(p = 0.672)

—-group(p = 0.628)


TABLE 3.3: Hypothesis testing results related to sleep duration and efficiency.Sleep-A: Sleep-appointment, App-I: App-intervention

and effective to scale a behavioral intervention like sleep hygiene in the form of an Android

application. Due to the pervasiveness of mobile phones (median 45% ownership for devel-

oping economies and 75% for developed economies (Taylor Kyle, 2019)) this opens up the

possibility to deliver this kind of intervention with ease to large groups of people with the

only limiting but not insurmountable factor being the usage of a Fitbit for tracking sleep; the

user could alternatively manually log sleep data every day on the app. Despite the promising

results, we can only partially attribute this outcome to the inclusion of the mobile-receptivity


Outcome Method Sample Baseline Sleep-A. App-I BaselinevsApp-I

BaselinevsSleep-A.

Sleep-A.vsApp-I.

Motivation ART All 4.662 5.116 5.17 p = 0.02,padj = 0.059,r = 0.55

p = 0.057,padj = 0.171,r = 0.45

p = 0.705,padj = 1.0,r = 0.07

Motivation ART Short-sleepers 4.942 5.256 5.388 p = 0.28,padj = 0.83,r = 0.59

p = 0.41,padj = 1.0,r = 0.39

p = 0.77,padj = 1.0,r = 0.03

Motivation ART Long-sleepers 4.522 5.047 5.062 p = 0.076,padj = 0.23,r = 0.53

p = 0.08,padj = 0.25,r = 0.49

p = 0.956,padj = 1.0,r = 0.04

Motivation ART Less-motivated 4.662 5.116 5.17 p = 0.28,padj = 0.83,r = 0.59

p = 0.41,padj = 1.0,r = 0.39

p = 0.77,padj = 1.0,r = 0.03

Motivation ART Motivated 4.082 4.648 4.799 p = 0.0056,padj = 0.017,r = 0.57

p = 0.0093,padj = 0.028,r = 0.49

p = 0.3388,padj = 1.0,r = 0.19

TABLE 3.4: Hypothesis testing results related to Motivation. Sleep-A: Sleep-appointment, App-I: App-intervention

detector (H2). We also fully corroborated that the system is as persuasive as an experienced

clinician in helping users to feel motivated to follow sleep recommendations (H3).

When experiencing the app-intervention, students had a small increase in sleep duration of

4 minutes, which means they mostly maintained an already healthy sleep duration from the

baseline period. In contrast, when experiencing the sleep-appointment intervention, students

lost 15 minutes (p = 0.084, padj = 0.25) of sleep duration in comparison to their baseline.

We further investigated this sleep duration loss or maintenance among students and we found

that less-motivated students and long-sleepers already had an average healthy daily sleep

duration of 7.3 hours or higher during the baseline. Both groups had a decrease in sleep

duration during the sleep-appointment and app intervention phases as shown in figure 3.6,

however this decrease was smaller for the app-intervention. This pattern of losing sleep as the

academic semester advances for college students was also found by the StudentLife project

(Wang et al., 2014), and may be the result from an ever increasing workload during the

academic term that forces students to sacrifice sleep in order to finish homework or prepare

for exams.

3.5 RESULTS 53

Students in the motivated group, had a borderline healthy sleep duration (7.1 hours) at baseline.

These students gained 19.8 minutes of sleep during the app-intervention in comparison to their

baseline and 22.2 minutes in comparison to the sleep-appointment. These students maintained

their baseline sleep duration during the sleep-appointment phase (average decrease of only

2.4 minutes). These results show that sleep duration gains depends on whether there is actual

room for improvement. The maintenance of sleep duration in long-sleepers and less-motivated

students shows that the outcome of a sleep intervention in general may not always result in

an increase of sleep duration or any other sleep related outcome like efficiency, but rather

the maintenance of baseline values. The results also show that SleepU, even under these

circumstances, also helps students that are naturally moving from a higher to a lower sleep

duration due to external pressure (e.g., increasing stress from the academic semester), and

helps by minimizing losses in sleep duration.

Short-sleepers in our study had the most to gain in comparison to all the other students. They

started with an average sleep duration of 6.3 hours in the baseline phase, and saw an increase

of 12 minutes while in the sleep-appointment and 36 minutes while in the app-intervention.

In this case, we can see that these students had the most to gain from a sleep intervention,

however their gains in sleep duration were 3 times higher from our personalization method

compared to a standard sleep-appointment. Moreover, based on our results for H1 and H2,

the improvement appears to comes from our unique approach to personalization. It is worth

noting that this 36 minute increase in sleep duration for short-sleepers is clinically significant.

In a 2013 study (Haack et al., 2013) with pre-hypertension patients, it was shown that an

increase of 36 minutes in sleep duration results in a significant decrease in blood-pressure.

The participants in that study were also short-sleepers and the study duration was 6 weeks.

There was a change in average motivation for all participants from 4.6 in baseline to 5.17 in

the app-condition to 5.11 in the sleep-appointment. This change means that participants as

a whole moved from a stage of contemplation to an action stage for both treatments. This

finding not only shows that SleepU is comparable to a sleep clinician in its persuasiveness, it

also demonstrates that even though motivation was similar across both the sleep-appointment


and the app-intervention, sleep duration did not remain the same and instead, after the sleep-

appointment, participants’ sleep duration was usually lower. In other words, despite having

the same interest and possibly having the same intentions, only during the app-intervention

phase of the study were the students able to succeed at improving their sleep. It is not

surprising that students would only show a positive behavior change while using the SleepU

app. The app was a frequent and contextual reminder of different things to do to improve sleep.

Once the app was no longer being used (as they started the sleep-appointment intervention),

their prioritization of sleep health remained the same, however their success at improving or

maintaining their sleep was lower.

Using mobile-receptivity triggered notifications has a lot of promise, even though their

actionability compared to the other triggers was not that different. As shown in figure 3.7

mobile-receptivity has the highest median actionability for all participants and across all

sub-groups investigated. For short-sleepers, the actionability of mobile-receptivity-triggered

notifications has a median of 86% while the next highest actionability is for user-triggered

notifications, which has a median of about 50%; this is a substantial difference despite the lack

of statistical power. Similarly, for motivated participants, the median actionability for mobile-

receptivity triggered notifications has a median of 80% while for user-triggered notifications,

it is again about 50%. Further evaluation of actionability over different days of the study

(figure 3.8) shows that the actionability for mobile-receptivity was increasing over the course

of the intervention. This is evident from the distribution over actionability "moving" from

the lower left (low actionability early in the phase) to the top right (increased actionability

late in the phase) for both motivated participants and short-sleepers. This indicates that the

study length was not long enough to reach the highest level of actionability possible by this

mechanism.

The results from our study are in line with those of a similar study by sleep researchers

(Levenson et al., 2016) where a sleep health intervention plus a social comparison component

was applied. However, that intervention (Levenson et al., 2016) relied on expert sleep

clinicians personalizing the sleep intervention for each participant in the study. In comparison

SleepU is comparably cheaper and scalable, since the app runs all necessary computing locally

3.5 RESULTS 55

Motivated Short-sleepers

FIGURE 3.8: Actionability of mobile-receptivity generated sleep recommend-ations by day in the app-intervention phase for motivated and short-sleeperparticipants. This plot shows how by the end of the intervention phase, for bothgroups actionability was increasing over time (more density in the top right ofeach plot). The lack of actionability early in the phase is a natural consequenceof the mechanism inside the communications module that does not use themobile-receptivity detector when the user checks the recommendations on herown.

on the user’s phone and does not require of a sleep clinician. Although our instantiation used

a Fitbit to track sleep, in a different instantiation, the app can rely instead solely on user

self-reports of sleep duration, or on one of the increasing number of mobile phone-based

assessments of sleep duration (e.g., SleepScore).

In terms of limitations, all of the effects found in this study are short-term; long-term effects

could not be evaluated given the length and study design. Future work will address this

issue by increasing the time over which students interact with SleepU. The scope of the

study also limits our results to the college student population, although some of our findings

may generalize across populations with similar constraints and behaviors (e.g., high school

students). Another limitation of the current work is that there was a monetary incentive to

install SleepU and fill out daily sleep diaries. We did not see any effect of filling out sleep

diaries in participants’ motivation, however just getting individuals to install and try a mobile

phone app could be challenging. In 2018, mobile phone users uninstalled %28 of the health

apps installed on their phones (of Apps, 2018). One possible way to minimize the uninstall


rate is focusing on the first interactions of the user with the app and in the specific case of a

mobile health intervention in the first treatment. As will be described in chapter 4, the first

treatments could be improved by using a prior that helps the contextual bandit pick sleep

recommendations from the beginning that are most likely to be followed and have a positive

outcome on sleep.

CHAPTER 4

Development of a personalized model of effects (proposed)

A model of effects is a mathematical model that can estimate the direction and/or strength

of a treatment on a health outcome. These models, are usually some form of generalized

linear regression, and are estimated to measure the average long-term effect of treatment

after a randomized clinical trial. However, this effects model is usually not personalized

but instead is only capable to give population or study-sample-level estimates. For this part

of the proposed work, I will investigate methods for estimating a personalized model of

effect: A model capable of estimating from a small amount of behavioral data (weeks or days)

the effects of the treatments of a health intervention for an individual for whom, treatment

related data does not exist yet. These models can be used to inform the selection of the intial

treatment as will be shown in chapter 6.

Mobile health researchers have recognized the selection of a good intial treatment (a good

initial policy(Tewari and Murphy, 2017)) as an important problem in the development of

mobile health interventions. Bad intial treatments can have a negative impact on health, and

user engagement. Tewari et al., (Tewari and Murphy, 2017) further argue that although this

could be done by including expert knowledge, it could be very difficult to capture accurately,

since experts may not be able to take context into account.

A model of effects could be used beyond the selection of initial treatment, however for the

specific work proposed in chapter 6, the informed prior built from this part of the project,is

enough to inform the intial treatment and any future intervention point. This is possible

because in this proposal I’m using a contextual bandit method that starts using as a prior an

uniform distribution, this could be replaced by the prior estimated from the model of effects.

57

58 4 DEVELOPMENT OF A PERSONALIZED MODEL OF EFFECTS (PROPOSED)

Over time, the contextual bandit will personalize (update) this prior to the outcomes of the

health intervention.

4.1 Related work

Mobile health researchers have usually relied on models of effects that are estimated with the

intervention itself(Rabbi et al., 2016; Daskalova et al., 2016; Paredes et al., 2014; Yom-Tov

et al., 2017) (i.e., there is a model of effects at the beginning of the intervention). More

recent approaches have looked at cohort-based modeling (Daskalova et al., 2018) a method

very similar to collaborative filtering (Aggarwal and others, 2016; Breese et al., 1998). In

Cohort-based modeling, where treatment for a new patient in a sleep intervention is provided

based on its effectivity for similar patients: For a new patient, behavior data is used to build

a sleep health profile, this data is used to select a cohort from a pool of patient’s. Then, the

patient’s sleep health aspects are compared to the cohort and the worst one is selected as the

target of the sleep intervention. Treatment is then selected for the individual by looking at the

cohort’s best treatment for the target of the sleep intervention.

4.2 Proposed work

Estimating a model of effects is challenging: Any personalized model of effects, due to the

small amount of data available, is likely to overfit (i.e., cannot generalize to observations

that are not in the training data set). In general, for personalized model of effects the main

constraints are that the method has to learn from a small number of observations and yet

should be able to generalize. Although this sounds like an impossible task, this constraint

does not mean that the data or even the model should be limited to the patient’s.

For this part of the proposal, two main approaches will be compared in terms of predictive

power and computing complexity:

(1) Cohort-based approach: Under this approach, formulated by Daskalova (Daskalova

et al., 2018), a profile is created using relevant demographics and behavioral variables

4.4 ENVISIONED RESULTS 59

and using those variables a cohort is selected from a pool of available data. Using

the cohort’s data a model of effects can be estimated.

(2) Model of effects including demographics: In this approach, demographics and

behavioral variables are directly included in the model and a single model is used to

estimate effects for any participant.

4.3 Evaluation

In order to understand the predictive power of each of the approaches listed above, they will

be used to predict the effect of sleep recommendations in sleep duration. I will be using

the dataset already collected during the sleep intervention study described in section 3. The

evaluation metrics are the standard performance metrics used in machine learning to evaluate

classifiers (accuracy, precision, recall and f1_score, measures of fit like R2, and measures of

fit that penalize model complexity like the Akaike Information Criterion(Akaike, 1998) (AIC)

and the Bayesian Information Criterion (Schwarz and others, 1978) (BIC). The evaluation will

be done replicating the conditions of the deployment of these models: There is an available

pool of data, and using demographics and baseline data, a subset of the data is used to estimate

effects models.

The available dataset has in total 12 weeks of data from 26 participants. The dataset includes

fitbit data like sleep related measurements like duration, efficiency, and physical activity

data like steps, and strength of physical exercise. Daily self reported data includes caffeine

consumption, cognitive activities before bed and sleep disruptions. For 4 of the 12 weeks,

the participants self-reported whether they followed sleep recommendations provided by our

SleepU app and so we have data related to the effectivity of specific sleep recommendations.

4.4 Envisioned results

From this work, I foresee the creation of specific machine learning pipelines that will allow

for creating models of effects for new participants from a very small amount of observations.

60 4 DEVELOPMENT OF A PERSONALIZED MODEL OF EFFECTS (PROPOSED)

As a secondary contribution I foresee the comparison of a cohort-based approach vs the

personalized population model.

CHAPTER 5

Development of models of behavior (proposed)

In order to capture patients behavior from sensor data, I plan to use a markov decision

process and methods from inverse reinforcement learning (IRL) like maximal causal entropy

to estimate the probabilities and costs. IRL is a general method that models state-action

tuples of an agent and environment with the goal of discovering a reward function and the

underlying policy (i.e.,the decision-making process) that the agent uses to interact with the

world (Ng et al., 2000). IRL models can inform a mobile health intervention by estimating the

preference of a treatment under specific contexts: An IRL model can estimate the likelihood

of an observed state (context) and an action (treatment).

Traditional approaches like generalized regression model approaches, are not well equipped

to deal with decision-making or considering the effect of context in behavior. Additional

constraints like how the behavior is related to the optimization of general behavioral and

contextual preferences and outcomes are also overlooked in traditional methods. These

behavioral preferences are strongly related to self-perceptions of ability and context which

have strong influence in behavior change.

In comparison with other approaches like a general regression model, do not take into

account the dynamics of the environment (i.e., how a patient transitions between contexts).

Other approaches that take into account dynamics do not fit well human behavior data by

overly constraining the shape of the probability distributions (Hidden Markov Models using

Expectatin Maximization). IRL models in contrast can model dynamics and have methods

available for estimating the probability distributions (e.g., maximal causal entropy) that are

better fitted to model human behavior(e.g., by allowing for probability distributions that have

a higher entropy).

61

62 5 DEVELOPMENT OF MODELS OF BEHAVIOR (PROPOSED)

5.1 Related work

In previous work, IRL has been able to capture the underlying policy behind everyday

activities like driving a cab(Ziebart et al., 2008), peoples walking trajectories around an

office(Kitani et al., 2012), driving behaviors of aggressive vs non-aggressive drivers(Banovic

et al., 2017), and even uncover and transfer the policy of an expert aerobatic RC-helicopter

pilot (Abbeel et al., 2010). IRL modeling in the context of a mobile health intervention, can be

used to uncover people’s policy: how people make decisions in the real world and how those

decisions are affected by their current physiological state and the state of the environment.

For this proposal, I am particularly interested in how IRL models can improve the effectiveness

of health interventions by informing how a patient would respond behaviorally to specific

treatments. As an example, in a mobile health intervention a system could recommend to a

patient to exercise more at specific times of day and places, however, such recommendation,

if ill-informed, may cause the participant to harm or give up on trying because the treatment

is too strenuous or does not accommodate the patient’s lifestyle. Instead, using an IRL-based

approach, it could be estimated whether particular treatments are likely to be followed by the

patient. Based on that, personalization of the health intervention is not only achievable, but

more likely to succeed. To investigate the use of IRL in mobile health interventions, I will be

using the data already collected from the sleep intervention study 3.

One of the first research questions I will be investigating is whether we need individual models

for each participant, models for subsets of participants or a population model. I have explored

similar research in the past (Gjoreski et al., 2015; Hong et al., 2012; Hong et al., 2015)

in the context of activity recognition. In that work, I found that neither the population or

individual level models were useful and instead a middle ground approach produced the best

results. Although this hints towards this middle-ground approach in the domain of modeling

behaviors, there is a key difference with respect to activity recognition and in general, the

topic needs more research.

5.4 ALTERNATIVE PLAN 63

5.2 Evaluation

For this part of the project I will be comparing between the individual, middle-ground and

population-level behavior modeling. The evaluation metrics are those usually used in machine

learning to evaluate classifiers (accuracy, precision, recall and f1_score, measures of fit like

R2, measures of fit that penalize model complexity like the Akaike Information Criterion AIC

(Akaike, 1998) and the (Schwarz and others, 1978) Bayesian Information Criterion. Once

behavioral models are established, an collaborative filtering approach will be used to find the

best behavioral models for new participants.


From this part of the proposed work, I foresee the creation of an inverse reinforcement

learning approach that leverages available behavioral data to estimate a new behavioral model

for a new participant.

5.4 Alternative plan

Although IRL methods are well known and have been used in complex domains, their per-

formance with a very limited amount of data is unknown. For that reason if this approach does

not work well enough as an alternative I plan to use collaborative filtering and content-based

filtering. Collaborative filtering (Breese et al., 1998) is an approach for recommender systems:

Systems that try to predict the preference of a person for a particular item. Collaborative

system starts with an usually incomplete set of preference of a user, those preferences are then

used to find other users in a database with similar preferences, then new items are suggested

from the pool of other prefered items by used in the pool. This is a very similar approach to

the cohort-based approach described in section chapter 4.

Another applicable method is content-based filtering (Aggarwal and others, 2016), in this

approach, items are described through a series of properties. Those properties are used as

64 5 DEVELOPMENT OF MODELS OF BEHAVIOR (PROPOSED)

input of a machine learning classifier which can then estimate which properties of those items

are desirable by the user and additionally, may be able to predict if a new item is in general

desirable. In the context of a sleep intervention I could create several dimensions for each

sleep recommendation like: Difficulty of performance, time of day applicable, perceived

value, etc. Once those dimensions are created a classifier could be built to predict whether a

participant will like or not a particular sleep intervention.

CHAPTER 6

A models-based approach to select initial treatment (proposed)

After figuring out the best techniques for estimating models of effects and behavior. I will be

using those models in the context of a sleep health intervention to select the intial treatment,

this models-based approach will also be used as priors for the contextual bandits. Priors are

probability estimates for each of the sleep recommendations available in the study that inform

how likely is a recommendation to be followed and improve sleep duration and efficiency.

The goal of using a prior is to speed up learning, reduce variance and decrease the impact of

noise.

The general idea is to replicate and re-use the same system from the former sleep intervention

study with the difference that this time the contextual bandit will not be starting selecting

treatments using a uniform probability distribution and instead the probabilities are going to

be informed using the models of behavior and effect estimated. Another key difference is that

the study is going to be 8 weeks long total, with 6 weeks of intervention (the previous version

was only 4 weeks of intervention). There are going to be only minimal changes to the app so

that the results from this new study can be compared to the earlier study.

6.1 Related work

Selecting an initial treatment or a starting policy is a general problem that is also found in the

field of contextual-bandits. Zhang et al.,(Zhang et al., 2019) for example explore the problem

of combining some expert advice (supervised labels) and combining it with the feedback

acquired by a contextual bandit to solve the starting policy problem. In Liao’s work, the expert

advice and the contextual-bandit are assumed to be be misaligned e.g., the preferences of the65

66 6 A MODELS-BASED APPROACH TO SELECT INITIAL TREATMENT (PROPOSED)

expert and those experienced by the contextual bandit may not be the same. In the context of a

mobile health intervention, expert advice could be health interventions provided by a clinician

while the contextual bandit feedback is provided by the patient. If the clinician didn’t assess

well enough the patient’s preferences, there is going to be misalignment between the two and

the intervention is not going to be as effective. Yet another example is estimating the effects

of an intervention using a model of effects, and using that model estimates as expert advice,

again here if the preferences of the patient are not taken into account misalignment will occur

and generate adverse outcomes.

Mobile health researchers have mostly used uniform priors (Paredes et al., 2014; Rabbi et al.,

2016; Yom-Tov et al., 2017) which is equivalent to assuming that all treatments are equally

good. Liao et al., (Liao et al., 2019) is one of the first to consider for future studies the use of

an informative prior from a previous study to initialize the reinforcement learning algorithm.

6.2 Simulation

Although the main goal of this work is to test this approach in a real deployment, the first step

is to simulate and estimate empirical bounds of possible outcomes. There are going to be two

main approaches for simulation:

(1) . Abstract simulation. In this simulation the goal is to gain empirical understanding

of the impact of priors varying from very close to very far from the real probability

values, and then seed those different priors to the contextual bandit. For this simula-

tion, no real sleep data is going to be used and instead idealized data generated from

normal distributions that approximate the ranges and values observed in the actual

study will be used.

(2) Sleep-personas simulation. In this simulation, the goal is to create several personas

from the data collected in our sleep study and then use those profiles and again

simulate what happens with priors varying from very close to very far from the real

probability values. The used personas will be varying from very predictable (i.e., low

variance and low entropy) to unpredictable (high variance and low entropy). These

6.3 STUDY 67

estimates will provide bounds on how the quality of the priors will affect different

types of people in a real deployment.

In general, the outcomes from the simulations are estimates of the consequences of having a

range of priors varying from close to the underlying preference and effect of treatment, on the

patient to a completely uninformative estimate. The measures used to evaluate the performance

of the approaches proposed are borrowed from the transfer learning in reinforcement learning

literature(Taylor and Stone, 2009), namely: Jumpstart (Gains in treatment outcomes during

the first week of intervention), Speed (Time to achieve the long-term treatment outcome),

and Generalization (Improvement gains on the treatment outcome in the long-term). The

results from this simulation will inform how to use and implement the priors in a real world

deployment.

6.3 Study

The main goal of this study is the measurement and comparison of two different approaches

for selecting the intial treatment in the context of a mobile health intervention for sleep.

The first approach is to select intial treatment using a model of behavior and a model of

effects. The second approach is to ask participants in the study via a survey, to estimate their

preferences and forecast of the effects of each of the treatments in the intervention. These

estimates will then be used to select intial treatment. The main reason to include this survey

approach is two fold: First, Although model-driven approaches are attractive they limit their

application to economies where there is access to a data scientist or an expert capable of

estimating such models, in comparison surveys are available to anyone. Second, even if the

model-driven approach is better, it is really important to understand how much better can it

be and whether it is justified to use a method that is computationally and economically more

expensive than a simple survey.

The study will be following a between subjects design and both groups will be interacting

with the same system as in the earlier study 3 and the only difference between them is going

to be the method used for estimating the intial treatment and priors for the contextual bandits.


The main research questions are the next:

(1) How does it compare the model-based and the survey approaches?

(2) How does it compare an approach using an informed prior against uniform pri-

ors(former study)?

(3) Does giving better intial treatments improve adherence in subsequent treatments?

For research questions 1 and 2, the measurements that will be compared are sleep duration,

sleep efficiency, motivation to improve sleep, jumpstart, speed, and generalization. For

research question 3, adherence will be measure by looking at how many of the sleep recom-

mendations are followed and seen in the app. Motivation will also be measured in this case

and compared to motivation levels for the same stage in the former study.

In terms of the questionnaires, I will be expanding the measures to collect health outcomes

that may be affected by improved (or worsen) sleep like: General Mood, Attention, Memory

and Stress. This is an important step since sleep is fundamental to many biological processes.

The main goal here is to collect data that can inform about measurable consequences of

changes in sleep duration.

For screening, I will be using the same criteria as in the former study which excluded

participants with sleep disorders or problematic substance use.

6.3.1 Study protocol

The study will follow a between subjects study design. After screening, participants are

assigned at random to one of two groups: Survey approach vs Models-based approach. The

study length is 8 weeks with two weeks of baseline and 6 weeks of the sleep intervention.

During the entire duration of the study the participants will be wearing at all times a Fitbit

device (we are still deciding between using a Fitbit inspire or an Inspire HR). For the duration

of the study, participants will be answering a daily questionnaire in which they will be asked

about whether they performed any of the sleep recommendations. These questions will be

asked even during the baseline period.

6.4 ENVISIONED RESULTS 69

6.3.2 Power analysis

In order to estimate the sample size required for this study, I estimated sample size following

(Sakpal, 2010).

u1 =0.7

u2 =0.5

std =0.3

Zα/2 =1.96

Zβ =0.84

n =(Zα/2 + Zβ)

2 · 2 · std2

(u1 − u2)2

n =35.28

Where u1 is average actionability of the model based approach during the first two weeks, u2 is

the average actionability using a uniform prior, Zα/2 is the Z-score for achieving significance

of 0.05, Zβ is the score for a Power of 0.8 , and n is the sample size for each group. According

to the above estimate the sample size per group should be 36 people for a total study sample

of 72 participants.


For the simulation part, I expect to find that priors that are offset from the real probability

estimates, are very damaging and may increase the time it takes to find the optimal treatment

by an amount of time proportional to the difference in magnitude between the real and the

estimated priors. Informative priors on the other hand can truly accelerate the process likewise

in an amount of time proportional to the difference between the prior and the real probabilities.

For the study part of the proposed work, I expect to find that model-based approach is

significantly superior to a uniform prior and only slightly better to the survey approach. My

main hypothesis is that the survey approach may be enough in non-critical interventions.


My main reasoning behind this hypothesis is that people although may not have perfect

information about their preferences and forecast of effects of treatment, can still provide a

relatively good estimate that is better than using an uniform prior.

CHAPTER 7

General timeline for the proposal

The next timeline includes all of the activities described in chapters 4, 5 and 6:

Activity Start date End date Description

Models of effects and behavior November 20th January 20th Estimating and testing performanceof behavior and effects models fromalready collected data.

Android app upgrades November 20th February 1st Improvements and updates to theapp

IRB December 8th January 8th IRB submission and changes

Ideal priors simulation with contex-tual bandits

November 20th December 2nd

Persona priors simulation with con-textual bandits

January 20th February 4th

Study recruitment January 15th February 15th The study will be deployed simul-taneously at the University of Pitts-burgh, Carlow and Carnegie MellonUniversity.

Study start February 20th April 9th End of semester for U. PittsburghApril 20th, CMU May 19th, CarlowApril 24th

Data wrangling May 1st June 1st Initial data cleaning,transformation,and visualization

Hypothesis testing June 2nd August 2nd

Papers writing July 1st End of September

Thesis writing August 1st End of October

Thesis defense Early december

TABLE 7.1: Timeline

71

References

[Abbeel et al.2010] Pieter Abbeel, Adam Coates, and Andrew Y Ng. 2010. Autonomous heli-copter aerobatics through apprenticeship learning. The International Journal of RoboticsResearch, 29(13):1608–1639.

[Adan et al.2006] ANA Adan, Marco Fabbri, Vincenzo Natale, and Gemma Prat. 2006.Sleep beliefs scale (sbs) and circadian typology. Journal of Sleep Research, 15(2):125–132.

[Aggarwal and others2016] Charu C Aggarwal et al. 2016. Recommender systems. Springer.

[Akaike1998] Hirotogu Akaike. 1998. Information theory and an extension of the maximumlikelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer.

[Auer et al.2002] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire.2002. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.

[Aung et al.2017] Min Hane Aung, Mark Matthews, and Tanzeem Choudhury. 2017. Sensingbehavioral symptoms of mental health and delivering personalized interventions usingmobile technologies. Depression and anxiety, 34(7):603–609.

[Bandura1977] Albert Bandura. 1977. Self-efficacy: toward a unifying theory of behavioralchange. Psychological review, 84(2):191.

[Banovic et al.2017] Nikola Banovic, Anqi Wang, Yanfeng Jin, Christie Chang, JulianRamos, Anind Dey, and Jennifer Mankoff. 2017. Leveraging human routine modelsto detect and generate human behaviors. In Proceedings of the 2017 CHI Conference onHuman Factors in Computing Systems, pages 6683–6694. ACM.

[Bauer et al.2012] Jared Bauer, Sunny Consolvo, Benjamin Greenstein, Jonathan Schooler,Eric Wu, Nathaniel F. Watson, and Julie Kientz. 2012. ShutEye: Encouraging Awarenessof Healthy Sleep Recommendations with a Mobile, Peripheral Display. In Proceedingsof the 2012 ACM annual conference on Human Factors in Computing Systems - CHI ’12,page 1401, New York, New York, USA. ACM Press.

[Biener and Abrams1991] Lois Biener and David B Abrams. 1991. The contemplationladder: validation of a measure of readiness to consider smoking cessation. HealthPsychology, 10(5):360.

72

REFERENCES 73

[Breese et al.1998] John S Breese, David Heckerman, and Carl Kadie. 1998. Empiricalanalysis of predictive algorithms for collaborative filtering. In Proceedings of the Fourteenthconference on Uncertainty in artificial intelligence, pages 43–52. Morgan KaufmannPublishers Inc.

[Bubeck et al.2012] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. 2012. Regret analysis ofstochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® inMachine Learning, 5(1):1–122.

[Buysse et al.1989] Daniel J Buysse, Charles F Reynolds III, Timothy H Monk, Susan RBerman, and David J Kupfer. 1989. The pittsburgh sleep quality index: a new instrumentfor psychiatric practice and research. Psychiatry research, 28(2):193–213.

[Buysse2014] Daniel J Buysse. 2014. Sleep health: can we define it? does it matter? Sleep,37(1):9–17.

[Centre for Clinical Interventions] Australia Centre for Clinical Interventions. Sleep hygiene.

[Cohen et al.1994] Sheldon Cohen, T Kamarck, R Mermelstein, et al. 1994. Perceived stressscale. Measuring stress: A guide for health and social scientists, 10.

[Collins and Varmus2015] Francis S Collins and Harold Varmus. 2015. A new initiative onprecision medicine. New England journal of medicine, 372(9):793–795.

[Dallery et al.2013] Jesse Dallery, Rachel N Cassidy, and Bethany R Raiff. 2013. Single-caseexperimental designs to evaluate novel technology-based health interventions. Journal ofmedical Internet research, 15(2):e22, feb.

[Daskalova et al.2016] Nediyana Daskalova, Danaë Metaxa-Kakavouli, Adrienne Tran,Nicole Nugent, Julie Boergers, John McGeary, and Jeff Huang. 2016. Sleepcoacher:A personalized automated self-experimentation system for sleep recommendations. InProceedings of the 29th Annual Symposium on User Interface Software and Technology,pages 347–358. ACM.

[Daskalova et al.2018] Nediyana Daskalova, Bongshin Lee, Jeff Huang, Chester Ni, and Jes-sica Lundin. 2018. Investigating the effectiveness of cohort-based sleep recommendations.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies,2(3):101.

[Dingler and Pielot2015] Tilman Dingler and Martin Pielot. 2015. I’ll be there for you:Quantifying attentiveness towards mobile messaging. In Proceedings of the 17th Inter-national Conference on Human-Computer Interaction with Mobile Devices and Services,pages 1–5. ACM.

[Fogg2009] Bj Fogg. 2009. A behavior model for persuasive design. Proceedings of the 4thInternational Conference on Persuasive Technology - Persuasive ’09, page 1.

74 REFERENCES

[Gjoreski et al.2015] Hristijan Gjoreski, Simon Kozina, Matjaz Gams, Mitja Lustrek,Juan Antonio Álvarez-García, Jin-Hyuk Hong, Julian Ramos, Anind K Dey, MaurizioBocca, and Neal Patwari. 2015. Competitive live evaluations of activity-recognitionsystems. IEEE Pervasive Computing, 14(1):70–77.

[Grandner et al.2014] Michael A Grandner, Nicholas Jackson, Nalaka S Gooneratne, andNirav P Patel. 2014. The development of a questionnaire to assess sleep-related practices,beliefs, and attitudes. Behavioral sleep medicine, 12(2):123–142.

[Haack et al.2013] Monika Haack, Jorge Serrador, Daniel Cohen, Norah Simpson, HansMeier-Ewert, and Janet M Mullington. 2013. Increasing sleep duration to lower beat-to-beat blood pressure: a pilot study. Journal of sleep research, 22(3):295–304.

[Heather et al.2008] Nick Heather, David Smailes, and Paul Cassidy. 2008. Development ofa readiness ruler for use with alcohol brief interventions. Drug and alcohol dependence,98(3):235–240.

[Ho and Intille2005] Joyce Ho and Stephen S Intille. 2005. Using context-aware computingto reduce the perceived burden of interruptions from mobile devices. In Proceedings of theSIGCHI conference on Human factors in computing systems, pages 909–918. ACM.

[Hong et al.2012] Jin-Hyuk Hong, Julian Ramos, Choonsung Shin, and Anind K Dey. 2012.An activity recognition system for ambient assisted living environments. In InternationalCompetition on Evaluating AAL Systems Through Competitive Benchmarking, pages 148–158. Springer.

[Hong et al.2015] Jin-Hyuk Hong, Julian Ramos, and Anind K Dey. 2015. Toward personal-ized activity recognition systems with a semipopulation approach. IEEE Transactions onHuman-Machine Systems, 46(1):101–112.

[Horne and Östberg1976] Jim A Horne and Olov Östberg. 1976. A self-assessment question-naire to determine morningness-eveningness in human circadian rhythms. Internationaljournal of chronobiology.

[Horsch et al.2017] Corine Horsch, Sandor Spruit, Jaap Lancee, Rogier van Eijk, Robbert JanBeun, Mark Neerincx, and Willem-Paul Brinkman. 2017. Reminders make people adherebetter to a self-help sleep intervention. Health and technology, 7(2-3):173–188.

[Hunter2007] John D Hunter. 2007. Matplotlib: A 2d graphics environment. Computing inscience & engineering, 9(3):90–95.

[Jakicic et al.2016] John M Jakicic, Kelliann K Davis, Renee J Rogers, Wendy C King,Marsha D Marcus, Diane Helsel, Amy D Rickman, Abdus S Wahed, and Steven H Belle.2016. Effect of wearable technology combined with a lifestyle intervention on long-termweight loss: the idea randomized clinical trial. Jama, 316(11):1161–1171.

REFERENCES 75

[Jameson and Longo2015] J Larry Jameson and Dan L Longo. 2015. Precision medi-cine—personalized, problematic, and promising. Obstetrical & gynecological survey,70(10):612–614.

[Jones et al.2001 ] Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001–. SciPy: Opensource scientific tools for Python. [Online; accessed ].

[Katevas et al.2017] Kleomenis Katevas, Ilias Leontiadis, Martin Pielot, and Joan Serrà.2017. Continual prediction of notification attendance with classical and deep networkapproaches. arXiv preprint arXiv:1712.07120.

[Kay and Wobbrock2016] Matthew Kay and J Wobbrock. 2016. Artool: aligned ranktransform for nonparametric factorial anovas. R package version 0.10, 2.

[Kim and Bang2016] Jeehyoung Kim and Heejung Bang. 2016. Three common misuses ofp values. Dental hypotheses, 7(3):73.

[Kitani et al.2012] Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, and MartialHebert. 2012. Activity forecasting. In European Conference on Computer Vision, pages201–214. Springer.

[Klasnja and Veeraraghavan2018] Predrag Klasnja and Eric B Veeraraghavan. 2018. Rethink-ing evaluations of mhealth systems for behavior change. GetMobile: Mobile Computingand Communications, 22(2):11–14.

[Kramer et al.2019] Jan-Niklas Kramer, Florian Künzler, Varun Mishra, Bastien Presset,David Kotz, Shawna Smith, Urte Scholz, and Tobias Kowatsch. 2019. Investigatingintervention components and exploring states of receptivity for a smartphone app topromote physical activity: protocol of a microrandomized trial. JMIR research protocols,8(1):e11540.

[Kuleshov and Precup2014] Volodymyr Kuleshov and Doina Precup. 2014. Algorithms formulti-armed bandit problems. arXiv preprint arXiv:1402.6028.

[Lattimore and Szepesvári2019] Tor Lattimore and Csaba Szepesvári. 2019. Bandit al-gorithms.

[Levenson et al.2016] Jessica C Levenson, Elizabeth Miller, Bethany L Hafer, Mary F Re-idell, Daniel J Buysse, and Peter L Franzen. 2016. Pilot study of a sleep health promotionprogram for college students. Sleep health, 2(2):167–174.

[Liao et al.2019] Peng Liao, Kristjan Greenewald, Predrag Klasnja, and Susan Murphy.2019. Personalized heartsteps: A reinforcement learning algorithm for optimizing physicalactivity. arXiv preprint arXiv:1909.03539.

[McKinney and others2010] Wes McKinney et al. 2010. Data structures for statisticalcomputing in python. In Proceedings of the 9th Python in Science Conference, volume445, pages 51–56. Austin, TX.

76 REFERENCES

[Michie et al.2011] Susan Michie, Maartje M Van Stralen, and Robert West. 2011. Thebehaviour change wheel: a new method for characterising and designing behaviour changeinterventions. Implementation science, 6(1):42.

[Morawiec] Darius Morawiec. sklearn-porter. Transpile trained scikit-learn estimators to C,Java, JavaScript and others.

[Nagai et al.2013] Masato Nagai, Yasutake Tomata, Takashi Watanabe, Masako Kakizaki,and Ichiro Tsuji. 2013. Association between sleep duration, weight gain, and obesity forlong period. Sleep Medicine, 14(2):206–210.

[Nahum-Shani et al.2017] Inbal Nahum-Shani, Shawna N Smith, Bonnie J Spring, Linda MCollins, Katie Witkiewitz, Ambuj Tewari, and Susan A Murphy. 2017. Just-in-timeadaptive interventions (jitais) in mobile health: key components and design principles forongoing health behavior support. Annals of Behavioral Medicine, 52(6):446–462.

[Ng et al.2000] Andrew Y Ng, Stuart J Russell, et al. 2000. Algorithms for inverse reinforce-ment learning. In Icml, volume 1, page 2.

[of Apps2018] Business of Apps. 2018. Mobile app uninstall rate after 30 days.

[Okoshi et al.2016] Tadashi Okoshi, Hiroki Nozaki, Jin Nakazawa, Hideyuki Tokuda, JulianRamos, and Anind K Dey. 2016. Towards attention-aware adaptive notification on smartphones. Pervasive and Mobile Computing, 26:17–34.

[Oliphant2006] Travis E Oliphant. 2006. A guide to NumPy, volume 1. Trelgol PublishingUSA.

[Paredes et al.2014] Pablo Paredes, Ran Gilad-Bachrach, Mary Czerwinski, Asta Roseway,Kael Rowan, and Javier Hernandez. 2014. Poptherapy: Coping with stress throughpop-culture. In Proceedings of the 8th International Conference on Pervasive ComputingTechnologies for Healthcare, pages 109–117. ICST (Institute for Computer Sciences,Social-Informatics and . . . .

[Pedregosa et al.2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machinelearning in Python. Journal of Machine Learning Research, 12:2825–2830.

[Pielot et al.2014] Martin Pielot, Rodrigo De Oliveira, Haewoon Kwak, and Nuria Oliver.2014. Didn’t you see my message?: predicting attentiveness to mobile instant messages. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages3319–3328. ACM.

[Pielot et al.2015] Martin Pielot, Tilman Dingler, Jose San Pedro, and Nuria Oliver. 2015.When attention is not scarce-detecting boredom from mobile phone usage. In Proceedingsof the 2015 ACM international joint conference on pervasive and ubiquitous computing,

REFERENCES 77

pages 825–836. ACM.

[Pielot et al.2017] Martin Pielot, Bruno Cardoso, Kleomenis Katevas, Joan Serrà, AleksandarMatic, and Nuria Oliver. 2017. Beyond interruptibility: Predicting opportune moments toengage mobile phone users. Proceedings of the ACM on Interactive, Mobile, Wearable andUbiquitous Technologies, 1(3):91.

[Posner and Gehrman2011] Donn Posner and Philip R. Gehrman. 2011. Sleep Hygiene.Academic Press, jan.

[Prochaska and Velicer1997] James O Prochaska and Wayne F Velicer. 1997. The trans-theoretical model of health behavior change. American Journal of Health Promotion,12(1):38–48.

[Rabbi et al.2016] Mashfiqui Rabbi, Min Hane Aung, Mi Zhang, and Tanzeem Choudhury.2016. MyBehavior: Automatic Personalized Health Feedback from User Behaviors andPreferences using Smartphones. In Proceedings of the 2015 ACM International JointConference on Pervasive and Ubiquitous Computing - UbiComp ’15, pages 707–718, NewYork, New York, USA. ACM Press.

[Rabbi et al.2018] Mashfiqui Rabbi, Min SH Aung, Geri Gay, M Cary Reid, and TanzeemChoudhury. 2018. Feasibility and acceptability of mobile phone–based auto-personalizedphysical activity recommendations for chronic pain self-management: Pilot study on adults.Journal of medical Internet research, 20(10):e10147.

[Rahman et al.2016] Tauhidur Rahman, Mary Czerwinski, Ran Gilad-Bachrach, and PaulJohns. 2016. Predicting about-to-eat moments for just-in-time eating intervention. InProceedings of the 6th International Conference on Digital Health Conference, pages141–150. ACM.

[Rasch and Born2013] Björn Rasch and Jan Born. 2013. About Sleep’s Role in Memory.Physiological Reviews, 93(2):681–766.

[ROBERTS et al.] MARY CATHERINE ROBERTS, AVERY ST DIZIER, and JOSHUAVAUGHAN. Multiobjective optimization: Portfolio optimization based on goal program-ming methods.

[Rothman1990] Kenneth J Rothman. 1990. No adjustments are needed for multiple compar-isons. Epidemiology, pages 43–46.

[Sakpal2010] Tushar Sakpal. 2010. Sample size estimation in clinical trial. Perspectives inclinical research, 1(2):67–67.

[Sankar and Parker2017] Pamela L Sankar and Lisa S Parker. 2017. The precision medicineinitiative’s all of us research program: an agenda for research on its ethical, legal, andsocial issues. Genetics in Medicine, 19(7):743.

78 REFERENCES

[Sano et al.2017] Akane Sano, Paul Johns, and Mary Czerwinski. 2017. Designing opportunestress intervention delivery timing using multi-modal data. In 2017 Seventh InternationalConference on Affective Computing and Intelligent Interaction (ACII), pages 346–353.IEEE.

[Saville1990] Dave J Saville. 1990. Multiple comparison procedures: the practical solution.The American Statistician, 44(2):174–180.

[Schwarz and others1978] Gideon Schwarz et al. 1978. Estimating the dimension of a model.The annals of statistics, 6(2):461–464.

[Stickgold et al.2001] R. Stickgold, J. A. Hobson, R. Fosse, and M. Fosse. 2001. Sleep,learning, and dreams: Off-line memory reprocessing. Science, 294(5544):1052–1057.

[Taylor and Stone2009] Matthew E Taylor and Peter Stone. 2009. Transfer learning forreinforcement learning domains: A survey. Journal of Machine Learning Research,10(Jul):1633–1685.

[Taylor Kyle2019] Silver Laura Taylor Kyle. 2019. Smartphone ownership is growing rapidlyaround the world, but not always equally.

[Tewari and Murphy2017] Ambuj Tewari and Susan A Murphy. 2017. From ads to interven-tions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer.

[Thiese et al.2016] Matthew S Thiese, Brenden Ronna, and Ulrike Ott. 2016. P valueinterpretations and considerations. Journal of thoracic disease, 8(9):E928.

[Walker2009] Matthew P. Walker. 2009. The role of sleep in cognition and emotion. Annalsof the New York Academy of Sciences, 1156:168–197.

[Wang et al.2014] Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari,Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T Campbell. 2014. Studentlife:assessing mental health, academic performance and behavioral trends of college studentsusing smartphones. In Proceedings of the 2014 ACM international joint conference onpervasive and ubiquitous computing, pages 3–14. ACM.

[Waskom et al.2017] Michael Waskom, Olga Botvinnik, Drew O’Kane, Paul Hobson, SauliusLukauskas, David C Gemperline, Tom Augspurger, Yaroslav Halchenko, John B. Cole,Jordi Warmenhoven, Julian de Ruiter, Cameron Pye, Stephan Hoyer, Jake Vanderplas, SantiVillalba, Gero Kunter, Eric Quintero, Pete Bachant, Marcel Martin, Kyle Meyer, AlistairMiles, Yoav Ram, Tal Yarkoni, Mike Lee Williams, Constantine Evans, Clark Fitzgerald,Brian, Chris Fonnesbeck, Antony Lee, and Adel Qalieh. 2017. mwaskom/seaborn: v0.8.1(september 2017), September.

[Wasserstein et al.2016] Ronald L Wasserstein, Nicole A Lazar, et al. 2016. The asa’sstatement on p-values: context, process, and purpose. The American Statistician, 70(2):129–133.

REFERENCES 79

[Wasserstein et al.2019] Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. 2019.Moving to a world beyond “p< 0.05”.

[Wobbrock et al.2011] Jacob O Wobbrock, Leah Findlater, Darren Gergle, and James JHiggins. 2011. The aligned rank transform for nonparametric factorial analyses usingonly anova procedures. In Proceedings of the SIGCHI conference on human factors incomputing systems, pages 143–146. ACM.

[Wolk et al.2005] Robert Wolk, Apoor S. Gami, Arturo Garcia-Touchard, Virend K. Somers,and S. H. Rahimtoola. 2005. Sleep and cardiovascular disease. Current Problems inCardiology, 30(12):625–662.

[Yang et al.2014] Guang Yang, Cora Sau Wan Lai, Joseph Cichon, Lei Ma, Wei Li, andWen-Biao Gan. 2014. Sleep promotes branch-specific formation of dendritic spines afterlearning. Science, 344(6188):1173–1178.

[Yom-Tov et al.2017] Elad Yom-Tov, Guy Feraru, Mark Kozdoba, Shie Mannor, MosheTennenholtz, and Irit Hochberg. 2017. Encouraging physical activity in patients withdiabetes: intervention using a reinforcement learning system. Journal of medical Internetresearch, 19(10):e338.

[Zhang et al.2019] Chicheng Zhang, Alekh Agarwal, Hal Daumé III, John Langford, andSahand N Negahban. 2019. Warm-starting contextual bandits: robustly combiningsupervised and bandit feedback. arXiv preprint arXiv:1901.00301.

[Ziebart et al.2008] Brian D Ziebart, Andrew L Maas, Anind K Dey, and J Andrew Bagnell.2008. Navigate like a cabbie: Probabilistic reasoning from observed context-aware behavior.In Proceedings of the 10th international conference on Ubiquitous computing, pages 322–331. ACM.

the personalization of mobile health interventions

Documents