the reinforcement learning method - diva...

33
Dalarna Licentiate Theses in Microdata Analysis 11 The reinforcement learning method A feasible and sustainable control strategy for efficient occupant-centred building operation in smart cities ROSS MAY Microdata Analysis School of Technology and Business Studies Dalarna University, Borlänge, Sweden 2019

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

Dalarna Licentiate Theses in Microdata Analysis 11

The reinforcement learning method A feasible and sustainable control strategy for efficient occupant-centred building operation in smart cities ROSS MAY Microdata Analysis School of Technology and Business Studies Dalarna University, Borlänge, Sweden 2019

Page 2: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

Licentiate thesis presented at Dalarna University to be publicly examined in B310, Borlänge, Friday, 1 November 2019 at 10:00 for the Degree of Licentiate of Philosophy. The examination will be conducted in English. Opponent: Assistant Professor Zoltan Nagy (The University of Texas at Austin).

AbstractMay, R. 2019. The reinforcement learning method. A feasible and sustainable control strategy for efficient occupant-centred building operation in smart cities. Dalarna Licentiate Theses in Microdata Analysis 11. Borlänge: Dalarna University. ISBN 978-91-88679-03-1.

Over half of the world’s population lives in urban areas, a trend which is expected to only grow as we move further into the future. With this increasing trend in urbanisation, challenges are presented in the form of the management of urban infrastructure systems. As an essential infrastructure of any city, the energy system presents itself as one of the biggest challenges. As cities expand in population and economically, global energy consumption increases and as a result so do greenhouse gas (GHG) emissions. To achieve the 2030 Agenda’s sustainable development goal on energy (SDG 7), renewable energy and energy efficiency have been shown as key strategies for attaining SDG 7. As the largest contributor to climate change, the building sector is responsible for more than half of the global final energy consumption and GHG emissions. As people spend most of their time indoors, the demand for energy is made worse as a result of maintaining the comfort level of the indoor environment. However, the emergence of the smart city and the internet of things (IoT) offers the opportunity for the smart management of buildings. Focusing on the latter strategy towards attaining SDG 7, intelligent building control offers significant potential for saving energy while respecting occupant comfort (OC). Most intelligent control strategies, however, rely on complex mathematical models which require a great deal of expertise to construct thereby costing in time and money. Furthermore, if these are inaccurate then energy is wasted and the comfort of occupants is decreased. Moreover, any change in the physical environment such as retrofits result in obsolete models which must be re-identified to match the new state of the environment. This model-based approach seems unsustainable and so a new model-free alternative is proposed. One such alternative is the reinforcement learning (RL) method. This method provides a beautiful solution to accomplishing the tradeoff between energy efficiency and OC within the smart city and more importantly to achieving SDG 7. To address the feasibility of RL as a sustainable control strategy for efficient occupant-centred building operation, a comprehensive review of RL for controlling OC in buildings as well as a case study implementing RL for improving OC via a window system are presented. The outcomes of each seem to suggest RL as a feasible solution, however, more work is required in the form of addressing current open issues such as cooperative multi-agent RL (MARL) needed for multi-occupant/multi-zonal buildings.

Keywords: Markov decision processes, Reinforcement learning, Control, Building, Indoor comfort, Occupant

Ross May, School of Technology and Business Studies, Microdata Analysis

© Ross May 2019

ISBN 978-91-88679-03-1urn:nbn:se:du-30613 (http://urn.kb.se/resolve?urn=urn:nbn:se:du-30613)

Page 3: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

List of papers

This thesis is based on the following papers, which are referred to inthe text by their Roman numerals.

I Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, SongPan, Da Yan, Yuan Jin, and Liguo Xu. “A review ofreinforcement learning methodologies for controlling occupantcomfort in buildings”, Sustainable Cities and Society, 51:101748,November 2019.

II Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, SongPan, Da Yan, and Yuan Jin. “A novel reinforcement learningmethod for improving occupant comfort via window openingand closing”, (Under review), June 2019.

Reprints were made with permission from the publishers.

Page 4: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the
Page 5: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.0.1 Cities - instigators of change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.0.2 Intelligent buildings and their control . . . . . . . . . . . . . . . . . . . . . . . 8

2 A hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 A strategy for corroborating the hypothesis -

pre-investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.1 Data requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Elements of MDPs and RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Policies and Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Bellman Optimality Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.5 Solution methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Post-investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1 Task 1 (Paper I): A comprehensive review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.2 Outcome of the review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Task 2 (Paper II): A Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Description of the case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.3 Outcome of the case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.1 BCSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.2 MDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Page 6: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the
Page 7: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

1. Introduction

According to the United Nations, 55% of the world’s population livesin urban areas, such as towns and cities. This figure is expected to growto 68% by the year 2050 [6]. Given this trend in increasing urbanisation,challenges arise in terms of the management of the various urban in-frastructure systems that make up the urban landscape [6, 19]. Mostnotably, challenges relating to the management of the energy systemstand out among the rest.

As we all know, energy is essential to the proper functioning of anyurban area, and in particular, cities. Without energy, other urban in-frastructures, for example, buildings and transportation, cease to workproperly, if at all. Moreover, as cities expand both in population andeconomically, global energy consumption increases and in turn, green-house gas (GHG) emissions [1, 4]. With the 2030 Agenda for Sustain-able Development in mind, achieving sustainable energy, therefore,poses the greatest challenge to the successful management of this in-frastructure, and more importantly, to realising the 2030 agenda.

As laid out by the United Nations General Assembly, the 2030 agendaconsists of 17 Sustainable Development Goals (SDGs) and 169 targets[5]. Their guiding policy is to end poverty and create socioeconomicsuccess, while protecting the planet. It is the goal on energy, SDG 7,with its close connections to the successful achievement of other SDGs,that plays a particularly important role [22]. The main targets of thisgoal include,• ensuring universal access to affordable, reliable and modern en-

ergy services,• increasing substantially the share of renewable energy in the global

energy mix, and• doubling the global rate of improvement in energy efficiency.

The latter two targets are of particular interest, due to the distinctiveroles they play in enabling the energy transition - from fossil-based tozero-carbon sources, and hence to achieving sustainable energy, which,as already alluded to, is a central component to the success of theagenda [22, 2]. In fact, Gielen et al have shown that both renewableenergy and energy efficiency are essential to this transformation [22].Further, an analysis by the International Renewable Energy Agency(IRENA) has highlighted that 90% of the required anthropogenic CO2

emmision reductions - the most commmon of GHG emissions [4] - can

7

Page 8: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

be achieved through the successful implementation of the aforemen-tioned measures [3].

1.0.1 Cities - instigators of changeTo achieve the radical changes laid out by the energy transformation,one requires leaders to lead the way. For the energy transformation,Cities, which may be viewed as sociotechnical systems, are our fore-runners.

Cities are major consumers of natural resources, and account for 60-80% of greenhouse gas (GHG) emissions [35]. They are also hotbedsof economic wealth, and technological and social innovation, and thus,they have been identified as key for sustainable development [30]. Amongthe different infrastructures that make up a city, the built environmenthas the largest impact on energy consumption and pollution. It alonecan be attributed to consuming 62% of final energy use and 55% ofGHG emisions [8]. Within this infrastructure, the building sector isthe largest contributor to climate change. It is responsible for morethan half of the global final energy consumption and GHG emissions[41]. Furthermore, as people spend between 80-90% of their time in-doors, the demand for energy is further exacerbated as a consequenceof maintaining the comfort level of the indoor environment [31, 35].Consequently, the efficient operation and maintenance of buildings be-comes crucial to achieving energy efficiency - one of the keys to a sus-tainable future.

The smart city - our saving graceFortunately, most cities around the world today have embraced infor-mation and communication technologies (ICTs), giving rise to what isknown colloquially as the “smart” city [18, 26]. With the ever grow-ing pace of technological advancements making up this digitilisation- from telecommunications right through to the individual data gen-eration devices - we find ourselves with an abundance of data at ourdisposal. This has many implications but the focal one for this thesisis on the smart management - smart operation and maintenance - ofinfrastructure, and in particular, buildings (known in this context as“intelligent” buildings) [18, 13].

1.0.2 Intelligent buildings and their controlAn intelligent building, in general, is one composed of many differentsensors measuring, for example, temperature, CO2, occupancy levels,et cetera. These monitoring devices are then combined through a build-ing management system (BMS) along with electronic actuators with the

8

Page 9: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

sole purpose to aid in the intelligent control of the indoor environmentvia various protocols and communication interfaces [13].

The essential function of a BMS is the building control system (BCS).This is usually designed to reduce energy consumption while main-taining indoor comfort at a certain level, in response to dynamic cli-mate and operational conditions. With an advanced control methodthe BCS is able to, not only take advantage of real-time data to producethe desired comfort level, but can also minimise the operational andmaintenance cost, and in turn improve the building’s energy perfor-mance. In fact, the reduction of energy consumption through buildingcontrol systems is known to vary between 5% and 20% [16]. Intelli-gent building control, therefore, offers significant potential for savingenergy.

In recent years the activity around understanding the interplay be-tween the building and its occupants has come increasingly into thespotlight [45, 43]. The presence/absence of occupants, their behaviour,and their interaction with the various building systems, affect both thecomfort level and the building’s energy performance in ways currentbuilding standards and codes inadequately address [32]. These factscall for advanced control methods which use knowledge about the oc-cupants (presence/absence patterns, behaviour, etc) and which enableoccupant feedback as measure of the occupants satisfaction with theenvironment.

A question of compromise

Given the above insights we find ourselves facing a tradeoff betweenoccupant comfort (OC) and energy efficiency. If we approach this froma physics point of view, disregarding the occupants, then building de-signers are able to create highly energy efficient buildings. An extremeexample of this tack was exemplified during the 1970’s oil embargoes.Designers were led to the creation of airtight buildings resulting in thereduced health of its occupants and the first cases of sick building syn-drome (SBS) [15]. However, leaving the building systems at the discre-tion of the inhabitants would clearly result in energy wastage and notnecessarily a more comfortable environment [23]. For example, leavinga window open in hot weather when the outdoor air quality is poor,would result in energy wastage because an air conditioning unit, forexample, would have to work harder and for longer in order to coolthe room down. It would further result in worsening the indoor airquality (IAQ) due to the natural exchange of the outside air with air inthe indoor environment.

We therefore require an alternative solution that sensibly respectsall of the objectives in a coherent way. And one such alternative, asalready alluded to, is “intelligent” control. Pushing this idea even

9

Page 10: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

further, we would also want an intelligent control that can adapt au-tonomously to its environment and to the occupants inhabiting thisspace. Thus, we would also like to include occupant feedback intothe control loop. This then begs the question, which advanced controlmethod would enable us to accomplish such requirements?

An answer to this question can be found in Chapter 2 but beforeproceeding to this a word or two is in order with regards to the notionof agent(s).

Agent(s)

Up to now we haven’t had the need to talk about this concept but aswe progress further into this thesis, the concept of “agent(s)” becomesunavoidable.

Succinctly, an agent can be thought of as an object (animate or inan-imate) that does something in an environment [33]. In the context ofthis thesis then an agent could be an occupant acting in an indoor en-vironment but equally it could also be a thermostat acting in the sameindoor environment. So we have a natural agent and an artificial agentacting in the same environment. While the natural agent (the occupantin this case) in this environment is central to this dissertation, when dis-cussing an agent or multiple agents acting in an environment, it shouldbe understood as referring to the artificial agents and not the naturalones i.e. occupants. In a building control problem, the occupants can,through feedback, offer crucial information to the agent(s) thereby aid-ing the agent(s) in creating a better indoor environment, but in termsof experience sharing, information exchange, skill learning and opendialogues among the agents, we mean the artificial agents.

Having clarified this point, the time is now ripe to move onto thenext chapter of our discussion and to find out an answer to the abovequestion.

10

Page 11: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

2. A hypothesis

To follow up the question posed at the end of the previous chapter, I be-lieve, first of all, a summary is in order regarding the requirements onewould desire from an advanced control method - requirements whichled us to the question being posed in the first place: that is we desirean advanced control method to be able to

1. sensibly balance the tradeoff between occupant comfort and en-ergy consumption

2. adapt autonomously to its environment and its occupants3. enable occupant feedback to be fed into its control logicHaving identified these as the main properties making up an intel-

ligent control, it is the last two properties which captured my atten-tion the most. If we consider the state-of-the-art regarding existingadvanced control techniques, it becomes apparent that there are twoapproaches one could take. Either, we use complex mathematical mod-els for the environment and its dynamics or we take a model-free ap-proach. While the first approach has its place and in a lot of instancesis the correct approach to take, there comes a point at which the systemone is trying to model becomes far too complicated (requiring modelsat many different levels). And if any one of these models is inaccurate,then the system as a whole will be an inaccurate representation of theproblem at hand, consequently leading to fallacious results. Further-more, if the building in question is retrofitted then the models becomeobsolete [42].

All of the above aforementioned examples regarding potential issueswith models require expert knowledge to rectify, which costs not onlyin terms of money and time, but also in discomfort to the occupants aswell as wastage of energy - as a result of inefficient building operation.Moreover, addressing the final two requirements listed above seemsto be either impossible or extremely difficult under a model-based ap-proach.

A model-free control technique, however, can be an alternative solu-tion to such challenges when it is applied together with real-time con-trol strategies. One such model-free control method is ReinforcementLearning (RL). This leads me to make the following hypothesis:

Hypothesis 1. RL is a feasible and sustainable control strategy for efficientoccupant-centred building operation in smart cities.

11

Page 12: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

The idea of reinforcement learning originated from the term, “opti-mal control”, which emerged in the late 1950s. Here a problem was for-mulated by designing a controller to minimise (or maximise) a measureof the behavior of a system evolving with time [39]. In 1957, RichardBellman [9] came up with the concept of Markov Decision Processes(MDPs) or finite MDPs, a fundamental theory of RL, to formulate op-timal control problems. This framework enables the essential featuresof an agent interacting, temporally, with an environment to achieve agoal to be abstracted out and be used to derive methods for solving theoptimal control problem.

Given this MDP framework, the agent of RL learns how to map situ-ations/states to actions so as to maximise a numerical delayed rewardsignal. It doesn’t need have to have a “teacher” telling it what actionto take but, rather, makes decisions via implementing a trial-and-errorsearch, and recognizing the delayed reward from the environment thatthe agent interacts with [39]. This trail-and-error search process leadsto a sequence of (state, action, reward) triples and through this experi-ence the agent is able to learn the action to take in each situation thatleads to the greatest total reward.

RL, in a sense, is the core of machine learning techniques. In thecontext of artificial intelligence, RL allows the agent to automaticallydetermine behaviors - and more importantly, emergent behaviours -which cannot be achieved by other types of machine learning such assupervised learning and unsupervised learning.

Having briefly described what RL is, one can immediately, at leastin my opinion, start to understand the basis for the formulation of theabove hypothesis. It is sustainable in the sense that as an agent in anunknown territory and given objectives to work towards (i.e. a goal) itcan, through trial and error learning over time, determine appropriateactions which satisfy its objectives and can adapt accordingly to anychanges within its surroundings (for instance, retrofits) without theneed of expert human intervention; furthermore, occupant feedbackcan be included into its control logic via sensory input allowing it toalso adapt to the occupant(s), thus turning it into an occupant-centeredcontrol; it is efficient because it enables the balance of OC against en-ergy consumption to be achieved in the best possible way via its goal-oriented nature. All of this is made possible within the smart city, withits abundant collection of data, collected in real-time.

It seems,theoretically at least, that the qualities of being sustainable,efficient (w.r.t. OC and energy use), and occupant-centric, fall out ofRL quite naturally. But is it practical? I believe it is, and that this willbecome ever more apparent with the natural evolution of the advance-ment of ICTs and the Internet of Things (IoT). That being said, a beliefis not evidence in support of a proposed claim, and, as with any scien-

12

Page 13: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

tific discipline, a strategy needs to be constructed for corroborating thestatement.

2.1 A strategy for corroborating the hypothesis -pre-investigation

To find evidence in support of my claim, a first step, as in any otheracademic discipline, should include a review to identify best practicefrom the literature. However, given the boldness of the hypothesis,a simple literature review wouldn’t be enough. To really address thestatement being made, a comprehensive review was deemed the bestapproach.

In the literature there already seems to be review works exhibitingevidence in support of RL as a good method for controlling energyconsumption. One such extensive review in this area would be thework by Vázquez-Canteli and Nagy [41]. However, there have beenlimited works reviewing RL for controlling OC, and in particular, therehave been none analysing the performance of RL in this regard fromthe methodological point of view, of which the future tasks in this fieldare still rare. Thus, the first step in trying to verify the claim wouldinclude a comprehensive review of RL in this area.

A natural next step to further support the claim would be to carryout a case study to further strengthen the basis for the feasibility of RLas a control method for OC. Thus, to summarise, the following taskswere chosen for forming a strategy to address the hypothesis:• Task 1: Carry out a comprehensive review of RL methodologies

for controlling OC in buildings.• Task 2: Carry out a case study of the feasibility of RL for control-

ling OC.

2.1.1 Data requirements

For Task 1, given its nature, the setting and data requirements for carry-ing out the task were already ideal and so I refer the reader to Section4.1 for specifics. However, for Task 2 the ideal situation is probablynot attainable for a number of reasons, some of which include, time,money, resources etc. Aware of this fact, one should like to have, as abenchmark, an idea about the ideal setting and features one would liketo observe and collect data on which would help as much as possibletowards verifying the hypothesis. For the task at hand, Task 2, I list,below, some of the things we would require from such a case study.These are,

13

Page 14: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

1. An experimental room in a building containing a system that anoccupant interacts with to affect their comfort in some way

2. Depending on this system, physical sensors collecting relevantenvironment variables, both outside and inside variables, that af-fect OC

3. A physical sensor to record the presence/non-presence of the oc-cupant

4. A physical sensor to record the interaction of the occupant withthe system

5. A device to record the satisfaction of the occupant with his/hersurroundings

6. An actuator for controlling the system7. A computer (such as a BMS) integrated with the sensors and ac-

tuator8. An RL control strategy as an application in the computer for pass-

ing signals to the actuator based on the sensor readings and feed-back from the occupant

9. A monitoring system for monitoring the performance of the agentover time

Given the nature of the problem and its many-faceted details, thedesign of such an experiment would likely be a collaborative work in-volving many different skill sets. Nevertheless, the list above, I believe,covers most of the things one would want from such an experiment.

Before moving onto the outcomes of the Tasks listed above, I feelnow is a good time to formalise RL by talking about its elements andits underlying framework, MDPs. Thus, the next chapter gives an in-troduction to the formal theory of RL as well as some solution methodsused in practice. Following this chapter, I will present the outcomes ofthe tasks. The last chapter will conclude the thesis, giving a brief sum-mary about what we’ve seen, and some of the contributions of thiswork both in terms of BCSs theory and the field of microdata analysis(MDA).

14

Page 15: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

3. Elements of MDPs and RL

3.1 ElementsIn a dynamic sequential decision-making process, the state, St ∈ S ,refers to a specific condition of the environment at discrete time steps,t = 0, 1, . . . . By realising and responding to the environment, the agentchooses a deterministic or stochastic action, At ∈ A, that tries to max-imise future returns and receives an instant reward, Rt+1 ∈ R, as theagent transfers to the new state, St+1. The reward is usually repre-sented by a quantitative measurement. In the diagram in Figure 3.1 wesee how a sequence of state, action, and reward are generated to forman MDP.

3.2 MDPsThe Markov property tells us that the future is independent of the pastand depends only on the present. In Figure 3.1, St and Rt are the out-comes after taking an action, and are considered as random variables.The joint probability density function for St and Rt is thus defined as,

p(s′, r|s, a) = P[St = s′, Rt = r|St−1 = s, At−1 = a], (3.1)

where s, s′ ∈ S , r ∈ R, and a ∈ A. It can be seen from equation 3.1 thatthe distribution of state and reward at time, t, depend only on the state

Figure 3.1. The interaction between agent and environment in an MDP

Environment

Agent

Rt+1

St+1

AtRtSt

stateaction

reward

15

Page 16: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

and action one step before. Equation 3.1 implies the basic rule of howthe MDP works and one can easily determine the marginal transitionprobabilities, p(s′|s, a), from the following sum

p(s′|s, a) = P[St = s′|St−1 = s, At−1 = a] = ∑r∈R

p(s′, r|s, a). (3.2)

Equation 3.3 below gives the expected reward, r(s, a), by using themarginal distribution of Rt,

r(s, a) = E[Rt|St−1 = s, At−1 = a] = ∑r∈R

r ∑s′∈S

p(s′, r|s, a). (3.3)

Equations 3.2 and 3.3 are used for solving the optimal value functionspresented in the next section.

3.3 Policies and Value FunctionsA policy, π, is a distribution over actions given states. It fully definesthe behavior of an agent by telling the agent how to act when it is indifferent states. The policy itself is either deterministic or stochastic[39]. In a stochastic setting, the probability of taking an action, a, instate, s, is given by

π(a|s) = P[At = a|St = s]. (3.4)

The policy can be considered as a function of actions, and the overallgoal of RL is to find the optimal policy given a state, s. An optimalpolicy tries to maximise the expected future return,

Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . =

∑k=0

γkRt+k+1,

from time-step, t, where γ ∈ [0, 1] is the discount parameter. Givena policy, π, the state-value function, vπ(s), and the action-value func-tion, qπ(s, a), are two useful measures in RL that can be estimated fromthe data. In an MDP, vπ(s), is defined as the expectation of the returnstarting from state, s,

vπ(s) = Eπ[Gt|St = s], (3.5)

for all s ∈ S .In practical applications, vπ(s) is more applicable for model-based

problems, that is to say, when we know the environment’s dynamics

p(s′, r|s, a). Whereas the action-value function, qπ(s, a), is more useful

16

Page 17: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

Figure 3.2. Backup diagrams for the optimal value functions

max

v∗(s)← s

q∗(s, a)← a

q∗(s, a)← s, a

r

v∗(s′)← s

q∗(s, a)← s, a

s′

q∗(s′, a′)← a

r

(a) (b)

(c)

in the model-free context, when p(s′, r|s, a) is not known. The reason

for this will become clear when we consider optimal value functions inSection 3.4.

When the full environment or the model is unknown, episodic sim-ulations are often used to estimate

qπ(s, a) = Eπ[Gt|St = s, At = a], (3.6)

for all s ∈ S and a ∈ A.The task of finding the optimal policy, π∗, is achieved by evaluating

either the optimal state-value function

v∗(s) = maxπ

vπ(s), (3.7)

or the optimal action-value function

q∗(s, a) = maxπ

qπ(s, a). (3.8)

3.4 Bellman Optimality EquationOne way of optimising equations 3.7 and 3.8 is to make use of the recur-sive relationships between two states or actions in a sequential order.Since the procedures are similar, we only present the relationship start-ing from the action-values, i.e. the Bellman optimality equation for,q∗(s, a) [10].

The backup diagrams in 3.2 show relationships between the valuefunction and a state or state-action pairs. Figure 3.2 (a) depicts theoptimal state-value function when taking an action. The agent looks

17

Page 18: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

at each of the possible actions it might take and selects the action withmaximum action-value that tells the agent how good the state is. Thatis,

v∗(s) = maxa

q∗(s, a). (3.9)

Similarly, figure 3.2 (b) evaluates the dynamic and stochastic environ-ment when an action is taken. Each of the states it ends up in has anoptimal value. Thus, the optimal action-value counts the immediateexpected reward, r(s, a) - from equation 3.3 - and a discounted optimalstate-value,

q∗(s, a) = r(s, a) + γ ∑s′∈S

p(s′|s, a)v∗(s′). (3.10)

Thus, as shown in figure 3.2 (c), the Bellman optimality equation forq∗(s, a) is obtained by substituting equation 3.9 into equation 3.10 togive,

q∗(s, a) = r(s, a) + γ ∑s′∈S

p(s′|s, a)maxa′

q∗(s′, a′). (3.11)

In a similar way we can derive the Bellman optimality equation for,v∗(s). Both of them are the fundamental expressions for MDPs. The re-cursive relationship assists in splitting the current value function intothe immediate reward and the value of the next action. This relation-ship is exploited in techniques used for solving the MDP. Once weknow q∗, we simply choose the action a that maximises q∗(s, a). Un-like v∗, there is no need to do one-step-ahead searches for which wewould require the environment’s dynamics since q∗(s, a) implicitly hasalready stored the results of these searches.

3.5 Solution methodsIn practice, we solve Equation 3.11 approximately using iterative solu-tion methods [39, 36]. In systems with small and discrete state or state-action sets, it is preferable to formulate the estimations using look-uptables with one entry for each state or state-action pair. This tabularapproach is straightforward to implement with convergence guaran-tees [39]. These tabular methods start with a random value functionand update to an improved value function in an iterative process untilconvergence to Q(S, A) ≈ q∗. The optimal policy is made by selectingthe action that optimises the value function at a certain state. The twomost common tabular methods are Q-learning and SARSA which areknown as temporal-difference (TD) learning methods. These methodsuse the next time step, t + 1, to make an update to the current estimate,Q(St, At) using the observed reward along the way and Q(St+1, At+1).

18

Page 19: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

The former is an off-policy TD control method and the latter is an on-policy TD control method. With off-policy methods the policy beinglearned is different to the one being followed whereas the latter ap-proach follows the policy being learned. One of the advantages aboutthe former way is that the optimal policy can be learned whilst follow-ing a different control strategy, for example, a model predictive con-trol (MPC) or rule-based control (RBC) strategy [39, 42]. The beauty ofthese methods is that they are simply expressed, can be applied online(i.e. they need only wait one time step), and require minimal compu-tational resources. These are the methods employed in Paper II.

For large MDP problems, however, we do not always want to sep-arately see the trajectory of each entry of the look-up table. The pa-rameterised value function approximation q̂(s, a; w) ≈ qπ(s, a) givesa mapping from the state-action to a function value, for which thereare many mapping functions available, the most powerful being deepneural networks (DNN). The combination of DNN with RL forms whatis known as deep RL and is behind many of the major breakthroughssuch as the that witnessed with the game of Go [37].

I have only touched the surface here with regards to RL solutionmethods, for a much richer explanation I refer the interested reader tothe main text in this area by Sutton and Barto [39].

The purpose of this section was really to give one a feel for how theRL problem is approached in practice and to offer the reader a brief in-sight to the algorithms implemented in the second paper II, presentedin the next section.

19

Page 20: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

4. Post-investigation

If we recall from Chapter 2, two tasks were proposed for forming astrategy to corroborate Hypothesis 1. In response to this strategy, twopapers of work, one for each task, have been created, the outcomes ofwhich are summarised below in order of task. I begin by giving a briefbackground to each work, following this I then draw upon the mainfindings from each.

4.1 Task 1 (Paper I): A comprehensive review

4.1.1 Background

We already know from earlier in this thesis that most people in theworld spend between 80%-90% of the day indoors. As a result of thisthe comfort of the occupants of these indoor environments becomesvery important. Indeed, maintaining the comfort factors that makeup the comfort level of an occupant not only improves their feeling ofcomfort, it also improves their health, morale, working efficiency andproductivity [28].

There are three key components that jointly influence the comfortlevel of a building occupant. These are, thermal comfort, visual com-fort, and indoor air quality (IAQ) [11, 31, 21]. While building designis a key player in achieving a comfortable indoor environment, the in-terrelation between the three aforementioned factors make it extremelyhard to find sound designs satisfying conflicting goals such as naturalventilation vs heating/cooling loss [44].

The BMS, however, via the BCS and its application of an advancedcontrol method, is able to balance this multi-objective task to reach acomfortable indoor environment when responding, in real-time, to dy-namic climate and operational conditions. Furthermore, through theapplication of this advanced control technique, the operational andmaintenance costs are reduced and in turn the energy efficiency of thebuilding is increased [29].

As alluded to earlier, a building’s performance, both in terms of OCand energy efficiency, depends critically on the presence/non-presenceof occupants and their interaction with the building via, for example,a window system, light switch, etc. [32]. Thus, to achieve a truly

20

Page 21: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

adaptive indoor building environment, that is to say, one which im-proves/maintains the comfort and adapts to its occupants in real-timeautomatically, becomes an incredibly complex task.

Current advanced control methods used for addressing this problemare based on mathematical models: building models, building systemmodels, weather forcast models, etc. But due to the complex problemat hand, these models tend not to be accurate enough in their predic-tions thus leading to inefficient building operation. A new control tech-nique, one which can adapt to its environment - occupants included -without a priori knowledge of specific models, is therefore needed. RL,I believe, offers a solution to this problem.

Since there exists no comprehensive review of RL for controlling OC(unlike energy), the aim of the work was to therefore,

1. methodolgically review the empirical works on how RL methodshave been implemented for comfort control in buildings, and

2. provide instructive directions for future research.

4.1.2 Outcome of the review

Findings

In general, the model-free RL technique exhibited promising results,achieving at least as good, and in a lot of cases, better results thanother methods both in terms of OC and energy efficiency. There wasone instance in which an MPC - under a perfect model of environmen-tal dynamics - outperformed RL. However, under an imperfect model,MPC was inferior to model-free RL. The majority of the works concen-trated on thermal comfort, with IAQ and lighting appearing much lessfrequently. Also, only 5 of the reviewed articles considered occupancypatterns and/or human feedback into the control loop - which are cru-cial for occupant-centric building operation.

Further analysis looked at the algorithm class, exploration vs ex-ploitation strategy, agent perspectives, and actual physical studies -as compared with simulation studies. The value-based learning algo-rithms dominate the literature, especially Q-learning - due to its easeof implementation. Most of the studies employed the naïve ǫ−greedyapproach to addressing the exploration vs exploitation dilemma. Only4 out of the 33 core articles took a more sophisticated approach, withgood results. A systematic study of these approaches, however, wasfound to be lacking. For multi-agent systems (MAS), the learning strat-egy is still limited to applying single-agent RL algorithms to multi-agent settings. Finally, six of the reviewed articles exhibited a casestudy in an actual real building. In general, the outcomes were good.

21

Page 22: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

There needs to be, however, more studies investigating the adaptabilityof RL to multi-occupant rooms.

Open issues

The findings of the review seem to suggest the RL method as a suitablecontrol method for controlling OC. Furthermore, as well as improvingthe comfort level of occupants as compared to other control techniques,it has also been shown, in those articles including energy as an addi-tional objective, improved energy efficiency. Still, the field of RL forOC control in buildings has seen much less interest as compared withbuilding energy control and there remain many challenges and henceopportunities - for example occupant-centric control - to explore in thisarea. Some of the key takeaways are the following:• There is a lack of documentation regarding the integration of com-

putation infrastructures with BMSs. To achieve real-time control,this integration is crucial and must be explored.• Thermal comfort is seen as the most important comfort factor,

however, automation of smart buildings is an integrated systemand thus from a comfort perspective there needs to be more workincluding the other key comfort factors - IAQ and lighting - intothe objective mix of BCSs.• The feasibility of multi-agent RL (MARL) for controlling the in-

door environment has been limited. There have been no stud-ies examining the performance of cooperative MARL for multi-occupant/multi-zonal settings. From a practical point of view,solving this ought to enable more efficient building operation.• There needs to be more studies addressing the inclusion of occu-

pancy patterns and/or occupant feedback into the control logic.Inclusion of these patterns and feedback are a crucial step to achiev-ing truly occupant-centric building control in smart buildings.• Finally, there needs to be more work in line with [42] working

towards creating a standard framework, similar to OpenAI Gym[14], coupling building simulation with advanced RL and othercontrol strategies which can be tested and compared in a repro-ducible manner.

4.2 Task 2 (Paper II): A Case study

4.2.1 Background

The outcomes from the first article make clear a strong basis for the fea-sibility and potential of RL as a control technique for OC. With this inmind, the work of this section empirically investigates the application

22

Page 23: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

of RL in window opening and closing - from the aspect of occupantbehaviour (OB) - in a naturally ventilated building with the goal of im-proving OC. This problem might at first seem rather mundane giventhe grandiose beginnings of this thesis, however, the problem of win-dow opening/closing in a naturally ventilated building is an importantproblem as is highlighted below.

As we’ve seen, enabling a comfortable indoor environment both froma passive building design perspective and adaptive building operationperspective are crucial for achieving a sustainable society. For example,only 11% and 26% of the combined buildings surveyed in US, Canada,and Finland, showed more than 80% of occupants satisfied with ther-mal comfort and IAQ, respectively [24]. Another survey done in Den-mark of inhabitants of residential housing revealed 54% of them hadat least one problem related to indoor comfort and a majority of whichdid not try to search for information on how to solve the problem [20].

In a building with natural ventilation, indoor comfort depends largelyon the control of the window systems. Distinct from heating, venti-lation and air conditioning units, the control of windows changes theindoor environment through naturally exchanging the air with the out-door environment, and therefore does not demand additional energy.Through this natural exchange of air with the outdoors, the comfortlevel of occupants can be improved by providing fresh air, reducingthe level of CO2, and cooling the indoors. However, arbitrary and cus-tomary window control by the occupants could also worsen the indoorenvironment. For example, keeping an open window when the out-door air quality becomes poor may have a negative effect on indoorcomfort. This inability to sense the slow deterioration of the environ-ment makes intelligent automation of the window system incrediblyattractive for improving OC.

The driving forces for changing the window state - i.e. open toclosed or closed to open - are made up of a number of factors but it hasbeen observed that physical environmental factors are the most directdrivers [17]. In fact, occupants’ window opening/closing behaviourcan be explained by both thermal comfort and IAQ alone [7, 25, 27, 38,40]. Given these factors as the optimisation objectives for optimisingOC via window opening/closing, the state-of-the-art control strategiesapplied to this problem so far have all relied on complex mathematicalmodelling to some degree, and in most of the cases used a simulationengine to generate the building environment. Furthermore, none havelooked at the control problem from the aspect of occupants’ behaviour.The objectives of this work were therefore to

1. propose a model-free reinforcement learning method for control-ling windows using a data-driven approach to mimic the change

23

Page 24: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

in indoor temperature at each step of the agent-interaction pro-cess

2. empirically verify the control methodology’s feasibility by com-paring the agent’s behaviour with that of the occupant’s behaviour

3. set a theoretical basis for including human feedback into the con-trol loop or full intelligent control based on the occupant’s histor-ical behaviour

4.2.2 Description of the case study

The case study used secondary data collected from an office buildingin Beijing during the transition season (March 16 - May 15, 2015). Asthe outdoor temperature is moderate, natural ventilation is preferred.The experimental room consisted of a single door and south pointingpush-pull window. The same single occupant following the univer-sity working routine was in the room during the collection period. Anumber of environmental variables were collected - both inside andoutside variables - at a time resolution of 10 minutes. A vector, madeup of the window position as well as direct environmental factors thathave impacts on the position change of the window - such as indoortemperature and air quality index (AQI) - was used to represent thestate of the environment. The action set was made up of two points,namely, switch and inaction. The continuous components of the vec-tor were discretised in order to facilitate tabular methods. Finally, thereward reflected the discomfort and consisted of indoor temperatureand AQI.

Tabular Q-learning and SARSA were used as our testing algorithmsfor adaptive window control, and we used a Recurrent Neural Net-work Long Short-Term Memory (RNN-LSTM) to predict the inside tem-perature in order to mimic the impact on the environment when anaction was made (see Figure 4.1).

4.2.3 Outcome of the case study

Findings

The RNN-LSTM prediction model was trained according to the 70%-30% training/validation sets rule and tested on a hold-out set. In gen-eral the model showed good performance with a 0.2◦C root mean squarederror (RMSE). This was deemed too trivial for the occupants’ sensory-receptors to sense, thus verifying the RNN-LSTM model as a goodmodel for simulating the temperature change w.r.t an action in the en-vironment.

24

Page 25: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

Figure 4.1. Depiction of program flow of RL coupled with RNN-LSTM

Environment

Agent

action

reward

RNN-LSTM

(Tin)

state

25

Page 26: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

Because of a lack of computing time the learning outcomes for theRL agents (one for each RL algorithm) were illustrated for a single day(April 8th) as a prototype. To evaluate the learning performances ofthe agents we used both cumulative reward and cumulative penaltiesper epoch. Both Q-learning and SARSA were able to increase therereward and reduce their penalties, with SARSA slightly outperform-ing Q-learning. The performances of the agents were further evaluatedagainst the actual occupant. In general the agents provided much bet-ter policies than the occupant, measured in terms of both cumulativereward and penalties. In fact, the RL control strategy improved ther-mal comfort and indoor air qulaity by more than 90% when comparedagainst the observed occupant behaviour data. A possible reason forthis is to do with the inertial thinking from the occupant in the morningfailing to sense the gradual deterioration of the environment.

Open issues

While we have seen good results from the current work, we are stillat the early stage of understanding the behaviours of window openingand closing. To create a more personal indoor environment humanfeedback is crucial. Including occupant feedback into the control logicof the RL agent will not only continuously correct the reward functionin the process of learning, but will also increase the actual learningexperience, resulting in an occupant-centred policy.

In moving away from a single-occupant setting and into a multi-occupant setting, the individual human effects would need to be indi-vidually treated in order to achieve the best balance. Algorithms builtfor cooperative MARL should, and need to be, explored in this case fortheir feasibility.

Finally, while discretisation of the state space allowed for tabular so-lution methods to be applied to the problem, in so doing, we have lostin terms of providing the most realistic representation of the environ-ment. Hence, solution methods for training an agent with a continuousstate space, in particular deep RL, need to be investigated for their fea-sibility.

26

Page 27: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

5. Conclusion

Wrapping everything up, I’d like to make some concluding remarksabout what we have seen so far and what we can expect to see. We be-gun this overview by first considering Agenda 2030 and its SDGs, andin particular, SDG 7 - the goal on energy, the most pressing goal w.r.tclimate mitigation and for achieving many of the other SDGs [22, 34].We then saw that the two core strategies for achieving the so calledenergy transition are renewable energy technologies and energy effi-ciency [22]. Focusing on the latter strategy w.r.t. the biggest contrib-utors to climate change, namely, buildings, our attention was drawnto the notion of building control as a key enabler of energy efficiency,and in particular, how the smart city and the IoT with its abundance ofdata offers the opportunity for the smart management of buildings viaa BMS and its core function, the BCS.

However, in seeking energy efficiency we must also consider seri-ously the inhabitants of these buildings for otherwise we risk creatingan environment which is simply inhabitable, consequently resulting inill health, lack of morale, losses in work efficiency [28] etc, and thusultimately leading to an unsustainable building environment. And so,the comfort of occupants in the drive towards energy efficient build-ings plays a critical role in this journey and the need for “intelligent”controls that are able to find the right balance between both occupantcomfort and energy efficiency are required. Given the complexity ofsuch a task, attacking it with purely model-based solutions seem, ashas been argued, to be an unsustainable approach. This observationleads one to consider model-free solutions as an alternative approach,and in particular, RL - a method which closely resembles the way wehumans learn, with ever increasing support for RL as one of the func-tions of the human brain [12]. My belief is that, through RL, we canachieve sustainable and efficient occupant-centered building operationin the smart city.

The purpose of this thesis has been to offer supporting evidence tothis claim and in particular, Hypothesis 1. In the literature, there isalready, I believe, evidence of this from the energy dimension, andone such extensive review is the work by Vázquez-Canteli et al [41].From the comfort dimension, however, a comprehensive review ad-dressing the feasibility of RL as a control strategy for occupant com-fort was found lacking. Thus, a first step to gathering support for Hy-pothesis 1 involved addressing the latter observation in the previous

27

Page 28: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

sentence. To follow up this review, further supporting evidence wassought after in the form of a case study involving the important prob-lem of efficiently controlling window opening and closing in a natu-rally ventilated building. In short, the findings from both investiga-tions point positively to RL as a feasible solution for sustainable andefficient occupant-centred building operation in the smart city.

We are, however, still at the early stages of this journey and thereremains much more work to be done. For example, the integrationof computation and communications infrastructures with BMSs seemsto be quite fuzzy - lacking in many cases implementation details; westill don’t know about the feasibility of cooperative MARL in buildingcontrol; there are still many open questions regarding the best way toincorporate occupancy patterns as well as occupancy feedback into thecontrol logic; the application of deep RL has been limited, and there re-mains many avenues to be explored here; the coupling of comfort com-ponents such as IAQ and visual comfort within BCSs has had far lessattention than thermal comfort, but the comfort level of an occupantshould be viewed holistically, and where possible all comfort compo-nents should be considered. Addressing all of these open issues willhelp greatly towards achieving efficient occupant-centric building op-eration and ultimately towards the creation of “sustainable” buildings.

Of course, the open issues listed above are not an exhaustive list andthere are, I’m sure, many other open issues. But in terms of this thesisand the knowledge I have up to this point in time, these are the oneswhich come to my attention as some of the pressing issues. It is mybelief that solving these issues will take us a step further in achievingAgenda 2030.

5.1 ContributionsBelow is a summary of the contributions of this thesis. These contribu-tions not only advance the field of BCSs - the main topic of discussionin both articles but also contribute to the advancement of MDA. Ac-cordingly, I have grouped these contributions by field.

5.1.1 BCSs

Up to now there had been no reviews comprehensively analysing theimplementation of RL for occupant comfort control. Furthermore, theanalysis of the application of RL for OC control in multi-agent environ-ments has highlighted an important gap in the literature relating to thefeasibility of cooperative MARL. A novel model-free RL method for

28

Page 29: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

controlling a window w.r.t. OC in an office building has been demon-strated: By applying a data-driven approach for simulating the envi-ronment an RL agent was able to learn a policy of window openingbehaviour far exceeding that of the observed occupant’s behaviour.Finally, the works have highlighted the potential of RL as a sustain-able forerunner for efficient occupant-centric building operation in theevolving smart city.

5.1.2 MDA

As a normative science MDA is concerned with solving important prob-lems using analysis and the application of appropriate methods to datacollected about the problem. In this thesis, I have brought to atten-tion the important problem of controlling window opening and closingbehaviour in naturally ventilated buildings, as well as demonstratingRL as a feasible method for solving such a problem. Furthermore, thedemonstration of an RNN-LSTM as an accurate model for mimickingthe change in temperature due to a window action, effectively acting asa surrogate model for approximating this physical phenomenon, iden-tifies RNN-LSTMs as another tool for microdata analysts which canbe used for solving temporal problems with many features. Further-more, the coupling of RL with an RNN-LSTM has been identified as apowerful technique, a technique that may be suitable for other similarproblems within the domain of MDA.

There are also open issues which are relevant to the field of MDA.Some of these include, cooperative MARL in building control, the de-velopment and application of deep RL for building control, as well ashow to incorporate occupancy patterns and feedback into the controllogic.

29

Page 30: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

References

[1] Cities are at the frontline of the energy transition.https://www.iea.org/newsroom/news/2016/september/cities-are-at-the-frontline-of-the-energy-transition.html.

[2] Energy Transition. https://www.irena.org/energytransition.[3] Global Energy Transformation: A Roadmap to 2050 (2019 Edition).

page 52.[4] Perspectives for the Energy Transition: The Role of Energy Efficiency.

https://webstore.iea.org/perspectives-for-the-energy-transition-the-role-of-energy-efficiency.

[5] United Nations 2030 Agenda for Sustainable Development.https://www.un.org/ga/search/view_doc.asp?symbol=A/RES/70/1&Lang=E.

[6] 68% of the world population projected to live in urban areas by 2050,says UN.https://www.un.org/development/desa/en/news/population/2018-revision-of-world-urbanization-prospects.html, May2018.

[7] Rune Andersen, Valentina Fabi, Jorn Toftum, Stefano P. Corgnati, andBjarne W. Olesen. Window opening behaviour modelled frommeasurements in Danish dwellings. Building and Environment,69:101–113, November 2013.

[8] John E. Anderson, Gebhard Wulfhorst, and Werner Lang. Energyanalysis of the built environment—A review and outlook. Renewable andSustainable Energy Reviews, 44:149–158, April 2015.

[9] Richard Bellman. A Markovian Decision Process. Indiana Univ. Math. J.,6(4):679–684, 1957.

[10] Richard Bellman. Dynamic Programming. Science, 153(3731):34–37, July1966.

[11] Abhinandana Boodi, Karim Beddiar, Malek Benamour, Yassine Amirat,and Mohamed Benbouzid. Intelligent Systems for Building Energy andOccupant Comfort Optimization: A State of the Art Review andRecommendations. Energies, 11(10):2604, October 2018.

[12] Matthew Botvinick, Sam Ritter, Jane X. Wang, Zeb Kurth-Nelson,Charles Blundell, and Demis Hassabis. Reinforcement Learning, Fastand Slow. Trends in Cognitive Sciences, 23(5):408–422, May 2019.

[13] M R Brambley, P Haves, S C McDonald, P Torcellini, D Hansen, D RHolmberg, and K W Roth. Advanced Sensors and Controls for BuildingApplications: Market Assessment and Potential R&D Pathways. page162.

[14] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider,John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym.arXiv:1606.01540 [cs], June 2016.

30

Page 31: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

[15] Konstantinos Dalamagkidis and Dionysia Kolokots. ReinforcementLearning for Building Environmental Control. In Cornelius Weber, MarkElshaw, and Norbert Michael, editors, Reinforcement Learning. I-TechEducation and Publishing, January 2008.

[16] Bing Dong and Khee Poh Lam. A real-time model predictive control forbuilding heating and cooling systems based on the occupancy behaviorpattern detection and local weather forecasting. Building Simulation,7(1):89–106, February 2014.

[17] Valentina Fabi, Rune Vinther Andersen, Stefano Corgnati, and Bjarne W.Olesen. Occupants’ window opening behaviour: A literature review offactors influencing occupant behaviour and models. Building andEnvironment, 58:188–198, December 2012.

[18] Matthias Finger. Management of Smart Urban Infrastructures MOOC(IGLUS).

[19] Matthias Finger. Management of Urban Infrastructures MOOC (IGLUS).[20] Monika Frontczak, Rune Vinther Andersen, and Pawel Wargocki.

Questionnaire survey on factors influencing comfort with indoorenvironmental quality in Danish housing. Building and Environment,50:56–64, April 2012.

[21] Monika Frontczak and Pawel Wargocki. Literature survey on howdifferent factors influence human comfort in indoor environments.Building and Environment, 46(4):922–937, April 2011.

[22] Dolf Gielen, Francisco Boshell, Deger Saygin, Morgan D. Bazilian,Nicholas Wagner, and Ricardo Gorini. The role of renewable energy inthe global energy transformation. Energy Strategy Reviews, 24:38–50,April 2019.

[23] Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, Song Pan,Da Yan, Yuan Jin, and Liguo Xu. A review of reinforcement learningmethodologies for controlling occupant comfort in buildings. SustainableCities and Society, 51:101748, November 2019.

[24] C. Huizenga, S. Abbaszadeh, Leah Zagreus, and Edward A. Arens. Airquality and thermal comfort in office buildings: Results of a large indoorenvironmental quality survey. Proceeding of Healthy Buildings 2006,3:393–397, 2006.

[25] Wufeng Jin, Ningning Zhang, and Junwei He. Experimental Study onthe Influence of a Ventilated Window for Indoor Air Quality and IndoorThermal Environment. Procedia Engineering, 121:217–224, January 2015.

[26] Rob Kitchin. The real-time city? Big data and smart urbanism. page 14,2014.

[27] Nan Li, Juncheng Li, Ruijuan Fan, and Hongyuan Jia. Probability ofoccupant operation of windows during transition seasons in officebuildings. Renewable Energy, 73:84–91, January 2015.

[28] Nianping Li, Haijiao Cui, Chihui zhu, Xuhan Zhang, and Lin Su. Greypreference analysis of indoor environmental factors using sub-indexesbased on Weber/Fechner’s law and predicted mean vote. Indoor andBuilt Environment, 25(8):1197–1208, December 2016.

[29] Vangelis Marinakis, Charikleia Karakosta, Haris Doukas, Styliani

31

Page 32: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

Androulaki, and John Psarras. A building automation and control toolfor remote and real time monitoring of energy consumption. SustainableCities and Society, 6:11–15, February 2013.

[30] Kes McCormick, Lena Neij, and Stefan Anderberg. Sustainable UrbanTransformation and the Green Urban Economy. 2012.

[31] June Young Park and Zoltan Nagy. Comprehensive analysis of therelationship between thermal comfort and building control research - Adata-driven literature review. Renewable and Sustainable Energy Reviews,82:2664–2679, February 2018.

[32] June Young Park, Mohamed M. Ouf, Burak Gunay, Yuzhen Peng,William O’Brien, Mikkel Baun Kjærgaard, and Zoltan Nagy. A criticalreview of field implementations of occupant-centric building controls.Building and Environment, 165:106351, November 2019.

[33] David L. Poole and Alan K. Mackworth. Artificial Intelligence:Foundations of Computational Agents. Cambridge University Press, NewYork, NY, USA, 2nd edition, 2017.

[34] David Rolnick, Priya L. Donti, Lynn H. Kaack, Kelly Kochanski,Alexandre Lacoste, Kris Sankaran, Andrew Slavin Ross, NikolaMilojevic-Dupont, Natasha Jaques, Anna Waldman-Brown, AlexandraLuccioni, Tegan Maharaj, Evan D. Sherwin, S. Karthik Mukkavilli,Konrad P. Kording, Carla Gomes, Andrew Y. Ng, Demis Hassabis,John C. Platt, Felix Creutzig, Jennifer Chayes, and Yoshua Bengio.Tackling Climate Change with Machine Learning. arXiv:1906.05433 [cs,stat], June 2019.

[35] Agnes Schuurmans, Susanne Dyrbøl, and Fanny Guay. Buildings inUrban Regeneration. Sustainable Cities - Authenticity, Ambition and Dream,November 2018.

[36] David Silver. Introduction to Reinforcement Learning, May 2015.[37] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou,

Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai,Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre,George van den Driessche, Thore Graepel, and Demis Hassabis.Mastering the game of Go without human knowledge. Nature,550(7676):354–359, October 2017.

[38] Francesca Stazi, Federica Naspi, Giulia Ulpiani, and Costanzo Di Perna.Indoor air quality and thermal comfort optimization in classroomsdeveloping an automatic system for windows opening and closing.Energy and Buildings, 139:732–746, March 2017.

[39] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: AnIntroduction. Adaptive Computation and Machine Learning Series. TheMIT Press, Cambridge, Massachusetts, second edition edition, 2018.

[40] Ryan A. Tanner and Gregor P. Henze. Stochastic control optimization fora mixed mode building considering occupant window openingbehaviour. Journal of Building Performance Simulation, 7(6):427–444,November 2014.

[41] José R. Vázquez-Canteli and Zoltán Nagy. Reinforcement learning fordemand response: A review of algorithms and modeling techniques.

32

Page 33: The reinforcement learning method - DiVA portaldu.diva-portal.org/smash/get/diva2:1358130/FULLTEXT02.pdf · emergence of the smart city and the internet of things (IoT) offers the

Applied Energy, 235:1072–1089, February 2019.[42] José R. Vázquez-Canteli, Stepan Ulyanin, Jérôme Kämpf, and Zoltán

Nagy. Fusing TensorFlow with building energy simulation forintelligent energy management in smart cities. Sustainable Cities andSociety, 45:243–257, February 2019.

[43] Andreas Wagner, Liam O’Brien, and Mackenzie Building. October 2018,updated after approval by IEA EBC. page 23.

[44] Weimin Wang, Radu Zmeureanu, and Hugues Rivard. Applyingmulti-objective genetic algorithms in green building designoptimization. Building and Environment, 40(11):1512–1525, November2005.

[45] Da Yan and Tianzhen Hong. Definition and Simulation of OccupantBehavior in Buildings. page 172.

33