a deep rei n forc emen t learn i n g a p p roac h t o u si n g w h ol … · 2018. 8. 9. · 2018...

9
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/326711617 A Deep Reinforcement Learning Approach to Using Whole Building Energy Model For HVAC Optimal Control Conference Paper · September 2018 CITATIONS 0 6 authors, including: Some of the authors of this publication are also working on these related projects: ASHRAE Multidisciplinary Task Group on Occupant Behavior in Buildings View project International Energy Agency Energy in Buildings and Communities Programme, Annex 66, Definition and Simulation of Occupant Behavior in Buildings View project Zhiang Zhang Carnegie Mellon University 2 PUBLICATIONS 1 CITATION SEE PROFILE Adrian Chong National University of Singapore 14 PUBLICATIONS 56 CITATIONS SEE PROFILE Khee Poh Lam Carnegie Mellon University 171 PUBLICATIONS 1,137 CITATIONS SEE PROFILE All content following this page was uploaded by Zhiang Zhang on 31 July 2018. The user has requested enhancement of the downloaded file.

Upload: others

Post on 26-Aug-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/326711617

A Deep Reinforcement Learning Approach to Using Whole Building Energy

Model For HVAC Optimal Control

Conference Paper · September 2018

CITATIONS

0

6 authors, including:

Some of the authors of this publication are also working on these related projects:

ASHRAE Multidisciplinary Task Group on Occupant Behavior in Buildings View project

International Energy Agency Energy in Buildings and Communities Programme, Annex 66, Definition and Simulation of Occupant Behavior in

Buildings View project

Zhiang Zhang

Carnegie Mellon University

2 PUBLICATIONS   1 CITATION   

SEE PROFILE

Adrian Chong

National University of Singapore

14 PUBLICATIONS   56 CITATIONS   

SEE PROFILE

Khee Poh Lam

Carnegie Mellon University

171 PUBLICATIONS   1,137 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Zhiang Zhang on 31 July 2018.

The user has requested enhancement of the downloaded file.

Page 2: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

2018 Building Performance Modeling Conference andSimBuild co-organized by ASHRAE and IBPSA-USA

Chicago, ILSeptember 26-28, 2018

A DEEP REINFORCEMENT LEARNING APPROACH TO USING WHOLE BUILDINGENERGY MODEL FOR HVAC OPTIMAL CONTROL

Zhiang Zhang1, Adrian Chong2, Yuqi Pan3, Chenlu Zhang1, Siliang Lu1, and Khee Poh Lam1,2

1Carnegie Mellon University, Pittsburgh, PA, USA2National University of Singapore, Singapore

3Ghafari Associates, MI, USA

ABSTRACTWhole building energy model (BEM) is difficult to beused in the classical model-based optimal control (MOC)because of its high-dimension nature and intensive com-putational speed. This study proposes a novel deep re-inforcement learning framework to use BEM for MOCof HVAC systems. A case study based on a real of-fice building in Pennsylvania is presented in this paper todemonstrate the workflow, including building modeling,model calibration and deep reinforcement learning train-ing. The learned optimal control policy can potentiallyachieve 15% of heating energy saving by simply control-ling the heating system supply water temperature.

INTRODUCTIONWhole building energy model (BEM) has been widelyused in building design stage for design optimization andcode compliance. As BEM contains rich information re-lated to building system control and operation, extendingthe life-cycle of BEM from building design stage to build-ing operation stage has been a popular concept in recentyears.One way to integrate BEM into building operation ismodel-based optimal control (MOC) of HVAC systems.However, most BEM engines are high-order and non-differentiable models so conventional MOC algorithmscannot be directly applied. Furthermore, most BEM en-gines have intensive computational demand and hence itcan be too slow to meet real-time control requirement. Forexample, the EnergyPlus based model predictive control(MPC) must limit its prediction horizon to one time-stepto save computational cost if it is for real-time control(Zhang and Lam 2017), or the algorithm can only be usedoffline to pre-calculate the optimal control actions once aday(Ascione et al. 2016).Reinforcement learning control provides a suitable solu-tion to use BEM in the MOC of HVAC systems becausethe optimal control policy is developed by reinforcementlearning using the model (a.k.a. simulator) off-line. Con-ventional studies of MOC with reinforcement learning forHVAC systems usually use simple reinforcement learn-

ing methods, such as tabular setting Q-learning (Liu andHenze 2006) and linear state-value function approxima-tion (Dalamagkidis et al. 2007). Such methods requirecareful design of the control algorithm and cannot easilyscale to large complicated buildings.Deep reinforcement learning (DRL) has been recentlyproposed for optimal control (Mnih et al. 2013) whichintegrates deep learning into the conventional reinforce-ment learning framework. Compared to the conventionalreinforcement learning, DRL can solve complicated con-trol problems and make ”end-to-end” control possible (i.e.raw observations as inputs, raw control actions as out-puts). However, DRL-based MOC for HVAC systems isstudied at an infant stage of research with very limitedexisting studies using only simple hypothetical buildingmodels as case studies (Wei, Wang, and Zhu 2017; Wang,Velswamy, and Huang 2017; Li et al. 2017).This study proposes a framework that uses BEM for MOCof HVAC systems by DRL that can be deployed in a realbuilding. The processes, including building modeling,model calibration and DRL training, are explained by us-ing a real-life office building as the case study. The en-ergy efficiency and thermal comfort performance of theproposed control method are analyzed. In addition, theoptimization problems of DRL training are discussed.

DRL BACKGROUNDSReinforcement Learning (RL)According to Sutton and Barto (2017), a standard RLproblem involves a learning agent interacts with the en-vironment in a number of discrete steps to learn how tomaximize the returned reward from the environment. Inthis case, the learning agent (hereafter called agent) is avirtual entity that performs the control actions, the en-vironment is the controlled HVAC system, and the re-ward is the human-designed feedback from the environ-ment representing how good the current environment stateis. Agent-environment interactions of one step can be ex-pressed as a tuple (St ,At ,St+1,Rt+1), where St is the envi-ronment’s state at time t, At is the control action of agentat the time t, St+1 is the resulting environment’ state af-

Page 3: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

ter the agent has taken the action, Rt+1 is the reward re-ceived by the agent from the environment. Ultimately, thegoal of the RL control is to learn an optimal control policyπ : St → At that maximizes the accumulated future reward∑

T∞

t Rt .The above-mentioned standard RL problem is a Markovdecision process (MDP) if it obeys the Markov property.Most RL algorithms implicitly assume the environment isa MDP. However, empirically, problems can still be solvedeven though the environment is not strictly MDP, such asAtari games (Mnih et al. 2013).In RL, there are three important concepts including state-value function, action-value function and advantage func-tion, as shown in Equations (1), (2), and (3):

vπ(s)=Eπ

[ ∞

∑k=0

γkRt+k+1|St = s

], (1)

qπ(s,a)=Eπ

[ ∞

∑k=0

γkRt+k+1|St = s,At = a

], (2)

aπ(s,a) = qπ(s,a)− vπ(s), (3)

where γ is the reward discount factor (Sutton and Barto2017). Intuitively, state-value function represents howgood is a state, action-value function represents howgood is an action, and advantage function describes howgood is an action with respect to the state. Note thatEa∼π(s)[aπ(s,a)] = 0. For the optimal policy π∗, there is

vπ∗(s) = maxa

qπ∗(s,a)

= maxa

E[Rt+1 + γvπ∗(St+1)|St = s,At = a].(4)

In deep reinforcement learning, deep learning models areused to represent value functions and policy function, i.e.vπ(s,θ), qπ(s,a,θ) and π(s,a,θ) where θ is function pa-rameters.

Policy Gradient

Policy gradient (Sutton and Barto 2017) is one type of RLtraining method that is used in this study. The goal ofthe policy gradient method is to learn the parameter θ inπθ (s,a) = Pr(a|s,θ) that maximizes average reward pertime step J(θ):

J(θ) = ∑s

dπθ(s)∑

aRa

s πθ (s,a), (5)

where dπθ(s) is stationary distribution for the state s of

the Markov chain starting from s0 following the policyπθ , and Ra

s is the reward of the agent at the state s takingthe action a.

We can use gradient descent to maximize Equation (5).The gradient of J(θ) with respect to θ is given by:

∇θ J(θ) = ∑s

dπθ(s)∑

aRa

s πθ (s,a)∇θ πθ (s,a)

πθ (s,a)(6a)

= ∑s

dπθ(s)∑

aRa

s πθ (s,a)∇θ logπθ (s,a) (6b)

= Eπθ[∇θ logπθ (s,a)qπθ

(s,a)] (6c)

= Eπθ

[∇θ logπθ (s,a)

(qπθ

(s,a)− v(s))]

(6d)= Eπθ

[∇θ logπθ (s,a)aπθ(s,a)], (6e)

where Equation (6c) follows from the policy gradient the-orem, (6d) is obtained by subtracting a zero-valued ”base-line function” Eπθ

[∇θ logπθ (s,a)v(s)] to reduce the vari-ance of qπθ

.The policy gradient in the form of Equation (6e) is calledadvantage actor critic (A2C).

Asynchronous Advantage Actor Critic (A3C)

The state-of-the-art deep reinforcement learning variationof the A2C, Asynchronous Advantage Actor Critic (A3C)(Mnih et al. 2016), is the specific deep reinforcementlearning method used in this study.In A3C method, rather than having only one agent to inter-acts with the environment, a number of agents interact thecopy of the same environment independently, but updatethe same global action-value or policy function networkasynchronously. Also asynchronously, the agents updatetheir own action-value or policy function network to bethe same as the global one in a certain frequency. Thepurpose of this method is to ensure the (St ,At ,St+1,Rt+1)tuples used to train the global network are independent.Compared to the non-asynchronous methods, A3C signif-icantly reduces memory usage and training time. Detailsof the algorithm can be found in (Mnih et al. 2016).According to Equation (6d), two deep neural networkmodels are needed for A3C, including πθ (s,a) to approx-imate the policy and vθv(s) to approximate the state-valuefunction. One step update function for θ is (follow Equa-tions (4) and (6))

θ ← θ +αEπθ

[∇θ logπθ (s,a)

(qπθ

(s,a)− vθv(s))]

= θ +αEπθ[∇θ logπθ (s,a)

(R′+ γvθv(s

′)− vθv(s))],

(7)

and one step update function for θv is (by applying themean squared loss function)

θv← θv−αEπθ

[∂(vtrue− vθv(s)

)2/∂θv

]= θv−αEπθ

[∂(R′+ γvθv(s

′)− vθv(s))2/∂θv

].

(8)

In Equation (7) and (8), α is the step size for gradientdescent, R′ is the actual reward, and s′ is the next statefrom the state s taking the action a.

Page 4: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

Figure 1: Relationship among the RL Algorithms

Figure 2: BEM-DRL Control Framework

The relationship among the RL algorithms discussed inthis section is shown in Figure 1. The specific RL algo-rithm used in this study, A3C, is a deep learning extensionto A2C. A2C can be categorized as the policy gradientmethod in RL.

BEM-DRL CONTROL FRAMEWORKThe BEM-DRL optimal control framework for HVACsystems is shown in the Figure 2, which includes foursteps:

1. BEM modeling: The building and its HVAC sys-tem are firstly modeled using BEM application. TheBEM built in building design stage can be reused inthis control framework.

2. HVAC model calibration: The BEM needs to be cal-ibrated against real HVAC operation data.

3. DRL agent training: The DRL agent will be trainedoff-line using the calibrated BEM to learn the opti-mal control policy. The DRL state, reward and actiondesign will be determined based on the building sen-sor data availability, control optimization objectivesand HVAC system control capability.

4. Actual HVAC system control: The trained DRLagent will control the actual HVAC systems. In thisprocess, the agent continuously refines its controlpolicy by learning from the actual feedbacks fromthe HVAC system.

This framework will be demonstrated using the IntelligentWorkplace (IW) building located in Pittsburgh, PA, USAto control its heating system. In this study, the last step of

Figure 3: The Geometry Rendering of the IW BEM (ren-dered by BuildSimHub (BuildSimHub, Inc. 2018))

Figure 4: “Mullion” Radiant Heating System in IW

the framework is not yet implemented. A virtual buildingmodel is used instead of the actual building to validate theAI agent’s control performance.

BUILDING MODELINGOverviewThe case study building, IW, is a one-level office buildingbuilt on top of an exiting building in 1997 with completebuilding automation system installed. It has area around600 m2, with about 20 regular occupants and a 30-personclassroom. EnergyPlus v8.3 (U.S. Department of Energy2015) is used to create the BEM. An overview of the ge-ometry of the model is shown in the Figure 3.

Radiant Heating SystemIW uses a water-based radiant heating system that is in-tegrated with the mullion of the window frames as shownin the Figure 4. This system has a very slow thermal re-sponse. In winter morning, it usually takes more than 2hours for the indoor air temperature to reach its setpoint.

Current Heating System Control LogicThe hot water for the radiant system is supplied by a dis-trict heating plant, as shown in the Figure 5. Steam towater heat exchanger located in the B2 level of the mainbuilding supplies the hot water to the system. Currently,the heat exchange is enabled when the outdoor air temper-ature is below 10 C. The radiant system is also suppliedwith cold water from the district cooling plant in summer.Valve 2 and valve 3 in the Figure 5 are on/off valves andoperate in a synchronized way to switch between heatingand cooling mode.

Page 5: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

Figure 5: Schematic Diagram of the IW Radiant Sys-tem Water Flow (valve 1: recirculation value; valve 2:hot/cold water inlet valve; valve 3: hot/cold water outletvalve)

The supply water temperature of the radiant system is con-trolled by valve 1 in the Figure 5, which changes the pro-portion between the return water and hot water to the radi-ant system. Current control logic of the system is to keepsupply water flow rate constant and change the supply wa-ter temperature. The supply water temperature setpoint iscalculated by a PID controller based on the difference be-tween the zones’ average indoor air temperature and itssetpoint.

MODEL CALIBRATIONCalibration ObjectiveThe calibration objective should be determined based onthe requirement of the control problem. In this study, theoptimal control aims at reducing heating energy consump-tion and maintaining acceptable indoor thermal comfortlevel. Therefore, the BEM of IW will be calibrated forheating energy consumption and indoor average air tem-perature. Note that the average indoor air temperature isthe representation of the indoor thermal response.This study focuses on the control for the heating season.Only three months measured data from Jan 1st 2017 toMar 31th 2017 with 5 minutes interval is used for the cal-ibration. The heating energy consumption is a calculatedvalue based on the measured radiant system inlet/outletwater temperature and water mass flow rate.

Bayesian CalibrationBayesian calibration using the method proposed by Chonget al.(2017) is performed for the IW model. The statisticalformulation of the Bayesian calibration in this study is

y(x) = ζ (x)+ ε(x) = η(x,c∗)+ ε(x), (9)

where y is the observed building behavior, ζ is the truebuilding behavior, η is the output from the building en-ergy model, x is the observable input parameters, c is the

Table 1: IW Model Calibration ParametersNo. Parameter) Range

1Mullion insulation(Polyisocyanates)

thickness1 mm to 5 mm

2 Internal mass area∗ 200 to 1000 m2/zone

3 Mullion radiantsurface area†

6.7% to 26.7% ofthe external wall

4 Infiltration rate 0.01 to 0.30 ACH

Note: *The internal mass is modeled in BEM as 5 cm thickconcrete. †The Mullion system of IW is abstracted to the radiantsurfaces in BEM.

unknown calibration parameters, ε is the observation er-ror.To adapt the Chong et al.(2017)’s method for the multi-objective calibration, multiple objectives have to be com-bined into the single objective. In this study, a simplelinear method, as shown in the Equation (10), is used tocombine multiple objectives (yn) into one.

y = µ1y1 +µ2y2+, ...,+µnyn, s.t.n

∑1

µn = 1 (10)

Input Parameters and Calibration ParametersThe input parameters for the BEM calibration generallyinclude real weather conditions, such as outdoor air tem-perature, humidity, solar radiation, wind speed and winddirection. In addition, actual HVAC system operation pa-rameters should be included based on the control problemrequirement. In this study, the control parameter of theoptimal control algorithm is the Mullion system supplywater temperature setpoint. Therefore, the input param-eters should include the real Mullion system supply wa-ter temperature setpoint and the steam heat exchanger (asshown in the Figure 5) on/off status.The calibration parameters are selected by the sensitivityanalysis with Morris method (Morris 1991). The selectedcalibration parameters are listed in the Table 1 with theircalibration ranges. The ranges are determined based onexperience.

Calibration ResultsIn this study, there are only two calibration objectives, in-cluding indoor average air temperature (y1) and heatingenergy consumption (y2). Different combinations of µ1and µ2 have been tested and the calibration evaluation cri-teria are the mean bias error (MBE) and cumulative vari-ation of the root mean square error (CVRMSE). It is in-teresting to find that µ1 = 0,µ2 = 1 gives the best accu-racy for both average indoor air temperature and heatingenergy consumption. This is possibly because the heating

Page 6: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

Table 2: Calibration Results: Jan 1st to Mar 31th of 2017with µ1 = 0,µ2 = 1

Objective Results

Average IndoorAir Temperature

5-min MBE: 0.52%5-min CVRMSE: 4.82%

Heating EnergyConsumption

Hourly MBE: 0.43%Hourly CVRMSE: 35.96%Daily CVRMSE: 10.46%

Table 3: State: Site Observations (SiteOb)

No. Site Observations

1 Day of the week2 Hour of the day3 Outdoor air temperature (C)4 Outdoor Air Relative Humidity (%)5 Wind Speed (m/s)6 Wind Direction (degree from north)7 Diffuse Solar Radiation (W/m2)8 Direct Solar Radiation (W/m2)

energy consumption and average indoor air temperature ishighly correlated in this model so capturing the accuracyin heating energy consumption also captures the accuracyin the average indoor air temperature. However, furtherstudy should be conducted for the multi-objective calibra-tion using the Bayesian method.The calibration results are shown in the Table 2. MBE andCVRMSE for the average indoor air temperature is cal-culated using the 5-minute interval data because air tem-perature is related to indoor thermal comfort and shouldbe checked very frequently during the MOC of HVAC. Itis shown that less than 1% error can be achieved. Heat-ing energy consumption MBE and CVRMSE are cal-culated for the hourly and daily interval. Even thoughthe CVRMSE for hourly heating energy consumptionhas relatively worse accuracy, its hourly MBE and dailyCVRMSE are still within the acceptable range. As accu-mulated value is more important than the spot value forthe HVAC energy consumption, this accuracy is accept-able.

DRL TrainingState, Action and Reward DesignFor reinforcement learning, state, action and reward de-sign are critical for learning convergence and control per-formance. To take advantage of the deep reinforcementlearning method, we only use raw observable or control-lable parameters for our state, action and reward designwith no extra data manipulations.The state in reinforcement learning is what the DRL agent

Table 4: State: Building Observations (BldOb)

No. Building Observations

1 IW Steam Heat Exchanger EnableSetpoint (HESSP, C)∗

2 IW Average PPD (PPD) †

3 IW Radiant Systems Supply WaterTemperature Setpoint (MULSSP, C)

4 IW Average Indoor Air Temperature (IAT, C)5 IAT Setpoint (IATSSP, C)6 IW Occupancy Mode Flag (OCCU)�

7 IW Average Heating Demand Since LastObservation (Ehvac, kW)

Note: *The outdoor air temperature setpoint below whichthe IW steam heat exchanger will be enabled. †PPD is shortfor Predicted Percentage of Dissatisfied, which is calculatedby the BEM engine with assumptions Clo = 1.0, Met = 1.2and vair = 0.137m/s. �The occupancy mode flag determinedbased on a fixed schedule (the occupancy mode flag is 1 be-tween 7:00 AM and 7:00 PM of weekdays, and between 8:00AM and 6:00 PM of weekends)

observes for each control step. In this study, the state con-sists of site observations (SiteOb) and building observa-tions (BldOb), as shown in the Table 3 and Table 4. Thefinal state observations for the deep reinforcement learn-ing is:

state ={{SiteOb,BldOb}t ,{SiteOb,BldOb}t−1, ...,

{SiteOb,BldOb}t−n}

(11)

where t is the current control step, n is the number of thehistory control steps to be considered. As building usu-ally has slow thermal response, it is important to considerthe history observations for the deep reinforcement learn-ing to make the environment roughly a MDP. It should benoted that each item in the state should be normalized tobetween 0 and 1 for the optimization purpose of the deepneural network. Min-max normalization is used.

Table 5: DRL Agent Actions

No. RadSSP HXOp No. RadSSP HXOp

1 N/A ∗ 0 ... ... ...2 20 C 1 10 60 C 13 25 C 1 11 65 C 1

Note: *If the IW steam heat exchanger is turned off, the IWradiant system supply water temperature is no longer needed.

The action is how the DRL agent controls the environ-ment. In this study, the actions are the IW radiant systemsupply water temperature setpoint (RadSSP) and on/off of

Page 7: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

Figure 6: Policy and State-value Function Network Archi-tecture

the IW steam heat exchanger (HXOp). The action spaceis discrete, as shown in the Table 5.The reward design determines the control optimizationobjective. The objective of the control method is to min-imize the HVAC energy consumption and maximize ther-mal comfort. The design of the reward functions maysignificantly affect the convergence of the deep reinforce-ment learning and the final control performance. A flex-ible reward function combining the HVAC energy con-sumption and the indoor thermal comfort is shown in theEquation (12). This reward design has good convergenceproperty by our experiments.

reward =

{[τ ∗([PPD−0.1]+ ∗ρ

)2+β ∗Ehvac

]10|Occu=1[

τ ∗ [St ptlow− IAT ]+ ∗λ +β ∗Ehvac]1

0|Occu=0

(12)

where τ,β ,ρ,λ ,St ptlow are the tunable hyperparameters.τ and β control the relative importance between theHVAC energy efficiency and the indoor thermal comfortfor the optimization; ρ is a scale factor to penalize largePPD value; λ is the penalty level for the indoor air tem-perature violation during the unoccupied hours, St ptlow isthe indoor air temperature penalty threshold. Note all pa-rameters are normalized.

DRL Training Experiment SetupThe neural network architecture for this study is shownin the Figure 6. A3C needs two function approxima-tion neural networks, one for the policy and the other forthe state-value function. Therefore, a shared multi-layerfeed-forward neural network is used. The output from theshared network is fed into a Softmax layer and a linearlayer in parallel, where the Softmax layer outputs the pol-icy and the linear layer outputs the state-value.The shared network in Figure 6 has 4 hidden layers ofwhich each layer has 512 hidden units with rectifier non-linearity. RMSProp (Tieleman and Hinton 2012) is usedfor optimization. The learning rate is fixed to be 0.0001,RMSProp decay factor is 0.90. All gradients during theback-propagation are clipped with their L2 norm ≤ 5.0.

16 A3C agents interacts with the environment in paralleland the total number of interactions is 10 million (about600K per A3C agent). The history window n in state func-tion (Equation (11)) is 3.An entropy term is added to the DRL learning updatefunction (Equation (7)) to encourage random explorationof the agent(Mnih et al. 2016). Therefore, the original up-date function (Equation (7)) becomes

θ ← θ +αEπθ[∇θ logπθ (s,a)(R′+ γvθv(s

′)− vθv(s))

+κ∇θ H(πθ (s))],(13)

where H is the entropy term, and κ is a tunable hyper-parameter controlling the exploration level (larger valueencourages more exploration). In this study, κ is a piece-wise constant with value 1, 0.1, 0.05 and 0.01 for inter-action steps before 2 million, 4 million, 6 million and 10million.The calibrated BEM of IW is used for DRL training andtesting in this study. The DRL agent is trained in heatingseason (Jan to Mar) using the TMY3 weather data. Thetrained DRL agent is tested also for heating season butusing the actual Jan-Mar weather data of 2017. The BEMsimulation time step and the control time step are both 5minutes. The St ptlow in Equation (12) is the indoor airtemperature setpoint calculated by the IW BAS controllogic.

Discussion and ResultsThe DRL performance is usually analyzed by its op-timization performance and control performance. Theoptimization performance is evaluated on the quality ofthe DRL training convergence (accumulated reward valueconvergence), and the control performance in this study isevaluated on the heating energy consumption and thermalcomfort.The DRL training in this study can usually convergearound 6.5 M interactions, which takes about 10 hours ona Xeon E5 CPU (no GPU is used). However, it is foundthat for some choices of the hyperparameters, the accu-mulated reward fluctuates or converges to low level in theDRL training. This is because of the very slow thermalresponse of the IW due to its radiant system and insuffi-cient heating capacity of the radiant system. As a result,the control actions taken by the DRL agent may not takeeffect immediately in terms of the environment observa-tion. For example, in winter, it may take more than 1 hourfor the IW average indoor air temperature to rise by 1◦Cafter the supply water temperature of the radiant systemhas been set to the maximum. This phenomenon is knownas ”delayed reward problem” in DRL. It is found that thecareful selection of the reward function hyperparametersand control action repeat can to some level solve this prob-

Page 8: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

lem. However, a more structured solution should be foundin the future.

The ultimate goal of the optimal control is to maximizethe indoor thermal comfort and minimize the heating en-ergy consumption. However, the two goals cannot be ful-filled simultaneously. A good balance must be found be-tween the heating energy consumption and indoor ther-mal comfort quality, which is determined by the rewardfunction in the DRL. Due to the highly complicated rela-tionships between the thermal comfort and heating energyconsumption, it is difficult to know exactly what rewardfunction can lead to the best balance. Therefore, hyperpa-rameters in the reward function (Equation (12)) must betuned.

Table 6 shows the control performance results of the DRLof some selected experiments. It is interesting to findthat the control performance results of different hyperpa-rameters are not necessarily intuitive. We would expectthat, bigger β and smaller ρ should lead to lower heat-ing energy consumption and worse indoor thermal com-fort. However, cases 4, 5 and 6 in the Table 6 do notrespect that. Also, action repeats lower the flexibility ofthe control algorithm and hence should lead to worse con-trol performance. However, in Table 6 it can be seen thatcases with action repeat of 3 generally performs betterthan cases with action repeat of 1. Such counter-intuitiveresults are caused by the non-ideal optimization of theDRL training. As DRL is in fact a non-convex optimiza-tion problem, finding the global optimal solution (or eventhe existence of the global optimal solution) is not guar-anteed. A small change in the hyperparameters may causethe DRL agent to yield different local optimal solutions.In addition, reward is significantly delayed in this study,which adds more difficulties to the optimization problem.For example, case 5 consumes less heating energy con-sumption but gives better indoor thermal comfort thancase 2, which means the DRL agent in case 2 was stuckin a worse local optimal during the DRL training. Out ofthe six experiments in the Table 6, case 6 performs com-parably the best, which saves 15% of heating energy withonly slightly worse indoor thermal comfort quality in thetesting model.

Over-fitting is the major problem of almost all machinelearning methods. In BEM-DRL, the DRL agent mayover-fit to the weather and building thermal behavior pat-terns of the simulator. Table 6 shows the control perfor-mance of the trained DRL agent on the testing model,which uses the same IW BEM but different weather dataas the training model. It is shown that the problem ofover-fitting to the simulator’s weather patterns is not de-tected. The energy saving and thermal comfort results onthe testing model are comparable or even better than thetraining model. However, further analysis on the problem

of over-fitting to the simulator’s thermal behavior patternsis required after the trained DRL agent is deployed in thereal building.

CONCLUSION AND FUTURE WORKThis study develops a BEM-DRL framework for HVACoptimal control for energy efficiency and thermal com-fort. The framework workflow includes building model-ing using BEM tools, model calibration and DRL training.The framework is applied to a case study, which controlsthe supply water temperature of the radiant heating sys-tem of an office building in Pennsylvania, USA. Bayesianmethod is used for the model calibration and less than1% modeling error is achieved for both indoor average airtemperature and heating energy consumption. A3C algo-rithm is used for DRL training. It is found that selection ofthe hyperparameters significantly affect the DRL conver-gence and control performance because of the DRL non-convex optimization nature and the slow thermal responseof the case study building. The best trained control policyby DRL can achieve 15% of heating energy saving withsimilar indoor thermal comfort as the basecase.Future work will include implementing the trained DRLagent in the real building to test its real control perfor-mance. In addition, the energy saving results should befurther analyzed to obtain its statistical significance. The”delayed reward” problem will be further studied to a findmore structured solution. More case studies should alsobe conducted for different types of HVAC systems to testthe robustness of the framework.

ACKNOWLEDGMENTThe first author would like to thank China ScholarshipCouncil for the financial support of the author’s PhDstudy. We would like to thank Linan Zhang from theCarnegie Mellon University for the help on LATEX.

REFERENCESAscione, Fabrizio, Nicola Bianco, Claudio De Stasio,

Gerardo Maria Mauro, and Giuseppe Peter Vanoli.2016. “Simulation-based model predictive controlby the multi-objective optimization of building en-ergy performance and thermal comfort.” Energy andBuildings 111:131–144.

BuildSimHub, Inc. 2018. BuildSimHub.

Chong, Adrian, Khee Poh Lam, Matteo Pozzi, and Jun-jing Yang. 2017. “Bayesian calibration of buildingenergy models with large datasets.” Energy & Build-ings 154:343–355.

Dalamagkidis, K., D. Kolokotsa, K. Kalaitzakis, andG. S. Stavrakakis. 2007. “Reinforcement learningfor energy conservation and comfort in buildings.”Building and Environment 42 (7): 2686–2698.

Page 9: A Deep Rei n forc emen t Learn i n g A p p roac h t o U si n g W h ol … · 2018. 8. 9. · 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and

Table 6: DRL Control Performance for IW in Heating Season (selected)

# Hyperparameters Training Model Testing Model

ActionRepeat∗ τ,β ,ρ†

HeatingEnergy(kWh)

PPDmean(%)

PPDstd(%)

HeatingEnergy(kWh)

PPDmean(%)

PPDstd(%)

Basecase N/A N/A 45302 10.48 4.48 43709 9.46 5.591 1 1.0, 1.5, 20 52806 8.72 4.31 47522 8.23 2.462 1 1.0, 2.5, 20 44549 11.58 5.35 39484 11.11 4.533 1 1.0, 2.5, 10 40101 16.09 9.46 37238 14.2 8.654 3 1.0, 1.5, 20 42255 11.46 4.26 38550 10.63 3.345 3 1.0, 2.5, 20 43532 10.63 4.23 39109 10.44 3.756 3 1.0, 2.5, 10 42104 11.49 4.24 37131 11.71 3.76

Note: *Action repeat means the DRL agent repeats the same control action for multiple time steps. †ρ is determined bya function ρ = 1

PPDthres/100−0.1 . ρ = 20 and ρ = 10 correspond to PPDthres = 15 and PPDthres = 20 respectively, meaningthe reward function (Equation (12)) returns the minimum value if the PPD exceeds the PPDthres.

Li, Yuanlong, Yonggang Wen, Kyle Guan, and DachengTao. 2017. Transforming Cooling Optimization forGreen Data Center via Deep Reinforcement Learn-ing.

Liu, Simeng, and Gregor P. Henze. 2006. “Experimentalanalysis of simulated reinforcement learning controlfor active and passive building thermal storage in-ventory: Part 2: Results and analysis.” Energy andBuildings 38 (2): 148–161.

Mnih, Volodymyr, Adria Puigdomenech Badia, MehdiMirza, Alex Graves, Tim Harley, Timothy P. Lilli-crap, David Silver, and Koray Kavukcuoglu. 2016.“Asynchronous Methods for Deep ReinforcementLearning.” Proceedings of the 33rd InternationalConference on Machine Learning - Volume 48,ICML’16. New York, NY, USA: JMLR.org, 1928–1937.

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra,and Martin Riedmiller. 2013. “Playing Atari WithDeep Reinforcement Learning.” In NIPS DeepLearning Workshop.

Morris, Max D. 1991. “Factorial Sampling Plans forPreliminary Computational Experiments.” Techno-metrics 33 (2): 161–174.

Sutton, Richard S., and Andrew G. Barto. 2017. Rein-forcement Learning: An Introduction. Second Edi.Cambridge, MA, USA: MIT Press.

Tieleman, Tijmen, and Geoffrey Hinton. 2012. “Lecture6.5-rmsprop: Divide the Gradient by a Running Av-erage of Its Recent Magnitude.” COURSERA: Neu-ral Networks for Machine Learning 4:26–31.

U.S. Department of Energy. 2015. EnergyPlus 8.3.0.

Wang, Yuan, Kirubakaran Velswamy, and Biao Huang.2017. “A Long-Short Term Memory Recurrent Neu-ral Network Based Reinforcement Learning Con-troller for Office Heating Ventilation and Air Con-ditioning Systems.” Processes 5, no. 46.

Wei, Tianshu, Yanzhi Wang, and Qi Zhu. 2017. “DeepReinforcement Learning for Building HVAC Con-trol.” Proceedings of the 54th Annual Design Au-tomation Conference 2017. Austin, TX, USA.

Zhang, Zhiang, and Khee Poh Lam. 2017. “An Imple-mentation Framework of Model Predictive Controlfor HVAC Systems: A Case Study of EnergyplusModel-Based Predictive Control.” ASHRAE 2017Annual Conference. Long Island, CA, USA.

NOMENCLATUREt Control time stepS Environment’s state of RLA Control action of the RL agentR Reward received by the RL agentπ Control policy of the RL agentE Expected valueγ Reward discounting factor of RLv,q,a State-value, action-value, advantage valueθ Function parameterPr Probabilityd Stationary distributionα Gradient descent step sizeH Entropy of a distributiony Observed building behaviorζ True building behaviorε Building behavior observation errorη Simulated building behaviorx,c Calibration input and unknown parameterµ,τ,βλ ,ρ,κ

Tunable hyper-parameter

View publication statsView publication stats