if you measure it, can you improve it? exploring the value ... · the value of energy...

If You Measure It, Can You Improve It? ExploringThe Value of Energy Disaggregation

Nipun Batra∗Indraprastha Institute ofInformation Technology

[email protected]

Amarjeet SinghIndraprastha Institute ofInformation Technology

[email protected]

Kamin WhitehouseUniversity of Virginia

Charlottesville, [email protected]

ABSTRACTOver the past few years, dozens of new techniques have beenproposed for more accurate energy disaggregation, but thejury is still out on whether these techniques can actuallysave energy and, if so, whether higher accuracy translatesinto higher energy savings. In this paper, we explore bothof these questions. First, we develop new techniques thatuse disaggregated power data to provide actionable feed-back to residential users. We evaluate these techniques us-ing power traces from 240 homes and find that they candetect homes that need feedback with as much as 84% ac-curacy. Second, we evaluate whether existing energy dis-aggregation techniques provide power traces with sufficientfidelity to support the feedback techniques that we createdand whether more accurate disaggregation results translateinto more energy savings for the users. Results show thatfeedback accuracy is very low even while disaggregation ac-curacy is high. These results indicate a need to revisit themetrics by which disaggregation is evaluated.

Categories and Subject DescriptorsD.2.8 [Load disaggregation]: MetricsGeneral TermsEnergy feedbackKeywordsNILM1. INTRODUCTION

Over the past few years, dozens of new techniques havebeen proposed for more accurate energy disaggregation, butthe jury is still out on whether these techniques can actuallysave energy and, if so, whether higher accuracy translatesinto higher energy savings. In this paper, we explore bothof these questions.

Energy disaggregation is the process of estimating the powerdraw of individual electrical loads based on metering of their

∗This work was done while the author was a visiting re-searcher at University of Virginia

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’15, November 4–5, 2015, Seoul, South Korea..c© 2015 ACM. ISBN 978-1-4503-3981-0/15/11 ...$15.00.

DOI: http://dx.doi.org/10.1145/2821650.2821660.

aggregate power. Research in this area is broadly motivatedby the philosophy of Lord Kelvin: “If you can’t measure it,you can’t improve it,” but is the opposite also true? Dis-aggregation techniques can be used to help users identifythe major energy consumers in their home or business, andsome users will pour over this data to achieve significant en-ergy savings [3]. However, several studies have shown thatjust providing energy data does not necessarily translateto long-term energy savings. After the novelty wears off,users experience a rebound effect as unsustainable energy-saving actions unwind themselves and as users tire of siftingthrough too much data [1]. Other studies have shown moresustainable effects by providing more targeted feedback orrecommendations for simple actions [10]. Can disaggrega-tion techniques be used to generate the types of targeted,actionable feedback that would produce sustainable energysavings?

In this paper, we present the exploration of two researchquestions that are highly relevant to the research commu-nity interested in energy disaggregation. First, we explorewhether disaggregated power data can be used to provide ac-tionable feedback to residential users, and whether that feed-back is likely to save energy. We focus on feedback about re-frigerators and HVAC, because they contribute significantlyto overall home energy consumption and are available inmost homes. We develop a model that breaks the powertrace of a refrigerator into three parts: baseline (when noone is using the fridge), defrost (energy consumption whenthe fridge is in defrost mode) and usage (energy consump-tion due to fridge usage). Then, we develop techniques toidentify users with 1) much more energy due to fridge us-age than the norm 2) much more energy due to defrost thanthe norm, or 3) fridges that are malfunctioning or miscon-figured, even during baseline operation. We evaluate ourmodel using a dataset with power traces from 95 refrigera-tors. Results indicate that our model can break down fridgeusage into its three components with only 4% error. Addi-tionally, the three types of feedback could help users saveup to 23%, 25% and 26% of their fridge energy usage, re-spectively. These techniques provide targeted feedback withspecific actions, e.g. fix or repair the fridge, and so we expectthis energy savings to be sustainable. Similarly, we developnew techniques to differentiate homes with and without set-back schedules on the HVAC system based on their HVACpower traces and outdoor weather patterns. This informa-tion can be used to give feedback to install a programmablethermostat. We evaluate these techniques with power tracesfrom 58 homes and results indicate that our techniques can

classify homes with 84% accuracy. Based on these results,we conclude that disaggregation does indeed have the po-tential to provide targeted, actionable feedback that couldlead to sustainable energy savings.

Second, we explore whether existing energy disaggrega-tion techniques provide power traces with sufficient fidelityto support the feedback techniques that we created, andwhether more accuracy disaggregation results translate intomore energy savings for the users. To do this, we re-evaluatethe feedback techniques above using power traces producedby disaggregation algorithms instead of those produced bydirect submetering. We use three benchmark algorithmsprovided in an open source toolkit called NILMTK [7]. Weverified that these algorithms and the parameters we useproduce disaggregation accuracies comparable to or betterthan the best results published in the literature. Nonethe-less, the feedback techniques that we developed become al-most completely ineffective when using the disaggregatedenergy traces. In some cases, they failed to identify over70% of the homes that should be getting feedback and falselyflagged 14% homes of additional homes that should not re-ceive feedback.

To conclude, we discuss why feedback accuracy is loweven while disaggregation accuracy is high: accurate energybreakdown feedback (i.e. “Your fridge accounts for 8% ofyour energy bill”) can be given even if the power traces havemany errors as long as those errors average out over time.However, more targeted and actionable feedback (i.e. “Yourfridge is defrosting too often; fix the seal.”) depends on spe-cific features of the power traces. Our results indicate thatthe disaggregation community needs to revisit the metricsby which it measures progress. Part of this process will beto look through the lens of applications, including but notlimited to the feedback techniques presented in this paper,to find the aspects of power traces that are most important.After all, “what you measure is what you get.”

2. RELATED WORKThe field of NILM was found by George Hart in the early

1980s. His early works were motivated towards the devel-opment of low cost and easy to use methods for utilities tocarry out residential appliance load research [16]. Existingload research methods during that time instrumented theindividual branches and appliances. Three potential appli-cations of NILM were proposed in the early works: i) con-trolling deferrable loads, and ii) providing detailed energyusage to the end user, and iii) identifying faulty appliances.

Between the early 1980s and late 2000s, the field hada steady progress. In the late 2000s, governments startedrolling out smart meters and it was easier than ever beforeto collect data for evaluating NILM. Prior to smart meterrollouts, work in the field primarily consisted of pilot deploy-ments. In 2011, the REDD [20] data set for NILM researchwas released. Its aim was to mimic the progress made incomputer vision research by the availability of public datasets. Since then, the field has shown exponential growth1.More than 10 public data sets have been released in thepast five years [7, 17]. This exponential growth in the fieldhas also seen a lot of differences in assumptions and evalu-ation metrics. Recently, there has been an interest in high

1http://blog.oliverparson.co.uk/2015/03/overview-of-nilm-field.html

frequency features which can identify low power appliancesunlike most previous work [14, 13, 2]. A subsection of theresearch has viewed the disaggregation problem from theperspective of correctly identifying the operational state ofan appliance and thus evaluates NILM accuracy using met-rics such as precision and recall on appliance states. Suchwork is often motivated by applications in activity recog-nition [14, 26, 28]. Other research often looks at the dis-aggregation problem as providing a fine-grained electricitybill, where the consumers can see how much each appliancecosts them per month. Such work often uses metrics such aspercentage of energy correctly identified. Such was the vari-ety of metrics used in the literature that it became virtuallyimpossible to compare two papers. To ease the comparisonof NILM research, there have been recent efforts with anaim to standardise NILM metrics and provide benchmarkalgorithms [7, 8].

Recently, there has been an increased focus towards devel-oping NILM applications related to providing energy feed-back. In terms of the techniques and evaluation we proposein this paper, there are three works that relate well to ours.Chen et al. [9] did a study on 124 apartments from an apart-ment complex having same appliances and amenities, wherethey collected hourly appliance level energy consumption.They explain the variation in fridge energy across homesto be caused by behavioural differences. They estimate theenergy savings possible if fridges older than 10 years are re-placed by newer efficient fridges. Our work differentiatesfrom their work by evaluating feedback models on disag-gregated power traces. Since scaling appliance level meter-ing remains a huge challenge, we believe that there is a lotof value in evaluating the feedback on disaggregated powertraces. Further, we evaluate our feedback methods on a widerange of homes that have variable appliances and amenities,unlike the data set used by Chen et al.

Parson et al. [24] also target feedback on the value of shift-ing to a new fridge across 117 homes from the UK. Our workis similar to theirs as they also give feedback based on disag-gregated power trace. A key differentiating factor betweenour approach and the work by Parson et al. and Chen etal. is that rather than dismissing a high energy consumingfridge as inefficient, our fridge model enables us to answer ifhigh energy is due to high usage, or is the high usage simplydue to higher fridge capacity. Importantly, our work pro-poses feedback methods which are more fine grained thanproviding feedback just based on appliance energy usage,which can be highly misleading. For instance, when com-paring the summer HVAC usage of two homes in a colderand warmer climate, feedback based only on HVAC energyusage may indicate that the home in the warmer climate isdoing worse. Instead, the energy feedback needs to considerthe climate before providing feedback.

Barker et al. [4] make a case of emphasizing NILM ap-plications over accuracy. Their evaluation deals with the“long” execution times associated with disaggregation usingcurrent NILM algorithms, which effectively rule out a hostof real-time applications. Our work is in the same vein,but instead does an empirical evaluation of energy feedbackmethods in an offline fashion. We believe that even beforewe address the issue of real-time applications, we need toevaluate the accuracy associated with the intended appli-cations. Our work also shows the efficacy of the proposedfeedback methods on a large number of homes.

3. DATA SETSWe now describe the two data sets that we will be using

throughout the rest of this paper. To assess the value ofenergy disaggregation, we need a data set containing a largenumber of homes. We thus use the Dataport data set [22],which is the largest publicly available dataset containingsubmetered and aggregate electricity consumption. The firstrelease of the data set contains minutely power readingsacross different appliances from 240 homes in Austin, Texasfrom January through July 2014. More recently, a newer ver-sion of the data set has been released which contains datafrom 800 homes for close to 3 years. In addition to powerdata from different appliances, the data set contains infor-mation on energy audits, home survey and internal temper-ature for a subset of homes. Since our fridge work predatesthe latest release, we use the first release made available inNILMTK [7] format consisting of data from 240 homes forour fridge analysis.

The data set contains power data logged every minute for172 fridges. Of these, we filtered out 77 fridges that haddata collection problems such as missing data and multi-ple appliances on the same sensor. We use the remaining95 fridges for evaluation of our proposed techniques. Thedata set also contains temperature setpoint data from 2013.Since, the initial release does not have electricity data from2013, we use the 2013 data from the newer release for ourHVAC feedback analysis. We use the 58 homes having boththe setpoint and power data information in our analysis.

We also collected data from four identical fridges oper-ated in identical ambient conditions across four floors of thecomputer science building at UVa. We put Hobo loggers2 tocollect power data at 1 Hz frequency from these four fridges.For one of the fridge to which we had easy access to, we col-lected door status for both doors and the freezer unit andinternal temperature data at 1 Hz frequency, in addition tothe power data. We collected data under different controlledand uncontrolled settings for two weeks.

4. APPLIANCE ENERGY MODELLINGHaving described the data sets that we use, we now dis-

cuss energy models for fridge and HVAC, both of which con-tribute significantly to overall home energy consumption andare available in most homes. The key idea behind these en-ergy models is to extract features from the power data whichserve as the basis for the energy feedback methods that welater describe in Section 5.

4.1 Fridge energy modellingA fridge is a compressor based appliance where the mo-

tor duty cycles to maintain the fridge at a set temperature.When the compressor is ON, the refrigerant transfers heatfrom inside the fridge to the outside [11]. The compressorturns ON and OFF at a small offset temperature above andbelow the set temperature. Since the fridge is operated ata lower temperature than the surroundings, there is alwaysheat leakage from the outside into the inside of the fridge,which is proportional to the temperature difference betweenthe fridge setpoint and ambient temperature. In the absenceof fridge usage (such as opening fridge door), the compressortypically duty cycles at the same rate, shown as the base-line compressor usage in Figure 1 which occurs in the early

2http://www.onsetcomp.com

00:0008-Apr

03:00 06:00 09:00 12:00 15:00 18:00 21:00

Time

0

100

200

300

400

500

600

700

Pow

er(W

)

BaselineDefrost

Increased compressorruntime due to defrostUsage

Figure 1: Breakdown of fridge energy consumptioninto baseline, defrost and usage

morning hours of the shown fridge. Each time the fridgeis opened, the leakage from the ambient environment in-creases and the compressor has to run longer to remove thisextra heat. The addition of items in the fridge also causesthe compressor to run longer due to the increased thermalmass. Both these factors cause an increase in the duty per-centage of the fridge. The increased compressor ON anddecreased compressor OFF durations are shown as usage inFigure 1. For efficient running of the fridge, fridges defrostperiodically to get rid of frost developed on the cooling coil.Defrosting is done via the defrost heater and introducesheat into the system, which is removed in the next few com-pressor cycles having higher duty percentage. These cyclescan be seen in Figure 1.

Thus, the fridge energy consumption can be broken downinto three components: usage, defrost and baseline. We nowdescribe the procedure for breaking down fridge energy intothese three components:

1. Finding baseline duty percentage: Duty percent-age of a fridge cycle (c) is given by the ratio of the compressorON duration to the total fridge cycle. Or,

Duty percentage (c) = ON duration(c)ON duration(c)+OFF duration(c)

Baseline duty percentage is found as the median of theduty percentage during early morning hours (1 to 5 AM)over the duration of the dataset. Using median overcomesthe cases when a home may have high fridge usage on somedays.

2. Finding defrost energy: Defrost energy comprisesof two parts: energy consumption when the fridge is in thedefrost state and the extra energy consumed in the regularcompressor cycles that follow the defrost state. We assumethat a defrost cycle causes an impact on the next D compres-sor cycles. For these D cycles, the extra energy consumedis found by the additional duty percentage over the baselineof the compressor cycles following the defrost cycle as:

Extra compressor energy due to defrost

=

D∑c=1

(Duty percentage (c) - Baseline duty percentage)

× (ON duration(c) + OFF duration(c))

× Fridge compressor power consumption

(1)

Energy consumption when fridge is in the defrost statecan be trivially calculated.

3. Finding usage energy: As a prerequisite to findingusage energy, we need to first find usage cycles, which we de-fine as fridge cycles that are affected by fridge usage. Afterremoving the defrost cycles and the subsequent D cycles, welook for cycles having duty percentage that is P% more thanthe baseline duty percentage. The intuition behind choos-ing a parameter P is that fridges may show some inherentvariation in duty cycle percentage independent of usage. Weassume that this variation is within P% of the baseline dutypercentage. After finding these U usage cycles, the usageenergy can be calculated as: Usage energy

=

U∑c=1

(Duty percentage (c) - Baseline duty percentage)

× (ON duration(c) + OFF duration(c))

× Fridge compressor power consumption

(2)

4. Finding baseline energy: All the cycles that are notaffected due to defrost or usage contribute towards baselineenergy and their energy consumption can be summed to findbaseline energy.

4.1.1 Evaluation of fridge modelWe now evaluate the accuracy of our fridge modelling ap-

proach. We use our collected data from the UVa CS buildingfor this evaluation as the Dataport data set does not havelabels for fridge usage. Using door sensor data, we manuallyannotated 3 days for usage cycles from the fridge for whichwe had instrumented in our data set. We found that thedefrost cycle impacts the next 3 cycles, and we thus choseD=3. It should be noted that choosing a slightly differentvalue of D is only going to change marginally the usage anddefrost energy numbers since defrost cycles are easily out-numbered by regular cycles. The other parameter in ourevaluation, percentage threshold (P ) for labelling usage cy-cles is more important due to the expected high number ofusage cycles.

We now define the three metrics used to evaluate ourfridge modelling:

1. % Usage energy error for fridge, which suggests howaccurately our model captures the energy usage whena fridge is being actively used:|Predicted fridge usage energy - Actual fridge usage energy|×100%

Actual fridge usage energy2. Precision on fridge usage cycles:|Correctly predicted fridge usage cycles|

# Predicted fridge usage cycles3. Recall on fridge usage cycles:|Correctly predicted fridge usage cycles|

# Total fridge usage cyclesFigure 2 shows the usage energy error, precision and re-

call on usage cycles as they vary with P . At a P of 11-16%,the usage energy error is less than 2%. Usage energy errorremains below 4% for P between 9 and 24, showing that theprediction remains useful within a wide percentage thresh-old. A precision of 1 is not observed until P = 17% dueto the presence of a single fridge cycle having a high dutypercentage despite being unrelated to usage. This is dueto the fact that rare cycles may show an inherent deviationfrom the regular duty percentage. At P = 11%, the recalldrops from 1. This is due to a usage cycle which shows lessthan 10% deviation from baseline duty percentage. We canconclude that our model is applicable even within a broadrange of parameters.

5 10 15 20 25 30 35

Percentage threshold (P )

0.0

0.2

0.4

0.6

0.8

1.0

Usage EnergyProportion Precision Recall

Figure 2: Our model for breaking fridge energy intousage, baseline and defrost is accurate to within 4%energy error for a wide range of percentage thresh-old above baseline duty percentage.

4.2 HVAC energy modellingAcross the globe, HVAC is the single largest contributor

to a home’s energy bill [25]. By optimising the HVAC set-point schedule, upto 30% of HVAC energy can be saved [21].Giving homes feedback on their setpoint schedule is likely tohave a big impact. Thus, we try to build an HVAC modelto predict setpoint temperature from HVAC energy data.Since HVAC energy usage is highly dependent on externalweather conditions, we incorporate weather data into ourHVAC model. While we explain our model for the cool-ing season (summers, when HVAC is used for cooling), itis equally applicable to the heating season. Our model isbased on the following assumptions:

1. HVAC energy is impacted by weather conditions suchas humidity, wind speed and temperature.

2. HVAC energy consumption is proportional to the dif-ference in external temperature and home setpoint tem-perature.

3. Programmable thermostats use the following four set-point times: night hours from 10 PM to 6 AM; morn-ing hours from 6 AM to 8 AM; work hours from 8 AMto 6 PM; evening hours from 6 PM to 10 PM. Thesetimes are as per the schedule times reported by Ener-gyStar.gov [12].

4. HVAC energy during an hour is zero if the HVAC wasnot used during this hour

Based on the first assumption, we have: HVAC energy ∝humidity; HVAC energy ∝ wind speed. Based on the secondassumption, we have HVAC energy∝ (External temperature-internal temperature setpoint). Based on the third assump-tion, we have four different temperature setpoints during theday. We use four proportionality constants (a1 through a4)corresponding to these four setpoint times, describing howstrongly the temperature delta between external and set-point temperature affects HVAC energy consumption. Toconvert our HVAC model into a regression model, we add abinary variable (is it nth hour) which is 1 if the data is fromthe nthhour and 0 otherwise. We also use a binary vari-able indicating if HVAC was used during the nth hour basedon the fourth assumption. Combining all of the above, ourHVAC models energy consumed in the nth hour of the dayas follows:

−25 −20 −15 −10 −5 0 5 10 15

Predicted setpoint error(◦F)

EveningMorning

NightWork

Figure 3: The predicted setpoint temperatures fromour HVAC model have a high offset from actual set-point temperatures.

HV AC energy(n) = a1 × [(External temperature(n)

−Night hours setpoint)

× Is it 0thhour × IsHV AC used(n)

+ . . .

(External temperature

−Night hours setpoint)

× Is it 5thhour

× IsHV AC used this hour]

+ a2 × . . .

+ a3 × . . .

+ a4 × . . .

+ a5 × humidity(n)

+ a6 × wind speed(n)

(3)

Our non-linear model has a total of 10 parameters: a1

through a6 and four setpoint temperatures.

4.2.1 Evaluation of HVAC modelWe now evaluate our HVAC model on its ability to learn

the temperature setpoints. We calculate hourly HVAC en-ergy usage for the 58 homes containing both HVAC powerand setpoint information. This forms the LHS of Equation3. We download hourly weather data from Forecast.io webservice3 and use linear interpolation to fill missing readings,similar to the work done by Rogers et al. [27]. Finally, weused non-linear least squares minimisation using the Pythonlmfit package4to estimate the 10 parameters in our model.We also constrain learnt setpoints to be within 60 and 90F.

Figure 3 shows that our model is inadequate in accuratelypredicting setpoint temperatures. This is most likely due tothe fact that some of the coefficients in our model are notindependent and the fact that our model does not considerthermal mass of the building. Our main objective is findinghomes which need HVAC setpoint feedback. While an accu-rate prediction of setpoint temperature would have allowedus to do the same, in section 5.4, we explore machine learn-ing based solutions to use the parameters from our HVACmodel to predict homes needing setpoint feedback. A keytakeaway which we see later in section 5.4 is that these learntparameters are useful in providing feedback to homes for set-point optimisation.

3http://forecast.io4http://lmfit.github.io/lmfit-py/

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of usage cycles

05

10152025303540

Usa

geen

ergy

%

Figure 4: 13 out of 95 homes (shown in red) fromthe Dataport data set can be given feedback basedon their fridge usage, potentially saving up to 23%of fridge energy.

5. ENERGY FEEDBACK METHODSIn this section, we develop and demonstrate some exam-

ples of how NILM could be used to provide feedback to usersto reduce their energy usage based on the appliance energymodelling we previously discussed. These are only exam-ples, and the analysis presented later in this paper wouldapply to any applications of NILM.

5.1 Fridge usage feedbackHaving shown that we can accurately breakdown fridge

energy into usage, defrost and baseline, we now show howwe can give feedback to homes based on this breakdown. Inthis section, we target homes based on fridge usage, wherethe potential feedback could be to reduce interactions withfridge, increase temperature setpoint, etc. We use robustestimator of covariance based outlier detection [15] to de-tect such homes. The outlier detection method is appliedon two dimensions: usage energy% and proportion of us-age cycles. We apply this outlier detection method on the95 homes from the Dataport data set. We divide this twodimensional home data into four quadrants through the me-dians on usage energy% and proportion of usage cycles. Fig-ure 4 shows the homes that can be given feedback based ontheir fridge usage energy in red. The black ellipse is theboundary outside which points are predicted to be outliers.Feedback can be given to homes in the first quadrant (shownin green), that have a high proportion of usage cycles andhigh usage energy. Homes in this category have a lot of cy-cles affected by usage and thus have high usage energy. 13homes fall into this category and can save up to 23% of theirfridge usage energy. Energy saving potential is calculated asthe difference between current energy consumption and me-dian energy consumption. There are no homes in the secondquadrant, which denotes homes which have a small propor-tion of cycles affected by usage and yet having a high usageenergy contribution. These homes could possibly have fewinteractions with the fridge, but, have a high usage energydue to a low fridge internal setpoint, where each interactionwith the fridge leads to a lot of heat flow from the outside.

5.2 Fridge defrost feedbackOur method for providing feedback based on defrost is

similar to the method of providing feedback based on usage.High defrost energy could be indictive of a broken fridgeseal. We use outlier detection methods on two dimensions:defrost energy% and number of defrost cycles per day and

0.0 0.5 1.0 1.5 2.0 2.5

Number of defrost cycles per day

05

10152025303540

Def

rost

ener

gy%

Figure 5: 17 out of 95 homes (shown in red) fromthe Dataport data set can be given feedback basedon their fridge defrost energy, potentially saving upto 25% of fridge energy.

give feedback to the homes lying in the first and the sec-ond quadrant. Number of defrost cycles per day is moreinterpretable and relatable than proportion of defrost cycles(which is going to be a very small floating point number).Figure 5 shows the homes that can be given feedback basedon their fridge defrost energy. 15 out of 95 homes fall into thefirst quadrant, and 2 homes fall into the second quadrant.These 17 homes can save up to 25% of their fridge energy.While homes in the first quadrant have high defrost energydue to high number of defrost cycles, homes in the secondquadrant are likely to have a fridge malfunction whereby afridge remains in the defrost state for a long time.

5.3 Fridge power feedbackWe next looked into providing feedback in case we know

the make and age of a fridge, and we have data from fridgesof the identical make and age. Ideally, all such fridges shouldhave similar power draw. However, we found four such pairsin the Dataport data set (LG, Frigidaire and two of Sam-sung) where one of them has a significantly higher fridgesteady state and transient power. Transient power is definedas the short duration power when the fridge compressor mo-tor starts. This power is higher than the steady state power,which is defined as the power draw of the fridge once thetransient has ended. Figure 6 shows these four fridges andthe differences in their steady state and transient powers. Inorder to eliminate the hypothesis that such differences couldarise due to the difference in ambient conditions of thesefridges, we also add in this figure the four General Electricfridges from our deployment. 3 of them have a <steadystate, transient> power consumption of <80,100> Watts,while the fourth one has <120, 1310> Watts. Since thesefour fridges were operated under identical ambient condi-tions, the possibility of ambient conditions causing a powerdifference between these is ruled out. The arrows in the fig-ure point towards the fridge consuming extra power. Thesefridges consume upto 26% more energy than their identi-cal counterparts, where extra energy consumption is foundby estimating the energy consumption if the fridge operatedwith lower steady state power. In order to reduce the falsepositive rate in giving such feedback about fridge malfunc-tion, we can choose to give feedback when the difference insteady state power is atleast 10%, where we assume thatfridges can record upto 10% variation in their power con-sumption owing to several factors including measurementerrors.

Figure 6: Identical fridges with the same model andage can have differences of 10% or more in steadystate power levels. Feedback about failing or mis-configured fridges can save up to 26% energy.

Feedback No Feedback

No Feedback

FeedbackT

rue

labe

ls

5 19

30 4

51015202530

Predicted labelsFigure 7: Our techniques correctly classify 84.4% ofthe homes as either having or not having a setpointschedule, based on submetered HVAC data.

5.4 HVAC setpoint feedbackWe previously saw in section 5.4, that our HVAC model

produces an offset in the learnt setpoint temperatures. In-stead of using the learnt setpoint temperatures directly tofind homes needing HVAC setpoint feedback, we use ma-chine learning methods for the same. We calculate an HVACefficiency score for the 58 homes in the Dataport data seton a scale of 0 to 4 based on recommended setpoint tem-perature from EnergyStar [12] as follows: 1)Morning score= 1 if morning setpoint temperature >78F, 0 otherwise; 2)Evening score = 1 if evening setpoint temperature > 78F, 0otherwise; 3) Work hours score = 1 if work hours setpoint>85 F, 0 if setpoint <=78, (85-setpoint)/7 otherwise; and4) Night score = 1 if setpoint >82F, 0 if setpoint <=78F,(82-setpoint)/4 otherwise. We decide that 34 homes thathave an overall score of 2 or less can be given feedback tooptimise their HVAC setpoints.

In addition to the 10 parameters of the HVAC model, weadd additional features such as total energy used in work,morning, night and evening hours and the number of min-utes HVAC system was on during these times to our machinelearning methods We use 2-fold cross validation and a gridsearch on the feature space to find that the feature <a1,a3, Energy in evening hours, Mins HVAC usage inmorning hours> used by the Random Forest classifier givethe optimal accuracy of 84.4% as shown in Figure 7.

Authors Year Dataset #Homes Algorithm Fridge HVACRMSE (W) Error Energy % F-score RMSE (W) Error Energy% F-score

Kolter [19] 2012 REDD [20] 6AdditiveFHMM

- 62.5 ∆ - - - -

Parson [23] 2012 REDD [20] 6DifferenceHMM

83 55 - - - -

Parson [24] 2014 Colden 117BayesianHMM

45

Batra [7] 2014 iAWE [6] 1 FHMM - 50 0.8 - 30 0.9Current work Data port 240 CO? 85 19 0.65 600 15 0.87Current work Data port 240 FHMM? 95 20 0.63 650 18 0.89Current work Data port 240 Hart 82 21 0.72 890 23 0.76

Table 1: Benchmark algorithms on the Dataport dataset give comparable performance to existing literature.? Both CO and FHMM achieve best performance for N=2, top-K=3.∆ Kolter’s paper includes a slightly different metric from which we derived this number.

6. EVALUATION OF NILM FOR FEEDBACKHaving described our methods for providing energy feed-

back to homes based on submetered data and showing thatthese models can give good feedback, we now evaluate howaccurately do current NILM approaches match these feed-back. We now describe the experimental setup for evaluatingNILM performance on the Dataport data set.

6.1 Experimental setupWe use NILMTK [7] to perform our NILM experiments.

We use the 3 reference implementations made available inNILMTK, which we describe now.

6.1.1 NILM modelsCombinatorial optimisation (CO): CO was proposed

by George Hart in his seminal NILM paper [16]. CO modelseach appliance to consist of a fixed number of states andassigns different power levels to each of these states. Theoptimisation function involves finding the optimal combina-tion of appliance states for different appliances which min-imises the difference between predicted and observed aggre-gate power.

x(n)t = argmin

x(n)t

∣∣∣∣∣yt −N∑

n=1

y(n)t

∣∣∣∣∣ (4)

Here, x(n)t is the state of the nth appliance at time t;yt is the

aggregate power consumption at time t and y(n)t is the power

consumption of nth appliance at time t. The time complex-ity of CO is exponential in the number of appliances andhence it becomes intractable for large number of appliances.

Factorial hidden Markov model (FHMM) :FHMMapproach is more recent and built upon the finite state ma-chines suggested by Hart [16]. In an FHMM, each applianceis modelled as a hidden Markov model (HMM), where thehidden component is its state, and the observed componentis its power draw. Like CO, FHMM has time complexityexponential in the number of appliances and thus becomeintractable for a large number of appliances.

Hart’s steady-state algorithm: Hart in his seminalwork presented an event based approach that we here on re-fer to as Hart’s algorithm [16]. This approach finds events inthe aggregate power time series and assigns them to appli-ances. An event is said to occur when the aggregate powerchanges beyond a threshold. During the training phase,Hart’s algorithm pairs rising and falling edges whose mag-

nitude difference fall within a threshold. Next, it clustersthese rising-falling pairs where each cluster represents anappliance. Since this algorithm is unsupervised in nature, itrequires manual labelling to assign appliance labels to theseclusters.

These 3 algorithms cover both classes of disaggregationalgorithms- CO and FHMM are non-event based, while Hart’salgo is event based. While FHMM is a recent develop-ment, Hart’s is the seminal work. While we do not evaluateother NILM algorithms (primarily due to the unavailabilityof their implementation), we believe that the current algo-rithms sufficiently cover NILM algorithms broadly.

6.1.2 NILM metricsHaving discussed the NILM models, we now discuss the

conventional NILM metrics to evaluate these. We use thestandard definition of NILM metrics as made available inNILMTK [7].

1. % Error in Energy: |Predicted energy - Actual energy|×100%Actual energy

2. Root Mean Squared Error (RMSE) Power:√1N

N∑i=1

(Predicted poweri −Actual poweri)2

3. F-score: First, disaggregation is converted to a binaryclassification problem where an appliance is ON if itconsumes more than a threshold and OFF otherwise.Next, the standard definition of F-score is used on thisbinary classification task.

6.1.3 Parameter optimisation and training strategyHaving discussed the metrics used for evaluating NILM

performance, we now discuss the tunable parameters in theseNILM models. Since both CO and FHMM are computation-ally intractable, NILM researchers often select the top-K ap-pliances in terms of energy consumption to reduce the statespace. Another parameter in these models is the number ofstates (N) to use for modelling an appliance (2 states meansthat an appliance can either be ON or OFF). We vary Kfrom 3 to 6 and N from 2 to 4 and find the accuracy ofdisaggregation for both fridge and HVAC. We used half ofthe data for training and the other half for evaluating dis-aggregation.

6.1.4 NILM accuracy

0.0 0.2 0.4 0.6 0.8 1.0


0

20

40

60

80

100

Usa

geen

ergy

%CO

#FN = 11, #FP = 8

0.0 0.2 0.4 0.6 0.8 1.0


FHMM#FN= 9, #FP=13

0.0 0.2 0.4 0.6 0.8 1.0


Hart#FN=7, #FP=7

Figure 8: NILM algorithms show poor accuracy in identifying homes which need feedback for high fridgeusage energy. Red dots indicate the homes which should be getting feedback based on analysis of submeteredfridge data, while these algorithms would give feedback to all homes in the green region outside the ellipticalboundary.

We now present the results of NILM evaluation on theDataport data set. We also compare our results with thestate of the art. From Table 1, we can see that for bothfridge and HVAC, the benchmark algorithms we use arecomparable in performance to existing literature. We couldnot include several recent works due to different reasons.Shao et al. [29] and Kim et al. [18] define precision andrecall in terms of identification of appliance power withinbounds. It is non-trivial to convert their metrics in termsof ours. Barker et al. [5] show that the performance of theirtracking algorithm is comparable to Additive FHMM, whichwe already consider in our comparison. Kolter et al. [20]do not provide appliance level metrics. Since none of theabove-mentioned works gave results on HVAC disaggrega-tion under residential settings, we used the numbers givenin the benchmark evaluation accompanying NILMTK [7]. Itshould be noted that many of the other approaches we com-pare with in Table 1 make lesser assumptions such as theavailability of training data. However, these do not affectour argument since they do not achieve substantially betterperformance according to conventional NILM metrics.

6.2 Fridge usage feedbackHaving established that our NILM performance is at par

with the state-of-the-art, we now see how accurate fridgeusage feedback we can provide with the disaggregated powertrace. Figure 8 shows that all three NILM algorithms havepoor accuracy in identifying homes that need feedback forhigh fridge usage. False negatives (FN) are those homesthat should be getting feedback but are not getting, andfalse positives (FP) are those homes that would wrongly getfeedback. We now explain the reasons for the poor accuracyof the used NILM algorithms.

During the night hours when typically only backgroundappliances such as fridge are running, Hart’s algorithm hasgood disaggregation accuracy. Due to this, Hart’s algorithmclosely matches the baseline duty percentage computed onsubmetered data as shown in Figure 9. However, Hart’s al-gorithm is susceptible to detection of false events and miss-ing true events, especially during active hours when appli-ances similar in magnitude to the fridge may be operat-ing. Thus, Hart’s algorithms underpredicts and overpredictsfridge compressor cycle durations during the day creating adeviation in fridge usage. While the change in predictedcycle durations has a minimal impact on conventional met-rics, it has a significant impact on fridge usage energy met-

CO FHMM Hart Submetered0.0

0.2

0.4

0.6

0.8

1.0

Bas

elin

edu

type

rcen

tage

Figure 9: The baseline duty percentage found onHart’s disaggregated power traces matches closelyto the submetered one, while CO and FHMM showa wide variation from submetered.

ric. The median baseline duty percentage found by CO andFHMM are higher than the median baseline duty percentageon submetered data. Owing to higher baseline duty percent-age, usage energy in these homes is lower than submetered,thereby explaining the high false negative rate. The reasonbehind CO and FHMM finding a high baseline duty percent-age is that the objective function in both these algorithmsincludes minimising the difference between aggregate powerand sum of power for predicted appliances. To satisfy thisobjective, these algorithms predict fridge to be ON longerthan actual during the night hours when typically few loadsare used. The high false positive rate can be explained bythe small number of homes for which the baseline duty per-centage learnt is much lower than that for submetered. Thiscauses these homes to have a high usage energy, and thuspredicted as candidates to give feedback.

6.3 Fridge defrost feedbackWe find that the our approach of breaking down fridge en-

ergy into baseline, defrost and usage is unable to find even asingle defrost cycle when fed the disaggregated power data.This is due to the inadequacy of the used NILM methods ineffectively learning and disaggregating the defrost state. COand FHMM rely on KMeans and Expectation Maximisationalgorithms respectively for learning the different states of anappliance. Due to defrost events being rare in comparisonto regular usage, these algorithms are not able to accuratelyassociate a cluster with the defrost state. Instead, thesealgorithms try to find multiple clusters to explain the vari-ation in fridge power when the compressor is ON. Hart’s

−200−100 0 100 200 300 400 500 600

GT power

050

100150200250300350

Pre

dict

edpo

wer

COFP= 29

−200−100 0 100 200 300 400 500 600

GT power

FHMMFP= 18

−200−100 0 100 200 300 400 500 600

GT power

HartFP= 44

Figure 10: All NILM algorithms estimated the steady state power levels of at least some fridges (shownin green) with errors over 10%, which means that estimates are not accurate enough to reliably detectmalfunctioning fridges based on power draw.


No Feedback

Feedback

Tru

ela

bels

13 11

20 14

CO


8 16

24 10

FHMM


13 11

25 9

Hart

12

14

16

18

20

Predicted labelsFigure 11: Classification of homes into those with setback schedules decreases from 84% with submeteredpower traces to 53%, 69%, and 62% respectively with power traces produced by the three NILM algorithms.

algorithm, which relies on pairing rising and falling edges ofsimilar magnitude in the power signal, is unable to learn thedefrost state as the defrost state has a significantly differentmagnitude of rising and falling edge.

6.4 Fridge power feedbackWe now show the efficacy of feedback based on fridge

power given NILM power traces. Since there were only 4homes in the dataset having a corresponding fridge of samemake and age, we evaluate this feedback assuming that foreach fridge in the data set we had a corresponding identicalfridge. For the identical fridge, we use the actual steadystate power as its learnt steady state power. Ideally, none ofthese 95 fridges should be getting feedback based on fridgepower. Figure 10 shows that NILM algorithms produce ahigh number of false positives due to estimating the steadystate power levels with errors over 10%.

Hart’s algorithm learns higher than actual steady statepower for a large number of fridges. This can be explained byits clustering strategy during the learning stage where pairsof rising-falling edges are clustered. Clustering is susceptibleto learning fewer clusters than actual appliances, and thussome of the learnt clusters could span multiple appliances.

For CO and FHMM, the high number of false positivescan be explained by the fact that using N=2 states maybe optimal for NILM metrics, but is suboptimal for learn-ing fridge steady state power. For N=3, the number offalse positives reduces to 17 and 5 respectively for CO andFHMM. Within CO and FHMM, the better performance ofFHMM can be attributed to it modelling time relationshipsbetween states. Thus, it is more robust to assigning clustersto power values that don’t correspond to an actual fridgestate, in comparison to CO.

6.5 HVAC setpoint feedbackWe now evaluate the efficacy of HVAC feedback based on

disaggregated power traces. Figure 11 shows that the classi-

0 5 10 15 20 25

Error in Prediction of Minutes of HVAC Usage (%)

CO

FHMM

Hart MorningNight

Figure 12: NILM algorithm have high accuracyoverall, but have higher error in the morning be-cause other appliances are being used. However,the morning hours are critical to inferring whethera home has a setback schedule.

fication of homes into those with setback schedules decreasessignificantly for all NILM algorithms. We now explain thelow classification accuracy based on the features used byRandom Forest classifier. Of the four features used, a1 anda3 are hard to interpret, and thus we provide an explanationbased on Mins HVAC usage during morning hours. Most ofthe HVAC usage in the data set occurs during the nighthours. Thus, NILM accuracy is likely to be highly depen-dent on night time HVAC disaggregation. Since, only HVACand fridge would be typically used in the night, and, HVAChas a distinct much higher power signature than the fridge,NILM accuracy for HVAC is decent (as per Table 1). How-ever, during the morning hours, when typically there is moreactivity in the home, NILM accuracy for HVAC is expectedto be lesser. In Figure 12, we compare the error in predic-tion of minutes of HVAC usage for different algorithms whencompared to submetered. It can be seen that for all algo-rithms, accuracy is higher in the night. Thus, despite nothaving a high impact on NILM accuracy, the high error pre-diction of minutes of HVAC usage affects our classificationaccuracy.

7. CONCLUSIONSIn this paper, we show that disaggregated power data has

the potential to provide targeted, actionable energy feed-back to homes. However, we found that the state-of-the-artNILM accuracy isn’t effective in enabling such feedback. Webelieve that the community needs to revisit the metrics forgauging NILM performance, and our work is a step in thatdirection. We finally conclude that- “If you can measure it,you may not necessarily be able to improve it.”

8. LIMITATIONS AND FUTURE WORKOur HVAC energy model is far from perfect. In fact, if it

were perfect in predicting HVAC temperature setpoints, wewould not need to train a classifier on top. However, it mustbe noted that our work is tangential to such HVAC mod-elling and can build upon better HVAC models to providefeedback. In the future, we would like to improve our HVACmodel with an aim of more accurate feedback. We plan todevelop energy feedback models for some of the other ap-pliances having a high energy impact, such as water heater.We also plan to use long term trends in finding homes need-ing feedback. For instance, over time a fridge may developan anomaly causing excessive energy usage. Further, thiswork shows a bounds on the savings possible and does notexplore whether users can actually adopt and apply the ad-vice successfully, which we leave for future work.

9. ACKNOWLEDGEMENTSThis work was funded in part by NSF awards 1305362,

1038271, and 0845761.We would like to thank TCS Researchand Development for supporting the first author throughPhD fellowship and Department of Electronic and Informa-tion Technology (DEITy), Govern- ment of India for fundingthe project (Grant Number De- itY/RD/ITEA/4(2)/2012).

10. REFERENCES[1] W. Abrahamse, L. Steg, C. Vlek, and T. Rothengatter. A

review of intervention studies aimed at household energyconservation. Journal of environmental psychology,25(3):273–291, 2005.

[2] K. Anderson, A. Ocneanu, D. Benitez, D. Carlson, A. Rowe,and M. Berges. BLUED: A fully labeled public dataset forEvent-Based Non-Intrusive load monitoring research. InProceedings of 2nd KDD Workshop on Data MiningApplications in Sustainability, Beijing, China, 2012.

[3] K. C. Armel, A. Gupta, G. Shrimali, and A. Albert. Isdisaggregation the holy grail of energy efficiency? The case ofelectricity. Energy Policy, 52:213–234, 2013.

[4] S. Barker, S. Kalra, D. Irwin, and P. Shenoy. Nilm redux: Thecase for emphasizing applications over accuracy. In NILM-2014Workshop, 2014.

[5] S. Barker, S. Kalra, D. Irwin, and P. Shenoy. Powerplay:creating virtual power meters through online load tracking. InProceedings of the 1st ACM Conference on Embedded Systemsfor Energy-Efficient Buildings, pages 60–69. ACM, 2014.

[6] N. Batra, M. Gulati, A. Singh, and M. B. Srivastava. It’sDifferent: Insights into home energy consumption in India. InProceedings of the Fifth ACM Workshop on EmbeddedSensing Systems for Energy-Efficiency in Buildings, 2013.

[7] N. Batra, J. Kelly, O. Parson, H. Dutta, W. Knottenbelt,A. Rogers, A. Singh, and M. Srivastava. Nilmtk: An opensource toolkit for non-intrusive load monitoring. In Proceedingsof the 5th international conference on Future energy systems,pages 265–276. ACM, 2014.

[8] C. Beckel, W. Kleiminger, R. Cicchetti, T. Staake, andS. Santini. The eco data set and the performance ofnon-intrusive load monitoring algorithms. In Proceedings of the1st ACM Conference on Embedded Systems forEnergy-Efficient Buildings. ACM, 2014.

[9] V. L. Chen, M. A. Delmas, W. J. Kaiser, and S. L. Locke.What can we learn from high-frequency appliance-level energymetering? results from a field experiment. Energy Policy,77:164–175, 2015.

[10] S. Darby. The effectiveness of feedback on energy consumption.A Review for DEFRA of the Literature on Metering, Billingand direct Displays, 2006.

[11] R. J. Dossat and T. J. Horan. Principles of refrigeration,volume 3. Wiley, 1961.

[12] EnergyStar.gov. Programmable thermostats for consumers.

[13] M. Gulati, S. S. Ram, and A. Singh. An in depth study intousing emi signatures for appliance identification. In Proceedingsof the 1st ACM Conference on Embedded Systems forEnergy-Efficient Buildings, pages 70–79. ACM, 2014.

[14] S. Gupta, M. S. Reynolds, and S. N. Patel. Electrisense:single-point sensing using emi for electrical event detection andclassification in the home. In Proceedings of the 12th ACMinternational conference on Ubiquitous computing, pages139–148. ACM, 2010.

[15] W. J. D. Haan and A. T. Levin. A practitioner’s guide torobust covariance matrix estimation, 1996.

[16] G. W. Hart. Nonintrusive appliance load monitoring.Proceedings of the IEEE, 80(12):1870–1891, 1992.

[17] J. Kelly, N. Batra, O. Parson, H. Dutta, W. Knottenbelt,A. Rogers, A. Singh, and M. Srivastava. Nilmtk v0. 2: anon-intrusive load monitoring toolkit for large scale data sets:demo abstract. In Proceedings of the 1st ACM Conference onEmbedded Systems for Energy-Efficient Buildings, pages182–183. ACM, 2014.

[18] H. Kim, M. Marwah, M. F. Arlitt, G. Lyon, and J. Han.Unsupervised disaggregation of low frequency powermeasurements. In SDM, volume 11, pages 747–758. SIAM, 2011.

[19] J. Z. Kolter and T. Jaakkola. Approximate Inference inAdditive Factorial HMMs with Application to EnergyDisaggregation. In Proceedings of the InternationalConference on Artificial Intelligence and Statistics, pages1472–1482, La Palma, Canary Islands, 2012.

[20] J. Z. Kolter and M. J. Johnson. REDD: A public data set forenergy disaggregation research. In Proceedings of 1st KDDWorkshop on Data Mining Applications in Sustainability, SanDiego, CA, USA, 2011.

[21] J. Lu, T. Sookoor, V. Srinivasan, G. Gao, B. Holben,J. Stankovic, E. Field, and K. Whitehouse. The smartthermostat: using occupancy sensors to save energy in homes.In Proceedings of the 8th ACM Conference on EmbeddedNetworked Sensor Systems. ACM, 2010.

[22] O. Parson, G. Fisher, A. Hersey, N. Batra, J. Kelly, A. Singh,W. Knottenbelt, and A. Rogers. Dataport and nilmtk: Abuilding data set designed for non-intrusive load monitoring. InThird IEEE Global Conference on Signal and InformationProcessing.

[23] O. Parson, S. Ghosh, M. Weal, and A. Rogers. Non-intrusiveload monitoring using prior models of general appliance types.In Proceedings of the 26th AAAI Conference on ArtificialIntelligence, pages 356–362, Toronto, ON, Canada, 2012.

[24] O. Parson, S. Ghosh, M. Weal, and A. Rogers. An unsupervisedtraining method for non-intrusive appliance load monitoring.Artificial Intelligence, 217:1–19, 2014.

[25] L. Perez-Lombard, J. Ortiz, and C. Pout. A review on buildingsenergy consumption information. Energy and buildings,40(3):394–398, 2008.

[26] J. Ranjan and K. Whitehouse. Object hallmarks: Identifyingobject users using wearable wrist sensors.

[27] A. Rogers, S. Ghosh, R. Wilcock, and N. R. Jennings. Ascalable low-cost solution to provide personalised home heatingadvice to households. In Proceedings of the 5th ACMWorkshop on Embedded Systems For Energy-EfficientBuildings, pages 1–8. ACM, 2013.

[28] M. Saha, S. Thakur, A. Singh, and Y. Agarwal. Energylens:combining smartphones with electricity meter for accurateactivity detection and user annotation. In Proceedings of the5th international conference on Future energy systems, pages289–300. ACM, 2014.

[29] H. Shao, M. Marwah, and N. Ramakrishnan. A temporal motifmining approach to unsupervised energy disaggregation. InProceedings of the 1st International Workshop onNon-Intrusive Load Monitoring, Pittsburgh, PA, USA,volume 7, 2012.

if you measure it, can you improve it? exploring the value ... · the value of energy...

Documents