managing and mining smart meter data – at scale cse project showcase9 july 2013 twitter:...

Post on 18-Dec-2015

219 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Managing and mining

smart meter data – at scale

CSE Project Showcase 9 July 2013

Twitter: @cse_bristol #SmartMeterData

Introduction

Contents

-Introduction to the project, the data, and its applications

-Managing SM data at scale

-Getting valuable knowledge out of SM data

-Demo: Smart Meter Analytics, Scaled by Hadoop (SMASH)

-Where next?

-Discussion

Introduction

Project Background

“Generating Value from Smart Electricity Meter Data”

18 Month TSB-supported collaboration

CSE, University of Bristol, SSE and Western Power Distribution

Three themes:

• Managing the data at scale• Extracting useful knowledge• Integrating the above in a user-facing application

Introduction

The data

A half-hourly timeseries for each smart meter / register

Content: date, time, consumption in the half hour.

For a single register: 17,520 records per year.

This is what 18 months look like:

Introduction

The data

EDRP:• 18 months• 16,250 smart metered households• 16,250 smart electricity meters• 9,364 smart gas meters• 670m half-hourly records (E: 420m, G: 250m)• 40GB of raw csv file data

Post rollout, per year, domestic only:• 25m smart metered households• 25m smart electricity meters• 20m smart gas meters• 800 billion half-hourly records (E: 450Bn, G: 350Bn)• 50TB of raw csv file data

EDRP ~ 0.1% of a year’s domestic data

Introduction

What might we use it for?

Improve existing processes

• Settlement• Billing, reconciliation, audit• Demand profiling• Customer profiling & segmentation

New processes not possible without HH data at scale

• Localised prediction• Distribution network planning and modelling• Automated DSM – prediction and verification• System state detection• Individualised consumer energy services

Introduction

What are the essential processes?

Ingestion – getting the data into the system

Storage – keeping it there securely

Analysis and reporting

• Ad-hoc queries• Transaction reports• Descriptives and summaries (e.g. OLAP)• Mining and modelling• Visualisation

Data management & processing

More fundamentally

Moving data between storage, memory and CPU

Transforming it in the CPU into desired forms

There are physical constraints on the speed of this.

(These are relevant at the scale of smart meter datasets).

Data management & processing

Single machine RDBMS

MEMORY ~10s of GB per machine

CPU

STORAGE ~ 1TB per disk

~ 100 MB/s

~ 1000 MB/s

~ 2.5GHz

Using SQL Server to sum half hourly consumption:

4 bn records: ~ 1 hour40 bn records: ~ 10 hours1 years’ worth: ~ 200 hours

Data management & processing

Single machine RDBMS

MEMORY ~10s of GB per machine

CPU

STORAGE ~ 1TB per disk

~ 100 MB/s

~ 1000 MB/s

~ 2.5GHz

Problem: the throughput of a single machine has not kept up with the growth in the size of datasets.

Data management & processing

Single machine RDBMS

MEMORY ~10s of GB per machine

CPU

STORAGE ~ 1TB per disk

~ 100 MB/s

~ 1000 MB/s

~ 2.5GHz

Problem: the throughput of a single machine has not kept up with the growth in the size of datasets.

Solution: harness multiple individual machines (‘horizontal scaling’).

Data management & processing

Single machine RDBMS

MEMORY ~10s of GB per machine

CPU

STORAGE ~ 1TB per disk

~ 100 MB/s

~ 1000 MB/s

~ 2.5GHz

Problem: the throughput of a single machine has not kept up with the growth in the size of datasets.

Solution: harness multiple individual machines.

Problem: this is difficult and expensive using traditional relational database applications

Data management & processing

Solution

Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling:

1 machine~£10k

2.5GHz1 GB/s100MB/s

~ a week

Data management & processing

Solution

Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling:

1 machine~£10k

2.5GHz1 GB/s100MB/s

~ a week

10 node cluster~£50k

25GHz10 GB/s1 GB/s

~ a day

Data management & processing

Solution

Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling:

1 machine~£10k

2.5GHz1 GB/s100MB/s

~ a week

10 node cluster~£50k

25GHz10 GB/s1 GB/s

~ a day

100 node cluster~£300k

250GHz100 GB/s10 GB/s

~ an hour

Data management & processing

Hadoop

Designed to solve the problem of exponentially growing data volumes (originally, google’s searchable copy of the web)

Harness a large number of commodity machines and low cost networking and storage.

Software takes a job (query, calculation, whatever) and ‘maps’ it out across the cluster.

In parallel each node locally processes a subset of the problem, before the results are ‘reduced’ back to a single dataset.

(Hence ‘Map/Reduce’)

Data management & processing

Experiments: SQL serverSingle high performance machine: bottlenecked by the speed of the hard drive

-

500,000

1,000,000

1,500,000

2,000,000

2,500,000

0 2,000,000,000 4,000,000,000 6,000,000,000

Runti

me

in s

econ

ds

Aggregation query performance versus dataset size

SQL Rows/second

~ 400GB

Data management & processing

Experiments: Hadoop 11 node physical cluster (~£50k hardware cost)

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

0 10,000,000,000 20,000,000,000 30,000,000,000 40,000,000,000

Runti

me

in s

econ

ds

Aggregation query performance versus dataset size

SMASH Rows per second vs dataset size

~2,500GB

Data management & processing

Experiments comparedNot straightforward to get SQL Server to run over ~ 10Bn records.

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

0 10,000,000,000 20,000,000,000 30,000,000,000 40,000,000,000

Runti

me

in s

econ

ds

Aggregation query performance versus dataset size

SMASH Rows per second vs dataset size

SQL Rows/second

~2,500GB

Data management & processing

Experiments: growing the clusterFixed dataset size of 500m records

R² = 0.9148

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1 2 3 4 5 6 7 8 9 10 11

Row

s pe

r sec

ond

Cluster size (nodes)

Aggregation query performance versus cluster size

SMASH speed in records per second vs cluster size

Data management & processing

Hadoop

Pros•Open source software – free and customisable•Adjustable data redundancy (data is replicated over the cluster)•Incrementally scalable – on both performance and cost measures: just add machines, system adapts automatically.•Responsive and cooperative developer community

Cons•Not the last word in user-friendliness (but this is changing)•Sledgehammer to crack a nut below a certain scale•Less mature (but rapidly developing) software ecosystem•Algorithms must fit the framework

Conclusion: low cost option for smart meter data processing

Data mining and visualisation

Finding value in the data

Improve existing processes

• Settlement• Billing, reconciliation, audit• Demand profiling• Customer profiling & segmentation

New processes not possible without HH data at scale

• Localised prediction• Distribution network planning and modelling• Automated DSM – prediction and verification• System state detection• Individualised consumer energy services

Data mining and visualisation

Finding value in the data

Collaborative approach with industry partners to identify business needs

Focus on:

(1) Datamining for subgroup discovery – classifying end users

(2) Cluster analysis on demand data – finding profiles

(3) Innovative visualisation of consumption data and datamining results

Data mining and visualisation

Subgroup discovery

“Pattern features”: 14 variables describing each household

•Income, geography, access to gas, size of house, value of house etc.

“Target features”: describe the behaviour of interest

•Profile error: how different is usage from the assigned profile?

Outputs:•groups of households with significantly different profile errors

Data mining and visualisation

Subgroup discovery

Looking at % annual profile error against sociodemographics

Data mining and visualisation

Subgroup discovery

Looking at % annual profile error against sociodemographics

Data mining and visualisation

Subgroup discovery

Looking at % annual profile error against sociodemographics

Data mining and visualisation

Subgroup discovery

Looking at % annual profile error against sociodemographics

Data mining and visualisation

Subgroup discovery

Looking at % annual profile error against sociodemographics

Data mining and visualisation

Clustering

Can we use demand data to create better profiles?

Define target features: waveform’s properties of interest

Two examples: using imposed and emergent properties.

Each using 3 clusters.

Data mining and visualisation

Clustering

E.g. 1 the average weekday as 5 pairs of numbers:

Data mining and visualisation

Clustering

E.g. 2: Frequency spectrum of the demand timeseries

Data mining and visualisation

Cluster analysis

Project competition results (the University won)

0.25

0.27

0.29

0.31

0.33

0.35

Average % difference from the cluster centroid

Data mining and visualisation

Conclusions from datamining

Subgroup discovery results suggest the approach is useful as long as you have metadata on the households

Cluster analysis work suggests it is possible to improve on the standard profile classes using SM data

Further work needs to be carried out on more representative datasets

There are many other potential applications!

The SMASH application

Web application

Installation of Hadoop on UoB and CSE clusters

11 Node physical cluster at the university (£50k)8 Node virtual cluster at CSE (£15k)

Integration of a range of Hadoop-friendly data management components

Development of a proof-of-concept web application for user interaction, job management, visualisation etc.

Deployment on both clusters

The SMASH application

Web application

Currently running on the CSE virtual Hadoop cluster

Generating Value from SM Data

Where next?

We have a proof-of-concept system developed with TSB R&D funding support.

We have mastered the underlying technologies and established that this approach has the potential to be a low-cost solution to a number of industry data challenges.

On a technical level the next steps are to•Further develop the web application •Refine the datamining algorithms (with more data)•Implement selected DM algorithms directly on the cluster

On a policy/programme level we want ensure this knowledge is incorporated into SM rollout infrastructure decision making.

Questions and discussion

@cse_bristol#SmartMeterData

Contacts:

Simon Roberts simon.roberts@cse.org.uk

Joshua Thumim joshua.thumim@cse.org.uk

Web: www.cse.org.uk Sign up to our monthly e-news through our website

Follow us on Twitter @cse_bristol

top related