british gas connected homes: data engineering
TRANSCRIPT
![Page 1: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/1.jpg)
Data EngineeringAt British Gas Connected Homes
1Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 2: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/2.jpg)
2Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
British Gas / Connected Homes• British Gas is a 200 year old company
• Connected Homes is BG’s IoT “startup”
• Leader in the UK’s connected home market
![Page 3: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/3.jpg)
Data Sources• Gas and electricity meter readings
• Thermostat temperature data
• Connected boiler data
• Real time energy consumption data
• Introducing motion sensors, window and door sensors, etc.
3Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 4: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/4.jpg)
Meter Data
• Millions of gas and electricity customers
• ~600k smart meters
• Readings every 30 minutes from smart meters
4Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 5: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/5.jpg)
Machine Learning applied to Meter Data
• Energy disaggregation
• Similar homes comparison
• Smart meters used in indirect algorithms for non-smart customers
5Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 6: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/6.jpg)
Connected Thermostats
• > 200k Connected Thermostats
• Temperature data time series
6Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 7: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/7.jpg)
Connected Boilers
• Proactive maintenance
• Failure detection
7Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 8: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/8.jpg)
In Home Displays in a mobile App
• Data every 10 seconds
• Still needs an access device connected to the router
• Allows real time mobile alerts
8Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 9: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/9.jpg)
Technologies we use
Technologies we are trying
![Page 10: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/10.jpg)
Our Engineering process• Two points of friction at the
intersection between teams
• Sharing datasets is problematic
• Real infrastructure too different from real environments
• New technologies too hard to deploy
• Time to production > 6 months
10Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 11: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/11.jpg)
Solution #1: Data Ops
• Data oriented DevOps instead of service oriented DevOps:
• Stateful instead of stateless
• Jobs instead of config
• Resource management instead of resource partitioning
11Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 12: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/12.jpg)
Solution #1: Data Ops• Ansible and Docker:
1. Smooth transition from development testing to production
2. blue / green deployments
3. swarm / mesos + docker = better use of infrastructure
• Time to production down to < 2 months :-|
12Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 13: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/13.jpg)
Future Solution #2: Data Science Environment
• Ideally Data Science models should be plug and play
• Python and R dataframes in Spark are promising but data scientists don’t feel the need of Spark
• Data scientists prefer to work with relational DBs
• We need to find a way to make production datasets available to them
13Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 14: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/14.jpg)
Future Solution #2: Data Science Environment
• Possible solutions we are investigating are:
• Automated exports into a data science relational DB
• Spark SQL server
• Automatically generated environment images
• Objective is to reduce implementation time for new features to < 1 month
14Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 15: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/15.jpg)
Use Case High Consumption Alerts
• The red dot on top is what we want to detect
• The green bottom dots are the baseline plus the fridge
15Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 16: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/16.jpg)
High Consumption Alerts Data Ingest
• Very high volume of messages (every 10 seconds)
• Kafka partitions help us cope with volume
• (experimental) we’re trying Samza for quick sliding-window type transformations
• Often we miss reads, the Samza job also does basic interpolation
16Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 17: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/17.jpg)
High Consumption Alerts Spark Streaming with Cassandra
• Real time data comes from Kafka
• Cassandra stores historical usage information
• A Spark Streaming job combines both and applies a machine learning algorithm to generate high usage alerts
17Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 18: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/18.jpg)
High Consumption Alerts Overall Architecture
• Getting the partitions right is very important for scalability
• Spark-Cassandra connector keeps C* partitions
• It’s important to match Kafka partitioning to CassandraRDD partitioning
18Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 19: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/19.jpg)
High Consumption alerts | Main Spark loop
19Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 20: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/20.jpg)
Data Partitioning• Data systems like Cassandra or
Kafka scale by partitioning data
• Given enough partitions, any technology can work
• We need a simple hashing algorithm that works the same in many languages and across technologies
20Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 21: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/21.jpg)
Cassandra data modelling with buckets• Using a hashing function that is uniform and deterministic we can cope
time series data of any amount of customers
• One of our preferred strategies is to use buckets
21Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 22: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/22.jpg)
h(k) = ⌊m * frac(kA)⌋• Multiplicative hashing is our preferred simple partitioning algorithm
• m= Number of partitions
• A≈(√5−1)/2 = 0.6180339887... (Golden Ratio)
• Online example: jsfiddle.net/joscas/yfp72fq5
22Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 23: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/23.jpg)
Summary• Increase in productivity with portable environments (Ansible, Docker,
Mesos)
• Getting partitions straight is essential
• Using a simple common hashing algorithm across technologies and languages is very helpful
23Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
![Page 24: British Gas Connected Homes: Data Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062901/58f167e61a28abc0338b456b/html5/thumbnails/24.jpg)
Summary• Streaming technologies are rapidly evolving
• Spark streaming is complex but with many advantages (Spark’s excellent integration with Cassandra, Spark’s ML libraries, etc.)
• Kafka ticks a lot of boxes for large scale distributed real time data systems
24Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA