iot connected brewery
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
Harnessing Data within Hadoop in the Connected Brewery: Kafka, Spark Streaming, and KuduJason [email protected]
2© Cloudera, Inc. All rights reserved.
Internet of Things (IoT)
$1.7 Trillion
In Value
20% Annual Growth
30 BillionThings
250 Million
Connected Vehicles
Source - IDC & Gartner Estimates
Internet of Things
IoT Markets - 2020
3© Cloudera, Inc. All rights reserved.
IoT Will Drive An Explosion of Data…
Data expected to explode to 44 ZB by 2020
Source: IDC
44 Trillion GB!80% of data will be unstructured
4© Cloudera, Inc. All rights reserved.
Value is maximized when data is combined with other sources
Value of Data is multiplied when you combine and correlate it with other data from relevant sources
Improvement in value that can be unlocked by combining data from multiple IoT applications and sources
SOURCE: McKinsey Global Institute analysis
Interoperability would significantly improve performance bycombining sensor data from different machines and systems to provide decision makers with an integrated view of performance
40%
5© Cloudera, Inc. All rights reserved.
The IoT EcosystemConsumer
Industrial
IoT Gateway
Data Center
Data Analytics
Sensors/ Things
Data Characteristics• Un-structured• Intermittent• Volume & Variety
Gateway• Data Routing• Edge-Processing• Edge-Storage
Sensors/ Things•To grow by 50X•Drop in prices by 70% in last 5 years
Data Storage, Processing & Analytics
IOT Data Characteristics• More processing in the
cloud• Analytics on the cloud
IOT Data Analytics• Key to Value Creation• Combine data from multiple
sources & types• Drive business insights
IOT Data Characteristics• Distributed Data
Processing• Cloud & On-Premise
Cloud
6© Cloudera, Inc. All rights reserved.
IoT Attributes
• Low powered devices, possibly battery powered• Highly Distributed• Gateway/Controller possibly mesh network• Compact messages
7© Cloudera, Inc. All rights reserved.
IoT Challenges
• Multiple protocols (Z-wave, Zigbee, Thread, etc)• Distributed, low power may mean data coming from multiple locations• May power off to save battery or away from controller, need to handle late data• Calibration between devices may be limited• Very fast and bursty traffic• Low bandwidth last mile
8© Cloudera, Inc. All rights reserved.
Use Cases
• Yes, Contrived• But a good excuse to:• Brew Beer• Buy more sensors and microprocessors• Sorry Wife
9© Cloudera, Inc. All rights reserved.
Use Case - Calibration
• Sensors need to continually be calibrated• Calibration takes resources and down time• Instead use historical raw data• Calibrate on known values• For temperature sensors use bowling point and triple point
• Temperature sensor is typically linear between these points• Fit a curve instead
10© Cloudera, Inc. All rights reserved.
Use Case - Optimize Models
• Kalman Filter is used to estimate variable with presence of noise• Need to know accuracy of sensor• Usually published by manufacturer but generalized• Accuracy can degrade over time
• PID Controller• 3 parameters control performance• Parameters different for each application
11© Cloudera, Inc. All rights reserved.
Use Case - Predictive Maintenance
• No, not just for heavy machinery• Sensors fail too• Can save money by not replacing too early• More importantly, schedule downtime
• Better Model with more data – Sensors same application many factories
12© Cloudera, Inc. All rights reserved.
Technologies
• Apache Kafka• Messaging Framework – Scalable, Fault Tolerant• Pub/Sub• Retains Data
• Apache Spark• General Purpose Distributed Processing Framework• Multiple Components including Streaming• Streaming continually processes data
• Apache Kudu
13© Cloudera, Inc. All rights reserved.
Kudu for IoT
Why it matters
14© Cloudera, Inc. All rights reserved.
Kudu use cases
Kudu is best for use cases requiring a simultaneous combination ofsequential and random reads and writes
• Machine data analytics• Example: IOT, Connected Cars, Network threat detection• Workload: Inserts, scans, lookups
• Time series• Examples: Streaming market data, fraud detection / prevention, risk monitoring• Workload: Insert, updates, scans, lookups
• Online reporting• Example: Operational data store (ODS)• Workload: Inserts, updates, scans, lookups
15© Cloudera, Inc. All rights reserved.
How would we build the Analytics System Today?
• HDFS Excels at: • Full table scans• Ad-hoc analytics
Click to enter confidentiality information
Sensors Kafka / Pub-sub
Events
Today’s Partition
Yesterday’s Partition
Historic Data
AnalystIngest
1. Have we accumulated enough data?
2. Flush into HDFS
16© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Handling Late Arriving Data
/cars/01-13/
/cars/01-14/
/cars/01-15/HDFS (Storage)Real-time Write
Real-time W
rite
I’m back!I’ll upload yesterdays data!
Data from 1-13
17© Cloudera, Inc. All rights reserved.
Hybrid big data analytics pipelineBefore Kudu
Sensors Kafka / Pub-sub
Events
HBase
Consumer
HDFS (Storage)
Random Reads
Analyst
Analytics
Snapshot& Convert to
Parquet
Compact late arriving data
18© Cloudera, Inc. All rights reserved.
Hybrid big data analytics pipelineAfter Kudu
Sensors Kafka / Pub-sub
Events
Kudu ConsumerRandom Reads
Analyst
Analytics
Kudu supports simultaneous combination ofsequential and random reads and writes
19© Cloudera, Inc. All rights reserved.
What Kudu is *NOT*
• Not a SQL interface itself • It’s just the storage layer
• Not an application that runs on HDFS• It’s an alternative, native Hadoop storage engine
• Not a replacement for HDFS or HBase• Select the right storage for the right use case
20© Cloudera, Inc. All rights reserved.
Kudu Trade-Offs (vs Hbase)
• Random updates will be slower•HBase model allows random updates without incurring a disk seek• Kudu requires a key lookup before update, Bloom lookup before insert
• Single-row reads may be slower• Columnar design is optimized for scans• Future: may introduce “column groups” for applications where single-row
access is more important
21© Cloudera, Inc. All rights reserved.
Demo