near realtime processing over hbase
TRANSCRIPT
![Page 1: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/1.jpg)
Near-‐Real(me Processing over HBaseRyan Brush@ryanbrush
![Page 2: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/2.jpg)
Topics-The story so far -Complemen8ng MapReduce with stream-‐based processing -Techniques and lessons -Query and search -The future
![Page 3: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/3.jpg)
The story so far...
![Page 4: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/4.jpg)
Chart Search
![Page 5: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/5.jpg)
Chart Search-Informa8on extrac8on -Seman8c markup of documents -Related concepts in search results -Processing latency: tens of minutes
![Page 6: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/6.jpg)
Medical Alerts
![Page 7: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/7.jpg)
Medical Alerts-Detect health risks in incoming data -No8fy clinicians to address those risks -Quickly include new knowledge -Processing latency: single-‐digit minutes
![Page 8: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/8.jpg)
Exploring live data
![Page 9: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/9.jpg)
Exploring live data-Novel ways of exploring records -Pre-‐computed models matching users’ access paLerns -Very fast load 8mes -Processing latency: seconds or faster
![Page 10: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/10.jpg)
And many othersPopula(on analy(cs
Care coordina(onPersonalized health plans
- Data sets growing at hundreds of GBs per day - Approaching 1 petabyte total data - Rate is increasing; expec8ng mul8-‐petabyte data sets
![Page 11: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/11.jpg)
-Analyze all data holis8cally -Quickly apply incremental updates
A trend towards compe8ng needs
![Page 12: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/12.jpg)
A trend towards compe8ng needs
MapReduce- (re-‐)Process all data - Move computa8on to data - Output is a pure func8on of the input
- Assumes set of sta8c input
Stream- Incremental updates - Move data to computa8on - Needs to clean up outdated state
- Input may be incomplete or out of order
Both processing models are necessary and the underlying logic must be the same
![Page 13: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/13.jpg)
A trend towards compe8ng needs
Speed Layer
Batch Layer
hLp://nathanmarz.com/blog/how-‐to-‐beat-‐the-‐cap-‐theorem.html
![Page 14: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/14.jpg)
Speed Layer
Batch LayerHigh Latency (minutes or hours to process)
Low Latency (seconds to process)
Move data to computa(on
Move computa(on to dataYears of data
Hours of data
Bulk loads
Incremental updates
A trend towards compe8ng needs
hLp://nathanmarz.com/blog/how-‐to-‐beat-‐the-‐cap-‐theorem.html
![Page 15: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/15.jpg)
Realtime Layer
Batch LayerMapReduce
Storm
Stream-‐based
Hadoop
A trend towards compe8ng needs
![Page 16: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/16.jpg)
Into the rabbit hole-A ride through the system -Techniques and lessons learned along the way
![Page 17: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/17.jpg)
Data inges8on
-Stream data into HTTPS service -Content stored as Protocol Buffers -Mirror the raw data as simply as possible
/source:1/document:123/source:2/allergy:345/source:2/document:456/source:2/order:234…/source:n/prescription:789
HBase
CollectorService
Source System 1
Source System 2
Source System N
. . . HTTPS
![Page 18: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/18.jpg)
Scan for updates
Process incoming data- Ini8ally modeled aYer Google Percolator -“No8fica8on” records indicate changes -Scan for no8fica8ons
Data Table
source:1/document:123
source:2/allergy:345
source:2/document:456
. . .
source:150/order:71
No8fica8on Table
source:1/document:123
source:150/order:71
![Page 19: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/19.jpg)
But there’s a catch…-Percolator-‐style no8fica8on records require external coordina8on -More infrastructure to build, maintain -…so let’s use HBase’s primi8ves
![Page 20: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/20.jpg)
Scan for updatesProcess incoming data
- Consumers scan for items to process -Atomically claim lease records (CheckAndPut) - Clear the record and no8fica8ons when done - ~3000 no8fica8ons per second per node
Row Key Qualifiers (lease record and keys of updated items)
split:0 0000_LEASE, source:2/allergy:345, source:150/order:71, …
split:1 0000_LEASE, source:4/problem:78, source:205/document:52, …
. . .
![Page 21: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/21.jpg)
Advantages-No addi8onal infrastructure -Leverages HBase guarantees -No lost data -No stranded data due to machine failure
-Robust to volume spikes of tens of millions of records
![Page 22: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/22.jpg)
Downsides-Weak ordering guarantees -Processing must be idempotent -Lots of garbage from deleted cells -Schedule major compac8ons!
-Must split to avoid hot regions -Poten8ally beLer op8ons emerging -Apache Kana with replica8on
![Page 23: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/23.jpg)
Measure Everything
- Instrumented HBase client to see effec8ve performance
- We use Coda Hale’s Metrics API and Graphite Reporter
- Revealed impact of hot HBase regions on clients
![Page 24: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/24.jpg)
The story so far
HBase
CollectorService
Source System 1
Source System 2
Source System N
. . . HTTPS Data Notifications
IncrementalProcessors
Load data
Scan for updates
![Page 25: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/25.jpg)
Into the Storm-Storm: scalable processing of data in mo8on -Complements HBase and Hadoop -Guaranteed message processing in a distributed environment -No8fica8ons scanned by a Storm Spout
![Page 26: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/26.jpg)
Processing with Storm
CollectorService
Source System 1
Source System 2
Source System N
. . . HTTPS Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed Data
Apps
Services
![Page 27: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/27.jpg)
Challenges of incremental updates
-Incomplete data -Outdated state -Difficult to reason about changing state and 8ming condi8ons
![Page 28: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/28.jpg)
Handling Incomplete Data
Row Key Summary Family Staging Family
document:1 page:1
Incoming data
- Process (map) components into a staging family
![Page 29: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/29.jpg)
Handling Incomplete Data
Row Key Summary Family Staging Family
document:1 page:1 page:3
Incoming data
- Process (map) components into a staging family
![Page 30: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/30.jpg)
Handling Incomplete Data
Row Key Summary Family Staging Family
document:1 page:1 page:2 page:3
Incoming data
- Process (map) components into a staging family
![Page 31: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/31.jpg)
Handling Incomplete Data
Row Key Summary Family Staging Family
document:1 document_summary page:1 page:2 page:3
- Process (map) components into a staging family -Merge (reduce) components when everything is available -Many cases need no merge phase; consuming apps simply read all of the components
Incoming data
![Page 32: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/32.jpg)
Outdated State
Time 0: Alice lives in ChicagoTime 1: Alice lives in New York
Incoming DataChicago resident indexNew York resident index
Processed Data
- Big Data - MapReduce: rebuild processed data
- Outdated state is simply ignored
- Fast Updates - ACID database: simply update Alice’s loca8on
- Big and Fast: it gets complicated
![Page 33: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/33.jpg)
Outdated State: Reconcile on Read
Historical Data (MapReduce
Output)
Incremental Updates
Merge Application
-Akin to Marz’s Lambda Architecture -Data stores op8mized for specific workloads - Keeps processing models independent -Adds complexity at read 8me, but simpler overall
-Marz’s Lambda Architecture
-Not available in commodity app stacks - Probably best approach when and if higher-‐level abstrac8ons emerge
![Page 34: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/34.jpg)
Outdated State: Reconcile on Write
-Marz’s Lambda Architecture
Time 0: Alice lives in ChicagoTime 1: Alice lives in New York
Incoming DataChicago resident indexNew York resident index
Processed Data
- Keep history of your incoming data
- When the event at Time 1 occurs, read that history and update both indexes
- Works with many exis8ng data stores
- Adds complexity to processing logic
- Data store must handle MapReduce and real8me loads -‐-‐ may not be op8mal
![Page 35: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/35.jpg)
Different models, same logic-Incremental updates like a rolling MapReduce -Func(ons are the center of the universe (not InputFormats or Messages)
-Write logic as pure func8ons, coordinate with higher libraries - Storm -Apache Crunch
![Page 36: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/36.jpg)
Gesng complicated?-Incremental logic is complex and error prone -Use MapReduce as a failsafe
CollectorService
Source System 1
Source System 2
Source System N
. . . HTTPS Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed Data
MapReduce
Apps
Services
![Page 37: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/37.jpg)
Reprocess during up8me
-Deploy new incremental processing logic -“Older” 8mestamps produced by MapReduce -The most recently wriLen cell in HBase need not be the logical newest
Row Key Document Family
document:1 {doc, ts=50}
document:2 {doc, ts=100}
Real 8me incremental update
, {doc, ts=300}
MapReduce outputs
, {doc ts=200}, {doc, ts=200}
![Page 38: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/38.jpg)
Comple8ng the Picture
CollectorService
Source System 1
Source System 2
Source System N
. . . HTTPS Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed Data
MapReduce
Apps
Services
![Page 39: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/39.jpg)
Comple8ng the Picture
CollectorService
Source System 1
Source System 2
Source System N
. . . HTTPS Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed Data
MapReduce
Apps
Services
Search Indexes
![Page 40: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/40.jpg)
Building indexes with MapReduce
-A shard per task -Build index in Hadoop -Copy to index hosts
Embedded Solr
Map TaskIndex Shard
Embedded Solr
Map TaskIndex Shard
Embedded Solr
Map TaskIndex Shard
![Page 41: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/41.jpg)
Pushing incremental updates-POST new records -Bursts can overwhelm target hosts -Consumers must deal with transient failures
SolrShard
SolrShard
SolrShard
Replica
Replica
Replica
ProcessorData stream
![Page 42: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/42.jpg)
Pulling indexes from HBase- Custom Solr plugin scans a range of HBase rows - Time-‐based scan to get only updates - Pulls items to index from HBase - Cleanly recovers from volume spikes and transient failures
person:1person:2. . . person:nperson:n + 1….person:m
HBase
SolrShard
SolrShard
Solr
Scan
Scan
Scan
![Page 43: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/43.jpg)
A note on schema: simplify it!
-Heterogeneous row keys efficient but hard to reason about -Must inspect row key to know what it is -Mismatches tools like Pig or Hive
Row Key Qualifiers
person:1/name <content>
person:1/address <content>
person:1/friend:1 <content>
person:1/friend:2 <content>
person:2/name <content>
…
person:n/name <content>
person:n/friend:m <content>
![Page 44: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/44.jpg)
Logical parent per row
-The row is the unit of locality -Tabular layout is easy to understand -No lost efficiency for most cases -HBase Schema Design -‐-‐ Ian Varley at 2012 HBaseCon
Row Key Qualifiers
person:1 name<…> address:<…> friend:1:<…> friend:2:<…>
person:2 name<…> address:<…> friend:1:<…>
. . .
person:n name<…> address:<…> friend:1:<…>
![Page 45: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/45.jpg)
The path forward
![Page 46: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/46.jpg)
This paMern has been successful…but complexity is our biggest enemy
![Page 47: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/47.jpg)
We may be in the assembly
language era of big data
![Page 48: Near Realtime Processing over HBase](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a78d021a28ab266e8b4871/html5/thumbnails/48.jpg)
Higher-‐level abstrac(ons for these paMerns will emerge
It’s going to be fun