sept 17 2013 - thug - hbase a technical introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase Technical Deep Dive Sept 17 2013 – Toronto Hadoop User Group Adam Muise [email protected]


Deep Dive Agenda

• Background – (how did we get here?)

• High-level Architecture – (where are we?)

• Anatomy of a RegionServer – (how does this thing work?)

• Using HBase – (where do we go from here?)


Background


So what is a BigTable anyway?

• BigTable paper from Google, 2006, Dean et al. – “Bigtable is a sparse, distributed, persistent multi-dimensional

sorted map.” – http://research.google.com/archive/bigtable.html

• Key Features: – Distributed storage across cluster of machines – Random, online read and write data access – Schemaless data model (“NoSQL”) – Self-managed data partitions


Modern Datasets Break Traditional Databases

>  10x more always-connected mobile devices than seen in PC era. >  Sensor, video and other machine generated data easily exceeds 100TB / day. >  Traditional databases can’t serve modern application needs.


Apache HBase: The Database For Big Data

More data is the key to richer application experiences and deeper insights.

With HBase you can: ü  Ingest and retain more data, to petabyte scale and beyond. ü  Store and access huge data volumes with low latency. ü  Store data of any structure. ü  Use the entire Hadoop ecosystem to gain deep insight on your data.


HBase At A Glance

1

2

4

CLIENT LAYER

HBASE LAYER

HDFS LAYER

1 Clients automatically load balanced across the cluster.

2 Scales linearly to handle any load.

3 Data stored in HDFS allows automated failover.

4 Analyze data with any Hadoop tool.

3


HBase: Real-Time Data on Hadoop

>  Read, Write, Process and Query data in real time using Hadoop infrastructure.


HBase: High Availability

>  Data safely protected in HDFS. >  Failed nodes are automatically recovered. >  No single point of failure, no manual intervention.

HBase Node HBase Node

Replication Replication

HDFS HDFS HDFS


HBase: Multi-Datacenter Replication

>  Replicate data to 2 or more datacenters. >  Load balancing or disaster recovery.


HBase: Seamless Hadoop Integration

>  HBase makes deep analytics simple using any Hadoop tool. >  Query with Hive, process with Pig, classify with Mahout.

HDFS


Apache Hadoop in Review

• Apache Hadoop Distributed Filesystem (HDFS) – Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables – The Google File System, 2003, Ghemawat et al. – http://research.google.com/archive/gfs.html

• Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing – Line- or record-oriented processing of the entire dataset – “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004,

Dean and Ghemawat – http://research.google.com/archive/mapreduce.html

For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/


High-level Architecture


Logical Architecture

• [Big]Tables consist of billions of rows, millions of columns

• Records ordered by rowkey – Inserts require sort, write-side overhead – Applications can take advantage of the sort

• Continuous sequences of rows partitioned into Regions – Regions partitioned at row boundary, according to size (bytes)

• Regions automatically split when they grow too large • Regions automatically distributed around the cluster

– ”Hands-free" partition management (mostly)


Logical ArchitectureDistributed, persistent partitions of a BigTable

ab

dc

ef

hg

ij

lk

mn

po

Table A

Region 1

Region 2

Region 3

Region 4

Region Server 7Table A, Region 1Table A, Region 2

Table G, Region 1070Table L, Region 25

Region Server 86Table A, Region 3Table C, Region 30Table F, Region 160Table F, Region 776

Region Server 367Table A, Region 4Table C, Region 17Table E, Region 52

Table P, Region 1116

Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.


Physical Architecture

• RegionServers collocate with DataNode – Tight MapReduce integration – Opportunity for data-local online processing via coprocessors

(experimental) • HBase Master process manages Region assignment • ZooKeeper configuration glue • Clients communicate directly with RegionServers (data

path) – Horizontally scale client load – Significantly harder for a single ignorant process to DOS the cluster

• DDL operations clients communicate with HBase Master • No persistent state in Master or ZooKeeper

– Recover from HDFS snapshot – See also: AWS Elastic MapReduce's HBase restore path


Physical ArchitectureDistribution and Data Path

...

ZooKeeper

ZooKeeper

ZooKeeper

HBaseClient

JavaApp

HBaseClient

JavaApp

HBaseClient

HBase Shell

HBaseClient

REST/ThriftGateway

HBaseClient

JavaApp

HBaseClient

JavaApp

RegionServer

DataNode

RegionServer

DataNode

...

RegionServer

DataNode

RegionServer

DataNode

HBaseMaster

NameNode

Legend: - An HBase RegionServer is collocated with an HDFS DataNode. - HBase clients communicate directly with Region Servers for sending and receiving data. - HMaster manages Region assignment and handles DDL operations. - Online configuration state is maintained in ZooKeeper. - HMaster and ZooKeeper are NOT involved in data path.


Logical Data Model

• Table as a sorted map of maps {rowkey => {family => {qualifier => {version => value}}}} – Think: nested OrderedDictionary (C#), TreeMap (Java)

• Basic data operations: GET, PUT, DELETE • SCAN over range of key-values

– benefit of the sorted rowkey business – this is how you implement any kind of "complex query”

• GET, SCAN support Filters – Push application logic to RegionServers

•  INCREMENT, CheckAnd{Put,Delete} – Server-side, atomic data operations – Require read lock, can be contentious

• No: secondary indices, joins, multi-row transactions


Logical Data ModelA sparse, multi-dimensional, sorted map

Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.

1368387247 [3.6 kb png data]"thumb"cf2b

a

cf1

1368394583 71368394261 "hello"

"bar"

1368394583 221368394925 13.61368393847 "world"

"foo"

cf21368387684 "almost the loneliest number"1.0001

1368396302 "fourth of July""2011-07-04"

Table A

rowkey columnfamily

columnqualifier timestamp value


Anatomy of a RegionServer


Storage Machinery • RegionServers host N Regions, as assigned by Master

– Common case, Region data is local to the RegionServer/DataNode • Each column family stored in isolation of others

– "column-family oriented” storage – NOT the same as column-oriented storage

• Key-values managed by "HStore” – combined view over data on disk + in-memory edits – region manages one HStore for each column family

• On disk: key-values stored sorted in "StoreFiles” – StoreFiles composed of ordered sequence of "Blocks” – also carries BloomFilter to minimize Block access

•  In memory: "MemStore" maintains heap of recent edits – not to be confused with "BlockCache” – this structure is essentially a log-structured merge tree (LSM-tree)*

with MemStore C0 and StoreFiles C1

* http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log-Structured%20Merge-Tree%20%28LSM-Tree%29.pdf


RegionServer

HDFS

HLog(WAL)

HRegion

HStore

StoreFile

HFile

StoreFile

HFile

MemStore

... ...

HStore

BlockCache

HRegion

...

HStoreHStore

...

Legend: - A RegionServer contains a single WAL, single BlockCache, and multiple Regions. - A Region contains multiple Stores, one for each Column Family. - A Store consists of multiple StoreFiles and a MemStore. - A StoreFile corresponds to a single HFile. - HFiles and WAL are persisted on HDFS.

Storage MachineryImplementing the data model


Write Path (Storage Machinery cont.) • Write summary:

1.  Log edit to HLog (WAL) 2.  Record in MemStore 3.  ACK write

• Data events recorded to a WAL on HDFS, for durability – After fails, edits in WAL are replayed during recovery – WAL appends are immediate, in critical write-path

• Data collected in "MemStore", until a "flush" writes new HFiles – Flush is automatic, based on configuration (size, or staleness interval) – Flush clears WAL entries corresponding to MemStore entries – Flush is deferred, not in critical write-path

• HFiles are merge-sorted during "Compaction” – Small files compacted into larger files – old records discarded (major compaction only) – Lots of disk and network IO


RegionServer

HDFS

HLog(WAL)

HRegion

HStore

StoreFile

HFile

StoreFile

HFile

MemStore

... ...

HStore

BlockCache

HRegion

...

HStoreHStore

...

Legend: 1. A MutateRequest is received by the RegionServer. 2. A WALEdit is appended to the HLog. 3. The new KeyValues are written to the MemStore. 4. The RegionServer acknowledges the edit with a MutateResponse.

Write PathStoring a KeyValue1

2

3

4


Read Path (Storage Machinery, cont.) • Read summary:

1.  Evaluate query predicate 2.  Materialize results from Stores 3.  Batch results to client

• Scanners opened over all relevant StoreFiles + MemStore – “BlockCache” maintains recently accessed Blocks in memory – BloomFilter used to skip irrelevant Blocks – Predicate matchs accumulate, sorted, return ordered rows

• Same Scanner APIs used for GET and SCAN – Different access patterns, different optimization strategies – SCAN:

– HDFS optimized for throughput of long sequential reads – Consider larger Block size for more data per seek

– GET: – BlockCache maintains hot Blocks for point access (GET) – Consider more granular BloomFilter


RegionServer

HDFS

HLog(WAL)

HRegion

HStore

StoreFile

HFile

StoreFile

HFile

MemStore

... ...

HStore

BlockCache

HRegion

...

HStoreHStore

...

Legend: 1. A GetRequest is received by the RegionServer. 2. StoreScanners are opened over appropriate StoreFiles and the MemStore. 3. Blocks identified as potential matches are read from HDFS if not already in the BlockCache. 4. KeyValues are merged into the final set of Results. 5. A GetResponse containing the Results is returned to the client.

Read PathServing a single read request1 5

23

3

2

4


Using HBase


For what kinds of workloads is it well suited?

• It depends on how you tune it, but… • HBase is good for:

– Large datasets – Sparse datasets – Loosely coupled (denormalized) records – Lots of concurrent clients

• Try to avoid: – Small datasets (unless you have *lots* of them) – Highly relational records – Schema designs requiring transactions


HBase Use Cases

Flexible Schema

Huge Data Volume

High Read Rate High Write Rate

Machine-‐Generated Data

Distributed Messaging

Real-‐Time Analy@cs

Object Store

User Profile Management


Hbase Example Use Case: Major Hard Drive Manufacturer

• Goal: detect defective drives before they leave the factory.

• Solution: – Stream sensor data to HBase as it is generated by their test

battery. – Perform real-time analysis as data is added and deep analytics

offline.

• HBase a perfect fit: – Scalable enough to accommodate all 250+ TB of data needed. – Seamless integration with Hadoop analytics tools.

• Result: – Went from processing only 5% of drive test data to 100%.


Other Example HBase Use Cases

• Facebook messaging and counts • Time series data • Exposing Machine Learning models (like risk sets) • Large message set store and forward, especially in social media

• Geospatial indexing • Indexing the Internet


How does it integrate with my infrastructure?

• Horizontally scale application data – Highly concurrent, read/write access – Consistent, persisted shared state – Distributed online data processing via Coprocessors

(experimental)

• Gateway between online services and offline storage/analysis – Staging area to receive new data – Serve online “views” on datasets in HDFS – Glue between batch (HDFS, MR1) and online (CEP, Storm)

systems


What data semantics does it provide?

• GET, PUT, DELETE key-value operations • SCAN for queries • INCREMENT, CAS server-side atomic operations • Row-level write atomicity • MapReduce integration


Creating a table in HBase #!/bin/sh # Small script to setup the hbase table used by OpenTSDB. test -‐n "$HBASE_HOME" || { #A echo >&2 'The environment variable HBASE_HOME must be set' exit 1 } test -‐d "$HBASE_HOME" || { echo >&2 "No such directory: HBASE_HOME=$HBASE_HOME" exit 1 } TSDB_TABLE=${TSDB_TABLE-‐'tsdb'} UID_TABLE=${UID_TABLE-‐'tsdb-‐uid'} COMPRESSION=${COMPRESSION-‐'LZO'} exec "$HBASE_HOME/bin/hbase" shell <<EOF create '$UID_TABLE', #B {NAME => 'id', COMPRESSION => '$COMPRESSION'}, #B {NAME => 'name', COMPRESSION => '$COMPRESSION'} #B create '$TSDB_TABLE', #C {NAME => 't', COMPRESSION => '$COMPRESSION'} #C EOF #A From environment, not parameter #B Make the tsdb-‐uid table with column families id and name #C Make the tsdb table with the t column family #Script taken from HBase in Action -‐ Chapter 7


Coprocessors in a nutshell

• Two types of coprocessors: Observer and Endpoints • Coprocessors are java code executed in each region server • Observer

– Similar to a database trigger – Available Observer types: RegionObserver, WALObserver, MasterObserver – Mainly used to extend pre/post logic within region server events, WAL events, or

DDL events

• Endpoint – Sort of like a UDF – Extend HBase client API to make functions exposed to a user – Still executed on RegionServer – Often used for sums/aggregations (HBase packs in an aggregate example)

• BE VERY CAREFUL WITH COPROCESSORS –  They run in your region servers and buggy code can take down your cluster – See HOYA details to help mitigate risk


What about operational concerns? • Balance memory and IO for reads

– Contention between random and sequential access – Configure Block size, BlockCache based on access patterns – Additional resources

–  “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners

–  “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in-hbase.html

• Balance IO for writes – Provision hardware with more spindles/TB – Configure L1 (compactions, region size, &c.) based on write pattern – Balance contention between maintaining L1 and serving reads – Additional resources

–  “Configuring HBase Memstore: what you should know,” http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/

–  “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/


Operational Tidbits

• Decommissioning Nodes will result in a downed server, use “graceful_stop.sh” to offload the workload from the region server

• Use the “zk_dump” to find all of your region servers and how your zookeeper instances are faring

• Use “status ‘summary’” or “status ‘detailed’” for a count of live/dead servers, average load, and file counts

• User “balancer” to automatically balance regions if HBase is set to auto-balance

• When using “hbase hbck” to diagnose and fix issues, RTFM!


SQL and HBase Hive and Phoenix over HBase


Phoenix over HBase

• Phoenix is a SQL shim over HBase • https://github.com/forcedotcom/phoenix • Hbase has fast write capabilities to Phoenix allows for fast simple query (no joins) and fast upserts

• Phoenix implements it’s own JDBC driver so you can use your favorite tools

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013

Phoenix over HBase


Hive over HBase

• Hive can be used directly with HBase • Hive uses the MapReduce InputFormat “HBaseStorageHandler” to query from the table

• Storage Handler has hooks for – Getting input / output formats – Meta data operations hook: CREATE TABLE, DROP TABLE, etc

• Storage Handler is a table level concept – Does not support Hive partitions, and buckets

• Hive does not need to include all columns from HBase table


Hive over HBase


Hive and Phoenix over HBase > hive add jar /usr/lib/hbase/hbase-0.94.6.1.3.0.0-107-security.jar; add jar /usr/lib/hbase/lib/zookeeper.jar; add jar /usr/lib/hbase/lib/protobuf-java-2.4.0a.jar; add jar /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-107.jar; set hbase.zookeeper.quorum=node1.hadoop; CREATE EXTERNAL TABLE phoenix_mobilelograw( key string, ip string, ts string, code string, d1 string, d2 string, d3 string, d4 string, properties string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,F:IP,F:TS,F:CODE,F:D1,F:D2,F:D3,F:D4,F:PROPERTIES") TBLPROPERTIES ("hbase.table.name" = "MOBILELOGRAW”); set hive.hbase.wal.enabled=false; INSERT OVERWRITE TABLE phoenix_mobilelograw SELECT * FROM hive_mobilelograw; set hive.hbase.wal.enabled=true;


Hbase Roadmap


Hortonworks Focus Areas for HBase

•  Simplified Operations: •  Intelligent Compaction •  Automated Rebalancing

•  Ambari Management: •  Snapshot / Revert •  Multimaster HA •  Cross-site Replication •  Backup / Restore

•  Ambari Monitoring: •  Latency metrics •  Throughput metrics •  Heatmaps •  Region visualizations

Simplified Operations Database Functionality

•  First-Class Datatypes •  SQL Interface Support •  Indexes •  Security

•  Encryption •  More Granular Permissions

•  Performance: •  Stripe Compactions •  Short Circuit Read for

Hadoop 2 •  Row and Entity Groups •  Deeper Hive/Pig Interop


HBase Roadmap Details: Operations

• Snapshots: – Protect data or restore to a point in time.

• Intelligent Compaction: – Compact when the system is lightly utilized. – Avoid “compaction storms” that can break SLAs.

• Ambari Operational Improvements: – Configure multi-master HA. – Simple setup/configuration for replication. – Manage and schedule snapshots. – More visualizations, more health checks.


HBase Roadmap Details: Data Management

• Datatypes: – First-class datatypes offer performance benefits and better

interoperability with tools and other databases.

• SQL Interface (Preview): – SQL interface for simplified analysis of data within HBase. – JDBC driver allows embedding in existing applications.

• Security: – Granular permissions on data within HBase.


HOYA HBase On YARN


HOYA?

• The new YARN resource negotiation layer in Hadoop allows for non-mapreduce applications to run on a Hadoop grid, why not allow HBase to take advantage of this capability?

• https://github.com/hortonworks/hoya/ • HOYA is a YARN application that provisions regionservers based

on an HBase cluster configuration • HOYA helps to bring HBase into YARN resource management

and paves the way for advanced resource management with HBase

• HOYA can be used to spin up temporary HBase clusters temporarily during MapReduce or other jobs


A quick YARN refresher…


The 1st Generation of Hadoop: Batch

HADOOP 1.0 Built for Web-Scale Batch Apps

Single App

BATCH

HDFS

Single App

INTERACTIVE

Single App

BATCH

HDFS

•  All other usage patterns must leverage that same infrastructure

•  Forces the creation of silos for managing mixed workloads

Single App

BATCH

HDFS

Single App

ONLINE


A Transition From Hadoop 1 to 2

HADOOP 1.0

HDFS (redundant, reliable storage)

MapReduce (cluster resource management

& data processing)


A Transition From Hadoop 1 to 2

HADOOP 1.0


MapReduce (cluster resource management

& data processing)


YARN (cluster resource management)

MapReduce (data processing)

Others (data processing)

HADOOP 2.0


The Enterprise Requirement: Beyond Batch

To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service

HDFS (Redundant, Reliable Storage)

BATCH INTERACTIVE STREAMING GRAPH IN-‐MEMORY HPC MPI ONLINE OTHER


YARN: Taking Hadoop Beyond Batch

• Created to manage resource needs across all uses

• Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON”

– Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc.

ApplicaDons Run NaDvely IN Hadoop

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH (MapReduce)

INTERACTIVE (Tez)

STREAMING (Storm, S4,…)

GRAPH (Giraph)

IN-‐MEMORY (Spark)

HPC MPI (OpenMPI)

ONLINE (HBase)

OTHER (Search) (Weave…)


HOYA Architecture


Key HOYA Design Goals

1.  Create on-demand HBase clusters 2.  Maintain multiple HBase cluster configurations and

implement them as required (i.e. high-load scenarios)

3.  Isolation – Sandbox clusters running different versions of HBase or with different coprocessors

4.  Create transient HBase clusters for MapReduce or other processing

5.  Elasticity of clusters for analytics, data-ingest, project-based work

6.  Leverage the scheduling in YARN to ensure HBase can be a good Hadoop cluster tenant


Time to call it an evening. We all have important work to do…


Thank you….

hbaseinaction.com

For more information, check out HBase: The Definitive Guide Or HBase in Action

sept 17 2013 - thug - hbase a technical introduction

Technology