sept 17 2013 - thug - hbase a technical introduction

60
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Technical Deep Dive Sept 17 2013 – Toronto Hadoop User Group Adam Muise [email protected]

Upload: adam-muise

Post on 27-Jan-2015

103 views

Category:

Technology


1 download

DESCRIPTION

HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).

TRANSCRIPT

Page 1: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase Technical Deep Dive Sept 17 2013 – Toronto Hadoop User Group Adam Muise [email protected]

Page 2: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Deep Dive Agenda

• Background – (how did we get here?)

• High-level Architecture – (where are we?)

• Anatomy of a RegionServer – (how does this thing work?)

• Using HBase – (where do we go from here?)

Page 2

Page 3: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Background

Page 3

Page 4: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

So what is a BigTable anyway?

• BigTable paper from Google, 2006, Dean et al. – “Bigtable is a sparse, distributed, persistent multi-dimensional

sorted map.” – http://research.google.com/archive/bigtable.html

• Key Features: – Distributed storage across cluster of machines – Random, online read and write data access – Schemaless data model (“NoSQL”) – Self-managed data partitions

Page 4

Page 5: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Modern Datasets Break Traditional Databases

Page 5

>  10x more always-connected mobile devices than seen in PC era. >  Sensor, video and other machine generated data easily exceeds 100TB / day. >  Traditional databases can’t serve modern application needs.

Page 6: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Apache HBase: The Database For Big Data

Page 6

More data is the key to richer application experiences and deeper insights.

With HBase you can: ü  Ingest and retain more data, to petabyte scale and beyond. ü  Store and access huge data volumes with low latency. ü  Store data of any structure. ü  Use the entire Hadoop ecosystem to gain deep insight on your data.

Page 7: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase At A Glance

Page 7

1

2

4

CLIENT LAYER

HBASE LAYER

HDFS LAYER

1 Clients automatically load balanced across the cluster.

2 Scales linearly to handle any load.

3 Data stored in HDFS allows automated failover.

4 Analyze data with any Hadoop tool.

3

Page 8: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase: Real-Time Data on Hadoop

Page 8

>  Read, Write, Process and Query data in real time using Hadoop infrastructure.

Page 9: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase: High Availability

Page 9

>  Data safely protected in HDFS. >  Failed nodes are automatically recovered. >  No single point of failure, no manual intervention.

HBase Node HBase Node

Replication Replication

HDFS HDFS HDFS

Page 10: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase: Multi-Datacenter Replication

Page 10

>  Replicate data to 2 or more datacenters. >  Load balancing or disaster recovery.

Page 11: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase: Seamless Hadoop Integration

Page 11

>  HBase makes deep analytics simple using any Hadoop tool. >  Query with Hive, process with Pig, classify with Mahout.

HDFS

Page 12: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Apache Hadoop in Review

• Apache Hadoop Distributed Filesystem (HDFS) – Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables – The Google File System, 2003, Ghemawat et al. – http://research.google.com/archive/gfs.html

• Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing – Line- or record-oriented processing of the entire dataset – “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004,

Dean and Ghemawat – http://research.google.com/archive/mapreduce.html

Page 12

For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Page 13: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

High-level Architecture

Page 13

Page 14: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Logical Architecture

• [Big]Tables consist of billions of rows, millions of columns

• Records ordered by rowkey – Inserts require sort, write-side overhead – Applications can take advantage of the sort

• Continuous sequences of rows partitioned into Regions – Regions partitioned at row boundary, according to size (bytes)

• Regions automatically split when they grow too large • Regions automatically distributed around the cluster

– ”Hands-free" partition management (mostly)

Page 14

Page 15: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Logical ArchitectureDistributed, persistent partitions of a BigTable

ab

dc

ef

hg

ij

lk

mn

po

Table A

Region 1

Region 2

Region 3

Region 4

Region Server 7Table A, Region 1Table A, Region 2

Table G, Region 1070Table L, Region 25

Region Server 86Table A, Region 3Table C, Region 30Table F, Region 160Table F, Region 776

Region Server 367Table A, Region 4Table C, Region 17Table E, Region 52

Table P, Region 1116

Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.

Page 15

Page 16: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Physical Architecture

• RegionServers collocate with DataNode – Tight MapReduce integration – Opportunity for data-local online processing via coprocessors

(experimental) • HBase Master process manages Region assignment • ZooKeeper configuration glue • Clients communicate directly with RegionServers (data

path) – Horizontally scale client load – Significantly harder for a single ignorant process to DOS the cluster

• DDL operations clients communicate with HBase Master • No persistent state in Master or ZooKeeper

– Recover from HDFS snapshot – See also: AWS Elastic MapReduce's HBase restore path

Page 16

Page 17: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Page 17

Physical ArchitectureDistribution and Data Path

...

ZooKeeper

ZooKeeper

ZooKeeper

HBaseClient

JavaApp

HBaseClient

JavaApp

HBaseClient

HBase Shell

HBaseClient

REST/ThriftGateway

HBaseClient

JavaApp

HBaseClient

JavaApp

RegionServer

DataNode

RegionServer

DataNode

...

RegionServer

DataNode

RegionServer

DataNode

HBaseMaster

NameNode

Legend: - An HBase RegionServer is collocated with an HDFS DataNode. - HBase clients communicate directly with Region Servers for sending and receiving data. - HMaster manages Region assignment and handles DDL operations. - Online configuration state is maintained in ZooKeeper. - HMaster and ZooKeeper are NOT involved in data path.

Page 18: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Logical Data Model

• Table as a sorted map of maps {rowkey => {family => {qualifier => {version => value}}}} – Think: nested OrderedDictionary (C#), TreeMap (Java)

• Basic data operations: GET, PUT, DELETE • SCAN over range of key-values

– benefit of the sorted rowkey business – this is how you implement any kind of "complex query”

• GET, SCAN support Filters – Push application logic to RegionServers

•  INCREMENT, CheckAnd{Put,Delete} – Server-side, atomic data operations – Require read lock, can be contentious

• No: secondary indices, joins, multi-row transactions

Page 18

Page 19: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Page 19

Logical Data ModelA sparse, multi-dimensional, sorted map

Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.

1368387247 [3.6 kb png data]"thumb"cf2b

a

cf1

1368394583 71368394261 "hello"

"bar"

1368394583 221368394925 13.61368393847 "world"

"foo"

cf21368387684 "almost the loneliest number"1.0001

1368396302 "fourth of July""2011-07-04"

Table A

rowkey columnfamily

columnqualifier timestamp value

Page 20: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Anatomy of a RegionServer

Page 20

Page 21: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Storage Machinery • RegionServers host N Regions, as assigned by Master

– Common case, Region data is local to the RegionServer/DataNode • Each column family stored in isolation of others

– "column-family oriented” storage – NOT the same as column-oriented storage

• Key-values managed by "HStore” – combined view over data on disk + in-memory edits – region manages one HStore for each column family

• On disk: key-values stored sorted in "StoreFiles” – StoreFiles composed of ordered sequence of "Blocks” – also carries BloomFilter to minimize Block access

•  In memory: "MemStore" maintains heap of recent edits – not to be confused with "BlockCache” – this structure is essentially a log-structured merge tree (LSM-tree)*

with MemStore C0 and StoreFiles C1

Page 21

* http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log-Structured%20Merge-Tree%20%28LSM-Tree%29.pdf

Page 22: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Page 22

RegionServer

HDFS

HLog(WAL)

HRegion

HStore

StoreFile

HFile

StoreFile

HFile

MemStore

... ...

HStore

BlockCache

HRegion

...

HStoreHStore

...

Legend: - A RegionServer contains a single WAL, single BlockCache, and multiple Regions. - A Region contains multiple Stores, one for each Column Family. - A Store consists of multiple StoreFiles and a MemStore. - A StoreFile corresponds to a single HFile. - HFiles and WAL are persisted on HDFS.

Storage MachineryImplementing the data model

Page 23: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Write Path (Storage Machinery cont.) • Write summary:

1.  Log edit to HLog (WAL) 2.  Record in MemStore 3.  ACK write

• Data events recorded to a WAL on HDFS, for durability – After fails, edits in WAL are replayed during recovery – WAL appends are immediate, in critical write-path

• Data collected in "MemStore", until a "flush" writes new HFiles – Flush is automatic, based on configuration (size, or staleness interval) – Flush clears WAL entries corresponding to MemStore entries – Flush is deferred, not in critical write-path

• HFiles are merge-sorted during "Compaction” – Small files compacted into larger files – old records discarded (major compaction only) – Lots of disk and network IO

Page 23

Page 24: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Page 24

RegionServer

HDFS

HLog(WAL)

HRegion

HStore

StoreFile

HFile

StoreFile

HFile

MemStore

... ...

HStore

BlockCache

HRegion

...

HStoreHStore

...

Legend: 1. A MutateRequest is received by the RegionServer. 2. A WALEdit is appended to the HLog. 3. The new KeyValues are written to the MemStore. 4. The RegionServer acknowledges the edit with a MutateResponse.

Write PathStoring a KeyValue1

2

3

4

Page 25: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Read Path (Storage Machinery, cont.) • Read summary:

1.  Evaluate query predicate 2.  Materialize results from Stores 3.  Batch results to client

• Scanners opened over all relevant StoreFiles + MemStore – “BlockCache” maintains recently accessed Blocks in memory – BloomFilter used to skip irrelevant Blocks – Predicate matchs accumulate, sorted, return ordered rows

• Same Scanner APIs used for GET and SCAN – Different access patterns, different optimization strategies – SCAN:

– HDFS optimized for throughput of long sequential reads – Consider larger Block size for more data per seek

– GET: – BlockCache maintains hot Blocks for point access (GET) – Consider more granular BloomFilter

Page 25

Page 26: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Page 26

RegionServer

HDFS

HLog(WAL)

HRegion

HStore

StoreFile

HFile

StoreFile

HFile

MemStore

... ...

HStore

BlockCache

HRegion

...

HStoreHStore

...

Legend: 1. A GetRequest is received by the RegionServer. 2. StoreScanners are opened over appropriate StoreFiles and the MemStore. 3. Blocks identified as potential matches are read from HDFS if not already in the BlockCache. 4. KeyValues are merged into the final set of Results. 5. A GetResponse containing the Results is returned to the client.

Read PathServing a single read request1 5

23

3

2

4

Page 27: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Using HBase

Page 27

Page 28: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

For what kinds of workloads is it well suited?

• It depends on how you tune it, but… • HBase is good for:

– Large datasets – Sparse datasets – Loosely coupled (denormalized) records – Lots of concurrent clients

• Try to avoid: – Small datasets (unless you have *lots* of them) – Highly relational records – Schema designs requiring transactions

Page 28

Page 29: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase Use Cases

Page 29

Flexible  Schema  

Huge  Data  Volume  

High  Read  Rate   High  Write  Rate  

Machine-­‐Generated  Data  

Distributed  Messaging  

Real-­‐Time  Analy@cs  

Object  Store  

User  Profile  Management  

Page 30: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Hbase Example Use Case: Major Hard Drive Manufacturer

Page 30

• Goal: detect defective drives before they leave the factory.

• Solution: – Stream sensor data to HBase as it is generated by their test

battery. – Perform real-time analysis as data is added and deep analytics

offline.

• HBase a perfect fit: – Scalable enough to accommodate all 250+ TB of data needed. – Seamless integration with Hadoop analytics tools.

• Result: – Went from processing only 5% of drive test data to 100%.

Page 31: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Other Example HBase Use Cases

• Facebook messaging and counts • Time series data • Exposing Machine Learning models (like risk sets) • Large message set store and forward, especially in social media

• Geospatial indexing • Indexing the Internet

Page 31

Page 32: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

How does it integrate with my infrastructure?

• Horizontally scale application data – Highly concurrent, read/write access – Consistent, persisted shared state – Distributed online data processing via Coprocessors

(experimental)

• Gateway between online services and offline storage/analysis – Staging area to receive new data – Serve online “views” on datasets in HDFS – Glue between batch (HDFS, MR1) and online (CEP, Storm)

systems

Page 32

Page 33: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

What data semantics does it provide?

• GET, PUT, DELETE key-value operations • SCAN for queries • INCREMENT, CAS server-side atomic operations • Row-level write atomicity • MapReduce integration

Page 33

Page 34: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Creating a table in HBase #!/bin/sh  #  Small  script  to  setup  the  hbase  table  used  by  OpenTSDB.    test  -­‐n  "$HBASE_HOME"  ||  {  #A  echo  >&2  'The  environment  variable  HBASE_HOME  must  be  set'    exit  1  }    test  -­‐d  "$HBASE_HOME"  ||  {  echo  >&2  "No  such  directory:  HBASE_HOME=$HBASE_HOME"    exit  1  }    TSDB_TABLE=${TSDB_TABLE-­‐'tsdb'}  UID_TABLE=${UID_TABLE-­‐'tsdb-­‐uid'}  COMPRESSION=${COMPRESSION-­‐'LZO'}    exec  "$HBASE_HOME/bin/hbase"  shell  <<EOF  create  '$UID_TABLE',  #B  {NAME  =>  'id',  COMPRESSION  =>  '$COMPRESSION'},  #B  {NAME  =>  'name',  COMPRESSION  =>  '$COMPRESSION'}  #B    create  '$TSDB_TABLE',  #C  {NAME  =>  't',  COMPRESSION  =>  '$COMPRESSION'}  #C      EOF    #A  From  environment,  not  parameter  #B  Make  the  tsdb-­‐uid  table  with  column  families  id  and  name    #C  Make  the  tsdb  table  with  the  t  column  family      #Script  taken  from  HBase  in  Action  -­‐  Chapter  7  

Page 34

Page 35: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Coprocessors in a nutshell

• Two types of coprocessors: Observer and Endpoints • Coprocessors are java code executed in each region server • Observer

– Similar to a database trigger – Available Observer types: RegionObserver, WALObserver, MasterObserver – Mainly used to extend pre/post logic within region server events, WAL events, or

DDL events

• Endpoint – Sort of like a UDF – Extend HBase client API to make functions exposed to a user – Still executed on RegionServer – Often used for sums/aggregations (HBase packs in an aggregate example)

• BE VERY CAREFUL WITH COPROCESSORS –  They run in your region servers and buggy code can take down your cluster – See HOYA details to help mitigate risk

Page 35

Page 36: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

What about operational concerns? • Balance memory and IO for reads

– Contention between random and sequential access – Configure Block size, BlockCache based on access patterns – Additional resources

–  “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners

–  “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in-hbase.html

• Balance IO for writes – Provision hardware with more spindles/TB – Configure L1 (compactions, region size, &c.) based on write pattern – Balance contention between maintaining L1 and serving reads – Additional resources

–  “Configuring HBase Memstore: what you should know,” http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/

–  “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/

Page 36

Page 37: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Operational Tidbits

• Decommissioning Nodes will result in a downed server, use “graceful_stop.sh” to offload the workload from the region server

• Use the “zk_dump” to find all of your region servers and how your zookeeper instances are faring

• Use “status ‘summary’” or “status ‘detailed’” for a count of live/dead servers, average load, and file counts

• User “balancer” to automatically balance regions if HBase is set to auto-balance

• When using “hbase hbck” to diagnose and fix issues, RTFM!

Page 37

Page 38: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

SQL and HBase Hive and Phoenix over HBase

Page 38

Page 39: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Phoenix over HBase

• Phoenix is a SQL shim over HBase • https://github.com/forcedotcom/phoenix • Hbase has fast write capabilities to Phoenix allows for fast simple query (no joins) and fast upserts

• Phoenix implements it’s own JDBC driver so you can use your favorite tools

Page 39

Page 40: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013

Phoenix over HBase

Page 40

Page 41: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Hive over HBase

• Hive can be used directly with HBase • Hive uses the MapReduce InputFormat “HBaseStorageHandler” to query from the table

• Storage Handler has hooks for – Getting input / output formats – Meta data operations hook: CREATE TABLE, DROP TABLE, etc

• Storage Handler is a table level concept – Does not support Hive partitions, and buckets

• Hive does not need to include all columns from HBase table

Page 41

Page 42: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013

Hive over HBase

Page 42

Page 43: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Hive over HBase

Page 43

Page 44: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Hive and Phoenix over HBase > hive add jar /usr/lib/hbase/hbase-0.94.6.1.3.0.0-107-security.jar; add jar /usr/lib/hbase/lib/zookeeper.jar; add jar /usr/lib/hbase/lib/protobuf-java-2.4.0a.jar; add jar /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-107.jar; set hbase.zookeeper.quorum=node1.hadoop; CREATE EXTERNAL TABLE phoenix_mobilelograw( key string, ip string, ts string, code string, d1 string, d2 string, d3 string, d4 string, properties string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,F:IP,F:TS,F:CODE,F:D1,F:D2,F:D3,F:D4,F:PROPERTIES") TBLPROPERTIES ("hbase.table.name" = "MOBILELOGRAW”); set hive.hbase.wal.enabled=false; INSERT OVERWRITE TABLE phoenix_mobilelograw SELECT * FROM hive_mobilelograw; set hive.hbase.wal.enabled=true;

Page 44

Page 45: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Hbase Roadmap

Page 45

Page 46: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Hortonworks Focus Areas for HBase

Page 46

•  Simplified Operations: •  Intelligent Compaction •  Automated Rebalancing

•  Ambari Management: •  Snapshot / Revert •  Multimaster HA •  Cross-site Replication •  Backup / Restore

•  Ambari Monitoring: •  Latency metrics •  Throughput metrics •  Heatmaps •  Region visualizations

Simplified Operations Database Functionality

•  First-Class Datatypes •  SQL Interface Support •  Indexes •  Security

•  Encryption •  More Granular Permissions

•  Performance: •  Stripe Compactions •  Short Circuit Read for

Hadoop 2 •  Row and Entity Groups •  Deeper Hive/Pig Interop

Page 47: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase Roadmap Details: Operations

Page 47

• Snapshots: – Protect data or restore to a point in time.

• Intelligent Compaction: – Compact when the system is lightly utilized. – Avoid “compaction storms” that can break SLAs.

• Ambari Operational Improvements: – Configure multi-master HA. – Simple setup/configuration for replication. – Manage and schedule snapshots. – More visualizations, more health checks.

Page 48: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HBase Roadmap Details: Data Management

Page 48

• Datatypes: – First-class datatypes offer performance benefits and better

interoperability with tools and other databases.

• SQL Interface (Preview): – SQL interface for simplified analysis of data within HBase. – JDBC driver allows embedding in existing applications.

• Security: – Granular permissions on data within HBase.

Page 49: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HOYA HBase On YARN

Page 49

Page 50: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

HOYA?

• The new YARN resource negotiation layer in Hadoop allows for non-mapreduce applications to run on a Hadoop grid, why not allow HBase to take advantage of this capability?

• https://github.com/hortonworks/hoya/ • HOYA is a YARN application that provisions regionservers based

on an HBase cluster configuration • HOYA helps to bring HBase into YARN resource management

and paves the way for advanced resource management with HBase

• HOYA can be used to spin up temporary HBase clusters temporarily during MapReduce or other jobs

Page 50

Page 51: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

A quick YARN refresher…

Page 51

Page 52: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013

The 1st Generation of Hadoop: Batch

HADOOP 1.0 Built for Web-Scale Batch Apps

Single  App  

BATCH

HDFS

Single  App  

INTERACTIVE

Single  App  

BATCH

HDFS

•  All other usage patterns must leverage that same infrastructure

•  Forces the creation of silos for managing mixed workloads

Single  App  

BATCH

HDFS

Single  App  

ONLINE

Page 53: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013

A Transition From Hadoop 1 to 2

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

Page 54: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013

A Transition From Hadoop 1 to 2

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

HDFS  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

MapReduce  (data  processing)  

Others  (data  processing)  

HADOOP 2.0

Page 55: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

The Enterprise Requirement: Beyond Batch

To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service

Page 55

HDFS  (Redundant,  Reliable  Storage)  

BATCH   INTERACTIVE   STREAMING   GRAPH   IN-­‐MEMORY   HPC  MPI  ONLINE   OTHER  

Page 56: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

YARN: Taking Hadoop Beyond Batch

• Created to manage resource needs across all uses

• Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON”

– Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc.

Page 56

ApplicaDons  Run  NaDvely  IN  Hadoop  

HDFS2  (Redundant,  Reliable  Storage)  

YARN  (Cluster  Resource  Management)      

BATCH  (MapReduce)  

INTERACTIVE  (Tez)  

STREAMING  (Storm,  S4,…)  

GRAPH  (Giraph)  

IN-­‐MEMORY  (Spark)  

HPC  MPI  (OpenMPI)  

ONLINE  (HBase)  

OTHER  (Search)  (Weave…)  

Page 57: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013

HOYA Architecture

Page 57

Page 58: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Key HOYA Design Goals

1.  Create on-demand HBase clusters 2.  Maintain multiple HBase cluster configurations and

implement them as required (i.e. high-load scenarios)

3.  Isolation – Sandbox clusters running different versions of HBase or with different coprocessors

4.  Create transient HBase clusters for MapReduce or other processing

5.  Elasticity of clusters for analytics, data-ingest, project-based work

6.  Leverage the scheduling in YARN to ensure HBase can be a good Hadoop cluster tenant

Page 58

Page 59: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Page 59

Time to call it an evening. We all have important work to do…

Page 60: Sept 17 2013 - THUG - HBase a Technical Introduction

Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Thank you….

Page 60

hbaseinaction.com

For more information, check out HBase: The Definitive Guide Or HBase in Action