trihug january 2012 talk by chris shain

47
HBase: An Introduction Christopher Shain Software Development Lead Tresata

Upload: trihug

Post on 08-May-2015

3.158 views

Category:

Technology


0 download

DESCRIPTION

Intro to Apache HBase

TRANSCRIPT

Page 1: TriHUG January 2012 Talk by Chris Shain

HBase: An Introduction

Christopher ShainSoftware Development LeadTresata

Page 2: TriHUG January 2012 Talk by Chris Shain

About Tresata

Hadoop for Financial Services First completely Hadoop-powered analytics

application Widely recognized as “Big Data Startup to

Watch” Winner of the 2011 NCTA Award for

Emerging Tech Company of the Year

Based in Charlotte NC We are hiring! [email protected]

Page 3: TriHUG January 2012 Talk by Chris Shain

About Me

Software Development Lead at Tresata

Background in Financial Services IT End-User Applications Data Warehousing ETL

Email: [email protected] Twitter: @chrisshain

Page 4: TriHUG January 2012 Talk by Chris Shain

OK, enough about me.

What is HBase? From: http://hbase.apache.org:▪ “HBase is the Hadoop database.”

I think this is a confusing statement

Page 5: TriHUG January 2012 Talk by Chris Shain

What isn’t HBase?

‘Database’, to many, means: Transactions Joins Indexes

HBase has none of these More on this later

Page 6: TriHUG January 2012 Talk by Chris Shain

What is HBase, really?

HBase is a data storage platform designed to work hand-in-hand with Hadoop Distributed Failure-tolerant Semi-structured Low latency Strictly consistent HDFS-Aware “NoSQL”

Page 7: TriHUG January 2012 Talk by Chris Shain

Where did HBase come from?

Need for a low-latency, distributed datastore with unlimited horizontal scale

Hadoop (MapReduce) doesn’t provide low-latency

Traditional RDBMS don’t scale out horizontally

Page 8: TriHUG January 2012 Talk by Chris Shain

History of HBase

November 2006: Google BigTable whitepaper published: http://research.google.com/archive/bigtable.html

February 2007: Initial HBase Prototype October 2007: First ‘usable’ HBase January 2008: HBase becomes Apache

subproject of Hadoop March 2009: HBase 0.20.0 May 10th, 2010: HBase becomes

Apache Top Level Project

Page 9: TriHUG January 2012 Talk by Chris Shain

Common Use Cases

Web Indexing

Social Graph

Messaging (Email etc.)

Page 10: TriHUG January 2012 Talk by Chris Shain

HBase Architecture

HBase is written almost entirely in Java JVM clients are first-class citizensHBase Master

RegionServer

RegionServer

RegionServer

RegionServer

Proxy(Thrift or

REST)

JVM Clients

Non-JVM Clients

Page 11: TriHUG January 2012 Talk by Chris Shain

Data Organization in HBase All data is stored in Tables Table rows have exactly one Key, and all

rows in a table are physically ordered by key Tables have a fixed number of Column

Families (more on this later!) Each row can have many Columns in each

column family Each column has a set of values, each with

a timestamp Each row:family:column:timestamp

combination represents coordinates for a Cell

Page 12: TriHUG January 2012 Talk by Chris Shain

What is a Column Family?

Defined by the Table A Column Family is a group of related

columns with it’s own name All columns must be in a column

family Each row can have a completely

different set of columns for a column family

Row: Column Family:

Columns:

Chris

Friends

Friends:Bob

Bob Friends:Chris Friends:James

James Friends:Bob

Page 13: TriHUG January 2012 Talk by Chris Shain

Rows and Cells

Not exactly the same as rows in a traditional RDBMS Key: a byte array (usually a UTF-8 String) Data: Cells, qualified by column family, column,

and timestamp (not shown here)Row Key:

Column Families :(Defined by the Table)

Columns:(Defined by the Row)(May vary between rows)

Cells:(Created with Columns)

Chris

Attributes Attributes:Age 30

Attributes:Height 68

Friends Friends:Bob 1 (Bob’s a cool guy)

Friends:Jane 0 (Jane and I don’t get along)

Page 14: TriHUG January 2012 Talk by Chris Shain

Cell Timestamps

All cells are created with a timestamp

Column family defines how many versions of a cell to keep

Updates always create a new cell Deletes create a tombstone (more

on that later) Queries can include an “as-of”

timestamp to return point-in-time values

Page 15: TriHUG January 2012 Talk by Chris Shain

Deletes

HBase deletes are a form of write called a “tombstone”

Indicates that “beyond this point any previously written value is dead”

Old values can still be read using point-in-time queries

Timestamp Write Type Resulting Value

Point-In-TimeValue “as of” T+1

T + 0 PUT (“Foo”) “Foo” “Foo”

T + 1 PUT (“Bar”) “Bar” “Bar”

T + 2 DELETE <none> “Bar”

T + 3 PUT (“Foo Too”)

“Foo Too” “Bar”

Page 16: TriHUG January 2012 Talk by Chris Shain

Use Case – Time Series

Requirement: Store real-time stock tick data

Requirement: Accommodate many simultaneous readers & writers

Requirement: Allow for reading of current price for any ticker at any point in time

Ticker Timestamp Sequence

Bid Ask

IBM 09:15:03:001 1 179.16 179.18

MSFT 09:15:04:112 2 28.25 28.27

GOOG 09:15:04:114 3 624.94 624.99

IBM 09:15:04:155 4 179.18 179.19

Page 17: TriHUG January 2012 Talk by Chris Shain

Time Series Use Case – RDBMS Solution:

Keys Column DataType

PK

Ticker Varchar

Timestamp DateTime

Sequence_Number Integer

Bid_Price Decimal

Ask_Price Decimal

Keys Column DataType

PK Ticker Varchar

Bid_Price Decimal

Ask_Price Decimal

Latest Prices:

Historical Prices:

Page 18: TriHUG January 2012 Talk by Chris Shain

Time Series Use Case – HBase Solution:

Row Key Family:Column

[Ticker].[Rev_Timestamp].[Rev_Sequence_Number]

Prices:Bid

Prices:Ask

HBase throughput will scale linearly with # of nodes

No need to keep separate “latest price” table A scan starting at “ticker” will always

return the latest price row

Page 19: TriHUG January 2012 Talk by Chris Shain

HBase Data Partitioning

HBase scales horizontally

Needs to split data over many RegionServers

Regions are the unit of scale

Page 20: TriHUG January 2012 Talk by Chris Shain

Regions & RegionServers

All HBase tables are broken into 1 or more regions

Regions have a start row key and an end row key

Each Region lives on exactly one RegionServer

RegionServers may host many Regions

When RegionServers die, Master detects this and assigns Regions to other RegionServers

Page 21: TriHUG January 2012 Talk by Chris Shain

Region Distribution

Table Region Region

Server

Users

“Aaron” – “George” Node01

“George” – “Matthew”

Node02

“Matthew” – “Zachary”

Node01

Row Keys in Region“Aaron” – “George”

“Aaron”

“Bob”

“Chris”

Row Keys in Region“George” – “Matthew”

“George”Row Keys in Region“Matthew” – “Zachary”

“Matthew”

“Nancy”

“Zachary”

-META- Table

“Users” Table

Page 22: TriHUG January 2012 Talk by Chris Shain

HBase Architecture

Deceptively simple

Page 23: TriHUG January 2012 Talk by Chris Shain

HBase Architecture

HBase Master

RegionServer

RegionServer

RegionServer

RegionServer

Proxy(Thrift or

REST)

JVM Clients

Non-JVM Clients

Backup HBase Master

ZooKeeper Cluster

Page 24: TriHUG January 2012 Talk by Chris Shain

Component Roles & Responsibilities

ZooKeeper Keeps track of which server is the

current HBase Master

HBase Master Keeps track of Region/RegionServer

mapping Manages the -ROOT- and .META. tables Responsible for updating ZooKeeper

when these change

Page 25: TriHUG January 2012 Talk by Chris Shain

More Component Roles & Responsibilities

RegionServer Stores table regions

Clients Need to be smarter than RDBMS clients First connect to ZooKeeper to get

RegionServer for a given Table/Region Then connect directly to RegionServer to

interact with the data All connections over Hadoop RPC – non-

JVM clients use proxy (Thrift or REST (Stargate))

Page 26: TriHUG January 2012 Talk by Chris Shain

HBase Catalog Tables

-ROOT- Table

.META.[region]

info:regioninfo

info:server

info:serverstartcode

.META. Table

[table],[region start key],[region id]

info:regioninfo

info:server

info:serverstartcode

Points to DataNode hosting .META. region.…

Regular User Table

… whatever … …

Points to DataNode hosting table region.

Page 27: TriHUG January 2012 Talk by Chris Shain

HBase Master Reliability

HBase Master is not necessarily a single point of failure (SPOF) Multiple masters can be running Current ‘active’ Master controlled via ZooKeeper Make sure you have enough ZooKeeper nodes!

Master is not needed for client connectivity Clients connect directly to ZooKeeper to find

Regions Everything Master does can be put off until one

is elected

Page 28: TriHUG January 2012 Talk by Chris Shain

ZooKeeper Quorum

HBase Master & ZooKeeper

ZooKeeper Node

ZooKeeper Node

ZooKeeper Node

HBase Master

(Current)

HBase Master

(Standby)

HBase Master

(Standby)

Page 29: TriHUG January 2012 Talk by Chris Shain

HBase RegionServer Reliability HBase tolerates RegionServer failure

when running on HDFS Data is replicated by HDFS (dfs.replication

setting) Lots of issues around fsync, failure before

data is flushed - some probably still not fixed Thus, data can still be lost if node fails after a

write

HDFS NameNode is still SPOF, even for HBase

Page 30: TriHUG January 2012 Talk by Chris Shain

RegionServer Write Ahead Log Similar to log in many RDBMS

All operations by default written to log before considered ‘committed’ (can be overridden for ‘disposable fast writes’)

Log can be replayed when region is moved to another RegionServer

One WAL per RegionServer

Writes

WAL

MemStore

HFile

HFile

Flushed periodically (10s by default)

Flushed when MemStore gets too big

Page 31: TriHUG January 2012 Talk by Chris Shain

HBase RegionServer & HDFSRegionServer

Region

Log

HDFS ClientBlock

Block

Block

Block

HDFS DataNode

Block

Block

Block

Block

Store

StoreFileHFile M

em

Sto

re

Store

StoreFileHFile M

em

Sto

re

Store

StoreFileHFile M

em

Sto

re

Block

Block

Block

HDFS DataNode

Block

Block

Block

Block

HDFS DataNode

Block

Block

Block

Block

HDFS DataNode

Block

Block

Block

Block

Page 32: TriHUG January 2012 Talk by Chris Shain

HBase data locality on HDFS

A RegionServer is not guaranteed to be on the same physical node as it’s data

Compaction causes RegionServer to write preferentially to local node But this is a function of HDFS Client, not

HBase

Page 33: TriHUG January 2012 Talk by Chris Shain

Flushes and Compactions All data is in memory initially (memstore) HBase is a write-only system

Modifications and deletes are just writes with later timestamps

Function of HDFS being append-only Eventually old writes need to be

discarded 2 Types of Compactions:

Minor Major

Page 34: TriHUG January 2012 Talk by Chris Shain

Flushes

All HBase edits are initially stored in memory (memstore)

Flushes occur when memstore reaches a certain size By default 67,108,864 bytes Controlled by

hbase.hregion.memstore.flush.size configuration property

Each flush creates a new HFile

Page 35: TriHUG January 2012 Talk by Chris Shain

Minor Compaction

Triggered when a certain number of HFiles are created for a given Region Store (+ some other conditions) By default 3 HFiles Controlled by hbase.hstore.compactionThreshold configuration

property

Compacts most recent HFiles into one By default, uses RegionServer-local HDFS node

Does not eliminate deletes Only touches most recent HFiles

NOTE: All column families are compacted at once (this might change in the future)

Page 36: TriHUG January 2012 Talk by Chris Shain

Major Compaction

Triggered every 24 hours (with random offset) or manually Large HBase installations usually leave

this for manual operators

Re-writes all HFiles into one

Processes deletes Eliminates tombstones Erases earlier entries

Page 37: TriHUG January 2012 Talk by Chris Shain

Guarantees

HBase does not have transactions However:

Row-level modifications are atomic: All modifications to a row will succeed or fail as a unit

Gets are consistent for a given point in time▪ But Scans may return 2 rows from different

points in time All data read has been ‘durably stored’▪ Does NOT mean flushed to disk- can still be

lost!

Page 38: TriHUG January 2012 Talk by Chris Shain

Appendix

Page 39: TriHUG January 2012 Talk by Chris Shain

Do's and Don'ts: Schema Design DO: Design your schema for linear range scans

on your most common queries. Scans are the most efficient way to query a lot

of rows quickly

DON’T: Use more than 2 or 3 column families. Some operations (flushing and compacting)

operate on the whole row

DO: Be aware of the relative cardinality of column families Wildly differing cardinality leads to sparsity and

bad scanning results.

Page 40: TriHUG January 2012 Talk by Chris Shain

Do's and Don'ts: Schema Design DO: Be mindful of the size of your row and

column keys They are used in indexes and queries, can

be quite large!

DON’T: Use monotonically increasing row keys Can lead to hotspots on writes

DO: Store timestamp keys in reverse Rows in a table need to be read in order,

usually you want most recent

Page 41: TriHUG January 2012 Talk by Chris Shain

Do's and Don'ts: Queries

DO: Query single rows using exact-match on key (Gets) or Scans for multiple rows Scans allow efficient I/O vs. multiple gets

DON’T: Use regex-based or non-prefix column filters Very inefficient

DO: Tune the scan cache and batch size parameters Drastically improves performance when returning

lots of rows

Page 42: TriHUG January 2012 Talk by Chris Shain

HBase Architecture

Deceptively simpleHBase Master

RegionServer

RegionServer

RegionServer

RegionServer

Proxy(Thrift or

REST)

JVM Clients

Non-JVM Clients

Page 43: TriHUG January 2012 Talk by Chris Shain

ZooKeeper Quorum

HBase Master & ZooKeeper

ZooKeeper Node

ZooKeeper Node

ZooKeeper Node

HBase Master

(Current)

HBase Master

(Standby)

HBase Master

(Standby)

Page 44: TriHUG January 2012 Talk by Chris Shain

HBase RegionServer & HDFSRegionServer

Region

Log

HDFS ClientBlock

Block

Block

Block

HDFS DataNode

Block

Block

Block

Block

Store

StoreFileHFile M

em

Sto

re

Store

StoreFileHFile M

em

Sto

re

Store

StoreFileHFile M

em

Sto

re

Block

Block

Block

HDFS DataNode

Block

Block

Block

Block

HDFS DataNode

Block

Block

Block

Block

HDFS DataNode

Block

Block

Block

Block

Page 45: TriHUG January 2012 Talk by Chris Shain

Use Case – User PreferencesRequirement: Store an arbitrary set of

preferences for all usersRequirement: Each user may choose to

store a different set of preferencesRequirement: Preferences may be of

different data types (Strings, Integers, etc)Requirement: Developers will add new

preference options all the time, so we shouldn’t need to modify the database structure when adding them

Page 46: TriHUG January 2012 Talk by Chris Shain

User Preferences Use Case – Relational Solution:

One possible RDBMS solution: Key/Value table All values as strings Flexible, but wastes space

Keys: Column: Data Type:

PKUserID Int

PreferenceName Varchar

PreferenceValue Varchar

Page 47: TriHUG January 2012 Talk by Chris Shain

User Preferences Use Case – HBase Solution:

Store all preferences in the Preferences column family

Preference name as column name, preference value as (serialized) byte array: HBase client library provides methods

for serializing many common data typesRow Key: Family: Column: Value:

Chris PreferencesAge 30

Hometown “Mineola, NY”

Joe Preferences Birthdate 11/13/1987