hbase internals

1

Speaker Name or Subhead Goes Here Matteo Bertozzi | @Cloudera March 2013 - Hadoop Summit Europe

HBase Storage Internals, present and future!

What is HBase?

• Open source Storage Manager that provides random read/write on top of HDFS

• Provides Tables with a “Key:Column/Value” interface

• Dynamic columns (qualifiers), no schema needed

• “Fixed” column groups (families)

• table[row:family:column] = value

2

HBase ecosystem

3

• Apache Hadoop HDFS for data durability and reliability (Write-Ahead Log)

• Apache ZooKeeper for distributed coordination

• Apache Hadoop MapReduce built-in support for running MapReduce jobs

ZK HDFS

App MR

4

How HBase Works

“View from 10000ft”

Region Server

Master, Region Servers and Regions

5

HDFS

Region

Region

Region

Region Server

Region

Region

Region

Region Server

Region

Region

Region

Client

ZooKeeper

Master

• Region Server

• Server that contains a set of Regions

• Responsible to handle reads and writes

• Region

• The basic unit of scalability in HBase

• Subset of the table’s data

• Contiguous, sorted range of rows stored together.

• Master

• Coordinates the HBase Cluster

• Assignment/Balancing of the Regions

• Handles admin operations

• create/delete/modify table, …

Autosharding and .META. table

• A Region is a Subset of the table’s data

• When there is too much data in a Region…

• a split is triggered, creating 2 regions

• The association “Region -> Server” is stored in a System Table

• The Location of .META. Is stored in ZooKeeper

6

Table Start Key Region ID Region Server

testTable Key-00 1 machine01.host




… … … …

users Key-AB 1 machine03.host

users Key-KG 2 machine02.host

machine01

Region 1 - testTable


machine02


Region 1 - users

machine03


Region 2 - users

The Write Path – Create a New Table

7

• The client asks to the master to create a new Table

• hbase> create ‘myTable’, ‘cf’

• The Master

• Store the Table information (“schema”)

• Create Regions based on the key-splits provided

• no splits provided, one single region by default

• Assign the Regions to the Region Servers

• The assignment Region -> Server is written to a system table called “.META.”

Client

Master

Region Server

Region Server

createTable()

Store Table “Metadata”

Assign the Regions “enable”

Region Region

Region Region

Region Server Region

Region

The Write Path – “Inserting” data

8

• table.put(row-key:family:column, value)

• The client asks ZooKeeper the location of .META.

• The client scans .META. searching for the Region Server responsible to handle the Key

• The client asks the Region Server to insert/update/delete the specified key/value.

• The Region Server process the request and dispatch it to the Region responsible to handle the Key

• The operation is written to a Write-Ahead Log (WAL)

• …and the KeyValues added to the Store: “MemStore”

Client Where is .META.?

ZooKeeper Region Server

Region

Scan .META.

Region

Region Server

Region

Region

Region

Insert KeyValue

The Write Path – Append Only to Random R/W

9

• Files in HDFS are

• Append-Only

• Immutable once closed

• HBase provides Random Writes?

• …not really from a storage point of view

• KeyValues are stored in memory and written to disk on pressure

• Don’t worry your data is safe in the WAL! • (The Region Server can recover data from the WAL is case of crash)

• But this allow to sort data by Key before writing on disk

• Deletes are like Inserts but with a “remove me flag”

RS

Region Region Region

WA

L

MemStore + Store Files (HFiles)

Key0 – value 0 Key1 – value 1 Key2 – value 2 Key3 – value 3 Key4 – value 4 Key5 – value 5

Store Files

The Read Path – “reading” data

10

• The client asks ZooKeeper the location of .META.

• The client scans .META. searching for the Region Server responsible to handle the Key

• The client asks the Region Server to get the specified key/value.

• The Region Server process the request and dispatch it to the Region responsible to handle the Key

• MemStore and Store Files are scanned to find the key

Client Where is .META.?

ZooKeeper Region Server

Region

Scan .META.

Region

Region Server

Region

Region

Region

Get Key

The Read Path – Append Only to Random R/W

11

• Each flush a new file is created

• Each file have KeyValues sorted by key

• Two or more files can contains the same key (updates/deletes)

• To find a Key you need to scan all the files

• …with some optimizations

• Filter Files Start/End Key

• Having a bloom filter on each file

Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0

Key0 – value 0.1 Key5 – value 5.0 Key1 – value 1.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0

12

HBase Store File Format

HFile

HFile Format

13

• Only Sequential Writes, just append(key, value)

• Large Sequential Reads are better

• Why grouping records in blocks?

• Easy to split

• Easy to read

• Easy to cache

• Easy to index (if records are sorted)

• Block Compression (snappy, lz4, gz, …)

Record 0

Record 1

Header

…

Record N

Record 0

Record 1

Header

…

Record N

Index 0

…

Index N

Blocks

Key/Value (record)

Key Length : int

Value Length : int

Key : byte[]

Value : byte[]

Trailer

Data Block Encoding

14

Row Length : short

Row : byte[]

Family Length : byte

Family : byte[]

Qualifier : byte[]

Timestamp : long

Type : byte

“on-disk” KeyValue

• “Be aware of the data”

• Block Encoding allows to compress the Key based on what we know

• Keys are sorted… prefix may be similar in most cases

• One file contains keys from one Family only

• Timestamps are “similar”, we can store the diff

• Type is “put” most of the time…

15

Optimize the read-path

Compactions

Compactions

16

• Reduce the number of files to look into during a scan

• Removing duplicated keys (updated values)

• Removing deleted keys

• Creates a new file by merging the content of two or more files

• Remove the old files


Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0

Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0

Pluggable Compactions

17

• Try different algorithm

• Be aware of the data

• Time Series? I guess no updates from the 80s

• Be aware of the requests

• Compact based on statistics

• which files are hot and which are not

• which keys are hot and which are not


Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0

Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0

18

Zero-Copy Snapshots and Table Clones

Snapshots

What Is a Snapshot?

19

• “a Snapshot is not a copy of the table”

• a Snapshot is a set of metadata information

• The table “schema” (column families and attributes)

• The Regions information (start key, end key, …)

• The list of Store Files

• The list of WALs active RS


WA

L

Store Files (HFiles)

RS


WA

L


Master

ZK ZK

ZK

How Taking a Snapshot Works?

20

• The master orchestrate the RSs

• the communication is done via ZooKeeper

• using a “2-phase commit like” transaction (prepare/commit)

• Each RS is responsible to take its “piece” of snapshot

• For each Region store the metadata information needed

• (list of Store Files, WALs, region start/end keys, …)

RS


WA

L


RS


WA

L


Master

ZK ZK

ZK

Cloning a Table from a Snapshots

21

• hbase> clone_snapshot ‘snapshotName’, ‘tableName’ …

• Creates a new table with the data “contained” in the snapshot

• No data copies involved

• HFiles are immutable, and shared between tables and snapshots

• You can insert/update/remove data from the new table

• No repercussions on the snapshot, original tables or other cloned tables

Compactions & Archiving

22

• HFiles are immutable, and shared between tables and snapshots

• On compaction or table deletion, files are removed from disk

• If one of these files are referenced by a snapshot or a cloned table

• The file is moved to an “archive” directory

• And deleted later, when there’re no references to it

23

What can be improved?

Future

0.96 is coming up

• Moving RPC to Protobuf

• Allows rolling upgrades with no surprises

• HBase Snapshots

• Pluggable Compactions

• Remove -ROOT-

• Table Locks

24

0.98 and Beyond

25

• Transparent Table/Column-Family Encryption

• Cell-level security

• Multiple WALs per Region Server (MTTR)

• Data Placement Awareness (MTTR)

• Data Type Awareness

• Compaction policies, based on the data needs

• Managing blocks directly (instead of files)

26

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Questions? Matteo Bertozzi | @Cloudera

hbase internals

Documents

host region

single region

association region server

regions regionregion

key insert keyvalue

key key6 value

value length

value clientwhere