hbase internals

27
1 Speaker Name or Subhead Goes Here Matteo Bertozzi | @Cloudera March 2013 - Hadoop Summit Europe HBase Storage Internals, present and future!

Upload: matteo-bertozzi

Post on 24-Jun-2015

438 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: HBase internals

1

Speaker Name or Subhead Goes Here Matteo Bertozzi | @Cloudera March 2013 - Hadoop Summit Europe

HBase Storage Internals, present and future!

Page 2: HBase internals

What is HBase?

• Open source Storage Manager that provides random read/write on top of HDFS

• Provides Tables with a “Key:Column/Value” interface

• Dynamic columns (qualifiers), no schema needed

• “Fixed” column groups (families)

• table[row:family:column] = value

2

Page 3: HBase internals

HBase ecosystem

3

• Apache Hadoop HDFS for data durability and reliability (Write-Ahead Log)

• Apache ZooKeeper for distributed coordination

• Apache Hadoop MapReduce built-in support for running MapReduce jobs

ZK HDFS

App MR

Page 4: HBase internals

4

How HBase Works

“View from 10000ft”

Page 5: HBase internals

Region Server

Master, Region Servers and Regions

5

HDFS

Region

Region

Region

Region Server

Region

Region

Region

Region Server

Region

Region

Region

Client

ZooKeeper

Master

• Region Server

• Server that contains a set of Regions

• Responsible to handle reads and writes

• Region

• The basic unit of scalability in HBase

• Subset of the table’s data

• Contiguous, sorted range of rows stored together.

• Master

• Coordinates the HBase Cluster

• Assignment/Balancing of the Regions

• Handles admin operations

• create/delete/modify table, …

Page 6: HBase internals

Autosharding and .META. table

• A Region is a Subset of the table’s data

• When there is too much data in a Region…

• a split is triggered, creating 2 regions

• The association “Region -> Server” is stored in a System Table

• The Location of .META. Is stored in ZooKeeper

6

Table Start Key Region ID Region Server

testTable Key-00 1 machine01.host

testTable Key-31 2 machine03.host

testTable Key-65 3 machine02.host

testTable Key-83 4 machine01.host

… … … …

users Key-AB 1 machine03.host

users Key-KG 2 machine02.host

machine01

Region 1 - testTable

Region 4 - testTable

machine02

Region 3 - testTable

Region 1 - users

machine03

Region 2 - testTable

Region 2 - users

Page 7: HBase internals

The Write Path – Create a New Table

7

• The client asks to the master to create a new Table

• hbase> create ‘myTable’, ‘cf’

• The Master

• Store the Table information (“schema”)

• Create Regions based on the key-splits provided

• no splits provided, one single region by default

• Assign the Regions to the Region Servers

• The assignment Region -> Server is written to a system table called “.META.”

Client

Master

Region Server

Region Server

createTable()

Store Table “Metadata”

Assign the Regions “enable”

Region Region

Region Region

Region Server Region

Region

Page 8: HBase internals

The Write Path – “Inserting” data

8

• table.put(row-key:family:column, value)

• The client asks ZooKeeper the location of .META.

• The client scans .META. searching for the Region Server responsible to handle the Key

• The client asks the Region Server to insert/update/delete the specified key/value.

• The Region Server process the request and dispatch it to the Region responsible to handle the Key

• The operation is written to a Write-Ahead Log (WAL)

• …and the KeyValues added to the Store: “MemStore”

Client Where is .META.?

ZooKeeper Region Server

Region

Scan .META.

Region

Region Server

Region

Region

Region

Insert KeyValue

Page 9: HBase internals

The Write Path – Append Only to Random R/W

9

• Files in HDFS are

• Append-Only

• Immutable once closed

• HBase provides Random Writes?

• …not really from a storage point of view

• KeyValues are stored in memory and written to disk on pressure

• Don’t worry your data is safe in the WAL! • (The Region Server can recover data from the WAL is case of crash)

• But this allow to sort data by Key before writing on disk

• Deletes are like Inserts but with a “remove me flag”

RS

Region Region Region

WA

L

MemStore + Store Files (HFiles)

Key0 – value 0 Key1 – value 1 Key2 – value 2 Key3 – value 3 Key4 – value 4 Key5 – value 5

Store Files

Page 10: HBase internals

The Read Path – “reading” data

10

• The client asks ZooKeeper the location of .META.

• The client scans .META. searching for the Region Server responsible to handle the Key

• The client asks the Region Server to get the specified key/value.

• The Region Server process the request and dispatch it to the Region responsible to handle the Key

• MemStore and Store Files are scanned to find the key

Client Where is .META.?

ZooKeeper Region Server

Region

Scan .META.

Region

Region Server

Region

Region

Region

Get Key

Page 11: HBase internals

The Read Path – Append Only to Random R/W

11

• Each flush a new file is created

• Each file have KeyValues sorted by key

• Two or more files can contains the same key (updates/deletes)

• To find a Key you need to scan all the files

• …with some optimizations

• Filter Files Start/End Key

• Having a bloom filter on each file

Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0

Key0 – value 0.1 Key5 – value 5.0 Key1 – value 1.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0

Page 12: HBase internals

12

HBase Store File Format

HFile

Page 13: HBase internals

HFile Format

13

• Only Sequential Writes, just append(key, value)

• Large Sequential Reads are better

• Why grouping records in blocks?

• Easy to split

• Easy to read

• Easy to cache

• Easy to index (if records are sorted)

• Block Compression (snappy, lz4, gz, …)

Record 0

Record 1

Header

Record N

Record 0

Record 1

Header

Record N

Index 0

Index N

Blocks

Key/Value (record)

Key Length : int

Value Length : int

Key : byte[]

Value : byte[]

Trailer

Page 14: HBase internals

Data Block Encoding

14

Row Length : short

Row : byte[]

Family Length : byte

Family : byte[]

Qualifier : byte[]

Timestamp : long

Type : byte

“on-disk” KeyValue

• “Be aware of the data”

• Block Encoding allows to compress the Key based on what we know

• Keys are sorted… prefix may be similar in most cases

• One file contains keys from one Family only

• Timestamps are “similar”, we can store the diff

• Type is “put” most of the time…

Page 15: HBase internals

15

Optimize the read-path

Compactions

Page 16: HBase internals

Compactions

16

• Reduce the number of files to look into during a scan

• Removing duplicated keys (updated values)

• Removing deleted keys

• Creates a new file by merging the content of two or more files

• Remove the old files

Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0

Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0

Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0

Page 17: HBase internals

Pluggable Compactions

17

• Try different algorithm

• Be aware of the data

• Time Series? I guess no updates from the 80s

• Be aware of the requests

• Compact based on statistics

• which files are hot and which are not

• which keys are hot and which are not

Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0

Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0

Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0

Page 18: HBase internals

18

Zero-Copy Snapshots and Table Clones

Snapshots

Page 19: HBase internals

What Is a Snapshot?

19

• “a Snapshot is not a copy of the table”

• a Snapshot is a set of metadata information

• The table “schema” (column families and attributes)

• The Regions information (start key, end key, …)

• The list of Store Files

• The list of WALs active RS

Region Region Region

WA

L

Store Files (HFiles)

RS

Region Region Region

WA

L

Store Files (HFiles)

Master

ZK ZK

ZK

Page 20: HBase internals

How Taking a Snapshot Works?

20

• The master orchestrate the RSs

• the communication is done via ZooKeeper

• using a “2-phase commit like” transaction (prepare/commit)

• Each RS is responsible to take its “piece” of snapshot

• For each Region store the metadata information needed

• (list of Store Files, WALs, region start/end keys, …)

RS

Region Region Region

WA

L

Store Files (HFiles)

RS

Region Region Region

WA

L

Store Files (HFiles)

Master

ZK ZK

ZK

Page 21: HBase internals

Cloning a Table from a Snapshots

21

• hbase> clone_snapshot ‘snapshotName’, ‘tableName’ …

• Creates a new table with the data “contained” in the snapshot

• No data copies involved

• HFiles are immutable, and shared between tables and snapshots

• You can insert/update/remove data from the new table

• No repercussions on the snapshot, original tables or other cloned tables

Page 22: HBase internals

Compactions & Archiving

22

• HFiles are immutable, and shared between tables and snapshots

• On compaction or table deletion, files are removed from disk

• If one of these files are referenced by a snapshot or a cloned table

• The file is moved to an “archive” directory

• And deleted later, when there’re no references to it

Page 23: HBase internals

23

What can be improved?

Future

Page 24: HBase internals

0.96 is coming up

• Moving RPC to Protobuf

• Allows rolling upgrades with no surprises

• HBase Snapshots

• Pluggable Compactions

• Remove -ROOT-

• Table Locks

24

Page 25: HBase internals

0.98 and Beyond

25

• Transparent Table/Column-Family Encryption

• Cell-level security

• Multiple WALs per Region Server (MTTR)

• Data Placement Awareness (MTTR)

• Data Type Awareness

• Compaction policies, based on the data needs

• Managing blocks directly (instead of files)

Page 26: HBase internals

26

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Questions? Matteo Bertozzi | @Cloudera

Page 27: HBase internals

27