hbase internals
TRANSCRIPT
1
Speaker Name or Subhead Goes Here Matteo Bertozzi | @Cloudera March 2013 - Hadoop Summit Europe
HBase Storage Internals, present and future!
What is HBase?
• Open source Storage Manager that provides random read/write on top of HDFS
• Provides Tables with a “Key:Column/Value” interface
• Dynamic columns (qualifiers), no schema needed
• “Fixed” column groups (families)
• table[row:family:column] = value
2
HBase ecosystem
3
• Apache Hadoop HDFS for data durability and reliability (Write-Ahead Log)
• Apache ZooKeeper for distributed coordination
• Apache Hadoop MapReduce built-in support for running MapReduce jobs
ZK HDFS
App MR
4
How HBase Works
“View from 10000ft”
Region Server
Master, Region Servers and Regions
5
HDFS
Region
Region
Region
Region Server
Region
Region
Region
Region Server
Region
Region
Region
Client
ZooKeeper
Master
• Region Server
• Server that contains a set of Regions
• Responsible to handle reads and writes
• Region
• The basic unit of scalability in HBase
• Subset of the table’s data
• Contiguous, sorted range of rows stored together.
• Master
• Coordinates the HBase Cluster
• Assignment/Balancing of the Regions
• Handles admin operations
• create/delete/modify table, …
Autosharding and .META. table
• A Region is a Subset of the table’s data
• When there is too much data in a Region…
• a split is triggered, creating 2 regions
• The association “Region -> Server” is stored in a System Table
• The Location of .META. Is stored in ZooKeeper
6
Table Start Key Region ID Region Server
testTable Key-00 1 machine01.host
testTable Key-31 2 machine03.host
testTable Key-65 3 machine02.host
testTable Key-83 4 machine01.host
… … … …
users Key-AB 1 machine03.host
users Key-KG 2 machine02.host
machine01
Region 1 - testTable
Region 4 - testTable
machine02
Region 3 - testTable
Region 1 - users
machine03
Region 2 - testTable
Region 2 - users
The Write Path – Create a New Table
7
• The client asks to the master to create a new Table
• hbase> create ‘myTable’, ‘cf’
• The Master
• Store the Table information (“schema”)
• Create Regions based on the key-splits provided
• no splits provided, one single region by default
• Assign the Regions to the Region Servers
• The assignment Region -> Server is written to a system table called “.META.”
Client
Master
Region Server
Region Server
createTable()
Store Table “Metadata”
Assign the Regions “enable”
Region Region
Region Region
Region Server Region
Region
The Write Path – “Inserting” data
8
• table.put(row-key:family:column, value)
• The client asks ZooKeeper the location of .META.
• The client scans .META. searching for the Region Server responsible to handle the Key
• The client asks the Region Server to insert/update/delete the specified key/value.
• The Region Server process the request and dispatch it to the Region responsible to handle the Key
• The operation is written to a Write-Ahead Log (WAL)
• …and the KeyValues added to the Store: “MemStore”
Client Where is .META.?
ZooKeeper Region Server
Region
Scan .META.
Region
Region Server
Region
Region
Region
Insert KeyValue
The Write Path – Append Only to Random R/W
9
• Files in HDFS are
• Append-Only
• Immutable once closed
• HBase provides Random Writes?
• …not really from a storage point of view
• KeyValues are stored in memory and written to disk on pressure
• Don’t worry your data is safe in the WAL! • (The Region Server can recover data from the WAL is case of crash)
• But this allow to sort data by Key before writing on disk
• Deletes are like Inserts but with a “remove me flag”
RS
Region Region Region
WA
L
MemStore + Store Files (HFiles)
Key0 – value 0 Key1 – value 1 Key2 – value 2 Key3 – value 3 Key4 – value 4 Key5 – value 5
Store Files
The Read Path – “reading” data
10
• The client asks ZooKeeper the location of .META.
• The client scans .META. searching for the Region Server responsible to handle the Key
• The client asks the Region Server to get the specified key/value.
• The Region Server process the request and dispatch it to the Region responsible to handle the Key
• MemStore and Store Files are scanned to find the key
Client Where is .META.?
ZooKeeper Region Server
Region
Scan .META.
Region
Region Server
Region
Region
Region
Get Key
The Read Path – Append Only to Random R/W
11
• Each flush a new file is created
• Each file have KeyValues sorted by key
• Two or more files can contains the same key (updates/deletes)
• To find a Key you need to scan all the files
• …with some optimizations
• Filter Files Start/End Key
• Having a bloom filter on each file
Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0
Key0 – value 0.1 Key5 – value 5.0 Key1 – value 1.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0
12
HBase Store File Format
HFile
HFile Format
13
• Only Sequential Writes, just append(key, value)
• Large Sequential Reads are better
• Why grouping records in blocks?
• Easy to split
• Easy to read
• Easy to cache
• Easy to index (if records are sorted)
• Block Compression (snappy, lz4, gz, …)
Record 0
Record 1
Header
…
Record N
Record 0
Record 1
Header
…
Record N
Index 0
…
Index N
Blocks
Key/Value (record)
Key Length : int
Value Length : int
Key : byte[]
Value : byte[]
Trailer
Data Block Encoding
14
Row Length : short
Row : byte[]
Family Length : byte
Family : byte[]
Qualifier : byte[]
Timestamp : long
Type : byte
“on-disk” KeyValue
• “Be aware of the data”
• Block Encoding allows to compress the Key based on what we know
• Keys are sorted… prefix may be similar in most cases
• One file contains keys from one Family only
• Timestamps are “similar”, we can store the diff
• Type is “put” most of the time…
15
Optimize the read-path
Compactions
Compactions
16
• Reduce the number of files to look into during a scan
• Removing duplicated keys (updated values)
• Removing deleted keys
• Creates a new file by merging the content of two or more files
• Remove the old files
Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0
Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0
Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0
Pluggable Compactions
17
• Try different algorithm
• Be aware of the data
• Time Series? I guess no updates from the 80s
• Be aware of the requests
• Compact based on statistics
• which files are hot and which are not
• which keys are hot and which are not
Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0
Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0
Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0
18
Zero-Copy Snapshots and Table Clones
Snapshots
What Is a Snapshot?
19
• “a Snapshot is not a copy of the table”
• a Snapshot is a set of metadata information
• The table “schema” (column families and attributes)
• The Regions information (start key, end key, …)
• The list of Store Files
• The list of WALs active RS
Region Region Region
WA
L
Store Files (HFiles)
RS
Region Region Region
WA
L
Store Files (HFiles)
Master
ZK ZK
ZK
How Taking a Snapshot Works?
20
• The master orchestrate the RSs
• the communication is done via ZooKeeper
• using a “2-phase commit like” transaction (prepare/commit)
• Each RS is responsible to take its “piece” of snapshot
• For each Region store the metadata information needed
• (list of Store Files, WALs, region start/end keys, …)
RS
Region Region Region
WA
L
Store Files (HFiles)
RS
Region Region Region
WA
L
Store Files (HFiles)
Master
ZK ZK
ZK
Cloning a Table from a Snapshots
21
• hbase> clone_snapshot ‘snapshotName’, ‘tableName’ …
• Creates a new table with the data “contained” in the snapshot
• No data copies involved
• HFiles are immutable, and shared between tables and snapshots
• You can insert/update/remove data from the new table
• No repercussions on the snapshot, original tables or other cloned tables
Compactions & Archiving
22
• HFiles are immutable, and shared between tables and snapshots
• On compaction or table deletion, files are removed from disk
• If one of these files are referenced by a snapshot or a cloned table
• The file is moved to an “archive” directory
• And deleted later, when there’re no references to it
23
What can be improved?
Future
0.96 is coming up
• Moving RPC to Protobuf
• Allows rolling upgrades with no surprises
• HBase Snapshots
• Pluggable Compactions
• Remove -ROOT-
• Table Locks
24
0.98 and Beyond
25
• Transparent Table/Column-Family Encryption
• Cell-level security
• Multiple WALs per Region Server (MTTR)
• Data Placement Awareness (MTTR)
• Data Type Awareness
• Compaction policies, based on the data needs
• Managing blocks directly (instead of files)
26
Headline Goes Here Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
Questions? Matteo Bertozzi | @Cloudera
27