![Page 1: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/1.jpg)
1New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data:BigTable and HBase
![Page 2: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/2.jpg)
2New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase and BigTable● HBase is Hadoop's counterpart of Google's BigTable● BigTable meets the need for a highly scalable storage
system for structured data● Provides random and (almost) real-time data access ● Works on top of Google File System● Data is structured into entities (records), aggregated into few
huge files and indexed● Not a relational database. Offers typical create, read, update
and delete (CRUD) ops. plus scan of keys● Not ACID guarantees● Used by many applications in Google
![Page 3: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/3.jpg)
3New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Hadoop's and Google's stacks
Hadoop MapReduce
Hadoop Distributed File System
HBase
Hadoop
ZooKeeper
Google MapReduce
Google File System
BigTable
Chubby
![Page 4: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/4.jpg)
4New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase/BigTable Tables● “A Bigtable is a sparse, distributed, persistent
multidimensional sorted map”
Map → Associates keys to valuesSorted → Ordered by key (efficient look-ups)
Multidimensional → Key is formed by several valuesPersistent → Once written, it is there until removed
Distributed → Stored across different nodesSparse → Many (most) values are not defined
![Page 5: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/5.jpg)
5New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Hbase/BigTable Datamodel● Rows are composed of columns, which are grouped
into column families● Column families group semantically related values● Each column family is stored as one file (HFile) in HDFS● One column family can have millions of columns● Each column is referenced by family:qualifier
● A row key is an array of bytes. Keys are ordered lexicographically
● A column value is denoted a cell. They have timestamps. Old values are kept
![Page 6: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/6.jpg)
6New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
BigTable vs. Relational Datamodels● Students/courses database
Students
Id (primary key)
name
gender
age
St_Co_Rel
st_id
co_id
mark
Course
Id (primary key)
title
description
row column families
info: course:
<student_id> info:nameinfo:genderinfo:age
course:course_id=mark
row column families
info: course:
<course_id> info:titleinfo:descript
student:student_id=mark
Relational Model
BigTable Model (identifiers handled by app programmers, no referencial integrity)
![Page 7: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/7.jpg)
7New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
val1
val2
ts 5
Hbase/BigTable Datamodel View
r_key_1
r_key_22
r_key_3
...
ROW KEYS
lexicographicallyordered
arbitraryarray ofbytes
COLUMN FAMILY COLUMN FAMILYcf1:col_a cf1:col_b cf2:col_y cf2:col_z
val3
ts 9
ts 10
arbitraryarray ofbytes
humanreadablename
ordered
cellvalues
time
(Table, RowKey, Family, Column, Timestamp) → Value
aa ts 3
![Page 8: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/8.jpg)
8New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
val1
val2
ts 5
Hbase/BigTable Data Storage
r_key_1
r_key_22
r_key_3
...
ROW KEYS COLUMN FAMILY COLUMN FAMILY
cf1:col_acf2:col_y
val3
ts 9
ts 10
Each column familyin its own HFile files
aa ts 3
cf1:col_b
Columns with no valueare not stored
Stored in HDFS, split into blocks (64 Kbs per block)and with a block index at the end of the HFile
![Page 9: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/9.jpg)
9New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase Regions● For scalability, tables are split into regions by key● Each region is assigned to a region server
ROW KEYS
region
Region Server Region Server
![Page 10: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/10.jpg)
10New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Region Server Internal ArchitectureRegion Server
HLog
1)
HRegion
Store table t2, column family cf2
(Write Ahead Log)
2)
MemStore
(in-memory map)
key
key
key
Data
Data
Data
Store
HRegion
A Store keeps the keys of a certain region of some column family of some table.
![Page 11: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/11.jpg)
11New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Region Server Internal ArchitectureRegion Server
HLog
1)
HRegion
Store table t1, column family cf1
(Write Ahead Log)
2)
MemStore
(in-memory map)
Store table t2, column family cf2
HRegion
When the MemStore gets full, data is written to a HFile. Several HFiles per Store can be present.
HFile
HFile
BlockBlock
Index
HFile
![Page 12: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/12.jpg)
12New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase over HDFS
HDFS DataNode
Region Server
HLog HFile HFile
HDFS DataNode
HDFS NameNode
HDFS Client middleware
HFile
Region Server
HLog HFile HFile
HDFS Client middleware
HFile
![Page 13: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/13.jpg)
13New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Reading Data● When reading data by row key the RegionServer
attending the corresponding region is queried● All HRegions which store a column family whose
data is requested by the query must be checked● For that, the in-memory map and the HBase files must
all be read (merged read)● HBase files can use bloom filters to speed-up readings
● Unless otherwise specified, only the last version of each value is returned
![Page 14: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/14.jpg)
14New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Writing, Updating, and Deleting Data● Rows are written on the in-memory map of the
corresponding RegionServer● Update operations just write a new version of data● Row deletion depends on where the row is located
● Rows in the in-memory map are just deleted● HBase files are immutable! Deletion markers are used
● To prevent HBase files from using too much space they are merged● Rows marked as deleted and old versions of data are
removed
![Page 15: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/15.jpg)
15New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase ROOT and .META Tables● Used to organize data in HBase● Special system tables● Clients use them to locate which RegionServer
serves a certain key of a certain table● They are partitioned into regions and served by
RegionServers as normal tables
![Page 16: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/16.jpg)
16New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Example of Lookup for data in HBaseZooKeeper
/hbase/root-region-server → “RS R”
User
Lookup #1
Region Server R
Region ROOT
.META.,,1
.META.,tableN,123456789
...
“RS M1”“RS M2”...
Lookup #2
Region Server M1
Region .META.,,1
tableA,,98766789tableA,row-520,99995432...
“RS U1”“RS U2”...
Region Server M2
Region .META.,t,tableN,123456789
tableN,,12344321tableN,row-333,12345791...
“RS U3”“RS U4”...
Region Server U1
Region tableA,,98766789
row-1…
……
……
……
Lookup #4
Lookup #3
Region Server U2
Region tableA,row520,99994532
row-520…
……
……
……
![Page 17: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/17.jpg)
17New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Zookeeper● Zookeeper is used to coordinate members of distributed systems● Implements a hierarchical namespace where clients write/read to share
state information. Each element in the path can store data and other elements
● A typical Zookeeper deployment has several servers, grouped in an ensemble. The middleware takes care of consistency, load balancing, crash recovery...
/
/app1
/app2
/app2/info1
/app2/info2
ZookeeperServer
ZookeeperServer
ZookeeperServer
Leader
Client ClientClient Client
Ensemble
Distributed system to coordinate
![Page 18: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/18.jpg)
18New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HMaster● Splits the key space of all tables and assigns the
resulting regions to the present RegionServers● Balances load by re-assigning regions● Handles metadata changes requests from clients● It is NOT involved in read/write operations● It uses Zookeeper to keep track of RegionServers
and to emit information for clients (like which RS holds the ROOT table)
![Page 19: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/19.jpg)
19New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase Architecture
RegionServer
RegionServer
Region Server
User MasterZooKeeper
r/wdata
HLog
HRegion
Store HFile HFile
HDFS
manage tables and regions
create/delete tables and col. families
![Page 20: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/20.jpg)
20New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase Client API● HBaseAdmin is used to:
● Create/delete tables: createTable(); deleteTable()
● Add/delete column families to tables: addColumn(); deleteColumn()
HBaseAdmin hbAdm = new HbaseAdmin(HBaseConfiguration.create());hbAdm.createTable(new HTableDescriptor(“TestTable”));
hbAdm.addColumn(“TestTable”, new HcolumnDescriptor(“TestColFamily”));
![Page 21: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/21.jpg)
21New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase Client API (2)● HTable class is the basic data access entity:
● Read data with get(); getScanner(Scan)
● Write data with put(); checkAndPut()
● Delete data with delete(); checkAndDelete()
Get get = new Get(Bytes.toBytes(“testRow”));Result result = testTable.get(get);for(byte[] family: result.keySet()) for(byte qual: result.get(family).keySet()) for(Long ts: result.get(family).get(qual).keySet()) String val = Bytes.toString( Result.get(family).get(qual).get(ts));
Put put = new Put(Bytes.toBytes(“testRow”));put.add(Bytes.toBytes(“testFam”), Bytes.toBytes(“testQual”), Bytes.toBytes(“value”));testTable.put(put);
![Page 22: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/22.jpg)
22New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
HBase Advanced Features● Filters: When scanning tables, Filter instances
refine the results returned to the client
● Counters: support for read-and-update atomic operations
● Coprocessors: to extend HBase functionality with users' custom code that it is run by the framework. Example: secondary indexes, access control...
User
RegionServer
RegionServer
Scan Filter
![Page 23: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/23.jpg)
23New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
Bibliography
● (Paper) “Bigtable: A Distributed Storage System for Structured Data”. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. ACM Transactions on Computer Systems, Volume 26 Issue 2, 2008.
● (Book) “HBase: The Definitive Guide” (2nd edition). Lars George. O'Reilly Media, Inc., 2011
![Page 24: Storage of Structured Data: BigTable and HBaselsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/HBase_2.pdf · Storage of Structured Data: BigTable and HBase HBase/BigTable](https://reader030.vdocuments.us/reader030/viewer/2022040221/5e2e81cd30c9e13f1077d96b/html5/thumbnails/24.jpg)
24New Trends In Distributed Systems
MSc Software and Systems
Storage of Structured Data: BigTable and HBase
(Example with HBase)