adding value to hbase with ibm infosphere biginsights and bigsql

Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLSession Number 1687

Piotr Pruski

@ppruski

2

Please note

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

3

Agenda

Introduction to HBase

Big SQL HBase Storage Handler– Column mapping– Data encoding– Data load

Secondary Indexes

Querying

Recommendations and limitations

Logs and Troubleshooting

Highlights and HBase use cases

4

HBase Basics

Client/server database– Master and a set of region servers

Key-value store – Key and value are byte arrays– Efficient access using row key

Different from relational databases– No types: all data is stored as bytes– No schema: Rows can have different set of columns

5

HBase Data Model

Table– Contains column-families

Column family– Logical and physical grouping of

columns

Column– Exists only when inserted– Can have multiple versions– Each row can have different set of

columns– Each column identified by it’s key

Row key– Implicit primary key– Used for storing ordered rows– Efficient queries using row key

HBTABLE

Row key Value

11111 cf_data: {‘cq_name’: ‘name1’, ‘cq_val’: 1111}cf_info: {‘cq_desc’: ‘desc11111’}

22222 cf_data: {‘cq_name’: ‘name2’, ‘cq_val’: 2013 @ ts = 2013,‘cq_val’: 2012 @ ts = 2012}

HFileHFileHFile

11111 cf_data cq_name name1 @ ts111111 cf_data cq_val 1111 @ ts122222 cf_data cq_name name2 @ ts1 22222 cf_data cq_val 2013 @ ts122222 cf_data cq_val 2012 @ ts 2

HFile

11111 cf_info cq_desc desc11111 @ ts1

6

More on the HBase Data Model

There is no Schema for an HBase Table in the RDBMS sense– Except that one has to declare the Column Families

• Since it determines the physical on-disk organization– Thus every row can have a different set of Columns

HBase is described as a key-value store

Each key-value pair is versioned– Can be a timestamp or an integer – Update a column is just to add a new version

All data are byte arrays, including table name, Column Family names, and Column names (also called Column Qualifiers)

Key/Value Row Column Family Column Qualifier Timestamp Value

Key

7

HBase Cluster Architecture

HDFS / GPFS

Region Server …

Master

…Client

ZooKeeper Peer

ZooKeeper Quorum

ZooKeeper Peer

… Hbase master assigns regions and load balancing

Client finds region server addresses in ZooKeeper

Client reads and writes row by accessing the region server

ZooKeeper is used for coordination / monitoring

Region

Region Server

Region … Region Region …

HFile

HFile

HFile

HFile

HFile

HFile HFile HFile

HFile HFile

Coprocessor Coprocessor …Coprocessor Coprocessor

Region Server Coprocessor CoprocessorCoprocessor …CoprocessorCoprocessor …CoprocessorCoprocessor

8 8

BigInsights - Big SQL

Big SQL brings robust SQL support to the Hadoop ecosystem

Driving design goals– Existing queries should run with no or few modifications– Existing JDBC and ODBC compliant tools should continue to function

• Data warehouse augmentation is a very common use case for Hadoop

While highly scalable, MapReduce is notoriously difficult to use

SQL support opens the data to a much wider audience

Making data in BigInsights accessible to SQL capable tools– Cognos BI– Microstrategy – Tableau– …

9

Big Data for a Query-able Archive

BigInsights

(Hadoop)InfoSphere Warehouse/ Netezza **

• Cognos BI can issue SQL Queries against data managed by Apache Hive in BigInsights

• The IBM BigData platform supports bi-directional queries between BigInsights and the EDW

•Key Benefits:

• Existing SQL based applications can leverage the BigData platform

• EDW optimized from size and performance perspective

• Provides cost effective and flexible big data storage and analysis

• Cognos BI can issue SQL Queries against data managed by Apache Hive in BigInsights

• The IBM BigData platform supports bi-directional queries between BigInsights and the EDW

•Key Benefits:

• Existing SQL based applications can leverage the BigData platform

• EDW optimized from size and performance perspective

• Provides cost effective and flexible big data storage and analysis

Cognos Insight

Cognos BI Server

Explore & Analyze

Report & Act

Bi-Directional Query SupportBi-Directional Query Support

SQLSQLInfoSphere OptimInfoSphere Optim

10

Big SQL HBase Storage Handler

Mapping of SQL to HBase data: Column Mapping Handles serialization/deserialization of data (SerDe) Efficiently handles SQL queries by pushing down predicates

InputData

Big SQL

QueryResults

HBase Storage Handler

SerDe

Delimited files

Warehouse

JDBCapplication

DFS

HBase

SQLQuery

Query Optimizer(Compile time)

- Process hints

Query Analyzer(Runtime)

- HBase scan limits- Filters

- Index usage

11

Column Mapping

Mapping HBase row key/columns to SQL columns– Supports one to one and one to many mappings

One to one mapping– Single HBase entity mapped to a single SQL column

11111 name1 1111name1 1111name1 1111name1 1111name1 1111name1 1111name111111 name111111 1111name111111 name1 1111name1 1111name1 1111name1 1111name1 1111name1

id name value

11111

id

11111

id

11111

id

name111111

id

1111name111111

id name

1111name111111

id valuename

1111name111111

id valuename

1111name111111

id valuename

1111name111111

id descvaluename

1111name1

id SQL

HBase

valuename

1111name1

id

name1 1111name1 1111name1 1111

nameid valuenameid descvaluenameid

Column Family: cf_info

desc11111key cq_name

Column Family: cf_data

cq_val cq_desc

12

Create Table: One to One Mapping

CREATE HBASE TABLE HBTABLE( id INT,

name VARCHAR(10), value INT, desc VARCHAR(20))COLUMN MAPPING(

key mapped by (id), cf_data:cq_name mapped by (name), cf_data:cq_val mapped by (value), cf_info:cq_desc mapped by (desc));

SQLHBase

Required

HBase column identified by

family:qualifier

13

One to Many Column Mapping

Single HBase entity mapped to multiple SQL columns

Composite key– HBase row key mapped to multiple SQL columns

Dense column– One HBase column mapped to multiple SQL columns

11111_ac11 fname1_lname1 11111#11#0.25


balancefirst_nameacc_nouserid last_name interestmin_bal SQL

HBasecq_names cq_acctkey

14

Create Table: One to Many Mapping

CREATE HBASE TABLE DENSE_TABLE( userid INT, acc_no VARCHAR(10),

first_name VARCHAR(10), last_name VARCHAR(10),

balance double, min_bal double, interest double)COLUMN MAPPING(

key mapped by (userid, acc_no), cf_data:cq_names mapped by (first_name, last_name), cf_data:cq_acct mapped by (balance, min_bal, interest));

Composite Key

Dense Columns

List of SQL columns

15

Why use One to Many mapping ?

HBase is very verbose– Stores a lot of information for each value– Primarily intended for sparse data

<row> <columnfamily> <columnqualifier> <timestamp> <value>

Save storage space– Sample table with 9 columns. 1.5 million rows– One to one mapping: 522 MB– One to many mapping: 276 MB

Improve query response time– Query results also return the entire key for each value– select * query on sample table

• One to one mapping: 1m 31 s• One to many mapping: 1m 2s

16

Data encoding

HBase stores all data as an array of bytes– Application decides how to encode/decode the bytes

Big SQL uses Hive SerDe interface for serialization/deserialization

Supports two types of data encodings: String, Binary

Encoding can be specified at HBase row key/column level

11111_ac11 fname1_lname1 0x000001 …


balancefirst_nameacc_nouserid last_name interestmin_bal SQL

HBasecq_names cq_acctkey

HBaseHBase

String BinaryString

17

String encoding

Default encoding

Value is converted to string and stored as UTF-8 bytes

Separator to identify parts in one to many mapping– Default separator: \u0000

CREATE HBASE TABLE DENSE_TABLE_STR( userid INT, acc_no VARCHAR(10),

first_name VARCHAR(10), last_name VARCHAR(10),

balance double, min_bal double, interest double)

COLUMN MAPPING(

key mapped by (userid, acc_no) separator '_', cf_data:cq_names mapped by (first_name, last_name) separator '_', cf_data:cq_acct mapped by (balance, min_bal, interest) separator '#');

Can specify different separator for each column and row key. Default separator is null byte (\

u0000) for string encoding.

18

String Encoding: Pros and Cons

Readable format and easier to port across applications

Useful to map existing data

Numeric data not collated correctly– HBase stores data as bytes– Lexicographic ordering

Slow– Parsing strings is expensive

11111_ac11 fname1_lname1 10000#10#0.25

Column Family: cf_datacq_names cq_acctkey

last_name interestbalance min_balfirst_nameacc_nouserid

ExistingHBase table

ExternalBig SQL table

11029

11029

2 > 109 > 10

2 > 109 > 10

19

External Tables

Useful to map tables that already exist in HBase– Data in external tables is not pre-validated

Can create multiple views of same table

create external hbase table externalhbase_table (user INT, acc string, balance double, min_bal double, interest double)

column mapping(key mapped by (user,acc), cf_data:cq_acct mapped by(balance, min_bal, interest) separator '#')

hbase table name 'dense_table';

HBase tables created using Hive HBase storage handler cannot be read by Big SQL– Need to create external tables for this

Things to note:– Dropping external table only drops the metadata– Cannot create secondary index on external tables

Use subset of data from dense_table

20

Binary Encoding

Data encoded using sortable binary representation

Separators handled internally– Escaped to avoid issue of separator existing within data

CREATE HBASE TABLE MIXED_ENCODING( C1 INT, C2 INT, C3 INT, C4 VARCHAR(10), C5 DECIMAL(5,2), C6 SMALLINT)COLUMN MAPPING( KEY MAPPED BY (C1, C2, C3) ENCODING BINARY, CF1:COL1 MAPPED BY (C4, C5) SEPARATOR '|', CF2:COL1 MAPPED BY (C6) ENCODING BINARY);

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey

foo|97.31foo|97.31

col1col1cf1

0x0000DEAF0x0000DEAF

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

cf2

keykey col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykeycf1

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykeycf1

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykeycf1

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey col1col1cf1

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey col1col1cf1

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey


col1col1cf1

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey


col1col1cf1

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey


cf1

foo|97.31foo|97.31

col1col1

0x0000000000000001000000000000000200000000000000030x000000000000000100000000000000020000000000000003

keykey

If encoding not specified, string is

used as default

21

Binary Encoding: Pros and Cons

Faster

Numeric types collated correctly including negative numbers

CREATE HBASE TABLE WEATHER (temp INT, date TIMESTAMP, humidity DOUBLE)

COLUMN MAPPING (key mapped by (temp, date), cf:cq mapped by (humidity))

default encoding binary;

Limited portability

100,2012-06-10 17:00:00:000,40.25-17,2012-12-12 17:00:00:000,30.2595,2012-06-05 17:00:00:000,50.25

100,2012-06-10 17:00:00:000,40.25-17,2012-12-12 17:00:00:000,30.2595,2012-06-05 17:00:00:000,50.25

\x01\x7F\xFF\xFF\xEF\x012012-12-12 17:00:00:000\x00 \x01\x80\x00\x00_\x012012-06-05 17:00:00:000\x00 \x01\x80\x00\x00d\x012012-06-10 17:00:00:000\x00

\x01\x7F\xFF\xFF\xEF\x012012-12-12 17:00:00:000\x00 \x01\x80\x00\x00_\x012012-06-05 17:00:00:000\x00 \x01\x80\x00\x00d\x012012-06-10 17:00:00:000\x00

\x01\xC0>@\x00\x00\x00\x00\x00\x01\xC0I \x00\x00\x00\x00\x00\x01\xC0D \x00\x00\x00\x00\x00

\x01\xC0>@\x00\x00\x00\x00\x00\x01\xC0I \x00\x00\x00\x00\x00\x01\xC0D \x00\x00\x00\x00\x00

cq

-1795

100

cf

22

Load Data

Load HBase – Loads data from delimited files– Column list can be specified

load hbase data inpath 'file:///input.dat'delimited fields terminated by '|'into table hbtable(name, value, desc, id);

Load FROM– Loads data from a (JDBC) source outside of a BigInsights cluster

Insert command available

insert into hbtable(name, value, desc, id)values(‘name5’, 5555, ‘desc55555’, 55555);

File can be on DFS or local to Big SQL server

Column list optional. If not specified, uses column ordering in

table definition

23

Load Data: Upsert

HBase ensures uniqueness of row key

Upsert can be confusing. No errors but fewer rows !

Combine multiple columns to make row key uniquekey mapped by (id, name)

select count(*) from hbtable : 7 rowsselect count(*) from hbtable : 7 rowsDelimited file : 10 rowsDelimited file : 10 rows Load : 10 rows affected

11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

…


…

11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

…


…

Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…


…

Load11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

…


…


…


…


…


…


…


…


…


…


…


…


…


…

Load 11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

…


…


…


…


…


…

Load

keykey


…


…

11111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

…

11111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1

…


…


…


…


…


…


…


…


…

11111/x00name1, 1111, desc11111 @ts011111/x00name9, 9999, desc99999 @ts122222/x00name2, 2222, desc22222 @ts1

…

11111/x00name1, 1111, desc11111 @ts011111/x00name9, 9999, desc99999 @ts122222/x00name2, 2222, desc22222 @ts1

…

Load 11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

…


…

key

24

Force Key Unique

Use force key unique option when creating a table

CREATE HBASE TABLE HBTABLE_FORCE_KEY_UNIQUE( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) )COLUMN MAPPING( key mapped by (id) force key unique, cf_data:cq_name mapped by (name), cf_data:cq_val mapped by (value), cf_info:cq_desc mapped by (desc));

Load adds UUID to the row key

Prevents data loss

InefficientStores more dataSlower queries

11111\x00b71c95d8-ffdd-4d49-9015-2fdd6f7dcdf4, name1, 1111, desc11111 11111\x00ea780078-9893-4bf7-95d8-cb9ca4b2427f, name9, 9999, desc99999

22222\x00a90885b0-418b-49ac-a6f6-aa73273b57ca, name2, 2222, desc22222…

11111\x00b71c95d8-ffdd-4d49-9015-2fdd6f7dcdf4, name1, 1111, desc11111 11111\x00ea780078-9893-4bf7-95d8-cb9ca4b2427f, name9, 9999, desc99999

22222\x00a90885b0-418b-49ac-a6f6-aa73273b57ca, name2, 2222, desc22222…


…


…

25

Load Data: Error Handling

Option to continue and log error rows– LOG ERROR ROWS IN FILE 'filename'

Common Errors– Separator exists within data for string encoding– Invalid numeric types

Always count number of rows after loading– Load always reports total number of rows that it handled


…


… Error file (2 rows) 22222 , name-2, 2222, desc22222 3333a , name3, 3333, desc33333

Error file (2 rows) 22222 , name-2, 2222, desc22222 3333a , name3, 3333, desc33333


…


…


…


…


…


…

Load: 4 rows affected11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222

…


…

HBase Table (2 rows)11111-name1, 1111, desc1111111111-name9, 9999, desc99999

HBase Table (2 rows)11111-name1, 1111, desc1111111111-name9, 9999, desc99999

11111, name1, 1111, desc1111111111, name9, 9999, desc99999

22222, name-2, 2222, desc222223333a, name3, 3333, desc33333

…

11111, name1, 1111, desc1111111111, name9, 9999, desc99999

22222, name-2, 2222, desc222223333a, name3, 3333, desc33333

…

keykey

key mapped by (id, name) separator ‘-’id defined as integer

26

Options to Speed up Load

Disable WAL– Data loss can happen if region server crashes

LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS DISABLE WAL;

Increase write buffer– set hbase.client.write.buffer=8388608;

27

Secondary Index Support

Self-maintaining secondary indexes– Stored in an HBase table– Populated using a Map Reduce index builder– Kept up to date using a synchronous coprocessor

Data Table

Index Table

Index Regions

ClientClient

Big SQL

HBase Storage Handler

Query Optimizer(Compile time)

- Process hints

Query Analyzer(Runtime)

- Use index ?

Data RegionsMRIndexBuildercreate index

SerDe

Index Coprocessor

inputdata

queryresults

query

Index building Index maintenance Batched Get Requests

28

Index Creation and Usage

create hbase table dt(id int,c1 string,c2 string,c3 string,c4 string,c5 string)column mapping (key mapped by (id), f:a mapped by (c1,c2,c3), f:b mapped by

(c4,c5));

create index ixc3 on table dt (c3) as 'hbase';

Automatic index usage– Range scan on index table to get matching row key(s) in base table– Batched get requests to base table with the matched row key(s)

bt1 , c11_c21_c31, c41_c51 bt2 , c12_c22_c32, c42_c52 bt3 , c13_c23_c33, c43_c53

…

bt1 , c11_c21_c31, c41_c51 bt2 , c12_c22_c32, c42_c52 bt3 , c13_c23_c33, c43_c53

…

key c1 c2 c3 c4 c5c31_bt1c32_bt2c33_bt3

…

c31_bt1c32_bt2c33_bt3

…

key Data table (dt)Data table (dt) Index table (dt_ixc3)Index table (dt_ixc3)

Use Index ?

Queryc3=c32

create index ixc3 (c3)

YesNo

Full table scanFull table scanIndex table range scan

start row = c32 stop row = c32++

Index table range scanstart row = c32

stop row = c32++

Data table getrow = bt2

Data table getrow = bt2

29

Index Pros and Cons

Fast key based lookups for queries that return limited data

Not beneficial if there are too many matches No statistics to make the decision in compiler useindex hint to make explicit choices

Index adds latency to data load– When loading a big data set, drop index and recreate

LOAD from option bypasses index maintenance Uses HBase bulk load which writes to HFiles directly

30

Column Family Options

Compression– compression(gz)

Bloom filters– NONE, ROW, ROWCOL

In memory columns– in memory, no in memory

create hbase table colopt_table (key string, c1 string)column mapping(key mapped by (key), cf1:c1 mapped by(c1))column family options(cf1 compression(gz) bloom filter(row) in memory);

31

Query Handling

Projection pushdown

Predicate pushdown– Point scan– Range scan– Automatic index usage– Filters

Query Hints

32

Sample Data

TPCH orders table with 1.5 million rows

drop table if exists orders;

CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) )

column mapping ( key mapped by (O_ORDERKEY,O_CUSTKEY), cf:d mapped by (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT),

cf:od mapped by (O_ORDERDATE))default encoding binary;

LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS;

33

Projection Pushdown

Get only columns required by the query

Limit data retrieved to the client

select * from ordersgo -m discard1500000 rows in results(first row: 0.21s; total: 1m1.77s)

HBase scan details:{ … , families={cf=[d, od]}, …}

select o_totalprice from ordersgo -m discard1500000 rows in results(first row: 0.19s; total: 21.27s)

HBase scan details:{ … , families={cf=[d]}, …}

select o_orderdate from ordersgo -m discard1500000 rows in results(first row: 0.36s; total: 36.24s)

HBase scan details:{ … , families={cf=[od]}, …}

Projection happens at HBase column level– For composite key and dense columns, the entire value is retrieved to the client– Efficient to pack columns that are queried together

Log

Log

Log

The response time is higher for this query

even when it retrieves lesser data than query for o_totalprice. This is

because timestamp type is more expensive

34

Predicate Pushdown: Point Scan

With full row key

Big SQL can combine predicates on row key parts

set force local on;select o_orderkey,o_totalprice from orders where o_custkey=1 and o_orderkey=454791;+--------------+| o_totalprice |+--------------+| 208660.75000 |+--------------+1 row in results(first row: 0.14s; total: 0.14s)

Found a row scan by combining all composite key parts.Log

1#454791 1# 579908

1# 3868359 1# 4273923 1# 4808192 1# 5133509

…

1#454791 1# 579908

1# 3868359 1# 4273923 1# 4808192 1# 5133509

…

……

keyo_custkey o_orderkey columns

Queryo_custkey=1

ando_orderkey=454791

start row=1#454791stop row=1#454791

35

Predicate Pushdown: Partial row Scan

select o_orderkey,o_totalprice from orders where o_custkey=1;+------------+--------------+| o_orderkey | o_totalprice |+------------+--------------+| 454791 | 74602.81250 || 579908 | 54048.26172 || 3868359 | 123076.84375 || 4273923 | 95911.00781 || 4808192 | 65478.05078 || 5133509 | 174645.93750 |+------------+--------------+6 rows in results(first row: 0.13s; total: 0.13s)

Found a row scan that uses the first 1 part(s) of composite key.

1#454791 1# 579908

1# 3868359 1# 4273923 1# 4808192 1# 5133509 2#430243

…

1#454791 1# 579908

1# 3868359 1# 4273923 1# 4808192 1# 5133509 2#430243

…

……


Queryo_custkey=1

start row=1 stop row=1++

Log

Predicate(s) on leading part(s) of row key

36

Predicate Pushdown: Range Scan

With range predicates

select o_orderkey,o_totalprice from orders where o_custkey < 3;

Found a row scan that uses the first 1 part(s) of composite key.

HBase scan details:{ .. , stopRow=\x01\x80\x00\x00\x03, startRow=, … }

Log

Log

1#454791 …

1# 5133509 2#430243

… 4#164711

1#454791 …

1# 5133509 2#430243

… 4#164711

……


Queryo_custkey<3

start row= stop row=3#

37

Predicate Pushdown: Full table Scan

This is an example of a case where predicates are not pushed down.

If there are predicates on non-leading parts of row key

set force local on;select o_orderkey,o_totalprice from orders where o_orderkey=454791;+------------+--------------+| o_orderkey | o_totalprice |+------------+--------------+| 454791 | 74602.81250 |+------------+--------------+1 row in results(first row: 32.13s; total: 32.13s)

HBase scan details:{ .. , stopRow=, startRow=, … }Log

38

Automatic Index Usage

select * from orders where o_clerk='Clerk#000000999'go -m discard1472 rows in results(first row: 1.63s; total: 30.32s)

create index ix_clerk on table orders (o_clerk) as 'hbase';0 rows affected (total: 3m57.82s)

select * from orders where o_clerk='Clerk#000000999'go -m discard1472 rows in results(first row: 3.60s; total: 3.65s)

Index query successful

Index used automatically

For composite index, rules similar to composite row key apply– Parts will be combined where possible– With partial value for composite index, range scan done on index table

Multiple indexes on a table– Index to be used is randomly chosen– Specify useIndex hint to make use of specific index

Log

39

Pushing down Filters into HBase

Filters do not avoid full table scan– Some filters can skip certain sections e.g, PrefixFilter

Limits rows returned to the client

Limits data returned to client– Key only filters

select o_orderkey from orders where o_custkey>100000 and o_orderstatus='P'go -m discard12819 rows in results(first row: 1.12s; total: 6.80s)

Found a row scan that uses the first 1 part(s) of composite key. HBase filter list created using AND. HBase scan details:{… , filter=FilterList AND (1/1): [SingleColumnValueFilter (cf, d, EQUAL, \x01P\x00)], stopRow=, startRow=\x01\x80\x01\x86\xA1, …}

Log

Row scan

Column filter as there is a

predicate on leading part of dense column

40

Key Only Tables

Big SQL allows creation of tables without specifying any HBase column

create hbase table KEY_ONLY_TABLE (k1 string, k2 string, k3 string)column mapping (key mapped by (k1, k2, k3));

select * from KEY_ONLY_TABLE;

Only row key or parts of row key requested. Applying filters.…

HBase scan details:{… families={}, filter=FilterList AND (2/2): [FirstKeyOnlyFilter, KeyOnlyFilter], …}

Log

41

Predicate Precedence

When a query contains multiple predicates, the following precedence applies:

– Row Scan– Index– Filters

• Row filters• Column filters

Filters will be applied along with row scans

Filters cannot be combined with index lookups

Multiple predicates: Use of row and column filter

select o_orderkey, o_custkey, o_orderdate from orders where o_orderdate=cast('1996-12-09' as timestamp) or o_custkey=2;

HBase filter list created using OR.HBase scan details:{… , filter=FilterList OR (2/2):[SingleColumnValueFilter (cf, od, EQUAL, \x011996-12-09 00:00:00.000\x00),PrefixFilter \x01\x80\x00\x00\x02], cacheBlocks=false, stopRow=,startRow=, … }

Log

The OR condition prevents usage of row scan. Row filter (PrefixFilter) is used along with a column filter

42

Accessmode Hint

Will run the query locally in Big SQL server – Useful to avoid map reduce overhead

Very important for HBase point queries– This is not detected currently by compiler– Specify accessmode=‘local’ hint when getting a limited set of data from HBase

Specify at query level

select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=1 and o_orderkey=454791;

Specify at session level– set force local on– set force commands override query level hints

43

HBase Hints

rowcachesize (default=2000)– Used as scan cache setting– Also used to determine number of get requests to batch in index lookups

colbatchsize (default=100)

useindex (‘false’ to avoid index usage)

select o_orderkey from orders /*+ rowcachesize=10000 +*/ where o_custkey>5000go -m discard1450136 rows in results(first row: 22.67s; total: 27.46s)

HBase scan details:{... , caching=10000, ...}

rowcachesize can also be set using the set command:– set hbase.client.scanner.caching=10000;

Log

44

Recommendations

Row key design is the most important factor– Try to combine predicates that are most commonly used into row key columns– Do not make the row key too long

Use short names for HBase column families and column qualifiers– f:q instead of mycolumnfamily:mycolumnqualifier

Check if key only tables can be used

Pack columns that are queried together into dense columns– Use the column that is used as query predicate as prefix– Create indexes for columns that do not have repeating values and are queried often

Separate columns that are rarely or never queried into a different column family

Set hbase.client.scanner.caching to an optimum value

Ensure even data distribution

45

Limitations

No diagnostic info about HBase pushdown– How HBase storage handler pushes down a query is decided only at runtime– Predicate handling details are logged at INFO level– Many examples of log messages covered in previous slides

No auto detection of local vs MR mode– Currently depends on user specified hints

Statistics not available– Big SQL does not have a framework to collect statistics– Query optimizations can be improved with availability of useful statistics

Map type not supported– Big SQL does not support map data type– Hive HBase handler supports map data type and many to one mapping

• Mapping an entire HBase column family to a map data type

46

Logs and Troubleshooting

Big SQL logs– Look for rewritten query– More information in Big SQL logs if query is run in local mode

Map Reduce logs– Predicate handling information in map task log when run in MR mode

HBase web GUI– http://<<hostname>>:60010/master-status

http://servername:60010/master-status



47

Big SQL HBase Handler Highlights

Support for composite key/dense columns

Pushdown for efficient execution of queries

Support for secondary indexes

Binary encoding (collated correctly)

Key only tables

Support for hints to make query optimization decisions

48

Scenarios that can leverage HBase features

Point queries– Queries that return a single row of result– Row can be determined using row key or secondary index

• All queries using secondary index are not point queries

Queries with projections– If a query requires only a few columns– Projection happens at HBase column level

Data maintenance using upserts– Loading different value for columns using same row key

49

Acknowledgements and Disclaimers

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2013. All rights reserved.

•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

http://www.ibm.com/legal/copytrade.shtml

Piotr Pruski

@ppruski

Thank YouAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL

Full credit to Deepa Remesh

Acknowledgements

adding value to hbase with ibm infosphere biginsights and bigsql

Technology

data cq

hbase data model table

hbase table

info cq

row key different

column names

column families

column qualifiers