challenges of sql on hadoop: a story from the trenches

© 2015 IBM Corporation

HHS-2479: Challenges of SQL on HadoopA story from the trenches

Scott C. Gray ([email protected])Senior Architect and STSM,

IBM Open Platform with Apache Hadoop

2

Why SQL on Hadoop?

Why are you even asking? This should be

obvious by now! :)

Hadoop is designed for any data

Doesn't impose any structure

Extremely flexible

At lowest levels is API based

Requires strong programming expertise

Steep learning curve

Even simple operations can be tedious

Why not use SQL in places its

strengths shine?

Familiar widely used syntax

Separation of what you want vs. how to get it

Robust ecosystem of tools

3

SQL Engines Everywhere!

SQL engines are springing up everywhere and maturing at an incredible pace!

In some cases, the richness of SQL in these engines matches or surpasses that of

traditional data warehouses

<ShamelessPlug>e.g. IBM’s Big SQL</ShamelessPlug>

Robust SQL plus inexpensive, easily expandable and reliable clustering leads to a deep

burning desire…

IBM Big SQLSQL

4

The Data Model Plop

Ditch that pricy data warehouse and plop it on Hadoop!

Problem solved, right?

database

partition

database

partition

database

partitiondatabase

partition

$$$ $

5

Your plan may just work! Or…it may not

Even moving your application from one traditional data warehouse to another requires:

Planning

Tuning

An intimate understanding of architectural differences between products

Hadoop’s architecture adds another level of potential impedance mismatch that needs to be

considered as well..

Whoa…Hold On There, Buckaroo

database

partition

database

partition

database

partitiondatabase

partition

$$$

?$

6

Challenges of SQL on Hadoop

Hadoop’s architecture presents significant challenges to matching the functionality of a

traditional data warehouse

Everything is a tradeoff though!

Hadoop’s architecture helps solve problems that have challenged data warehouses

It opens data processing well beyond just relational

This presentation will discuss a (very small) subset of the following challenges and ways in

which some projects are addressing them Data placement

Indexing

Data manipulation

Security

File formats out the wazoo (that’s a technical term)

Caching (buffer pools)

Optimization and data ownership

Competing workloads

7

Disclaimer(s)

This presentation is not designed to scare you or dishearten you

Only to educate you and make you think and plan

And to teach you about the bleeding edge technologies that will solve all your problems!

I work on one of these SQL engines (IBM’s Big SQL)

I wouldn’t be doing so if it weren’t solving real problems for real customers

There are a LOT of technologies out there

I’m sure I’m missing at least one of your favorite. Sorry.

Call me out when I’m wrong. I want to learn too!

8

Data Placement

Most DW’s rely heavily on controlled data placement

Data is explicitly partitioned across the cluster

A particular node “owns” a known subset of data

Partitioning tables on the same key(s) and on the same

nodes allows for co-located processing

The fundamental design of HDFS explicitly implements

“random” data placement

No matter which node writes a block there is no

guarantee a copy will live on that node

Rebalancing HDFS can move blocks around

So, no co-located processing without bending over

backwards (more on this later)

Partition A

T1 T2

Partition B

T1 T2

Partition C

T1 T2

Query

Coordinator

HDFS

9

Query Processing Without Data Placement

Without co-location the options for join processing are limited

Redistribution join

DB engines read and filter “local” blocks for each table

Records with the same key are shipped to the same

node to be joined

In the worst case both joined tables are moved in their entirety!

Doesn’t really work well for non-equijoins (!=, <, >, etc.)

Hash Join

Smaller, or heavily filtered, tables are shipped to all

other nodes

An in memory hash table is used for very fast joins

Can still lead to a lot of network to move the small table

Tricks like bloom filters can help optimize these types of joins

T1

T1

DB

EngineT1

DB

Engine

T2DB

Engine T2

DB

Engine

DB

Engine

DB

EngineDB

Engine

Broadcast Join

T1

T1

DB

EngineT1

DB

Engine

T2DB

Engine

Hash Join

T2 T2

10

Data Placement – Explicit Placement Policies

HDFS

HDFS has supported a pluggable data placement policy for some time now

This could be used to keep blocks for specific tables “together”

HDFS doesn’t know the data in the blocks, so it would be an “all or nothing” policy

• A full copy of both tables together a given host

• Can be more granular by placing hive-style partitions together (next slide)

What do you do when a host “fills up”?

I’m not aware of any SQL engine that leverages this feature now

HBase (e.g. HBASE-10576)

HBase today takes advantage of HDFS write behavior such that table regions are “most likely” local

There are projects underway to cause the HBase balancer to split tables (regions) together

This nicely solves the problem of a host “filling up”

Obviously, this is restricted to HBase storage only

11

Data Placement – Partitioning Without Placement

Without explicit placement, the next best

thing is reducing the amount of data to

be scanned

In Hive, “partitioning” allows for subdividing

data by the value in a set of columns

Queries only access the directories

required to satisfy the query

Typically cannot be taken advantage of when

joining on the partitioning column

Scanning a lot of partitions can be

quite expensive!

Other platforms, like Jethrodata, similarly allow for range partitioning into directories

Allows for more control over the number of directories/data

12

Avoid-the-Join: Nested Data Types

One way to avoid the cost of joins is to physically

nest related data E.g. store data as nested JSON, AVRO, etc.

Each department row contains all employees

Apache Drill allows this with no schema provided!

Impala is adding language support to simplify

such queries An ARRAY-of-STRUCT implicitly treated as a table

Aggregates an be applied to arrays

Dynamic JSON schema discovery

CREATE HADOOP TABLE DEPARTMENT(

DEPT_ID INT NOT NULL,DEPT_NAME VARCHAR(30) NOT NULL,...EMPLOYEES ARRAY<STRUCT<

EMP_ID:INT,EMP_NAME: VARCHAR(30),SALARY DECIMAL(10,2) ...

>>)ROW FORMAT SERDE ‘com.myco.MyJsonSerDe’

SELECT D.DEPT_NAME, SUM(E.SALARY)FROM DEPARTMENT D,

UNNEST(D.EMPLOYEES) AS E

Big SQL Example

SELECT DEPT_NAME, SUM(E.SALARY) FROM (SELECT D.DEPT_NAME, FLATTEN(D.EMPLOYEES) E

FROM `myfile.json` D)

13

Avoid-the-Join: Gotchas

While avoiding the cost of the joins, nested data types have some downsides:

The row size can become very large

Most storage formats must completely read the entire row even when the complex

column is not being used

You are no longer relational!

• Becomes expensive to slice the data another way

14

Indexing, It’s a Challenge!

HDFS’ random block placement is problematic for traditional indexing

An index is typically just a data file organized by indexed columns

Each block in the index file will, of course, be randomly scattered

Each index entry will point to data in the base data, which is

ALSO randomly scattered!

This sort of “global” index will work for smaller point or range queries

Network I/O costs grow as the scan range increases on the index

Many SQL engines allow users to just drop data files into a directory to make it available

How does the index know it needs to be updated?

D

D

D

I

D

D = Data I = Index

D

I

D I

I

D

D

D

15

Indexing and Hadoop Legacy

Hive-derived database engines use standard Hadoop classes to allow access to any data

InputFormat – Used to interpret, split, and read a given file type

OutputFormat – Used to write data into a given file type

This interfaces are great!

They were established with the very first version of Hadoop (MapReduce, specifically)

They are ubiquitous

You can turn literally any file format into a table!

But…the interface lacks any kind of “seek” operation!

A feature necessary to implement an index

16

Hive “Indexes”

Hive has supported indexing since 0.8, but they are

barely indexes in the traditional sense

They are limited in utility

No other Hive derived SQL solution uses them

The index table contains

One row for each [index-values,blockoffset] pair

A set of bit for each row in the block (1 = a row contains

the indexed columns)

This sort of index is useful for

Indexing any file type, regardless of format

Skipping base table blocks that don’t contain matching values

Avoiding interpretation of data in rows that don’t match the index

You still have to read each matching block in its entirety (up to the last “1”)

CREATE INDEX IDX1 ON T1 (A, B)

ROW FORMAT DELIMITED

A B Block

Offset

Bits

CA San Jose 6371541 011010010000…

CA San Jose 4718461 110100000111…

CA Berkeley 1747665 110000000011…

NY New York 1888828 1111111100001…

17

Block Level Indexing and Synopsis

The latest trend in indexing is with “smarter” file formats

Exemplified by Parquet and ORC

These formats typically

Store data in a compressed columnar(-ish) format

Store indexing and/or statistical information within each block

Can be configured with search criteria prior to reading data

Index and data are intimately tied together and always in sync

Optimizations include

Skipping of blocks that do not match your search criteria

Quickly seeking within a block to data matching your search criteria

You still have to at least “peek” at every block

Fetching a single row out of a billion will still take some time

Parquet

ORC

18

Indexing in HBase

All HBase tables are inherently partitioned and indexed on row key

Provides near-RDBMS levels of performance for fetching on row key (yay!!)

At a non-negligible cost in writes due to index maintenance (boo!!)

And requires persistent servers (memory, CPU) instead of just simple flat files

Today HBase has no native secondary index support (it’s coming!)

But there are many solutions that will provide them for you….

19

Most secondary index solutions for HBase store the index in another HBase table

The index is automatically maintained via HBase co-processors (kind of like triggers)

There is a measurable cost to index maintenance

Big SQL is exploring using co-processors to store index data outside of HBase

E.g. using a Lucene index

Stored locally with each region server

Secondary Indexes in HBase

Row Key C1 C2 C3

12345 Frank Martin 44

12346 Mary Lee 22

T1

Row Key Pointer

Martin|44 12345

Lee|22 12346

CREATE INDEX IDX1

ON T1 (C2, C3)

T1_IDX1

20

Secondary Indexes in HBase

Global Index (e.g. Phoenix, Big SQL)

Index table is maintained like a regular HBase

table (regions are randomly scattered)

Index data likely not co-located with base data

Good for point or small scan queries

Suffers from “network storm” during large index

scans

Region Server

T1Region A

T1_IDX1Region B

Region Server

T1Region B

Region Server

T1_IDX1Region A

Local Index (e.g. Phoenix)

Custom HBase balance ensures index data is

co-located with base data

• There is a small chance that it will be remote

No network hop to go from index to base data

BUT, for a given index key all index region

servers must be polled

• Potentially more expensive for single row

lookups

Region Server

T1Region A

Region Server

T1Region B

T1_IDX1Region A

T1_IDX1Region B

21

Data Manipulation (DML) In a Non-Updateable World

Another “gotcha” of HDFS is that it only supports write and append

Modifying data is difficult without the ability to update data!

Variable block length and block append (HDFS-3689) may allow some crude

modification features

As a result, you’ll notice very few SQL solutions support DML operations

Those that do support it have to bend over backwards to accommodate the file system

Modifications are logged next to original data

Reads of original data are merged

22

Hive ACID Tables

Hive 0.14 introduced “ACID” tables (which

aren’t quite ACID yet!)

Modifications are logged, in row order,

next to the base data files

During read, delta file changes are

“merged” into the base data

Minor compaction process merges delta

files together, major re-builds base data

Not suitable for OLTP

Single row update still scans all base

data and produces one delta file

23

Data Modification in HBase

HBase takes a similar approach

Changes are logged to the write-ahead-log (WAL)

Final view of the row is cached in memory

Base data (HFILES) are periodically rebuilt by merging

changes

HBase achieves OLTP levels of performance by caching changes in memory

Hive supports “UPSERT” semantics

It is still difficult (costly) to implement SQL UPDATE semantics

Region Server

Write

Ahead

Log

(WAL)

Cache

HFILE

HFILE

HFILE

24

Security

Most Hadoop SQL solutions are built on “open” data formats. This is good!

E.g. delimited, JSON, Parquet, ORC, etc.

Data can be manipulated by different tools (SQL, Pig, Spark, etc.)

Use the right tool for the right job!

But, there lurks a significant security challenge!

HDFS has a very course, file level, security model

(Good) SQL solutions provide very fine-grained access controls

Object level, column level, row level, data masking rules, etc.

Providing a consistent view of security across technologies is a huge challenge

25

Common SQL Security Models

Optional mode for Hive

All operations executed as connected user

HDFS provides security enforcement

Pros

Permissions are always “in sync”

Cons

GRANT/REVOKE is impossible

Fine grained access control is impossible

Permissions managed outside of SQL

Impersonation

Provided by most SQL solutions

All operations executed as server owner

Server owner is typically a privileged user

Pros

“Traditional” SQL security model

GRANT/REVOKE is supported

Fine grained access control supported

Cons

Data is owned by the server

Users must be granted explicit permission

Server Based

26

Permission Synchronization

Solutions like Apache Sentry attempt to bridge the gap between security models

The SQL server runs as a single privileged user

GRANT/REVOKE operations are also translated to HDFS permission changes

Object ownership translates to file ownership

But, this model falls apart quickly

CREATE PERMISSION BRANCH_A ON HR_STAFF

FOR ROWS WHERE (

VERIFY_ROLE_FOR_USER(

SESSION_USER, 'BRANCH_A_ROLE') = 1

AND HR.STAFF.BRANCH_NAME = 'Branch_A’)

ENFORCED FOR ALL ACCESS ENABLE

GRANT SELECT ON T1 TO BOB

GRANT SELECT ON T1 (C1, C2, C3) TO BOB

hdfs dfs –setfacl –m user:bob:r-- /path/to/T1

????

????

27

Security Delegation

New projects are popping up to solve this problem!!

Dedicated scan-and-filter I/O engines

Client applications provide scan-and-filter criteria

Engine enforces security policies (from Hive metastore)

Also provides other optimizations (caching, performance optimizations)

LLAP (Live Long and Process)

Developed as part of Hive

In-memory vectorized predicate evaluation

Data caching

RecordService

Currently part of Sentry

C++ based runtime (from Impala), highly optimized

Compute/Data Node

Hive I/O

Engine

HDFSHDFS HDFS

MR Pig Spark

SQL

Compute/Data Node

Hive I/O

Engine

HDFSHDFS HDFS

MR Pig Spark

SQL

28

Oh, There’s So Much More!!

Other areas and technologies I would have liked to have covered:

File formats out the wazoo

• Trade-offs in a propriety format vs. being completely agnostic

Caching

• How file formats and compression make efficient caching difficult

Schema-discovery and schema-less querying

• What if the data doesn’t have a rigid schema? (Hint: Drill It)

Optimization and data ownership

• How do you optimize a query if you have no statistics? (dynamic vs. static optimization)

Competing workloads

• How does the database deal with competing workloads from other Hadoop tools?

“Sir Not-Appearing-In-This-Film”

29

Thank You!

Thanks for putting up with me

Queries? (Optimized of course!)

challenges of sql on hadoop: a story from the trenches

Data & Analytics