apache hive 2.0 sql, speed, scale by alan gates

Apache Hive 2: SQL, Speed, ScaleAlan GatesHive PMC MemberCo-founder Hortonworks


Acknowledgements

The Apache Hive community for building all this awesome tech

Content of some of these slides based on earlier presentations by Sergey Shelukhinand Siddarth Seth

alias Hive=‘Apache Hive’alias Hadoop=‘Apache Hadoop’alias Spark=‘Apache Spark’alias Tez=‘Apache Tez’alias Parquet=‘Apache Parquet’alias ORC=‘Apache ORC’alias Calcite=‘Apache Calcite’


Apache Hive History

Initially Hive provided SQL on Hadoop– Provided a table view instead of file view of data– Translated SQL to MapReduce– Mostly used for ETL (Extract Transform Load)– Big, batch, high start up time

Around 2012 it became clear users wanted to do all data warehousing on Hadoop, not just batch ETL

Hive has shifted over time to focus on traditional data warehousing problems– Still does large ETL well– Now also can be used for analytics, reporting– Work being done to better support BI (Business Intelligence) tools

Not OLTP, very focused on backend analytics


Hive 1.x and 2.x

New feature development in Hive moving at a fast pace– Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce)– Realizing the full potential of Hive as data warehouse on Hadoop requires more changes

Compromise: follow Hadoop’s example, split into stable and new feature lines

1.x– Stable– Backwards compatible– Ongoing bug fixes

2.x– Major new features– Backwards compatible where possible, but some things will be broken– Hive 2.0 released February 15, 2016 – Not considered production ready– Hive 2.1 released June 20, 2016 – Getting closer, but still beta


Hive 2 Releases

2.0 released February 2016– 1039 JIRAs resolved with 2.0 as fix version

• 666 bugs• 140 improvements or new features

– 2.0.1 released in May with 68 bug fixes

2.1 released June 2016– 633 JIRAs resolved

• 391 bugs• 128 new features or improvements


Hive 2.0 New Features Overview

HPLSQL

LLAP

Hive-On-Spark Improvements

Cost Based Optimizer Improvements

Many, many new features and bug fixes I will not have time to cover

Plus some really cool stuff being worked on now that I will touch on at the end


Adding Procedural SQL: HPLSQL

Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures– Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)

Aims to be compatible with all major dialects of procedural SQL to maximize re-use of existing scripts

Currently external to Hive, communicates with Hive via JDBC. – User runs command using hplsql binary– Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures,

etc.


Sub-second Queries in Hive: LLAP (Live Long and Process)

Persistent daemons– Saves time on process start up (eliminates container allocation and JVM start up time)– All code JITed within a query or two

Data caching with an async I/O elevator– Hot data cached in memory (columnar aware, so only hot columns cached)– When possible work scheduled on node with data cached, if not work will be run in other node

Operators can be executed inside LLAP when it makes sense– Large, ETL style queries usually don’t make sense– User code not run in LLAP for security

Working on interface to allow other data engines to read securely in parallel

Beta in 2.0, 2.1


Hive With LLAP Execution Options

AM AM

T T T

R R

R

T T

T

R

M M M

R R

R

M M

R

R

Tez Only LLAP + Tez

T T T

R R

R

T T

T

R

LLAP only

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: 25+x Performance Boost

0

5

10

15

20

25

30

35

40

45

50

0

50

100

150

200

250

Spee

dup

(x F

acto

r)

Que

ry T

ime(

s) (L

ower

is B

ette

r)Hive 2 with LLAP averages 26x faster than Hive 1

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Performance Compared with an MPP System (Apache Impala)


LLAP On Going Work

Currently in beta, hope to have it stable by early next year

Read only, write path being worked on (see HIVE-14535)

ACID integration being worked on

User must decide whether query runs fully in LLAP, mixed mode, or not at all– Should be handled by CBO

Currently only reads ORC files, other options being added

Currently only integrates with Tez as an engine


Improvements to Hive on Spark

Dynamic partition pruning

Make use of spark persistence for self-join, self-union, and CTEs

Vectorized map-join and other map-join improvements

Parallel order by

Pre-warming of containers

Support for Spark 1.5

Many bug fixes

Moderador

Notas de la presentación

1. DPP: Implemented in two sequential jobs. The first one processes the pruning part, saving the dynamic values on HDFS. The second job uses these values to filter out unwanted partitoins. Not fully tested yet. 2. Spark RDD persistence is used to store the temporary results from repeated subqueires to avoid re-computation. This is similar to materialized view and happens automatically. This is especially useful for cases of self-join, self-union, and CTE. 3. Vectorized map-join, optimized hashtable for mapjoin. These are very similar to tez. 4. Use parallel order by provided by Spark to do global sorting without limiting to one reducer. Internally, however, spark does the sampling. 5. Wait for a few seconds after SparkContext is created before submitting the job to make sure that enough number of executors are launched. SparkContext allows a job to be submitted right way, even if the executors are still starting up. Parallelism at reducer is partially determined by the number of available executors at the time when the job is submitted. This is useful for short-lived sessions, such as those launched by Oozie.


Cost Base Optimizer (CBO) Improvements

Hive’s CBO uses Calcite– Not all optimization rules migrated yet, but 2.0 continues work towards that

CBO on by default in 2.0 (wasn’t in in 1.x)

Main focus of CBO work has been BI queries (using TPC-DS as guide)– Some work on machine generated queries, since tools generate some funky queries

Focus on improving stats collection and estimating stats more accurately between operators in the plan


And Many, Many More

• SQL Standard Auth is the default authorization (actually works)• First pass at storing metadata in HBase instead of RDBMS• CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*)• Codahale-based metrics (also in 1.3)• HS2 Web UI• Stability Improvements and bugfixes for ACID (almost production ready now)• Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc.• Improvements to Parquet performance (PPD, memory manager, etc.)• ORC schema evolution• Improvement to windowing functions, refactoring ORC before split, SIMD

optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tezsession management, many more


Features Being Worked on in Hive 2

LLAP GA

Vastly improved performance for ACID tables– Reworked the way ACID delta files are stored and organized

Adding MERGE for ACID

Support for transactional inserts for non-ACID tables

Vast readability improvements in EXPLAIN

Materialized views

Integration with Druid

Continued progress towards full SQL 2011– INTERSECT, EXCEPT, extended subquery support, non-equijoins

Hive on Spark support for Spark 2.0


Hive 2 Incompatibilities

Java 7 & 8 supported, 6 no longer supported

Requires Hadoop 2.x, Hadoop 1.x no longer supported

MapReduce deprecated, Tez or Spark recommended instead– At some future date MR will be removed

Some configuration defaults changed, e.g.– bucketing enforced by default– metadata schema no longer created if it is missing– SQL Standard authorization used by default

We plan to remove Hive CLI in the future and replace with beeline CLI– Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC– It is cleaner to maintain one code path


Thank You

apache hive 2.0 sql, speed, scale by alan gates

Technology