graham mossman - sql and high performance computing on hadoop

© 2014 EXASOL AG

1

SQL and high performance compu3ng on Hadoop

Graham Mossman, Senior Solu;on Engineer, EXASOL

© 2014 EXASOL AG

2

I Love My Lawnmower ...

... because it cuts my grass well

© 2014 EXASOL AG

3

But ...

... it‘s quite a struggle cuBng my hedge

© 2014 EXASOL AG

4

And ...

... it isn‘t good at making apple sauce

© 2014 EXASOL AG

5

And don‘t even thinking about...

... using it to cut hair

© 2014 EXASOL AG

6

Hadoop today is …

§  S;ll Open Source ! §  Began with HDFS and Map/Reduce §  Now comprises a number of addi;onal technologies

§  File systems §  (e.g. Tachyon)

§  Cluster Managers §  (e.g. YARN + Mesos)

§  Execu;on Engines §  (e.g. Tez, Spark etc.)

§  Analy;cal Layer and Applica;ons §  (e.g. Hive, Pig, various SQL on Hadoop)

© 2014 EXASOL AG

7

Hadoop With Everything

§  Hadoop was invented to more easily distribute the Nutch and Lucene applica;ons across a cluster of machines. §  Map/Reduce – distributed processing §  HDFS – distributed file system

§  Began to be used for …. just about everything. §  But not all processing tasks are like indexing the Internet §  Hadoop started to acract cri;cism

§  But usually when it was being used for something it wasn’t designed for

© 2014 EXASOL AG

8

Definitely NOT jobs for Hadoop

§  Word processing

§  Payroll system

§  Anything on a single computer

§  Anything with “small” data

© 2014 EXASOL AG

9

Analy3cal Queries

§  “GROUP BY“ logic

§  i.e. not concerned with individual data items

§  Analy;cal Func;ons §  MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …

§  Table joins, nested sub-‐queries Usually short-‐running, ad-‐hoc and submiced many at a ;me.

© 2014 EXASOL AG

10

Map/Reduce and HDFS : the wrong tools for Analy3cs ?

§  Queries tend to be short : fault tolerance is less important §  If chance of failure in a 5 hour batch is 1 in 300 § Chance of failure in a 5 second query is 1 in 1,000,000

§  Queries tend to be short : start-‐up ;me is significant §  a 20 second start-‐up ;me is NOT OK on a 5 second query

§  A number of projects started to address these issues §  e.g. “Hot containers” in Hive on Tez to reduce start-‐up ;me

© 2014 EXASOL AG

11

Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation

Map/Reduce: the wrong language for Analy3cs ?

Stage 0: Map-Shuffle-ReduceMapper(row) { fields = row.split("\t") emit(fields[0], fields[1]);}Reducer(key, values) { sum = 0; for (value in values) { sum += value; } emit(key, sum);}

Stage 1: Map-ShuffleMapper(row) { ... emit(page_views, page_name);}... shuffleStage 2: Localdata = open("stage1.out")for (i in 0 to 10) { print(data.getNext())}

© 2014 EXASOL AG

12

Equivalent in SQL

SELECT page_name, SUM(page_views) views

FROM wikistats GROUP BY page_nameORDER BY views DESC LIMIT 10;

© 2014 EXASOL AG

13

The SQL language

§  Portable § Well-‐defined standards exist §  No detailed knowledge of the plaporm required

§  e.g. you don’t need to manage memory §  SQL is assumed by a lot of repor;ng tools § Widely used and understood even by non-‐technical people

© 2014 EXASOL AG

14

I‘m not saying that SQL is perfect

•  Try writing the simple Hadoop “Word Count” example in pure SQL

•  Or try to “sessionise” weblog data

•  Or anything with data that is not structured•  “Which part of STRUCTURED Query Language

don’t you understand …?!”

•  All I’m saying is that is an excellent language for analytical queries.

© 2014 EXASOL AG

15

Hadoop could handle SQL (via Hive), but historically …

§  High Latency §  Restricted SQL op;ons §  All but simple table joins were difficult §  Licle support for compression & indexing § Merv Adrian (Gartner Research -‐ 2014)

§ “What is remarkable is that Hadoop does SQL. Just don’t expect it to do it well”

§  Result : EVERYTHING looked good compared to Hive

© 2014 EXASOL AG

16

Everyone s3ll likes to compare themselves to Hive

© 2014 EXASOL AG

17

EXASOL being no excep3on !

© 2014 EXASOL AG

18

Hive con3nues to be improved … §  Completed §  Views (HIVE-‐1143)

§  Par;;oned Views (HIVE-‐1941)

§  Storage Handlers (HIVE-‐705)

§  HBase Integra;on §  HBase Bulk Load

§  Locking (HIVE-‐1293)

§  Indexes (HIVE-‐417)

§  Bitmap Indexes (HIVE-‐1803)

§  Filter Pushdown (HIVE-‐279) §  Table-‐level Sta;s;cs (HIVE-‐1361)

§  Dynamic Par;;ons

§  Binary Data Type (HIVE-‐2380)

§  Decimal Precision and Scale Support §  HCatalog

§  HiveServer2 (HIVE-‐2935)

§  Column Sta;s;cs in Hive (HIVE-‐1362)

§  List Bucke;ng (HIVE-‐3026)

§  Group By With Rollup (HIVE-‐2397) §  Enhanced Aggrega;on, Cube, Grouping

and Rollup (HIVE-‐3433)

§  Op;mizing Skewed Joins (HIVE-‐3086)

§  Correla;on Op;mizer (HIVE-‐2206)

§  Hive on Tez (HIVE-‐4660) §  Vectorized Query

Execu;on (HIVE-‐4160)

§  In Progress §  Atomic Insert/Update/Delete

(HIVE-‐5317)

§  Transac;on Manager (HIVE-‐5843)

§  Cost Based Op;mizer in Hive (HIVE-‐5775)

§  Proposed §  Spa;al Queries

§  Theta Join (HIVE-‐556)

§  JDBC Storage Handler

§  MapJoin Op;miza;on §  Proposal to standardize and expand

Authoriza;on in Hive

§  Dependent Tables (HIVE-‐3466)

§  AccessServer

§  Type Qualifiers in Hive §  MapJoin & Par;;on Pruning

(HIVE-‐5119)

§  SQL Standard based secure authoriza;on (HIVE-‐5837)

§  Updatable Views (HIVE-‐1143) §  Hive on Spark (HIVE-‐7292)

© 2014 EXASOL AG

19

The dream data architecture for analy3cs …

§ Based on the SQL language § but leverages Hadoop’s extreme scalability §  and Hadoop’s fault tolerance § while not compromising on speed.

Could it please also have some maturity ? And be easy to use ?

© 2014 EXASOL AG

20

The current reality

§  SQL on SQL, which is arguably § Less scalable § Less fault tolerant § Less good with unstructured data

§  SQL on Hadoop, which is arguably § Less mature § Less easy to use § Slower

© 2014 EXASOL AG

21

Choices for SQL and Hadoop

§  SQL AND HADOOP §  A Connector

§  HADOOP ON SQL §  User Defined Func;ons

§  SQL ON HADOOP §  Something like Hive, but becer

© 2014 EXASOL AG

22

Op3on 1 – SQL AND HADOOP

Run SQL-‐on-‐SQL and Hadoop-‐on-‐Hadoop and use a connector to join the two systems

Pros §  Minimal impact (SQL and Hadoop worlds can func;on as before) §  Easier to implement

Cons §  Network ! §  Challenge of op;mising across two technologies

© 2014 EXASOL AG

23

Op3on 2 – HADOOP ON SQL

§  Bring Map/Reduce into the Parallel database §  For example using Java User Defined Func;ons

select my_java_map_func1on(words) a_word, count(*) word_count from DOCUMENTS group by 1

§  Doesn’t benefit from Hadoop’s storage advantages

© 2014 EXASOL AG

24

Op3on 3 -‐ SQL ON HADOOP

Build a rela;onal database on Hadoop storage §  Impala (Cloudera) §  S;nger (Hortonworks) §  Presto (Facebook) §  SparkSQL (UC Berkeley) §  HAWQ (Pivotal) §  BigSQL (IBM) §  Apache Phoenix (for HBase) §  Apache Tajo §  Apache Drill §  etc etc etc …. AND DON‘T FORGET HIVE !

© 2014 EXASOL AG

25

Four possible market outcomes…

§  Hadoop and SQL databases are on a collision course – only one will survive §  No sign of that so far

§  They are complementary – both will survive §  Probably -‐ the challenge is how to make them work together

§  They will merge and become one §  Some indica;ons this is already star;ng to happen

§  Something even more amazing will come along and replace them both §  Some;mes this happens – Spark ?

© 2014 EXASOL AG

26

My Personal Opinionated Opinion

Becer to use a tool that has been made for the job A purpose-‐built tool will always beat one made originally for another purpose.

graham mossman - sql and high performance computing on hadoop

Technology