graham mossman - sql and high performance computing on hadoop
DESCRIPTION
As usage of Hadoop has broadened, alternative frameworks are emerging which offer more high performance computing based on SQL and ‘SQL-like’ technologies. In this talk, Graham considers how Hadoop and SQL alternatives compare in terms of use cases, best practice, functionality and performance. He goes on to consider future possible market outcomes with regard to Hadoop and SQL architectures and integration options.TRANSCRIPT
© 2014 EXASOL AG
1
SQL and high performance compu3ng on Hadoop
Graham Mossman, Senior Solu;on Engineer, EXASOL
© 2014 EXASOL AG
2
I Love My Lawnmower ...
... because it cuts my grass well
© 2014 EXASOL AG
3
But ...
... it‘s quite a struggle cuBng my hedge
© 2014 EXASOL AG
4
And ...
... it isn‘t good at making apple sauce
© 2014 EXASOL AG
5
And don‘t even thinking about...
... using it to cut hair
© 2014 EXASOL AG
6
Hadoop today is …
§ S;ll Open Source ! § Began with HDFS and Map/Reduce § Now comprises a number of addi;onal technologies
§ File systems § (e.g. Tachyon)
§ Cluster Managers § (e.g. YARN + Mesos)
§ Execu;on Engines § (e.g. Tez, Spark etc.)
§ Analy;cal Layer and Applica;ons § (e.g. Hive, Pig, various SQL on Hadoop)
© 2014 EXASOL AG
7
Hadoop With Everything
§ Hadoop was invented to more easily distribute the Nutch and Lucene applica;ons across a cluster of machines. § Map/Reduce – distributed processing § HDFS – distributed file system
§ Began to be used for …. just about everything. § But not all processing tasks are like indexing the Internet § Hadoop started to acract cri;cism
§ But usually when it was being used for something it wasn’t designed for
© 2014 EXASOL AG
8
Definitely NOT jobs for Hadoop
§ Word processing
§ Payroll system
§ Anything on a single computer
§ Anything with “small” data
© 2014 EXASOL AG
9
Analy3cal Queries
§ “GROUP BY“ logic
§ i.e. not concerned with individual data items
§ Analy;cal Func;ons § MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …
§ Table joins, nested sub-‐queries Usually short-‐running, ad-‐hoc and submiced many at a ;me.
© 2014 EXASOL AG
10
Map/Reduce and HDFS : the wrong tools for Analy3cs ?
§ Queries tend to be short : fault tolerance is less important § If chance of failure in a 5 hour batch is 1 in 300 § Chance of failure in a 5 second query is 1 in 1,000,000
§ Queries tend to be short : start-‐up ;me is significant § a 20 second start-‐up ;me is NOT OK on a 5 second query
§ A number of projects started to address these issues § e.g. “Hot containers” in Hive on Tez to reduce start-‐up ;me
© 2014 EXASOL AG
11
Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation
Map/Reduce: the wrong language for Analy3cs ?
Stage 0: Map-Shuffle-ReduceMapper(row) { fields = row.split("\t") emit(fields[0], fields[1]);}Reducer(key, values) { sum = 0; for (value in values) { sum += value; } emit(key, sum);}
Stage 1: Map-ShuffleMapper(row) { ... emit(page_views, page_name);}... shuffleStage 2: Localdata = open("stage1.out")for (i in 0 to 10) { print(data.getNext())}
© 2014 EXASOL AG
12
Equivalent in SQL
SELECT page_name, SUM(page_views) views
FROM wikistats GROUP BY page_nameORDER BY views DESC LIMIT 10;
© 2014 EXASOL AG
13
The SQL language
§ Portable § Well-‐defined standards exist § No detailed knowledge of the plaporm required
§ e.g. you don’t need to manage memory § SQL is assumed by a lot of repor;ng tools § Widely used and understood even by non-‐technical people
© 2014 EXASOL AG
14
I‘m not saying that SQL is perfect
• Try writing the simple Hadoop “Word Count” example in pure SQL
• Or try to “sessionise” weblog data
• Or anything with data that is not structured• “Which part of STRUCTURED Query Language
don’t you understand …?!”
• All I’m saying is that is an excellent language for analytical queries.
© 2014 EXASOL AG
15
Hadoop could handle SQL (via Hive), but historically …
§ High Latency § Restricted SQL op;ons § All but simple table joins were difficult § Licle support for compression & indexing § Merv Adrian (Gartner Research -‐ 2014)
§ “What is remarkable is that Hadoop does SQL. Just don’t expect it to do it well”
§ Result : EVERYTHING looked good compared to Hive
© 2014 EXASOL AG
16
Everyone s3ll likes to compare themselves to Hive
© 2014 EXASOL AG
17
EXASOL being no excep3on !
© 2014 EXASOL AG
18
Hive con3nues to be improved … § Completed § Views (HIVE-‐1143)
§ Par;;oned Views (HIVE-‐1941)
§ Storage Handlers (HIVE-‐705)
§ HBase Integra;on § HBase Bulk Load
§ Locking (HIVE-‐1293)
§ Indexes (HIVE-‐417)
§ Bitmap Indexes (HIVE-‐1803)
§ Filter Pushdown (HIVE-‐279) § Table-‐level Sta;s;cs (HIVE-‐1361)
§ Dynamic Par;;ons
§ Binary Data Type (HIVE-‐2380)
§ Decimal Precision and Scale Support § HCatalog
§ HiveServer2 (HIVE-‐2935)
§ Column Sta;s;cs in Hive (HIVE-‐1362)
§ List Bucke;ng (HIVE-‐3026)
§ Group By With Rollup (HIVE-‐2397) § Enhanced Aggrega;on, Cube, Grouping
and Rollup (HIVE-‐3433)
§ Op;mizing Skewed Joins (HIVE-‐3086)
§ Correla;on Op;mizer (HIVE-‐2206)
§ Hive on Tez (HIVE-‐4660) § Vectorized Query
Execu;on (HIVE-‐4160)
§ In Progress § Atomic Insert/Update/Delete
(HIVE-‐5317)
§ Transac;on Manager (HIVE-‐5843)
§ Cost Based Op;mizer in Hive (HIVE-‐5775)
§ Proposed § Spa;al Queries
§ Theta Join (HIVE-‐556)
§ JDBC Storage Handler
§ MapJoin Op;miza;on § Proposal to standardize and expand
Authoriza;on in Hive
§ Dependent Tables (HIVE-‐3466)
§ AccessServer
§ Type Qualifiers in Hive § MapJoin & Par;;on Pruning
(HIVE-‐5119)
§ SQL Standard based secure authoriza;on (HIVE-‐5837)
§ Updatable Views (HIVE-‐1143) § Hive on Spark (HIVE-‐7292)
© 2014 EXASOL AG
19
The dream data architecture for analy3cs …
§ Based on the SQL language § but leverages Hadoop’s extreme scalability § and Hadoop’s fault tolerance § while not compromising on speed.
Could it please also have some maturity ? And be easy to use ?
© 2014 EXASOL AG
20
The current reality
§ SQL on SQL, which is arguably § Less scalable § Less fault tolerant § Less good with unstructured data
§ SQL on Hadoop, which is arguably § Less mature § Less easy to use § Slower
© 2014 EXASOL AG
21
Choices for SQL and Hadoop
§ SQL AND HADOOP § A Connector
§ HADOOP ON SQL § User Defined Func;ons
§ SQL ON HADOOP § Something like Hive, but becer
© 2014 EXASOL AG
22
Op3on 1 – SQL AND HADOOP
Run SQL-‐on-‐SQL and Hadoop-‐on-‐Hadoop and use a connector to join the two systems
Pros § Minimal impact (SQL and Hadoop worlds can func;on as before) § Easier to implement
Cons § Network ! § Challenge of op;mising across two technologies
© 2014 EXASOL AG
23
Op3on 2 – HADOOP ON SQL
§ Bring Map/Reduce into the Parallel database § For example using Java User Defined Func;ons
select my_java_map_func1on(words) a_word, count(*) word_count from DOCUMENTS group by 1
§ Doesn’t benefit from Hadoop’s storage advantages
© 2014 EXASOL AG
24
Op3on 3 -‐ SQL ON HADOOP
Build a rela;onal database on Hadoop storage § Impala (Cloudera) § S;nger (Hortonworks) § Presto (Facebook) § SparkSQL (UC Berkeley) § HAWQ (Pivotal) § BigSQL (IBM) § Apache Phoenix (for HBase) § Apache Tajo § Apache Drill § etc etc etc …. AND DON‘T FORGET HIVE !
© 2014 EXASOL AG
25
Four possible market outcomes…
§ Hadoop and SQL databases are on a collision course – only one will survive § No sign of that so far
§ They are complementary – both will survive § Probably -‐ the challenge is how to make them work together
§ They will merge and become one § Some indica;ons this is already star;ng to happen
§ Something even more amazing will come along and replace them both § Some;mes this happens – Spark ?
© 2014 EXASOL AG
26
My Personal Opinionated Opinion
Becer to use a tool that has been made for the job A purpose-‐built tool will always beat one made originally for another purpose.