overview of stinger interactive query for hive

23
Overview of S+nger: Interac+ve Query for Hive @ddkaiser linkedin.com/in/dkaiser slideshare.net/ddkaiser [email protected] [email protected] OC Big Data Meetup #1 May 21, 2014 David Kaiser

Upload: david-kaiser

Post on 11-Aug-2014

260 views

Category:

Data & Analytics


2 download

DESCRIPTION

Presentation given to the OC Big Data Meetup Group. http://www.meetup.com/OCBigData

TRANSCRIPT

Page 1: Overview of stinger   interactive query for hive

Overview  of  S+nger:  Interac+ve  Query  for  Hive    

@ddkaiser  linkedin.com/in/dkaiser  slideshare.net/ddkaiser  [email protected]  [email protected]    

OC  Big  Data  Meetup  #1  May  21,  2014  

David  Kaiser  

Page 2: Overview of stinger   interactive query for hive

Who Am I?

David  Kaiser  

20+  years  experience  with  Linux    3  years  experience  with  Hadoop    Career  experiences:  •  Data  Warehousing  •  Geospa+al  Analy+cs  •  Open-­‐source  Solu+ons  and  Architecture  

 Employed  at  Hortonworks  as  a  Senior  Solu+ons  Engineer    

@ddkaiser  linkedin.com/in/dkaiser  slideshare.net/ddkaiser  [email protected]  [email protected]    

Page 3: Overview of stinger   interactive query for hive

Overview of Stinger: Interactive Query for Hive

• Abstract: – Hadoop  is  about  so  much  more  than  batch  processing.    With  the  recent  release  of  Hadoop  2,  there  have  been  many  new  approaches  for  increased  applica+on  performance.  

– Hive  is  the  most  used  SQL  implementa+on  on  Hadoop.    – Hive  provides  the  most  amount  of  SQL  compa+bility  on  Hadoop.  

– But…        Hive  is  Slow.            – Hive  WAS  Slow.  

– This  talk  will  discuss  the  S+nger  ini+a+ve,  which  improved  Hive  performance  over  100x.  

Page 4: Overview of stinger   interactive query for hive

S"nger  Project  (announced  February  2013)  

Batch AND Interactive SQL-in-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE

   Hive  0.13,  April  2014  

•  Hive  on  Apache  Tez  •  Query  Service  •  Buffer  Cache  •  Cost  Based  Op+mizer  (Op+q)  •  Vectorized  Processing  

 

Hive  0.11,  May  2013:  •  Base  Op+miza+ons  •  SQL  Analy+c  Func+ons  •  ORCFile,  Modern  File  Format  

Hive  0.12,  October  2013:  

•  VARCHAR,  DATE  Types  •  ORCFile  predicate  pushdown  •  Advanced  Op+miza+ons  •  Performance  Boosts  via  YARN  

Speed  Improve  Hive  query  performance  by  100X  to  allow  for  interac+ve  query  +mes  (seconds)  

Scale  The  only  SQL  processing  in  Hadoop  designed  for  queries  that  scale  from  TB  to  PB  

SQL  Support  broadest  range  of  SQL  seman+cs  for  analy+c  applica+ons  running  against  Hadoop  

Goals:  

An Open Community at its finest: Apache Hive Contribution

1,672 Jira Tickets Closed

145 Developers

44 Companies

~400,000 Lines Of Code Added…

13 Months

Page 5: Overview of stinger   interactive query for hive

Outcomes from the Stinger Project

Page 5

Feature   Descrip"on   Benefit  

Tez  Integra+on   Tez  is  significantly  beeer  engine  than  MapReduce   Latency  

Vectorized  Query   Take  advantage  of  modern  hardware  by  processing  thousand-­‐row  blocks  rather  than  row-­‐at-­‐a-­‐+me.   Throughput  

Query  Planner  

Using  extensive  sta+s+cs  now  available  in  Metastore  to  beeer  plan  and  op+mize  query,  including  predicate  pushdown  during  compila+on  to  eliminate  por+ons  of  input  (beyond  par++on  pruning)  

Latency  

ORC  File   Columnar,  type  aware  format  with  indices   Latency  

Cost  Based  Op+mizer  (Op+q)  

Join  re-­‐ordering  and  other  op+miza+ons  based  on  column  sta+s+cs  including  histograms  etc.   Latency  

Hive  as  a  Service   Leaves  engine  running  between  sessions   Latency  

Buffer  Cache   Leaves  most  used  HDFS  file  blocks  in  memory   Latency  

Page 6: Overview of stinger   interactive query for hive

Hadoop 2: Moving Past MapReduce

Page  6  

HADOOP  1.0  

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

HDFS2  (redundant,  highly-­‐available  &  reliable  storage)  

YARN  (cluster  resource  management)  

MapReduce  (data  processing)  

Others  

HADOOP  2.0  

Single  Use  System  Batch  Apps  

Mul/  Purpose  Pla5orm  Batch,  Interac/ve,  Online,  Streaming,  …  

Page 7: Overview of stinger   interactive query for hive

Apache Tez as the new Primitive

HDFS2  (redundant,  reliable  storage)  

Tez  (execu+on  engine)  

YARN  (cluster  resource  management)  

HADOOP  2.0  

MapReduce  as  Base   Apache  Tez  as  Base  

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

Pig  (data  flow)  

Hive  (sql)  

 Others  (Cascading)  

 

HADOOP  1.0  Data  Flow  

Pig  SQL  Hive  

 Others  (Cascading)  

 

Batch  MapReduce  

Slider  (con+nuous  execu+on)  

Online    Data    

Processing  HBase,  Accumulo  

Real  Time    Stream    

Processing  Storm  

Page 8: Overview of stinger   interactive query for hive

Complete Open Source Stack •  YARN is the logical extension of Apache Hadoop

–  Complements  HDFS,  the  data  reservoir    

•  Resource Management for the Enterprise Data Lake –  Shared,  secure,  mul+-­‐tenant  Hadoop  

Allows for all processing in Open-Source Hadoop

Page  8  

HDFS2  (Redundant,  Reliable  Storage)  

YARN  (Cluster  Resource  Management)      

BATCH  (MapReduce)  

INTERACTIVE  (Tez)  

STREAMING  (Storm,  S4,…)  

GRAPH  (Giraph)  

IN-­‐MEMORY  (Spark)  

HPC  MPI  (OpenMPI)  

ONLINE  (HBase)  

OTHER  (Search)  (Weave…)  

Page 9: Overview of stinger   interactive query for hive

Feature   Descrip"on   Benefit  

Tez  Session   Overcomes  Map-­‐Reduce  job-­‐launch  latency  by  pre-­‐launching  Tez  AppMaster   Latency  

Tez  Container  Pre-­‐Launch  

Overcomes  Map-­‐Reduce  latency  by  pre-­‐launching  hot  containers  ready  to  serve  queries.   Latency  

Tez  Container  Re-­‐Use  

Finished  maps  and  reduces  pick  up  more  work  rather  than  exi+ng.  Reduces  latency  and  eliminates  difficult  split-­‐size  tuning.  Out  of  box  performance!  

Latency  

Run+me  re-­‐configura+on  of  DAG  

Run+me  query  tuning  by  picking  aggrega+on  parallelism  using  online  query  sta+s+cs   Throughput  

Tez  In-­‐Memory  Cache   Hot  data  kept  in  RAM  for  fast  access.   Latency  

Complex  DAGs   Tez  Broadcast  Edge  and  Map-­‐Reduce-­‐Reduce  paeern  improve  query  scale  and  throughput.   Throughput  

Hive On Tez - Execution

Page 10: Overview of stinger   interactive query for hive

ORC File Advantages Sustained Query Times Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale

131  GB  (78%  Smaller)  

File  Size  Comparison  Across  Encoding  Methods  Dataset:  TPC-­‐DS  Scale  500  Dataset  

221  GB  (62%  Smaller)  

Encoded  with  Text  

Encoded  with  RCFile  

Encoded  with  ORCFile  

Encoded  with  Parquet  

505  GB  (14%  Smaller)  

585  GB  (Original  Size)   •  Larger  Block  Sizes    

•  Columnar  format  arranges  columns  adjacent  within  the  file  for  compression  &  fast  access  

Impala  

Hive  12  

Smaller Footprint Better encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster.

Page 11: Overview of stinger   interactive query for hive

ORCFile  File  Format  

Page 11

Query-­‐Op"mized:  Split-­‐able,  columnar  storage  file    Efficient  Reads:  Break  into  large  “stripes”  of  data  for  efficient  read    Fast  Filtering:  Built  in  index,  min/max,  metadata  for  fast  filtering  blocks  -­‐  bloom  filters  if  desired    Efficient  Compression:  Decompose  complex  row  types  into  primi+ves:  massive  compression  and  efficient  comparisons  for  filtering    Precomputa"on:  Built  in  aggregates  per  block  (min,  max,  count,  sum,  etc.)    

Page 12: Overview of stinger   interactive query for hive

A Journey to SQL Compliance Evolu"on  of  SQL  Compliance  in  Hive  SQL  Datatypes   SQL  Seman"cs  INT/TINYINT/SMALLINT/BIGINT   SELECT,  INSERT  

FLOAT/DOUBLE   GROUP  BY,  ORDER  BY,  HAVING  

BOOLEAN   JOIN  on  explicit  join  key  

ARRAY,  MAP,  STRUCT,  UNION   Inner,  outer,  cross  and  semi  joins  

STRING   Sub-­‐queries  in  the  FROM  clause  

BINARY   ROLLUP  and  CUBE  

TIMESTAMP   UNION  

DECIMAL   Standard  aggrega+ons  (sum,  avg,  etc.)  

DATE   Custom  Java  UDFs  

VARCHAR   Windowing  func+ons  (OVER,  RANK,  etc.)  

CHAR   Advanced  UDFs  (ngram,  XPath,  URL)  

Interval  Types   Sub-­‐queries  for  IN/NOT  IN,  HAVING  

JOINs  in  WHERE  Clause  

INSERT/UPDATE/DELETE  

Legend  Hive  10  or  earlier  

Roadmap  

Hive  11  

Hive  12  

Hive  13  

Page 13: Overview of stinger   interactive query for hive

Tez – Execution Performance •  Performance gains over Map Reduce

–  Eliminate  replicated  write  barrier  between  successive  computa+ons.  –  Eliminate  job  launch  overhead  of  workflow  jobs.  –  Eliminate  extra  stage  of  map  reads  in  every  workflow  job.  –  Eliminate  queue  and  resource  conten+on  suffered  by  workflow  jobs  that  are  started  aper  a  predecessor  job  completes.  

Page  13  

Pig/Hive  -­‐  MR   Pig/Hive  -­‐  Tez  

Page 14: Overview of stinger   interactive query for hive

Hive  –  MR   Hive  –  Tez  

Hive-on-MR vs. Hive-on-Tez SELECT  a.state,  COUNT(*),  AVERAGE(c.price)  FROM  a  JOIN  b  on  (a.id  =  b.id)  JOIN  c  on  (a.itemId  =  c.itemId)  GROUP  by  a.state

SELECT  a.state  

JOIN  (a,  c)  SELECT  c.price  

SELECT  b.id  

JOIN(a,  b)  GROUP  BY  a.state  

COUNT(*)  AVERAGE(c.price)  

M M M

R R

M M

R

M M

R

M M

R

HDFS  

HDFS  

HDFS  

M M M

R R

R

M M

R

R

SELECT  a.state,  c.itemId  

JOIN  (a,  c)  

JOIN(a,  b)  GROUP  BY  a.state  

COUNT(*)  AVERAGE(c.price)  

SELECT  b.id  

Tez  avoids  unneeded  writes  to  HDFS  

Page 15: Overview of stinger   interactive query for hive

Vectorization • Rewrite all operations to operate on blocks of 1K+ records, rather than one record at a time

• Block is array of Java scalars, not Objects (eliminate Objects – compounding GC gains over time)

• Avoids many function calls, CPU pipeline stalls

•  Size to fit in L1 cache, avoid cache misses

Page  15  

Page 16: Overview of stinger   interactive query for hive

Stinger Phase 3: Unlocking Interactive Query

S"nger  Phase  3:  Features  and  Benefits  

Container  Pre-­‐Launch   Overcomes  Java  VM  startup  latency  by  pre-­‐launching  hot  containers  ready  to  serve  queries  

Container  Re-­‐Use  Finished  Maps  and  Reduces  pick  up  more  work  rather  than  exi+ng.  Reduces  latency  and  eliminates  difficult  split  size  tuning  

Tez  Integra+on   Tez  Broadcast  Edge  and  Map-­‐Reduce-­‐Reduce  paeern  improve  query  scale  and  throughput  

In-­‐Memory  Cache   Hot  data  kept  in  RAM  for  fast  access  

Page 17: Overview of stinger   interactive query for hive

Quantifying Stinger

Page 17

Hive 10 Hive 0.13 (Phase 3) Hive 0.11 (Phase 1)

190x  Improvement  

1400s

39s

7.2s

TPC-­‐DS  Query  27  

3200s

65s

14.9s

TPC-­‐DS  Query  82  

200x  Improvement  

Query  27:  Pricing  Analy"cs  using  Star  Schema  Join    Query  82:  Inventory  Analy"cs  Joining  2  Large  Fact  Tables  

All  Results  at  Scale  Factor  200  (Approximately  200GB  Data)  

Page 18: Overview of stinger   interactive query for hive

41.1s

4.2s

39.8s

4.1s TPC-­‐DS  Query  52   TPC-­‐DS  Query  55  

Query  Time  in  Seconds  

Speed: Delivering Interactive Query

Test  Cluster:  •  200  GB  Data  (ORCFile)  •  20  Nodes,  24GB  RAM  each,  6x  disk  each    

Hive 0.12

Hive 0.13 (Phase 3)

Query  52:  Star  Schema  Join  with  group-­‐by,  order-­‐by  on  different  keys  Query  55:  Star  Schema  Join  with  group-­‐by,  order-­‐by  on  different  keys  

Page 19: Overview of stinger   interactive query for hive

22s

9.8s

31s

6.7s TPC-­‐DS  Query  28   TPC-­‐DS  Query  12  

Query  Time  in  Seconds  

Speed: Delivering Interactive Query

Test  Cluster:  •  200  GB  Data  (ORCFile)  •  20  Nodes,  24GB  RAM  each,  6x  disk  each    

Hive 0.12

Hive 0.13 (Phase 3)

Query  28:  Four  sub-­‐query  join  (Vectoriza"on)  Query  12:  Star  Join  over  range  of  dates  (M-­‐R-­‐R  palern)  

Page 20: Overview of stinger   interactive query for hive

Hortonworks Confidential © 2014

Speed@Scale: Large Scale Implementation

Page 20

http://blogs.cisco.com/datacenter/hdp

Cisco Engineering Blog Post Independent assessment by Cisco UCS Team Benchmark @ 30TB

Page 21: Overview of stinger   interactive query for hive

Hortonworks Confidential © 2014

Speed@Scale: Large Scale Implementation

Page 21

https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/

Facebook Engineering Blog Post Hortonworks engineering team worked on ORCFile Facebook provided improvements to ORCFile, working with Hortonworks Hive is used for efficient analytics on the largest Hadoop Data Warehouse site Ultimate Scale Data Analysis

Page 22: Overview of stinger   interactive query for hive

Your Fastest On-ramp to Enterprise Hadoop™!

Page  22  

hep://hortonworks.com/products/hortonworks-­‐sandbox/  

The  Sandbox  lets  you  experience  Apache  Hadoop  from  the  convenience  of  your  own  laptop  –  no  data  center,  no  cloud  and  no  internet  connec+on  needed!    The  Hortonworks  Sandbox  is:  •  A  free  download:    hep://hortonworks.com/products/hortonworks-­‐sandbox/  •  A  complete,  self  contained  virtual  machine  with  Apache  Hadoop  pre-­‐configured  •  A  personal,  portable  and  standalone  Hadoop  environment  •  A  set  of  hands-­‐on,  step-­‐by-­‐step  tutorials  that  allow  you  to  learn  and  explore  Hadoop  

Page 23: Overview of stinger   interactive query for hive

Ques+ons?  

@ddkaiser  linkedin.com/in/dkaiser  slideshare.net/ddkaiser  [email protected]  [email protected]    

David  Kaiser