stinger hadoop summit june 2013

16
Putting the Sting in Hive Page 1 Alan F. Gates @alanfgates

Upload: alanfgates

Post on 27-Jan-2015

114 views

Category:

Technology


7 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Stinger hadoop summit june 2013

Putting the Sting in Hive

Page 1

Alan F. Gates

@alanfgates

Page 2: Stinger hadoop summit june 2013

© 2013 Hortonworks

Stinger Overview

Page 2

•An initiative, not a project or product• Includes changes to Hive and a new project Tez•Two main goals

–Improve Hive performance 100x over Hive 0.10–Extend Hive SQL to include features needed for analytics

•Hive will support:–BI tools connecting to Hadoop–Analysts performing ad-hoc, interactive queries–Still excellent at the large batch jobs it is used for today

Page 3: Stinger hadoop summit june 2013

© 2013 Hortonworks

Stinger Mileposts

Page 3

Stinger Phase 3• Buffer Cache• Cost Based

Optimizer

Stinger Phase 2• YARN Resource Mgmnt• Hive on Apache Tez• Query Service• Vectorized Operators

Stinger Phase 1• Base Optimizations• SQL Analytics• ORCFile Format

1 2 Improve existing tools & preserve investments

Enable Hive to support interactive workloads

Released in Hive 0.11

Current Work

Roadmap

Page 4: Stinger hadoop summit june 2013

© 2013 Hortonworks

Hive Performance Gains in 0.11

Page 4

• Enable star joins by improving Hive’s map join (aka broadcast join)–Where possible do in single map only task–When not possible push larger tables to separate tasks

• Collapse adjacent jobs where possible–Hive has lots of M->MR type plans, collapse these to MR–Collapse adjacent jobs on sufficiently similar keys when

feasible– join followed by group– join followed by order– group followed by order

• Improvements in sort merge bucket (SMB) joins

Page 5: Stinger hadoop summit june 2013

© Hortonworks Inc. 2013

Improvements in Map Joins

• TPC-DS Query 27, Scale=200, 10 EC2 nodes (40 disks)

Query 270

200

400

600

800

1000

1200

1400

16001501.895

1176.479

631.026999999999

39.985

Text

Linear (Text)

RCFile

Partitioned RCFile

Partitioned RCFile + Optimizations

Page 5

Page 6: Stinger hadoop summit june 2013

© Hortonworks Inc. 2013

Before

Page 6

Page 7: Stinger hadoop summit june 2013

© Hortonworks Inc. 2013

After

Page 7

Page 8: Stinger hadoop summit june 2013

© Hortonworks Inc. 2013

Improvements in SMB Joins

• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)

Query 820

500

1000

1500

2000

2500

3000

35003257.692

2862.669

255.641

71.114

Text

RCFile

Partitioned RCFile

Partitioned RCFile + Optimizations

Page 8

Page 9: Stinger hadoop summit june 2013

© 2013 Hortonworks

New Technologies in Hive

Page 9

• All covered in depth in other talks– See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today

• Tez – A new execution engine for relational tools such as Hive– No need to use MapReduce, instead provides general DAG execution– Data moved between tasks via socket, disk, or HDFS based on performance / re-

startability trade off– Provides standing service to greatly reduce query start time

• ORCFile – A rewrite of RCFile– Columnar– Tightly integrated with Hive’s type model, including support for nested types– Much better compression – Supports projection and filter push down

• Vectorization – Rewriting operators to take advantage of modern processors– Based on work done in MonetDB– Rewrite operators to radically reduce number of function calls, branch prediction

misses, and cache misses

Page 10: Stinger hadoop summit june 2013

© Hortonworks Inc. 2013

Standard Queries

Page 10

Query 27_x000d_Scale 200

Query 82_x000d_Scale 200

Query 27_x000d_Scale

1000

Query 82_x000d_Scale

1000

0

50

100

150

200

250

300

260

165

38

77

142

296

38 42

6780

Query 27 Star JoinQuery 82 Fact Table Join

Hive 0.10, RC File

Hive 0.11 CP, RC File

Hive 0.11 CP, ORC File

Page 11: Stinger hadoop summit june 2013

© Hortonworks Inc. 2013

Performance Trajectory

Page 11

Hive 10_x000d_T

ext

Hive 10_x000d_R

C

Hive 11_x000d_R

C

Hive 11_x000d_O

RC

Hive 11 CP_x000d_O

RC, Tez_x000d_Vectoriza-

tion

0X

5X

10X

15X

20X

25X

1X2X

12X11X

21X

Query 27 Speedup

Hive 10_x000d_

Text

Hive 10_x000d_

RC

Hive 11_x000d_

RC

Hive 11_x000d_

ORC

Hive 11 CP_x000d_ORC, Tez

0X

10X

20X

30X

40X

50X

60X

70X

80X

90X

1X

14X

44X

57X

78X

Query 82 Speedup

Page 12: Stinger hadoop summit june 2013

© Hortonworks Inc. 2013

Query 12 – Demonstrating MRR

Page 12

RC File _x000d_Scale 200

ORC File _x000d_Scale 200

RC File _x000d_Scale 1000

ORC File _x000d_Scale 1000

0

10

20

30

40

50

60

70

80

55 54

75

65

35 34

55

46

Query 12 - MRR Optimization

Traditional _x000d_Map-ReduceTez Map_x000d_Reduce Reduce

Elap

sed

Tim

e (s

econ

ds)

Page 13: Stinger hadoop summit june 2013

© 2013 Hortonworks

Hive Performance Up Next

Page 13

• Push down start up time - even for queries that spend less than a second running on the cluster, there is ~15 seconds of start up time– Tez service will remove Hadoop startup issues– Need to reduce time for the metadata access– Need intelligent file caching so that hot tables can be kept in memory

• Keep working on the optimizer– Y Smart work from Ohio State University– Start using statistics to make intelligent decisions about how many mappers and

reducers to spawn – maybe in Hive, maybe in Tez– Start using statistics to choose between competing plan options

• Buffer Cache– Coordinate with HDFS team to determine caching strategy

Page 14: Stinger hadoop summit june 2013

© 2013 Hortonworks

Extending Hive SQL in 0.11

Page 14

• DECIMAL data type – for fixed precision calculation (e.g. currency)• OVER clause

– PARTITION BY, ORDER BY, ROWS BETWEEN/FOLLOWING/PRECEDING

– Works with existing aggregate functions– New analytic and window functions added

– ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK

SELECT salesperson, AVG(salesprice) OVER (PARTITION BY region ORDER BY date

ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING)FROM sales;

Page 15: Stinger hadoop summit june 2013

© 2013 Hortonworks

Extending Hive SQL Post 0.11

Page 15

• Subqueries in WHERE– Non-correlated first– [NOT] IN first, then extend to (in)equalities and EXISTS

• Datatype conformance – Hive has Java type model, add support for SQL types:– DATE– CHAR() and VARCHAR()– add precision and scale to decimal and float– aliases for standard SQL types (BLOB = binary, CLOB = string, integer =

int, real/number = decimal)

• Security– Add security checks to views, indices, functions, etc.– Secure GRANT and REVOKE

Page 16: Stinger hadoop summit june 2013

© 2013 Hortonworks

Questions

Page 16