sql in hadoop

21
SQL in Hadoop Munich, 21 January 2014 Sven Bayer

Upload: bayersven

Post on 15-Jan-2015

257 views

Category:

Technology


1 download

DESCRIPTION

This presentation describes the Query Compiler of Hive for MapReduce. The architecture of the Hive Query Compiler is explained. Additionally, the compilation of a SQL-query to a MapReduce-Job is shown. This presentation was created with the a presentation of Takeshi Nakano.

TRANSCRIPT

Page 1: SQL in Hadoop

SQL in Hadoop

Munich, 21 January 2014

Sven Bayer

Page 2: SQL in Hadoop

QAware GmbH

QAware 321 January 2014

■45 employees

■Software Engineering

■Quality

■Agility

■Projects

■Software diagnosis

■Individual Software solutions

■Customers

■Automotive, Energy, Retail,

Telecommunications, and others

Page 3: SQL in Hadoop

Agenda

QAware 421 January 2014

1. Motivation

2. Big Data

3. MapReduce

4. Hadoop

5. Hive

6. Hive Query Compiler

7. Discussion

Page 4: SQL in Hadoop

Motivation

QAware 521 January 2014

■Hadoop processes

huge data on clusters

■Hive provides

SQL for Hadoop

■Hive generates complex

MapReduce jobs from SQL

■How does Hive convert

SQL to MapReduce?

Masterthesis:

Software-metrics + time tracking

Page 5: SQL in Hadoop

Big Data

QAware 621 January 2014

■Defined by 4 V‘s

■Volume

■Velocity

■Variety

■Veracity

Page 6: SQL in Hadoop

MapReduce

QAware 721 January 2014

■In 2004 published by Google

■MapReduce is highly scalable on clusters

■Big Data can be processed with MapReduce

■Consists mainly of a Map and Reduce function

Page 7: SQL in Hadoop

Example: Word-Count-Algorithm

QAware 821 January 2014

MapReduce

Hadoop uses MapReduce.

There isa Map phase.

There is a Reduce phase,a Map phase.

There isa Map phase.

Input Map

{Hadoop,1},{uses,1},

{MapReduce,1}

Sort, Shuffle

{There,1}, {is,1}, {a,1},

{Map,1}, {phase,1}

{There,1}, {is,1}, {a,1}, {Reduce,1}, {phase,1}, {a,1},

{Map,1}, {phase,1}

{There,1}, {is,1}, {a,1}, {Map,1},

{phase,1}

{Hadoop,1}

{uses,1}

{MapReduce,1}

{There,1}, {There,1}, {There, 1}

{is,1}, {is,1},{is, 1},

{a,1}, {a,1},{a,1}, {a,1}

{Map,1}, {Map,1}, {Map, 1}

{phase,1}, {phase,1}, {phase, 1}, {phase,1}

{Reduce,1}

Reduce

Hadoop 1uses 1MapReduce 1

{There,[1,1,1]},{is,[1,1,1]}, {a,[1,1,1,1]}

{Hadoop,[1]},{uses,[1]},

{MapReduce,[1]}

{Map,[1,1,1]},{phase,[1,1,1,1]},

{Reduce,[1]}

There 3is 3a 4

Map 3phase 4Reduce 1

Output

Page 8: SQL in Hadoop

MapReduce – In practice

QAware 921 January 2014

■Get the users with the products that they watched

■Get these products with their numbers, makers, price

and filter the products on „audi“

access_log

product

Input Map

{pNo1,user1},{pNo1,user2},{pNo2,user3}

Sort, Shuffle

{pNo1,{audi,30€}},{pNo2,{audi,50€}},{pNo3,{bmw,60€}}

+Filtering on „audi“

{pNo1,user1},{pNo1,user2},

{pNo1,{audi,30€}}

{pNo2,user3},{pNo2,{audi,50€}}

Reduce

pNo1 user1,audi,30€,pNo1 user2,audi,30€,pNo2 user3,audi,50€

Join on product_no+

{pNo1,[user1,user2,{audi,30€}]},{pNo2,[user3,{audi,50€}]}

Output

pNo1 audi 30€

pNo2 audi 50€

pNo3 bmw 60€

id1 user1 pNo1

id2 user2 pNo1

id3 user3 pNo2

Page 9: SQL in Hadoop

Hadoop

QAware 1021 January 2014

■In 2006 initiated by Yahoo

■Hadoop cluster

■Highly scalalbe for Big Data

■Hadoop architecture

Hadoop Common

HDFS

YARN (MapReduce)

Page 10: SQL in Hadoop

Hive

QAware 1121 January 2014

■Built on top of Hadoop

■MapReduce

■HDFS

■Provides HiveQL queries for Hadoop

■Compiles HiveQL to MapReduce

Page 11: SQL in Hadoop

Metastore

CLIJDBC/ODBC

Driver

Web-UI

Thrift Server

Execution Engine

Query Compiler

Parser

Semantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Hive architecture

QAware 1221 January 2014

Hive

Legend

Framework

Component

Framework

Component

Cal

l of

a co

mp

on

ent

Page 12: SQL in Hadoop

Hive Query Compiler

QAware 21 January 2014

Start HiveQL

Parser AST

Semantic Analyzer

QB

Logical Plan

Generator

QB Tree

Logical Optimizer

QB Tree

Physical Plan

Generator

Phys. Plan

Physical Optimizer

Phys. Plan

Execution Engine

End

Hive

Page 13: SQL in Hadoop

Parser

QAware 1521 January 2014

SELECT a.user, a.product_no, p.maker, p.price

FROM access_log a JOIN product p

ON (a.product_no = p.product_no)

WHERE p.maker = `audi`;

Hive Query Compiler

HiveQL Parser AST

ParserSemantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Execution Engine

access_log

product

pNo1 audi 30€

pNo2 audi 50€

pNo3 bmw 60€

id1 user1 pNo1

id2 user2 pNo1

id3 user3 pNo2

Page 14: SQL in Hadoop

Parser

QAware 1621 January 2014

■… WHERE p.maker = `audi`;

Hive Query Compiler

HiveQL Parser AST

ParserSemantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Execution Engine

Page 15: SQL in Hadoop

Semantic Analyzer

QAware 1721 January 2014

Hive Query Compiler

ASTSemantic Analyzer

QB

ParserSemantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Execution Engine

Query Block FROM-Clause

MetaData ParseInfo

Alias to Table Info“a”=Table Info(“access_log”)“p”=Table Info(“product”)

AST of Join-Expression

Page 16: SQL in Hadoop

Logical Plan Generator

QAware 1821 January 2014

Hive Query Compiler

QBLogical Plan Generator

QB Tree

ParserSemantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Execution Engine

TableScanOperatorTS_0

TableScanOperatorTS_1

ReduceSinkOperatorRS_2

ReduceSinkOperatorRS_3

JoinOperatorJOIN_4

FilterOperatorFIL_5

(maker = ‘audi’)

SelectOperatorSEL_6

FileSinkOperatorFS_7

Page 17: SQL in Hadoop

Logical Optimizer

QAware 1921 January 2014

Hive Query Compiler

QB Tree

Logical Optimizer

QB Tree

ParserSemantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Execution Engine

TableScanOperatorTS_0

TableScanOperatorTS_1

ReduceSinkOperatorRS_2

ReduceSinkOperatorRS_3

JoinOperatorJOIN_4

SelectOperatorSEL_6

FileSinkOperatorFS_7

FilterOperatorFIL_8

(maker = ‘audi’)

Page 18: SQL in Hadoop

Phyiscal Plan Generator

QAware 2021 January 2014

Hive Query Compiler

QB Tree

Physical Plan Generator

Phys. Plan

ParserSemantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Execution Engine

MapRedTask (Stage-1/root)

Reducer

Mapper Mapper

TableScanOperatorTS_0

TableScanOperatorTS_1

ReduceSinkOperatorRS_2

ReduceSinkOperatorRS_3

JoinOperatorJOIN_4

SelectOperatorSEL_6

FileSinkOperatorFS_7

FilterOperatorFIL_8

(maker= ‘audi’)

MoveTask (Stage-0)

StatsTask (Stage-2)

Page 19: SQL in Hadoop

Physical Optimizer

QAware 2121 January 2014

■Optimizes the Physical Plan

■Transforms a plan with Joins to multiple MapReduce jobs

■Converts tasks including a Join to a MapJoin

HiveQL-Verarbeitung

Phys. Plan

Physical Optimizer

Phys. Plan

ParserSemantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Execution Engine

Page 20: SQL in Hadoop

Execution Engine

QAware 2221 January 2014

■MapReduce job is serialized as plan.xml

■Returns the result

■Temporary place

■Table

HiveQL-Verarbeitung

ParserSemantic Analyzer

Logical Plan Generator

Logical Optimizer

Physical Plan Generator

Physical Optimizer

Execution Engine

Phys. Plan

Execution Engine

Page 21: SQL in Hadoop

Discussion

QAware 2321 January 2014

■Hive brings SQL to Hadoop

■Advantages of Hive

■Reduces developer workload

■No need for manual coding of MapReduce jobs

■Easy migration for systems interacting with SQL

■Disadvantages of Hive

■High latency

■Outlook for Hive

■Apache Tez with container reusage, Mapper reduction in DAG

■Alternatives for Hive

■Impala, Shark, Presto, Lingual