sql in hadoop
DESCRIPTION
This presentation describes the Query Compiler of Hive for MapReduce. The architecture of the Hive Query Compiler is explained. Additionally, the compilation of a SQL-query to a MapReduce-Job is shown. This presentation was created with the a presentation of Takeshi Nakano.TRANSCRIPT
SQL in Hadoop
Munich, 21 January 2014
Sven Bayer
QAware GmbH
QAware 321 January 2014
■45 employees
■Software Engineering
■Quality
■Agility
■Projects
■Software diagnosis
■Individual Software solutions
■Customers
■Automotive, Energy, Retail,
Telecommunications, and others
Agenda
QAware 421 January 2014
1. Motivation
2. Big Data
3. MapReduce
4. Hadoop
5. Hive
6. Hive Query Compiler
7. Discussion
Motivation
QAware 521 January 2014
■Hadoop processes
huge data on clusters
■Hive provides
SQL for Hadoop
■Hive generates complex
MapReduce jobs from SQL
■How does Hive convert
SQL to MapReduce?
Masterthesis:
Software-metrics + time tracking
Big Data
QAware 621 January 2014
■Defined by 4 V‘s
■Volume
■Velocity
■Variety
■Veracity
MapReduce
QAware 721 January 2014
■In 2004 published by Google
■MapReduce is highly scalable on clusters
■Big Data can be processed with MapReduce
■Consists mainly of a Map and Reduce function
Example: Word-Count-Algorithm
QAware 821 January 2014
MapReduce
Hadoop uses MapReduce.
There isa Map phase.
There is a Reduce phase,a Map phase.
There isa Map phase.
Input Map
{Hadoop,1},{uses,1},
{MapReduce,1}
Sort, Shuffle
{There,1}, {is,1}, {a,1},
{Map,1}, {phase,1}
{There,1}, {is,1}, {a,1}, {Reduce,1}, {phase,1}, {a,1},
{Map,1}, {phase,1}
{There,1}, {is,1}, {a,1}, {Map,1},
{phase,1}
{Hadoop,1}
{uses,1}
{MapReduce,1}
{There,1}, {There,1}, {There, 1}
{is,1}, {is,1},{is, 1},
{a,1}, {a,1},{a,1}, {a,1}
{Map,1}, {Map,1}, {Map, 1}
{phase,1}, {phase,1}, {phase, 1}, {phase,1}
{Reduce,1}
Reduce
Hadoop 1uses 1MapReduce 1
{There,[1,1,1]},{is,[1,1,1]}, {a,[1,1,1,1]}
{Hadoop,[1]},{uses,[1]},
{MapReduce,[1]}
{Map,[1,1,1]},{phase,[1,1,1,1]},
{Reduce,[1]}
There 3is 3a 4
Map 3phase 4Reduce 1
Output
MapReduce – In practice
QAware 921 January 2014
■Get the users with the products that they watched
■Get these products with their numbers, makers, price
and filter the products on „audi“
access_log
product
Input Map
{pNo1,user1},{pNo1,user2},{pNo2,user3}
Sort, Shuffle
{pNo1,{audi,30€}},{pNo2,{audi,50€}},{pNo3,{bmw,60€}}
+Filtering on „audi“
{pNo1,user1},{pNo1,user2},
{pNo1,{audi,30€}}
{pNo2,user3},{pNo2,{audi,50€}}
Reduce
pNo1 user1,audi,30€,pNo1 user2,audi,30€,pNo2 user3,audi,50€
Join on product_no+
{pNo1,[user1,user2,{audi,30€}]},{pNo2,[user3,{audi,50€}]}
Output
pNo1 audi 30€
pNo2 audi 50€
pNo3 bmw 60€
id1 user1 pNo1
id2 user2 pNo1
id3 user3 pNo2
Hadoop
QAware 1021 January 2014
■In 2006 initiated by Yahoo
■Hadoop cluster
■Highly scalalbe for Big Data
■Hadoop architecture
Hadoop Common
HDFS
YARN (MapReduce)
Hive
QAware 1121 January 2014
■Built on top of Hadoop
■MapReduce
■HDFS
■Provides HiveQL queries for Hadoop
■Compiles HiveQL to MapReduce
Metastore
CLIJDBC/ODBC
Driver
Web-UI
Thrift Server
Execution Engine
Query Compiler
Parser
Semantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Hive architecture
QAware 1221 January 2014
Hive
Legend
Framework
Component
Framework
Component
Cal
l of
a co
mp
on
ent
Hive Query Compiler
QAware 21 January 2014
Start HiveQL
Parser AST
Semantic Analyzer
QB
Logical Plan
Generator
QB Tree
Logical Optimizer
QB Tree
Physical Plan
Generator
Phys. Plan
Physical Optimizer
Phys. Plan
Execution Engine
End
Hive
Parser
QAware 1521 January 2014
SELECT a.user, a.product_no, p.maker, p.price
FROM access_log a JOIN product p
ON (a.product_no = p.product_no)
WHERE p.maker = `audi`;
Hive Query Compiler
HiveQL Parser AST
ParserSemantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Execution Engine
access_log
product
pNo1 audi 30€
pNo2 audi 50€
pNo3 bmw 60€
id1 user1 pNo1
id2 user2 pNo1
id3 user3 pNo2
Parser
QAware 1621 January 2014
■… WHERE p.maker = `audi`;
Hive Query Compiler
HiveQL Parser AST
ParserSemantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Execution Engine
Semantic Analyzer
QAware 1721 January 2014
Hive Query Compiler
ASTSemantic Analyzer
QB
ParserSemantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Execution Engine
Query Block FROM-Clause
MetaData ParseInfo
Alias to Table Info“a”=Table Info(“access_log”)“p”=Table Info(“product”)
AST of Join-Expression
Logical Plan Generator
QAware 1821 January 2014
Hive Query Compiler
QBLogical Plan Generator
QB Tree
ParserSemantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Execution Engine
TableScanOperatorTS_0
TableScanOperatorTS_1
ReduceSinkOperatorRS_2
ReduceSinkOperatorRS_3
JoinOperatorJOIN_4
FilterOperatorFIL_5
(maker = ‘audi’)
SelectOperatorSEL_6
FileSinkOperatorFS_7
Logical Optimizer
QAware 1921 January 2014
Hive Query Compiler
QB Tree
Logical Optimizer
QB Tree
ParserSemantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Execution Engine
TableScanOperatorTS_0
TableScanOperatorTS_1
ReduceSinkOperatorRS_2
ReduceSinkOperatorRS_3
JoinOperatorJOIN_4
SelectOperatorSEL_6
FileSinkOperatorFS_7
FilterOperatorFIL_8
(maker = ‘audi’)
Phyiscal Plan Generator
QAware 2021 January 2014
Hive Query Compiler
QB Tree
Physical Plan Generator
Phys. Plan
ParserSemantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Execution Engine
MapRedTask (Stage-1/root)
Reducer
Mapper Mapper
TableScanOperatorTS_0
TableScanOperatorTS_1
ReduceSinkOperatorRS_2
ReduceSinkOperatorRS_3
JoinOperatorJOIN_4
SelectOperatorSEL_6
FileSinkOperatorFS_7
FilterOperatorFIL_8
(maker= ‘audi’)
MoveTask (Stage-0)
StatsTask (Stage-2)
Physical Optimizer
QAware 2121 January 2014
■Optimizes the Physical Plan
■Transforms a plan with Joins to multiple MapReduce jobs
■Converts tasks including a Join to a MapJoin
HiveQL-Verarbeitung
Phys. Plan
Physical Optimizer
Phys. Plan
ParserSemantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Execution Engine
Execution Engine
QAware 2221 January 2014
■MapReduce job is serialized as plan.xml
■Returns the result
■Temporary place
■Table
HiveQL-Verarbeitung
ParserSemantic Analyzer
Logical Plan Generator
Logical Optimizer
Physical Plan Generator
Physical Optimizer
Execution Engine
Phys. Plan
Execution Engine
Discussion
QAware 2321 January 2014
■Hive brings SQL to Hadoop
■Advantages of Hive
■Reduces developer workload
■No need for manual coding of MapReduce jobs
■Easy migration for systems interacting with SQL
■Disadvantages of Hive
■High latency
■Outlook for Hive
■Apache Tez with container reusage, Mapper reduction in DAG
■Alternatives for Hive
■Impala, Shark, Presto, Lingual