hadoop from hive with stinger to tez
TRANSCRIPT
www.rubicon.nl
Hadoop: From Hive with Stinger to Tez
Jan Pieter Posthuma
March 5, 2015
2
Introduction
Jan Pieter Posthuma Microsoft Data Consultant
Rubicon, local consultancy firm in the Netherlands
Architect role at multiple projects
Analysis Service, Reporting Service, Big Data, HDInsight,Cloud BI, Power BI
http://twitter.com/jppphttp://linkedin.com/[email protected]
3
Agenda
Hive Stinger Tez
Hadoop
4
Hadoop
Hadoop is a collection of software to create a data-intensivedistributed cluster running on commodity hardware:
‘store and process the data on the Internet in a simple, scalable and economically feasible way’
Widely accepted by Database vendors as a solution for unstructured data
Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight (now on Windows and Linux)
Available on premise and as an Azure service
HortonWorks Data Platform (HDP) 100% Open Source!
5
Why SQL on Hadoop?
Hadoop is great for cost, but MapReduce is too difficult.
SQL on Hadoop makes Hadoop real and gives me scale that traditional SQL can’t offer.
I’m deleting important data because it’s too expensive to store it. $
6
Hive
Developed Hive to address traditional RDBMS limitations.300+ PB of data under management.600+ TB of data loaded daily.60,000+ Hive queries per day.More than 1,000 users per day.Initial Apache release in April 2009
Problem: Hive is bound to MapReduce leading to latency and needs higher performance
7
Stinger
‘Making Apache Hive 100 Times Faster’
Hortonworks blog, February 2013
SQL Engine
Vectorized SQL Engine
ColumnarStorage
ORCFile
= 100X+ +
Distributed Execution
Apache Tez
8
ORCFiles
Started by HortonWorks to optimize existing RCFiles with input from Microsoft to cooperate with QE and Tez
Two goals: Improve query speed Improve storage efficiency
CREATE TABLE … STORED AS ORC
9
Yarn
10
Tez
11
Stinger TPC-DS Benchmark at 30 Terabyte Scale
Sample of 50 queries from TPC-DS at 30 terabyte scale. Average 52x Query Speedup, Maximum 160x Query
Speedup. Total benchmark time decreased from 7.8 days to 9.3
hours.(3)
Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
12
Stinger.Next
Stinger.Next (in 3 phases) Transactions with ACID semantics – allow users to easily
modify data with inserts, updates and deletes. It extend Hive from the traditional write-once, and read-often system to support analytics over changing data.
Sub-second queries – allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements. Emerge of LLAP (Live Long and Process) and Hive on Spark.
SQL:2011 Analytics – allows rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.
13
Stor
age
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engi
ne
SQL Engines
Row Engine Vector EngineSQ
LSQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cach
e
Block Cache
Linux Cache
Dis
trib
uted
Exec
ution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In Development
Legend
Apache Hive: Modern Architecture
14
Questions
?
15
Links
Microsoft Big Data:http://www.microsoft.com/bigdata
Hortonworks:http://www.hortonworks.com
Try your self via Windows Azure HDInsight:http://azure.com/hdinsight
16
Usefull resources
http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final/
http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/
http://hortonworks.com/labs/stinger/ http://hortonworks.com/blog/100x-faster-hive/ http://www.slideshare.net/hugfrance/recent-enhancements-to-a
pache-hive-query-performance?qid=2cd74ce1-e863-436c-a1ab-52a513c61a27&v=default&b=&from_search=10
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit
http://hortonworks.com/blog/microsofts-contributions-to-the-stinger-initiative-and-apache-hive/