coordinating the many tools of big data - apache hcatalog, apache pig and apache hive. alan gates at...

30
Coordinating the Many Tools of Big Data Page 1 Alan F. Gates @alanfgates Big Data Spain 2012 http://www.bigdataspain.org/

Upload: big-data-spain

Post on 24-Dec-2014

3.740 views

Category:

Technology


0 download

DESCRIPTION

Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/coordinating-many-tools-of-big-data/alan-gates

TRANSCRIPT

Page 1: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Coordinating the Many Tools of Big Data

Page 1

Alan F. Gates

@alanfgates

Big Data Spain 2012http://www.bigdataspain.org/

Page 2: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Big Data = Terabytes, Petabytes, …

Page 2

Image Credit: Gizmodo

Page 3: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

But It Is Also Complex Algorithms

Page 3

• An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs (user defined functions) in Pig. This equation uses stochastic gradient descent to do machine learning across with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

Page 4: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Pre-Cloud: One Tool per Machine

Page 4

• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms

(SAS).

Data Warehouse

Statistical Analysis

Cube/MOLAP

OLTP

Data Mart

Page 5: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Cloud: Many Tools One Platform

Page 5

• Users no longer want to be concerned with what platform their data is in – just apply the tool to it

• SQL no longer the only or primary data access tool

Data Warehouse

Statistical AnalysisData

Mart

Cube/MOLAP

OLTP

Page 6: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Upside - Pick the Right Tool for the Job

Page 6

Page 7: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Downside – Tools Don’t Play Well Together

Page 7

• Hard for users to share data between tools– Different storage formats– Different data models– Different user defined function interfaces

Page 8: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Downside – Wasted Developer Time

Page 8

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Page 9: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Downside – Wasted Developer Time

Page 9

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

Page 10: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Conclusion: We Need Services

Page 10

• We need to find a way to share services where we can. • Gives users the same experience across tools• Allows developers to share effort when it makes sense

Page 11: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Hadoop = Distributed Data Operating System

Page 11

Service Hadoop Component Single Node Analogue

Table Management HCatalog RDBMS

User access control Hadoop /etc/passwd, file system permissions, etc.

Resource management YARN Process management

Notification HCatalog Signals, semaphores, mutexes

REST/Connectors HCatalog, Hive, HBase, Oozie

Network layer

Batch data processing Data Virtual Machine JVM

Exists Pieces exist in this component To be built

Page 12: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Hadoop = Distributed Data Operating System

Page 12

Service Hadoop Component Single Node Analogue

Table Management HCatalog RDBMS

User access control Hadoop /etc/passwd, file system permissions, etc.

Resource management YARN Process management

Notification HCatalog Signals, semaphores, mutexes

REST/Connectors HCatalog, Hive, HBase, Oozie

Network layer

Batch data processing Data Virtual Machine JVM

Exists Pieces exist in this component To be built

Page 13: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

HCatalog – Table Management

Page 13

• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access

Page 14: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Data Access Without HCatalog

Page 14© Hortonworks 2012

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

Page 15: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Data & Metadata Access With HCatalog

Page 15© Hortonworks 2012

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

HCatInputFormat/ HCatOuputFormat

MapReduce Pig

HCatLoader/ HCatStorer

REST

External System

Page 16: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Without HCatalog

Page 16© Hortonworks 2012

Feature MapReduce Pig Hive

Record format Key value pairs Tuple Record

Data model User defined int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Encoded in app Declared in script or read by loader

Read from metadata

Data location Encoded in app Declared in script Read from metadata

Data format Encoded in app Declared in script Read from metadata

Page 17: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

With HCatalog

Page 17© Hortonworks 2012

Feature MapReduce + HCatalog

Pig + HCatalog Hive

Record format Record Tuple Record

Data model int, float, string, maps, structs, lists

int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Read from metadata

Read from metadata

Read from metadata

Data location Read from metadata

Read from metadata

Read from metadata

Data format Read from metadata

Read from metadata

Read from metadata

Page 18: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

YARN – Resource Manager

Page 18

• Hadoop 1.0: HDFS plus MapReduce• Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for

developers to write parallel applications on top of the Hadoop cluster• The Resource Manager provides:

– applications a way to request resources in the cluster– allocation and scheduling of machine resource to the applications

• MapReduce is now an application provided inside YARN• Other systems have been ported to YARN such as Spark (cluster computing system

that focuses on in memory operations) and Storm (streaming computations)

Page 19: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Architectural Comparison

Page 19

Hadoop 1.0 Hadoop 2.0

Page 20: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Data Virtual Machine – Shared Batch Processing

Page 20

• Recall our previous diagram of Pig and Hive

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

Page 21: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

A VM That Provides

Page 21

• Standard operators (equivalent of Java byte codes):– Project– Select– Join– Aggregate– Sort– …

• An optimizer that could – Choose appropriate implementation of an operator based on physical data

characteristics– Dynamically re-optimize the plan based on information gathered executing the plan

• Shared execution layer– Can provide its own YARN application master and improve on MapReduce

paradigm for batch processing

• Shared User Defined Function (UDF) framework– user code works across systems

Page 22: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Taking Advantage of YARN – MR*

Page 22

Map Map

Reduce Reduce

Map Map

Reduce Reduce

HDFS

Page 23: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Taking Advantage of YARN – MR*

Page 23

Map Map

Reduce Reduce

Map Map

Reduce Reduce

HDFSWhy do I

need these

maps?

Page 24: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Taking Advantage of YARN – MR*

Page 24

Map Map

Reduce Reduce

Map Map

Reduce Reduce

HDFS

Map Map

Reduce Reduce

Reduce Reduce

• Removed an entire write/read cycle of HDFS• Still want to checkpoint sometimes

Page 25: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Taking Advantage of YARN – In Memory Data Transfer

Page 25

Map Map

Reduce Reduce

Page 26: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Taking Advantage of YARN – In Memory Data Transfer

Page 26

Map Map

Reduce Reduce

These are writes to

disk

Switching shuffle to in memory instead of on disk• Better performance• Data must also be spilled to disk for retry-ability and to handle memory overflow• Will benefit from stronger guarantees of simultaneous execution

Page 27: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

On the Fly Optimization

Page 27

• Traditionally databases do all optimization up front based on statistics– But often there are not statistics in Hadoop– Languages like Pig Latin allow very long series of operations that make up front

estimates unrealistic

• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information

MR Job

MR Job

Hash Join

Page 28: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

On the Fly Optimization

Page 28

• Traditionally databases do all optimization up front based on statistics– But often there are not statistics in Hadoop– Languages like Pig Latin allow very long series of operations that make up front

estimates unrealistic

• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information

MR Job

MR Job

Hash Join

Output fits in memory

Page 29: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

On the Fly Optimization

Page 29

• Traditionally databases do all optimization up front based on statistics– But often there are not statistics in Hadoop– Languages like Pig Latin allow very long series of operations that make up front

estimates unrealistic

• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information

MR Job

MR Job

Hash Join

MR Job

MR Job

Map-side Join

Load into distributed

cache

Page 30: Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

© Hortonworks 2012

Thank You Big Data Spain

Page 30