coordinating the many tools of big data - apache hcatalog, apache pig and apache hive. alan gates at...

Coordinating the Many Tools of Big Data

Alan F. Gates

@alanfgates

Big Data Spain 2012http://www.bigdataspain.org/

http://www.bigdataspain.org/

© Hortonworks 2012

Big Data = Terabytes, Petabytes, …

Image Credit: Gizmodo

© Hortonworks 2012

But It Is Also Complex Algorithms

• An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs (user defined functions) in Pig. This equation uses stochastic gradient descent to do machine learning across with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

© Hortonworks 2012

Pre-Cloud: One Tool per Machine

• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms

(SAS).

Data Warehouse

Statistical Analysis

Cube/MOLAP

OLTP

Data Mart

© Hortonworks 2012

Cloud: Many Tools One Platform

• Users no longer want to be concerned with what platform their data is in – just apply the tool to it

• SQL no longer the only or primary data access tool

Data Warehouse

Statistical AnalysisData

Mart

Cube/MOLAP

OLTP

© Hortonworks 2012

Upside - Pick the Right Tool for the Job

© Hortonworks 2012

Downside – Tools Don’t Play Well Together

• Hard for users to share data between tools– Different storage formats– Different data models– Different user defined function interfaces

© Hortonworks 2012

Downside – Wasted Developer Time

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

© Hortonworks 2012

Downside – Wasted Developer Time

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

© Hortonworks 2012

Conclusion: We Need Services

• We need to find a way to share services where we can. • Gives users the same experience across tools• Allows developers to share effort when it makes sense

© Hortonworks 2012

Hadoop = Distributed Data Operating System

Service Hadoop Component Single Node Analogue

Table Management HCatalog RDBMS

User access control Hadoop /etc/passwd, file system permissions, etc.

Resource management YARN Process management

Notification HCatalog Signals, semaphores, mutexes

REST/Connectors HCatalog, Hive, HBase, Oozie

Network layer

Batch data processing Data Virtual Machine JVM

Exists Pieces exist in this component To be built

© Hortonworks 2012

HCatalog – Table Management

• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access

Data Access Without HCatalog

© Hortonworks 2012

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

Data & Metadata Access With HCatalog

© Hortonworks 2012

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

HCatInputFormat/ HCatOuputFormat

MapReduce Pig

HCatLoader/ HCatStorer

REST

External System

Without HCatalog

© Hortonworks 2012

Feature MapReduce Pig Hive

Record format Key value pairs Tuple Record

Data model User defined int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Encoded in app Declared in script or read by loader

Read from metadata

Data location Encoded in app Declared in script Read from metadata

Data format Encoded in app Declared in script Read from metadata

With HCatalog

© Hortonworks 2012

Feature MapReduce + HCatalog

Pig + HCatalog Hive

Record format Record Tuple Record

Data model int, float, string, maps, structs, lists

int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Read from metadata

Read from metadata

Read from metadata

Data location Read from metadata

Read from metadata

Read from metadata

Data format Read from metadata

Read from metadata

Read from metadata

© Hortonworks 2012

YARN – Resource Manager

• Hadoop 1.0: HDFS plus MapReduce• Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for

developers to write parallel applications on top of the Hadoop cluster• The Resource Manager provides:

– applications a way to request resources in the cluster– allocation and scheduling of machine resource to the applications

• MapReduce is now an application provided inside YARN• Other systems have been ported to YARN such as Spark (cluster computing system

that focuses on in memory operations) and Storm (streaming computations)

© Hortonworks 2012

Architectural Comparison

Hadoop 1.0 Hadoop 2.0

© Hortonworks 2012

Data Virtual Machine – Shared Batch Processing

• Recall our previous diagram of Pig and Hive

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

© Hortonworks 2012

A VM That Provides

• Standard operators (equivalent of Java byte codes):– Project– Select– Join– Aggregate– Sort– …

• An optimizer that could – Choose appropriate implementation of an operator based on physical data

characteristics– Dynamically re-optimize the plan based on information gathered executing the plan

• Shared execution layer– Can provide its own YARN application master and improve on MapReduce

paradigm for batch processing

• Shared User Defined Function (UDF) framework– user code works across systems

© Hortonworks 2012


Map Map

Reduce Reduce

Map Map

Reduce Reduce

HDFS

Map Map

Reduce Reduce

Reduce Reduce

• Removed an entire write/read cycle of HDFS• Still want to checkpoint sometimes

© Hortonworks 2012

Taking Advantage of YARN – In Memory Data Transfer

Map Map

Reduce Reduce

These are writes to

disk

Switching shuffle to in memory instead of on disk• Better performance• Data must also be spilled to disk for retry-ability and to handle memory overflow• Will benefit from stronger guarantees of simultaneous execution

© Hortonworks 2012

On the Fly Optimization

• Traditionally databases do all optimization up front based on statistics– But often there are not statistics in Hadoop– Languages like Pig Latin allow very long series of operations that make up front

estimates unrealistic

• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information

MR Job

MR Job

Hash Join

coordinating the many tools of big data - apache hcatalog, apache pig and apache hive. alan gates at...

Technology

data metadata access

metadata access hortonworks

alanfgates page

gizmodo hortonworks

y hortonworks

job hortonworks

sense hortonworks

shared data model