strata + hadoop world 2012: data science on hadoop: how cloudera impala unlocks new productivity and...

Data Science on Hadoop:How Cloudera Impala Unlocks New Productivity and InsightsJustin Erickson | Product ManagerMarcel Kornacker | Software EngineerRavikumar Visweswara | Software EngineerOctober 2012

Why Data Scientists Love Hadoop

• SCALABILITY• Massive volumes of data

• FLEXIBILITY• Data preparation & analytics in 1 environment• Highly flexible environment for creating & testing machine learning models

• LOW COST• 10% the cost/TB under management

Already query Hadoop using Hive

Already load data into CDH every 90 mins or less

Already use HBase for real-time data access

67% 51% 54%

Source: Cloudera customer survey August 2012

Hadoop Use Cases Moving to Real-Time

Need fasterqueries on

Hadoop data

Move data from Hadoop to RDBMS for

interactive SQL

See value today in consolidating to a

single platform

78% 71% 62%

Source: Cloudera customer survey August 2012

But Hadoop Isn’t Fast Enough

Beyond Batch – The Next Stage for Hadoop

HADOOP TODAY IS TOO SLOW

MapReduce is batchSimple queries can take minutes / tens of minutes

CURRENT DATA MANAGEMENT IS TOO COMPLEXOptimized for rigid schemas &

special purpose applicationsRedundant data storage & processes

Very expensive systems: $20K-150K / TB

Cloudera Enterprise RTQReal-Time Query for Data Stored in Hadoop Powered by Cloudera Impala.FAMILIAR Supports Hive SQL

FAST 4-30X faster than Hive over MapReduce

Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED

100% OPEN SOURCE Flexible, cost-effective, no lock-in

EASY TO USE Deploy & operate with Cloudera Manager

FLEXIBLE Supports multiple storage engines & file formats

Cloudera Now Powered by ImpalaBEFORE IMPALA

• With Impala: Real-time SQL queriesNative distributed query engineOptimized for low-latency

• Provides:Answers as fast as you can askEveryone to ask questions for all dataBig data storage and analytics together

WITH IMPALA

• Unified Storage:Supports HDFS and HBaseFlexible file formats

• Unified Metastore• Unified Security• Unified Client Interfaces:

ODBC, SQL syntax, Hue Beeswax

BATCH PROCESSING

USER INTERFACE

REAL-TIME ACCESS

STORAGE

BATCH PROCESSING

USER INTERFACEMETA DATA

STORAGE

BATCH PROCESSING REAL-TIME

ACCESS(IMPALA)

USER INTERFACEMETA DATA

Cloudera Impala Details

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Fully MPP Distributed

Local Direct Reads

ODBC

SQL App

Common Hive SQL and interface Unified metadata and scheduler

HDFS NNHive Metastore YARN

State Store


HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App

Common Hive SQL and interface


State Store

SQL Request


HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App

Unified metadata and scheduler


State Store


HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Fully MPP Distributed

ODBC

SQL App HDFS NNHive Metastore YARN

State Store


HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Local Direct Reads

ODBC


State Store


HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC


State Store

SQL Results

In Memory Transfers

• No high-latency MapReduce batch processing• Local processing avoids network bottlenecks• No costly data format conversion overhead• All data immediately query-able• Single machine pool to scale• All machines available to both Impala and MapReduce• Single, open, and unified metadata and scheduler

Advantages of Our Approach

Query Node

Query Node

Query Node

NN

DN DN DN

Query Engine

Query Node

DN

MRHive

ORHDFS

MR

MapReduce Side StorageRemote Query

DN

MR

Cloudera Impala Demo

Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop

SPEED TO INSIGHT

COST SAVINGS

FULL FIDELITY ANALYSIS

DISCOVERABILITY

•Get answers as fast as you can ask questions• Interactive analytics directly on source data•No jumping between data silos

•Reduce duplicate storage with EDW•Reduce data movement for interactive analysis• Leverage existing tools and employee skills

•Ask questions of all your data•No information loss from aggregation or

conforming to relational schemas for analysis

• Single metadata store from origination through analysis•No need to hunt through multiple data silos

Cloudera powers real-time data hub

18

The Solution:• Cloudera Enterprise – 4 Petabyes •One single scalable platform for Big data for

archive, ETL & analytics with real-time BI• Running Impala

So Expedia can optimize end user data-driven search results and maximize Google AdWord spend.

CONFIDENTIAL - RESTRICTED

The Challenge:• Needs to understand 2 years clickstream data for greater insight• Legacy system cannot scale for data processing and analytics

Validated Beta Partners

strata + hadoop world 2012: data science on hadoop: how cloudera impala unlocks new productivity and...

Documents

cloudera manager

data science

databig data storage

hadoop hadoop today

cloudera customer survey

sql syntax

data fromsee value

hive cdh