strata + hadoop world 2012: data science on hadoop: how cloudera impala unlocks new productivity and...

20

Upload: cloudera-inc

Post on 20-Aug-2015

2.241 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights
Page 2: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Data Science on Hadoop:How Cloudera Impala Unlocks New Productivity and InsightsJustin Erickson | Product ManagerMarcel Kornacker | Software EngineerRavikumar Visweswara | Software EngineerOctober 2012

Page 3: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Why Data Scientists Love Hadoop

• SCALABILITY• Massive volumes of data

• FLEXIBILITY• Data preparation & analytics in 1 environment• Highly flexible environment for creating & testing machine learning models

• LOW COST• 10% the cost/TB under management

Page 4: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Already query Hadoop using Hive

Already load data into CDH every 90 mins or less

Already use HBase for real-time data access

67% 51% 54%

Source: Cloudera customer survey August 2012

Hadoop Use Cases Moving to Real-Time

Page 5: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Need fasterqueries on

Hadoop data

Move data from Hadoop to RDBMS for

interactive SQL

See value today in consolidating to a

single platform

78% 71% 62%

Source: Cloudera customer survey August 2012

But Hadoop Isn’t Fast Enough

Page 6: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Beyond Batch – The Next Stage for Hadoop

HADOOP TODAY IS TOO SLOW

MapReduce is batchSimple queries can take minutes / tens of minutes

CURRENT DATA MANAGEMENT IS TOO COMPLEXOptimized for rigid schemas &

special purpose applicationsRedundant data storage & processes

Very expensive systems: $20K-150K / TB

Page 7: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Enterprise RTQReal-Time Query for Data Stored in Hadoop Powered by Cloudera Impala.FAMILIAR Supports Hive SQL

FAST 4-30X faster than Hive over MapReduce

Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED

100% OPEN SOURCE Flexible, cost-effective, no lock-in

EASY TO USE Deploy & operate with Cloudera Manager

FLEXIBLE Supports multiple storage engines & file formats

Page 8: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Now Powered by ImpalaBEFORE IMPALA

• With Impala: Real-time SQL queriesNative distributed query engineOptimized for low-latency

• Provides:Answers as fast as you can askEveryone to ask questions for all dataBig data storage and analytics together

WITH IMPALA

• Unified Storage:Supports HDFS and HBaseFlexible file formats

• Unified Metastore• Unified Security• Unified Client Interfaces:

ODBC, SQL syntax, Hue Beeswax

BATCH PROCESSING

USER INTERFACE

REAL-TIME ACCESS

STORAGE

BATCH PROCESSING

USER INTERFACEMETA DATA

STORAGE

BATCH PROCESSING REAL-TIME

ACCESS(IMPALA)

USER INTERFACEMETA DATA

Page 9: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Impala Details

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Fully MPP Distributed

Local Direct Reads

ODBC

SQL App

Common Hive SQL and interface Unified metadata and scheduler

HDFS NNHive Metastore YARN

State Store

Page 10: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Impala Details

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App

Common Hive SQL and interface

HDFS NNHive Metastore YARN

State Store

SQL Request

Page 11: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Impala Details

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App

Unified metadata and scheduler

HDFS NNHive Metastore YARN

State Store

Page 12: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Impala Details

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Fully MPP Distributed

ODBC

SQL App HDFS NNHive Metastore YARN

State Store

Page 13: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Impala Details

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Local Direct Reads

ODBC

SQL App HDFS NNHive Metastore YARN

State Store

Page 14: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Impala Details

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App HDFS NNHive Metastore YARN

State Store

SQL Results

In Memory Transfers

Page 15: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

• No high-latency MapReduce batch processing• Local processing avoids network bottlenecks• No costly data format conversion overhead• All data immediately query-able• Single machine pool to scale• All machines available to both Impala and MapReduce• Single, open, and unified metadata and scheduler

Advantages of Our Approach

Query Node

Query Node

Query Node

NN

DN DN DN

Query Engine

Query Node

DN

MRHive

ORHDFS

MR

MapReduce Side StorageRemote Query

DN

MR

Page 16: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera Impala Demo

Page 17: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop

SPEED TO INSIGHT

COST SAVINGS

FULL FIDELITY ANALYSIS

DISCOVERABILITY

•Get answers as fast as you can ask questions• Interactive analytics directly on source data•No jumping between data silos

•Reduce duplicate storage with EDW•Reduce data movement for interactive analysis• Leverage existing tools and employee skills

•Ask questions of all your data•No information loss from aggregation or

conforming to relational schemas for analysis

• Single metadata store from origination through analysis•No need to hunt through multiple data silos

Page 18: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Cloudera powers real-time data hub

18

The Solution:• Cloudera Enterprise – 4 Petabyes •One single scalable platform for Big data for

archive, ETL & analytics with real-time BI• Running Impala

So Expedia can optimize end user data-driven search results and maximize Google AdWord spend.

CONFIDENTIAL - RESTRICTED

The Challenge:• Needs to understand 2 years clickstream data for greater insight• Legacy system cannot scale for data processing and analytics

Page 19: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

Validated Beta Partners

Page 20: Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights