Data Science on Hadoop:How Cloudera Impala Unlocks New Productivity and InsightsJustin Erickson | Product ManagerMarcel Kornacker | Software EngineerRavikumar Visweswara | Software EngineerOctober 2012
Why Data Scientists Love Hadoop
• SCALABILITY• Massive volumes of data
• FLEXIBILITY• Data preparation & analytics in 1 environment• Highly flexible environment for creating & testing machine learning models
• LOW COST• 10% the cost/TB under management
Already query Hadoop using Hive
Already load data into CDH every 90 mins or less
Already use HBase for real-time data access
67% 51% 54%
Source: Cloudera customer survey August 2012
Hadoop Use Cases Moving to Real-Time
Need fasterqueries on
Hadoop data
Move data from Hadoop to RDBMS for
interactive SQL
See value today in consolidating to a
single platform
78% 71% 62%
Source: Cloudera customer survey August 2012
But Hadoop Isn’t Fast Enough
Beyond Batch – The Next Stage for Hadoop
HADOOP TODAY IS TOO SLOW
MapReduce is batchSimple queries can take minutes / tens of minutes
CURRENT DATA MANAGEMENT IS TOO COMPLEXOptimized for rigid schemas &
special purpose applicationsRedundant data storage & processes
Very expensive systems: $20K-150K / TB
Cloudera Enterprise RTQReal-Time Query for Data Stored in Hadoop Powered by Cloudera Impala.FAMILIAR Supports Hive SQL
FAST 4-30X faster than Hive over MapReduce
Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED
100% OPEN SOURCE Flexible, cost-effective, no lock-in
EASY TO USE Deploy & operate with Cloudera Manager
FLEXIBLE Supports multiple storage engines & file formats
Cloudera Now Powered by ImpalaBEFORE IMPALA
• With Impala: Real-time SQL queriesNative distributed query engineOptimized for low-latency
• Provides:Answers as fast as you can askEveryone to ask questions for all dataBig data storage and analytics together
WITH IMPALA
• Unified Storage:Supports HDFS and HBaseFlexible file formats
• Unified Metastore• Unified Security• Unified Client Interfaces:
ODBC, SQL syntax, Hue Beeswax
BATCH PROCESSING
USER INTERFACE
REAL-TIME ACCESS
STORAGE
BATCH PROCESSING
USER INTERFACEMETA DATA
STORAGE
BATCH PROCESSING REAL-TIME
ACCESS(IMPALA)
USER INTERFACEMETA DATA
Cloudera Impala Details
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Fully MPP Distributed
Local Direct Reads
ODBC
SQL App
Common Hive SQL and interface Unified metadata and scheduler
HDFS NNHive Metastore YARN
State Store
Cloudera Impala Details
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App
Common Hive SQL and interface
HDFS NNHive Metastore YARN
State Store
SQL Request
Cloudera Impala Details
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App
Unified metadata and scheduler
HDFS NNHive Metastore YARN
State Store
Cloudera Impala Details
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Fully MPP Distributed
ODBC
SQL App HDFS NNHive Metastore YARN
State Store
Cloudera Impala Details
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Local Direct Reads
ODBC
SQL App HDFS NNHive Metastore YARN
State Store
Cloudera Impala Details
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App HDFS NNHive Metastore YARN
State Store
SQL Results
In Memory Transfers
• No high-latency MapReduce batch processing• Local processing avoids network bottlenecks• No costly data format conversion overhead• All data immediately query-able• Single machine pool to scale• All machines available to both Impala and MapReduce• Single, open, and unified metadata and scheduler
Advantages of Our Approach
Query Node
Query Node
Query Node
NN
DN DN DN
Query Engine
Query Node
DN
MRHive
ORHDFS
MR
MapReduce Side StorageRemote Query
DN
MR
Cloudera Impala Demo
Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop
SPEED TO INSIGHT
COST SAVINGS
FULL FIDELITY ANALYSIS
DISCOVERABILITY
•Get answers as fast as you can ask questions• Interactive analytics directly on source data•No jumping between data silos
•Reduce duplicate storage with EDW•Reduce data movement for interactive analysis• Leverage existing tools and employee skills
•Ask questions of all your data•No information loss from aggregation or
conforming to relational schemas for analysis
• Single metadata store from origination through analysis•No need to hunt through multiple data silos
Cloudera powers real-time data hub
18
The Solution:• Cloudera Enterprise – 4 Petabyes •One single scalable platform for Big data for
archive, ETL & analytics with real-time BI• Running Impala
So Expedia can optimize end user data-driven search results and maximize Google AdWord spend.
CONFIDENTIAL - RESTRICTED
The Challenge:• Needs to understand 2 years clickstream data for greater insight• Legacy system cannot scale for data processing and analytics
Validated Beta Partners