1
The Platform for Big DataAmr Awadallah | CTO, Founder, Cloudera, [email protected], twitter: @awadallah
©2012 Cloudera, Inc. All Rights Reserved.2
Storage Only Grid (original raw data)
Instrumentation
Collection
RDBMS (aggregated data)
BI Reports + Interactive Apps
Mostly Append
ETL Compute Grid
1. Moving Data To Compute Doesn’t Scale
3. Can’t Explore Original High Fidelity Raw Data
2. Archiving = PrematureData Death
The Problems with Current Data Systems
©2012 Cloudera, Inc. All Rights Reserved.3
The Solution: A Combined Storage/Compute Layer
Hadoop: Storage + Compute Grid
Instrumentation
Collection
RDBMS (aggregated data)
BI Reports + Interactive Apps3. Data Exploration &Advanced Analytics
2. Keep Data Alive For Ever
(Active Archive)
1. Scalable ThroughputFor ETL & Aggregation
(ETL Acceleration)
Mostly Append
So What is Apache Hadoop ?
• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).
• Core Hadoop has two main systems:• Hadoop Distributed File System: self-healing high-bandwidth clustered
storage.• MapReduce: distributed fault-tolerant resource management and scheduling
coupled with a scalable data programming abstraction.
• Key business values:• Flexibility – Store any data, Run any analysis.• Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.• Economics – Cost per TB at a fraction of traditional options.
©2012 Cloudera, Inc. All Rights Reserved.4
The Hadoop Big Bang
©2012 Cloudera, Inc. All Rights Reserved.5
• Fastest sort of a TB, 62secs over 1,460 nodes• Sorted a PB in 16.25hours over 3,658 nodes
Hadoop World 2009,500 attendees
The Key Benefit: Agility/Flexibility
©2012 Cloudera, Inc. All Rights Reserved.6
Schema-on-Read (Hadoop):Schema-on-Write (RDBMS):• Schema must be created before
any data can be loaded.
• An explicit load operation has to take place which transforms data to DB internal serialization format.
• New columns must be added explicitly before new data for such columns can be loaded into the database.
• OLAP is Fast
• Standards/Governance
• Data is simply copied to the file store, no transformation is needed.
• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding)
• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.
• Load is Fast
• Flexibility/AgilityProsPros
Scalability: Scalable Software Development
©2012 Cloudera, Inc. All Rights Reserved.7
Grows without requiring developers to re-architect their algorithms/application.
AUTO SCALEAUTO SCALE
Economics: Return on Byte
• Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte
• If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.
©2012 Cloudera, Inc. All Rights Reserved.8
Low ROB
High ROB
Cloud Deployment
The Big Data Platform: CDH4 – June 2012
Coordination
Data Integration
Fast Read/Write
Access
Batch Processing Languages
Web Console
Job Workflow
Metadata
APACHE ZOOKEEPER
APACHE FLUME, APACHE SQOOP APACHE HBASE
APACHE PIG, APACHE HIVE
HUE
APACHE OOZIE
APACHE HIVE MetaStoreInteractive SQL
Data Mining Lib
Impala
APACHE MAHOUT
APACHE WHIRR
Build
/Tes
t: A
PACH
E BI
GTO
P
Cloudera Manager Free Edition (Installation Wizard)
©2012 Cloudera, Inc. All Rights Reserved.9
Hadoop Core KernelMapReduce, HDFS
Connectivity
Data Processing LibDataFu for Pig
ODBC/JDBC/FUSE/HTTPS
CDH in the Enterprise Data Stack
LogsLogs FilesFiles Web DataWeb Data Relational DatabasesRelational Databases
IDEsIDEs BI / Analytics
BI / Analytics
Enterprise ReportingEnterprise Reporting
Enterprise Data Warehouse
Online Serving Systems
ClouderaManagerClouderaManager
SYSTEM OPERATORS
ENGINEERS ANALYSTS BUSINESS USERS
Web/Mobile ApplicationsWeb/Mobile Applications
CUSTOMERS
Sqoop
Sqoop
Sqoop
FlumeFlumeFlume
Modeling Tools
Modeling Tools
DATA SCIENTISTS
DATA ARCHITECTS
Meta Data/ ETL Tools
Meta Data/ ETL Tools
ODBC, JDBC, NFS, HTTP
©2012 Cloudera, Inc. All Rights Reserved.10
HBase versus HDFS
HDFS: HBase:
Use For:
• Dimension tables which are updated frequently and require random low-latency lookups.
Use For:
• Fact tables that are mostly append only and require sequential full table scans.
Optimized For:
• Large Files
• Sequential Access (Hi Throughput)
• Append Only
Optimized For:
• Small Records
• Random Access (Lo Latency)
• Atomic Record Updates
Not Suitable For:
• Low Latency Interactive OLAP.
©2012 Cloudera, Inc. All Rights Reserved.11
• Retail: Price OptimizationRetail: Price Optimization• Media: Content TargetingMedia: Content Targeting• Finance: Fraud DetectionFinance: Fraud Detection• Manufacturing: DiagnosticsManufacturing: Diagnostics• Info Services: Satellite ImageryInfo Services: Satellite Imagery• Agriculture: Seed OptimizationAgriculture: Seed Optimization• Power: Smart ConsumptionPower: Smart Consumption
Use Case Examples
©2012 Cloudera, Inc. All Rights Reserved.12
1. FLEXIBILITYSTORE ANY DATARUN ANY ANALYSISKEEP’S PACE WITH THE RATE OF CHANGE OF INCOMING DATA
2. SCALABILITYPROVEN GROWTH TO PBS/1,000s OF NODESNO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALESKEEP’S PACE WITH THE RATE OF GROWTH OF INCOMING DATA
3. ECONOMICSCOST PER TB AT A FRACTION OF OTHER OPTIONSKEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVEPOWERING THE DATA BEATS ALGORITHM MOVEMENT
©2012 Cloudera, Inc. All Rights Reserved.13
Core Benefits of the Platform for Big Data
Amr Awadallah, CTO, Founder, Cloudera, Inc. <[email protected]> @awadallahThank you!