hadoop: distributed data processing
DESCRIPTION
TRANSCRIPT
Hadoop: Distributed Data Processing
Amr AwadallahFounder/CTO, Cloudera, Inc.ACM Data Mining SIGThursday, January 25th, 2010
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 2
Outline
▪Scaling for Large Data Processing▪What is Hadoop?▪HDFS and MapReduce▪Hadoop Ecosystem▪Hadoop vs RDBMSes▪Conclusion
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 3
Current Storage Systems Can’t Compute
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 3
Current Storage Systems Can’t Compute
InstrumentationCollection
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 3
Current Storage Systems Can’t Compute
Storage Farm for Unstructured Data (20TB/day)
InstrumentationCollection
Mostly Append
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 3
Current Storage Systems Can’t Compute
Storage Farm for Unstructured Data (20TB/day)
InstrumentationCollection
RDBMS (200GB/day)Interactive Apps
Mostly Append
ETL Grid
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 3
Current Storage Systems Can’t Compute
Storage Farm for Unstructured Data (20TB/day)
InstrumentationCollection
RDBMS (200GB/day)Interactive Apps
Mostly Append
ETL Grid
Filer heads are a bottleneck
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 3
Current Storage Systems Can’t Compute
Storage Farm for Unstructured Data (20TB/day)
InstrumentationCollection
RDBMS (200GB/day)Interactive Apps
Mostly Append
Ad hoc Queries &Data Mining
ETL Grid Non-ConsumptionFiler heads are a bottleneck
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 4
The Solution: A Store-Compute Grid
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 4
The Solution: A Store-Compute Grid
Storage + Computation
InstrumentationCollection
Mostly Append
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 4
The Solution: A Store-Compute Grid
Storage + Computation
InstrumentationCollection
RDBMSInteractive Apps
Mostly Append
ETL and Aggregations
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 4
The Solution: A Store-Compute Grid
Storage + Computation
InstrumentationCollection
RDBMSInteractive Apps “Batch” Apps
Mostly Append
ETL and Aggregations
Ad hoc Queries& Data Mining
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 5
What is Hadoop?
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 5
What is Hadoop?
▪A scalable fault-tolerant grid operating system for data storage and processing
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 5
What is Hadoop?
▪A scalable fault-tolerant grid operating system for data storage and processing▪ Its scalability comes from the marriage of:▪ HDFS: Self-Healing High-Bandwidth Clustered Storage▪ MapReduce: Fault-Tolerant Distributed Processing
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 5
What is Hadoop?
▪A scalable fault-tolerant grid operating system for data storage and processing▪ Its scalability comes from the marriage of:▪ HDFS: Self-Healing High-Bandwidth Clustered Storage▪ MapReduce: Fault-Tolerant Distributed Processing
▪Operates on unstructured and structured data
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 5
What is Hadoop?
▪A scalable fault-tolerant grid operating system for data storage and processing▪ Its scalability comes from the marriage of:▪ HDFS: Self-Healing High-Bandwidth Clustered Storage▪ MapReduce: Fault-Tolerant Distributed Processing
▪Operates on unstructured and structured data▪A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 5
What is Hadoop?
▪A scalable fault-tolerant grid operating system for data storage and processing▪ Its scalability comes from the marriage of:▪ HDFS: Self-Healing High-Bandwidth Clustered Storage▪ MapReduce: Fault-Tolerant Distributed Processing
▪Operates on unstructured and structured data▪A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)▪Open source under the friendly Apache License
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 5
What is Hadoop?
▪A scalable fault-tolerant grid operating system for data storage and processing▪ Its scalability comes from the marriage of:▪ HDFS: Self-Healing High-Bandwidth Clustered Storage▪ MapReduce: Fault-Tolerant Distributed Processing
▪Operates on unstructured and structured data▪A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)▪Open source under the friendly Apache License▪http://wiki.apache.org/hadoop/
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers ▪ 2004: Cutting adds DFS & MapReduce support to Nutch
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers ▪ 2004: Cutting adds DFS & MapReduce support to Nutch▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers ▪ 2004: Cutting adds DFS & MapReduce support to Nutch▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch▪ 2007: NY Times converts 4TB of archives over 100 EC2s
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers ▪ 2004: Cutting adds DFS & MapReduce support to Nutch▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch▪ 2007: NY Times converts 4TB of archives over 100 EC2s▪ 2008: Web-scale deployments at Y!, Facebook, Last.fm
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers ▪ 2004: Cutting adds DFS & MapReduce support to Nutch▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch▪ 2007: NY Times converts 4TB of archives over 100 EC2s▪ 2008: Web-scale deployments at Y!, Facebook, Last.fm▪ April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910
nodes
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers ▪ 2004: Cutting adds DFS & MapReduce support to Nutch▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch▪ 2007: NY Times converts 4TB of archives over 100 EC2s▪ 2008: Web-scale deployments at Y!, Facebook, Last.fm▪ April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910
nodes▪ May 2009: ▪ Yahoo does fastest sort of a TB, 62secs over 1460 nodes▪ Yahoo sorts a PB in 16.25hours over 3658 nodes
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers ▪ 2004: Cutting adds DFS & MapReduce support to Nutch▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch▪ 2007: NY Times converts 4TB of archives over 100 EC2s▪ 2008: Web-scale deployments at Y!, Facebook, Last.fm▪ April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910
nodes▪ May 2009: ▪ Yahoo does fastest sort of a TB, 62secs over 1460 nodes▪ Yahoo sorts a PB in 16.25hours over 3658 nodes
▪ June 2009, Oct 2009: Hadoop Summit (750), Hadoop World (500)
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 6
Hadoop History▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch▪ 2003-2004: Google publishes GFS and MapReduce papers ▪ 2004: Cutting adds DFS & MapReduce support to Nutch▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch▪ 2007: NY Times converts 4TB of archives over 100 EC2s▪ 2008: Web-scale deployments at Y!, Facebook, Last.fm▪ April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910
nodes▪ May 2009: ▪ Yahoo does fastest sort of a TB, 62secs over 1460 nodes▪ Yahoo sorts a PB in 16.25hours over 3658 nodes
▪ June 2009, Oct 2009: Hadoop Summit (750), Hadoop World (500)
▪ September 2009: Doug Cutting joins ClouderaWednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 7
Hadoop Design Axioms
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 7
Hadoop Design Axioms
1. System Shall Manage and Heal Itself
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 7
Hadoop Design Axioms
1. System Shall Manage and Heal Itself2. Performance Shall Scale Linearly
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 7
Hadoop Design Axioms
1. System Shall Manage and Heal Itself2. Performance Shall Scale Linearly 3. Compute Should Move to Data
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 7
Hadoop Design Axioms
1. System Shall Manage and Heal Itself2. Performance Shall Scale Linearly 3. Compute Should Move to Data4. Simple Core, Modular and
Extensible
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 8
Block Size = 64MBReplication Factor = 3
HDFS: Hadoop Distributed File System
Cost/GB is a few ¢/month vs $/month
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 8
Block Size = 64MBReplication Factor = 3
HDFS: Hadoop Distributed File System
Cost/GB is a few ¢/month vs $/month
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 9
MapReduce: Distributed Processing
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 9
MapReduce: Distributed Processing
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 10
MapReduce Example for Word Count
Split 1
Split i
Split N
SELECT word, COUNT(1) FROM docs GROUP BY word;cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 10
MapReduce Example for Word Count
Split 1
Split i
Split N
Map 1(docid, text)
(docid, text) Map i
(docid, text) Map M
(words, counts)
(words, counts)
“To Be Or Not
To Be?”
Be, 5
Be, 12
Be, 7Be, 6
SELECT word, COUNT(1) FROM docs GROUP BY word;cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 10
MapReduce Example for Word Count
Split 1
Split i
Split N
Reduce 1
Reduce i
Reduce R
(sorted words, counts)
Shuffle(sorted words, counts)
Map 1(docid, text)
(docid, text) Map i
(docid, text) Map M
(words, counts)
(words, counts)
“To Be Or Not
To Be?”
Be, 5
Be, 12
Be, 7Be, 6
SELECT word, COUNT(1) FROM docs GROUP BY word;cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 10
MapReduce Example for Word Count
Split 1
Split i
Split N
Reduce 1
Reduce i
Reduce R
(sorted words, counts)
Shuffle(sorted words, counts)
Map 1(docid, text)
(docid, text) Map i
(docid, text) Map M
(words, counts)
(words, counts)
“To Be Or Not
To Be?”
Be, 5
Be, 12
Be, 7Be, 6
Output File 1(sorted words,
sum of counts)
Output File i(sorted words, sum of counts)
Output File R(sorted words,
sum of counts)
Be, 30
SELECT word, COUNT(1) FROM docs GROUP BY word;cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 11
Hadoop High-Level Architecture
Name NodeMaintains mapping of file blocks
to data node slaves
Job TrackerSchedules jobs across
task tracker slaves
Data NodeStores and serves
blocks of data
Hadoop ClientContacts Name Node for data or Job Tracker to submit jobs
Task TrackerRuns tasks (work units)
within a jobShare Physical Node
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 12
Apache Hadoop Ecosystem
HDFS(Hadoop Distributed File System)
MapReduce (Job Scheduling/Execution System)
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 12
Apache Hadoop Ecosystem
HDFS(Hadoop Distributed File System)
MapReduce (Job Scheduling/Execution System)
Avro
(Ser
ializ
atio
n)
Zook
eepr
(Coo
rdin
atio
n)
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 12
Apache Hadoop Ecosystem
HDFS(Hadoop Distributed File System)
HBase (key-value store)
MapReduce (Job Scheduling/Execution System)
Avro
(Ser
ializ
atio
n)
Zook
eepr
(Coo
rdin
atio
n)
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 12
Apache Hadoop Ecosystem
HDFS(Hadoop Distributed File System)
HBase (key-value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
Avro
(Ser
ializ
atio
n)
Zook
eepr
(Coo
rdin
atio
n) Sqoop
RDBMS
(Streaming/Pipes APIs)
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 13
Relational Databases:Hadoop:
Use The Right Tool For The Right Job
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 13
Relational Databases:Hadoop:
Use The Right Tool For The Right Job
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 13
Relational Databases:Hadoop:
Use The Right Tool For The Right Job
When to use?• Affordable Storage/Compute
• Structured or Not (Agility)• Resilient Auto Scalability
When to use?• Interactive Reporting (<1sec)
• Multistep Transactions• Interoperability
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 14
Economics of Hadoop
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 14
Economics of Hadoop▪ Typical Hardware:▪ Two Quad Core Nehalems▪ 24GB RAM▪ 12 * 1TB SATA disks (JBOD mode, no need for RAID)▪ 1 Gigabit Ethernet card
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 14
Economics of Hadoop▪ Typical Hardware:▪ Two Quad Core Nehalems▪ 24GB RAM▪ 12 * 1TB SATA disks (JBOD mode, no need for RAID)▪ 1 Gigabit Ethernet card
▪ Cost/node: $5K/node
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 14
Economics of Hadoop▪ Typical Hardware:▪ Two Quad Core Nehalems▪ 24GB RAM▪ 12 * 1TB SATA disks (JBOD mode, no need for RAID)▪ 1 Gigabit Ethernet card
▪ Cost/node: $5K/node▪ Effective HDFS Space:▪ ¼ reserved for temp shuffle space, which leaves 9TB/node▪ 3 way replication leads to 3TB effective HDFS space/node▪ But assuming 7x compression that becomes ~ 20TB/node
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 14
Economics of Hadoop▪ Typical Hardware:▪ Two Quad Core Nehalems▪ 24GB RAM▪ 12 * 1TB SATA disks (JBOD mode, no need for RAID)▪ 1 Gigabit Ethernet card
▪ Cost/node: $5K/node▪ Effective HDFS Space:▪ ¼ reserved for temp shuffle space, which leaves 9TB/node▪ 3 way replication leads to 3TB effective HDFS space/node▪ But assuming 7x compression that becomes ~ 20TB/node
Effective Cost per user TB: $250/TB
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 14
Economics of Hadoop▪ Typical Hardware:▪ Two Quad Core Nehalems▪ 24GB RAM▪ 12 * 1TB SATA disks (JBOD mode, no need for RAID)▪ 1 Gigabit Ethernet card
▪ Cost/node: $5K/node▪ Effective HDFS Space:▪ ¼ reserved for temp shuffle space, which leaves 9TB/node▪ 3 way replication leads to 3TB effective HDFS space/node▪ But assuming 7x compression that becomes ~ 20TB/node
Effective Cost per user TB: $250/TBOther solutions cost in the range of $5K to $100K per user TB
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 15
Sample Talks from Hadoop World ‘09▪ VISA: Large Scale Transaction Analysis▪ JP Morgan Chase: Data Processing for Financial Services▪ China Mobile: Data Mining Platform for Telecom Industry▪ Rackspace: Cross Data Center Log Processing▪ Booz Allen Hamilton: Protein Alignment using Hadoop▪ eHarmony: Matchmaking in the Hadoop Cloud▪ General Sentiment: Understanding Natural Language▪ Yahoo!: Social Graph Analysis▪ Visible Technologies: Real-Time Business Intelligence▪ Facebook: Rethinking the Data Warehouse with Hadoop and
Hive
Slides and Videos at http://www.cloudera.com/hadoop-world-nyc
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 16
Cloudera Desktop
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 17
Conclusion
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 17
Conclusion
Hadoop is a data grid operating system which provides an economically
scalable solution for storing and processing large amounts of unstructured or structured
data over long periods of time.
Wednesday, January 27, 2010
Amr Awadallah, Cloudera Inc 18
Amr AwadallahCTO, Cloudera [email protected]
http://twitter.com/awadallah
Online Training Videos and Info:http://cloudera.com/hadoop-
traininghttp://cloudera.com/blog
http://twitter.com/cloudera
Contact Information
Wednesday, January 27, 2010
(c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Wednesday, January 27, 2010