apache hadoop - big data engineering
TRANSCRIPT
Apache HadoopBig Data Engineering
Prepared by:● Islam Elbanna● Mahmoud Hanafy
Presented by:● Ahmed Mahran
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Introduction
What is Hadoop?"Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage"
Open Source software + Hardware commodity = IT Cost reduction
Introduction - Cont.
Why Hadoop ?● Performance● Storage● Scalability● Fault tolerance● Cost efficiency (Commodity Machines)
Introduction - Cont.
What is Hadoop used for ?● Searching● Log processing● Recommendation system● Analytics● Video and Image analysis
Introduction - Cont.
Who uses Hadoop ?● Amazon● Facebook● Google● IBM● New York Times● Yahoo● Twitter● LinkedIn● …
Introduction - Cont.
Hadoop RDBMS
Non-Structured/Structured data Structured data
Scale Out Scale Up
Procedural/Functional programming Declarative Queries
Offline batch processing Online/Batch Transactions
Petabytes Gigabytes
Key Value Pairs Predefined fields
Hadoop Vs RDBMS
Introduction - Cont.
Problem: 20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk~ Four months to read the web (Time).~1,000 hard drives just to store the web (Storage).
Introduction - Cont.
Solution: same problem with 1000 machines < 3 hoursBut we need:● Communication and coordination● Recovering from machine failure● Status reporting● Debugging● Optimization
Distributed System
Introduction - Cont.
Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing
Introduction - Cont.
Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing
Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing
Introduction - Cont.
Introduction - Cont.
Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
History
● 2002-2004 Started as a sub-project of Apache Nutch.
● 2003-2004 Google published Google File System (GFS) and MapReduce Framework Paper.
● 2004 Doug Cutting and Mike Cafarella implemented Google’s frameworks in Nutch.
● In 2006 Yahoo hires Doug Cutting to work on Hadoop with a dedicated team.
● In 2008 Hadoop became Apache Top Level Project.
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Assumptions
● Hardware Failure● Streaming Data Access● Large Data Sets● Simple Coherency Model● Moving Computation is Cheaper than Moving Data● Software Platform Portability
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Architecture
Hadoop designed and built on two independent frameworks
Hadoop = HDFS + MapReduce
HDFS: is a reliable distributed file system that provides high-throughput access to data.● File divided into blocks 64MB (default)● Each block replicated 3 times (default)
MapReduce: is a framework for performing high performance distributed data processing.
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Case Study: Word Count
Problem: We need to calculate word frequencies in billions of web pages● Input: Files with one document per
record● Output: List of words and their
frequencies in the whole documents
Case Study: Solution
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Architecture - Cont.
MapReduce Design● Map● Reduce● Shuffle & Sort
Case Study: Map Phase
● Specify a map function that takes a key/value pairkey = document URLvalue = document contents
● Output of map function is key/value pairs.In our case, output(word, “1”) once per word in the document
Case Study: Reduce Phase● MapReduce library gathers together all pairs with the same key
(shuffle/sort)● The reduce function combines the values for a key
In our case, compute the sum
● Output of reduce will be like that
Architecture - Cont.
MapReduce Design● Map: extract
something you care about from each record.
Architecture - Cont.
MapReduce Design● Reduce :
aggregate, summarize, filter, or transform mapper output
Architecture - Cont.
MapReduce Design Overall View:
Architecture - Cont.
MapReduce Design● Shuffle & Sort :
redirect the mapper output to the right reducer
Case Study: Overall View
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Architecture - Cont.
MapReduce Programmer specifies two primary methods:
map(k1, v1) → <k2, v2>reduce(k2, list<v2>) → <k3, v3>
Case Study : Code ExampleMap Function
Case Study : Code ExampleReduce Function
Hadoop not only JAVA (streaming)
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Architecture - Cont.
Main Modules● File System (HDFS)
⚪ Name Node⚪ Secondary Name Node⚪ Data Node
● MapReduce Framework⚪ Job Tracker⚪ Task Tracker
Architecture - Cont.Main Modules● File System (HDFS)
⚪ Name Node⚪ Secondary Name Node⚪ Data Node⚪
Architecture - Cont.
Main Modules● MapReduce Framework
⚪ Job Tracker⚪ Task Tracker
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Architecture - Cont.
Access Procedure● Read From HDFS● Write to HDFS
Architecture - Cont.
Access Procedure● Read From HDFS● Write to HDFS
Architecture - Cont.
Access Procedure● Read From HDFS● Write to HDFS
Architecture - Cont.
Tasks distribution Procedure:JobTracker choses the nodes to execute the tasks to achieve the data locality principle
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Hadoop Modes
Hadoop Modes● Standalone● Pseudo-Distributed● Fully-Distributed
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
MapReduce 1 Vs MapReduce 2(YARN)
Outlines
1. Introduction2. History3. Assumptions 4. Architecture
a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure
5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions
Questions
References
● Book “Hadoop in Action” by Chuck Lam● Book “Hadoop The Definitive Guide” by Tom Wbite● http://hadoop.apache.org/● http://en.wikipedia.org/wiki/Apache_Hadoop ● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ ● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil● http://www.slideshare.net/PhilippeJulio/hadoop-architecture● http://www.slideshare.net/rantav/introduction-to-map-reduce● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=
&from_search=2● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q
f1&b=&from_search=12● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1
4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1
Thanks