apache hadoop - big data engineering

Apache HadoopBig Data Engineering

Prepared by:● Islam Elbanna● Mahmoud Hanafy

Presented by:● Ahmed Mahran

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Introduction

What is Hadoop?"Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage"

Open Source software + Hardware commodity = IT Cost reduction

Introduction - Cont.

Why Hadoop ?● Performance● Storage● Scalability● Fault tolerance● Cost efficiency (Commodity Machines)


What is Hadoop used for ?● Searching● Log processing● Recommendation system● Analytics● Video and Image analysis


Who uses Hadoop ?● Amazon● Facebook● Google● IBM● New York Times● Yahoo● Twitter● LinkedIn● …


Hadoop RDBMS

Non-Structured/Structured data Structured data

Scale Out Scale Up

Procedural/Functional programming Declarative Queries

Offline batch processing Online/Batch Transactions

Petabytes Gigabytes

Key Value Pairs Predefined fields

Hadoop Vs RDBMS


Problem: 20+ billion web pages x 20KB = 400+ terabytes

One computer can read 30-35 MB/sec from disk~ Four months to read the web (Time).~1,000 hard drives just to store the web (Storage).


Solution: same problem with 1000 machines < 3 hoursBut we need:● Communication and coordination● Recovering from machine failure● Status reporting● Debugging● Optimization

Distributed System


Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Outlines




History

● 2002-2004 Started as a sub-project of Apache Nutch.

● 2003-2004 Google published Google File System (GFS) and MapReduce Framework Paper.

● 2004 Doug Cutting and Mike Cafarella implemented Google’s frameworks in Nutch.

● In 2006 Yahoo hires Doug Cutting to work on Hadoop with a dedicated team.

● In 2008 Hadoop became Apache Top Level Project.

Outlines




Assumptions

● Hardware Failure● Streaming Data Access● Large Data Sets● Simple Coherency Model● Moving Computation is Cheaper than Moving Data● Software Platform Portability

Outlines




Architecture

Hadoop designed and built on two independent frameworks

Hadoop = HDFS + MapReduce

HDFS: is a reliable distributed file system that provides high-throughput access to data.● File divided into blocks 64MB (default)● Each block replicated 3 times (default)

MapReduce: is a framework for performing high performance distributed data processing.

Outlines




Case Study: Word Count

Problem: We need to calculate word frequencies in billions of web pages● Input: Files with one document per

record● Output: List of words and their

frequencies in the whole documents

Case Study: Solution

Outlines




Architecture - Cont.

MapReduce Design● Map● Reduce● Shuffle & Sort

Case Study: Map Phase

● Specify a map function that takes a key/value pairkey = document URLvalue = document contents

● Output of map function is key/value pairs.In our case, output(word, “1”) once per word in the document

Case Study: Reduce Phase● MapReduce library gathers together all pairs with the same key

(shuffle/sort)● The reduce function combines the values for a key

In our case, compute the sum

● Output of reduce will be like that


MapReduce Design● Map: extract

something you care about from each record.


MapReduce Design● Reduce :

aggregate, summarize, filter, or transform mapper output


MapReduce Design Overall View:


MapReduce Design● Shuffle & Sort :

redirect the mapper output to the right reducer

Case Study: Overall View

Outlines





MapReduce Programmer specifies two primary methods:

map(k1, v1) → <k2, v2>reduce(k2, list<v2>) → <k3, v3>

Case Study : Code ExampleMap Function

Case Study : Code ExampleReduce Function

Hadoop not only JAVA (streaming)

Outlines





Main Modules● File System (HDFS)

⚪ Name Node⚪ Secondary Name Node⚪ Data Node

● MapReduce Framework⚪ Job Tracker⚪ Task Tracker

Architecture - Cont.Main Modules● File System (HDFS)

⚪ Name Node⚪ Secondary Name Node⚪ Data Node⚪


Main Modules● MapReduce Framework

⚪ Job Tracker⚪ Task Tracker

Outlines





Access Procedure● Read From HDFS● Write to HDFS


Tasks distribution Procedure:JobTracker choses the nodes to execute the tasks to achieve the data locality principle

Outlines




Hadoop Modes

Hadoop Modes● Standalone● Pseudo-Distributed● Fully-Distributed

Outlines




MapReduce 1 Vs MapReduce 2(YARN)

Outlines




Questions

References

● Book “Hadoop in Action” by Chuck Lam● Book “Hadoop The Definitive Guide” by Tom Wbite● http://hadoop.apache.org/● http://en.wikipedia.org/wiki/Apache_Hadoop ● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ ● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil● http://www.slideshare.net/PhilippeJulio/hadoop-architecture● http://www.slideshare.net/rantav/introduction-to-map-reduce● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=

&from_search=2● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q

f1&b=&from_search=12● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1

4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1

http://hadoop.apache.org/

http://hadoop.apache.org/

http://en.wikipedia.org/wiki/Apache_Hadoop

http://en.wikipedia.org/wiki/Apache_Hadoop

https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/

https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/

http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil

http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil

http://www.slideshare.net/PhilippeJulio/hadoop-architecture

http://www.slideshare.net/PhilippeJulio/hadoop-architecture

http://www.slideshare.net/rantav/introduction-to-map-reduce

http://www.slideshare.net/rantav/introduction-to-map-reduce

http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=2



http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=12



http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig

http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig

http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1



Thanks