apache hadoop - big data engineering

Post on 16-Apr-2017

72 Views

Category:

Software

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache HadoopBig Data Engineering

Prepared by:● Islam Elbanna● Mahmoud Hanafy

Presented by:● Ahmed Mahran

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Introduction

What is Hadoop?"Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage"

Open Source software + Hardware commodity = IT Cost reduction

Introduction - Cont.

Why Hadoop ?● Performance● Storage● Scalability● Fault tolerance● Cost efficiency (Commodity Machines)

Introduction - Cont.

What is Hadoop used for ?● Searching● Log processing● Recommendation system● Analytics● Video and Image analysis

Introduction - Cont.

Who uses Hadoop ?● Amazon● Facebook● Google● IBM● New York Times● Yahoo● Twitter● LinkedIn● …

Introduction - Cont.

Hadoop RDBMS

Non-Structured/Structured data Structured data

Scale Out Scale Up

Procedural/Functional programming Declarative Queries

Offline batch processing Online/Batch Transactions

Petabytes Gigabytes

Key Value Pairs Predefined fields

Hadoop Vs RDBMS

Introduction - Cont.

Problem: 20+ billion web pages x 20KB = 400+ terabytes

One computer can read 30-35 MB/sec from disk~ Four months to read the web (Time).~1,000 hard drives just to store the web (Storage).

Introduction - Cont.

Solution: same problem with 1000 machines < 3 hoursBut we need:● Communication and coordination● Recovering from machine failure● Status reporting● Debugging● Optimization

Distributed System

Introduction - Cont.

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Introduction - Cont.

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Introduction - Cont.

Introduction - Cont.

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

History

● 2002-2004 Started as a sub-project of Apache Nutch.

● 2003-2004 Google published Google File System (GFS) and MapReduce Framework Paper.

● 2004 Doug Cutting and Mike Cafarella implemented Google’s frameworks in Nutch.

● In 2006 Yahoo hires Doug Cutting to work on Hadoop with a dedicated team.

● In 2008 Hadoop became Apache Top Level Project.

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Assumptions

● Hardware Failure● Streaming Data Access● Large Data Sets● Simple Coherency Model● Moving Computation is Cheaper than Moving Data● Software Platform Portability

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Architecture

Hadoop designed and built on two independent frameworks

Hadoop = HDFS + MapReduce

HDFS: is a reliable distributed file system that provides high-throughput access to data.● File divided into blocks 64MB (default)● Each block replicated 3 times (default)

MapReduce: is a framework for performing high performance distributed data processing.

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Case Study: Word Count

Problem: We need to calculate word frequencies in billions of web pages● Input: Files with one document per

record● Output: List of words and their

frequencies in the whole documents

Case Study: Solution

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Architecture - Cont.

MapReduce Design● Map● Reduce● Shuffle & Sort

Case Study: Map Phase

● Specify a map function that takes a key/value pairkey = document URLvalue = document contents

● Output of map function is key/value pairs.In our case, output(word, “1”) once per word in the document

Case Study: Reduce Phase● MapReduce library gathers together all pairs with the same key

(shuffle/sort)● The reduce function combines the values for a key

In our case, compute the sum

● Output of reduce will be like that

Architecture - Cont.

MapReduce Design● Map: extract

something you care about from each record.

Architecture - Cont.

MapReduce Design● Reduce :

aggregate, summarize, filter, or transform mapper output

Architecture - Cont.

MapReduce Design Overall View:

Architecture - Cont.

MapReduce Design● Shuffle & Sort :

redirect the mapper output to the right reducer

Case Study: Overall View

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Architecture - Cont.

MapReduce Programmer specifies two primary methods:

map(k1, v1) → <k2, v2>reduce(k2, list<v2>) → <k3, v3>

Case Study : Code ExampleMap Function

Case Study : Code ExampleReduce Function

Hadoop not only JAVA (streaming)

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Architecture - Cont.

Main Modules● File System (HDFS)

⚪ Name Node⚪ Secondary Name Node⚪ Data Node

● MapReduce Framework⚪ Job Tracker⚪ Task Tracker

Architecture - Cont.Main Modules● File System (HDFS)

⚪ Name Node⚪ Secondary Name Node⚪ Data Node⚪

Architecture - Cont.

Main Modules● MapReduce Framework

⚪ Job Tracker⚪ Task Tracker

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Architecture - Cont.

Access Procedure● Read From HDFS● Write to HDFS

Architecture - Cont.

Access Procedure● Read From HDFS● Write to HDFS

Architecture - Cont.

Access Procedure● Read From HDFS● Write to HDFS

Architecture - Cont.

Tasks distribution Procedure:JobTracker choses the nodes to execute the tasks to achieve the data locality principle

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Hadoop Modes

Hadoop Modes● Standalone● Pseudo-Distributed● Fully-Distributed

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

MapReduce 1 Vs MapReduce 2(YARN)

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Questions

References

● Book “Hadoop in Action” by Chuck Lam● Book “Hadoop The Definitive Guide” by Tom Wbite● http://hadoop.apache.org/● http://en.wikipedia.org/wiki/Apache_Hadoop ● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ ● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil● http://www.slideshare.net/PhilippeJulio/hadoop-architecture● http://www.slideshare.net/rantav/introduction-to-map-reduce● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=

&from_search=2● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q

f1&b=&from_search=12● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1

4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1

Thanks

top related