apache hadoop - big data engineering

55

Upload: badr

Post on 16-Apr-2017

72 views

Category:

Software


6 download

TRANSCRIPT

Page 1: Apache Hadoop - Big Data Engineering
Page 2: Apache Hadoop - Big Data Engineering

Apache HadoopBig Data Engineering

Prepared by:● Islam Elbanna● Mahmoud Hanafy

Presented by:● Ahmed Mahran

Page 3: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 4: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 5: Apache Hadoop - Big Data Engineering

Introduction

What is Hadoop?"Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage"

Open Source software + Hardware commodity = IT Cost reduction

Page 6: Apache Hadoop - Big Data Engineering

Introduction - Cont.

Why Hadoop ?● Performance● Storage● Scalability● Fault tolerance● Cost efficiency (Commodity Machines)

Page 7: Apache Hadoop - Big Data Engineering

Introduction - Cont.

What is Hadoop used for ?● Searching● Log processing● Recommendation system● Analytics● Video and Image analysis

Page 8: Apache Hadoop - Big Data Engineering

Introduction - Cont.

Who uses Hadoop ?● Amazon● Facebook● Google● IBM● New York Times● Yahoo● Twitter● LinkedIn● …

Page 9: Apache Hadoop - Big Data Engineering

Introduction - Cont.

Hadoop RDBMS

Non-Structured/Structured data Structured data

Scale Out Scale Up

Procedural/Functional programming Declarative Queries

Offline batch processing Online/Batch Transactions

Petabytes Gigabytes

Key Value Pairs Predefined fields

Hadoop Vs RDBMS

Page 10: Apache Hadoop - Big Data Engineering

Introduction - Cont.

Problem: 20+ billion web pages x 20KB = 400+ terabytes

One computer can read 30-35 MB/sec from disk~ Four months to read the web (Time).~1,000 hard drives just to store the web (Storage).

Page 11: Apache Hadoop - Big Data Engineering

Introduction - Cont.

Solution: same problem with 1000 machines < 3 hoursBut we need:● Communication and coordination● Recovering from machine failure● Status reporting● Debugging● Optimization

Distributed System

Page 12: Apache Hadoop - Big Data Engineering

Introduction - Cont.

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Page 13: Apache Hadoop - Big Data Engineering

Introduction - Cont.

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Page 14: Apache Hadoop - Big Data Engineering

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Introduction - Cont.

Page 15: Apache Hadoop - Big Data Engineering

Introduction - Cont.

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Page 16: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 17: Apache Hadoop - Big Data Engineering

History

● 2002-2004 Started as a sub-project of Apache Nutch.

● 2003-2004 Google published Google File System (GFS) and MapReduce Framework Paper.

● 2004 Doug Cutting and Mike Cafarella implemented Google’s frameworks in Nutch.

● In 2006 Yahoo hires Doug Cutting to work on Hadoop with a dedicated team.

● In 2008 Hadoop became Apache Top Level Project.

Page 18: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 19: Apache Hadoop - Big Data Engineering

Assumptions

● Hardware Failure● Streaming Data Access● Large Data Sets● Simple Coherency Model● Moving Computation is Cheaper than Moving Data● Software Platform Portability

Page 20: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 21: Apache Hadoop - Big Data Engineering

Architecture

Hadoop designed and built on two independent frameworks

Hadoop = HDFS + MapReduce

HDFS: is a reliable distributed file system that provides high-throughput access to data.● File divided into blocks 64MB (default)● Each block replicated 3 times (default)

MapReduce: is a framework for performing high performance distributed data processing.

Page 22: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 23: Apache Hadoop - Big Data Engineering

Case Study: Word Count

Problem: We need to calculate word frequencies in billions of web pages● Input: Files with one document per

record● Output: List of words and their

frequencies in the whole documents

Page 24: Apache Hadoop - Big Data Engineering

Case Study: Solution

Page 25: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 26: Apache Hadoop - Big Data Engineering

Architecture - Cont.

MapReduce Design● Map● Reduce● Shuffle & Sort

Page 27: Apache Hadoop - Big Data Engineering

Case Study: Map Phase

● Specify a map function that takes a key/value pairkey = document URLvalue = document contents

● Output of map function is key/value pairs.In our case, output(word, “1”) once per word in the document

Page 28: Apache Hadoop - Big Data Engineering

Case Study: Reduce Phase● MapReduce library gathers together all pairs with the same key

(shuffle/sort)● The reduce function combines the values for a key

In our case, compute the sum

● Output of reduce will be like that

Page 29: Apache Hadoop - Big Data Engineering

Architecture - Cont.

MapReduce Design● Map: extract

something you care about from each record.

Page 30: Apache Hadoop - Big Data Engineering

Architecture - Cont.

MapReduce Design● Reduce :

aggregate, summarize, filter, or transform mapper output

Page 31: Apache Hadoop - Big Data Engineering

Architecture - Cont.

MapReduce Design Overall View:

Page 32: Apache Hadoop - Big Data Engineering

Architecture - Cont.

MapReduce Design● Shuffle & Sort :

redirect the mapper output to the right reducer

Page 33: Apache Hadoop - Big Data Engineering

Case Study: Overall View

Page 34: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 35: Apache Hadoop - Big Data Engineering

Architecture - Cont.

MapReduce Programmer specifies two primary methods:

map(k1, v1) → <k2, v2>reduce(k2, list<v2>) → <k3, v3>

Page 36: Apache Hadoop - Big Data Engineering

Case Study : Code ExampleMap Function

Page 37: Apache Hadoop - Big Data Engineering

Case Study : Code ExampleReduce Function

Page 38: Apache Hadoop - Big Data Engineering

Hadoop not only JAVA (streaming)

Page 39: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 40: Apache Hadoop - Big Data Engineering

Architecture - Cont.

Main Modules● File System (HDFS)

⚪ Name Node⚪ Secondary Name Node⚪ Data Node

● MapReduce Framework⚪ Job Tracker⚪ Task Tracker

Page 41: Apache Hadoop - Big Data Engineering

Architecture - Cont.Main Modules● File System (HDFS)

⚪ Name Node⚪ Secondary Name Node⚪ Data Node⚪

Page 42: Apache Hadoop - Big Data Engineering

Architecture - Cont.

Main Modules● MapReduce Framework

⚪ Job Tracker⚪ Task Tracker

Page 43: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 44: Apache Hadoop - Big Data Engineering

Architecture - Cont.

Access Procedure● Read From HDFS● Write to HDFS

Page 45: Apache Hadoop - Big Data Engineering

Architecture - Cont.

Access Procedure● Read From HDFS● Write to HDFS

Page 46: Apache Hadoop - Big Data Engineering

Architecture - Cont.

Access Procedure● Read From HDFS● Write to HDFS

Page 47: Apache Hadoop - Big Data Engineering

Architecture - Cont.

Tasks distribution Procedure:JobTracker choses the nodes to execute the tasks to achieve the data locality principle

Page 48: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 49: Apache Hadoop - Big Data Engineering

Hadoop Modes

Hadoop Modes● Standalone● Pseudo-Distributed● Fully-Distributed

Page 50: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 51: Apache Hadoop - Big Data Engineering

MapReduce 1 Vs MapReduce 2(YARN)

Page 52: Apache Hadoop - Big Data Engineering

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Page 53: Apache Hadoop - Big Data Engineering

Questions

Page 54: Apache Hadoop - Big Data Engineering

References

● Book “Hadoop in Action” by Chuck Lam● Book “Hadoop The Definitive Guide” by Tom Wbite● http://hadoop.apache.org/● http://en.wikipedia.org/wiki/Apache_Hadoop ● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ ● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil● http://www.slideshare.net/PhilippeJulio/hadoop-architecture● http://www.slideshare.net/rantav/introduction-to-map-reduce● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=

&from_search=2● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q

f1&b=&from_search=12● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1

4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1

Page 55: Apache Hadoop - Big Data Engineering

Thanks