apache hadoop - big data engineering

Apache HadoopBig Data Engineering

Prepared by:● Islam Elbanna● Mahmoud Hanafy

Presented by:● Ahmed Mahran

Outlines

1. Introduction2. History3. Assumptions 4. Architecture

a. Case Studyb. MapReduce Designc. Code Exampled. Main Modulese. Access Procedure

5. Hadoop Modes6. MapReduce 1 VS MapReduce 2 (YARN)7. Questions

Outlines

Introduction

What is Hadoop?"Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage"

Open Source software + Hardware commodity = IT Cost reduction

Introduction - Cont.

Why Hadoop ?● Performance● Storage● Scalability● Fault tolerance● Cost efficiency (Commodity Machines)

What is Hadoop used for ?● Searching● Log processing● Recommendation system● Analytics● Video and Image analysis

Who uses Hadoop ?● Amazon● Facebook● Google● IBM● New York Times● Yahoo● Twitter● LinkedIn● …

Hadoop RDBMS

Non-Structured/Structured data Structured data

Scale Out Scale Up

Procedural/Functional programming Declarative Queries

Offline batch processing Online/Batch Transactions

Petabytes Gigabytes

Key Value Pairs Predefined fields

Hadoop Vs RDBMS

Problem: 20+ billion web pages x 20KB = 400+ terabytes

One computer can read 30-35 MB/sec from disk~ Four months to read the web (Time).~1,000 hard drives just to store the web (Storage).

Solution: same problem with 1000 machines < 3 hoursBut we need:● Communication and coordination● Recovering from machine failure● Status reporting● Debugging● Optimization

Distributed System

Distributed systems● Cluster of machines● Distributed Storage● Distributed Computing

Outlines

History

● 2002-2004 Started as a sub-project of Apache Nutch.

● 2003-2004 Google published Google File System (GFS) and MapReduce Framework Paper.

● 2004 Doug Cutting and Mike Cafarella implemented Google’s frameworks in Nutch.

● In 2006 Yahoo hires Doug Cutting to work on Hadoop with a dedicated team.

● In 2008 Hadoop became Apache Top Level Project.

Outlines

Assumptions

● Hardware Failure● Streaming Data Access● Large Data Sets● Simple Coherency Model● Moving Computation is Cheaper than Moving Data● Software Platform Portability

Outlines

Architecture

Hadoop designed and built on two independent frameworks

Hadoop = HDFS + MapReduce

HDFS: is a reliable distributed file system that provides high-throughput access to data.● File divided into blocks 64MB (default)● Each block replicated 3 times (default)

MapReduce: is a framework for performing high performance distributed data processing.

Outlines

Case Study: Word Count

Problem: We need to calculate word frequencies in billions of web pages● Input: Files with one document per

record● Output: List of words and their

frequencies in the whole documents

Case Study: Solution

Outlines

Architecture - Cont.

MapReduce Design● Map● Reduce● Shuffle & Sort

Case Study: Map Phase

● Specify a map function that takes a key/value pairkey = document URLvalue = document contents

● Output of map function is key/value pairs.In our case, output(word, “1”) once per word in the document

Case Study: Reduce Phase● MapReduce library gathers together all pairs with the same key

(shuffle/sort)● The reduce function combines the values for a key

In our case, compute the sum

● Output of reduce will be like that

MapReduce Design● Map: extract

something you care about from each record.

MapReduce Design● Reduce :

aggregate, summarize, filter, or transform mapper output

MapReduce Design Overall View:

MapReduce Design● Shuffle & Sort :

redirect the mapper output to the right reducer

Case Study: Overall View

Outlines

MapReduce Programmer specifies two primary methods:

map(k1, v1) → <k2, v2>reduce(k2, list<v2>) → <k3, v3>

Case Study : Code ExampleMap Function

Case Study : Code ExampleReduce Function

Hadoop not only JAVA (streaming)

Outlines

Main Modules● File System (HDFS)

⚪ Name Node⚪ Secondary Name Node⚪ Data Node

● MapReduce Framework⚪ Job Tracker⚪ Task Tracker

Architecture - Cont.Main Modules● File System (HDFS)

⚪ Name Node⚪ Secondary Name Node⚪ Data Node⚪

Main Modules● MapReduce Framework

⚪ Job Tracker⚪ Task Tracker

Outlines

Access Procedure● Read From HDFS● Write to HDFS

Tasks distribution Procedure:JobTracker choses the nodes to execute the tasks to achieve the data locality principle

Outlines

Hadoop Modes

Hadoop Modes● Standalone● Pseudo-Distributed● Fully-Distributed

Outlines

MapReduce 1 Vs MapReduce 2(YARN)

Outlines

Questions

References

● Book “Hadoop in Action” by Chuck Lam● Book “Hadoop The Definitive Guide” by Tom Wbite● http://hadoop.apache.org/● http://en.wikipedia.org/wiki/Apache_Hadoop ● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ ● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil● http://www.slideshare.net/PhilippeJulio/hadoop-architecture● http://www.slideshare.net/rantav/introduction-to-map-reduce● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=

&from_search=2● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q

f1&b=&from_search=12● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1

4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1

Thanks

apache hadoop - big data engineering

Software

apache hadoop today & tomorrow - snia€¦ · apache hadoop...

scale-out storage infrastructure for apache hadoop* big...

apache metron and apache spot big data tools for...

accelerating big data processing with hadoop, spark and...

the executive's guide to big data - wwt started with big...

review the practice of hadoop and spark to analyze big...

big data: concepts, techniques et démonstration de apache...

high performance big data (hibd): accelerating hadoop...

big data - hadoop/mapreduce - github pages · enter .....

big data: using arcgis with apache hadoop · big data:...

introduction to apache...

performance-analyse von apache spark und apache...

nosql, apache solr and apache hadoop

hadoop summit sj 2016: next gen big data analytics with...

big neuroscience infrastructure · 2017. 3. 1. · • rdma...

ndn-hadoop: exploring applicability of ndn for big...

apache hadoop hbase

polystore,julia,andproduc3vity inabig*dataworld · big data...

apache hadoop big data technology

applying apache hadoop to nasa's big climate data use cases