hadoop release 2.0

54
www.edureka.in/hadoop Slide 1

Upload: prashant-sharma

Post on 27-Nov-2015

145 views

Category:

Documents


0 download

DESCRIPTION

A presentation on the features of Hadoop 2.0 framework.

TRANSCRIPT

Page 1: Hadoop Release 2.0

www.edureka.in/hadoopSlide 1

Page 2: Hadoop Release 2.0

www.edureka.in/hadoopSlide 2

How It Works…

LIVE On-Line classes

Class recordings in Learning Management System (LMS)

Module wise Quizzes, Coding Assignments

24x7 on-demand technical support

Project work on large Datasets

Online certification exam

Lifetime access to the LMS

Complimentary Java Classes

Page 3: Hadoop Release 2.0

www.edureka.in/hadoopSlide 3

Module 1 Understanding Big Data Hadoop Architecture

Module 2 Introduction to Hadoop 2.x Data loading Techniques Hadoop Project Environment

Module 3 Hadoop MapReduce framework Programming in Map Reduce

Module 4 Advance MapReduce YARN (MRv2) Architecture Programming in YARN

Module 5 Analytics using Pig Understanding Pig Latin

Module 6 Analytics using Hive Understanding HIVE QL

Module 7 NoSQL Databases Understanding HBASE Zookeeper

Module 8 Real world Datasets and Analysis Project Discussion

Course Topics

Page 4: Hadoop Release 2.0

www.edureka.in/hadoopSlide 4

What is Big Data?

Limitations of the existing solutions

Solving the problem with Hadoop

Introduction to Hadoop

Hadoop Eco-System

Hadoop Core Components

HDFS Architecture

MapRedcue Job execution

Anatomy of a File Write and Read

Hadoop 2.0 (YARN or MRv2) Architecture

Topics for Today

Page 5: Hadoop Release 2.0

www.edureka.in/hadoopSlide 5

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.

NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.

What Is Big Data?

Page 6: Hadoop Release 2.0

www.edureka.in/hadoopSlide 6

Un-Structured Data is Exploding

Page 7: Hadoop Release 2.0

www.edureka.in/hadoopSlide 7

Volume Velocity Variety

Characteristics of Big Data

IBM’s Definition

Page 8: Hadoop Release 2.0

www.edureka.in/hadoopSlide 8

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Annie’s Introduction

Page 9: Hadoop Release 2.0

www.edureka.in/hadoopSlide 9

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Annie’s Question

Map the following to corresponding data type:

- XML Files

- Word Docs, PDF files, Text files

- E-Mail body

- Data from Enterprise systems (ERP, CRM etc.)

Page 10: Hadoop Release 2.0

www.edureka.in/hadoopSlide 10

XML Files -> Semi-structured data

Word Docs, PDF files, Text files -> Unstructured Data

E-Mail body -> Unstructured Data

Data from Enterprise systems (ERP, CRM etc.) -> Structured Data

Annie’s Answer

Page 11: Hadoop Release 2.0

www.edureka.in/hadoopSlide 11

More on Big Datahttp://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoophttp://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoophttp://www.edureka.in/blog/jobs-in-hadoop/

Big Data http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

Further Reading

Page 12: Hadoop Release 2.0

www.edureka.in/hadoopSlide 12

Web and e-tailing Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection

Telecommunications Customer Churn Prevention Network Performance

Optimization Calling Data Record (CDR)

Analysis Analyzing Network to Predict

Failure

http://wiki.apache.org/hadoop/PoweredBy

Common Big Data Customer Scenarios

Page 13: Hadoop Release 2.0

www.edureka.in/hadoopSlide 13

Government Fraud Detection And Cyber Security Welfare schemes Justice

Healthcare & Life Sciences Health information exchange Gene sequencing Serialization Healthcare service quality

improvements Drug Safety

http://wiki.apache.org/hadoop/PoweredBy

Common Big Data Customer Scenarios (Contd.)

Page 14: Hadoop Release 2.0

www.edureka.in/hadoopSlide 14

Banks and Financial services Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis

Retail Point of sales Transaction

Analysis Customer Churn Analysis Sentiment Analysis

http://wiki.apache.org/hadoop/PoweredBy

Common Big Data Customer Scenarios (Contd.)

Page 15: Hadoop Release 2.0

www.edureka.in/hadoopSlide 15

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data.

Case Study: Sears Holding Corporation

X

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data.

Hidden Treasure

Page 16: Hadoop Release 2.0

www.edureka.in/hadoopSlide 16

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

90% of the ~2PB Archived

Storage

Processing

Instrumentation

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

ETL Compute Grid

3. Premature data death

1. Can’t explore original high fidelity raw data

2. Moving data to compute doesn’t scale

Mostly Append

A meagre 10% of the

~2PB Data is available for

BI

Storage only Grid (original Raw Data)

Collection

Limitations of Existing Data Analytics Architecture

Page 17: Hadoop Release 2.0

www.edureka.in/hadoopSlide 17

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.

No Data Archiving

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

3. Keep data alive forever

Mostly Append

Instrumentation

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

Collection

Hadoop : Storage + Compute Grid

Entire ~2PB Data is

available for processing

Both Storage

And Processing

Solution: A Combined Storage Computer Layer

Page 18: Hadoop Release 2.0

www.edureka.in/hadoopSlide 18

Accessible

RobustSimple

Scalable

DifferentiatingFactors

Hadoop Differentiating Factors

Page 19: Hadoop Release 2.0

www.edureka.in/hadoopSlide 19

Structured Data Types Multi and Unstructured

Limited, No Data Processing Processing Processing coupled with Data

Standards & Structured Governance Loosely Structured

Required On write Schema Required On Read

Reads are Fast Speed Writes are Fast

Software License Cost Support Only

Known Entity Resources Growing, Complexities, Wide

Interactive OLAP AnalyticsComplex ACID TransactionsOperational Data Store

Best Fit Use Data DiscoveryProcessing Unstructured DataMassive Storage/Processing

RDBMSRDBMS EDW MPP NoSQL HADOOP

Hadoop – It’s about Scale And Structure

Page 20: Hadoop Release 2.0

www.edureka.in/hadoopSlide 20

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine

4 I/O ChannelsEach Channel – 100 MB/s

10 Machines

Why DFS?

Page 21: Hadoop Release 2.0

www.edureka.in/hadoopSlide 21

45 Minutes

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine

4 I/O ChannelsEach Channel – 100 MB/s

10 Machines

Why DFS?

Page 22: Hadoop Release 2.0

www.edureka.in/hadoopSlide 22

4.5 Minutes45 Minutes

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine

4 I/O ChannelsEach Channel – 100 MB/s

10 Machines

Why DFS?

Page 23: Hadoop Release 2.0

www.edureka.in/hadoopSlide 23

Apache Hadoop is a framework that allows for the distributed processing of large data sets

across clusters of commodity computers using a simple programming model.

It is an Open-source Data Management with scale-out storage & distributed processing.

What Is Hadoop?

Page 24: Hadoop Release 2.0

www.edureka.in/hadoopSlide 24

Reliable

EconomicalFlexible

Scalable

HadoopFeatures

Hadoop Key Characteristics

Page 25: Hadoop Release 2.0

www.edureka.in/hadoopSlide 25

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Hadoop is a framework that allows for the distributed processing of:

- Small Data Sets

- Large Data Sets

Annie’s Question

Page 26: Hadoop Release 2.0

www.edureka.in/hadoopSlide 26

Annie’s Answer

Large Data Sets. It is also capable to process small data-sets

however to experience the true power of Hadoop one needs to have

data in Tb’s because this where RDBMS takes hours and fails

whereas Hadoop does the same in couple of minutes.

Page 27: Hadoop Release 2.0

www.edureka.in/hadoopSlide 27

Apache Oozie (Workflow)

HDFS (Hadoop Distributed File System)

Pig LatinData Analysis

MahoutMachine Learning

HiveDW System

MapReduce Framework

HBase

Flume Sqoop

Import Or Export

Unstructured orSemi-Structured data Structured Data

Hadoop Eco-System

Page 28: Hadoop Release 2.0

www.edureka.in/hadoopSlide 28

Hadoop and MapReduce magic in

action

Write intelligent applications using Apache Mahout

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

LinkedIn Recommendations

Machine Learning with Mahout

Page 29: Hadoop Release 2.0

www.edureka.in/hadoopSlide 29

Hadoop is a system for large scale data processing.

It has two main components:

HDFS – Hadoop Distributed File System (Storage)

Distributed across “nodes”

Natively redundant

NameNode tracks locations.

MapReduce (Processing)

Splits a task across processors

“near” the data & assembles results

Self-Healing, High Bandwidth

Clustered storage

JobTracker manages the TaskTrackers

Hadoop Core Components

Page 30: Hadoop Release 2.0

www.edureka.in/hadoopSlide 30

Data Node

TaskTracker

Data Node

TaskTracker

Data Node

TaskTracker

Data Node

TaskTracker

MapReduceEngine

HDFSCluster

Job Tracker

Admin Node

Name node

Hadoop Core Components (Contd.)

Page 31: Hadoop Release 2.0

www.edureka.in/hadoopSlide 31

Metadata (Name, replicas,…):/home/foo/data, 3,…

Rack 1

Blocks

DatanodesBlock ops

Replication

Write

Datanodes

Metadata ops

Client

Read

NameNode

Rack 2Client

HDFS Architecture

Page 32: Hadoop Release 2.0

www.edureka.in/hadoopSlide 32

NameNode:

master of the system

maintains and manages the blocks which are present on the

DataNodes

DataNodes:

slaves which are deployed on each machine and provide the

actual storage

responsible for serving read and write requests for the clients

Main Components Of HDFS

Page 33: Hadoop Release 2.0

www.edureka.in/hadoopSlide 33

Meta-data in Memory The entire metadata is in main memory No demand paging of FS meta-data

Types of Metadata List of files List of Blocks for each file List of DataNode for each block File attributes, e.g. access time, replication factor

A Transaction Log Records file creations, file deletions. etc

Name Node(Stores metadata only)

METADATA:/user/doug/hinfo -> 1 3 5/user/doug/pdetail -> 4 2

Name Node:Keeps track of overall file directory structure and the placement of Data Block

NameNode Metadata

Page 34: Hadoop Release 2.0

www.edureka.in/hadoopSlide 34

Secondary NameNode:

Not a hot standby for the NameNode

Connects to NameNode every hour*

Housekeeping, backup of NemeNode metadata

Saved metadata can build a failed NameNode

You give me metadata every hour, I will make

it secure

Single Point Failure

Secondary NameNode

NameNode

metadata

metadata

Secondary Name Node

Page 35: Hadoop Release 2.0

www.edureka.in/hadoopSlide 35

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

NameNode?

a) is the “Single Point of Failure” in a cluster

b) runs on ‘Enterprise-class’ hardware

c) stores meta-data

d) All of the above

Annie’s Question

Page 36: Hadoop Release 2.0

www.edureka.in/hadoopSlide 36

Annie’s Answer

All of the above. NameNode Stores meta-data and runs on reliable

high quality hardware because it’s a Single Point of failure in a

hadoop Cluster.

Page 37: Hadoop Release 2.0

www.edureka.in/hadoopSlide 37

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Annie’s Question

When the NameNode fails, Secondary NameNode takes over

instantly and prevents Cluster Failure:

a) TRUE

b) FALSE

Page 38: Hadoop Release 2.0

www.edureka.in/hadoopSlide 38

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Annie’s Answer

False. Secondary NameNode is used for creating NameNode

Checkpoints. NameNode can be manually recovered using ‘edits’

and ‘FSImage’ stored in Secondary NameNode.

Page 39: Hadoop Release 2.0

www.edureka.in/hadoopSlide 39

DFS

Job Tracker

1. Copy Input Files

User

2. Submit Job

3. Get Input Files’ Info

6. Submit Job

4. Create Splits

5. Upload Job Information

Input Files

Client

Job.xml.Job.jar.

JobTracker

Page 40: Hadoop Release 2.0

www.edureka.in/hadoopSlide 40

DFS

Client Job.xml.Job.jar.

6. Submit Job

8. Read Job Files

7. Initialize JobJob Queue

As many maps as splitsInput Spilts

Maps Reduces

9. Create maps and reduces

Job Tracker

JobTracker (Contd.)

Page 41: Hadoop Release 2.0

www.edureka.in/hadoopSlide 41

Job Tracker

Job QueueH3

H4

H5

H1

Task Tracker -H2

Task Tracker -H4

10. Heartbeat

12. Assign Tasks

10. Heartbeat

10. Heartbeat

10. Heartbeat

11. Picks Tasks(Data Local if possible)

Task Tracker -H3

Task Tracker -H1

JobTracker (Contd.)

Page 42: Hadoop Release 2.0

www.edureka.in/hadoopSlide 42

Hadoop framework picks which of the following daemon

for scheduling a task ?

a) namenode

b) datanode

c) task tracker

d) job tracker

Annie’s Question

Page 43: Hadoop Release 2.0

www.edureka.in/hadoopSlide 43

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Annie’s Answer

JobTracker takes care of all the job scheduling and

assign tasks to TaskTrackers.

Page 44: Hadoop Release 2.0

www.edureka.in/hadoopSlide 44

NameNode

NameNode

DataNode

DataNode

DataNode

DataNode

2. Create

7. Complete

5. ack Packet4. Write Packet

Pipeline of Data nodes

DataNode

DataNode

Distributed File System

HDFSClient

1. Create

3. Write

4

5

4

5

Anatomy of A File Write

Page 45: Hadoop Release 2.0

www.edureka.in/hadoopSlide 45

NameNode

NameNode

DataNode

DataNode

DataNode

DataNode

2. Get Block locations

4. Read

DataNode

DataNode

Distributed File System

HDFSClient

1. Create

3. Write

5. Read

Anatomy of A File Read

Page 46: Hadoop Release 2.0

www.edureka.in/hadoopSlide 46

Replication and Rack Awareness

Page 47: Hadoop Release 2.0

www.edureka.in/hadoopSlide 47

In HDFS, blocks of a file are written in parallel, however

the replication of the blocks are done sequentially:

a) TRUE

b) FALSE

Annie’s Question

Page 48: Hadoop Release 2.0

www.edureka.in/hadoopSlide 48

Annie’s Answer

True. A files is divided into Blocks, these blocks are

written in parallel but the block replication happen in

sequence.

Page 49: Hadoop Release 2.0

www.edureka.in/hadoopSlide 49

Annie’s Question

A file of 400MB is being copied to HDFS. The system has finished copying 250MB. What happens if a client tries to access that file:

a) can read up to block that's successfully written.b) can read up to last bit successfully written.c) Will throw an throw an exception.d) Cannot see that file until its finished copying.

Page 50: Hadoop Release 2.0

www.edureka.in/hadoopSlide 50

Annie’s Answer

Client can read up to the successfully written data block,

Answer is (a)

Page 51: Hadoop Release 2.0

www.edureka.in/hadoopSlide 51

Node Manager

HDFS

YARN

Resource Manager

Shared edit logs

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Secondary Name Node

Data Node

Standby NameNode

Active NameNode

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Data Node

Client

Data Node

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Node Manager

Hadoop 2.x (YARN or MRv2)

Page 52: Hadoop Release 2.0

www.edureka.in/hadoopSlide 52

Apache Hadoop and HDFS http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

Apache Hadoop HDFS Architecturehttp://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

Hadoop 2.0 and YARNhttp://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/

Further Reading

Page 53: Hadoop Release 2.0

www.edureka.in/hadoopSlide 53

Setup the Hadoop development environment using the documents present in the LMS.

Hadoop Installation – Setup Cloudera CDH3 Demo VM

Hadoop Installation – Setup Cloudera CDH4 QuickStart VM

Execute Linux Basic Commands

Execute HDFS Hands On commands

Attempt the Module-1 Assignments present in the LMS.

Module-2 Pre-work

Page 54: Hadoop Release 2.0

Thank You See You in Class Next Week