hadoop an introduction

By :- Rishi Arora

www.rishiarora.com

Companies

by

estimated

Number

of

Servers

Source : http://www.ibmbigdatahub.com/infographic/four-vs-big-data

WHY BIG DATA ?

WHY NOW ?

2,500 exabytes of new information in 2012 with Internet as primary driver

Digital universe grew by 62% last year to 800K petabytes and will grow to

1.2 “zettabytes” this year

Source : An IDC White Paper- As the Economy Contracts, the Digital Universe

Problems

with

Big Data

?

Read Write Disk is Slow

1Tb Drives are read at 100Mb/sec

Use Disks in Parallel

1 HDD = 100 Mb/sec

100 HDD = 10 Gb /Sec

Solution

Problem #2

Hardware Failure

Single Machine Failure

Keep Multiple Copies of Data

Solution

Problem #3

Merge Data from Different Reads

Keep Multiple Copies of Data

Solution

Only completed results need to be taken into consideration

and failed results need to be ignored

Data needs to be compressed to be sent across the network

DISTRIBUTED

FAULT TOLERENT

SCALABLE

FLEXIBLE

INTILLIGENT

http://www.google.co.in/url?sa=i&source=imgres&cd=&ved=0CAkQjRwwAGoVChMIp5n-s9HdxwIVFhKSCh3E4Apn&url=http://www.2wicklers.com/insight-into-my-hadoop-cluster/&psig=AFQjCNFy8ER0SJLb91Te6qXRcZnnjmrsSg&ust=1441464917242258

Hadoop

Components

HDFS Map Reduce

Distributed File Manager Map Reduce

• Designed for modest number of Large files (millions

instead of billions)

• Sequential access not Random access

• Write Once, Read Many

• Data is split into chunks and stored in multiple nodes

as blocks

• Namenode maintains the block locations

• Blocks get replicated over the data nodes

• Single namespace and accessible universally

• Computation is moved to the data – data locality

HDFS Overview

Map Reduce Overview

• Tasks are distributed to multiple nodes

• Each node processes the data stored in that node

• Consists of two phase:

• Map – Reads input data and output intermediate

keys and values

• Reduce – Values with the same key are sent to

the same reducer for further processing

HDFS

HDFS v2 YARN

ZO

OK

EE

PE

R

C

oord

inato

r

F

LU

ME

L

og

Co

lle

cto

r

SQ

OO

P

Data

Exchanger

Wo

rkflo

w

P

IG

S

cripting

H

IVE

SQ

L Q

uery

Mach

ine

Le

arn

ing

C

olu

mn

S

tore

Hadoop Ecosystem

Thank You !!

hadoop an introduction

Data & Analytics