hadoop an introduction
TRANSCRIPT
By :- Rishi Arora
www.rishiarora.com
Companies
by
estimated
Number
of
Servers
Source : http://www.ibmbigdatahub.com/infographic/four-vs-big-data
WHY BIG DATA ?
WHY NOW ?
2,500 exabytes of new information in 2012 with Internet as primary driver
Digital universe grew by 62% last year to 800K petabytes and will grow to
1.2 “zettabytes” this year
Source : An IDC White Paper- As the Economy Contracts, the Digital Universe
Problems
with
Big Data
?
Read Write Disk is Slow
1Tb Drives are read at 100Mb/sec
Use Disks in Parallel
1 HDD = 100 Mb/sec
100 HDD = 10 Gb /Sec
Solution
Problem #2
Hardware Failure
Single Machine Failure
Keep Multiple Copies of Data
Solution
Problem #3
Merge Data from Different Reads
Keep Multiple Copies of Data
Solution
Only completed results need to be taken into consideration
and failed results need to be ignored
Data needs to be compressed to be sent across the network
DISTRIBUTED
FAULT TOLERENT
SCALABLE
FLEXIBLE
INTILLIGENT
Hadoop
Components
HDFS Map Reduce
Distributed File Manager Map Reduce
• Designed for modest number of Large files (millions
instead of billions)
• Sequential access not Random access
• Write Once, Read Many
• Data is split into chunks and stored in multiple nodes
as blocks
• Namenode maintains the block locations
• Blocks get replicated over the data nodes
• Single namespace and accessible universally
• Computation is moved to the data – data locality
HDFS Overview
Map Reduce Overview
• Tasks are distributed to multiple nodes
• Each node processes the data stored in that node
• Consists of two phase:
• Map – Reads input data and output intermediate
keys and values
• Reduce – Values with the same key are sent to
the same reducer for further processing
HDFS
HDFS v2 YARN
ZO
OK
EE
PE
R
C
oord
inato
r
F
LU
ME
L
og
Co
lle
cto
r
SQ
OO
P
Data
Exchanger
Wo
rkflo
w
P
IG
S
cripting
H
IVE
SQ
L Q
uery
Mach
ine
Le
arn
ing
C
olu
mn
S
tore
Hadoop Ecosystem
Thank You !!