map reduce & hdfs with hadoop

27
Big Data: Hadoop 2.0 Map Reduce / HDFS 2.0 @diego_pacheco Software Architect | Agile Coach

Upload: diego-pacheco

Post on 20-Jan-2015

1.360 views

Category:

Technology


7 download

DESCRIPTION

Map reduce & HDFS with Hadoop

TRANSCRIPT

Page 1: Map reduce & HDFS with Hadoop

Big Data: Hadoop 2.0 Map Reduce / HDFS 2.0

@diego_pacheco Software Architect | Agile Coach

Page 2: Map reduce & HDFS with Hadoop

Big Data

Page 3: Map reduce & HDFS with Hadoop
Page 4: Map reduce & HDFS with Hadoop
Page 5: Map reduce & HDFS with Hadoop
Page 6: Map reduce & HDFS with Hadoop
Page 7: Map reduce & HDFS with Hadoop
Page 8: Map reduce & HDFS with Hadoop
Page 9: Map reduce & HDFS with Hadoop
Page 10: Map reduce & HDFS with Hadoop
Page 11: Map reduce & HDFS with Hadoop

Hadoop - Cases

Page 12: Map reduce & HDFS with Hadoop

Hadoop

Page 13: Map reduce & HDFS with Hadoop

H D F S

adoop

distributed

ile

ystem

4000 nodes: 14PB storage

Page 14: Map reduce & HDFS with Hadoop

HDFS – Assumptions and Goals

• Hardware Failure: Houndred or thousands machines, expect to fail.

• Streaming Data Access: Batch processing, high throughtput not low latency.

• Large Data Sets: Terrabytes, works on cluster, scale, milions of files single instance.

• Simple Coherency Model: Write-once-read-many(create, read, close, no

changes) maximize coherency and high throughtput, perfect for Map/Reduce.

• Moving Computation instead of Moving Data: Is way

more cheaper, huge data, minimize network. HDFS moves the computation close to the data.

• Sofware and hardware Portability: Easily Portable.

Page 15: Map reduce & HDFS with Hadoop

HDFS

• Very large distributed FS • 10k nodes, 100M files, 10PB

• Works with comodity hardware • File replication • Detect and recover from failures

• Optimized for batch processing • Files break by blocks 128mb

• blocks: replicated in N dataNodes • Data Coherency

• Write Once, Read Many • Only Append to existent files

Page 16: Map reduce & HDFS with Hadoop

HDFS - Architecture

Page 17: Map reduce & HDFS with Hadoop

HDFS 2.0 - Federation

Page 18: Map reduce & HDFS with Hadoop

Hadoop

Page 19: Map reduce & HDFS with Hadoop

Map Reduce

Page 20: Map reduce & HDFS with Hadoop

Today: Parallelism per file

Single LARGE File

Single Thread

No Parallelism

Page 21: Map reduce & HDFS with Hadoop

Map/Reduce: Unit of data

Task 0 Task 1 Task 2 Task 3

0..64 mb 64..128mb 128..192mb 192..256mb

Each task process a unit of data

Page 22: Map reduce & HDFS with Hadoop

Today: Network issue

Page 23: Map reduce & HDFS with Hadoop

Task 0 Task 1 Task 2 Task 3

Node 0

0..64 mb 64..128mb 128..192mb 192..256mb

Node 1 Node 2 Node 3

Map/Reduce: Local Read

• Local Read, no need for network copy • Data is read from many disks in parallel

Page 24: Map reduce & HDFS with Hadoop

Map/Reduce: The Magic!

Single Hard Drive: Reads 75mb/second

12 hard drive Per machine

12 * 75mb/second * 4k = 3.4 TB/ second

Page 25: Map reduce & HDFS with Hadoop

Map

Page 26: Map reduce & HDFS with Hadoop

Reduce

Page 27: Map reduce & HDFS with Hadoop

Big Data: Hadoop 2.0 Map Reduce / HDFS 2.0

@diego_pacheco Software Architect | Agile Coach

Obrigado! Thank You!