map reduce & hdfs with hadoop

Big Data: Hadoop 2.0 Map Reduce / HDFS 2.0

@diego_pacheco Software Architect | Agile Coach

Big Data

Hadoop - Cases

Hadoop

H D F S

adoop

distributed

ile

ystem

4000 nodes: 14PB storage

HDFS – Assumptions and Goals

• Hardware Failure: Houndred or thousands machines, expect to fail.

• Streaming Data Access: Batch processing, high throughtput not low latency.

• Large Data Sets: Terrabytes, works on cluster, scale, milions of files single instance.

• Simple Coherency Model: Write-once-read-many(create, read, close, no

changes) maximize coherency and high throughtput, perfect for Map/Reduce.

• Moving Computation instead of Moving Data: Is way

more cheaper, huge data, minimize network. HDFS moves the computation close to the data.

• Sofware and hardware Portability: Easily Portable.

HDFS

• Very large distributed FS • 10k nodes, 100M files, 10PB

• Works with comodity hardware • File replication • Detect and recover from failures

• Optimized for batch processing • Files break by blocks 128mb

• blocks: replicated in N dataNodes • Data Coherency

• Write Once, Read Many • Only Append to existent files

HDFS - Architecture

HDFS 2.0 - Federation

Hadoop

Map Reduce

Today: Parallelism per file

Single LARGE File

Single Thread

No Parallelism

Map/Reduce: Unit of data

Task 0 Task 1 Task 2 Task 3

0..64 mb 64..128mb 128..192mb 192..256mb

Each task process a unit of data

Today: Network issue

Task 0 Task 1 Task 2 Task 3

Node 0

0..64 mb 64..128mb 128..192mb 192..256mb

Node 1 Node 2 Node 3

Map/Reduce: Local Read

• Local Read, no need for network copy • Data is read from many disks in parallel

Map/Reduce: The Magic!

Single Hard Drive: Reads 75mb/second

12 hard drive Per machine

12 * 75mb/second * 4k = 3.4 TB/ second

Reduce

Big Data: Hadoop 2.0 Map Reduce / HDFS 2.0

@diego_pacheco Software Architect | Agile Coach

Obrigado! Thank You!

map reduce & hdfs with hadoop

Technology

big data

moving data

huge data

large data sets

network copy data

streaming data access

hdfs assumptions

hdfs architecture