map reduce & hdfs with hadoop
DESCRIPTION
Map reduce & HDFS with HadoopTRANSCRIPT
Big Data: Hadoop 2.0 Map Reduce / HDFS 2.0
@diego_pacheco Software Architect | Agile Coach
Big Data
Hadoop - Cases
Hadoop
H D F S
adoop
distributed
ile
ystem
4000 nodes: 14PB storage
HDFS – Assumptions and Goals
• Hardware Failure: Houndred or thousands machines, expect to fail.
• Streaming Data Access: Batch processing, high throughtput not low latency.
• Large Data Sets: Terrabytes, works on cluster, scale, milions of files single instance.
• Simple Coherency Model: Write-once-read-many(create, read, close, no
changes) maximize coherency and high throughtput, perfect for Map/Reduce.
• Moving Computation instead of Moving Data: Is way
more cheaper, huge data, minimize network. HDFS moves the computation close to the data.
• Sofware and hardware Portability: Easily Portable.
HDFS
• Very large distributed FS • 10k nodes, 100M files, 10PB
• Works with comodity hardware • File replication • Detect and recover from failures
• Optimized for batch processing • Files break by blocks 128mb
• blocks: replicated in N dataNodes • Data Coherency
• Write Once, Read Many • Only Append to existent files
HDFS - Architecture
HDFS 2.0 - Federation
Hadoop
Map Reduce
Today: Parallelism per file
Single LARGE File
Single Thread
No Parallelism
Map/Reduce: Unit of data
Task 0 Task 1 Task 2 Task 3
0..64 mb 64..128mb 128..192mb 192..256mb
Each task process a unit of data
Today: Network issue
Task 0 Task 1 Task 2 Task 3
Node 0
0..64 mb 64..128mb 128..192mb 192..256mb
Node 1 Node 2 Node 3
Map/Reduce: Local Read
• Local Read, no need for network copy • Data is read from many disks in parallel
Map/Reduce: The Magic!
Single Hard Drive: Reads 75mb/second
12 hard drive Per machine
12 * 75mb/second * 4k = 3.4 TB/ second
Map
Reduce
Big Data: Hadoop 2.0 Map Reduce / HDFS 2.0
@diego_pacheco Software Architect | Agile Coach
Obrigado! Thank You!