hadoop: a hands-on introduction

HadoopA Hands-on Introduction

Claudio MartellaElia Bruni

9 November 2011

Tuesday, November 8, 11

Outline

• What is Hadoop

• Why is Hadoop

• How is Hadoop

• Hadoop & Python

• Some NLP code

• A more complicated problem: Eva

2


A bit of Context

• 2003: first MapReduce library @ Google

• 2003: GFS paper

• 2004: MapReduce paper

• 2005: Apache Nutch uses MapReduce

• 2006: Hadoop was born

• 2007: first 1000 nodes cluster at Y!

3


An Ecosystem

• HDFS & MapReduce

• Zookeeper

• HBase

• Pig & Hive

• Mahout

• Giraph

• Nutch4


Traditional way

• Design a high-level Schema

• You store data in a RDBMS

• Which has very poor write throughput

• And doesn’t scale very much

• When you talk about Terabyte of data

• Expensive Data Warehouse

5


BigData & NoSQL

• Store first, think later

• Schema-less storage

• Analytics

• Petabyte scale

• Offline processing

6


Vertical Scalability

• Extremely expensive

• Requires expertise in distributed systems and concurrent programming

• Lacks of real fault-tolerance

7


Horizontal Scalability

• Built on top of commodity hardware

• Easy to use programming paradigms

• Fault-tolerance through replication

8


1st Assumptions

• Data to process does not fit on one node.

• Each node is commodity hardware.

• Failure happens.

Spread your data among your nodes and replicate it.

9


2nd Assumptions

• Moving computation is cheap.

• Moving data is expensive.

• Distributed computing is hard.

Move computation to data, with simple paradigm.

10


3rd Assumptions

• Systems run on spinning hard disks.

• Disk seek >> disk scan.

• Many small files are expensive.

Base the paradigm on scanning large files.

11


Typical Problem

• Collect and iterate over many records

• Filter and extract something from each

• Shuffle & sort these intermediate results

• Group-by and aggregate them

• Produce final output set

12


Typical Problem

• Collect and iterate over many records

• Filter and extract something from each

• Shuffle & sort these intermediate results

• Group-by and aggregate them

• Produce final output set

MA

P

REDU

CE

13


Quick example127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

• (frank, index.html)

• (index.html, 10/Oct/2000)

• (index.html, http://www.example.com/start.html)

14


MapReduce

• Programmers define two functions:

★ map (key, value) (key’, value’)*

★ reduce (key’, [value’+]) (key”, value”)*

• Can also define:

★ combine (key, value) (key’, value’)*

★ partitioner: k‘ partition

15


9

MapReduce| Programmers specify two functions:

map (k, v) ĺ <k’, v’>*reduce (k’, v’) ĺ <k’, v’>*

All l ith th k d d t thz All values with the same key are reduced together

| Usually, programmers also specify:partition (k’, number of partitions) ĺ partition for k’z Often a simple hash of the key, e.g. hash(k’) mod nz Allows reduce operations for different keys in parallelcombine (k’, v’) ĺ <k’, v’>*z Mini-reducers that run in memory after the map phasez Mini-reducers that run in memory after the map phasez Used as an optimization to reducer network traffic

| Implementations:z Google has a proprietary implementation in C++z Hadoop is an open source implementation in Java

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

mapmap map map

Shuffle and Sort: aggregate values by keys

ba 1 2 c c3 6 a c5 2 b c7 9

a 1 5 b 2 7 c 2 3 6 9

reduce reduce reduce

r1 s1 r2 s2 r3 s3

16


MapReduce daemons

• JobTracker: it’s the Master, it runs the schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance.

• TaskTracker: it’s the Worker, it runs on each slave, runs (multiple) Mappers and Reducers each in their JVM.

17


11

UserProgram

(1) fork (1) fork (1) fork

split 0split 1split 2split 3split 4

worker

worker

worker

worker

Master

outputfile 0

outputfile 1

(2) assign map(2) assign reduce

(3) read(4) local write

(5) remote read(6) write

worker

Inputfiles

Mapphase

Intermediate files(on local disk)

Reducephase

Outputfiles

Redrawn from (Dean and Ghemawat, OSDI 2004)

How do we get data to the workers?

NAS

Compute Nodes

SAN

What’s the problem here?

18


HDFS daemons

• NameNode: it’s the Master, it keeps the filesystem metadata (in-memory), the file-block-node mapping, decides replication and block placement, collects heart-beats from nodes.

• DataNode: it’s the Slave, it stores the blocks (64MB) of the files and serves directly reads and writes.

19


13

GFS: Design Decisions

| Files stored as chunksz Fixed size (64MB)

| Reliability through replication| Reliability through replicationz Each chunk replicated across 3+ chunkservers

| Single master to coordinate access, keep metadataz Simple centralized management

| No data cachingz Little benefit due to large data sets, streaming readsz Little benefit due to large data sets, streaming reads

| Simplify the APIz Push some of the issues onto the client

Application GFS masterApplication

GSF Client

GFS masterFile namespace

/foo/barchunk 2ef0

GFS chunkserver GFS chunkserver

(file name, chunk index)

(chunk handle, chunk location)

Instructions to chunkserver

Chunkserver state(chunk handle, byte range)

Redrawn from (Ghemawat et al., SOSP 2003)

Linux file system

…

Linux file system

…

chunk data

20


Transparent to

• Workers to data assignment

• Map / Reduce assignment to nodes

• Management of synchronization

• Management of communication

• Fault-tolerance and restarts

21


Take home recipe

• Scan-based computation (no random I/O)

• Big datasets

• Divide-and-conquer class algorithms

• No communication between tasks

22


Not good for

• Real-time / Stream processing

• Graph processing

• Computation without locality

• Small datasets

23


Questions?


Baseline solution


What we attacked

• You don’t want to parse the file many times

• You don’t want to re-calculate the norm

• You don’t want to calculate 0*n

26


Our solution

line format: <string><norm>[<col><value>]*

0 1.3 0 0 7.1 1.1

1.2 0 0 0 0 3.4

0 5.7 0 0 1.1 2

5.1 0 0 4.6 0 10

0 0 0 1.6 0 0

1.3 7.1

1.2 3.4

5.7 1.1

5.1 4.6

1.6

2

1.1

for example: cat 12.1313 0 5.1 3 4.6 5 10

10

27


Benchmarking

• serial python (single-core): 7 minutes

• java+hadoop (single-core): 2 minutes

• serial python (big file): 18 days

• java+hadoop (parallel, big file): 8 hours

• it makes sense: 18d / 3.5 = 5.14d / 14 = 8h

28


hadoop: a hands-on introduction

Technology

assumptions data

data assignment map

map key

value key

output split

assign map

output worker file

mapreduce paper