hadoop: a hands-on introduction

28
Hadoop A Hands-on Introduction Claudio Martella Elia Bruni 9 November 2011 Tuesday, November 8, 11

Upload: claudio-martella

Post on 11-May-2015

5.123 views

Category:

Technology


0 download

DESCRIPTION

An introduction to Hadoop. This seminar was intended to non IT engineers but more NLP specialists and cognitive scientists. See the blog post for more information on this presentation.

TRANSCRIPT

Page 1: Hadoop: A Hands-on Introduction

HadoopA Hands-on Introduction

Claudio MartellaElia Bruni

9 November 2011

Tuesday, November 8, 11

Page 2: Hadoop: A Hands-on Introduction

Outline

• What is Hadoop

• Why is Hadoop

• How is Hadoop

• Hadoop & Python

• Some NLP code

• A more complicated problem: Eva

2

Tuesday, November 8, 11

Page 3: Hadoop: A Hands-on Introduction

A bit of Context

• 2003: first MapReduce library @ Google

• 2003: GFS paper

• 2004: MapReduce paper

• 2005: Apache Nutch uses MapReduce

• 2006: Hadoop was born

• 2007: first 1000 nodes cluster at Y!

3

Tuesday, November 8, 11

Page 4: Hadoop: A Hands-on Introduction

An Ecosystem

• HDFS & MapReduce

• Zookeeper

• HBase

• Pig & Hive

• Mahout

• Giraph

• Nutch4

Tuesday, November 8, 11

Page 5: Hadoop: A Hands-on Introduction

Traditional way

• Design a high-level Schema

• You store data in a RDBMS

• Which has very poor write throughput

• And doesn’t scale very much

• When you talk about Terabyte of data

• Expensive Data Warehouse

5

Tuesday, November 8, 11

Page 6: Hadoop: A Hands-on Introduction

BigData & NoSQL

• Store first, think later

• Schema-less storage

• Analytics

• Petabyte scale

• Offline processing

6

Tuesday, November 8, 11

Page 7: Hadoop: A Hands-on Introduction

Vertical Scalability

• Extremely expensive

• Requires expertise in distributed systems and concurrent programming

• Lacks of real fault-tolerance

7

Tuesday, November 8, 11

Page 8: Hadoop: A Hands-on Introduction

Horizontal Scalability

• Built on top of commodity hardware

• Easy to use programming paradigms

• Fault-tolerance through replication

8

Tuesday, November 8, 11

Page 9: Hadoop: A Hands-on Introduction

1st Assumptions

• Data to process does not fit on one node.

• Each node is commodity hardware.

• Failure happens.

Spread your data among your nodes and replicate it.

9

Tuesday, November 8, 11

Page 10: Hadoop: A Hands-on Introduction

2nd Assumptions

• Moving computation is cheap.

• Moving data is expensive.

• Distributed computing is hard.

Move computation to data, with simple paradigm.

10

Tuesday, November 8, 11

Page 11: Hadoop: A Hands-on Introduction

3rd Assumptions

• Systems run on spinning hard disks.

• Disk seek >> disk scan.

• Many small files are expensive.

Base the paradigm on scanning large files.

11

Tuesday, November 8, 11

Page 12: Hadoop: A Hands-on Introduction

Typical Problem

• Collect and iterate over many records

• Filter and extract something from each

• Shuffle & sort these intermediate results

• Group-by and aggregate them

• Produce final output set

12

Tuesday, November 8, 11

Page 13: Hadoop: A Hands-on Introduction

Typical Problem

• Collect and iterate over many records

• Filter and extract something from each

• Shuffle & sort these intermediate results

• Group-by and aggregate them

• Produce final output set

MA

P

REDU

CE

13

Tuesday, November 8, 11

Page 14: Hadoop: A Hands-on Introduction

Quick example127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

• (frank, index.html)

• (index.html, 10/Oct/2000)

• (index.html, http://www.example.com/start.html)

14

Tuesday, November 8, 11

Page 15: Hadoop: A Hands-on Introduction

MapReduce

• Programmers define two functions:

★ map (key, value) (key’, value’)*

★ reduce (key’, [value’+]) (key”, value”)*

• Can also define:

★ combine (key, value) (key’, value’)*

★ partitioner: k‘ partition

15

Tuesday, November 8, 11

Page 16: Hadoop: A Hands-on Introduction

9

MapReduce| Programmers specify two functions:

map (k, v) ĺ <k’, v’>*reduce (k’, v’) ĺ <k’, v’>*

All l ith th k d d t thz All values with the same key are reduced together

| Usually, programmers also specify:partition (k’, number of partitions) ĺ partition for k’z Often a simple hash of the key, e.g. hash(k’) mod nz Allows reduce operations for different keys in parallelcombine (k’, v’) ĺ <k’, v’>*z Mini-reducers that run in memory after the map phasez Mini-reducers that run in memory after the map phasez Used as an optimization to reducer network traffic

| Implementations:z Google has a proprietary implementation in C++z Hadoop is an open source implementation in Java

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

mapmap map map

Shuffle and Sort: aggregate values by keys

ba 1 2 c c3 6 a c5 2 b c7 9

a 1 5 b 2 7 c 2 3 6 9

reduce reduce reduce

r1 s1 r2 s2 r3 s3

16

Tuesday, November 8, 11

Page 17: Hadoop: A Hands-on Introduction

MapReduce daemons

• JobTracker: it’s the Master, it runs the schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance.

• TaskTracker: it’s the Worker, it runs on each slave, runs (multiple) Mappers and Reducers each in their JVM.

17

Tuesday, November 8, 11

Page 18: Hadoop: A Hands-on Introduction

11

UserProgram

(1) fork (1) fork (1) fork

split 0split 1split 2split 3split 4

worker

worker

worker

worker

Master

outputfile 0

outputfile 1

(2) assign map(2) assign reduce

(3) read(4) local write

(5) remote read(6) write

worker

Inputfiles

Mapphase

Intermediate files(on local disk)

Reducephase

Outputfiles

Redrawn from (Dean and Ghemawat, OSDI 2004)

How do we get data to the workers?

NAS

Compute Nodes

SAN

What’s the problem here?

18

Tuesday, November 8, 11

Page 19: Hadoop: A Hands-on Introduction

HDFS daemons

• NameNode: it’s the Master, it keeps the filesystem metadata (in-memory), the file-block-node mapping, decides replication and block placement, collects heart-beats from nodes.

• DataNode: it’s the Slave, it stores the blocks (64MB) of the files and serves directly reads and writes.

19

Tuesday, November 8, 11

Page 20: Hadoop: A Hands-on Introduction

13

GFS: Design Decisions

| Files stored as chunksz Fixed size (64MB)

| Reliability through replication| Reliability through replicationz Each chunk replicated across 3+ chunkservers

| Single master to coordinate access, keep metadataz Simple centralized management

| No data cachingz Little benefit due to large data sets, streaming readsz Little benefit due to large data sets, streaming reads

| Simplify the APIz Push some of the issues onto the client

Application GFS masterApplication

GSF Client

GFS masterFile namespace

/foo/barchunk 2ef0

GFS chunkserver GFS chunkserver

(file name, chunk index)

(chunk handle, chunk location)

Instructions to chunkserver

Chunkserver state(chunk handle, byte range)

Redrawn from (Ghemawat et al., SOSP 2003)

Linux file system

Linux file system

chunk data

20

Tuesday, November 8, 11

Page 21: Hadoop: A Hands-on Introduction

Transparent to

• Workers to data assignment

• Map / Reduce assignment to nodes

• Management of synchronization

• Management of communication

• Fault-tolerance and restarts

21

Tuesday, November 8, 11

Page 22: Hadoop: A Hands-on Introduction

Take home recipe

• Scan-based computation (no random I/O)

• Big datasets

• Divide-and-conquer class algorithms

• No communication between tasks

22

Tuesday, November 8, 11

Page 23: Hadoop: A Hands-on Introduction

Not good for

• Real-time / Stream processing

• Graph processing

• Computation without locality

• Small datasets

23

Tuesday, November 8, 11

Page 24: Hadoop: A Hands-on Introduction

Questions?

Tuesday, November 8, 11

Page 25: Hadoop: A Hands-on Introduction

Baseline solution

Tuesday, November 8, 11

Page 26: Hadoop: A Hands-on Introduction

What we attacked

• You don’t want to parse the file many times

• You don’t want to re-calculate the norm

• You don’t want to calculate 0*n

26

Tuesday, November 8, 11

Page 27: Hadoop: A Hands-on Introduction

Our solution

line format: <string><norm>[<col><value>]*

0 1.3 0 0 7.1 1.1

1.2 0 0 0 0 3.4

0 5.7 0 0 1.1 2

5.1 0 0 4.6 0 10

0 0 0 1.6 0 0

1.3 7.1

1.2 3.4

5.7 1.1

5.1 4.6

1.6

2

1.1

for example: cat 12.1313 0 5.1 3 4.6 5 10

10

27

Tuesday, November 8, 11

Page 28: Hadoop: A Hands-on Introduction

Benchmarking

• serial python (single-core): 7 minutes

• java+hadoop (single-core): 2 minutes

• serial python (big file): 18 days

• java+hadoop (parallel, big file): 8 hours

• it makes sense: 18d / 3.5 = 5.14d / 14 = 8h

28

Tuesday, November 8, 11