hadoop/mapreduce as a platform for data-intensive computing jimmy lin university of maryland...

54
Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Post on 21-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Hadoop/MapReduce as a Platform for Data-Intensive Computing

Jimmy LinUniversity of Maryland(currently at Twitter)

Friday, December 2, 2011

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Page 2: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Our World: Large DataSource: Wikipedia (Hard disk drive)

Page 3: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

How much data?

9 PB of user data +>50 TB/day (11/2011)

processes 20 PB a day (2008)

36 PB of user data + 80-90 TB/day (6/2010)

Wayback Machine: 3 PB + 100 TB/month (3/2009)

LHC: ~15 PB a year(at full capacity)

LSST: 6-10 PB a year (~2015)

640K ought to be enough for anybody.

150 PB on 50k+ servers running 15k apps

S3: 449B objects, peak 290k request/second (7/2011)

Page 4: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: Wikipedia (Everest)

Why large data? ScienceEngineeringCommerce

Page 5: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Science Emergence of the 4th Paradigm

Data-intensive e-Science

Maximilien Brice, © CERN

Page 6: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Engineering The unreasonable effectiveness of data

Count and normalize!

Source: Wikipedia (Three Gorges Dam)

Page 7: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Commerce Know thy customers

Data Insights Competitive advantages

Source: Wikiedia (Shinjuku, Tokyo)

Page 8: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

+ simple, distributed programming models cheap commodity clusters

= data-intensive computing for the masses!

(or utility computing)

Source: flickr (turtlemom_nancy/2046347762)

Data nirvana requires the right infrastructure store, manage, organize, analyze, distribute, visualize, …store, manage, organize, analyze, distribute, visualize, …

How large data?Why large data?

Page 9: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

“Result”

Divide et impera Chop problem into smaller parts

Combine partial results

Source: Wikiedia (Forest)

“Work”w1 w2 w3

r1 r2 r3

Page 10: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Parallel computing is hard!

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Me

mo

ry

Different programming models

Different programming constructsmutexes, conditional variables, barriers, …masters/slaves, producers/consumers, work queues, …

Fundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …

Common problemslivelock, deadlock, data starvation, priority inversion…dining philosophers, sleeping barbers, cigarette smokers, …

Architectural issuesFlynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence

The reality: programmer shoulders the burden of managing concurrency…(I want my students developing new algorithms, not debugging race conditions)

Page 11: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: Ricardo Guimarães Herrmann

Page 12: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: MIT Open Courseware

Page 13: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work
Page 14: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: NY Times (6/14/2006)

The datacenter is the computer!

Page 15: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

MapReduce

Page 16: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

MapReduce Functional programming meets distributed processing

Independent per-record processing in parallel Aggregation of intermediate results to generate final output

Programmers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>* All values with the same key are sent to the same reducer

The execution framework handles everything else… Handles scheduling Handles data management, transport, etc. Handles synchronization Handles errors and faults

Recall “count and normalize”? Perfect!

Page 17: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

Page 18: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

UserProgram

outputfile 0

outputfile 1

(1) submit

(2) schedule map (2) schedule reduce

(3) read(4) local write

(5) remote read(6) write

Inputfiles

Mapphase

Intermediate files(on local disk)

Reducephase

Outputfiles

Adapted from (Dean and Ghemawat, OSDI 2004)

Master

(I want my students developing new algorithms, not debugging race conditions)

Page 19: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

MapReduce Implementations Google has a proprietary implementation in C++

Bindings in Java, Python

Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem

Page 20: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: Wikipedia (Rosetta Stone)

Statistical Machine Translation (Chris Dyer)

Page 21: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Translation Model

LanguageModel

Decoder

Foreign Input Sentence

maria no daba una bofetada a la bruja verde

English Output Sentence

mary did not slap the green witch

Word Alignment

Statistical Machine Translation

(vi, i saw)(la mesa pequeña, the small table)…

Phrase Extraction

i saw the small table

vi la mesa pequeña

Parallel Sentences

he sat at the tablethe service was good

Target-Language Text

Training Data

ˆ e 1I argmax

e1I

P(e1I | f1

J ) argmaxe1

I

P(e1I )P( f1

J | e1I )

Page 22: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Maria no dio una bofetada a la bruja verde

Mary not

did not

no

did not give

give a slap to the witch green

slap

a slap

to the

to

the

green witch

the witch

by

slap

Translation as a Tiling Problem

Mary

did not

slap

the

green witch

ˆ e 1I argmax

e1I

P(e1I | f1

J ) argmaxe1

I

P(e1I )P( f1

J | e1I )

Page 23: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

The Data Bottleneck

“Every time I fire a linguist, the performance of our … system goes up.”- Fred Jelinek

Page 24: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Translation Model

LanguageModel

Decoder

Foreign Input Sentence

maria no daba una bofetada a la bruja verde

English Output Sentence

mary did not slap the green witch

Word Alignment

Statistical Machine Translation

(vi, i saw)(la mesa pequeña, the small table)…

Phrase Extraction

i saw the small table

vi la mesa pequeña

Parallel Sentences

he sat at the tablethe service was good

Target-Language Text

Training Data

We’ve built MapReduce implementations of these two components! (2008)

ˆ e 1I argmax

e1I

P(e1I | f1

J ) argmaxe1

I

P(e1I )P( f1

J | e1I )

Page 25: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

HMM Alignment: Giza

Single-core commodity server

Page 26: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

HMM Alignment: MapReduce

Single-core commodity server

38 processor cluster

Page 27: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

HMM Alignment: MapReduce

38 processor cluster

1/38 Single-core commodity server

Page 28: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

What’s the point? The optimally-parallelized version doesn’t exist!

MapReduce occupies a sweet spot in the design space for a large class of problems: Fast… in terms of running time + scaling characteristics Easy… in terms of programming effort Cheap… in terms of hardware costs

Page 29: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: Wikipedia (DNA)

Sequence Assembly (Michael Schatz)

Page 30: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Strangely-Formatted Manuscript Dickens: A Tale of Two Cities

Text written on a long spool

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Page 31: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

… With Duplicates Dickens: A Tale of Two Cities

“Backup” on four more copies

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Page 32: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Shredded Book Reconstruction Dickens accidently shreds the manuscript

How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k

fragments The short fragments from every copy are mixed together Some fragments are identical

It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of of times, it was theof times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was theof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age ofit was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age ofit was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Page 33: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Greedy Assembly

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

of times, it was the

times, it was the age

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

it was the worst of

of times, it was the

times, it was the age

it was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

it was the age of

was the age of foolishness,

the worst of times, it

The repeated sequence make the correct reconstruction ambiguous!

Alternative: model sequence reconstruction as a graph problem…

Page 34: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

de Bruijn Graph Construction Dk = (V,E)

V = All length-k subfragments (k < l) E = Directed edges between consecutive subfragments

(Nodes overlap by k-1 words)

Locally constructed graph reveals the global structure Overlaps between sequences implicitly computed

It was the best of

Original Fragment

It was the best was the best of

Directed Edge

de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001

Page 35: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

de Bruijn Graph Assembly

the age of foolishness

It was the best

best of times, it

was the best of

the best of times,

of times, it was

times, it was the

it was the worst

was the worst of

worst of times, it

the worst of times,

it was the age

was the age ofthe age of wisdom,

age of wisdom, it

of wisdom, it was

wisdom, it was the

A unique Eulerian tour of the graph reconstructs the

original text

If a unique tour does not exist, try to simplify the

graph as much as possible

Page 36: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

de Bruijn Graph Assembly

the age of foolishness

It was the best of times, it

of times, it was the

it was the worst of times, it

it was the age ofthe age of wisdom, it was theA unique Eulerian tour of the

graph reconstructs the original text

If a unique tour does not exist, try to simplify the

graph as much as possible

Page 37: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

GATGCTTACTATGCGGGCCCC

CGGTCTAATGCTTACTATGC

GCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT

TAATGCTTACTATGCAATGCTTAGCTATGCGGGC

AATGCTTACTATGCGGGCCCCTT

AATGCTTACTATGCGGGCCCCTT

CGGTCTAGATGCTTACTATGC

AATGCTTACTATGCGGGCCCCTT

CGGTCTAATGCTTAGCTATGC

ATGCTTACTATGCGGGCCCCTT

?

Subject genome

Sequencer

Reads

Human genome: 3 gbpA few billion short reads (~100 GB compressed data)

Page 38: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Short Read Assembly Genome assembly as finding an Eulerian tour of the de

Bruijn graph Human genome: >3B nodes, >10B edges

Present short read assemblers require tremendous computation: Velvet (serial): > 2TB of RAM ABySS (MPI): 168 cores × ~96 hours SOAPdenovo (pthreads): 40 cores × 40 hours, >140 GB RAM

Can we get by with MapReduce on commodity clusters? Horizontal scaling-out in the cloud!

(Zerbino & Birney, 2008)(Simpson et al., 2009)(Li et al., 2010)

Page 39: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Graph Compression

Challenges– Nodes stored on different machines– Nodes can only access direct neighbors

Randomized Solution– Randomly assign H / T to each

compressible node– Compress H T links

Page 40: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Fast Graph Compression

Initial Graph: 42 nodes

Page 41: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Fast Graph Compression

Round 1: 26 nodes (38% savings)

Page 42: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Fast Graph Compression

Round 2: 15 nodes (64% savings)

Page 43: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Fast Graph Compression

Round 3: 6 nodes (86% savings)

Page 44: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Fast Graph Compression

Round 4: 5 nodes (88% savings)

Page 45: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Contrail De Novo Assembly of the Human Genome

African male NA18507 (SRA000271, Bentley et al., 2008) Input: 3.5B 36bp reads, 210bp insert (~40x coverage)

Initial

NMax

>7 B27 bp

Compressed

>1 B303 bp

5.0 M14,007 bp

B

B’

A

Clip Tips

4.2 M20,594 bp

Pop Bubbles

B

B’

A C

Page 46: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Aside: How to do this better… MapReduce is a poor abstraction for graphs

No separation of computation from graph structure Poor locality: unnecessary data movement

Bulk synchronous parallel (BSP) as a better model: Google’s Pregel, open source Giraph clone

Interesting (open?) question: how many hammers and how many nails?

Page 47: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: flickr (stuckincustoms/4051325193)

Page 48: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: Wikipedia (Tide)

Commoditization of large-data processing capabilities allows us to ride the rising tide!

ScienceEngineeringCommerce

Page 49: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: flickr (60in3/2338247189)

Page 50: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Best thing since sliced bread? Distributed programming models:

MapReduce is the first Definitely not the only And probably not even the best Alternatives: Pig, Dryad/DryadLINQ, Pregel, etc.

It’s all about the right level of abstraction The von Neumann architecture won’t cut it anymore

Separating the what from how Developer specifies the computation that needs to be performed Execution framework handles actual execution Framework hides system-level details from the developers

Page 51: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Source: NY Times (6/14/2006)

The datacenter is the computer!

Page 52: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

What are the appropriate abstractions for the datacenter computer?

What new abstractions do applications demand?

What exciting applications do new abstractions enable?

How do we achieve true impact and change the world?

Page 53: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Education Teaching students to “think at web scale”

Rise of the data scientist: necessary skill set?

Source: flickr (infidelic/3008675635)

Page 54: Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work

Questions?