leskovec - stanford university · 20+ billion web pages x 20kb = 400+ tb 1 computer reads 30‐35...

54
CS345a: Data Mining Jure Leskovec Stanford University

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

CS345a: Data MiningJure LeskovecStanford University

Page 2: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Memory

CPU

Machine Learning, Statistics

Memory

“Classical” Data Mining

Disk

1/7/2010 2Jure Leskovec, Stanford CS345a: Data Mining

Page 3: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

20+ billion web pages x 20KB = 400+ TB 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 months to read the web ~4 months to read the web

~1,000 hard drives to store the web

Even more to do  something with the data

3Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 4: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Web data sets can be very large y g Tens to hundreds of terabytes

Cannot mine on a single serverg Standard architecture emerging: Cluster of commodity Linux nodesCluster of commodity Linux nodes Gigabit ethernet interconnect

How to organize computations on thisHow to organize computations on this architecture? Mask issues such as hardware failureMask issues such as hardware failure

1/7/2010 4Jure Leskovec, Stanford CS345a: Data Mining

Page 5: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Traditional big‐iron box (circa 2003)Traditional big iron box (circa 2003)  8 2GHz Xeons  64GB RAM  8TB disk  758,000 USD 

Prototypical Google rack (circa 2003) Prototypical Google rack (circa 003) 176 2GHz Xeons 176GB RAM

d k ~7TB disk 278,000 USD

In Aug 2006 Google had ~450,000 machines

5Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 6: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

 Gb  b t  

2‐10 Gbps backbone between racks

S it h S it h

Switch1 Gbps between any pair of nodesin a rack

Switch Switch

Mem

CPU

Mem

CPU

… Mem

CPU

Mem

CPU

Disk Disk

Each rack contains 16 64 nodes

Disk Disk

Each rack contains 16‐64 nodes

1/7/2010 6Jure Leskovec, Stanford CS345a: Data Mining

Page 7: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

L l ti f d t i i bl Large scale computing for data mining problems on commodity hardware PCs connected in a network Need to process huge datasets on large clusters of computers

Challenges: Challenges: How do you distribute computation? Distributed programming is hard Machines fail

Map‐reduce addresses all of the above Google’s computational/data manipulation modelGoogle s computational/data manipulation model Elegant way to work with big data

7Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 8: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Y h ’ ll b ti ith d i Yahoo’s collaboration with academia Foster open research Focus on large‐scale, highly parallelFocus on large scale, highly parallel computing

Seed Facility:  M45y Datacenter in a Box (DiB) 1000 nodes, 4000 cores, 3TB RAM, 1 5PB disk1.5PB disk

High bandwidth connection to Internet Located on Yahoo! corporate campusp p World’s top 50 supercomputer

8Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 9: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Implications of such computing environment Implications of such computing environment Single machine performance does not matter Add more machines Add more machines

Machines break One server may stay up 3 years (1 000 days) One server may stay up 3 years (1,000 days) If you have 1,0000 servers, expect to loose 1/day

How can we make it easy to write distributedHow can we make it easy to write distributed programs?

9Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 10: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Idea Idea Bring computation close to the data St fil lti l ti f li bilit Store files multiple times for reliability

Need Programming model Map‐Reduce

Infrastructure – File system Google: GFS Hadoop: HDFS

10Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 11: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

First order problem: if nodes can fail how can First order problem: if nodes can fail, how can we store data persistently? 

Answer: Distributed File System Answer: Distributed File System Provides global file namespace Goo le GFS Hadoop HDFS Kosmi KFS Google GFS; Hadoop HDFS; Kosmix KFS

Typical usage patternH fil (100 f GB t TB) Huge files (100s of GB to TB) Data is rarely updated in place

d d d Reads and appends are common

1/7/2010 11Jure Leskovec, Stanford CS345a: Data Mining

Page 12: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Reliable distributed file system for petabyte scaleReliable distributed file system for petabyte scale Data kept in 64‐megabyte “chunks” spread across thousands of machines

Each chunk replicated, usually 3 times, on different machines S l f di k hi f il Seamless recovery from disk or machine failure

C0 C1 C5C1 C2D0 C0 C5

C2C5

Chunk server 1

D1

Chunk server 3

C3C5

Chunk server 2

…D0

Chunk server N

C2D0

12

Bring computation directly to the data!

Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 13: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Chunk ServersChunk Servers File is split into contiguous chunks Typically each chunk is 16‐64MBE h h k li t d ( ll 2 3 ) Each chunk replicated (usually 2x or 3x)

Try to keep replicas in different racks Master node a.k.a. Name Nodes in HDFS Stores metadata Might be replicatedMight be replicated

Client library for file access Talks to master to find chunk servers  Connects directl to ch nkser ers to access data Connects directly to chunkservers to access data

1/7/2010 13Jure Leskovec, Stanford CS345a: Data Mining

Page 14: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

We have a large file of words: We have a large file of words: one word per line

Count the number of times each distinct word appears in the filepp

Sample application:  analyze web server logs to find popular URLs

1/7/2010 14Jure Leskovec, Stanford CS345a: Data Mining

Page 15: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Case 1: Entire file fits in memory Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in memcount> pairs fit in mem

Case 3: File on disk, too many distinct words to fit in memoryto fit in memory sort datafile | uniq –c

1/7/2010 15Jure Leskovec, Stanford CS345a: Data Mining

Page 16: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

To make it slightly harder suppose we have a To make it slightly harder, suppose we have a large corpus of documents

Count the number of times each distinct word occurs in the corpus words(docs/*) | sort | uniq -c where words takes a file and outputs the words in it, one to a line

Th b t th f The above captures the essence of MapReduce Great thing is it is naturally parallelizable Great thing is it is naturally parallelizable

1/7/2010 16Jure Leskovec, Stanford CS345a: Data Mining

Page 17: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Read a lot of data Read a lot of data Map Extract something you care about Extract something you care about

Shuffle and Sort Reduce Reduce Aggregate, summarize, filter or transform

Write the data Write the data

Outline stays the same, map and reduce change to fit the problem

17Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 18: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Program specifies two primary methods: Program specifies two primary methods: Map(k,v) <k’, v’>* R d (k’ < ’>*) <k’ ’’>* Reduce(k’, <v’>*)  <k’, v’’>*

All v’ with same k’ are reduced together and All v’ with same k’ are reduced together and processed in v’ order

1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 18

Page 19: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Provided by the  Provided by the 

MAP:reads input and 

d f

Group by key:Collect all pairs 

Reduce:Collect all values b l h

programmer programmer

The crew of the space shuttle  (the  1) (crew  1)

produces a set of key value pairs

Collect all pairs with same key

belonging to the key and output

 data

dsThe crew of the space shuttle Endeavor recently returned to 

Earth as ambassadors, harbingers of a new era of 

space exploration. Scientists at NASA are saying that the 

recent assembly of the Dextrebot is the first step in a long‐

(the, 1)(crew, 1)(of, 1)(the, 1)

(crew, 1)(crew, 1)(space, 1)(the, 1)

(crew, 2)(space, 1)(the  3) y 

read th

e d

ntial    rea

d

bot is the first step in a longterm space‐based 

man/machine partnership. '"The work we're doing now ‐‐the robotics we're doing ‐‐ is what we're going to need to do to build any work station or habitat structure on the 

(space, 1)(shuttle, 1)

(Endeavor, 1)(recently, 1)

(the, 1)(the, 1)

(shuttle, 1)(recently, 1)

(the, 3)(shuttle, 1)(recently, 1)

eque

ntially

nly  sequ

e

moon or Mars," said Allard Beutel.

Big document

(recently, 1)….

(recently, 1)…

(key, value) (key, value)(key, value)

SeOn

19Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 20: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

map(key, value):// key: document name; value: text of document

for each word w in value:emit(w, 1)( , )

reduce(key, values):// key: a word; value: an iterator over counts

result = 0for each count v in values:

result += vemit(result)emit(result)

1/7/2010 20Jure Leskovec, Stanford CS345a: Data Mining

Page 21: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Map‐Reduce environment takes care of: Map‐Reduce environment takes care of: Partitioning the input data Scheduling the program’s execution across a set of Scheduling the program s execution across a set of machines Handling machine failuresg Managing required inter‐machine communication

Allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed cluster

21 Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 22: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Big document

MAP:reads input and produces a set of key value pairskey value pairs

Group by key:Collect all pairs with same key

Reduce:Collect all values belonging to the 

22

belonging to the key and output

1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 23: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Programmer specifies Input 0 Input 1 Input 2Programmer specifies  Map and Reduce and input files

Workflow Read inputs as a set of key‐value‐pairs

Input 0

Map 0

Input 1

Map 1

Input 2

Map 2 Map transforms input kv‐pairs into a 

new set of k'v'‐pairs Sorts & Shuffles the k'v'‐pairs to output 

nodes

Map 0 Map 1 Map 2

Shufflenodes All k’v’‐pairs with a given k’ are sent to 

the same reduce Reduce processes all k'v'‐pairs grouped 

b k i k'' '' iReduce 0 Reduce 1

by key into new k''v''‐pairs Write the resulting pairs to files

All phases are distributed with many tasks doing the work Out 0 Out 1tasks doing the work

23Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 24: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

24Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 25: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Input final output are stored on a distributed Input, final output are stored on a distributed file system Scheduler tries to schedule map tasks “close” to Scheduler tries to schedule map tasks  close  to physical storage location of input data

Intermediate results are stored on local FS of Intermediate results are stored on local FS of map and reduce workers

Output is often input to another map reduce Output is often input to another map reduce task

1/7/2010 25Jure Leskovec, Stanford CS345a: Data Mining

Page 26: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Master data structures Task status: (idle, in‐progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate filesthe location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducersMaster pushes this info to reducers

Master pings workers periodically t d t t f ilto detect failures

1/7/2010 26Jure Leskovec, Stanford CS345a: Data Mining

Page 27: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Map worker failure Map worker failure Map tasks completed or in‐progress at worker are reset to idlereset to idle Reduce workers are notified when task is rescheduled on another workerrescheduled on another worker

Reduce worker failure Only in‐progress tasks are reset to idleOnly in progress tasks are reset to idle

Master failure MapReduce task is aborted and client is notified MapReduce task is aborted and client is notified

1/7/2010 27Jure Leskovec, Stanford CS345a: Data Mining

Page 28: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Fine granularity tasks: map tasks >> machinesFine granularity tasks:   map tasks   machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing 

Often use 200,000 map  &  5,000 reduce tasks  Running on 2,000 machines g ,

28Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 29: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

29Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 30: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

30Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 31: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

31Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 32: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

32Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 33: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

33Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 34: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

34Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 35: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

35Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 36: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

36Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 37: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

37Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 38: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

38Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 39: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

39Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 40: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Slow workers significantly slow the Slow workers significantly slow the completion time: Other jobs on the machineOther jobs on the machine Bad disks Weird thingsg

Solution: Near end of phase, spawn backup copies of tasks Whichever one finishes first “wins”

Effect:i ll h j b l i i Dramatically shortens job completion time

1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 40

Page 41: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Backup tasks reduce job time Backup tasks reduce job time System deals with failures

1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 41

Page 42: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Often a map task will produce many pairs of Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k E.g., popular words in Word CountE.g., popular words in Word Count

Can save network time by pre‐aggregating at mapper combine(k1, list(v1))  v2 Usually same as reduce function

Works only if reduce function is commutative and associative

1/7/2010 42Jure Leskovec, Stanford CS345a: Data Mining

Page 43: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Inputs to map tasks are created by contiguous Inputs to map tasks are created by contiguous splits of input file

For reduce, we need to ensure that records ,with the same intermediate key end up at the same worker

System uses a default partition function e.g., hash(key) mod RS ti f l t id Sometimes useful to override  E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output filefrom a host end up in the same output file

1/7/2010 43Jure Leskovec, Stanford CS345a: Data Mining

Page 44: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Input does not have to be bigInput does not have to be big E.g., want to simulate disease spreading in a (small) social network

Input: Each line: node id, virus parameters (death, birth rate)

Map: Map:  Reads a line of input and simulate the virus Output: triplets (node id virus id hit time) Output: triplets (node id, virus id, hit time)

Reduce: Collect the node IDs and see which nodes are most vulnerable

44Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 45: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Suppose we have a large web corpus Suppose we have a large web corpus Let’s look at the metadata file Lines of the form (URL size date ) Lines of the form (URL, size, date, …)

For each host, find the total number of bytes i e the s m of the pa e si es for all URLs from i.e., the sum of the page sizes for all URLs from that host

1/7/2010 45Jure Leskovec, Stanford CS345a: Data Mining

Page 46: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Statistical machine translation: Statistical machine translation: Need to count number of times every 5‐word sequence occurs in a large corpuse of dumentssequence occurs in a large corpuse of duments

Easy with MapReduce: Easy with MapReduce: Map: extract (5‐word sequence, count) from documentdocument Reduce: combine counts

1/7/2010 46Jure Leskovec, Stanford CS345a: Data Mining

Page 47: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Find all occurrences of the given pattern in a Find all occurrences of the given pattern in a very large set of files 

1/7/2010 47Jure Leskovec, Stanford CS345a: Data Mining

Page 48: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Given a directed graph as an adjacency list: Given a directed graph as an adjacency list:src1: dest11, dest12, …src2: dest21 dest22src2: dest21, dest22, …

Construct the graph in which all the links are Construct the graph in which all the links are reversed

1/7/2010 48Jure Leskovec, Stanford CS345a: Data Mining

Page 49: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Google Google Not available outside Google

Hadoopp An open‐source implementation in Java Uses HDFS for stable storageg Download: http://lucene.apache.org/hadoop/

Aster Data Cluster‐optimized SQL Database that also implements MapReduceM d il bl f f h f thi l Made available free of charge for this class

1/7/2010 49Jure Leskovec, Stanford CS345a: Data Mining

Page 50: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Ability to rent computing by the hour Ability to rent computing by the hour Additional services e.g., persistent storage

We will be using Amazon’s “Elastic Compute We will be using Amazon s  Elastic Compute Cloud” (EC2)

Aster Data and Hadoop can both be run on Aster Data and Hadoop can both be run on EC2

In discussions with Amazon to provide access In discussions with Amazon to provide access free of charge for class

1/7/2010 50Jure Leskovec, Stanford CS345a: Data Mining

Page 51: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Jeffrey Dean and Sanjay Ghemawat,Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processing   on Large Clustershttp://labs.google.com/papers/mapreduce.html

Sanjay Ghemawat, Howard Gobioff, and Shun‐Tak Leung, The Google File SystemGoogle File Systemhttp://labs.google.com/papers/gfs.html

1/7/2010 51Jure Leskovec, Stanford CS345a: Data Mining

Page 52: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

• Hadoop Wiki• Hadoop Wiki– Introduction

• http://wiki.apache.org/lucene-hadoop/– Getting Started

• http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop– Map/Reduce Overview p

• http://wiki.apache.org/lucene-hadoop/HadoopMapReduce• http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses

– Eclipse EnvironmentEclipse Environment• http://wiki.apache.org/lucene-hadoop/EclipseEnvironment

• Javadoc– http://lucene.apache.org/hadoop/docs/api/

52Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 53: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

• Releases from Apache download mirrorsReleases from Apache download mirrors–

http://www.apache.org/dyn/closer.cgi/lucene/hado/op/

• Nightly builds of source––

http://people.apache.org/dist/lucene/hadoop/nightly/

• Source code from subversion–

http://lucene apache org/hadoop/version controlhttp://lucene.apache.org/hadoop/version_control.html

53Jure Leskovec, Stanford CS345a: Data Mining1/7/2010

Page 54: Leskovec - Stanford University · 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30‐35 MB/sec from disk ~4 monthsmonths toto readread thethe webweb ~1,000 hard drives to

Programming model inspired by functional language primitives Partitioning/shuffling similar to many large‐scale sorting systems  NOW‐Sort ['97] 

Re‐execution for fault tolerance  BAD FS ['04] and TACC ['97] BAD‐FS [ 04] and TACC [ 97] 

Locality optimization has parallels with Active Disks/Diamond work  Active Disks ['01], Diamond ['04] 

Backup tasks similar to Eager Scheduling in Charlotte system p g g y Charlotte ['96] 

Dynamic load balancing solves similar problem as River's distributed queues 

Ri ['99] River ['99]

54Jure Leskovec, Stanford CS345a: Data Mining1/7/2010