behm shah pagerank

CS224: Advanced Topics in Data Management Spring 2008

Computing PageRank Using Hadoop

(+Introduction to MapReduce)

Alexander Behm, Ajey Shah

University of California, IrvineInstructor: Prof. Chen Li

Outline Introduction to MapReduce

Motivation + Goals

MapReduce Paradigm + Example

Introduction to Hadoop

Architecture

Our setup

Computing PageRank using Map Reduce

Link Analysis

Matrix Multiplication

Motivation for MapReduce

How can we process huge amounts of data quickly (think web-scale)?

Mainframe (one big machine)

expensive, one vendor, hard to scale radically, single point of failure

COTS Cluster (many small machines)

cheap components, many vendors, easy to scale

COTS Clusters very popular because of price and scalability

Main drawback is complexity of programming parallel applications on them

COTS = Commodity off the shelf

Motivation for MapReduce

What are the main challenges of programming a COTS cluster?

1. Fault Tolerance (many machines many failures)

2. Transparency: how to hide underlying details of cluster

3. Scheduling and load balancing

Parallel Programming Models

-High exposure to programmer-Complex programming-High Efficiency-(Long development time)

-Low exposure to programmer-Simple programming-Lower Efficiency-(Short development time)

MapReduce

MapReduce Goals

Provide easy but general model for programmers to use cluster resources

Hide network communication (i.e. RPCs)

Hide storage details, file chunks are automatically distributed and replicated

Provide transparent fault tolerance

Failed tasks are automatically rescheduled on live nodes

High throughput and automatic load balancing

E.g. scheduling tasks on nodes that already have data

RPC = Remote Procedure Call

MapReduce is NOT…

An operating system

A programming language

Meant for online processing

Hadoop (it is an implementation of MapReduce)

MapReduce is a programming

paradigm!

MapReduce FlowInput

Key, Value Key, Value …=

Map Map

Key, Value

Split Input into Key-Value pairs.For each K-V pair call Map.

Each Map produces new set of K-V pairs.

Reduce(K, V[ ])

Output Key, Value Key, Value …=

For each distinct key, call reduce. Produces one K-V pair for each distinct key.

Output as a set of Key Value Pairs.

MapReduce WordCount ExampleOutput:Number of occurrences of each word

Input:File containing words

Hello World Bye WorldHello Hadoop Bye HadoopBye Hadoop Hello Hadoop

Bye 3Hadoop 4Hello 3World 2

MapReduce

How can we do this within the MapReduce framework?

Basic idea: parallelize on lines in input file!

MapReduce WordCount ExampleInput

1, “Hello World Bye World”

2, “Hello Hadoop Bye Hadoop”

3, “Bye Hadoop Hello Hadoop”

Map Output

<Hello,1><World,1><Bye,1><World,1>

<Hello,1><Hadoop,1><Bye,1><Hadoop,1>

<Bye,1><Hadoop,1><Hello,1><Hadoop,1>

Map(K, V) { For each word w in V Collect(w, 1);}

MapReduce WordCount ExampleReduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count);}

Map Output

<Hello,1><World,1><Bye,1><World,1>

<Hello,1><Hadoop,1><Bye,1><Hadoop,1>

<Bye,1><Hadoop,1><Hello,1><Hadoop,1>

Internal Grouping

Reduce Output

<Bye, 3><Hadoop, 4><Hello, 3><World, 2>

Reduce

Open Source implementation of MapReduce by Apache Java software framework In use and supported by Yahoo! Hadoop consists of the following components:

Processing : Map Reduce Storage: HDFS, Hbase (Google Bigtable)

@Yahoo!Some Webmap size data:Number of links between pages in the index: roughly 1 trillion links Size of output: over 300 TB, compressed! Number of cores used to run a single Map-Reduce job: over 10,000 Raw disk used in the production cluster: over 5 Petabytes

(source: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html)

Typical Hadoop Setup

Our Hadoop Setup

MASTER

Namenode

JobTracker

TaskTracker

DataNode

watermelon

DataNode

TaskTracker

cherry

DataNode

TaskTracker

avocado

DataNode

TaskTracker

blueberry

DataNode

TaskTracker

Switch

SLAVES

Our Hadoop Setup

Demo: Hadoop Admin Pages!

• Single Name Node- manages meta data and block placement

• DataNode – stores blocks

Storage: HDFS

Run Application

Job Tracker

Task Tracker Task Tracker Task Tracker…

Task Task Task Task Task Task

Hadoop Black Box

Job Execution Diagram

Input -> Map -> Shuffle -> Reduce -> Output

Processing: Hadoop MapReduce

Using Hadoop To Program

Reduce(…)Mapper(…)

extendsextends

implements implements

Sample Map Classpublic static class Map extends MapReduceBase implements Mapper<LongWritable,

Text, Text, IntWritable>

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens())

word.set(tokenizer.nextToken());

output.collect(word, one);

Sample Reduce Classpublic static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable>

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException

int sum = 0;

while (values.hasNext())

sum += values.next().get();

output.collect(key, new IntWritable(sum));

Running a Job

Demo: Show WordCount Example

Project5: PageRank on Hadoop

Output

#|colNum|NumOfRows|<R,val>…..<R,val>|#.....

Link Extractor

Link Analysis

Crawled Pages

PageRank on MapReduceVery Basic PageRank Algorithm

Input:

PageRankVector

DistributionMatrix

ComputePageRank {

Until converged {

PageRankVector = DistributionMatrix * PageRankVector;

Output:

PageRankVector

Challenges

-Storage of matrix and vector

-Parallel matrix multiplication

-Determine convergence

-Implementation on Hadoop

PageRank on MapReduceWhy is storage a challenge?

UCI domain: 500000 pages

Assuming 4 Bytes per entry

Size of Vector: 500000 * 4 = 2000000 = 2MB

Size of Matrix: 500000 * 500000 * 4 = 1012 = 1TB

Assumes a fully connected graph.Cleary this is very unrealistic for web pages!

Solution: Sparse MatrixBut: Row-Wise or Column-Wise?

Depends on usage patterns! (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)

PageRank on MapReduceParallel Matrix Multiplication

Requirement: Make it work! Simple but practical solution

M V M x V

Every Row of M is “combined” with V, yielding one element of M x V each

Intuition:

- Parallelize on rows: each parallel task computes one final value

- Use row-wise sparse matrix, so above can be done easily!

(column-wise is actually better for PageRank)

PageRank on MapReduce

0 0 0 0 1 0

0 1 0 1 0 0

1 1 0 0 0 0

0 0 0 1 1 0

1 1 0 0 0 0

0 1 0 0 0 1

1 2 3 4 5 6

Stored As

2, 1 4, 1

1, 1 2, 1

4, 1 5, 1

1, 1 2, 1

2, 1 6, 1

Original Matrix Row-Wise Sparse Matrix

New Storage Requirements

UCI domain: 500000 pages

Assuming 4 Bytes per entry

Assuming max 100 outgoing links per page

Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 106 = 400MB

Notice:

No more random access!

PageRank on MapReduce

Map(Key, Row) {

Vector v = getVector();

Int sum = 0;

For each Element e in Row

sum += e.value * v.at(e.columnNumber);

collect(Key, sum);

Reduce(Key, Value) {

collect(Key, Value);

Map-Reduce procedures for parallel matrix*vector multiplication

using row-wise sparse matrix

Matrix Vector Multiplication

Demo: Show Matrix-Vector Multiplication

Hadoop: Implementing Own File Format

HDFSFile

InputFormat- Splits File into Chunks (“InputSplits”)- Byte Oriented- Provides appropriate RecordReader

InputSplit- Filename- Start Offset- End Offset- Hosts in HDFS

RecordReader- Record oriented- Extracts Records from an InputSplit

InputSplit InputSplit

RecordReader RecordReader

Map Map Map

[1] Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processing on Large Clusters,Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, December, 2004

[2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html

[3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html

[4] http://lucene.apache.org/hadoop/

References

behm shah pagerank

hadoop mapreduce

map map map

value key

mapreduce flow input

yahoo hadoop

input map

hadoop architecture

keyvalue pairs

Technology

generalizing pagerank (pisa)

garage plans by behm design

malyk shah behm open source licences

pagerank on an evolving graph · computed pagerank (as...

csc2556 lecture 4 impartial selection; pagerank; facility...

pagerank - information...

pagerank algorithm explained

computing pagerank with...

behm design provides unique garage plans

pagerank . pagerank . pagerank google - aut

9 algorithms: pagerank

pagerank . pagerank . pagerank...

topic-sensitive pagerank

link analysis: pagerank

google pagerank (pdf)

graph mining - pagerank

pagerank and blockrank - evl · pagerank/blockrank...

pagerank di

next pagerank

kevin behm – addison county regional planning commission