behm shah pagerank
Post on 27-Jan-2015
131 Views
Preview:
DESCRIPTION
TRANSCRIPT
CS224: Advanced Topics in Data Management Spring 2008
1
Computing PageRank Using Hadoop
(+Introduction to MapReduce)
Alexander Behm, Ajey Shah
University of California, IrvineInstructor: Prof. Chen Li
CS224: Advanced Topics in Data Management Spring 2008
2
Outline Introduction to MapReduce
Motivation + Goals
MapReduce Paradigm + Example
Introduction to Hadoop
Architecture
Our setup
Computing PageRank using Map Reduce
Link Analysis
Matrix Multiplication
CS224: Advanced Topics in Data Management Spring 2008
3
Motivation for MapReduce
How can we process huge amounts of data quickly (think web-scale)?
Mainframe (one big machine)
expensive, one vendor, hard to scale radically, single point of failure
COTS Cluster (many small machines)
cheap components, many vendors, easy to scale
COTS Clusters very popular because of price and scalability
Main drawback is complexity of programming parallel applications on them
COTS = Commodity off the shelf
CS224: Advanced Topics in Data Management Spring 2008
4
Motivation for MapReduce
What are the main challenges of programming a COTS cluster?
1. Fault Tolerance (many machines many failures)
2. Transparency: how to hide underlying details of cluster
3. Scheduling and load balancing
Parallel Programming Models
-High exposure to programmer-Complex programming-High Efficiency-(Long development time)
-Low exposure to programmer-Simple programming-Lower Efficiency-(Short development time)
MapReduce
CS224: Advanced Topics in Data Management Spring 2008
5
MapReduce Goals
Provide easy but general model for programmers to use cluster resources
Hide network communication (i.e. RPCs)
Hide storage details, file chunks are automatically distributed and replicated
Provide transparent fault tolerance
Failed tasks are automatically rescheduled on live nodes
High throughput and automatic load balancing
E.g. scheduling tasks on nodes that already have data
RPC = Remote Procedure Call
CS224: Advanced Topics in Data Management Spring 2008
6
MapReduce is NOT…
An operating system
A programming language
Meant for online processing
Hadoop (it is an implementation of MapReduce)
MapReduce is a programming
paradigm!
CS224: Advanced Topics in Data Management Spring 2008
7
MapReduce FlowInput
Map
Key, Value Key, Value …=
Map Map
Key, Value
Key, Value
…
Key, Value
Key, Value
…
Key, Value
Key, Value
…
Split Input into Key-Value pairs.For each K-V pair call Map.
Each Map produces new set of K-V pairs.
Reduce(K, V[ ])
Sort
Output Key, Value Key, Value …=
For each distinct key, call reduce. Produces one K-V pair for each distinct key.
Output as a set of Key Value Pairs.
CS224: Advanced Topics in Data Management Spring 2008
8
MapReduce WordCount ExampleOutput:Number of occurrences of each word
Input:File containing words
Hello World Bye WorldHello Hadoop Bye HadoopBye Hadoop Hello Hadoop
Hello World Bye WorldHello Hadoop Bye HadoopBye Hadoop Hello Hadoop
Bye 3Hadoop 4Hello 3World 2
Bye 3Hadoop 4Hello 3World 2
MapReduce
How can we do this within the MapReduce framework?
Basic idea: parallelize on lines in input file!
CS224: Advanced Topics in Data Management Spring 2008
9
MapReduce WordCount ExampleInput
1, “Hello World Bye World”
2, “Hello Hadoop Bye Hadoop”
3, “Bye Hadoop Hello Hadoop”
Map Output
<Hello,1><World,1><Bye,1><World,1>
<Hello,1><Hadoop,1><Bye,1><Hadoop,1>
<Bye,1><Hadoop,1><Hello,1><Hadoop,1>
Map(K, V) { For each word w in V Collect(w, 1);}
Map
Map
Map
CS224: Advanced Topics in Data Management Spring 2008
10
MapReduce WordCount ExampleReduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count);}
Map Output
<Hello,1><World,1><Bye,1><World,1>
<Hello,1><Hadoop,1><Bye,1><Hadoop,1>
<Bye,1><Hadoop,1><Hello,1><Hadoop,1>
Internal Grouping
<Bye 1, 1, 1>
<Hadoop 1, 1, 1, 1>
<Hello 1, 1, 1>
<World 1, 1>
Reduce Output
<Bye, 3><Hadoop, 4><Hello, 3><World, 2>
Reduce
Reduce
Reduce
Reduce
CS224: Advanced Topics in Data Management Spring 2008
11
Open Source implementation of MapReduce by Apache Java software framework In use and supported by Yahoo! Hadoop consists of the following components:
Processing : Map Reduce Storage: HDFS, Hbase (Google Bigtable)
@Yahoo!Some Webmap size data:Number of links between pages in the index: roughly 1 trillion links Size of output: over 300 TB, compressed! Number of cores used to run a single Map-Reduce job: over 10,000 Raw disk used in the production cluster: over 5 Petabytes
(source: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html)
CS224: Advanced Topics in Data Management Spring 2008
12
Typical Hadoop Setup
CS224: Advanced Topics in Data Management Spring 2008
13
Our Hadoop Setup
MASTER
peach
Namenode
JobTracker
TaskTracker
DataNode
watermelon
DataNode
TaskTracker
cherry
DataNode
TaskTracker
avocado
DataNode
TaskTracker
blueberry
DataNode
TaskTracker
Switch
SLAVES
CS224: Advanced Topics in Data Management Spring 2008
14
Our Hadoop Setup
Demo: Hadoop Admin Pages!
CS224: Advanced Topics in Data Management Spring 2008
15
• Single Name Node- manages meta data and block placement
• DataNode – stores blocks
Storage: HDFS
CS224: Advanced Topics in Data Management Spring 2008
16
Run Application
Job Tracker
Task Tracker Task Tracker Task Tracker…
Task Task Task Task Task Task
Hadoop Black Box
Job Execution Diagram
CS224: Advanced Topics in Data Management Spring 2008
17
Input -> Map -> Shuffle -> Reduce -> Output
Processing: Hadoop MapReduce
CS224: Advanced Topics in Data Management Spring 2008
18
Using Hadoop To Program
Reduce(…)Mapper(…)
extendsextends
implements implements
CS224: Advanced Topics in Data Management Spring 2008
19
Sample Map Classpublic static class Map extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
CS224: Advanced Topics in Data Management Spring 2008
20
Sample Reduce Classpublic static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
CS224: Advanced Topics in Data Management Spring 2008
21
Running a Job
Demo: Show WordCount Example
CS224: Advanced Topics in Data Management Spring 2008
22
Project5: PageRank on Hadoop
CS224: Advanced Topics in Data Management Spring 2008
23
Output
#|colNum|NumOfRows|<R,val>…..<R,val>|#.....
Link Extractor
Link Analysis
Crawled Pages
CS224: Advanced Topics in Data Management Spring 2008
24
PageRank on MapReduceVery Basic PageRank Algorithm
Input:
PageRankVector
DistributionMatrix
ComputePageRank {
Until converged {
PageRankVector = DistributionMatrix * PageRankVector;
}
}
Output:
PageRankVector
Challenges
-Storage of matrix and vector
-Parallel matrix multiplication
-Determine convergence
-Implementation on Hadoop
CS224: Advanced Topics in Data Management Spring 2008
25
PageRank on MapReduceWhy is storage a challenge?
UCI domain: 500000 pages
Assuming 4 Bytes per entry
Size of Vector: 500000 * 4 = 2000000 = 2MB
Size of Matrix: 500000 * 500000 * 4 = 1012 = 1TB
Assumes a fully connected graph.Cleary this is very unrealistic for web pages!
Solution: Sparse MatrixBut: Row-Wise or Column-Wise?
Depends on usage patterns! (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)
CS224: Advanced Topics in Data Management Spring 2008
26
PageRank on MapReduceParallel Matrix Multiplication
Requirement: Make it work! Simple but practical solution
X =
M V M x V
Every Row of M is “combined” with V, yielding one element of M x V each
Intuition:
- Parallelize on rows: each parallel task computes one final value
- Use row-wise sparse matrix, so above can be done easily!
(column-wise is actually better for PageRank)
CS224: Advanced Topics in Data Management Spring 2008
27
PageRank on MapReduce
0 0 0 0 1 0
0 1 0 1 0 0
1 1 0 0 0 0
0 0 0 1 1 0
1 1 0 0 0 0
0 1 0 0 0 1
1 2 3 4 5 6
1
2
3
4
5
6
Stored As
1
2
3
4
5
6
5, 1
2, 1 4, 1
1, 1 2, 1
4, 1 5, 1
1, 1 2, 1
2, 1 6, 1
Original Matrix Row-Wise Sparse Matrix
New Storage Requirements
UCI domain: 500000 pages
Assuming 4 Bytes per entry
Assuming max 100 outgoing links per page
Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 106 = 400MB
Notice:
No more random access!
CS224: Advanced Topics in Data Management Spring 2008
28
PageRank on MapReduce
Map(Key, Row) {
Vector v = getVector();
Int sum = 0;
For each Element e in Row
sum += e.value * v.at(e.columnNumber);
collect(Key, sum);
}
Reduce(Key, Value) {
collect(Key, Value);
}
Map-Reduce procedures for parallel matrix*vector multiplication
using row-wise sparse matrix
CS224: Advanced Topics in Data Management Spring 2008
29
Matrix Vector Multiplication
Demo: Show Matrix-Vector Multiplication
CS224: Advanced Topics in Data Management Spring 2008
30
Hadoop: Implementing Own File Format
HDFSFile
HDFSFile
InputFormat- Splits File into Chunks (“InputSplits”)- Byte Oriented- Provides appropriate RecordReader
InputSplit- Filename- Start Offset- End Offset- Hosts in HDFS
RecordReader- Record oriented- Extracts Records from an InputSplit
InputSplit InputSplit
RecordReader RecordReader
Map Map Map
CS224: Advanced Topics in Data Management Spring 2008
31
[1] Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processing on Large Clusters,Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, December, 2004
[2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html
[3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html
[4] http://lucene.apache.org/hadoop/
References
top related