dryad and dryadlinq aditya akella cs 838: lecture 6
TRANSCRIPT
![Page 1: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/1.jpg)
Dryad and DryadLINQ
Aditya AkellaCS 838: Lecture 6
![Page 2: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/2.jpg)
Distributed Data-ParallelProgramming using Dryad
ByAndrew Birrell, Mihai Budiu,
Dennis Fetterly, Michael Isard, Yuan Yu
Microsoft Research Silicon Valley
EuroSys 2007
![Page 3: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/3.jpg)
Dryad goals
• General-purpose execution environment for distributed, data-parallel applications– Concentrates on throughput not latency– Assumes private data center
• Automatic management of scheduling, distribution, fault tolerance, etc.
![Page 4: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/4.jpg)
A typical data-intensive query(Quick Skim)
var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\aditya") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("aditya", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;
![Page 5: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/5.jpg)
Steps in the queryvar logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\aditya") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("aditya", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;
Go through logs and keep only lines that are not comments. Parse each line into a LogEntry object.
Go through logentries and keep only entries that are accesses by aditya.
Group aditya’s accesses according to what page they correspond to. For each page, count the occurrences.
Sort the pages aditya has accessed according to access frequency.
![Page 6: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/6.jpg)
Serial executionvar logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\aditya") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("aditya", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;
For each line in logs, do…
For each entry in logentries, do..
Sort entries in user by page. Then iterate over sorted list, counting the occurrences of each page as you go.
Re-sort entries in access by page frequency.
![Page 7: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/7.jpg)
Parallel executionvar logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\aditya") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("aditya", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;
![Page 8: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/8.jpg)
How does Dryad fit in?
• Many programs can be represented as a distributed execution graph– The programmer may not have to know this
• “SQL-like” queries: LINQ
• Dryad will run them for you
![Page 9: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/9.jpg)
Runtime
• Services– Name server– Daemon
• Job Manager– Centralized coordinating process– User application to construct graph– Linked with Dryad libraries for scheduling vertices
• Vertex executable– Dryad libraries to communicate with JM– User application sees channels in/out– Arbitrary application code, can use local FS
V V V
![Page 10: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/10.jpg)
Job = Directed Acyclic Graph
Processingvertices Channels
(file, pipe, shared memory)
Inputs
Outputs
![Page 11: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/11.jpg)
What’s wrong with MapReduce?
• Literally Map then Reduce and that’s it…– Reducers write to replicated storage
• Complex jobs pipeline multiple stages– No fault tolerance between stages
• Map assumes its data is always available: simple!
• Output of Reduce: 2 network copies, 3 disks– In Dryad this collapses inside a single process– Big jobs can be more efficient with Dryad
![Page 12: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/12.jpg)
What’s wrong with Map+Reduce?
• Join combines inputs of different types• “Split” produces outputs of different types
– Parse a document, output text and references• Can be done with Map+Reduce
– Ugly to program– Hard to avoid performance penalty– Some merge joins very expensive
• Need to materialize entire cross product to disk
![Page 13: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/13.jpg)
How about Map+Reduce+Join+…?
• “Uniform” stages aren’t really uniform
![Page 14: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/14.jpg)
How about Map+Reduce+Join+…?
• “Uniform” stages aren’t really uniform
![Page 15: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/15.jpg)
Graph complexity composes
• Non-trees common• E.g. data-dependent re-partitioning
– Combine this with merge trees etc.
Distribute to equal-sized ranges
Sample to estimate histogram
Randomly partitioned inputs
![Page 16: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/16.jpg)
Scheduler state machine
• Scheduling is independent of semantics– Vertex can run anywhere once all its inputs
are ready• Constraints/hints place it near its inputs
– Fault tolerance• If A fails, run it again• If A’s inputs are gone, run upstream vertices again
(recursively)• If A is slow, run another copy elsewhere and use
output from whichever finishes first
![Page 17: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/17.jpg)
Some Case Studies(Take a peek offline)
Starts at slide 39…
![Page 18: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/18.jpg)
DryadLINQA System for General-Purpose
Distributed Data-Parallel Computing
ByYuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey
Microsoft Research Silicon ValleyOSDI 2008
![Page 19: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/19.jpg)
Distributed Data-Parallel Computing
• Research problem: How to write distributed data-parallel programs for a compute cluster?
• The DryadLINQ programming model– Sequential, single machine programming abstraction– Same program runs on single-core, multi-core, or cluster– Familiar programming languages– Familiar development environment
![Page 20: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/20.jpg)
DryadLINQ Overview
Automatic query plan generation by DryadLINQ Automatic distributed execution by Dryad
![Page 21: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/21.jpg)
LINQ• Microsoft’s Language INtegrated Query
– Available in Visual Studio products• A set of operators to manipulate datasets in .NET
– Support traditional relational operators• Select, Join, GroupBy, Aggregate, etc.
– Integrated into .NET programming languages• Programs can call operators• Operators can invoke arbitrary .NET functions
• Data model– Data elements are strongly typed .NET objects– Much more expressive than SQL tables
• Highly extensible– Add new custom operators– Add new execution providers
![Page 22: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/22.jpg)
LINQ System Architecture
PLINQ
Local machine
.Netprogram(C#, VB, F#, etc)
Execution engines
Query
Objects
LINQ-to-SQL
DryadLINQ
LINQ-to-ObjLIN
Q p
rovi
der i
nter
face
Scalability
Single-core
Multi-core
Cluster
![Page 23: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/23.jpg)
24
Recall: Dryad Architecture
Files, TCP, FIFO, Networkjob schedule
data plane
control plane
NS PD PDPD
V V V
Job manager cluster
![Page 24: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/24.jpg)
A Simple LINQ Example: Word Count
Count word frequency in a set of documents:
var docs = [A collection of documents];var words = docs.SelectMany(doc => doc.words);var groups = words.GroupBy(word => word);var counts = groups.Select(g => new WordCount(g.Key, g.Count()));
![Page 25: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/25.jpg)
Word Count in DryadLINQ
Count word frequency in a set of documents:
var docs = DryadLinq.GetTable<Doc>(“file://docs.txt”);var words = docs.SelectMany(doc => doc.words);var groups = words.GroupBy(word => word);var counts = groups.Select(g => new WordCount(g.Key, g.Count()));
counts.ToDryadTable(“counts.txt”);
![Page 26: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/26.jpg)
Distributed Execution of Word Count
SM
DryadLINQGB
S
LINQ expression
IN
OUT
Dryad execution
![Page 27: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/27.jpg)
DryadLINQ System Architecture
28
DryadLINQClient machine
(11)
Distributedquery plan
.NET program
Query Expr
Data center
Output TablesResults
Input TablesInvoke Query
Output DryadTable
Dryad Execution
.Net Objects
JM
ToTable
foreach
Vertexcode
![Page 28: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/28.jpg)
DryadLINQ Internals• Distributed execution plan
– Static optimizations: pipelining, eager aggregation, etc.– Dynamic optimizations: data-dependent partitioning,
dynamic aggregation, etc.
• Automatic code generation– Vertex code that runs on vertices– Channel serialization code– Callback code for runtime optimizations– Automatically distributed to cluster machines
• Separate LINQ query from its local context– Distribute referenced objects to cluster machines– Distribute application DLLs to cluster machines
![Page 29: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/29.jpg)
30
Execution Plan for Word Count
(1)
SM
GB
S
SM
Q
GB
C
D
MS
GB
Sum
SelectMany
sort
groupby
count
distribute
mergesort
groupby
Sum
pipelined
pipelined
![Page 30: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/30.jpg)
31
Execution Plan for Word Count
(1)
SM
GB
S
SM
Q
GB
C
D
MS
GB
Sum
(2)
SM
Q
GB
C
D
MS
GB
Sum
SM
Q
GB
C
D
MS
GB
Sum
SM
Q
GB
C
D
MS
GB
Sum
![Page 31: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/31.jpg)
32
MapReduce in DryadLINQ
MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs{ var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs}
![Page 32: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/32.jpg)
Map-Reduce Plan(When reduce is combiner-enabled)
M
Q
G1
C
D
MS
G2
R
M
Q
G1
C
D
MS
G2
R
M
Q
G1
C
D
MS
G2
R
MS
G2
R
map
sort
groupby
combine
distribute
mergesort
groupby
reduce
mergesort
groupby
reducem
apD
ynam
ic a
ggre
gatio
nre
duce
![Page 33: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/33.jpg)
PageRank: A more complex example(Take a peek offline)
Starts at slide 56…
![Page 34: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/34.jpg)
LINQ System Architecture
PLINQ
Local machine
.Netprogram(C#, VB, F#, etc)
Execution engines
Query
Objects
LINQ-to-SQL
DryadLINQ
LINQ-to-ObjLIN
Q p
rovi
der i
nter
face
Scalability
Single-core
Multi-core
Cluster
![Page 35: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/35.jpg)
36
Combining with PLINQ
Query
DryadLINQ
PLINQ
subquery
![Page 36: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/36.jpg)
37
Combining with LINQ-to-SQL
DryadLINQ
Subquery Subquery Subquery Subquery Subquery
Query
LINQ-to-SQL LINQ-to-SQL
![Page 37: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/37.jpg)
38
ImageProcessing
Cosmos DFSSQL Servers
Software Stack
Windows Server
Cluster Services
Azure Platform
Dryad
DryadLINQ
Windows Server
Windows Server
Windows Server
Other Languages
CIFS/NTFS
MachineLearning
GraphAnalysis
DataMining
Applications
…Other Applications
![Page 38: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/38.jpg)
Future Directions• Goal: Use a cluster as if it is a single computer
– Dryad/DryadLINQ represent a modest step
• On-going research– What can we write with DryadLINQ?
• Where and how to generalize the programming model?
– Performance, usability, etc.• How to debug/profile/analyze DryadLINQ apps?
– Job scheduling• How to schedule/execute N concurrent jobs?
– Caching and incremental computation• How to reuse previously computed results?
– Static program checking• A very compelling case for program analysis?• Better catch bugs statically than fighting them in the cloud?
![Page 39: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/39.jpg)
Dryad Case Studies
![Page 40: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/40.jpg)
SkyServer DB Query
• 3-way join to find gravitational lens effect• Table U: (objId, color) 11.8GB• Table N: (objId, neighborId) 41.8GB• Find neighboring stars with similar colors:
– Join U+N to findT = U.color,N.neighborId where U.objId = N.objId
– Join U+T to findU.objId where U.objId = T.neighborID
and U.color ≈ T.color
![Page 41: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/41.jpg)
D D
MM 4n
SS 4n
YY
H
n
n
X Xn
U UN N
U U
• Took SQL plan• Manually coded in Dryad• Manually partitioned data
SkyServer DB query
u: objid, color
n: objid, neighborobjid
[partition by objid]
select
u.color,n.neighborobjid
from u join n
where
u.objid = n.objid
(u.color,n.neighborobjid)
[re-partition by n.neighborobjid]
[order by n.neighborobjid]
[distinct]
[merge outputs]
select
u.objid
from u join <temp>
where
u.objid = <temp>.neighborobjid and
|u.color - <temp>.color| < d
![Page 42: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/42.jpg)
Optimization
D
M
S
Y
X
M
S
M
S
M
S
U N
U
D D
MM 4n
SS 4n
YY
H
n
n
X Xn
U UN N
U U
![Page 43: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/43.jpg)
Optimization
D
M
S
Y
X
M
S
M
S
M
S
U N
U
D D
MM 4n
SS 4n
YY
H
n
n
X Xn
U UN N
U U
![Page 44: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/44.jpg)
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
0 2 4 6 8 10
Number of Computers
Spe
ed-u
pDryad In-Memory
Dryad Two-pass
SQLServer 2005
![Page 45: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/45.jpg)
Query histogram computation
• Input: log file (n partitions)• Extract queries from log partitions• Re-partition by hash of query (k buckets)• Compute histogram within each bucket
![Page 46: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/46.jpg)
Naïve histogram topology
Q Q
R
Q
R k
k
k
n
n
is:Each
R
is:
Each
MS
C
P
C
S
C
S
D
P parse lines
D hash distribute
S quicksort
C count occurrences
MS merge sort
![Page 47: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/47.jpg)
Efficient histogram topologyP parse lines
D hash distribute
S quicksort
C count occurrences
MS merge sort
M non-deterministic merge
Q' is:Each
R
is:
Each
MS
C
M
P
C
S
Q'
RR k
T
k
n
T
is:
Each
MS
D
C
![Page 48: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/48.jpg)
RR
T
Q’
MS►C►D
M►P►S►C
MS►C
P parse lines D hash distribute
S quicksort MS merge sort
C count occurrences M non-deterministic merge
R
![Page 49: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/49.jpg)
MS►C►D
M►P►S►C
MS►C
P parse lines D hash distribute
S quicksort MS merge sort
C count occurrences M non-deterministic merge
RR
T
R
Q’Q’Q’Q’
![Page 50: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/50.jpg)
MS►C►D
M►P►S►C
MS►C
P parse lines D hash distribute
S quicksort MS merge sort
C count occurrences M non-deterministic merge
RR
T
R
Q’Q’Q’Q’
T
![Page 51: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/51.jpg)
MS►C►D
M►P►S►C
MS►C
P parse lines D hash distribute
S quicksort MS merge sort
C count occurrences M non-deterministic merge
RR
T
R
Q’Q’Q’Q’
T
![Page 52: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/52.jpg)
P parse lines D hash distribute
S quicksort MS merge sort
C count occurrences M non-deterministic merge
MS►C►D
M►P►S►C
MS►C RR
T
R
Q’Q’Q’Q’
T
![Page 53: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/53.jpg)
P parse lines D hash distribute
S quicksort MS merge sort
C count occurrences M non-deterministic merge
MS►C►D
M►P►S►C
MS►C RR
T
R
Q’Q’Q’Q’
T
![Page 54: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/54.jpg)
Final histogram refinement
Q' Q'
RR 450
TT 217
450
10,405
99,713
33.4 GB
118 GB
154 GB
10.2 TB
1,800 computers
43,171 vertices
11,072 processes
11.5 minutes
![Page 55: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/55.jpg)
Optimizing Dryad applications
• General-purpose refinement rules• Processes formed from subgraphs
– Re-arrange computations, change I/O type• Application code not modified
– System at liberty to make optimization choices• High-level front ends hide this from user
– SQL query planner, etc.
![Page 56: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/56.jpg)
DryadLINQ: PageRank example
![Page 57: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/57.jpg)
An Example: PageRankRanks web pages by propagating scores along hyperlink structure
Each iteration as an SQL query:
1. Join edges with ranks2. Distribute ranks on edges3. GroupBy edge destination4. Aggregate into ranks5. Repeat
![Page 58: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/58.jpg)
One PageRank Step in DryadLINQ// one step of pagerank: dispersing and re-accumulating rankpublic static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks){ // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);
// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}
![Page 59: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/59.jpg)
The Complete PageRank Program
var pages = DryadLinq.GetTable<Page>(“file://pages.txt”); var ranks = pages.Select(page => new Rank(page.name, 1.0)); // repeat the iterative computation several times for (int iter = 0; iter < iterations; iter++) { ranks = PRStep(pages, ranks); }
ranks.ToDryadTable<Rank>(“outputranks.txt”);
public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links;
public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; }
public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } }
public struct Rank { public UInt64 name; public double rank;
public Rank(UInt64 n, double r) { name = n; rank = r; } }
public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);
// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}
![Page 60: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/60.jpg)
One Iteration PageRank
J
S
G
C
D
M
G
R
J
S
G
C
D
M
G
R
J
S
G
C
D
Join pages and ranks
Disperse page’s rank
Group rank by page
Accumulate ranks, partially
Hash distribute
Merge the data
Group rank by page
Accumulate ranks
M
G
R
…
…
Dynamic aggregation
![Page 61: Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6](https://reader035.vdocuments.us/reader035/viewer/2022062802/56649e885503460f94b8c5cc/html5/thumbnails/61.jpg)
Multi-Iteration PageRankpages ranks
Iteration 1
Iteration 2
Iteration 3
Memory FIFO