dandelion: a unified programming model for gpu...

39
DANDELION: A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS Jon Currey Microsoft Research Joint work with Chris Rossbach, Yuan Yu, JP Martin, Dennis Fetterly

Upload: dodang

Post on 20-May-2018

236 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

DANDELION: A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS

Jon CurreyMicrosoft Research

Joint work withChris Rossbach, Yuan Yu, JP Martin, Dennis Fetterly

Page 2: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Motivation: Programmability forHeterogeneous Distributed Systems

Data volumes increasing

Cluster costs decreasing

Architectural diversity prevalent CPU-GPU server: 5X Gflops/$, 4X Gflops/kwatt v. CPUs

Programming challenges Heterogeneity programming models, arch. expertise

Distributed resources data movement, scheduling

Concurrency synchronization, consistency

Dandelion GTC 2014 S4221 2

Page 3: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Dandelion *Goal*

Single programming interface for clusters CPUs

GPUs

FPGAs

You name it…

Programmer write sequential code

Runtime Parallelize computation

Partition data

Runs on all available resources

Maps computation to best architecture

Dandelion GTC 2014 S4221

(holy grail)

3

Page 4: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Dandelion goal

Offload data-parallel code fragments

Small cluster of multi-core + GPU

Starting point: LINQ queries

Dandelion GTC 2014 S4221

(a less holy and attractive vessel:often just as effective, mileage may vary)

Our 10-node GPU Cluster:-- 24,960 GPU cores -- 240 CPU HW threads (12 cores x2 ctxts/node)-- 2560 GB RAM (256 GB/node)

4

Page 5: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

(Very) High Level View

Dandelion GTC 2014 S4221 5

User ProgramPartitioned data files

(input)

Compile to a mix of CPU and GPU code

Run on the cluster Partitioned data files(output)

&

Dandelion

Page 6: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Dandelion Architecture

Dandelion GTC 2014 S4221

Client User Program

Data-flow graphs(cluster, machine, GPU)

Worker Vertex Code(CPU and GPU)

Dandelion Compiler

Machine Runtime

Dandelion Vertex

Cluster Runtime GPU Runtime

Worker Vertex CodeData-flow graphs

Cluster

6

User Program

Dandelion Compiler

Data-flow graphs(cluster, machine, GPU)

Worker Vertex Code(CPU and GPU)

Machine RuntimeCluster Runtime GPU Runtime

Worker Vertex CodeData-flow graphs

Page 7: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Wait… why so many different “dataflow” components?

Dandelion GTC 2014 S4221 7

Page 8: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

The composition problem

What happens if I want the following? Matrix D = A x B x C

Matrixgemm(Matrix A, Matrix B) {

copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;

}

Dandelion GTC 2014 S4221 8

Page 9: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Composed matrix multiplication

Matrix gemm(Matrix A, Matrix B) {

copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;

}

Matrix AxBxC(Matrix A, B, C) {

Matrix AxB = gemm(A,B);Matrix AxBxC = gemm(AxB,C); return AxBxC;

}

Dandelion GTC 2014 S4221 9

Page 10: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Composed matrix multiplication

Matrixgemm(Matrix A, Matrix B) {

copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;

}

MatrixAxBxC(Matrix A, B, C) {

Matrix AxB = gemm(A,B);

Matrix AxBxC = gemm(AxB,C); return AxBxC;

}

AxB copied from GPU memory…

Dandelion GTC 2014 S4221 10

Page 11: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Composed matrix multiplication

Matrixgemm(Matrix A, Matrix B) {

copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;

}

MatrixAxBxC(Matrix A, B, C) {

Matrix AxB = gemm(A,B);

Matrix AxBxC = gemm(AxB,C); return AxBxC;

} …only to be copied right back!

Dandelion GTC 2014 S4221 11

Page 12: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

What if I have >1 GPU?

What happens if I want the following? Matrix D = A x B x C

Matrixgemm(GPU dev,Matrix A, Matrix B) {

copyToGPU(dev, A);copyToGPU(dev, B);invokeGPU(dev);Matrix C = new Matrix();copyFromGPU(dev, C);return C;

}

Dandelion GTC 2014 S4221 12

Page 13: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Composition with >1 GPUMatrix gemm(GPU dev, Matrix A, Matrix B) {

copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;

}

Matrix AxBxC(Matrix A,B,C) {

Matrix AxB = gemm(???, A,B);Matrix AxBxC = gemm(???, AxB,C); return AxBxC;

}

Dandelion GTC 2014 S4221 13

Page 14: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Composition with >1 GPUMatrix gemm(GPU dev, Matrix A, Matrix B) {

copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;

}

Matrix AxBxC(GPU dev, Matrix A,B,C) {

Matrix AxB = gemm(dev, A,B);Matrix AxBxC = gemm(dev, AxB,C); return AxBxC;

}

Rats…now I can only use 1 GPU.How to partition

computation?

Dandelion GTC 2014 S4221 14

Page 15: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Composition with >1 GPUMatrix gemm(GPU dev, Matrix A, Matrix B) {

copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;

}

Matrix AxBxC(GPU devA, GPU devB, Matrix A,B,C) {

Matrix AxB = gemm(devA, A,B);Matrix AxBxC = gemm(devB, AxB,C); return AxBxC;

}

Rats…this will never scale to many GPUs.Plus, how do I choose which GPUs to use?

Why don’t we have this problem with CPUs?

Device-centric APIs are the wrong abstraction for GPU compute.

Dandelion GTC 2014 S4221 15

Page 16: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

nodes computation

edges communication

Expresses parallelism explicitly

Minimal specification of data movement: runtime does it.

asynchrony is a runtime concern (not programmer concern)

No specification of computedevice mapping: like threads!

Dataflow: program == graph

gemm

gemm

Matrix: C

Matrix: A Matrix: B

Programmer provides algorithms, graph structure, runtime does the rest:

Data movement (with asynchrony), multi-GPU scheduling

Works for distribute compute too!

Dandelion GTC 2014 S4221 16

Page 17: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

S

V

machine

Dandelion Architecture (2)

Dandelion GTC 2014 S4221

User program

cluster

SS

cluster graph

TCP, caches, files

V V V

Dandelion Compiler

Cluster Runtime

machine graph

A B

C

DCPU Task

GPU Task

M = MasterS = Slave

= CPU= GPU

LINQ Query

17

GPU graph

Machine Runtime GPU Runtime

primitive library:relational algebra

This talk (mostly):

Page 18: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Language Integrated Query Relational operators on collections

var res = collection

.Where(x => x.isRed())

.GroupBy(x => x)

.Select(x => f(x));

Why focus on LINQ? Expresses many important workloads easily

K-Means, PageRank (MR), Sparse Matrix SVD, …

Powerful: lambdas embed C#/.NET Declarative/data-parallel

Natural fit for dataflow

Lambdas in C++11 and Java 8

What’s a LINQ query?

Dandelion GTC 2014 S4221 18

Page 19: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Running Example: K-Means

Partition n points into k clusters Pick k initial centers

while(not done) {1. Each point nearest center2. Each new center = mean(points old center)

}Dandelion GTC 2014 S4221 19

Page 20: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Dandelion GTC 2014 S4221

centers = points

.GroupBy(point => NearestCenter(point, centers))

.Select(g => g.Aggregate((x, y) => x+y)/g.Count());

Step 2: Each new cluster center = average of points in a group

Step 1: Group points by nearest cluster center

simple mapping to GPU

GPU implementationnon-obvious

Running Example: K-Means

20

Partition n points into k clusters Pick k initial centers

while(not done) {1. Each point nearest center2. Each new center = mean(points old center)

}

Page 21: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

GroupBy

Group a collection by key

Lambda function maps elements key

Dandelion GTC 2014 S4221

var res = ints.GroupBy(x => x);

10 30 20 10 20 30 10

101010 202030 30

21

foreach(T elem in ints)

{

key = KeyLambda(elem);

group = GetGroup(key);

group.Add(elem);

}

foreach(T elem in PF(ints))

{

key = KeyLambda(elem);

group = GetGroup(key);

group.Add(elem);

}

Page 22: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Dandelion GTC 2014 S4221 22

Background: GPU Architecture

Kernel

Device with 4 SMs

SM 0 SM 1 SM 2 SM 3

Thread Block 1

Thread Block 2

Thread Block 3

Thread Block 0

Thread Block 4

Thread Block 5

Thread Block 6

Thread Block 7

Thread Block 1

Thread Block 2

Thread Block 3

Thread Block 4

Thread Block 5

Thread Block 6

Thread Block 7

Thread Block 8

Thread(0, 0) … Thread

(31, 0)

Thread(0, 1) … Thread

(31, 1)

Wide SIMD (vector) machine: SMs• code == kernels, 1000s of threads

• explicit subdivision: blocks

• model: all threads run in parallel• HW maps subsets (warps) to SMs

• warps: concurrent, divergent CF serialized• schedule non-deterministic• locks problematic, despite atomic ops

• exposed u-arch features/warts• e.g. software-managed caches• 1st order performance impact

SM = Streaming Multiprocessor

foreach(T elem in PF(ints))

{

key = KeyLambda(elem);

group = GetGroup(key);

group.Add(elem);

}

Page 23: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

GPU GroupByProcess each input element in parallel

grouping ~ shuffling input item output offset s.t. groups are contiguous output offset = group offset + item number … but how to get the group offset, item number?

Dandelion GTC 2014 S4221

10 30 20 10 20 30 10

101010 202030 30

ints

res

Number of groups and input group

mapping

Number of elements in each

group

Start index of each group in the

output sequence

23

Page 24: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

GPU GroupBy: Multiple Stages

GPU lock-free hash table

Dandelion GTC 2014 S4221

10 30 20 10 20 30 10

Assign group IDs

Compute group sizes

0 1 2

10 20 30

Group ID :

0 1 2

10 20 30

3 2 2

Group ID :

Group Size :

Compute start indices

0 1 2

10 20 30

0 3 5

Group ID :

Group Start Index :

Write Outputs

10 302010 20 3010

Hash table lookup: group ID

-- Uses atomic increment

prefix sum of group sizes

Write to output location– Uses atomic increment

24

Page 25: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

GPU GroupBy: Multiple Stages

GPU lock-free hash table

10 30 20 10 20 30 10

Assign group IDs

Compute group sizes

0 1 2

10 20 30

Group ID :

0 1 2

10 20 30

3 2 2

Group ID :

Group Size :

Compute start indices

0 1 2

10 20 30

0 3 5

Group ID :

Group Start Index :

Write Outputs

10 302010 20 3010

Hash table lookup: group ID

-- Uses atomic increment

prefix sum of group sizes

Write to output location– Uses atomic increment

Assign group IDs

Compute group sizes

Compute start indices

Write Outputs

Dandelion GTC 2014 S4221

• User types/functions not needed at every step• The dataflow is abstract generic primitives

Assign group IDs

Compute group sizes

Compute start indices

Write Outputs

25

Page 26: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Composed Generic Primitives

GPU lock-free hash table

Hash table lookup: group ID

-- Uses atomic increment

prefix sum of group sizes

Write to output location– Uses atomic increment

10 30 20 10 20 30 10

Assign group IDs

Compute group sizes

0 1 2

10 20 30

Group ID :

0 1 2

10 20 30

3 2 2

Group ID :

Group Size :

Compute start indices

0 1 2

10 20 30

0 3 5

Group ID :

Group Start Index :

Write Outputs

10 302010 20 3010

Assign group IDs

Compute group sizes

Compute start indices

Write Outputs

Dandelion GTC 2014 S4221

buildHT<K,T,keyfn, eqfn>

prefixsum

shuffle<K,T,keyfn>

groupsizes

GPU GroupBy: Multiple Stages

26

Compile Time: Howcross-compile/marshal

these?

How to build a LINQGPU compiler:Repeat this process for all LINQ operators

GroupBy<K,T,keyfn, eqfn>

Page 27: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Compiling C# GPU code

Dandelion GTC 2014 S4221

int NearestCenter(Vector point, IEnumerable<Vector> centers) {

int minIndex = 0, curIndex = 0;

double minValue = Double.MaxValue;

foreach (Vector center in centers) {

double curValue = (center - point).Norm2();

minIndex = (minValue > curValue) ? curIndex : minIndex;

minValue = (minValue > curValue) ? curValue : minValue;

curIndex++;

}

return minIndex;

}

centers = points

.GroupBy(pnt => NearestCenter(pnt, centers))

.Select(g=>g.Aggregate((x,y)=>x+y)/g.Count());

Marshalling for user types:1. Decide GPU-side layout

2. Generate serialization code

also cross-compile all referenced

functions

27

Page 28: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Compiling C# GPU code

Dandelion GTC 2014 S4221

Translation performed at .NET byte-code (‘CIL’) level Map C# types to CUDA structs Translate C# methods into CUDA kernel functions Generate C# code for CPU-GPU serialization/transfer

Main constraint: dynamic memory allocation Convert to stack allocation if object size can be

inferred Fail parallelization, fallback to host otherwise

28

Page 29: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Generated CUDA Kernel Code__device__ __host__ int NearestCenter_Kernel(KernelStruct_0 point, KernelStruct_0 *centers, int centers_n) {

KernelStruct_0 local_6;

int local_0 = 0;

double local_1 = 1.79769313486232E+308;

int local_2 = 0;

int centers_n_idx = -1;

goto IL_0041;

{

IL_0018:

KernelStruct_0 local_3 = centers[centers_n_idx];

local_6 = op_Subtraction_Kernel(local_3, point);

double local_4 = ((double)(Norm2_Kernel(local_6)));

if (((local_1) > (local_4))) {

local_1 = local_4;

local_0 = local_2;

}

local_2 = ((local_2) + (1));

IL_0041:

if (((++centers_n_idx) < centers_n)) {

goto IL_0018;

}

goto IL_0058;

}

IL_0058:

return local_0;

}

Dandelion GTC 2014 S4221

int NearestCenter(Vector point, IEnumerable<Vector> centers) {

int minIndex = 0, curIndex = 0;

double minValue = Double.MaxValue;

foreach (Vector center in centers) {

double curValue = (center - point).Norm2();

minIndex = (minValue > curValue) ? curIndex : minIndex;

minValue = (minValue > curValue) ? curValue : minValue;

curIndex++;

}

return minIndex;

}

29

struct KernelStruct_0 {float arr[N];__device__ int GetLength() { return N;

}};

Page 30: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

newCenters is an expression tree:

GroupBy

Select

void KMeans(IQueryable<Vector> points,

IQueryable<Vector> centers) {

var newCenters =

points.GroupBy(point => NearestCenter(point, centers))

.Select(g => g.Aggregate((x, y) => x + y) / g.Count());

... // other stuff

foreach (Vector center in newCenters) {

do_something(center);

}

}

Dandelion GTC 2014 S4221

Leveraging lazy evaluation

Dandelion invoked:1. load binary, find IL2. generate C#, CUDA3. compile *.dll, *.ptx4. build dataflow graphs5. deploy bin, graphs

…30

Page 31: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

10 x GroupBy

centers

Tee

G1 G1 G1 G1 G1 G1 G1 G1 G1 G1

G2 G2 G2 G2 G2 G2 G2 G2 G2 G2

new_centers

merge

10 x vector-partition

K-Means Dataflow Graphs

Dandelion GTC 2014 S4221 31

Machine graph/GPU graph

GroupBy

Page 32: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Evaluation

Programmability

Performance: single-machine & cluster

Benchmarks: kmeans, pagerank, terasort, skyserver

Black-scholes, ID3 dec. trees, BM25F (local-only)

Platform: 10-machine cluster• NVIDIA Tesla k20m, 5GB GDDR5

• 2 Xeon E5 2.3GHz 24 hw threads

• 256 GB RAM L1:32K I + 32K d, 256K L2, 15M L3

• Windows Server 2008 R2 64-bit

• Mellanox ConnectX-3 10 Gigabit Ethernet

Dandelion GTC 2014 S4221 32

Page 33: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

K-Means in C#class KMeans {

int NearestCenter(Vector point, IEnumerable<Vector> centers) {

int minIndex = 0, curIndex = 0;

double minValue = Double.MaxValue;

foreach (Vector center in centers) {

double curValue = (center - point).Norm2();

minIndex = (minValue > curValue) ? curIndex : minIndex;

minValue = (minValue > curValue) ? curValue : minValue;

curIndex++;

}

return minIndex;

}

IQueryable<Vector> Steps(int nSteps, IQueryable<Vector> points, IQueryable<Vector> centers) {

for(int i=0; i<nSteps; i++)

centers = points

.GroupBy(point => NearestCenter(point, centers))

.Select(g => g.Aggregate((x, y) => x + y) / g.Count());

return centers;

}

IQueryable<Vector> KMeans() {

IQueryable<Vector> points = new Vector[N];

IQueryable<Vector> centers = new Vector[K];

return Steps(s, points, centers);

}

}

Dandelion GTC 2014 S4221 33

Page 34: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

class KMeans {

int NearestCenter(Vector point, IEnumerable<Vector> centers) {

int minIndex = 0, curIndex = 0;

double minValue = Double.MaxValue;

foreach (Vector center in centers) {

double curValue = (center - point).Norm2();

minIndex = (minValue > curValue) ? curIndex : minIndex;

minValue = (minValue > curValue) ? curValue : minValue;

curIndex++;

}

return minIndex;

}

IQueryable<Vector> Steps(int nSteps, IQueryable<Vector> points, IQueryable<Vector> centers) {

for(int i=0; i<nSteps; i++)

centers = points

.GroupBy(point => NearestCenter(point, centers))

.Select(g => g.Aggregate((x, y) => x + y) / g.Count());

return centers;

}

IQueryable<Vector> KMeans() {

IQueryable<Vector> points = new Vector[N].AsDandelion();

IQueryable<Vector> centers = new Vector[K].AsDandelion();

return Steps(s, points, centers);

}

}

K-Means in Dandelion

Dandelion GTC 2014 S4221 34

Page 35: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

0

100

200

300

400

500

600

700

800

900

1000

0.1

1

10

100

1000

SL

OC

Sp

ee

du

p o

ver

seq

ue

nti

al C

++

speedup SLOC

K-Means Shootout

• Speedup: log-scale, higher is better• SLOC: lower is better• Other input sizes similar

• single machine• NVIDIA Tesla k20m, 5GB GDDR5• 2 Xeon E5 2.3GHz 24 hw threads• 256 GB RAM L1:32K I + 32K d, 256K L2, 15M L3• Windows Server 2008 R2 64-bit 35

• low SLOC but slow• 24 threads only 7x

• fast, complex• expertise required

• 20X SLOC reduction• ~17X speedup v. seq.• 2..7X slower v. hand opt.

Dandelion GTC 2014 S4221

Page 36: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

0

5

10

15

20

Sp

ee

du

p o

ver

seq

ue

nti

al L

INQ

/CP

ULINQ-seq Multi-thread CPU GPU

Single-machine performance

Dandelion GTC 2014 S4221

• NVIDIA Tesla k20m, 5GB GDDR5• 2 Xeon E5 2.3GHz 24 hw threads• 256 GB RAM L1:32K I + 32K d, 256K L2, 15M L3• Windows Server 2008 R2 64-bit

• Higher is better• Other input sizes: same trends

36

• 15-20X v. seq, 2x v. 24 cpus• high compute:datapyrrhic victory:

• ~2X v. seq, 1X v. 24 cpus• low arithmetic intensity

Page 37: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Cluster performance

Dandelion GTC 2014 S4221

• Speedup is log-scale, higher is better• Larger inputs for cluster:

10 machines:• NVIDIA Tesla k20m, 5GB GDDR5• 2 Xeon E5 2.3GHz 24 hw threads• 256 GB RAM L1:32K I + 32K d, 256K L2, 15M L3• Windows Server 2008 R2 64-bit• Mellanox ConnectX-3 10 Gigabit Ethernet

1

10

100

kmeans pagerank skyserver terasort

Sp

ee

du

p o

ver

1 th

rea

d/n

od

e x

10

no

de

sMulti-thread CPU GPU

37

• 66X v. 1 cpu/node• 4X v. 24 cpus/node• data streamable

• intermediate data > GPU mem• GPU runtime thrashing

dist. overheads narrow GPU v. CPU gap

Page 38: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

LINQits: Dandelion compiler with FPGA backend [ISCA ’13]

GPU Programming models/Cross-compilation Delite [Chafi, Brown ‘11], Liszt[DeVito 11], Halide[Ragan-Kelley 13], Legion[Bauer 12], OptiML[Sujeeth `11],

Accelerator [ASPLOS ‘06], Amp/C++, CUDA, OpenCL

StreamIt CUDA [CGO ‘09, LCTES ‘09], Flextream [Hormati 09], Lime [Auerbach 10]

Copperhead[Catanzaro `11], JCUDA[Yan `09], Rootbeer[Pratt-Szeliga `12], pycuda[Kloeckner `12]

Jacket, MATLAB CUDA compiler [Prasad ‘11]

GPU Scheduling/GPU engines

TimeGraph [Kato 11], Maestro[Spafford 10], Pegasus [Gupta 11], StarPU[Augonnet], Merge[Linderman `08]

Graph-based programming models

Synthesis [Masselin 89], Monsoon/Id [Arvind], Dryad [Isard 07]

StreamIt [Thies 02], DirectShow, TCP Offload [Currid 04]

PTask [Rossbach 11], PipesFS [de Bruijn 08], FFPF[Bos 04], Ruler[Hruby 07]

Relational algebra on GPUs [He 08, He 09, Govindaraju 05] Thrust

More…please see paper

Related Work

Dandelion GTC 2014 S4221 38

Page 39: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device

Conclusion

Dandelion

High-level abstractions for heterogeneous systems

Improved programmability

Current results promising, incomplete

Future work:

Query planning, scheduling, applications

Support more accelerators/architectures

Move beyond LINQ

Dataflow: an important key

Enables composition of multiple runtimes

Thank you! Questions?

Dandelion GTC 2014 S4221 39