haskell arrays accelerated with gpus

47
HASKELL ARRAYS, ACCELERATED USING GPUS Manuel M. T. Chakravarty University of New South Wales JOINT WORK WITH Gabriele Keller Sean Lee Monday, 7 September 2009

Upload: don-stewart

Post on 12-Nov-2014

3.323 views

Category:

Documents


0 download

DESCRIPTION

Manuel Chakravarty's original PDF: http://bit.ly/mhZYnSoftware needs to expose increasing amounts of explicit parallelism to fully utilise modern processors. The current extreme are Graphics Processing Units (GPUs) that require thousands of data-parallel threads to produce maximum performance. Purely functional programming and especially collective array operations are a good match for data-parallel programming. In this talk, I briefly review two approaches to data-parallel programming in Haskell and present some first benchmarks. The two approaches differ in that the first implements the well understood model of flat data-parallelism, whereas the second implements the much more ambitious model of nested data- parallelism. We target GPUs with the flat parallel model and multicore CPUs with the nested parallel model.

TRANSCRIPT

Page 1: Haskell Arrays Accelerated with GPUs

HASKELL ARRAYS, ACCELERATED

USING GPUSManuel M. T. ChakravartyUniversity of New South Wales

JOINT WORK WITH

Gabriele KellerSean Lee

Monday, 7 September 2009

Page 2: Haskell Arrays Accelerated with GPUs

General Purpose GPU Programming (GPGPU)

Monday, 7 September 2009

Page 3: Haskell Arrays Accelerated with GPUs

MODERN GPUS ARE FREELY PROGRAMMABLEBut no function pointers & limited recursion

Monday, 7 September 2009

Page 4: Haskell Arrays Accelerated with GPUs

MODERN GPUS ARE FREELY PROGRAMMABLEBut no function pointers & limited recursion

Monday, 7 September 2009

Page 5: Haskell Arrays Accelerated with GPUs

Very Different Programming Model(Compared to multicore CPUs)

Monday, 7 September 2009

Page 6: Haskell Arrays Accelerated with GPUs

MASSIVELY PARALLEL PROGRAMS NEEDEDTens of thousands of dataparallel threads

QuadcoreXeon CPU

Tesla T10 GPU

hundreds of threads/core

Monday, 7 September 2009

Page 7: Haskell Arrays Accelerated with GPUs

Programming GPUs is hard! Why bother?

Monday, 7 September 2009

Page 8: Haskell Arrays Accelerated with GPUs

Reduce power consumption!

✴GPU achieves 20x better performance/Watt (judging by peak performance)✴Speedups between 20x to 150x have been observed in real applications

Monday, 7 September 2009

Page 9: Haskell Arrays Accelerated with GPUs

Prototype of code generator targeting GPUsComputation only, without CPU ⇄ GPU transfer

0.1

1

10

100

1000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sparse Matrix Vector Multiplication

Tim

e (m

illis

econ

ds)

Number of non-zero elements (million)

Plain Haskell, CPU only (AMD Sempron) Plain Haskell, CPU only (Intel Xeon)Haskell EDSL on a GPU (GeForce 8800GTS) Haskell EDSL on a GPU (Tesla S1070 x1)

Monday, 7 September 2009

Page 10: Haskell Arrays Accelerated with GPUs

ChallengesCode must be massively dataparallel

Control structures are limited

‣ No function pointers

‣ Very limited recursion

Software-managed cache, memory-access patterns, etc.

Portability...

Monday, 7 September 2009

Page 11: Haskell Arrays Accelerated with GPUs

OTHER COMPUTE ACCELERATOR ARCHITECTURESGoal: portable data parallelism

Tesla T10 GPU

Monday, 7 September 2009

Page 12: Haskell Arrays Accelerated with GPUs

OTHER COMPUTE ACCELERATOR ARCHITECTURESGoal: portable data parallelism

Tesla T10 GPU

Monday, 7 September 2009

Page 13: Haskell Arrays Accelerated with GPUs

OTHER COMPUTE ACCELERATOR ARCHITECTURESGoal: portable data parallelism

Tesla T10 GPU

Monday, 7 September 2009

Page 14: Haskell Arrays Accelerated with GPUs

OTHER COMPUTE ACCELERATOR ARCHITECTURESGoal: portable data parallelism

Tesla T10 GPU

Monday, 7 September 2009

Page 15: Haskell Arrays Accelerated with GPUs

Data.Array.AccelerateCollective operations on multi-dimensional regular arrays

Embedded DSL

‣ Restricted control flow

‣ First-order GPU code

Generative approach based on combinator templates

Multiple backends

Monday, 7 September 2009

Page 16: Haskell Arrays Accelerated with GPUs

Data.Array.AccelerateCollective operations on multi-dimensional regular arrays

Embedded DSL

‣ Restricted control flow

‣ First-order GPU code

Generative approach based on combinator templates

Multiple backends

✓ massive data parallelism

Monday, 7 September 2009

Page 17: Haskell Arrays Accelerated with GPUs

Data.Array.AccelerateCollective operations on multi-dimensional regular arrays

Embedded DSL

‣ Restricted control flow

‣ First-order GPU code

Generative approach based on combinator templates

Multiple backends

✓ massive data parallelism

✓ limited control structures

Monday, 7 September 2009

Page 18: Haskell Arrays Accelerated with GPUs

Data.Array.AccelerateCollective operations on multi-dimensional regular arrays

Embedded DSL

‣ Restricted control flow

‣ First-order GPU code

Generative approach based on combinator templates

Multiple backends

✓ massive data parallelism

✓ limited control structures

✓ hand-tuned access patterns

Monday, 7 September 2009

Page 19: Haskell Arrays Accelerated with GPUs

Data.Array.AccelerateCollective operations on multi-dimensional regular arrays

Embedded DSL

‣ Restricted control flow

‣ First-order GPU code

Generative approach based on combinator templates

Multiple backends

✓ massive data parallelism

✓ limited control structures

✓ hand-tuned access patterns

✓ portability

Monday, 7 September 2009

Page 20: Haskell Arrays Accelerated with GPUs

import Data.Array.Accelerate

Dot product

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float)

dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Monday, 7 September 2009

Page 21: Haskell Arrays Accelerated with GPUs

import Data.Array.Accelerate

Dot product

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float)

dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Haskell array

Monday, 7 September 2009

Page 22: Haskell Arrays Accelerated with GPUs

import Data.Array.Accelerate

Dot product

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float)

dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Haskell array

EDSL array =desc. of array comps

Monday, 7 September 2009

Page 23: Haskell Arrays Accelerated with GPUs

import Data.Array.Accelerate

Dot product

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float)

dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Haskell array

EDSL array =desc. of array comps

Lift Haskell arrays into EDSL — may trigger host➙device transfer

Monday, 7 September 2009

Page 24: Haskell Arrays Accelerated with GPUs

import Data.Array.Accelerate

Dot product

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float)

dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Haskell array

EDSL array =desc. of array comps

Lift Haskell arrays into EDSL — may trigger host➙device transfer

EDSL array computations

Monday, 7 September 2009

Page 25: Haskell Arrays Accelerated with GPUs

import Data.Array.AccelerateSparse-matrix vector multiplication

type SparseVector a = Vector (Int, a)type SparseMatrix a = (Segments, SparseVector a)

smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float)smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd

Monday, 7 September 2009

Page 26: Haskell Arrays Accelerated with GPUs

import Data.Array.AccelerateSparse-matrix vector multiplication

type SparseVector a = Vector (Int, a)type SparseMatrix a = (Segments, SparseVector a)

smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float)smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd

[0, 0, 6.0, 0, 7.0] ≈ [(2, 6.0), (4, 7.0)]

Monday, 7 September 2009

Page 27: Haskell Arrays Accelerated with GPUs

import Data.Array.AccelerateSparse-matrix vector multiplication

type SparseVector a = Vector (Int, a)type SparseMatrix a = (Segments, SparseVector a)

smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float)smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd

[0, 0, 6.0, 0, 7.0] ≈ [(2, 6.0), (4, 7.0)]

[[10, 20], [], [30]] ≈ ([2, 0, 1], [10, 20, 30])

Monday, 7 September 2009

Page 28: Haskell Arrays Accelerated with GPUs

Architecture ofData.Array.Accelerate

Monday, 7 September 2009

Page 29: Haskell Arrays Accelerated with GPUs

map (\x -> x + 1) arr

Monday, 7 September 2009

Page 30: Haskell Arrays Accelerated with GPUs

map (\x -> x + 1) arr

Reify & HOAS -> de Bruijn

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Monday, 7 September 2009

Page 31: Haskell Arrays Accelerated with GPUs

map (\x -> x + 1) arr

Reify & HOAS -> de Bruijn

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing(CSE or Observe)

Monday, 7 September 2009

Page 32: Haskell Arrays Accelerated with GPUs

map (\x -> x + 1) arr

Reify & HOAS -> de Bruijn

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing(CSE or Observe)Optimisation

(Fusion)

Monday, 7 September 2009

Page 33: Haskell Arrays Accelerated with GPUs

map (\x -> x + 1) arr

Reify & HOAS -> de Bruijn

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing(CSE or Observe)Optimisation

(Fusion)

__global__ void kernel (float *arr, int n){...

Code generation

Monday, 7 September 2009

Page 34: Haskell Arrays Accelerated with GPUs

map (\x -> x + 1) arr

Reify & HOAS -> de Bruijn

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing(CSE or Observe)Optimisation

(Fusion)

__global__ void kernel (float *arr, int n){...

Code generation

nvcc

Monday, 7 September 2009

Page 35: Haskell Arrays Accelerated with GPUs

map (\x -> x + 1) arr

Reify & HOAS -> de Bruijn

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing(CSE or Observe)Optimisation

(Fusion)

__global__ void kernel (float *arr, int n){...

Code generation

nvcc

package

plugins

Monday, 7 September 2009

Page 36: Haskell Arrays Accelerated with GPUs

The API ofData.Array.Accelerate(The main bits)

Monday, 7 September 2009

Page 37: Haskell Arrays Accelerated with GPUs

Array types

data Array dim e — regular, multi-dimensional arrays

type DIM0 = ()type DIM1 = Inttype DIM2 = (Int, Int)⟨and so on⟩

type Scalar e = Array DIM0 etype Vector e = Array DIM1 e

Monday, 7 September 2009

Page 38: Haskell Arrays Accelerated with GPUs

Array types

data Array dim e — regular, multi-dimensional arrays

type DIM0 = ()type DIM1 = Inttype DIM2 = (Int, Int)⟨and so on⟩

type Scalar e = Array DIM0 etype Vector e = Array DIM1 e

EDSL forms

data Exp e — scalar expression formdata Acc a — array expression form

Monday, 7 September 2009

Page 39: Haskell Arrays Accelerated with GPUs

Array types

data Array dim e — regular, multi-dimensional arrays

type DIM0 = ()type DIM1 = Inttype DIM2 = (Int, Int)⟨and so on⟩

type Scalar e = Array DIM0 etype Vector e = Array DIM1 e

EDSL forms

data Exp e — scalar expression formdata Acc a — array expression form

Classes

class Elem e — scalar and tuples typesclass Elem ix => Ix ix — unit and integer tuples

Monday, 7 September 2009

Page 40: Haskell Arrays Accelerated with GPUs

Scalar operations

instance Num (Exp e) — overloaded arithmeticinstance Integral (Exp e)⟨and so on⟩

(==*), (/=*), (<*), (<=*), — comparisons (>*), (>=*), min, max(&&*), (||*), not — logical operators

(?) :: Elem t — conditional expression => Exp Bool -> (Exp t, Exp t) -> Exp t

(!) :: (Ix dim, Elem e) — scalar indexing => Acc (Array dim e) -> Exp dim -> Exp e

shape :: Ix dim — yield array shape => Acc (Array dim e) -> Exp dim

Monday, 7 September 2009

Page 41: Haskell Arrays Accelerated with GPUs

Collective array operations — creation

— use an array from Haskell landuse :: (Ix dim, Elem e) => Array dim e -> Acc (Array dim e)

— create a singletonunit :: Elem e => Exp e -> Acc (Scalar e)

— multi-dimensional replicationreplicate :: (SliceIx slix, Elem e) => Exp slix -> Acc (Array (Slice slix) e) -> Acc (Array (SliceDim slix) e)

— Example: replicate (All, 10, All) twoDimArr

Monday, 7 September 2009

Page 42: Haskell Arrays Accelerated with GPUs

Collective array operations — slicing

— slice extractionslice :: (SliceIx slix, Elem e) => Acc (Array (SliceDim slix) e) -> Exp slix -> Acc (Array (Slice slix) e)

— Example: slice (5, All, 7) threeDimArr

Monday, 7 September 2009

Page 43: Haskell Arrays Accelerated with GPUs

Collective array operations — mapping

map :: (Ix dim, Elem a, Elem b) => (Exp a -> Exp b) -> Acc (Array dim a) -> Acc (Array dim b)

zipWith :: (Ix dim, Elem a, Elem b, Elem c) => (Exp a -> Exp b -> Exp c) -> Acc (Array dim a) -> Acc (Array dim b) -> Acc (Array dim c)

Monday, 7 September 2009

Page 44: Haskell Arrays Accelerated with GPUs

Collective array operations — reductions

fold :: (Ix dim, Elem a) => (Exp a -> Exp a -> Exp a) — associative -> Exp a -> Acc (Array dim a) -> Acc (Scalar a)

scan :: Elem a => (Exp a -> Exp a -> Exp a) — associative -> Exp a -> Acc (Vector a) -> (Acc (Vector a), Acc (Scalar a))

Monday, 7 September 2009

Page 45: Haskell Arrays Accelerated with GPUs

Collective array operations — permutations

permute :: (Ix dim, Ix dim', Elem a) => (Exp a -> Exp a -> Exp a) -> Acc (Array dim' a) -> (Exp dim -> Exp dim') -> Acc (Array dim a) -> Acc (Array dim' a)

backpermute :: (Ix dim, Ix dim', Elem a) => Exp dim' -> (Exp dim' -> Exp dim) -> Acc (Array dim a) -> Acc (Array dim' a)

Monday, 7 September 2009

Page 46: Haskell Arrays Accelerated with GPUs

Regular arrays in package dph-seq@ 3.06 GHz Core 2 Duo (with GHC 6.11)

1

10

100

1000

10000

100000

1000000

128 256 512 1024

Dense Matrix-Matrix MultiplicationTi

me

(mill

isec

ond

s)

size of square NxN matrix

Unboxed Haskell arrays Delayed arrays Plain C Mac OS vecLib

Monday, 7 September 2009

Page 47: Haskell Arrays Accelerated with GPUs

ConclusionEDSL for processing multi-dimensional arrays

Collective array operations (highly data parallel)

Support for multiple backends

Status:

‣ Very early version on Hackage (only interpreter)

http://hackage.haskell.org/package/accelerate

‣ Currently porting GPU backend over

‣ Looking for backend contributors

Monday, 7 September 2009