an embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfa set...

An embedded language for An embedded language for dataparallel programmingdataparallel programming

Master of Science Thesis in Computer ScienceMaster of Science Thesis in Computer Science

By Joel SvenssonBy Joel Svensson

Department of Computer Science and EngineeringDepartment of Computer Science and Engineering

CHALMERS UNIVERSITY OF TECHNOLOGYCHALMERS UNIVERSITY OF TECHNOLOGY

GÖTEBORGS UNIVERSITYGÖTEBORGS UNIVERSITY

Göteborg, SwedenGöteborg, Sweden

Obsidian: an embedded language for Obsidian: an embedded language for dataparallel programmingdataparallel programming

Dataparallel programmingDataparallel programmingGeneralPurpose computations on the GPU GeneralPurpose computations on the GPU (GPGPU)(GPGPU)LavaLava

NVIDIA 8800 GPU

Project OutlineProject Outline

An embedded language for dataparallel An embedded language for dataparallel programmingprogrammingLava programming style using combinatorsLava programming style using combinatorsGenerate C code for NVIDIA GPUGenerate C code for NVIDIA GPU

Dataparallel programmingDataparallel programming

Single sequential programSingle sequential programExecuted by a number of processing Executed by a number of processing elementselementsOperating on different dataOperating on different data

for j := 1 to log(n) do

for all k in parallel do

if ((k+1) mod 2^j) = 0 then

x[k] := x[k-2^(j-1)] + x[k]

fi

od

od

GPGPUGPGPU

GPUs are relatively cheapGPUs are relatively cheap High performance (Hundreds of GFLOPS)High performance (Hundreds of GFLOPS)

Applications:Applications:Physics simulationPhysics simulationBioinformaticsBioinformaticsSortingSorting

www.gpgpu.org

http://www.gpgpu.org/

GPU vs CPU GFLOPS GPU vs CPU GFLOPS ChartChart

NVIDIA 8800 GPUsNVIDIA 8800 GPUs

A set of SIMD multiprocessorsA set of SIMD multiprocessors8 SIMD processing elements per 8 SIMD processing elements per MultiprocessorMultiprocessorUp to 16 multiprocessors in one GPUUp to 16 multiprocessors in one GPUGiving 128 processing elements totalGiving 128 processing elements total

www.nvidia.com

http://www.nvidia.com/

NVIDIA 8800 GPUsNVIDIA 8800 GPUs

NVDIA Compute Unified Device NVDIA Compute Unified Device ArchitectureArchitecture

C compiler and libraries for the GPUC compiler and libraries for the GPUGPU as a highly parallel coprocessorGPU as a highly parallel coprocessorfor use with NVIDIA's 8800 series GPUsfor use with NVIDIA's 8800 series GPUs

www.nvidia.com/cuda

http://www.nvidia.com/cuda

CUDA Programming modelCUDA Programming model

High number of threadsHigh number of threads Divided into BlocksDivided into Blocks

Thread blockThread block 512 Threads512 Threads Divided into WarpsDivided into Warps Executed on one multiprocessorExecuted on one multiprocessor

CUDA SynchronisationCUDA Synchronisation

CUDA supplies a synchronisation primitive, CUDA supplies a synchronisation primitive, __syncthreads() __syncthreads() Barrier synchronisationBarrier synchronisation Across all the threads of a blockAcross all the threads of a block

Coordinate communicationCoordinate communication

ObsidianObsidianEmbedded in HaskellEmbedded in HaskellPresents a high level Presents a high level programmers interface programmers interface Parallel computations Parallel computations described using described using combinatorscombinatorsCUDA C code is CUDA C code is generatedgenerated

ObsidianObsidian

Describes computations on arrays:Describes computations on arrays: Length homogeneousLength homogeneous

Sorting algorithmsSorting algorithms Integer values Integer values

Limitations: Limitations: Currently limited to iterative sorting algorithmsCurrently limited to iterative sorting algorithms

Obsidian ProgrammingObsidian ProgrammingBasicsBasics Sequential composition of programs: Sequential composition of programs: ->-->- Parallel composition of programs: Parallel composition of programs: parlparl Index operations:Index operations:

revrevriffleriffleunriffle unriffle

Array operations:Array operations:halvehalveconcconc

Apply or Map: Apply or Map: fun fun

Obsidian ProgrammingObsidian Programming

Array OperationsArray Operations halve halve concconc oeSplitoeSplit shuffleshuffle


Index OperationsIndex Operations revrev riffle riffle unriffleunriffle

riffle = halve ->- shuffle

unriffleunriffle

unriffle = oeSplit ->- conc


Apply or Map: Apply or Map: funfun

Sequential composition of programs: Sequential composition of programs: ->-->- Parallel composition of programs: Parallel composition of programs: parlparl

Obsidian Programming: an Obsidian Programming: an example example

rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int))rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int))rev_incr = rev ->- fun (+1) ->- syncrev_incr = rev ->- fun (+1) ->- sync

*Obsidian> execute rev_incr [1,2,3]*Obsidian> execute rev_incr [1,2,3][4,3,2][4,3,2]

Obsidian SynchronisationObsidian Synchronisation

Synchronisation primitive: Synchronisation primitive: syncsync AllAll array elements are updated after a array elements are updated after a syncsync Only applicable at toplevelOnly applicable at toplevel

Inherits behavior from CUDA's Inherits behavior from CUDA's __syncthreads()__syncthreads()

Generating C CodeGenerating C Code

Generate CUDA C Code for NVIDIA GPUGenerate CUDA C Code for NVIDIA GPU Executed as one block of threadsExecuted as one block of threads

ProsPros Communication and synchronisation possibleCommunication and synchronisation possible

ConsCons Upper limit of 512 threads per block Upper limit of 512 threads per block Does not use entire GPUDoes not use entire GPU


Each thread is in charge of calculating one Each thread is in charge of calculating one array elementarray element Limits array size to 512 elementsLimits array size to 512 elements Leads to some redundancyLeads to some redundancy

Swap operation performed by two threads in Swap operation performed by two threads in cooperationcooperation


__global__ static void reverse(int *values, int n){ extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = shared[((n - 1) - tid)]; __syncthreads(); shared[tid] = tmp; __syncthreads();

values[tid] = shared[tid];}

reverse = rev ->- sync


__global__ static void example( int *values, int nint *values, int n){ extern __shared__ int shared[];extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid];shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads();

values[tid] = shared[tid];values[tid] = shared[tid];}


__global__ static void example(int *values, int n){ extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]);tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads();

values[tid] = shared[tid];}

1

2

3

Implementing a sorterImplementing a sorter

A twosorter sorts a pair of values:A twosorter sorts a pair of values:cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a)cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a)

Sort each pair of elements in an array:Sort each pair of elements in an array:sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync)sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync)

*Obsidian> execute sort2 [2,3,5,1,6,7]*Obsidian> execute sort2 [2,3,5,1,6,7][2,3,1,5,6,7][2,3,1,5,6,7]*Obsidian> execute sort2 [2,1,2,1,2,1]*Obsidian> execute sort2 [2,1,2,1,2,1][1,2,1,2,1,2][1,2,1,2,1,2]


A more efficient pairwise sort:A more efficient pairwise sort:sortEvens = evens (cmpSwap (<*)) ->- syncsortEvens = evens (cmpSwap (<*)) ->- sync

*Obsidian> execute sortEvens [2,3,5,1,6,7]*Obsidian> execute sortEvens [2,3,5,1,6,7][2,3,1,5,6,7][2,3,1,5,6,7]*Obsidian> execute sortEvens [2,1,2,1,2,1]*Obsidian> execute sortEvens [2,1,2,1,2,1][1,2,1,2,1,2][1,2,1,2,1,2]


evens


A close relative of A close relative of evens evens isis odds odds::sortOdds = odds (cmpSwap (<*)) ->- syncsortOdds = odds (cmpSwap (<*)) ->- sync

*Obsidian> execute sortOdds [5,3,2,1,4,6]*Obsidian> execute sortOdds [5,3,2,1,4,6][5,2,3,1,4,6][5,2,3,1,4,6]*Obsidian> execute sortOdds [1,2,1,2,1,2]*Obsidian> execute sortOdds [1,2,1,2,1,2][1,1,2,1,2,2][1,1,2,1,2,2]


odds

Odd Even Transposition Odd Even Transposition SortSort

Sorter implemented using Sorter implemented using oddsodds and and evensevens: : sortOETCore = sortEvens ->- sortOddssortOETCore = sortEvens ->- sortOdds

sortOET arr = sortOET arr = let n = len arr let n = len arr in (repE (idiv (n+1) 2) sortOETCore) arrin (repE (idiv (n+1) 2) sortOETCore) arr

Odd Even Transposition Odd Even Transposition SortSort

VSortVSort

Another iterative sorterAnother iterative sorterloglog22(n) depth(n) depth

Built around a Built around a shuffle exchange network:shuffle exchange network:shex f n = rep n (riffle ->- evens f ->- sync)shex f n = rep n (riffle ->- evens f ->- sync)

VSortVSort

Merger implemented using shex: bmergeIt n = shex (cmpSwap (<*)) n

*Obsidian> execute (shex (cmpSwap (<*)) 3) [2,4,6,8,7,5,3,1][1,2,3,4,5,6,7,8]

VSortVSort

Sorter implemented using bmergeIt: vmergeIt n = tblLook tautab ->- sync –>- bmergeIt n

VsortIt n = rep n (vmergeIt n)

Comparison of sortersComparison of sorters

Six different sortersSix different sorters Bitonic sort on CPUBitonic sort on CPU Odd Even Transposition sortOdd Even Transposition sort Three versions of VSortThree versions of VSort CUDA Bitonic sort on GPUCUDA Bitonic sort on GPU

Data and HardwareData and Hardware 288 Mb of random data288 Mb of random data CPU: 2.4GHz Intel Core 2CPU: 2.4GHz Intel Core 2 GPU: 1.2GHz NVIDIA 8800 GTS (shader GPU: 1.2GHz NVIDIA 8800 GTS (shader

clock)clock)

Comparison of sortersComparison of sorters

Related workRelated work

PanPan Embedded in HaskellEmbedded in Haskell Image synthesisImage synthesis Generates C codeGenerates C code

VertigoVertigo Embedded in HaskellEmbedded in Haskell Describes Describes ShadersShaders Generates GPU programsGenerates GPU programs


PyGPUPyGPU Embedded in PythonEmbedded in Python Uses Pythons introspective abilitiesUses Pythons introspective abilities Graphics applicationsGraphics applications


NESL NESL Functional languageFunctional language Nested dataparallelismNested dataparallelism Compiles into VCodeCompiles into VCode

Data Parallel HaskellData Parallel Haskell Nested dataparallelism in HaskellNested dataparallelism in Haskell

Future workFuture work

Solve the recursion dilemmaSolve the recursion dilemma Enable the description of recursive sortersEnable the description of recursive sorters

Bitonic SortBitonic Sort

Make use of entire GPUMake use of entire GPUOptimise the generated codeOptimise the generated codeMore generality More generality Not just sortersNot just sorters

Other target platformsOther target platforms

Future workFuture work

More generalityMore generality Arr a > Arr b (not just Arr Int > Arr Int)Arr a > Arr b (not just Arr Int > Arr Int) Matrices Matrices Pairs of arrays to arraysPairs of arrays to arrays Arrays of pairs to arraysArrays of pairs to arrays Throw away length homogeneity demandThrow away length homogeneity demand

an embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfa set...

Documents