an embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfa set...

42
 An embedded language for An embedded language for data-parallel programming data-parallel programming Master of Science Thesis in Computer Science Master of Science Thesis in Computer Science By Joel Svensson By Joel Svensson Department of Computer Science and Engineering Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY CHALMERS UNIVERSITY OF TECHNOLOGY GÖTEBORGS UNIVERSITY GÖTEBORGS UNIVERSITY Göteborg, Sweden Göteborg, Sweden

Upload: others

Post on 06-Apr-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

An embedded language for An embedded language for data­parallel programmingdata­parallel programming

Master of Science Thesis in Computer ScienceMaster of Science Thesis in Computer Science

By Joel SvenssonBy Joel Svensson

Department of Computer Science and EngineeringDepartment of Computer Science and Engineering

CHALMERS UNIVERSITY OF TECHNOLOGYCHALMERS UNIVERSITY OF TECHNOLOGY

GÖTEBORGS UNIVERSITYGÖTEBORGS UNIVERSITY

Göteborg, SwedenGöteborg, Sweden

Page 2: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Obsidian: an embedded language for Obsidian: an embedded language for data­parallel programmingdata­parallel programming  

Data­parallel programmingData­parallel programmingGeneral­Purpose computations on the GPU General­Purpose computations on the GPU (GPGPU)(GPGPU)LavaLava

NVIDIA 8800 GPU

Page 3: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Project OutlineProject Outline

An embedded language for data­parallel An embedded language for data­parallel programmingprogrammingLava programming style using combinatorsLava programming style using combinatorsGenerate C code for NVIDIA GPUGenerate C code for NVIDIA GPU

Page 4: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Data­parallel programmingData­parallel programming

Single sequential programSingle sequential programExecuted by a number of processing Executed by a number of processing elementselementsOperating on different dataOperating on different data

for j := 1 to log(n) do

for all k in parallel do

if ((k+1) mod 2^j) = 0 then

x[k] := x[k-2^(j-1)] + x[k]

fi

od

od

Page 5: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

GPGPUGPGPU

GPUs are relatively cheapGPUs are relatively cheap   High performance (Hundreds of GFLOPS)High performance (Hundreds of GFLOPS)

Applications:Applications:Physics simulationPhysics simulationBioinformaticsBioinformaticsSortingSorting

www.gpgpu.org

Page 6: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

GPU vs CPU GFLOPS GPU vs CPU GFLOPS ChartChart

Page 7: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

NVIDIA 8800 GPUsNVIDIA 8800 GPUs

A set of SIMD multiprocessorsA set of SIMD multiprocessors8 SIMD processing elements per 8 SIMD processing elements per MultiprocessorMultiprocessorUp to 16 multiprocessors in one GPUUp to 16 multiprocessors in one GPUGiving 128 processing elements totalGiving 128 processing elements total

www.nvidia.com

Page 8: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

NVIDIA 8800 GPUsNVIDIA 8800 GPUs

Page 9: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

NVDIA Compute Unified Device NVDIA Compute Unified Device ArchitectureArchitecture

C compiler and libraries for the GPUC compiler and libraries for the GPUGPU as a highly parallel co­processorGPU as a highly parallel co­processorfor use with NVIDIA's 8800 series GPUsfor use with NVIDIA's 8800 series GPUs

www.nvidia.com/cuda

Page 10: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

CUDA Programming modelCUDA Programming model

High number of threadsHigh number of threads Divided into BlocksDivided into Blocks

Thread blockThread block 512 Threads512 Threads Divided into WarpsDivided into Warps Executed on one multiprocessorExecuted on one multiprocessor

Page 11: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

CUDA SynchronisationCUDA Synchronisation

CUDA supplies a synchronisation primitive, CUDA supplies a synchronisation primitive, __syncthreads() __syncthreads()  Barrier synchronisationBarrier synchronisation Across all the threads of a blockAcross all the threads of a block

Coordinate communicationCoordinate communication

Page 12: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

ObsidianObsidianEmbedded in HaskellEmbedded in HaskellPresents a high level Presents a high level programmers interface programmers interface Parallel computations Parallel computations described using described using combinatorscombinatorsCUDA C code is CUDA C code is generatedgenerated    

Page 13: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

ObsidianObsidian

Describes computations on arrays:Describes computations on arrays: Length homogeneousLength homogeneous

Sorting algorithmsSorting algorithms Integer values Integer values 

Limitations: Limitations:  Currently limited to iterative sorting algorithmsCurrently limited to iterative sorting algorithms

Page 14: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Obsidian ProgrammingObsidian ProgrammingBasicsBasics Sequential composition of programs: Sequential composition of programs: ->-->-   Parallel composition of programs: Parallel composition of programs: parlparl   Index operations:Index operations:

revrevriffleriffleunriffle unriffle

Array operations:Array operations:halvehalveconcconc

Apply or Map: Apply or Map: fun fun

Page 15: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Obsidian ProgrammingObsidian Programming

Array OperationsArray Operations halve halve concconc oeSplitoeSplit shuffleshuffle

  

Page 16: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Obsidian ProgrammingObsidian Programming

Index OperationsIndex Operations revrev riffle riffle unriffleunriffle  

riffle = halve ->- shuffle

Page 17: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

unriffleunriffle

unriffle = oeSplit ->- conc 

Page 18: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Obsidian ProgrammingObsidian Programming

Apply or Map: Apply or Map: funfun

Sequential composition of programs: Sequential composition of programs: ->-->-  Parallel composition of programs: Parallel composition of programs: parlparl  

Page 19: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Obsidian Programming: an Obsidian Programming: an example example 

rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int))rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int))rev_incr = rev ->- fun (+1) ->- syncrev_incr = rev ->- fun (+1) ->- sync

*Obsidian> execute rev_incr [1,2,3]*Obsidian> execute rev_incr [1,2,3][4,3,2][4,3,2]

Page 20: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Obsidian SynchronisationObsidian Synchronisation

Synchronisation primitive: Synchronisation primitive: syncsync AllAll array elements are updated after a  array elements are updated after a syncsync Only applicable at top­levelOnly applicable at top­level

Inherits behavior from CUDA's Inherits behavior from CUDA's __syncthreads()__syncthreads()

Page 21: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Generating C CodeGenerating C Code

Generate CUDA C Code for NVIDIA GPUGenerate CUDA C Code for NVIDIA GPU Executed as one block of threadsExecuted as one block of threads

ProsPros Communication and synchronisation possibleCommunication and synchronisation possible

ConsCons Upper limit of 512 threads per block Upper limit of 512 threads per block  Does not use entire GPUDoes not use entire GPU

Page 22: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Generating C CodeGenerating C Code

Each thread is in charge of calculating one Each thread is in charge of calculating one array elementarray element Limits array size to 512 elementsLimits array size to 512 elements Leads to some redundancyLeads to some redundancy

Swap operation performed by two threads in Swap operation performed by two threads in cooperationcooperation

Page 23: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Generating C CodeGenerating C Code

__global__ static void reverse(int *values, int n){ extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = shared[((n - 1) - tid)]; __syncthreads(); shared[tid] = tmp; __syncthreads();

values[tid] = shared[tid];}

reverse = rev ->- sync

Page 24: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Generating C CodeGenerating C Code

__global__ static void example( int *values, int nint *values, int n){ extern __shared__ int shared[];extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid];shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads();

values[tid] = shared[tid];values[tid] = shared[tid];}

Page 25: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Generating C CodeGenerating C Code

__global__ static void example(int *values, int n){ extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]);tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads();

values[tid] = shared[tid];}

1

2

3

Page 26: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Implementing a sorterImplementing a sorter

A two­sorter sorts a pair of values:A two­sorter sorts a pair of values:cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a)cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a)

Sort each pair of elements in an array:Sort each pair of elements in an array:sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync)sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync)

*Obsidian> execute sort2 [2,3,5,1,6,7]*Obsidian> execute sort2 [2,3,5,1,6,7][2,3,1,5,6,7][2,3,1,5,6,7]*Obsidian> execute sort2 [2,1,2,1,2,1]*Obsidian> execute sort2 [2,1,2,1,2,1][1,2,1,2,1,2][1,2,1,2,1,2]

Page 27: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Implementing a sorterImplementing a sorter

A more efficient pairwise sort:A more efficient pairwise sort:sortEvens = evens (cmpSwap (<*)) ->- syncsortEvens = evens (cmpSwap (<*)) ->- sync

*Obsidian> execute sortEvens [2,3,5,1,6,7]*Obsidian> execute sortEvens [2,3,5,1,6,7][2,3,1,5,6,7][2,3,1,5,6,7]*Obsidian> execute sortEvens [2,1,2,1,2,1]*Obsidian> execute sortEvens [2,1,2,1,2,1][1,2,1,2,1,2][1,2,1,2,1,2]

Page 28: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Implementing a sorterImplementing a sorter

evens

Page 29: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Implementing a sorterImplementing a sorter

A close relative of A close relative of evens evens isis odds odds::sortOdds = odds (cmpSwap (<*)) ->- syncsortOdds = odds (cmpSwap (<*)) ->- sync

*Obsidian> execute sortOdds [5,3,2,1,4,6]*Obsidian> execute sortOdds [5,3,2,1,4,6][5,2,3,1,4,6][5,2,3,1,4,6]*Obsidian> execute sortOdds [1,2,1,2,1,2]*Obsidian> execute sortOdds [1,2,1,2,1,2][1,1,2,1,2,2][1,1,2,1,2,2]

Page 30: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Implementing a sorterImplementing a sorter

odds 

Page 31: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Odd Even Transposition Odd Even Transposition SortSort

Sorter implemented using Sorter implemented using oddsodds and  and evensevens: : sortOETCore = sortEvens ->- sortOddssortOETCore = sortEvens ->- sortOdds

sortOET arr = sortOET arr = let n = len arr let n = len arr in (repE (idiv (n+1) 2) sortOETCore) arrin (repE (idiv (n+1) 2) sortOETCore) arr

Page 32: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Odd Even Transposition Odd Even Transposition SortSort

Page 33: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

VSortVSort

Another iterative sorterAnother iterative sorterloglog22(n) depth(n) depth

Built around a Built around a shuffle exchange network:shuffle exchange network:shex f n = rep n (riffle ->- evens f ->- sync)shex f n = rep n (riffle ->- evens f ->- sync)

Page 34: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

VSortVSort

Merger implemented using shex: bmergeIt n = shex (cmpSwap (<*)) n

*Obsidian> execute (shex (cmpSwap (<*)) 3) [2,4,6,8,7,5,3,1][1,2,3,4,5,6,7,8]

Page 35: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

VSortVSort

Sorter implemented using bmergeIt: vmergeIt n = tblLook tautab ->- sync –>- bmergeIt n

VsortIt n = rep n (vmergeIt n)

Page 36: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Comparison of sortersComparison of sorters

Six different sortersSix different sorters Bitonic sort on CPUBitonic sort on CPU Odd Even Transposition sortOdd Even Transposition sort Three versions of VSortThree versions of VSort CUDA Bitonic sort on GPUCUDA Bitonic sort on GPU

Data and HardwareData and Hardware 288 Mb of random data288 Mb of random data CPU: 2.4GHz  Intel Core 2CPU: 2.4GHz  Intel Core 2 GPU: 1.2GHz NVIDIA 8800 GTS (shader GPU: 1.2GHz NVIDIA 8800 GTS (shader 

clock)clock)

Page 37: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Comparison of sortersComparison of sorters

Page 38: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Related workRelated work

PanPan Embedded in HaskellEmbedded in Haskell Image synthesisImage synthesis Generates C codeGenerates C code

VertigoVertigo Embedded in HaskellEmbedded in Haskell Describes Describes ShadersShaders Generates GPU programsGenerates GPU programs

Page 39: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Related workRelated work

PyGPUPyGPU Embedded in PythonEmbedded in Python Uses Pythons introspective abilitiesUses Pythons introspective abilities Graphics applicationsGraphics applications

Page 40: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Related workRelated work

NESL NESL  Functional languageFunctional language Nested data­parallelismNested data­parallelism Compiles into VCodeCompiles into VCode

Data Parallel HaskellData Parallel Haskell Nested data­parallelism in HaskellNested data­parallelism in Haskell

Page 41: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Future workFuture work

Solve the recursion dilemmaSolve the recursion dilemma Enable the description of recursive sortersEnable the description of recursive sorters

Bitonic SortBitonic Sort

Make use of entire GPUMake use of entire GPUOptimise the generated codeOptimise the generated codeMore generality More generality  Not just sortersNot just sorters

Other target platformsOther target platforms

Page 42: An embedded language for data-parallel programmingsvenssonjoel.github.io/slides/masterpres.pdfA set of SIMD multiprocessors 8 SIMD processing elements per ... CUDA Programming model

   

Future workFuture work

More generalityMore generality Arr a ­> Arr b (not just Arr Int ­> Arr Int)Arr a ­> Arr b (not just Arr Int ­> Arr Int) Matrices Matrices  Pairs of arrays to arraysPairs of arrays to arrays Arrays of pairs to arraysArrays of pairs to arrays Throw away length homogeneity demandThrow away length homogeneity demand