using open64 for high performance computing on a gpu

35
Using Open64 for High Performance Computing on a GPU by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong

Upload: stacia

Post on 16-Jan-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Using Open64 for High Performance Computing on a GPU. by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong. Using Open64 for High Performance Computing on a GPU. Background and Overview Functionality Work Performance Work Concluding Thoughts. Why Use a GPU?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Open64 for High Performance Computing on a GPU

Using Open64 for High Performance Computing

on a GPU

by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong

Page 2: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Using Open64 for High Performance Computing on a GPU

• Background and Overview

• Functionality Work

• Performance Work

• Concluding Thoughts

Page 3: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Why Use a GPU?

• Sequential processors have hit a wall

• GPU is efficient parallel processor

• lots of big ALUs

• multithreading can hide latency

• context switching is basically free

• all threads run same sequential program

• SIMT (Single Instruction Multiple Thread)

Page 4: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

GPU for Compute

• CUDA (Compute Unified Device Architecture)

• augment C/C++ with minimal abstraction

• divide into sequential host and parallel device code

• let programmers focus on parallel algorithms

Page 5: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

CUDA Example

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__

void vecAdd(float* A, float* B, float* C, int n)

{

int i = threadIdx.x + blockDim.x * blockIdx.x;

if(i<n) C[i] = A[i] + B[i];

}

int main()

{

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C, n);

}

Host Code

Page 6: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

CUDA Successes

•Nebulae computer uses NVIDIA GPU• #4 on Green500 list (MFLOPS/Watt)

• #2 on Top500 list

• DARPA “exascale supercomputer” grant just announced

•Thousands of applications. e.g.• AMBER (scientific)

• Numerix (financial)

• Adobe Premiere Pro (video)

• Physx (game collision physics)

Page 7: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

CUDA Accelerating Computation

146X 36X 19X 17X 100X

Interactive visualization of volumetric white matter connectivity

Ionic placement for molecular dynamics simulation on GPU

Transcoding HD video stream to H.264

Simulation in Matlab using .mex file CUDA

function

Astrophysics N-body simulation

149X 47X 20X 24X 30X

Financial simulation of LIBOR model with

swaptions

GLAME@lab: An M-script API for linear

Algebra operations on GPU

Ultrasound medical imaging for cancer

diagnostics

Highly optimized object oriented molecular

dynamics

Cmatch exact string matching to find similar

proteins and gene sequences

Page 8: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Why Open64?

• Previously, GPU’s were hard to program

• Used optimizing assembler on short shaders

• Did scheduling, register allocation and peephole opts

• For CUDA want ability to code in C/C++

• Needed high-level optimizing compiler

• Open64 was open-source and good optimizer

Page 9: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Where Open64?

CUDA code

Host compiler cudafe

Open64

ptx

ptxas

elf

executable

device codehost code

Page 10: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

What Open64?

preprocessed input

gfec

inliner

be

ptx

• no Fortran

• no IPA

• no LNO

• minimal CG

ptxas does register allocation, scheduling and peephole optimizations

Page 11: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

How Open64?

• Functional enhancements

• Performance enhancements

Page 12: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Windows Host

• Hosted on 32 and 64bit linux, mac, and windows

• Windows build uses MINGW, so need cygwin to build, but can run without cygwin.

• Can also build with visual studio from cygwin.

• No dso’s or dll’s: combine be, wopt, cg and target into one executable.

Page 13: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

PTX Target

• Unlimited virtual registers of different sizes

• Explicit memory spaces (e.g. ld.global)

• Strongly typed instructions

• No stack

• Abstracted call syntax

• Vector memory accesses

Page 14: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Handling Virtual Registers

• PTX has unlimited virtual registers of different sizes

• by default, targ_info and cg use static arrays of registers.

• compile problems when 100,000 registers.

• most info is same so use sparse arrays, hash maps, or just no array (recalculate).

Page 15: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

PTX Target – no stack

• Try to store all local variables in registers

• even for –g (limited local memory space)

• keep small structs and unions in registers

• enhancements in VHO and CGEXP

• use local memory if cannot put in reg (e.g. address taken)

• Abstracted call syntax

• use param space in PTX

• ptxas will utilize param registers and stack

Page 16: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

How Open64?

• Functional enhancements

• Performance enhancements

Page 17: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Vectorizing Memory Accesses

• Vector memory accesses save memory latency.

• We optimize on scalars then in CG we coalesce loads and stores into vectors.

ld.f32 f1, [arr+4];

S1;

ld.f32 f2, [arr+0];

can be vectorized to:

ld.v2.f32 {f2,f1}, [arr+0];

S1;

if arr is 8-byte aligned and S1 does not use f2

Page 18: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Rematerialization

• Rematerialize across basic blocks to reduce register pressure.

• Some instructions like shared memory loads can be folded in final object code.

• Use dominator info to find last reaching def

• find defs in BB_dom, the def that dominates others is last reaching def.

• can rematerialize def if no intervening def, alias, or barrier.

Page 19: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

32->16bit optimization

• C rules promote arithmetic to int (32bits)

• But use fewer registers if pack into 16bit

• Some 16bit instructions are faster (e.g. multiply)

• Pass to analyze 16bit load/store/converts

• Propagate info forwards and backwards

• Change to 16bit if 16bits are enough

Page 20: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Hierarchy of Memory Spaces

Per-thread local memory

Per-block shared memory

Per-device global memory

Generic memory overlays other spaces

Thread

per-threadlocal memory

Block

per-blocksharedmemory

Kernel

. . .

per-deviceglobal

memory. . .

Page 21: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Handling Memory Spaces

• Infer the address space of all memory accesses• Use “generic” or “unified” addressing if a

memory access cannot be resolved statically• Generic address access has more latency than

specific memory access• Specific memory accesses good for

performance• Pointer class analysis

Page 22: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Pointer Class Analysis

• An example

__shared__ int sharedvar;

__device__ void devicefunction(void) { int *lvar = &sharedvar;

= *lvar; // whether to generate generic ld or ld.shared

*lvar = ; // whether to generate generic st or st.shared}

Page 23: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Pointer Class Analysis (continued)

• Another example

__shared__ int sharedvar;

__device__ void devicefunction(int * input) { int *lvar = &sharedvar;

= *lvar + *input; // generate generic ld or ld.shared for lvar?

*lvar = ; // generate generic st or st.shared for lvar?}

• What address space does “input” point to?• Inlining may help disambiguate “input”

Page 24: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Pointer Class-based Alias Analysis

• Memory accesses to different address spaces do not overlap

• Address space information used to help resolve aliases

• Pointer class information maintained for each memory access

• Alias Manager takes pointer class into consideration

Page 25: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Why Pointer Class Analysis ?

• Generic addressing more expensive than specific addressing

• Improve Open64’s alias analysis• Specific memory accesses help memory

disambiguation in ptxas• Some optimizations applicable only to certain

memory spaces (e.g. LDU) In summary pointer class analysis benefits

application performance.

Page 26: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Variance Analysis

• CUDA’s execution model:

• - At a given time, all participated threads run the same kernel in-parallel

• - Each thread has its own registers, thread ID, local memory

Page 27: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Variance Analysis (2)

• Though all participated threads execute the same instructions, but different thread may get different results because of thread ID differences

• Variance Analysis is to find out instructions which may produce thread-dependent (variant) results

Page 28: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Why Variance Analysis ?

• CUDA’s execution model:

• if (cond) then S1 else S2

• step 1: all-threads execute cond

• step 2: threads-with-true-cond execute S1

• step 3: threads-with-false-cond execute S2

Page 29: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Why Variance Analysis (2)?

• If the cond is not variant, only one of the branches is executed.

• If the cond is variant, both branches under the condition will be executed ( sequentially ).

- avoid placing code into multiple branches under a variant condition

Page 30: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Why Variance Analysis (3) ?

An example, … if (x > 0) S1 else S2 endif S3 if (x > 0) S4 else S5 endif

Assuming x is not changed in any of the statements, the following transformation may be desired, … if (x > 0) S1 S3 S4 else S2 S3 S5 endif

But if x is variant, therefore x > 0 may be variant, the above transformation may increase run-time, since S3 could be executed twice.

Page 31: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Why Variance Analysis (3) ?

• Variance Analysis also help explore certain architecture specific properties,

- Issue only one Load request if a LOAD loads from a non-Variant address, otherwise, need to issue load request for each thread.

Page 32: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Variance Analysis Algorithm

• 1) Build Forward Data-Flow

• 2) Collect an initial set of Variant Values

• 3) Chasing Forward Data-Flow on Variant Values to Find affected variant Values

• 4) Mark Variant Conditions, and collect new Variant Values, and back to 3)

Page 33: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Concluding Thoughts

• Open64 has been successfully used for functionality and performance in CUDA.

Page 34: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Concluding Thoughts

• What obstacles have we faced in using Open64?

• Open64 was originally designed for superscalar CPUs

• GPU presents different issues with registers and memory model.

• Register pressure is critical issue; some optimizations increase register pressure.

• GPL license limits some uses of compiler in embedded space.

Page 35: Using Open64 for High Performance Computing on a GPU

@2010 NVIDIA Corporation

Questions?