nvidia’s experience with open64

NVIDIA’s Experience with Open64

Mike Murphy

NVIDIA

Outline

Why Open64How we use Open64 What we did to Open64Future work in Open64

Compiling CUDA for GPUs

C/C++ CUDAApplication

GPU Code CPU CodeGPU Code

executable

Why Open64

We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes.

own gcc open64

Why Open64

own gcc open64

take too long

Why Open64

own gcc open64

take too long good long-term support

Why Open64

own gcc open64

take too long good long-term support

best performance

(kudos to PathScale)

NVCC processing of GPU codecudafe

C code for GPU

nvopencc (Open64)

object code

Changes: Rehosting Open64

Our compiler has to run on 32 & 64bit Linux, 32 & 64bit Windows, and Mac OS.Main Open64 source tree is only for Linux.

This is an area where sharing our changes can help grow the user base by making it easier to port Open64.

For Windows we build using Cygwin’s MINGW

Changes: Memory and registers

We don’t have a stack or fast memoryTherefore want to keep data in registersInline everything and optimize as much as possibleTry to keep small structs in registers by expanding struct copies into field copies (versus taking address and generating loop to do byte copy)

Changes: Vector loads and stores

Coalesce adjacent loads and stores for performanceDo this in CG:

Iterate through ops, trying to add to vectorsCheck for intervening killsChange alignment and use dummy regs for padding if helps to create wider vector (e.g. may use 4-word vector for 3-word struct).

Changes: 16bit optimization

Cheaper to use 16bit registers and operationsBut C converts shorts to int.So add pass in CG that converts back to 16bit:

Mark 16bit loads, stores, and convertsPropagate 16bit-ness forwards and backwardsUnmark 16bit-ness if cannot be 16bitChange remaining registers and instructions to be 16bit.

Future work

1 person -> 4 people working with Open64New application TBAMerging changes into trunk

Thanks to Sun Chan and Shin!Investigating register pressure in WOPT

Want better control of register pressure during optimization

Investigating using other features (LNO, IPA, etc)

Questions?

http://www.nvidia.com/CUDA

mmurphy@nvidia.com

nvidia’s experience with open64

pathscale nvidia corporation

long nvidia corporation

byte copy nvidia corporation

cc codes

graphics codes

lowlevel code generator

highlevel optimization

compiling cuda

Documents

nvidia grid · grid nvidia technology that is a combination...

nvidia’s tegra line of processors for mobile devices2 2

“what and how can the open64 community collaborate more...

structure layout optimizations in the open64 compiler:...

opencj: a research java static compiler based on open64...

nvidia’s tegra k1 system-on-chip€¦ · nvidia’s tegra...

wilf lalonde ©2012 comp 4501 95.4501 collision detection...

an overview of nvidia’s autonomous vehicles platform ·...

retarget open64 with an object-oriented adl

a soc simulator the newest component in open64

extending open64 with transactional memory features jiaqi...

extending open64 with transactional memory features

okl: a uniﬁed language for parallel...

using open64 for high performance computing on a gpu

gpu architecture - rochester institute of...

new gpu features of nvidia’s maxwell architecture

new gpu features of nvidia’s maxwell...

multi-layer perceptron - cae...

amd’s x86 open64...

my name is lars ishop, and i’m an engineer in nvidia’s