realizing high-performance pipelines using piko

73
REALIZING HIGH-PERFORMANCE PIPELINES USING PIKO Kerry Seitz, Anjul Patney, Stanley Tzeng, John D. Owens

Upload: others

Post on 23-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

REALIZING HIGH-PERFORMANCE PIPELINES

USING PIKO

Kerry Seitz, Anjul Patney, Stanley Tzeng, John D. Owens

OUTLINE

Motivation

Related Work

Abstraction

Language

Compiler

Results

2

MOTIVATION

3

Geometry

Materials

Illumination

Camera

Models

Efficiency

FLEXIBILITY

Evolving application scenarios

— Deferred shading on GPUs

— Ray tracing in Reyes

Diversifying architectures

— Discrete GPUs

— Heterogeneous CPU-GPU architectures

— Mobile GPUs

Limited programmability of shaders

6

Piko

CPU CPU +

GRAPHICS PIPELINES

Functional description for graphics algorithms

Intuitive and flexible

Does not directly translate to implementations

Vertex Shade

Rasterize

Fragment

Shade

Composite

8

IMPLEMENTATIONS OF GRAPHICS PIPELINES

VS VS VS VS

Rast Rast

FS FS FS FS

Comp Comp Comp Comp

FS FS FS FS

VS VS Vertex Shade

Rasterize

Fragment

Shade

Composite

9

IMPLEMENTATIONS OF GRAPHICS PIPELINES

Parallelism

Varying work granularity

Coherence

Data locality

VS VS VS VS

Rast Rast

FS FS FS FS

Comp Comp Comp Comp

FS FS FS FS

VS VS

10

PIKO – BASIC IDEA

Augment traditional pipelines to expose these characteristics

Within each stage, we group computation using spatial tiles

We propose three programmable phases per stage

— AssignBin

— Schedule

— Process

11

RELATED WORK

12

GRAMPS

• A flexible pipeline abstraction

• Similarities – Pipelines as graphs

– Expression for intra-stage parallelism

– Use of “queue sets”

• Differences – Specialization of stages

– Scheduling granularity [Sugerman et al. 2009] 13

HALIDE

• Programmable scheduling for image processing

• Similarities – Decouple schedule from algorithm

– Flexible image-space work granularity

• Differences – Image processing applications (pixels)

– Central Scheduler

– Shorter pipelines

[Ragan-Kelley et al. 2012] 14

SOFTWARE PIPELINES ON GPUS

OptiX CudaRaster Line Sampled DOF

RenderAnts VoxelPipe Image-Space Photon Mapping FreePipe

15

SOFTWARE PIPELINES ON GPUS

• Similarities

– Multi-kernel approach

– Use of spatial-bins or tiles

• Differences

– Focus on single pipeline

16

SOFTWARE PIPELINES ON GPUS

Strawman

Multikernel

Strawman

Persistent threads

Strawman

Piko Optimized

Off-chip Memory Core

s

Pipeline Progress

A

A

A

A

B

B

B

B

C

C

C

C

C

C

Off-chip Memory

Core

s

Pipeline Progress

A

A

A

A

B

B

B

B

C

C

C

C

C

C

Off-chip Memory

Core

s

Pipeline Progress

A

A

A

A

B

B

B

B

C

C

C

C

C

C

17

ABSTRACTION

18

Why

— Is everywhere

— Is intuitive

— Is helpful

locality

parallelism

SPATIAL TILING

Forms a link between algorithm and implementation

19

SPATIAL TILING FOR PARALLELISM

Core 1

Core 2

Core 3

Core 4

20

SPATIAL TILING FOR LOCALITY

Core 1

Core 2

Core 3

Core 4

21

EXPRESSING PIPELINES USING TILES

Add programmable tiling to a pipeline

What does the programmer tell us?

— How to group data into bins?

— When and where to schedule each bin?

— What to compute for each bin?

“AssignBin”

“Schedule”

“Process”

22

Input

Primitives

Input Scene

A

A

A

Populated

Bins

AssignBin A Schedule S Process P

S

S

S

S

Execution

Cores

P

P

Final

Output

Stages

sVS sRast sFS sZTest sComp

VS Rast FS ZTest Comp

Phases A S P A S P A S P A S P A S P

A S P A S P A S P A S P A S P

sVS sRast sFS sZTest sComp

VS Rast FS ZTest Comp

A S P AssignBin Schedule Process

Kernels A A A A A A A A S S S S S S S S S S P P P P P P P P P P

sVS sRast sFS sZTest sComp VS Rast FS ZTest Comp

LANGUAGE

25

PIKO LANGUAGE

Basically C++

26

PIKO LANGUAGE

Basically C++

Piko API

— Stage class

— Pipe class

— Math library (using NVIDIA’s libdevice library)

27

PIKO LANGUAGE

Basically C++

Piko API

— Stage class

— Pipe class

— Math library (using NVIDIA’s libdevice library)

Added Directives for optimization information

— Policy – express common patterns

— Hints – can theoretically be derived from code

28

DIRECTIVES - POLICIES AssignBin

— OneToAll

— OneToManyRange

— OneToOneIdentity

29

DIRECTIVES - POLICIES AssignBin

— OneToAll

— OneToManyRange

— OneToOneIdentity

Schedule

— LoadBalance (dynamic)

— DirectMap (static)

— Serialize

— All

TileSplitSize

— EndStage(X)

— EndBin

30

DIRECTIVES - POLICIES AssignBin

— OneToAll

— OneToManyRange

— OneToOneIdentity

Process

— None

Schedule

— LoadBalance (dynamic)

— DirectMap (static)

— Serialize

— All

TileSplitSize

— EndStage(X)

— EndBin

31

DIRECTIVES - HINTS AssignBin

— MaxOutBins

32

DIRECTIVES - HINTS AssignBin

— MaxOutBins

Schedule

— None

33

DIRECTIVES - HINTS AssignBin

— MaxOutBins

Process

— MaxOutPrims

Schedule

— None

34

EXAMPLE STAGE class FragmentShaderStage : public Stage<8, 8, 64, piko_fragment, piko_fragment> { }; 35

EXAMPLE STAGE class FragmentShaderStage : public Stage<8, 8, 64, piko_fragment, piko_fragment> { public: void assignBin(piko_fragment f) { int binID = getBinFromPosition(f.screenPos); this->assignToBin(f, binID); } }; 36

EXAMPLE STAGE class FragmentShaderStage : public Stage<8, 8, 64, piko_fragment, piko_fragment> { public: void assignBin(piko_fragment f) { int binID = getBinFromPosition(f.screenPos); this->assignToBin(f, binID); } void schedule(int binID) { specifySchedule(LOAD_BALANCE); } }; 37

EXAMPLE STAGE class FragmentShaderStage : public Stage<8, 8, 64, piko_fragment, piko_fragment> { public: void assignBin(piko_fragment f) { int binID = getBinFromPosition(f.screenPos); this->assignToBin(f, binID); } void schedule(int binID) { specifySchedule(LOAD_BALANCE); } void process(piko_fragment f) { cvec3f material = gencvec3f(0.80f, 0.75f, 0.65f); cvec3f lightvec = normalize(gencvec3f(1,1,1)); f.color = material * dot(f.normal, lightvec); this->emit(f); } }; 38

EXAMPLE PIPE class RasterPipe : public PikoPipe { };

39

EXAMPLE PIPE class RasterPipe : public PikoPipe { VertexShaderStage vertexShader; RasterStage raster; FragmentShaderStage fragmentShader; };

40

EXAMPLE PIPE class RasterPipe : public PikoPipe { VertexShaderStage vertexShader; RasterStage raster; FragmentShaderStage fragmentShader; public: RasterPipe() { PikoPipe::pikoConnect(vertexShader, raster); PikoPipe::pikoConnect(raster, fragmentShader); } };

41

COMPILER

42

THREE PHASES

Analysis

— Clang

43

THREE PHASES

Analysis

— Clang

Optimization

— LLVM

— Kernel Planner

44

THREE PHASES

Analysis

— Clang

Optimization

— LLVM

— Kernel Planner

Code Generation

— NVPTX backend

— C++ runner code

45

ANALYSIS

Parse Clang AST twice

— Gather stage information

— Gather pipe information

46

ANALYSIS

Parse Clang AST twice

— Gather stage information

— Gather pipe information

Produce Pipeline Skeleton

— Stage policies and hints

— Connections between stages

47

OPTIMIZATION – KERNEL PLANNER

Resolves ordering dependencies

— Explicit barriers

— Cycles

48

OPTIMIZATION – KERNEL PLANNER

Resolves ordering dependencies

— Explicit barriers

— Cycles

Map phases to kernels

— Simplest – each phase is a separate kernel

— Optimized – merge phases together where appropriate

E.g. process + assignBin

49

INTER-STAGE OPTIMIZATIONS

Kernel Fusion

Scheduler Elimination

Static Dependency Resolution

50

A::Schedule

A::Process

B::AssignBin

B::Schedule

B::Process

C::AssignBin

KERNEL FUSION Goal

— Exploit producer-consumer locality

Opportunity

— Same work granularity (tile size)

— No synchronization

A::Schedule

A::Process

B::AssignBin

B::Schedule

B::Process

C::AssignBin

51

B::Schedule

B::Process

C::AssignBin

SCHEDULER ELIMINATION Goal

— Eliminate software scheduling overheads

Opportunity

— Most pipelines simply require a load-balanced schedule

A::Schedule

A::Process

B::AssignBin

B::Schedule

B::Process

C::AssignBin

A::Schedule

A::Process

B::AssignBin

52

STATIC DEPENDENCY RESOLUTION

A::Schedule

A::Process

B::AssignBin

B::Schedule

B::Process

C::AssignBin

A::Schedule

A::Process

B::AssignBin

B::Schedule

B::Process

C::AssignBin

B::Schedule::WaitPolicy

EndStage(A)

B::Schedule::WaitPolicy

EndBin(A)

Global

Sync

Local

Sync

53

CODE GENERATION

C++ Source for kernel runner

— Use Kernel Plan

— CUDA Driver API

58

CODE GENERATION

C++ Source for kernel runner

— Use Kernel Plan

— CUDA Driver API

NVPTX LLVM Backend for device code

— Link with libdevice

— Inline PTX

59

LIBDEVICE

Provide own header for math functions

E.g.

extern “C” double __nv_fmin(double x, double y); namespace piko { double fmin(double x, double y) { return __nv_fmin(x,y); } }

60

LIBDEVICE

Provide own header for math functions

E.g.

extern “C” double __nv_fmin(double x, double y); namespace piko { double fmin(double x, double y) { return __nv_fmin(x,y); } }

Gets resolved when the

libdevice LLVM module is linked

61

INLINE PTX EXAMPLES

Atomic Add

int myAtomicAdd(int* v1, int v2) { int res; asm(“atom.add.s32 %0, [%1], %2;” : “=r”(res) : “1”(v1), “r”(v2)); return res; }

62

INLINE PTX EXAMPLES

Atomic Add

Ballot

int myAtomicAdd(int* v1, int v2) { int res; asm(“atom.add.s32 %0, [%1], %2;” : “=r”(res) : “1”(v1), “r”(v2)); return res; }

int myBallot(int pred) { int res; asm __volatile__ (“{ \n\t“ “.reg .pred \t%%p1; \n\t” “setp.ne.u32 \t%%p1, %1, 0; \n\t” “vote.ballot.b32 \t%0, %%p1; \n\t” “}” : “=r”(res) : “r”(pred)); return res; } 63

RESULTS

64

EXPERIMENTAL SETUP

A feed-forward rasterization pipeline

We studied its efficiency, flexibility, and portability

65

RAST-STRAWMAN (RSM)

Vertex Shade

Geometry Shade

Rasterize

Fragment Shade

Depth Test

Composite

Vertex Shade

Geometry Shade

Rasterize

Fragment Shade

Depth Test

Composite

K1

K2

K3

K4

K5

K6

66

RAST-FREEPIPE (RFP)

Vertex Shade

Geometry Shade

Rasterize

Fragment Shade

Depth Test

Composite

Vertex Shade

Geometry Shade

Rasterize

Fragment Shade

Depth Test

Composite

K1

67

RAST-LOCALITY (RLC)

Vertex Shade

Geometry Shade

Rasterize

Fragment Shade

Depth Test

Composite

Vertex Shade

Geometry Shade

Rasterize

Fragment Shade

Depth Test

Composite

K1

K2

K3

68

RAST-LOADBALANCE (RLB)

Vertex Shade

Geometry Shade

Rasterize

Fragment Shade

Depth Test

Composite

Vertex Shade

Geometry Shade

Rasterize

Fragment Shade

Depth Test

Composite

K1

K2

K3

K4

K5

69

RESULTS - FRAMEWORK

We tested Piko against two pipeline implementations:

Strawman – 1 Kernel per pipeline stage.

Freepipe – 1 Kernel for the whole pipleline.

We configure Piko to produce two types of pipelines:

Loadbalance – Pikoc does not merge pipeline stages.

Locality – Pikoc merges pipeline stages as much as possible.

70

RESULTS – ALTERNATIVE HARDWARE

Implemented piko-generated pipelines on heterogeneous architectures that have CPU and GPU on the same chip.

We tested Piko on Intel’s Ivy Bridge.

Since Pikoc does not generate OpenCL backend code, we used Pikoc’s pipeline skeleton and hand-coded the pipelines.

too much

text

71

IS IT FLEXIBLE?

0

50

100

150

200

250

300

350

400

RSM RFP RLC RLB

Off

-chip

access

es

(MB)

Locality

Loads

Stores

0

200

400

600

800

1000

1200

1400

1600

RSM RFP RLC RLB

Fra

gm

ents

Shaded /

Core

Load Balance

Mean

Max

Min

Ability to explore multiple design choices for one pipe

72

IS IT EFFICIENT?

Yes

— rast-loadbalance is 4x faster than rast-strawman

— Combination of tiling and opportunistic kernel fusion

No

— RLB is 3-4x slower than state of the art (cudaraster)

— Aggressive optimizations not pursued

— Indirect optimizations unavailable to Piko

73

IS IT PORTABLE?

74

0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 4 5 6 0 1 2 3 4

5 6 0 1 2 3 4 5 6 0 1 2

3 4 5 6 0 1 2 3 4 5 6 0

1 2 3 4 5 6 0 1 2 3 4 5

0 1 2 3 4 5 6 0 1 2 3 4

5 6 0 1 2 3 4 5 6 0 1 2

3 4 5 6 0 1 2 3 4 5 6 0

1 2 3 4 5 6 0 1 2 3 4 5

0 1 2 3 4 5 6 0 1 2 3

4 5 6 0 1 2 3 4 5 6 0

2 3 4 5 6 0 1 2 3 4 5

0 1 2 3 4 5 6 0 1 2 3 4

5 6 0 1 2 3 4 5 6 0 1 2

1

6

6

GPU Cores

CPU Cores

GPU Cores

Discrete GPU Hybrid CPU-GPU

IS IT PORTABLE?

0

0.2

0.4

0.6

0.8

1

1.2

RSM RLC RLB

Norm

alized t

ime /

fra

me NVIDIA GPU

Intel Ivy BridgeLocality works

better for CPU +

GPU

Load balance

works for both

architectures

75

SUMMARY

This talk introduced Piko: User writes pipeline stages split into three components:

assignBin, schedule, and process

Pikoc will fuse pipeline stages together into an optimal set of CUDA kernels

Pikoc runs with a LLVM backend so that we can translate to other architectures as well (in the future).

too much

text

79

FUTURE WORK

Additional Features into Piko:

— Printf (now in Piko!)

— Support for more GPU instructions (i.e. funnel shift, shuffle, etc.)

Additional Pipeline Implementations:

— Raytracing, hybrid pipelines

Additional Target Architectures:

— Potential Tegra K1 target

— HSAIL LLVM IR Integration

80

THANK YOU

81