beyond cuda/gpus and future graphics architectures

38
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward A Multicore Architecture for Real-time Raytracing, MICRO-41, 2008, Venkatraman Govindaraju, Peter Djeu, Karthikeyan Sankaralingam, Mary Vernon, William R. Mark.

Upload: jiro

Post on 19-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward A Multicore Architecture for Real-time Raytracing, MICRO-41, 2008, Venkatraman Govindaraju, Peter Djeu, Karthikeyan Sankaralingam, Mary Vernon, William R. Mark. Beyond CUDA/GPUs and Future Graphics Architectures. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

1

Beyond CUDA/GPUs and Future Graphics Architectures

Karu SankaralingamUniversity of Wisconsin-Madison

Adapted from “Toward A Multicore Architecture for Real-time Raytracing, MICRO-41, 2008, Venkatraman Govindaraju, Peter

Djeu, Karthikeyan Sankaralingam, Mary Vernon, William R. Mark.

Page 2: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

2

Real-time Graphics Rendering

Today

Page 3: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

3

Real-time Graphics RenderingToday Future

Page 4: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

4

Real-time Graphics Rendering

What are the problems?How can we get there?

Page 5: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

What is wrong with this picture?

5

Page 6: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

GPU/CUDA

6

Z-buffer

Page 7: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

7

Z-buffer

Arch

“Ptolemic” Graphic Universe

Architecture, application all optimized for Z-buffer Difficult to render images with realistic effects.

– self-reflection, soft shadows, ambient occlusion Problems:

– Scene constraints, Artist and programmer productivity

Application

Page 8: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

Current Graphics Architectures

8

Courtesy: ACM Queue

Page 9: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

How did we get here?

Hardware Rasterizers and perspective-correct texture mapping (RIVA 128)

Single Pass Multitexture (TNT / TNT2) Register Combiners: a generalization of

multitexture (GeForce 256) Per-pixel Shading (Geforce 2 GTS) Programmable Hardware Pixel Shading Programmable Vertex Shading CUDA

9

Page 10: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

10

AlgorithmArch

“Copernican” Graphic Universe

Architecture, application revolves around Algorithm

More general purpose algorithm Easier to provide realistic effects Architecture can support other applications

Application

Ray-tracin

g

Page 11: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

Future Graphics Architectures

11

Courtesy: ACM Queue

Page 12: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

12

Executive Summary: Copernicus System

Co-designed application, architecture and analysis framework

Path from specialized graphics architecture to more general purpose architecture.

A detailed characterization and analysis framework

Real-time frame rates possible for high quality dynamic scenes

Page 13: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

13

Outline

Motivation Copernicus system

– Graphics Algorithm: Razor– Architecture– Evaluation and Results

Summary

Page 14: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

14

Ray-tracingFull

scene

Cube Cylinder

Simulating the behavior light rays through 3D scene

Rays from eye to scene (Primary rays) Rays from hitpoint to light (Secondary rays) Acceleration structure (eg. BSP Tree) for

efficiency

Page 15: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

15

Disadvantages of Raytracing

Every frame need to rebuild the acceleration structure for dynamic scenes.

Irregular data accesses for traversing the acceleration structure.

Higher resolution secondary ray tracing computation

Page 16: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

16

Razor: A Dynamic Multiresolution Raytracer

Cube Cylinder

Thread 1 Thread 2

Packet ray-tracer: Traces beam of rays instead of a ray– Opportunity for data level parallelism

Each thread lazily builds its own acceleration structure(KD Tree)– Builds the portion of structure it needs.

Page 17: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

17

Razor: A Dynamic Multiresolution Raytracer

Multi-level resolution to reduce secondary rays computation.

Replicates KD-Tree to reduce synchronization across threads. – Hypothesis: Duplication across threads will be

limited.

Page 18: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

18

Razor Implementation

Linux/x86– Implemented Razor in Intel Clovertown.– Parallelized using pthreads.

Optimized with SSE instructions Sustains 1 FPS on this prototype system Helps develop algorithms Designed with future hardware in mind

Page 19: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

19

Razor’s Memory Usage

# Threads

Mem

ory

footp

rin

t

Page 20: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

20

Parallel Scalability

# Threads

Sp

eed

up

1

2

3

4

5

6

1 2 3 4 5 6 7 8

CourtyardFairyforestForestJuarezSaloon

Page 21: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

21

Outline

Motivation Copernicus system

– Graphics Algorithm: Razor– Architecture– Evaluation and Results

Summary

Page 22: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

22

Architecture: Core

• Inorder core• Private L1 Data

and Instruction Cache

• Supports SIMD instructions

• SMT Threads to hide memory latency

Page 23: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

23

Architecture: Tile

• Shared L2 cache• Shared

Accelerator for specialized instructions

Page 24: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

24

Architecture: Chip

Page 25: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

25

Architecture Razor Mapping

Assigned to Tile

Assigned to Core

Page 26: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

26

Outline

Motivation Copernicus system

– Graphics Algorithm: Razor– Architecture– Evaluation and Results

Summary

Page 27: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

27

Benchmark Scenes

v

Courtyard Fairyforest Forest

Juarez Saloon

Page 28: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

28

Evaluation Methodology

Simulation with Multifacet/GEMS– Simulate SSE Instructions– Simulate a full tile– Validated with prototype data

• Pin-based and PAPI-based performance counters

– Randomly selected regions of scenes

Full chip– Simulating full chip is too slow– Build customized analytic model

Page 29: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

29

Analytical Model

Core Level– Pipeline stalls– Multiple threads

Tile Level– L2 contention

Chip Level– Main memory contention

Compared with our simulation results

Page 30: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

30

Single Core Performance (Single Issue)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Courtyard Fairyforest Forest Juarez Saloon

No SMT 2 SMT 4 SMT

IPC

Page 31: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

31

Single Core Performance (Dual Issue)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Courtyard Fairyforest Forest Juarez Saloon

No SMT 2 SMT 4 SMT

IPC

Page 32: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

32

Single Tile Performance

0

1

2

3

4

5

6

7

8

Courtyard Fairyforest Forest Juarez Saloon

No SMT 2 SMT 4 SMT

IPC

Page 33: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

33

Full Chip Performance

0

20

40

60

80

0 2 4 6 8 10 12 14 16

Ideal1 DIMM2 DIMMs3 DIMMs4 DIMMs

#Tiles

Mil

lion

R

ays

/Seco

nd

s

Page 34: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

34

So, Are we there yet?

Page 35: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

35

Results

Goal: 100 Million rays per second Achieved: 50 Million rays per second

– With 16 tiles and 4 DIMMs

Insights:– 4 SMT single issue is ideal for this workload– Good parallel scalability– Razor’s physically-motivated optimizations work

Potential for further architectural optimizations– Shared accelerator– Wide SIMD bundles

Page 36: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

36

Outline

Motivation Copernicus system

– Graphics Algorithm: Razor– Architecture– Evaluation and Results

Summary

Page 37: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

37

Summary

A transformation path to ray-tracing– Ptolemic universe to Copernican graphics universe

Unique architecture design point– Tradeoff data redundancy and re-computation over

synchronization

Evaluation methodology interesting in its own right– Prototype, simulation and analytical framework to design

and evaluate future systems

Future work– Instructions specialization and shared accelerator design– Tradeoffs with SIMD width and area– Memory system

Page 38: Beyond CUDA/GPUs and Future Graphics Architectures

Department of Computer Science

38

Other Questions?