lecture 5: hw1 discussion, intro to gpus · discuss hw1 intro to gpu computing discuss hw1intro to...

72
Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 · October 5, 2010 Discuss HW1 Intro to GPU Computing

Upload: others

Post on 18-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Lecture 5: HW1 Discussion, Intro to GPUs

G63.2011.002/G22.2945.001 · October 5, 2010

Discuss HW1 Intro to GPU Computing

Page 2: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Outline

Discuss HW1

Intro to GPU Computing

Discuss HW1 Intro to GPU Computing

Page 3: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Outline

Discuss HW1

Intro to GPU Computing

Discuss HW1 Intro to GPU Computing

Page 4: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Dense Matrix Multiply: Blocking vs Scalar

We provided a blocked example matrixmultiplication code.Why is blocked matmul faster thanun-blocked?

Key: Computational Intensity

Definition:Flops per FPN moved up the memoryhierarchy

Large intensity: good for deep memoryhierarchies

Discuss HW1 Intro to GPU Computing

Page 5: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Computational Intensity for Scalar Matmul

Floating Point operations: 2N3

Assume: Size(L1) � N2 FPNs

N2 read each row of A once+ N3 read each column of B N times

+ 2N2 read/write C

N3 + 3N2 FPN-size cache misses

(neglecting cache lines, etc.)

Computational Intensity: about 2

Discuss HW1 Intro to GPU Computing

Page 6: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Computational Intensity for Blocked MatmulFloating Point operations: still 2N3

b: block size n: dN/be

b2n3 read each block of A n3 times+ b2n3 same for B+ 2N2 read/write C

2b2n3 + 2N2 FPN-size cache misses

Rewrite:

b2n3 ≈ b2 N3

b3=

N3

b

Computational Intensity:

2N3

2N3/b + 2N2≈ 2N3

2N3/b= b

→ incentive to choose b � 2

The power of assumptions:Can we choose b = N?

Discuss HW1 Intro to GPU Computing

Page 7: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Computational Intensity for Blocked MatmulFloating Point operations: still 2N3

b: block size n: dN/be

b2n3 read each block of A n3 times+ b2n3 same for B+ 2N2 read/write C

2b2n3 + 2N2 FPN-size cache misses

Rewrite:

b2n3 ≈ b2 N3

b3=

N3

b

Computational Intensity:

2N3

2N3/b + 2N2≈ 2N3

2N3/b= b

→ incentive to choose b � 2

The power of assumptions:Can we choose b = N?

Discuss HW1 Intro to GPU Computing

Page 8: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Hatching a Plan

Consider each level of the memory hierarchy.

How do we exploit. . .

• . . . L2: Ignore–we’re nearly L2-local atmost sizes.

• . . . L1: 32 KiB = 4096 Floats.Key: memory layout.

• . . . registers: 16 FP registers.Key: loop/operation ordering.

Discuss HW1 Intro to GPU Computing

Page 9: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Optimizing for L1: Memory Layout

Memory layout of A: column-major.

Only use one entry of each cache line perfetch.

Better to store A in row-major order.

Input is row-major. If memory available (notswap!), storing a transposed copy of A can bea good idea. (Copy takes O(N2) time.)

Discuss HW1 Intro to GPU Computing

Page 10: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Optimizing for L1: Memory Layout

Memory layout of A: column-major.

Only use one entry of each cache line perfetch.

Better to store A in row-major order.

Input is row-major. If memory available (notswap!), storing a transposed copy of A can bea good idea. (Copy takes O(N2) time.)

Discuss HW1 Intro to GPU Computing

Page 11: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Optimizing for L1: Reuse Pattern, Block Size

QuestionBlocking: good idea. Optimal bL1?

Follow-up question:How much needs to fit in L1?

One block of each of A, B, C .All of A, plus one column of B and C .32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60

Discuss HW1 Intro to GPU Computing

Page 12: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Optimizing for L1: Reuse Pattern, Block Size

QuestionBlocking: good idea. Optimal bL1?

Follow-up question:How much needs to fit in L1?

One block of each of A, B, C .

All of A, plus one column of B and C .32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60

Discuss HW1 Intro to GPU Computing

Page 13: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Optimizing for L1: Reuse Pattern, Block Size

QuestionBlocking: good idea. Optimal bL1?

Follow-up question:How much needs to fit in L1?

One block of each of A, B, C .

All of A, plus one column of B and C .32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60

Discuss HW1 Intro to GPU Computing

Page 14: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Optimizing for L1: Reuse Pattern, Block Size

QuestionBlocking: good idea. Optimal bL1?

Follow-up question:How much needs to fit in L1?

One block of each of A, B, C .All of A, plus one column of B and C .32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60

Discuss HW1 Intro to GPU Computing

Page 15: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

L1 Block Copy

Further concerns:

• Cache line boundaries

• SIMD

• Cache set conflicts

All solved by small-block copyoptimization.

Copy all of A.Copy bL1-sized blocks of A, B, and C ,operate on those, then copy outputback.

Discuss HW1 Intro to GPU Computing

Page 16: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

L1 Block Copy: The Plan

Basic plan:

For each i :For each j :

Load Block C [i , j ]For each k :

Load Block A[i , k]Load Block B[k , j ]dbL1/bre3 register kernels:

C + = ABStore Block C [i , j ]

(can be improved: many A, B loads)

Aside: Also neatly deals with fringes.

So: how does this solve the problems above?Can you define “alignment”?

Discuss HW1 Intro to GPU Computing

Page 17: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

L1 Block Copy: The Plan

Basic plan:

For each i :For each j :

Load Block C [i , j ]For each k :

Load Block A[i , k]Load Block B[k , j ]dbL1/bre3 register kernels:

C + = ABStore Block C [i , j ]

(can be improved: many A, B loads)

Aside: Also neatly deals with fringes.

So: how does this solve the problems above?Can you define “alignment”?

Discuss HW1 Intro to GPU Computing

Page 18: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Alignment

A memory address a is n-byte aligned when n is a power of twoand a is a multiple of n bytes. (see also IBM devWorks article)

#include <stdlib.h>

/∗ dynamic allocation ∗/double ∗ attribute (( aligned (64))) var ;

int error = posix memalign((void ∗∗) &var, 64, array size );

if ( error )abort ();

/∗ static allocation ∗/double attribute (( aligned (64))) ary2 [500];

Examples: Cache-line-aligned, SIMD-aligned.

Code generation in the non-aligned case?

Discuss HW1 Intro to GPU Computing

Page 19: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Alignment

A memory address a is n-byte aligned when n is a power of twoand a is a multiple of n bytes. (see also IBM devWorks article)

#include <stdlib.h>

/∗ dynamic allocation ∗/double ∗ attribute (( aligned (64))) var ;

int error = posix memalign((void ∗∗) &var, 64, array size );

if ( error )abort ();

/∗ static allocation ∗/double attribute (( aligned (64))) ary2 [500];

Examples: Cache-line-aligned, SIMD-aligned.

Code generation in the non-aligned case?

Discuss HW1 Intro to GPU Computing

Page 20: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Register KernelChoose block size br = 2k ,with bL1 mod br = 0.

for ( int j = 0; j < b r; ++j)for ( int k = 0; k < b r; ++k)for ( int i = 0; i < b r; ++i)

C[i+j∗b l1] +=A[i+k∗b l1] ∗ B[k+j∗b l1 ];

For each Ab matvec:Perform br scalar·vector updates.

• Vectorizable

• Pipeline-friendly(min. data dependencies)

• Access to A, C unit-stride

• Access to B is inner-loop invariant

• Unrolling, software pipelining: Compiler

Discuss HW1 Intro to GPU Computing

Page 21: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Psychoanalyzing the Compiler

Flags for Intel:-O3 -fno-alias -funroll-loops

-std=c99 -D XOPEN SOURCE=500

-opt-streaming-stores auto -static

-fast -xHost

Flags for GCC:-O3 -funroll-loops -march=native

-std=c99 -D XOPEN SOURCE=500

-ftree-vectorizer-verbose=2

-ffast-math

GCC 4.3 sometimes better than GCC 4.4.

Self-study material:

• Compiler Reference: Intel GNU

• C99 restrict keyword, Aliasing

Discuss HW1 Intro to GPU Computing

Page 22: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.

Many event types countable:

CPU CLK UNHALTED : Clock cycles when not halted

L2 RQSTS : number of L2 cache requests

LLC MISSES : L2 cache demand requests from this core that

missed the L2

FLOPS : number of FP computational micro-ops executed

IDLE DURING DIV : cycles divider is busy and all other

execution units are idle.

L1D ALL REF : All references to the L1 data cache

L1D PEND MISS : Total number of outstanding L1 data cache

misses at any cycle

IFU MEM STALL : cycles instruction fetch pipe is stalled

INST RETIRED : number of instructions retired

UOPS RETIRED : number of UOPs retired

MACHINE NUKES SMC : number of pipeline flushing events

RAT STALLS : Partial register stall cycles

BR INST DECODED : number of branch instructions decoded

FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7

187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3

470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2

2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax

184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0

5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0

4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0

5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)

Discuss HW1 Intro to GPU Computing

Page 23: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.

Many event types countable:

CPU CLK UNHALTED : Clock cycles when not halted

L2 RQSTS : number of L2 cache requests

LLC MISSES : L2 cache demand requests from this core that

missed the L2

FLOPS : number of FP computational micro-ops executed

IDLE DURING DIV : cycles divider is busy and all other

execution units are idle.

L1D ALL REF : All references to the L1 data cache

L1D PEND MISS : Total number of outstanding L1 data cache

misses at any cycle

IFU MEM STALL : cycles instruction fetch pipe is stalled

INST RETIRED : number of instructions retired

UOPS RETIRED : number of UOPs retired

MACHINE NUKES SMC : number of pipeline flushing events

RAT STALLS : Partial register stall cycles

BR INST DECODED : number of branch instructions decoded

FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7

187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3

470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2

2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax

184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0

5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0

4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0

5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)

Discuss HW1 Intro to GPU Computing

Page 24: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.

Many event types countable:

CPU CLK UNHALTED : Clock cycles when not halted

L2 RQSTS : number of L2 cache requests

LLC MISSES : L2 cache demand requests from this core that

missed the L2

FLOPS : number of FP computational micro-ops executed

IDLE DURING DIV : cycles divider is busy and all other

execution units are idle.

L1D ALL REF : All references to the L1 data cache

L1D PEND MISS : Total number of outstanding L1 data cache

misses at any cycle

IFU MEM STALL : cycles instruction fetch pipe is stalled

INST RETIRED : number of instructions retired

UOPS RETIRED : number of UOPs retired

MACHINE NUKES SMC : number of pipeline flushing events

RAT STALLS : Partial register stall cycles

BR INST DECODED : number of branch instructions decoded

FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7

187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3

470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2

2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax

184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0

5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0

4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0

5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)

Discuss HW1 Intro to GPU Computing

Page 25: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.

Many event types countable:

CPU CLK UNHALTED : Clock cycles when not halted

L2 RQSTS : number of L2 cache requests

LLC MISSES : L2 cache demand requests from this core that

missed the L2

FLOPS : number of FP computational micro-ops executed

IDLE DURING DIV : cycles divider is busy and all other

execution units are idle.

L1D ALL REF : All references to the L1 data cache

L1D PEND MISS : Total number of outstanding L1 data cache

misses at any cycle

IFU MEM STALL : cycles instruction fetch pipe is stalled

INST RETIRED : number of instructions retired

UOPS RETIRED : number of UOPs retired

MACHINE NUKES SMC : number of pipeline flushing events

RAT STALLS : Partial register stall cycles

BR INST DECODED : number of branch instructions decoded

FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7

187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3

470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2

2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax

184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0

5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0

4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0

5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)

Discuss HW1 Intro to GPU Computing

Page 26: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.

Many event types countable:

CPU CLK UNHALTED : Clock cycles when not halted

L2 RQSTS : number of L2 cache requests

LLC MISSES : L2 cache demand requests from this core that

missed the L2

FLOPS : number of FP computational micro-ops executed

IDLE DURING DIV : cycles divider is busy and all other

execution units are idle.

L1D ALL REF : All references to the L1 data cache

L1D PEND MISS : Total number of outstanding L1 data cache

misses at any cycle

IFU MEM STALL : cycles instruction fetch pipe is stalled

INST RETIRED : number of instructions retired

UOPS RETIRED : number of UOPs retired

MACHINE NUKES SMC : number of pipeline flushing events

RAT STALLS : Partial register stall cycles

BR INST DECODED : number of branch instructions decoded

FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7

187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3

470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2

2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax

184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0

5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0

4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0

5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)

Discuss HW1 Intro to GPU Computing

Page 27: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Solution Performance

0 100 200 300 400 500 600 700 800Matrix dimension N

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

MFl

ops/

s

basic

tuned

blas

git clone

ssh://[email protected]:2234/hw1-solution.git

(Private, works if you signed up for an account.)

Great–but:Most BLAS lose out to triple-loops forspecial-case matrices.

Want to see code of a “real” BLAS?GotoBLAS2

Discuss HW1 Intro to GPU Computing

Page 28: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Solution Performance

0 100 200 300 400 500 600 700 800Matrix dimension N

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

MFl

ops/

s

basic

tuned

blas

git clone

ssh://[email protected]:2234/hw1-solution.git

(Private, works if you signed up for an account.)

Great–but:Most BLAS lose out to triple-loops forspecial-case matrices.

Want to see code of a “real” BLAS?GotoBLAS2

Discuss HW1 Intro to GPU Computing

Page 29: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Key Messages of HW1

In HPC:

• Very simple things quickly becomerather complex.

• Need: ideas, careful analysis.

• Flexibility ↔ performance

• Run-time code generation can beuseful.

This class helps by introducing

• known tricks

• helpful tools.

Matmul is a “microcosm” of single-procoptimization.

Do not worry if you did not figure outthe tricks here on your own.

Discuss HW1 Intro to GPU Computing

Page 30: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Key Messages of HW1

In HPC:

• Very simple things quickly becomerather complex.

• Need: ideas, careful analysis.

• Flexibility ↔ performance

• Run-time code generation can beuseful.

This class helps by introducing

• known tricks

• helpful tools.

Matmul is a “microcosm” of single-procoptimization.

Do not worry if you did not figure outthe tricks here on your own.

Discuss HW1 Intro to GPU Computing

Page 31: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Questions?

?

Discuss HW1 Intro to GPU Computing

Page 32: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Outline

Discuss HW1

Intro to GPU Computing

Discuss HW1 Intro to GPU Computing

Page 33: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPUs: System Context

Processor

Memory

Expansion Slots

PCI-Express (x4, x16, x1, x16)and regular PCI

PCIe V2, x16 Bandwidth:∼ 6 GB/s

GPU goes here

Discuss HW1 Intro to GPU Computing

Page 34: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPUs: System Context

Processor

Memory

Expansion Slots

PCI-Express (x4, x16, x1, x16)and regular PCI

PCIe V2, x16 Bandwidth:∼ 6 GB/s

GPU goes here

Discuss HW1 Intro to GPU Computing

Page 35: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPUs: System Context

Processor

Memory

Expansion Slots

PCI-Express (x4, x16, x1, x16)and regular PCI

PCIe V2, x16 Bandwidth:∼ 6 GB/s

GPU goes here

Discuss HW1 Intro to GPU Computing

Page 36: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPUs: System Context

Processor

Memory

Expansion Slots

PCI-Express (x4, x16, x1, x16)and regular PCI

PCIe V2, x16 Bandwidth:∼ 6 GB/s

GPU goes here

Discuss HW1 Intro to GPU Computing

Page 37: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPUs: System Context

Processor

Memory

Expansion Slots

PCI-Express (x4, x16, x1, x16)and regular PCI

PCIe V2, x16 Bandwidth:∼ 6 GB/s

GPU goes here

Discuss HW1 Intro to GPU Computing

Page 38: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPU Computing?

• Design target for CPUs:• Make a single thread very fast• Take control away from

programmer

• GPU Computing takes adifferent approach:

• Throughput matters—single threads do not

• Give explicit control toprogrammer

Discuss HW1 Intro to GPU Computing

Page 39: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

“CPU-style” Cores

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

CPU-“style” cores

ALU (Execute)

Fetch/ Decode

Execution Context

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Data cache (A big one)

13

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 40: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Slimming down

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Slimming down

ALU (Execute)

Fetch/ Decode

Execution Context

Idea #1:

Remove components that help a single instruction stream run fast

14

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 41: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

More Space: Double the Number of Cores

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Two cores (two fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

<diffuseShader>: 

sample r0, v4, t0, s0 

mul  r3, v0, cb0[0] 

madd r3, v1, cb0[1], r3 

madd r3, v2, cb0[2], r3 

clmp r3, r3, l(0.0), l(1.0) 

mul  o0, r0, r3 

mul  o1, r1, r3 

mul  o2, r2, r3 

mov  o3, l(1.0) 

fragment 1

<diffuseShader>: 

sample r0, v4, t0, s0 

mul  r3, v0, cb0[0] 

madd r3, v1, cb0[1], r3 

madd r3, v2, cb0[2], r3 

clmp r3, r3, l(0.0), l(1.0) 

mul  o0, r0, r3 

mul  o1, r1, r3 

mul  o2, r2, r3 

mov  o3, l(1.0) 

fragment 2

15

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 42: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

. . . again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Four cores (four fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

16

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 43: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

. . . and again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17

Credit: Kayvon Fatahalian (Stanford)

→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent

Discuss HW1 Intro to GPU Computing

Page 44: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

. . . and again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17

Credit: Kayvon Fatahalian (Stanford)

→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent

Discuss HW1 Intro to GPU Computing

Page 45: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 46: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 47: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 48: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 49: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Gratuitous Amounts of Parallelism!

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Great if everybody in a group does thesame thing.

But what if not?

What leads to divergent instructionstreams?

Discuss HW1 Intro to GPU Computing

Page 50: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Gratuitous Amounts of Parallelism!

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Great if everybody in a group does thesame thing.

But what if not?

What leads to divergent instructionstreams?

Discuss HW1 Intro to GPU Computing

Page 51: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Gratuitous Amounts of Parallelism!

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Great if everybody in a group does thesame thing.

But what if not?

What leads to divergent instructionstreams?

Discuss HW1 Intro to GPU Computing

Page 52: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Branches

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

26

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 53: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Branches

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

27

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 54: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Branches

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

Not all ALUs do useful work! Worst case: 1/8 performance

28

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 55: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Branches

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) { 

} else { 

<unconditional shader code> 

<resume unconditional shader code> 

y = pow(x, exp); 

y *= Ks; 

refl = y + Ka;   

x = 0;  

refl = Ka;   

T T T F F F F F

29

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 56: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?

Idea #3

Even more parallelism+ Some extra memory

= A solution!

Discuss HW1 Intro to GPU Computing

Page 57: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?

Idea #3

Even more parallelism+ Some extra memory

= A solution!

Discuss HW1 Intro to GPU Computing

Page 58: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

33

Idea #3

Even more parallelism+ Some extra memory

= A solution!

Discuss HW1 Intro to GPU Computing

Page 59: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

Idea #3

Even more parallelism+ Some extra memory

= A solution!

Discuss HW1 Intro to GPU Computing

Page 60: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

33

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 61: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 62: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Stall

Runnable

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

35

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 63: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Stall

Runnable

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

36

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 64: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

1 2 3 4

Stall

Stall

Stall

Stall

Runnable

Runnable

Runnable

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

37

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 65: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Hiding Memory Latency

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Throughput! Time

(clocks)

Stall

Runnable

2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

1

Increase run time of one group To maximum throughput of many groups

Start

Start

Start

38

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 66: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPU Architecture Summary

Core Ideas:

1. Many slimmed down cores→ lots of parallelism

2. More ALUs, Fewer Control Units

3. Avoid memory stalls by interleavingexecution of SIMD groups

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

Page 67: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPU-CPU Bird’s Eye Comparison

Floorplan: VIA Isaiah (2008)65 nm, 4 SP ops at a time, 1MiB L2.

Floorplan: AMD RV770 (2008)55 nm, 800 SP opsat a time.

Discuss HW1 Intro to GPU Computing

Page 68: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Nvidia GTX200

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Off-chip Memory150 GB/s

Discuss HW1 Intro to GPU Computing

Page 69: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

GPU Architecture (e.g. Nvidia GT200)

• 1 GPU = 30 SIMD cores

• 1 SIMD core: 32× 32 PCs,HW Sched + 1 ID (1/4 clock) +8 SP + 1 DP + 16 KiB Shared +32 KiB Reg

• Device ↔ RAM: 140 GB/s

• Device ↔ Host: 6 GB/s

• User manages memory hierarchy

Discuss HW1 Intro to GPU Computing

Page 70: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

What is OpenCL?

OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]

• Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)

• Vendor-neutral

• Comes with RTCG

Defines:

• Host-side programming interface (library)

• Device-side programming language (!)

Discuss HW1 Intro to GPU Computing

Page 71: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Questions?

?

Discuss HW1 Intro to GPU Computing

Page 72: Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Image Credits

• Blocks: sxc.hu/Avolore• Flag: sxc.hu/Ambrozjo

• Mainboard: Wikimedia Commons

• PCI Express slots: Wikimedia Commons

• Fighting chips: flickr.com/oskay• Isaiah die shot: VIA Technologies• RV770 die shot: AMD Corp.• Nvidia Tesla Architecture: Nvidia Corp.

Discuss HW1 Intro to GPU Computing