a taxonomy of hardware parallelism - otto-von-guericke ...isgchristian lessig, 2017 2 “[serial]...

23
GPU Programming A taxonomy of hardware parallelism Christian Lessig

Upload: others

Post on 13-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

1© Christian Lessig, 2017

GPU ProgrammingA taxonomy of hardware parallelism Christian Lessig

Page 2: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

2© Christian Lessig, 2017

“[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers are unlikely to be able to take advantage of these advances because they require new pro-grams and new algorithms.”

Gordon Bell (1992)G. Bell, “Massively parallel computers: why not parallel computers for the masses?,” in The Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992, pp. 292–297.

Parallel programming

Page 3: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

3© Christian Lessig, 2017

“[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers are unlikely to be able to take advantage of these advances because they require new pro-grams and new algorithms.”

Gordon Bell (1992)G. Bell, “Massively parallel computers: why not parallel computers for the masses?,” in The Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992, pp. 292–297.

Parallel programming

Even worse:

You have to know your

architectu

re!

Page 4: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

4© Christian Lessig, 2017

Instruction level parallelism

18 CHAPTER 2 Device architectures

As a second problem, increasing the clock frequency on-chip requires either anincrease of off-chip memory bandwidth to provide data fast enough to not stall theworkload running through the processor or an increase of the amount of caching inthe system.

If we are unable to continue increasing the frequency with the goal of obtaininghigher performance, we require other solutions. The heart of any of these solutionsis to increase the number of operations performed in a given clock cycle.

2.2.2 SUPERSCALAR EXECUTIONSuperscalar and, by extension, out-of-order execution is one solution that has beenincluded on CPUs for a long time; it has been included on x86 designs since thebeginning of the Pentium era. In these designs, the CPU maintains dependenceinformation between instructions in the instruction stream and schedules work ontounused functional units when possible. An example of this is shown in Figure 2.1.

FIGURE 2.1

Out-of-order execution of an instruction stream of simple assembly-like instructions. Notethat in this syntax, the destination register is listed first. For example, add a,b,c is a = b+c.

Out-of-order execution:

processor extracts parallelim from instruction stream

D. R

. Kae

li, P

. Mis

try,

D. S

chaa

, and

D. P

. Zha

ng, H

eter

ogen

eous

com

putin

g w

ith O

penC

L 2.

0, C

h. 2

.

Page 5: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

5© Christian Lessig, 2017

Instruction level parallelism

18 CHAPTER 2 Device architectures

As a second problem, increasing the clock frequency on-chip requires either anincrease of off-chip memory bandwidth to provide data fast enough to not stall theworkload running through the processor or an increase of the amount of caching inthe system.

If we are unable to continue increasing the frequency with the goal of obtaininghigher performance, we require other solutions. The heart of any of these solutionsis to increase the number of operations performed in a given clock cycle.

2.2.2 SUPERSCALAR EXECUTIONSuperscalar and, by extension, out-of-order execution is one solution that has beenincluded on CPUs for a long time; it has been included on x86 designs since thebeginning of the Pentium era. In these designs, the CPU maintains dependenceinformation between instructions in the instruction stream and schedules work ontounused functional units when possible. An example of this is shown in Figure 2.1.

FIGURE 2.1

Out-of-order execution of an instruction stream of simple assembly-like instructions. Notethat in this syntax, the destination register is listed first. For example, add a,b,c is a = b+c.

Out-of-order execution:

processor extracts parallelim from instruction stream

D. R

. Kae

li, P

. Mis

try,

D. S

chaa

, and

D. P

. Zha

ng, H

eter

ogen

eous

com

putin

g w

ith O

penC

L 2.

0, C

h. 2

.

Page 6: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

6© Christian Lessig, 2017

Instruction level parallelism

Very long instruction word processors:

compiler extracts parallelim from program code

20 CHAPTER 2 Device architectures

FIGURE 2.2

VLIW execution based on the out-of-order diagram in Figure 2.1.

In the example in Figure 2.2, we see that the instruction schedule has gaps: the firsttwo VLIW packets are missing a third entry, and the third VLIW packet is missing itsfirst and second entries. Obviously, the example is very simple, with few instructionsto pack, but it is a common problem with VLIW architectures that efficiency canbe lost owing to the compiler’s inability to fully fill packets. This may be due tolimitations in the compiler or may be due simply to an inherent lack of parallelism inthe instruction stream. In the latter case, the situation will be no worse than for out-of-order execution but will be more efficient as the scheduling hardware is reducedin complexity. The former case would end up as a trade-off between efficiency lossesfrom unfilled execution slots and gains from reduced hardware control overhead.In addition, there is an extra cost in compiler development to take into accountwhen performing a cost-benefit analysis for VLIW execution over hardware schedulesuperscalar execution.

VLIW designs commonly appear in digital signal processor chips. High-endconsumer devices currently include the Intel Itanium line of CPUs (known asexplicitly parallel instruction computing, EPIC) and AMD’s HD6000 series GPUs.

D. R

. Kae

li, P

. Mis

try,

D. S

chaa

, and

D. P

. Zha

ng, H

eter

ogen

eous

com

putin

g w

ith O

penC

L 2.

0, C

h. 2

.

Page 7: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

7© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

instruction stream

data

Flynn’s classification

Page 8: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

8© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Flynn’s classification

Page 9: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

9© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Flynn’s classification

Page 10: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

10© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Flynn’s classification

Page 11: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

11© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Flynn’s classification

Page 12: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

12© Christian Lessig, 2017

Flynn’s classification

existingparallelarchitectures

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Page 13: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

13© Christian Lessig, 2017

Distributed memory

◦ Scales to arbitrary number of processors

◦ Communication via message passing

› e.g. MPI

› High latency and limited bandwidth

◦ Used in super-computers

https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.png

Page 14: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

14© Christian Lessig, 2017

Shared memory

◦ Limited to less than ≈100 processors

◦ Communication via “shared” memory

› Low latency and high bandwidth

› Access to shared resources needs to be synchronized

https://en.wikipedia.org/wiki/Multi-core_processor

Page 15: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

15© Christian Lessig, 2017

Parallel ArchitecturesShared memoryDistributed memory

https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor

Page 16: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

16© Christian Lessig, 2017

Parallel ArchitecturesShared memoryDistributed memory

https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor

− high latency, low bandwidth + low latency, high bandwidth

Page 17: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

17© Christian Lessig, 2017

Parallel ArchitecturesShared memoryDistributed memory

https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor

− high latency, low bandwidth+ large number of processors

+ low latency, high bandwidth− limited number of processors

Page 18: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

18© Christian Lessig, 2017

cluster

multi-core

Taxonomy

single-core

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

Page 19: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

19© Christian Lessig, 2017

cluster

multi-core

Taxonomy

single-core

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

Page 20: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

20© Christian Lessig, 2017

clustermulti-core

Taxonomy

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

single-core

Page 21: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

21© Christian Lessig, 2017

Taxonomy

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

single-core

multi-core

cluster

Page 22: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

22© Christian Lessig, 2017

Taxonomy

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

single-core

multi-core

cluster

Page 23: A taxonomy of hardware parallelism - Otto-von-Guericke ...isgChristian Lessig, 2017 2 “[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers

23© Christian Lessig, 2017

Further reading

◦ J. L. Hennessy and D. A. Patterson, Computer architecture: a quantita-tive approach, fourth edition. Morgan Kaufmann, 2007.

◦ http://cva.stanford.edu/classes/cs99s/

◦ http://research.ac.upc.edu/HPCseminar/SEM9900/Pollack1.pdf

◦ http://groups.csail.mit.edu/cag/raw/documents/Waingold-Comput-er-1997.pdf

◦ http://cacm.acm.org/magazines/2009/5/24648-spending-moores-divi-dend/fulltext