a taxonomy of hardware parallelism - otto-von-guericke ...isgchristian lessig, 2017 2 “[serial]...

Post on 13-Mar-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1© Christian Lessig, 2017

GPU ProgrammingA taxonomy of hardware parallelism Christian Lessig

2© Christian Lessig, 2017

“[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers are unlikely to be able to take advantage of these advances because they require new pro-grams and new algorithms.”

Gordon Bell (1992)G. Bell, “Massively parallel computers: why not parallel computers for the masses?,” in The Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992, pp. 292–297.

Parallel programming

3© Christian Lessig, 2017

“[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers are unlikely to be able to take advantage of these advances because they require new pro-grams and new algorithms.”

Gordon Bell (1992)G. Bell, “Massively parallel computers: why not parallel computers for the masses?,” in The Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992, pp. 292–297.

Parallel programming

Even worse:

You have to know your

architectu

re!

4© Christian Lessig, 2017

Instruction level parallelism

18 CHAPTER 2 Device architectures

As a second problem, increasing the clock frequency on-chip requires either anincrease of off-chip memory bandwidth to provide data fast enough to not stall theworkload running through the processor or an increase of the amount of caching inthe system.

If we are unable to continue increasing the frequency with the goal of obtaininghigher performance, we require other solutions. The heart of any of these solutionsis to increase the number of operations performed in a given clock cycle.

2.2.2 SUPERSCALAR EXECUTIONSuperscalar and, by extension, out-of-order execution is one solution that has beenincluded on CPUs for a long time; it has been included on x86 designs since thebeginning of the Pentium era. In these designs, the CPU maintains dependenceinformation between instructions in the instruction stream and schedules work ontounused functional units when possible. An example of this is shown in Figure 2.1.

FIGURE 2.1

Out-of-order execution of an instruction stream of simple assembly-like instructions. Notethat in this syntax, the destination register is listed first. For example, add a,b,c is a = b+c.

Out-of-order execution:

processor extracts parallelim from instruction stream

D. R

. Kae

li, P

. Mis

try,

D. S

chaa

, and

D. P

. Zha

ng, H

eter

ogen

eous

com

putin

g w

ith O

penC

L 2.

0, C

h. 2

.

5© Christian Lessig, 2017

Instruction level parallelism

18 CHAPTER 2 Device architectures

As a second problem, increasing the clock frequency on-chip requires either anincrease of off-chip memory bandwidth to provide data fast enough to not stall theworkload running through the processor or an increase of the amount of caching inthe system.

If we are unable to continue increasing the frequency with the goal of obtaininghigher performance, we require other solutions. The heart of any of these solutionsis to increase the number of operations performed in a given clock cycle.

2.2.2 SUPERSCALAR EXECUTIONSuperscalar and, by extension, out-of-order execution is one solution that has beenincluded on CPUs for a long time; it has been included on x86 designs since thebeginning of the Pentium era. In these designs, the CPU maintains dependenceinformation between instructions in the instruction stream and schedules work ontounused functional units when possible. An example of this is shown in Figure 2.1.

FIGURE 2.1

Out-of-order execution of an instruction stream of simple assembly-like instructions. Notethat in this syntax, the destination register is listed first. For example, add a,b,c is a = b+c.

Out-of-order execution:

processor extracts parallelim from instruction stream

D. R

. Kae

li, P

. Mis

try,

D. S

chaa

, and

D. P

. Zha

ng, H

eter

ogen

eous

com

putin

g w

ith O

penC

L 2.

0, C

h. 2

.

6© Christian Lessig, 2017

Instruction level parallelism

Very long instruction word processors:

compiler extracts parallelim from program code

20 CHAPTER 2 Device architectures

FIGURE 2.2

VLIW execution based on the out-of-order diagram in Figure 2.1.

In the example in Figure 2.2, we see that the instruction schedule has gaps: the firsttwo VLIW packets are missing a third entry, and the third VLIW packet is missing itsfirst and second entries. Obviously, the example is very simple, with few instructionsto pack, but it is a common problem with VLIW architectures that efficiency canbe lost owing to the compiler’s inability to fully fill packets. This may be due tolimitations in the compiler or may be due simply to an inherent lack of parallelism inthe instruction stream. In the latter case, the situation will be no worse than for out-of-order execution but will be more efficient as the scheduling hardware is reducedin complexity. The former case would end up as a trade-off between efficiency lossesfrom unfilled execution slots and gains from reduced hardware control overhead.In addition, there is an extra cost in compiler development to take into accountwhen performing a cost-benefit analysis for VLIW execution over hardware schedulesuperscalar execution.

VLIW designs commonly appear in digital signal processor chips. High-endconsumer devices currently include the Intel Itanium line of CPUs (known asexplicitly parallel instruction computing, EPIC) and AMD’s HD6000 series GPUs.

D. R

. Kae

li, P

. Mis

try,

D. S

chaa

, and

D. P

. Zha

ng, H

eter

ogen

eous

com

putin

g w

ith O

penC

L 2.

0, C

h. 2

.

7© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

instruction stream

data

Flynn’s classification

8© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Flynn’s classification

9© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Flynn’s classification

10© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Flynn’s classification

11© Christian Lessig, 2017

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

Flynn’s classification

12© Christian Lessig, 2017

Flynn’s classification

existingparallelarchitectures

http

s://

en.w

ikip

edia

.org

/wik

i/Fly

nn%

27s_

taxo

nom

y

13© Christian Lessig, 2017

Distributed memory

◦ Scales to arbitrary number of processors

◦ Communication via message passing

› e.g. MPI

› High latency and limited bandwidth

◦ Used in super-computers

https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.png

14© Christian Lessig, 2017

Shared memory

◦ Limited to less than ≈100 processors

◦ Communication via “shared” memory

› Low latency and high bandwidth

› Access to shared resources needs to be synchronized

https://en.wikipedia.org/wiki/Multi-core_processor

15© Christian Lessig, 2017

Parallel ArchitecturesShared memoryDistributed memory

https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor

16© Christian Lessig, 2017

Parallel ArchitecturesShared memoryDistributed memory

https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor

− high latency, low bandwidth + low latency, high bandwidth

17© Christian Lessig, 2017

Parallel ArchitecturesShared memoryDistributed memory

https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor

− high latency, low bandwidth+ large number of processors

+ low latency, high bandwidth− limited number of processors

18© Christian Lessig, 2017

cluster

multi-core

Taxonomy

single-core

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

19© Christian Lessig, 2017

cluster

multi-core

Taxonomy

single-core

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

20© Christian Lessig, 2017

clustermulti-core

Taxonomy

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

single-core

21© Christian Lessig, 2017

Taxonomy

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

single-core

multi-core

cluster

22© Christian Lessig, 2017

Taxonomy

memory

instruction stream

shared

distributed

single

multiple

gpu

implicitexplicit

single-core

multi-core

cluster

23© Christian Lessig, 2017

Further reading

◦ J. L. Hennessy and D. A. Patterson, Computer architecture: a quantita-tive approach, fourth edition. Morgan Kaufmann, 2007.

◦ http://cva.stanford.edu/classes/cs99s/

◦ http://research.ac.upc.edu/HPCseminar/SEM9900/Pollack1.pdf

◦ http://groups.csail.mit.edu/cag/raw/documents/Waingold-Comput-er-1997.pdf

◦ http://cacm.acm.org/magazines/2009/5/24648-spending-moores-divi-dend/fulltext

top related