a taxonomy of hardware parallelism - otto-von-guericke ...isgchristian lessig, 2017 2 “[serial]...
TRANSCRIPT
1© Christian Lessig, 2017
GPU ProgrammingA taxonomy of hardware parallelism Christian Lessig
2© Christian Lessig, 2017
“[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers are unlikely to be able to take advantage of these advances because they require new pro-grams and new algorithms.”
Gordon Bell (1992)G. Bell, “Massively parallel computers: why not parallel computers for the masses?,” in The Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992, pp. 292–297.
Parallel programming
3© Christian Lessig, 2017
“[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] comput-ers are unlikely to be able to take advantage of these advances because they require new pro-grams and new algorithms.”
Gordon Bell (1992)G. Bell, “Massively parallel computers: why not parallel computers for the masses?,” in The Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992, pp. 292–297.
Parallel programming
Even worse:
You have to know your
architectu
re!
4© Christian Lessig, 2017
Instruction level parallelism
18 CHAPTER 2 Device architectures
As a second problem, increasing the clock frequency on-chip requires either anincrease of off-chip memory bandwidth to provide data fast enough to not stall theworkload running through the processor or an increase of the amount of caching inthe system.
If we are unable to continue increasing the frequency with the goal of obtaininghigher performance, we require other solutions. The heart of any of these solutionsis to increase the number of operations performed in a given clock cycle.
2.2.2 SUPERSCALAR EXECUTIONSuperscalar and, by extension, out-of-order execution is one solution that has beenincluded on CPUs for a long time; it has been included on x86 designs since thebeginning of the Pentium era. In these designs, the CPU maintains dependenceinformation between instructions in the instruction stream and schedules work ontounused functional units when possible. An example of this is shown in Figure 2.1.
FIGURE 2.1
Out-of-order execution of an instruction stream of simple assembly-like instructions. Notethat in this syntax, the destination register is listed first. For example, add a,b,c is a = b+c.
Out-of-order execution:
processor extracts parallelim from instruction stream
D. R
. Kae
li, P
. Mis
try,
D. S
chaa
, and
D. P
. Zha
ng, H
eter
ogen
eous
com
putin
g w
ith O
penC
L 2.
0, C
h. 2
.
5© Christian Lessig, 2017
Instruction level parallelism
18 CHAPTER 2 Device architectures
As a second problem, increasing the clock frequency on-chip requires either anincrease of off-chip memory bandwidth to provide data fast enough to not stall theworkload running through the processor or an increase of the amount of caching inthe system.
If we are unable to continue increasing the frequency with the goal of obtaininghigher performance, we require other solutions. The heart of any of these solutionsis to increase the number of operations performed in a given clock cycle.
2.2.2 SUPERSCALAR EXECUTIONSuperscalar and, by extension, out-of-order execution is one solution that has beenincluded on CPUs for a long time; it has been included on x86 designs since thebeginning of the Pentium era. In these designs, the CPU maintains dependenceinformation between instructions in the instruction stream and schedules work ontounused functional units when possible. An example of this is shown in Figure 2.1.
FIGURE 2.1
Out-of-order execution of an instruction stream of simple assembly-like instructions. Notethat in this syntax, the destination register is listed first. For example, add a,b,c is a = b+c.
Out-of-order execution:
processor extracts parallelim from instruction stream
D. R
. Kae
li, P
. Mis
try,
D. S
chaa
, and
D. P
. Zha
ng, H
eter
ogen
eous
com
putin
g w
ith O
penC
L 2.
0, C
h. 2
.
6© Christian Lessig, 2017
Instruction level parallelism
Very long instruction word processors:
compiler extracts parallelim from program code
20 CHAPTER 2 Device architectures
FIGURE 2.2
VLIW execution based on the out-of-order diagram in Figure 2.1.
In the example in Figure 2.2, we see that the instruction schedule has gaps: the firsttwo VLIW packets are missing a third entry, and the third VLIW packet is missing itsfirst and second entries. Obviously, the example is very simple, with few instructionsto pack, but it is a common problem with VLIW architectures that efficiency canbe lost owing to the compiler’s inability to fully fill packets. This may be due tolimitations in the compiler or may be due simply to an inherent lack of parallelism inthe instruction stream. In the latter case, the situation will be no worse than for out-of-order execution but will be more efficient as the scheduling hardware is reducedin complexity. The former case would end up as a trade-off between efficiency lossesfrom unfilled execution slots and gains from reduced hardware control overhead.In addition, there is an extra cost in compiler development to take into accountwhen performing a cost-benefit analysis for VLIW execution over hardware schedulesuperscalar execution.
VLIW designs commonly appear in digital signal processor chips. High-endconsumer devices currently include the Intel Itanium line of CPUs (known asexplicitly parallel instruction computing, EPIC) and AMD’s HD6000 series GPUs.
D. R
. Kae
li, P
. Mis
try,
D. S
chaa
, and
D. P
. Zha
ng, H
eter
ogen
eous
com
putin
g w
ith O
penC
L 2.
0, C
h. 2
.
7© Christian Lessig, 2017
http
s://
en.w
ikip
edia
.org
/wik
i/Fly
nn%
27s_
taxo
nom
y
instruction stream
data
Flynn’s classification
8© Christian Lessig, 2017
http
s://
en.w
ikip
edia
.org
/wik
i/Fly
nn%
27s_
taxo
nom
y
Flynn’s classification
9© Christian Lessig, 2017
http
s://
en.w
ikip
edia
.org
/wik
i/Fly
nn%
27s_
taxo
nom
y
Flynn’s classification
10© Christian Lessig, 2017
http
s://
en.w
ikip
edia
.org
/wik
i/Fly
nn%
27s_
taxo
nom
y
Flynn’s classification
11© Christian Lessig, 2017
http
s://
en.w
ikip
edia
.org
/wik
i/Fly
nn%
27s_
taxo
nom
y
Flynn’s classification
12© Christian Lessig, 2017
Flynn’s classification
existingparallelarchitectures
http
s://
en.w
ikip
edia
.org
/wik
i/Fly
nn%
27s_
taxo
nom
y
13© Christian Lessig, 2017
Distributed memory
◦ Scales to arbitrary number of processors
◦ Communication via message passing
› e.g. MPI
› High latency and limited bandwidth
◦ Used in super-computers
https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.png
14© Christian Lessig, 2017
Shared memory
◦ Limited to less than ≈100 processors
◦ Communication via “shared” memory
› Low latency and high bandwidth
› Access to shared resources needs to be synchronized
https://en.wikipedia.org/wiki/Multi-core_processor
15© Christian Lessig, 2017
Parallel ArchitecturesShared memoryDistributed memory
https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor
16© Christian Lessig, 2017
Parallel ArchitecturesShared memoryDistributed memory
https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor
− high latency, low bandwidth + low latency, high bandwidth
17© Christian Lessig, 2017
Parallel ArchitecturesShared memoryDistributed memory
https://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.pnghttps://en.wikipedia.org/wiki/Multi-core_processor
− high latency, low bandwidth+ large number of processors
+ low latency, high bandwidth− limited number of processors
18© Christian Lessig, 2017
cluster
multi-core
Taxonomy
single-core
memory
instruction stream
shared
distributed
single
multiple
gpu
implicitexplicit
19© Christian Lessig, 2017
cluster
multi-core
Taxonomy
single-core
memory
instruction stream
shared
distributed
single
multiple
gpu
implicitexplicit
20© Christian Lessig, 2017
clustermulti-core
Taxonomy
memory
instruction stream
shared
distributed
single
multiple
gpu
implicitexplicit
single-core
21© Christian Lessig, 2017
Taxonomy
memory
instruction stream
shared
distributed
single
multiple
gpu
implicitexplicit
single-core
multi-core
cluster
22© Christian Lessig, 2017
Taxonomy
memory
instruction stream
shared
distributed
single
multiple
gpu
implicitexplicit
single-core
multi-core
cluster
23© Christian Lessig, 2017
Further reading
◦ J. L. Hennessy and D. A. Patterson, Computer architecture: a quantita-tive approach, fourth edition. Morgan Kaufmann, 2007.
◦ http://cva.stanford.edu/classes/cs99s/
◦ http://research.ac.upc.edu/HPCseminar/SEM9900/Pollack1.pdf
◦ http://groups.csail.mit.edu/cag/raw/documents/Waingold-Comput-er-1997.pdf
◦ http://cacm.acm.org/magazines/2009/5/24648-spending-moores-divi-dend/fulltext