computer/architecture single/cpu...said to be ‘in flight’, meaning that they are in various...
TRANSCRIPT
Computer/Architecture Single/CPU
23
What/Does/a/CPU/Look/Like?
24
What/Does/it/Mean?
25
What/is/in/a/Core?
26
Von/Neumann/Architecture/
• Instruc1on/decode:/determine/opera1on/and/operands/
• Get/operands/from/memory/• Perform/opera1on/• Write/results/back/• Con1nue/with/next/instruc1on
Undivided/memory/that/stores/both/program/and/data/(‘stored/program’)/+/processing/unit/that/executes/the/instruc1ons,/opera1ng/on/the/data/
27
Contemporary/Architecture/
• Mul1ple/opera1ons/simultaneously/“in/flight”/• Operands/can/be/in/memory,/cache,/register/• Results/may/need/to/be/coordinated/with/other/processing/elements/
• Opera1ons/can/be/performed/specula1vely
28
Scien1fic/Compu1ng
• Some/algorithms/are/“CPU%bound”/– the/speed/of/the/processor/is/the/most/important/factor/
• Some/algorithms/are/“memory%bound”/– bus/speed,/cache/size/become/important
“Memory%bound”/becomes/ever/more/prominent.../Simple/“GHz”/comparison/does/not/tell/the/whole/story!
29
Modern/Floa1ng/Point/Units
• Tradi1onally:/one/instruc1on/at/a/1me/• Modern/CPUs:/Mul1ple/floa1ng/point/units,/for/instance/1/Mul/+/1/Add,/or/1/FMA/(“Fused/mul1ply%add”) //!
• Peak/performance/is/several/ops/clock/cycle/(currently/up/to/4);/usually/very/hard/to/obtain/
• Other/opera1ons/not/as/op1mized:/a/division/requires/10/to/20/clock/cycles
1.2. Modern floating point units
• Instruction decode: the processor inspects the instruction to determine the operation and theoperands.
• Memory fetch: if necessary, data is brought from memory into a register.• Execution: the operation is executed, reading data from registers and writing it back to a register.• Write-back: for store operations, the register contents is written back to memory.
Complicating this story, contemporary CPUs operate on several instructions simultaneously, which aresaid to be ‘in flight’, meaning that they are in various stages of completion. This is the basic idea of thesuperscalar CPU architecture, and is also referred to as Instruction Level Parallelism (ILP). Thus, whileeach instruction can take several clock cycles to complete, a processor can complete one instruction percycle in favourable circumstances; in some cases more than one instruction can be finished per cycle.
The main statistic that is quoted about CPUs is their Gigahertz rating, implying that the speed of the pro-cessor is the main determining factor of a computer’s performance. While speed obviously correlates withperformance, the story is more complicated. Some algorithms are cpu-bound , and the speed of the proces-sor is indeed the most important factor; other algorithms are memory-bound , and aspects such as bus speedand cache size, to be discussed later, become important.
In scientific computing, this second category is in fact quite prominent, so in this chapter we will devoteplenty of attention to the process that moves data from memory to the processor, and we will devote rela-tively little attention to the actual processor.
1.2 Modern floating point units
Many modern processors are capable of doing multiple operations simultaneously, and this holds in partic-ular for the arithmetic part. For instance, often there are separate addition and multiplication units; if thecompiler can find addition and multiplication operations that are independent, it can schedule them so as tobe executed simultaneously, thereby doubling the performance of the processor. In some cases, a processorwill have multiple addition or multiplication units.
Another way to increase performance is to have a ‘fused multiply-add’ unit, which can execute the in-struction x ax + b in the same amount of time as a separate addition or multiplication. Together withpipelining (see below), this means that a processor has an asymptotic speed of several floating point opera-tions per clock cycle.
Processor floating point units max operations per cycleIntel Pentium4 2 add or 2 mul 2Intel Woodcrest, AMD Barcelona 2 add + 2 mul 4IBM POWER4, POWER5, POWER6 2 FMA 4IBM BG/L, BG/P 1 SIMD FMA 4SPARC IV 1 add + 1 mul 2Itanium2 2 FMA 4
Table 1.1: Floating point capabilities of several current processor architectures
Victor Eijkhout 9
30
Pipelining
• A/single/instruc1on/takes/several/clock/cycles/to/complete/
• Subdivide/an/instruc1on:/– Instruc1on/decode/– Operand/exponent/align/– Actual/opera1on/– Normalize/
• Pipeline:/separate/piece/of/hardware/for/each/subdivision/
• Compare/to/assembly/line
31
Pipelining
• Decoding/the/instruc1on,/including/finding/the/loca1ons/of/the/operands/
• Copying/the/operands/into/registers/(‘data/fetch’).//• Aligning/the/exponents:/• Execu1ng/the/addi1on/of/the/man1ssas://• Normalizing/the/result://• Storing/the/result
32
Example:/addi1on/of/2/fp/numbers/has/the/following/subdivision/(“components”,/“stages”,/“segments”)
1. Sequential Computing
Incidentally, there are few algorithms in which division operations are a limiting factor. Correspondingly,the division operation is not nearly as much optimized in a modern CPU as the additions and multiplicationsare. Division operations can take 10 or 20 clock cycles, while a CPU can have multiple addition and/ormultiplication units that (asymptotically) can produce a result per cycle.
1.2.1 Pipelining
The floating point add and multiply units of a processor are pipelined, which has the effect that a stream ofindependent operations can be performed at an asymptotic speed of one result per clock cycle.
The idea behind a pipeline is as follows. Assume that an operation consists of multiple simpler opera-tions, and that for each suboperation there is separate hardware in the processor. For instance, an additioninstruction can have the following components:
• Decoding the instruction, including finding the locations of the operands.• Copying the operands into registers (‘data fetch’).• Aligning the exponents; the addition .35⇥10�1 + .6⇥10�2 becomes .35⇥10�1 + .06⇥10�1.• Executing the addition of the mantissas, in this case giving .41.• Normalizing the result, in this example to .41 ⇥ 10�1. (Normalization in this example does not
do anything. Check for yourself that in .3⇥ 100 + .8⇥ 100 and .35⇥ 10�3 + (�.34)⇥ 10�3
there is a non-trivial adjustment.)• Storing the result.
These parts are often called the ‘stages’ or ‘segments’ of the pipeline.
If every component is designed to finish in 1 clock cycle, the whole instruction takes 6 cycles. However, ifeach has its own hardware, we can execute two operations in less than 12 cycles:
• Execute the decode stage for the first operation;• Do the data fetch for the first operation, and at the same time the decode for the second.• Execute the third stage for the first operation and the second stage of the second operation simul-
taneously.• Et cetera.
You see that the first operation still takes 6 clock cycles, but the second one is finished a mere 1 cycle later.This idea can be extended to more than two operations: the first operation still takes the same amount oftime as before, but after that one more result will be produced each cycle. Formally, executing n operationson a s-segment pipeline takes s + n� 1 cycles, as opposed to ns in the classical case.Exercise 1.1. Let us compare the speed of a classical floating point unit, and a pipelined one.
If the pipeline has s stages, what is the asymptotic speedup? That is, with T0(n) thetime for n operations on a classical CPU, and Ts(n) the time for n operations on ans-segment pipeline, what is limn!1(T0(n)/Ts(n))?Next you can wonder how long it takes to get close to the asymptotic behaviour. DefineSs(n) as the speedup achieved on n operations. The quantity n1/2 is defined as thevalue of n such that Ss(n) is half the asymptotic speedup. Give an expression for n1/2.
Since a vector processor works on a number of instructions simultaneously, these instructions have tobe independent. The operation 8i : ai bi + ci has independent additions; the operation 8i : ai+1
10 Introduction to High Performance Scientific Computing
1. Sequential Computing
Incidentally, there are few algorithms in which division operations are a limiting factor. Correspondingly,the division operation is not nearly as much optimized in a modern CPU as the additions and multiplicationsare. Division operations can take 10 or 20 clock cycles, while a CPU can have multiple addition and/ormultiplication units that (asymptotically) can produce a result per cycle.
1.2.1 Pipelining
The floating point add and multiply units of a processor are pipelined, which has the effect that a stream ofindependent operations can be performed at an asymptotic speed of one result per clock cycle.
The idea behind a pipeline is as follows. Assume that an operation consists of multiple simpler opera-tions, and that for each suboperation there is separate hardware in the processor. For instance, an additioninstruction can have the following components:
• Decoding the instruction, including finding the locations of the operands.• Copying the operands into registers (‘data fetch’).• Aligning the exponents; the addition .35⇥10�1 + .6⇥10�2 becomes .35⇥10�1 + .06⇥10�1.• Executing the addition of the mantissas, in this case giving .41.• Normalizing the result, in this example to .41 ⇥ 10�1. (Normalization in this example does not
do anything. Check for yourself that in .3⇥ 100 + .8⇥ 100 and .35⇥ 10�3 + (�.34)⇥ 10�3
there is a non-trivial adjustment.)• Storing the result.
These parts are often called the ‘stages’ or ‘segments’ of the pipeline.
If every component is designed to finish in 1 clock cycle, the whole instruction takes 6 cycles. However, ifeach has its own hardware, we can execute two operations in less than 12 cycles:
• Execute the decode stage for the first operation;• Do the data fetch for the first operation, and at the same time the decode for the second.• Execute the third stage for the first operation and the second stage of the second operation simul-
taneously.• Et cetera.
You see that the first operation still takes 6 clock cycles, but the second one is finished a mere 1 cycle later.This idea can be extended to more than two operations: the first operation still takes the same amount oftime as before, but after that one more result will be produced each cycle. Formally, executing n operationson a s-segment pipeline takes s + n� 1 cycles, as opposed to ns in the classical case.Exercise 1.1. Let us compare the speed of a classical floating point unit, and a pipelined one.
If the pipeline has s stages, what is the asymptotic speedup? That is, with T0(n) thetime for n operations on a classical CPU, and Ts(n) the time for n operations on ans-segment pipeline, what is limn!1(T0(n)/Ts(n))?Next you can wonder how long it takes to get close to the asymptotic behaviour. DefineSs(n) as the speedup achieved on n operations. The quantity n1/2 is defined as thevalue of n such that Ss(n) is half the asymptotic speedup. Give an expression for n1/2.
Since a vector processor works on a number of instructions simultaneously, these instructions have tobe independent. The operation 8i : ai bi + ci has independent additions; the operation 8i : ai+1
10 Introduction to High Performance Scientific Computing
1. Sequential Computing
Incidentally, there are few algorithms in which division operations are a limiting factor. Correspondingly,the division operation is not nearly as much optimized in a modern CPU as the additions and multiplicationsare. Division operations can take 10 or 20 clock cycles, while a CPU can have multiple addition and/ormultiplication units that (asymptotically) can produce a result per cycle.
1.2.1 Pipelining
The floating point add and multiply units of a processor are pipelined, which has the effect that a stream ofindependent operations can be performed at an asymptotic speed of one result per clock cycle.
The idea behind a pipeline is as follows. Assume that an operation consists of multiple simpler opera-tions, and that for each suboperation there is separate hardware in the processor. For instance, an additioninstruction can have the following components:
• Decoding the instruction, including finding the locations of the operands.• Copying the operands into registers (‘data fetch’).• Aligning the exponents; the addition .35⇥10�1 + .6⇥10�2 becomes .35⇥10�1 + .06⇥10�1.• Executing the addition of the mantissas, in this case giving .41.• Normalizing the result, in this example to .41 ⇥ 10�1. (Normalization in this example does not
do anything. Check for yourself that in .3⇥ 100 + .8⇥ 100 and .35⇥ 10�3 + (�.34)⇥ 10�3
there is a non-trivial adjustment.)• Storing the result.
These parts are often called the ‘stages’ or ‘segments’ of the pipeline.
If every component is designed to finish in 1 clock cycle, the whole instruction takes 6 cycles. However, ifeach has its own hardware, we can execute two operations in less than 12 cycles:
• Execute the decode stage for the first operation;• Do the data fetch for the first operation, and at the same time the decode for the second.• Execute the third stage for the first operation and the second stage of the second operation simul-
taneously.• Et cetera.
You see that the first operation still takes 6 clock cycles, but the second one is finished a mere 1 cycle later.This idea can be extended to more than two operations: the first operation still takes the same amount oftime as before, but after that one more result will be produced each cycle. Formally, executing n operationson a s-segment pipeline takes s + n� 1 cycles, as opposed to ns in the classical case.Exercise 1.1. Let us compare the speed of a classical floating point unit, and a pipelined one.
If the pipeline has s stages, what is the asymptotic speedup? That is, with T0(n) thetime for n operations on a classical CPU, and Ts(n) the time for n operations on ans-segment pipeline, what is limn!1(T0(n)/Ts(n))?Next you can wonder how long it takes to get close to the asymptotic behaviour. DefineSs(n) as the speedup achieved on n operations. The quantity n1/2 is defined as thevalue of n such that Ss(n) is half the asymptotic speedup. Give an expression for n1/2.
Since a vector processor works on a number of instructions simultaneously, these instructions have tobe independent. The operation 8i : ai bi + ci has independent additions; the operation 8i : ai+1
10 Introduction to High Performance Scientific Computing
1. Sequential Computing
Incidentally, there are few algorithms in which division operations are a limiting factor. Correspondingly,the division operation is not nearly as much optimized in a modern CPU as the additions and multiplicationsare. Division operations can take 10 or 20 clock cycles, while a CPU can have multiple addition and/ormultiplication units that (asymptotically) can produce a result per cycle.
1.2.1 Pipelining
The floating point add and multiply units of a processor are pipelined, which has the effect that a stream ofindependent operations can be performed at an asymptotic speed of one result per clock cycle.
The idea behind a pipeline is as follows. Assume that an operation consists of multiple simpler opera-tions, and that for each suboperation there is separate hardware in the processor. For instance, an additioninstruction can have the following components:
• Decoding the instruction, including finding the locations of the operands.• Copying the operands into registers (‘data fetch’).• Aligning the exponents; the addition .35⇥10�1 + .6⇥10�2 becomes .35⇥10�1 + .06⇥10�1.• Executing the addition of the mantissas, in this case giving .41.• Normalizing the result, in this example to .41 ⇥ 10�1. (Normalization in this example does not
do anything. Check for yourself that in .3⇥ 100 + .8⇥ 100 and .35⇥ 10�3 + (�.34)⇥ 10�3
there is a non-trivial adjustment.)• Storing the result.
These parts are often called the ‘stages’ or ‘segments’ of the pipeline.
If every component is designed to finish in 1 clock cycle, the whole instruction takes 6 cycles. However, ifeach has its own hardware, we can execute two operations in less than 12 cycles:
• Execute the decode stage for the first operation;• Do the data fetch for the first operation, and at the same time the decode for the second.• Execute the third stage for the first operation and the second stage of the second operation simul-
taneously.• Et cetera.
You see that the first operation still takes 6 clock cycles, but the second one is finished a mere 1 cycle later.This idea can be extended to more than two operations: the first operation still takes the same amount oftime as before, but after that one more result will be produced each cycle. Formally, executing n operationson a s-segment pipeline takes s + n� 1 cycles, as opposed to ns in the classical case.Exercise 1.1. Let us compare the speed of a classical floating point unit, and a pipelined one.
If the pipeline has s stages, what is the asymptotic speedup? That is, with T0(n) thetime for n operations on a classical CPU, and Ts(n) the time for n operations on ans-segment pipeline, what is limn!1(T0(n)/Ts(n))?Next you can wonder how long it takes to get close to the asymptotic behaviour. DefineSs(n) as the speedup achieved on n operations. The quantity n1/2 is defined as thevalue of n such that Ss(n) is half the asymptotic speedup. Give an expression for n1/2.
Since a vector processor works on a number of instructions simultaneously, these instructions have tobe independent. The operation 8i : ai bi + ci has independent additions; the operation 8i : ai+1
10 Introduction to High Performance Scientific Computing
Pipelining
33
Every/component/designed/to/finish/in/1/clock/cycle:/the/whole/instruc1on/takes/6/cycles/If/each/has/its/own/hardware,/one/can/execute/two/opera1ons/in/less/than/12/cycles://• Execute/the/decode/stage/for/the/first/opera1on;//• Do/the/data/fetch/for/the/first/opera1on,/and/at/the/same/1me/the/decode/for/the/second.//
• Execute/the/third/stage/for/the/first/opera1on/and/the/second/stage/of/the/second/opera1on/simultaneously.//
• ...
Pipelining
34
Analysis://• First/addi1on/takes/6/clock/cycles/• Second/addi1on/finishes/a/mere/1/cycle/later/!This/idea/can/be/extended/to/more/than/two/opera1ons:/the/first/opera1on/s1ll/takes/the/same/amount/of/1me/as/before,/but/auer/that/one/more/result/will/be/produced/each/cycle.//!Execu1ng/n/opera1ons/on/a/s%segment/pipeline/takes//(s/+/n/−/1)/cycles,/as/opposed/to/(ns)/in/the/classical/case.//!This/requires/independent/opera1ons.../One/solu1on:/mul1ple/pipes
Pipelining
35
With/pipelining,/peak/CPU/performance/=/!
(clock/speed)/x/
(number/of/independent/floa1ng/point/units)/!The/measure/of/floa1ng/point/performance/is/‘floa1ng/point/opera1ons/per/second’,/abbreviated/“flops”.//!‘gigaflops’/=/mul1ples/of//////////flops/
1. Sequential Computing
(You may wonder why we are mentioning some fairly old computers here: true pipeline supercomputershardly exist anymore. In the US, the Cray X1 was the last of that line, and in Japan only NEC still makesthem. However, the functional units of a CPU these days are pipelined, so the notion is still important.)Exercise 1.4. The operation
for (i) {x[i+1] = a[i]*x[i] + b[i];
}
can not be handled by a pipeline because there is a dependency between input of oneiteration of the operation and the output of the previous. However, you can transformthe loop into one that is mathematically equivalent, and potentially more efficient tocompute. Derive an expression that computes x[i+2] from x[i] without involvingx[i+1]. This is known as recursive doubling . Assume you have plenty of temporarystorage. You can now perform the calculation by
• Doing some preliminary calculations;• computing x[i],x[i+2],x[i+4],..., and from these,• compute the missing terms x[i+1],x[i+3],....
Analyze the efficiency of this scheme by giving formulas for T0(n) and Ts(n). Can youthink of an argument why the preliminary calculations may be of lesser importance insome circumstances?
1.2.2 Peak performance
Thanks to pipelining, for modern CPUs there is a simple relation between the clock speed and the peakperformance . Since each floating point unit can produce one result per cycle asymptotically, the peak per-formance is the clock speed times the number of independent floating point units. The measure of floatingpoint performance is ‘floating point operations per second’, abbreviated flops . Considering the speed ofcomputers these days, you will mostly hear floating point performance being expressed in ‘gigaflops’: mul-tiples of 109 flops.
1.2.3 Pipelining beyond arithmetic: instruction-level parallelism
In fact, nowadays, the whole CPU is pipelined. Not only floating point operations, but any sort of instructionwill be put in the instruction pipeline as soon as possible. Note that this pipeline is no longer limited to iden-tical instructions: the notion of pipeline is now generalized to any stream of partially executed instructionsthat are simultaneously “in flight”.
This concept is also known as Instruction Level Parallelism (ILP), and it is facilitated by various mecha-nisms:
• multiple-issue: instructions that are independent can be started at the same time;• pipelining: already mentioned, arithmetic units can deal with multiple operations in various
stages of completion;• branch prediction and speculative execution: a compiler can ‘guess’ whether a conditional in-
struction will evaluate to true, and execute those instructions accordingly;
12 Introduction to High Performance Scientific Computing
Pipelining/Beyond/Arithme1c
36
The/whole/CPU/is/pipelined,/leading/to/“Instruc(on*Level*Parallelism”/(ILP)/!Facilitated/by//• mul1ple/issue/(independent/instruc1ons/can/be/started/at/the/same/1me)/
• branch/predic1on/and/specula1ve/execu1on/• out%of%order/execu1on
• Memory/is/too/slow/to/keep/up/with/the/processor/– 100%1000/cycles/latency/before/data/arrives/– Data/stream/maybe/1/4/fp/number/cycle;/processor/wants/2/or/3/
– “Memory/wall”/
• At/considerable/cost/it’s/possible/to/build/faster/memory/
• Cache/is/small/amount/of/fast/memory
37
Memory/Hierarchies
• Memory/is/divided/into/different/levels:/– Registers/– Caches/– Main/Memory/
• Memory/is/accessed/through/the/hierarchy/– registers/where/possible/– .../then/the/caches/– .../then/main/memory
38
Memory/Hierarchies
39
Memory/Hierarchies1.3. Memory Hierarchies
Figure 1.3: Memory hierarchy of an AMD Xeon, characterized by speed and size.
Data needed in some operation gets copied into the various caches on its way to the processor. If, someinstructions later, a data item is needed again, it is first searched for in the L1 cache; if it is not found there,it is searched for in the L2 cache; if it is not found there, it is loaded from main memory. Finding data incache is called a cache hit , and not finding it a cache miss .
Figure 1.3 illustrates the basic facts of caches, in this case for the AMD Opteron chip: the closer caches areto the floating point units, the faster, but also the smaller they are. Some points about this figure.
• Loading data from registers is so fast that it does not constitute a limitation on algorithm exe-cution speed. On the other hand, there are few registers. The Opteron5 has 16 general purposeregisters, 8 media and floating point registers, and 16 SIMD registers.
• The L1 cache is small, but sustains a bandwidth of 32 bytes, that is 4 double precision number,per cycle. This is enough to load two operands each for two operations, but note that the Opteroncan actually perform 4 operations per cycle. Thus, to achieve peak speed, certain operands needto stay in register. The latency from L1 cache is around 3 cycles.
• The bandwidth from L2 and L3 cache is not documented and hard to measure due to cachepolicies (see below). Latencies are around 15 cycles for L2 and 50 for L3.
• Main memory access has a latency of more than 100 cycles, and a bandwidth of 4.5 bytes per cy-cle, which is about 1/7th of the L1 bandwidth. However, this bandwidth is shared by the 4 coresof the opteron chip, so effectively the bandwidth is a quarter of this number. In a machine likeRanger, which has 4 chips per node, some bandwidth is spent on maintaining cache coherence(see section 1.4) reducing the bandwidth for each chip again by half.
On level 1, there are separate caches for instructions and data; the L2 and L3 cache contain both data andinstructions.
You see that the larger caches are increasingly unable to supply data to the processors fast enough. For this
5. Specifically the server chip used in the Ranger supercomputer; desktop versions may have different specifications.
Victor Eijkhout 19
AMD/Opteron
• The/two/most/important/terms/related/to/performance/for/memory/subsystems/and/for/networks:/
• Latency*– How/long/does/it/take/to/retrieve/a/word/of/memory?//– Units/are/generally/nanoseconds/(milliseconds/for/network/latency)/or/clock/periods/(CP)/
– Some1mes/addresses/are/predictable:/compiler/will/schedule/the/fetch./Predictable/code/is/good!//
• Bandwidth*– What/data/rate/can/be/sustained/once/the/message/is/started?//
– Units/are/B/sec/(MB/sec,/GB/sec,/etc.)40
Latency/and/Bandwidth
• The/1me/that/a/message/takes/from/start/to/finish/combines/latency/and/bandwidth:/!!
• ///////latency/• ///////inverse/of/bandwidth/(the/1me/per/byte)
1.3. Memory Hierarchies
1.3.2 Latency and Bandwidth
Above, we mentioned in very general terms that accessing data in registers is almost instantaneous, whereasloading data from memory into the registers, a necessary step before any operation, incurs a substantialdelay. We will now make this story slightly more precise.
There are two important concepts to describe the movement of data: latency and bandwidth . The assump-tion here is that requesting an item of data incurs an initial delay; if this item was the first in a stream ofdata, usually a consecutive range of memory addresses, the remainder of the stream will arrive with nofurther delay at a regular amount per time period.Latency is the delay between the processor issuing a request for a memory item, and the item actually
arriving. We can distinguish between various latencies, such as the transfer from memory tocache, cache to register, or summarize them all into the latency between memory and processor.Latency is measured in (nano) seconds, or clock periods.If a processor executes instructions in the order they are found in the assembly code, then execu-tion will often stall while data is being fetched from memory; this is also called memory stall .For this reason, a low latency is very important. In practice, many processors have ‘out-of-orderexecution’ of instructions, allowing them to perform other operations while waiting for the re-quested data. Programmers can take this into account, and code in a way that achieves latencyhiding . Graphics Processing Units (GPUs) (see section 2.9) can switch very quickly betweenthreads in order to achieve latency hiding.
Bandwidth is the rate at which data arrives at its destination, after the initial latency is overcome. Band-width is measured in bytes (kilobyes, megabytes, gigabyes) per second or per clock cycle. Thebandwidth between two memory levels is usually the product of the cycle speed of the channel(the bus speed ) and the bus width : the number of bits that can be sent simultaneously in everycycle of the bus clock.
The concepts of latency and bandwidth are often combined in a formula for the time that a message takesfrom start to finish:
T (n) = ↵ + �n
where ↵ is the latency and � is the inverse of the bandwidth: the time per byte.
Typically, the further away from the processor one gets, the longer the latency is, and the lower the band-width. These two factors make it important to program in such a way that, if at all possible, the processoruses data from cache or register, rather than from main memory. To illustrate that this is a serious matter,consider a vector addition
for (i)a[i] = b[i]+c[i]
Each iteration performs one floating point operation, which modern CPUs can do in one clock cycle byusing pipelines. However, each iteration needs two numbers loaded and one written, for a total of 24 bytes4
of memory traffic. Typical memory bandwidth figures (see for instance figure 1.3) are nowhere near 24
4. Actually, a[i] is loaded before it can be written, so there are 4 memory access, with a total of 32 bytes, per iteration.
Victor Eijkhout 15
41
Latency/and/Bandwidth
1.3. Memory Hierarchies
1.3.2 Latency and Bandwidth
Above, we mentioned in very general terms that accessing data in registers is almost instantaneous, whereasloading data from memory into the registers, a necessary step before any operation, incurs a substantialdelay. We will now make this story slightly more precise.
There are two important concepts to describe the movement of data: latency and bandwidth . The assump-tion here is that requesting an item of data incurs an initial delay; if this item was the first in a stream ofdata, usually a consecutive range of memory addresses, the remainder of the stream will arrive with nofurther delay at a regular amount per time period.Latency is the delay between the processor issuing a request for a memory item, and the item actually
arriving. We can distinguish between various latencies, such as the transfer from memory tocache, cache to register, or summarize them all into the latency between memory and processor.Latency is measured in (nano) seconds, or clock periods.If a processor executes instructions in the order they are found in the assembly code, then execu-tion will often stall while data is being fetched from memory; this is also called memory stall .For this reason, a low latency is very important. In practice, many processors have ‘out-of-orderexecution’ of instructions, allowing them to perform other operations while waiting for the re-quested data. Programmers can take this into account, and code in a way that achieves latencyhiding . Graphics Processing Units (GPUs) (see section 2.9) can switch very quickly betweenthreads in order to achieve latency hiding.
Bandwidth is the rate at which data arrives at its destination, after the initial latency is overcome. Band-width is measured in bytes (kilobyes, megabytes, gigabyes) per second or per clock cycle. Thebandwidth between two memory levels is usually the product of the cycle speed of the channel(the bus speed ) and the bus width : the number of bits that can be sent simultaneously in everycycle of the bus clock.
The concepts of latency and bandwidth are often combined in a formula for the time that a message takesfrom start to finish:
T (n) = ↵ + �n
where ↵ is the latency and � is the inverse of the bandwidth: the time per byte.
Typically, the further away from the processor one gets, the longer the latency is, and the lower the band-width. These two factors make it important to program in such a way that, if at all possible, the processoruses data from cache or register, rather than from main memory. To illustrate that this is a serious matter,consider a vector addition
for (i)a[i] = b[i]+c[i]
Each iteration performs one floating point operation, which modern CPUs can do in one clock cycle byusing pipelines. However, each iteration needs two numbers loaded and one written, for a total of 24 bytes4
of memory traffic. Typical memory bandwidth figures (see for instance figure 1.3) are nowhere near 24
4. Actually, a[i] is loaded before it can be written, so there are 4 memory access, with a total of 32 bytes, per iteration.
Victor Eijkhout 15
1.3. Memory Hierarchies
1.3.2 Latency and Bandwidth
Above, we mentioned in very general terms that accessing data in registers is almost instantaneous, whereasloading data from memory into the registers, a necessary step before any operation, incurs a substantialdelay. We will now make this story slightly more precise.
There are two important concepts to describe the movement of data: latency and bandwidth . The assump-tion here is that requesting an item of data incurs an initial delay; if this item was the first in a stream ofdata, usually a consecutive range of memory addresses, the remainder of the stream will arrive with nofurther delay at a regular amount per time period.Latency is the delay between the processor issuing a request for a memory item, and the item actually
arriving. We can distinguish between various latencies, such as the transfer from memory tocache, cache to register, or summarize them all into the latency between memory and processor.Latency is measured in (nano) seconds, or clock periods.If a processor executes instructions in the order they are found in the assembly code, then execu-tion will often stall while data is being fetched from memory; this is also called memory stall .For this reason, a low latency is very important. In practice, many processors have ‘out-of-orderexecution’ of instructions, allowing them to perform other operations while waiting for the re-quested data. Programmers can take this into account, and code in a way that achieves latencyhiding . Graphics Processing Units (GPUs) (see section 2.9) can switch very quickly betweenthreads in order to achieve latency hiding.
Bandwidth is the rate at which data arrives at its destination, after the initial latency is overcome. Band-width is measured in bytes (kilobyes, megabytes, gigabyes) per second or per clock cycle. Thebandwidth between two memory levels is usually the product of the cycle speed of the channel(the bus speed ) and the bus width : the number of bits that can be sent simultaneously in everycycle of the bus clock.
The concepts of latency and bandwidth are often combined in a formula for the time that a message takesfrom start to finish:
T (n) = ↵ + �n
where ↵ is the latency and � is the inverse of the bandwidth: the time per byte.
Typically, the further away from the processor one gets, the longer the latency is, and the lower the band-width. These two factors make it important to program in such a way that, if at all possible, the processoruses data from cache or register, rather than from main memory. To illustrate that this is a serious matter,consider a vector addition
for (i)a[i] = b[i]+c[i]
Each iteration performs one floating point operation, which modern CPUs can do in one clock cycle byusing pipelines. However, each iteration needs two numbers loaded and one written, for a total of 24 bytes4
of memory traffic. Typical memory bandwidth figures (see for instance figure 1.3) are nowhere near 24
4. Actually, a[i] is loaded before it can be written, so there are 4 memory access, with a total of 32 bytes, per iteration.
Victor Eijkhout 15
Implica1ons/of/Latency/and/Bandwidth:/LiBle’s/law
• Memory/loads/can/depend/on/each/other:/loading/the/result/of/a/previous/opera1on/
• Two/such/loads/have/to/be/separated/by/at/least/the/memory/latency/
• In/order/not/to/waste/bandwidth,/at/least/latency/many/items/have/to/be/under/way/at/all/1mes,/and/they/have/to/be/independent/
• Mul1ply/by/bandwidth:LiBle’s/law:/Concurrency/=/Bandwidth/x/Latency/
42
• Finding/parallelism/is/some1mes/called/`latency/hiding’:/load/data/early/to/hide/latency/
• GPUs/do/latency/hiding/by/spawning/many/threads
• Requires/fast/context/switch
43
PS:/Latency/Hiding/and/GPUs
Registers
• Highest/bandwidth,/lowest/latency/memory/that/a/modern/processor/can/access;/built/into/the/CPU/
• Ouen/a/scarce/resource/and/not/random/access/• Processors/instruc1ons/operate/on/registers/directly/
– have/assembly/language/names/names/like://• eax,/ebx,/ecx,/etc./
– sample/instruc1on://• addl %eax, %edx
• Separate/instruc1ons/and/registers/for/floa1ng%point/opera1ons
44
• Between/the/CPU/Registers/and/main/memory/
• L1/Cache:/Data/cache/closest//to/registers//• L2/Cache:/Secondary/data/cache,/stores/both/data/and/instruc1ons/– Data/from/L2/has/to/go/through/L1/to/registers/– L2/is/10/to/100/1mes/larger/than/L1//– Some/systems/have/an/L3/cache,/~10x/larger/than/L2/
• Cache/line
45
Data/Caches
• The/smallest/unit/of/data/transferred/between/main/memory/and/the/caches/(or/between/levels/of/cache;/every/cache/has/its/own/line/size)/
• N/sequen1ally%stored,/mul1%byte/words/(usually/N=8/or/16)./
• If/you/request/one/word/on/a/cache/line,/you/get/the/whole/line/
– Make/sure/to/use/the/other/items,/you’ve/paid/for/them/in/bandwidth/
– Sequen1al/access/good,/“strided”/access/ok,/random/access/bad
46
Cache/Line
Main/Memory
• Cheapest/form/of/RAM/• Also/the/slowest/
– lowest/bandwidth/– highest/latency/
• Unfortunately/most/of/our/data/lives/out/here
47
Cache/and/Register/Access
• Access/is/transparent/to/the/programmer/– data/is/in/a/register/or/in/cache/or/in/memory/– Loaded/from/the/highest/level/where/it’s/found/– processor/cache/controller/MMU/hides/cache/access/from/the/programmer/
• …but/you/can/influence/it:/– Access/x/(that/puts/it/in/L1),/access/100k/of/data,/access/x/
again:/it/will/probably/be/gone/from/cache/– If/you/use/an/element/twice,/don’t/wait/too/long/– If/you/loop/over/data,/try/to/take/chunks/of/less/than/cache/size/
– In/C/declare/register/variable,/only/sugges1on48
Register/Use
• y[i]/can/be/kept/in/register/
• Declara1on/is/only/sugges1on/to/the/compiler/
• Compiler/can/usually/figure/this/out/itself
for (i=0; i<m; i++) { for (j=0; j<n; j++) { y[i] = y[i]+a[i][j]*x[j]; } }
register double s; for (i=0; i<m; i++) { s = 0.; for (j=0; j<n; j++) { s = s+a[i][j]*x[j]; } y[i] = s; }
49
• Cache/hit/– loca1on/referenced/is/found/in/the/cache/
• Cache/miss/– loca1on/referenced/is/not/found/in/cache//– triggers/access/to/the/next/higher/cache/or/memory/
• Cache/thrashing/– Two/data/elements/can/be/mapped/to/the/same/cache/line:/loading/the/second/“evicts”/the/first//
– Now/what/if/this/code/is/in/a/loop?/“thrashing”:/really/bad/for/performance
Hits,/Misses,/Thrashing
50
Cache/Mapping
• Because/each/memory/level/is/smaller/than/the/next%closer/level,/data/must/be/mapped/!
• Types/of/mapping/– Direct/– Set/associa1ve/– Fully/associa1ve
51
A/block/from/main/memory/can/go/in/exactly/one/place/in/the/cache./This/is/called/direct/mapped/because/there/is/direct/mapping/from/any/block/address/in/memory/to/a/single/loca1on/in/the/cache. !Typically/modulo/calcula1on/(e.g./keep/16/last/bits/of/memory/address)
cache
main/memory
52
Direct/Mapped/Cache
• Example:/cache/size/64K,/needs/16/bits/to/address/
• a[0]/and/b[0]/mapped/to/the/same/cache/loca1on/
• Cache/line/is/4/words//• Thrashing:/
– b[0]..b[3]/loaded/to/cache,/to/register/
– a[0]..a[3]/loaded,/gets/new/value,/kicks&b[0]..b[3]&out&of&cache&
– b[1]/requested,/so/b[0]..b[3]/loaded/again/
– a[1]/requested,/loaded,/kicks&b[0..3]&out&again
double a[8192],b[8192]; for (i=0; i<n; i++) { a[i] = b[i] }
53
The/Problem/with/Direct/Mapping
A/block/from/main/memory/can/be/placed/in/any/loca1on/in/the/cache./This/is/called/fully/associa1ve/because/a/block/in/main/memory/may/be/associated/with/any/entry/in/the/cache./Requires/lookup/table.
cache
main/memory
54
Fully/Associa1ve/Caches
Fully/Associa1ve/Caches
• Ideal/situa1on/• Any/memory/loca1on/can/be/associated/with/any/cache/line/
• Cost/prohibi1ve
55
In/a/n%way/set/associa1ve/cache/a/block/from/main/memory/can/go/into/n/(n/at/least/2)/loca1ons/in/the/cache.
2%way/set%associa1ve/cache
main/memory
56
Set/Associa1ve/Caches
Set/Associa1ve/Caches
• Direct%mapped/caches/are/1%way/set%associa1ve/caches/• For/a/k%way/set%associa1ve/cache,/each/memory/region/can/be/associated/with/k/cache/lines/
• Fully/associa1ve/is/k%way/with/k/the/number/of/cache/lines
57
Transla1on/Look%Aside/Buffer/(TLB)• Translates/between/logical/space/that/each/program/has/and/actual/memory/addresses/
• Memory/organized/in/‘small/pages’,/a/few/Kbyte/in/size/• Memory/requests/go/through/the/TLB,/normally/very/fast/
• Pages/that/are/not/tracked/through/the/TLB/can/be/found/through/the/‘page/table’:/much/slower/
• %>/Jumping/between/more/pages/than/the/TLB/can/track/has/a/performance/penalty/
• This/illustrates/the/need/for/spa1al/locality
58
Prefetch
• Hardware/tries/to/detect/if/you/load/regularly/spaced/data:/– “prefetch/stream”/– This/can/some1mes/be/programmed/in/souware,/ouen/only/in%line/assembly
59
Data/reuse
• Performance/is/limited/by/data/transfer/rate/• High/performance/if/data/items/are/used/mul1ple/1mes/• Examples:/
– vector/addi1on/xi=xi+yi:/1op,/3/mem/accesses/
– inner/product/s=s+xi*yi:/2op,/2/mem/access/(s/in/register;/also/no/writes)
60
Data/reuse:/matrix%matrix/product
• Matrix%matrix/product:/2n3&ops,/2n2/data
for (i=0; i<n; i++) { for (j=0; j<n; j++) { s = 0; for (k=0; k<n; k++) { s = s+a[i][k]*b[k][j]; } c[i][j] = s; } }
Is/there/any/data/reuse/in/this/algorithm?
61
Data/reuse:/matrix%matrix/product
• Matrix%matrix/product:/2n3&ops,/2n2/data/– Data/reuse/is/O(n):/every/data/item/is/used/O(n)&Cmes/
• If/it/can/be/programmed/right,/this/can/overcome/the/bandwidth/cpu/speed/gap/
• Again/only/theore1cally:/naïve/implementa1ons/are/inefficient...&so&do¬&code&this&yourself:/use/BLAS/(MKL,/Atlas,/etc.)/
• (This/is/the/important/kernel/in/the/Linpack/benchmark:/cf./Top500)
62
63
1.6. Programming strategies for high performance
Figure 1.15: Performance of naive and optimized implementations of the Discrete Fourier Transform
Figure 1.16: Performance of naive and optimized implementations of the matrix-matrix product
• Compilers are not able to extract anywhere close to optimal performance9.• There are autotuning projects for automatic generation of implementations that are tuned to the
architecture. This approach can be moderately to very successful. Some of the best known ofthese projects are Atlas [133] for Blas kernels, and Spiral [112] for transforms.
1.6.10 Cache aware programming
Unlike registers and main memory, both of which can be addressed in (assembly) code, use of cachesis implicit. There is no way a programmer can load data explicitly to a certain cache, even in assemblylanguage.
9. Presenting a compiler with the reference implementation may still lead to high performance, since some compilers aretrained to recognize this operation. They will then forego translation and simply replace it by an optimized variant.
Victor Eijkhout 45
Reuse/analysis:/matrix%vector/product
y[i]/invariant/but/not/reused:/arrays/get/wriBen/back/to/memory,/so/2/accesses/just/for/y[i]
for (i=0; i<m; i++) { for (j=0; j<n; j++) { y[i] = y[i]+a[i][j]*x[j]; } }
for (i=0; i<m; i++) { s = 0.; for (j=0; j<n; j++) { s = s+a[i][j]*x[j]; } y[i] = s; }
s stays in register
64
Reuse/analysis(1):/ matrix%vector/product
Reuse/of/x[j],/but/the/gain/is/outweighed/by/mul1ple/load/store/of/y[i]/
for (j=0; j<n; j++) { for (i=0; i<m; i++) { y[i] = y[i]+a[i][j]*x[j]; } }
for (j=0; j<n; j++) { t = x[j]; for (i=0; i<m; i++) { y[i] = y[i]+a[i][j]*t; } }
Different/behaviour/matrix/stored/by/rows/and/columns
65
Reuse/analysis(2):/ matrix%vector/product
Loop/1ling:/• x/is/loaded/m/2/1mes,/not/m/Register/usage/for/y/as/before/Loop/overhead/half/less/Pipelined/opera1ons/exposed/Prefetch/streaming//
for (i=0; i<m; i+=2) { s1 = 0.; s2 = 0.; for (j=0; j<n; j++) { s1 = s1+a[i][j]*x[j]; s2 = s2+a[i+1][j]*x[j] } y[i] = s1; y[i+1] = s2; }
for (i=0; i<m; i+=4) { for (j=0; j<n; j++) { s1 = s1+a[i][j]*x[j]; s2 = s2+a[i+1][j]*x[j] s3 = s3+a[i+2][j]*x[j] s4 = s4+a[i+3][j]*x[j]
Matrix/stored/by/columns://Now/full/cache/line/of/A/used//
66
Reuse/analysis(3):/ matrix%vector/product
Further/op1miza1on:/use/pointer/arithme1c/instead/of/indexing/
/
a1 = &(a[0][0]); a2 = a1+n; for (i=0,ip=0; i<m/2; i++) { s1 = 0.; s2 = 0.; xp = &x; for (j=0; j<n; j++) { s1 = s1+*(a1++)**xp; s2 = s2+*(a2++)**(xp++); } y[ip++] = s1; y[ip++] = s2; a1 += n; a2 += n; }
67
Locality
• Programming/for/high/performance/is/based/on/spa1al/and/temporal/locality/
• Temporal/locality://– Group/references/to/one/item/close/together:/
• Spa1al/locality:/– Group/references/to/nearby/memory/items/together
68
Temporal/Locality!
• Use/an/item,/use/it/again/before/it/is/flushed/from/register/or/cache://
– Use/item,/– Use/small/number/of/other/data/– Use/item/again
69
Temporal/locality:/example
Original/loop:long/1me/between/uses/of/x,/!!!!!Rearrangement:/x/is/reused
70
for (loop=0; loop<10; loop++) { for (i=0; i<N; i++) { ... = ... x[i] ... } }
for (i=0; i<N; i++) { for (loop=0; loop<10; loop++) { ... = ... x[i] ... } }
Spa1al/Locality
!• Use/items/close/together/• Cache/lines:/if/the/cache/line/is/already/loaded,/other/elements/are/‘for/free’/
• TLB:/don’t/jump/more/than/512/words/too/many/1mes/
71
Illustra1on:/Cache/Size
for (i=0; i<NRUNS; i++) for (j=0; j<size; j++) array[j] = 2.3*array[j]+1.2;
• If/the/data/fits/in/L1/cache,/the/transfer/is/very/fast/• If/there/is/more/data,/transfer/speed/from/L2/dominates
72
Illustra1on:/Cache/size
for (i=0; i<NRUNS; i++) { blockstart = 0; for (b=0; b<size/l1size; b++) for (j=0; j<l1size; j++) array[blockstart+j] = 2.3*array[blockstart+j]+1.2; }
• Data/can/some1mes/be/arranged/to/fit/in/cache:/• Cache&blocking
73
Illustra1on:/Cache/line/u1liza1on for (i=0,n=0; i<L1WORDS; i++,n+=stride) array[n] = 2.3*array[n]+1.2;
• Same/amount/of/data,/but/increasing/stride/
• Increasing/stride:/more/cachelines/loaded,/slower/execu1on
74
Power/Consump1on
• Scale/all/geometrical/features/by/s/(s/</1):/– dynamic/power/consump1on/P/is/scaled/to/s2P/– circuit/delay/T/is/scaled/to/sT/– opera1ng/frequency/F/is/changed/to/F/s&– Energy/consump1on/is/scaled/by/s3,/and/this/gives/us/the/space/to/put/more/components/on/a/chip/
• However,/miniaturiza1on/of/features/is/coming/to/a/stands1ll/due/to/laws/of/physics/
• Increasing/frequency/would/raise/heat/produc1on/• %>/“Power/wall”
79
Power/Consump1on
80
1.7. Power consumption
The net result is that the dynamic power consumption P is scaled to s2P , circuit delay T is scaled to sT ,and operating frequency F is changed to F/s.Correspondingly, the energy consumption is scaled by s3,and this gives us the space to put more components on a chip.
At the time of this writing (circa 2010), miniaturization of components has almost come to a standstill,because further lowering of the voltage would give prohibitive leakage. Conversely, the frequency can notbe scaled up since this would raise the heat production of the chip too far. Figure 1.17 gives a dramatic
Figure 1.17: Projected heat dissipation of a CPU if trends had continued – this graph courtesy Pat Helsinger
illustration of the heat that a chip would give off, if single-processor trends had continued.
One conclusion is that computer design is running into a power wall , where the sophistication of a singlecore can not be increased any further (so we can for instance no longer increase ILP and pipeline depth )and the only way to increase pwerformance is to increase the amount of explicitly visible parallelism. Thisdevelopment has led to the current generation of multicore processors; see section 1.4. It is also the reasonGPUs with their simplified processor design and hence lower energy consumption are attractive; the sameholds for Field-Programmable Gate Arrays (FPGAs).
The total power consumption of a parallel computer is determined by the consumption per processor andthe number of processors in the full machine. At present, this is commonly several Megawatts. By theabove reasoning, the increase in power needed from increasing the number of processors can no longer beoffset by more power-effective processors, so power is becoming the overriding consideration as parallelcomputers move from the petascale (attained in 2008 by the IBM Roadrunner) to a projected exascale.
Victor Eijkhout 47
Mul1core/Architectures
• “Power/wall”/(clock/frequency/cannot/be/increased)/• Limits/of/Instruc1on/Level/Parallelism/(ILP)/
– compiler/limita1ons/– limited/amount/of/intrinsically/available/parallelism/– branch/predic1on/
• Solu1on:/divide/chip/into/mul1ple/processing/“cores”:/– 2/cores/at/lower/frequency/can/have/same/throughput/as/1/core/at/higher/frequency/(breaks/power/wall)/
– discovered/ILP/replaced/by/explicit/task/parallelism,/managed/by/programmer
81
Mul1core/Architectures
82
1. Sequential Computing
Figure 1.2: Cache hiearchy in a single-core and dual-core chip
With a cache, the assembly code stays the same, but the actual behaviour of the memory system nowbecomes:
• load x from memory into cache, and from cache into register; operate on it;• do the intervening instructions;• request x from memory, but since it is still in the cache, load it from the cache into register;
operate on it.
Since loading from cache is faster than loading from main memoory, the computation will now be faster.Caches are fairly small, so values can not be kept there indefinitely. We will see the implications of this inthe following discussion.
There is an important difference between cache memory and registers: while data is moved into register byexplicit assembly instructions, the move from main memory to cache is entirely done by hardware. Thuscache use and reuse is outside of direct programmer control. Later, especially in sections 1.5.2 and 1.6,you will see how it is possible to influence cache use indirectly.
1.3.4.2 Cache levels, speed and size
The caches are called ‘level 1’ and ‘level 2’ (or, for short, L1 and L2) cache; some processors can have anL3 cache. The L1 and L2 caches are part of the die , the processor chip, although for the L2 cache that isa recent development; the L3 cache is off-chip. The L1 cache is small, typically around 16Kbyte. Level 2(and, when present, level 3) cache is more plentiful, up to several megabytes, but it is also slower. Unlikemain memory, which is expandable, caches are fixed in size. If a version of a processor chip exists with alarger cache, it is usually considerably more expensive. In multicore chips, the cores typically have someprivate cache, while there is also shared cache on the processor chip.
18 Introduction to High Performance Scientific Computing
Single/core Dual/core
Mul1%core/chips
• What/is/a/processor?/Instead,/talk/of/“socket”/and/“core”/
• Cores/have/separate/L1,/shared/L2/cache/– Hybrid/shared/distributed/model/
• Cache/coherency/problem:/conflic1ng/access/to/duplicated/cache/lines
83
Need/to/study/parallel/architecture/and/programming...//Star1ng/next/week!
Computer/Architecture Parallel/Computers
84
The/basic/idea• Spread/opera1ons/over/many/processors/• If/n/opera1ons/take/1me/t/on/1/processor,/• Does/this/become/t/p/on/p/processors/(p<=n)?
for (i=0; i<n; i++) a[i] = b[i]+c[i]
a = b+c
Idealized/version:/every/process/has/one/array/element
85
86
The/basic/idea• Spread/opera1ons/over/many/processors/• If/n/opera1ons/take/1me/t/on/1/processor,/• Does/this/become/t/p/on/p/processors/(p<=n)?
for (i=0; i<n; i++) a[i] = b[i]+c[i]
a = b+c
for (i=my_low; i<my_high; i++) a[i] = b[i]+c[i]
Idealized/version:/every/process/has/one/array/element
Slightly/less/ideal:/each/processor/has/part/of/the/array
87
The/basic/idea/(cont’d)• Spread/opera1ons/over/many/processors/• If/n/opera1ons/take/1me/t/on/1/processor,/• Does/it/always/become/t/p/on/p/processors/(p<=n)?
s = sum( a[i], i=0,n-1 )
88
89
The/basic/idea/(cont’d)• Spread/opera1ons/over/many/processors/• If/n/opera1ons/take/1me/t/on/1/processor,/• Does/it/always/become/t/p/on/p/processors/(p<=n)?
s = sum( a[i], i=0,n-1 ) Conclusion:/n/opera1ons/can/be/done/with/n/2/processors,/in/total/1me/log2n/!Theore1cal/ques1on:/can/addi1on/be/done/faster?/!Prac1cal/ques1on:/can/we/even/do/this?
90
for (s=2; s<n; s*=2) for (i=0; i<n; i+=s) a[i] += a[i+s/2]
91
2. Parallel Computing
Figure 2.2: Parallelization of a vector addition
First let us look systematically at communication. We can take the second half of figure 2.2 and turn itinto a tree graph (see Appendix A.5) by defining the inputs as leave nodes, all partial sums as interiornodes, and the root as the total sum. There is an edge from one node to another if the first is input tothe (partial) sum in the other. This is illustrated in figure 2.3. In this figure nodes are horizontally alignedwith other computations that can be performed simultaneously; each level is sometimes called a superstepin the computation. Nodes are vertically aligned if they are computed on the same processors, and anarrow corresponds to a communication if it goes from one processor to another. The vertical alignment in
Figure 2.3: Communication structure of a parallel vector addition
figure 2.3 is not the only one possible. If nodes are shuffled within a superstep or horizontal level, a differentcommunication pattern arises.Exercise 2.1. Consider placing the nodes within a superstep on random processors. Show that,
if no two nodes wind up on the same processor, at most twice the number of commu-nications is performed from the case in figure 2.3.
Exercise 2.2. Can you draw the graph of a computation that leaves the sum result on eachprocessor? There is a solution that takes twice the number of supersteps, and there is
50 Introduction to High Performance Scientific Computing
Some/theory
• ….before/we/get/into/the/hardware/• Op1mally,/p/processes/give/TP=T1/p/
• Speedup/SP&=&T1/Tp,/is/p/at/best/
• Superlinear/speedup/not/possible/in/theory,/some1mes/happens/in/prac1ce./
• Perfect/speedup/in/“embarrassingly/parallel/applica1ons”/
• Less/than/op1mal:/overhead,/sequen1al/parts,/dependencies
92
Some/more/theory
• ….before/we/get/into/the/hardware/• Op1mally,/p&processes/give/TP=T1/p/
• Speedup/SP&=&T1/Tp,/is/p&at/best/
• Efficiency/EP&=&Sp/p/
• Scalability:/efficiency/bounded/below
93
Scaling• Increasing/the/number/of/processors/for/a/given/problem/makes/sense/up/to/a/point:/p&>&n/2/in/the/addi1on/example/has/no/use/
• Strong*scaling:/problem/constant,/number/of/processors/increasing/
• More/realis1c:/scaling/up/problem/and/processors/simultaneously,/for/instance/to/keep/data/per/processor/constant:/Weak*scaling/
• Weak/scaling/not/always/possible:/problem/size/depends/on/measurements/or/other/external/factors.
94
Amdahl’s/Law/
• Some/parts/of/a/code/are/not/parallelizable/• =>/they/ul1mately/become/a/boBleneck/• For/instance,/if/5%/is/sequen1al,/you/can/not/get/a/speedup/over/20,/no/maBer/p./
• Formally,/if/Fs/is/the/sequen1al/frac1on/and/Fp&the/parallelizable/frac1on/(Fp+Fs=1):/
– Tp=&(sequenCal)&+&(parallelized)&=&(T1Fs)&+&(T1Fp/p)&
• Amdahl’s&law:&Tp=T1(Fs+Fp/p)&
– Tp&approaches/(T1Fs)/as/p/increases;/speedup/Sp&<=&1/Fs95
Theore1cal/characteriza1on/of/architectures
96
Parallel/Computers/Architectures• Parallel*compu(ng/means/using/mul1ple/processors,/possibly/comprising/mul1ple/computers/
• Flynn's/(1966)/taxonomy/is/a/first/way/to/classify/parallel/computers/into/one/of/four/types:/
– (SISD)/Single/instruc1on,/single/data/• Your/(old,/single/core)/desktop/
– (SIMD)/Single/instruc1on,/mul1ple/data/• Thinking/machines/CM%2,/Cray/1,/and/other/vector/machines/(there’s/some/controversy/here)/
• Parts/of/modern/GPUs/– (MISD)/Mul1ple/instruc1on,/single/data/
• Special/purpose/machines/• No/commercial,/general/purpose/machines/
– (MIMD)/Mul1ple/instruc1on,/mul1ple/data/• Nearly/all/of/today’s/parallel/machines,/including/your/laptop 97
SIMD
• Based/on/regularity/of/computa1on:/all/processors/ouen/doing/the/same/opera1on:/data¶llel&
• Big/advantage:/processor/do/not/need/separate/ALU/• =>/lots/of/small/processors/packed/together/• Ex:/Goodyear/MPP:/64k/processors/in/1983/• Use/masks/to/let/processors/differen1ate
98
SIMD/then/and/now
• There/used/to/be/computers/that/were/en1rely/SIMD/(usually/aBached/processor/to/a/front/end)/
• SIMD/these/days:/– SSE/instruc1ons/in/regular/CPUs/– GPUs/are/SIMD/units/(sort/of)
99
Kinda/SIMD:/Vector/Machines• Based/on/a/single/processor/with:/
– Segmented/(pipeline)/func1onal/units/– Needs/sequence/of/the/same/opera1on/
• Dominated/early/parallel/market/– overtaken/in/the/90s/by/clusters,/et/al./
• Making/a/comeback/(sort/of)/– clusters/constella1ons/of/vector/machines:/
• Earth/Simulator/(NEC/SX6)/and/Cray/X1/X1E/
– Arithme1c/units/in/CPUs/are/pipelined.
100
Remember/the/pipeline
• Assembly/line/model/(body/on/frame,/aBach/wheels,/doors,/handles/on/doors)/
• Floa1ng/point/addi1on:/exponent/align,/add/man1ssas,/exponent/normalize/
• Separate/hardware/for/each/stage:/pipeline/processor
101
102
1.2. Modern floating point units
Figure 1.1: Schematic depiction of a pipelined operation
aibi + ci feeds the result of one iteration (ai) to the input of the next (ai+1 = . . .), so the operations are notindependent.
A pipelined processor can speed up operations by a factor of 4, 5, 6 with respect to earlier CPUs. Suchnumbers were typical in the 1980s when the first successful vector computers came on the market. Thesedays, CPUs can have 20-stage pipelines. Does that mean they are incredibly fast? This question is a bitcomplicated. Chip designers continue to increase the clock rate, and the pipeline segments can no longerfinish their work in one cycle, so they are further split up. Sometimes there are even segments in whichnothing happens: that time is needed to make sure data can travel to a different part of the chip in time.
The amount of improvement you can get from a pipelined CPU is limited, so in a quest for ever higherperformance several variations on the pipeline design have been tried. For instance, the Cyber 205 hadseparate addition and multiplication pipelines, and it was possible to feed one pipe into the next withoutdata going back to memory first. Operations like 8i : ai bi + c · di were called ‘linked triads’ (becauseof the number of paths to memory, one input operand had to be scalar).
Exercise 1.2. Analyse the speedup and n1/2 of linked triads.
Another way to increase performance is to have multiple identical pipes. This design was perfected by theNEC SX series. With, for instance, 4 pipes, the operation 8i : ai bi + ci would be split module 4, so thatthe first pipe operated on indices i = 4 · j, the second on i = 4 · j + 1, et cetera.
Exercise 1.3. Analyze the speedup and n1/2 of a processor with multiple pipelines that operatein parallel. That is, suppose that there are p independent pipelines, executing the sameinstruction, that can each handle a stream of operands.
Victor Eijkhout 11
MIMD
• Mul1ple/Instruc1on,/Mul1ple/Data/• Most/general/model:/each/processor/works/on/its/own/data/with/its/own/data/stream:/task¶llel&
• Example:/one/processor/produces/data,/next/processor/consumes/analyzes/data
103
MIMD
• In/prac1ce/SPMD:/Single/Program/Mul1ple/Data://– all/processors/execute/the/same/code/– Just/not/the/same/instruc1on/at/the/same/1me/– Different/control/flow/possible/too/– Different/amounts/of/data:/load/unbalance
104
Granularity
• You/saw/data/parallel/and/task/parallel/• Medium/grain/parallelism:/carve/up/large/job/into/tasks/of/data/parallel/work/
• (Example:/array/summing,/each/processor/has/a/subarray)/
• Good/match/to/hybrid/architectures:�task/%>/node�data/parallel/%>/SIMD/engine
105
GPU:/the/miracle/architecture/(?)• Lots/of/hype/about/incredible/speedup///high/performance/for/low/cost./What’s/behind/it?/
• Origin/of/GPUs:/that/“G”/• Graphics/processing:/iden1cal/(fairly/simple)/opera1ons/on/lots/of/pixels/
• Doesn’t/maBer/when/any/individual/pixel/gets/processed,/as/long/as/they/all/get/done/in/the/end//
• (Otoh,/CPU:/heterogeneous/instruc1ons,/need/to/be/done/ASAP.)/
• =>/GPU/is/SIMD/engine/• …and/scien1fic/compu1ng/is/ouen/very/data%parallel
106
GPU/programming:
• KernelProc<< m,n >>( args )
• Explicit/SIMD/programming/• There/is/more:/threads/(see/later)
107
Characteriza1on/by/Memory/structure
108
Parallel/Computer/Architectures
• Top500/List/now/dominated/by/MPPs/and/Clusters/• The/MIMD/model/“won”./• SIMD/exists/only/on/smaller/scale/• /A/much/more/useful/way/to/classifica1on/is/by/memory/model/
– shared/memory/– distributed/memory
109
Two/memory/models
• Shared/memory:/all/processors/share/the/same/address/space/
– OpenMP:/direc1ves%based/programming/– PGAS/languages/(UPC,/Titanium,/X10)/
• Distributed/memory:/every/processor/has/its/own/address/space/
– MPI:/Message/Passing/Interface
110
Shared/and/Distributed/Memory
Shared*memory:/single/address//space./All/processors/have/access//to/a/pool/of/shared/memory./(e.g.,/Single/Cluster/node/(2%way,/4%way,/...))/!Methods/of/memory/access/:////%/Bus////%/Distributed/Switch/////%/Crossbar
Distributed*memory:/each/processor/has/its/own/local/memory./Must/do//message/passing/to/exchange/data//between/processors.//(examples:/Linux/Clusters,/Cray/XT3)/!Methods/of/memory/access/:////%/single/switch/or/switch/hierarchy//////with/fat/tree,/etc./topology
Network
P
M
P P P P P
M M M M MB U S
Memory
P P P P P P
Bus/Crossbar
B U S
P P P P P P
Buses
FBCFBCFBCFBCFBCFBC………………
M…
M…
M…
M…
M…
M…
111
Shared/Memory:/UMA/and/NUMAUniform*Memory*Access*(UMA):*Each/processor/has/uniform/access/1me/to/memory/%/also/known/as/symmetric/mul1processors/(SMPs)/(example:/Sun/E25000/at/TACC)
NonLUniform*Memory*Access*(NUMA):*Time/for/memory/access/depends/on/loca1on/of/data;/also/known/as/Distributed/Shared/memory/machines./Local/access/is/faster/than/non%local/access./Easier/to/scale/than/SMPs/(e.g.:/SGI/Origin/2000)
112
Interconnects
113
Topology/of/interconnects
• What/is/the/actual/‘shape’/of/the/interconnect?/Are/the/nodes/connected/by/a/2D/mesh?/A/ring?/Something/more/elaborate?/
• =>/some/graph/theory
114
Completely/Connected/and/Star/Networks
• Completely/Connected/:/Each/processor/has/direct/communica1on/link/to/every/other/processor/!!!!
• Star/Connected/Network/:/The/middle/processor/is/the/central/processor;/every/other/processor/is/connected/to/it./
115
Arrays/and/Rings
• Linear/Array/://!
• Ring/:/!!
• Mesh/Network/(e.g./2D%array)
116
Torus
2%d/Torus/(2%d/version/of/the/ring)
117
Hypercubes• Hypercube/Network/:/A/mul1dimensional/mesh/of/processors/with/exactly/two/processors/in/each/dimension./A/d/dimensional/processor/consists/of/
/ / / / p/=/2d/processors//• Shown/below/are/0,/1,/2,/and/3D/hypercubes
0-D 1-D 2-D 3-D hypercubes
118
Induc1ve/defini1on
119
Pros/and/cons/of/hypercubes
• Pro:/processors/are/close/together:/never/more/than/log(p)/
• Lots/of/bandwidth/• LiBle/chance/of/conten1on/• Con:/the/number/of/wires/out/of/a/processor/depends/on/p:/complicated/design/
• Values/of/p/other/than/2p/not/possible./
120
Mapping/applica1ons/to/hypercubes
• Is/there/a/natural/mapping/from/1,2,3D/to/a/hypercube?/• Naïve/node/numbering/does/not/work:/• Nodes/0/and/1/have/distance/1,/but/• 3/and/4/have/distance/3/• (so/do/7/and/0)
121
Mapping/applica1ons/to/hypercubes
• Is/there/a/natural/mapping/from/1,2,3D/to/a/hypercube?/• =>/Gray/codes/• Recursive/defini1on:/number/subcube,/then/other/subcube/in/mirroring/order.
10
23
10
236
7
54
Subsequent/processors/(in/the/Linear/ordering)/all/one/link/apart
Recursive/defini1on:/0/|/1/!0/0/|/1/1/0/1/|/1/0/!0/0/0/0/|/1/1/1/1/0/0/1/1/|/1/1/0/0/0/1/1/0/|/0/1/1/0 122
Busses/Hubs/and/Crossbars
Hub/Bus:/Every/processor/shares/the/communica1on/links/!Crossbar/Switches:/Every/processor/connects/to/the/switch/which/routes/communica1ons/to/their/des1na1ons
123
BuBerfly/exchange/network
• Built/out/of/simple/switching/elements/
• Mul1%stage;/#stages/grows/with/#procs/
• Mul1ple/non%colliding/paths/possible/
• Uniform/memory/access
124
Fat/Trees• Mul1ple/switches/• Each/level/has/the/same/number/of/links/in/as/out/
• Increasing/number/of/links/at/each/level/
• Gives/full/bandwidth/between/the/links/
• Added/latency/the/higher/you/go
125
Fat/Trees• In/prac1ce/emulated/by/switching/network
126
Interconnect/graph/theory• Degree/
– How/many/links/to/other/processors/does/each/node/have?/– More/is/beBer,/but/also/expensive/and/hard/to/engineer/
• Diameter/– maximum/distance/between/any/two/processors/in/the/network./– The/distance/between/two/processors/is/defined/as/the/shortest/path,/in/terms/of/links,/between/them.//
– completely/connected/network/is/1,/for/star/network/is/2,/for/ring/is/p/2/(for/p/even/processors)/
• Connec1vity/– measure/of/the/mul1plicity/of/paths/between/any/two/processors/(#/arcs/that/must/be/removed/to/break/the/connec1on)./
– high/connec1vity/is/desired/since/it/lowers/conten1on/for/communica1on/resources.//
– 1/for/linear/array,/1/for/star,/2/for/ring,/2/for/mesh,/4/for/torus/– technically/1/for/tradi1onal/fat/trees,/but/there/is/redundancy/in/the/switch/infrastructure 127
Prac1cal/issues/in/interconnects• Latency/:/How/long/does/it/take/to/start/sending/a/"message"?/Units/are/generally/microseconds/or/milliseconds.//
• Bandwidth/:/What/data/rate/can/be/sustained/once/the/message/is/started?/Units/are/Mbytes/sec/or/Gbytes/sec./
– Both/point%to%point/and/aggregate/bandwidth/are/of/interest/
• Mul1ple/wires:/mul1ple/latencies,/same/bandwidth/• Some1mes/shortcuts/possible:/`wormhole/rou1ng’
128
Measures/of/bandwidth
• Aggregate/bandwidth:/total/data/rate/if/every/processor/sending:/total/capacity/of/the/wires./This/can/be/very/high/and/quite/unrealis1c./
• Imagine/linear/array/with/processor/i/sending/to/P/2+i:/`Conten1on’/
• Bisec1on/bandwidth:/bandwidth/across/the/minimum/number/of/wires/that/would/split/the/machine/in/two.
129
130
Interconnects• Bisec1on/width/
– Minimum/#/of/communica1on/links/that/have/to/be/removed/to/par11on/the/network/into/two/equal/halves.//Bisec1on/width/is//
– 2/for/ring,/sq./root(p)/for/mesh/with/p/(even)/processors,/p/2/for/hypercube,/(p*p)/4/for/completely/connected/(p/even)./
• Channel/width/– of/physical/wires/in/each/communica1on/link/
• Channel/rate//– peak/rate/at/which/a/single/physical/wire/link/can/deliver/bits/
• Channel/BW//– peak/rate/at/which/data/can/be/communicated/between/the/ends/of/a/communica1on/link//
– =//(channel/width)/*/(channel/rate)//
• Bisec1on/BW/– minimum/volume/of/communica1on/found/between/any/2/halves/of/the/network/with/equal/#/of/procs/
– =/(bisec1on/width)/*/(channel/BW)/ 131
Bandwidth/and/Latency
132
IB%DDR 10/Gigabit 1/Gigabit
Ping%Pong/bandwidth,/MB/s
1466 1000 112.5
Exchange/bandwidth,/MB/s
2659 2073 157.6
Latency,/us 2.01 8.23 46.52