computer/architecture single/cpu...said to be ‘in ﬂight’, meaning that they are in various...

Computer/Architecture Single/CPU

23

What/Does/a/CPU/Look/Like?

24

What/Does/it/Mean?

25

What/is/in/a/Core?

26

Von/Neumann/Architecture/

• Instruc1on/decode:/determine/opera1on/and/operands/

• Get/operands/from/memory/• Perform/opera1on/• Write/results/back/• Con1nue/with/next/instruc1on

Undivided/memory/that/stores/both/program/and/data/(‘stored/program’)/+/processing/unit/that/executes/the/instruc1ons,/opera1ng/on/the/data/

27

Contemporary/Architecture/

• Mul1ple/opera1ons/simultaneously/“in/flight”/• Operands/can/be/in/memory,/cache,/register/• Results/may/need/to/be/coordinated/with/other/processing/elements/

• Opera1ons/can/be/performed/specula1vely

28

Scien1fic/Compu1ng

• Some/algorithms/are/“CPU%bound”/– the/speed/of/the/processor/is/the/most/important/factor/

• Some/algorithms/are/“memory%bound”/– bus/speed,/cache/size/become/important

“Memory%bound”/becomes/ever/more/prominent.../Simple/“GHz”/comparison/does/not/tell/the/whole/story!

29

Modern/Floa1ng/Point/Units

• Tradi1onally:/one/instruc1on/at/a/1me/• Modern/CPUs:/Mul1ple/floa1ng/point/units,/for/instance/1/Mul/+/1/Add,/or/1/FMA/(“Fused/mul1ply%add”) //!

• Peak/performance/is/several/ops/clock/cycle/(currently/up/to/4);/usually/very/hard/to/obtain/

• Other/opera1ons/not/as/op1mized:/a/division/requires/10/to/20/clock/cycles

1.2. Modern floating point units

• Instruction decode: the processor inspects the instruction to determine the operation and theoperands.

• Memory fetch: if necessary, data is brought from memory into a register.• Execution: the operation is executed, reading data from registers and writing it back to a register.• Write-back: for store operations, the register contents is written back to memory.

Complicating this story, contemporary CPUs operate on several instructions simultaneously, which aresaid to be ‘in flight’, meaning that they are in various stages of completion. This is the basic idea of thesuperscalar CPU architecture, and is also referred to as Instruction Level Parallelism (ILP). Thus, whileeach instruction can take several clock cycles to complete, a processor can complete one instruction percycle in favourable circumstances; in some cases more than one instruction can be finished per cycle.

The main statistic that is quoted about CPUs is their Gigahertz rating, implying that the speed of the pro-cessor is the main determining factor of a computer’s performance. While speed obviously correlates withperformance, the story is more complicated. Some algorithms are cpu-bound , and the speed of the proces-sor is indeed the most important factor; other algorithms are memory-bound , and aspects such as bus speedand cache size, to be discussed later, become important.

In scientific computing, this second category is in fact quite prominent, so in this chapter we will devoteplenty of attention to the process that moves data from memory to the processor, and we will devote rela-tively little attention to the actual processor.

1.2 Modern floating point units

Many modern processors are capable of doing multiple operations simultaneously, and this holds in partic-ular for the arithmetic part. For instance, often there are separate addition and multiplication units; if thecompiler can find addition and multiplication operations that are independent, it can schedule them so as tobe executed simultaneously, thereby doubling the performance of the processor. In some cases, a processorwill have multiple addition or multiplication units.

Another way to increase performance is to have a ‘fused multiply-add’ unit, which can execute the in-struction x ax + b in the same amount of time as a separate addition or multiplication. Together withpipelining (see below), this means that a processor has an asymptotic speed of several floating point opera-tions per clock cycle.

Processor floating point units max operations per cycleIntel Pentium4 2 add or 2 mul 2Intel Woodcrest, AMD Barcelona 2 add + 2 mul 4IBM POWER4, POWER5, POWER6 2 FMA 4IBM BG/L, BG/P 1 SIMD FMA 4SPARC IV 1 add + 1 mul 2Itanium2 2 FMA 4

Table 1.1: Floating point capabilities of several current processor architectures

Victor Eijkhout 9

30

Pipelining

• A/single/instruc1on/takes/several/clock/cycles/to/complete/

• Subdivide/an/instruc1on:/– Instruc1on/decode/– Operand/exponent/align/– Actual/opera1on/– Normalize/

• Pipeline:/separate/piece/of/hardware/for/each/subdivision/

• Compare/to/assembly/line

31

Pipelining

• Decoding/the/instruc1on,/including/finding/the/loca1ons/of/the/operands/

• Copying/the/operands/into/registers/(‘data/fetch’).//• Aligning/the/exponents:/• Execu1ng/the/addi1on/of/the/man1ssas://• Normalizing/the/result://• Storing/the/result

32

Example:/addi1on/of/2/fp/numbers/has/the/following/subdivision/(“components”,/“stages”,/“segments”)

1. Sequential Computing

Incidentally, there are few algorithms in which division operations are a limiting factor. Correspondingly,the division operation is not nearly as much optimized in a modern CPU as the additions and multiplicationsare. Division operations can take 10 or 20 clock cycles, while a CPU can have multiple addition and/ormultiplication units that (asymptotically) can produce a result per cycle.

1.2.1 Pipelining

The floating point add and multiply units of a processor are pipelined, which has the effect that a stream ofindependent operations can be performed at an asymptotic speed of one result per clock cycle.

The idea behind a pipeline is as follows. Assume that an operation consists of multiple simpler opera-tions, and that for each suboperation there is separate hardware in the processor. For instance, an additioninstruction can have the following components:

• Decoding the instruction, including finding the locations of the operands.• Copying the operands into registers (‘data fetch’).• Aligning the exponents; the addition .35⇥10�1 + .6⇥10�2 becomes .35⇥10�1 + .06⇥10�1.• Executing the addition of the mantissas, in this case giving .41.• Normalizing the result, in this example to .41 ⇥ 10�1. (Normalization in this example does not

do anything. Check for yourself that in .3⇥ 100 + .8⇥ 100 and .35⇥ 10�3 + (�.34)⇥ 10�3

there is a non-trivial adjustment.)• Storing the result.

These parts are often called the ‘stages’ or ‘segments’ of the pipeline.

If every component is designed to finish in 1 clock cycle, the whole instruction takes 6 cycles. However, ifeach has its own hardware, we can execute two operations in less than 12 cycles:

• Execute the decode stage for the first operation;• Do the data fetch for the first operation, and at the same time the decode for the second.• Execute the third stage for the first operation and the second stage of the second operation simul-

taneously.• Et cetera.

You see that the first operation still takes 6 clock cycles, but the second one is finished a mere 1 cycle later.This idea can be extended to more than two operations: the first operation still takes the same amount oftime as before, but after that one more result will be produced each cycle. Formally, executing n operationson a s-segment pipeline takes s + n� 1 cycles, as opposed to ns in the classical case.Exercise 1.1. Let us compare the speed of a classical floating point unit, and a pipelined one.

If the pipeline has s stages, what is the asymptotic speedup? That is, with T0(n) thetime for n operations on a classical CPU, and Ts(n) the time for n operations on ans-segment pipeline, what is limn!1(T0(n)/Ts(n))?Next you can wonder how long it takes to get close to the asymptotic behaviour. DefineSs(n) as the speedup achieved on n operations. The quantity n1/2 is defined as thevalue of n such that Ss(n) is half the asymptotic speedup. Give an expression for n1/2.

Since a vector processor works on a number of instructions simultaneously, these instructions have tobe independent. The operation 8i : ai bi + ci has independent additions; the operation 8i : ai+1

10 Introduction to High Performance Scientific Computing



1.2.1 Pipelining
















1.2.1 Pipelining
















1.2.1 Pipelining














Pipelining

33

Every/component/designed/to/finish/in/1/clock/cycle:/the/whole/instruc1on/takes/6/cycles/If/each/has/its/own/hardware,/one/can/execute/two/opera1ons/in/less/than/12/cycles://• Execute/the/decode/stage/for/the/first/opera1on;//• Do/the/data/fetch/for/the/first/opera1on,/and/at/the/same/1me/the/decode/for/the/second.//

• Execute/the/third/stage/for/the/first/opera1on/and/the/second/stage/of/the/second/opera1on/simultaneously.//

• ...

Pipelining

34

Analysis://• First/addi1on/takes/6/clock/cycles/• Second/addi1on/finishes/a/mere/1/cycle/later/!This/idea/can/be/extended/to/more/than/two/opera1ons:/the/first/opera1on/s1ll/takes/the/same/amount/of/1me/as/before,/but/auer/that/one/more/result/will/be/produced/each/cycle.//!Execu1ng/n/opera1ons/on/a/s%segment/pipeline/takes//(s/+/n/−/1)/cycles,/as/opposed/to/(ns)/in/the/classical/case.//!This/requires/independent/opera1ons.../One/solu1on:/mul1ple/pipes

Pipelining

35

With/pipelining,/peak/CPU/performance/=/!

(clock/speed)/x/

(number/of/independent/floa1ng/point/units)/!The/measure/of/floa1ng/point/performance/is/‘floa1ng/point/opera1ons/per/second’,/abbreviated/“flops”.//!‘gigaflops’/=/mul1ples/of//////////flops/


(You may wonder why we are mentioning some fairly old computers here: true pipeline supercomputershardly exist anymore. In the US, the Cray X1 was the last of that line, and in Japan only NEC still makesthem. However, the functional units of a CPU these days are pipelined, so the notion is still important.)Exercise 1.4. The operation

for (i) {x[i+1] = a[i]*x[i] + b[i];

}

can not be handled by a pipeline because there is a dependency between input of oneiteration of the operation and the output of the previous. However, you can transformthe loop into one that is mathematically equivalent, and potentially more efficient tocompute. Derive an expression that computes x[i+2] from x[i] without involvingx[i+1]. This is known as recursive doubling . Assume you have plenty of temporarystorage. You can now perform the calculation by

• Doing some preliminary calculations;• computing x[i],x[i+2],x[i+4],..., and from these,• compute the missing terms x[i+1],x[i+3],....

Analyze the efficiency of this scheme by giving formulas for T0(n) and Ts(n). Can youthink of an argument why the preliminary calculations may be of lesser importance insome circumstances?

1.2.2 Peak performance

Thanks to pipelining, for modern CPUs there is a simple relation between the clock speed and the peakperformance . Since each floating point unit can produce one result per cycle asymptotically, the peak per-formance is the clock speed times the number of independent floating point units. The measure of floatingpoint performance is ‘floating point operations per second’, abbreviated flops . Considering the speed ofcomputers these days, you will mostly hear floating point performance being expressed in ‘gigaflops’: mul-tiples of 109 flops.

1.2.3 Pipelining beyond arithmetic: instruction-level parallelism

In fact, nowadays, the whole CPU is pipelined. Not only floating point operations, but any sort of instructionwill be put in the instruction pipeline as soon as possible. Note that this pipeline is no longer limited to iden-tical instructions: the notion of pipeline is now generalized to any stream of partially executed instructionsthat are simultaneously “in flight”.

This concept is also known as Instruction Level Parallelism (ILP), and it is facilitated by various mecha-nisms:

• multiple-issue: instructions that are independent can be started at the same time;• pipelining: already mentioned, arithmetic units can deal with multiple operations in various

stages of completion;• branch prediction and speculative execution: a compiler can ‘guess’ whether a conditional in-

struction will evaluate to true, and execute those instructions accordingly;


Pipelining/Beyond/Arithme1c

36

The/whole/CPU/is/pipelined,/leading/to/“Instruc(on*Level*Parallelism”/(ILP)/!Facilitated/by//• mul1ple/issue/(independent/instruc1ons/can/be/started/at/the/same/1me)/

• branch/predic1on/and/specula1ve/execu1on/• out%of%order/execu1on

• Memory/is/too/slow/to/keep/up/with/the/processor/– 100%1000/cycles/latency/before/data/arrives/– Data/stream/maybe/1/4/fp/number/cycle;/processor/wants/2/or/3/

– “Memory/wall”/

• At/considerable/cost/it’s/possible/to/build/faster/memory/

• Cache/is/small/amount/of/fast/memory

37

Memory/Hierarchies

• Memory/is/divided/into/different/levels:/– Registers/– Caches/– Main/Memory/

• Memory/is/accessed/through/the/hierarchy/– registers/where/possible/– .../then/the/caches/– .../then/main/memory

38

Memory/Hierarchies

39

Memory/Hierarchies1.3. Memory Hierarchies

Figure 1.3: Memory hierarchy of an AMD Xeon, characterized by speed and size.

Data needed in some operation gets copied into the various caches on its way to the processor. If, someinstructions later, a data item is needed again, it is first searched for in the L1 cache; if it is not found there,it is searched for in the L2 cache; if it is not found there, it is loaded from main memory. Finding data incache is called a cache hit , and not finding it a cache miss .

Figure 1.3 illustrates the basic facts of caches, in this case for the AMD Opteron chip: the closer caches areto the floating point units, the faster, but also the smaller they are. Some points about this figure.

• Loading data from registers is so fast that it does not constitute a limitation on algorithm exe-cution speed. On the other hand, there are few registers. The Opteron5 has 16 general purposeregisters, 8 media and floating point registers, and 16 SIMD registers.

• The L1 cache is small, but sustains a bandwidth of 32 bytes, that is 4 double precision number,per cycle. This is enough to load two operands each for two operations, but note that the Opteroncan actually perform 4 operations per cycle. Thus, to achieve peak speed, certain operands needto stay in register. The latency from L1 cache is around 3 cycles.

• The bandwidth from L2 and L3 cache is not documented and hard to measure due to cachepolicies (see below). Latencies are around 15 cycles for L2 and 50 for L3.

• Main memory access has a latency of more than 100 cycles, and a bandwidth of 4.5 bytes per cy-cle, which is about 1/7th of the L1 bandwidth. However, this bandwidth is shared by the 4 coresof the opteron chip, so effectively the bandwidth is a quarter of this number. In a machine likeRanger, which has 4 chips per node, some bandwidth is spent on maintaining cache coherence(see section 1.4) reducing the bandwidth for each chip again by half.

On level 1, there are separate caches for instructions and data; the L2 and L3 cache contain both data andinstructions.

You see that the larger caches are increasingly unable to supply data to the processors fast enough. For this

5. Specifically the server chip used in the Ranger supercomputer; desktop versions may have different specifications.

Victor Eijkhout 19

AMD/Opteron

• The/two/most/important/terms/related/to/performance/for/memory/subsystems/and/for/networks:/

• Latency*– How/long/does/it/take/to/retrieve/a/word/of/memory?//– Units/are/generally/nanoseconds/(milliseconds/for/network/latency)/or/clock/periods/(CP)/

– Some1mes/addresses/are/predictable:/compiler/will/schedule/the/fetch./Predictable/code/is/good!//

• Bandwidth*– What/data/rate/can/be/sustained/once/the/message/is/started?//

– Units/are/B/sec/(MB/sec,/GB/sec,/etc.)40

Latency/and/Bandwidth

• The/1me/that/a/message/takes/from/start/to/finish/combines/latency/and/bandwidth:/!!

• ///////latency/• ///////inverse/of/bandwidth/(the/1me/per/byte)

1.3. Memory Hierarchies

1.3.2 Latency and Bandwidth

Above, we mentioned in very general terms that accessing data in registers is almost instantaneous, whereasloading data from memory into the registers, a necessary step before any operation, incurs a substantialdelay. We will now make this story slightly more precise.

There are two important concepts to describe the movement of data: latency and bandwidth . The assump-tion here is that requesting an item of data incurs an initial delay; if this item was the first in a stream ofdata, usually a consecutive range of memory addresses, the remainder of the stream will arrive with nofurther delay at a regular amount per time period.Latency is the delay between the processor issuing a request for a memory item, and the item actually

arriving. We can distinguish between various latencies, such as the transfer from memory tocache, cache to register, or summarize them all into the latency between memory and processor.Latency is measured in (nano) seconds, or clock periods.If a processor executes instructions in the order they are found in the assembly code, then execu-tion will often stall while data is being fetched from memory; this is also called memory stall .For this reason, a low latency is very important. In practice, many processors have ‘out-of-orderexecution’ of instructions, allowing them to perform other operations while waiting for the re-quested data. Programmers can take this into account, and code in a way that achieves latencyhiding . Graphics Processing Units (GPUs) (see section 2.9) can switch very quickly betweenthreads in order to achieve latency hiding.

Bandwidth is the rate at which data arrives at its destination, after the initial latency is overcome. Band-width is measured in bytes (kilobyes, megabytes, gigabyes) per second or per clock cycle. Thebandwidth between two memory levels is usually the product of the cycle speed of the channel(the bus speed ) and the bus width : the number of bits that can be sent simultaneously in everycycle of the bus clock.

The concepts of latency and bandwidth are often combined in a formula for the time that a message takesfrom start to finish:

T (n) = ↵ + �n

where ↵ is the latency and � is the inverse of the bandwidth: the time per byte.

Typically, the further away from the processor one gets, the longer the latency is, and the lower the band-width. These two factors make it important to program in such a way that, if at all possible, the processoruses data from cache or register, rather than from main memory. To illustrate that this is a serious matter,consider a vector addition

for (i)a[i] = b[i]+c[i]

Each iteration performs one floating point operation, which modern CPUs can do in one clock cycle byusing pipelines. However, each iteration needs two numbers loaded and one written, for a total of 24 bytes4

of memory traffic. Typical memory bandwidth figures (see for instance figure 1.3) are nowhere near 24

4. Actually, a[i] is loaded before it can be written, so there are 4 memory access, with a total of 32 bytes, per iteration.

Victor Eijkhout 15

41

Latency/and/Bandwidth








T (n) = ↵ + �n







Victor Eijkhout 15








T (n) = ↵ + �n







Victor Eijkhout 15

Implica1ons/of/Latency/and/Bandwidth:/LiBle’s/law

• Memory/loads/can/depend/on/each/other:/loading/the/result/of/a/previous/opera1on/

• Two/such/loads/have/to/be/separated/by/at/least/the/memory/latency/

• In/order/not/to/waste/bandwidth,/at/least/latency/many/items/have/to/be/under/way/at/all/1mes,/and/they/have/to/be/independent/

• Mul1ply/by/bandwidth:LiBle’s/law:/Concurrency/=/Bandwidth/x/Latency/

42

• Finding/parallelism/is/some1mes/called/`latency/hiding’:/load/data/early/to/hide/latency/

• GPUs/do/latency/hiding/by/spawning/many/threads

• Requires/fast/context/switch

43

PS:/Latency/Hiding/and/GPUs

Registers

• Highest/bandwidth,/lowest/latency/memory/that/a/modern/processor/can/access;/built/into/the/CPU/

• Ouen/a/scarce/resource/and/not/random/access/• Processors/instruc1ons/operate/on/registers/directly/

– have/assembly/language/names/names/like://• eax,/ebx,/ecx,/etc./

– sample/instruc1on://• addl %eax, %edx

• Separate/instruc1ons/and/registers/for/floa1ng%point/opera1ons

44

• Between/the/CPU/Registers/and/main/memory/

• L1/Cache:/Data/cache/closest//to/registers//• L2/Cache:/Secondary/data/cache,/stores/both/data/and/instruc1ons/– Data/from/L2/has/to/go/through/L1/to/registers/– L2/is/10/to/100/1mes/larger/than/L1//– Some/systems/have/an/L3/cache,/~10x/larger/than/L2/

• Cache/line

45

Data/Caches

• The/smallest/unit/of/data/transferred/between/main/memory/and/the/caches/(or/between/levels/of/cache;/every/cache/has/its/own/line/size)/

• N/sequen1ally%stored,/mul1%byte/words/(usually/N=8/or/16)./

• If/you/request/one/word/on/a/cache/line,/you/get/the/whole/line/

– Make/sure/to/use/the/other/items,/you’ve/paid/for/them/in/bandwidth/

– Sequen1al/access/good,/“strided”/access/ok,/random/access/bad

46

Cache/Line

Main/Memory

• Cheapest/form/of/RAM/• Also/the/slowest/

– lowest/bandwidth/– highest/latency/

• Unfortunately/most/of/our/data/lives/out/here

47

Cache/and/Register/Access

• Access/is/transparent/to/the/programmer/– data/is/in/a/register/or/in/cache/or/in/memory/– Loaded/from/the/highest/level/where/it’s/found/– processor/cache/controller/MMU/hides/cache/access/from/the/programmer/

• …but/you/can/influence/it:/– Access/x/(that/puts/it/in/L1),/access/100k/of/data,/access/x/

again:/it/will/probably/be/gone/from/cache/– If/you/use/an/element/twice,/don’t/wait/too/long/– If/you/loop/over/data,/try/to/take/chunks/of/less/than/cache/size/

– In/C/declare/register/variable,/only/sugges1on48

Register/Use

• y[i]/can/be/kept/in/register/

• Declara1on/is/only/sugges1on/to/the/compiler/

• Compiler/can/usually/figure/this/out/itself

for (i=0; i<m; i++) { for (j=0; j<n; j++) { y[i] = y[i]+a[i][j]*x[j]; } }

register double s; for (i=0; i<m; i++) { s = 0.; for (j=0; j<n; j++) { s = s+a[i][j]*x[j]; } y[i] = s; }

49

• Cache/hit/– loca1on/referenced/is/found/in/the/cache/

• Cache/miss/– loca1on/referenced/is/not/found/in/cache//– triggers/access/to/the/next/higher/cache/or/memory/

• Cache/thrashing/– Two/data/elements/can/be/mapped/to/the/same/cache/line:/loading/the/second/“evicts”/the/first//

– Now/what/if/this/code/is/in/a/loop?/“thrashing”:/really/bad/for/performance

Hits,/Misses,/Thrashing

50

Cache/Mapping

• Because/each/memory/level/is/smaller/than/the/next%closer/level,/data/must/be/mapped/!

• Types/of/mapping/– Direct/– Set/associa1ve/– Fully/associa1ve

51

A/block/from/main/memory/can/go/in/exactly/one/place/in/the/cache./This/is/called/direct/mapped/because/there/is/direct/mapping/from/any/block/address/in/memory/to/a/single/loca1on/in/the/cache. !Typically/modulo/calcula1on/(e.g./keep/16/last/bits/of/memory/address)

cache

main/memory

52

Direct/Mapped/Cache

• Example:/cache/size/64K,/needs/16/bits/to/address/

• a[0]/and/b[0]/mapped/to/the/same/cache/loca1on/

• Cache/line/is/4/words//• Thrashing:/

– b[0]..b[3]/loaded/to/cache,/to/register/

– a[0]..a[3]/loaded,/gets/new/value,/kicks&b[0]..b[3]&out&of&cache&

– b[1]/requested,/so/b[0]..b[3]/loaded/again/

– a[1]/requested,/loaded,/kicks&b[0..3]&out&again

double a[8192],b[8192]; for (i=0; i<n; i++) { a[i] = b[i] }

53

The/Problem/with/Direct/Mapping

A/block/from/main/memory/can/be/placed/in/any/loca1on/in/the/cache./This/is/called/fully/associa1ve/because/a/block/in/main/memory/may/be/associated/with/any/entry/in/the/cache./Requires/lookup/table.

cache

main/memory

54

Fully/Associa1ve/Caches

Fully/Associa1ve/Caches

• Ideal/situa1on/• Any/memory/loca1on/can/be/associated/with/any/cache/line/

• Cost/prohibi1ve

55

In/a/n%way/set/associa1ve/cache/a/block/from/main/memory/can/go/into/n/(n/at/least/2)/loca1ons/in/the/cache.

2%way/set%associa1ve/cache

main/memory

56

Set/Associa1ve/Caches

Set/Associa1ve/Caches

• Direct%mapped/caches/are/1%way/set%associa1ve/caches/• For/a/k%way/set%associa1ve/cache,/each/memory/region/can/be/associated/with/k/cache/lines/

• Fully/associa1ve/is/k%way/with/k/the/number/of/cache/lines

57

Transla1on/Look%Aside/Buffer/(TLB)• Translates/between/logical/space/that/each/program/has/and/actual/memory/addresses/

• Memory/organized/in/‘small/pages’,/a/few/Kbyte/in/size/• Memory/requests/go/through/the/TLB,/normally/very/fast/

• Pages/that/are/not/tracked/through/the/TLB/can/be/found/through/the/‘page/table’:/much/slower/

• %>/Jumping/between/more/pages/than/the/TLB/can/track/has/a/performance/penalty/

• This/illustrates/the/need/for/spa1al/locality

58

Prefetch

• Hardware/tries/to/detect/if/you/load/regularly/spaced/data:/– “prefetch/stream”/– This/can/some1mes/be/programmed/in/souware,/ouen/only/in%line/assembly

59

Data/reuse

• Performance/is/limited/by/data/transfer/rate/• High/performance/if/data/items/are/used/mul1ple/1mes/• Examples:/

– vector/addi1on/xi=xi+yi:/1op,/3/mem/accesses/

– inner/product/s=s+xi*yi:/2op,/2/mem/access/(s/in/register;/also/no/writes)

60

Data/reuse:/matrix%matrix/product

• Matrix%matrix/product:/2n3&ops,/2n2/data

for (i=0; i<n; i++) { for (j=0; j<n; j++) { s = 0; for (k=0; k<n; k++) { s = s+a[i][k]*b[k][j]; } c[i][j] = s; } }

Is/there/any/data/reuse/in/this/algorithm?

61

Data/reuse:/matrix%matrix/product

• Matrix%matrix/product:/2n3&ops,/2n2/data/– Data/reuse/is/O(n):/every/data/item/is/used/O(n)&Cmes/

• If/it/can/be/programmed/right,/this/can/overcome/the/bandwidth/cpu/speed/gap/

• Again/only/theore1cally:/naïve/implementa1ons/are/inefficient...&so&do&not&code&this&yourself:/use/BLAS/(MKL,/Atlas,/etc.)/

• (This/is/the/important/kernel/in/the/Linpack/benchmark:/cf./Top500)

62

63

1.6. Programming strategies for high performance

Figure 1.15: Performance of naive and optimized implementations of the Discrete Fourier Transform

Figure 1.16: Performance of naive and optimized implementations of the matrix-matrix product

• Compilers are not able to extract anywhere close to optimal performance9.• There are autotuning projects for automatic generation of implementations that are tuned to the

architecture. This approach can be moderately to very successful. Some of the best known ofthese projects are Atlas [133] for Blas kernels, and Spiral [112] for transforms.

1.6.10 Cache aware programming

Unlike registers and main memory, both of which can be addressed in (assembly) code, use of cachesis implicit. There is no way a programmer can load data explicitly to a certain cache, even in assemblylanguage.

9. Presenting a compiler with the reference implementation may still lead to high performance, since some compilers aretrained to recognize this operation. They will then forego translation and simply replace it by an optimized variant.

Victor Eijkhout 45

Reuse/analysis:/matrix%vector/product

y[i]/invariant/but/not/reused:/arrays/get/wriBen/back/to/memory,/so/2/accesses/just/for/y[i]

for (i=0; i<m; i++) { for (j=0; j<n; j++) { y[i] = y[i]+a[i][j]*x[j]; } }

for (i=0; i<m; i++) { s = 0.; for (j=0; j<n; j++) { s = s+a[i][j]*x[j]; } y[i] = s; }

s stays in register

64

Reuse/analysis(1):/ matrix%vector/product

Reuse/of/x[j],/but/the/gain/is/outweighed/by/mul1ple/load/store/of/y[i]/

for (j=0; j<n; j++) { for (i=0; i<m; i++) { y[i] = y[i]+a[i][j]*x[j]; } }

for (j=0; j<n; j++) { t = x[j]; for (i=0; i<m; i++) { y[i] = y[i]+a[i][j]*t; } }

Different/behaviour/matrix/stored/by/rows/and/columns

65


Loop/1ling:/• x/is/loaded/m/2/1mes,/not/m/Register/usage/for/y/as/before/Loop/overhead/half/less/Pipelined/opera1ons/exposed/Prefetch/streaming//

for (i=0; i<m; i+=2) { s1 = 0.; s2 = 0.; for (j=0; j<n; j++) { s1 = s1+a[i][j]*x[j]; s2 = s2+a[i+1][j]*x[j] } y[i] = s1; y[i+1] = s2; }

for (i=0; i<m; i+=4) { for (j=0; j<n; j++) { s1 = s1+a[i][j]*x[j]; s2 = s2+a[i+1][j]*x[j] s3 = s3+a[i+2][j]*x[j] s4 = s4+a[i+3][j]*x[j]

Matrix/stored/by/columns://Now/full/cache/line/of/A/used//

66


Further/op1miza1on:/use/pointer/arithme1c/instead/of/indexing/

/

a1 = &(a[0][0]); a2 = a1+n; for (i=0,ip=0; i<m/2; i++) { s1 = 0.; s2 = 0.; xp = &x; for (j=0; j<n; j++) { s1 = s1+*(a1++)**xp; s2 = s2+*(a2++)**(xp++); } y[ip++] = s1; y[ip++] = s2; a1 += n; a2 += n; }

67

Locality

• Programming/for/high/performance/is/based/on/spa1al/and/temporal/locality/

• Temporal/locality://– Group/references/to/one/item/close/together:/

• Spa1al/locality:/– Group/references/to/nearby/memory/items/together

68

Temporal/Locality!

• Use/an/item,/use/it/again/before/it/is/flushed/from/register/or/cache://

– Use/item,/– Use/small/number/of/other/data/– Use/item/again

69

Temporal/locality:/example

Original/loop:long/1me/between/uses/of/x,/!!!!!Rearrangement:/x/is/reused

70

for (loop=0; loop<10; loop++) { for (i=0; i<N; i++) { ... = ... x[i] ... } }

for (i=0; i<N; i++) { for (loop=0; loop<10; loop++) { ... = ... x[i] ... } }

Spa1al/Locality

!• Use/items/close/together/• Cache/lines:/if/the/cache/line/is/already/loaded,/other/elements/are/‘for/free’/

• TLB:/don’t/jump/more/than/512/words/too/many/1mes/

71

Illustra1on:/Cache/Size

for (i=0; i<NRUNS; i++) for (j=0; j<size; j++) array[j] = 2.3*array[j]+1.2;

• If/the/data/fits/in/L1/cache,/the/transfer/is/very/fast/• If/there/is/more/data,/transfer/speed/from/L2/dominates

72

Illustra1on:/Cache/size

for (i=0; i<NRUNS; i++) { blockstart = 0; for (b=0; b<size/l1size; b++) for (j=0; j<l1size; j++) array[blockstart+j] = 2.3*array[blockstart+j]+1.2; }

• Data/can/some1mes/be/arranged/to/fit/in/cache:/• Cache&blocking

73

Illustra1on:/Cache/line/u1liza1on for (i=0,n=0; i<L1WORDS; i++,n+=stride) array[n] = 2.3*array[n]+1.2;

• Same/amount/of/data,/but/increasing/stride/

• Increasing/stride:/more/cachelines/loaded,/slower/execu1on

74

Power/Consump1on

• Scale/all/geometrical/features/by/s/(s/</1):/– dynamic/power/consump1on/P/is/scaled/to/s2P/– circuit/delay/T/is/scaled/to/sT/– opera1ng/frequency/F/is/changed/to/F/s&– Energy/consump1on/is/scaled/by/s3,/and/this/gives/us/the/space/to/put/more/components/on/a/chip/

• However,/miniaturiza1on/of/features/is/coming/to/a/stands1ll/due/to/laws/of/physics/

• Increasing/frequency/would/raise/heat/produc1on/• %>/“Power/wall”

79

Power/Consump1on

80

1.7. Power consumption

The net result is that the dynamic power consumption P is scaled to s2P , circuit delay T is scaled to sT ,and operating frequency F is changed to F/s.Correspondingly, the energy consumption is scaled by s3,and this gives us the space to put more components on a chip.

At the time of this writing (circa 2010), miniaturization of components has almost come to a standstill,because further lowering of the voltage would give prohibitive leakage. Conversely, the frequency can notbe scaled up since this would raise the heat production of the chip too far. Figure 1.17 gives a dramatic

Figure 1.17: Projected heat dissipation of a CPU if trends had continued – this graph courtesy Pat Helsinger

illustration of the heat that a chip would give off, if single-processor trends had continued.

One conclusion is that computer design is running into a power wall , where the sophistication of a singlecore can not be increased any further (so we can for instance no longer increase ILP and pipeline depth )and the only way to increase pwerformance is to increase the amount of explicitly visible parallelism. Thisdevelopment has led to the current generation of multicore processors; see section 1.4. It is also the reasonGPUs with their simplified processor design and hence lower energy consumption are attractive; the sameholds for Field-Programmable Gate Arrays (FPGAs).

The total power consumption of a parallel computer is determined by the consumption per processor andthe number of processors in the full machine. At present, this is commonly several Megawatts. By theabove reasoning, the increase in power needed from increasing the number of processors can no longer beoffset by more power-effective processors, so power is becoming the overriding consideration as parallelcomputers move from the petascale (attained in 2008 by the IBM Roadrunner) to a projected exascale.

Victor Eijkhout 47

Mul1core/Architectures

• “Power/wall”/(clock/frequency/cannot/be/increased)/• Limits/of/Instruc1on/Level/Parallelism/(ILP)/

– compiler/limita1ons/– limited/amount/of/intrinsically/available/parallelism/– branch/predic1on/

• Solu1on:/divide/chip/into/mul1ple/processing/“cores”:/– 2/cores/at/lower/frequency/can/have/same/throughput/as/1/core/at/higher/frequency/(breaks/power/wall)/

– discovered/ILP/replaced/by/explicit/task/parallelism,/managed/by/programmer

81

Mul1core/Architectures

82


Figure 1.2: Cache hiearchy in a single-core and dual-core chip

With a cache, the assembly code stays the same, but the actual behaviour of the memory system nowbecomes:

• load x from memory into cache, and from cache into register; operate on it;• do the intervening instructions;• request x from memory, but since it is still in the cache, load it from the cache into register;

operate on it.

Since loading from cache is faster than loading from main memoory, the computation will now be faster.Caches are fairly small, so values can not be kept there indefinitely. We will see the implications of this inthe following discussion.

There is an important difference between cache memory and registers: while data is moved into register byexplicit assembly instructions, the move from main memory to cache is entirely done by hardware. Thuscache use and reuse is outside of direct programmer control. Later, especially in sections 1.5.2 and 1.6,you will see how it is possible to influence cache use indirectly.

1.3.4.2 Cache levels, speed and size

The caches are called ‘level 1’ and ‘level 2’ (or, for short, L1 and L2) cache; some processors can have anL3 cache. The L1 and L2 caches are part of the die , the processor chip, although for the L2 cache that isa recent development; the L3 cache is off-chip. The L1 cache is small, typically around 16Kbyte. Level 2(and, when present, level 3) cache is more plentiful, up to several megabytes, but it is also slower. Unlikemain memory, which is expandable, caches are fixed in size. If a version of a processor chip exists with alarger cache, it is usually considerably more expensive. In multicore chips, the cores typically have someprivate cache, while there is also shared cache on the processor chip.


Single/core Dual/core

Mul1%core/chips

• What/is/a/processor?/Instead,/talk/of/“socket”/and/“core”/

• Cores/have/separate/L1,/shared/L2/cache/– Hybrid/shared/distributed/model/

• Cache/coherency/problem:/conflic1ng/access/to/duplicated/cache/lines

83

Need/to/study/parallel/architecture/and/programming...//Star1ng/next/week!

Christophe Geuzaine

Computer/Architecture Parallel/Computers

84

The/basic/idea• Spread/opera1ons/over/many/processors/• If/n/opera1ons/take/1me/t/on/1/processor,/• Does/this/become/t/p/on/p/processors/(p<=n)?

for (i=0; i<n; i++) a[i] = b[i]+c[i]

a = b+c

Idealized/version:/every/process/has/one/array/element

85

The/basic/idea• Spread/opera1ons/over/many/processors/• If/n/opera1ons/take/1me/t/on/1/processor,/• Does/this/become/t/p/on/p/processors/(p<=n)?

for (i=0; i<n; i++) a[i] = b[i]+c[i]

a = b+c

for (i=my_low; i<my_high; i++) a[i] = b[i]+c[i]

Idealized/version:/every/process/has/one/array/element

Slightly/less/ideal:/each/processor/has/part/of/the/array

87

The/basic/idea/(cont’d)• Spread/opera1ons/over/many/processors/• If/n/opera1ons/take/1me/t/on/1/processor,/• Does/it/always/become/t/p/on/p/processors/(p<=n)?

s = sum( a[i], i=0,n-1 )

88

The/basic/idea/(cont’d)• Spread/opera1ons/over/many/processors/• If/n/opera1ons/take/1me/t/on/1/processor,/• Does/it/always/become/t/p/on/p/processors/(p<=n)?

s = sum( a[i], i=0,n-1 ) Conclusion:/n/opera1ons/can/be/done/with/n/2/processors,/in/total/1me/log2n/!Theore1cal/ques1on:/can/addi1on/be/done/faster?/!Prac1cal/ques1on:/can/we/even/do/this?

90

for (s=2; s<n; s*=2) for (i=0; i<n; i+=s) a[i] += a[i+s/2]

91

2. Parallel Computing

Figure 2.2: Parallelization of a vector addition

First let us look systematically at communication. We can take the second half of figure 2.2 and turn itinto a tree graph (see Appendix A.5) by defining the inputs as leave nodes, all partial sums as interiornodes, and the root as the total sum. There is an edge from one node to another if the first is input tothe (partial) sum in the other. This is illustrated in figure 2.3. In this figure nodes are horizontally alignedwith other computations that can be performed simultaneously; each level is sometimes called a superstepin the computation. Nodes are vertically aligned if they are computed on the same processors, and anarrow corresponds to a communication if it goes from one processor to another. The vertical alignment in

Figure 2.3: Communication structure of a parallel vector addition

figure 2.3 is not the only one possible. If nodes are shuffled within a superstep or horizontal level, a differentcommunication pattern arises.Exercise 2.1. Consider placing the nodes within a superstep on random processors. Show that,

if no two nodes wind up on the same processor, at most twice the number of commu-nications is performed from the case in figure 2.3.

Exercise 2.2. Can you draw the graph of a computation that leaves the sum result on eachprocessor? There is a solution that takes twice the number of supersteps, and there is


Some/theory

• ….before/we/get/into/the/hardware/• Op1mally,/p/processes/give/TP=T1/p/

• Speedup/SP&=&T1/Tp,/is/p/at/best/

• Superlinear/speedup/not/possible/in/theory,/some1mes/happens/in/prac1ce./

• Perfect/speedup/in/“embarrassingly/parallel/applica1ons”/

• Less/than/op1mal:/overhead,/sequen1al/parts,/dependencies

92

Some/more/theory

• ….before/we/get/into/the/hardware/• Op1mally,/p&processes/give/TP=T1/p/

• Speedup/SP&=&T1/Tp,/is/p&at/best/

• Efficiency/EP&=&Sp/p/

• Scalability:/efficiency/bounded/below

93

Scaling• Increasing/the/number/of/processors/for/a/given/problem/makes/sense/up/to/a/point:/p&>&n/2/in/the/addi1on/example/has/no/use/

• Strong*scaling:/problem/constant,/number/of/processors/increasing/

• More/realis1c:/scaling/up/problem/and/processors/simultaneously,/for/instance/to/keep/data/per/processor/constant:/Weak*scaling/

• Weak/scaling/not/always/possible:/problem/size/depends/on/measurements/or/other/external/factors.

94

Amdahl’s/Law/

• Some/parts/of/a/code/are/not/parallelizable/• =>/they/ul1mately/become/a/boBleneck/• For/instance,/if/5%/is/sequen1al,/you/can/not/get/a/speedup/over/20,/no/maBer/p./

• Formally,/if/Fs/is/the/sequen1al/frac1on/and/Fp&the/parallelizable/frac1on/(Fp+Fs=1):/

– Tp=&(sequenCal)&+&(parallelized)&=&(T1Fs)&+&(T1Fp/p)&

• Amdahl’s&law:&Tp=T1(Fs+Fp/p)&

– Tp&approaches/(T1Fs)/as/p/increases;/speedup/Sp&<=&1/Fs95

Theore1cal/characteriza1on/of/architectures

96

Parallel/Computers/Architectures• Parallel*compu(ng/means/using/mul1ple/processors,/possibly/comprising/mul1ple/computers/

• Flynn's/(1966)/taxonomy/is/a/first/way/to/classify/parallel/computers/into/one/of/four/types:/

– (SISD)/Single/instruc1on,/single/data/• Your/(old,/single/core)/desktop/

– (SIMD)/Single/instruc1on,/mul1ple/data/• Thinking/machines/CM%2,/Cray/1,/and/other/vector/machines/(there’s/some/controversy/here)/

• Parts/of/modern/GPUs/– (MISD)/Mul1ple/instruc1on,/single/data/

• Special/purpose/machines/• No/commercial,/general/purpose/machines/

– (MIMD)/Mul1ple/instruc1on,/mul1ple/data/• Nearly/all/of/today’s/parallel/machines,/including/your/laptop 97

SIMD

• Based/on/regularity/of/computa1on:/all/processors/ouen/doing/the/same/opera1on:/data&parallel&

• Big/advantage:/processor/do/not/need/separate/ALU/• =>/lots/of/small/processors/packed/together/• Ex:/Goodyear/MPP:/64k/processors/in/1983/• Use/masks/to/let/processors/differen1ate

98

SIMD/then/and/now

• There/used/to/be/computers/that/were/en1rely/SIMD/(usually/aBached/processor/to/a/front/end)/

• SIMD/these/days:/– SSE/instruc1ons/in/regular/CPUs/– GPUs/are/SIMD/units/(sort/of)

99

Kinda/SIMD:/Vector/Machines• Based/on/a/single/processor/with:/

– Segmented/(pipeline)/func1onal/units/– Needs/sequence/of/the/same/opera1on/

• Dominated/early/parallel/market/– overtaken/in/the/90s/by/clusters,/et/al./

• Making/a/comeback/(sort/of)/– clusters/constella1ons/of/vector/machines:/

• Earth/Simulator/(NEC/SX6)/and/Cray/X1/X1E/

– Arithme1c/units/in/CPUs/are/pipelined.

100

Remember/the/pipeline

• Assembly/line/model/(body/on/frame,/aBach/wheels,/doors,/handles/on/doors)/

• Floa1ng/point/addi1on:/exponent/align,/add/man1ssas,/exponent/normalize/

• Separate/hardware/for/each/stage:/pipeline/processor

101

102

1.2. Modern floating point units

Figure 1.1: Schematic depiction of a pipelined operation

aibi + ci feeds the result of one iteration (ai) to the input of the next (ai+1 = . . .), so the operations are notindependent.

A pipelined processor can speed up operations by a factor of 4, 5, 6 with respect to earlier CPUs. Suchnumbers were typical in the 1980s when the first successful vector computers came on the market. Thesedays, CPUs can have 20-stage pipelines. Does that mean they are incredibly fast? This question is a bitcomplicated. Chip designers continue to increase the clock rate, and the pipeline segments can no longerfinish their work in one cycle, so they are further split up. Sometimes there are even segments in whichnothing happens: that time is needed to make sure data can travel to a different part of the chip in time.

The amount of improvement you can get from a pipelined CPU is limited, so in a quest for ever higherperformance several variations on the pipeline design have been tried. For instance, the Cyber 205 hadseparate addition and multiplication pipelines, and it was possible to feed one pipe into the next withoutdata going back to memory first. Operations like 8i : ai bi + c · di were called ‘linked triads’ (becauseof the number of paths to memory, one input operand had to be scalar).

Exercise 1.2. Analyse the speedup and n1/2 of linked triads.

Another way to increase performance is to have multiple identical pipes. This design was perfected by theNEC SX series. With, for instance, 4 pipes, the operation 8i : ai bi + ci would be split module 4, so thatthe first pipe operated on indices i = 4 · j, the second on i = 4 · j + 1, et cetera.

Exercise 1.3. Analyze the speedup and n1/2 of a processor with multiple pipelines that operatein parallel. That is, suppose that there are p independent pipelines, executing the sameinstruction, that can each handle a stream of operands.

Victor Eijkhout 11

MIMD

• Mul1ple/Instruc1on,/Mul1ple/Data/• Most/general/model:/each/processor/works/on/its/own/data/with/its/own/data/stream:/task&parallel&

• Example:/one/processor/produces/data,/next/processor/consumes/analyzes/data

103

MIMD

• In/prac1ce/SPMD:/Single/Program/Mul1ple/Data://– all/processors/execute/the/same/code/– Just/not/the/same/instruc1on/at/the/same/1me/– Different/control/flow/possible/too/– Different/amounts/of/data:/load/unbalance

104

Granularity

• You/saw/data/parallel/and/task/parallel/• Medium/grain/parallelism:/carve/up/large/job/into/tasks/of/data/parallel/work/

• (Example:/array/summing,/each/processor/has/a/subarray)/

• Good/match/to/hybrid/architectures:�task/%>/node�data/parallel/%>/SIMD/engine

105

GPU:/the/miracle/architecture/(?)• Lots/of/hype/about/incredible/speedup///high/performance/for/low/cost./What’s/behind/it?/

• Origin/of/GPUs:/that/“G”/• Graphics/processing:/iden1cal/(fairly/simple)/opera1ons/on/lots/of/pixels/

• Doesn’t/maBer/when/any/individual/pixel/gets/processed,/as/long/as/they/all/get/done/in/the/end//

• (Otoh,/CPU:/heterogeneous/instruc1ons,/need/to/be/done/ASAP.)/

• =>/GPU/is/SIMD/engine/• …and/scien1fic/compu1ng/is/ouen/very/data%parallel

106

GPU/programming:

• KernelProc<< m,n >>( args )

• Explicit/SIMD/programming/• There/is/more:/threads/(see/later)

107

Characteriza1on/by/Memory/structure

108

Parallel/Computer/Architectures

• Top500/List/now/dominated/by/MPPs/and/Clusters/• The/MIMD/model/“won”./• SIMD/exists/only/on/smaller/scale/• /A/much/more/useful/way/to/classifica1on/is/by/memory/model/

– shared/memory/– distributed/memory

109

Two/memory/models

• Shared/memory:/all/processors/share/the/same/address/space/

– OpenMP:/direc1ves%based/programming/– PGAS/languages/(UPC,/Titanium,/X10)/

• Distributed/memory:/every/processor/has/its/own/address/space/

– MPI:/Message/Passing/Interface

110

Shared/and/Distributed/Memory

Shared*memory:/single/address//space./All/processors/have/access//to/a/pool/of/shared/memory./(e.g.,/Single/Cluster/node/(2%way,/4%way,/...))/!Methods/of/memory/access/:////%/Bus////%/Distributed/Switch/////%/Crossbar

Distributed*memory:/each/processor/has/its/own/local/memory./Must/do//message/passing/to/exchange/data//between/processors.//(examples:/Linux/Clusters,/Cray/XT3)/!Methods/of/memory/access/:////%/single/switch/or/switch/hierarchy//////with/fat/tree,/etc./topology

Network

P

M

P P P P P

M M M M MB U S

Memory

P P P P P P

Bus/Crossbar

B U S

P P P P P P

Buses

FBCFBCFBCFBCFBCFBC………………

M…

M…

M…

M…

M…

M…

111

Shared/Memory:/UMA/and/NUMAUniform*Memory*Access*(UMA):*Each/processor/has/uniform/access/1me/to/memory/%/also/known/as/symmetric/mul1processors/(SMPs)/(example:/Sun/E25000/at/TACC)

NonLUniform*Memory*Access*(NUMA):*Time/for/memory/access/depends/on/loca1on/of/data;/also/known/as/Distributed/Shared/memory/machines./Local/access/is/faster/than/non%local/access./Easier/to/scale/than/SMPs/(e.g.:/SGI/Origin/2000)

112

Interconnects

113

Topology/of/interconnects

• What/is/the/actual/‘shape’/of/the/interconnect?/Are/the/nodes/connected/by/a/2D/mesh?/A/ring?/Something/more/elaborate?/

• =>/some/graph/theory

114

Completely/Connected/and/Star/Networks

• Completely/Connected/:/Each/processor/has/direct/communica1on/link/to/every/other/processor/!!!!

• Star/Connected/Network/:/The/middle/processor/is/the/central/processor;/every/other/processor/is/connected/to/it./

115

Arrays/and/Rings

• Linear/Array/://!

• Ring/:/!!

• Mesh/Network/(e.g./2D%array)

116

Torus

2%d/Torus/(2%d/version/of/the/ring)

117

Hypercubes• Hypercube/Network/:/A/mul1dimensional/mesh/of/processors/with/exactly/two/processors/in/each/dimension./A/d/dimensional/processor/consists/of/

/ / / / p/=/2d/processors//• Shown/below/are/0,/1,/2,/and/3D/hypercubes

0-D 1-D 2-D 3-D hypercubes

118

Induc1ve/defini1on

119

Pros/and/cons/of/hypercubes

• Pro:/processors/are/close/together:/never/more/than/log(p)/

• Lots/of/bandwidth/• LiBle/chance/of/conten1on/• Con:/the/number/of/wires/out/of/a/processor/depends/on/p:/complicated/design/

• Values/of/p/other/than/2p/not/possible./

120

Mapping/applica1ons/to/hypercubes

• Is/there/a/natural/mapping/from/1,2,3D/to/a/hypercube?/• Naïve/node/numbering/does/not/work:/• Nodes/0/and/1/have/distance/1,/but/• 3/and/4/have/distance/3/• (so/do/7/and/0)

121

Mapping/applica1ons/to/hypercubes

• Is/there/a/natural/mapping/from/1,2,3D/to/a/hypercube?/• =>/Gray/codes/• Recursive/defini1on:/number/subcube,/then/other/subcube/in/mirroring/order.

10

23

10

236

7

54

Subsequent/processors/(in/the/Linear/ordering)/all/one/link/apart

Recursive/defini1on:/0/|/1/!0/0/|/1/1/0/1/|/1/0/!0/0/0/0/|/1/1/1/1/0/0/1/1/|/1/1/0/0/0/1/1/0/|/0/1/1/0 122

Busses/Hubs/and/Crossbars

Hub/Bus:/Every/processor/shares/the/communica1on/links/!Crossbar/Switches:/Every/processor/connects/to/the/switch/which/routes/communica1ons/to/their/des1na1ons

123

BuBerfly/exchange/network

• Built/out/of/simple/switching/elements/

• Mul1%stage;/#stages/grows/with/#procs/

• Mul1ple/non%colliding/paths/possible/

• Uniform/memory/access

124

Fat/Trees• Mul1ple/switches/• Each/level/has/the/same/number/of/links/in/as/out/

• Increasing/number/of/links/at/each/level/

• Gives/full/bandwidth/between/the/links/

• Added/latency/the/higher/you/go

125

Fat/Trees• In/prac1ce/emulated/by/switching/network

126

Interconnect/graph/theory• Degree/

– How/many/links/to/other/processors/does/each/node/have?/– More/is/beBer,/but/also/expensive/and/hard/to/engineer/

• Diameter/– maximum/distance/between/any/two/processors/in/the/network./– The/distance/between/two/processors/is/defined/as/the/shortest/path,/in/terms/of/links,/between/them.//

– completely/connected/network/is/1,/for/star/network/is/2,/for/ring/is/p/2/(for/p/even/processors)/

• Connec1vity/– measure/of/the/mul1plicity/of/paths/between/any/two/processors/(#/arcs/that/must/be/removed/to/break/the/connec1on)./

– high/connec1vity/is/desired/since/it/lowers/conten1on/for/communica1on/resources.//

– 1/for/linear/array,/1/for/star,/2/for/ring,/2/for/mesh,/4/for/torus/– technically/1/for/tradi1onal/fat/trees,/but/there/is/redundancy/in/the/switch/infrastructure 127

Prac1cal/issues/in/interconnects• Latency/:/How/long/does/it/take/to/start/sending/a/"message"?/Units/are/generally/microseconds/or/milliseconds.//

• Bandwidth/:/What/data/rate/can/be/sustained/once/the/message/is/started?/Units/are/Mbytes/sec/or/Gbytes/sec./

– Both/point%to%point/and/aggregate/bandwidth/are/of/interest/

• Mul1ple/wires:/mul1ple/latencies,/same/bandwidth/• Some1mes/shortcuts/possible:/`wormhole/rou1ng’

128

Measures/of/bandwidth

• Aggregate/bandwidth:/total/data/rate/if/every/processor/sending:/total/capacity/of/the/wires./This/can/be/very/high/and/quite/unrealis1c./

• Imagine/linear/array/with/processor/i/sending/to/P/2+i:/`Conten1on’/

• Bisec1on/bandwidth:/bandwidth/across/the/minimum/number/of/wires/that/would/split/the/machine/in/two.

129

Interconnects• Bisec1on/width/

– Minimum/#/of/communica1on/links/that/have/to/be/removed/to/par11on/the/network/into/two/equal/halves.//Bisec1on/width/is//

– 2/for/ring,/sq./root(p)/for/mesh/with/p/(even)/processors,/p/2/for/hypercube,/(p*p)/4/for/completely/connected/(p/even)./

• Channel/width/– of/physical/wires/in/each/communica1on/link/

• Channel/rate//– peak/rate/at/which/a/single/physical/wire/link/can/deliver/bits/

• Channel/BW//– peak/rate/at/which/data/can/be/communicated/between/the/ends/of/a/communica1on/link//

– =//(channel/width)/*/(channel/rate)//

• Bisec1on/BW/– minimum/volume/of/communica1on/found/between/any/2/halves/of/the/network/with/equal/#/of/procs/

– =/(bisec1on/width)/*/(channel/BW)/ 131

Bandwidth/and/Latency

132

IB%DDR 10/Gigabit 1/Gigabit

Ping%Pong/bandwidth,/MB/s

1466 1000 112.5

Exchange/bandwidth,/MB/s

2659 2073 157.6

Latency,/us 2.01 8.23 46.52

computer/architecture single/cpu...said to be ‘in ﬂight’, meaning that they are in various...

Documents