marco vanneschi - dipartimento di informaticavannesch/spa 2011-12/precourse 2010/precourse-2... ·...

45
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Precourse in Computer Architecture Fundamentals 2. Firmware Level

Upload: vocong

Post on 20-Aug-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Master Program (Laurea Magistrale) in Computer Science and Networking

High Performance Computing Systems and Enabling Platforms

Marco Vanneschi

Precourse in Computer Architecture Fundamentals

2.

Firmware Level

Page 2: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Firmware level

• At this level, the system is viewed as a collection of cooperating modules calledPROCESSING UNITS (simply: units).

• Each Processing Unit is

– autonomous

• has its own control, i.e. it has self-control capability, i.e. it is an active computational entity

– described by a sequential program, called microprogram.

• Cooperation is realized through COMMUNICATIONS

– Communication channels implemented by physical links and a firmware protocol.

• Parallelism among units.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 2

U1

Uj

Un

U2

. . . . . .

. . .

Control Part j

Operating Part j

Page 3: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Examples

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 3

Cache Memory

.

.

.

Main Memory

MemoryManagement

Unit

ProcessorInterrupt Arbiter

I/O1 I/Ok. . .

. . .

CPU

A uniprocessor architecture

In turn, CPU is a collection of

communicating units (on the

same chip)

Cache Memory

In turn, the Cache Memory can be

implemented as a collection of

parallel units: instruction and data

cache, primary – secondary –

tertiary… cache

In turn, the Main Memory can be implemented as a

collection of parallel units: modular memory,

interleaved memory, buffer memory, or …

In turn, the Processor can be

implemented as a (large)

collection of parallel, pipelined

units

In turn, an advanced

I/O unit can be

a parallel architecture

(vector / SIMD

coprocessor / GPU /

…),

or a powerful

multiprocessor-based

network interface,

or ….

Page 4: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Firmware interpretation of assembly language

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 4

Hardware

Applications

Processes

Assembler

Firmware

Uniprocessor : Instruction Level Parallelism

Shared Memory Multiprocessor: SMP, NUMA,…

Distributed Memory : Cluster, MPP, …

• The true architectural level.• Microcoded interpretation of assembler

instructions.• This level is fundamental (it cannot be

eliminated).

• Physical resources: combinatorial and sequential circuit components, physical links

I

I

The firmware interpretation of any assembly

language instructions has a decentralized

organization, even for simple uniprocessor

systems.

The interpreter functionalities are partitioned

among the various processing units (belonging to

CPU, M, I/O subsystems), each unit having its

own Control Part and its own microprogram.

Cooperation of units through explicit message

passing .

Page 5: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Processing Unit model: Control + Operation

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 5

PC and PO are automata,implemented by synchronous sequential networks (sequential circuits)

Clock cycle: time interval t for executing a (any) microinstruction. Synchronous model of computation.

Externaloutputs

Externalinputs

Control Part

(PC)

Operating Part

(PO)

Condition variables

{x}

Control variables

g = {a} {b}

Clock signal

Page 6: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Example of a simple unit design

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 6

Specification IN: OUT = number_of_occurrences (IN, A)

Pseudo-code of interpreter:while (true) do

B = IN; C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;OUT = C

A[N]

IN (stream)

32 32

U

OUT (stream)

Integer array: implemented as a registerRAM/ROM

Page 7: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 7

Page 8: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Synchronized registers

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 8

when clock do R_output = R_input

R_input

R_output = R_present_state

In PO, the clock signal is AND-ed with a control variable (b), in oder to enable register-writingexplicitly by PC.

Register_writing

RR_output = R_present_state

clock (pulse signal)

R_input

Register output and state remain STABLE duringa clock cycle

Page 9: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 9

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

Example of PO structure

Arithmetic-logic

combinatorial

circuit

Register array

(RAM/ROM), N words,

32 bits/word

Register (32 bits)

Multiplexer

to PC

from PC

B

Write-enable signal, from PC

1 bit register

Page 10: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others

• The microprogram is expressed by means of register-transfer micro-operations.

– Example: A + B C

– i.e., the contents of registers A and B are added, and the result is written into register C.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 10

Page 11: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 11

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

Example of micro-operation

Data pathfor the execution of

zero (A[J] – B) Z

B

= 0

= 1

= 0

= 0 = 0 = 0

= 1

= - = -

Executed in oneclock cycle:all the interestedcombinatorialcircuits becomestable during the clock cycle.

Page 12: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others

• The microprogram is expressed by means of register-transfer micro-operations.

– Example: A + B C

– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. – Example: A + B C, D – E D

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 12

Page 13: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 13

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

Data pathfor the execution of

J + 1 J, C + 1 C

B

= 1

= 0

= 1

= 1

= 1

= 1

= 0

= 0

= 0

Example of parallel micro-operation

These register-transferoperations are logicallyindependent, and theycan be actuallyexecuted in parallel ifproper PO resourcesare provided (e.g., twoALUs).Both operations are executed during the same clock cycle.

Page 14: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Approach to processing unit design

• PO structure is composed of

– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and

few others

• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C

– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. Example: A + B C, D – E F

• Conditional microistructions: if … then … else, switch case …

• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 14

Page 15: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Example of microistruction

i. if (Asign = 0)

then A + B A, J + 1 J, goto j

else A – B B, goto k

j. …

k. …

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 15

Currentmicroistructionlabel

Logical conditionon one or more conditionvariables(Asign denotes the most significantbit of register A)

(Parallel) micro-operation

Label of nextmicroistruction

Control Part (PC) is anautomaton, whose state graphcorresponds to the structure ofthe microprogram:

i

j

k

Asign = 0: execute A + B A, J + 1 J in PO, and goto j

Asign = 1: execute A - B B in PO, and goto k

Page 16: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Approach to processing unit design

• PO structure is composed of– Registers: to implement variables

– Independent registers,

– Register array: ROM/RAM memory

– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and few others

• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C– i.e., the contents of registers A and B are added, and the results is written into register C.

• Parallelism inside micro-operations. Example: A + B C, D – E F

• Conditional microistructions: if … then … else, switch case …

• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.

• Each microinstruction is executed in a clock cycle.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 16

Page 17: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Generation of propervalues ofcontrolvariables

Clock cycle: informal meaning

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 17

PC

PO

pulse width

(stabilization of register

contents)

{values of control variables

clock frequency f = 1/t e.g. t = 0,25 nsec f = 4 GHz

tExecution ofA + B C(A, B, C: registers ofPO)

Execution ofinput_C = A + B

Execution ofA + B C, D + 1 Din parallel in thesame clock cycle

Execution ofinput_D = D + 1

C = input_CD = input_D

Clock cycle, t = maximum stabilitation delay time for the execution of a (any) microinstruction, i.e. for the stabilization of the inputs of all the PC and PO registers

input_C C

Register C

Page 18: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Clock cycle: formally

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 18

TwPC

PC

PO

TwPOTsPO

TsPC

d pulse width

{x}

{g}

clock cycle t = TwPO + max (TwPC + TsPO, TsPC) + d

wA: output function ofautomata A

sA: state transitionfunction of automata A

TF: stabilization delay ofthe combinatorialnetwork implementingfunction F

Page 19: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Example of a simple unit design

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 19

Specification IN: OUT = number_of_occurrences (IN, A)

Pseudo-code of interpreter:while (true) do

B = IN; C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;OUT = C

Microprogram (= interpreter implementation)

0. IN B, 0 C, 0 J, goto 1

1. if (Jmost = 0) then zero (A[Jmodule] – B) Z, goto 2

else C OUT, goto 0

2. if (Z = 0) then J + 1 J, goto 1

else J + 1 J, C + 1 C, goto 1

// to be completed with synchronized communications //

A[N]

IN (stream)

32 32

U

Service time of U: Tcalc = k t = ( 2 + 2 N ) t 2N t

N times

N times

1 time

0 1 2

1 time

Control Part state graph (automaton):

Jmost = 0

Jmost = 1

OUT (stream)

Integer array: implemented as a registerRAM/ROM

Page 20: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 20

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure

Arithmetic-logic

combinatorial

circuit

Register array

(RAM/ROM), N words,

32 bits/word

Register (32 bits)

Multiplexer

to PC

from PC

B

Write-enable signal, from PC

1 bit register

Page 21: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 21

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure

Data pathfor the execution of

J + 1 J, C + 1 C

B

= 1

= 0

= 1

= 1

= 1

= 1

= 0

= 0

= 0

Page 22: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 22

BbB

IN

CbC

KCaC

0 Alu1

JbJ

KJaJ

0 Alu2

OUTbOUT

C

x0 = J0

ALU1 : -, +1

aAlu2

KALU1aAlu1

outA C

Alu1

ALU2 : +1

Alu2

J

A [N]outA

Jm

ZbZ

x1 = Z

Jm

PO structure

Data pathfor the execution of

zero (A[J] – B) Z

B

= 0

= 1

= 0

= 0 = 0 = 0

= 1

= - = -

Page 23: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Service time, and other performance parameters

• Computations on streams

• Stream: (possibly infinite) sequence of values with the same type, also called tasks

• Service time: average time interval between the beginning of processing of twoconsecutive tasks, i.e. between two consecutive services

• Latency: average duration of a single task processing (from input acquisition to resultgeneration) : calculation time

– Service time is equal to latency for sequential computations only (a single module,

or unit)

• Processing Bandwidth = 1/service_time: average number of processed tasks per second (e.g. images per second, operations per second, instructions per second, bitstransmitted per second, …) – also called throughput for generic applications, or performance for CPUs (instructions or float operations per second)

• Completion time, for a finite lenght (m) stream: time interval needed to fully processall the tasks. For large m and parallel computations, often the following approximaterelation holds:

completion_time m service_time

equality relation is rigorous for sequential computationsMCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 23

Page 24: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Communications at the firmware level

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 24

Dedicated links (one-to-one, unidirectional)

Shared link: bus (many-to-many, bidirectional)It can support one-to-one, one-to-many, and many-to-one communications.

In general, for computer structures : parallel links(e.g. one or more wordsper link)

Valid for• uniprocessor computer, • multiprocessor with

shared memory, • some multicomputers

with distributed memory(MPP)

Not valid for• distributed architectures

with standard communication networks,

• some kind of simple I/O.

Page 25: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Basic firmware structure for dedicated links

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 25

Usource Udestination

Output

interfaceInput interface

LINK

Usource::

while (true) do

< produce message MSG >;

< send MSG to Udestination >

Udestination::

while (true) do

< receive MSG from Usource >;

< process MSG >

Page 26: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Ready – acknowledgement protocol

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 26

RDY line (1 bit)

ACK line (1 bit)

U1

O

U

TU2

I

NMSG (n bit)

< send MSG > ::

wait ACK;

write MSG into OUT, signal RDY,

< receive MSG >::

wait RDY;

utilize IN, signal ACK, …

Asynchronous communication : asynchrony degree = 1

Page 27: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Implementation

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 27

U1

OUT U2

I

N

ACK

(1 bit)

RDYOUT

ACKOUT

RDYIN

ACKIN

MSG (n bit)

bRDYOUT

bACKOUT

bRDYIN

bACKIN

RDY

(1 bit)

< send MSG > ::

wait until ACKOUT;

MSG OUT, set RDYOUT, reset ACKOUT

< receive MSG >::

wait until RDYIN;

f( IN, …) …… , set ACKIN, reset RDYIN

Level-transition interface : detects a leveltransition on input line, provided that it occurswhen S = Y

clock

RDY

line exor

RDYIN

S

Y

Module-2 counter

set X: complements X content (a level transition is produced onto output link)reset X: forces X output to zero (by forcing element Y to become equal to S)

Page 28: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Example of a simple unit design withcommunication

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 28

Specification IN: OUT = number_of_occurrences (IN, A)

Pseudo-code of interpreter:while (true) do

receive (IN, B); C = 0;for (J = 0; J < N; J ++)

if ( A[ J ] == B) then C = C + 1;send (OUT, C)

A[N]

IN

32 32

U

OUT

RDYIN RDYOUT

ACKIN ACKOUT

Microprogram (= interpreter implementation)

0. if (RDYIN = 0) then nop, goto 0

else reset RDYIN, set ACKIN, IN B, 0 C, 0 J, goto 1

1. case (Jmost ACKOUT = 0 - ) do zero (A[Jmodule] – B) Z, goto 2

(Jmost ACKOUT = 1 0 ) do nop, goto 1

(Jmost ACKOUT = 1 1 ) do reset ACKOUT, set RDYOUT, C OUT, goto 0

2. if (Z = 0) then J + 1 J, goto 1

else J + 1 J, C + 1 C, goto 1

Page 29: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Impact of communication protocol on performance parameters

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 29

1 time

N times

N times

1 time

0 1 2

Control Part state graph (automaton):

Jmost = 0

Jmost = 1 and ACKOUT = 1

Jmost = 1 and ACKOUT = 0

RDYIN = 0

RDYIN = 1

Synchronization implies waitconditions (wait RDYIN, waitACKOUT), corresponding to arcswith the same source and destination node (stable internalstates) in the state graph.

• In the evaluation of the internal calculation time of U, these wait conditions are not to be taken into account.• That is, we are interested in evaluating the processing capabilities of U

independently of the current situation of the partner units: as if the source and destination units were always ready to communicate.

• On the other hand, the communication protocol has “no cost”, i.e. no overhead.• Thus, the calculation time of U is again Tcalc = 2Nt.

• However, communication might have impact on the service time.

Page 30: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Communication: performance parameters

• Because of communication impact, we distinguish between:

– Ideal service time = internal calculation time

– Effective service time, or simply service time

• Overhead of RDY-ACK protocol at the firmware level, for dedicatedlinks: null for calculation time (ideal service time) !

– all the required operations for RDY-ACK protocol are executed in a single

clock cycle, in parallel to operations of the true computation.

• Communication latency: time needed to complete a communication, i.e. for the ACK reception.

• Service time: considering the effect of communication, it dependson the calculation time and on the communication latency:

– once executed a send operation, the sender can execute a new send operation

after the communication latency time interval (not before).

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 30

Page 31: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Communication latencyand impact on service time

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 31

ACKRDY RDY

Ttrt

Tcalc Tcom

Sender

Receiver

Clock cycle for calculation only

Clock cycle for calculation

and communication

Communication Latency

Lcom = 2 (Ttr + t)

Transmission Latency(Link only)

Communication time NOT overlapped to (i.e. not masked by) internal calculation

Calculation time

Service time T = Tcalc + Tcom

Tcalc ≥ Lcom Tcom = 0

Lcom

T

Page 32: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

“Coarse grain” and “fine grain” computations

• Two possible situations: Tcalc Lcom or Tcalc < Lcom

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 32

Tcalc

Lcom

Tcalc Lcom

Tcom = 0Service time T = Tcalc

Tcalc

Lcom

Tcalc < Lcom

Tcom > 0Service time T = Lcom

(Coarse Grain)

(Fine Grain)

T = Tcalc + Tcom = max (Tcalc, Lcom)

Page 33: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

In the example

Tcalc = 2Nt

Let: Ttr = 9 t

Then: Lcom = 2 (t + Ttr) = 20 t

For N 10: Tcalc Lcom

In this case:

Tcom = 0, service time T = Tcalc = 2Nt

i.e. “coarse grain” calculation (wrt communications)

However, for relatively low N, provided that N 10, it is a “fine grain”

computation (e.g. few tens clock cycles) able to fully mask communications.

For N < 10: Tcalc < Lcom

Tcom > 0, service time T = Lcom = 20t

i.e. “too fine” grain calculation.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 33

Page 34: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Communication latency and impact on service time

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 34

Comm. Latency Lcom = 2 (Ttr + t)

Service time T = Tcalc + TcomTcalc ≥ Lcom Tcom = 0

efficiency e 1,

ideal processing bandwidth (throughput)

• Impact of Ttr on service time T is significant

• e.g. memory access time = response time of “external” memory to processor

requests (“external” = outside CPU-chip)

• Inside the same chip: Ttr 0,

• e.g. between processor – MMU – cache

• except for on-chip interconnection structures with “long wires” (e.g. buses,

rings, meshes, …)

• With Ttr 0, very “fine grain” computations are possible without communicationpenalty on service time: Lcom = 2t if Tcalc ≥ 2t then T = Tcalc

• Important application in Instruction Level Parallelism processors and in Interconnection Network switch-units.

Page 35: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Request-reply communication

Example: a simple cooperation between a processor Pand a memory unit M

(read or write request from P to M, reply from M to P)

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 35

t

tM

Ttr

P-M request-reply latency: ta = 2 (t + Ttr) + tM

P

M

Memory access time (as “seen” by the processor)

M

P

Page 36: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Integrated hardware technologies

• Integration on silicon semiconductor CHIPS

• Chip area vs clock cycle:– Each logical HW component takes up a given area on silicon (currently: order of fractions of

nanometers).

– Increasing the number of components per chip, i.e. increasing chip area, is positive providedthat components are intensively “packed”.

– Relatively long links on chip are the main source of inefficiency.

– Some macro-components are more suitable for intensive packing (notably: memories), othersare very critical (e.g. large combinatorial networks realized by AND-OR gates).

– Pin count problem: increasing the number of interfaces increases the chip perimeter linearly, thus the chip area increases quadratically. Many interfaces (e.g. over 102 -103 pins) are motivated only if the area increase is properly exploited by a quadratic increase in intensivelypacked HW compenents.

• Technology of CPUs vs technology of memories and links:– CPU clock cycle: t 100 nsec (frequency: few GHz)

– Static Main Memory clock cycle: tM 101 – 102 t

– Dynamic Main Memory clock cycle: tM 102 – 104 t

– Inter-chip link transmission latency: Ttr 100 – 102 t

– Thus, ta is (one or) more order of magnitude greater than the CPU clock cycle.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 36

Page 37: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

The processor-memory gap (Von Neumann bottleneck)

• For every (known) technology, the ratio ta/t is increasing at every technological improvement– the frequency increase of CPU clock is greater than the frequency increase of

memory clock and of links (inverse of unit latency of links).

• This is the central problem of computer architecture(yesterday, today, tomorrow)– SOLUTIONS ARE BASED ON MEMORY HIERACHIES AND ON PARALLELISM

• “Moore law” (the CPU frequency doubles every 1.5 years) hasbeen valid until 2004 (we’ll see why). Memory and linkstechnologies has been “stopped” too.– Multi-core evolution / revolution.

– The Moore law is now applied to the number of cores (CPUs) in multi-coreand many-core chips.

– What about programming ? One main motivation of SPA and SPM courses.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 37

Page 38: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Interconnection structuresbased on dedicated links: first examples

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 38

E

E

E

U1

U2

U3

U4

C

C

C

input

stream

output

stream

Trees:Inteconnection of a single input stream and of a single output stream to a collection of nindependent units.

Communication Latency = O(lg n)

Rings /Tori: Communication Latency = O(n)

U1 U2 U3 U4

Page 39: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 39

Interconnection structuresbased on dedicated links: first examples

Meshes:

U U U

U U U

U U U

Communication Latency = O(n1/2)

possibly toroidal

Page 40: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Interconnection structuresbased on dedicated links• In general, all the most interesting interconnection structures for

– shared memory multiprocessors

– distributed memory multicomputer, cluster, MPP

are “limited degree” networks– based on dedicated links

– with limited number of links per unit: CHIP COUNT problem

• Examples (Course Part 2):– n-dimension cubes

– n-dimensione butterflys

– n-dimension Trees / Fat Trees

– rings and multi-rings

• Currently the “best” commercial networks belong to these classes:– Infiniband

– Myrinet

– …

• Trend: on-chip optical interconnection networks

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 40

Page 41: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

(Traditional, old-style) Bus

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 41

Sender behaviour Receiver behaviour

Shared, bidirectionallink.At most one messageat the time can betransmitted.

Arbitrationmechanisms must beprovided.

Serious problems ofbandwidth(contention) for bus-based parallelstructures.

Page 42: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Arbiter mechanisms (for bus writing)

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 42

Independent Request

Page 43: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Arbiters

• Daisy Chaining

• Polling (similar)

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 43

Page 44: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Decentralized arbiters

• Token ring

• Non-deterministic with collision detection

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 44

Page 45: Marco Vanneschi - Dipartimento di Informaticavannesch/SPA 2011-12/Precourse 2010/Precourse-2... · Marco Vanneschi Precourse in ... Uniprocessor : Instruction Level Parallelism

Firmware: summary

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 45

Hardware

Applications

Processes

Assembler

Firmware

Uniprocessor : Instruction Level Parallelism

Shared Memory Multiprocessor: SMP, NUMA,…

Distributed Memory : Cluster, MPP, …

• The true architectural level.• Microcoded interpretation of assembler

instructions.• This level is fundamental (it cannot be

eliminated).

• Physical resources: combinatorial and sequential circuit components, physical links

I

I

• Decentralized control of firmware interpreter

• Synchronous computation model

• Clock cycle

• Parallelism inside microinstructions

• Clear cost model for calculation and for

communication

• Efficient communications can be implemented

on dedicated links

• Communication-internal calculation

overlapping