Multiprocessors
Advanced Computers Architecture, UNIT 4
Flynn's classification
Vector computers
Pipelining in Vector computers
Cray
Multiprocessor interconnection
General purpose Multiprocessor
Data Flow Computers
The Big Picture: Where are We Now?
Advanced Computers Architecture, UNIT 4
The major issue is this:
We’ve taken copies of the contents of main memory and put them in caches closer to the processors. But what happens to those copies if someone else wants to use the main memory data?
How do we keep all copies of the data in synch with each other?
The Multiprocessor Picture
Advanced Computers Architecture, UNIT 4
Processor/MemoryBus
PCI Bus
I/O Busses
Example: Pentium System
Organization
CS 284a, 7 October 97 Copyright (c) 1997-98, John Thornley 4
Why Buy a Multiprocessor?
• Multiple users.• Multiple applications.• Multitasking within an application.• Responsiveness and/or throughput.
Advanced Computers Architecture, UNIT 4
5
Multiprocessor Architectures
• Message-Passing Architectures– Separate address space for each processor.– Processors communicate via message passing.
• Shared-Memory Architectures– Single address space shared by all processors.– Processors communicate by memory read/write.– SMP or NUMA.– Cache coherence is important issue.
Advanced Computers Architecture, UNIT 4
CS 284a, 7 October 97 Copyright (c) 1997-98, John Thornley 6
Message-Passing Architecture
. . .
processor
cache
memory
processor
cache
memory
processor
cache
memory
interconnection network
. . .
Advanced Computers Architecture, UNIT 4
CS 284a, 7 October 97 Copyright (c) 1997-98, John Thornley 7
Shared-Memory Architecture
. . .
interconnection network
. . .
processor1
cache
processor2
cache
processorN
cache
memory1
memoryM
memory2
Advanced Computers Architecture, UNIT 4
8
Shared-Memory Architecture:SMP and NUMA
• SMP = Symmetric Multiprocessor– All memory is equally close to all processors.– Typical interconnection network is a shared bus.– Easier to program, but doesn’t scale to many
processors.• NUMA = Non-Uniform Memory Access
– Each memory is closer to some processors than others. – a.k.a. “Distributed Shared Memory”.– Typically interconnection is grid or hypercube.– Harder to program, but scales to more processors.
Advanced Computers Architecture, UNIT 4
Shared Memory Multiprocessor
Advanced Computers Architecture, UNIT 4
Memory
Disk & other IO
Registers
Caches
Processor
Registers
Caches
Processor
Registers
Caches
Processor
Registers
Caches
Processor
Chipset • Memory: centralized with Uniform Memory Access time (“uma”) and bus interconnect, I/O
• Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro
Shared Memory Multiprocessor
Advanced Computers Architecture, UNIT 4
• Several processors share one address space– conceptually a shared memory– often implemented just like a
multicomputer• address space distributed
over private memories• Communication is implicit
– read and write accesses to shared memory locations
• Synchronization– via shared memory locations
• spin waiting for non-zero– barriers
P
M
Network/Bus
P P
Conceptual Model
Message Passing Multicomputers
Advanced Computers Architecture, UNIT 4
• Computers (nodes) connected by a network– Fast network interface
• Send, receive, barrier– Nodes not different than regular PC or workstation
• Cluster conventional workstations or PCs with fast network – cluster computing– Berkley NOW– IBM SP2
P
M
P
M
P
M
Network
Node
Large-Scale MP Designs
Advanced Computers Architecture, UNIT 4
Low LatencyHigh Reliability
40 cycles100 cycles
Memory: distributed with nonuniform memory access time (“numa”) and scalable interconnect (distributed memory)
Shared Memory Architectures
Advanced Computers Architecture, UNIT 4
In this section we will understand the issues around:
• Sharing one memory space among several processors.
• Maintaining coherence among several copies of a data item.
The Problem of Cache Coherency
CPU
Cache
100
200
A’
B’
Memory
100
200
A
B
I/O
a) Cache and memory coherent: A’ = A, B’ = B.
CPU
Cache
550
200
A’
B’
Memory
100
200
A
B
I/OOutput of A gives 100
b) Cache and memory incoherent: A’ ^= A.
CPU
Cache
100
200
A’
B’
Memory
100
440
A
B
I/OInput 440 to B
c) Cache and memory incoherent: B’ ^= B.
Some Simple Definitions
Advanced Computers Architecture, UNIT 4
Mechanism How It Works Performance Coherency Issues
Write Back
Write Through
Write modified data from cache to
memory only when necessary.
Write modified data from cache to
memory immediately.
Good, because doesn’t tie up
memory bandwidth.
Not so good - uses a lot of
memory bandwidth.
Can have problems with various copies containing different
values.
Modified values always written to memory;
data always matches.
What Does Coherency Mean?
Advanced Computers Architecture, UNIT 4
• Informally:– “Any read must return the most recent write”– Too strict and too difficult to implement
• Better:– “Any write must eventually be seen by a read”– All writes are seen in proper order (“serialization”)
• Two rules to ensure this:– “If P writes x and P1 reads it, P’s write will be seen by P1 if the
read and write are sufficiently far apart”– Writes to a single location are serialized:
seen in one order• Latest write will be seen• Otherwise could see writes in illogical order
(could see older value after a newer value)
Vector Computers
Advanced Computers Architecture, UNIT 4
• Vector Processing Overview• Vector Metrics, Terms• Greater Efficiency than Super Scalar Processors• Examples
– CRAY-1 (1976, 1979) 1st vector-register supercomputer– Multimedia extensions to high-performance PC processors– Modern multi-vector-processor supercomputer – NEC ESS
• Design Features of Vector Supercomputers• Conclusions
Vector Programming Model
Advanced Computers Architecture, UNIT 4
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic InstructionsADDV v3, v1, v2 v3
v2v1
v1Vector Load and Store InstructionsLV v1, r1, r2
Base, r1 Stride, r2Memory
Vector Register
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
Vector Code Example
Advanced Computers Architecture, UNIT 4
# Scalar Code LI R4, 64loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop
# Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3
# C codefor (i=0; i<64; i++) C[i] = A[i] + B[i];
Vector Arithmetic Execution
Advanced Computers Architecture, UNIT 4
• Use deep pipeline (=> fast clock) to execute element operations
• Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)
V1
V2
V3
V3 <- v1 * v2
Six stage multiply pipeline
Vector Instruction Set Advantages
Advanced Computers Architecture, UNIT 4
• Compact–one short instruction encodes N operations => N*FlOp BandWidth
• Expressive, tells hardware that these N operations:–are independent–use the same functional unit–access disjoint registers–access registers in the same pattern as previous instructions–access a contiguous block of memory (unit-stride load/store) OR access memory in a known pattern (strided load/store)
• Scalable–can run same object code on more parallel pipelines or lanes
Properties of Vector Processors
Advanced Computers Architecture, UNIT 4
• Each result independent of previous result=> long pipeline, compiler ensures no dependencies=> high clock rate
• Vector instructions access memory with known pattern=> highly interleaved memory=> amortize memory latency of 64-plus elements=> no (data) caches required! (but use instruction cache)
• Reduces branches and branch problems in pipelines
• Single vector instruction implies lots of work (≈ loop)=> fewer instruction fetches
Supercomputers
Advanced Computers Architecture, UNIT 4
Definition of a supercomputer:• Fastest machine in world at given task• A device to turn a compute-bound problem into an
I/O bound problem • Any machine costing $30M+• Any machine designed by Seymour Cray
CDC6600 (Cray, 1964) regarded as first supercomputer
Supercomputer Applications
Advanced Computers Architecture, UNIT 4
Typical application areas• Military research (nuclear weapons, cryptography)• Scientific research• Weather forecasting• Oil exploration• Industrial design (car crash simulation)
All involve huge computations on large data sets
In 70s-80s, Supercomputer Vector Machine
Vector Supercomputers
Advanced Computers Architecture, UNIT 4
Epitomized by Cray-1, 1976:
Scalar Unit + Vector Extensions• Load/Store Architecture• Vector Registers• Vector Instructions• Hardwired Control• Highly Pipelined Functional Units• Interleaved Memory System• No Data Caches• No Virtual Memory
Cray-1 (1976)
Advanced Computers Architecture, UNIT 4
Advanced Computers Architecture, UNIT 4
Single PortMemory
16 banks of 64-bit words+ 8-bit SECDED
80MW/sec data load/store
320MW/sec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V. Mask
V. Length64 Element Vector Registers
Vector Memory System
Advanced Computers Architecture, UNIT 4
0 1 2 3 4 5 6 7 8 9 A B C D E F
+
Base StrideVector Registers
Memory Banks
Address Generator
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency• Bank busy time: Cycles between accesses to same bank
Vector Instruction Execution
Advanced Computers Architecture, UNIT 4
ADDV C,A,B
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
Execution using four pipelined functional units
History of Microprocessors
Advanced Computers Architecture, UNIT 4
1950s IBM instituted a research program
1964 Release of System/360
Mid-1970s improved measurement tools demonstrated on CISC
1979 32-bit RISC microprocessor (801) developed led by Joel Birnbaum
1984 MIPS developed at Stanford, as well as projects done at Berkeley
1988 RISC processors had taken over high-end of the workstation market
Early 1990s IBM’s POWER (Performance Optimization With Enhanced RISC) architecture introduced w/ the RISC System/6k AIM (Apple, IBM, Motorola) alliance formed, resulting in PowerPC
What is CISC….?
Advanced Computers Architecture, UNIT 4
A complex instruction set computer (CISC, pronounced like "sisk") is a microprocessor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction.
The philosophy behind it is, that hardware is always faster than software, therefore one should make a powerful instruction set, which provides programmers with assembly instructions to do a lot with short programs.
So the primary goal of the CISC is to complete a task in few lines of assembly instruction as possible.
Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy.
Advanced Computers Architecture, UNIT 4
• Memory in those days was expensive bigger program->more storage->more money
Hence needed to reduce the number of instructions per program
• Number of instructions are reduced by having multiple operations within a single instruction
• Multiple operations lead to many different kinds of instructions that access memory In turn making instruction length variable and fetch-decode execute
time unpredictable – making it more complex Thus hardware handles the complexity
CISC philosophy
Advanced Computers Architecture, UNIT 4
Use microcode • Used a simplified microcode instruction set to control the data path logic. This type of
implementation is known as a microprogrammed implementation.
Build rich instruction sets • Consequences of using a microprogrammed design is that designers could build more
functionality into each instruction.
Build high-level instruction sets • The logical next step was to build instruction sets which map directly from high-level
languages
Characteristics of a CISC design
Advanced Computers Architecture, UNIT 4
Register to register, register to memory, and memory to register commands.
Uses Multiple addressing modes .
Variable length instructions where the length often varies according to the addressing mode
Instructions which require multiple clock cycles to execute.
Addressing Modes
Advanced Computers Architecture, UNIT 4
• Immediate
• Direct
• Indirect
• Register
• Register Indirect
• Displacement (Indexed) • Stack
Immediate Addressing
Advanced Computers Architecture, UNIT 4
• Operand is part of instruction• Operand = address field• e.g. ADD 5
— Add 5 to contents of accumulator— 5 is operand
• No memory reference to fetch data• Fast• Limited range
OperandOpcode
Instruction
Direct Addressing
Advanced Computers Architecture, UNIT 4
• Address field contains address of operand• Effective address (EA) = address field (A)• e.g. ADD A
— Add contents of cell A to accumulator— Look in memory at address A for operand
• Single memory reference to access data• No additional calculations to work out effective address• Limited address space
Direct Addressing Diagram
Address AOpcode
Instruction
Memory
Operand
Indirect Addressing
Advanced Computers Architecture, UNIT 4
• Memory cell pointed to by address field contains the address of (pointer to) the operand
• EA = (A)— Look in A, find address (A) and look there for operand
• e.g. ADD (A)— Add contents of cell pointed to by contents of A to accumulator
Large address space
2n where n = word length
May be nested, multilevel, cascaded e.g. EA = (((A)))
Multiple memory accesses to find operand
Hence slower
Indirect Addressing Diagram
Address AOpcode
Instruction
Memory
Operand
Pointer to operand
CISC Disadvantages
Advanced Computers Architecture, UNIT 4
Designers soon realised that the CISC philosophy had its own problems, including:
Earlier generations of a processor family generally were contained as a subset in every new version - so instruction set & chip hardware become more complex with each generation of computers.
So that as many instructions as possible could be stored in memory with the least possible wasted space, individual instructions could be of almost any length - this means that different instructions will take different amounts of clock time to execute, slowing down the overall performance of the machine.
Many specialized instructions aren't used frequently enough to justify their existence -approximately 20% of the available instructions are used in a typical program.
CISC instructions typically set the condition codes as a side effect of the instruction. Not only does setting the condition codes take time, but programmers have to remember to examine the condition code bits before a subsequent instruction changes them.
Examples - CISC
Advanced Computers Architecture, UNIT 4
• Examples of CISC processors are• VAX• PDP-11• Motorola 68000 family• Intel x86/Pentium CPU’s