tm origin system architecture hardware and software environment

TM

Origin System ArchitectureOrigin System Architecture

Hardware

and

Software Environment

TM

Scalar ArchitectureScalar Architecture

Reduced Instruction Set (RISC) Architecture:• load/store instructions refer to memory• functional units operate on items in the register file• memory hierarchy in the Scalar Architecture

– Most recently used items are captured in the cache– Access to cache is much faster than access to memory

memoryCache

Register FileFunctional Unit

(mult, add)

Processor~500 MB/s~100 cycles

~2GB/s~10 cy

TM

Vector ArchitectureVector Architecture

• Vectors will be loaded (loadv instruction) from memory• The performance is determined by memory bandwidth• Optimization takes vector length (64 words) into account

loadf f2,(r3) load scalar A(i,k)loadv v3,(r3) load vector B(k,1:n)mpyvs v3,v3,v2 calculate A(I,k)*B(k,1:n)addvv v4,v4,v3 update C(I,1:n)

loadf f2,(r3) load scalar A(i,k)loadv v3,(r3) load vector B(k,1:n)mpyvs v3,v3,v2 calculate A(I,k)*B(k,1:n)addvv v4,v4,v3 update C(I,1:n)

+ Accumulate C(1,1:n) in a vector register+ Accumulate C(1,1:n) in a vector register

= Xi

C

i

k A

k

B

Vector Registers

Functional Unit(mult, add)

Processor

memory

DO i=1,n DO k=1,n C(i,1:n)=C(i,1:n) + A(i,k)*B(k,1:n)ENDDO ENDDO

Vector OperationVector Operation

TM

Multiprocessor ArchitectureMultiprocessor Architecture

Cache coherency unit will intervene if two or more processors attempt to update same cache line

• All memory (and I/O) is shared by all processors

• Read/write conflicts between processors on the same memory location are resolved by cache coherency unit

• Programming model is an extension of single processor programming model

memory

Cache

Register FileFunctional

Unit(mult, add)

Cache Coherency

UnitProcessor

Cache


Unit(mult, add)

Cache Coherency

UnitProcessor

TM

• All memory and I/O path are independent

• Data movement across the interconnect is “slow”

• Programming model is based on message passing– Processors explicitly engage in communication by sending and

receiving data

Multicomputer ArchitectureMulticomputer Architecture

Mainmemory

Cache


Unit(mult, add)

Processor

Mainmemory

Cache


Unit(mult, add)

Processor

TM

Origin 2000 Node BoardOrigin 2000 Node Board

Basic Building BlockBasic Building Block

DirectoryDirectory>32P>32P

R1*KR1*K

CacheCache

HubHub

R1*KR1*K

CacheCache

Node Board

MainMainMemoryMemory

DirectoryDirectory

•2 X R12000 Processors•64 MB to 4 GB Main Memory

Hub Bandwidth Peaks•780 MB/s [625] --- CPUs•780 MB/s [683] --- memory•1.56 GB/s [1.25] -- XIO link•1.56 GB/s [1.25] -- CrayLink

XIO

CrayLink

TM

HUB Crossbar ASIC:• Single chip integrates all 4 Interfaces:

– Processor Interface; two R1x000 processors multiplex on the same bus– Memory Interface, integrating the memory controller and (Directory)

Cache Coherency– Interface to the CrayLink Interconnect to other nodes in the system– Interface to the I/O devices with XIO-to-PCI bridges

• Memory Access characteristics:– Read Bandwidth single processor 460 MB/s sustained– Average access latency 315 ns to restart processor pipeline

O2000 Node BoardO2000 Node Board

Input/Output on every node: 2x800 MB/s

R1x000processor

L2 Cache1-4-8 MB

R1x000processorL2 Cache1-4-8 MB

HUB

Memory Interface

I/O InterfacePro

c In

terf

ace

Lin

k In

terf

ace

DirectorySDRAM

CrayLinkduplex connection(2x23@400 MHz,

2x800 MB/s)to other nodes

Main Memory up to 4 GB/node SDRAM (144@50 MHz=800MB/s)

HUB ASIC:950K gates100MHz 64bitBTE64 counters /(4KB)page

TM

Origin 2000 Switch TechnologyOrigin 2000 Switch Technology

XBOWXBOW

ccNUMAhypercube

DirectoryDirectory>32P>32P

Proc.

CacheCache

HubHub

Proc.

CacheCache

6 ports to XIO

Router to otherNode Boards

Node Board

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

N

N

NN

N

N

N N

MainMainMemoryMemory

DirectoryDirectory

TM

Distributed switch does scale:– Network of crossbars allows for full remote bandwidth– The switch components are distributed and modular

L2 Cache1-4-8 MB

O2000 Scalability PrincipleO2000 Scalability Principle

R1x000processor

L2 Cache1-4-8 MB

R1x000processor

L2 Cache1-4-8 MB

HUB

Memory Interface

I/O Interface

Proc

Int

erfa

ce

Lin

k In

terf

ace

DirectorySDRAM

Main Memory

R1x000processor

R1x000processor

L2 Cache1-4-8 MB

HUB

Memory Interface

I/O Interface

Lin

k In

terf

ace

Proc

Int

erfa

ce

DirectorySDRAM

Main Memory

Crossbar router

network

TM

Origin 2000 ModuleOrigin 2000 Module

System Building BlockSystem Building Block

Module Features:•Up to 8 R12000 CPUs (1-4 Nodes)•Up to 16 GB physical memory•Up to 12 XIO slots•2 XBOW Switches•2 Router Switches•64 bit internal PCI Bus (optional)•Up to 2.5 [3.1] GB/sec system bandwidth•Up to 5.0 [6.2] GB/sec I/O bandwidth

TM

Origin 2000 ModuleOrigin 2000 Module Deskside SystemDeskside System• 2-8 CPUs

• 16GB Memory

• 12 XIO slots

SGI 2100 / 2200SGI 2100 / 2200

R R N

NN

N

TM

Origin 2000 Single RackOrigin 2000 Single Rack

Single Rack SystemSingle Rack System• 2-16 CPUs

• 32GB Memory

• 24 XIO slots

SGI 2400SGI 2400N

R R

R R N

NN

N

N

N N

TM

Origin 2000 Multi-RackOrigin 2000 Multi-Rack

Multi-Rack SystemMulti-Rack System• 17-32 CPUs

• 64GB Memory

• 48 XIO slots

• 32-processor hypercube building block

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

NN

NN

N

N

N N

TM

Origin 2000 Large SystemsOrigin 2000 Large Systems

Large Multi-Rack SystemsLarge Multi-Rack Systems• up to 512 CPUs

• up to 1 TB Memory

• 384+ XIO slots

SGI 2800SGI 2800

+

=

++

TM

Modular Architecture

Interface and Form Factor

Standards

PRO

CESSO

R

SUBSY

STEMS

I/O SUBSYSTEMS

INT

ER

CO

NN

EC

T

SUB

SYST

EM

S

SScalablecalable NNodeode Product Product ConceptConcept

Address diverse customer Address diverse customer requirementsrequirements

• Independent scaling of CPU, I/O, and Independent scaling of CPU, I/O, and storage…tailor ratios to suit applicationstorage…tailor ratios to suit application

• Large dynamic range of product Large dynamic range of product configurationsconfigurations

• RAS via component isolationRAS via component isolation

Independent evolution and Independent evolution and upgrade of system upgrade of system componentscomponents

Maximize leverage of Maximize leverage of engineering and technology engineering and technology development effortsdevelopment efforts

TM

C-brickCPU Module

D-brickDisk Storage

R-brickRouter Interconnect

X-brickXIO Expansion

P-brickPCI Expansion

I-brickBase I/O Module

G-brickGraphics Expansion

Origin 3000 Hardware Origin 3000 Hardware Modules Modules (BRICKS)(BRICKS)

TM

Origin 3000 MIPS NodeOrigin 3000 MIPS Node

R1*000

Mem/DirBedrock

ASIC

R1*000 R1*000

R1*000

Two Independent SysAD InterfacesTwo Independent SysAD InterfacesEach 2x O2K Bandwidth

200 MHz, 1600 MB/sec each

NUMALink3 Network PortNUMALink3 Network Port2x O2K Bandwidth

800 MHz, 1600 MB/secBi-directional

XIO+ PortXIO+ Port1.5x O2K Bandwidth

600 MHz, 1200 MB/secBi-directional

Memory InterfaceMemory Interface4x O2K Bandwidth

200 MHz, 3200 MB/sec60% O2K Latency

180 ns local8 GB/node (Max)DDR SDRAM

128 Nodes / 512 CPUs128 Nodes / 512 CPUsper System (Max)per System (Max) L2

Cache

L2 Cache

L2 Cache

L2 Cache

TM

Origin 3000 CPU Brick (Origin 3000 CPU Brick (C-brickC-brick))

• 3U high x 28” deep3U high x 28” deep

• Four MIPS or IA64 CPUs Four MIPS or IA64 CPUs

• 1 - 4 DIMM pairs: 256MB, 1 - 4 DIMM pairs: 256MB, 512MB, 1024MB (premium)512MB, 1024MB (premium)

• 48V DC power input48V DC power input

• N+1 redundant, hot-plug N+1 redundant, hot-plug coolingcooling

• Independent power on/offIndependent power on/off

• Each CPU module can Each CPU module can support one I/O bricksupport one I/O brick

TM

Origin 3000 BEDROCK ChipOrigin 3000 BEDROCK Chip

TM

Memory

Hub

node

CPU

CPU

Memory

node

CPU

CPU

Hub

1600

2x1600

3200

CPU

CPU

CPU

CPU

1600

1600 1600

1600

1600 900900

900900 900900

900900

11501150 11501150

21002100

2x12502x1250

SGI Origin 3000 BandwidthSGI Origin 3000 Bandwidth Theoretical vs. Measured (MB/s) Theoretical vs. Measured (MB/s)

TM

STREAMS Copy STREAMS Copy BenchmarkBenchmark

0.0

500.0

1000.0

1500.0

2000.0

2500.0

3000.0

Number of CPUs

Me

ga

by

tes

/se

c

Origin 2000 R12KS 400 MHz 380.0 381.0 820.0 1538.0

Origin 3000 R12KS 400 MHz 623.0 777.0 1406.0 2855.0

Origin 3000 R14K 500 MHz 685.0 778.0 1401.0 2823.0

1 2 4 8

SGI ConfidentialSGI Confidential

TM

Origin 3000 Router BrickOrigin 3000 Router Brick ( (r/R-brickr/R-brick))

•2U high x 25” deep2U high x 25” deep

•Replaces system mid-planeReplaces system mid-plane

•Multiple ImplementationsMultiple Implementations– r-Brick…6-port (up to 32 CPUs)

– R-Brick…8-port (up to 128 CPUs)

– metarouter…(128 to 512 CPUs)

•48V DC power input48V DC power input

•N+1 redundant, hot-plug cooling N+1 redundant, hot-plug cooling

•Independent power on/offIndependent power on/off

•Latency 50% ORIGIN 2000Latency 50% ORIGIN 2000– 45 ns45 ns

NUMAlink™ 3Router

8 NUMAlink™ 3 NW PortsEach port...3.2GB/s(2x O2K bandwidth)

45ns roundtrip latency(50% O2K router latency)

TM

SGI Origin 3000 SGI Origin 3000 Measured BandwidthMeasured Bandwidth

Router

5000 MB/s5000 MB/s

25002500

25002500

TM

SGI NUMA 3SGI NUMA 3Scalable Architecture (16p - 1hop)Scalable Architecture (16p - 1hop)

R1*000

BedrockASIC

R1*000 R1*000

R1*000 R1*000

R1*000 R1*000

R1*000 R1*000

R1*000 R1*000

R1*000 R1*000

R1*000 R1*000

R1*000

BedrockASIC

BedrockASIC

BedrockASIC

To other Routers

8-port Router

TM

Origin 3000Origin 3000 I/O BricksI/O Bricks X-brick:X-brick:

XIO ExpansionXIO Expansion

• Highest performanceI/O expansion

• Supports HIPPI,GSN, VME, HDTV

• 4 XIO slots per brick

P-brick:P-brick:PCI ExpansionPCI Expansion

• 12 industry-standard,64-bit, 66MHz slots

• Supports almost allsystem peripherals

• All slots are hot-swap

I-brick:I-brick:Base I/O ModuleBase I/O Module

• Base system I/O:• system disk• CD-ROM• 5 PCI slots

• No need to duplicate starting I/O infrastructure

New I/O bricks (e.g., PCI-X) can be attached via same XIO+ port

TM

Types of Computer Types of Computer ArchitectureArchitecturecharacterised by memory accesscharacterised by memory access

MIMD

MultiprocessorsSingle Address spaceShared Memory

MulticomputersMultiple Address spaces

UMACentral Memory

NUMAdistributed memory

NORMAno-remote memory access

PVP (SGI/Cray T90)

SMP (Intel SHV, SUN E10000, DEC 8400SGI Power Challenge, IBM R60, etc.)

COMA (KSR-1, DDM)

CC-NUMA(SGI Origin2000, Origin3000, Cray T3E, HP Exemplar, Sequent NUMA-Q, Data General)

NCC-NUMA (Cray T3D, IBM SP3)

Cluster (IBM SP2, DEC TruCluster,Microsoft Wolfpack, “Beowolf”, etc.)loosely coupled, multiple OS

“MPP” (Intel TFLOPS,TM-5)

tightly coupled & single OSMIMD Multiple Instruction s Multiple Data PVP Parallel Vector ProcessorUMA Uniform Memory Access SMP Symmetric Multi-ProcessorNUMA Non-Uniform Memory Access COMA Cache Only Memory ArchitectureNORMA No-Remote Memory Access CC-NUMA Cache-Coherent NUMAMPP Massively Parallel Processor NCC-NUMA Non-Cache Coherent NUMA

TM

Processor

Cache

Processor

Cache

Origin DSM-ccNUMA Architecture

MainMemoryD

ir

Processor

Cache

XIO+

Processor

Cache

Bedrock

MainMemoryD

ir

XIO+Bedrock

NUMALink3 and R-Bricks

Distributed SharedShared Memory

Processor

Cache

Processor

Cache

Processor

Cache

Processor

Cache

TMDistributed Shared Memory Distributed Shared Memory Architecture (DSM)Architecture (DSM)

• Local memory and independent path to memory as with the Multicomputer Architecture

• Memory of all nodes is organized as one logical “shared memory”• Non-uniform memory access (NUMA):

— “Local memory” access is faster than “remote memory” access• Programming model is (almost) the same as for the Shared Memory

Architecture— data distribution is available for optimization

• Scalability properties similar to the Multicomputer Architecture

Mainmemory

Cache


Unit(mult, add)

Processor

Mainmemory

Cache


Unit(mult, add)

ProcessorCache

CoherencyUnit

Cache Coherency

Unit

interconnect

TM

Processor

Cache

Processor

Cache

Origin DSM-ccNUMA ArchitectureOrigin DSM-ccNUMA Architecture

MainMemoryD

ir

Processor

Cache

XIO+

Processor

Cache

Bedrock

MainMemoryD

ir

XIO+Bedrock

NUMALink3 and R-Bricks

Processor

Cache

Processor

Cache

Processor

Cache

Processor

Cache

Directory-Based ScalableScalable Cache Coherence

TM

Origin Cache CoherencyOrigin Cache Coherency• Memory page is divided in data blocks of 32 words or 128 Bytes

each (L2 cache line size)

• Each data request transfers one data block (128 Bytes)

• Each data block has associated presence and state information

• If a node (HUB) requests a data block, the corresponding presence bit is set and the state of that cache line is recorded

• HUB runs the Cache Coherency protocol, updating the state of the data block and notifying nodes for which the presence bit is set.

Each L2 cache line contains 4 data blocks of 8 words or 32 Bytes each (L1 data cache line size)

page

Data Block or Cache line 128 Bytes (32 words)presence(64 bits)

state8bits

Data Block or Cache line 128 Bytes (32 words)presence(64 bits)

state8bits

directoryUnowned: no copiesShared: read-only copiesExclusive: one read-writeBusy: state in transition

TMCC-NUMA Architecture: CC-NUMA Architecture: ProgrammingProgramming

• All data is shared• Additional optimization to place data close to the processor that would do most of

the computations on that data• Automatic (compiler) optimizations for single processor and parallel performance• The data access (data exchange) is implicit in the algorithm;• Except for the additional data placement directives, the source is the same as for the

single processor programming (SMP principle)

i

j k j

i k

= X

Proc 1Proc 2

Proc 3

C every processor holds a column of each matrix:C$distribute A(*,block),B(*,block),C(*,block)C$omp parallel doDO i=1,n DO j=1,n DO k=1,n C(i,j)=C(i,j) + A(i,k)*B(k,j)ENDDO ENDDO ENDDO

TM

Problems of CC-NUMA ArchitectureProblems of CC-NUMA Architecture

SMP programming style + data placement techniques (directives)

SMP programming Cliffremote memory latency jump ~3-5requires correct data placement

64-128 processor O2000ta(remote)/ta(local) ~3-5->correct data placement

Based on 1 GB/s SCI link;latency/hop ~ 500 ns

TM

DSM-ccNUMA MemoryDSM-ccNUMA Memory

Easy to Program Easy to Scale

Shared-memorySystems (SMP)

Massively ParallelSystems (MPP)

Hard to scale Hard to program

Easy to ProgramEasy to Scale

Distributed Shared MemorySystems [ccNUMA)

TM

SGI 3200 (2-8p)SGI 3200 (2-8p)

Short Rack(17U config. space)

Power Bay

Minimum (2p) System

I-Brick

Maximum (8p) System

C-Brick

Power Bay

C-Brick

I-Brick

P, I, or, X-Brick

C-Brick

C-Brick

P,I, or X-BrickI-Brick

P

P

PBR

P

XIO+

P

P

PBR

P

XIO+

System Topology

XIO+ Ports

NetworkNetwork

XIO+ Ports

Router-less Router-less configurations in configurations in deskside form factordeskside form factor

TM

SGI 3400 (4-32p)SGI 3400 (4-32p)

Full-size Rack(39U config. space)

Power Bay

Power Bay

r-Brick

Minimum (4p) System

C-Brick

I-Brick

Maximum (32p) System

C-Brick

C-Brick

C-Brick

Power Bay

r-Brick

C-Brick

r-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

P

P

PBR

P

XIO+

P

P

PBR

P

XIO+

P

P

PBR

P

XIO+

P

P

PBR

P

XIO+

r-Brick 6-port router

r-Brick 6-port router

P

P

P

P

XIO+

P

P

P

P

XIO+

P

P

P

P

XIO+

System Topology

BR BR BR

Power Bay

I-Brick

P, I, or, X-Brick

Power Bay

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P

P

P

P

XIO+

BR

TM

SGI 3800 (16-128p)SGI 3800 (16-128p)

Minimum (16p) System Maximum (128p) System

128P System Topology

R

Rack 1

C

CC

C

RC

CC

C

R

Rack 2

C

CC

C

R C

CC

C

R

Rack 3

C

CC

C

RC

CC

C

R

Rack 4

C

CC

C

R C

CC

C

1 2 3 4

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

R-Brick

C-Brick

Power Bay

Power Bay

Power Bay

I-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

Power Bay

Power Bay

I-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick

R-Brick8-port router

TM

16 proc16 proc

16 proc

16 proc16 proc

16 proc

16 proc 16 proc

SGI 3800 System: 128 processorsSGI 3800 System: 128 processors

TM

SGI 3800 (32-512p)SGI 3800 (32-512p)

512p Power Estimates:512p Power Estimates:MIPS = 77 KWMIPS = 77 KWItaniumItanium

TMTM

= 150 KW= 150 KWMcKinley = 231 KWMcKinley = 231 KW

No I/O or storage included No I/O or storage included in power estimates.in power estimates.

Premium memory requiredPremium memory required

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

Power Bay

Power Bay

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick I-Brick

R-BrickR-BrickR-Brick

One Quadrant of a 512p System

TM

Router-to-Router ConnectionsRouter-to-Router Connections for 256 Processor Systems for 256 Processor Systems

TM

512 Processor Systems512 Processor Systems

TM

R1xK Family of R1xK Family of ProcessorsProcessors

•Supports the 64-bit MIPS IV ISASupports the 64-bit MIPS IV ISA•4-way superscalar4-way superscalar•Five separate execution unitsFive separate execution units•2 floating point results / cycle2 floating point results / cycle•4-way deep speculative execution of branches4-way deep speculative execution of branches•Out-of-order execution (48 instruction window)Out-of-order execution (48 instruction window)•Register re-namingRegister re-naming•Two-way set associative non-blocking cachesTwo-way set associative non-blocking caches

–Up to 4 outstanding memory read requestsUp to 4 outstanding memory read requests–Prefetching of dataPrefetching of data–1MB to 8MB secondary data cache1MB to 8MB secondary data cache

•Four user-accessible event countersFour user-accessible event counters

MIPS R1x000 is an out-of-order, dynamic-scheduling MIPS R1x000 is an out-of-order, dynamic-scheduling superscalar processor with non-blocking cachessuperscalar processor with non-blocking caches

TMOrigin 3000 Origin 3000 MIPS Processor RoadmapMIPS Processor Roadmap

Origin 2000

O3K-MIPS

R10000250 MHz, 500 MFlops

R12000300 MHz, 600 MFlops

R12000A400 MHz, 800 MFlops

R14000(A)500+ MHz, 1000+ MFlops

R16000xxx MHz, xxx GFlops

1999 2000 2001 2002 2003

R18000xxx MHz, xxx GFlops

8 MB @ 200 MHz8 MB @ 200 MHz

4 MB @ 250 MHz4 MB @ 250 MHz

8 MB @ 266 MHz8 MB @ 266 MHz

8 MB DDR SRAM@ 250+ MHz8 MB DDR SRAM@ 250+ MHz

TM

R14000 Cache InterfacesR14000 Cache Interfaces

TM

Sp

eed

of

Acc

ess

1/cl

ock

64reg

32KB(L1)

8MB(L2)

~1 - 100s GB

Cache subsystem memory

Device Capacity (size)

1

0.1

0.01

~4000 cy

~100 - 300 cy(NUMA)

~10 cy

~2-3cy

disk

Memory HierarchyMemory Hierarchy

175 175235

285335 335

435485

585

343

554

759 759836

1067

1169

0

200

400

600

800

1000

1200

1400

2p 4p 8p 16p 32p 64p 128p 256p 512p

Rem

ote

La

ten

cy

(n

s)

Origin3000 Latency

Origin2000 Latency

TM

Effects of Memory HierarchyEffects of Memory Hierarchy

2MB cache

1MB cache

4MB cache

32 KB L1 cache

4 MB L1 cache

L2 cache:

TM

Instruction Latencies (R12K)Instruction Latencies (R12K) Integer units latency Repeat rate• ALU 1

– add, sub, logic ops, shift, br 1 1• ALU 2

– add, sub, logic ops 1 1– signed multiply (32/64 bit) 6/10 6/10– (unsigned multiply: +1 cycle)– divide (32/64 bit) 35/67 35/67

• Address Unit– load integer 2 1– load floating point 3 1– store - 1– Atomic LL,ADD,SC sequence 6 6

Floating point units• FPU 1

– add, sub, compare, convert 2 1• FPU 2

– multiply 2 1– multiply-add (madd) 4 1

• FPU 3– divide, reciprocal (32/64 bit) 12/19 14/21– sqrt (32/64 bit) 18/33 20/35– rsqrt (32/64 bit) 30/52 34/56

Repeat rate of 1 means that afterpipelining processor can complete1 operation per cycle.

Thus the peak rates:Int operations: 2 int operations/cycleFP operations: 2 fp operations/cycle

For the R14000@500MHz:

4*500 MHz = 2000 MIPS2*500 MHz = 1000 Mflop/s

Repeat rate of 1 means that afterpipelining processor can complete1 operation per cycle.

Thus the peak rates:Int operations: 2 int operations/cycleFP operations: 2 fp operations/cycle

For the R14000@500MHz:

4*500 MHz = 2000 MIPS2*500 MHz = 1000 Mflop/s

Compiler has this table build in.The goal of compiler scheduling is finding instructions that can be executed in parallel to fill all slots:ILP - Instruction Level Parallelism

TM

Instruction Latencies: DAXPY ExampleInstruction Latencies: DAXPY Example

– There are 2 loads (x,y) and 1 store (y)= 3 mem ops.– There are 2 fp operations (+,*) which can be done with 1 madd

• 3 mem ops require 3 cycles minimum (processor can do 1 mem op/cycle)

• theoretically in 3 cycles processor can do 6 fp operations

• only 2 fp operations are available in the code max processor speed is 2fp/6fp=1/3 peak on this code; I.e. for the R12000@300MHz processor 600/3=200 Mflop/s.

DO I=1,n Y(I) = Y(I) + A*X(I)ENDDO

Loop parallelism:2 loads, 1 store1 multiply-add (madd)2 address increments1 loop-end test1 branchper single loop iteration

Processor parallelism:1 load or store 1 ALU1 instruction1 ALU2 instruction1 FP add1 FP multiplyper processor cycle

TM

DAXPY Example: SchedulesDAXPY Example: Schedules

Simple schedule: unrolled by 2:

2fp/(8cycles*2fp/cy)=1/8 peak 4fp/(9cycles*2fp/cy)=2/9 peak

R12000@300MHz ~ 75 Mflop/s ~133 Mflop/s

cycle instructions 0 ld x x++1 ld y23 madd4567 st y br y++

x load delay 3 cycles

madd dela

y 4 cycle

s

DO I=1,n Y(I) = Y(I) + A*X(I)ENDDO

DO I=1,n-1,2 Y(I+0) = Y(I+0) + A*X(I+0) Y(I+1) = Y(I+1) + A*X(I+1)ENDDO

cycle instructions 0 ld x01 ld x12 ld y0 x+=43 ld y1 madd04 madd1567 st y08 st y1 y+=4 br

x load delay 3 cycles

madd dela

y 4 cycle

s

TMDAXPY Example: DAXPY Example: Software PipeliningSoftware Pipelining

• Software pipelining is the way to fill all processor slots by mixing iterations

• replications gives how many iterations are mixed• number of replications depends on the distance (in cycles) between the load

and the calculation

• DAXPY 6 cy schedule with 4 fp ops: 4fp/(6cy*2fp/cy)=1/3 peak

#<swp> replication 0 #cyld x0 ldc1 $f0,0($1) #[0]ld x1 ldc1 $f1,-8($1) #[1]st y2 sdc1 $f3,-8($3) #[2]st y3 sdc1 $f5,0($3) #[3]y+=2 addiu $3,$2,16 #[3]

madd.d $f5,$f2,$f0,$f4 #[4]ld y0 ldc1 $f0,-8($2) #[4]

madd.d $f3,$f0,$f1,$f4 #[5]x+=2 addiu $1,$1,16 #[5]

beq $2,$4,.BB21.daxpy #[5]ld y3 ldc1 $f2,0($3) #[5]

#<swp> replication 1 #cyld x3 ldc1 $f1,0($1) #[0]ld x2 ldc1 $f0,-8($1) #[1]st y1 sdc1 $f3,-8($2) #[2]st y0 sdc1 $f5,0($2) #[3]y+=2 addiu $2,$3,16 #[3]

madd.d $f5,$f2,$f1,$f4 #[4]ld y3 ldc1 $f1,-8($3) #[4]

madd.d $f3,$f1,$f0,$f4 #[5]x+=2 addiu $1,$1,16 #[5]ld y0 ldc1 $f2,0($2) #[5]

TM

DAXPY SWP: Compiler MessagesDAXPY SWP: Compiler Messages F77 -mips4 -O3 -LNO:prefetch=0 -S daxpy.f

• With the -S switch the compiler will produce file daxpy.s with assembler instructions and comments about software pipelining schedules

#<swps> Pipelined loop line 6 steady state #<swps> 50 estimated iterations before pipelining #<swps> 2 unrolling before pipelining #<swps> 6 cycles per 2 iterations #<swps> 4 flops ( 33% of peak)(madds count 2fp) #<swps> 2 flops ( 16% of peak)(madds count 1fp) #<swps> 2 madds ( 33% of peak) #<swps> 6 mem refs (100% of peak) #<swps> 3 integer ops ( 25% of peak) #<swps> 11 instructions ( 45% of peak) #<swps> 2 short trip threshold #<swps> 7 ireg registers used. #<swps> 6 fgr registers used.

• The schedule is the max 1/3 peak processor performance, as expected

• note: it is necessary to switch off prefetch to attain max schedule

TM

• Processor can support 4 outstanding memory requests

Timing linked list references: while(x) x=x->p; #outstanding ref time per pointer fetch: 1 230 ns (480 ns) 2 160 ns (250 ns) 4 110 ns (240 ns)

Multiple Outstanding Mem RefsMultiple Outstanding Mem Refs

Wait for data Wait for dataExecution

independentinstructions

Execution

Execution

Executionindependentinstructions

Wait for data

time

“Sequential” cache miss

“Parallel” cache miss

TM

Origin 3000 Memory LatencyOrigin 3000 Memory Latency

LocalLocalNI to NINI to NIPer RouterPer Router

320 ns320 ns165 ns165 ns105 ns105 ns

O3KO3K

180 ns180 ns50 ns50 ns45 ns45 ns

485 ns + #hops*105 ns 230 ns + #hops*45 ns

ORIGINORIGIN

32 CPU O3K Max Latency:32 CPU O3K Max Latency: 315 ns315 ns

TMRemote Memory Remote Memory LatencyLatency

0

200

400

600

800

1000

1200

1400

2p 4p 8p 16p 32p 64p 128p 256p 512p 1024p

Node Size (CPUs)

Wo

rst

case r

ou

nd

tri

p r

em

ote

late

ncy (

ns)

Origin2000 Latency

SN (Hypercube)Origin 3000 Series

SGI™ 3000 Family vs. SGI™ 2000 Series

TM

R1x000 Event CountersR1x000 Event Counters The R1x000 processor family allows extensive performance monitoring with counters that can be triggered by 32 events:• R10000 has 2 event counters

• R12000 has 4 event counters

The counters are incremented when an event happens in the processor (e.g. cache miss) and the event is selected by the user.

The first counter can be triggered by the events 0-15, the second counter is incremented in response to events 15-31.

R12000 has 2 additional counters that allow to monitor conditional events (i.e. events based on previous events).

User access to the counters is through a software library or shell level tools provided by the IRIX OS.

TM

Origin Address SpaceOrigin Address Space• Physically the memory is distributed and is not contiguous.

• Node id is assigned at boot time

• Logically memory is a shared single contiguous address space, the virtual address space is 44 bits (16 TB)

• The program (compiler) uses the virtual address space.

• Translation from the virtual to the physical address space is by the CPU.

0 1 2 3 4 ...

12 GB

8 GB

4 GB

0

PhysicalAddress

Node id

1 TB max(40 bits)

Empty slotsmemory present

Max for a single node:4 GB memory

Node id 8 bits Node offset 32 bits (4 GB)39 32 31 0

TLB = Translation Look-aside Buffer

Page size is configurable as 16 KB (default),64 KB, 256 KB, 1 MB, 4 MB, 16 MB

Page 0

Page 1

Page 2

Page n

Page k

Page 1

Page n

Page 0

TLB

Virtual

Physical

TM

Process SchedulingProcess Scheduling Irix is a Symmetric Multiprocessing Operating System• Processes and Processors are independent• Parallel programs are executed as jobs with multiple processes• The Scheduler will allocate processes to processors

Priority range from 0 to 2550 weightless (batch)1-40 time share (interactive) (TS)90-239 system (daemons and interrupts)1-255 real time processes (FIFO & RR)

10

40

128

255

system

Timeshare

Real time

TM

Process SchedulingProcess Scheduling

TM

System Monitoring CommandsSystem Monitoring Commands uptime(1) returns information about system

usage and user load w(1) who is on the system and what are they

doing? sysmon system log viewer ps(1) a "snapshot" of the process table toptop, gr_top process table dynamic display osviewosview system usage statistics sarsar system activity reporter gr_osviewgr_osview system usage statistics in graphical

form gmemusagegmemusage graphical memory usage monitor sysconfsysconf system limits, options, and parameters

TM

System Monitoring CommandsSystem Monitoring Commands ecstats -C R10K Counter Monitor ja job accounting statistics oview Performance Co-Pilot (bundled

with IRIX) pmchartpmchart Performance Co-Pilot (licensed

software) nstats,linkstat CrayLink connection statistics

(man refcnt(5) ) bufviewbufview system buffer statistics parpar process activity report numa_view, dlooknuma_view, dlook provides process memory

placement information limit [-h]limit [-h] displays system soft [hard] limits

TM

System Monitoring CommandsSystem Monitoring Commands hinvhinv hardware inventory topologytopology system interconnect description

TM

Summary: Origin PropertiesSummary: Origin Properties

• Single machine image– it behaves like a fat workstation

• same compilers• time sharing

– all your old code will run – OS schedules all the hardware resources on the machine

• Processor scalability 2-512 cpu

• I/O scalability 2-300 GB/s• All memory and I/O devices are directly addressable

– no limitation on the size of a single program, it can use all the available memory

– no limitation on the location of the data, all disks can be used in a single file system

• 64 bit operating system and file system– HPC Features: Checkpoint/Restart, DMF, NQE/LSF, TMF, Miser,

job limits, cpusets, enhanced accounting

• Machine stability

tm origin system architecture hardware and software environment

Documents