tm origin 2000 ccnuma architecture joe goyette systems engineer [email protected]

TM

Origin 2000ccNUMA Architecture

Joe GoyetteSystems [email protected]

TM

February 2001

Presentation Overview

1. ccNUMA Basics2. SGI’s ccNUMA Implementation (O2K)3. Supporting OS Technology4. SGI’s NextGen ccNUMA (O3K) (brief)5. Q&A

TM

February 2001

ccNUMA

cc: cache coherentNUMA: Non-Uniform Memory Access•Memory is physically distributed

throughout the system•memory and peripherals are globally

addressable•Local memory accesses are faster than

remote accesses (Non-Uniform Memory Access = NUMA)

•Local accesses on different nodes do not interfere with each other

TM

February 2001

Typical SMP Model

I/OI/OI/OMain

MemoryMain

Memory

Processor Processor Processor

SnoopyCache

SnoopyCache

SnoopyCache

I/OMain

MemoryMain

Memory

Central Bus

TM

February 2001

Typical MPP Model

Interconnect Network (ie. GSN,100BaseT, Myrinet)

I/O

Processor

MainMemory

Operating System

I/O

Processor

MainMemory

Operating System

I/O

Processor

MainMemory

Operating System

TM

February 2001

Scalable Cache Coherent Memory

Easy to Program Easy to Scale

Shared-memorySystems (SMP)

Massively ParallelSystems (MMP)

Hard to scale Hard to program

Easy to ProgramEasy to Scale

Scalable Shared MemorySystems [ccNUMA)

TM

February 2001

Origin ccNUMA vs other Architectures

> Single Address Space

> Modular Design

> All aspects scale as system grows

> Low-latency, high bandwidth global memory

Ori

gin

Con

ven

tion

al S

MP

Oth

er

NU

MA

Clu

sters

/MPP

TM

February 2001

Origin ccNUMA Advantage

Fixed bus SMP Other NUMA

Clusters, MPPOrigin 2000 ccNUMA

InterconnectionBisections

N

N

NNN

N N

NN

N N NN N N N

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

N

N

NN

N

N

N N

TM

February 2001

IDC: NUMA is the future

Source: High Performance Technical Computing Market: Review and Forecast, 1997-2002International Data Corporation, September 1998

- IDC, September 1998

Architecture typeBus-based SMPNUMA SMPMessage PassingSwitch-based SMPUni-processorNUMA (uni-node)

1997 share41.0%20.8%15.3%12.1%5.5%5.3%

Change-13.7 pts.+16.9 pts.-1.1 pts.-0.9 pts.-3.9 pts.+3.8 pts.

1996 share54.7%3.9%16.4%13.0%9.4%1.5%

“Buses are the preferred approach for SMP implementations because of their relatively low cost. However, scalability is limited by the performance of the bus.”“NUMA SMP ... appears to be the preferred memory architecture for next-generation systems.”

TM

SGI’s First Commercial ccNUMA Implementation

Origin 2000 Architecture

TM

February 2001

History of Multiprocessing at SGI

CPUs

2

32

64

128

256

1993 1996 1997 1998 1999

Challenge2-36 CPUs

Origin 20002-32 CPUs




Origin 2000 ccNUMA introduced

2000

TM

February 2001

Origin 2000 Logical Diagram32 CPU Hypercube (3D)

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

N

N

NN

N

N

N N

TM

February 2001

Origin 2000 Node Board

Basic Building Block

Directory>32P

Proc.Cache

Hub

MainMemory

Directory

Proc.Cache

Node Board

TM

February 2001

MIPS R12000 CPU

•64-bit RISC design, 0.25-micron CMOS process •Single-chip four-way superscalar RISC dataflow architecture •5 fully-pipelined execution units•supports speculative and out-of-order execution•8MB L2 cache Origin 2000, 4MB Origin 200 •32KB 2-way set-associative instruction and data caches •2,048-entry branch prediction table

•48-entry active list •32-entry two-way set-associative Branch Target Address Cache (BTAC) •Doubled L2 way prediction table for improved L2 hit rate •Improved branch prediction by using global history mechanism •Improved performance monitoring support •Maintains code and instruction set compatibility with R10000

TM

February 2001

Memory Hierarchy

1. local cpu registers2. local cpu cache 5 ns3. local memory318 ns4. remote memory554 ns5. remote caches

TM

February 2001

Directory Based Cache CoherencyCache Coherency == System hw guarantees that every cached copy remains

a true reflection of the memory data, without sw

intervention.

Directory Bits consist of two parts:

a. 8-bit integer representing node that has exclusive ownership of data

b. Bit map that represents which nodes have copies of data in cache.

TM

February 2001

Cache Example

1. data read into cache for thread on CPU 0

2. threads on CPUs 1 and 2 read data into cache

3. thread on CPU 2 updates data in cache

(cacheline is set exclusive)

4. Eventually cache line gets invalidated

TM

February 2001

•6-way non-blocking crossbar (9.3 Gbytes/sec)

•Link Level Protocol (LLP) uses CRC error checking

•1.56 Gbyte/sec (peak full-duplex) per port

•packet delivery prioritization (credits, aging)

•uses internal routing table and supports wormhole routing

•internal buffers (SSR/SSD) down-convert 390MHz external signaling to core frequency.

•Three ports connect to external 100 conductor NumaLink cables.

Router and Interconnect Fabric

Global SwitchInterconnect

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

N

N

NN

N

N

N N

TM

February 2001

Origin 2000 Module

Basic Building Block

Node Boards

Directory>32P

Proc.Cache

Hub

MainMemoryDirectory

Proc.Cache

Directory>32P

Proc.Cache

Hub

MainMemoryDirectory

Proc.Cache

Directory>32P

Proc.Cache

Hub

MainMemoryDirectory

Proc.Cache

Directory>32P

Proc.Cache

Hub

MainMemoryDirectory

Proc.Cache Router

BoardRouterBoard

Midplane

XB

OW

XB

OW

TM

February 2001

Modules become Systems

Deskside(Module)

Rack(2 Modules)

Multi-rack(4 Modules)

Etc...

2-8 CPUs

16 CPUs

..128 CPUs

32 CPUs

TM

February 2001

Origin 2000 Grows to Single Rack

Single Rack System•2-16 CPUs•32GB Memory•24 XIO I/O slots

N

R R

R R N

NN

N

N

N N

TM

February 2001

Origin 2000 Grows to Multi-Rack

Multi-Rack System•17-32 CPUs•64GB Memory•48 XIO I/O slots•32-processor

hypercube building block

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

NN

NN

N

N

N N

TM

February 2001

Origin 2000 Grows to Large Systems

Large Multi-Rack Systems•2-256 CPUs•512GB Memory•384 I/O slots

+

=

++

TM

February 2001

Bisection Bandwidth as System Grows

CPUs BisectionBW(total)

BisectionBW/CPU

RouterHops (max)

RouterHops (avg)

Latency(max)

Latency(avg)

2 na na na na 343 ns 343 ns


8 1.56 0.195 1 0.75 759 ns 623 ns

16 3.12 0.195 2 1.63 759 ns 691 ns

32 6.24 0.195 3 2.19 836 ns 674 ns

64 12.5 0.195 5 2.97 1067 ns 851 ns

128 25.0 0.195 6 3.98 1169 ns 959 ns

TM

February 2001

Memory Latency as System Grows

CPUs BisectionBW(total)

BisectionBW/CPU

RouterHops (max)

RouterHops (avg)

Latency(max)

Latency(avg)



8 1.56 0.195 1 0.75 759 ns 623 ns

16 3.12 0.195 2 1.63 759 ns 691 ns

32 6.24 0.195 3 2.19 836 ns 674 ns

64 12.5 0.195 5 2.97 1067 ns 851 ns

128 25.0 0.195 6 3.98 1169 ns 959 ns

TM

February 2001

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70

CPUs

ST

RE

AM

Tri

ad B

and

wid

th

SGI Origin2000/300MHz

SGI Origin2000/250MHz

Sun UE10000

Compaq/DEC 8400

HP/Convex V2500

HP/Convex SPP

Origin 2000 Bandwidth Scales

Origin 2000/300MhZ

SUN UE10000

HP/Convex SPP

HP/Convex VCompaq/DEC 8400

STREAM Triad results

Origin 2000/250MhZ

TM

February 2001

Performance on HPC job mix

0

5000

10000

15000

20000

25000

30000

35000

8 28 48 68 88 108 128

SGI/195MHz

SGI/250MHz

SGI/300MHz

IBM/120MHz

IBM/160MHz

DEC

Sun E

Sun UE

HP/V

HP/SPP

HP/X

SPECfp_rate95 results Origin 300Mhz

IBM

DEC

SUN

HP

Origin 250Mhz

Origin 195Mhz

TM

Enabling Technologies

IRIX: NUMA Aware OS and System Utilities

TM

February 2001

Default Memory PlacementMemory allocated on “first-touch” basis

- on node where process that defines page is running- or as close as possible (minimize latency)- developers should initialize work areas in newly

created threads

IRIX scheduler maintains process affinity

- re-schedules jobs on processor where they ran last

- or on other CPU in the same node- or as close as possible )minimize latency)

TM

February 2001

Alternatives to “first-touch” policy

Round Robin Allocation- Data is distributed at run-time

among all nodes used for execution- setenv _DSM_ROUND_ROBIN_

TM

February 2001

Dynamic Page Migration

•IRIX can keep track of run-time memory access patterns and dynamically copy pages to new node.•Expensive operation. Requires: daemon, TLB invalidations, and the memory copy itself.)• setenv _DSM_MIGRATION ON • setenv _DSM_MIGRATION_LEVEL 90

TM

February 2001

Explicit Placement: source directives

integer i, j, n, niters parameter (n = 8*1024*1024, niters = 1000)

c-----Note that the distribute directive is used after the arraysc-----are declared.

real a(n), b(n), qc$distribute a(block), b(block)

c-----initializationdo i = 1, n a(i) = 1.0 - 0.5*i b(i) = -10.0 + 0.01*(i*i) enddo

c-----real work do it = 1, niters q = 0.01*itc$doacross local(i), shared(a,b,q), affinity (i) = data(a(i)) do i = 1, n a(i) = a(i) + q*b(i) enddo enddo

TM

February 2001

Explicit Placement: dprof / dplace

•Used for application that don’t use libmp (ie. explicit sproc, fork, pthreads, mpi, etc)•dprof: profiles memory access pattern•dplace can:

– Change the page size used – Enable page migration – Specify the topology used by the threads of a parallel program

– Indicate resource affinities – Assign memory ranges to particular nodes

TM

SGI 3rd Generation ccNUMA Implementation

Origin 3000 Family

TM

February 2001

Compute Module vs. Bricks

8P12 Compute Module (Origin 2000)

C-Brick

R-Brick

P-Brick

System “Bricks” (Origin 3000)

TM

February 2001

Feeds and Speeds

Origin 2000 Origin 3000

Node Density 2 CPUs/node 4 CPUs/node

Memory Density 4 GBytes/node 8 GBytes/node

Ports/Router 6 ports 6 or 8 ports

InterconnectBandwidth

1.6 GBytes/sec(full duplex)

3.2 GBytes/sec(full duplex)

CPU Technology MIPS MIPS or IA64

TM

Taking Advantage of Multiple CPUs

Parallel Programming Models Available on Origin Family

TM

February 2001

Many Different Models and Tools To Choose From

•Automatic Parallelization Option: compiler flags

•Compiler Source Directives: OpenMP, c$doacross, etc

•explicit multi-threading: pthreads, sproc

•Message Passing APIs: MPI, PVM

TM

February 2001

Computing Value of π: Simple Serial

program compute_pi

integer n, i

double precision w, x, sum,pi, f, a

c function to integrate

f(a) = 4.d0 / (1.d0 + a*a)

print *, ‘Enter number of intervals:’

read *,n

c calculate the interval size

w =1.0d0/n

sum = 0.0d0

do i = 1,n

x = w * (i - 0.5d0)

sum = sum + f(x)

end do

pi = w * sum

print *, ‘computed pi =‘ ,pi

stop

end

TM

February 2001

Automatic Parallelization Option

•Add-on option for SGI MipsPro compilers• compiler searches for loops that it can parallelize

f77 -apo compute_pi.f77setenv MP_SET_NUM_THREADS 4./a.out

TM

February 2001

OpenMP Source Directives

program compute_pi

integer n, i



f(a) = 4.d0 / (1.d0 + a*a)


read *,n


w =1.0d0/n

sum = 0.0d0

!$OMP PARALLEL DO PRIVATE(X), SHARED(W), REDUCTION(+:sum)

do i = 1,n

x = w * (i - 0.5d0)

sum = sum + f(x)

end do

!$OMP END PARALLEL DO

pi = w * sum


stop

end

TM

February 2001

Message Passing Interface (MPI)

program compute_pi

Include ‘mpif.h’

integer n, i, myid, numprocs, rc



f(a) = 4.d0 / (1.d0 + a*a)

call MPI_INIT(ierr)

call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)

call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

if (myid .eq. 0) then


read *,n

endif

call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)


w =1.0d0/n

sum = 0.0d0

do i = myid+1, n, numprocs

x = w * (i - 0.5d0)

sum = sum + f(x)

end do

TM

February 2001

Message Passing Interface (MPI)

mypi = w * sum

c collect all the partial sums

call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,

$MPI_COMM_WORLD,ierr)

c node 0 prints the answer

if (myid .eq. 0) then


endif

call MPI_FINALIZE(rc)

stop

end

tm origin 2000 ccnuma architecture joe goyette systems engineer [email protected]

Documents