tm origin 2000 ccnuma architecture joe goyette systems engineer [email protected]

44
TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer [email protected]

Upload: emmeline-hawkins

Post on 04-Jan-2016

223 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

Origin 2000ccNUMA Architecture

Joe GoyetteSystems [email protected]

Page 2: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Presentation Overview

1. ccNUMA Basics2. SGI’s ccNUMA Implementation (O2K)3. Supporting OS Technology4. SGI’s NextGen ccNUMA (O3K) (brief)5. Q&A

Page 3: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

ccNUMA

cc: cache coherentNUMA: Non-Uniform Memory Access•Memory is physically distributed

throughout the system•memory and peripherals are globally

addressable•Local memory accesses are faster than

remote accesses (Non-Uniform Memory Access = NUMA)

•Local accesses on different nodes do not interfere with each other

Page 4: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Typical SMP Model

I/OI/OI/OMain

MemoryMain

Memory

Processor Processor Processor

SnoopyCache

SnoopyCache

SnoopyCache

I/OMain

MemoryMain

Memory

Central Bus

Page 5: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Typical MPP Model

Interconnect Network (ie. GSN,100BaseT, Myrinet)

I/O

Processor

MainMemory

Operating System

I/O

Processor

MainMemory

Operating System

I/O

Processor

MainMemory

Operating System

Page 6: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Scalable Cache Coherent Memory

Easy to Program Easy to Scale

Shared-memorySystems (SMP)

Massively ParallelSystems (MMP)

Hard to scale Hard to program

Easy to ProgramEasy to Scale

Scalable Shared MemorySystems [ccNUMA)

Page 7: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Origin ccNUMA vs other Architectures

> Single Address Space

> Modular Design

> All aspects scale as system grows

> Low-latency, high bandwidth global memory

Ori

gin

Con

ven

tion

al S

MP

Oth

er

NU

MA

Clu

sters

/MPP

Page 8: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Origin ccNUMA Advantage

Fixed bus SMP Other NUMA

Clusters, MPPOrigin 2000 ccNUMA

InterconnectionBisections

N

N

NNN

N N

NN

N N NN N N N

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

N

N

NN

N

N

N N

Page 9: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

IDC: NUMA is the future

Source: High Performance Technical Computing Market: Review and Forecast, 1997-2002International Data Corporation, September 1998

- IDC, September 1998

Architecture typeBus-based SMPNUMA SMPMessage PassingSwitch-based SMPUni-processorNUMA (uni-node)

1997 share41.0%20.8%15.3%12.1%5.5%5.3%

Change-13.7 pts.+16.9 pts.-1.1 pts.-0.9 pts.-3.9 pts.+3.8 pts.

1996 share54.7%3.9%16.4%13.0%9.4%1.5%

“Buses are the preferred approach for SMP implementations because of their relatively low cost. However, scalability is limited by the performance of the bus.”“NUMA SMP ... appears to be the preferred memory architecture for next-generation systems.”

Page 10: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

SGI’s First Commercial ccNUMA Implementation

Origin 2000 Architecture

Page 11: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

History of Multiprocessing at SGI

CPUs

2

32

64

128

256

1993 1996 1997 1998 1999

Challenge2-36 CPUs

Origin 20002-32 CPUs

Origin 20002-64 CPUs

Origin 20002-256 CPUs

Origin 30002-1024 CPUs

Origin 2000 ccNUMA introduced

2000

Page 12: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Origin 2000 Logical Diagram32 CPU Hypercube (3D)

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

N

N

NN

N

N

N N

Page 13: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Origin 2000 Node Board

Basic Building Block

Directory>32P

Proc.Cache

Hub

MainMemory

Directory

Proc.Cache

Node Board

Page 14: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

MIPS R12000 CPU

•64-bit RISC design, 0.25-micron CMOS process •Single-chip four-way superscalar RISC dataflow architecture •5 fully-pipelined execution units•supports speculative and out-of-order execution•8MB L2 cache Origin 2000, 4MB Origin 200 •32KB 2-way set-associative instruction and data caches •2,048-entry branch prediction table

•48-entry active list •32-entry two-way set-associative Branch Target Address Cache (BTAC) •Doubled L2 way prediction table for improved L2 hit rate •Improved branch prediction by using global history mechanism •Improved performance monitoring support •Maintains code and instruction set compatibility with R10000

Page 15: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Memory Hierarchy

1. local cpu registers2. local cpu cache 5 ns3. local memory318 ns4. remote memory554 ns5. remote caches

Page 16: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Directory Based Cache CoherencyCache Coherency == System hw guarantees that every cached copy remains

a true reflection of the memory data, without sw

intervention.

Directory Bits consist of two parts:

a. 8-bit integer representing node that has exclusive ownership of data

b. Bit map that represents which nodes have copies of data in cache.

Page 17: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Cache Example

1. data read into cache for thread on CPU 0

2. threads on CPUs 1 and 2 read data into cache

3. thread on CPU 2 updates data in cache

(cacheline is set exclusive)

4. Eventually cache line gets invalidated

Page 18: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

•6-way non-blocking crossbar (9.3 Gbytes/sec)

•Link Level Protocol (LLP) uses CRC error checking

•1.56 Gbyte/sec (peak full-duplex) per port

•packet delivery prioritization (credits, aging)

•uses internal routing table and supports wormhole routing

•internal buffers (SSR/SSD) down-convert 390MHz external signaling to core frequency.

•Three ports connect to external 100 conductor NumaLink cables.

Router and Interconnect Fabric

Global SwitchInterconnect

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

N

N

NN

N

N

N N

Page 19: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Origin 2000 Module

Basic Building Block

Node Boards

Directory>32P

Proc.Cache

Hub

MainMemoryDirectory

Proc.Cache

Directory>32P

Proc.Cache

Hub

MainMemoryDirectory

Proc.Cache

Directory>32P

Proc.Cache

Hub

MainMemoryDirectory

Proc.Cache

Directory>32P

Proc.Cache

Hub

MainMemoryDirectory

Proc.Cache Router

BoardRouterBoard

Midplane

XB

OW

XB

OW

Page 20: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Modules become Systems

Deskside(Module)

Rack(2 Modules)

Multi-rack(4 Modules)

Etc...

2-8 CPUs

16 CPUs

..128 CPUs

32 CPUs

Page 21: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Origin 2000 Grows to Single Rack

Single Rack System•2-16 CPUs•32GB Memory•24 XIO I/O slots

N

R R

R R N

NN

N

N

N N

Page 22: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Origin 2000 Grows to Multi-Rack

Multi-Rack System•17-32 CPUs•64GB Memory•48 XIO I/O slots•32-processor

hypercube building block

N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

NN

NN

N

N

N N

Page 23: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Origin 2000 Grows to Large Systems

Large Multi-Rack Systems•2-256 CPUs•512GB Memory•384 I/O slots

+

=

++

Page 24: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Bisection Bandwidth as System Grows

CPUs BisectionBW(total)

BisectionBW/CPU

RouterHops (max)

RouterHops (avg)

Latency(max)

Latency(avg)

2 na na na na 343 ns 343 ns

4 na na na na 554 ns 441 ns

8 1.56 0.195 1 0.75 759 ns 623 ns

16 3.12 0.195 2 1.63 759 ns 691 ns

32 6.24 0.195 3 2.19 836 ns 674 ns

64 12.5 0.195 5 2.97 1067 ns 851 ns

128 25.0 0.195 6 3.98 1169 ns 959 ns

Page 25: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Memory Latency as System Grows

CPUs BisectionBW(total)

BisectionBW/CPU

RouterHops (max)

RouterHops (avg)

Latency(max)

Latency(avg)

2 na na na na 343 ns 343 ns

4 na na na na 554 ns 441 ns

8 1.56 0.195 1 0.75 759 ns 623 ns

16 3.12 0.195 2 1.63 759 ns 691 ns

32 6.24 0.195 3 2.19 836 ns 674 ns

64 12.5 0.195 5 2.97 1067 ns 851 ns

128 25.0 0.195 6 3.98 1169 ns 959 ns

Page 26: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70

CPUs

ST

RE

AM

Tri

ad B

and

wid

th

SGI Origin2000/300MHz

SGI Origin2000/250MHz

Sun UE10000

Compaq/DEC 8400

HP/Convex V2500

HP/Convex SPP

Origin 2000 Bandwidth Scales

Origin 2000/300MhZ

SUN UE10000

HP/Convex SPP

HP/Convex VCompaq/DEC 8400

STREAM Triad results

Origin 2000/250MhZ

Page 27: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Performance on HPC job mix

0

5000

10000

15000

20000

25000

30000

35000

8 28 48 68 88 108 128

SGI/195MHz

SGI/250MHz

SGI/300MHz

IBM/120MHz

IBM/160MHz

DEC

Sun E

Sun UE

HP/V

HP/SPP

HP/X

SPECfp_rate95 results Origin 300Mhz

IBM

DEC

SUN

HP

Origin 250Mhz

Origin 195Mhz

Page 28: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

Enabling Technologies

IRIX: NUMA Aware OS and System Utilities

Page 29: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Default Memory PlacementMemory allocated on “first-touch” basis

- on node where process that defines page is running- or as close as possible (minimize latency)- developers should initialize work areas in newly

created threads

IRIX scheduler maintains process affinity

- re-schedules jobs on processor where they ran last

- or on other CPU in the same node- or as close as possible )minimize latency)

Page 30: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Alternatives to “first-touch” policy

Round Robin Allocation- Data is distributed at run-time

among all nodes used for execution- setenv _DSM_ROUND_ROBIN_

Page 31: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Dynamic Page Migration

•IRIX can keep track of run-time memory access patterns and dynamically copy pages to new node.•Expensive operation. Requires: daemon, TLB invalidations, and the memory copy itself.)• setenv _DSM_MIGRATION ON • setenv _DSM_MIGRATION_LEVEL 90

Page 32: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Explicit Placement: source directives

integer i, j, n, niters parameter (n = 8*1024*1024, niters = 1000)

c-----Note that the distribute directive is used after the arraysc-----are declared.

real a(n), b(n), qc$distribute a(block), b(block)

c-----initializationdo i = 1, n a(i) = 1.0 - 0.5*i b(i) = -10.0 + 0.01*(i*i) enddo

c-----real work do it = 1, niters q = 0.01*itc$doacross local(i), shared(a,b,q), affinity (i) = data(a(i)) do i = 1, n a(i) = a(i) + q*b(i) enddo enddo

Page 33: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Explicit Placement: dprof / dplace

•Used for application that don’t use libmp (ie. explicit sproc, fork, pthreads, mpi, etc)•dprof: profiles memory access pattern•dplace can:

– Change the page size used – Enable page migration – Specify the topology used by the threads of a parallel program

– Indicate resource affinities – Assign memory ranges to particular nodes

Page 34: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

SGI 3rd Generation ccNUMA Implementation

Origin 3000 Family

Page 35: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Compute Module vs. Bricks

8P12 Compute Module (Origin 2000)

C-Brick

R-Brick

P-Brick

System “Bricks” (Origin 3000)

Page 36: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Feeds and Speeds

Origin 2000 Origin 3000

Node Density 2 CPUs/node 4 CPUs/node

Memory Density 4 GBytes/node 8 GBytes/node

Ports/Router 6 ports 6 or 8 ports

InterconnectBandwidth

1.6 GBytes/sec(full duplex)

3.2 GBytes/sec(full duplex)

CPU Technology MIPS MIPS or IA64

Page 37: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

Taking Advantage of Multiple CPUs

Parallel Programming Models Available on Origin Family

Page 38: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Many Different Models and Tools To Choose From

•Automatic Parallelization Option: compiler flags

•Compiler Source Directives: OpenMP, c$doacross, etc

•explicit multi-threading: pthreads, sproc

•Message Passing APIs: MPI, PVM

Page 39: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Computing Value of π: Simple Serial

program compute_pi

integer n, i

double precision w, x, sum,pi, f, a

c function to integrate

f(a) = 4.d0 / (1.d0 + a*a)

print *, ‘Enter number of intervals:’

read *,n

c calculate the interval size

w =1.0d0/n

sum = 0.0d0

do i = 1,n

x = w * (i - 0.5d0)

sum = sum + f(x)

end do

pi = w * sum

print *, ‘computed pi =‘ ,pi

stop

end

Page 40: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Automatic Parallelization Option

•Add-on option for SGI MipsPro compilers• compiler searches for loops that it can parallelize

f77 -apo compute_pi.f77setenv MP_SET_NUM_THREADS 4./a.out

Page 41: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

OpenMP Source Directives

program compute_pi

integer n, i

double precision w, x, sum,pi, f, a

c function to integrate

f(a) = 4.d0 / (1.d0 + a*a)

print *, ‘Enter number of intervals:’

read *,n

c calculate the interval size

w =1.0d0/n

sum = 0.0d0

!$OMP PARALLEL DO PRIVATE(X), SHARED(W), REDUCTION(+:sum)

do i = 1,n

x = w * (i - 0.5d0)

sum = sum + f(x)

end do

!$OMP END PARALLEL DO

pi = w * sum

print *, ‘computed pi =‘ ,pi

stop

end

Page 42: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Message Passing Interface (MPI)

program compute_pi

Include ‘mpif.h’

integer n, i, myid, numprocs, rc

double precision w, x, sum,pi, f, a

c function to integrate

f(a) = 4.d0 / (1.d0 + a*a)

call MPI_INIT(ierr)

call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)

call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

if (myid .eq. 0) then

print *, ‘Enter number of intervals:’

read *,n

endif

call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

c calculate the interval size

w =1.0d0/n

sum = 0.0d0

do i = myid+1, n, numprocs

x = w * (i - 0.5d0)

sum = sum + f(x)

end do

Page 43: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com

TM

February 2001

Message Passing Interface (MPI)

mypi = w * sum

c collect all the partial sums

call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,

$MPI_COMM_WORLD,ierr)

c node 0 prints the answer

if (myid .eq. 0) then

print *, ‘computed pi =‘ ,pi

endif

call MPI_FINALIZE(rc)

stop

end

Page 44: TM Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com