architecture of parallel computers csc / ece 506 summer ... · chips that are embedded in the same...
Post on 19-Jul-2020
1 Views
Preview:
TRANSCRIPT
Architecture of Parallel ComputersCSC / ECE 506
Summer 2006
Scalable MultiprocessorsLecture 10
6/19/2006
Dr Steve Hunter
CSC / ECE 506 2Arch of Parallel Computers
What is a Multiprocessor?
• A collection of communicating processors– Goals: balance load, reduce inherent communication and extra work
• A multi-cache, multi-memory system– Role of these components essential regardless of programming model
– Programming model and communication abstraction affect specificperformance tradeoffs
P P P...
Proc Proc Proc
...
NodeController
NodeController
NodeController
Interconnect
Cache Cache Cache
CSC / ECE 506 3Arch of Parallel Computers
Scalable Multiprocessors
• Study of machines which scale from 100’s to 1000’s of processors.
• Scalability has implications at all levels of system design and all aspects must scale
• Areas emphasized in text:– Memory bandwidth must scale with number of processors– Communication network must provide scalable bandwidth at reasonable latency– Protocols used for transferring data and synchronization techniques must scale
• A scalable system attempts to avoid inherent design limits on the extent to which resources can be added to the system.
For example:– How does the bandwidth/throughput of the system when adding processors?– How does the latency or time per operation increase?– How does the cost of the system increase?– How are the systems packaged?
CSC / ECE 506 4Arch of Parallel Computers
Scalable Multiprocessors
• Basic metrics affecting the scalability of a computer system from an application perspective are (Hwang 93):
– Machine size: the number of processors– Clock rate: determines the basic machine cycle– Problem size: amount of computational workload or the number of data points– CPU time: the actual CPU time in seconds– I/O Demand: the input/output demand in moving the program, data, and results– Memory capacity: the amount of main memory used in a program execution– Communication overhead: the amount of time spent for interprocessor
communication, synchronization, remote access, etc.– Computer cost: the total cost of hardware and software resources required to
execute a program– Programming overhead: the development overhead associated with an
application program
• Power (watts) and cooling are also becoming inhibitors to scalability
CSC / ECE 506 5Arch of Parallel Computers
Scalable Multiprocessors
• Some other recent trends:– Multi-core processors on a single socket– Reduced focus on increasing the processor clock rate– System-on-Chip (SoC) combining processor cores, integrated interconnect,
cache, high-performance I/O, etc.– Geographically distributed applications utilizing Grid and HPC technologies– Standardizing of high-performance interconnects (e.g., Infiniband, Ethernet) and
focus by Ethernet community to reduce latency– For example, Force 10’s recently announced 10Gb Ethernet switch
» S2410 data center switch has set industry benchmarks for 10 Gigabit price and latency
» Designed for high performance clusters, 10 Gigabit Ethernet connectivity to the server and Ethernet-based storage solutions, the S2410 supports 24 line-rate 10 Gigabit Ethernet ports with ultra low switching latency of 300 nanoseconds at an industry-leading price point.
» The S2410 eliminates the need to integrate Infiniband or proprietary technologies into the data center and opens the high performance storage market to 10 Gigabit Ethernet technology. Standardizing on 10 Gigabit Ethernet in the data core, edge and storage radically simplifies management and reduces total network cost.
CSC / ECE 506 6Arch of Parallel Computers
Bandwidth Scalability
• What fundamentally limits bandwidth?– Number of wires, clock rate
• Must have many independent wires or high clock rate• Connectivity through bus or switches
P M M P M M P M M P M M
S S S S
Typical switches
Bus
Multiplexers
Crossbar
CSC / ECE 506 7Arch of Parallel Computers
Some Memory Models
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Inter connection network
$
Pn
Mem Mem
Shared Cache
Centralized MemoryDance Hall, UMA
P1
$
Inter connection network
$
Pn
Mem Mem
Distributed Memory (NUMA)
CSC / ECE 506 8Arch of Parallel Computers
Generic Distributed Memory Organization
° ° °
Scalable network
CA
P
$
Switch
M
Switch Switch
• Network bandwidth requirements?– independent processes?– communicating processes?
• Latency?
CSC / ECE 506 9Arch of Parallel Computers
Some Examples
CSC / ECE 506 10Arch of Parallel Computers
AMD Opteron Processor Technology
CSC / ECE 506 11Arch of Parallel Computers
AMD Opteron Architecture
• AMD Opteron™ Processor Key Architectural Features– Single-Core and Dual-Core AMD Opteron processors– Direct Connect Architecture– Integrated DDR DRAM Memory Controller– HyperTransport™ Technology– Low-Power
CSC / ECE 506 12Arch of Parallel Computers
AMD Opteron Architecture• Direct Connect Architecture
– Addresses and helps reduce the real challenges and bottlenecks of system architectures– Memory is directly connected to the CPU optimizing memory performance– I/O is directly connected to the CPU for more balanced throughput and I/O– CPUs are connected directly to CPUs allowing for more linear symmetrical
multiprocessing• Integrated DDR DRAM Memory Controller
– Changes the way the processor accesses main memory, resulting in increased bandwidth, reduced memory latencies, and increased processor performance
– Available memory bandwidth scales with the number of processors– 128-bit wide integrated DDR DRAM memory controller capable of supporting up to eight
(8) registered DDR DIMMs per processor– Available memory bandwidth up to 6.4 GB/s (with PC3200) per processor
• HyperTransport™ Technology– Provides a scalable bandwidth interconnect between processors, I/O subsystems, and
other chipsets– Support of up to three (3) coherent HyperTransport links, providing up to 24.0 GB/s peak
bandwidth per processor– Up to 8.0 GB/s bandwidth per link providing sufficient bandwidth for supporting new
interconnects including PCI-X, DDR, InfiniBand, and 10G Ethernet– Offers low power consumption (1.2 volts) to help reduce a system’s thermal budget
CSC / ECE 506 13Arch of Parallel Computers
AMD Processor Architecture• Low-Power Processors
– The AMD Opteron processor HE offers industry-leading performance per watt making it an ideal solution for rack-dense 1U servers or blades in datacenter environments as well ascooler, quieter workstation designs.
– The AMD Opteron processor EE provides maximum I/O bandwidth currently available in a single-CPU controller making it a good fit for embedded controllers in markets such as NAS and SAN.
• Other features of the AMD Opteron processor include:– 64-bit wide key data and address paths that incorporate a 48-bit virtual address space and
a 40-bit physical address space
– ECC (Error Correcting Code) protection for L1 cache data, L2 cache data and tags, and DRAM with hardware scrubbing of all ECC protected arrays
– 90nm SOI (Silicon on Insulator) process technology for lower thermal output levels and improved frequency scaling
– Support for all instructions necessary to be fully compatible with SSE2 technology
– Two (2) additional pipeline stages (compared to AMD’s seventh generation architecture) for increased performance and frequency scalability
– Higher IPC (Instructions per Clock) achieved through additional key features, such as larger TLBs (Translation Look-aside Buffers), flush filters, and enhanced branch prediction algorithm
CSC / ECE 506 14Arch of Parallel Computers
AMD vs Intel• Performance
– SPECint® rate2000 – the Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 28 percent
– SPECfp® rate2000 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 76 percent
– SPECjbb®2005 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz by 13 percent
• Processor Power (Watts)– Dual-Core AMD Opteron™ processors at 95 watts, consume far less than the
competition’s dual-core x86 server processors which according to their published data, have a thermal design power of 135 watts and a max power draw of 150 watts.
– Can result in 200 percent better performance-per-watt than the competition.
– Even greater performance-per-watt can be achieved with lower-power processors that are (55 watt).
CSC / ECE 506 15Arch of Parallel Computers
IBM POWER Processor Technology
CSC / ECE 506 16Arch of Parallel Computers
IBM POWER4+ Processor Architecture
CSC / ECE 506 17Arch of Parallel Computers
IBM POWER4+ Processor Architecture
• Two processor cores on one chip as shown• Clock frequency of the POWER4+ is 1.5--1.9 GHz • The L2 cache modules are connected to the processors by the Core Interface
Unit (CIU) switch, a 2×3 crossbar with a bandwidth of 40 B/cycle per port. • This enables to ship 32 B to either the L1 instruction cache or the data cache of
each of the processors and to store 8 B values at the same time.• Also, for each processor there is a Non-cacheable Unit that interfaces with the
Fabric Controller and that takes care of non-cacheable operations. • The Fabric Controller is responsible for the communication with three other
chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs.
• The bandwidths at 1.7 GHz are 13.6, 9.0, and 6.8 GB/s, respectively. The chip further still contains a variety of devices: the L3 cache directory and the L3 and Memory Controller that should bring down the off-chip latency considerably
• The GX Controller is responsible for the traffic on the GX bus which transports data to/from the system and in practice is used for I/O. The maximum size of the L3 cache is 32 MB
CSC / ECE 506 18Arch of Parallel Computers
IBM POWER5 Processor Architecture
CSC / ECE 506 19Arch of Parallel Computers
IBM POWER5 Processor Architecture
• Like the POWER4(+), the POWER5 has two processor cores on a chip• Clock frequency of the POWER5 is 1.9 GHz. • However the higher density on the chip (the POWER5 is built in 130 nm
technology instead of 180 nm used for the POWER4+) more devices could be placed on the chip and they could also be enlarged.
• The L2 caches of two neighboring chips are connected and the L3 caches are directly connected to the L2 caches.
• Both are larger than their respective counterparts of the POWER4: 1.875 MB against 1.5 MB for the L2 cache and 36 MB against 32 MB for the L3 cache.
• In addition the speed of the L3 cache has gone up from about 120 cycles to 80 cycles. Also the associativity of the caches has improved: from 2-way to 4-way for the L1 cache, from 8-way to 10-way for the L2 cache, and from 8 to 12-way for the L3 cache.
• A big difference is also the improved bandwidth from memory to the chip: it has increased from 4 GB/s for the POWER4+ to approximately 16 GB/s for the POWER5
CSC / ECE 506 20Arch of Parallel Computers
Intel (Future) Processor Technology
DP Server ArchitectureDP Server Architecture
CONSTANTLY ANALYZING THE REQUIREMENTS,CONSTANTLY ANALYZING THE REQUIREMENTS,THE TECHNOLOGIES, AND THE TRADEOFFSTHE TECHNOLOGIES, AND THE TRADEOFFS
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
Energy Energy
Perf Perf
AMB AMB AMB
AMB
AMB AMB AMB
AMB
AMB AMB AMB
AMB
AMB AMB AMB
AMB
Bensley PlatformBensley Platform
BlackfordBlackford
17 GB/s17 GB/s
64 GB64 GB
FSB ScalingFSB Scaling800MHz800MHz1067MHz1067MHz1333MHz1333MHz
Point to PointPoint to PointInterconnectInterconnect
Local and RemoteLocal and RemoteMemory LatenciesMemory Latencies
ConsistentConsistent
Central CoherencyCentral CoherencyResolutionResolution
Sustained &Sustained &BalancedBalanced
ThroughputThroughput
Easy CapacityEasy CapacityExpansionExpansion
LargeLargeShared CachesShared Caches
PlatformPlatformPerformance:Performance:ItIt’’s all abouts all about
BandwidthBandwidth &&LatencyLatency
17,066 Flops/Watt17,066 Flops/Watt467 Flops/Dollar467 Flops/Dollar
Energy Efficient Performance Energy Efficient Performance –– High EndHigh End
ASC PurpleASC Purple6 MWatt6 MWatt100 TFlops goal100 TFlops goal12K+ cpus 12K+ cpus –– Power5Power5
$230M$230M
DATACENTERDATACENTER““ENERGY LABELENERGY LABEL””
Source: LLNLSource: LLNL
Source: NASASource: NASA
NASA ColumbiaNASA Columbia2 MWatt2 MWatt60 TFlops goal60 TFlops goal10,240 cpus 10,240 cpus –– Itanium IIItanium II
$50M$50M
30,720 Flops/Watt30,720 Flops/Watt1,288 Flops/Dollar1,288 Flops/Dollar
ComputationalComputationalEfficiencyEfficiency
CoreCore™™ Microarchitecture Advances WithMicroarchitecture Advances WithQuad CoreQuad Core
Quad CoreQuad Core
KentsfieldKentsfield
ClovertownClovertown
ServerServer
DesktopDesktop
Paxville DPPaxville DP
WoodcrestWoodcrest
IrwindaleIrwindale
DP Performance Per WattDP Performance Per WattComparison with SPECint_rateComparison with SPECint_rate
at the Platform Levelat the Platform Level
Dempsey MVDempsey MV
1X1X
2X2X
3X3X
4X4X
H2 H2 ‘‘0606
H1 H1 ‘‘0606
H2 H2 ‘‘0505
H1 H1 ‘‘0505
Source: IntelSource: Intel®®
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
Clovertown Clovertown H1 H1 ‘‘0707
EnergyEnergy PerformancePerformanceEfficientEfficient
POWER
35%
Woodcrest for ServersWoodcrest for Servers
……relative to relative to IntelIntel®® XeonXeon®® 2.8GHz 2x2MB2.8GHz 2x2MB
Source: Intel based on estimated Source: Intel based on estimated SPECintSPECint*_rate_base2000 and thermal design power *_rate_base2000 and thermal design power
PERFORMANCE
80%
OverOver--clockedclocked(+20%)(+20%)
1.00x1.00x
Relative singleRelative single--core frequency and core frequency and VccVcc
1.73x1.73x
1.13x1.13x
Max FrequencyMax Frequency
PowerPower
PerformancePerformance
MultiMulti--CoreCoreEnergyEnergy--Efficient PerformanceEfficient Performance
DualDual--corecore((--20%)20%)
1.02x1.02x
1.73x1.73xDualDual--CoreCore
Intel MultiIntel Multi--Core TrajectoryCore Trajectory
DualDual--CoreCore
QuadQuad--CoreCore
20062006 20072007
CSC / ECE 506 27Arch of Parallel Computers
Blade Architectures - General
Interconnect
BladeServer
…..BladeServer
BladeServer
• Blades interconnected by common fabrics– Infiniband, Ethernet, Fibre Channel are most common– Redundant interconnect available for failover– Links from interconnect provide external connectivity
• Each blade contains multiple processors, memory and network interfaces– Some options may exist such as for memory, network connectivity, etc.
• Power, cooling, management overhead optimized within chassis– Multiple chassis connected together for greater number of nodes
CSC / ECE 506 28Arch of Parallel Computers
I/O Bridge• e.g., Ethernet, Fibre
Channel, Passthru
• Dual 4x (16 wire) wiring
internally to each HSSM
High-speed Switch• Ethernet or Infiniband
• 4x (16 wire) blade links
• 4x (16 wire) bridge links
• 1x (4 wire) Mgmt links
• Uplinks: Up to 12x links
for IB and at least four
10Gb links for Ethernet
IBM BladeCenter H Architecture
Switch Module 2
Blade 2
Blade 14
Blade 1
...
Mgmt Mod 2
Mgmt Mod 1
Switch Module 1
I/O Bridge
HS Switch 2
HS Switch 1
I/O Bridge
I/O Bridge 4 / SM4
HS Switch 4
HS Switch 3
I/O Bridge 3/ SM3
CSC / ECE 506 29Arch of Parallel Computers
Blade 2
Blade 14
Blade 1
...
Blade 2
Blade 14
Blade 1
...
Blade 2
Blade 14
Blade 1
...
... Inte
rcon
nect
...
• External high performance interconnect(s) for multiple chassis
• Independent scaling of blades and I/O
• Scales for large clusters
• Architecture used for Barcelona Supercomputer Center (MareNostrum #8)
IBM BladeCenter H Architecture
CSC / ECE 506 30Arch of Parallel Computers
Cray (Octigabay) Blade Architecture
• MPI offloaded in hardware throughput 2900 MB/s and latency 1.6us• Processor and communication interface is Hyper Transport• Dedicated link and communication chip per processor• FPGA Accelerator available for additional offload
5.4 GB/s
(DDR 333)
5.4 GB/s
(DDR 333)
6.4 GB/sec
(HT)
8 GB/sPer
Link
Memory Opteron Opteron Memory
RAP RAP AcceleratorRapid Array
Communications Processor
includes MPI hardware
offload capabilities FPGA for application offload
CSC / ECE 506 31Arch of Parallel Computers
CtrlSys Power
Switch
Monitoring &Control ASIC
Chassis Board Options
SATA via 8111 HT/SATAbridges
2nd switch cardoptional
4 slots PCI-Expressvia 2 8131 HT/PCIbridges; each attach to one blade
Switch
PCI SATA
Cray Blade Architecture
Blade Characteristics• Two 2.2 GHz Opteron processors
– Dedicated memory per processor
• Two Rapid Array Communication Processors– One dedicated link each– One redundant link each
• Application Accelerator FPGA• Local Hard Drive
Shelf Characteristics• One or two IB 4x switches• Twelve or twenty four external links• Additional I/O:
– Three high speed I/O links– Four PCI-X bus slots– 100Mb Ethernet for management
• Active management system
CSC / ECE 506 32Arch of Parallel Computers
Cray Blade Architecture
• Six blades per 3U shelf• Twelve 4x IB external links for primary switch• An additional twelve links are available with optional redundant switch
100 Mb Ethernet
Rapid Array Interconnect(24 x 24 IB 4x Switch)
5.4 GB/s
(DDR 333)5.4 GB/s
(DDR 333)
6.4 GB/sec
(HT)
8 GB/s
Memory Opteron Opteron Memory
Accelerator
Rapid Array Interconnect(24 x 24 IB 4x Switch)
RAP includes MPI offload capabilities
Active MgmtSystem
High-Speed I/O PCI-X
RAP RAP
CSC / ECE 506 33Arch of Parallel Computers
Cray Blade Architecture
• With up to 24 external links per Octigabay 12K shelf, a variety of configurations can be achieved depending on the applications
• OctigaBay suggests interconnecting shelves by mesh, tori, fat trees, and fully-connected shelves for systems that fit in one rack. Fat tree configurations require extra switches, which OctigaBay terms “spine switches.”
• Mellanox Infiniband technology used for interconnect• Up to 25 shelves can be directly connected, yielding a 300 Opteron system
Interconnect
. . .
CSC / ECE 506 34Arch of Parallel Computers
IBM BlueGene/L Architecture – Compute Card
• The BlueGene/L is the first in a new generation of systems made by IBM for very massively parallel computing.
• The individual speed of the processor has been traded in favor of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified PowerPC 400 at 700 MHz.
• Two of these processors reside on a chip together with 4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The processors have two load ports and one store port from/to the L2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high.
• The CPUs have 32 KB of instruction cache and of data cache on board. In favorable circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L2 cache is smaller than the L1 cache which is quite unusual but which allows it to be fast.
CSC / ECE 506 35Arch of Parallel Computers
IBM BlueGene/L Architecture
CSC / ECE 506 36Arch of Parallel Computers
IBM BlueGene/L Overview
• BlueGene/L boasts a peak speed of over 360 teraOPS, a total memory of 32 tebibytes, total power of 1.5 megawatts, and machine floor space of 2,500 square feet. The full system has 65,536 dual-processor compute nodes. Multiple communications networks enable extreme application scaling:
• Nodes are configured as a 32 x 32 x 64 3D torus; each node is connected in six different directions for nearest-neighbor communications
• A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65,536 nodes
• Multiple global barrier and interrupt networks allow fast synchronization of tasks across the entire machine within a few microseconds
• 1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk
• The BlueGene/L possesses no less than 5 networks, 2 of which are of interest for inter-processor communication: a 3-D torus network and a tree network.
• The torus network is used for most general communication patterns. • The tree network is used for often occurring collective communication patterns
like broadcasting, reduction operations, etc.
CSC / ECE 506 37Arch of Parallel Computers
IBM’s X3 Architecture
IBM System x
EM64T EM64T
PCI-X 2.0266 MHz
PCI-X 2.0266 MHz I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O BridgeScalability Controller
Memory Controller
Scalability Ports to Other Xeon MP Processors
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs
X3 Chipset - Scalable Intel MP Server . . .
IBM System x
EM64T EM64T
PCI-X 2.0266 MHz
PCI-X 2.0266 MHz I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O BridgeScalability Controller
Memory Controller
Scalability Ports to Other Xeon MP Processors
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs
X3 Chipset – Low latency
108 ns
IBM System x
EM64T EM64T
PCI-X 2.0266 MHz
PCI-X 2.0266 MHz I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O BridgeScalability Controller
Memory Controller
Scalability Ports to Other Xeon MP Processors
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs
X3 Chipset – Low latency
222 ns
IBM System x
EM64T EM64T
PCI-X 2.0266 MHz
PCI-X 2.0266 MHz I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O BridgeScalability Controller
Memory Controller
Scalability Ports to Other Xeon MP Processors
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
MemoryInterface
RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs
X3 Chipset – High Bandwidth
21.3 GB/s
15 G
B/s
6.4
GB
/s10.6 GB/s
42 IBM SystemsIBM Confidential
IBM System x
X3 Chipset – Snoop Filter
EM64T EM64T
I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O Bridge
MemoryInterface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface
Internal Cache Miss
Others
EM64T EM64T
I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O Bridge
Scalability Controller
Memory Controller
MemoryInterface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface
X3No traffic on FSB
! Cache from EACH processor is mirrored on hurricane
! Relieves traffic on FSB! Faster access to main memory
! Cache from EACH processor must be snooped
! Creates traffic along FSB
Node Controller
Memory Controller
IBM System x
43 IBM SystemsIBM Confidential
IBM System x
X3 Chipset – Snoop Filter
EM64T EM64T
I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O Bridge
MemoryInterface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface
Others
EM64T EM64T
I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O Bridge
Scalability Controller
Memory Controller
MemoryInterface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface
X3
! Cache from EACH processor must be snooped
! Creates traffic along FSB
! Cache from EACH processor is mirrored on hurricane
! Relieves traffic on FSB! Faster access to main memory
Node Controller
Memory Controller
IBM System x
IBM Confidential
Multi-node Scalability – Putting it together
Hurri-cane
* Snoop filter and Remote Directory work together in multi-node configurations* Local processor cache miss is broadcast to all memory controllers* Only the node owning latest copy of data responds* Maximizes system bus bandwidth
Hurri-cane
Hurri-cane
Hurri-caneRequester
Cached address maps to Main memory on this node
This node owns the requested cache line Data
NullNull
IBM System x
X3 Chipset – Scalability Ports
4 Way
4 Way
4 Way
4 WayCabledScalabilityPorts
16 Way – Single OS Image MP
X3 Scales to 32 Way Dual Core Capable – 64 Cores
CSC / ECE 506 46Arch of Parallel Computers
The End
top related