extensible message layers for resource-rich cluster computers craig ulmer center for experimental...
TRANSCRIPT
Extensible Message Layers forResource-Rich Cluster Computers
Craig Ulmer
Center for Experimental Research in Computer Systems
A Doctoral Thesis
Outline
Background Evolution of cluster computers
Thesis Design of extensible message layers
GRIM: General-purpose Reliable In-order Messages Core communication functions
Extensions Integrating peripheral devices Streaming computations Multicast Sockets API
Host-to-host performance Concluding remarks
Background
An Evolution of Cluster Computers
Cluster Computers
Cost-effective alternative to supercomputers Number of commodity workstations Specialized network hardware and software
Result: Large pool of host processors
CPU
NetworkInterface
Memory
I/O
Bus
CPU
NetworkInterface
Memory
I/O
Bus
CPU
NetworkInterface
Memory
I/O
Bus
CPU
NetworkInterface
MemoryI/
O B
us
System Area Network
Industry Trends
Increasingly independent, intelligent peripheral devices
Migration of computing power and bandwidth requirements to peripherals
Network and multimedia applications reliance on peripheral devices
High-bandwidth, low-latency system area networks
Ethernet
Host
Storage
CPU
SAN NI
Media Capture
Resource-Rich Cluster Computers
Inclusion of diverse peripheral devices Ethernet server cards, multimedia capture devices, embedded
storage, computational accelerators
Processing takes place in host CPUs and peripherals
SAN NI
Ethernet
HostHost
Host
System AreaNetwork
Cluster
SAN NIVideo Capture
FPGA
Host
Host Host
Storage
HostHost
CPU CPU
Problem: Utilizing distributed cluster resources
How is efficient intra-cluster communication provided? Desirable abstractions and their implementations
How can applications make use of resources?
CPU
CPUCPU CPU CPU CPU CPU
CPU
CPU
VideoCapture
FPGA
RAID
FPGA
FPGA
EthernetEthernet
RAID
RAID
? ? ? ? ? ?
Thesis Definition
Supporting Resource-Rich Cluster Computers
The Challenge
Message layers are enabling technology for clusters
Current message layers Optimized for transmissions between host CPUs Peripheral devices only available in context of the local host
What is needed Support efficient communication between
host CPUS and peripheral devices Ability to harness peripheral devices as pool of resources
Message Layer Design Characteristics
Reliable data transfers End-to-end flow control Simplify workload for communication endpoints
Virtualization of the network interface Multiple endpoints share network access
Flexible programming abstractions Efficient data transfer mechanisms Allow users to customize interactions with resources
Message Layer Requirements
Extensibility is a key feature Hardware: Easily add new peripheral devices Software: Support higher-level programming
abstractions
ApplicationExtensions
PeripheralExtensions
MessageLayerCore
Related Work
InfiniBand Emerging industry standard for clusters/data centers New I/O infrastructure
Existing message layers Myricom’s GM OPIUM: SCSI interactions
Adaptive Computing Clusters with FPGA cards Tower of Power
GRIM: An Implementation
A message layer for
resource-rich clusters
General-purpose Reliable In-order Message Layer (GRIM)
Message layer for resource-rich clusters Myrinet SAN backbone Interconnect host CPUs and peripheral devices Direct access to network hardware for resources
Communication core Migrate functionality into NI Easy to provide extensions GRIM
Core Software Architecture
Per-hop flow control Endpoint-NI transfers NI-NI transfers
Logical channels Multiple endpoints Virtual NI
Programming interfaces Active messages Remote memory
NI
Endpoint
Remote Memory API
Registered Memory
Memory Management
Active Message API
Handler Management
Active MessageExecution
Network
Endpoint-NI Transfers
NI-NI Transfers
NI
Endpoint
Per-hop Flow Control
End-to-end flow control necessary for reliable delivery Prevents buffer overflows in communication path
Endpoint-managed schemes Impractical for peripheral devices
Per-hop flow control scheme Transfer data as soon as next stage can accept Optimistic approach
ReceivingEndpoint
SendingEndpoint SAN
Network Interface Network Interface
PCI PCI
Reply
ReceivingEndpoint
SendingEndpoint
Send
SANNetwork Interface Network Interface
PCI PCIReceivingEndpoint
SendingEndpoint
DATA
ACK
DATA
ACK
PCISAN
Network Interface Network Interface
DATA
ACK
PCI
Logical Channels
Multiple endpoints in a host share the NI Employ multiple logical channels in the NI
Each endpoint owns one or more logical channels Logical channel provides virtual interface to network
Endpoint 1
Endpoint n
Logical Channel
Logical Channel
Network Interface
Scheduler
Network
Programming Interfaces: Active Messages
Message specifies function to be executed at receiver Similar to remote procedure calls, but lightweight Invoke operations at remote resources
Useful for constructing device-specific APIs Example: Interactions with remote storage controller
CPU
StorageControllerNINI SAN
am_fetch_file()
am_return_file_data()
Programming Interfaces: Remote Memory
Transfer blocks of data from one host to another Receiving NI executes transfer directly
Read and Write operations NI interacts with kernel driver to translate virtual addresses Optional notification mechanisms
CPU
NINI SAN
MemoryCPU
Memory
Integrating Peripheral Devices
Hardware Extensibility
Peripheral Device Overview
NI
CPU
CPU
Peripheral Device
In GRIM peripherals are endpoints
Intelligent peripherals Operate autonomously On-card message queues Process incoming active messages Eject outgoing active messages
Legacy peripherals Managed by host application or Remote memory operations
Legacy Peripheral Device
Cyclone Systems I2O Server Adaptor Card
Networked host on a PCI card Integration with GRIM
Interact directly with the NI Ported host-level endpoint software
Utilized as a LAN-SAN bridge
HostSystem
i960 RxProcessor
DMAEngines
PrimaryPCI
Interface
DRAM
10/100 Ethernet
10/100 Ethernet
SCSI
SCSI
ROM
DMAEngine
SecondaryPCI
Interface
Daughter Card
Local Bus
Video Capture and Display Cards
Video capture card Specialized DMA engine Host-level library based on
Video-4-Linux Active messages to control
Video display card Host locates frame buffer Manipulate frame buffer Remote memory writes
Incorporate legacy peripheral devices Host-level management or remote memory operations
DMAA/D FrameBuffer
HostMemoryVideo Capture Video Display
D/AAGPFrameBuffer
Celoxica RC-1000 FPGA Card
FPGAs provide acceleration Load with application-specific circuits
Celoxica RC-1000 FPGA card Xilinx Virtex-1000 FPGA 8 MB SRAM
Hardware implementation Endpoint as state machines AM handlers are circuits
SRAM
0SRAM
1SRAM
2SRAM
3
PCIFPGA
Control&Switching
FPGA Endpoint Organization
Frame
InputQueues
OutputQueues
Communication Library API
ApplicationData
Memory API
FPGA Card Memory
FPGACircuit Canvas
UserCircuitn
User Circuit API
UserCircuit1
Example FPGA Configuration
Cryptography configuration DES, RC6, MD5, and ALU
20 MHz Clock Newer FPGAs much faster
Expansion: Sharing the FPGA
FPGA has limited space for hardware circuits Host reconfigures FPGA on demand FPGA Function Fault
HostCPU
FPGA
Circuit X
Circuit Y
Configuration A
Circuit X
Circuit Y
Configuration A
Configuration B
Circuit E
Circuit F
Configuration C
Circuit G
StateStorage
SRAM0Message:Use Circuit F
FunctionFault
Circuit E
Circuit F
Configuration C
Circuit G
(150 ms)
Page Fault
Expansion: Sharing On-Card Memory
Limited card-memory for storing application data Construct virtual memory system for on-card memory Swap space is host memory
HostCPU
FPGA
User-definedCircuits
PageFrame 1
SRAM1
PageFrame 2
SRAM2
PageFrame 1
PageFrame 1
PageFrame 1
UserPage X
Extension: Streaming Computations
Software extensibility [1/3]
Streaming Computation Overview
Programming method for distributed resources Establish pipeline for streaming operations Example: Multimedia processing
Celoxica RC-1000 FPGA endpoint
CPU
NI
VideoCapture
CPU
NI
MediaProcessor
CPU
NI
MediaProcessor
CPU
NI
MediaProcessor
System Area Network
Streaming Fundamentals
Computation: How is a computation performed? Active message approach
Forwarding: Where are results transmitted? Programmable forwarding directory
Destination: FPGAForward Entry: XAM: Perform FFT
In MessageFPGA
Computational Circuits
Circuit 1: FFT
Circuit N: Encrypt
Forwarding DirectoryDestination: Host Forward Entry: XAM: Receive FFT
Out Message
Performance: FPGA Computations
Acquire SRAM
Detect New Message
Fetch Header
Computation
Store Results
Store Header
Lookup Forwarding
Update Queues
Release SRAM
8
4
7
1024
1024
16
5
3
1
Fetch Payload 1024
Clocks
Clock Speed: 20MHzOperation Latency: 55 s (4KB 73MB/s)
Extension: Multicast
Software extensibility [2/3]
GRIM Multicast Extensions
Distribute the same message to multiple receivers Tree based distributions Replicate message at NI Messages are recycled back into network
Extensions to NI’s core communication operations Recycled messages in separate logical channel Utilize per-hop flow control for reliable delivery
A
B C
D E
NIEndpoint A
NI Endpoint B
NI Endpoint D
NI Endpoint C
NI Endpoint E
A
B
C
D
E
Multicast Performance
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000 1,000,000
Multicast RTT
Unicast RTT
Multicast Injection Overhead
Unicast Injection Overhead
LANai 4, P4-1.7 GHz Hosts
Tim
e (μ
s)
8 Hosts
Multicast Message Size (Bytes)
Multicast Observations
Beneficial: reduces sending overhead
Performance loss for large messages Dependent on NI memory copy bandwidth
On-card memory copy benchmark: LANai 4: 19 MB/s LANai 9: 66 MB/s
Extension: Sockets API
Software Extensibility [3/3]
Extension: Sockets Emulation
Berkeley sockets is a communication standard Utilized in numerous distributed applications
GRIM provides sockets API emulation Functions for intercepting socket calls AM handler functions for buffering connection data
write()
Intercept
Generate AM
AM:AppendSocket X
SocketData
Socket X
AM HandlerAppend Socket
Intercept
Extract Data
read()
Sender Receiver
Sockets Emulation Performance
0
20
40
60
80
100
120
1 10 100 1,000 10,000 100,000 1,000,000 10,000,000
GRIM Sockets LANai 4
100 Mb/s Ethernet
P4-1.7 GHz Hosts
Ban
dwid
th (
MB
ytes
/s)
Transfer Size (Bytes)
Host-to-Host Performance
Transferring data betweentwo host-level endpoints
Host-to-Host Communication Performance
Host-to-Host transfers standard benchmark Remote memory writes in benchmarks Myrinet LANai 4, 9 NI cards
Injection performance Overall communication path
NI SAN
CPU
NI
CPU
Memory Memory
Active Messages
Remote Memory Operations
11
22
33
Source Destination
Host-NI: Data Injections
Host-NI transfers challenging Host lacks DMA engine
Multiple transfer methods Programmed I/O DMA
Automatically select methodResult: Tunable PCI Injection Library (TPIL)
CPU
MainMemory
PC
I B
us
PCIDMA
Peripheral
DeviceMemory
MemoryController
Cache
TPIL Performance: LANai 9 NI with Pentium III-550 MHz Host
Ban
dwid
th (
MB
ytes
/s)
Injection Size (Bytes)
Overall Performance: Store-and-Forward
Approach: Single message, no overlap Three transmission stages Expect roughly 1/3 of bandwidth of individual stage
P3-550 MHz Hosts
Message 1
Message 1
Message 1
time
PCI: 132 MB/s
PCI: 132 MB/s
Myrinet: 160 MB/s
Overall Transmission Time
SendingHost-NI
NI-NI
ReceivingNI-Host
Ban
dwid
th (
MB
ytes
/s)
Message Size (Bytes)
Enhancement: Message Pipelining
Allow overlap with multiple in-flight messages GRIM uses AM and RM fragmentation/reassembly Performance depends on fragment size
LANai 9, P3-550 MHz Hosts
SendingHost-NI
NI-NI
ReceivingNI-Host
Message 1
time
Message 3Message 2
Overall Transmission Time
Message 1 Message 3Message 2
Message 1 Message 3Message 2
Ban
dwid
th (
MB
ytes
/s)
Message Size (Bytes)
Enhancement: Cut-through Transfers
Forward data as soon as it begins to arrive Cut-through at sending and receiving NIs
time
Message 1
Message 1
Message 1 Message 2
Message 2
Message 2
SendingHost-NI
NI-NI
ReceivingNI-Host
Overall Transmission TimeLANai 9, P3-550 MHz HostsMessage Size (Bytes)
Ban
dwid
th (
MB
ytes
/s)
Overall Host-to-Host Performance
Host NI Latency (μs) Bandwidth (MB/s)
P4-1.7GHzLANai 9 8 146
LANai 4 14.5 108
P3-550MHzLANai 9 9.5 116
LANai 4 14 96
Ban
dwid
th (
MB
ytes
/s)
Message Size (Bytes)
Comparison to Existing Message Layers
Latency (μs)
μs
Bandwidth (MB/s)
MB/s
Host-to-Host Performance Summary
GRIM fitted with performance enhancements Take place automatically Self configuring
GRIM provides competitive performance Bandwidth: 1.168 Gb/s Latency: 8 μs
Provides increased functionality
Concluding Remarks
Key Contributions
Framework for communication in resource-rich clusters Reliable delivery mechanisms, virtualized network interface, and
flexible programming interfaces Comparable performance to state-of-the-art message layers
Extensible for peripheral devices Suitable for intelligent and legacy peripherals Methods for managing card resources
Extensible for higher-level programming abstractions Endpoint-level: Streaming computations and sockets emulation NI-level: multicast
Future Directions
Continued work with GRIM Video card vendors opening cards to developers Myrinet connected embedded devices
Adaptation to other network substrates Gigabit Ethernet appealing because of cost Modification to transmission protocols InfiniBand technology promising
Active system area networks FPGA chips beginning to feature gigabit transceivers Use FPGA chips as networked processing device
Related Publications
A Tunable Communications Library for Data Injection, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2002.
Active SANs: Hardware Support for Integrating Computation and Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of the Workshop on Novel Uses of System Area Networks at HPCA, 2002.
A Messaging Layer for Heterogeneous Endpoints in Resource Rich Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet User Group Conference, 2000.
An Extensible Message Layer for High-Performance Clusters, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2000.
Papers and Software Available at http://www.ee.gatech.edu/~grimace/research