extensible message layers for resource-rich cluster computers craig ulmer center for experimental...

Extensible Message Layers forResource-Rich Cluster Computers

Craig Ulmer

Center for Experimental Research in Computer Systems

A Doctoral Thesis

Outline

Background Evolution of cluster computers

Thesis Design of extensible message layers

GRIM: General-purpose Reliable In-order Messages Core communication functions

Extensions Integrating peripheral devices Streaming computations Multicast Sockets API

Host-to-host performance Concluding remarks

Background

An Evolution of Cluster Computers

Cluster Computers

Cost-effective alternative to supercomputers Number of commodity workstations Specialized network hardware and software

Result: Large pool of host processors

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

MemoryI/

O B

us

System Area Network

Industry Trends

Increasingly independent, intelligent peripheral devices

Migration of computing power and bandwidth requirements to peripherals

Network and multimedia applications reliance on peripheral devices

High-bandwidth, low-latency system area networks

Ethernet

Host

Storage

CPU

SAN NI

Media Capture

Resource-Rich Cluster Computers

Inclusion of diverse peripheral devices Ethernet server cards, multimedia capture devices, embedded

storage, computational accelerators

Processing takes place in host CPUs and peripherals

SAN NI

Ethernet

HostHost

Host

System AreaNetwork

Cluster

SAN NIVideo Capture

FPGA

Host

Host Host

Storage

HostHost

CPU CPU

Problem: Utilizing distributed cluster resources

How is efficient intra-cluster communication provided? Desirable abstractions and their implementations

How can applications make use of resources?

CPU

CPUCPU CPU CPU CPU CPU

CPU

CPU

VideoCapture

FPGA

RAID

FPGA

FPGA

EthernetEthernet

RAID

RAID

? ? ? ? ? ?

Thesis Definition

Supporting Resource-Rich Cluster Computers

The Challenge

Message layers are enabling technology for clusters

Current message layers Optimized for transmissions between host CPUs Peripheral devices only available in context of the local host

What is needed Support efficient communication between

host CPUS and peripheral devices Ability to harness peripheral devices as pool of resources

Message Layer Design Characteristics

Reliable data transfers End-to-end flow control Simplify workload for communication endpoints

Virtualization of the network interface Multiple endpoints share network access

Flexible programming abstractions Efficient data transfer mechanisms Allow users to customize interactions with resources

Message Layer Requirements

Extensibility is a key feature Hardware: Easily add new peripheral devices Software: Support higher-level programming

abstractions

ApplicationExtensions

PeripheralExtensions

MessageLayerCore

Related Work

InfiniBand Emerging industry standard for clusters/data centers New I/O infrastructure

Existing message layers Myricom’s GM OPIUM: SCSI interactions

Adaptive Computing Clusters with FPGA cards Tower of Power

GRIM: An Implementation

A message layer for

resource-rich clusters

General-purpose Reliable In-order Message Layer (GRIM)

Message layer for resource-rich clusters Myrinet SAN backbone Interconnect host CPUs and peripheral devices Direct access to network hardware for resources

Communication core Migrate functionality into NI Easy to provide extensions GRIM

Core Software Architecture

Per-hop flow control Endpoint-NI transfers NI-NI transfers

Logical channels Multiple endpoints Virtual NI

Programming interfaces Active messages Remote memory

NI

Endpoint

Remote Memory API

Registered Memory

Memory Management

Active Message API

Handler Management

Active MessageExecution

Network

Endpoint-NI Transfers

NI-NI Transfers

NI

Endpoint

Per-hop Flow Control

End-to-end flow control necessary for reliable delivery Prevents buffer overflows in communication path

Endpoint-managed schemes Impractical for peripheral devices

Per-hop flow control scheme Transfer data as soon as next stage can accept Optimistic approach

ReceivingEndpoint

SendingEndpoint SAN

Network Interface Network Interface

PCI PCI

Reply

ReceivingEndpoint

SendingEndpoint

Send

SANNetwork Interface Network Interface

PCI PCIReceivingEndpoint

SendingEndpoint

DATA

ACK

DATA

ACK

PCISAN

Network Interface Network Interface

DATA

ACK

PCI

Logical Channels

Multiple endpoints in a host share the NI Employ multiple logical channels in the NI

Each endpoint owns one or more logical channels Logical channel provides virtual interface to network

Endpoint 1

Endpoint n

Logical Channel

Logical Channel

Network Interface

Scheduler

Network

Programming Interfaces: Active Messages

Message specifies function to be executed at receiver Similar to remote procedure calls, but lightweight Invoke operations at remote resources

Useful for constructing device-specific APIs Example: Interactions with remote storage controller

CPU

StorageControllerNINI SAN

am_fetch_file()

am_return_file_data()

Programming Interfaces: Remote Memory

Transfer blocks of data from one host to another Receiving NI executes transfer directly

Read and Write operations NI interacts with kernel driver to translate virtual addresses Optional notification mechanisms

CPU

NINI SAN

MemoryCPU

Memory

Integrating Peripheral Devices

Hardware Extensibility

Peripheral Device Overview

NI

CPU

CPU

Peripheral Device

In GRIM peripherals are endpoints

Intelligent peripherals Operate autonomously On-card message queues Process incoming active messages Eject outgoing active messages

Legacy peripherals Managed by host application or Remote memory operations

Legacy Peripheral Device

Cyclone Systems I2O Server Adaptor Card

Networked host on a PCI card Integration with GRIM

Interact directly with the NI Ported host-level endpoint software

Utilized as a LAN-SAN bridge

HostSystem

i960 RxProcessor

DMAEngines

PrimaryPCI

Interface

DRAM

10/100 Ethernet

10/100 Ethernet

SCSI

SCSI

ROM

DMAEngine

SecondaryPCI

Interface

Daughter Card

Local Bus

Video Capture and Display Cards

Video capture card Specialized DMA engine Host-level library based on

Video-4-Linux Active messages to control

Video display card Host locates frame buffer Manipulate frame buffer Remote memory writes

Incorporate legacy peripheral devices Host-level management or remote memory operations

DMAA/D FrameBuffer

HostMemoryVideo Capture Video Display

D/AAGPFrameBuffer

Celoxica RC-1000 FPGA Card

FPGAs provide acceleration Load with application-specific circuits

Celoxica RC-1000 FPGA card Xilinx Virtex-1000 FPGA 8 MB SRAM

Hardware implementation Endpoint as state machines AM handlers are circuits

SRAM

0SRAM

1SRAM

2SRAM

3

PCIFPGA

Control&Switching

FPGA Endpoint Organization

Frame

InputQueues

OutputQueues

Communication Library API

ApplicationData

Memory API

FPGA Card Memory

FPGACircuit Canvas

UserCircuitn

User Circuit API

UserCircuit1

Example FPGA Configuration

Cryptography configuration DES, RC6, MD5, and ALU

20 MHz Clock Newer FPGAs much faster

Expansion: Sharing the FPGA

FPGA has limited space for hardware circuits Host reconfigures FPGA on demand FPGA Function Fault

HostCPU

FPGA

Circuit X

Circuit Y

Configuration A

Circuit X

Circuit Y

Configuration A

Configuration B

Circuit E

Circuit F

Configuration C

Circuit G

StateStorage

SRAM0Message:Use Circuit F

FunctionFault

Circuit E

Circuit F

Configuration C

Circuit G

(150 ms)

Page Fault

Expansion: Sharing On-Card Memory

Limited card-memory for storing application data Construct virtual memory system for on-card memory Swap space is host memory

HostCPU

FPGA

User-definedCircuits

PageFrame 1

SRAM1

PageFrame 2

SRAM2

PageFrame 1

PageFrame 1

PageFrame 1

UserPage X

Extension: Streaming Computations

Software extensibility [1/3]

Streaming Computation Overview

Programming method for distributed resources Establish pipeline for streaming operations Example: Multimedia processing

Celoxica RC-1000 FPGA endpoint

CPU

NI

VideoCapture

CPU

NI

MediaProcessor

CPU

NI

MediaProcessor

CPU

NI

MediaProcessor

System Area Network

Streaming Fundamentals

Computation: How is a computation performed? Active message approach

Forwarding: Where are results transmitted? Programmable forwarding directory

Destination: FPGAForward Entry: XAM: Perform FFT

In MessageFPGA

Computational Circuits

Circuit 1: FFT

Circuit N: Encrypt

Forwarding DirectoryDestination: Host Forward Entry: XAM: Receive FFT

Out Message

Performance: FPGA Computations

Acquire SRAM

Detect New Message

Fetch Header

Computation

Store Results

Store Header

Lookup Forwarding

Update Queues

Release SRAM

8

4

7

1024

1024

16

5

3

1

Fetch Payload 1024

Clocks

Clock Speed: 20MHzOperation Latency: 55 s (4KB 73MB/s)

Extension: Multicast

Software extensibility [2/3]

GRIM Multicast Extensions

Distribute the same message to multiple receivers Tree based distributions Replicate message at NI Messages are recycled back into network

Extensions to NI’s core communication operations Recycled messages in separate logical channel Utilize per-hop flow control for reliable delivery

A

B C

D E

NIEndpoint A

NI Endpoint B

NI Endpoint D

NI Endpoint C

NI Endpoint E

A

B

C

D

E

Multicast Performance

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000 1,000,000

Multicast RTT

Unicast RTT

Multicast Injection Overhead

Unicast Injection Overhead

LANai 4, P4-1.7 GHz Hosts

Tim

e (μ

s)

8 Hosts

Multicast Message Size (Bytes)

Multicast Observations

Beneficial: reduces sending overhead

Performance loss for large messages Dependent on NI memory copy bandwidth

On-card memory copy benchmark: LANai 4: 19 MB/s LANai 9: 66 MB/s

Extension: Sockets API

Software Extensibility [3/3]

Extension: Sockets Emulation

Berkeley sockets is a communication standard Utilized in numerous distributed applications

GRIM provides sockets API emulation Functions for intercepting socket calls AM handler functions for buffering connection data

write()

Intercept

Generate AM

AM:AppendSocket X

SocketData

Socket X

AM HandlerAppend Socket

Intercept

Extract Data

read()

Sender Receiver

Sockets Emulation Performance

0

20

40

60

80

100

120

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

GRIM Sockets LANai 4

100 Mb/s Ethernet

P4-1.7 GHz Hosts

Ban

dwid

th (

MB

ytes

/s)

Transfer Size (Bytes)

Host-to-Host Performance

Transferring data betweentwo host-level endpoints

Host-to-Host Communication Performance

Host-to-Host transfers standard benchmark Remote memory writes in benchmarks Myrinet LANai 4, 9 NI cards

Injection performance Overall communication path

NI SAN

CPU

NI

CPU

Memory Memory

Active Messages

Remote Memory Operations

11

22

33

Source Destination

Host-NI: Data Injections

Host-NI transfers challenging Host lacks DMA engine

Multiple transfer methods Programmed I/O DMA

Automatically select methodResult: Tunable PCI Injection Library (TPIL)

CPU

MainMemory

PC

I B

us

PCIDMA

Peripheral

DeviceMemory

MemoryController

Cache

TPIL Performance: LANai 9 NI with Pentium III-550 MHz Host

Ban

dwid

th (

MB

ytes

/s)

Injection Size (Bytes)

Overall Performance: Store-and-Forward

Approach: Single message, no overlap Three transmission stages Expect roughly 1/3 of bandwidth of individual stage

P3-550 MHz Hosts

Message 1

Message 1

Message 1

time

PCI: 132 MB/s

PCI: 132 MB/s

Myrinet: 160 MB/s

Overall Transmission Time

SendingHost-NI

NI-NI

ReceivingNI-Host

Ban

dwid

th (

MB

ytes

/s)

Message Size (Bytes)

Enhancement: Message Pipelining

Allow overlap with multiple in-flight messages GRIM uses AM and RM fragmentation/reassembly Performance depends on fragment size

LANai 9, P3-550 MHz Hosts

SendingHost-NI

NI-NI

ReceivingNI-Host

Message 1

time

Message 3Message 2

Overall Transmission Time

Message 1 Message 3Message 2

Message 1 Message 3Message 2

Ban

dwid

th (

MB

ytes

/s)


Enhancement: Cut-through Transfers

Forward data as soon as it begins to arrive Cut-through at sending and receiving NIs

time

Message 1

Message 1

Message 1 Message 2

Message 2

Message 2

SendingHost-NI

NI-NI

ReceivingNI-Host

Overall Transmission TimeLANai 9, P3-550 MHz HostsMessage Size (Bytes)

Ban

dwid

th (

MB

ytes

/s)

Overall Host-to-Host Performance

Host NI Latency (μs) Bandwidth (MB/s)

P4-1.7GHzLANai 9 8 146

LANai 4 14.5 108

P3-550MHzLANai 9 9.5 116

LANai 4 14 96

Ban

dwid

th (

MB

ytes

/s)


Comparison to Existing Message Layers

Latency (μs)

μs

Bandwidth (MB/s)

MB/s

Host-to-Host Performance Summary

GRIM fitted with performance enhancements Take place automatically Self configuring

GRIM provides competitive performance Bandwidth: 1.168 Gb/s Latency: 8 μs

Provides increased functionality

Concluding Remarks

Key Contributions

Framework for communication in resource-rich clusters Reliable delivery mechanisms, virtualized network interface, and

flexible programming interfaces Comparable performance to state-of-the-art message layers

Extensible for peripheral devices Suitable for intelligent and legacy peripherals Methods for managing card resources

Extensible for higher-level programming abstractions Endpoint-level: Streaming computations and sockets emulation NI-level: multicast

Future Directions

Continued work with GRIM Video card vendors opening cards to developers Myrinet connected embedded devices

Adaptation to other network substrates Gigabit Ethernet appealing because of cost Modification to transmission protocols InfiniBand technology promising

Active system area networks FPGA chips beginning to feature gigabit transceivers Use FPGA chips as networked processing device

Related Publications

A Tunable Communications Library for Data Injection, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2002.

Active SANs: Hardware Support for Integrating Computation and Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of the Workshop on Novel Uses of System Area Networks at HPCA, 2002.

A Messaging Layer for Heterogeneous Endpoints in Resource Rich Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet User Group Conference, 2000.

An Extensible Message Layer for High-Performance Clusters, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2000.

Papers and Software Available at http://www.ee.gatech.edu/~grimace/research

extensible message layers for resource-rich cluster computers craig ulmer center for experimental...

Documents

peripheral devicesability

host cpusperipheral

new peripheral devicessoftware

peripherals network

multimedia capture devices

end flow control necessary

bandwidth requirements

experimental research