dsmp whitepaper release 3

Overview of the Distributed Symmetric Multiprocessing

Software Architecture

By Peter Robinson Technical Marketing Manager

Symmetric ComputingVenture Development Center

University of Massachusetts - BostonBoston MA 02125

DS

MP

This page intentionally left blank

Introduction

Distributed Symmetric Multiprocessing or DSMP, is a new kernel extension or kernel

enhancement, that extends the capabilities of the legacy Linux operating system, so it can support a

scalable, shared-memory architecture over a 40Gb InfiniBand attached cluster. DSMP is comprised of

two unique software components; the host operating system (OS)

which runs on the head-node and a unique lightweight micro-kernel

OS which runs on all “other” servers (which make-up the cluster).

The host OS consists of a Linux image plus a new DSMP kernel,

creating a new durative work as noted in Figure 1. The micro-kernel

is a non-Linux based image that extends the function of the host OS

over the entire cluster. These two OS images (host and micro-

kernel), are designed to run on commodity, Symmetric

Multiprocessing (SMP) servers based on the AMD64 processor.

The AMD64 architecture was selected over competing

platforms for a number of reasons, the primary being

price performance. Back in 2005 when we conceived

DSMP, the AMD Opteron™ Processor was the only x86 solution that supported a high

density, 4P direct connect architecture in a 1U form-factor. As of 4Q09, AMD continues to provide the

best value for 4P 1U servers and they continue to offer the only commercially viable 4P solution on the

market today.

A look at supercomputing today

Supercomputing can be divided into two camps - proprietary shared-memory systems or

commodity message passing Interface (MPI) clusters. Shared memory systems are based on commodity

processors such as the PowerPC or Itanium or the ever-popular x86 and commodity memory (DRAM

SIMMs). At the core of most shared-memory systems is a proprietary fabric. This fabric physically

extends the host processors coherency scheme over multiple nodes, providing low-latency inter-node

communication while maintaining system wide coherency. These ultra expensive, hardened shared-

memory supercomputers are designed to accommodate concurrent, enterprise or transactional processing

applications. These applications; VMware, Oracle, dbase, SAP, etc. can utilize one to 512+ processor-

cores and tera-bytes of shared-memory. Most of these applications are optimized for the host OS and the

micro-architecture of the host processor, but not for the macro architecture of the target system. Shared-

memory systems are also a great deal easer to develop applications for. In fact, rarely is it ever necessary

to modify code-sets or data-sets to run on a shared-memory system, for most SMP software plugs and

plays, which is why the shared-memory supercomputers are in such high demand.

ARCH Device Drivers

(DD)

MemoryManagement (MM)

Network Stack

Process Management (PM)

Virtual File System (VFS)

System Call Interface (SCI)

DSMP Gasket Interface

Figure 1 – Host DSMP Software architecture

Conversely, MPI clusters are comprised entirely of commodity servers, connected via Ethernet,

InfiniBand, or similar communication fabrics. However, these commodity networks introduce tremendous

latency compared to proprietary fabrics on OEM shared-memory supercomputers. Additionally, cluster

computing poses challenges for application providers to comply with the strict rules of MPI and to work

within the memory limitations of the SMP nodes which makeup the cluster. Despite the computational

and porting overhead, the cost benefits of commodity based computing solutions make MPI clusters a

staple of University and small-business research labs.

Although MPI is the platform of choice for Universities and Research Labs, data-sets in

bioinformatics, Oil & Gas, atmospheric modeling, etc. are becoming too large for single node Symmetric

Multi-Processing (SMP) systems and are impractical for an MPI clusters, due to the problems that arise

when you decimate data-sets. The alternative is to purchase time on a National Labs shared-memory

supercomputer (such as the ORNL peta-scale Cray XT4/XT5 Jaguar supercomputer). The problem with

the Jaguar supercomputer option is cost, time and overkill. In short, the reliability, availability

serviceability (RAS) of enterprise computing is quite different from what a researcher wants; as an

example researchers and academia:

Don’t need an hardened enterprise class 9-9s reliable platform;

Do not run multiple applications concurrently and there is no need for virtualization.

Applications are single-process, multiple-thread;

Have an aversion to spending time, dollars and staff-hours needed to apply to access these

National Lab machines;

Do not want to wait weeks on end in a queue to run their application;

Are willing to optimized their applications for the target hardware to get the most out of the run;

Ultimately want unencumbered 24/7 access to an affordable shared-memory machine – just like

their MPI cluster.

Enter Symmetric Computing

The design team of Symmetric Computing came out of the research community. As such, they

were very aware of the problems researcher face today and in the future. This awareness drove the

development of DSMP and the need to base it on commodity hardware. Our intent is nothing short of

have DSMP do for shared-memory supercomputing what the Beowulf project (MPI) did for cluster

computing.

How DSMP works

As stated in the introduction, DSMP is software that transforms an InfiniBand connected cluster

of homogeneous 1U/4P commodity servers into a shared-memory supercomputer. Although there are two

unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference

between them because, from the programmers perspective, there is only one OS image and one kernel.

The DSMP kernel provides seven (7) enhancements that transform a cluster into a distributed symmetric

multiprocessing platform, they are:

1. The shared-memory system;

2. The optimized InfiniBand driver which supports a shared-memory architecture;

3. An application driven, memory page coherency scheme;

4. An enhanced multi-threading service, based on the POSIX thread standard;

5. A distributed MuTeX;

6. A memory based distributed disk-queue and

7. A Distributed disk array.

The shared-memory system: The center piece of DSMP is its shared-

memory architecture. For our example we will assume a three node 4P

system with 64GB of physical memory per node. The three nodes are

networked via 40Gb InfiniBand and there is no switch. This in fact is our

value Treo™ Departmental Supercomputer product offering, shown here on

the right.

Treo™ Departmental Supercomputer

Figure 2 presents a macro view of the DSMP memory architecture. What become quite obvious from

viewing this graphic is the application of two memory segments, i.e., local-memory and global-memory.

Both coexist in the SMP physical memory and are evenly distributed over the four AMD64 processors on

each of the three servers. However, the memory management unit (MMU) on the AMD Opteron™

processor sees only the local memory (as noted in blue). Local memory is statically allocated by the

kernel, for our Treo™ example we will assume 1GB of local memory for every AMD64 core within the

server. Hence, there are 16GB of local-memory per server or 48GB of local-memory allocated from the

192GB of available system wide memory. The remaining 144GB is global-memory, which is

concurrently viewable and accessible by all 48 processor cores within the Treo™ Departmental

Supercomputer.

All memory (local and global) is partitioned into 4,096 byte pages or 64 AMD64 cache-lines.

When there is a cache-line miss from local-memory (a page fault), the kernel identifies a least recently

used (LRU) memory-page and swaps in the missing memory-page from global-memory. That happens,

across the InfiniBand fabric, in just under 5µ-seconds, even faster if the page is on the same physical

node.

The Optimized InfiniBand Drivers: The entire success of DSMP revolved around the existence

of a low latency, commercially available network fabric. It wasn’t that long ago, with the exit of Intel

from InfiniBand, that the industry experts were forecasting its demise. Today InfiniBand is the fabric of

choice for most High Performance Computing (HPC) clusters due to its low latency and high bandwidth.

To squeeze every last nano-second of performance out of the fabric, the designer of DSMP

bypassed the Linux InfiniBand protocol stack and wrote his own low-level driver. In addition, he

developed a set of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel

adapter (HCA). This allowed the HCA to service and move memory-pages requests, without processor

16GB Local

Memory“0”

16GBLocal

Memory“1”

16GBLocal

Memory“3”

64GB 64GB 64GB

4GB

12GB

Global Memory

“0”

Global Memory

“1”

Global Memory

“3”

TX

RX RX

TX

RX

TX

SMP 0 SMP 1 SMP n

P0 P1 P2 P3

Figure 1 - DSMP memory architecture

intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction,

reducing system-wide latency.

An application driven, memory page coherency scheme: As stated in the introduction, all

proprietary supercomputers maintain memory-consistency and/or coherency via hardware extension of

the host processor. DSMP took a different approach for maintaining the two separated levels of coherency

within the system. First there is cache-line coherency within the local SMP server. Coherency at this level

is maintained by the MMU and the SMP logic native to the AMD64 processor, i.e., Cache-coherent

HyperTransport™ Technology. However, global memory page coherency and consistency is controlled

by, and maintained by the programmer. This approach may seem counter intuitive at first. However, the

target market-segment for DSMP was technical computing not enterprise and it was assumed that the end

user is familiar with the algorithm and how to optimize it for the target platform (in the same way code

was optimized for a Beowulf cluster). Given the high skill level of the end users with the need to use only

commodity hardware, drove system level code decisions to keep a DSMP cluster both affordable and fast.

To obtain these goals, new and enhanced Linux primitives were developed. Hence, with some simple,

intuitive programming rules, augmented with new primitives; porting an application to a DSMP platform

(while maintaining coherency), is simple and manageable. Those rules are as follows:

Be sensitive to the fact the memory-pages are swapped into and out of local memory from global

memory in 4K pages and that it takes 5µ-seconds to complete the swap.

Be careful not to overlap or allocate multiple data sets within the same memory page. To help

prevent this a new Alloc( ) primitive is provided to assure alignment.

Because of the way local and global memory are partitioned (within physical memory), care

should be taken to distribute process/threads and associated data evenly over the four processors.

In short, try not to pile-up process/threads on one processor/memory unit, but rather distribute

them evenly over the system. POSIX thread primitives are provided to support the distribution of

these threads.

If there is a data-set which is “modified-shared” and accessed by multiple process/threads which

are on an adjacent server, then it will be necessary to use a set of new Linux primitives

to maintain coherency i.e., Sync( ), Lock( ) and Release( ).

Multi-Threading: The “gold standard” for parallelizing Linux C/C++ source code is with the

POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface. The

latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common

Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two

dozen or so POSIX routines were either tested to and/or modified for DSMP and the Treo™ platform.

The common method for parallelizing a process is via the Fork( ) primitive. Within DSMP there

is a flag associated with Fork( ). This flag determines if the forked thread is to say local (with the current

process on the primary server), or run on one of the remote servers. This allows the programmer to

specify, how many threads of a given process can be serviced by the head node. Simple analysis will

show just how many thread can run concurrently before performance flattens out due to the memory-wall

effect, or other conditions. Once this value is understood, the remote flag can be used to evenly distribute

threads over all the servers within the DSMP system. By default, each successive instance of Fork( )

caused that thread to be associated with the next server in the DSMP system, in a round-robin fashion.

Hence, a Fork ( ) remote of three threads on Treo™ would place the current process on each of the three

servers with one thread per server. The Kernel manages the consistency of the process to ensure it

executes with the same environment and associated state variables.

Coherency at the memory-page level is the responsibility of the programmer. A lot of this is

common sense; if a memory page is accessed by multiple threads and up-dated (modified – exclusive),

then it will be necessary to hold off pending threads until the current thread has updated the page in

question. To facilitate this, three DSMP Linux primitives are provided. They are Sync( ), Lock( ) and

Release( ).

Sync( ): as the name Implies, synchronize one (1) local private memory-page with its source

global-memory page.

Lock( ): is used to prevent any other process thread from accessing and subsequently

modifying the memory-page. Lock( ) also invalidates all other copies of the locked memory-

page within the system. If a process thread on an adjacent server accesses a locked memory

page, execution is suspended until the page is released.

Release( ): as the name implies, releases a previously locked memory page.

Lastly, to insure that data structure do not overlap, a new DSMP Alloc( ) primitive is provided to

force alignment for a give data-structure on a 4K boundary. This primitive assures that the end of one

data-structure does not fall inside an adjacent data-structure.

Distributed MuTeX: Wikipedia describes MuTeX or Mutual exclusion as a set of algorithms

which are used in concurrent programming to avoid the simultaneous use of a common resource, such as

a global variable or a critical sections. A distributed MuTeX is nothing more than a DSMP kernel

enhancement which insures that MuTeX functions as expected within the DSMP system. From a

programmers point-of-view, there are no changes or modification to MuTeX – it just works.

Memory based distributed disk-queue: A new DSMP primitive D_file( ) provides a high-

bandwidth/low-latency elastic queue for data which is intended to be written to a low bandwidth

interface, such as a Hard Disk Drive (HDD) or the network. This distributed input/output queue, is a

memory (DRAM) based storage buffer which effectively eliminates bottlenecks which occur when a

multiple threads compete for a low bandwidth device such as a HDD. Once the current process retires, the

contents of the queue are sent to the target I/O device and the queue is released.

A Distributed disk array: A distributed disk array is implemented by the kernel through

enhancements made to the Linux striped volume manager. These enhancements extend the Linux volume

manager over the entire network interface providing to the OS, a single consolidated drive. On Treo™ the

distributed disk array is made up of six (6) 1TB drivers – two per server, for a single 6TB storage device.

DSMP Performance

Performance of a supercomputer is a function of two metrics:

1) Processor performance (computational throughput);

2) Global Memory Read/Write performance - which can be furthered divided down to:

a. Stream performance – continuous R/W memory bandwidth and

b. Random read/write performance (memory R/W latency).

The extraordinary thing about the DSMP™ is the fact that it is based on commodity components.

That’s important, because DSMP performance scales with the performance of the commodity components

from which it is made. As an example, random read/write latency for Treo™, went down 40% with the

availability of 40Gb InfiniBand. Furthermore, this move from 20Gb to a 40Gb fabric caused no

appreciable increase in the cost of a Treo™ system (and no changes to the DSMP software were needed).

Also, within this same timeframe, AMD64 processor density went from quad-core to six-core, again

without any appreciable increase in the cost of the total system. Therefore, over time the performance gap

between DSMP™ shared-memory supercomputers and proprietary shared-memory systems will close.

Today proprietary shared-memory system providers have intra-node bandwidth numbers in the

order of 2.5GB/sec. and random access times in the order of 1µsec. That’s a difference of ~4:1 in

bandwidth and ~5:1 in R/W latency over DSMP™. At first glance, this much of a disparity might appear

as a disadvantage, but that is not necessarily the case - for three reasons. First: DSMP random R/W

latency is based on the time it takes to move 4,096B vs. 64B or 128B in <1µsec. (for SGI and others);

that’s a 64:1 or 32:1 difference in size of the cache-line or page size. In addition, the processors used in

these proprietary systems might have enhanced floating-point capabilities but they might run slower, in

some case, much slower than a 2.8GHz quad-core AMD Opteron™ Processor. So performance is not tied

entirely to memory latency or processor performance but is a function of many system variables as-well-

as the algorithm and the way the data is structured.

A second and more important reason why the DSMP performance is not a problem is access.

That is, having open and unencumbered 24/7 access to a shared-memory system. As an example, let’s

assume it takes 24 hours to run a job on the ORNL Jaguar supercomputer with a allocation of 48

processors and 150GB of shared-memory. However, it takes months to submit the proposal and gain

approved. Then there’s the additional wait in the queue of around 14 days - to access the system; typical

for this type of engagement. If we assume the DSMP™ shared-memory supercomputer is 1/5 the

performance of the one at Oakridge (due to memory latency, bandwidth and related factors), then it would

take five times longer to get the same results – that’s 120 hours verses, 24. However, when you take into

account the two week queue time, the results are available 10-days sooner. In the same time-frame, you

could have run the job three times over.

The third and final reason is value. Today, an entry level Treo™ departmental supercomputer

costs only $49,950 - configured with 144GB of shared memory, 48 - 2.8Ghz AMD64 processor cores and

6TB of disk storage (University pricing). A comparable shared-memory platform from an OEM would

approach $1,000,000 (not including maintenance and licensing fees), that’s 1/20 of the price at 1/5 the

performance. With the introduction of the Treo™ departmental supercomputer, Universities and

researchers have a new option which is based on the same market forces that drove the emergence of the

MPI cluster i.e., commodity hardware, value and availability. Today, Symmetric Computing is offering

four unique configuration of Treo™ from 48 to 72 - AMD64 cores and 144GB to 336GB of shared-

memory (see table on following page).

Treo™ P/N

Quad-core 2.8GHz

Six-core 2.6GHz

4GB PC5300 DIMMs

8GB PC5300 DIMMs

Total Shared

Memory

SCA161604-3 269 Giga-flops - 192 GB - 144GB

SCA241604-3 - 374 Giga-flops 192 GB - 120GB

SCA241608-3 - 374 Giga-flops - 384 GB 312GB

SCA161608-3 269 Giga-flops - - 384 GB 336GB

Looking forward to 1Q10, the Symmetric Computing engineering staff will introduce a 10-node

blade center delivering 1.2 Tera-flops of peak throughput with 640GB or 1.28GB of system memory. In

addition, we are working with our partners to deliver turn-key platform tuned for application specific

missions – such as next generation sequencing, HMMER, BLAST, etc.

Conclusion

Symmetric Computing’s overall goal is to make supercomputing accessible and affordable to a

broad range of end users. We believe that DSMP is to shared-memory computing what Beowulf/MPI was

to distributed-memory computing. We are focused on delivering an affordable, commodity based

technical computing solutions that services an entirely new market with – the Departmental

Supercomputer. Our initial focus is to provide open applications optimized to run under DSMP and on

Treo™, to accelerate scientific developments in Biosciences and Bioinformatics. We continue to expand

our scope of applications and remain committed to delivering Supercomputing to the Masses.

About Symmetric Computing

Symmetric Computing is a Boston based software company with offices at the Venter

Development Center on the campus of the University of Massachusetts – Boston. We design software to

accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas,

Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to

delivering standards-based, customer-focused technical computing solutions for users, ranging from

Universities to enterprises. For more information, visit www.symmetriccomputing.com.

http://www.symmetriccomputing.com/

dsmp whitepaper release 3

Documents

shared memory systems

memory architecture

shared memory supercomputers

sharedmemory supercomputers

memory limitations

commodity memory dram

os system

commodity servers