yasmin shared memory programming enno rehling universität paderborn

YasminShared memory programming

Enno RehlingUniversität Paderborn

Overview

Shared memory programming on SCI

The Yasmin library

Common pitfalls and caveats

Conclusion and Outlook

SCI shared memory

SCI can do more than message passing.The shared memory programming model

is more intuitive.It’s more efficient, too: SCI offers

hardware support for distributed shared memory programming.

Remote memory and local memory have very different properties.

SCI shared memory

Memory from remote nodes is mapped into the PCI address space

All processes can handle local and remote mapped memory in the same way

PCI

Process B

physicalPCI

Process A

physical

Performance Data*

Bandwidth: writing at 65 MB/sec reading at 1.7 MB/sec

Latency: 4.7 µsec (MPI, zero byte ping-pong) 2.7 µsec (Yasmin, 4 byte ping-pong)

* Benchmarks were done on outdated hardware.

Yasmin: Shared memory API

Developed at Paderborn

API layer on top of the SCI driver

Runs on Linux but not Solaris

Reliable, fast and extensively tested

Distributed segments

User creates and exports distributed segments

void* foo(size_t block_size, sci_group_p group) {int procs = sci_get_groupsize(group); sci_distr_seg_p segment;void *base = NULL;size_t *sizes = malloc(procs * sizeof(size_t));

for (int i = 0; i != procs; ++i) sizes[i] = block_size;sci_create_distr_seg(group, &base, sizes, &segment);free(sizes);return base;

}base

base base

base

Consistent view

Same view of shared segments for each process

void* bar(sci_group_p group, size_t size) {int rank = sci_get_rank(group);int procs = sci_get_groupsize(group);void *base = foo(size*procs, group);void *msg = base+size*(procs+rank);

do_some_work(msg, size);for (int i=0;i!=procs;++i) if (i!=rank)

memcpy(base+size*(i*procs+rank), msg, size);sci_barrier(group); return base + size*rank*procs;

}

Allocating memory

Allocation in both local and remote segments

void foobar(char * s, sci_group_p group, char** msg[]) { int rank = sci_get_rank(group); int procs = sci_get_groupsize(group); int error = sci_heap_create(4096); int next = (rank+1)%procs, prev = (rank+procs-1)%procs;

char *dst = (char*)sci_heap_malloc(rank+1, strlen(s)+1);

*msg[next] = strcpy(dst, s); sci_barrier(group); printf("message from %d: %s\n", prev, *msg[rank]); sci_heap_free(*msg[rank]);}

Allocation in both local and remote segments

void foobar(char * s, sci_group_p group, char** msg[]) { int rank = sci_get_rank(group); int procs = sci_get_groupsize(group); int error = sci_heap_create(4096); int next = (rank+1)%procs, prev = (rank+procs-1)%procs;

char *dst = (char*)sci_heap_malloc(rank+1, strlen(s)+1);

*msg[next] = strcpy(dst, s); while (*msg[rank]==NULL); printf("message from %d: %s\n", prev, *msg[rank]); sci_heap_free(*msg[rank]);}

What else?

Synchronization primitives synchronization barriers mutexes reader/writer locks condition variables

Group operations create static subgroups all functions work on subgroups

Startup mechanism

Hostfile contains list of hostnames to run program

on defines number of processes per node on

SMPProcesses are created using rsh or ssh

output returned to shell or into file easy debugging

Debugging SCI Applications

Profiling

Diploma Thesis at PaderbornDynamic programm analysisGathers information about access patterns

of SCI programs, helps identify performance bottlenecks

SCI pitfalls and caveats

Read access to remote memory is slow. Use sci_memcpy(), not memcpy. MMX

instructions boost performance by a factor of 2

Sometimes, access to remote memory can fail. Hardware problems lead to errors Driver functions can lead to errors

SCI pitfalls and caveats

Awkward consistency model

void awkward(volatile int *base) {

*base = 7;

sci_flush();

*base = 2;

fprintf(stdout, "%d\n", *base); // undefined

}

Programming without explicit knowledge of memory layout is never efficient

Conclusions

Yasmin greatly simplifies use of shared segments

Raw performance plus easier developmentTransparent shared memory programming

on SCI is not yet achieved

Outlook

Integration with CCS Yasmin goes open source Inclusion in SuSE cluster CD

yasmin shared memory programming enno rehling universität paderborn

Documents