yasmin shared memory programming enno rehling universität paderborn

17
Yasmin Shared memory programming Enno Rehling Universität Paderborn

Upload: arnold-floyd

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yasmin Shared memory programming Enno Rehling Universität Paderborn

YasminShared memory programming

Enno RehlingUniversität Paderborn

Page 2: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Overview

Shared memory programming on SCI

The Yasmin library

Common pitfalls and caveats

Conclusion and Outlook

Page 3: Yasmin Shared memory programming Enno Rehling Universität Paderborn

SCI shared memory

SCI can do more than message passing.The shared memory programming model

is more intuitive.It’s more efficient, too: SCI offers

hardware support for distributed shared memory programming.

Remote memory and local memory have very different properties.

Page 4: Yasmin Shared memory programming Enno Rehling Universität Paderborn

SCI shared memory

Memory from remote nodes is mapped into the PCI address space

All processes can handle local and remote mapped memory in the same way

PCI

Process B

physicalPCI

Process A

physical

Page 5: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Performance Data*

Bandwidth: writing at 65 MB/sec reading at 1.7 MB/sec

Latency: 4.7 µsec (MPI, zero byte ping-pong) 2.7 µsec (Yasmin, 4 byte ping-pong)

* Benchmarks were done on outdated hardware.

Page 6: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Yasmin: Shared memory API

Developed at Paderborn

API layer on top of the SCI driver

Runs on Linux but not Solaris

Reliable, fast and extensively tested

Page 7: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Distributed segments

User creates and exports distributed segments

void* foo(size_t block_size, sci_group_p group) {int procs = sci_get_groupsize(group); sci_distr_seg_p segment;void *base = NULL;size_t *sizes = malloc(procs * sizeof(size_t));

for (int i = 0; i != procs; ++i) sizes[i] = block_size;sci_create_distr_seg(group, &base, sizes, &segment);free(sizes);return base;

}base

base base

base

Page 8: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Consistent view

Same view of shared segments for each process

void* bar(sci_group_p group, size_t size) {int rank = sci_get_rank(group);int procs = sci_get_groupsize(group);void *base = foo(size*procs, group);void *msg = base+size*(procs+rank);

do_some_work(msg, size);for (int i=0;i!=procs;++i) if (i!=rank)

memcpy(base+size*(i*procs+rank), msg, size);sci_barrier(group); return base + size*rank*procs;

}

Page 9: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Allocating memory

Allocation in both local and remote segments

void foobar(char * s, sci_group_p group, char** msg[]) { int rank = sci_get_rank(group); int procs = sci_get_groupsize(group); int error = sci_heap_create(4096); int next = (rank+1)%procs, prev = (rank+procs-1)%procs;

char *dst = (char*)sci_heap_malloc(rank+1, strlen(s)+1);

*msg[next] = strcpy(dst, s); sci_barrier(group); printf("message from %d: %s\n", prev, *msg[rank]); sci_heap_free(*msg[rank]);}

Allocation in both local and remote segments

void foobar(char * s, sci_group_p group, char** msg[]) { int rank = sci_get_rank(group); int procs = sci_get_groupsize(group); int error = sci_heap_create(4096); int next = (rank+1)%procs, prev = (rank+procs-1)%procs;

char *dst = (char*)sci_heap_malloc(rank+1, strlen(s)+1);

*msg[next] = strcpy(dst, s); while (*msg[rank]==NULL); printf("message from %d: %s\n", prev, *msg[rank]); sci_heap_free(*msg[rank]);}

Page 10: Yasmin Shared memory programming Enno Rehling Universität Paderborn

What else?

Synchronization primitives synchronization barriers mutexes reader/writer locks condition variables

Group operations create static subgroups all functions work on subgroups

Page 11: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Startup mechanism

Hostfile contains list of hostnames to run program

on defines number of processes per node on

SMPProcesses are created using rsh or ssh

output returned to shell or into file easy debugging

Page 12: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Debugging SCI Applications

Page 13: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Profiling

Diploma Thesis at PaderbornDynamic programm analysisGathers information about access patterns

of SCI programs, helps identify performance bottlenecks

Page 14: Yasmin Shared memory programming Enno Rehling Universität Paderborn

SCI pitfalls and caveats

Read access to remote memory is slow. Use sci_memcpy(), not memcpy. MMX

instructions boost performance by a factor of 2

Sometimes, access to remote memory can fail. Hardware problems lead to errors Driver functions can lead to errors

Page 15: Yasmin Shared memory programming Enno Rehling Universität Paderborn

SCI pitfalls and caveats

Awkward consistency model

void awkward(volatile int *base) {

*base = 7;

sci_flush();

*base = 2;

fprintf(stdout, "%d\n", *base); // undefined

}

Programming without explicit knowledge of memory layout is never efficient

Page 16: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Conclusions

Yasmin greatly simplifies use of shared segments

Raw performance plus easier developmentTransparent shared memory programming

on SCI is not yet achieved

Page 17: Yasmin Shared memory programming Enno Rehling Universität Paderborn

Outlook

Integration with CCS Yasmin goes open source Inclusion in SuSE cluster CD