yasmin shared memory programming enno rehling universität paderborn
TRANSCRIPT
![Page 1: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/1.jpg)
YasminShared memory programming
Enno RehlingUniversität Paderborn
![Page 2: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/2.jpg)
Overview
Shared memory programming on SCI
The Yasmin library
Common pitfalls and caveats
Conclusion and Outlook
![Page 3: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/3.jpg)
SCI shared memory
SCI can do more than message passing.The shared memory programming model
is more intuitive.It’s more efficient, too: SCI offers
hardware support for distributed shared memory programming.
Remote memory and local memory have very different properties.
![Page 4: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/4.jpg)
SCI shared memory
Memory from remote nodes is mapped into the PCI address space
All processes can handle local and remote mapped memory in the same way
PCI
Process B
physicalPCI
Process A
physical
![Page 5: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/5.jpg)
Performance Data*
Bandwidth: writing at 65 MB/sec reading at 1.7 MB/sec
Latency: 4.7 µsec (MPI, zero byte ping-pong) 2.7 µsec (Yasmin, 4 byte ping-pong)
* Benchmarks were done on outdated hardware.
![Page 6: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/6.jpg)
Yasmin: Shared memory API
Developed at Paderborn
API layer on top of the SCI driver
Runs on Linux but not Solaris
Reliable, fast and extensively tested
![Page 7: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/7.jpg)
Distributed segments
User creates and exports distributed segments
void* foo(size_t block_size, sci_group_p group) {int procs = sci_get_groupsize(group); sci_distr_seg_p segment;void *base = NULL;size_t *sizes = malloc(procs * sizeof(size_t));
for (int i = 0; i != procs; ++i) sizes[i] = block_size;sci_create_distr_seg(group, &base, sizes, &segment);free(sizes);return base;
}base
base base
base
![Page 8: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/8.jpg)
Consistent view
Same view of shared segments for each process
void* bar(sci_group_p group, size_t size) {int rank = sci_get_rank(group);int procs = sci_get_groupsize(group);void *base = foo(size*procs, group);void *msg = base+size*(procs+rank);
do_some_work(msg, size);for (int i=0;i!=procs;++i) if (i!=rank)
memcpy(base+size*(i*procs+rank), msg, size);sci_barrier(group); return base + size*rank*procs;
}
![Page 9: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/9.jpg)
Allocating memory
Allocation in both local and remote segments
void foobar(char * s, sci_group_p group, char** msg[]) { int rank = sci_get_rank(group); int procs = sci_get_groupsize(group); int error = sci_heap_create(4096); int next = (rank+1)%procs, prev = (rank+procs-1)%procs;
char *dst = (char*)sci_heap_malloc(rank+1, strlen(s)+1);
*msg[next] = strcpy(dst, s); sci_barrier(group); printf("message from %d: %s\n", prev, *msg[rank]); sci_heap_free(*msg[rank]);}
Allocation in both local and remote segments
void foobar(char * s, sci_group_p group, char** msg[]) { int rank = sci_get_rank(group); int procs = sci_get_groupsize(group); int error = sci_heap_create(4096); int next = (rank+1)%procs, prev = (rank+procs-1)%procs;
char *dst = (char*)sci_heap_malloc(rank+1, strlen(s)+1);
*msg[next] = strcpy(dst, s); while (*msg[rank]==NULL); printf("message from %d: %s\n", prev, *msg[rank]); sci_heap_free(*msg[rank]);}
![Page 10: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/10.jpg)
What else?
Synchronization primitives synchronization barriers mutexes reader/writer locks condition variables
Group operations create static subgroups all functions work on subgroups
![Page 11: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/11.jpg)
Startup mechanism
Hostfile contains list of hostnames to run program
on defines number of processes per node on
SMPProcesses are created using rsh or ssh
output returned to shell or into file easy debugging
![Page 12: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/12.jpg)
Debugging SCI Applications
![Page 13: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/13.jpg)
Profiling
Diploma Thesis at PaderbornDynamic programm analysisGathers information about access patterns
of SCI programs, helps identify performance bottlenecks
![Page 14: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/14.jpg)
SCI pitfalls and caveats
Read access to remote memory is slow. Use sci_memcpy(), not memcpy. MMX
instructions boost performance by a factor of 2
Sometimes, access to remote memory can fail. Hardware problems lead to errors Driver functions can lead to errors
![Page 15: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/15.jpg)
SCI pitfalls and caveats
Awkward consistency model
void awkward(volatile int *base) {
*base = 7;
sci_flush();
*base = 2;
fprintf(stdout, "%d\n", *base); // undefined
}
Programming without explicit knowledge of memory layout is never efficient
![Page 16: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/16.jpg)
Conclusions
Yasmin greatly simplifies use of shared segments
Raw performance plus easier developmentTransparent shared memory programming
on SCI is not yet achieved
![Page 17: Yasmin Shared memory programming Enno Rehling Universität Paderborn](https://reader036.vdocuments.us/reader036/viewer/2022072005/56649ce45503460f949b0e48/html5/thumbnails/17.jpg)
Outlook
Integration with CCS Yasmin goes open source Inclusion in SuSE cluster CD