erlangen regional computing center

ERLANGEN REGIONAL COMPUTING CENTER

J. Eitzinger, T. Röhl, W. Hesse, A. Jeutter, E. Focht15.12.2015

FEPAProject status and further steps

2

§ Cluster administrators employ monitoring to§ Detect errors or faulty operation§ Observe total system utilization

§ Application developers use (mostly GUI) tools to do performance profiling

Motivation

Primary TargetProvide a monitoring infrastructure to allow for a continuous system-wide application performance and energy profiling based on hardware performance counter measurements

Ein flexibles Framework zur Energie- und Performanceanalysehochparalleler Applikationen im Rechenzentrum

3

§ Allow to detect applications with pathological performance behavior

§ Help to identify applications with large optimization potential

§ Give users feedback about application performance

§ Ease access to hardware performance counter data

Objectives

STATUS

5

§ Support for new architectures: Intel Silvermont, Intel Broadwelland Broadwell-EP, Intel Skylake

§ Improved overflow detection (including RAPL)§ Improved documentation with many new examples (Cilk+,

C++11 threads)§ More performance groups and validated metrics for many

architectures§ Improvements in likwid-bench and likwid-mpirun

§ New access layer to support platform-independent code (x86, Power, ARM)

RRZE (Thomas Röhl)

6

NEC (Andreas Jeuter)

collector

collector

controller

aggregator

aggregator

collector

node

node

node

node

node

node

node

nodegroup

group

groupper job

per job

store

store

store

NoSQLDB

NoSQLDB

NoSQLDB

ResourceScheduler

per group

Sharding+

Replication

instantiateInstantiate

Program tagger

jobstart/stop

Instantiate at job start(Trigger aggregation)

Kill when job stops

AggMon

§ Componentized§ Fully distributed§ Separate processes: truly parallel§ Implemented in Python§ Connected through ZeroMQ

7

AggMon: Collector

queueZMQPULL tagger match &

publishZMQPUSH

RPCcollectorgmond

ZMQ PUSH

modified

Add tagRemove tag

SubscribeUnsubscribe

Messages: JSON serialized dicts/mapsTagger: adds a key-value to message, based on match condition

Subscribe: based on match condition (key-value, key-value regex)

O(50k)msg/s O(10k)

msg/s

8

§ TokuMX: MongoDB compatible§ Collections can be sharded

§ Spread Documents on different mongod instances§ Entry point: any mongos instance

§ Replication (for example master-slave) is possible

AggMon: Data Store

mongod

Group master

mongod mongod mongod

rack1

mongos

Group master Group mastermongos

Group master...

...

configsvr

rack2 rack3 ...

shard key{ group:rack1, … }

O(10k)msg/s

9

LRZ (Wolfram Hesse, Carla Guillen)

Erfolgreicher Abschluss der Promotion von C. Guillen

§ Validierung der verwendeten Performancemuster§ Statistische Auswertung der Performancemuster§ Dokumentation des PerSyst-Monitoring-System

Knowledge-based Performance Monitoring for Large Scale HPC Architectures; Dissertation C. Guillen Carias; 2015; http://mediatum.ub.tum.de?id=1237547

10

§ PerSyst-Monitoring ist produktiv @ SuperMUC Phase I + II § Definition und Umsetzung der Performancemuster Phase 1 (Westmere-

EX,SandyBridge-EP) und Phase 2 (Haswell-EP)

§ Nutzung und Verifikation durch:§ LRZ-Applikationsunterstützungsgruppe und IBM-Mitarbeiter› Benachrichtigung der Benutzer, falls offensichtliche Bottlenecks vorliegen +

Vorschläge für Optimierungen› Sichtung von Anwendungen für Extreme Scaling und Benchmarks

§ SuperMUC-Benutzer › Pos. Feedback bzg. Nützlichkeit

§ Umsetzung des PerSyst Web-Frontend am RRZE

LRZ: PerSyst Status

ONGOING WORK

Integrate complete stack at RRZEValidate Performance Patterns from profiling data

12

§ How to deal with established monitoring infrastructure (Ganglia)?§ Easy: Use existing monitoring infrastructures§ Target: Replace existing software with FEPA stack

§ Concerns about large overhead of continous HPM profiling§ Overhead could be lower with a better interface to HPM (ISA, OS)§ Missing knowledge about overheads in general

§ Picking the right building blocks.§ Backend daemon: diamond (https://github.com/python-diamond/Diamond)§ Communication protocol: ZeroMQ (http://zeromq.org)§ Storage: TokuMX (NoSQL)

Current Questions

13

§ Target system: 80-node Nehalem cluster system in normal production use

Objectives§ Sort out issues between components§ Validate and benchmark solution:

§ diamond§ mongoDB/TokuMX§ Liferay framework based PerSyst frontend

§ Experiment on application profiling data§ Required granularity for phase detection§ Performance Pattern validation on set of known codes

Integration of FEPA components

14

§ Layers are ready to be integrated into complete stack§ Convergence for finding external building blocks

§ LRZ PerSyst System in production use

Next:§ Continue integrating stack to make FEPA ready to be distributed

at associated HPC centers§ Validate FEPA on a set of known benchmarks (Mantevo, NPB,

SPEC)

Conclusion and Outlook

ERLANGEN REGIONAL COMPUTING CENTER

Thank You.

Leibniz-Rechenzentrum

NEC Deutschland GmbH

RegionalesRechenzentrum

Erlangen