numa, resource-management, monitoring auf ibm power: … · 2016-10-07 · numa,...

NUMA, Resource-Management, Monitoring auf IBM Power: Erfahrungen am HPI (Core2Cloud on Power)

Prof. Dr. Andreas Polze

Linux on Power Anwendertag, IBM, Böblingen

05.10.2016

Agenda

1. Resource Management from Core to Cloud– a motivation (SSICLOPS)

2. How to approach IBM Power Systems– the IBM Power Block Course at HPI

3. NUMA observations– how big a problem we are dealing with

4. Approaches from a Software Engineering standpoint

■ C++ library for memory allocation

■ Monitoring tools (black box / white box)

■ Experiences with porting applications

5. Summary – a call to action

Motivation: SSICLOPS European Project:(Scalable and Secure Infrastructures for Cloud Operations)

■ Management of federated private cloud infrastructures

■ Network communication improvements (latency and bandwidth)

■ Workload scheduling across datacenters

■ Security- and privacy-aware storage and processing

SSICLOPS:Research Areas

OSM Group

Core2Cloud on Power

Chart 4

SSICLOPS: Partners

Chart 5

OSM Group

Core2Cloud on Power

■ HPI: In-memory databases

■ HIP: High-energy physics analysis

■ Telekom: Content delivery and caching

■ Orange: Network function virtualization

SSICLOPS: Use Cases

Chart 6

OSM Group

Core2Cloud on Power

SSICLOPS: Big picture

OSM Group

Core2Cloud on Power

Chart 7

SSICLOPS: Experience so far

OSM Group

Core2Cloud on Power

Chart 8

SSICLOPS: Experience so far

OSM Group

Core2Cloud on Power

Chart 9

SSICLOPS: Next steps

OSM Group

Core2Cloud on Power

Chart 10

SSICLOPS: Vision

OSM Group

Core2Cloud on Power

Chart 11

Core to Cloud

Core to Cloud:Overview

Chart 13

inter cloud

intra cloud

intra system

inter node

intra node

coreOSM Group

Core2Cloud on Power

Experiments and related work

■ Understand resource management on every level

■ Understand main technologies on every level

■ Collect tools for every level

Research

■ Resource contention

□ Avoid the worst case!

■ Monitoring on higher levels should include lower levels to make good

control decisions

Core to Cloud:The Idea

Chart 14

OSM Group

Core2Cloud on Power

How to approach IBM Power Systems

■ LPARs, virtualization always on

■ Endianess

(but Linux is Linux – really?)

■ Missing libraries

■ Resource management

■ Example: data.c

int i = 0xffeeddcc;

Power is different...

OSM Group

Core2Cloud on Power

Chart 16

mac ( 35 )-% od -tx1 data.o0000000 cf fa ed fe 07 00 00 01 03 00 00 00 01 00 00 000000020 04 00 00 00 60 01 00 00 00 20 00 00 00 00 00 000000040 19 00 00 00 e8 00 00 00 00 00 00 00 00 00 00 000000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000100 04 00 00 00 00 00 00 00 80 01 00 00 00 00 00 000000120 04 00 00 00 00 00 00 00 07 00 00 00 07 00 00 000000140 02 00 00 00 00 00 00 00 5f 5f 74 65 78 74 00 000000160 00 00 00 00 00 00 00 00 5f 5f 54 45 58 54 00 000000200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000220 00 00 00 00 00 00 00 00 80 01 00 00 00 00 00 000000240 00 00 00 00 00 00 00 00 00 00 00 80 00 00 00 000000260 00 00 00 00 00 00 00 00 5f 5f 64 61 74 61 00 000000300 00 00 00 00 00 00 00 00 5f 5f 44 41 54 41 00 000000320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000340 04 00 00 00 00 00 00 00 80 01 00 00 02 00 00 000000360 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000400 00 00 00 00 00 00 00 00 24 00 00 00 10 00 00 000000420 00 0a 0a 00 00 00 00 00 02 00 00 00 18 00 00 000000440 84 01 00 00 01 00 00 00 94 01 00 00 04 00 00 000000460 0b 00 00 00 50 00 00 00 00 00 00 00 00 00 00 000000500 00 00 00 00 01 00 00 00 01 00 00 00 00 00 00 000000520 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00*0000600 cc dd ee ff 01 00 00 00 0f 02 00 00 00 00 00 000000620 00 00 00 00 00 5f 69 00 0000630

aix01 $ od -tx1 data.o

0000000 01 df 00 02 57 d9 9e b1 00 00 00 76 00 00 00 0a

0000020 00 00 00 00 2e 74 65 78 74 00 00 00 00 00 00 00

0000040 00 00 00 00 00 00 00 00 00 00 00 64 00 00 00 00

0000060 00 00 00 00 00 00 00 00 00 00 00 20 2e 64 61 74

0000100 61 00 00 00 00 00 00 00 00 00 00 00 00 00 00 08

0000120 00 00 00 64 00 00 00 6c 00 00 00 00 00 01 00 00

0000140 00 00 00 40 ff ee dd cc 00 00 00 00 00 00 00 04

0000160 00 00 00 02 1f 00 2e 66 69 6c 65 00 00 00 00 00

0000200 00 00 ff fe 00 03 67 01 64 61 74 61 2e 63 00 00

0000220 00 00 00 00 00 00 00 00 00 00 2e 74 65 78 74 00

0000240 00 00 00 00 00 00 00 01 00 00 6b 01 00 00 00 00

0000260 00 00 00 00 00 00 11 00 00 00 00 00 00 00 2e 64

0000300 61 74 61 00 00 00 00 00 00 00 00 02 00 00 6b 01

0000320 00 00 00 08 00 00 00 00 00 00 21 05 00 00 00 00

0000340 00 00 69 00 00 00 00 00 00 00 00 00 00 00 00 02

0000360 00 00 02 01 00 00 00 04 00 00 00 00 00 00 02 05

0000400 00 00 00 00 00 00 54 4f 43 00 00 00 00 00 00 00

0000420 00 08 00 02 00 00 6b 01 00 00 00 00 00 00 00 00

0000440 00 00 11 0f 00 00 00 00 00 00

0000452

■ big+little endian on Power8

□ AIX: big endian

□ Linux: both versions

Little Endian

Big Endian

Assigning Resources to a Linux LPAR

OSM Group

Core2Cloud on Power

Chart 17

Dynamically changing available CPU resources

OSM Group

Core2Cloud on Power

Chart 18

IBM Power Systems Block Course @ HPI

OSM Group

Core2Cloud on Power

Chart 19

http://www.dcl.hpi.uni-potsdam.de/events/POWER2016/

NUMA Observations(how big a problem we are dealing with)

Core to Cloud:NUMA systems

Chart 21

inter cloud

intra cloud

intra system

inter node

intra node

coreOSM Group

Core2Cloud on Power


Chart 22

Interconnect CPU CPU CPU CPU

Memory cont roller

MemoryMemory

Node Node

Node NodeOSM Group

Core2Cloud on Power


Chart 23


Memory cont roller

MemoryMemory

Node Node

Node NodeOSM Group

Core2Cloud on Power


Chart 24


Memory cont roller

MemoryMemory

Node Node

Node NodeOSM Group

Core2Cloud on Power


Chart 25


Memory cont roller

MemoryMemory

Node Node

Node NodeOSM Group

Core2Cloud on Power

Core to Cloud:Resource contention

Chart 26

OSM Group

Core2Cloud on Power

Core to Cloud:Monitoring and Control

Chart 27

inter cloud

intra cloud

intra system

inter node

intra node

core

numatoplibnuma

MemAxes

Intel Vtuneperf

Intel PCM

hwlocLinux sysfs

OSM Group

Core2Cloud on Power

Core to Cloud:Rackscale NUMA systems

Chart 28

inter cloud

intra cloud

intra system

inter node

intra node

coreOSM Group

Core2Cloud on Power

Core to Cloud:Rackscale NUMA

Chart 29

OSM Group

Core2Cloud on Power

Core to Cloud:Topology

OSM Group

Core2Cloud on Power

Chart 30

Core to Cloud:Topology

Experiment:Memory access measurements

Chart 32

0

2

4

6

8

10

Bandwidth

0

10

20

30

40

50

Latency

0

10

20

30

40

50

SLIT

Local

Remote 1

Remote 2

Remote 3

OtherBlade OSM Group

Core2Cloud on Power

Experiment:SIFT on rackscale NUMA

Chart 33

Input&Image

Blur&1

Blur&2

DoG

Output:&Feature'descriptors

(gradient'histograms)+'orientation+'blur'factor+' interpolated'x,y'coordinates

1.'Create'octaves'of' differently'scaled'copies

3.'Compute'DoG'within'each'octave

2.&Apply'different'Gaussian'blurs

4.'Filter'extrema

5.'Detect'gradients,'normalize'orientation

OSM Group

Core2Cloud on Power

Experiment: SIFT on rackscale NUMA

Chart 34

0,00

5,00

10,00

15,00

20,00

25,00

30,00

35,00

15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240

EX

EC

UTIO

NTIM

E

CORES

Naive

NUMA

MPI

OSM Group

Core2Cloud on Power

• Resource management starts with LPAR configuration

• Need to understand HMC (Hardware Management Console)

• Dedicated vs. shared CPUs

• LPAR may get assigned „remote“ memory only

• Elaborate automatic resource management with AIX

• Resource management in Linux less mature

• But: all the regular Linux tools at your disposal

Intel vs. Power?

OSM Group

Core2Cloud on Power

Chart 35

Approaches from a SE standpoint

Tools (1)

Assessing hardware performance counters

(Christoph Sterz, M.Sc. thesis)

Available at:

https://github.com/chsterz/performance-tools.git

EvSel (Event Selector)

� Presents and explains all PMU

events to the user

� Reveals many hidden events

� (AVX, Backend-Stalls, MMU

events)

� Compares selectable PMU

events for different program

configurations� Correlates PMU event change

to configuration change

(parameter sweep)

MemHist

� Leverages Precise Event-

Based Sampling� for memory load latency

measurements

� Allows for application

characterization� Event Count or Cycle Cost

selectable� Can sample in a selectable

range and granularity

Firefox Browser Benchmark

Phasenprüfer� Tries to distinguish

program ramp-up phases

from actual load (different

behavior)

� Uses linear and quadratic

fitting (least squares) to

separate program phases

� Can separate PMU

measurements in both

phases

Google Chome Startup

Programming libraries (2)

A C++ programming model for NUMA-aware applications -libnuma

(Wieland Hagen, M.Sc. thesis)

Problem statement and project goals

• C++ has no concept of localityFlat address spacenew calls malloc();malloc() in libc calls mmap()

Idea:

• Provide low-level mechanisms for data placement and migration• NUMA-aware asynchronous task spawning• Implement NUMA-aware data structures using the low-level mechanisms

Encapsulate data distribution, parallelism, locality Configurable datadistribution (see HPF)

Foo *foo = new Foo();

vector<string> vec;

vec.push_back(“hello”);

#include <external_c_library.h>

ancient_c_style_struct *s =

create_ancient_type();

Logical Memory Regions

• Occupy a set of contiguous pages• Reside on specified node• May be migrated to different node• Memory allocation algorithm operates within these pages

numa::Node targetNode;

MemSource *ms = new MemSource(1 << 20, targetNode);

ms->migrate(newNode);

Control Memory Allocations

Foo *allocWithinMemSource(MemSource *ms) {

MemGuard guard(ms);

return new Foo();

}

void addElement(vector<string> &vec, MemSource

*ms) {

MemGuard guard(ms);

vec.push_back(“new element”);

}

Example: NUMA-aware hash table

Hash-Table • h = hash(key)

use last N bits of h to identifybucketlist of <key,value> pairs insideeach bucket

split into 2^M bins• each bin on a dedicated node

with a dedicated MemSourcehash: 010100101001010101110101010

numa

hash

numa

hash.constructAt

hash.insertAt

});

hash.for_each

});

numa::HashTable<Key,Value,6> hash;

hash[key] = value;

numa::Node node = hash.nodeOf(key);

hash.constructAt(key, args...);

hash.insertAt(key, [] () {

return Value();

});

hash.for_each( [] (Key key, Value value) {

// do something

});

Evaluation – Word counting

Porting Guides (3)

The Power Software Ecosystem –and what is missing if you come from the Intel world

(Sven Köhler, M.Sc. thesis)

ObservationIntel (and AMD) support a huge ecosystem of

• software• development tools• performance measurement and profiling tools

This workExamine existing FOSS (Free Open Source Software) and GCC’s toolchain and evaluate whether they make use of existing new features of POWER8 (ISA v2.07B) and how they can be incorperated into the existing code bases.

The objectiveFormulate a guide assisting with the port of existing software to POWER8, especially highlighting pitfalls that open just from the study of IBMs Redbooks.

Apply guide for several smaller examples and show performance impact.Also apply it to the reference implementation of the Opus audio codec

Problem statement

Findings using GCC-Toolchain1. Generated assembly is generally less optimized than on other platforms2. Intrinsics for advanced POWER instructions sometimes problematic

• limited integration with AST and thus reordering/optimization/error messaging.• E.g. some POWER instructions take immediate values

� handed over to programmer.

• Example (SHA256 rotation intrinsic, GCC 5.4):

• Could be resolved by generating two instructions

• Example 2

3. AltiVec support is rather limited compared to XLC/Clang • (lacks multiplication/division of ints, endianess independed-permutation, …)

4. Although claimed in RedBook, software transactional memory implementation does not fully facilitate hardware support

__builtin_crypto_vshasigmaw(input, 0, mask == s0 ? 0 : 0xf);

> sha256.cpp:22:67: error: argument 3 must be in the range 0..15> sha256.cpp:22: confused by earlier errors, bailing out

__builtin_crypto_vshasigmaw(input, 0, 0xff);

> sha256.cpp:22:1: internal compiler error: in extract_insn, at recog.c:2343> sha256.cpp:22:1: internal compiler error: Abort trap: 6

Working on Opus codec:In addition to throughput, optimization improvescodec quality by reducing normalization steps in FPU-units using dedicated POWER8 instructions

rel. error in reference implementation rel. error in optimized implementation

Opus test vector quality metric (part of codec package):

$ ./tests/run_vectors.sh base ~/opus_testvectors 48000All tests have passed successfullyAverage mono quality is 97.4333 %Average stereo quality is 99.75 %

$ ./tests/run_vectors.sh p8 ~/opus_testvectors 48000All tests have passed successfullyAverage mono quality is 97.8833 %Average stereo quality is 99.775 %

( standardized by Internet Engineering TaskForce (IETF) as RFC 6716)

■ Power is mostly unknown territory to today’s students

� need to invest in education

� IBM Power Systems Blockcourse @ HPI

■ Resource Management is key

� need to understand LPARs, HMC…

� Linux perf/libPAPI/slit tools reflect some settings…

■ Impact of dynamic LPARs on application performance needs study

■ Need to understand platform:

� Endianess

� AltiVec Intrinsics

■ Gaining hands-on expertise: HPI FutureSOC Lab

Summary

Chart 51

OSM Group

Core2Cloud on Power

numa, resource-management, monitoring auf ibm power: … · 2016-10-07 · numa,...

Documents