numa, resource-management, monitoring auf ibm power: … · 2016-10-07 · numa,...
TRANSCRIPT
NUMA, Resource-Management, Monitoring auf IBM Power: Erfahrungen am HPI (Core2Cloud on Power)
Prof. Dr. Andreas Polze
Linux on Power Anwendertag, IBM, Böblingen
05.10.2016
Agenda
1. Resource Management from Core to Cloud– a motivation (SSICLOPS)
2. How to approach IBM Power Systems– the IBM Power Block Course at HPI
3. NUMA observations– how big a problem we are dealing with
4. Approaches from a Software Engineering standpoint
■ C++ library for memory allocation
■ Monitoring tools (black box / white box)
■ Experiences with porting applications
5. Summary – a call to action
Motivation: SSICLOPS European Project:(Scalable and Secure Infrastructures for Cloud Operations)
■ Management of federated private cloud infrastructures
■ Network communication improvements (latency and bandwidth)
■ Workload scheduling across datacenters
■ Security- and privacy-aware storage and processing
SSICLOPS:Research Areas
OSM Group
Core2Cloud on Power
Chart 4
SSICLOPS: Partners
Chart 5
OSM Group
Core2Cloud on Power
■ HPI: In-memory databases
■ HIP: High-energy physics analysis
■ Telekom: Content delivery and caching
■ Orange: Network function virtualization
SSICLOPS: Use Cases
Chart 6
OSM Group
Core2Cloud on Power
SSICLOPS: Big picture
OSM Group
Core2Cloud on Power
Chart 7
SSICLOPS: Experience so far
OSM Group
Core2Cloud on Power
Chart 8
SSICLOPS: Experience so far
OSM Group
Core2Cloud on Power
Chart 9
SSICLOPS: Next steps
OSM Group
Core2Cloud on Power
Chart 10
SSICLOPS: Vision
OSM Group
Core2Cloud on Power
Chart 11
Core to Cloud
Core to Cloud:Overview
Chart 13
inter cloud
intra cloud
intra system
inter node
intra node
coreOSM Group
Core2Cloud on Power
Experiments and related work
■ Understand resource management on every level
■ Understand main technologies on every level
■ Collect tools for every level
Research
■ Resource contention
□ Avoid the worst case!
■ Monitoring on higher levels should include lower levels to make good
control decisions
Core to Cloud:The Idea
Chart 14
OSM Group
Core2Cloud on Power
How to approach IBM Power Systems
■ LPARs, virtualization always on
■ Endianess
(but Linux is Linux – really?)
■ Missing libraries
■ Resource management
■ Example: data.c
int i = 0xffeeddcc;
Power is different...
OSM Group
Core2Cloud on Power
Chart 16
mac ( 35 )-% od -tx1 data.o0000000 cf fa ed fe 07 00 00 01 03 00 00 00 01 00 00 000000020 04 00 00 00 60 01 00 00 00 20 00 00 00 00 00 000000040 19 00 00 00 e8 00 00 00 00 00 00 00 00 00 00 000000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000100 04 00 00 00 00 00 00 00 80 01 00 00 00 00 00 000000120 04 00 00 00 00 00 00 00 07 00 00 00 07 00 00 000000140 02 00 00 00 00 00 00 00 5f 5f 74 65 78 74 00 000000160 00 00 00 00 00 00 00 00 5f 5f 54 45 58 54 00 000000200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000220 00 00 00 00 00 00 00 00 80 01 00 00 00 00 00 000000240 00 00 00 00 00 00 00 00 00 00 00 80 00 00 00 000000260 00 00 00 00 00 00 00 00 5f 5f 64 61 74 61 00 000000300 00 00 00 00 00 00 00 00 5f 5f 44 41 54 41 00 000000320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000340 04 00 00 00 00 00 00 00 80 01 00 00 02 00 00 000000360 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000400 00 00 00 00 00 00 00 00 24 00 00 00 10 00 00 000000420 00 0a 0a 00 00 00 00 00 02 00 00 00 18 00 00 000000440 84 01 00 00 01 00 00 00 94 01 00 00 04 00 00 000000460 0b 00 00 00 50 00 00 00 00 00 00 00 00 00 00 000000500 00 00 00 00 01 00 00 00 01 00 00 00 00 00 00 000000520 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00*0000600 cc dd ee ff 01 00 00 00 0f 02 00 00 00 00 00 000000620 00 00 00 00 00 5f 69 00 0000630
aix01 $ od -tx1 data.o
0000000 01 df 00 02 57 d9 9e b1 00 00 00 76 00 00 00 0a
0000020 00 00 00 00 2e 74 65 78 74 00 00 00 00 00 00 00
0000040 00 00 00 00 00 00 00 00 00 00 00 64 00 00 00 00
0000060 00 00 00 00 00 00 00 00 00 00 00 20 2e 64 61 74
0000100 61 00 00 00 00 00 00 00 00 00 00 00 00 00 00 08
0000120 00 00 00 64 00 00 00 6c 00 00 00 00 00 01 00 00
0000140 00 00 00 40 ff ee dd cc 00 00 00 00 00 00 00 04
0000160 00 00 00 02 1f 00 2e 66 69 6c 65 00 00 00 00 00
0000200 00 00 ff fe 00 03 67 01 64 61 74 61 2e 63 00 00
0000220 00 00 00 00 00 00 00 00 00 00 2e 74 65 78 74 00
0000240 00 00 00 00 00 00 00 01 00 00 6b 01 00 00 00 00
0000260 00 00 00 00 00 00 11 00 00 00 00 00 00 00 2e 64
0000300 61 74 61 00 00 00 00 00 00 00 00 02 00 00 6b 01
0000320 00 00 00 08 00 00 00 00 00 00 21 05 00 00 00 00
0000340 00 00 69 00 00 00 00 00 00 00 00 00 00 00 00 02
0000360 00 00 02 01 00 00 00 04 00 00 00 00 00 00 02 05
0000400 00 00 00 00 00 00 54 4f 43 00 00 00 00 00 00 00
0000420 00 08 00 02 00 00 6b 01 00 00 00 00 00 00 00 00
0000440 00 00 11 0f 00 00 00 00 00 00
0000452
■ big+little endian on Power8
□ AIX: big endian
□ Linux: both versions
Little Endian
Big Endian
Assigning Resources to a Linux LPAR
OSM Group
Core2Cloud on Power
Chart 17
Dynamically changing available CPU resources
OSM Group
Core2Cloud on Power
Chart 18
IBM Power Systems Block Course @ HPI
OSM Group
Core2Cloud on Power
Chart 19
http://www.dcl.hpi.uni-potsdam.de/events/POWER2016/
NUMA Observations(how big a problem we are dealing with)
Core to Cloud:NUMA systems
Chart 21
inter cloud
intra cloud
intra system
inter node
intra node
coreOSM Group
Core2Cloud on Power
Core to Cloud:NUMA systems
Chart 22
Interconnect CPU CPU CPU CPU
Memory cont roller
MemoryMemory
Node Node
Node NodeOSM Group
Core2Cloud on Power
Core to Cloud:NUMA systems
Chart 23
Interconnect CPU CPU CPU CPU
Memory cont roller
MemoryMemory
Node Node
Node NodeOSM Group
Core2Cloud on Power
Core to Cloud:NUMA systems
Chart 24
Interconnect CPU CPU CPU CPU
Memory cont roller
MemoryMemory
Node Node
Node NodeOSM Group
Core2Cloud on Power
Core to Cloud:NUMA systems
Chart 25
Interconnect CPU CPU CPU CPU
Memory cont roller
MemoryMemory
Node Node
Node NodeOSM Group
Core2Cloud on Power
Core to Cloud:Resource contention
Chart 26
OSM Group
Core2Cloud on Power
Core to Cloud:Monitoring and Control
Chart 27
inter cloud
intra cloud
intra system
inter node
intra node
core
numatoplibnuma
MemAxes
Intel Vtuneperf
Intel PCM
hwlocLinux sysfs
OSM Group
Core2Cloud on Power
Core to Cloud:Rackscale NUMA systems
Chart 28
inter cloud
intra cloud
intra system
inter node
intra node
coreOSM Group
Core2Cloud on Power
Core to Cloud:Rackscale NUMA
Chart 29
OSM Group
Core2Cloud on Power
Core to Cloud:Topology
OSM Group
Core2Cloud on Power
Chart 30
Core to Cloud:Topology
Experiment:Memory access measurements
Chart 32
0
2
4
6
8
10
Bandwidth
0
10
20
30
40
50
Latency
0
10
20
30
40
50
SLIT
Local
Remote 1
Remote 2
Remote 3
OtherBlade OSM Group
Core2Cloud on Power
Experiment:SIFT on rackscale NUMA
Chart 33
Input&Image
Blur&1
Blur&2
DoG
Output:&Feature'descriptors
(gradient'histograms)+'orientation+'blur'factor+' interpolated'x,y'coordinates
1.'Create'octaves'of' differently'scaled'copies
3.'Compute'DoG'within'each'octave
2.&Apply'different'Gaussian'blurs
4.'Filter'extrema
5.'Detect'gradients,'normalize'orientation
OSM Group
Core2Cloud on Power
Experiment: SIFT on rackscale NUMA
Chart 34
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240
EX
EC
UTIO
NTIM
E
CORES
Naive
NUMA
MPI
OSM Group
Core2Cloud on Power
• Resource management starts with LPAR configuration
• Need to understand HMC (Hardware Management Console)
• Dedicated vs. shared CPUs
• LPAR may get assigned „remote“ memory only
• Elaborate automatic resource management with AIX
• Resource management in Linux less mature
• But: all the regular Linux tools at your disposal
Intel vs. Power?
OSM Group
Core2Cloud on Power
Chart 35
Approaches from a SE standpoint
Tools (1)
Assessing hardware performance counters
(Christoph Sterz, M.Sc. thesis)
Available at:
https://github.com/chsterz/performance-tools.git
EvSel (Event Selector)
� Presents and explains all PMU
events to the user
� Reveals many hidden events
� (AVX, Backend-Stalls, MMU
events)
� Compares selectable PMU
events for different program
configurations� Correlates PMU event change
to configuration change
(parameter sweep)
MemHist
� Leverages Precise Event-
Based Sampling� for memory load latency
measurements
� Allows for application
characterization� Event Count or Cycle Cost
selectable� Can sample in a selectable
range and granularity
Firefox Browser Benchmark
Phasenprüfer� Tries to distinguish
program ramp-up phases
from actual load (different
behavior)
� Uses linear and quadratic
fitting (least squares) to
separate program phases
� Can separate PMU
measurements in both
phases
Google Chome Startup
Programming libraries (2)
A C++ programming model for NUMA-aware applications -libnuma
(Wieland Hagen, M.Sc. thesis)
Problem statement and project goals
• C++ has no concept of localityFlat address spacenew calls malloc();malloc() in libc calls mmap()
Idea:
• Provide low-level mechanisms for data placement and migration• NUMA-aware asynchronous task spawning• Implement NUMA-aware data structures using the low-level mechanisms
Encapsulate data distribution, parallelism, locality Configurable datadistribution (see HPF)
Foo *foo = new Foo();
vector<string> vec;
vec.push_back(“hello”);
#include <external_c_library.h>
ancient_c_style_struct *s =
create_ancient_type();
Logical Memory Regions
• Occupy a set of contiguous pages• Reside on specified node• May be migrated to different node• Memory allocation algorithm operates within these pages
numa::Node targetNode;
MemSource *ms = new MemSource(1 << 20, targetNode);
ms->migrate(newNode);
Control Memory Allocations
Foo *allocWithinMemSource(MemSource *ms) {
MemGuard guard(ms);
return new Foo();
}
void addElement(vector<string> &vec, MemSource
*ms) {
MemGuard guard(ms);
vec.push_back(“new element”);
}
Example: NUMA-aware hash table
Hash-Table • h = hash(key)
use last N bits of h to identifybucketlist of <key,value> pairs insideeach bucket
split into 2^M bins• each bin on a dedicated node
with a dedicated MemSourcehash: 010100101001010101110101010
numa
hash
numa
hash.constructAt
hash.insertAt
});
hash.for_each
});
numa::HashTable<Key,Value,6> hash;
hash[key] = value;
numa::Node node = hash.nodeOf(key);
hash.constructAt(key, args...);
hash.insertAt(key, [] () {
return Value();
});
hash.for_each( [] (Key key, Value value) {
// do something
});
Evaluation – Word counting
Porting Guides (3)
The Power Software Ecosystem –and what is missing if you come from the Intel world
(Sven Köhler, M.Sc. thesis)
ObservationIntel (and AMD) support a huge ecosystem of
• software• development tools• performance measurement and profiling tools
This workExamine existing FOSS (Free Open Source Software) and GCC’s toolchain and evaluate whether they make use of existing new features of POWER8 (ISA v2.07B) and how they can be incorperated into the existing code bases.
The objectiveFormulate a guide assisting with the port of existing software to POWER8, especially highlighting pitfalls that open just from the study of IBMs Redbooks.
Apply guide for several smaller examples and show performance impact.Also apply it to the reference implementation of the Opus audio codec
Problem statement
Findings using GCC-Toolchain1. Generated assembly is generally less optimized than on other platforms2. Intrinsics for advanced POWER instructions sometimes problematic
• limited integration with AST and thus reordering/optimization/error messaging.• E.g. some POWER instructions take immediate values
� handed over to programmer.
• Example (SHA256 rotation intrinsic, GCC 5.4):
• Could be resolved by generating two instructions
• Example 2
3. AltiVec support is rather limited compared to XLC/Clang • (lacks multiplication/division of ints, endianess independed-permutation, …)
4. Although claimed in RedBook, software transactional memory implementation does not fully facilitate hardware support
__builtin_crypto_vshasigmaw(input, 0, mask == s0 ? 0 : 0xf);
> sha256.cpp:22:67: error: argument 3 must be in the range 0..15> sha256.cpp:22: confused by earlier errors, bailing out
__builtin_crypto_vshasigmaw(input, 0, 0xff);
> sha256.cpp:22:1: internal compiler error: in extract_insn, at recog.c:2343> sha256.cpp:22:1: internal compiler error: Abort trap: 6
Working on Opus codec:In addition to throughput, optimization improvescodec quality by reducing normalization steps in FPU-units using dedicated POWER8 instructions
rel. error in reference implementation rel. error in optimized implementation
Opus test vector quality metric (part of codec package):
$ ./tests/run_vectors.sh base ~/opus_testvectors 48000All tests have passed successfullyAverage mono quality is 97.4333 %Average stereo quality is 99.75 %
$ ./tests/run_vectors.sh p8 ~/opus_testvectors 48000All tests have passed successfullyAverage mono quality is 97.8833 %Average stereo quality is 99.775 %
( standardized by Internet Engineering TaskForce (IETF) as RFC 6716)
■ Power is mostly unknown territory to today’s students
� need to invest in education
� IBM Power Systems Blockcourse @ HPI
■ Resource Management is key
� need to understand LPARs, HMC…
� Linux perf/libPAPI/slit tools reflect some settings…
■ Impact of dynamic LPARs on application performance needs study
■ Need to understand platform:
� Endianess
� AltiVec Intrinsics
■ Gaining hands-on expertise: HPI FutureSOC Lab
Summary
Chart 51
OSM Group
Core2Cloud on Power