hardware-aware thread scheduling: the case of asymmetric multicore processors

36
Hardware-aware thread scheduling: the case of asymmetric multicore processors Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder * [email protected] http://sosoa.inf.unisi.ch

Upload: achille-peternier

Post on 25-May-2015

605 views

Category:

Technology


2 download

DESCRIPTION

Talk given at ICPADS 2012 in Singapore.

TRANSCRIPT

Page 1: Hardware-aware thread scheduling: the case of asymmetric multicore processors

Hardware-aware thread scheduling: the case of asymmetric multicore processors

Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder

* [email protected]://sosoa.inf.unisi.ch

Page 2: Hardware-aware thread scheduling: the case of asymmetric multicore processors

2

CONTEXT AND OVERALL IDEAIntroduction

Page 3: Hardware-aware thread scheduling: the case of asymmetric multicore processors

3

Context

• Modern CPUs increase the computational power through additional cores

• HW architectures are becoming increasingly more complex– Shared caches– Non Uniform Memory Access (NUMA) – Single Instruction Multiple Data (SIMD) registers– Simultaneous MultiThreading (SMT) units

Page 4: Hardware-aware thread scheduling: the case of asymmetric multicore processors

4

Context

• Operating System (OS) kernel and scheduler try to automatically optimize applications’ performance according to the available resources– Based on the underlying HW – Using a limited set of performance indicators (CPU

time, memory usage, etc.)

Page 5: Hardware-aware thread scheduling: the case of asymmetric multicore processors

“Today it is impossible to estimate performance: you have to measure it. Programming has become an empirical science.”

Performance Anxiety: Performance analysis in the new millenniumJoshua Bloch, Google Inc.

Page 6: Hardware-aware thread scheduling: the case of asymmetric multicore processors

6

Contributions

2) Hardware-aware optimized scheduler performing decisions based on hardware resource usage and the output of the workload analysis

- to improve processing units occupancy on SMT/asymmetric processors

1) Automated workload analysis technique relying on a specific set of performance metrics that are currently not used by common OS schedulers

Page 7: Hardware-aware thread scheduling: the case of asymmetric multicore processors

7

FPUINT

The big pictureMonitoring daemon

OS threads and processes

Workload characterization

Page 8: Hardware-aware thread scheduling: the case of asymmetric multicore processors

8

FPUINT

The big picture

Workload characterization

Hardware-aware scheduler

Page 9: Hardware-aware thread scheduling: the case of asymmetric multicore processors

9

AMD BULLDOZER PROCESSORTarget architecture

Page 10: Hardware-aware thread scheduling: the case of asymmetric multicore processors

10

AMD Bulldozer

• AMD Bulldozer architecture– Each CPU is implemented as a series of modules

(a.k.a. “cores”) with two cores (a.k.a. “processing or SMT units”)

– Arithmetic-Logic Units (ALUs) are really available per SMT unit

– A module is more similar to:• A dual core when doing integer ops• A single core with SMT=2 when

doing floating point ops

Page 11: Hardware-aware thread scheduling: the case of asymmetric multicore processors

11

AMD Bulldozer

Page 12: Hardware-aware thread scheduling: the case of asymmetric multicore processors

12

AMD Bulldozer

X

Page 13: Hardware-aware thread scheduling: the case of asymmetric multicore processors

13

AMD Bulldozer

ok

Page 14: Hardware-aware thread scheduling: the case of asymmetric multicore processors

14

WORKLOAD CHARACTERIZATION

Page 15: Hardware-aware thread scheduling: the case of asymmetric multicore processors

15

Workload characterization

• Is used to sort processes and threads that are floating point intensive– Among the X most running threads• (where X = the number of cores available)

• Based on realtime monitoring system using Hardware Performance Counters (HPCs)

Page 16: Hardware-aware thread scheduling: the case of asymmetric multicore processors

16

…about HPCs…

• Registers embedded into processors to keep track of hardware-related events such as cache misses, number of CPU cycles, branch mispredictions, etc.

• Very low overhead (about 1%)• Extremely accurate• Limited resources, only few of them can be used

at the same time– This limits their wide adoption (yet) on large scale

• HW-specific

Page 17: Hardware-aware thread scheduling: the case of asymmetric multicore processors

17

Workload characterization

• HPCs used:– PERF_COUNT_HW_CPU_CYCLES: measures the

total number of CPU cycles consumed by a thread during its execution time

– CYCLES_FPU_EMPTY: keeps track of the number of CPU cycles the floating point units are not being used by a thread during its execution time

– L2_CACHE_MISSES: counts the number of L2 cache misses generated by a thread during its execution time

Page 18: Hardware-aware thread scheduling: the case of asymmetric multicore processors

18

MONITORING AND SCHEDULING INFRASTUCTURE DESING

Page 19: Hardware-aware thread scheduling: the case of asymmetric multicore processors

19

BulldOver design

• Bulldozer Overseer -> BulldOver• Client-server architecture

Page 20: Hardware-aware thread scheduling: the case of asymmetric multicore processors

20

BulldOver design

• Server– Daemon – Scans the underlying architecture– Time-based HPC monitoring (once per sec)• We target scientific workloads, short-lived threads are

not well suitable

– Applies scheduling policies– libHpcOverseer, hwloc, libpfm

Page 21: Hardware-aware thread scheduling: the case of asymmetric multicore processors

21

BulldOver design

• Client– Command-line tool• prompt> bulldover java myprogram

– Traces the creation/termination of threads/processes

– Share information through shared memory with the server

– libmonitor, boost

Page 22: Hardware-aware thread scheduling: the case of asymmetric multicore processors

22

BulldOver design

User space

Page 23: Hardware-aware thread scheduling: the case of asymmetric multicore processors

23

EVALUATION

Page 24: Hardware-aware thread scheduling: the case of asymmetric multicore processors

24

Testing environment

• Dell PowerEdge M915– 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8

modules each)• Limited to 1 CPU with 8 cores/4 modules

– Test limited to a single NUMA node• Avoiding latencies and other NUMA-related well known

effects

– Turbo mode and freq. scaling disabled

Page 25: Hardware-aware thread scheduling: the case of asymmetric multicore processors

25

Benchmark suites

• SPEC CPU 2006– Perfect match for evaluating Integer vs. Floating point

behaviors

• SciMark 2.0– Java based– Noisy environment (additional threads for garbage

collection, JIT, etc.)– Mainly FPU-oriented, with different levels of stress– Modified multi-threaded version running several random

benchmarks over a thread-pool

Page 26: Hardware-aware thread scheduling: the case of asymmetric multicore processors

26

Workload characterizationSpec CPU 2006

Empty FPU Cycles Total CPU Cycles

Page 27: Hardware-aware thread scheduling: the case of asymmetric multicore processors

27

Workload characterizationSciMark 2.0

Empty FPU Cycles Total CPU Cycles

Page 28: Hardware-aware thread scheduling: the case of asymmetric multicore processors

28

FPU usage and cachesFPU usage L2 cache miss ratio

Page 29: Hardware-aware thread scheduling: the case of asymmetric multicore processors

29

Results for SPEC CPU 2006

Inefficient baseline

Improved scheduling

Default OS scheduling

Running 4x Int and 4x FPU benchmarks on a single NUMA node (4 modules/8 cores)

Page 30: Hardware-aware thread scheduling: the case of asymmetric multicore processors

30

Discussion

• BulldOver avoids the worst case scenario– The default OS scheduler is not aware of the

workload characterization• Benefits coming both from improved cache

usage AND better FPU/Integer units occupancy

Page 31: Hardware-aware thread scheduling: the case of asymmetric multicore processors

31

Results for Scimark 2.0

Default OS scheduling

Improved scheduling

Running 8x randomly changing over-time benchmarks on a single NUMA node (4 modules/8 cores)

Page 32: Hardware-aware thread scheduling: the case of asymmetric multicore processors

32

Discussion

• All the threads are FPU-intensive– But at different levels

• Still a reasonable speedup “for free”• Dynamic adaptation, since the FPU usage

intensity varies over time– BulldOver reacts accordingly

Page 33: Hardware-aware thread scheduling: the case of asymmetric multicore processors

33

Conclusions- We show how thread scheduling not aware of the shared HW

resources available on the AMD Bulldozer processor can incur a significant performance penalty

- We presented a monitoring system that is able to characterize the most active threads according to their FPU/Integer usage

- Thanks to the realtime analysis, improved scheduling can be applied and performance improved

- Our system is very low intrusive:- Low overhead (below 2%)- No kernel patching required- No code instrumentation- Works on any application

Page 34: Hardware-aware thread scheduling: the case of asymmetric multicore processors

34

Conclusions

• Currently tuned for a specific HW architecture• Good for scientific workloads– Sampling rate is required (1 sec in our case, could

be less but can’t be 0…)• Based on a very simple scheduling policy– More sophisticated policies could be used

Page 35: Hardware-aware thread scheduling: the case of asymmetric multicore processors

35

Thanks!

Achille [email protected]://sosoa.inf.unisi.ch

Page 36: Hardware-aware thread scheduling: the case of asymmetric multicore processors

36

“Pow7Over”

• Work in progress on IBM Power7 processors– 1 CPU, 8 cores, up to 4 SMT units per core– Completely different…

• …operating system: RHEL 6.3• …architecture: PowerPC• …HPCs: IBM-specific ones (more than 500 available…)• …compiler: autotools 6.0

• Similar approach• Slightly less significant speedup

– But this is a full SMT– Similar overall behavior both for the PUs and L2 caches