contention - aware scheduling (a different approach)

11

LCN : Design and Implementation of a Contention-LCN : Design and Implementation of a Contention-Aware SchedulerAware Scheduler

Raptis Dimos – DimitriosRaptis Dimos – Dimitrios

88thth SFHMMY Conference 2015 SFHMMY Conference 2015April 4, 2015April 4, 2015

April 4, 2015April 4, 2015 11National Technical University of AthensNational Technical University of Athens

School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering88thth SFHMMY Conference SFHMMY Conference

22April 4, 2015April 4, 2015 22National Technical University of AthensNational Technical University of Athens

School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering

Outline Motivation Background Similar Research Scheduler Overview Classification Scheme Prediction Model Scheduling Algorithm Comparison with Similar Research Conclusion Future Work

88thth SFHMMY Conference SFHMMY Conference



Motivation

Memory Wall

Protocols

SMPs & CMPs

Multithreaded

Cache Coherency

Programming

Parallel Processing




Motivation

Cache Coherence Problems Legacy PC

applications(not benefiting) Applications

benefiting from multithreaded environments

“Embarassingly parallel” applications (GPU etc.)

Leveraging Parallelism




Motivation Problems Approaches

Memory Contention Problem

Cache Coherence Problem

Missing existing infrastructure to detect and restrict system resources contention

What if it was not programmer's responsibility to “allocate” resources ?

What if Operating System was responsible for judging applications' parallelism ?

Contention – Aware Scheduling




Contention – Aware Scheduling

Classification (based on locality and degree of contention)

Background

Scheduling Algorithm

HPC Monitoring

++ Our approach contains an additional component : a prediction model




Similar Research

Various Approaches Simple Heuristic approaches (LLC misses & Memory bandwidth) Stack Distance Profiling approaches Dynamic Scheduling approaches using supervised learning (linear regression, fuzzy-rule models, K-nearest neighbour)

Differences

Simple Heuristic approaches (LLC misses & Memory bandwidth) Stack Distance Profiling approaches Dynamic Scheduling approaches using supervised learning (linear regression, fuzzy-rule models, K-nearest neighbour)

Not covering the whole memory hierarchy

Using additional hardware not available currently in OS

Targeting the same problem from a different view

Pre-defined allocated resources in applications




Scheduler Overview

Scheduler Main Components

Classification Scheme

4 categories of applications based on memory hierarchy Prediction Model

prediction of contention in varying resources allocations Scheduling Algorithm

scheduling a workload of applications based on classification scheme (co-scheduling combinations) prediction model (for ideal resource management)




Classification Scheme 4 main categories of applications

L LC

C N




Classification Scheme Co-scheduling interference

N - * : no interference L - L : contention on same resource, bandwidth “divided” L - C : contention in different resources

severe performance degradation in C no impact in L

L - LC : performance degradation for both LC faces bigger degradation than L

LC - LC : contention in 2 resources (memory link and LLC) Both have degradation but in low levels

LC - C : mediocre contention, mainly in C C - C : most difficult to predict - based on data access

patterns (MESI protocol)




Classification Scheme Co-scheduling interference

Analysis from workload of 16 applications

4 applications belonging to each

class

Co-scheduling of all possible

combinations

Average slowdown calculated for each

combination Table : Average slowdown in co-execution




Classification Scheme Classification tree




Prediction Model

Linear Regression Model Target : Prediction of scaling

possess HPC monitored for 1 core allocation capability to predict scaling for any possible allocation use of threshold value for defining optimal scaling

Use the suitable counters for each class

Class L : memory link (bandwidth) Class LC : LLC reuse (MESI protocol) Class C : L2 and LLC reuse (MESI protocol) Class N : private part of memory hierarchy




Prediction Model L class

Rp = (Mem1 p)/(Maximum Memory Bandwidth)∗

poptimum = max{p}, Rp < 1.15

LC class Completion(LC) = 0.01799 ∗ fLC + 0.50119 (p = 2cores)

= 0.02516 ∗ fLC + 0.34286 (p = 3 cores)

= 0.02846 ∗ fLC + 0.26028 (p = 4 cores)

= 0.03199 ∗ fLC + 0.21584 (p = 5 cores)

= 0.03404 ∗ fLC + 0.18296 (p = 6 cores)

= 0.03621 ∗ fLC + 0.16410 (p = 7 cores)

= 0.03751 ∗ fLC + 0.13969 (p = 8 cores)

Ideal_Completionp = 1/p , fLC = L2 RFO Requests/(L3 reuse*105)

Rp = (Ideal_Completionp /Completionp ) 100∗

poptimum = max{p}, Rp > 70




Prediction Model C class

Completion(C) = 0.3447 ∗ fC + 0.4947 (2cores)

= 0.46974 ∗ fC + 0.34415 (p = 3 cores)

= 0.5155 ∗ fC + 0.2478 (p = 4 cores)

= 0.63609 ∗ fC + 0.22492 (p = 5 cores)

= 0.61403 ∗ fC + 0.18127 (p = 6 cores)

= 0.65915 ∗ fC + 0.15864 (p = 7 cores)

= 0.6095 ∗ fC + 0.1263 (p = 8 cores)

Ideal_Completionp = 1/p , fC = (L2 Shared*104)/Inst.Retired



N class Completion(N)p = Completion_idealp

poptimum = max{p}


April 4, 2015April 4, 2015 National Technical University of AthensNational Technical University of Athens 1616


Prediction Model Example

L class LC class Mem1 = 4GB/secMemmax = 13.5 GB/secR1 = 4/13.5 = 0.29R2 = (4*2)/13.5 = 0.59R3 = (4*3)/13.5 = 0.88R4 = (4*4)/13.5 = 1.185R5 = (4*5)/13.5 = 1.48R6 = (4*6)/13.5 = 1.77R7 = (4*7)/13.5 = 2.07R8 = (4*8)/13.5 = 2.37

poptimum = 3 cores

RFO1 = 319106 per second , L3 reuse = 1.51f

LC = 319106/(1.51*105) = 2.10

Completion(LC)2 = 0.01799*2.10 + 0.50119 = 0.53 → R2=0.5/0.53*100= 92.792.7Completion(LC)3 = 0.02516*2.10 + 0.34286 = 0.39 → R3=0.33/0.39*100= 84.284.2Completion(LC)4 = 0.02846*2.10 + 0.26028 = 0.32 → R4=0.25/0.32*100= 78.078.0Completion(LC)5 = 0.03199*2.10 + 0.21584 = 0.28 → R5=0.2/0.28*100= 70.670.6Completion(LC)6 = 0.03404*2.10 + 0.18296 =0.25 → R6=0.166/0.25*100= 65.465.4Completion(LC)7 = 0.03621*2.10 + 0.16410 = 0.24 → R7=0.142/0.24*100= 59.059.0Completion(LC)8 = 0.03751*2.10 + 0.13969 = 0.21 → R8=0.125/0.21*100= 57.157.1

Poptimum = 5 cores




Prediction Model

Evaluation – Verification

Relative Errors in Predictions of C class




Prediction Model

Evaluation – Verification

Relative Errors in Predictions of LC class




Prediction Model

LC - C prediction model improvement Integration of 7 relationships to a single one Coefficients follow logarithmic trendline Results after analysis

Completion(LC)p = [0.0139536 log(p) + 0.0090562] f∗ ∗ LC + [−0.252533 log(p) + 0.6407058]∗

Completion(C)p = [0.2151318 log(p) + 0.2239032] f∗ ∗ C + [−0.25468 log(p) + 0.6397947]∗

Ideal_Completionp = 1/p






Prediction Model

Evaluation – Verification of Refinement

Deviation in C coefficients Deviation in LC coefficients


Prediction Model School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering

April 4, 2015April 4, 2015 National Technical University of AthensNational Technical University of Athens 2121

Experimentation Platform cores : 8 L1D,I: 32KB

8-way L2 : 256KB

8-way L3 : 16 MB

16-way 64bytes line Mem :64GB

DDR3 1.3GHZ

Debian 6.06

*(Prediction Model also tested on Nehalem architecture)




Scheduling Algorithm

Executed after first 2 steps are finished for each application Step 1 has classified each application Step 2 has predicted the optimum number of cores that

should be allocated by the scheduler to each application The algorithm tries to co-schedule the applications in pairs

so that Sum of cores does not exceed package cores Contention is avoided as much as possible

(using conclusions from Classification step) The approach can be extended for co-execution of more

than 2 applications N applications are allocated half cores and scheduled twice

(their profile implies that they are not affected by this)




Scheduling Algorithm Lists of applications separated by class : L, LC, C, N while(N not empty){

x = current N application ;y = popMatchFromTheEnd(C, L, LC, N);coschedule(x, y);

}while( LC not empty){

x = current LC application;y = popMatchFromTheEnd(C, LC, L);coschedule(x, y);

}while(L not empty){

x = current L application;y = popMatchFromTheEnd(L);coschedule(x, y);

}while(C not empty){

x = current C application;y = popMatchFromTheEnd(C);coschedule(x, y);

}scheduleRemainingApplications();




Comparison with Similar Research The other state-of-the-art schedulers

Sorting by heuristic Distributing load Combining

application from the top with application from the bottom

LLC – MRB

LLC misses LBB

memory bandwidth




Comparison with Similar Research

Experiments – Comparison Process Linux CFS, LCN, LLC-MRB, LLB to be compared Workload of 17 applications (equally shared among classes) Whole workload executed for 1 hour Time quantums of 1 second defined in all schedulers When application finishes, it gets respawn to re-execute Comparison between schedulers with 2 criteria

Throughput Total number of executions of all applications Number of improved applications

Fairness Standard Deviation between gain of each application

*gain compared to Gang scheduler





Most Improved Applications

Linux : 5LLC – Balance : 7MEM-Balance : 5LCN : 8





Criteria :

- Throughput LCNLCN- Fairness LLC-Balance LLC-Balance **

* fairness can be * fairness can be misinterpretedmisinterpreted





Major Drawbacks of other schedulers Linux Scheduler CFS

Cannot locate contention Does not identify threads of the same application

parallelism benefits lost MEM - Balance Scheduler

Uses over-generic heuristic Does not take into account all memory hierarchy parts

LLC - Balance Scheduler Cannot differentiate between class N and C applications,

since they both exhibit low LLC misses Results co-scheduling L with C applications → contention




Conclusion Proposed contention-aware schedulers that

Does not require additional OS hardware adjustments Simple, easily integratable as component in modern OS Consisted of 3 parts

Compared to other state-of-the-art schedulers and the CFS Presents the best throughput Presents equal fairness to CFS

(and lower than the other contention-aware schedulers)

Can be integrated to real-life scheduling with 2 approaches: Applications executed when inserted in queue for 2-3 quantums Start scheduling and monitoring simultaneously (dynamic adaptation)




Future Work Major Improvements Improvements in the prediction model

Stepwise regresion models to add more variables Decrease error Caution : limitation in number of monitored counters

Other methods, such as machine learning Investigation of added overhead Extension of approach to NUMA architectures

Implemented and tested for 1 package only Extensible to multiple packages, using thread migrations

• Initially try to allocate threads of the same application in the same package

• Thread migrations executed when class change is observed along with memory migrations




THE END

Thank you !!!

Any Questions ??


contention - aware scheduling (a different approach)

Software

system resources contention

existing infrastructure

fuzzyrule models

additional component

nearest neighb

programmers responsibility