contention - aware scheduling (a different approach)
TRANSCRIPT
11
LCN : Design and Implementation of a Contention-LCN : Design and Implementation of a Contention-Aware SchedulerAware Scheduler
Raptis Dimos – DimitriosRaptis Dimos – Dimitrios
88thth SFHMMY Conference 2015 SFHMMY Conference 2015April 4, 2015April 4, 2015
April 4, 2015April 4, 2015 11National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering88thth SFHMMY Conference SFHMMY Conference
22April 4, 2015April 4, 2015 22National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Outline Motivation Background Similar Research Scheduler Overview Classification Scheme Prediction Model Scheduling Algorithm Comparison with Similar Research Conclusion Future Work
88thth SFHMMY Conference SFHMMY Conference
33April 4, 2015April 4, 2015 33National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Motivation
Memory Wall
Protocols
SMPs & CMPs
Multithreaded
Cache Coherency
Programming
Parallel Processing
88thth SFHMMY Conference SFHMMY Conference
44April 4, 2015April 4, 2015 44National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Motivation
Cache Coherence Problems Legacy PC
applications(not benefiting) Applications
benefiting from multithreaded environments
“Embarassingly parallel” applications (GPU etc.)
Leveraging Parallelism
88thth SFHMMY Conference SFHMMY Conference
55April 4, 2015April 4, 2015 55National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Motivation Problems Approaches
Memory Contention Problem
Cache Coherence Problem
Missing existing infrastructure to detect and restrict system resources contention
What if it was not programmer's responsibility to “allocate” resources ?
What if Operating System was responsible for judging applications' parallelism ?
Contention – Aware Scheduling
88thth SFHMMY Conference SFHMMY Conference
66April 4, 2015April 4, 2015 66National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Contention – Aware Scheduling
Classification (based on locality and degree of contention)
Background
Scheduling Algorithm
HPC Monitoring
++ Our approach contains an additional component : a prediction model
88thth SFHMMY Conference SFHMMY Conference
77April 4, 2015April 4, 2015 77National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Similar Research
Various Approaches Simple Heuristic approaches (LLC misses & Memory bandwidth) Stack Distance Profiling approaches Dynamic Scheduling approaches using supervised learning (linear regression, fuzzy-rule models, K-nearest neighbour)
Differences
Simple Heuristic approaches (LLC misses & Memory bandwidth) Stack Distance Profiling approaches Dynamic Scheduling approaches using supervised learning (linear regression, fuzzy-rule models, K-nearest neighbour)
Not covering the whole memory hierarchy
Using additional hardware not available currently in OS
Targeting the same problem from a different view
Pre-defined allocated resources in applications
88thth SFHMMY Conference SFHMMY Conference
88April 4, 2015April 4, 2015 88National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Scheduler Overview
Scheduler Main Components
Classification Scheme
4 categories of applications based on memory hierarchy Prediction Model
prediction of contention in varying resources allocations Scheduling Algorithm
scheduling a workload of applications based on classification scheme (co-scheduling combinations) prediction model (for ideal resource management)
88thth SFHMMY Conference SFHMMY Conference
99April 4, 2015April 4, 2015 99National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Classification Scheme 4 main categories of applications
L LC
C N
88thth SFHMMY Conference SFHMMY Conference
1010April 4, 2015April 4, 2015 1010National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Classification Scheme Co-scheduling interference
N - * : no interference L - L : contention on same resource, bandwidth “divided” L - C : contention in different resources
severe performance degradation in C no impact in L
L - LC : performance degradation for both LC faces bigger degradation than L
LC - LC : contention in 2 resources (memory link and LLC) Both have degradation but in low levels
LC - C : mediocre contention, mainly in C C - C : most difficult to predict - based on data access
patterns (MESI protocol)
88thth SFHMMY Conference SFHMMY Conference
1111April 4, 2015April 4, 2015 1111National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Classification Scheme Co-scheduling interference
Analysis from workload of 16 applications
4 applications belonging to each
class
Co-scheduling of all possible
combinations
Average slowdown calculated for each
combination Table : Average slowdown in co-execution
88thth SFHMMY Conference SFHMMY Conference
1212April 4, 2015April 4, 2015 1212National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Classification Scheme Classification tree
88thth SFHMMY Conference SFHMMY Conference
1313April 4, 2015April 4, 2015 1313National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Prediction Model
Linear Regression Model Target : Prediction of scaling
possess HPC monitored for 1 core allocation capability to predict scaling for any possible allocation use of threshold value for defining optimal scaling
Use the suitable counters for each class
Class L : memory link (bandwidth) Class LC : LLC reuse (MESI protocol) Class C : L2 and LLC reuse (MESI protocol) Class N : private part of memory hierarchy
88thth SFHMMY Conference SFHMMY Conference
1414April 4, 2015April 4, 2015 1414National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Prediction Model L class
Rp = (Mem1 p)/(Maximum Memory Bandwidth)∗
poptimum = max{p}, Rp < 1.15
LC class Completion(LC) = 0.01799 ∗ fLC + 0.50119 (p = 2cores)
= 0.02516 ∗ fLC + 0.34286 (p = 3 cores)
= 0.02846 ∗ fLC + 0.26028 (p = 4 cores)
= 0.03199 ∗ fLC + 0.21584 (p = 5 cores)
= 0.03404 ∗ fLC + 0.18296 (p = 6 cores)
= 0.03621 ∗ fLC + 0.16410 (p = 7 cores)
= 0.03751 ∗ fLC + 0.13969 (p = 8 cores)
Ideal_Completionp = 1/p , fLC = L2 RFO Requests/(L3 reuse*105)
Rp = (Ideal_Completionp /Completionp ) 100∗
poptimum = max{p}, Rp > 70
88thth SFHMMY Conference SFHMMY Conference
1515April 4, 2015April 4, 2015 1515National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Prediction Model C class
Completion(C) = 0.3447 ∗ fC + 0.4947 (2cores)
= 0.46974 ∗ fC + 0.34415 (p = 3 cores)
= 0.5155 ∗ fC + 0.2478 (p = 4 cores)
= 0.63609 ∗ fC + 0.22492 (p = 5 cores)
= 0.61403 ∗ fC + 0.18127 (p = 6 cores)
= 0.65915 ∗ fC + 0.15864 (p = 7 cores)
= 0.6095 ∗ fC + 0.1263 (p = 8 cores)
Ideal_Completionp = 1/p , fC = (L2 Shared*104)/Inst.Retired
Rp = (Ideal_Completionp /Completionp ) 100∗
poptimum = max{p}, Rp > 70
N class Completion(N)p = Completion_idealp
poptimum = max{p}
88thth SFHMMY Conference SFHMMY Conference
April 4, 2015April 4, 2015 National Technical University of AthensNational Technical University of Athens 1616
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Prediction Model Example
L class LC class Mem1 = 4GB/secMemmax = 13.5 GB/secR1 = 4/13.5 = 0.29R2 = (4*2)/13.5 = 0.59R3 = (4*3)/13.5 = 0.88R4 = (4*4)/13.5 = 1.185R5 = (4*5)/13.5 = 1.48R6 = (4*6)/13.5 = 1.77R7 = (4*7)/13.5 = 2.07R8 = (4*8)/13.5 = 2.37
poptimum = 3 cores
RFO1 = 319106 per second , L3 reuse = 1.51f
LC = 319106/(1.51*105) = 2.10
Completion(LC)2 = 0.01799*2.10 + 0.50119 = 0.53 → R2=0.5/0.53*100= 92.792.7Completion(LC)3 = 0.02516*2.10 + 0.34286 = 0.39 → R3=0.33/0.39*100= 84.284.2Completion(LC)4 = 0.02846*2.10 + 0.26028 = 0.32 → R4=0.25/0.32*100= 78.078.0Completion(LC)5 = 0.03199*2.10 + 0.21584 = 0.28 → R5=0.2/0.28*100= 70.670.6Completion(LC)6 = 0.03404*2.10 + 0.18296 =0.25 → R6=0.166/0.25*100= 65.465.4Completion(LC)7 = 0.03621*2.10 + 0.16410 = 0.24 → R7=0.142/0.24*100= 59.059.0Completion(LC)8 = 0.03751*2.10 + 0.13969 = 0.21 → R8=0.125/0.21*100= 57.157.1
Poptimum = 5 cores
88thth SFHMMY Conference SFHMMY Conference
1717April 4, 2015April 4, 2015 1717National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Prediction Model
Evaluation – Verification
Relative Errors in Predictions of C class
88thth SFHMMY Conference SFHMMY Conference
1818April 4, 2015April 4, 2015 1818National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Prediction Model
Evaluation – Verification
Relative Errors in Predictions of LC class
88thth SFHMMY Conference SFHMMY Conference
1919April 4, 2015April 4, 2015 1919National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Prediction Model
LC - C prediction model improvement Integration of 7 relationships to a single one Coefficients follow logarithmic trendline Results after analysis
Completion(LC)p = [0.0139536 log(p) + 0.0090562] f∗ ∗ LC + [−0.252533 log(p) + 0.6407058]∗
Completion(C)p = [0.2151318 log(p) + 0.2239032] f∗ ∗ C + [−0.25468 log(p) + 0.6397947]∗
Ideal_Completionp = 1/p
Rp = (Ideal_Completionp /Completionp ) 100∗
poptimum = max{p}, Rp > 70
88thth SFHMMY Conference SFHMMY Conference
2020April 4, 2015April 4, 2015 2020National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Prediction Model
Evaluation – Verification of Refinement
Deviation in C coefficients Deviation in LC coefficients
88thth SFHMMY Conference SFHMMY Conference
Prediction Model School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
April 4, 2015April 4, 2015 National Technical University of AthensNational Technical University of Athens 2121
Experimentation Platform cores : 8 L1D,I: 32KB
8-way L2 : 256KB
8-way L3 : 16 MB
16-way 64bytes line Mem :64GB
DDR3 1.3GHZ
Debian 6.06
*(Prediction Model also tested on Nehalem architecture)
88thth SFHMMY Conference SFHMMY Conference
2222April 4, 2015April 4, 2015 2222National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Scheduling Algorithm
Executed after first 2 steps are finished for each application Step 1 has classified each application Step 2 has predicted the optimum number of cores that
should be allocated by the scheduler to each application The algorithm tries to co-schedule the applications in pairs
so that Sum of cores does not exceed package cores Contention is avoided as much as possible
(using conclusions from Classification step) The approach can be extended for co-execution of more
than 2 applications N applications are allocated half cores and scheduled twice
(their profile implies that they are not affected by this)
88thth SFHMMY Conference SFHMMY Conference
2323April 4, 2015April 4, 2015 2323National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Scheduling Algorithm Lists of applications separated by class : L, LC, C, N while(N not empty){
x = current N application ;y = popMatchFromTheEnd(C, L, LC, N);coschedule(x, y);
}while( LC not empty){
x = current LC application;y = popMatchFromTheEnd(C, LC, L);coschedule(x, y);
}while(L not empty){
x = current L application;y = popMatchFromTheEnd(L);coschedule(x, y);
}while(C not empty){
x = current C application;y = popMatchFromTheEnd(C);coschedule(x, y);
}scheduleRemainingApplications();
88thth SFHMMY Conference SFHMMY Conference
2424April 4, 2015April 4, 2015 2424National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Comparison with Similar Research The other state-of-the-art schedulers
Sorting by heuristic Distributing load Combining
application from the top with application from the bottom
LLC – MRB
LLC misses LBB
memory bandwidth
88thth SFHMMY Conference SFHMMY Conference
2525April 4, 2015April 4, 2015 2525National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Comparison with Similar Research
Experiments – Comparison Process Linux CFS, LCN, LLC-MRB, LLB to be compared Workload of 17 applications (equally shared among classes) Whole workload executed for 1 hour Time quantums of 1 second defined in all schedulers When application finishes, it gets respawn to re-execute Comparison between schedulers with 2 criteria
Throughput Total number of executions of all applications Number of improved applications
Fairness Standard Deviation between gain of each application
*gain compared to Gang scheduler
88thth SFHMMY Conference SFHMMY Conference
2626April 4, 2015April 4, 2015 2626National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Comparison with Similar Research
Most Improved Applications
Linux : 5LLC – Balance : 7MEM-Balance : 5LCN : 8
88thth SFHMMY Conference SFHMMY Conference
2727April 4, 2015April 4, 2015 2727National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Comparison with Similar Research
Criteria :
- Throughput LCNLCN- Fairness LLC-Balance LLC-Balance **
* fairness can be * fairness can be misinterpretedmisinterpreted
88thth SFHMMY Conference SFHMMY Conference
2828April 4, 2015April 4, 2015 2828National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Comparison with Similar Research
Major Drawbacks of other schedulers Linux Scheduler CFS
Cannot locate contention Does not identify threads of the same application
parallelism benefits lost MEM - Balance Scheduler
Uses over-generic heuristic Does not take into account all memory hierarchy parts
LLC - Balance Scheduler Cannot differentiate between class N and C applications,
since they both exhibit low LLC misses Results co-scheduling L with C applications → contention
88thth SFHMMY Conference SFHMMY Conference
2929April 4, 2015April 4, 2015 2929National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Conclusion Proposed contention-aware schedulers that
Does not require additional OS hardware adjustments Simple, easily integratable as component in modern OS Consisted of 3 parts
Compared to other state-of-the-art schedulers and the CFS Presents the best throughput Presents equal fairness to CFS
(and lower than the other contention-aware schedulers)
Can be integrated to real-life scheduling with 2 approaches: Applications executed when inserted in queue for 2-3 quantums Start scheduling and monitoring simultaneously (dynamic adaptation)
88thth SFHMMY Conference SFHMMY Conference
3030April 4, 2015April 4, 2015 3030National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
Future Work Major Improvements Improvements in the prediction model
Stepwise regresion models to add more variables Decrease error Caution : limitation in number of monitored counters
Other methods, such as machine learning Investigation of added overhead Extension of approach to NUMA architectures
Implemented and tested for 1 package only Extensible to multiple packages, using thread migrations
• Initially try to allocate threads of the same application in the same package
• Thread migrations executed when class change is observed along with memory migrations
88thth SFHMMY Conference SFHMMY Conference
3131April 4, 2015April 4, 2015 3131National Technical University of AthensNational Technical University of Athens
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
THE END
Thank you !!!
Any Questions ??
88thth SFHMMY Conference SFHMMY Conference