age based scheduling for asymmetric multiprocessors nagesh b lakshminarayana, jaekyu lee &...
TRANSCRIPT
Age Based Scheduling for Asymmetric Multiprocessors
Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim
Outline
• Background and Motivation• Age Based Scheduling• Evaluation• Conclusion
2
3
Asymmetric (Chip) Multiprocessors
• Heterogeneous Architectures where all cores have same ISA but different performance
PEA
PEB
PEB
PEB
PEB
Heterogeneous Architecture
4
Asymmetric (Chip) Multiprocessors
• Potential for better performance than SMPs occupying same area and consuming same power
Core0 Core1
Core2 Core3
Core0
Symmetric Chip Multiprocessor (SMP/CMP)
Asymmetric Chip Multiprocessor (AMP/ACMP)
Co
re1
Co
re2
Co
re3
AMPs present new challenges
• Thread Scheduling is one among them
5
6
Scheduling in Multiprocessor OSes
• Thread Assignment– assign to least loaded core
• Load Balancing– make load on all cores uniform
• Idle Balancing – move threads from busy cores to idle
core
7
Scheduling in Multiprocessor OSes
• Assume that all cores are identical • Results in bad performance and application
instability
Parsec benchmarks on a (real) AMP using the Linux Scheduler
all-fast 16 cores- 2GHz
half-half 8 cores -2GHz, 8 cores -1GHz
all-slow 16 cores - 1GHz
8
Problem with current Scheduling
Not taking advantage of fast core
9
Outline
• Background and Motivation• Age Based Scheduling (ABS)• Evaluation• Conclusion
10
Motivation for Age Based Scheduling• Many compute-intensive multithreaded applications follow fork-
join model• Milestones (barriers) in thread execution
Application Model
…
…
…
…
…
fork
join
barrier
barrier
barrier
barrier
main thread
11
Symmetry of Applications
• Threads created together are symmetric– Based on instruction count– Degree of Symmetry = Std Dev /
Average
Degree of Symmetry of Parsec Benchmarks
(Symmetric benchmarks are benchmarks with degree of symmetry <= 0.1)
Insight
exe_dur (T1) = exe_dur (T2) =
exe_dur (T3) = exe_dur (T4)
• Difficult to predict absolute execution duration, so predict relative execution duration
12
execution duration = ?
barrier
barrier
T1
T2 T3
T4
Putting together
• Applications follow fork-join model with milestones in between
• Many applications are symmetric• Easy to predict relative execution
duration to next milestoneAge Based Scheduling
13
What is Age?
Age is the progress made by a thread towards its next milestone
14
15
Age Calculation
• Threads created together have the same age
• As a thread executes, it ages• Reset age when milestone crossed
tA – age of thread A
tB – age of thread B
creation
execution
tA = 0
milestone
(termination)
milestone
(barrier)
tA = 30
tA = X
tA = 0
tB = 0
tB = 50
tB = 0
X – Unknown, assumed to be a large value
16
Age Based Scheduling Algorithm
To make a Scheduling decision:• Calculate remaining execution
duration to next milestone based on age
• Assign threads with longer remaining execution durations to fast core – Longest Job to Fast Core First (LJFCF)
Application of L JFCF
• Apply whenever– Thread is created– A core becomes idle– Reassignment timer expires (for load
balancing)
17
Working of the Algorithm
execution
tA = 0
creation milestone
(termination)
milestone
(barrier)
tA = 30 Age at barrier =
X
rem_exe = (X – 30)
T1
18
19
Remaining Execution Duration (I)
• Track progress of threads• Using Prediction [AGE]
– Predict all threads have same inter-milestone distance
tA – age of thread A
tB – age of thread B
creation
execution
tA = 0
milestone (termination)
milestone
(barrier)
tA = X tA =
0 tA = X
tB = 0 tB =
X
20
Remaining Execution Duration (II)
• Using Profiling [AGE(PROF)]– threads have different inter-milestone
distances calculated based on a metric obtained by profiling
tA – age of thread A
tB – age of thread B
creation
execution
tA = 0
milestone
(termination)
milestone (barrier)
tA = X tA = 0
tA = X
tB = 0
tB = rX r is from profiler
Only one r value for each thread
Working of the Algorithm
fast slow slow slow
B C DA
rem_exeA = 50
rem_exeD = 30
rem_exeC = 90
rem_exeB = 70
AC
rem_exeC = 90
rem_exeA = 50
21
22
Benefit of Age Based Scheduling
• Asymmetry aware• Utilizes all cores• Gives all threads opportunities to run
on fast cores
23
Implementation
• OS – Track progress using Performance
Counters– Disable counter on Interrupts
• Compiler (AGE[PROF])– Passing profiled information
• one value for each thread
24
Outline
• Background and Motivation• Age Based Scheduling• Evaluation• Conclusion
25
Evaluation• Simulation based experiments
• Trace + execution hybrid simulator • Lock, barriers are modeled• Context switch and migration overhead simulated• 10 ms time slice for each thread
• Machine configuration• 1 fast, 7 slow, 8:1 speed ratio (others are in the paper)
• Benchmarks• Symmetric
– Parsec (simmedium input)
• Asymmetric– Splash-2– OMPSCR– SuperLU
Comparisons with Other Policies
26
Policy Description
Linux Linux O(1) Scheduler
RR Threads are assigned to fast cores in a Round Robin fashion
SCALEDLD [Li’07]
Fast Core First assignment, asymmetry aware load balancing (baseline)
FCA-AGE Fast Core First assignment with Age based periodic reassignment
AGE Age based assignment and reassignment using prediction
AGE(PROF) Age based assignment and reassignment using profiling
AGE(ORACLE)
Age based assignment and reassignment using oracle
27
L JFCF vs Other Policies (I)
-200
-150
-100
-50
0
50
100
% R
ed
ucti
on
in
Execu
tio
n T
ime
RR
FCA-AGE
AGE
AGE(PROF)
AGE(ORACLE)
Policy Avg % reduction over SCALEDLD
RR -36.64
FCA-AGE 9.8
AGE 10.4
AGE(PROF) 13.2
AGE(ORACLE)
15.4
• Parsec
Baseline: SCALEDLD
L JFCF vs Other Policies (II)• Asymmetric Benchmarks
-10
-5
0
5
10
15
20
25
30
35
40
% R
ed
ucti
on
in
E
xecu
tio
n T
ime
FCA-AGE
AGE
AGE(PROF)
AGE(ORACLE)
28
Policy Avg % reduction over SCALEDLD
FCA-AGE 8.2
AGE 7.7
AGE(PROF) 9.4
AGE(ORACLE) 13.1
Baseline: SCALEDLD
29
Idle Cycles
0%10%20%
30%40%50%60%70%
80%90%
100%
blac
ksch
oles
body
trac
k
fluid
anim
ate
swap
tions
blac
ksch
oles
body
trac
k
fluid
anim
ate
swap
tions
blac
ksch
oles
body
trac
k
fluid
anim
ate
swap
tions
Linux SCALEDLD AGE
Slow Cores
Fast Core
• Linux Scheduler – Most of the idle cycles contributed by fast core
• SCALEDLD – keeps same thread(s) on fast core• AGE – assigns different threads to fast core
30
Different AMP Configurations
• Need for asymmetry aware scheduling increases as cores become more asymmetric
• AGE based policies show more improvement over SCALEDLD as asymmetry increases
0
0.5
1
1.5
2
2.5
2/1-Parsec 4/1-Parsec 6/1-Parsec 8/1-Parsec
No
rmal
ized
exe
cuti
on
tim
e
LinuxSCALEDLD
AGEAGE(PROF)
X/1 : Ratio of speeds of Fast and Slow cores is X:1
31
Outline
• Background and Motivation• Age Based Scheduling• Evaluation• Conclusion
32
Conclusion
• Age based scheduling (ABS) for Asymmetric Multiprocessors– ABS assumes threads created at the same
time are symmetric– ABS assigns threads to cores based on their
predicted remaining execution durations– Predictions are made based on Age of
threads• Improvement of 10.4% (Pred) and 13.2%
(Prof) for Parsec and 7.6% (Pred) and 9.4% (Prof) for Asymmetric benchmarks over Li’s mechanism
THANK YOU