age based scheduling for asymmetric multiprocessors nagesh b lakshminarayana, jaekyu lee &...

Age Based Scheduling for Asymmetric Multiprocessors

Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim

Outline

• Background and Motivation• Age Based Scheduling• Evaluation• Conclusion

2

3

Asymmetric (Chip) Multiprocessors

• Heterogeneous Architectures where all cores have same ISA but different performance

PEA

PEB

PEB

PEB

PEB

Heterogeneous Architecture

4

Asymmetric (Chip) Multiprocessors

• Potential for better performance than SMPs occupying same area and consuming same power

Core0 Core1

Core2 Core3

Core0

Symmetric Chip Multiprocessor (SMP/CMP)

Asymmetric Chip Multiprocessor (AMP/ACMP)

Co

re1

Co

re2

Co

re3

AMPs present new challenges

• Thread Scheduling is one among them

5

6

Scheduling in Multiprocessor OSes

• Thread Assignment– assign to least loaded core

• Load Balancing– make load on all cores uniform

• Idle Balancing – move threads from busy cores to idle

core

7

Scheduling in Multiprocessor OSes

• Assume that all cores are identical • Results in bad performance and application

instability

Parsec benchmarks on a (real) AMP using the Linux Scheduler

all-fast 16 cores- 2GHz

half-half 8 cores -2GHz, 8 cores -1GHz

all-slow 16 cores - 1GHz

8

Problem with current Scheduling

Not taking advantage of fast core

9

Outline

• Background and Motivation• Age Based Scheduling (ABS)• Evaluation• Conclusion

10

Motivation for Age Based Scheduling• Many compute-intensive multithreaded applications follow fork-

join model• Milestones (barriers) in thread execution

Application Model

…

…

…

…

…

fork

join

barrier

barrier

barrier

barrier

main thread

11

Symmetry of Applications

• Threads created together are symmetric– Based on instruction count– Degree of Symmetry = Std Dev /

Average

Degree of Symmetry of Parsec Benchmarks

(Symmetric benchmarks are benchmarks with degree of symmetry <= 0.1)

Insight

exe_dur (T1) = exe_dur (T2) =

exe_dur (T3) = exe_dur (T4)

• Difficult to predict absolute execution duration, so predict relative execution duration

12

execution duration = ?

barrier

barrier

T1

T2 T3

T4

Putting together

• Applications follow fork-join model with milestones in between

• Many applications are symmetric• Easy to predict relative execution

duration to next milestoneAge Based Scheduling

13

What is Age?

Age is the progress made by a thread towards its next milestone

14

15

Age Calculation

• Threads created together have the same age

• As a thread executes, it ages• Reset age when milestone crossed

tA – age of thread A

tB – age of thread B

creation

execution

tA = 0

milestone

(termination)

milestone

(barrier)

tA = 30

tA = X

tA = 0

tB = 0

tB = 50

tB = 0

X – Unknown, assumed to be a large value

16

Age Based Scheduling Algorithm

To make a Scheduling decision:• Calculate remaining execution

duration to next milestone based on age

• Assign threads with longer remaining execution durations to fast core – Longest Job to Fast Core First (LJFCF)

Application of L JFCF

• Apply whenever– Thread is created– A core becomes idle– Reassignment timer expires (for load

balancing)

17

Working of the Algorithm

execution

tA = 0

creation milestone

(termination)

milestone

(barrier)

tA = 30 Age at barrier =

X

rem_exe = (X – 30)

T1

18

19

Remaining Execution Duration (I)

• Track progress of threads• Using Prediction [AGE]

– Predict all threads have same inter-milestone distance



creation

execution

tA = 0

milestone (termination)

milestone

(barrier)

tA = X tA =

0 tA = X

tB = 0 tB =

X

20

Remaining Execution Duration (II)

• Using Profiling [AGE(PROF)]– threads have different inter-milestone

distances calculated based on a metric obtained by profiling



creation

execution

tA = 0

milestone

(termination)

milestone (barrier)

tA = X tA = 0

tA = X

tB = 0

tB = rX r is from profiler

Only one r value for each thread

Working of the Algorithm

fast slow slow slow

B C DA

rem_exeA = 50

rem_exeD = 30

rem_exeC = 90

rem_exeB = 70

AC

rem_exeC = 90

rem_exeA = 50

21

22

Benefit of Age Based Scheduling

• Asymmetry aware• Utilizes all cores• Gives all threads opportunities to run

on fast cores

23

Implementation

• OS – Track progress using Performance

Counters– Disable counter on Interrupts

• Compiler (AGE[PROF])– Passing profiled information

• one value for each thread

24

Outline


25

Evaluation• Simulation based experiments

• Trace + execution hybrid simulator • Lock, barriers are modeled• Context switch and migration overhead simulated• 10 ms time slice for each thread

• Machine configuration• 1 fast, 7 slow, 8:1 speed ratio (others are in the paper)

• Benchmarks• Symmetric

– Parsec (simmedium input)

• Asymmetric– Splash-2– OMPSCR– SuperLU

Comparisons with Other Policies

26

Policy Description

Linux Linux O(1) Scheduler

RR Threads are assigned to fast cores in a Round Robin fashion

SCALEDLD [Li’07]

Fast Core First assignment, asymmetry aware load balancing (baseline)

FCA-AGE Fast Core First assignment with Age based periodic reassignment

AGE Age based assignment and reassignment using prediction

AGE(PROF) Age based assignment and reassignment using profiling

AGE(ORACLE)

Age based assignment and reassignment using oracle

27

L JFCF vs Other Policies (I)

-200

-150

-100

-50

0

50

100

% R

ed

ucti

on

in

Execu

tio

n T

ime

RR

FCA-AGE

AGE

AGE(PROF)

AGE(ORACLE)

Policy Avg % reduction over SCALEDLD

RR -36.64

FCA-AGE 9.8

AGE 10.4

AGE(PROF) 13.2

AGE(ORACLE)

15.4

• Parsec

Baseline: SCALEDLD

L JFCF vs Other Policies (II)• Asymmetric Benchmarks

-10

-5

0

5

10

15

20

25

30

35

40

% R

ed

ucti

on

in

E

xecu

tio

n T

ime

FCA-AGE

AGE

AGE(PROF)

AGE(ORACLE)

28

Policy Avg % reduction over SCALEDLD

FCA-AGE 8.2

AGE 7.7

AGE(PROF) 9.4

AGE(ORACLE) 13.1

Baseline: SCALEDLD

29

Idle Cycles

0%10%20%

30%40%50%60%70%

80%90%

100%

blac

ksch

oles

body

trac

k

fluid

anim

ate

swap

tions

blac

ksch

oles

body

trac

k

fluid

anim

ate

swap

tions

blac

ksch

oles

body

trac

k

fluid

anim

ate

swap

tions

Linux SCALEDLD AGE

Slow Cores

Fast Core

• Linux Scheduler – Most of the idle cycles contributed by fast core

• SCALEDLD – keeps same thread(s) on fast core• AGE – assigns different threads to fast core

30

Different AMP Configurations

• Need for asymmetry aware scheduling increases as cores become more asymmetric

• AGE based policies show more improvement over SCALEDLD as asymmetry increases

0

0.5

1

1.5

2

2.5

2/1-Parsec 4/1-Parsec 6/1-Parsec 8/1-Parsec

No

rmal

ized

exe

cuti

on

tim

e

LinuxSCALEDLD

AGEAGE(PROF)

X/1 : Ratio of speeds of Fast and Slow cores is X:1

31

Outline


32

Conclusion

• Age based scheduling (ABS) for Asymmetric Multiprocessors– ABS assumes threads created at the same

time are symmetric– ABS assigns threads to cores based on their

predicted remaining execution durations– Predictions are made based on Age of

threads• Improvement of 10.4% (Pred) and 13.2%

(Prof) for Parsec and 7.6% (Pred) and 9.4% (Prof) for Asymmetric benchmarks over Li’s mechanism

THANK YOU

age based scheduling for asymmetric multiprocessors nagesh b lakshminarayana, jaekyu lee &...

Documents

agesreset age

age calculationthreads

absolute execution duration

scheduling algorithmto

scheduling decision

remaining execution

x 30t1

relative execution durationexe