multi-core scheduling optimizations for soft real-time

Multi-core scheduling optimizationsfor soft real-time applications

a cooperation aware approach

co-authors:

Liria Matsumoto Sato

Patrick Bellasi

William Fornaciari

Wolfgang Betz

Lucas De Marchi

sponsors:

Agenda

IntroductionMotivationObjectives

AnalysisOptimization descriptionExperimental resultsConclusions & future works

Introduction – motivation


SMP + RTMultiple processing units inside a processorDeterminism

Parallel Programming ParadigmsData Level Parallelism (DLP)

Competitive tasks

Task Level Parallelism (TLP)Cooperative tasks

Introduction – motivationDLP TLP

T1

T0

T2

T3

T0

T1

T2

Characterization:SynchronizationCommunication


Linux RT scheduler (mainline)Run as soon as possible (based on prio)⇔ Use as many CPUs as possibleOk for DLP!

But, what about TLP?Anyway, why do we care about TLP?

Objectives

Study the behavior of RT Linux scheduler for cooperative tasks

Optimize the RT schedulerSmooth integration into mainline kernel

Don't throw away everything

Analysis – benchmark

Simulation of a scenario where SW replaces HW

Multimedia-likeMixed workload:

DLP + TLPChallenge: map N

tasks to M cores optimally

Analysis – metrics

ThroughputSample mean

time

DeterminismSample variance

Analysis – metrics

Influencing factorsPreemption-disabledIRQ-disabledLocality of wake-up

Intel i7 920:4 cores2.8 GHz

Analysis – locality of wake-upMigrations

wave0: 112010101010101010101030111301010101010131010321010033 wave1: 033333333333333333333313202133333333333303333202332121 wave2: 022022222222222222222222233302222222222222222213322330 wave3: 111101010101010101010101011110101010101010101010101002 mixer0: 023033333333333333333333333202333333333333333333322233 mixer1: 103302201010101010101010101010330101010101010101010101 mixer2: 230201010101010101010101012020101010101010101010101021monitor: 111122222222222222222222231112222222222222222233322303

a) migration patterns (wave0, mixer1 and mixer2)

Analysis – locality of wake-upMigrations

wave0: 302121331102303323303311032320011021111111120101010101 wave1: 223303013333212010121032121012333210320230333333333333 wave2: 033332232302003111201211330021322233323330220222222222 wave3: 110211120211101030031121003030120101101111111010101010 mixer0: 322210301110320231310303021213120013220320230333333333 mixer1: 001003321202220131103203212130320331201031033022010101 mixer2: 230101020201202020121020021010130101201202302010101010monitor: 113022133333131312032133303222223233123111111222222222

b) occasional migrations

Analysis – locality of wake-up

Cache-miss rate measurements

1 CPU 2 CPUs 4 CPUs Increase (1 - 4)

Load 7.58% 8.99% 9.44% +24.5%

Store 9.29% 9.78% 11.62% +25.1%

Analysis – conclusion

Why do we care about TLP?Common parallelization technique

What about TLP?Current state of Linux scheduler is not as

good as we want

Solution – benchmarkAbstraction:

One application level

sends data to another

Serialization (task cooperation)

Parallelization (task com

petition)

w0

w1

w2

w3

m0

m1

m2




Reality:

shared buffers +

synchronization



petition)

w0

w1

w2

w3

m0

m1

m2

buffer




Reality:

shared buffers +

synchronization

Dependencies:

Define dependencies

among tasks in the

opposite way of

data flow



petition)

w0

w1

w2

w3

m0

m1

m2

If data do not go to tasks, then the tasks go to where data were produced

Make tasks run on same CPU of their

dependencies

Solution – ideaDependency followers

Measurement tools

Ftracesched_switch tool

(Carsten Emde)*gtkwaveperfadhoc scripts

* http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html

Solution – task-affinityDependency followers

Task-affinity:

selection of the CPU in which a task (e.g. m0) executes takes into consideration the CPUs in which its dependencies (e.g. w0 and w1) ran last time.

w0

w1

w2

w3

m0

m1

m2

Solution – task-affinityimplementation

2 lists inside each task_struct:taskaffinity_listfollowme_list

2 system calls to add/delete affinities:sched_add_taskaffinitysched_del_taskaffinity

1 core 2 cores 4 cores

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

11.00%

12.00%

9.29%

9.78%

11.62%

10.17%

8.68%

10.35%

mainline task-aff inity

% o

f sto

re c

ach

e-m

iss

1 core 2 cores 4 cores

5.00%

5.50%

6.00%

6.50%

7.00%

7.50%

8.00%

8.50%

9.00%

9.50%

10.00%

7.58%

8.99%

9.44%

7.92% 7.88%

8.58%

mainline task-af f inity

% o

f lo

ad

ca

che

-mis

s

Experimental resultsCache-miss rates

More data/instruction hit

the L1 cache

less cache lines are invalidated

Measurements without and with task-affinity

Experimental resultsWhat exactly to evaluate?

w0

w1

w2

w3

m0

m1

m2

Cache-miss rate is not exactly what we want to optimize

Optimization objectives:Lower the time to

produce a single sample

Increase determinism on production of several samples

w ave0 w ave1 w ave2 w ave3 mixer0 mixer1 mixer2

8

10

12

14

16

18

20

8.81

9.479.21 9.19

16.34

15.41

17.55

9.15

9.649.47

9.84

17.39

16.69

17.82

task-aff inity mainline

ave

rag

e e

xecu

tion

tim

e (

µs

)Experimental resultsAverage execution time of each task

w0

w1

w2

w3

m0

m1

m2

w ave0 w ave1 w ave2 w ave3 mixer0 mixer1 mixer2

0

0.5

1

1.5

2

2.5

0.22

0.31 0.290.36

0.61

0.380.35

0.570.51 0.48

0.83

1.45

2.19

0.47

task-aff inity mainline

vari

an

ce o

f th

e e

xecu

tion

tim

e (

µs

)²Experimental resultsVariance of execution time of each task

w0

w1

w2

w3

m0

m1

m2

Experimental resultsProduction time of each single sample

mainline task-affinity

Results obtained for 150,000 samples

Experimental resultsProduction time of each single sample

Empiric repartition functionReal-time metric (normal distribution):

average + 2 * standard deviation

02

46

810

1214

1618

2022

2426

2830

3234

3638

4042

4446

4850

5254

5658

6062

6466

6870

7274

7678

8082

8486

8890

9294

9698

100102

104106

108110

112114

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

time (us)

% o

f sa

mp

les

41.2 µs

01

23

45

67

8910

1112

1314

1516

1718

1920

2122

2324

2526

2728

2930

3132

3334

3536

3738

3940

4142

4344

4546

4748

4950

51

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

time (us)

% o

f sa

mp

les

38.96 µs

Max: 114Min: 29Avg: 37.8Variance: 3.22

Max: 51Min: 35Avg: 38.0Variance: 0.21

mainline

task-affinity

Experimental resultssummary

~15x

Conclusion & future works

Average execution time is almost the sameDeterminism for real-time applications is

improved

Future works:Better focus on temporal localityImprove task-affinity configurationTest on other architecturesClean up the repository

Conclusion & future works

Still a Work In ProgressGit repository:

git://git.politreco.com/linux-lcs.git

Contact:[email protected]@gmail.com

mailto:[email protected]

mailto:[email protected]

Solution – Linux schedulerDependency followers

Linux scheduler: Change the decision process of the CPU in which a task executes when it is woken up

Mainscheduler

Periodicscheduler

CoreScheduler

CPUi

Contextswitch

RT IdleFair

RR FIFO Normal

Batch

Idle

Selecttask

Schedulerclasses

Schedulerpolicies

multi-core scheduling optimizations for soft real-time

Documents