implementation study - courses (please see the ubc course...

133
1 Real-Time Scalability On the Scalability of Real-Time Scheduling Algorithms on Multicore Platforms: A Case Study Sathish Gopalakrishnan The University of British Columbia (based on work by others at the University of North Carolina)

Upload: others

Post on 19-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

1 Real-Time Scalability

On the Scalability of Real-Time Scheduling Algorithms on

Multicore Platforms: A Case Study

Sathish Gopalakrishnan The University of British Columbia

(based on work by others at the University of North Carolina)

Page 2: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

2 Real-Time Scalability

Focus of this Talk

 Multicore platforms are predicted to get much larger in the future. » 10s or 100s of cores per chip, multiple

hardware threads per core.

 Research Question: How will different real-time scheduling algorithms scale?

» Scalability is defined w.r.t. schedulability (more on this later).

Page 3: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

3 Real-Time Scalability

Outline

 Background. » Real-time workload assumed. » Scheduling algorithms evaluated. » Some properties of these algorithms.

 Research questions addressed.  Experimental results.  Observations/speculation.  Future work.

Page 4: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

4 Real-Time Scalability

Real-Time Workload Assumed in this Talk

  Set τ of periodic tasks scheduled on M cores:

0 10 20 30

T = (2,5)

5 15 25

U = (9,15)

2 5

One Core Here

Page 5: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

5 Real-Time Scalability

Real-Time Workload Assumed in this Talk

  Set τ of periodic tasks scheduled on M cores: »  Task T = (T.e,T.p) releases a job with exec. cost T.e every

T.p time units. –  T’s utilization (or weight) is U(T) = T.e/T.p. –  Total utilization is U(τ) = ΣT T.e/T.p.

0 10 20 30

T = (2,5)

5 15 25

U = (9,15)

2 5

One Core Here

Page 6: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

6 Real-Time Scalability

Real-Time Workload Assumed in this Talk

  Set τ of periodic tasks scheduled on M cores: »  Task T = (T.e,T.p) releases a job with exec. cost T.e every

T.p time units. –  T’s utilization (or weight) is U(T) = T.e/T.p. –  Total utilization is U(τ) = ΣT T.e/T.p.

0 10 20 30

T = (2,5)

5 15 25

U = (9,15)

2 5

One Core Here

Page 7: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

7 Real-Time Scalability

Real-Time Workload Assumed in this Talk

  Set τ of periodic tasks scheduled on M cores: »  Task T = (T.e,T.p) releases a job with exec. cost T.e every

T.p time units. –  T’s utilization (or weight) is U(T) = T.e/T.p. –  Total utilization is U(τ) = ΣT T.e/T.p.

»  Each job of T has a deadline at the next job release of T.

0 10 20 30

T = (2,5)

5 15 25

U = (9,15)

2 5

One Core Here

Page 8: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

8 Real-Time Scalability

Real-Time Workload Assumed in this Talk

  Set τ of periodic tasks scheduled on M cores: »  Task T = (T.e,T.p) releases a job with exec. cost T.e every

T.p time units. –  T’s utilization (or weight) is U(T) = T.e/T.p. –  Total utilization is U(τ) = ΣT T.e/T.p.

»  Each job of T has a deadline at the next job release of T.

0 10 20 30

T = (2,5)

5 15 25

U = (9,15)

2 5

One Core Here

Page 9: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

9 Real-Time Scalability

Real-Time Workload Assumed in this Talk

  Set τ of periodic tasks scheduled on M cores: »  Task T = (T.e,T.p) releases a job with exec. cost T.e every

T.p time units. –  T’s utilization (or weight) is U(T) = T.e/T.p. –  Total utilization is U(τ) = ΣT T.e/T.p.

»  Each job of T has a deadline at the next job release of T.

0 10 20 30

T = (2,5)

5 15 25

U = (9,15)

2 5

One Core Here

This is an earliest-deadline-first schedule. Much of our work pertains to EDF scheduling…

Page 10: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

10 Real-Time Scalability

Scheduling vs. Schedulability

 W.r.t. scheduling, we actually care about two kinds of algorithms: » Scheduling algorithm (of course).

– Example: Earliest-deadline-first (EDF): Jobs with earlier deadlines have higher priority.

» Schedulability test.

Test for EDF

τ yes no

no timing requirement will be violated if τ is scheduled with EDF

a timing requirement will (or may) be violated …

Page 11: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

11 Real-Time Scalability

Multiprocessor Real-Time Scheduling

Two Approaches:

Steps: 1.  Assign tasks to processors (bin

packing). 2.  Schedule tasks on each

processor using a uniprocessor algorithm.

Partitioning Global Scheduling

Important Differences: •  One task queue. •  Tasks may migrate among

the processors.

Page 12: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

12 Real-Time Scalability

Scheduling Algorithms Considered

 Partitioned EDF: PEDF.  Preemptive & Non-preemptive Global

EDF: GEDF & NP-GEDF.  Clustered EDF: CEDF.

» Partition onto clusters of cores, globally schedule within each cluster

L2

From other 8 cores…

L1

C C C C

L1

C C C C clusters

Page 13: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

13 Real-Time Scalability

Scheduling Algorithms (Continued)

 PD2, a global Pfair algorithm. » Schedule jobs one quantum at a time at a

“uniform” rate. – May preempt and migrate jobs frequently.

 Staggered PD2: S-PD2. » Same as PD2 but quanta are “staggered” to

avoid excessive bus contention.

Page 14: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

14 Real-Time Scalability

PD2 Example

  Under partitioning & most global algorithms, overall utilization must be capped to avoid deadline misses. »  Due to connections to bin-packing.

  Exception: Global “Pfair” algorithms do not require caps. »  Such algorithms schedule jobs one quantum at a time.

– May therefore preempt and migrate jobs frequently. –  Perhaps less of a concern on a multicore platform.

  Under most global algorithms, if utilization is not capped, deadline tardiness is bounded. »  Sufficient for soft real-time systems.

3 tasks with parameters (2,3) on two processors…

0 10 20 30

T = (2,3)

5 15 25

U = (2,3)

V = (2,3)

On Processor 1 On Processor 2

Page 15: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

15 Real-Time Scalability

Schedulability

 HRT: No deadline is missed.  SRT: Deadline tardiness is bounded.  For some scheduling algorithms,

utilization loss is inherent when checking schedulability. » That is, schedulability cannot be

guaranteed for all task systems with total utilization at most M.

Page 16: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

16 Real-Time Scalability

Example: PEDF

  Under partitioning & most global algorithms, overall utilization must be capped to avoid deadline misses. »  Due to connections to bin-packing.

  Exception: Global “Pfair” algorithms do not require caps. »  Such algorithms schedule jobs one quantum at a time.

– May therefore preempt and migrate jobs frequently. –  Perhaps less of a concern on a multicore platform.

  Under most global algorithms, if utilization is not capped, deadline tardiness is bounded. »  Sufficient for soft real-time systems.

Example: Partitioning three tasks with parameters (2,3) on two processors will overload one processor.

In terms of bin-packing…

Processor 1 Processor 2

Task 1

Task 2 Task 3

0

1

Page 17: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

17 Real-Time Scalability

Schedulability Summary

HRT SRT

PEDF util. loss util. loss (same as HRT) GEDF util. loss no loss NP-GEDF util. loss no loss CEDF util. loss util. loss (not as bad as PEDF)

PD2 no loss no loss S-PD2 slight loss no loss

(must shrink periods by one quantum)

Page 18: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

18 Real-Time Scalability

GEDF SRT Example

  Under partitioning & most global algorithms, overall utilization must be capped to avoid deadline misses. »  Due to connections to bin-packing.

  Exception: Global “Pfair” algorithms do not require caps. »  Such algorithms schedule jobs one quantum at a time.

– May therefore preempt and migrate jobs frequently. –  Perhaps less of a concern on a multicore platform.

  Under most global algorithms, if utilization is not capped, deadline tardiness is bounded. »  Sufficient for soft real-time systems.

Earlier example with GEDF…

0 10 20 30

T = (2,3)

5 15 25

U = (2,3)

V = (2,3)

Tardiness is at most one quantum.

Page 19: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

19 Real-Time Scalability

Outline

 Background. » Real-time workload assumed. » Scheduling algorithms evaluated. » Some properties of these algorithms.

 Research questions addressed.  Experimental results.  Observations/speculation.  Future work.

Page 20: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

20 Real-Time Scalability

Research Questions

  In theory, PD2 is always preferable. »  It is optimal (no utilization loss).

 What about in practice? » That is, what happens if system overheads

are taken into account?  Do migrations really matter on a

multicore platform with a shared cache?  As multicore platforms get larger, will

global algorithms scale?

Focus of this Talk: An Experimental comparison of these scheduling algorithms on the basis of schedulability.

Page 21: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

21 Real-Time Scalability

Test System

  HW platform: Sun Niagara (UltraSPARC T1).

– OS has 32 “logical CPUs” to manage. –  Far larger than any system considered before in RT literature. – Note: CEDF “cluster” = 4 HW threads on a core.

Core 1 Core 8

L1 L1

L2

… •  1.2 GHz “RISC-like” cores.

•  Relatively simple, e.g., no instr. reordering or branch prediction.

•  Caches somewhat small compared to Intel.

4 HW threads per core

16K (8K) L1 instr. (data) cache per core Shared 3MB L2

Page 22: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

22 Real-Time Scalability

Test System (Cont’d)

 Operating System: LITMUSRT: LInux Testbed for MUltiprocessor Scheduling in Real-Time systems. » Developed at UNC. » Extends Linux by allowing different schedulers to

be linked as “plug-in” components. » Several (real-time) synchronization protocols are

also supported. » Code is available at http://www.cs.unc.edu/

~anderson/litmus-rt/.

Page 23: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

23 Real-Time Scalability

Methodology

 Ran several hundred (synthetic) task sets on the test system.

 Collected 70 GB of raw overhead samples.  Distilled expressions for average (for SRT)

and worst-case (for HRT) overheads.  Conducted schedulability experiments

involving 8.5 million randomly-generated task sets with overheads considered.

Note: This step is offline. It does not involve the Niagara.

Page 24: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

24 Real-Time Scalability

Kinds of Overheads

  Tick scheduling overhead. »  Incurred when the kernel is invoked at the beginning of

each quantum (timer “tick”). A quantum is 1ms.   Release overhead.

»  Incurred when the kernel is invoked to handle a job release.

  Scheduling overhead. »  Incurred when the scheduler (in the kernel) is invoked.

  Context-switching overhead. »  Non-cache-related costs associated with a context switch.

  Preemption/migration overhead. »  Costs incurred upon a preemption/migration due to a loss

of cache affinity.

These overheads can be accounted for in schedulability tests by inflating

job execution costs.

(Doing this correctly is a little tricky.)

Page 25: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

25 Real-Time Scalability

Kernel Overheads

Alg Scheduling Overhead (in µs) PD2 32.7 S-PD2 43.1 GEDF/NP-GEDF 55.2+.26N (N = no. of tasks)

 Most overheads were small (2-15µs) except worst-case overheads impacted by global queues. » Most notable: Worst-case scheduling overheads

for PD2, S-PD2, and GEDF/NP-GEDF:

Page 26: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

26 Real-Time Scalability

Preemption/Migration Overheads

 Obtained by measuring synthetic tasks, each with a 64K working set & 75/25 read/write ratio. »  Interesting trends: PD2 is terrible, staggering really

helps, preempt. cost ≈ mig. cost per algorithm, but algorithms that migrate have higher costs.

Worst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster Mig Inter-Cluster Mig PD2 681.1 649.4 654.2 681.1 S-PD2 104.1 103.4 103.4 104.1 GEDF 375.4 375.4 326.8 321.1 CEDF 171.6 171.6 167.3 --- PEDF 139.1 139.1 --- ---

Page 27: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

27 Real-Time Scalability

Schedulability Results

  Generated random tasks using 6 distributions and checked schedulability using “state-of-the-art” tests (with overheads considered). »  8.5 million task sets in total.

  Distributions: »  Utilizations uniform over

–  [0.001,01] (light), –  [0.1,0.4] (medium), and –  [0.5,09] (heavy).

»  Bimodal with utilizations distributed over either [0.001,05) or [0.5,09] with probabilities of –  8/9 and 1/9 (light), –  6/9 and 3/9 (medium), and –  4/9 and 5/9 (heavy).

Page 28: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

28 Real-Time Scalability

Schedulability Results

  Generated random tasks using 6 distributions and checked schedulability using “state-of-the-art” tests (with overheads considered). »  8.5 million task sets in total.

  Distributions: »  Utilizations uniform over

–  [0.001,01] (light), –  [0.1,0.4] (medium), and –  [0.5,09] (heavy).

»  Bimodal with utilizations distributed over either [0.001,05) or [0.5,09] with probabilities of –  8/9 and 1/9 (light), –  6/9 and 3/9 (medium), and –  4/9 and 5/9 (heavy).

will only show graphs for these

Page 29: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

29 Real-Time Scalability

HRT Summary

  PEDF usually wins. »  Exception: Lots of heavy tasks (makes bin-packing

hard).   S-PD2 usually does well.

»  Staggering has an impact.

  PD2 and GEDF are quite poor. »  PD2 is negatively impacted by high preemption and

migration costs due to aligned quanta. »  GEDF suffers from high scheduling costs (due to

the global queue).

Page 30: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

30 Real-Time Scalability

HRT, Bimodal Light

PEDF peforms pretty well if most task utilizations are low.

S-PD2 performs pretty well too.

Page 31: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

31 Real-Time Scalability

HRT, Bimodal Light

Page 32: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

32 Real-Time Scalability

HRT, Bimodal Medium

In this and the next slide, as the fraction of heavy tasks grows, the gap between S-PD2 and PEDF narrows.

Page 33: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

33 Real-Time Scalability

HRT, Bimodal Medium

Page 34: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

34 Real-Time Scalability

HRT, Bimodal Heavy

Page 35: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

35 Real-Time Scalability

SRT Summary

 PEDF is not as effective as before, but still OK in light-mostly cases.

 CEDF performs the best in most cases.  S-PD2 still performs generally well.  GEDF is still negatively impacted by

higher scheduling costs. » Note: SRT schedulability for GEDF entails

no utilization loss. » NP-GEDF and GEDF are about the same.

 Note: The scale is different from before.

Page 36: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

36 Real-Time Scalability

SRT, Bimodal Light

PEDF and CEDF perform well if tasks are mostly light.

Note: S-PD2 never performs really badly in any experiment.

Page 37: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

37 Real-Time Scalability

SRT, Bimodal Light

Page 38: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

38 Real-Time Scalability

SRT, Bimodal Medium

This and the next slide show that as the frequency of heavy tasks increases, PEDF degrades. CEDF isn’t affected by this increase much.

Page 39: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

39 Real-Time Scalability

SRT, Bimodal Medium

Page 40: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

40 Real-Time Scalability

SRT, Bimodal Heavy

Page 41: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

41 Real-Time Scalability

Outline

 Background. » Real-time workload assumed. » Scheduling algorithms evaluated. » Some properties of these algorithms.

 Research questions addressed.  Experimental results.  Observations/speculation.  Future work.

Page 42: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

42 Real-Time Scalability

Observations/Speculation

  Global algorithms are really sensitive to how shared queues are implemented. »  Saw 100X performance improvement by switching

from linked lists to binomial heaps. »  Still working on this… »  Speculation: Can reduce GEDF costs to close to

PEDF costs for systems with ≤ 32 cores.   Per algorithm, preempt. cost ≈ mig. cost.

»  Due to having a shared cache. »  One catch: Migrations increase both costs.

  Quantum staggering is very effective.

Page 43: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

43 Real-Time Scalability

Observations/Speculation (Cont’d)

 No one “best” algorithm.   Intel has claimed they will produce an 80-

core general-purpose chip. If they do… »  the cores will have to be simple ⇒ high

execution costs ⇒ high utilizations ⇒ PEDF will suffer;

»  “pure” global algorithms will not scale; » some instantiation of CEDF (or maybe CS-

PD2) will hit the “sweet spot”.

Page 44: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

44 Real-Time Scalability

Future Work

 Thoroughly study “how to implement shared queues”.

 Repeat this study on Intel and embedded machines.

 Examine mixed HRT/SRT workloads.  Factor in synchronization and dynamic

behavior. »  In past work, PEDF was seen to be more

negatively impacted by these things.

Page 45: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

45 Real-Time Scalability

Thanks!

 Questions?

Page 46: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

46 Real-Time Scalability

SRT Tardiness, Uniform Medium

Page 47: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

47 Real-Time Scalability

Measuring Overheads

  Done using a UNC-produced tracer called Feather-Trace. »  http://www.cs.unc.edu/~bbb/feathertrace/

  Highest 1% of values were tossed. »  Eliminates “outliers” due to non-deterministic

behavior in Linux, warm-up effects, etc.   Used worst-case (average-case) values for

HRT (SRT) schedulability.   Used linear regression analysis to produce

linear (in the task count) overhead expressions.

Page 48: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

48 Real-Time Scalability

Obtaining Kernel Overheads

 Ran 90 (synthetic) task sets per scheduling algorithm for 30 sec.

  In total, over 600 million individual overheads were recorded (45 GB of data).

Page 49: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

49 Real-Time Scalability

Kernel Overheads (in µs) (N = no. of tasks)

Worst-Case Alg Tick Schedule Context SW Release PD2 11.2 +.3N 32.7 3.1+.01N --- S-PD2 4.8+.3N 43.1 3.2+.003N --- GEDF 3+.003N 55.2+.26N 29.2 45+.3N CEDF 3.2 14.8+.01N 6.1 30.3 PEF 2.7+.002N 8.6+.01N 14.9+.04N 4.7+.009N

Average Alg Tick Schedule Context SW Release PD2 4.3+.03N 4.7 2.6+.001N --- S-PD2 2.1+.02N 4.2 2.5+.001N --- GEDF 2.1+.002N 11.8+.06N 7.6 5.8+.1N CEDF 2.8 6.1+.01N 3.2 16.5 PEDF 2.1+.002N 2.7+.008N 4.7+.005N 4+.005N

Page 50: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

50 Real-Time Scalability

Kernel Overheads (in µs) (N = no. of tasks)

Worst-Case Alg Tick Schedule Context SW Release PD2 11.2 +.3N 32.7 3.1+.01N --- S-PD2 4.8+.3N 43.1 3.2+.003N --- GEDF 3+.003N 55.2+.26N 29.2 45+.3N CEDF 3.2 14.8+.01N 6.1 30.3 PEF 2.7+.002N 8.6+.01N 14.9+.04N 4.7+.009N

Average Alg Tick Schedule Context SW Release PD2 4.3+.03N 4.7 2.6+.001N --- S-PD2 2.1+.02N 4.2 2.5+.001N --- GEDF 2.1+.002N 11.8+.06N 7.6 5.8+.1N CEDF 2.8 6.1+.01N 3.2 16.5 PEDF 2.1+.002N 2.7+.008N 4.7+.005N 4+.005N

Page 51: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

51 Real-Time Scalability

Obtaining Preemption/Migration Overheads

  Ran 90 (synthetic) task sets per scheduling algorithm for 60 sec.

  Each task has a 64K working set (WS) that it accesses repeatedly with a 75/25 read/write ratio.

  Recorded time to access WS after preemption/migration minus “cache-warm access”.

  In total, over 105 million individual preemption/migration overheads were recorded (15 GB of data).

Page 52: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

52 Real-Time Scalability

Preemption/Migration Overheads (in µs) (N = no. of tasks)

Worst-Case Alg Overall Preemption Intra-Cluster Mig Inter-Cluster Mig PD2 681.1 649.4 654.2 681.1 S-PD2 104.1 103.4 103.4 104.1 GEDF 375.4 375.4 326.8 321.1 CEDF 171.6 171.6 167.3 --- PEDF 139.1 139.1 --- ---

Average Alg Overall Preemption Intra-Cluster Mig Inter-Cluster Mig PD2 172 131.4 141.8 187.6 S-PD2 89.3 86.2 87.8 90.2 GEDF 73 95.1 73.5 72.6 CEDF 67 78.5 64.8 --- PEDF 72.3 72.3 --- ---

Page 53: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

53 Real-Time Scalability

Preemption/Migration Overheads (in µs) (N = no. of tasks)

Worst-Case Alg Overall Preemption Intra-Cluster Mig Inter-Cluster Mig PD2 681.1 649.4 654.2 681.1 S-PD2 104.1 103.4 103.4 104.1 GEDF 375.4 375.4 326.8 321.1 CEDF 171.6 171.6 167.3 --- PEDF 139.1 139.1 --- ---

Average Alg Overall Preemption Intra-Cluster Mig Inter-Cluster Mig PD2 172 131.4 141.8 187.6 S-PD2 89.3 86.2 87.8 90.2 GEDF 73 95.1 73.5 72.6 CEDF 67 78.5 64.8 --- PEDF 72.3 72.3 --- ---

Page 54: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

54 Real-Time Scalability

HRT, Uniform Light

This is the easiest case for partitioning, so PEDF wins.

S-PD2 does pretty well too.

Page 55: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

55 Real-Time Scalability

HRT, Uniform Light

Page 56: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

56 Real-Time Scalability

HRT, Uniform Medium

Similar to before.

Utilizations aren’t high enough to start causing problems for partitioning.

Page 57: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

57 Real-Time Scalability

HRT, Uniform Medium

Page 58: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

58 Real-Time Scalability

HRT, Uniform Heavy

Utilizations are high enough to cause problems for partitioning.

S-PD2 wins now.

Page 59: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

59 Real-Time Scalability

HRT, Uniform Heavy

Page 60: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

60 Real-Time Scalability

SRT, Uniform Light

PEDF wins, S-PD2 performs pretty well.

Page 61: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

61 Real-Time Scalability

SRT, Uniform Light

Page 62: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

62 Real-Time Scalability

SRT, Uniform Medium

CEDF really benefits from using a “no utilization loss” schedulability test within each cluster.

Page 63: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

63 Real-Time Scalability

SRT, Uniform Medium

Page 64: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

64 Real-Time Scalability

SRT, Uniform Heavy

GEDF and NP-GEDF actually win in this case.

CEDF and S-PD2 perform pretty well.

PEDF loses.

Page 65: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

65 Real-Time Scalability

SRT, Uniform Heavy

Page 66: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementationof Global Real-Time Schedulers

Sathish GopalakrishnanThe University of British Columbia

Work supported by IBM, SUN, and Intel Corps., NSF grants CNS 0834270, CNS 0834132, and CNS 0615197, and ARO grant W911NF-06-1-0425.

Simon Fraser UniversityApril 15, 2010

Tuesday, April 5, 2011

Page 67: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

UNC’s Implementation Studies (I)

2

Calandrino et al. (2006)➡ Are commonly-studied RT schedulers implementable?➡ In Linux on common hardware platforms?

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

Tuesday, April 5, 2011

Page 68: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main M

emory

L2 Cache

L2 Cache

Proc. 1

Proc. 2

Proc. 4

Proc. 3

L2 Cache

L2 Cache

UNC’s Implementation Studies (I)

3

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

Intel 4x 2.7 GHz Xeon SMP(few, fast processors; private caches)

Calandrino et al. (2006)➡ Are commonly-studied RT schedulers implementable?➡ In Linux on common hardware platforms?

Tuesday, April 5, 2011

Page 69: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main M

emory

L2 Cache

L2 Cache

Proc. 1

Proc. 2

Proc. 4

Proc. 3

L2 Cache

L2 Cache

UNC’s Implementation Studies (I)

4

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

G-EDF

P-EDF

S-PD2

G-NP-EDF

PD2

partitioned EDF

2 x global EDF

2 x PFAIR

Calandrino et al. (2006)➡ Are commonly-studied RT schedulers implementable?➡ In Linux on common hardware platforms?

Tuesday, April 5, 2011

Page 70: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main M

emory

L2 Cache

L2 Cache

Proc. 1

Proc. 2

Proc. 4

Proc. 3

L2 Cache

L2 Cache

UNC’s Implementation Studies (I)

5

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

G-EDF

P-EDF

S-PD2

G-NP-EDF

PD2

Calandrino et al. (2006)➡ Are commonly-studied RT schedulers implementable?➡ In Linux on common hardware platforms?

“for each tested scheme, scenarios exist in which it is a viable choice”

Tuesday, April 5, 2011

Page 71: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

UNC’s Implementation Studies (II)

6

Brandenburg et al. (2008)➡ What if there are many slow processors?

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

G-EDF

P-EDF

S-PD2

G-NP-EDF

PD2

Main M

emory

L2 Cache

L2 Cache

Proc. 1

Proc. 2

Proc. 4

Proc. 3

L2 Cache

L2 Cache

Tuesday, April 5, 2011

Page 72: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main

Mem

ory

L2 Cache

L1 Cache

Core 1

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 2

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 3

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 4

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 6

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 5

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 7

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 8

Thread

1

Thread

2

Thread

3

Thread

4

UNC’s Implementation Studies (II)

7

Brandenburg et al. (2008)➡ What if there are many slow processors?➡ Explored scalability of RT schedulers on a Sun Niagara.

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

G-EDF

P-EDF

S-PD2

G-NP-EDF

PD2

Tuesday, April 5, 2011

Page 73: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main

Mem

ory

L2 Cache

L1 Cache

Core 1

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 2

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 3

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 4

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 6

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 5

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 7

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 8

Thread

1

Thread

2

Thread

3

Thread

4

UNC’s Implementation Studies (II)

8

Brandenburg et al. (2008)➡ What if there are many slow processors?➡ Explored scalability of RT schedulers on a Sun Niagara.

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

G-EDF

P-EDF

S-PD2

G-NP-EDF

PD2

G-EDF: high overheads, low schedulability.

Tuesday, April 5, 2011

Page 74: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Today’s discussion

9

How to implement global schedulers?

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

Main

Mem

ory

L2 Cache

L1 Cache

Core 1

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 2

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 3

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 4

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 6

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 5

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 7

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 8

Thread

1

Thread

2

Thread

3

Thread

4

G-EDF

P-EDF

S-PD2

G-NP-EDF

PD2

Tuesday, April 5, 2011

Page 75: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main

Mem

ory

L2 Cache

L1 Cache

Core 1

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 2

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 3

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 4

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 6

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 5

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 7

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 8

Thread

1

Thread

2

Thread

3

Thread

4

Today’s discussion

10

How to implement global schedulers?➡ Explore how implementation tradeoffs affect schedulability.

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

Instead of considering

one implementation

of severaldifferent scheduling

algorithms…

Tuesday, April 5, 2011

Page 76: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main

Mem

ory

L2 Cache

L1 Cache

Core 1

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 2

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 3

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 4

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 6

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 5

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 7

Thread

1

Thread

2

Thread

3

Thread

4

L1 Cache

Core 8

Thread

1

Thread

2

Thread

3

Thread

4

Today’s discussion

11

How to implement global schedulers?➡ Explore how implementation tradeoffs affect schedulability.➡ Case study: nine G-EDF variants on a Sun Niagara.

Calandrino et al. (2006), LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE Real-Time Systems Symposium, pages 111–123.Brandenburg et al. (2008), On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proceedings of the 29th IEEE Real-Time Systems Symposium, pages 157–169.

G-EDF G-EDF

G-EDF G-EDF

G-EDF

G-EDF G-EDF

G-EDF

G-EDF

Tuesday, April 5, 2011

Page 77: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Design Choices

12Tuesday, April 5, 2011

Page 78: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Design Choices

13

➡ When to schedule.➡ Quantum alignment.➡ How to handle interrupts.➡ How to queue pending jobs.➡ How to manage future releases.➡ How to avoid unnecessary preemptions.

Tuesday, April 5, 2011

Page 79: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Scheduler Invocation

14Tuesday, April 5, 2011

Page 80: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Scheduler Invocation

15

Event-Driven➡ on job release➡ on job completion➡ preemptions occur

immediately

P1

P2T x

1T y2

T z3 T y

2

5 10 150

release completion

Tuesday, April 5, 2011

Page 81: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

release completion

Scheduler Invocation

16

Event-Driven➡ on job release➡ on job completion➡ preemptions occur

immediately

Quantum-Driven➡ on every timer tick➡ easier to implement➡ on release a job is just

enqueued; scheduler is invoked at next tick

P1

P2T x

1T y2

T z3 T y

2

5 10 150

P1

P2T x

1T y2

T z3 T y

2

delay partially-used quantum

5 10 150

Tuesday, April 5, 2011

Page 82: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Quantum Alignment

17

Aligned➡ Tick synchronized

across processors.➡ Contention at

quantum boundary!

P1

P2

5 10 150release completion

Tuesday, April 5, 2011

Page 83: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Quantum Alignment

18

Aligned➡ Tick synchronized

across processors.➡ Contention at

quantum boundary!

Staggered➡ Ticks spread out

across quantum.➡ Reduced bus and

lock contention.➡ Additional latency.

P1

P2

5 10 150

P1

P2

5 10 150

release completion

Tuesday, April 5, 2011

Page 84: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Quantum Alignment

19

Aligned➡ Tick synchronized

across processors.➡ Contention at

quantum boundary!

Staggered➡ Ticks spread out

across quantum.➡ Reduced bus and

lock contention.➡ Additional latency.

P1

P2T x

1T y2

T z3 T y

2

delay partially-used quantum

5 10 150

P1

P2

5 10 150

release completion

Tuesday, April 5, 2011

Page 85: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Quantum Alignment

20

Aligned➡ Tick synchronized

across processors.➡ Contention at

quantum boundary!

Staggered➡ Ticks spread out

across quantum.➡ Reduced bus and

lock contention.➡ Additional latency.

P1

P2T x

1T y2

T z3 T y

2

delay partially-used quantum

5 10 150

P1

P2T x

1T y2

T z3 T y

2

staggering delays

5 10 150

release completion

Tuesday, April 5, 2011

Page 86: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Quantum Alignment

21

Aligned➡ Tick synchronized

across processors.➡ Contention at

quantum boundary!

Staggered➡ Ticks spread out

across quantum.➡ Reduced bus and

lock contention.➡ Additional latency.

P1

P2T x

1T y2

T z3 T y

2

staggering delays

5 10 150

P1

P2T x

1T y2

T z3 T y

2

delay partially-used quantum

5 10 150release completion

Tuesday, April 5, 2011

Page 87: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Interrupt Handling

22Tuesday, April 5, 2011

Page 88: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Interrupt Handling

23

Global interrupt handling.➡ Job releases triggered by interrupts.➡ Interrupts may fire on any processor.➡ Jobs may execute on any processor.➡ Thus, in the worst case, a job may be

delayed by each interrupt.

Mai

n M

emor

y

L2 Cache

L1 Cache

Core 1

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 2

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 3

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 4

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 6

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 5

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 7

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 8

Thread1

Thread2

Thread3

Thread4

Tuesday, April 5, 2011

Page 89: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Interrupt Handling

24

Global interrupt handling.➡ Job releases triggered by interrupts.➡ Interrupts may fire on any processor.➡ Jobs may execute on any processor.➡ Thus, in the worst case, a job may be

delayed by each interrupt.

Mai

n M

emor

y

L2 Cache

L1 Cache

Core 1

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 2

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 3

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 4

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 6

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 5

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 7

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 8

Thread1

Thread2

Thread3

Thread4

Mai

n M

emor

y

L2 Cache

L1 Cache

Core 1

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 2

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 3

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 4

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 6

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 5

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 7

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 8

Thread1

Thread2

Thread3

Thread4

Dedicated interrupt handling.➡ Only one processor services interrupts.➡ Jobs may execute on other processors.➡ Jobs are not delayed by release interrupts.➡ Well-known technique; used in the Spring

kernel (Stankovic and Ramamritham, 1991).➡ How does it affect schedulability?

J.A. Stankovic and K. Ramamritham (1991), The Spring kernel: A new paradigm for real-time systems. IEEE Software, 8(3):62–72.

Tuesday, April 5, 2011

Page 90: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Ready Queue

25

Main Memory

L2 Cache

L2 Cache

Core 1 Core 2 Core 4Core 3

Q1

T a1

T b2

T c3

T d4

Tuesday, April 5, 2011

Page 91: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main Memory

L2 Cache

L2 Cache

Core 1 Core 2 Core 4Core 3

Q1

T a1

T b2

T c3

T d4

Ready Queue

26

Globally-shared priority queue.➡ Problem: hyper-period boundaries.➡ Problem: lock contention.➡ Problem: bus contention.

Tuesday, April 5, 2011

Page 92: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Main Memory

L2 Cache

L2 Cache

Core 1 Core 2 Core 4Core 3

Q1

T a1

T b2

T c3

T d4

Ready Queue

27

Globally-shared priority queue.➡ Problem: hyper-period boundaries.➡ Problem: lock contention.➡ Problem: bus contention.

Requirements.➡ Mergeable priority queue: release n

jobs in O(log n) time.➡ Parallel enqueue / dequeue operations.➡ Mostly cache-local data structures.

Tuesday, April 5, 2011

Page 93: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Ready Queue

28

Globally-shared priority queue.➡ Problem: hyper-period boundaries. ➡ Problem: lock contention.➡ Problem: bus contention.

P1 P2

P32

Coarse-Grained Heap Hierarchical Heaps Fine-Grained Heap

In this study, we consider three queue implementations.

Tuesday, April 5, 2011

Page 94: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Ready Queue: Coarse-Grained Heap

29

Binomial heap + single lock.➡ Lock used to synchronize all G-EDF state.➡ Mergeable queue.➡ No parallel updates.➡ No cache-local updates.➡ Low locking overhead

(only single lock acquisition).

Tuesday, April 5, 2011

Page 95: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Ready Queue: Hierarchical Heaps

30

P1 P2

P32

Per-processor queues + master queue.➡ Each queue protected by a lock.➡ Master queue holds min element of each per-

processor queue.➡ Global, sequential dequeue operations.➡ Mostly-local enqueue operations.

Tuesday, April 5, 2011

Page 96: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Ready Queue: Hierarchical Heaps

31

P1 P2

P32

Per-processor queues + master queue.➡ Each queue protected by a lock.➡ Master queue holds min element of each per-

processor queue.➡ Global, sequential dequeue operations.➡ Mostly-local enqueue operations.

Locking.➡ Dequeue: top-down.➡ Enqueue: bottom-up.➡ Enqueue may have to

drop lock, retry.➡ Additional complexity

wrt. dequeue (see paper).➡ Bottom line: expensive.

Tuesday, April 5, 2011

Page 97: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Ready Queue: Fine-Grained Heap

32

Parallel binary heap.➡ One lock per heap node.➡ Proposed by Hunt et al. (1996).➡ Not mergeable.➡ Parallel enqueue / dequeue.➡ No cache-local data.

Hunt et al. (1996), An efficient algorithm for concurrent priority queue heaps. Information Processing Letters, 60(3):151–157.

Tuesday, April 5, 2011

Page 98: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Ready Queue: Fine-Grained Heap

33

Parallel binary heap.➡ One lock per heap node.➡ Proposed by Hunt et al. (1996).➡ Not mergeable.➡ Parallel enqueue / dequeue.➡ No cache-local data.

Locking.➡ Many lock acquisitions.➡ Atomic peek+dequeue

operation needed to check for preemptions.

Hunt et al. (1996), An efficient algorithm for concurrent priority queue heaps. Information Processing Letters, 60(3):151–157.

Tuesday, April 5, 2011

Page 99: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Additional Components

34

Release queue.➡ Support mergeable queues.➡ Support dedicated interrupt handling.

Job-to-processor mapping.➡ Quickly determine whether preemption is required.➡ Avoid unnecessary preemptions.➡ Used to linearize concurrent scheduling decisions.

Tuesday, April 5, 2011

Page 100: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Implementation in LITMUSRT

35Tuesday, April 5, 2011

Page 101: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

36

Linux Testbed for Multiprocessor Scheduling in Real-Time systems

Tuesday, April 5, 2011

Page 102: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

37

Linux Testbed for Multiprocessor Scheduling in Real-Time systems

UNC’s Linux patch.➡ Used in several previous studies.➡ On-going development.➡ Currently, based off of Linux 2.6.24.

Tuesday, April 5, 2011

Page 103: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

38

Linux Testbed for Multiprocessor Scheduling in Real-Time systems

UNC’s Linux patch.➡ Used in several previous studies.➡ On-going development.➡ Currently, based off of Linux 2.6.24.

Scheduler Plugin API.➡ scheduler_tick()➡ schedule()➡ release_jobs()

Tuesday, April 5, 2011

Page 104: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Considered G-EDF Variants

39

Name Ready Q Scheduling InterruptsCEm coarse-grained event-driven global

CQm coarse-grained quantum (aligned) global

FEm fine-grained event-driven global

HEm hierarchical event-driven global

S-CQm coarse-grained quantum (staggered) global

CE1 coarse-grained event-driven dedicated

FE1 fine-grained event-driven dedicated

CQ1 coarse-grained quantum (aligned) dedicated

S-CQ1 coarse-grained quantum (staggered) dedicated

Tuesday, April 5, 2011

Page 105: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Considered G-EDF Variants

40

Name Ready Q Scheduling InterruptsCEm coarse-grained event-driven global

CQm coarse-grained quantum (aligned) global

S-CQm coarse-grained quantum (staggered) global

HEm hierarchical event-driven global

FEm fine-grained event-driven global

CE1 coarse-grained event-driven dedicated

FE1 fine-grained event-driven dedicated

CQ1 coarse-grained quantum (aligned) dedicated

S-CQ1 coarse-grained quantum (staggered) dedicated

Tuesday, April 5, 2011

Page 106: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Considered G-EDF Variants

41

Name Ready Q Scheduling InterruptsCEm coarse-grained event-driven global

CQm coarse-grained quantum (aligned) global

S-CQm coarse-grained quantum (staggered) global

HEm hierarchical event-driven global

FEm fine-grained event-driven global

CE1 coarse-grained event-driven dedicated

FE1 fine-grained event-driven dedicated

CQ1 coarse-grained quantum (aligned) dedicated

S-CQ1 coarse-grained quantum (staggered) dedicated

Baseline from(Brandenburg et al., 2008)

Tuesday, April 5, 2011

Page 107: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Considered G-EDF Variants

42

Name Ready Q Scheduling InterruptsCEm coarse-grained event-driven global

CQm coarse-grained quantum (aligned) global

S-CQm coarse-grained quantum (staggered) global

HEm hierarchical event-driven global

FEm fine-grained event-driven global

CE1 coarse-grained event-driven dedicated

FE1 fine-grained event-driven dedicated

CQ1 coarse-grained quantum (aligned) dedicated

S-CQ1 coarse-grained quantum (staggered) dedicated

No fine-grained heaps + quantum-driven scheduling.(Parallel updates not beneficial due to quantum barrier.)

Tuesday, April 5, 2011

Page 108: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Considered G-EDF Variants

43

Name Ready Q Scheduling InterruptsCEm coarse-grained event-driven global

CQm coarse-grained quantum (aligned) global

S-CQm coarse-grained quantum (staggered) global

HEm hierarchical event-driven global

FEm fine-grained event-driven global

CE1 coarse-grained event-driven dedicated

CQ1 coarse-grained quantum (aligned) dedicated

S-CQ1 coarse-grained quantum (staggered) dedicated

FE1 fine-grained event-driven dedicated

Tuesday, April 5, 2011

Page 109: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Considered G-EDF Variants

44

Name Ready Q Scheduling InterruptsCEm coarse-grained event-driven global

CQm coarse-grained quantum (aligned) global

S-CQm coarse-grained quantum (staggered) global

HEm hierarchical event-driven global

FEm fine-grained event-driven global

CE1 coarse-grained event-driven dedicated

CQ1 coarse-grained quantum (aligned) dedicated

S-CQ1 coarse-grained quantum (staggered) dedicated

FE1 fine-grained event-driven dedicated

No hierarchical heaps + dedicated interrupt handling.(Hierarchical heaps not beneficial if only one proc. enqueues.)

Tuesday, April 5, 2011

Page 110: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Schedulability Study

45Tuesday, April 5, 2011

Page 111: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Objective

46

Compare the discussed implementationsin terms of the ratio of randomly-generated task sets

that can be shown to be schedulableunder consideration of system overheads.

Tuesday, April 5, 2011

Page 112: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Scheduling Overheads

47Tuesday, April 5, 2011

Page 113: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Scheduling Overheads

48

Release overhead.➡ The cost of a one-shot timer interrupt.

Scheduling overhead.➡ Selecting the next job to run.

Context switch overhead.➡ Changing address space.

P1

P2T x

1T y2

T z3 T y

2

context switchrelease schedule

T z3csr r

cscsr

cs

cs

cs

5 10 150

release completion

Tuesday, April 5, 2011

Page 114: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Scheduling Overheads

49

Release overhead.➡ The cost of a one-shot timer interrupt.

Scheduling overhead.➡ Selecting the next job to run.

Tick overhead.➡ Cost of a periodic timer interrupt.➡ Beginning of a new quantum.

Context switch overhead.➡ Changing address space.

Preemption and migration overhead.➡ Loss of cache affinity.➡ Known from (Brandenburg et al., 2008).

P1

P2T x

1T y2

T z3 T y

2

context switchrelease schedule

T z3csr r

cscsr

cs

cs

cs

5 10 150

release completion

Tuesday, April 5, 2011

Page 115: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

IPI Latency

50

P1

P2 T x1T y

2

T z3 T y

2

IPI latency

5 10 150

Inter-processor interrupts (IPIs).➡ Interrupt may be processed by a processor different from the one

that will schedule a newly-arrived job.➡ Requires notification of remote processor.➡ Event-based scheduling incurs added latency.

release completion

Tuesday, April 5, 2011

Page 116: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Test Platform

51

LITMUSRT

➡ UNC’s Linux-based Real-Time Testbed

Sun UltraSPARC T1 “Niagara”➡ 8 cores, 4 HW threads per core = 32 logical processors.➡ 3 MB shared L2 cache

on

— SUN UltraSPARC T1 “Niagara”

Tuesday, April 5, 2011

Page 117: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Test Platform

52

LITMUSRT

➡ UNC’s Linux-based Real-Time Testbed

Sun UltraSPARC T1 “Niagara”➡ 8 cores, 4 HW threads per core = 32 logical processors.➡ 3 MB shared L2 cache

on

— SUN UltraSPARC T1 “Niagara”

0

100

200

300

400

500

600

700

PFAIR S-PFAIR G-EDF C-EDF P-EDF

Overheads➡ Traced overheads under each of the plugins.➡ Collected more than 640,000,000 samples (total).➡ Computed worst-case and average-case overheads.➡ Over 20 graphs; see online version.

Outliers➡ Removed top 1% of samples to discard outliers.

Tuesday, April 5, 2011

Page 118: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Example: Tick Overhead

53

0

50

100

150

200

250

300

50 100 150 200 250 300 350 400 450

over

head

(us)

number of tasks

worst-case tick overhead

CEm tick overhead (worst-case)CE1 tick overhead (worst-case)FEm tick overhead (worst-case)FE1 tick overhead (worst-case)

CQm tick overhead (worst-case)CQ1 tick overhead (worst-case)HEm tick overhead (worst-case)“Higher is worse.”

number of tasks

mic

rose

cond

s

Tuesday, April 5, 2011

Page 119: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Example: Tick Overhead

54

0

50

100

150

200

250

300

50 100 150 200 250 300 350 400 450

over

head

(us)

number of tasks

worst-case tick overhead

CEm tick overhead (worst-case)CE1 tick overhead (worst-case)FEm tick overhead (worst-case)FE1 tick overhead (worst-case)

CQm tick overhead (worst-case)CQ1 tick overhead (worst-case)HEm tick overhead (worst-case)

Event-Driven

Quantum-Driven

Tuesday, April 5, 2011

Page 120: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

0

50

100

150

200

250

300

350

400

50 100 150 200 250 300 350 400 450

over

head

(us)

number of tasks

worst-case release overhead

CEm release overhead (worst-case)CE1 release overhead (worst-case)FEm release overhead (worst-case)FE1 release overhead (worst-case)

CQm release overhead (worst-case)CQ1 release overhead (worst-case)HEm release overhead (worst-case)

Example: Release Overhead

55

Quantum-Driven

Event-Driven

Tuesday, April 5, 2011

Page 121: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Study Setup

56

Methodology.➡ Randomly generate task set.➡ Apply overheads (for each G-EDF implementation).➡ Test whether task set can be claimed schedulable (for

each G-EDF implementation). 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.1, 0.4]; period uniformly in [10, 100]

G-EDF CEm CE1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.001, 0.1]; period uniformly in [10, 100]

G-EDF CQ1 S-CQ1

Tuesday, April 5, 2011

Page 122: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Study Setup

57

Methodology.➡ Randomly generate task set.➡ Apply overheads (for each G-EDF implementation).➡ Test whether task set can be claimed schedulable (for

each G-EDF implementation). 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.1, 0.4]; period uniformly in [10, 100]

G-EDF CEm CE1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.001, 0.1]; period uniformly in [10, 100]

G-EDF CQ1 S-CQ1

Schedulability.➡ Hard real-time: worst-case overheads, no tardiness.➡ Soft real-time: average-case overheads, bounded

tardiness. 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.5, 0.9]; period uniformly in [10, 100]

G-EDF CE1 FE1

Tuesday, April 5, 2011

Page 123: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Study Setup

58

Methodology.➡ Randomly generate task set.➡ Apply overheads (for each G-EDF implementation).➡ Test whether task set can be claimed schedulable (for

each G-EDF implementation). 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.1, 0.4]; period uniformly in [10, 100]

G-EDF CEm CE1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.001, 0.1]; period uniformly in [10, 100]

G-EDF CQ1 S-CQ1

Task set generation.➡ Six utilization distributions (uniform and bimodal).➡ Three period distributions (uniform).➡ Over 300 graphs; see online version.

Schedulability.➡ Hard real-time: worst-case overheads, no tardiness.➡ Soft real-time: average-case overheads, bounded

tardiness. 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.5, 0.9]; period uniformly in [10, 100]

G-EDF CE1 FE1

Tuesday, April 5, 2011

Page 124: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Results

59

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.1, 0.4]; period uniformly in [10, 100]

G-EDF CEm CE1

increasing utilizationsche

dula

ble

task

set

s

“Higher is better.”

Tuesday, April 5, 2011

Page 125: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Interrupt Handling

60

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.1, 0.4]; period uniformly in [10, 100]

G-EDF CEm CE1

Dedicated

Zero Overh.Global

Dedicated interrupt handlingwas generally preferable (or no worse).

Tuesday, April 5, 2011

Page 126: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Quantum Staggering

61

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.001, 0.1]; period uniformly in [10, 100]

G-EDF CQ1 S-CQ1

Staggered

Zero OverheadsAligned

Staggered quantawere generally preferable (or no worse).

Tuesday, April 5, 2011

Page 127: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Quantum- vs. Event-Driven

62

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.1, 0.4]; period uniformly in [10, 100]

G-EDF FE1 S-CQ1

QuantumEvent

Event-driven scheduling was preferable in most cases.

Zero Overh.

Tuesday, April 5, 2011

Page 128: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Choice of Ready Queue (1)

63

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.1, 0.4]; period uniformly in [10, 100]

G-EDF CEm HEm

HierarchicalCoarse-Grained

Zero Overh.

The coarse-grained ready queueperformed better than the hierarchical queue.

Tuesday, April 5, 2011

Page 129: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Choice of Ready Queue (II)

64

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

ratio

of s

ched

ulab

le ta

sk s

ets

[har

d]

task set utilization cap (prior to inflation)

utilization uniformly in [0.5, 0.9]; period uniformly in [10, 100]

G-EDF CE1 FE1

Zero O.Coarse-Grained

Fine-Grained

The fine-grained ready queueperformed marginally better than the coarse-grained queue if used together with dedicated interrupt handling.

Tuesday, April 5, 2011

Page 130: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Conclusion

65Tuesday, April 5, 2011

Page 131: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Summary of Results

66

Implementation choices can impact schedulability as much as

scheduling-theoretic tradeoffs.

Unless task counts are very highor periods very short,

G-EDF can scale to 32 processors.

Tuesday, April 5, 2011

Page 132: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Recommendation

67

Mai

n M

emor

y

L2 Cache

L1 Cache

Core 1

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 2

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 3

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 4

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 6

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 5

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 7

Thread1

Thread2

Thread3

Thread4

L1 Cache

Core 8

Thread1

Thread2

Thread3

Thread4

Best results obtained with combination of:

fine-grained heapevent-driven scheduling

dedicated interrupt handling

Tuesday, April 5, 2011

Page 133: Implementation Study - Courses (please see the UBC course ...courses.ece.ubc.ca/494/files/MP-ImplementationStudy.pdfWorst-Case Overheads (in µs) Alg Overall Preemption Intra-Cluster

On the Implementation of Global Real-Time Schedulers

Future Work

68

Platform.➡ Repeat study on embedded hardware platform.

Implementation.➡ Simplify locking requirements.➡ Parallel mergeable heaps?

Analysis.➡ Less pessimistic hard real-time G-EDF schedulability tests.➡ Less pessimistic interrupt accounting.

Tuesday, April 5, 2011