adaptive partition scheduling

8/3/2019 Adaptive Partition Scheduling

1/15

January 24, 2012

Adaptive Partition Scheduling

Part 1: Why we did itCool stuff from QNXA.Danko


2/15

Cool Stuff from QNX 2January 24, 2012

Evolution of schedulersWhy?

Timeline

priority pre-emptive

Timeslicing

Time-varying priority

Really clever time-varying

Fair Share scheduling

Adaptive configuration

Yes, but:

System locks up

Backhoes and Mothers day

Untuneable for more than 1

application.

US Military Satcom

Hard to manage share interactions.

Not invented until now.

SCHED_FIFO

SCHED_RR

SCHED_SPORADIC


3/15


Evolution: Lessons learned

Numerical priorities are chosen by applications but systemscheduling behavior must be designed globally

Degradation and overload: Priorities are not constants.Importance of work depends on circumstances.> Modes: normal operation, restart, emergency maintenance

Scheduling strategy needs to be based on unit of work, butwhat we have is communicating threads.

must measure real-time behavior.> 0.1 % accuracy

Want to specify shares as global percentages> Applications dont get to pick their importance or shares. System engineers

do.

Need to throttle cpu usage without losing realtime latencies.

Why?


4/15


QNX Answer POSIX compatible design which can be

applied to existing systems with little or

no recoding

A global hard real-time scheduler with

overload protection and CPU guarantees> Separation of work based on working for

common purpose

Runtime typed memory and kernel object

guarantees and limits

>With full inheritance and accounting for allchildren

Persistent storage (file system)

guarantees and limits

Process model for fault isolation

Dynamic configuration

What is Partitioning?

General Answer

Separation of

work

To isolate:> cpu usage

> memory usage

> system resource

usage> Failures

Design

Adaptive Partition Scheduling


5/15


Principles

Scheduler must not trigger an overload> Overhead may not increase with # of threads

Real-time during underload> Same behavior as today

Real-time during overload> At least for interrupt handling

Must also be a fair-share scheduler> global scheduler algorithm

> globally configured

Must mesh with current QNX architecture Preemptive priority, individual thread scheduling

Heavy use of message passing

> Easy to drop onto existing applications

> Cant be a bag on the side

Simple enough for customers to use> Engineerable

> Reconfigure on the fly

Offered load

Throughput

Insert picture ofJuggling Watermelons

here

Design


6/15


Counting time

What does 14% cpu mean?> CPU usage is calculated over a sliding window.

>

Accuracy:

> Counting ticks is not enough. Micro-billing is used to track actual CPUutilization even when threads dont use their whole timeslice.

> micro- and nano-second resolution

> Threads are billed based on real usage, not statistics

windowsize is configurable as an argument to kernel at boot> Tradeoff maximum READY-state latency with accuracy of CPU budgeting

100ms window -> 1% accuracy or better.

> Internal arithmetic accurate to 0.5% or better

Partition usage> ns cpu time executed, during last sliding window, expressed as percentage

Partition budget> Guaranteed percentage of cpu time, balanced over sliding window

Design

T= nowT= -100ms


7/15


File System

Process

-

Whos got time: Partition Inheritance

Adaptive Partition 1

(Multi-media)


(Java application)

CPU budget

available

6

11

8

9

Resource manager threads work on behalf of sender

Priority and adaptive partition in inherited on receive> Execution time in server billed to clients partition

This allows proper accounting for shared resources

-

-

Receive Threads CPU budget

available

6

67

4

10

Design

99Message

9

10

Message

9

10


8/15


Real time: Behavior under normal load


(Multi-media)


(Java application)

Blocked

Running

Ready

CPU budget

available

CPU budget

available

6

118

99

6

67

4

1010

Hard real-time scheduler under normal load

Running thread selected as highest priority READY thread

No delay on scheduling if adaptive partition has budget

Design


9/15


Out of time: Behavior under overload


(Multi-media)


(Java application)

Blocked

Running

Ready

CPU budget

available

CPU budget

exceeded

6

118

9

6

67

4

10

Highest priority READY thread in Partition with budget runs

No delay on scheduling if adaptive partition has budget

Design


10/15


Free Time: Behavior with unused CPU


(Multi-media)


(Java application)

Blocked

Running

CPU budget

exceeded

CPU budget

exceeded

6

118

9

6

67

4

10

If no partitions with remaining budget have READY threads, highest

priority READY thread is selected to run from other partitions

This allows free time to be given based upon priority> Free time is still accounted and may have to be paid back (for example, if partition 3

becomes ready within 1 averaging window)


6

10

8

CPU budget

available

Design

109


11/15


30

Borrowed Time: Critical Threads


(Multi-media)


(Air Bag Control)

Blocked

Running

Ready

CPU budget

available

CPU budget

exceeded

6

118

11

6

67

4

30

Critical threads still run (based on priority) even if partition has no budget

Critical threads provide deterministic scheduling even in overload

Critical threads are given critical budget and can go into short-term debt> Critical time is accounted and has to be repaid

> Exceeding critical budget is considered an error and causes notification/action

Critical

Thread

11

Design


12/15


Equal time.

How to choose between partitions of equal priority> Unimportant?

> Many threads run at default priority, therefore equal priority

Possible algorithms:

> - round robin

> - favor partition with most free time

> - favor longest waiter

Requirement:> Minimize latencies during underload

> WBN: divide free time by % cpu share.

Solution: Interleave partitions by ratio of partition shares

We found a clever way to do that, so its in the patent.

Design


13/15


How it does it

uKernel

libmod_aps.aProcesscreation

messaging

Per-partitionReady Q

Schedulerclock intr handler

ready()

block()

select_thread()

for all partitions, p

Def m(p) ->

(bud(p)||crit(p), prio(p), run_t/wsize/bud(p))

Then schedule ps

Def ps -> rdy(ps) and (m(ps) < m(pi))

For all i != s


14/15


Overhead: Fancy, but is it fast?

Scheduling overhead increases with:> - number of partitions

> - number of messages/sec

> - number of clock interrupts/sec, i.e. ClockPeriod()

> * does not increase with number of threads *

Free or almost free operations:> Inheriting partition as part of message receive> Joining a thread to a partition

> Dynamically changing budgets

Computational requirements> 32 bit multiply, 64bit add

> *no floating point* *no divides* *no address space swapping**short-circuit calculation of merit function* *no inter-cpu msging onSMP* *history-less algorithm*

Overhead typically 1% of total cpu


15/15

Cool Stuff from QNX

Any Queries????

15January 24, 2012

adaptive partition scheduling

Documents