taming non-blocking caches to improve isolation in multicore...

Post on 02-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Taming Non-blocking Caches to Improve

Isolation in Multicore Real-time Systems

(RTAS 2016)

Prathap Kumar Valsan, Heechul Yun, Farzad FarshchiUniversity of Kansas

Multicore Processors in Real-Time Systems● Real-Time systems need increased performance as they become more

intelligent○ Computer Vision○ Collision Avoidance

● Real-Time systems still need high levels of predictability in order to be effective and safe

Time Predictability in Multicore Processors● Multicore processors are less

predictable than single core because of shared resources○ Lowest Level Cache (LLC)○ Bus Interface

● Out-of-order cores using non-blocking caches also share Miss Status Holding Registers (MSHRs)

LLC

http://www.cse.wustl.edu/~jain/cse567-11/ftp/multcore/

Page Coloring for Cache Partitioning● Cache Partitioning is used to prevent cores from interfering with other cores

shared cache space

● Partitioning is done through page coloring○ Implemented in either hardware or software

○ Allocates non-overlapping partitions of LLC to cores

● Prevents unpredictable cache-line evictions being caused by other cores

● This does NOT fully isolate cores with non-blocking caches

Non-blocking Cache● Memory Level Parallelism (MLP)

○ Ability to handle multiple memory operations concurrently

● Continue to serve cache hits even as cache misses are waiting to be served

● Miss Status Holding Registers (MSHRs)○ Cache Miss - allocate MSHR entry to “pend” the memory operation until it can be fulfilled

○ Data Received - clear entry from MSHR

● The number of MSHRs available determines the Memory Level Parallelism of

the cache

MSHR Contention● MSHRs of the shared LLC are also shared by the cores

● If all MSHRs are full:○ The cache becomes blocked

○ Memory operations (including cache hits) will be blocked until free MSHRs become available

● Cache partitioning does not prevent MSHR contention

Core MSHR Request

L1 Cache Miss

LLC Inaccessible

Until MSHR is Available

All MSHRsOccupied

MSHR Contention as a Source of Interference● Performance of the subject is

measured independently and with co-runners

● Unwanted cache-line evictions are prevented by page coloring

● If page coloring is sufficient for isolation, co-runners should not affect the performance of the subject

Core 1 (Subject)

Core 2(Co-runner)

Core 3(Co-runner)

Core 4(Co-runner)

Partitioned LLC

DRAM

Testing Platforms

Results

LLC : All memory accesses are LLC hitsDRAM : All memory accesses are LLC missesLatency : Has data dependencies that cause it to only generate one outstanding request at a timeBwRead : Has no data dependencies so it can generate multiple outstanding requests at a time

Results● Number of Global MSHRs relative to Local MSHRs significantly impacts the

amount of contention between the cores● MSHR setting : (Local MSHRs / Global MSHRs)

Proposition● Dynamically controlling the MSHRs will improve isolation of the cores● Add “Target Count” and “Valid Count” registers to the local cache MSHRs● This allows the OS to control each core’s MLP independently

Implementation

● Utilized GEM5 cycle-accurate simulator

If the next task is a real-time task, configure TargetCount register of core to reserve appropriate MSHR slots for the task

If no currently running tasks require MSHR reservations then the TargetCount of each core is reset to the maximum

Occurs upon context switch in a core

Any remaining (unreserved) MSHR slots are distributed across the cores to be utilized for best-effort processes

Evaluation● BwWrite(DRAM) is run on each core as a

“best-effort” task

● Periodic EEMBC benchmarks with computation times of ~8ms are used for the “real-time” tasks with periods

○ Core1 : 20ms ○ Core2 : 30ms

○ Core3 : 40ms ○ Core4 : 60ms

● Real-time tasks see an improvement of up to 20% due to reduction in MSHR contention

● Best-effort tasks suffer a 3% throughput reduction

Questions?● What are some real-time systems that could benefit from this architecture?● Why don’t multicore processors currently allow control over MSHR allocation?● What is a remaining source of contention?● Why is the implementation of page coloring in the system a prerequisite for

performing these experiments accurately?

top related