taming non-blocking caches to improve isolation in multicore...
TRANSCRIPT
Taming Non-blocking Caches to Improve
Isolation in Multicore Real-time Systems
(RTAS 2016)
Prathap Kumar Valsan, Heechul Yun, Farzad FarshchiUniversity of Kansas
Multicore Processors in Real-Time Systems● Real-Time systems need increased performance as they become more
intelligent○ Computer Vision○ Collision Avoidance
● Real-Time systems still need high levels of predictability in order to be effective and safe
Time Predictability in Multicore Processors● Multicore processors are less
predictable than single core because of shared resources○ Lowest Level Cache (LLC)○ Bus Interface
● Out-of-order cores using non-blocking caches also share Miss Status Holding Registers (MSHRs)
LLC
http://www.cse.wustl.edu/~jain/cse567-11/ftp/multcore/
Page Coloring for Cache Partitioning● Cache Partitioning is used to prevent cores from interfering with other cores
shared cache space
● Partitioning is done through page coloring○ Implemented in either hardware or software
○ Allocates non-overlapping partitions of LLC to cores
● Prevents unpredictable cache-line evictions being caused by other cores
● This does NOT fully isolate cores with non-blocking caches
Non-blocking Cache● Memory Level Parallelism (MLP)
○ Ability to handle multiple memory operations concurrently
● Continue to serve cache hits even as cache misses are waiting to be served
● Miss Status Holding Registers (MSHRs)○ Cache Miss - allocate MSHR entry to “pend” the memory operation until it can be fulfilled
○ Data Received - clear entry from MSHR
● The number of MSHRs available determines the Memory Level Parallelism of
the cache
MSHR Contention● MSHRs of the shared LLC are also shared by the cores
● If all MSHRs are full:○ The cache becomes blocked
○ Memory operations (including cache hits) will be blocked until free MSHRs become available
● Cache partitioning does not prevent MSHR contention
Core MSHR Request
L1 Cache Miss
LLC Inaccessible
Until MSHR is Available
All MSHRsOccupied
MSHR Contention as a Source of Interference● Performance of the subject is
measured independently and with co-runners
● Unwanted cache-line evictions are prevented by page coloring
● If page coloring is sufficient for isolation, co-runners should not affect the performance of the subject
Core 1 (Subject)
Core 2(Co-runner)
Core 3(Co-runner)
Core 4(Co-runner)
Partitioned LLC
DRAM
Testing Platforms
Results
LLC : All memory accesses are LLC hitsDRAM : All memory accesses are LLC missesLatency : Has data dependencies that cause it to only generate one outstanding request at a timeBwRead : Has no data dependencies so it can generate multiple outstanding requests at a time
Results● Number of Global MSHRs relative to Local MSHRs significantly impacts the
amount of contention between the cores● MSHR setting : (Local MSHRs / Global MSHRs)
Proposition● Dynamically controlling the MSHRs will improve isolation of the cores● Add “Target Count” and “Valid Count” registers to the local cache MSHRs● This allows the OS to control each core’s MLP independently
Implementation
● Utilized GEM5 cycle-accurate simulator
If the next task is a real-time task, configure TargetCount register of core to reserve appropriate MSHR slots for the task
If no currently running tasks require MSHR reservations then the TargetCount of each core is reset to the maximum
Occurs upon context switch in a core
Any remaining (unreserved) MSHR slots are distributed across the cores to be utilized for best-effort processes
Evaluation● BwWrite(DRAM) is run on each core as a
“best-effort” task
● Periodic EEMBC benchmarks with computation times of ~8ms are used for the “real-time” tasks with periods
○ Core1 : 20ms ○ Core2 : 30ms
○ Core3 : 40ms ○ Core4 : 60ms
● Real-time tasks see an improvement of up to 20% due to reduction in MSHR contention
● Best-effort tasks suffer a 3% throughput reduction
Questions?● What are some real-time systems that could benefit from this architecture?● Why don’t multicore processors currently allow control over MSHR allocation?● What is a remaining source of contention?● Why is the implementation of page coloring in the system a prerequisite for
performing these experiments accurately?