software implications of high-performance memory systems
Post on 02-Jun-2022
6 Views
Preview:
TRANSCRIPT
1
Software implications of high-performance memory systems
Leif LindholmARM Ltd.
Embedded Linux Conference Europe 2010Copyright 2010, ARM Ltd.
2
Alternative titles
Barriers – the what, the how, and the by all that is holy why?
What you’re going to wish you didn’t know about modern computer systems if you don’t already
Of course it couldn’t do that! …could it?
3
OverviewThis presentation aims to explain some of what goes on underneath your feet when developing software for modern computer systems.
The good news is that if you are an application developer, you normally don’t need to be aware of this. Congratulations!
The bad news is that if you are developing or debugging kernel code, drivers, system libraries, execution environments, JIT compilers, …, you do. Sorry.
4
It all used to be so simpleSingle core
In-orderSingle-issueNo speculationNo caches?
Only slave peripheralsNo DMA
Simple operating systemsBare metal?
Few, if any, user-accessible expansion ports
Processor
DRAM
Flash
Modem
LCD
Audio
GPIO
5
The world today
core
core core
core
L2 cache
PCIe
USB
SATA
DMC
Multi-layerreorderingbus matrix
DDR3 DDR3
Southbridge
graphicsaccelerator
DSP
Ethernet
graphicscontroller
6
The world todayMulti-core processorsSpeculative multi-issue out-of-order coresMultiple levels of caches
Some with hardware coherency management
Multi-layered bus interconnectsMemory access merging (reads and writes)Many agents/bus-masters in systemEnd-user accessible expansion bussesHighly optimizing compilers
Most of this available today in devices amusingly still referred to as phones, as well as set-top-boxes, TVs, …
7
So what does all that stuff mean in practise?
8
In the good old days…Things happened in the way specified by the programThings happened the number of times specified in the program (no more, no less)Only one thing happened at once
This is now referred to as “the sequential execution model”For software to work at all, this model must still appear to be in place within the scope of a single process executing on a single core
But throw in some SMP and the world changes…And this will not necessarily be true for an external observer comparing the bus traffic to the program code
9
Multi-issue (superscalar)More than one instruction can be issued per clock cycle, where not prevented by data-dependencies
Offers new and exciting ways for compilers to improve code performance by shuffling instructions around
1 add r0, r0, #12 mul r2, r2, r33 load r1, [r0]4 mov r4, r25 sub r1, r2, r56 store r1, [r0]7 return
1 add mul2 load mov3 sub *stall*4 store return
Executing on a dual-issue core
10
SpeculationThe core executes things before it is determined if they are actually meant to execute
Pretending that nothing happened if it turns out the speculation was not the actual case
The core fetches code or data it determines might be used soon into cache ahead of time (prefetching)
add r0, r0, #1cmp r0, #42bne skipload r1, [r2]b proceed
skip: load r1, [r3]proceed: store r1, [r4]
11
Out-of-order executionWhen core detects an unresolved data-dependency preventing it from issuing an instruction, it just issues the next instruction instead of stalling waiting for the result to come back
Continues executing until there are nonon-dependent operations available
1 add r0, r0, #12 mul r2, r2, r33 store r2, [r0]4 load r4, [r1]5 sub r1, r4, r26 return 1 add r0, r0, #1
2 mul r2, r2, r34 load r4, [r1]3 store r2, [r0]5 sub r1, r4, r26 return
1 add r0, r0, #12 mul r2, r2, r3
*stall*3 store r2, [r0]4 load r4, [r1]
*stall*5 sub r1, r4, r26 return
In-order
Out-of-order
12
Coherency-managed SMPLines can migrate between (data) caches at any timeWrite buffers can affect externally visible ordering of memory accesses (between cores as well as in the outside system).
send_ipi: (core0)load r3, #IPI_IDstore r2, [r1] @ set payloadstore r3, [r0] @ send IPI
recv_ipi: (core1)load r1, [r0]cmp r1, #VALUE @ should contain what
@ was in r2 on core0
13
External mastersTypical use of a DMA controller:
You write a bunch of data into a shared buffer, and clean your caches after completion if using cached memoryThen you signal the DMA controller to start transferringThings will work a whole lot better if the DMA controller sees these operations in this order
Using a DSP to do video decode into a shared buffer?
14
And let’s not forget the compilers
Ignoring all of the magic I’ve mentioned underneath the hood, what would you expect somefunc() to return?
42, yes, that’s possible.So is 0.
int flag = BUSY;int data = 0;
int somefunc(void){
while (flag != DONE)continue;
return data;}
void otherfunc(void){
data = 42;flag = DONE;
}
15
All in allReading architecture specifications these days, you frequently come across interesting terms and phrases like:
…is observed to……must appear to…
The comfy world of sequential execution is no more. One must now think of whether the effect of an instruction can be detected rather than if it has “executed”
If you dual-issue a NOP with an ADD … does it take any time to execute?
Where correct operation requires something to appear in the same order to multiple agents, this must be explicitly ordered
16
So how come anything actually works?
17
How come anything works?Because within each core, the sequential execution model must still (appear to) hold true
Dependent/overlapping accesses cannot be reordered*
Because of barriers
And because library functions that require barriers for correct operation already use them where necessary
void somefunc(void){
unsigned char *cptr = iptr;*iptr = 0x12345678;cptr[1] = 0xff;
}
18
Barrier and Fence instructionsBarriers make it possible to write software that actually worksInstructions that explicitly order memory accesses
Prevent reordering of any memory accesses past the barrierPrevent reordering of specific memory accesses past the barrierEnsure synchronization between data and instruction sideEnsure synchronization between instruction stream and memory accesses
Architecture BarriersAlpha IMB, MB, WMBARMv7 DMB, DSB, ISBIA64 MFPPC SYNC, LWSYNC, EIEIOx86/AMD64 LFENCE, MFENCE, SFENCE
19
Compiler barriersAn optimizing compiler is free to reorder non-volatile memory accesses in any way it sees fit in order to improve performance.
And rememberDocumentation/volatile-considered-harmful.txt
while (*hold == 1);return *ret;
load r0, [ret] 1: load r1, [hold]
cmp r1, #1beq 1bb LR
This can be prevented by introducing a compiler scheduling barrier:barrier() defined in include/linux/compiler-gcc.h
#define barrier() __asm__ __volatile__("": : :"memory")
20
Linux generic memory barriersLinux defines a set of generic memory barrier macros, common both to SMP and uniprocessor systems
Since the DEC Alpha had the weakest memory model of all platforms in the kernel, this became the template for the architecture-independent model within Linux
“If it works on the Alpha, it’ll work anywhere”
Guaranteed to be at minimum a compiler barrierBut where architecturally required, it will output the necessary barrier instruction
Macro Functionalitymb() No memory accesses can overtake.rmb() No reads can overtake.wmb() No writes can overtake.
21
Linux SMP memory barriersLinux also defines a set of barriers that ensure correct operation in SMP systems – in practise where hardware coherency management is in place
Only guaranteed for cached memory, system bus effects ignored
NOT a superset of generic barriers – usually weaker
Turned into compiler barriers when CONFIG_SMP is not enabled
Macro Functionalitysmp_mb() No memory accesses can overtake.smp_rmb() No reads can overtake.smp_wmb() No writes can overtake.
22
Read dependency barriers
Macro Functionalityread_barrier_depends() Ensures values from previous reads are usable.smp_read_barrier_depends() Ensures values from previous reads are usable.
*The DEC Alpha processor amazingly permitted reordering dependent loads
The read_barrier_depends() macros were introduced to deal with this – these turn into NULL statements on all other architectures (not even compiler barriers)
23
mmiowb()mmiowb() forces global ordering of memory mapped I/O accesses
Macro Functionalitymmiowb() Synchronize I/O globally.
24
outer_sync()When barriers only reach the external bus interface of the processor, the interconnect can still reorder bufferable memory accesses
Cortex-A9 does not have an integrated Level 2 cache – most implementations supplemented with external PL310 controller.ARM-specific outer_sync() macro included in mb() when DMA memory treated as bufferable
arch/arm/include/asm/outercache.h
CPU
store (DMA region)mb()store (initiate transfer)
store (initiate transfer)store (DMA region)
L2 cache
CPU
store (DMA region)mb() – with outer_sync()store (initiate transfer)
store (DMA region)store (initiate transfer)
L2 cache
mb() without outer_sync() mb() with outer_sync()
25
Shameless marketingCortex-A15 has an integrated L2 cache, but also implements external bus interfaces following the new AMBA4 AXI specification
These AMBA4 AXI interfaces also support ACE (AMBA Coherency Extensions)
ACE includes support for having the interconnect propagating barriersBarriers can be specified with a limit. Backwards-compatible in that unimplemented barrier variants will execute as System-wide barriers.
Non-shareable (NSH)Inner-shareable (ISH)Outer-shareable (OSH)System-wide (SY)
26
I/O accessorsA long thread (“USB mass storage and ARM cache coherency”) spanned several kernel lists earlier this year
Uncovered that actually quite a few drivers do not really use barriers everywhere they should beThe pragmatic solution was to add barriers to ARM I/O accessors
read{b,w,l}()
write{b,w,l}()
ioread{8,16,32}()
iowrite{8,16,32}()
27
Synchronization primitivesspin_{lock,unlock}() contain smp_mb()
This ensures ordering between acquiring the lock and accessing the protected resource, and between modifying the resource and releasing the lock
atomic_{inc,dec,add,sub}() make no such promises
28
In summarySo, barriers are great – I should put mb() everywhere just to make sure?
Well, no … barriers are sometimes required to make software work as expected, but they do come at a cost
An smp_rb() might have no visible impact even on an SMP system, whereas an mb() can force an outer_sync() as well as forcing a drain of the write buffer.Always use the weakest barrier possible – even if there is no noticeabe difference on your current platform between using smp_rmb() or mb(), that is not necessarily the case for other platforms. Some of which you might be using in your next project.
29
ReferencesDocumentation/memory-barriers.txt
Paul E. McKenney - Memory Barriers: a Hardware View for Software Hackers http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
Evolution of “Memory Ordering in Modern Microprocessors” LJ articles
Kourosh Gharachorloo - Memory consistency models for shared-memory multiprocessors
Barrier Litmus Test and Cookbok - infocenter.arm.com
top related