code-generation for the arm m-profile vector extension€¦ · us llvm developer conference 2019...
TRANSCRIPT
US LLVM Developer conference 2019
Code-Generation for the Arm M-profile Vector Extension
Sjoerd Meijer
Sam Parker
David Green
2 © 2019 Arm Limited
Major MVE Architecture Features
• Low-overhead branches, a.k.a. hardware-loops• Improved performance for the loop test, increment, branch
• Vectorisation• 128-bit vector size• 8 vector registers• 1 predicate register
• Predication:• Tail-loop predication• Lane-predication, a.k.a. VPT predication (like IF-THEN blocks)
3 © 2019 Arm Limited
The Challenge is....
All these features can be combined!
"Tail-predicated and vectorised hardware-loop"
1. What is it?
2. Why is it useful?
3. How should the transformation pipeline look like to achieve this?
4 © 2019 Arm Limited
Predication
1. Tail-loop predication• Removes the need for generating a scalar remainder loop.
2. Lane / VPT Predication
VPT BlockScalar code example
Vector pseudo code
for (i= 0; i < VL; i++)
if (A[i] < B[i])
A[i] = B[i] + C[i]
else
A[i] = B[i] + D[i]
mask = A < B
T: A = B + C
F: A = B + D
VPTE.s32 lt, q0, q1
VADDT.i32 q0, q1, q2
VADDE.i32 q0, q1, q3
5 © 2019 Arm Limited
Tail-Predication: A Code Example
• Loop is vectorisable• Assume restrict, or runtime checks
• Unknown trip count N:• Vector body• Scalar remainder
void foo(int *A, int *B, int *C, int N)
{
for (int i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
6 © 2019 Arm Limited
Tail-Predication: What Is It?
i = 4
i = 0
i = 8
// Assume N = 10, VF = 4
// 1. vector body
for (int i = 0; i <= 8; i+=4)
C[i:4] = A[i:4] + B[i:4];
// 2. scalar remainder: the tail
for (i = 0; i < 2; i++)
C[i] = A[i] + B[i];
0 1 2 3
4 5 6 7
8 9 10 11
// 3. tail-folded vector loop
for (int i = 0; i < 12; i+=4)
M = get_mask(i:4, N);
M: C[i:4] = A[i:4] + B[i:4];
Active lanes
Inactive lanes
4 lanes:
vector loop iterations:
Tail-folding
7 © 2019 Arm Limited
Tail-Predicated Vector Hardware-Loop: Why is it useful?
Tail-predicated vector hardware-loop:
• 4 instructions for the loop-body:
• 2 instructions for controlling the loop(s)
Benefits:
1. Code-size: important for microcontrollers.
2. Performance: data sets relatively small, overhead of the tail can be significant.
// r0 = A, r1 = B
// r2 = C, r3 = N
wlstp.32 lr, r3, END
.LBB0_4:
vldrw.u32 q0, [r0, #16]!
vldrw.u32 q1, [r1, #16]!
vadd.i32 q0, q1, q0
vstrw.32 q0, [r2, #16]!
letp lr, .LBB0_4
END:
8 © 2019 Arm Limited
if (N == 0) goto END;
if (N < 4)
goto REMAINDER_LOOP;
else
goto VEC_LOOP;
VEC_LOOP:
for (..)
C[i:4] = A[i:4] + B[i:4];
REMAINDER_LOOP:
for (..)
C[i] = A[i] + B[i];
END:
return;
Typical Vectorisation Plan:
Vectorisation and Predication:
• How do we create 1 vector loop that combines the vector and scalar code?
Hardware-loops:
• Do we create hwloops before, during, or after vectorisation?
• How is a hwloop represented in IR, and
• How is it lowered?
9 © 2019 Arm Limited
Optimisation Pipeline & Contributions
Loopvectoriser
IRhardwareLoop pass
ARMbackend
1 2
4
3
Tail-folded
vector loop
Scalar Loop
Tail-folded vector hwloop
IRTail Pred Pass
Tail-predicate vector hwloop
ARM Backend
3
10 © 2019 Arm Limited
wlstp.32 lr, r3, END
.LBB0_4:
vldrw.u32 q0, [r0, #16]!
vldrw.u32 q1, [r1, #16]!
vadd.i32 q0, q1, q0
vstrw.32 q0, [r2, #16]!
letp lr, .LBB0_4
END:
Optimisation Pipeline - Backend
Low-overhead loop&
Tail-predication (todo)
4
Revert if necessary:
hwloops,
Tail-predication
ISEL & Pseudo exp.
Loop
Loop
Transformation Details
12 © 2019 Arm Limited
1. Vectorisation: Tail Folding
Pseudo code
vector.preheader:
Vimax = (N, N, N, N)
vector.body:
Vinit = (i, i, i, i)
Vinc = (0, 1, 2, 3)
Vi = Vinit + Vinc
ActiveLaneMask = Vi < Vimax
i += 4
Enabled in multiple ways
• #pragma
clang loop vectorize_predicate(enable)
• -prefer-predicate-over-epilog=true
• Optimise for size.
• Querying a TTI hook (in progress).
13 © 2019 Arm Limited
2. Hardware Loops
• A generic LLVM IR LoopPass which inserts intrinsics into loops, designating it as a 'hardware loop'.
• Transforms natural loops with a known trip count.
• Works on both vector and scalar loops.
• TargetTransformInfo hook to check support and profitability of the conversion.• Also conveys target details back to the pass which enables configurability.
• Based upon the PPCCTRLoops pass.
• Now used by both PPC and ARM backends.
14 © 2019 Arm Limited
2. Hardware Loop: Intrinsics
• void llvm.set.loop.iterations(anyint)
• Set the counter.
• i1 llvm.test.set.loop.iterations(anyint)
• Set and test that the given operand is not zero.
• i1 llvm.loop.decrement(anyint)
• Signal whether to continue looping.
• anyint llvm.loop.decrement.reg(anyint, anyint)
• A fancy sub for the iteration count.
15 © 2019 Arm Limited
2. Hardware Loops: Examplepreheader:
call void @llvm.set.loop.iterations(i32 %num.iterations)
br label %body
body:
%count = phi i32 [ %num.iterations, %preheader ], [ %remaining, %body ]
...
%remaining = call i32 @llvm.decrement.reg(i32 %count, i32 1)
%not.finished = icmp ne i32 %remaining, 0
br i1 %not.finished, label %body, label %exit
16 © 2019 Arm Limited
2. Hardware Loops: Guard entryentry:
%non.zero = call i1 @llvm.test.set.loop.iterations(i32 %num.iterations)
br i1 %non.zero, label %preheader, label %exit
preheader:
br label %body
body:
%count = phi i32 [ %num.iterations, %preheader ], [ %remaining, %body ]
...
%remaining = call i32 @llvm.decrement.reg(i32 %count, i32 1)
%not.finished = icmp ne i32 %remaining, 0
br i1 %not.finished, label %body, label %exit
unrestricted
17 © 2019 Arm Limited
3. MVE Tail Predication
• An IR LoopPass which inserts Arm specific intrinsics to perform vector predication.
• Find hardware loop intrinsics.
• Find masked load/store intrinsics.
• Masked load/store intrinsics take a lane predicate.
• Replace lane masking vector icmp sequence with an intrinsic.
18 © 2019 Arm Limited
3. VCTP 'Vector Create Tail Predicate'
• Vectorizer will generate a specific pattern for masking out inactive lanes.
• The loop invariant 'broadcast.splat.limit' gives us our total number of elements that we need to process.
• Given the number of elements to process and the effective vector width, generate a mask for the inactive lanes.
%index = phi i32
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert,
<4 x i32> undef,
<4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%pred = icmp ule <4 x i32> %induction, %broadcast.splat.limit
%load = tail call <4 x i32> @llvm.masked.load..<4 x i1> %pred...
19 © 2019 Arm Limited
3. Replacing icmp sequence with vctppreheader:
call void @llvm.set.loop.iterations(i32 %num.iterations)
br label %vector.body
vector.body:
%count = phi i32 [ %num_iterations, %preheader ], [ %count.next, %vector.body ]
%elems = phi i32 [ %total.elems, %preheader ], [ %elems.remaining, %vector.body ]
%tail.pred = call <4 x i1> @llvm.arm.vctp32(i32 %elems)
%elems.remaining = sub i32 %elems, 4
...
%count.next = call i32 @llvm.decrement.reg(i32 %count, i32 1)
%not.finished = icmp ne i32 %count.next, 0
br i1 %not.finished, label %vector.body, label %exit
20 © 2019 Arm Limited
4. Instruction Selection
llvm.set.loop.iterations
llvm.test.set.loop.iterations
br
llvm.arm.vctp
llvm.loop.decrement.reg
icmp ne/eq/ugt, etc...
br
DoLoopStart (Pseudo)
WhileLoopStart (Pseudo)
1-1 mapping with real instruction
Separate the decrement from the
control flow:
- LoopDec (Pseudo)
- LoopEnd (Pseudo)
Need to check branch targets and conditions when combining nodes!
21 © 2019 Arm Limited
4. Pre-emission Finalisation
• Right at the end of the pipeline.
• Need to handle scalar and vector loops (predicated or not).
• Either generate special loop instructions or revert to sub, compare and branch.• DoLoopStart = DLS, DSLTP• WhileLoopStart = WLS, WLSTP• LoopDec, LoopEnd = LE, LETP
• Expand pseudo instructions.• Their sizes are large enough to compensate for extra instructions that maybe inserted.
• Convert explicit predication to implicit predication (which we haven't done yet...)
22 © 2019 Arm Limited
4. When do we revert to a 'normal' loop?
• If LoopDec is used by another instruction other than LoopEnd.• Loop counter is held in the link register (LR), which is available to the register allocator.• The 'LE' instruction will be performing the decrement and the branch, so the real value of LoopDec
does not exist before the terminator!
• If we don’t discover all the necessary pseudo instructions.
• Branch targets need to be in range:• LoopStart instructions can only branch forwards.• LoopEnd can only branch backwards.
• No calls within the loop.• Though this is not an architectural requirement.
23 © 2019 Arm Limited
4. When won't we convert to implicit tail predication?
• Finalising after VPT (predicated vector) blocks have been generated.
• We can't use implicit predication if we don't know how to remove the existing VPT blocks.
• If we don't understand the predicate dataflow:• VPT blocks can represent nested conditional statements.• We have ONE predicate register which can be written, spilled, reloaded, etc...• Need dataflow analysis.
• If we don’t know how to tail predicate an instruction.• This is a temporary lack of semantic understanding on our part.
24 © 2019 Arm Limited
Where we are, where we want to be
wls lr,
.body:
vctp.32 ; Generate Predicate
sub
vpstt ; VPT block
vldrwt.u32 ; Predicated load
vldrwt.u32 ; Predicated load
vadd.i32
vpst ; VPT block
vstrwt.32 ; Predicated store
le lr, .body
Explicit: LR holds an iteration count while vctpoperates on another register holding an element count.
wlstp.32 lr, ; Generate Predicate
.body:
vldrw.u32 ; Predicated load
vldrw.u32 ; Predicated load
vadd.i32 ; Predicated add
vstrw.32 ; Predicated store
letp lr, .body ; Generate Predicate
Implicit: LR holds the element count and predication is performed via the loop start/end instructions via a system register.
Smaller, faster, better!
25 © 2019 Arm Limited
Conclusion
• Created a multi-stage optimisation pipeline to enable tail predicated vector loops.
• Code generation for MVE is in good shape and still improving.
• Development has been relatively quick because we were able to reuse existing code.
• Interested in talking to others who rely on analysis and transforms on post-RA machine code.
• Thank you for all your feedback and reviews!
Thank YouDankeMerci谢谢
ありがとうGracias
Kiitos감사합니다
धन्यवादشكًراתודה
© 2019 Arm Limited