autovectorization in llvm
TRANSCRIPT
Changwoo Min ([email protected])
2010/06/23
Project Goal
Design and implement prototype level autovectorizer in LLVM
Understand and hands-on LLVM
Implement simple analysis phase in LLVM
Implement simple transform phase in LLVM
2
Vector Support in LLVM
Support vector type and its operation in IR level
Generate vector type to MMX/SSE instruction in IA32 architecture
= a[i]
+ b[i]
c[i]
• vector stride = 1
%pb = getelementptr [32 x i32]* @b, i32 0, i32 %i %vb = bitcast i32* %pb to <8 x i32>* %pc = getelementptr [32 x i32]* @c, i32 0, i32 %i %vc = bitcast i32* %pa to <8 x i32>* %vb_i = load <8 x i32>* %vb, align 32 %vc_i = load <8 x i32>* %vc, align 32 %va_i = add nsw <8 x i32> %vb_i, %vc_i %pa= getelementptr [32 x i32]* @a, i32 0, i32 %i %va = bitcast i32* %pa to <8 x i32>* store <8 x i32> %va_i, <8 x i32>* %va, align 32
• vector type, vector operation
movaps b (,%eax,0), %xmm0 paddd c (,%eax,0), %xmm0 movaps %xmm0, a (,%eax,0)
• SSE code generation
3
Vectorization, what it is?
int a[259], b[259], c[259] for(i=0;i<259;++i) { a[i] = b[i+1] + c[i]; }
int a[259], b[259], c[259] for(i=0;i<259; i+=8) { a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } }
4
Vectorization, what it is?
int a[259], b[259], c[259] for(i=0;i<259;++i) { a[i] = b[i+1] + c[i]; }
int a[259], b[259], c[259] for(i=0;i<259; i+=8) { a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } }
5
Vectorization, what it is?
int a[259], b[259], c[259] for(i=0;i<259;++i) { a[i] = b[i+1] + c[i]; }
int a[259], b[259], c[259] for(i=0;i<259; i+=8) { a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } }
6
Vectorization, big idea
Find a loop
Is is vectorizable?
If so, vectorize it
Yes
• Use existing LLVM infra structure
• Is it countable loop? • Are there any unvectorizable instructions? • Loop independence dependence? • Loop carried dependence?
• Change array type to vector type • Type casting • Alignment • Handle remainder if any
7
Find a loop
Implement “LoopVectorizer” path as one of the transform path Inherit LoopPath class which is invoked for
every loop. PathManager which is a parent of LoopPath
manger deals with integrating other LLVM paths.
Ask PassManager to hand me a loop which is more canonical form than natural loop LoopSimply Form Entry Block, Exit Block, Latch Block Single backedge Countable loop which is incrementing by one
PathManager
LoopPath
LoopVectorize
8
Is it vectorizable? (1/3)
Loop type test Inner-most loop Countable loop
for(i=0;i<100;++i) OK for(;*p!=NULL;++p) NOK
Long enough to vectorize for(i=0;i<3;++i) NOK Iteration should be longer than vectorization factor.
Are there any unvectorizable IR instruction? Function call NOK stack allocation NOK operation to scalar value except for loop induction variable NOK Stride of pointer/array should be on.
a[i] OK, a[2*i] NOK
9
Is it vectorizable? (2/3)
Collect array/pointer variables used in LHS and RHS
a[i] c[i]
b[i+1]
LHS = {a[i]}, RHS={b[i+1], c[i]}
a[i] = b[i+1] + c[i];
10
Is it vectorizable? (3/3) Data dependence testing between LHS and RHS
Dependence testing
Strides of W and R are one. We only check if W and R will be colliding WITHIN vectorization
factor by subtracting base coefficient. W[i+LC] R[i+RC] If |LC-RC| < vectorization factor, there will be collision. Not vectorizable
foreach member W in LHS
foreach member R in LHS U RHS
if R is alias of W
if there is data dependence between W and R
“It is not vectorizable.”
“Ok, it is vectorizable”
11
If so, vectorize it (1/5) Idea
int a[259], b[259], c[259] for(i=0;i<259;++i) { a[i] = b[i+1] + c[i]; }
int a[259], b[259], c[259] for(i=0;i<259; i+=8) { a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } }
Loop Body
Vectorized Loop Body
Check if there are remainders
Epilogue Loop for remainder
Epilogue loop
Epilogue Preheader
12
If so, vectorize it (2/5) Vectorize Loop Body
1. Insert bitcast instruction after every getelementptr insturction 2. Replace uses of getelementptr to use bitcast
If it is a Load or Store instruction, set alignment constraint.
3. Construct set of instructions which requires type casting from array/pointer type to vector type
Maximal use set of getelementptr Type cast instructions in type casting set to vector type
4. Modify increment of induction variable to vectorization factor 5. Modify destination of loop exit to epilogue preheader
Calculate alignment It assumes base address is 32-byte aligned. Only check if induction variable breaks its alignment.
a[0] 32- byte aligned a[i] 32- byte aligned a[i+1] 4-byte aligned
13
If so, vectorize it (3/5)
Vectorized Loop Body
bb1: ; preds = %bb1, %bb.nph %i.03 = phi i32 [ 0, %bb.nph ], [ %6, %bb1 ] ; <i32> [#uses=5] %scevgep = getelementptr [259 x i32]* @a, i32 0, i32 %i.03 ; <i32*> [#uses=1] %0 = bitcast i32* %scevgep to <8 x i32>* ; <<8 x i32>*> [#uses=1] %scevgep4 = getelementptr [259 x i32]* @c, i32 0, i32 %i.03 ; <i32*> [#uses=1] %1 = bitcast i32* %scevgep4 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %tmp = add i32 %i.03, 1 ; <i32> [#uses=1] %scevgep5 = getelementptr [259 x i32]* @b, i32 0, i32 %tmp ; <i32*> [#uses=1] %2 = bitcast i32* %scevgep5 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %3 = load <8 x i32>* %2, align 4 ; <<8 x i32>> [#uses=1] %4 = load <8 x i32>* %1, align 32 ; <<8 x i32>> [#uses=1] %5 = add nsw <8 x i32> %4, %3 ; <<8 x i32>> [#uses=1] store <8 x i32> %5, <8 x i32>* %0, align 32 %6 = add i32 %i.03, 8 ; <i32> [#uses=2] %exitcond = icmp eq i32 %6, 256 ; <i1> [#uses=1] br i1 %exitcond, label %bb1.preheader, label %bb
14
If so, vectorize it (4/5)
Generate epilogue preheader
If there are remainders, jump to epilogue loop.
bb1.preheader: ; preds = %bb1 %7 = shl i32 %i.03, 0 ; <i32> [#uses=2] %8 = icmp eq i32 %7, 259 ; <i1> [#uses=1] br i1 %8, label %return, label %bb1.epilogue
15
If so, vectorize it (5/5)
Generate epilogue loop for remainder 1. Clone original loop body 2. Update all the uses to denote the cloned one 3. Update phi of induction variable 4. Update branch target
bb1.epilogue: ; preds = %bb1.epilogue, %bb1.preheader %9 = phi i32 [ %7, %bb1.preheader ], [ %17, %bb1.epilogue ] ; <i32> [#uses=4] %10 = getelementptr [259 x i32]* @a, i32 0, i32 %9 ; <i32*> [#uses=1] %11 = getelementptr [259 x i32]* @c, i32 0, i32 %9 ; <i32*> [#uses=1] %12 = add i32 %9, 1 ; <i32> [#uses=1] %13 = getelementptr [259 x i32]* @b, i32 0, i32 %12 ; <i32*> [#uses=1] %14 = load i32* %13, align 4 ; <i32> [#uses=1] %15 = load i32* %11, align 4 ; <i32> [#uses=1] %16 = add nsw i32 %15, %14 ; <i32> [#uses=1] store i32 %16, i32* %10, align 4 %17 = add i32 %9, 1 ; <i32> [#uses=2] %18 = icmp eq i32 %17, 259 ; <i1> [#uses=1] br i1 %18, label %return, label %bb1.epilogue
16
Generated Code .LBB1_1: # %bb1 # =>This Inner Loop Header: Depth=1 movups b+1088(,%eax,4), %xmm0 paddd c+1084(,%eax,4), %xmm0 movups b+1072(,%eax,4), %xmm1 paddd c+1068(,%eax,4), %xmm1 movaps %xmm1, a+1068(,%eax,4) movaps %xmm0, a+1084(,%eax,4) addl $8, %eax cmpl $-11, %eax jne .LBB1_1 # BB#2: # %bb1.preheader testl %eax, %eax je .LBB1_5 # BB#3: # %bb1.preheader.bb1.epilogue_crit_edge movl $-44, %eax .align 16, 0x90 .LBB1_4: # %bb1.epilogue # =>This Inner Loop Header: Depth=1 movl c+1036(%eax), %ecx addl b+1040(%eax), %ecx movl %ecx, a+1036(%eax) addl $4, %eax jne .LBB1_4 .LBB1_5: # %return ret
Vectorized Loop Body
Epilogue Preheader
Epilogue Loop
17
Experiment Environment
CPU Intel i5 2.67GHz
OS Ubuntu 10.04
LLVM LLVM 2.7 (Released at 04/27/2010) LLVM-GCC Front End 4.2
GCC GCC 4.4.3 (Ubuntu 10.04 Canonical Version)
18
Performance Comparison : aligned access
a[i] = b[i] + c[i];
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
char short int float
GCC Vect
GCC No-Vect
LLVM No-Vect
LLVM Vect(VF=4)
LLVM Vect(VF=8)
VF = Vectorization Factor
No
rmal
ized
Exe
cuti
on
Tim
e
19
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
GCC Vect GCC No-Vect LLVM No-Vect LLVMVect(VF=4)
LLVMVect(VF=8)
a[i]=b[i]+c[i]
a[i]=b[i+1]+c[i]
a[i]=b[i+1]+c[i+1]
a[i+1]=b[i+1]+c[i+1]
Performance Comparison : unaligned access
For integer type
VF = Vectorization Factor
No
rmal
ized
Exe
cuti
on
Tim
e
20
Conclusion
Implement prototype level LLVM vectorizer with Data dependence analysis Loop transformation and vectorization Alignment testing Type Conversion
Use variety of LLVM infra structure Path Manage, Loop Path Manager, Loop Simply form, Alias analysis,
IndVars, SCEV, etc
Its performance is quite promising In most cases, it is better than GCC tree vectorize.
But, followings are requires to extend its coverage Need to extend dependence testing to support multi dimensional array
W[i][ j][k+LC] R[i][ j][k+RC]
More sophisticated alignment calculation is required It may need to collaborate with code generation. Do we have efficient way to calculate alignment in multi dimensional array?
a[i][ j][k]
Do we need to support a loop which has more than one basic block for loop body?
21