autovectorization in llvm

Changwoo Min ([email protected])

2010/06/23

Project Goal

Design and implement prototype level autovectorizer in LLVM

Understand and hands-on LLVM

Implement simple analysis phase in LLVM

Implement simple transform phase in LLVM

2

Vector Support in LLVM

Support vector type and its operation in IR level

Generate vector type to MMX/SSE instruction in IA32 architecture

= a[i]

+ b[i]

c[i]

• vector stride = 1

%pb = getelementptr [32 x i32]* @b, i32 0, i32 %i %vb = bitcast i32* %pb to <8 x i32>* %pc = getelementptr [32 x i32]* @c, i32 0, i32 %i %vc = bitcast i32* %pa to <8 x i32>* %vb_i = load <8 x i32>* %vb, align 32 %vc_i = load <8 x i32>* %vc, align 32 %va_i = add nsw <8 x i32> %vb_i, %vc_i %pa= getelementptr [32 x i32]* @a, i32 0, i32 %i %va = bitcast i32* %pa to <8 x i32>* store <8 x i32> %va_i, <8 x i32>* %va, align 32

• vector type, vector operation

movaps b (,%eax,0), %xmm0 paddd c (,%eax,0), %xmm0 movaps %xmm0, a (,%eax,0)

• SSE code generation

3

Vectorization, what it is?

int a[259], b[259], c[259] for(i=0;i<259;++i) { a[i] = b[i+1] + c[i]; }

int a[259], b[259], c[259] for(i=0;i<259; i+=8) { a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } }

4




5




6

Vectorization, big idea

Find a loop

Is is vectorizable?

If so, vectorize it

Yes

• Use existing LLVM infra structure

• Is it countable loop? • Are there any unvectorizable instructions? • Loop independence dependence? • Loop carried dependence?

• Change array type to vector type • Type casting • Alignment • Handle remainder if any

7

Find a loop

Implement “LoopVectorizer” path as one of the transform path Inherit LoopPath class which is invoked for

every loop. PathManager which is a parent of LoopPath

manger deals with integrating other LLVM paths.

Ask PassManager to hand me a loop which is more canonical form than natural loop LoopSimply Form Entry Block, Exit Block, Latch Block Single backedge Countable loop which is incrementing by one

PathManager

LoopPath

LoopVectorize

8

Is it vectorizable? (1/3)

Loop type test Inner-most loop Countable loop

for(i=0;i<100;++i) OK for(;*p!=NULL;++p) NOK

Long enough to vectorize for(i=0;i<3;++i) NOK Iteration should be longer than vectorization factor.

Are there any unvectorizable IR instruction? Function call NOK stack allocation NOK operation to scalar value except for loop induction variable NOK Stride of pointer/array should be on.

a[i] OK, a[2*i] NOK

9

Is it vectorizable? (2/3)

Collect array/pointer variables used in LHS and RHS

a[i] c[i]

b[i+1]

LHS = {a[i]}, RHS={b[i+1], c[i]}

a[i] = b[i+1] + c[i];

10

Is it vectorizable? (3/3) Data dependence testing between LHS and RHS

Dependence testing

Strides of W and R are one. We only check if W and R will be colliding WITHIN vectorization

factor by subtracting base coefficient. W[i+LC] R[i+RC] If |LC-RC| < vectorization factor, there will be collision. Not vectorizable

foreach member W in LHS

foreach member R in LHS U RHS

if R is alias of W

if there is data dependence between W and R

“It is not vectorizable.”

“Ok, it is vectorizable”

11

If so, vectorize it (1/5) Idea



Loop Body

Vectorized Loop Body

Check if there are remainders

Epilogue Loop for remainder

Epilogue loop

Epilogue Preheader

12

If so, vectorize it (2/5) Vectorize Loop Body

1. Insert bitcast instruction after every getelementptr insturction 2. Replace uses of getelementptr to use bitcast

If it is a Load or Store instruction, set alignment constraint.

3. Construct set of instructions which requires type casting from array/pointer type to vector type

Maximal use set of getelementptr Type cast instructions in type casting set to vector type

4. Modify increment of induction variable to vectorization factor 5. Modify destination of loop exit to epilogue preheader

Calculate alignment It assumes base address is 32-byte aligned. Only check if induction variable breaks its alignment.

a[0] 32- byte aligned a[i] 32- byte aligned a[i+1] 4-byte aligned

13

If so, vectorize it (3/5)


bb1: ; preds = %bb1, %bb.nph %i.03 = phi i32 [ 0, %bb.nph ], [ %6, %bb1 ] ; <i32> [#uses=5] %scevgep = getelementptr [259 x i32]* @a, i32 0, i32 %i.03 ; <i32*> [#uses=1] %0 = bitcast i32* %scevgep to <8 x i32>* ; <<8 x i32>*> [#uses=1] %scevgep4 = getelementptr [259 x i32]* @c, i32 0, i32 %i.03 ; <i32*> [#uses=1] %1 = bitcast i32* %scevgep4 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %tmp = add i32 %i.03, 1 ; <i32> [#uses=1] %scevgep5 = getelementptr [259 x i32]* @b, i32 0, i32 %tmp ; <i32*> [#uses=1] %2 = bitcast i32* %scevgep5 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %3 = load <8 x i32>* %2, align 4 ; <<8 x i32>> [#uses=1] %4 = load <8 x i32>* %1, align 32 ; <<8 x i32>> [#uses=1] %5 = add nsw <8 x i32> %4, %3 ; <<8 x i32>> [#uses=1] store <8 x i32> %5, <8 x i32>* %0, align 32 %6 = add i32 %i.03, 8 ; <i32> [#uses=2] %exitcond = icmp eq i32 %6, 256 ; <i1> [#uses=1] br i1 %exitcond, label %bb1.preheader, label %bb

14


Generate epilogue preheader

If there are remainders, jump to epilogue loop.

bb1.preheader: ; preds = %bb1 %7 = shl i32 %i.03, 0 ; <i32> [#uses=2] %8 = icmp eq i32 %7, 259 ; <i1> [#uses=1] br i1 %8, label %return, label %bb1.epilogue

15


Generate epilogue loop for remainder 1. Clone original loop body 2. Update all the uses to denote the cloned one 3. Update phi of induction variable 4. Update branch target

bb1.epilogue: ; preds = %bb1.epilogue, %bb1.preheader %9 = phi i32 [ %7, %bb1.preheader ], [ %17, %bb1.epilogue ] ; <i32> [#uses=4] %10 = getelementptr [259 x i32]* @a, i32 0, i32 %9 ; <i32*> [#uses=1] %11 = getelementptr [259 x i32]* @c, i32 0, i32 %9 ; <i32*> [#uses=1] %12 = add i32 %9, 1 ; <i32> [#uses=1] %13 = getelementptr [259 x i32]* @b, i32 0, i32 %12 ; <i32*> [#uses=1] %14 = load i32* %13, align 4 ; <i32> [#uses=1] %15 = load i32* %11, align 4 ; <i32> [#uses=1] %16 = add nsw i32 %15, %14 ; <i32> [#uses=1] store i32 %16, i32* %10, align 4 %17 = add i32 %9, 1 ; <i32> [#uses=2] %18 = icmp eq i32 %17, 259 ; <i1> [#uses=1] br i1 %18, label %return, label %bb1.epilogue

16

Generated Code .LBB1_1: # %bb1 # =>This Inner Loop Header: Depth=1 movups b+1088(,%eax,4), %xmm0 paddd c+1084(,%eax,4), %xmm0 movups b+1072(,%eax,4), %xmm1 paddd c+1068(,%eax,4), %xmm1 movaps %xmm1, a+1068(,%eax,4) movaps %xmm0, a+1084(,%eax,4) addl $8, %eax cmpl $-11, %eax jne .LBB1_1 # BB#2: # %bb1.preheader testl %eax, %eax je .LBB1_5 # BB#3: # %bb1.preheader.bb1.epilogue_crit_edge movl $-44, %eax .align 16, 0x90 .LBB1_4: # %bb1.epilogue # =>This Inner Loop Header: Depth=1 movl c+1036(%eax), %ecx addl b+1040(%eax), %ecx movl %ecx, a+1036(%eax) addl $4, %eax jne .LBB1_4 .LBB1_5: # %return ret


Epilogue Preheader

Epilogue Loop

17

Experiment Environment

CPU Intel i5 2.67GHz

OS Ubuntu 10.04

LLVM LLVM 2.7 (Released at 04/27/2010) LLVM-GCC Front End 4.2

GCC GCC 4.4.3 (Ubuntu 10.04 Canonical Version)

18

Performance Comparison : aligned access

a[i] = b[i] + c[i];

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

char short int float

GCC Vect

GCC No-Vect

LLVM No-Vect

LLVM Vect(VF=4)

LLVM Vect(VF=8)

VF = Vectorization Factor

No

rmal

ized

Exe

cuti

on

Tim

e

19

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

GCC Vect GCC No-Vect LLVM No-Vect LLVMVect(VF=4)

LLVMVect(VF=8)

a[i]=b[i]+c[i]

a[i]=b[i+1]+c[i]

a[i]=b[i+1]+c[i+1]

a[i+1]=b[i+1]+c[i+1]

Performance Comparison : unaligned access

For integer type

VF = Vectorization Factor

No

rmal

ized

Exe

cuti

on

Tim

e

20

Conclusion

Implement prototype level LLVM vectorizer with Data dependence analysis Loop transformation and vectorization Alignment testing Type Conversion

Use variety of LLVM infra structure Path Manage, Loop Path Manager, Loop Simply form, Alias analysis,

IndVars, SCEV, etc

Its performance is quite promising In most cases, it is better than GCC tree vectorize.

But, followings are requires to extend its coverage Need to extend dependence testing to support multi dimensional array

W[i][ j][k+LC] R[i][ j][k+RC]

More sophisticated alignment calculation is required It may need to collaborate with code generation. Do we have efficient way to calculate alignment in multi dimensional array?

a[i][ j][k]

Do we need to support a loop which has more than one basic block for loop body?

21

autovectorization in llvm

Technology

vector support

llvm support vector

vector stride

handson llvm

xmm0 bi

movaps b

simple analysis phase

simple transform phase