optimizing data permutations for simd devices gang ren, peng wu 1, david padua university of...

Optimizing Data Permutations for SIMD Devices

Gang Ren, Peng Wu1, David Padua

University of Illinois at Urbana-Champaign1 IBM T.J. Watson Research Center

2

PLDI 06

SIMD Is Everywhere

+ + + +

Register File

ALU

MemorySIMD Architecture

3

PLDI 06

• Data Permutation Optimization• Idiom Recognition• Execution Mapping• Type Promotion Elimination……

SIMD Compilation

int a[16],b[16],c[16];for(i=0; i<16; i++) c[i] = a[i] + b[i];

for(i=0; i<16; i++) c[i] = a[i] + b[i];

float a[16],b[16],c[16];c[0:15] = a[0:15] + b[0:15];c[0:15] = a[0:15] + b[0:15];

float a[16], b[16], c[16];...vr1 = vload(a);vr2 = vload(b);vr3 = vadd(vr1, vr2);...

...vr1 = vec_load(a);vr2 = vec_load(b);vr3 = vec_add(vr1, vr2);...

Explore Data ParallelismExplore Data Parallelism

Generating Efficient SIMD CodeGenerating Efficient SIMD Code

• Vectorization• Instruction Packing• If Conversion……

4

PLDI 06

Strict SIMD Architecture (1)

+ + + +

Register File

ALU

Memory

a0 a1 a2 a3

a4 a5 a6 a7

……

a0 a1 a2 a3a0 a1 a2 a3a0 a1 a2 a3

... = ...a[0:3:1]...;

Most SIMD devices only support memory accesses on contiguous and aligned memory sections

vr1 = vec_load(a);

5

PLDI 06

Additional permutation instructions are needed for non-contiguous and/or misaligned memory references

Strict SIMD Architecture (2)

+ + + +

Register File

ALU

Memory

a0 a1 a2 a3

a4 a5 a6 a7

……

... = ...a[0:6:2]...;

a1 a3

a5 a7

a0 a2

a4 a6

a0 a1 a2 a3

a4 a5 a6 a7vperm

a0 a1 a2 a3 a4 a5 a6 a7a0 a1 a2 a3 a4 a5 a6 a7

a4 a5 a6 a7a0 a2 a4 a6

<0,2,4,6>

a0 a2 a4 a6

Strict SIMD devices: All data reorganization must be accomplished with permutation instructions.

vr1 = vec_load(a); vr2 = vec_load(a+4); vr4 = vperm(vr1, vr2, <0,2,4,6>);

6

PLDI 06 Overview of the Optimization Frameworkfloat a[16],b[16],c[16];c[0:15] = a[0:15] + b[0:15];c[0:15] = a[0:31:2] + b[0:15];


...vr1 = vec_load(a);vr2 = vec_load(a+4);vr3 = vperm(vr1,vr2,…);vr4 = vec_load(b);...

Normalization

Optimization

Code Generation

7

PLDI 06 Example: An 8-point FFT Program1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0] + t3[2:3];9. y[i+4:i+6:2] = t3[0] - t3[2:3];10. }

1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0:1] + t3[2:3];9. y[i+4:i+6:2] = t3[0:1] - t3[2:3];10. }

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);


1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];


0

1

2

3 Generating native permutation instructions from Permute operations

8




Normalization

Optimization

Code Generation

Use generic Permute to represent:• Non-unit strides• Misalignment• Other reorganizations

9

PLDI 06

Data Permutations on Vectors

Permute(Xn, Pn): Xn is a vector and Pn is a permutation matrix

Use Permute to represent all data reorganizations explicitly

TT ab ]1:3:0[

1000000100100100

]1:3:0[

b[0:3] = Permute(a[0:3], <2,1,0,3>)

... = a[0:6:2] + a[1:7:2];... = a[0:6:2] + a[1:7:2];

t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>); ... = t[0:3] + t[4:7];

t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>); ... = t[0:3] + t[4:7];

Two stride-2 accesses at right-hand side

0 1 2 3a0 a1 a2 a3 b[0:3]

0 1 2 3a2 a1 a0 a3

a[0:3]

10




Normalization

Optimization

Code Generation

Minimize Permute ops in a basic block - Based on two rules of Permute- A NP-complete problem- Propagation-based algorithm

11

PLDI 06 Two Important Rules on Permutations

Composition Rule

Distributive Rule

Permute(Permute(a[0:3:1], <1, 0, 3, 2>), <2, 1, 0, 3>)

PPPCPABPACPB ,,

BPAPBAP )(

x0 x1 x2 x3a0 a1 a2 a3 x0 x1 x2 x3

a1 a0 a3 a2 x0 x1 x2 x3a3 a0 a1 a2

Permute(a[0:3:1], <1, 0, 3, 2>) + Permute(b[0:3:1], <1, 0, 3, 2>)

x0 x1 x2 x3a0 a1 a2 a3 x0 x1 x2 x3

b0 b1 b2 b3

x0 x1 x2 x3a1 a0 a3 a2 x0 x1 x2 x3

b1 b0 b3 b2

x0 x1 x2 x3a1+b1 a0+b0 a3+b3 a2+b2

+

Permute(a[0:3:1] + b[0:3:1], <1, 0, 3, 2>)

x0 x1 x2 x3a0 a1 a2 a3 x0 x1 x2 x3

b0 b1 b2 b3

+x0 x1 x2 x3

a0+b0 a1+b1 a2+b2 a3+b3

x0 x1 x2 x3a1+b1 a0+b0 a3+b3 a2+b2

Permute(a[0:3:1], <3, 0, 1, 2>)

12

PLDI 06 Propagation-Based Optimization Algorithm

Overview: Propagating permutation to permutation– Step 1: Pickup an unvisited permutation statement – Step 2: Propagate the permutation from the definition to the

uses– Step 3: If a use is a permutation, goto (a), otherwise goto (b)

a.Merge it with the propagated permutation pattern. Goto Step 1b.Propagate the permutation from right-hand side to left-hand si

de. Goto Step 21. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);




1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2’[0:7] * u2[0:7];12. u3[0:7] = Permute(t3[0:7], P6’);13. y[0:3] = u3[0:3] + u3[4:7];14. y[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

13

PLDI 06

b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);

c[0:3] = b[0:3] + b[4:7];

Propagating Permutations to Partial Usesb[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);

c[0:3] = b[0:3] + b[4:7];

b[0:3] and b[4:7] are two partial uses of b[0:7].

b[0:3] = Permute(a[0:3], <3,2,1,0>);b[4:7] = Permute(a[4:7], <3,2,1,0>);c[0:3] = b[0:3] + b[4:7];

b[0:3] = Permute(a[0:3], <3,2,1,0>);b[4:7] = Permute(a[4:7], <3,2,1,0>);c[0:3] = b[0:3] + b[4:7];

11

11

11

11

b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);

c[0:3] = b[0:3] + b[4:7];

b[0:7] = Permute(a[0:7], <0,4,1,5,2,6,3,7>);

c[0:3] = b[0:3] + b[4:7];

Not all permutations can be partitioned and propagated to partial uses

11

11

11

11

PQ

R

Improvements over partial use boundary - Permutation decomposition

Register-wise decomposition Shuffle instruction decomposition

- Permutation reshaping

14

PLDI 06 Optimization: Permutation Reshaping

For permutations used in commutative operations

a0 a1 a2 a3 a4 a5 a6 a7a0 a1 a2 a3

a0 a5 a2 a7a0 a5 a2 a7

a4 a5 a6 a7

a4 a1 a6 a3a4 a1 a6 a3

+

a0+a4 a5+a1 a2+a6 a7+a3c0 c1 c2 c3a0+a4 a5+a1 a2+a6 a7+a3

b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85];

b[0:7] = Permute(a[0:7], <0,5,2,7,4,1,6,3>);c[0:4] = b[0:3] + b[4:7];

11

11

11

11

b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85];

b[0:7] = Permute(a[0:7], <0,1,2,3,4,5,6,7>);c[0:4] = b[0:3] + b[4:7];

a0 a1 a2 a3 a4 a5 a6 a7a0 a1 a2 a3

a0 a5 a2 a7a0 a1 a2 a3

a4 a5 a6 a7

a4 a1 a6 a3a4 a5 a6 a7

+

a0+a4 a5+a1 a2+a6 a7+a3c0 c1 c2 c3a0+a4 a1+a5 a2+a6 a3+a7

11

11

11

11

15




Normalization

Optimization

Code Generation

- “Strip-mine” Permute to vperm inst. - Map vperm to native permutation inst.

16

PLDI 06 Generating Permutation Instructions (1)a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>);

b[0:15]0 1 2 30 1 2 3 0 1 2 34 5 6 7 0 1 2 38 9 10 11 0 1 2 312 13 14 15

a[0:15]0 1 2 30 4 8 12 0 1 2 31 5 9 13 0 1 2 32 6 10 14 0 1 2 33 7 11 15

<0,4,*,*>vperm

0 1 2 30 4 * *

0 1 2 30 4 8 *

<0,1,4,*>vperm

<0,1,2,4>vperm

vperm

0 1 2 30 4 * *

0 1 2 30 4 8 *

vperm

vperm

vperm

0 1 2 31 5 * *

0 1 2 31 5 9 *

vperm

vperm

vperm

0 1 2 32 6 * *

0 1 2 32 6 10 *

vperm

vperm

vperm

0 1 2 33 7 * *

0 1 2 33 7 11 *

vperm

vperm

17

PLDI 06 Generating Permutation Instructions (2)a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>);

b[0:15]0 1 2 30 1 2 3 0 1 2 34 5 6 7 0 1 2 38 9 10 11 0 1 2 312 13 14 15

a[0:15]0 1 2 30 4 8 12 0 1 2 31 5 9 13 0 1 2 32 6 10 14 0 1 2 33 7 11 15

<0,4,*,*>vperm

0 1 2 30 4 * *

<0,4,*,*>vperm

0 1 2 38 12 * *

<0,1,4,5>vperm

<0,4,1,5>vperm

0 1 2 30 4 1 5

<0,4,1,5>vperm

0 1 2 38 12 9 13

vperm<2,3,6,7>

vperm

0 1 2 30 4 1 5

vperm

0 1 2 38 12 9 13

vpermvperm

vperm

0 1 2 32 6 3 7

vperm

0 1 2 310 14 11 15

vpermvperm

Two Steps:• Maximize empty slots when generating vperm instructions;• Fill empty slots with data elements that go to the same target;

18

PLDI 06

Experiment Setups

Two SIMD devices: VMX(AltiVec) & SSE2

Tested applications– Group I : Applications with relatively simple permutation patterns

• C-Saxpy: Complex version of saxpy ( y = alpha*x + y )• R-Color, C-Dot, R-FIR, …

– Group II: Applications with complicated permutation patterns• FFT: Fast Fourier transform programs generated by the SPIRAL system• WHT: Walsh-Hadamard transform routines generated by the SPIRAL sy

stem• Bitonic sorting: One of the fastest sorting networks

– Group III: Reorganization-only applications• Matrix transpose• Bit-reversal reordering

Processor 1.8G PowerPC G5

2.0G Pentium 4

Main Memory 2048 MB 1024 MB

Operating System Mac OS v10.3 Linux v2.4

Compiler xlc v6.0 icc v9.0Compiler Options -O3 -qaltivec -fast (-O3)

19

PLDI 06 Static Evaluation: # of Permutation Inst.

VMX SSE2

Program

Size Base

Opt

Reduced

Base

Opt

Reduced

fft.4 16 96 24 75.0% 96 24 75.0%

fft.5 32 208 48 79.6% 208 48 79.6%

fft.6 64 352 96 72.7% 352 96 72.7%

wht.4 16 48 12 75.0% 48 12 75.0%

wht.5 32 96 24 75.0% 96 24 75.0%

wht.6 64 192 48 75.0% 192 48 75.0%

bitonic.4

16 52 34 34.6% 56 34 39.3%

bitonic.5

32 136 92 32.3% 144 92 36.1%

bitonic.6

64 336 232

31.0% 352 232

34.1%

20

PLDI 06 Run-time Performance of FFT & Bitonic Sorting

32-point FFT On VMX/AltiVec

0

2

4

6

8

10

12

# of

Ope

ratio

ns P

er S

econ

d (1

00M

)

Scalar Fast SIMD Base SIMD Opt

32-point FFT On SSE2

0

2

4

6

8

10

# of

Ope

ratio

n P

er S

econ

d (1

00M

)

Scalar O3SIMD BaseSIMD Opt

Bitonic Sorting on VMX/AltiVec

0

4

8

12

16

8 16 32 64 128 256Size

# of

Sta

ge P

er S

econ

d (1

00M

)

Scalar GCC

SIMD Base

SIMD Opt

Bitonic Sorting on SSE2

0

2

4

6

8

8 16 32 64 128 256Size

# of

Sta

ge P

er S

econ

d (1

00M

)

Scalar O3

SIMD Base

SIMD Opt

21

PLDI 06

Overall Speedups

Overall Speedups of All Applications (Misaligned Data)

0

1

2

3

4

fff.4

fft.5

fft.6

wht.4

wht.5

wht.6

bitonic.5

transpose

bit-reverse

c-saxpy

c-dot

r-fir

r-color

Sp

ee

du

ps

VMX Base VMX Opt SSE2 Base SSE2 Opt

Overall Speedups of All Applications (Aligned Data)

0

1

2

3

4

fff.4

fft.5

fft.6

wht.4

wht.5

wht.6

bitonic.5

transpose

bit-reverse

c-saxpy

c-dot

r-fir

r-color

Sp

ee

du

ps

VMX Base VMX Opt SSE2 Base SSE2 Opt

2 3 1

2 3 1

22

PLDI 06

Related Work

Optimizing permutation instructions introduced by misalignment– A. Eichenberger, P. Wu, K. O'Brien, Vectorization for SIMD architectures with

alignment constraints, PLDI ’04– P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD Code Generation for Runtim

e Alignment and Length Conversion, CGO 05

Efficient permutation instruction generation– A. Kudriavtsev, P. Kogge, Generation of permutations for SIMD processors, LCT

ES ’05– M. Narayanan, K. Yelick, Generating permutation instructions from a high-level

description, MSP ’04– D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization of interleaved data for SIMD,

PLDI ’06

Similar idea, different applications– A. Solar-Lezama, R. Rabbah, R. Bodik, K. Ebcioglu, Programming by sketching f

or bit-streaming programs, PLDI ’05– S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng. Automatic array alignment in dat

a-parallel programs, POPL ’93– G. Hwang, J. K. Lee, D. Ju, An array operation synthesis scheme to optimize FO

RTRAN 90 programs, PPOPP ’95

23

PLDI 06

Conclusion

It is a performance critical problem for SIMD compilation to reduce the overhead introduced by permutation instructions

A unified framework is proposed to optimize data permutations– Putting all forms of data permutations into a unified representation– Propagating permutations across statements and merging them

together– Generating efficient permutation instructions natively supported by

devices

Experiments were conducted on different applications– Up to 77% permutation instructions are eliminated– Improve average performance by 48% on VMX and 68% on SSE2– Near-peak overall speedups are achieved on some applications

Thank You!June 2006

optimizing data permutations for simd devices gang ren, peng wu 1, david padua university of...

Documents

vperm vperm

permutation instructions

vperm instructions

simd architectures

simd processors

o3 slide

target slide

efficient simd code