optimizing data permutations for simd devices gang ren, peng wu 1, david padua university of...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Optimizing Data Permutations for SIMD Devices
Gang Ren, Peng Wu1, David Padua
University of Illinois at Urbana-Champaign1 IBM T.J. Watson Research Center
3
PLDI 06
• Data Permutation Optimization• Idiom Recognition• Execution Mapping• Type Promotion Elimination……
SIMD Compilation
int a[16],b[16],c[16];for(i=0; i<16; i++) c[i] = a[i] + b[i];
for(i=0; i<16; i++) c[i] = a[i] + b[i];
float a[16],b[16],c[16];c[0:15] = a[0:15] + b[0:15];c[0:15] = a[0:15] + b[0:15];
float a[16], b[16], c[16];...vr1 = vload(a);vr2 = vload(b);vr3 = vadd(vr1, vr2);...
...vr1 = vec_load(a);vr2 = vec_load(b);vr3 = vec_add(vr1, vr2);...
Explore Data ParallelismExplore Data Parallelism
Generating Efficient SIMD CodeGenerating Efficient SIMD Code
• Vectorization• Instruction Packing• If Conversion……
4
PLDI 06
Strict SIMD Architecture (1)
+ + + +
Register File
ALU
Memory
a0 a1 a2 a3
a4 a5 a6 a7
……
a0 a1 a2 a3a0 a1 a2 a3a0 a1 a2 a3
... = ...a[0:3:1]...;
Most SIMD devices only support memory accesses on contiguous and aligned memory sections
vr1 = vec_load(a);
5
PLDI 06
Additional permutation instructions are needed for non-contiguous and/or misaligned memory references
Strict SIMD Architecture (2)
+ + + +
Register File
ALU
Memory
a0 a1 a2 a3
a4 a5 a6 a7
……
... = ...a[0:6:2]...;
a1 a3
a5 a7
a0 a2
a4 a6
a0 a1 a2 a3
a4 a5 a6 a7vperm
a0 a1 a2 a3 a4 a5 a6 a7a0 a1 a2 a3 a4 a5 a6 a7
a4 a5 a6 a7a0 a2 a4 a6
<0,2,4,6>
a0 a2 a4 a6
Strict SIMD devices: All data reorganization must be accomplished with permutation instructions.
vr1 = vec_load(a); vr2 = vec_load(a+4); vr4 = vperm(vr1, vr2, <0,2,4,6>);
6
PLDI 06 Overview of the Optimization Frameworkfloat a[16],b[16],c[16];c[0:15] = a[0:15] + b[0:15];c[0:15] = a[0:31:2] + b[0:15];
float a[16], b[16], c[16];...vr1 = vload(a);vr2 = vload(b);vr3 = vadd(vr1, vr2);...
...vr1 = vec_load(a);vr2 = vec_load(a+4);vr3 = vperm(vr1,vr2,…);vr4 = vec_load(b);...
Normalization
Optimization
Code Generation
7
PLDI 06 Example: An 8-point FFT Program1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0] + t3[2:3];9. y[i+4:i+6:2] = t3[0] - t3[2:3];10. }
1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0:1] + t3[2:3];9. y[i+4:i+6:2] = t3[0:1] - t3[2:3];10. }
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];
0
1
2
3 Generating native permutation instructions from Permute operations
8
PLDI 06 Overview of the Optimization Frameworkfloat a[16],b[16],c[16];c[0:15] = a[0:15] + b[0:15];c[0:15] = a[0:31:2] + b[0:15];
float a[16], b[16], c[16];...vr1 = vload(a);vr2 = vload(b);vr3 = vadd(vr1, vr2);...
...vr1 = vec_load(a);vr2 = vec_load(a+4);vr3 = vperm(vr1,vr2,…);vr4 = vec_load(b);...
Normalization
Optimization
Code Generation
Use generic Permute to represent:• Non-unit strides• Misalignment• Other reorganizations
9
PLDI 06
Data Permutations on Vectors
Permute(Xn, Pn): Xn is a vector and Pn is a permutation matrix
Use Permute to represent all data reorganizations explicitly
TT ab ]1:3:0[
1000000100100100
]1:3:0[
b[0:3] = Permute(a[0:3], <2,1,0,3>)
... = a[0:6:2] + a[1:7:2];... = a[0:6:2] + a[1:7:2];
t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>); ... = t[0:3] + t[4:7];
t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>); ... = t[0:3] + t[4:7];
Two stride-2 accesses at right-hand side
0 1 2 3a0 a1 a2 a3 b[0:3]
0 1 2 3a2 a1 a0 a3
a[0:3]
10
PLDI 06 Overview of the Optimization Frameworkfloat a[16],b[16],c[16];c[0:15] = a[0:15] + b[0:15];c[0:15] = a[0:31:2] + b[0:15];
float a[16], b[16], c[16];...vr1 = vload(a);vr2 = vload(b);vr3 = vadd(vr1, vr2);...
...vr1 = vec_load(a);vr2 = vec_load(a+4);vr3 = vperm(vr1,vr2,…);vr4 = vec_load(b);...
Normalization
Optimization
Code Generation
Minimize Permute ops in a basic block - Based on two rules of Permute- A NP-complete problem- Propagation-based algorithm
11
PLDI 06 Two Important Rules on Permutations
Composition Rule
Distributive Rule
Permute(Permute(a[0:3:1], <1, 0, 3, 2>), <2, 1, 0, 3>)
PPPCPABPACPB ,,
BPAPBAP )(
x0 x1 x2 x3a0 a1 a2 a3 x0 x1 x2 x3
a1 a0 a3 a2 x0 x1 x2 x3a3 a0 a1 a2
Permute(a[0:3:1], <1, 0, 3, 2>) + Permute(b[0:3:1], <1, 0, 3, 2>)
x0 x1 x2 x3a0 a1 a2 a3 x0 x1 x2 x3
b0 b1 b2 b3
x0 x1 x2 x3a1 a0 a3 a2 x0 x1 x2 x3
b1 b0 b3 b2
x0 x1 x2 x3a1+b1 a0+b0 a3+b3 a2+b2
+
Permute(a[0:3:1] + b[0:3:1], <1, 0, 3, 2>)
x0 x1 x2 x3a0 a1 a2 a3 x0 x1 x2 x3
b0 b1 b2 b3
+x0 x1 x2 x3
a0+b0 a1+b1 a2+b2 a3+b3
x0 x1 x2 x3a1+b1 a0+b0 a3+b3 a2+b2
Permute(a[0:3:1], <3, 0, 1, 2>)
12
PLDI 06 Propagation-Based Optimization Algorithm
Overview: Propagating permutation to permutation– Step 1: Pickup an unvisited permutation statement – Step 2: Propagate the permutation from the definition to the
uses– Step 3: If a use is a permutation, goto (a), otherwise goto (b)
a.Merge it with the propagated permutation pattern. Goto Step 1b.Propagate the permutation from right-hand side to left-hand si
de. Goto Step 21. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);
1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2’[0:7] * u2[0:7];12. u3[0:7] = Permute(t3[0:7], P6’);13. y[0:3] = u3[0:3] + u3[4:7];14. y[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);
13
PLDI 06
b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);
c[0:3] = b[0:3] + b[4:7];
Propagating Permutations to Partial Usesb[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);
c[0:3] = b[0:3] + b[4:7];
b[0:3] and b[4:7] are two partial uses of b[0:7].
b[0:3] = Permute(a[0:3], <3,2,1,0>);b[4:7] = Permute(a[4:7], <3,2,1,0>);c[0:3] = b[0:3] + b[4:7];
b[0:3] = Permute(a[0:3], <3,2,1,0>);b[4:7] = Permute(a[4:7], <3,2,1,0>);c[0:3] = b[0:3] + b[4:7];
11
11
11
11
b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);
c[0:3] = b[0:3] + b[4:7];
b[0:7] = Permute(a[0:7], <0,4,1,5,2,6,3,7>);
c[0:3] = b[0:3] + b[4:7];
Not all permutations can be partitioned and propagated to partial uses
11
11
11
11
PQ
R
Improvements over partial use boundary - Permutation decomposition
Register-wise decomposition Shuffle instruction decomposition
- Permutation reshaping
14
PLDI 06 Optimization: Permutation Reshaping
For permutations used in commutative operations
a0 a1 a2 a3 a4 a5 a6 a7a0 a1 a2 a3
a0 a5 a2 a7a0 a5 a2 a7
a4 a5 a6 a7
a4 a1 a6 a3a4 a1 a6 a3
+
a0+a4 a5+a1 a2+a6 a7+a3c0 c1 c2 c3a0+a4 a5+a1 a2+a6 a7+a3
b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85];
b[0:7] = Permute(a[0:7], <0,5,2,7,4,1,6,3>);c[0:4] = b[0:3] + b[4:7];
11
11
11
11
b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85];
b[0:7] = Permute(a[0:7], <0,1,2,3,4,5,6,7>);c[0:4] = b[0:3] + b[4:7];
a0 a1 a2 a3 a4 a5 a6 a7a0 a1 a2 a3
a0 a5 a2 a7a0 a1 a2 a3
a4 a5 a6 a7
a4 a1 a6 a3a4 a5 a6 a7
+
a0+a4 a5+a1 a2+a6 a7+a3c0 c1 c2 c3a0+a4 a1+a5 a2+a6 a3+a7
11
11
11
11
15
PLDI 06 Overview of the Optimization Frameworkfloat a[16],b[16],c[16];c[0:15] = a[0:15] + b[0:15];c[0:15] = a[0:31:2] + b[0:15];
float a[16], b[16], c[16];...vr1 = vload(a);vr2 = vload(b);vr3 = vadd(vr1, vr2);...
...vr1 = vec_load(a);vr2 = vec_load(a+4);vr3 = vperm(vr1,vr2,…);vr4 = vec_load(b);...
Normalization
Optimization
Code Generation
- “Strip-mine” Permute to vperm inst. - Map vperm to native permutation inst.
16
PLDI 06 Generating Permutation Instructions (1)a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>);
b[0:15]0 1 2 30 1 2 3 0 1 2 34 5 6 7 0 1 2 38 9 10 11 0 1 2 312 13 14 15
a[0:15]0 1 2 30 4 8 12 0 1 2 31 5 9 13 0 1 2 32 6 10 14 0 1 2 33 7 11 15
<0,4,*,*>vperm
0 1 2 30 4 * *
0 1 2 30 4 8 *
<0,1,4,*>vperm
<0,1,2,4>vperm
vperm
0 1 2 30 4 * *
0 1 2 30 4 8 *
vperm
vperm
vperm
0 1 2 31 5 * *
0 1 2 31 5 9 *
vperm
vperm
vperm
0 1 2 32 6 * *
0 1 2 32 6 10 *
vperm
vperm
vperm
0 1 2 33 7 * *
0 1 2 33 7 11 *
vperm
vperm
17
PLDI 06 Generating Permutation Instructions (2)a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>);
b[0:15]0 1 2 30 1 2 3 0 1 2 34 5 6 7 0 1 2 38 9 10 11 0 1 2 312 13 14 15
a[0:15]0 1 2 30 4 8 12 0 1 2 31 5 9 13 0 1 2 32 6 10 14 0 1 2 33 7 11 15
<0,4,*,*>vperm
0 1 2 30 4 * *
<0,4,*,*>vperm
0 1 2 38 12 * *
<0,1,4,5>vperm
<0,4,1,5>vperm
0 1 2 30 4 1 5
<0,4,1,5>vperm
0 1 2 38 12 9 13
vperm<2,3,6,7>
vperm
0 1 2 30 4 1 5
vperm
0 1 2 38 12 9 13
vpermvperm
vperm
0 1 2 32 6 3 7
vperm
0 1 2 310 14 11 15
vpermvperm
Two Steps:• Maximize empty slots when generating vperm instructions;• Fill empty slots with data elements that go to the same target;
18
PLDI 06
Experiment Setups
Two SIMD devices: VMX(AltiVec) & SSE2
Tested applications– Group I : Applications with relatively simple permutation patterns
• C-Saxpy: Complex version of saxpy ( y = alpha*x + y )• R-Color, C-Dot, R-FIR, …
– Group II: Applications with complicated permutation patterns• FFT: Fast Fourier transform programs generated by the SPIRAL system• WHT: Walsh-Hadamard transform routines generated by the SPIRAL sy
stem• Bitonic sorting: One of the fastest sorting networks
– Group III: Reorganization-only applications• Matrix transpose• Bit-reversal reordering
Processor 1.8G PowerPC G5
2.0G Pentium 4
Main Memory 2048 MB 1024 MB
Operating System Mac OS v10.3 Linux v2.4
Compiler xlc v6.0 icc v9.0Compiler Options -O3 -qaltivec -fast (-O3)
19
PLDI 06 Static Evaluation: # of Permutation Inst.
VMX SSE2
Program
Size Base
Opt
Reduced
Base
Opt
Reduced
fft.4 16 96 24 75.0% 96 24 75.0%
fft.5 32 208 48 79.6% 208 48 79.6%
fft.6 64 352 96 72.7% 352 96 72.7%
wht.4 16 48 12 75.0% 48 12 75.0%
wht.5 32 96 24 75.0% 96 24 75.0%
wht.6 64 192 48 75.0% 192 48 75.0%
bitonic.4
16 52 34 34.6% 56 34 39.3%
bitonic.5
32 136 92 32.3% 144 92 36.1%
bitonic.6
64 336 232
31.0% 352 232
34.1%
20
PLDI 06 Run-time Performance of FFT & Bitonic Sorting
32-point FFT On VMX/AltiVec
0
2
4
6
8
10
12
# of
Ope
ratio
ns P
er S
econ
d (1
00M
)
Scalar Fast SIMD Base SIMD Opt
32-point FFT On SSE2
0
2
4
6
8
10
# of
Ope
ratio
n P
er S
econ
d (1
00M
)
Scalar O3SIMD BaseSIMD Opt
Bitonic Sorting on VMX/AltiVec
0
4
8
12
16
8 16 32 64 128 256Size
# of
Sta
ge P
er S
econ
d (1
00M
)
Scalar GCC
SIMD Base
SIMD Opt
Bitonic Sorting on SSE2
0
2
4
6
8
8 16 32 64 128 256Size
# of
Sta
ge P
er S
econ
d (1
00M
)
Scalar O3
SIMD Base
SIMD Opt
21
PLDI 06
Overall Speedups
Overall Speedups of All Applications (Misaligned Data)
0
1
2
3
4
fff.4
fft.5
fft.6
wht.4
wht.5
wht.6
bitonic.5
transpose
bit-reverse
c-saxpy
c-dot
r-fir
r-color
Sp
ee
du
ps
VMX Base VMX Opt SSE2 Base SSE2 Opt
Overall Speedups of All Applications (Aligned Data)
0
1
2
3
4
fff.4
fft.5
fft.6
wht.4
wht.5
wht.6
bitonic.5
transpose
bit-reverse
c-saxpy
c-dot
r-fir
r-color
Sp
ee
du
ps
VMX Base VMX Opt SSE2 Base SSE2 Opt
2 3 1
2 3 1
22
PLDI 06
Related Work
Optimizing permutation instructions introduced by misalignment– A. Eichenberger, P. Wu, K. O'Brien, Vectorization for SIMD architectures with
alignment constraints, PLDI ’04– P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD Code Generation for Runtim
e Alignment and Length Conversion, CGO 05
Efficient permutation instruction generation– A. Kudriavtsev, P. Kogge, Generation of permutations for SIMD processors, LCT
ES ’05– M. Narayanan, K. Yelick, Generating permutation instructions from a high-level
description, MSP ’04– D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization of interleaved data for SIMD,
PLDI ’06
Similar idea, different applications– A. Solar-Lezama, R. Rabbah, R. Bodik, K. Ebcioglu, Programming by sketching f
or bit-streaming programs, PLDI ’05– S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng. Automatic array alignment in dat
a-parallel programs, POPL ’93– G. Hwang, J. K. Lee, D. Ju, An array operation synthesis scheme to optimize FO
RTRAN 90 programs, PPOPP ’95
23
PLDI 06
Conclusion
It is a performance critical problem for SIMD compilation to reduce the overhead introduced by permutation instructions
A unified framework is proposed to optimize data permutations– Putting all forms of data permutations into a unified representation– Propagating permutations across statements and merging them
together– Generating efficient permutation instructions natively supported by
devices
Experiments were conducted on different applications– Up to 77% permutation instructions are eliminated– Improve average performance by 48% on VMX and 68% on SSE2– Near-peak overall speedups are achieved on some applications