eecs 583 – class 22 research topic 4: automatic simdization - superword level parallelism
DESCRIPTION
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism. University of Michigan December 10, 2012. Announcements. Last class today! No more reading Dec 12-18 – Project presentations Each group sign up for 30-minute slot - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/1.jpg)
EECS 583 – Class 22Research Topic 4: Automatic SIMDization - Superword Level Parallelism
University of Michigan
December 10, 2012
![Page 2: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/2.jpg)
- 2 -
Announcements
Last class today!» No more reading
Dec 12-18 – Project presentations» Each group sign up for 30-minute slot
» See me after class if you have not signed up
Course evaluations reminder» Please fill one out, it will only take 5 minutes
» I do read them
» Improve the experience for future 583 students
![Page 3: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/3.jpg)
- 3 -
Notes on Project Demos
Demo format» Each group gets 30 minutes
Strict deadlines enforced because many back to back groups Don’t be late! Figure out your room number ahead of time (see schedule on my door)
» Plan for 20 mins of presentation (no more!), 10 mins questions Some slides are helpful, try to have all group members say something Talk about what you did (basic idea, previous work), how you did it
(approach + implementation), and results Demo or real code examples are good
Report» 5 pg double spaced including figures – what you did + why,
implementation, and results
» Due either when you do your demo or Dec 18 at 6pm
![Page 4: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/4.jpg)
- 4 -
SIMD Processors: Larrabee (now called Knights Corner) Block Diagram
![Page 5: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/5.jpg)
- 5 -
Vector Unit Block Diagram
![Page 6: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/6.jpg)
- 6 -
Processor Core Block Diagram
![Page 7: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/7.jpg)
- 7 -
Larrabee vs Conventional GPUs
Each Larrabee core is a complete Intel processor» Context switching & pre-emptive multi-tasking
» Virtual memory and page swapping, even in texture logic
» Fully coherent caches at all levels of the hierarchy
Efficient inter-block communication» Ring bus for full inter-processor communication
» Low latency high bandwidth L1 and L2 caches
» Fast synchronization between cores and caches
Larrabee: the programmability of IA with the parallelism of graphics processors
![Page 8: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/8.jpg)
Exploiting Superword Level Parallelism with Multimedia
Instruction Sets
![Page 9: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/9.jpg)
Multimedia Extensions
• Additions to all major ISAs• SIMD operations
Instruction Set Architecture SIMD Width Floating PointAltiVec PowerPC 128 yesMMX/SSE Intel 64/128 yes3DNow! AMD 64 yesVIS Sun 64 noMAX2 HP 64 noMVI Alpha 64 noMDMX MIPS V 64 yes
![Page 10: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/10.jpg)
Using Multimedia Extensions
• Library calls and inline assembly– Difficult to program– Not portable
• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!
• Need automatic compilation
![Page 11: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/11.jpg)
Vector Compilation
• Pros:– Successful for vector computers– Large body of research
• Cons:– Involved transformations – Targets loop nests
![Page 12: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/12.jpg)
Superword Level Parallelism (SLP)
• Small amount of parallelism– Typically 2 to 8-way
• Exists within basic blocks • Uncovered with a simple analysis
• Independent isomorphic operations– New paradigm
![Page 13: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/13.jpg)
1. Independent ALU Ops
R = R + XR * 1.08327G = G + XG * 1.89234B = B + XB * 1.29835
R R XR 1.08327G = G + XG * 1.89234B B XB 1.29835
![Page 14: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/14.jpg)
2. Adjacent Memory References
R = R + X[i+0]G = G + X[i+1]B = B + X[i+2]
R RG = G + X[i:i+2]B B
![Page 15: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/15.jpg)
for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]
3. Vectorizable Loops
![Page 16: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/16.jpg)
3. Vectorizable Loops
for (i=0; i<100; i+=4)
A[i:i+3] = B[i:i+3] + C[i:i+3]
for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0]
A[i+1] = A[i+1] + B[i+1]A[i+2] = A[i+2] + B[i+2]A[i+3] = A[i+3] + B[i+3]
![Page 17: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/17.jpg)
4. Partially Vectorizable Loops
for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)
![Page 18: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/18.jpg)
4. Partially Vectorizable Loops
for (i=0; i<16; i+=2)
L0L1
= A[i:i+1] – B[i:i+1]
D = D + abs(L0)D = D + abs(L1)
for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L)
L = A[i+1] – B[i+1]D = D + abs(L)
![Page 19: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/19.jpg)
Exploiting SLP with SIMD Execution
• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op
• Cost:– Packing and unpacking– Reshuffling within a register
![Page 20: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/20.jpg)
Packing/Unpacking Costs
C = A + 2D = B + 3
C A 2D B 3= +
![Page 21: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/21.jpg)
Packing/Unpacking Costs
• Packing source operands
A AB BA = f()
B = g()C = A + 2D = B + 3
C A 2D B 3= +
![Page 22: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/22.jpg)
Packing/Unpacking Costs
• Packing source operands• Unpacking destination operands
C CD D
A = f()B = g()C = A + 2D = B + 3E = C / 5F = D * 7
A AB B
C A 2D B 3= +
![Page 23: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/23.jpg)
Optimizing Program Performance
• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking
• Many packing possibilities– Worst case: n ops n! configurations– Different cost/benefit for each choice
![Page 24: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/24.jpg)
A = B + CD = E + F
Observation 1:Packing Costs can be Amortized
• Use packed result operands
G = A - HI = D - J
![Page 25: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/25.jpg)
Observation 1:Packing Costs can be Amortized
• Use packed result operands• Share packed source operands
A = B + CD = E + F
G = B + HI = E + J
A = B + CD = E + F
G = A - HI = D - J
![Page 26: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/26.jpg)
Observation 2:Adjacent Memory is Key
• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth
• Few packing possibilities– Only one ordering exploits pre-packing
![Page 27: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/27.jpg)
SLP Extraction Algorithm
• Identify adjacent memory references
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
![Page 28: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/28.jpg)
SLP Extraction Algorithm
• Identify adjacent memory references
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
![Page 29: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/29.jpg)
SLP Extraction Algorithm
• Follow def-use chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
![Page 30: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/30.jpg)
SLP Extraction Algorithm
• Follow def-use chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
HJ
CD
AB= -
![Page 31: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/31.jpg)
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
HJ
CD
AB= -
![Page 32: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/32.jpg)
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
CD
EF
35= *
HJ
CD
AB= -
![Page 33: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/33.jpg)
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
CD
EF
35= *
HJ
CD
AB= -
![Page 34: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/34.jpg)
SLP Availability
0
10
20
30
40
50
60
70
80
90
100
swim
tomca
tv
mgr
id
su2c
or
wave5
apsi
hydr
o2d
turb
3d
applu
fppp
p FIR IIRVM
MMMM
YUV
% dynamic SUIF instructions eliminated
128 bits1024 bits
![Page 35: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/35.jpg)
SLP vs. Vector Parallelism
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
swim
tomca
tv
mgrid
su2c
or
wave5 ap
si
hydro2
d
turb3d
applu
fppp
p
SLPVector
![Page 36: EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813bd4550346895da4f99d/html5/thumbnails/36.jpg)
Conclusions
• Multimedia architectures abundant– Need automatic compilation
• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp
• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70
• Found SLP in general-purpose codes