simd defragmenter: efficient ilp realization on data-parallel architectures
DESCRIPTION
SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures. Yongjun Park 1 , Sangwon Seo 2 , Hyunchul Park 3 , Hyoun Kyu Cho 1 , and Scott Mahlke 1. March 6, 2012 1 University of Michigan, Ann Arbor 2 Qualcomm Incorporated, San Diego, CA - PowerPoint PPT PresentationTRANSCRIPT
University of MichiganElectrical Engineering and Computer Science1
SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures
Yongjun Park1, Sangwon Seo2, Hyunchul Park3, Hyoun Kyu Cho1, and Scott Mahlke1
March 6, 2012
1University of Michigan, Ann Arbor2Qualcomm Incorporated, San Diego, CA
3Programmin Systems Lab, Intel Labs, Santa Clara, CA
University of MichiganElectrical Engineering and Computer Science
Convergence of Functionalities
2
Convergence of functionalities demands a flexible solutionApplications have different characteristics
Anatomy of an iPhone
4G Wireless
Navigation
AudioVideo
3D
Flexible Accelerator!
University of MichiganElectrical Engineering and Computer Science
0 16 32 48 64 80 96 112 1280
50
100
150
200
250
300
350
400
450
500
Issue width
Rel
ativ
e A
rea
(Cos
t)
SIMD : Attractive Alternative to ASICs
3
Suitable for running wireless and multimedia ap-plications for future embedded systems
Advantage High throughput Low fetch-decode overhead Easy to scale
Disadvantage Hard to realize high resource utilization High SIMDization overhead
Example SIMD architectures IBM Cell, ARM NEON,
Intel MIC architecture, etc.Example SIMD machine: 100 MOps /mW
VLIW
SIMD
FUs
5.6x
2x
University of MichiganElectrical Engineering and Computer Science
Under-utilization on wide SIMD
• Multimedia applications have various natural SIMD widths– SIMD width characterization of innermost loops (Intel Compiler rule)– Inside & across applications
• How to use idle SIMD resources?
4
AAC 3D H.264
Execution time distribution at different SIMD widths
Full
Under
Resource utilization
16-way SIMD
University of MichiganElectrical Engineering and Computer Science
Traditional Solutions for Under-utilization• Dynamic power gating
– Selectively cut off unused SIMD lanes– Effective dynamic & leakage power savings– Transition time & power overhead– High area overhead
5
• Thread Level parallelism– Execute multiple threads having separate
data– Different instruction flow– Input-dependent control flow– High memory pressure
On
Off
Thread 1
Thread 2
Thread 3
Thread 4
University of MichiganElectrical Engineering and Computer Science
Objective of This Work• Beyond loop-level SIMD
– Put idle SIMD lanes to work– Find more SIMD opportunities inside vectorized basic blocks when loop-level SIMD
parallelism is insufficient
• Possible SIMD instructions inside a vectorized basic block– Perform same work– Same data flow– More than 50% of total instructions have some opportunities
• Challenges– High data movement overhead between lanes– Hard to find best instruction packing combination
6
University of MichiganElectrical Engineering and Computer Science
Partial SIMD Opportunity
7
1: For (it = 0; It < 4; it++) {2: i = a[it] + b[it];3: j = c[it] + d[it];4: k = e[it] + f[it];5: l = g[it] + h[it];6: m = i + j;7: n = k + l;8: result[it] = m + n;9: }
1234
56789
101112
131415
0
SIMD Resource1. Loop level SIMDization2. Partial SIMDization
+12+4
University of MichiganElectrical Engineering and Computer Science
• Data level parallelism between ‘identical subgraphs’– SIMDizable operators– Isomorphic dataflow– No dependencies on each other
• Advantages– Minimize overhead
• No inter-lane data movement inside a subgraph
– High instruction packing gain• Multiple instructions inside a subgraph increase the packing gain
Cost Gain
Cost
Gain
Subgraph Level Parallelism(SGLP)
8
Cost
Gain
University of MichiganElectrical Engineering and Computer Science
Example: program order (2 degree)
9
LD0
0 1
4 5
98
ST0 ST1
LD1
2 3
6 7
11
10
ST2 ST3
FFT kernel
SIMD Lane
Cycle
0
1
Gain: 1 = 9 (SIMD) – 8 (overhead)
LD0
LD1
Inte
r-la
ne m
ove
Inte
r-la
ne m
ove
01
23
Inte
r-la
ne m
ove
Inte
r-la
ne m
ove
45
67
Inte
r-la
ne m
ove
Inte
r-la
ne m
ove
89
Inte
r-la
ne m
ove
Inte
r-la
ne m
ove
1011
ST0
ST1
ST2
ST3
Lane 1Lane 0
University of MichiganElectrical Engineering and Computer Science
Example: SGLP (2 degree)
10
SIMD Lane
Cycle
0
1
LD0
0 1
4 5
98
ST0 ST1
LD1
2 3
6 7
11
10
ST2 ST3
FFT kernel
LD0
LD1
02
13
57
46
810
911
ST0
ST2
ST1
ST3
Inte
r-la
ne m
ove
Gain: 7 = 9 (SIMD) – 2 (overhead)In
ter-
lane
mov
eLane 1Lane 0
University of MichiganElectrical Engineering and Computer Science
Compilation Overview
11
Loop-unrolling & Vectorization
Dataflow Generation
1. Subgraph Identification
2. SIMD Lane Assignment
3. Code Generation
Application Hardware Information
Loop-level Vectorized Basicblock Dataflow Graph
Identical Subgraphs
Lane-assigned Subgraphs
University of MichiganElectrical Engineering and Computer Science
1. Subgraph Identification• Heuristic discovery
– Grow subgraphs from seed nodes and find identical subgraphs
• Additional conditions over traditional subgraph search– Corresponding operators are identical– Operand type should be same:
register/constant– No inter-subgraph dependencies
12
a b c d e f g h
result
1 2
+ *
256
University of MichiganElectrical Engineering and Computer Science
1. Subgraph gain Assign lanes based on decreasing order of gain
Gain: A > B > C > D
• Select subgraphs to be packed and assign them to SIMD lane groups– Pack maximum number of instructions with minimum
overhead– Safe parallel execution without dependence violation– Criteria: gain, affinity, partial order
2. SIMD Lane Assignment
13
A0
B0
D
A1
B1
SIMD Lane
Time
Lane
0~3
Lane
4~7
A0
A1
B0
B1
C0
C1
D
A0
B0
D
A1
B1
C1
C0
2. Affinity cost Data movement between different subgraphs Use producer/consumer, common producer/consumer relation Affinity value: how many related operations exist between subgraphs Assign a lane with highest affinity cost
Affinity: B0 is closer to A0 than A1
Conflict!!
C0-
1C
0-0
C1-
1C
1-0
3. Partial order check Partial order of identical subgraphs inside the SIMD lanes must be same
Partial order: C0 ≠ C1
University of MichiganElectrical Engineering and Computer Science
Experimental Setup• 144 loops from industry-level optimized media applications
– AAC decoder (MPEG4 audio decoding, low complexity profile)– H.264 decoder (MPEG4 video decoding, baseline profile, qcif)– 3D (3D graphics rendering).
• Target architecture: wide vector machines– SIMD width: 16 ~ 64 – SODA-style wide vector instruction set– Single-cycle delay data shuffle instruction(≈vperm(VMX), vex_perm(AltiVec))
• IMPACT frontend compiler + cycle-accurate simulator • Compared to 2 other solutions
– SLP: superword level parallelism (basic block level SIMDization) [Larsen, PLDI’00]– ILP: Instruction level parallelism on same-way VLIW machines
• Apply 2 ~ 4 degree SGLP
14
University of MichiganElectrical Engineering and Computer Science
Static Performance
15
• SGLP retains similar trend as ILP after overhead consideration• Max 1.66x @ 4-way (SLP 1.27x)• See the paper for representative kernels (FFT, DCT, HafPel….)
1 2 3 4 5 6 7 8 9 10 11 121
1.5
2
2.5
3SLP w/ overhead SGLP w overhead ILP
# of degree
Rel
ativ
e Pe
rform
ance
1 2 3 4 5 6 7 8 9 10 11 121
1.5
2
2.5
3SLP w/o overhead SGLP w/o overhead ILP
# of degree
Rel
ativ
e Pe
rform
ance
AAC 3D H.264 AVG
University of MichiganElectrical Engineering and Computer Science
Dynamic Performance on SIMD
16
16 32 64 16 32 64 16 32 64 16 32 641
1.5
2
2.5
3
SLP w/ overhead SGLP w/ overhead
# of SIMD lanes
Rel
ativ
e Pe
rfor
man
ce AAC 3D H.264 AVG
• Only when a natural SIMD width is insufficient, the available degree of SGLP is exploited. (Up to 4-way)
• Max 1.76x speedups (SLP: 1.29x)
University of MichiganElectrical Engineering and Computer Science
[email protected] Execution
17
SGLP @ 32-wide SIMD ILP @ 4 way 8-wide VLIW gainpower (mW) 54.40 93.17 -31.61%cycle(million) 13.07 10.77 +21.36%energy (mJ) 3.55 5.02 -29.14%
Control
8-wide SIMD
Control
8-wide SIMD
Control
8-wide SIMD
Control
8-wide SIMD
Control
8-wide SIMD
8-wide SIMD
8-wide SIMD
8-wide SIMD
30% energy efficient!
• 200 MHz(IBM 0.65nm technology)
SIMD VLIW
University of MichiganElectrical Engineering and Computer Science
Conclusion• SIMD is an energy-efficient solution for mobile systems.
• SIMD programming of multimedia applications is an interesting challenge due to various degrees of SIMD parallelism.
• Subgraph level parallelism successfully provides supplemental SIMD parallelism by converting ILP into DLP inside the vectorized basic block.
• SGLP outperforms traditional loop-level SIMDization by up to 76% on a 64-way SIMD architecture.
18
University of MichiganElectrical Engineering and Computer Science19
Questions?
For more informationhttp://cccp.eecs.umich.edu
University of MichiganElectrical Engineering and Computer Science
Example 2: High-level View
20
Kernel 0SIMD width: 8
Kernel 1SIMD width: 4
Kernel 2SIMD width: 8
B
A0 A1
C0 C1
D
SIMD Lane
Time
Ker
nel 0B
A0 A1
C0 C1
D
B
A0 A1
C0 C1
D
Ker
nel 2
Lane
0~3
Lane
4~7
Gain = (A1 + C1) (SIMD) – ((A1->B) + (C1->D)) (overhead)
Kernel 1
University of MichiganElectrical Engineering and Computer Science
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 301
1.5
2
2.5
3
3.5
4SLP w/o overhead SGLP w/o overhead ILP
# of degree
Rel
ativ
e Pe
rform
ance
Static Performance
21
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 301
1.5
2
2.5
3
3.5
4SLP w/ overhead SGLP w overhead ILP
# of degree
Rel
ativ
e Pe
rform
ance
FFT MDCT MatMul4x4 MatMul3x3HalfPel QuarterPel AAC 3D H.264 AVG
• Performance results depend on kernel characteristics(Ex: MatMul4x4, MatMul3x3)• SGLP retains similar trend as ILP after overhead consideration• Max 1.66x @ 4-way (SLP 1.27x)
Kernel Application