simd defragmenter: efficient ilp realization on data-parallel architectures

University of MichiganElectrical Engineering and Computer Science1

SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures

Yongjun Park1, Sangwon Seo2, Hyunchul Park3, Hyoun Kyu Cho1, and Scott Mahlke1

March 6, 2012

1University of Michigan, Ann Arbor2Qualcomm Incorporated, San Diego, CA

3Programmin Systems Lab, Intel Labs, Santa Clara, CA

University of MichiganElectrical Engineering and Computer Science

Convergence of Functionalities

2

Convergence of functionalities demands a flexible solutionApplications have different characteristics

Anatomy of an iPhone

4G Wireless

Navigation

AudioVideo

3D

Flexible Accelerator!


0 16 32 48 64 80 96 112 1280

50

100

150

200

250

300

350

400

450

500

Issue width

Rel

ativ

e A

rea

(Cos

t)

SIMD : Attractive Alternative to ASICs

3

Suitable for running wireless and multimedia ap-plications for future embedded systems

Advantage High throughput Low fetch-decode overhead Easy to scale

Disadvantage Hard to realize high resource utilization High SIMDization overhead

Example SIMD architectures IBM Cell, ARM NEON,

Intel MIC architecture, etc.Example SIMD machine: 100 MOps /mW

VLIW

SIMD

FUs

5.6x

2x


Under-utilization on wide SIMD

• Multimedia applications have various natural SIMD widths– SIMD width characterization of innermost loops (Intel Compiler rule)– Inside & across applications

• How to use idle SIMD resources?

4

AAC 3D H.264

Execution time distribution at different SIMD widths

Full

Under

Resource utilization

16-way SIMD


Traditional Solutions for Under-utilization• Dynamic power gating

– Selectively cut off unused SIMD lanes– Effective dynamic & leakage power savings– Transition time & power overhead– High area overhead

5

• Thread Level parallelism– Execute multiple threads having separate

data– Different instruction flow– Input-dependent control flow– High memory pressure

On

Off

Thread 1

Thread 2

Thread 3

Thread 4


Objective of This Work• Beyond loop-level SIMD

– Put idle SIMD lanes to work– Find more SIMD opportunities inside vectorized basic blocks when loop-level SIMD

parallelism is insufficient

• Possible SIMD instructions inside a vectorized basic block– Perform same work– Same data flow– More than 50% of total instructions have some opportunities

• Challenges– High data movement overhead between lanes– Hard to find best instruction packing combination

6


Partial SIMD Opportunity

7

1: For (it = 0; It < 4; it++) {2: i = a[it] + b[it];3: j = c[it] + d[it];4: k = e[it] + f[it];5: l = g[it] + h[it];6: m = i + j;7: n = k + l;8: result[it] = m + n;9: }

1234

56789

101112

131415

0

SIMD Resource1. Loop level SIMDization2. Partial SIMDization

+12+4


• Data level parallelism between ‘identical subgraphs’– SIMDizable operators– Isomorphic dataflow– No dependencies on each other

• Advantages– Minimize overhead

• No inter-lane data movement inside a subgraph

– High instruction packing gain• Multiple instructions inside a subgraph increase the packing gain

Cost Gain

Cost

Gain

Subgraph Level Parallelism(SGLP)

8

Cost

Gain


Example: program order (2 degree)

9

LD0

0 1

4 5

98

ST0 ST1

LD1

2 3

6 7

11

10

ST2 ST3

FFT kernel

SIMD Lane

Cycle

0

1

Gain: 1 = 9 (SIMD) – 8 (overhead)

LD0

LD1

Inte

r-la

ne m

ove

Inte

r-la

ne m

ove

01

23

Inte

r-la

ne m

ove

Inte

r-la

ne m

ove

45

67

Inte

r-la

ne m

ove

Inte

r-la

ne m

ove

89

Inte

r-la

ne m

ove

Inte

r-la

ne m

ove

1011

ST0

ST1

ST2

ST3

Lane 1Lane 0


Example: SGLP (2 degree)

10

SIMD Lane

Cycle

0

1

LD0

0 1

4 5

98

ST0 ST1

LD1

2 3

6 7

11

10

ST2 ST3

FFT kernel

LD0

LD1

02

13

57

46

810

911

ST0

ST2

ST1

ST3

Inte

r-la

ne m

ove

Gain: 7 = 9 (SIMD) – 2 (overhead)In

ter-

lane

mov

eLane 1Lane 0


Compilation Overview

11

Loop-unrolling & Vectorization

Dataflow Generation

1. Subgraph Identification

2. SIMD Lane Assignment

3. Code Generation

Application Hardware Information

Loop-level Vectorized Basicblock Dataflow Graph

Identical Subgraphs

Lane-assigned Subgraphs


1. Subgraph Identification• Heuristic discovery

– Grow subgraphs from seed nodes and find identical subgraphs

• Additional conditions over traditional subgraph search– Corresponding operators are identical– Operand type should be same:

register/constant– No inter-subgraph dependencies

12

a b c d e f g h

result

1 2

+ *

256


1. Subgraph gain Assign lanes based on decreasing order of gain

Gain: A > B > C > D

• Select subgraphs to be packed and assign them to SIMD lane groups– Pack maximum number of instructions with minimum

overhead– Safe parallel execution without dependence violation– Criteria: gain, affinity, partial order

2. SIMD Lane Assignment

13

A0

B0

D

A1

B1

SIMD Lane

Time

Lane

0~3

Lane

4~7

A0

A1

B0

B1

C0

C1

D

A0

B0

D

A1

B1

C1

C0

2. Affinity cost Data movement between different subgraphs Use producer/consumer, common producer/consumer relation Affinity value: how many related operations exist between subgraphs Assign a lane with highest affinity cost

Affinity: B0 is closer to A0 than A1

Conflict!!

C0-

1C

0-0

C1-

1C

1-0

3. Partial order check Partial order of identical subgraphs inside the SIMD lanes must be same

Partial order: C0 ≠ C1


Experimental Setup• 144 loops from industry-level optimized media applications

– AAC decoder (MPEG4 audio decoding, low complexity profile)– H.264 decoder (MPEG4 video decoding, baseline profile, qcif)– 3D (3D graphics rendering).

• Target architecture: wide vector machines– SIMD width: 16 ~ 64 – SODA-style wide vector instruction set– Single-cycle delay data shuffle instruction(≈vperm(VMX), vex_perm(AltiVec))

• IMPACT frontend compiler + cycle-accurate simulator • Compared to 2 other solutions

– SLP: superword level parallelism (basic block level SIMDization) [Larsen, PLDI’00]– ILP: Instruction level parallelism on same-way VLIW machines

• Apply 2 ~ 4 degree SGLP

14


Static Performance

15

• SGLP retains similar trend as ILP after overhead consideration• Max 1.66x @ 4-way (SLP 1.27x)• See the paper for representative kernels (FFT, DCT, HafPel….)

1 2 3 4 5 6 7 8 9 10 11 121

1.5

2

2.5

3SLP w/ overhead SGLP w overhead ILP

# of degree

Rel

ativ

e Pe

rform

ance

1 2 3 4 5 6 7 8 9 10 11 121

1.5

2

2.5

3SLP w/o overhead SGLP w/o overhead ILP

# of degree

Rel

ativ

e Pe

rform

ance

AAC 3D H.264 AVG


Dynamic Performance on SIMD

16

16 32 64 16 32 64 16 32 64 16 32 641

1.5

2

2.5

3

SLP w/ overhead SGLP w/ overhead

# of SIMD lanes

Rel

ativ

e Pe

rfor

man

ce AAC 3D H.264 AVG

• Only when a natural SIMD width is insufficient, the available degree of SGLP is exploited. (Up to 4-way)

• Max 1.76x speedups (SLP: 1.29x)


[email protected] Execution

17

SGLP @ 32-wide SIMD ILP @ 4 way 8-wide VLIW gainpower (mW) 54.40 93.17 -31.61%cycle(million) 13.07 10.77 +21.36%energy (mJ) 3.55 5.02 -29.14%

Control

8-wide SIMD

Control

8-wide SIMD

Control

8-wide SIMD

Control

8-wide SIMD

Control

8-wide SIMD

8-wide SIMD

8-wide SIMD

8-wide SIMD

30% energy efficient!

• 200 MHz(IBM 0.65nm technology)

SIMD VLIW


Conclusion• SIMD is an energy-efficient solution for mobile systems.

• SIMD programming of multimedia applications is an interesting challenge due to various degrees of SIMD parallelism.

• Subgraph level parallelism successfully provides supplemental SIMD parallelism by converting ILP into DLP inside the vectorized basic block.

• SGLP outperforms traditional loop-level SIMDization by up to 76% on a 64-way SIMD architecture.

18

University of MichiganElectrical Engineering and Computer Science19

Questions?

For more informationhttp://cccp.eecs.umich.edu


Example 2: High-level View

20

Kernel 0SIMD width: 8



B

A0 A1

C0 C1

D

SIMD Lane

Time

Ker

nel 0B

A0 A1

C0 C1

D

B

A0 A1

C0 C1

D

Ker

nel 2

Lane

0~3

Lane

4~7

Gain = (A1 + C1) (SIMD) – ((A1->B) + (C1->D)) (overhead)

Kernel 1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 301

1.5

2

2.5

3

3.5

4SLP w/o overhead SGLP w/o overhead ILP

# of degree

Rel

ativ

e Pe

rform

ance

Static Performance

21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 301

1.5

2

2.5

3

3.5

4SLP w/ overhead SGLP w overhead ILP

# of degree

Rel

ativ

e Pe

rform

ance

FFT MDCT MatMul4x4 MatMul3x3HalfPel QuarterPel AAC 3D H.264 AVG

• Performance results depend on kernel characteristics(Ex: MatMul4x4, MatMul3x3)• SGLP retains similar trend as ILP after overhead consideration• Max 1.66x @ 4-way (SLP 1.27x)

Kernel Application

simd defragmenter: efficient ilp realization on data-parallel architectures

Documents

simd opportunities

paper simd

simd defragmenter

simd programming

simd resource1

wide simd architecture

example simd machine

idle simd resources