architectural enhancements for efficient operand transport in multimedia systems

40
Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim [email protected]

Upload: haruko

Post on 15-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems. ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim [email protected]. Overview. Introduction Characterization and modeling of operand usage and transport - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

ECE7102 Class Presentation

Date: 2006. 4. 13

Hongkyu [email protected]

Page 2: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

2/40

Overview

• Introduction

• Characterization and modeling of operand usage and transport

• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network

for general-purpose domain

– Dynamic SIMDization for application-specific domain

• Summary

Page 3: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

3/40

Interconnect Complexity

FU FU

Storage

Storage

Interconnect

FU

FU

FU

FU

FU

FU

FU

FU

Storage Storage Storage

Storage Storage Storage

Interconnect

• Exponential increase of chip capacity More devices

• Exponential decrease of feature size Interconnect limitation

J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35, May/June 2003.

Page 4: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

4/40

Interconnect Bottleneck

ITRS 2002 Documents, http://public.itrs.net/Files/2002Update/Home.pdf.

1

10

100

0.1

Rel

ativ

e D

elay

250 180 130 90 65 45 42

Process Technology Node (nm)

α

1/α2

1/α2

• Disparity between wire delay and gate delay

Page 5: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

5/40

Problem Statement

• High-performance interconnect– Interconnect organizations

– Interconnect technologies

• Why architectural responses are limited?– Compatibility with old ISAs

• Sequentially-specified operations• Restricted register file-based operand namespace

– ILP mechanisms• Operand bypass network, register renaming, and instruction

scheduling• Poorly scaling broadcast buses

Page 6: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

6/40

Research Objective and Approach

• ObjectiveReduce latency of operand transport for multimedia– Development of dynamic execution techniques– Development of low-cost operand bypass networks

• Approach summary

Analysis of operandsExamine operand usage propertiesExplore the impact of architectural techniques on the operand transport

Technology model-based evaluation on target platforms

GENESYSSimpleScalar

Background work General approach Application-specific approach

Dynamic execution techniqueInstruction clusteringRecognition of regular operand transport patternsEfficient execution unit

Cluster mapping on inter-ALU network

Basic instruction clusteringRaw cluster mappingLocal operand mapping on dedicated inter-ALU path

Optimizing operand transport for multimedia systems

Regular pattern recognitionCluster reorganizationFunction remappingDynamic SIMDization

Page 7: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

7/40

Overview

• Introduction

• Characterization and modeling of operand usage and transport

• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network

for general-purpose domain

– Dynamic SIMDization for application-specific domain

• Summary

Page 8: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

8/40

Motivation and Approach• Motivation

– Shift of microarchitectural design focusOperand computation Operand communication

– Recognizing and understanding of operand usage and transport properties Efficiently controlling operand traffic

• Approach summary– Operand usage characteristics

• How often operands are used Examine temporal property• Where operands are used Examine spatial property

– Operand transport properties

• What accounts for the majority of communication needs

Explore the impact of architectural techniques on the operand transport

Page 9: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

9/40

Operand Usage Analysis• General terms

– Operands: values in registers, memory locations, or memory addresses

– Operand transport: buffering and delivery of operands to FUs

• Operands’ temporal characteristics– Which inst. consumes operands after they are produced

– Metrics: Degree of use, Age, Lifetime

• Operands’ spatial characteristics– From/to which FU operands are moved in the execution model

– Metrics: Degree of functionality, Transport pattern

Page 10: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

10/40

Operand Transport Analysis• Operand transport model

Global Storage

Bypass Networktransrd_global

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

transrd_bypass

transwr_global

Page 11: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

11/40

0% 20% 40% 60% 80% 100%

degree offunctionality

lifetime

age

degree of use

Preliminary Results• Operand usage properties (MediaBench average)

0 1 2 3 >3

1 2 3~5 >5

1 2 3~5 6~10 >10

0 1(same) 1(different)

H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia applications,” Proc of the International Workshop on System-on-Chip for Real-Time Applications, pp. 168-171, July 2004.

>1

Page 12: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

12/40

Preliminary Results (cntd.)• Operand transport pattern (MediaBench average)

integer integer43.0%

integer branch14.9%

integer ld/st13.6%

ld/st integer13.8%

ld/st ld/st6.6%

Others8.1%

Page 13: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

13/40

Preliminary Results (cntd.)• Effective architectural techniques on operand

transport– Storage hierarchy: local buffering

– Dedicated transport network

– Lifetime detection: compile-time/run-time

– Smart instruction steering

Page 14: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

14/40

Overview

• Introduction

• Characterization and modeling of operand usage and transport

• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network

for general-purpose domain

– Dynamic SIMDization for application-specific domain

• Summary

Page 15: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

15/40

Motivation and Approach• Motivation

Multimedia applications– Operand movement is highly regular

– Most operands are short lived, transient operandsDevelop dynamic execution technique exploiting regular

operand distribution patterns and local properties

• Approach summary– Instruction clustering: dynamic instruction grouping

– Recognition of regular operand transport pattern

– Efficient execution unit: reduce transport latency

Page 16: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

16/40

Related Work• Solutions for multimedia processing

– Multimedia-specific ISA extensions• Exploit data-level parallelism at subword level• General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!,

Sun’s VIS, IBM’s Altivec• Application-specific signal processing domain: Analog Device’s

TigerShark, Trimedia

– Vectorization and retargeting• Manual assembly coding• Hand-optimization: in-lined assembly code, library routines• Automatic vectorization: compiler/retargeting technology

Page 17: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

17/40

• Solutions for reducing operand transport complexity– Communication-aware execution

• Network-connected tile architecture: RAW, GPA• Transport triggered architecture: MOVE

– Resource partitioning: Clustered architectures• Heterogeneous: decoupled architecture• Commercial: DEC Alpha21264• Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP

– Dynamic optimizations• Fill unit: reform instructions in H/W, and cache them

• Small-scale dependence collapsing: combine dependences among multiple instructions macro instruction

Related Work (cntd.)

Page 18: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

18/40

Related Research Landscape

Dynamic execution technique exploiting regular operand transport patterns in multimedia

Communication-aware execution:efficient operand transport

Resource partitioning:Clever instruction steering

Dynamic optimizations:instruction grouping, small-scale

dependence collapsing

Multimedia processing:independent computation

Regular pattern of dependent instructions

Steering burden off the critical path

Binary-compatibility,run-time optimization

Larger, more generalinstruction grouping

Page 19: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

19/40

Research Methodology

Application code(C source)

gcc cross-compiler

PISA binary

Instruction trace

Instruction stream

Cluster formation logic

Cluster storage(cache)

Execution platform

Matched?

Normal execution unit

Cluster execution unit

N YInstruction queue Cluster queue

Page 20: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

20/40

Dynamic Instruction Clustering

• Instruction Cluster– A connected subgraph of instructions joined by local operands– Dataflow graph Dependence edge classification

Instruction grouping

• Dependence edge types– External: produced/consumed by previous/next blocks– Non-clusterable: operands from/to memory– Local: produced and consumed within the same block

Page 21: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

21/40

Instruction Clustering Example• Color conversion block in JPEG encoder

0: lbu r4, 0(r9) 19: addu r2, r2, r31: lbu r5, 1(r9) 20: lw r3, 5120(r6)2: lbu r6, 2(r9) 21: addu r7, r15, r83: sll r4, r4, 0x2 22: addu r2, r2, r34: addu r4, r4, r10 23: sra r2, r2, 0x105: sll r5, r5, 0x2 24: sb r2, 0(r7)6: addu r5, r5, r10 25: lw r2, 5120(r4)7: lw r2, 0(r4) 26: lw r3, 6144(r5)8: lw r3, 1024(r5) 27: addiu r9, r9, 39: sll r6, r6, r10 28: addu r2, r2, r310: addu r6, r6, r10 29: lw r3, 7168(r6)11: addu r2, r2, r3 30: addu r7, r12, r812: lw r3, 2048(r6) 31: addiu r8, r8, 113: addu r7, r25, r8 32: addu r2, r2, r314: addu r2, r2, r3 33: sra r2, r2, 0x1015: sra r2, r2, 0x10 34: sb r2, 0(r7)16: sb r2, 0(r7) 35: sltu r2, r8, r1617: lw r2, 3072(r4) 36: bne r2, r0, 0x41218818: lw r3, 4096(r5)

0 1 2

3 5 9

4 6 10

177 25 188 26 2012 29

11

14

15 13

16

19

22

2321

24

28

32

33 30

34

27

31

35

36

0 1 2

3 5 9

4 6 10

177 25 188 26 2012 29

11

14

15 13

16

19

22

2321

24

28

32

33 30

34

27

31

35

36

External

Local

Non-clusterable

0 1 2

3 5 9

4 6 10

177 25 188 26 2012 29

11

14

15 13

16

19

22

2321

24

28

32

33 30

34

27

31

35

36

External

Local

Non-clusterable

Instruction Cluster

Page 22: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

22/40

Overview

• Introduction

• Characterization and modeling of operand usage and transport

• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network

for general-purpose domain

– Dynamic SIMDization for application-specific domain

• Summary

Page 23: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

23/40

• Raw cluster execution on inter-ALU network– Focus on intermediate, short-lived operands

• Local operands: inter-ALU dedicated bypass network• Others: traditional global bypass network

– Organization• Instruction cluster formation• Cluster queue and scheduling• Cluster execution: inter-ALU network

H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia, December 2005.

Implementation Example - I

Page 24: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

24/40

Cluster Queue and Scheduling

I1

Conventional instuction queue

I0 I3I2

Head Tail

Cluster queue

C0:I0 C1:I0 C2:I0

C0:I1 C1:I1 C2:I1

C0:I2 C1:I2

C1:I3

Head Tail

width

dept

h

02 1Issue

pointer

• Organization of cluster queue– Single entry per cluster (2D)– Ready flag for local operands are always set– Issue pointer for each entry, in-order issue

Page 25: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

25/40

Cluster Execution Unit• Cluster mapping on inter-ALU network

– Local operands: dedicated bypass network– Others: traditional global bypass network

I1I0

I2

I3

I4

I5

I6

Instruction cluster

I1I0

I2

I3

I4

I5

I6

0

Instruction cluster

1

2

3

4

Instruction Depth

row 0

row 1

row 2

row 3

col 0 col 1 col 2 col 3

I0 I1 I6

I2 I4

I3

I5

network ALU

Page 26: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

26/40

Experimental Setup• Simulation Environment

– SimpleScalar sim-outorder simulator– MediaBench application programs

• Processor Configurations8-way 16-way

Queues24 instruction queue,8 cluster queue,16 load/store queue

48 instruction queue,16 cluster queue,32 load/store queue

FU resources

4 integer ALUs,1 (4x4) network ALU,2 integer MULs,2 floating ALUs1 floating MUL,2 memory ports

8 integer ALUs,2 (4x4) network ALUs,2 integer MULs,2 floating ALUs1 floating MUL,2 memory ports

Operand bypass(latency)

Local (0),pass-through (1),Global (1)

Local (0),pass-through (1),Global (max 3)

Page 27: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

27/40

Experimental Result• Dynamic instruction coverage

0%

10%

20%

30%

40%

50%

60%

70%

80%cj

pe

g

djp

eg

ep

ic

ep

icu

n

g7

21

de

cod

e

g7

21

en

cod

e

mp

eg

2d

eco

de

mp

eg

2e

nco

de

raw

cau

dio

raw

da

ud

io

ave

rag

e

clu

ste

red

in

st.

/to

tal

co

mm

ite

d i

ns

t. 32 entries 64 entries128 entries 256 entries512 entries 1K entries

Page 28: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

28/40

Experimental Result (cntd.)• Operand transport types

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4cjp

eg

djp

eg

epic

epic

un

g721decode

g721encode

mpeg2decode

mpeg2encode

raw

caudio

raw

daudio

ave

rage

cjp

eg

djp

eg

epic

epic

un

g721decode

g721encode

mpeg2decode

mpeg2encode

raw

caudio

raw

daudio

ave

rage

8-way 16-way

ave

rage d

ependence e

dge p

er in

st.

globalpass- throughlocal

29.5%

11.0%

59.5%

31.5%

10.6%

57.8%

Page 29: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

29/40

Experimental Result (cntd.)• IPC speedup

0%

10%

20%

30%

40%

50%

60%

70%

80%cj

pe

g

djp

eg

ep

ic

ep

icu

n

g7

21

de

cod

e

g7

21

en

cod

e

mp

eg

2d

eco

de

mp

eg

2e

nco

de

raw

cau

dio

raw

da

ud

io

ave

rag

e

cjp

eg

djp

eg

ep

ic

ep

icu

n

g7

21

de

cod

e

g7

21

en

cod

e

mp

eg

2d

eco

de

mp

eg

2e

nco

de

raw

cau

dio

raw

da

ud

io

ave

rag

e

8-way 16-way

IPC

sp

ee

du

p

Page 30: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

30/40

Summary• Summary of approach

– Dynamically group dependent instructions into clusters– Store regular operand transport patterns– Execute them on inter-ALU network where intermediate values

are propagated among ALUs w/o/ using global buses

• Summary of results (MediaBench average)– Dynamic instruction coverage

– Shortest transport rate

– IPC speedup

57.3%@ 256 entry cluster cache

30% 16-way8-way 32%

16-way8-way 16.2% 35.2%

Page 31: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

31/40

• Introduction

• Characterization and modeling of operand usage and transport

• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network

for general-purpose domain

– Dynamic SIMDization for application-specific domain

• Summary

Overview

Page 32: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

32/40

• Data parallel execution using dynamic SIMDization– Observation (Image processing applications)

• Operand movement w/in a loop iteration is highly regular• Small # of inner loops covers most of execution time

– Focus on regular operand transport pattern between iterations of innermost loop

• Stride prediction: break loop-carried dependences data-parallel execution

• Operand lifetime detection operand traffic control

– Organization• Instruction cluster formation• SIMD instruction queue and scheduling• SIMD PE array

Implementation Example - II

Page 33: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

33/40

Dynamic Instruction Clustering

• External dependence edge types– External-input: serving only as input– External-output: serving only as output– External-updated: serving as both input and output

• Parallel and non-parallel region detection– p-cluster: producing no external-updated output and not

having unpredicted external-updated input– np-cluster

Page 34: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

34/40

Instruction Clustering Example• Image convolution code in TI’s IMGLIB

r2

IC0

0

1

r11

2

3

r8 r15

8 13

4

5

6

7

r10

9

10

11

12

r13

14

15

16

20

21

17 18 19

r9

r3r4r5r6 r7r9 r8

IC1

IC2 IC3

IC4

IC5

external-input = {r10, r11, r13, r15} external-output = {r2, r3, r4, r5, r6, r7} external-updated = {r8, r9}

p-clusters = {IC0, IC1, IC2, IC3}np-clusters = {IC4, IC5}

Page 35: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

35/40

SIMD Execution Unit• Cluster scheduling on SIMD PE array

20

21

22

23

30

31

32

33

160

161

162

163

0 1 2 t 0 1 2 3 4 t

PE0

PE1

PE2

PE3

(a) p- cluster scheduling (b) np- cluster scheduling

8[0:3] 13[0:3]

200 210

201 211

202 212

203 213

4

Page 36: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

36/40

SIMD Execution Unit (cntd.)• Operand transport model

Scalar resourcesP

ILPP

SIMD

conventional ILP processor

external- input external- output

local external- updpated

P

P P

PE

Page 37: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

37/40

Summary of Approach• Dynamic parallelization

– Detect regular operand transport pattern on external-updated

– Compute stride predict external-update values

• Optimizing operand transport– Identify the lifetime of operands– Remove needless communication localize transport

• Execute the clusters on 1-D mesh SIMD PE array

Page 38: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

38/40

Overview

• Introduction

• Characterization and modeling of operand usage and transport

• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network

for general-purpose domain

– Dynamic SIMDization for application-specific domain

• Summary

Page 39: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

39/40

Summary

• Characterization and modeling of operand– Examine the operand usage properties– Explore the impact of architectural techniques on the operand

transport

• Development of a dynamic execution technique– Instruction clustering– Recognition of regular operand transport pattern– Efficient execution unit

Page 40: Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

40/40

Thank you. Any questions?