a fixed-point 3d graphics library with energy-efficient cache …ssl.kaist.ac.kr › 2007 › data...

Min-wuk Lee 1

MS Thesis

A fixedA fixed--point 3D graphics library with point 3D graphics library with energyenergy--efficient cache architecture for efficient cache architecture for

mobile multimedia system mobile multimedia system

Min-wuk Lee

2004.12.14Semiconductor System Laboratory

Department Electrical Engineering and Computer ScienceKorea Advanced Institute of Science and Technology [KAIST]

Min-wuk Lee 2

OutlineOutlineIntroductionMotivationMobileGL: Mobile 3D graphics libraryEnergy-efficient CPU cache Energy-efficient texture cache Conclusion

Min-wuk Lee 3

Introduction(1/2)Introduction(1/2)

Performance-energy co-optimization for mobile 3D graphics

Software system : High speed graphics library (MobileGL)Hardware system : Energy-efficient Cache architecture

Mobile 3D graphics system

CPU 3D Rendering Engine Memory

MobileGL

Pixel

Screen Clipping

Perspective Projection

Lighting

Transformation

Model Interface

Texture Mapping

Triangle Setup

Alpha Blending

Depth Compare

Gouraud Shading

Model

Hardware system

Software system

Input Output

Embedded mobile system

Optimized code for speedDraw off the best H/W performance

Low energy consumptionGood quality, high performance

Min-wuk Lee 4

Introduction(2/2)Introduction(2/2)Target system

Low-cost target High speed graphics libraryEnergy-efficient CPU cache system

High quality target High speed and good quality graphics libraryEnergy-efficient CPU cache, texture cache system

Mem

CPUcache

Applicationprocessor

Graphicslibrary

Mem3D graphicsSoC

Applicationprocessor

CPUcache

TexturecacheFramecacheDepthcacheR.E.

System buscontroller

Graphicslibrary

High quality targetLow-cost target

Min-wuk Lee 5

Previous workPrevious workFor low-cost target

PC, workstation platform graphics libraryToo huge GL supported by FPU, special graphics engine

Embedded platform graphics library : Fixed point arithmeticYoshida’s work[1]

Limited operation (Without texturing)No research on memory bandwidth bottleneck

Previous work in our groupAnalysis with memory-only, without cache system

For high quality targetTexture cache in PC platform

Hakura’s work[2]Analysis based on miss rate

Did not consider energy, execution time, system limitation

[1] : K.Yoshida, Consumer Electronics, IEEE Transactions on,1998.

[2] : Ziyad s. Hakura, ISCA ,1997.

Min-wuk Lee 6

MotivationMotivationGraphics library of this work

Extended operationLighting, Texturing, Alpha blending, Face culling, etc.

Optimization of memory transaction3D graphics characteristicAnalysis with cache system

Texture cache of this workEnergy-efficient texture cache in embedded system

With negligible performance degradation

Min-wuk Lee 7


Min-wuk Lee 8

MobileGL(1/6)MobileGL(1/6)Mobile 3D graphics libraryA fixed-point arithmetic

32bit integerOptimized memory transaction

To reduce instruction and data traffic

Selective pipeline ApplicationsTo reduce branch45KB total code size MobileGL block diagram

View transformation

Lighting

Perspective projection

Screen mapping

Cull face

View transformationX


Screen mapping

Cull face

Rendering stage Rendering stage

Model

Pixel

Lighting and TexturingLighting only

Texture-only

Include r,g,b,a calculationExclude r,g,b,a calculation

Clipping Clipping

Screen mapping

Cull face

Screen mapping

Cull face

Perspective division Perspective division

Lighting enable Lighting disable

Min-wuk Lee 9

MobileGL(2/6)MobileGL(2/6)Disabling option for perspective correction of texture address

Due to small screen sizeTrade-off between correctness and speedup

On Off

StrongARM at 200 MHz

0

Horizontal setupTriangle setup

Texturing

exec

utio

n tim

e/1

K P

olyg

ons 20

10

30 %reduction

View transformationX


Triangle setup

Horizontal setup

Pixel interpolation

u,v/ wClipping/Perspective divison

Screen mapping

Cull face

Screen mapping

Cull face

u,v/w

u,v

u,v

Conventional This work

Min-wuk Lee 10

MobileGL(3/6)MobileGL(3/6)Division reduction in interpolation

Use shift instead of reciprocalHigh probability of 1,2 or 4 in denominator value

1 is 1, 2 or 4, using shift@:

2nd

1st

3rd

Top

Mid

Bot

Line2

Line1

Line3

start endDirection_y

Direction_xLine4 Ex

ecut

ion

time(

ms)

per 1

000

Poly

gons 30

20

10

0

27 %reduction

ROD

Horizontal setupTriangle setup

Texturing

StrongARM @ 200 MHz

Min-wuk Lee 11

MobileGL(4/6)MobileGL(4/6)Z comparison in advance

To avoid unnecessary shading and texturing[3]

Selective precision of matrix multiplication

67% improvement @geometry stage

Standard OpenGL pipeline

Z comparison in advance

Texturing Depth test Blending

TexturingDepth test Blending

0 10 20 30(%)

Speed improvement

Z_fail / Z_access

#1#2

Unnecessaryoperation

for #232 bit

4bit

64 bit

A

B

A MULL B

32 bit4bit

A

B

A MUL B

Should extend to 64bit for result

[3] : Ramchan Woo, ISSCC 2003

Min-wuk Lee 12

MobileGL(5/6)MobileGL(5/6)Library performance

67K polygons/sec @ texture only application due to several optimization steps

ARM7@ 80MHz

ARM9@ 200MHz

StrongARM@ 200MHz

0

20

40

60

Poly

ons

/ mill

i sec

Original C codeOptimized code

Previouswork[1]

67K Polygons/sec

6.7 timesPerformanceimprovement

Min-wuk Lee 13

MobileGL(6/6)MobileGL(6/6)Implementation result

Min-wuk Lee 14


Min-wuk Lee 15

EnergyEnergy--efficient CPU cache (1/4)efficient CPU cache (1/4)Simulation environment

From ARM SDK : 1, memory transactionFrom cache model using memory transaction : 2, 3

ARMProcessor

Memory CACHE_MODEL

Memory transaction file

CPU execution time

Memory_access_time

TOTALEXECUTION

TIME

Target Hardware Platform

3D Graphics Library

ApplicationPrograms

∑ ∑= =

+=

+=countsninstructio

K

countshitcache

KninstructioexeCPUexe

CPUexetotalexe

cycleCPUTT

timeaccessmemoryTT_

1

__

1__

__

_

__3

21

ARM SDK

Min-wuk Lee 16

EnergyEnergy--efficient CPU cache (2/4)efficient CPU cache (2/4)Cache model : about execution time

Processorcore

Data cacheInstruction cache

Memory

hit_time memory_access_time

)(__)(___

_

)_(_)__(__

__

__

writetimeaccessmemoryreadtimehittimeaccessmemory

timemiss

accessburstperiodclockaccesssquentialnonperiodclocklatencyCASt

timeaccessmemory

periodclocktimehit

mem

memRCD

core

L

K

L

L

+==

++==

=

Min-wuk Lee 17

EnergyEnergy--efficient CPU cache (3/4)efficient CPU cache (3/4)Energy modeling

Tool and research documentation based modelCache hit energy : from CACTI 3.0 [4]Cache miss energy : from Power & Energy Characterization of the Itsy Pocket Computer by Compaq Western Research Laboratory

– 4.70nJ / bus_clock [5], [6]

[4] : CACTI 3.0 : An integrated cache timing, power, and area model, Compaq Western Research Laboratory[5] : Power and energy characterization of the Itsy pocket computer [6] : A simulation framework for energy-consumption analysis of OS-driven embedded applications, TCAS 2003

Min-wuk Lee 18

EnergyEnergy--efficient CPU cache (4/4)efficient CPU cache (4/4)Simulation results

Using 2-way cache, 13% energy saving, 1% performance degradationcompared with conventional 4-way cache

1.5

2

2.5

3

3.5

4

2KB 4KB 8KB 16KB

Miss rate(%)

0.5

0.6

0.7

0.8

0.9

1

2KB 4KB 8KB 16KB

Normalized execution timeNormlizated@

0.5

0.6

0.7

0.8

0.9

1

2KB 4KB 8KB 16KB

Normalized energy consumptionNormlizated@

cache size cache size cache size

Direct mapped data cache, 8E/line (32B line size)

1.3

1.4

1.5

1.6

1.7

1.8

DM 2WAY 4WAY 8WAY

Miss rate(%)

0.5

0.6

0.7

0.8

0.9

1

DM 2WAY 4WAY 8WAY

Normalized execution timeNormlizated@

Normlizated@

0.5

0.7

0.9

1.1

1.3

1.5

DM 2WAY 4WAY 8WAY

16KB data cache, 32B line size

1% performancedegradation 13% energy

saving

Normalized energy consumption

Min-wuk Lee 19


Min-wuk Lee 20

EnergyEnergy--efficient texture cache (1/12)efficient texture cache (1/12)Texture mapping (Introduction)

Map from 3D surface to 2D texel domain (image)Texture coordinateLookup color in imageLookup method

Nearest texelInterpolation of surrounding texlesMIPMAP

– Image pyramid

xy

z s

t

F(x,y,z) = (s,t)

2D texture diagram

d axis

Level 0

Image pyramid

Min-wuk Lee 21

EnergyEnergy--efficient texture cache (2/12)efficient texture cache (2/12)Texture filtering methods (Introduction)

Point sampling, Bilinear filtering, Bilinear MIPMAP, Trilinear MIPMAP

LOD 0

LOD 1

LOD 2LOD 3

1st

2nd

3rd

1st

2nd

3rd

Bilinearinterpolation


LOD = 1.XX

Linearinterpolation

3. Bilinear MIPMAP

4. Trilinear MIPMAP

1st

2nd

3rd


LOD 0

2. Bilinear filtering

1. Point sampling

Texture space Screen space Texture space

Min-wuk Lee 22

EnergyEnergy--efficient texture cache (3/12)efficient texture cache (3/12)Obstacle of texture mapping

Requirement of extremely high bandwidth

Texture cache To reduce the off-chip memory access bottlenecksImage conversion (texture map representation) : Reduce conflict miss Address conversion unit (A few logical operations and two additions)

External memory

3D Renderingengine

TexturecacheAddress

conversionImage

conversion

Texture cache system

Min-wuk Lee 23

EnergyEnergy--efficient texture cache (4/12)efficient texture cache (4/12)Simulation models

Tiny Stealth Alien6833 polygons 542 polygons 854 polygons

Tiny :LOD[0:1] 80%, LOD[1:2] 10% Stealth :LOD[0:1] 67%, LOD[1:2] 15%Alien :LOD[0:1] 48%, LOD[1:2] 44% @ trilinearMIPMAP

Min-wuk Lee 24

EnergyEnergy--efficient texture cache (5/12)efficient texture cache (5/12)Proposed texture map representation

Reduce conflict miss at bank changeMiss rate reduction, energy saving (17.4%), execution time reduction (15.2%)

Blocked representation Recursive Sub Block

Min-wuk Lee 25

EnergyEnergy--efficient texture cache (6/12)efficient texture cache (6/12)Address conversion unit for RSB2X2

Use one-to-one correspondence and find ruleHardware implementation : only thirteen 2:1mux in trilinear MIPMAP

old0old1

old2old3

old4old5

old6old7

old8old9

old10old11

core request address

new0new1

new2new3

new4new5

new6new7

new8new9

new10new11

converted address

256 X 256, RSB 2X2

old0old1

old2old3

old4old5

old6old7

old8old9

old10old11


new0new1

new2new3

new4new5

new6new7

new8new9

new10new11

converted address

128 X 128, RSB 2X2

Address conversion unit of this work : RSB 2X2

old0old1

old2old3

old4old5

old6old7

old8old9

old10old11


new0new1

new2new3

new4new5

new6new7

new8new9

new10new11

converted address

64 X 64, RSB 2X2

old0old1

old2old3

old4old5

old6old7

old8old9

old10old11


new0new1

new2new3

new4new5

new6new7

new8new9

new10new11

converted address

32 X 32, RSB 2X2

Min-wuk Lee 26

EnergyEnergy--efficient texture cache (7/12)efficient texture cache (7/12)Texture cache model using bank interleaved

Morton order representation previous work

Proposed RSB2X2 also free from bank conflict

A0

D0

Texture cache(1 bank)

Point sampling

Texture cache(4 bank)

Bilinear filteringBilinear MIPMAP

A1A0

A2A3

D3 D1 D0D2

Texture cache foreven, odd LOD

(4 bank)

Trilinear MIPMAP

D7 D5 D4D6

A5A4

A6A7

A1A0

A2A3

D3 D1 D0D2

EvenLOD$

OddLOD$

Min-wuk Lee 27

EnergyEnergy--efficient texture cache (8/12)efficient texture cache (8/12)Performance and Energy comparison between filtering method

Energy consumption, Execution timePoint sampling < Bilinear filtering < Bilinear MIPMAP < TrilinearMIPMAP

Trade off point : Image quality (aliasing criterion)

0

1

2

3

4

5

P.S. B.F. B.M. T.M.

D.M.2WAY4WAY

Normalized @

Normalized energy

2KB, 16entries/line, Tiny_model

Min-wuk Lee 28

EnergyEnergy--efficient texture cache (9/12)efficient texture cache (9/12)Image quality analysis

Textile modelLOD[0:1] : 44%, LOD[1:2] : 40% in MIPMAP

DCT analysisLow frequency term in top-left

Point smapling Bilinear filtering Bilinear mipmap Trilinear mipmap

Point smapling Bilinear filtering Bilinear mipmap Trilinear mipmap

Min-wuk Lee 29

EnergyEnergy--efficient texture cache (10/12)efficient texture cache (10/12)Image quality metric in terms of aliasing criterion

IndexQ, IndexETo find relative value Normalize from 0 to 1

∑∑

∑∑

= =

= == π π

π π

0 0

2/

0

2/

05.0__

fx fy

fx fy

amplitude

amplitudequalityimage

∑∑

∑∑

= =

= == π π

π π

0 0

43

0

43

075.0__

fx fy

fx fy

amplitude

amplitudequalityimage

QQ

QQQ

curIndex

minmaxmin−

−=

EE

EEE

curIndex

minmaxmax

−−

=

fx0

fy

∑

∑

∑

∑

PI

PI

Min-wuk Lee 30

EnergyEnergy--efficient texture cache (11/12)efficient texture cache (11/12)Index = IndexQ + IndexE

Almost same quality between B.M. and T.M. in QVGALarge different energy between B.M. and T.M.Poor image quality in P.S.Bilinear MIPMAP get the largest score.

0

0.2

0.4

0.6

0.8

1

1.2

P.S. B.F. B.M. T.M.

8E16E

0

0.2

0.4

0.6

0.8

1

P.S. B.F. B.M. T.M.

8E,(Q_0.5)16E,(Q_0.75)

IndexQ IndexE

0.91

1.11.21.31.41.51.61.7

P.S. B.F. B.M. T.M.

8E,(Q_0.5)16E,(Q_0.5)8E,(Q_0.75)16E,(Q_0.75)

IndexQ+IndexE

2-way set associative, 2KB texture cache

Min-wuk Lee 31

EnergyEnergy--efficient texture cache (12/12)efficient texture cache (12/12)Simulation results

0.8

0.9

1

1.1

1.2

1K 2K 4K 8K 1K 2K 4K 8K 1K 2K 4K 8K

8E16E

Normalized energy

Tiny Stealth Alien

Normalized @

Energy comparison while changing cache size

@ 2-way, using bilinear MIPMAP

4KB texture cache, 16B line size (2B per 1texel) energy-efficient, low cost, high-quality

Min-wuk Lee 32


Min-wuk Lee 33

ConclusionConclusionFor performance-energy co-optimization in Mobile3D graphics

MobileGL / Cache architectureMobileGL : Mobile 3D graphics library

67K polygons/sec66.1% performance improvement in average

Energy-efficient CPU cache2-way set associative cache to save energy

Energy-efficient texture cacheProposed texture map representationBilinear MIPMAP shows good quality to energy ratio16B line size , 4KB size cache is the optimal point

Min-wuk Lee 34

Supplemental MaterialsSupplemental Materials

Min-wuk Lee 35

Graphics pipelineGraphics pipelineGeometry stage Rendering stage

2nd

1st

3rd

Top

Mid

Bot

Line2

Line1

Line3

start endDirection_y

Direction_x

Triangle setup :For line1, line2, line3 using1st, 2nd, 3rd

Horizontal setup :For line_x using start, end

Pixel interpolation :Each pixel shading, texturing

Line_x

Cameradirection

Camera positionx

zView tran

ProjectionClipping1/w

Screen mappin

sform

gz

xUnit-cube

View frustum

xz

Rendering stage

Min-wuk Lee 36

Energy portion of cacheEnergy portion of cacheARM920T and M*CORE : Caches consume 50% of total processor system power (Segars 01,Lee et.al. 99)

>50%

Min-wuk Lee 37

Blocked representationBlocked representationConventional texture map representation (16X16blocked)

Conflict miss @ block changeA path : 2B path : 16C path : 16

Assumptions


2. Cache size = block size

3. 16entries / 1 line

3 4 5 6 7 8 9 10 11 12 13 14 15

16X16 block

32

19 20 21 22 23 24 25 26 27 28 29 30 31

48

64

80

96

112

128

144

160

176

192

208

224

240

33

49

65

81

97

113

129

145

161

177

193

209

225

241

34 35

50 51

68 69

84 85

102 103

118 119

136 137

152 153

170 171

186 187

204 205

220 221

238 239

254 255

2

18

0 1

16 17

@ conventional block texture map

A

BC

block block

Block : Square region that texelsare ordered consecutively

Min-wuk Lee 38

Proposed texture map representationProposed texture map representationRecursive Sub Block texture map representation (RSB4X4)

Conflict miss @block changeA path : 4B path : 4C path : 4

Assumptions


2. Cache size = block size

3. 16 entries/ 1line

3 16 17 18 19 64 65 66 67 80 81 82 83

8

7 20 21 22 23 68 69 70 71 84 85 86 87

12

32

36

40

44

128

132

136

140

160

164

168

172

9

13

33

37

41

45

129

133

137

141

161

165

169

173

10 11

14 15

48 49

52 53

58 59

62 63

192 193

196 197

202 203

206 207

240 241

244 245

250 251

254 255

2

6

0 1

4 5

24 25 26 27

28 29 30 31

34 35

38 39

42 43

46 47

50 51

54 55

56 57

60 61

72 73 74 75

76 77 78 79

88 89 90 91

92 93 94 95

@ recursive sub-block 4X4 method

A

BC

block

block block

Min-wuk Lee 39

Simulation between representation methodsSimulation between representation methodsSimulation results between texture representations

Bilinear filtering, 2-way, 1KB texture cache27% performance improvement in averageLow miss rate doesn’t mean high performance8entries/line or 16entries/line shows good performance

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

4E 8E 16E 32E 4E 8E 16E 32E 4E 8E 16E 32E

4X48X816X16RSB2X2

Nor

mal

ized

per

form

ance Tiny Stealth Alien

Normalized @

0

0.1

0.2

0.3

4E 8E 16E 32E 4E 8E 16E 32E 4E 8E 16E 32E

4X48X816X16RSB2X2

Mis

s ra

te Tiny Stealth Alien

Min-wuk Lee 40

Simulation between representation methodsSimulation between representation methods@ bilinear filtering

Low miss rate doesn’t mean low energy consumptionRSB accomplish 17.4% energy saving compared to the best of conventional methods

@ point sampling25% performance improvement in average

0.50.60.70.80.9

11.11.21.31.41.51.61.71.8

4E 8E 16E 32E 4E 8E 16E 32E 4E 8E 16E 32E

4X48X816X16RSB2X2

Nor

mal

ized

per

form

ance Tiny Stealth Alien

Normalized @

0.5

0.7

0.9

1.1

1.3

1.5

4E 8E 16E 32E 4E 8E 16E 32E 4E 8E 16E 32E

4X48X816X16RSB2X2

Nor

mal

ized

ene

rgy

Normalized @

17.4% energysaving

Tiny Stealth Alien

Bilinear filtering, 1KB cache, 2way Point sampling, 1KB cache, 2way

Min-wuk Lee 41

Morton orderMorton order

Morton order1 4

2 3 6 70 5 9 10

1211257

258259256

1 02 3 2 30 1 0 1

321

2 30

LSB 2bits

0 132

0 132

4X4 block map1 2

4 5 6 70 3

912 138

1 20 1 2 30 3

10 10

LSB 2bits

2 332

1114 1510

1 03 2 3

1 0 132

13

LSB 2bits

0 132

0 132

20

20

D5

D4

D6

D7D1

D0

D2

D3

132

0

LSB 2bits

Bank conflict free

Not free from bankconflict

Not free from bank conflict

Multi-ported cacheTo access more than 1 texel in the same cycle

Interleaving the cache lines across multi-banksMorton orderRSB4X4 : Not free from bank conflict

RSB2x2

Trilinear filtering : Not free from bank conflict

Cache for even, odd LOD

Min-wuk Lee 42

Proposed texture map representationProposed texture map representationRSB2X2 map representation

Bank conflict free

444

RSB 4X4

RSB 2X2

5 16 17 20 21 64 65 68 69 80 81 84 85

8

7 18 19 22 23 66 67 70 71 82 83 86 87

10

32

34

40

42

128

130

136

138

160

162

168

170

9

11

33

35

41

43

129

131

137

139

161

163

169

171

12 13

14 15

48 49

50 51

60 61

62 63

192 193

196 197

202 203

206 207

240 241

244 245

250 251

254 255

4

6

0 1

2 3

24 25 28 29

26 27 30 31

36 37

38 39

44 45

46 47

52 53

54 55

56 57

58 59

72 73 76 77

74 75 78 79

88 89 92 93

90 91 94 95

@ recursive sub-block 2X2 method

A

BC

A :B :C :

Conflict miss @ bank change

block block

a fixed-point 3d graphics library with energy-efficient cache …ssl.kaist.ac.kr › 2007 › data...

Documents