a fixed-point 3d graphics library with energy-efficient cache …ssl.kaist.ac.kr › 2007 › data...
TRANSCRIPT
Min-wuk Lee 1
MS Thesis
A fixedA fixed--point 3D graphics library with point 3D graphics library with energyenergy--efficient cache architecture for efficient cache architecture for
mobile multimedia system mobile multimedia system
Min-wuk Lee
2004.12.14Semiconductor System Laboratory
Department Electrical Engineering and Computer ScienceKorea Advanced Institute of Science and Technology [KAIST]
Min-wuk Lee 2
OutlineOutlineIntroductionMotivationMobileGL: Mobile 3D graphics libraryEnergy-efficient CPU cache Energy-efficient texture cache Conclusion
Min-wuk Lee 3
Introduction(1/2)Introduction(1/2)
Performance-energy co-optimization for mobile 3D graphics
Software system : High speed graphics library (MobileGL)Hardware system : Energy-efficient Cache architecture
Mobile 3D graphics system
CPU 3D Rendering Engine Memory
MobileGL
Pixel
Screen Clipping
Perspective Projection
Lighting
Transformation
Model Interface
Texture Mapping
Triangle Setup
Alpha Blending
Depth Compare
Gouraud Shading
Model
Hardware system
Software system
Input Output
Embedded mobile system
Optimized code for speedDraw off the best H/W performance
Low energy consumptionGood quality, high performance
Min-wuk Lee 4
Introduction(2/2)Introduction(2/2)Target system
Low-cost target High speed graphics libraryEnergy-efficient CPU cache system
High quality target High speed and good quality graphics libraryEnergy-efficient CPU cache, texture cache system
Mem
CPUcache
Applicationprocessor
Graphicslibrary
Mem3D graphicsSoC
Applicationprocessor
CPUcache
TexturecacheFramecacheDepthcacheR.E.
System buscontroller
Graphicslibrary
High quality targetLow-cost target
Min-wuk Lee 5
Previous workPrevious workFor low-cost target
PC, workstation platform graphics libraryToo huge GL supported by FPU, special graphics engine
Embedded platform graphics library : Fixed point arithmeticYoshida’s work[1]
Limited operation (Without texturing)No research on memory bandwidth bottleneck
Previous work in our groupAnalysis with memory-only, without cache system
For high quality targetTexture cache in PC platform
Hakura’s work[2]Analysis based on miss rate
Did not consider energy, execution time, system limitation
[1] : K.Yoshida, Consumer Electronics, IEEE Transactions on,1998.
[2] : Ziyad s. Hakura, ISCA ,1997.
Min-wuk Lee 6
MotivationMotivationGraphics library of this work
Extended operationLighting, Texturing, Alpha blending, Face culling, etc.
Optimization of memory transaction3D graphics characteristicAnalysis with cache system
Texture cache of this workEnergy-efficient texture cache in embedded system
With negligible performance degradation
Min-wuk Lee 7
OutlineOutlineIntroductionMotivationMobileGL: Mobile 3D graphics libraryEnergy-efficient CPU cache Energy-efficient texture cache Conclusion
Min-wuk Lee 8
MobileGL(1/6)MobileGL(1/6)Mobile 3D graphics libraryA fixed-point arithmetic
32bit integerOptimized memory transaction
To reduce instruction and data traffic
Selective pipeline ApplicationsTo reduce branch45KB total code size MobileGL block diagram
View transformation
Lighting
Perspective projection
Screen mapping
Cull face
View transformationX
Perspective projection
Screen mapping
Cull face
Rendering stage Rendering stage
Model
Pixel
Lighting and TexturingLighting only
Texture-only
Include r,g,b,a calculationExclude r,g,b,a calculation
Clipping Clipping
Screen mapping
Cull face
Screen mapping
Cull face
Perspective division Perspective division
Lighting enable Lighting disable
Min-wuk Lee 9
MobileGL(2/6)MobileGL(2/6)Disabling option for perspective correction of texture address
Due to small screen sizeTrade-off between correctness and speedup
On Off
StrongARM at 200 MHz
0
Horizontal setupTriangle setup
Texturing
exec
utio
n tim
e/1
K P
olyg
ons 20
10
30 %reduction
View transformationX
Perspective projection
Triangle setup
Horizontal setup
Pixel interpolation
u,v/ wClipping/Perspective divison
Screen mapping
Cull face
Screen mapping
Cull face
u,v/w
u,v
u,v
Conventional This work
Min-wuk Lee 10
MobileGL(3/6)MobileGL(3/6)Division reduction in interpolation
Use shift instead of reciprocalHigh probability of 1,2 or 4 in denominator value
1 is 1, 2 or 4, using shift@:
2nd
1st
3rd
Top
Mid
Bot
Line2
Line1
Line3
start endDirection_y
Direction_xLine4 Ex
ecut
ion
time(
ms)
per 1
000
Poly
gons 30
20
10
0
27 %reduction
ROD
Horizontal setupTriangle setup
Texturing
StrongARM @ 200 MHz
Min-wuk Lee 11
MobileGL(4/6)MobileGL(4/6)Z comparison in advance
To avoid unnecessary shading and texturing[3]
Selective precision of matrix multiplication
67% improvement @geometry stage
Standard OpenGL pipeline
Z comparison in advance
Texturing Depth test Blending
TexturingDepth test Blending
0 10 20 30(%)
Speed improvement
Z_fail / Z_access
#1#2
Unnecessaryoperation
for #232 bit
4bit
64 bit
A
B
A MULL B
32 bit4bit
A
B
A MUL B
Should extend to 64bit for result
[3] : Ramchan Woo, ISSCC 2003
Min-wuk Lee 12
MobileGL(5/6)MobileGL(5/6)Library performance
67K polygons/sec @ texture only application due to several optimization steps
ARM7@ 80MHz
ARM9@ 200MHz
StrongARM@ 200MHz
0
20
40
60
Poly
ons
/ mill
i sec
Original C codeOptimized code
Previouswork[1]
67K Polygons/sec
6.7 timesPerformanceimprovement
Min-wuk Lee 13
MobileGL(6/6)MobileGL(6/6)Implementation result
Min-wuk Lee 14
OutlineOutlineIntroductionMotivationMobileGL: Mobile 3D graphics libraryEnergy-efficient CPU cache Energy-efficient texture cache Conclusion
Min-wuk Lee 15
EnergyEnergy--efficient CPU cache (1/4)efficient CPU cache (1/4)Simulation environment
From ARM SDK : 1, memory transactionFrom cache model using memory transaction : 2, 3
ARMProcessor
Memory CACHE_MODEL
Memory transaction file
CPU execution time
Memory_access_time
TOTALEXECUTION
TIME
Target Hardware Platform
3D Graphics Library
ApplicationPrograms
∑ ∑= =
+=
+=countsninstructio
K
countshitcache
KninstructioexeCPUexe
CPUexetotalexe
cycleCPUTT
timeaccessmemoryTT_
1
__
1__
__
_
__3
21
ARM SDK
Min-wuk Lee 16
EnergyEnergy--efficient CPU cache (2/4)efficient CPU cache (2/4)Cache model : about execution time
Processorcore
Data cacheInstruction cache
Memory
hit_time memory_access_time
)(__)(___
_
)_(_)__(__
__
__
writetimeaccessmemoryreadtimehittimeaccessmemory
timemiss
accessburstperiodclockaccesssquentialnonperiodclocklatencyCASt
timeaccessmemory
periodclocktimehit
mem
memRCD
core
L
K
L
L
+==
++==
=
Min-wuk Lee 17
EnergyEnergy--efficient CPU cache (3/4)efficient CPU cache (3/4)Energy modeling
Tool and research documentation based modelCache hit energy : from CACTI 3.0 [4]Cache miss energy : from Power & Energy Characterization of the Itsy Pocket Computer by Compaq Western Research Laboratory
– 4.70nJ / bus_clock [5], [6]
[4] : CACTI 3.0 : An integrated cache timing, power, and area model, Compaq Western Research Laboratory[5] : Power and energy characterization of the Itsy pocket computer [6] : A simulation framework for energy-consumption analysis of OS-driven embedded applications, TCAS 2003
Min-wuk Lee 18
EnergyEnergy--efficient CPU cache (4/4)efficient CPU cache (4/4)Simulation results
Using 2-way cache, 13% energy saving, 1% performance degradationcompared with conventional 4-way cache
1.5
2
2.5
3
3.5
4
2KB 4KB 8KB 16KB
Miss rate(%)
0.5
0.6
0.7
0.8
0.9
1
2KB 4KB 8KB 16KB
Normalized execution timeNormlizated@
0.5
0.6
0.7
0.8
0.9
1
2KB 4KB 8KB 16KB
Normalized energy consumptionNormlizated@
cache size cache size cache size
Direct mapped data cache, 8E/line (32B line size)
1.3
1.4
1.5
1.6
1.7
1.8
DM 2WAY 4WAY 8WAY
Miss rate(%)
0.5
0.6
0.7
0.8
0.9
1
DM 2WAY 4WAY 8WAY
Normalized execution timeNormlizated@
Normlizated@
0.5
0.7
0.9
1.1
1.3
1.5
DM 2WAY 4WAY 8WAY
16KB data cache, 32B line size
1% performancedegradation 13% energy
saving
Normalized energy consumption
Min-wuk Lee 19
OutlineOutlineIntroductionMotivationMobileGL: Mobile 3D graphics libraryEnergy-efficient CPU cache Energy-efficient texture cache Conclusion
Min-wuk Lee 20
EnergyEnergy--efficient texture cache (1/12)efficient texture cache (1/12)Texture mapping (Introduction)
Map from 3D surface to 2D texel domain (image)Texture coordinateLookup color in imageLookup method
Nearest texelInterpolation of surrounding texlesMIPMAP
– Image pyramid
xy
z s
t
F(x,y,z) = (s,t)
2D texture diagram
d axis
Level 0
Image pyramid
Min-wuk Lee 21
EnergyEnergy--efficient texture cache (2/12)efficient texture cache (2/12)Texture filtering methods (Introduction)
Point sampling, Bilinear filtering, Bilinear MIPMAP, Trilinear MIPMAP
LOD 0
LOD 1
LOD 2LOD 3
1st
2nd
3rd
1st
2nd
3rd
Bilinearinterpolation
Bilinearinterpolation
LOD = 1.XX
Linearinterpolation
3. Bilinear MIPMAP
4. Trilinear MIPMAP
1st
2nd
3rd
Bilinearinterpolation
LOD 0
2. Bilinear filtering
1. Point sampling
Texture space Screen space Texture space
Min-wuk Lee 22
EnergyEnergy--efficient texture cache (3/12)efficient texture cache (3/12)Obstacle of texture mapping
Requirement of extremely high bandwidth
Texture cache To reduce the off-chip memory access bottlenecksImage conversion (texture map representation) : Reduce conflict miss Address conversion unit (A few logical operations and two additions)
External memory
3D Renderingengine
TexturecacheAddress
conversionImage
conversion
Texture cache system
Min-wuk Lee 23
EnergyEnergy--efficient texture cache (4/12)efficient texture cache (4/12)Simulation models
Tiny Stealth Alien6833 polygons 542 polygons 854 polygons
Tiny :LOD[0:1] 80%, LOD[1:2] 10% Stealth :LOD[0:1] 67%, LOD[1:2] 15%Alien :LOD[0:1] 48%, LOD[1:2] 44% @ trilinearMIPMAP
Min-wuk Lee 24
EnergyEnergy--efficient texture cache (5/12)efficient texture cache (5/12)Proposed texture map representation
Reduce conflict miss at bank changeMiss rate reduction, energy saving (17.4%), execution time reduction (15.2%)
Blocked representation Recursive Sub Block
Min-wuk Lee 25
EnergyEnergy--efficient texture cache (6/12)efficient texture cache (6/12)Address conversion unit for RSB2X2
Use one-to-one correspondence and find ruleHardware implementation : only thirteen 2:1mux in trilinear MIPMAP
old0old1
old2old3
old4old5
old6old7
old8old9
old10old11
core request address
new0new1
new2new3
new4new5
new6new7
new8new9
new10new11
converted address
256 X 256, RSB 2X2
old0old1
old2old3
old4old5
old6old7
old8old9
old10old11
core request address
new0new1
new2new3
new4new5
new6new7
new8new9
new10new11
converted address
128 X 128, RSB 2X2
Address conversion unit of this work : RSB 2X2
old0old1
old2old3
old4old5
old6old7
old8old9
old10old11
core request address
new0new1
new2new3
new4new5
new6new7
new8new9
new10new11
converted address
64 X 64, RSB 2X2
old0old1
old2old3
old4old5
old6old7
old8old9
old10old11
core request address
new0new1
new2new3
new4new5
new6new7
new8new9
new10new11
converted address
32 X 32, RSB 2X2
Min-wuk Lee 26
EnergyEnergy--efficient texture cache (7/12)efficient texture cache (7/12)Texture cache model using bank interleaved
Morton order representation previous work
Proposed RSB2X2 also free from bank conflict
A0
D0
Texture cache(1 bank)
Point sampling
Texture cache(4 bank)
Bilinear filteringBilinear MIPMAP
A1A0
A2A3
D3 D1 D0D2
Texture cache foreven, odd LOD
(4 bank)
Trilinear MIPMAP
D7 D5 D4D6
A5A4
A6A7
A1A0
A2A3
D3 D1 D0D2
EvenLOD$
OddLOD$
Min-wuk Lee 27
EnergyEnergy--efficient texture cache (8/12)efficient texture cache (8/12)Performance and Energy comparison between filtering method
Energy consumption, Execution timePoint sampling < Bilinear filtering < Bilinear MIPMAP < TrilinearMIPMAP
Trade off point : Image quality (aliasing criterion)
0
1
2
3
4
5
P.S. B.F. B.M. T.M.
D.M.2WAY4WAY
Normalized @
Normalized energy
2KB, 16entries/line, Tiny_model
Min-wuk Lee 28
EnergyEnergy--efficient texture cache (9/12)efficient texture cache (9/12)Image quality analysis
Textile modelLOD[0:1] : 44%, LOD[1:2] : 40% in MIPMAP
DCT analysisLow frequency term in top-left
Point smapling Bilinear filtering Bilinear mipmap Trilinear mipmap
Point smapling Bilinear filtering Bilinear mipmap Trilinear mipmap
Min-wuk Lee 29
EnergyEnergy--efficient texture cache (10/12)efficient texture cache (10/12)Image quality metric in terms of aliasing criterion
IndexQ, IndexETo find relative value Normalize from 0 to 1
∑∑
∑∑
= =
= == π π
π π
0 0
2/
0
2/
05.0__
fx fy
fx fy
amplitude
amplitudequalityimage
∑∑
∑∑
= =
= == π π
π π
0 0
43
0
43
075.0__
fx fy
fx fy
amplitude
amplitudequalityimage
QQQ
curIndex
minmaxmin−
−=
EE
EEE
curIndex
minmaxmax
−−
=
fx0
fy
∑
∑
∑
∑
PI
PI
Min-wuk Lee 30
EnergyEnergy--efficient texture cache (11/12)efficient texture cache (11/12)Index = IndexQ + IndexE
Almost same quality between B.M. and T.M. in QVGALarge different energy between B.M. and T.M.Poor image quality in P.S.Bilinear MIPMAP get the largest score.
0
0.2
0.4
0.6
0.8
1
1.2
P.S. B.F. B.M. T.M.
8E16E
0
0.2
0.4
0.6
0.8
1
P.S. B.F. B.M. T.M.
8E,(Q_0.5)16E,(Q_0.75)
IndexQ IndexE
0.91
1.11.21.31.41.51.61.7
P.S. B.F. B.M. T.M.
8E,(Q_0.5)16E,(Q_0.5)8E,(Q_0.75)16E,(Q_0.75)
IndexQ+IndexE
2-way set associative, 2KB texture cache
Min-wuk Lee 31
EnergyEnergy--efficient texture cache (12/12)efficient texture cache (12/12)Simulation results
0.8
0.9
1
1.1
1.2
1K 2K 4K 8K 1K 2K 4K 8K 1K 2K 4K 8K
8E16E
Normalized energy
Tiny Stealth Alien
Normalized @
Energy comparison while changing cache size
@ 2-way, using bilinear MIPMAP
4KB texture cache, 16B line size (2B per 1texel) energy-efficient, low cost, high-quality
Min-wuk Lee 32
OutlineOutlineIntroductionMotivationMobileGL: Mobile 3D graphics libraryEnergy-efficient CPU cache Energy-efficient texture cache Conclusion
Min-wuk Lee 33
ConclusionConclusionFor performance-energy co-optimization in Mobile3D graphics
MobileGL / Cache architectureMobileGL : Mobile 3D graphics library
67K polygons/sec66.1% performance improvement in average
Energy-efficient CPU cache2-way set associative cache to save energy
Energy-efficient texture cacheProposed texture map representationBilinear MIPMAP shows good quality to energy ratio16B line size , 4KB size cache is the optimal point
Min-wuk Lee 34
Supplemental MaterialsSupplemental Materials
Min-wuk Lee 35
Graphics pipelineGraphics pipelineGeometry stage Rendering stage
2nd
1st
3rd
Top
Mid
Bot
Line2
Line1
Line3
start endDirection_y
Direction_x
Triangle setup :For line1, line2, line3 using1st, 2nd, 3rd
Horizontal setup :For line_x using start, end
Pixel interpolation :Each pixel shading, texturing
Line_x
Cameradirection
Camera positionx
zView tran
ProjectionClipping1/w
Screen mappin
sform
gz
xUnit-cube
View frustum
xz
Rendering stage
Min-wuk Lee 36
Energy portion of cacheEnergy portion of cacheARM920T and M*CORE : Caches consume 50% of total processor system power (Segars 01,Lee et.al. 99)
>50%
Min-wuk Lee 37
Blocked representationBlocked representationConventional texture map representation (16X16blocked)
Conflict miss @ block changeA path : 2B path : 16C path : 16
Assumptions
1. Bilinear filtering
2. Cache size = block size
3. 16entries / 1 line
3 4 5 6 7 8 9 10 11 12 13 14 15
16X16 block
32
19 20 21 22 23 24 25 26 27 28 29 30 31
48
64
80
96
112
128
144
160
176
192
208
224
240
33
49
65
81
97
113
129
145
161
177
193
209
225
241
34 35
50 51
68 69
84 85
102 103
118 119
136 137
152 153
170 171
186 187
204 205
220 221
238 239
254 255
2
18
0 1
16 17
@ conventional block texture map
A
BC
block block
Block : Square region that texelsare ordered consecutively
Min-wuk Lee 38
Proposed texture map representationProposed texture map representationRecursive Sub Block texture map representation (RSB4X4)
Conflict miss @block changeA path : 4B path : 4C path : 4
Assumptions
1. Bilinear filtering
2. Cache size = block size
3. 16 entries/ 1line
3 16 17 18 19 64 65 66 67 80 81 82 83
8
7 20 21 22 23 68 69 70 71 84 85 86 87
12
32
36
40
44
128
132
136
140
160
164
168
172
9
13
33
37
41
45
129
133
137
141
161
165
169
173
10 11
14 15
48 49
52 53
58 59
62 63
192 193
196 197
202 203
206 207
240 241
244 245
250 251
254 255
2
6
0 1
4 5
24 25 26 27
28 29 30 31
34 35
38 39
42 43
46 47
50 51
54 55
56 57
60 61
72 73 74 75
76 77 78 79
88 89 90 91
92 93 94 95
@ recursive sub-block 4X4 method
A
BC
block
block block
Min-wuk Lee 39
Simulation between representation methodsSimulation between representation methodsSimulation results between texture representations
Bilinear filtering, 2-way, 1KB texture cache27% performance improvement in averageLow miss rate doesn’t mean high performance8entries/line or 16entries/line shows good performance
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
4E 8E 16E 32E 4E 8E 16E 32E 4E 8E 16E 32E
4X48X816X16RSB2X2
Nor
mal
ized
per
form
ance Tiny Stealth Alien
Normalized @
0
0.1
0.2
0.3
4E 8E 16E 32E 4E 8E 16E 32E 4E 8E 16E 32E
4X48X816X16RSB2X2
Mis
s ra
te Tiny Stealth Alien
Min-wuk Lee 40
Simulation between representation methodsSimulation between representation methods@ bilinear filtering
Low miss rate doesn’t mean low energy consumptionRSB accomplish 17.4% energy saving compared to the best of conventional methods
@ point sampling25% performance improvement in average
0.50.60.70.80.9
11.11.21.31.41.51.61.71.8
4E 8E 16E 32E 4E 8E 16E 32E 4E 8E 16E 32E
4X48X816X16RSB2X2
Nor
mal
ized
per
form
ance Tiny Stealth Alien
Normalized @
0.5
0.7
0.9
1.1
1.3
1.5
4E 8E 16E 32E 4E 8E 16E 32E 4E 8E 16E 32E
4X48X816X16RSB2X2
Nor
mal
ized
ene
rgy
Normalized @
17.4% energysaving
Tiny Stealth Alien
Bilinear filtering, 1KB cache, 2way Point sampling, 1KB cache, 2way
Min-wuk Lee 41
Morton orderMorton order
Morton order1 4
2 3 6 70 5 9 10
1211257
258259256
1 02 3 2 30 1 0 1
321
2 30
LSB 2bits
0 132
0 132
4X4 block map1 2
4 5 6 70 3
912 138
1 20 1 2 30 3
10 10
LSB 2bits
2 332
1114 1510
1 03 2 3
1 0 132
13
LSB 2bits
0 132
0 132
20
20
D5
D4
D6
D7D1
D0
D2
D3
132
0
LSB 2bits
Bank conflict free
Not free from bankconflict
Not free from bank conflict
Multi-ported cacheTo access more than 1 texel in the same cycle
Interleaving the cache lines across multi-banksMorton orderRSB4X4 : Not free from bank conflict
RSB2x2
Trilinear filtering : Not free from bank conflict
Cache for even, odd LOD
Min-wuk Lee 42
Proposed texture map representationProposed texture map representationRSB2X2 map representation
Bank conflict free
444
RSB 4X4
RSB 2X2
5 16 17 20 21 64 65 68 69 80 81 84 85
8
7 18 19 22 23 66 67 70 71 82 83 86 87
10
32
34
40
42
128
130
136
138
160
162
168
170
9
11
33
35
41
43
129
131
137
139
161
163
169
171
12 13
14 15
48 49
50 51
60 61
62 63
192 193
196 197
202 203
206 207
240 241
244 245
250 251
254 255
4
6
0 1
2 3
24 25 28 29
26 27 30 31
36 37
38 39
44 45
46 47
52 53
54 55
56 57
58 59
72 73 76 77
74 75 78 79
88 89 92 93
90 91 94 95
@ recursive sub-block 2X2 method
A
BC
A :B :C :
Conflict miss @ bank change
block block