performance and power analysis of globally asynchronous locally synchronous multiprocessor systems...
TRANSCRIPT
Performance and Power Analysis of Globally Asynchronous Locally
Synchronous Multiprocessor Systems
Zhiyi Yu, Bevan M. Baas
VLSI Computation Lab,
ECE department, UC Davis
Outline
• Motivation• Experimental platform: a GALS array processor• Performance analysis of GALS multiprocessors• Power analysis of GALS multiprocessors
Why GALS Chip Multiprocessor
• Why GALS clocking style– The challenge of globally synchronous systems– The challenge of totally asynchronous systems– GALS is a good compromise
• Why chip multiprocessor– The challenge of increasing clock frequency– High performance and high energy efficiency of
multiprocessor system
clk1
clk2module
1sync.circuit
module2
clk1 clk2
Totalsync. delay
te
ts
GALS Effects
• Performance penalty due to additional synchronization delay
• High energy efficiency due to independent clock/voltage scaling
A simple GALS system Sync. delay = clk edge alignment (te)
+ sync. circuit (ts)
Synchronous vs. GALS comparisons
Uniprocessor
Array processor
SynchronousGALS
(multiple clock domains)
ALU
MULT
MEM
This work
Previous work
Outline
• Motivation
• Experimental platform: a GALS array processor
• Performance analysis of GALS multiprocessor• Power analysis of GALS multiprocessor
Our Implementation of a GALS Array and Synchronous Array
• Contains multiple identical simple processors• Local oscillator and dual-clock FIFOs are key
components for GALS style
IMEM
ALUMAC
Control
DMEM
FIFO0
FIFO1
IMEM
ALUMAC
Control
DMEM
osc.
dualclk-FIFO0
dualclk-FIFO1
enhanced dual-clock FIFO
3x3 arrayprocessor
Single processor in asynchronous array
Single processor in aGALS array
config config
Micrograph of the 6 x 6 GALS Array
Technology: 0.18 µm CMOS
Max speed: 475 MHz
Area (1 Proc): 0.66 mm²
Fully functional
SingleProcessor
5.65 mm
5.68 mm
[Yu, ISSCC06]
Outline
• Motivation• Experimental platform: a GALS array processor
• Performance analysis of GALS multiprocessor
• Power analysis of GALS multiprocessor
Performance Penalty of GALS Uni-processor
• Extra pipeline hazards result in ~10% throughput penalty compared with synchronous uni-processor
[Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03]
Branch penalty of 3 cycles in a synchronous uni-processor
Branch penalty of 3 cycles and 4 SYNC delays in a GALS uni-processor
IF ID EXE MEM WB
IF ID EXE
Branch inst
Branch inst IF ID EXE MEM WBSYNC SYNC SYNC SYNC
IF
clk4 clk5clk3clk2clk1
MEM WBbranch penalty
branch penalty
Application Performance of GALS and Synchronous Array
• GALS array has only ~1% performance penalty• Simulation conditions: 32-word FIFO, same clock
frequency, 2 synchronization registers for GALS
8-pt DCT
8×8 DCT
Zig-zag
merg sort
bub. sort
ma-trix
64 FFT
JPEG 802.
11a/g
Sync. array 41 498 168 254 444 817.5 11439 1439 87857
GALS array 41 505 168 254 444 819 11710 1443 88989
GALS perf. reduction(%
)
0 1.4 0 0 0 0.1 2.3 0.3 1.3
Clock cycles for several applications
The Source of Performance Penalty of GALS Array Processor
• For all systems, communication delay affects system throughput only when it generates a loop
• For GALS array processor, communication loop is the FIFO stall loop– Performance simulation results show that the chance
of FIFO stall loop is low for many DSP applications
• FIFO depth affects FIFO stalls and hence a reasonable FIFO size is required
Importance of the Communication Loop Delay
• One way communication does not affect system throughput
• Communication loop degrades throughput– In uni-processor, it is caused by pipeline hazards – GALS system has longer communication loop delay
unit1(T1)
unit2(T3)
comm.(T2)
One way communication:The throughput is 1/Max (T1, T2, T3)
unit1(T1)
unit2(T3)
comm.(T2)
Communication loop:The throughput is 1/(T1 + T2 + T3 + T4)
comm.(T4)
Examples of Stall and Stall Loops in a GALS Array Processor
Proc.1
Proc.2
Data producer proc. 1 too slow causesfrequent FIFO empty stalls
Proc.1
Proc.2
Data consumer proc. 2 too slow causesfrequent FIFO full stalls
Proc.1
Proc.2
Data producer proc. 1 and data consumerproc. 2 both too slow at different times
cause FIFO empty and full stalls
Proc.1
Example of multiple-link loop betweentwo processors
Proc.2
FIFO
FIFO
emptystall
fullstall
empty stall
full stall
emptystall
fullstall
full stall
empty stall
FIFO
FIFO
FIFO
Performance of Synchronous and GALS Array with Different FIFO Sizes
Synchronous Array
GALS vs. Synchronous Array
GALS Array
8 DCT zig-zag bsort 64 FFT 802.11 8x8 DCT msort matrix JPEG
Rel
ativ
e P
erfo
rman
ce
Per
form
ance
R
atio
Rel
ativ
e P
erfo
rman
ce
Outline
• Motivation• Experimental platform: a GALS array processor• Performance analysis of GALS multiprocessor
• Power analysis of GALS multiprocessor
Clock/Voltage Scaling in GALS Uni-processor
• All GALS designs– Independently control clock frequencies to save power– Reduced clock frequency allows voltage reduction
• Around 25% energy savings with more than 10% performance penalty
[Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03]
IF ID EXE MEM WBsynccirc.
synccirc.
synccirc.
synccirc.
clk1
task1program with
few MEMinstructions
task2
program withfew WB
instructions
clk4 can bereduced
clk5 can bereduced
clk2 clk3 clk4 clk5
Clock/Voltage Scaling in GALS Array Processor
• Similar basic idea as uni-processor– Use low clock frequency for processors with light
computation load– Benefit from unbalanced processor computation loads
• Both static and dynamic clock scaling methods– We study only static scaling here
• The optimal processor clock frequency is determined by its – Computational load– Position
• Can achieve power savings without performance reduction!
Unbalanced Processor Computation Loads in Nine Applications
8-pt DCT 8x8 DCT zig-zag
merge-sort bubble-sort matrix
64 FFT JPEG 802.11a/g
Throughput Changes with Statically Configured Clock for 8x8 DCT
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
scale 1st processor
scale 2nd processor
scale 3rd processor
scale 4th processor
optimal clockscaling points
Relative clock frequency
Rel
ativ
e T
hro
ug
hp
ut
Compute load(clk cycles)
408 204 408 204
Relationship of Processors in an 8x8 DCT
• Proc. 2 and 4 have identical computational load• Different position results in a different FIFO stall
style, which causes different clock scaling behavior
Trans.
FIFO
1-DCT
FIFO
1-DCT
FIFO
Trans.
FIFO
FIFO empty stall of 2nd proc.
FIFO full stall of 2nd proc.
FIFO empty stall of 4th proc.
Power of GALS Array with Static Clock/Voltage Scaling
Rel
ativ
e P
ow
er
8 DCT zig-zag JPEG msort matrix 8x8 DCT 64 FFT 802.11 bsort
• 40% power savings without a performance penalty
• From simulated clock frequency and referenced clk/voltage/pow relationship
Summary
• Compared to a synchronous array processor, the proposed GALS array processor has: – < 1% throughput reduction– ~40% energy savings
• These results compare well with reported GALS uni-processors:– ~10% throughput reduction– ~25% energy savings
• Source of throughput reduction in GALS system– Extra cost for communication loops– Extra cost for FIFO stall loops in GALS array processors
• Energy benefit of GALS clock/voltage scaling– Unbalanced processor computation loads
Acknowledgments
• Funding– Intel Corporation– UC Micro– NSF Grant No. 0430090– UCD Faculty Research Grant
• Special Thanks– E. Work, T. Mohsenin, other VCL processor co-
designers, R. Krishnamurthy, M. Anders, S. Mathew