![Page 1: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/1.jpg)
Performance and Power Analysis of Globally Asynchronous Locally
Synchronous Multiprocessor Systems
Zhiyi Yu, Bevan M. Baas
VLSI Computation Lab,
ECE department, UC Davis
![Page 2: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/2.jpg)
Outline
• Motivation• Experimental platform: a GALS array processor• Performance analysis of GALS multiprocessors• Power analysis of GALS multiprocessors
![Page 3: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/3.jpg)
Why GALS Chip Multiprocessor
• Why GALS clocking style– The challenge of globally synchronous systems– The challenge of totally asynchronous systems– GALS is a good compromise
• Why chip multiprocessor– The challenge of increasing clock frequency– High performance and high energy efficiency of
multiprocessor system
![Page 4: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/4.jpg)
clk1
clk2module
1sync.circuit
module2
clk1 clk2
Totalsync. delay
te
ts
GALS Effects
• Performance penalty due to additional synchronization delay
• High energy efficiency due to independent clock/voltage scaling
A simple GALS system Sync. delay = clk edge alignment (te)
+ sync. circuit (ts)
![Page 5: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/5.jpg)
Synchronous vs. GALS comparisons
Uniprocessor
Array processor
SynchronousGALS
(multiple clock domains)
ALU
MULT
MEM
This work
Previous work
![Page 6: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/6.jpg)
Outline
• Motivation
• Experimental platform: a GALS array processor
• Performance analysis of GALS multiprocessor• Power analysis of GALS multiprocessor
![Page 7: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/7.jpg)
Our Implementation of a GALS Array and Synchronous Array
• Contains multiple identical simple processors• Local oscillator and dual-clock FIFOs are key
components for GALS style
IMEM
ALUMAC
Control
DMEM
FIFO0
FIFO1
IMEM
ALUMAC
Control
DMEM
osc.
dualclk-FIFO0
dualclk-FIFO1
enhanced dual-clock FIFO
3x3 arrayprocessor
Single processor in asynchronous array
Single processor in aGALS array
config config
![Page 8: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/8.jpg)
Micrograph of the 6 x 6 GALS Array
Technology: 0.18 µm CMOS
Max speed: 475 MHz
Area (1 Proc): 0.66 mm²
Fully functional
SingleProcessor
5.65 mm
5.68 mm
[Yu, ISSCC06]
![Page 9: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/9.jpg)
Outline
• Motivation• Experimental platform: a GALS array processor
• Performance analysis of GALS multiprocessor
• Power analysis of GALS multiprocessor
![Page 10: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/10.jpg)
Performance Penalty of GALS Uni-processor
• Extra pipeline hazards result in ~10% throughput penalty compared with synchronous uni-processor
[Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03]
Branch penalty of 3 cycles in a synchronous uni-processor
Branch penalty of 3 cycles and 4 SYNC delays in a GALS uni-processor
IF ID EXE MEM WB
IF ID EXE
Branch inst
Branch inst IF ID EXE MEM WBSYNC SYNC SYNC SYNC
IF
clk4 clk5clk3clk2clk1
MEM WBbranch penalty
branch penalty
![Page 11: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/11.jpg)
Application Performance of GALS and Synchronous Array
• GALS array has only ~1% performance penalty• Simulation conditions: 32-word FIFO, same clock
frequency, 2 synchronization registers for GALS
8-pt DCT
8×8 DCT
Zig-zag
merg sort
bub. sort
ma-trix
64 FFT
JPEG 802.
11a/g
Sync. array 41 498 168 254 444 817.5 11439 1439 87857
GALS array 41 505 168 254 444 819 11710 1443 88989
GALS perf. reduction(%
)
0 1.4 0 0 0 0.1 2.3 0.3 1.3
Clock cycles for several applications
![Page 12: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/12.jpg)
The Source of Performance Penalty of GALS Array Processor
• For all systems, communication delay affects system throughput only when it generates a loop
• For GALS array processor, communication loop is the FIFO stall loop– Performance simulation results show that the chance
of FIFO stall loop is low for many DSP applications
• FIFO depth affects FIFO stalls and hence a reasonable FIFO size is required
![Page 13: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/13.jpg)
Importance of the Communication Loop Delay
• One way communication does not affect system throughput
• Communication loop degrades throughput– In uni-processor, it is caused by pipeline hazards – GALS system has longer communication loop delay
unit1(T1)
unit2(T3)
comm.(T2)
One way communication:The throughput is 1/Max (T1, T2, T3)
unit1(T1)
unit2(T3)
comm.(T2)
Communication loop:The throughput is 1/(T1 + T2 + T3 + T4)
comm.(T4)
![Page 14: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/14.jpg)
Examples of Stall and Stall Loops in a GALS Array Processor
Proc.1
Proc.2
Data producer proc. 1 too slow causesfrequent FIFO empty stalls
Proc.1
Proc.2
Data consumer proc. 2 too slow causesfrequent FIFO full stalls
Proc.1
Proc.2
Data producer proc. 1 and data consumerproc. 2 both too slow at different times
cause FIFO empty and full stalls
Proc.1
Example of multiple-link loop betweentwo processors
Proc.2
FIFO
FIFO
emptystall
fullstall
empty stall
full stall
emptystall
fullstall
full stall
empty stall
FIFO
FIFO
FIFO
![Page 15: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/15.jpg)
Performance of Synchronous and GALS Array with Different FIFO Sizes
Synchronous Array
GALS vs. Synchronous Array
GALS Array
8 DCT zig-zag bsort 64 FFT 802.11 8x8 DCT msort matrix JPEG
Rel
ativ
e P
erfo
rman
ce
Per
form
ance
R
atio
Rel
ativ
e P
erfo
rman
ce
![Page 16: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/16.jpg)
Outline
• Motivation• Experimental platform: a GALS array processor• Performance analysis of GALS multiprocessor
• Power analysis of GALS multiprocessor
![Page 17: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/17.jpg)
Clock/Voltage Scaling in GALS Uni-processor
• All GALS designs– Independently control clock frequencies to save power– Reduced clock frequency allows voltage reduction
• Around 25% energy savings with more than 10% performance penalty
[Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03]
IF ID EXE MEM WBsynccirc.
synccirc.
synccirc.
synccirc.
clk1
task1program with
few MEMinstructions
task2
program withfew WB
instructions
clk4 can bereduced
clk5 can bereduced
clk2 clk3 clk4 clk5
![Page 18: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/18.jpg)
Clock/Voltage Scaling in GALS Array Processor
• Similar basic idea as uni-processor– Use low clock frequency for processors with light
computation load– Benefit from unbalanced processor computation loads
• Both static and dynamic clock scaling methods– We study only static scaling here
• The optimal processor clock frequency is determined by its – Computational load– Position
• Can achieve power savings without performance reduction!
![Page 19: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/19.jpg)
Unbalanced Processor Computation Loads in Nine Applications
8-pt DCT 8x8 DCT zig-zag
merge-sort bubble-sort matrix
64 FFT JPEG 802.11a/g
![Page 20: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/20.jpg)
Throughput Changes with Statically Configured Clock for 8x8 DCT
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
scale 1st processor
scale 2nd processor
scale 3rd processor
scale 4th processor
optimal clockscaling points
Relative clock frequency
Rel
ativ
e T
hro
ug
hp
ut
Compute load(clk cycles)
408 204 408 204
![Page 21: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/21.jpg)
Relationship of Processors in an 8x8 DCT
• Proc. 2 and 4 have identical computational load• Different position results in a different FIFO stall
style, which causes different clock scaling behavior
Trans.
FIFO
1-DCT
FIFO
1-DCT
FIFO
Trans.
FIFO
FIFO empty stall of 2nd proc.
FIFO full stall of 2nd proc.
FIFO empty stall of 4th proc.
![Page 22: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/22.jpg)
Power of GALS Array with Static Clock/Voltage Scaling
Rel
ativ
e P
ow
er
8 DCT zig-zag JPEG msort matrix 8x8 DCT 64 FFT 802.11 bsort
• 40% power savings without a performance penalty
• From simulated clock frequency and referenced clk/voltage/pow relationship
![Page 23: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/23.jpg)
Summary
• Compared to a synchronous array processor, the proposed GALS array processor has: – < 1% throughput reduction– ~40% energy savings
• These results compare well with reported GALS uni-processors:– ~10% throughput reduction– ~25% energy savings
• Source of throughput reduction in GALS system– Extra cost for communication loops– Extra cost for FIFO stall loops in GALS array processors
• Energy benefit of GALS clock/voltage scaling– Unbalanced processor computation loads
![Page 24: Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f305503460f94c4b8a9/html5/thumbnails/24.jpg)
Acknowledgments
• Funding– Intel Corporation– UC Micro– NSF Grant No. 0430090– UCD Faculty Research Grant
• Special Thanks– E. Work, T. Mohsenin, other VCL processor co-
designers, R. Krishnamurthy, M. Anders, S. Mathew