lecture 10b: implementing dsp functionality: alternatives
DESCRIPTION
Lecture 10b: Implementing DSP Functionality: Alternatives. Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted, - PowerPoint PPT PresentationTRANSCRIPT
1Kurt Keutzer
Lecture 10b: Implementing DSP Functionality:
Alternatives
Prepared by: Professor Kurt Keutzer
Computer Science 252, Spring 2000
With contributions from:
Prof. Heinrich Meyr, University of Aachen
Philip Chong, David Chinnery, Rhett Davis, Paul Husted,
Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang
2Kurt Keutzer
System Implementation Choices
DSP Core
ProgramROM
CoefficientROM
Control
EMBEDDEDCORE µP/DSP
OFF-THESHELF µP/
DSP
DSP
APPLICATIONSPECIFIC µP (ASIP)
ASIC
System Functionality
ASIP Core
ProgramROM
CoefficientROM
Control
3Kurt Keutzer
Making a Successful Comparison - 1
Find an interesting application kernel viterbi decoding for speech processing (not a full modem!)
Find realistic constraints native to the application n=2, K=7, QPSK, 100KBS, BER= 10^-4
Find architectures/implementations that are promising for the application TI TMS320C54, Tensilica Xtensa What are the relevant features of this architecture that support this
application?
Fix application constraints across all implementations (above)
Fix key parameters for implementation comparison performance (constraint) area power
4Kurt Keutzer
Making a Successful Comparison - 2
Identify how key parameters will be measured performance - instruction set simulator, eval board area - data sheets, gate estimates power - eval board, TI application note
Implement your application kernel Examine different algorithms Start with code downloaded from the web - multimedia
benchmarks etc. Build your software development/evaluation environment:
http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm
5Kurt Keutzer
Making a Successful Comparison - 3
Implement your application kernel (cont) Phase 0: Research
Find application notes, research reports for your own or comparable architectures
Phase 1: Estimation Develop a quick estimate based on initial code Integrate research findings Do a quick back-of-envelope reality check
Phase 2: Real implementation/Tuning Tailor algorithm, implementation to architecture Do your very best! Have a contest with your partner
Phase 3: Evaluation Apply evaluation tools to key parameters Evaluate and compare results - return to 2
If your life depended on choosing the right part - what would you do?
6Kurt Keutzer
Making a Successful Comparison - 4
Final evaluation and comparison - compare all implementations
To evaluate for a product - everything is fair game
To evaluate principally the architectures - need to consider: Fab differences - TSMC vs. IBM (10-20% faster) process differences - .35 micron vs. .25 (50% faster) power supply differences 3.0V vs. 1.5V asic vs. custom implementations - (2x faster)
Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently?
cache sizes register availability additional instructions on chip memory
7Kurt Keutzer
Making a Successful Comparison - 5
Just for fun …
In addition to primary constraints (speed, cost, power)
final real world considerations business relationships (joint partnership with Lucent) Time-to-market issues
time to configure? software development environment library/application software support application engineering support
8Kurt Keutzer
Viterbi Algorithm
Prof. Heinrich Meyr
University of Aachen
9Kurt Keutzer
Viterbi Decoders in digital communication systems
Signal Source Source CoderConvolutional orTrellis Coder &Mapper
Modulator
Channel
Viterbi DecoderSource Decoder DemodulatorSignal Sink
information bits channel symbols ck
received symbols yk
decoded bits
10Kurt Keutzer
Convolutional Coder and Trellis diagram
0 k k+1 T
x
0
1
2
3
ss0,k 0,k+1
s s3,k 3,k+1
z -1 z -1
+
+
uk
codesymbols
Mapper
channelsymbols
modulo 2addition
xx1,k 0,k
kyknown startstate X =00 T
additivewhitenoise n
CONVOLUTIONAL CODER
VITERBI DECODER
CHANNEL
kinformationbits
uk-1
uk-2
T-1
BPSK
kc
kb
kb = 1
ik
b = 0i
Survivor Memory
known endstate X =0
decoded bits
decisions
11Kurt Keutzer
ACS recursion for M = 2
Max { , }(1,i)k
survivor pathcompeting path
(1,i)k
Z(0,i),k-1
Z(1,i),k-1 (1,i)k
i,k
d = 1i,k (1,i)k
(0,i)k
Z(0,i),k-1 (0,i)k
(0,i)k
Z(1,i),k-1
12Kurt Keutzer
Viterbi Decoder block diagram
TMU ACSU SMU
Latch
channelsymbols y
k
branchmetrics
statemetrics
k
decisionbits
decodedbits u
13Kurt Keutzer
Characteristic of a 2-bit step-at-zero quantizer
Q=-2
Q=-1
Q=0
Q=1 saturation
saturation-2
-1
1
2
normalizedinputlevel
Interpretation
1 2-1-2
14Kurt Keutzer
Architecture
15Kurt Keutzer
Node parallel ACS architecture
(0,i)k
Shuffle-ExchangeNetwork
0,k
1,k
N-1,k
(1,i)k
ACS
ACS
ACS
0
1
N-1
TMU
Register
SMU
decisionsdec(i,k)
16Kurt Keutzer
ACS
ACS
ACS
ACS
M
M
M
M
butterfly butterflysharedACS
sharedACS
Alternative Implementations
17Kurt Keutzer
Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code
ACS
ACS
ACS
Path metricmemory
ACS
0,k
1,k
3,k
2,k
ACS
ACS
0,k+1
2,k+1
3,k+1
1,k+1
0,k
1,k
3,k
2,k
MUX
MUX
MUX
MUX
oldstatemetrics
newstatemetrics
18Kurt Keutzer
Survivor Memory Unit
19Kurt Keutzer
REA hardware architecture
d
3
0
1
2
d
d
d
0=
0
00
11
11
0 1 D
1
1=
1
1
0
0
0=
0=
PE
3,k
0,k
1,k
2,k
s
s
s
s
u
[1]
k-D
u
[2]
k-D
u
[3]
k-D
u
[0]
k-D
k-1
k-1
k-1
k-1
^
^
^
^
u
[1]u
[2]u
[3]u
[0]^
^
^
^
u(0,0)
u(0,0)
u(1,0)
u(1,3)
k
k
k
k
u
[1]u
[2]u
[3]u
[0]^
^
^
^
u
[1]
k-D+1
u
[2]
k-D+1
u
[3]
k-D+1
u
[0]
k-D+1
^
^
^
^
20Kurt Keutzer
Decoded Sequence: 0 0 ... 0 1 0
Acquisition of final survivorDecoding
10
0
Decoded Sequence : 0 0 ... 0 1 0
00
ku[0]^
k-Du[0]^u[0]^
k-(D+ M-1)
21Kurt Keutzer
Viterbi Project Constraints
•uncoded word length = 1
•coded word length (n) = 2 this means that it is rate 1/2
•constraint length (K aka. L) = 7 this means that the number
of states in trellis is 2^(K-1) or 64 states
•branch metric calculation is QPSK
• soft decision wordlength (q) = 6
•chain-backing depth (D) = 96
•generator polynomials: p0 = 171, p1= 133 (octal) this means that p0=1111001,
p1=1011011
• data rate 100 kbs
• goal: bit error rate (BER) = 10^-4
• signal to noise ratio (SNR)
• degradation 0.05dB
22Kurt Keutzer
Viterbi Decoder Implementation on an ARM
EE 290S Final Project
May 4, 1999
Phillip Chong
23Kurt Keutzer
ARM Overview
32-bit RISC microprocessor
Five stage pipeline
Features fast ALU operations (barrel shifter)
Scalar integer unit, no FPU
24Kurt Keutzer
Algorithm Tweaking
Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots)
Parity computation (Viterbi code) can also be done through table lookup
25Kurt Keutzer
Reducing Memory Footprint
Cache misses can be very costly due to pipeline stalls
We are willing to give up some algorithmic efficiency to eliminate cache misses
To minimize the memory footprint, we pack 32 bits of traceback into single word; we can easily unpack this data due to the barrel shifter (1 cycle operation)
For 128 level traceback, memory requirements are 512 bytes (metrics table) + 1024 bytes (traceback) + 768 bytes (parity lookup tables) = 2304 bytes
26Kurt Keutzer
Simulation Results
Simulated decoding of 4096 bits on a 125 MHz 3.3V model
Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate
Power consumption was estimated at 52.47 mW
Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW
27Kurt Keutzer
Summary
Clock speed: 275 MHz
Execution Performance: 96kb/s
Power Dissipation: 42.40 mW (5.68 mW/mm2)
Area: 7.47mm2 in 0.25 m
Design Effort: 4 days
Portability very high: code is ANSI C; architecture-dependent tweaks may need reworking
28Kurt Keutzer
Conclusion/Thanks
One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR
Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available
Many thanks to Marlene Wan for providing power estimation
29Kurt Keutzer
Viterbi Decoder Implementation on a TI C54x
EE 290S Final Project
May 4, 1999
Paul Husted
30Kurt Keutzer
Introduction
Implemented Viterbi Decoder on a TI TMS320VC5402 DSP
Examine: Performance (bits/sec) Power (mW/bit) Cost ($/unit,area) Design effort (engineer-months)
31Kurt Keutzer
Viterbi Decoder Specifications
Implementation Specifications: Constraint Length (K aka. L) = 7 Branch Metric Calculation is QPSK Soft Decision Wordlength (q) = 6 Chain-backing Depth (D) = 96 Gen. Polynomials: p0 = 171, p1= 133 (octal) Data Rate 100 kbs Goal: Bit Error Rate (BER) = 10^-4
32Kurt Keutzer
C54x Capabilities
Capabilities of all C54x DSP Cores: Three 16-bit Data, One 16-bit program bus 40 bit ACC with 40 bit barrel shifter Two independent accumulators A single cycle non-pipelined MAC Single-instruction repeat and block-repeat Six channel DMA controller Arithmetic instructions with parallel store and parallel
load
33Kurt Keutzer
Helpful Instructions for the Viterbi Decoder
The C54x Has Specialized Instruction Set Dual Add/Subtract in 1 Cycle Compare, Select, and Store Unit (CSSU)
Compare Branch Metrics Store Larger Value, Store Decision Bit Increment Address Registers in Circular Buffer 1 Cycle
Allows Butterfly (2 States) in 5 cycles
34Kurt Keutzer
Butterfly Implementation
DADSTCMPS
DSADTCMPS
Old(2*j)
Old(2*j+1)
New(j)
New(j+2(K-2))
T Register = Local Distance
35Kurt Keutzer
TI TMS320VC5402 DSP
Specific Chip Characteristics: Operates at 100 MIPS
Core Voltage of 1.8V I/O Pins Operate at 3.3V
16K Word x 16 Bits of Dual-Access RAM 4K Word x 16 Bits of ROM Internal DMA Created in 0.18 Micron Technology
36Kurt Keutzer
Dataflow
Data I/O Input Values Assumed to be Placed at Specified
Memory Location by Internal DMA Output Values Assumed to be removed from another
Memory Location by Internal DMA Alternatively, Data Could be Placed in this Memory
Location After Other On-Chip Receiver Processing
37Kurt Keutzer
Implementation Analysis
Viterbi Decoder Code Created in Assembly
Linked to Processor Specific Memory Map
Simulated on Cycle-Accurate Simulator Used Correct Memory Model for VC5402
38Kurt Keutzer
Implementation Results
Estimated ActualCode Size 500
Instructions1032 (16 bit)Words
Data Size 1280 (16 bit)Words
1280 (16 Bit)Words
MIPS(100 Kbps)
18.425 21.53125
Max. Speed(100 MIPS)
582 Kbps 464.7 Kbps
39Kurt Keutzer
Power Calculation
Compared with TI Figures: TI uses 1/2 MACs, 1/2 NOPs For Power Figure .25 Micron Estimate is .45 mA/MIPS
Fully Static Design can be Clocked at Any Rate Viterbi Code Uses 1.08 Times More Current than TI
Estimate
At 22 MIPS, 19.25 mW are Consumed in the Core
40Kurt Keutzer
Area Estimate
TI Will Not Release Die Sizes .25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on
a 144 pin BGA Maximum Die Size is thus 10.24 mm2
41Kurt Keutzer
Development Cost
Engineering Time Estimate - 3 days
Assumes Engineer Has Experience with Assembly Language and TI Tools
Tool Cost - $13262.45 Includes Emulator, Simulator, Compiler, Assembler,
Linker, Debugger
Cost of Chip - $8.52
42Kurt Keutzer
Conclusion
Optimized Instructions Make Algorithm Efficient
Static Design Allows Clock Rate to be Set As Needed to Reduce Power
Flexibility Exists to Perform Other Processing of Data
Very Little Development Time/Cost
43Kurt Keutzer
ACS TIE Extension with State (ACS)
bm331 24:2316:15 8:7 0
bm2 bm1 bm0
+
+
17pm- pm-
1127
-=1?
31Rs
msbmsb
+
+
17pm-pm-
11 27
- =1?
31Rt
msbmsb
11pm
310:1decision bitdecision bit
Rrpm
16:17
0:11:0
27
decision bitdecision bit
Control
instruction
44Kurt Keutzer
Tensilica Viterbi Implementation
Niraj Shah
Scott Weber
290A Final Presentation
45Kurt Keutzer
Tensilica Flow
.c
.o xt-run
.c.c
gen uArch Designer
gen
xt-gcc
TIE
TensilicaProcessorGenerator
46Kurt Keutzer
Xtensa Architecture
XtensaCore
Rs Rt RrI
TIE
TIE Extensions: single cycle state free no new exceptions no stalls typeless data
Rs, Rt, Rr are 32 bit regs
I is the instruction controlling the TIE unit
Xtensa Core is a 32 bit configurable RISC processor
47Kurt Keutzer
Viterbi Architecture
ACS
TraceBackRAMInit
ADC I/0Device
MeasuredMeasuredPerformancePerformance
HereHere
48Kurt Keutzer
TIE SetupBMreg (ACS)
-++
31 8:7 0I
Rs Rt
Rr
31 8:7 0Q
bm33123:2415:167:80
bm2bm1bm0
-0x7F0x7F
-
Controlinstruction
49Kurt Keutzer
ACS TIE Extension (ACS)
+
+
bm331 24:23 16:15 8:7 0
bm2 bm1 bm017
pm- pm-11 1:027
-=1?
11:12pm
310:10’sdecision bitdecision bit
ACS03 ||ACS12 ||ACS30 ||ACS21
31
instruction
RtRs
Rr
msbmsb
50Kurt Keutzer
ACS TIE Extension with State (ACS)
bm331 24:2316:15 8:7 0
bm2 bm1 bm0
+
+
17pm- pm-
1127
-=1?
31Rs
msbmsb
+
+
17pm-pm-
11 27
- =1?
31Rt
msbmsb
11pm
310:1decision bitdecision bit
Rrpm
16:17
0:11:0
27
decision bitdecision bit
Control
instruction
51Kurt Keutzer
TIE Zmask (TraceBack)
&
31 1:0Rs Rt
Rr
31 6:5 0
6:70
|
0x7F0x7F
<<1<<1
&0x3F0x3F
31
Controlinstruction
52Kurt Keutzer
Designs
All designs had a BER of 0.000095 after 10 million iterations
Design 1 100 MHz, 48 mW, 1K DCache, 1K ICache, TIE
Design 1+ 222 MHz, 144 mW, 1K DCache, 1K ICache, TIE
Design 2- 100 MHz, 69 mW, 16K DCache, 16K ICache, TIE
Design 2 222 MHz, 191 mW, 16K DCache, 16K ICache, TIE
Design 3 222 MHz, 191 mW, 16K DCAche, 16K ICache, TIE with state
53Kurt Keutzer
Performance
118
409
263
909
357409
793
909966
1142
0
200
400
600
800
1000
1200
Design1
Design1+
Design2-
Design2
Design3
CachePerfect Cache
Kb/sKb/s
54Kurt Keutzer
Energy Dissipation
uJ/bituJ/bit
0.4
0.12
0.54
0.160.19
0.17
0.240.21 0.2
0.17
0
0.1
0.2
0.3
0.4
0.5
0.6
Design1
Design1+
Design2-
Design2
Design3
CachePerfect Cache
55Kurt Keutzer
n(s*J)/Bit
n(s*J)/n(s*J)/BitBit
3.39
0.293
2.05
0.176
0.5320.416 0.3150.231 0.2070.148
00.5
11.5
22.5
33.5
Design1
Design1+
Design2-
Design2
Design3
CachePerfect Cache
56Kurt Keutzer
Die Area
2.1 2.12.372.37
6.146.146.7 6.7 6.7 6.7
01234567
Design1
Design1+
Design2-
Design2
Design3
CachePerfect Cache
mmmm22
57Kurt Keutzer
Conclusions
TIE extensions, cache configuration, and improved code efficiency resulted in an order of magnitude improvement from our original
For power and performance, the effect of cache size is greater than the effect of a higher clock frequency
Use voltage scaling to reduce the power
If streaming data, then scale frequency
Adding state will result in the ability to increase performance
Having the ability to remove core instructions will decrease decode complexity and should lower power and area
58Kurt Keutzer
Soft Core Viterbi Decoder
EECS 290A Project
Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang
59Kurt Keutzer
High Level Architecture
23%36%30%
0%48%15%
38%8%22%
18%4%16%
9%2%8%
4%1%5%
2%1%4%
% Gates% Area% Power
60Kurt Keutzer
Branch & Path Metric Generation
UL
UL
UL
UL
UL
UL
UL
UL
Branch Metrics Computation apparently implemented with a CORDIC block (contains 840 MUX’s, 58 adders & flip-flops, 32 15-bit busses)
Branch Metrics Hard-wired to each ACS unit
Path Metrics Stored in ACS units
Each ACS unit handles 16 states
Hard-wired Path Metric Interconnect
61Kurt Keutzer
ACS Architecture
Each ACS unit stores 32 path metrics
Only two SRAM’s are active at a time
Across all four ACS units, each path metric is stored twice
SRAM accounts for 88% of the area and 27% of the power for each ACS unit
8x9 SRAM
PMU
PML
PMU
BMU
PML
BML
Add CompareSelect
Pipeline Register
MUX
62Kurt Keutzer
Traceback Architecture
State-Machine blocks are just large sum-of products combinational networks(351 gates each)
Each memory unit contains a 16x64 SRAM and logic(192 MUX’s, 128 flip-flops)
DecisionBits Traceback
Next_ramin
PipelineRegister
MUXSRAM
Traceback Memory Unit
192
OutDecisionBits
TracebackMemory Unit22% Area20% Power
Finite StateMachine11% Area13% Power
Traceback Unit
63Kurt Keutzer
Design Flow
Design Compiler Synthesis script (from Mentor/Inventra)
SRAM Generator (from Norman Walker)
VHDL gate-level sims (timing verification, switching activity annotation)
PowerMill Simulations (SRAM, core)
Design Compiler, Power Compiler (Static timing, power analysis)
Floor Planning (Preview)
Place & Route (Silicon Ensemble)
Interconnect Parasitic Extraction (“report simcap”
PowerMill simulations, PathMill static analysis
Design Compiler, Power Compiler (Static timing, power analysis with back-annotated interconnect parasitics)
Synthesis & Module Generation
Pre-Layout Verification & Analysis
Post-Layout Verification & Analysis
Floor Planning Place & Route
64Kurt Keutzer
Synthesis and SRAM Generation
Synthesis with Synopsys Design Compiler Constraint: 66 kHz clock (effectively infinite) Bottom-up synthesis of 62 VHDL entities
Low-Power SRAM generator (from Pleiades) Very large sense-amps, control logic Optimized for power, speed at low supply-voltages Word-length limited to a power of 2
65Kurt Keutzer
Simulation Models
Behavioral C
Behavioral VHDL
RTL VHDL
• Parameterized, bit-true, and fast
• Used for system level design and BER simulations
• Synthesizable, crafted for specific parameters and implementation structure• Used for synthesis quality
• Parameterized, bit-true, and cycle-true• Used for structural simulations and test bench reference
66Kurt Keutzer
BER Simulation Results
67Kurt Keutzer
SRAM
Simulation Tools: TimeMill & PowerMill
Parameters 66 MHz clock Voltage 2.5V Random Generated Test Vectors
Results Power Analysis Timing Analysis
68Kurt Keutzer
SRAM: Power Numbers
SRAM used for ACS Unit 8 words by 9 data bits
Operations Avg.(µA) Avg.(mW) Avg.(pJ)
Read Activity 663.73 1.659 24.885
Write Activity 563.21 1.408 21.120
Read/Write 612.29 1.530 22.950
Parasitic ExtractionOperations Avg.(µA) Avg.(mW) Avg.(pJ)
Read Activity 949.89 2.3747 35.6205
Write Activity 772.830 1.9320 28.980
Read/Write 851.42 2.1285 31.9275
69Kurt Keutzer
SRAM: Power Numbers
SRAM used for Traceback Unit 16 words by 64 data bits
Operations Avg.(µA) Avg.(mW) Avg.(pJ)
Read Activity 2170.7 5.4267 81.4005
Write Activity 1893.4 4.7335 71.0025
Read/Write 2086.9 5.2172 78.2580
Parasitic Extraction?
70Kurt Keutzer
SRAM: Timing Numbers
Delays Delays
Setup Time; Hold Time time needed for data address to become stable
Setup(ns) Hold(ns) Data Resolution(ns)
ACS SRAM ~1 ~2 ~1.8
Traceback SRAM ~1 ~2 ~5
71Kurt Keutzer
Place and Route
Floor planning of the Viterbi SRAM macro cells and standard cells was done in Preview, and Silicon Ensemble was used for routing.
Total SRAM macro cell area was 1.58 mm2 (1.08 mm2 with 9x8 SRAMs) Area of the 16 9x8 bit SRAM macro cells: 0.052 mm2 each, 62% larger than
required, as 16x8 bit SRAMs were used (SRAM generator output had been verified for powers of 2)
Area of the 3 16x64 bit SRAM macro cells: 0.25 mm2 each
Area of the standard cells 1.02 mm2 (0.35 mm2 from DEF file)
Final chip area was 4.0 mm2 (original estimate 2.5 mm2)
Parasitics for timing simulation were extracted from the final routed nets in Silicon Ensemble.
72Kurt Keutzer
Wiring Statistics
Six metal layers, layers 5 and 6 used for power and ground respectively
Ground and power spaced alternately 100 um apart horizontally and vertically.
There were about 6200 nets and 46,114 vias.
Total wire lengths:
metal layer 1: 3,293 um
metal layer 2: 458,440 um
metal layer 3: 510,517 um
metal layer 4: 218,023 um
metal layer 5: 96,882 um signal, and 38,400 um power
metal layer 6: 8,660 um signal, and 37,500 um ground
wire length: 685 mm horizontal, 611 mm vertical, total 1296 mm
73Kurt Keutzer
Final Placement and Routing
Significant routing congestion at 16 by 64 bit SRAM outputs, due to Silicon Ensemble grid size of 1 um (observe white and light blue wires).
Minimum of 6 unroutable nets observed, even at 12 mm2 chip area.
Final size was 1.25 mm x 3.2 mm, 4 mm2, with 9 unroutable nets.
Violation reports in Silicon Ensemble did not identify which nets were unroutable, other than problems with ground and power connections.
74Kurt Keutzer
Static Timing Checks
Delay BeforeAnnotation (ns)
Delay AfterAnnotation (ns)
Max ClockFrequency (MHz)
Max SymbolRate (Msps)
Critical Path 8.7 17 60 3.8Longest
SRAM Path8.5 14 - -
All timing checks performed with Design Compiler’s report_timing command
Parasitic capacitances back-annotated with the set_load command
No RC parasitics annotated
No SRAM model was used for timing checks
Critical Path was from ACS control logic, through a PM ouput MUX select signal (in an ACS unit), through the following ACS unit.
Checks performed at 2.5V
75Kurt Keutzer
Static Power Checks
Power Before Annotation After SAIFAnnotation
After ParasiticAnnotation
Cell Internal (mW): 28 20 20Net Switching (mW): 15 6.3 8.7Total Dynamic (mW): 43 26 29Cell Leakage (nW): 750 810 810
All timing checks performed with Design Compiler’s report_power command
Switching activity was measured for every output port (transition counts over 16,000-cycle simulation)
Back-annotation performed with SAIF files
No SRAM model was used for power checks (added in manually)
Checks performed at 2.5V w/ 60 MHz clock
76Kurt Keutzer
Delay and Energy Scaling
77Kurt Keutzer
Performance Results
For fixed throughput requirement 100ksps:
SupplyVoltage (V)
Clock Rate(MHz)
Symbol Rate(Msps)
Power(mW)
Optimized forPerformance
2.5 1.6 0.1 1.59
Optimized forEnergy
0.8 1.6 0.1 0.16
Optimized forEDP
1.25 1.6 0.1 0.40
SupplyVoltage (V)
Clock Rate(MHz)
Symbol Rate(Msps)
Energy DelayProduct (fJs)
Power(mW)
Optimized forPerformance
2.5 60 3.75 4.24 59.6
Optimized forEnergy
0.8 7.46 0.47 3.49 0.76
Optimized forEDP
1.25 25.12 1.57 2.53 6.24
78Kurt Keutzer
Summary NORMALIZED (100kbs)
Effort
(days)
Power (uW)/
Gate
Gates/
Area
Area
(mm^2)GatesNorm
Power
(mW)Performanc
e (kbs)Implementation
60.81423809.522.1050000294.440.68100.00CP 1
40.7376695.687.47500000266.836.86100.00ARM
60.0527040.066.694709817.92.47100.00CP 2
60.0763958.156.692648014.72.02100.00CP 3
30.0424599.4110.244709814.31.97100.00DSP
300.0048775.004.00351001.00.14100.00ASIC
79Kurt Keutzer
Summary MAX PERFORMANCE
Effort
(days)Power (uW)/
Gate
Gates/
Area
Area
(mm^2)GatesNorm
Power
(mW)Performance
(kbs)Implementatio
n
N/AN/AN/AN/AN/AN/A100.00 N/AReference
40.866695.687.47500000.842.94116.48ARM
60.9623809.522.10500000.948.00118.00CP 1
31.904599.4110.24470981.889.46464.70DSP
64.067040.066.69470983.8191.00793.00CP 2
67.213958.156.69264803.8191.00966.00CP 3
301.448775.004.00351001.050.603750.00ASIC