low power system level design methodologies young-chul kim chonnam national univ. dept. of ece, it...
TRANSCRIPT
Low Power System Level Design Methodologies
Young-Chul KimChonnam National Univ.
Dept. of ECE, IT SoC Lab. http://soc.chonnam.ac.kr/~yckim
IT SoC Lab.
2
Contents
Introduction to System Level Design Hardware and Software Co-design Re-configurable Processors Other Low Power System Level Designs
IT SoC Lab.
3
Introduction to SOC
• SOC will bridge the gap b/w s/w and their implementation
in novel, energy-efficient silicon architecture.
•In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate level
•SOC specs are coming from ICT system engineers rather
than RTL descriptions.
IT SoC Lab.
4
Common Fabric for IP Blocks Soft IP blocks are portable, but not as predictable as
hard IP. Hard IP blocks are very predictable since a specific
physical implementation can be characterized, but are hard to port since are often tied to a specific process.
Common fabric is required for both portability and predictability.
Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.
IT SoC Lab.
5
Four main applications
Set-top box: Mobile multimedia system, base station for the home local-area network.
Digital PCTV: concurrent use of TV,3D graphics, and Internet services
Set-top box LAN service: Wireless home-networks, multi-user wireless LAN
Navigation system: steer and control traffic and/or goods-transportation
IT SoC Lab.
6Types of System-on-a-Chip Designs
IT SoC Lab.
7
Silicon in 2010Die Area: 2.5x2.5 cmVoltage: 0.6 VTechnology: 0.07 m
Density Access Time(Gbits/cm2) (ns)
DRAM 8.5 10DRAM (Logic) 2.5 10SRAM (Cache) 0.3 1.5
Density Max. Ave. Power Clock Rate(Mgates/cm2) (W/cm2) (GHz)
Custom 25 54 3Std. Cell 10 27 1.5
Gate Array 5 18 1Single-Mask GA 2.5 12.5 0.7
FPGA 0.4 4.5 0.25
IT SoC Lab.
8
Why Lower Power
Portable systems long battery life light weight small form factor
IC priority list power dissipation cost performance
Technology direction Reduced voltage/power
designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed
IT SoC Lab.
9
year
Power(W)
1980 1985 1990 1995 2000
10
20
30
40
50
5
15
25
35
45
i286i386 DX 16 i486 DX25
i486 DX 50
i486 DX2 66 P-PC601 50
P6 166
P5 66
Alpha21064 200
Alpha 21164
i486 DX4 100
P II 300
P-PC604 133
P-PC750 400
P III 500
Alpha 21264
Microprocessor Power Dissipation
IT SoC Lab.
10
Levels for Low Power DesignSystem
Algorithm
Architecture
Circuit/Logic
Technology
Hardware-software partitioning,
Complexity, Concurrency, Locality,
Parallelism, Pipelining, Signal correlations
Sizing, Logic Style, Logic Design
Threshold Reduction, Scaling, Advanced packaging
Possible Power Savings at Different Design LevelsLevel of
Abstraction Expected Saving
Algorithm
Architecture
Logic Level
Layout Level
Device Level
10 - 100 times
10 - 90%
20 - 40%
10 - 30%
10 - 30%
Regularity, Data representation
Instruction set selection, Data rep.
SOI
Power down
IT SoC Lab.
11
Power-hungry Applications
Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management
Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders
IT SoC Lab.
12
New Computing Platforms
SOC power efficiency more than 10GOPs/w Higher On Chip System Integration: COTS: 100W,
SOAC:10W (inter-chip capacitive loads, I/O buffers) Speed & Performance: shorter interconnection,fewer
drivers,faster devices,more efficient processing artchitectures
Mixed signal systems Reuse of IP blocks Multiprocessor, configurable computing Domain-specific, combined memory-logic
2P kCFV
IT SoC Lab.
13
Physical gap
Timing closure problem: layout-driven logic and RT-level synthesis
Energy efficiency requires locality of computation and storage: match for stream-based data processing of speech,images, and multimedia-system packets.
Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.
IT SoC Lab.
14
Low Power Design Flow IFunction
Partitioning andHW/SW Allocation
SystemLevel
Specification
System-LevelPower Analysis
BehavioralDescription
SoftwareFunctions
ProcessorSelection
Power-drivenBehavioralTransformation
Behavioral-LevelPower Analysis
Power ConsciousBehavioralDescription
Power AnalysisRT-LevelHigh-Level
Synthesis andOptimization
SoftwareOptimization
Software-Level
Power Analysis
To RT-Level Design
IT SoC Lab.
15
Low Power Design Flow IIRT-levelDescription
RTLmapping
Logic SynthesisandOptimization
Gate-LevelPower Analysis
Gate-level
Description
Power AnalysisSwitch-LevelHigh-Level
Synthesis andOptimization
RTLLibrary
Data-path Controller
Switch-level
Description
Standard cellLibraryProcessor
Control andSteering Logic
Memory
RTLMacrocells
IT SoC Lab.
16
Three Factors affecting Energy– Reducing waste by Hardware Simplification:
redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing
– All in one Approach(SOC): I/O pin and buffer reduction– Voltage Reducible Hardwares
2-D pipelining (systolic arrays) SIMD:Parallel Processing:useful for data w/ parallel
structure VLIW: Approach- flexible
IT SoC Lab.
17
Example 1: Filter: Eliminating Redundant Computations
IT SoC Lab.
18
Example2: IBM’s PowerPC Lower Power Architecture Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction
execution 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) FPU is pipelined so a multiply-add instruction can be issued every clock cycle Low power 3.3-volt design
Use small complex instruction with smaller instruction length IBM’s PowerPC 603e is RISC
Superscalar: CPI < 1 603e issues as many as three instructions per cycle
Low Power Management 603e provides four software controllable power-saving modes.
Copper Processor with SOI IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times
IT SoC Lab.
19
Power-Down Techniques
◆ Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work
IT SoC Lab.
20
Voltage vs Delay
•Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.
IT SoC Lab.
21
Low Voltage Main Memories
IT SoC Lab.
22
Why Copper Processor? Motivation: Aluminum resists the flow of
electricity as wires are made thinner and narrower.
Performance: 40% speed-up Cost: 30% less expensive Power: Less power from batteries Chip Size: 60% smaller than Aluminum chip
IT SoC Lab.
23
Silicon-on-Insulator How Does SOI Reduce Capacitance ?
Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error
IT SoC Lab.
24
SOC Co-Design Challenges Current systems are complex and heterogenous
Contain many different types of components Half of the chip can be filled with 200 low-power,
RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: ASIC
Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz. This will greatly simplify the design for correct timing, testability, and signal integrity.
IT SoC Lab.
25
Configurability
One-M gate reconfigurable, one-M gate hardwired logic.
50GIPS for programmable components or 500 GIPS for dedicated hardwares
Reduce design risks for which NRE costs will become dominant
1 V with the watt range
IT SoC Lab.
26
Bridging the architectural gap Product reliability: design at a level far above the
RT level, with reuse factors in excess of 100 Trade-off: 100MOPs/watt (microprocessor)
100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)
IT SoC Lab.
27
Implementing Digital Systems
IT SoC Lab.
28
H/W and S/W Co-design
IT SoC Lab.
29
Hardware/Softrware C0-Design Flow
Analysis of Constra ints& Requirem ents
System Specification
Hardware & SoftwarePartitioning
HardwareDescription
SoftwareDescription
Interface SynthesisHardware Synthesis
& ConfigurationSoftware G eneration &
Param eterization
ConfigurationM odules
HardwareCom ponents
HW / SWInterface
SoftwareM odules
HW / SW Integration &Cosim ulation
IntegrationSystem
System Evaluation Design Verification
IT SoC Lab.
30
Three Co-Design Approaches IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -software co-design of embedded
systems using multiple formalisms for application development” ASIP co-design: starts with an application, builds a specific
programmable processor and translates the application into software code. H/w and s/w partitioning includes the instruction set design.
H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-processors. Vulcan,Codes,Tosca,Cosyma
H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and communication
transformation. Coware(powerful),Siera (reuse),Ptolemy (DSP)
IT SoC Lab.
31
Mixing H/W and S/W Argument: Mixed hardware/ software systems
represent the best of both worlds.High performance, flexibility, design reuse, etc.
Counterpoint: From a design standpoint, it is the worst of both worlds
Simulation: Problems of verification, and test become harder
Interface: Too many tools, too many interactions, too much heterogeneity
Hardware/ software partitioning is “AI- complete”!
IT SoC Lab.
32
Partitioning Performance Requirements
몇몇의 Function 들은 Hardware 로의 구현이 더 용이 반복적으로 사용되는 Block Parallel 하게 구성되어 있는 Block
Modifiability Software 로 구성된 Block 은 변형이 용이
Implementation Cost Hardware 로 구성된 Block 은 공유해서 사용이 가능
Scheduling 각각 HW 와 SW 로 분리된 Block 들을 정해진 constraints 들에 맞출 수 있도록
scheduling SW Operation 은 순차적으로 scheduling 되어야 한다 Data 와 Control 의 의존성만 없다면 SW 와 HW 는 Concurrent 하게
scheduling
IT SoC Lab.
33
Low power partitioning approach
Different HW resources are invoked according to the instruction executed at a specific point in time
During the execution of the add op., ALU and register are used, but Multiplier is in idle state.
Non-active resources will still consume energy since the according circuit continue to switch
Calculate wasting energy Adding application specific core and partial running Whenever one core performing, all the other cores
are shut down
IT SoC Lab.
34
Partitioning Process - Derives a graph G
- operation and connection- Decomposition of G into a set of clusters
- cluster : set of operation- Calculate bus-traffic energy- Pre-select clusters with constraints- Set the number of resources- List scheduling- Test the utilization rate (ASIC or µP)
- the utilization rate of µP is supported by SW estimation tool
IT SoC Lab.
35
Design FlowApplication
DevideAppliction in
cluster
List schedule
Computeutilizationrate(ASIC)
Select cluster
Computeutilizationrate(uP)
-
Core EnergyEstimation
HW Synthesis
Evaluate
- Max 94% energy saving and in most case even reduced execution time- 16k sell overhead
IT SoC Lab.
36
Interface Interface Block 의 필요성
Hardware 와 Software Block 간의 Data 전달 효율적인 Interface Block 을 구성해야만 HW/SW
Block 간의 Overhead 를 줄일 수 있다
Interface 방법 Shared Memory FIFO Handshaking protocol
IT SoC Lab.
37
Logical Bus ArchitectureSystem Bus Signals
address, data, control signalsaddress space consists of the memory space & I/O spacememory space : memory of the SW componentI/O space : ports within SW & registers in other HW
Port SignalsThese are specialized signals capable of directly interfacing between SW & HW component
Interrupt SignalsWhen SW & HW components have completed an operation, or when an error condition is detected
IT SoC Lab.
38
Co-Simulation Co-simulation 의 필요성
HW part 와 SW part 를 함께 Simulation 을 할 수 있게 해 줌으로써 구성된 System 의 결과를 예측할 수 있다
System Performance 를 예측하여 Synthesis 이전에 지정된 Spec. 에 맞도록 System 을 재설계할 수 있도록 해 준다
HW/SW Partitioning 을 위한 각 Sub-block 의 특성을 예측해 준다
Co-simulation Tool Ptolemy COSSAP POLIS
IT SoC Lab.
39
Partitioning Example: CDMA Searcher- vada Lab. SKKU
P N -C odeG enera to r
µ¿ ±â´© À û´Ü(R ea l)
µ¿ ±â´© À û´Ü(Im age)
¿ ¡³Ê Á ö°è»ê´Ü(R ea l)
¿ ¡³Ê Á ö°è»ê´Ü(Im age)
ºñ± ³, ¼ ±Å à ´Ü ºñµ ¿ ±â´© À û´Ü ºñ± ³, ¼ ±Å à ´Ü
P N -C odeG enera tion
S ynchronousA ccum ula tor
(S W )
S ynchronousA ccum ula tor1
(H W )
C ost(S peed,A rea,P ow er)
E nergyE stim ate
(S W )
S ynchronousA ccum ula tor2
(H W )
C om parator(S W )
A synchronousA ccum ula tor
(S W )
C om parator(S W )
E nergyE stim ate
(H W )
C om paratorw ith
precom puta tion(H W )
A synchronousA ccum ula tor
(H W )
C om paratorw ith
precom puta tion(H W )
G O A L!
IT SoC Lab.
40
Approach - vada Lab. SKKU
+ +
+ +
Y I2 YQ
2
>
>
+
>
RXI TXI RXQ TXQ RXI TXQ RXQ - TXI
max 값 선 택
θ 1 와 비 교
θ 2 와 비 교
동 기 누 적 단
비 동 기 누 적 단
에 너 지 계 산 단
O I = (RX I * TX I)
+ (RXQ * TXQ) O Q = (RX I * TXQ)
+ (RXQ * (- TX I))
Y I = ∑ O I Y Q = ∑ O Q
Z = max (Y I2 , Y Q
2)
∑ Z
Search Done !!
Yes
YesSearch_Slew No
No
C ontrol Signal G enerator
- Software oriented design- Dark block : Hardware- Interface : Control signal gen.- Partitioned in terms of speed cost
- Change from SW to HW 1. Implementation speed 2. Parallel architecture
IT SoC Lab.
41
Result -vada Lab. SKKU
cycle ratio Area(gates)Full SW 266 -
Full HW - 9008Synchronous accumulator(1) 138 48.1 + 872
Computing energy(2) 265 4.4 + 3096(1) & (2) 137 48.5 + 3968
(2) &Comparator(3)
265 4.4 + 3155
(1) & (3) 138 48.1 + 931
IT SoC Lab.
42Low Power CDMA Searcher Project at SKKU
과제명 : IS-95 기반의 DS/CDMA 시스템 Co-design 기법을 이용한 저전력 설계
개발기간 : 1999.3.1 - 2000.2:28 (12 개월 ) 개발 목적 및 방법 : CDMA 단말기에 사용하기위한 MSM
(Mobile Station Modem) 칩의 탐색자 (Searcher Engine) 에 대한 RTL 수준 저전력 설계 구현 . 동작 주파수 : 12.5MHz
Data flow graph 를 사용하여 rescheduling, pre-computation 및 strength reduction, Synchronous Accumulator 를 이용한 저전력 설 , area 와 power
를 각각 최대 67.68%, 41.35% 감소 시킴 . H/W and S/W Co-design 기법 적용 San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May.
1999.
Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”, ASIC Workshop,
Sep. 1999.
IT SoC Lab.
43
Application- Specific Instruction Processor Processor architecture tailored not just for
application domain (e. g., DSP, microcontrollers), but for specific sets of applications (e. g., audio, engine control)
ASIP characteristics Greater design cost (processor + compiler) Higher performance, lower power than
commercial cores, more flexibility than ASIC
IT SoC Lab.
44
ASIP Design Given a set of applications, determine micro
architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)
To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code.
The micro architecture of the processor is a design parameter!
IT SoC Lab.
45
ASIP Design Flow
IT SoC Lab.
46
Compiler Optimizations Machine independent optimizations
Parallelizing transformations, Common sub-expression elimination, Constant Propagation, Strength reduction, Loop Invariant Code motion
Machine dependent optimizations Loop unrolling and software pipelining Static allocation (non- recursive procedure calls) Storage layout (arrays, scalars) Optimization of mode setting instructions Instruction selection, scheduling, and register allocation
IT SoC Lab.
47
Cross-Disciplinary nature
Software for low power:loop transformation leads to much higher temporal and spatial locality of data.
Code size becomes an important objective Software will eventually become a part of the chip
Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation.
Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institute http://www.eesi.tue.nl/english)
IT SoC Lab.
48
VLSI Signal Processing Design Methodology
pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering
bit-serial, bit-parallel and digit-serial architectures, carry save architecture
redundant and residue systems Viterbi decoder, motion compensation, 2D-
filtering, and data transmission systems
IT SoC Lab.
49
Low Power DSP DO-LOOP Dominant
VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %
DO-LOOP Power Minimization ==> DSP Power Minimization
VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding
IT SoC Lab.
50
Loop unrolling The technique of loop unrolling replicates the body of a loop some number of
times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.
Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.
for i to N
A i A i A i A i
= -
( ) = ( ) + ( - ) ( + )
2 1
1 1
for i to N
A i A i A i A i
A i A i A i A i
= - 2 step 2
( ) = ( ) + ( - ) ( + )
( ) = ( ) + ( ) ( + )
2
1 1
1 1 2
IT SoC Lab.
51
Loop Unrolling (IIR filter example) loop unrolling : localize the data to reduce the activity of the inputs of the
functional units or two output samples are computed in parallel based on two input samples.
Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,
The transformation yields critical path of 3, thus voltage can be dropped.
)( 211
211
nnnnnn
nnn
YAXAXYAXY
YAXY
22
1
211
nnnn
nnn
YAYAXY
YAXY
IT SoC Lab.
52
Loop Unrolling for Low Power
IT SoC Lab.
53
Loop Unrolling for Low Power
IT SoC Lab.
54
Loop Unrolling for Low Power
IT SoC Lab.
55
Effective Resource Utilization+
+
+
+
D
D
S
5 1 2
3 4
6
7
Retiming
D
D
D
D
D+
+
+
+S
51 2 6
7
43
Before AFTER
CYCLE Multipliers1 1, 3
2, 4
-
-5
6, 8
7
2
13
4
Adder8
6
7
5
Adder Multipliers
2
1
1
1
-
Can reducd interconnect capacitance.
IT SoC Lab.
56
Domain Specific Processor: Flexibility vs. Energy-Efficiency
• Trade-off between efficiency and flexibility, programmable designs incur significant performance and power penalties compared to ASIC.•The parallel algorithm of signal processing can be achieved significant power savings by executing the dominant computational kernels of a given class of applications with common features on dedicated, optimized processing elements with minimum energy overhead.
Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithmsDomain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms
IT SoC Lab.
57
Hybrid Architecture Template (Pleiades) Arthur Abnous and Jan Rabaey
Pleiades does much better on the energy scale than the TI DSPs.Because DSPs are general-purpose, and instruction execution involves a great deal of overhead. Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead
IT SoC Lab.
58
Application Domains : ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS CELP- Based Speech Coding LPC Analysis and Synthesis Codebook Search Lag ComputationDCT- Based Video Compression and Decompression DCT and Inverse- DCT Motion Estimation and Compensation Huffman Coding and Decoding Baseband Processing for Digital Radios Demodulation, Channel Equalization Timing Recovery, Error Correction
IT SoC Lab.
59
The Re-configurable Terminal
IT SoC Lab.
60
Satellite Processors
IT SoC Lab.
61
Elements of Energy- Efficiency
IT SoC Lab.
62
Multi-Processor Implementation
IT SoC Lab.
63
Communication Network
IT SoC Lab.
64
Distributed Data- Driven Control
Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.
IT SoC Lab.
65
Implementation of Handshaking
IT SoC Lab.
66
Design Methodology
IT SoC Lab.
67
Low Power Circuit Techniques Reduced swing interconnect (communication network, memories,
programmable logic modules) On chip dc- dc conversion + multiple supply voltages Locally synchronous - globally asynchronous Automatic power- down Optimized libraries (0.6 m CMOS + Cadence/ Synopsys design flow)
IT SoC Lab.
68
Switching Activity Reduction(a) Average activity in a multiplier as a function of the constant value
(b) A parallel and serial implementations of an adder tree.
IT SoC Lab.
69
VSELP Synthesis Filter Mapped onto Satellite Processors
IT SoC Lab.
70
Mappings of VSELP Kernel
The most energy efficient CELP-based speech algorithm - dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS) - requires 23.4 MOPS
Proposed VSELP speech coder - 0.6 um CMOS - dissipates under 5 mW
IT SoC Lab.
71
IIR Mapping
IT SoC Lab.
72
IIR Comparison
IT SoC Lab.
73
FFT Mapping
IT SoC Lab.
74
FFT Comparison
IT SoC Lab.
75
Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Park,UMASS
IT SoC Lab.
76
Motion Estimation
IT SoC Lab.
77
Block Matching Algorithm
IT SoC Lab.
78
Configurable H/W Paradigms
IT SoC Lab.
79
Programmable Logic Modules
IT SoC Lab.
80
Why Hardware for Motion Estimation?
Most Computationally demanding part of Video Encoding
Example: CCIR 601 format 720 by 576 pixel 16 by 16 macro block (n = 16) 32 by 32 search area (p = 8) 25 Hz Frame rate (f frame = 25) 9 Giga Operations/Sec is needed for Full Search
Block Matching Algorithm.
IT SoC Lab.
81
Why Reconguration in Motion Estimation?
Adjusting the search area at frame-rate according to the changing characteristics of video sequences
Reducing Power Consumption by avoiding unnecessary computation
Motion Vector Distributions
IT SoC Lab.
82
Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995
IT SoC Lab.
83
Re-configurable Architecture for ME
IT SoC Lab.
84
Power Estimation in Recongurable Architecture
IT SoC Lab.
85
Power vs Search area
IT SoC Lab.
86
Resource Reuse in FPGAs
IT SoC Lab.
87
Motion Estimation - Conventional
IT SoC Lab.
88
Motion Estimation - Data Reuse
P P P
P P P P
P P
a add abs
b add add abs
abs add
2 2
2
0 45
2
2 1
2
/
/
.
Therefore, power reduction
factor is 11%
IT SoC Lab.
89
DIGLOG multiplierC n n C n n
A A B B
A B A B B A A B
mult add
jR
kR
jR
kR
jR
kR R R
( ) , ( ) ,
,
( )( )
253 214
2 2
2 2 2 2
2 where n world length in bits
1st Iter 2nd Iter 3rd Iter
Worst-case error -25% -6% -1.6%
Prob. of Error<1% 10% 70% 99.8%
With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)
IT SoC Lab.
90
Voltage Scaling Merely changing a processor clock frequency
is not an effective technique for reducing energy consumption. Reducing the clock frequency will reduce the power consumed by a processor, however, it does not reduce the energy required to perform a given task.
Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work.
IT SoC Lab.
91
OS: Voltage Scaling
IT SoC Lab.
92
Different Voltage Schedules
0 5 10 15 20 25 Time(sec)
5.021000Mcycles50MHz
40J
(A)
0 5 10 15 20 25 Time(sec)
5.02750Mcycles50MHz
32.5J
(B)
0 5 10 15 20 25Time(sec)
5.02
1000Mcycles40MHz
25J (C)
Timing constraint
2.52
250Mcycles25MHz
4.02
En
ergy
con
sum
pti
on (
Vd
d2 )
IT SoC Lab.
93
OS: Voltage Scheduling
IT SoC Lab.
94
Scale Supply Voltage with fCLK
IT SoC Lab.
95
Adaptive Power Supply Voltages
IT SoC Lab.
96
Data Driven Signal ProcessingThe basic idea of averaging two samples are buffered and their work loads are averaged.
The averaged workload is then used as the effective workload to drive the power supply.
Using a pingpong buffering scheme, data samples In +2, In +3
are being buffered while In, In +1
are being processed.
IT SoC Lab.
97
Example of Buffering
IT SoC Lab.
98RTL: Multiple Supply Voltages SchedulingFilter Example
IT SoC Lab.
99
Viterbi decoder project▶ 과제명 : Convolutional Encoder 를 위한 저전력 복호
알고리즘의 연구▶ 개발기간 : 1999.02.22 - 11:30 ( 약 9 개월 )▶ 개발 목적 및 방법 : IMT-2000 중에 포함되는 channel
coding 장치의 저전력화를 위한 독 자적인 기술의 연구 / 개발
▶ CODEC 주요사양 : - Code Rate : R = 1/2, 1/3, 1/4 , k=9 - Decoding 방법 : Trace-back Viterbi Decoder using Soft Decision
IT SoC Lab.
100
Viterbi decoder project▶ 발표논문
1. Asia Pacific Conference on ASIC’99In this paper, we have presented the use of the consensus term and clocking control signal in ACSU for the low power Viterbi decoder. A 20% reduction in area and 30% reduction in power consumption are obtained based on the low power ACSU architecture[1]. Applying our proposed glitch reduction techniques to [1], the additional power consumption is reduced by 7% at a cost of 3% increase in area.
2. International Conference on VLSI and CAD’99 In this paper, we propose a new lower power algorithm on the trace-back unit of
systolic array Viterbi decoder[2]. Reusing the already-generated trace-back routes reduces the number of trace-back operations, and results in increasing the area of spurious switching activity region. Therefore, the switching activity during trace-back operation was further reduced with using gated-clocks. Our result showed on the average 40% reduction in power with the same latency, but 23% increase in area against the trace-back unit in [2]. We used Design Compiler of SYNOPSYS and measured power consumption using DesignPower of SYNOPSYS.
IT SoC Lab.
101
Viterbi decoder project▶ Reference
1. B C. Y. Tsui, R.S. K. Cheng and C. Ling, “Using Transformation to Reduce Power Consumption of IS-95 CDMA Receiver”, International Symposium on Low Power Electronics and Design, 1999
2. T. K. Truong, A. M. T. Shih, I. S. Reed, E. H.Satorius, “A VLSI Design for a Trace-back Viterbi Decoder”, IEEE Trans. Communication, vol. 40, no. 3, Mar. 1992.
IT SoC Lab.
102
References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of the IEEE VLSI
Signal Processing Workshop, San Francisco, Oct 1996.
[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.
[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.
[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.
[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.
[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.
[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.
[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.
[9] E. Kusse, Personal communication, 1996.[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2,
pp. 40-51, 1991.[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31,
N0. 11, pp. 1703-1714, Nov. 1996.[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report
SPRA281, TI, 1997.[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer
Academic publishers, 1992.
[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.
[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.
[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.
IT SoC Lab.
103
References[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller Synthesis for Dynamically Recongurable
Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.
[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.
[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.
[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.
[21] Xilinx XABEL reference manual.
IT SoC Lab.
104
SOC CAD Companies Avant! www.avanticorp.com Cadence www.cadence.com Duet Tech www.duettech.com Escalade www.escalade.com Logic visions
www.logicvision.com Mentor Graphics
www.mentor.com Palmchip www.palmchip.com Sonic www.sonicsinc.com Summit Design www.summit-
design.com
Synopsys www.synopsys.com
Topdown design solutions www.topdown.com
Xynetix Design Systems www.xynetix.com
Zuken-Redac www.redac.co.uk