1 low power system on chip design. 2 system level power optimization algorithm selection / algorithm...
DESCRIPTION
3 Levels for Low Power Design Level of Abstraction Expected Saving Algorithm Architecture Logic Level Layout Level Device Level times % % % System Algorithm Architecture Circuit/Logic Technology Hardware-software partitioning, Complexity, Concurrency, Locality, Parallelism, Pipelining, Signal correlations Sizing, Logic Style, Logic Design Threshold Reduction, Scaling, Advanced packaging Regularity, Data representation Instruction set selection, Data rep. SOI Power downTRANSCRIPT
2
System Level Power Optimization
• Algorithm selection / algorithm transformation
• Identification of hot spots• Low Power data encoding• Quality of Service vs. Power• Low Power Memory mapping• Resource Sharing / Allocation
3
Levels for Low Power Design
Level ofAbstraction Expected SavingAlgorithm
ArchitectureLogic Level
Layout LevelDevice Level
10 - 100 times10 - 90%20 - 40%10 - 30%10 - 30%
SystemAlgorithm
ArchitectureCircuit/Logic
Technology
Hardware-software partitioning, Complexity, Concurrency, Locality,
Parallelism, Pipelining, Signal correlationsSizing, Logic Style, Logic Design
Threshold Reduction, Scaling, Advanced packaging
Regularity, Data representation Instruction set selection,
Data rep. SOI
Power down
4
High Performance System 구현을 위한 제반 요소
High Performance System
Reduced SwingLogic
Low Voltage
Low VT
AdvancedTechnology
High Speed
Deep SubmicronTechnology
ChannelEngineering
High Density
Low Power perGate
Low Capacitance
5
System Level Power Optimization
• Algorithm selection / algorithm transformation
• Identification of hot spots• Low Power data encoding• Quality of Service vs. Power• Low Power Memory mapping• Resource Sharing / Allocation
6
전력 소모에 대한 고찰• Digital 회로에서 전력 소모의 구성 성분Power f C V I V Q f V
Switching Activity f Frequency CSupply Voltage I Leakage Current
Q Short Circuit Charge
DD leak DD short circuit DD
leak
short circuit
2
: : :: :
:
CapacitanceVDD
7
Vdd, power, and current trend
Year
Volta
ge
Pow
er p
er c
hip
[W]
VDD
cur
rent
[A]
0 0
200 500
Current
Power
Voltage2.5
2.0
1.5
1.0
0.5
0.01998 2002 2006 2010 2014
International Technology Roadmap for Semiconductors 1998 update
8
Three Factors affecting Energy– Reducing waste by Hardware Simplification: redundant h/
w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing
– All in one Approach(SOC): I/O pin and buffer reduction– Voltage Reducible Hardwares
– 2-D pipelining (systolic arrays)– SIMD:Parallel Processing:useful for data w/ parallel st
ructure– VLIW: Approach- flexible
9
전력 소모를 줄일 수 있는 설계 방법• 공급 전압을 조절하는 방법
– IC 내에서 high speed 가 필요한 곳에만 높은 전압을 사용한다 .– 사용하지 않는 block 에 대해서는 sleep mode 로 전력 소모를 줄인다 .
• 동작 주파수를 낮추는 방법– Parallel processing 으로 같은 throughput 을 얻으면서 동작 주파수는 낮춘다 . 이로 인한 면적의 증가는 필연적이다 .– 큰 clock buffer 의 사용을 피한다 .– Phase Locked Loop (PLL) 을 사용하여 필요한 곳에만 주파수를 높여 사용한다 .
10
전력 소모를 줄일 수 있는 설계 방법• Parasitic capacitance 를 줄이는 방법
– Critical node 에 짧은 배선을 사용한다 .– 3 배 이상의 fan-out 을 피한다 .– 낮은 전압 사용시 배선의 폭을 줄인다 .– 가능한 한 작은 크기의 transistor 를 사용한다 .
• Switching Activity 를 줄이는 방법– Bit 수를 감소시킨다 .– Dynamic 회로보다는 static 회로를 사용한다 .– 전체 transistor 수를 줄인다 .– 가장 active 한 node 는 internal node 로 결정한다 .
11
전력 소모를 줄일 수 있는 설계 방법• Switching Activity 를 줄이는 방법
– 각 node 에서 주파수와 capacitance 의 곱의 합이 최소가 되도록 logic 을 설계한다 . 즉 , switching activity 가 통계적으로 최소가 되도록 한다 .
– Logic tree 를 결정할 때 , 입력 신호의 activity 가 높을수록 VDD 또는 ground 에서 멀리 위치시킨다 .
– Activity 가 큰 cell 은 dynamic 으로 , activity 가 작은 cell 은 static으로 설계한다 .– Data 가 변하지 않는 flip-flop 의 clock 을 off 시킨다 .– 항상 사용하지 않는 cell 의 clock 을 disable 시킬 수 있도록 한다 .
f Ci ii
n
1
min , f mean switching frequency of node i
C capacitance of node i
i
i
12
Web browsing is slow with 802.11 PSMSon! Haven’t I told you to turn on power-
saving mode. Batteries don’t grow on trees you know!But dad! Performance
SUCKS when I turn on power-saving
mode!So what! When I was your age, I walked 2
miles through the snow to fetch my Web
pages!• Users complain about performance degradation
13
IBM’s PowerPC Lower Power Architecture
• Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution– 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) – FPU is pipelined so a multiply-add instruction can be issued every clock
cycle – Low power 3.3-volt design
• Use small complex instruction with smaller instruction length – IBM’s PowerPC 603e is RISC
• Superscalar: CPI < 1– 603e issues as many as three instructions per cycle
• Low Power Management– 603e provides four software controllable power-saving modes.
• Copper Processor with SOI• IBM’s Blue Logic ASIC :New design reduces of power by a factor of
10 times
14
Power-Down Techniques
◆ Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work
15
Voltage vs Delay
•Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.
16
Why Copper Processor?• Motivation: Aluminum resists the flow
of electricity as wires are made thinner and narrower.
• Performance: 40% speed-up • Cost: 30% less expensive• Power: Less power from batteries• Chip Size: 60% smaller than Aluminum
chip
17
Silicon-on-Insulator• How Does SOI Reduce Capacitance ?
Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error
18
Clock Network Power Managements
• 50% of the total power• FIR (massively pipelined circuit): video processing: edge detection voice-processing (data transmission like xDSL) Telephony: 50% (70%/30%) idle, 동시에
이야기하지 않음 .with every clock cycle, data are loaded into the
working register banks, even if there are no data changes.
19
Partitioning• Performance Requirements
– 몇몇의 Function 들은 Hardware 로의 구현이 더 용이– 반복적으로 사용되는 Block– Parallel 하게 구성되어 있는 Block
• Modifiability– Software 로 구성된 Block 은 변형이 용이
• Implementation Cost– Hardware 로 구성된 Block 은 공유해서 사용이 가능
• Scheduling– 각각 HW 와 SW 로 분리된 Block 들을 정해진 constraints 들에 맞출 수 있도록
scheduling– SW Operation 은 순차적으로 scheduling 되어야 한다– Data 와 Control 의 의존성만 없다면 SW 와 HW 는 Concurrent 하게
scheduling
20
Low power partitioning approach
• Different HW resources are invoked according to the instruction executed at a specific point in time
• During the execution of the add op., ALU and register are used, but Multiplier is in idle state.
• Non-active resources will still consume energy since the according circuit continue to switch
• Calculate wasting energy• Adding application specific core and partial running Whenever one core performing, all the other cores are
shut down
21
Design Flow
- Max 94% energy saving and in most case even reduced execution time- 16k sell overhead
Application
DevideAppliction incluster
List scheduleComputeutilization
rate(ASIC)
Select cluster
Computeutilization
rate(uP)
-Core EnergyEstimation
HW Synthesis
SEvaluate
22
H/W and S/W 통합 저전력 설계 최적화
H/W 합성 및 에너지 예측
HW SW 통합
S/W 코아 에너지 예측
SW 에너지 효율 계산
시스템 수준 에너지 예측
클러스터 스케쥴링
클러스터 선택
HW 에너지 효율 계산
클러스터 링
알고리즘 선택S/WS/WH/WH/W
- Max 94% energy saving and in most case even reduced execution time- 16k sell overhead
23
PN-CodeGeneration
SynchronousAccumulator
(SW)
SynchronousAccumulator1
(HW)
Cost(Speed,Area,Power)
EnergyEstimate
(SW)
SynchronousAccumulator2
(HW)
Comparator(SW)
AsynchronousAccumulator
(SW)Comparator
(SW)
EnergyEstimate
(HW)
Comparatorwith
precomputation(HW)
AsynchronousAccumulator
(HW)
Comparatorwith
precomputation(HW)
GOAL!
PN-CodeGeneration
SynchronousAccumulator
(SW)
SynchronousAccumulator1
(HW)
Cost(Speed,Area,Power)
EnergyEstimate
(SW)
SynchronousAccumulator2
(HW)
Comparator(SW)
AsynchronousAccumulator
(SW)Comparator
(SW)
EnergyEstimate
(HW)
Comparatorwith
precomputation(HW)
AsynchronousAccumulator
(HW)
Comparatorwith
precomputation(HW)
GOAL!
IS-95 CDMA Searcher H/W and S/W 통합 설계
황인기 , 성균관대
24
Low Power DSP• DO-LOOP Dominant
VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %
DO-LOOP Power Minimization ==> DSP Power Minimization
VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding
25
Loop unrolling• The technique of loop unrolling replicates the body of a
loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.
Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.
for i to NA i A i A i A i
= - ( ) = ( ) + ( - ) ( + )
2 11 1
for i to NA i A i A i A iA i A i A i A i
= - 2 step 2 ( ) = ( ) + ( - ) ( + ) ( ) = ( ) + ( ) ( + )
21 1
1 1 2
26
Loop Unrolling (IIR filter example)
loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples.
Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,
The transformation yields critical path of 3, thus voltage can be dropped.
)( 211
211
nnnnnn
nnn
YAXAXYAXYYAXY
22
1
211
nnnn
nnn
YAYAXY
YAXY
30
Designing a Parallel FIR To obtain a parallel processing structure, the
SISO(single-input single-output) system must be converted into a MIMO(multiple-input multiple-output) system. y(3k) = ax(3k)+bx(3k-1)+cx(3k-2) y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1) y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k)
Parallel Processing systems are also referred to as block processing systems.
31
Parallel Processing (2)
Parallel processing architecture for a 3-tap FIR filter (with block size 3)
32
Parallel Processing (3)<Combined fine-grain pipelining and parallel processing for 3-tap FIR filter>
36
Why Hardware for Motion Estimation?
• Most Computationally demanding part of Video Encoding
• Example: CCIR 601 format• 720 by 576 pixel• 16 by 16 macro block (n = 16)• 32 by 32 search area (p = 8)• 25 Hz Frame rate (f frame = 25)• 9 Giga Operations/Sec is needed for Full
Search Block Matching Algorithm.
37
Why Reconguration in Motion Estimation?
• Adjusting the search area at frame-rate according to the changing characteristics of video sequences
• Reducing Power Consumption by avoiding unnecessary computation
Motion Vector Distributions
38
Architecture for Motion Estimation
From P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995
44
Motion Estimation - Data Reuse
P P PP P P PP P
a add abs
b add add abs
abs add
2 22
0 45
2
2 1
2
//
.Therefore, power reduction factor is 11%
45
Vector Quantization• Lossy compression technique which exploits the correlati
on that exists between neighboring samples and quantizes samples together
46
Complexity of VQ EncodingThe distortion metric between an input vector X anda codebook vector C_i is computed as follows:
Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook tree-search.
47
Full Search• Brute-force VQ: the distortion between the input vector an
d every entry in the code-book is computed, and the codeindex that corresponds to the minimum distortion is determined and sent over to the decoder.
• For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined.
48
Tree-structured Vector Quantization
If for example at level 1, the input vector iscloser to the left entry, then the right portion of the tree is never compared below level 2 and an index bit 0 istransmitted.
Here only 2 x log 2 256 = 16 distortion calculations with 8 comparisons
49
Algorithmic Optimization• Minimizing the number of operations
– example• video data stream using the v
ector quantization (VQ) algorithm
• distortion metric
– Full search VQ• exhaustive full-search• distortion calculation : 256• value comparison : 255
15
0
2
jijji CXD
– Tree-structured VQ• binary tree-search• some performance
degradation• distortion calculation :
16 ( 2 x log2 256 )• value comparison : 8
1
2 2
3 3 3 3
8 8
0
0 1 0 1
1
50
Differential Codebook Tree-structure Vector Quantization
• The distortion difference b/w the left and right node needs to be computed. This equation can be manipulated to reduce the number of operations
.
51
Algorithmic Optimization– Differential codebook tree-structure VQ
• modify equation for optimizing operations
algorithm # ofmem.
accessfull searchtree searchdifferentialtree search
# ofmul.
# ofadd.
# ofsub
4096 4096 3840 4096256 256 240 264
136 128 128 0
15
0
15
0,,
2,
2,
15
0
15
0
2,
2,
2j j
jleftjrightjjrightjleft
j jjrightjjleftjrightleft
CCXCX
CXCXD
52
ALU MULT
ACC PR
X Y
MUL > (5 * ALU)
X
Y
[ Modified Booth Encoding ]One of 0, X, -X, 2X, -2Xbased on each 2 bits of Y
Multiplication and Accumulation: MAC
• Major operation in DSP
PR
CSA
CPA
53
Operand Swapping (1/2)
• Weight = how many additions are needed ?
ByBooth Encoding
0011110000X000X0
Y= Weight = 2
7FFF AAAA0001 AAAA7FFF 66660001 AAAA7FFF AAAA0001 0001
A B A*B B*A22.0
31.6
28.8
10.0
10.0
12.2
Saving
54%
68%
58%
Current (mW)Operands
Low WeightHigh Switching
54
DIGLOG multiplierC n n C n n
A A B B
A B A B B A A B
mult add
jR
kR
jR
kR
jR
kR R R
( ) , ( ) ,
,
( )( )
253 214
2 2
2 2 2 2
2 where n world length in bits
1st Iter 2nd Iter 3rd Iter
Worst-case error -25% -6% -1.6%
Prob. of Error<1% 10% 70% 99.8%
With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)
55
Voltage Scaling• Merely changing a processor clock frequency
is not an effective technique for reducing energy consumption. Reducing the clock frequency will reduce the power consumed by a processor, however, it does not reduce the energy required to perform a given task.
• Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work.
56
Different Voltage Schedules
0 5 10 15 20 25 Time(sec)
5.02 1000Mcycles50MHz
40J
(A)
0 5 10 15 20 25 Time(sec)
5.02 750Mcycles50MHz
32.5J
(B)
0 5 10 15 20 25Time(sec)
5.02
1000Mcycles40MHz
25J (C)
Timing constraint
2.52
250Mcycles25MHz
4.02
Ene
rgy
cons
umpt
ion
( V
dd2 )
57
Data Driven Signal Processing
The basic idea of averaging two samples are buffered and their work loads are averaged.
The averaged workload is then used as the effective workload to drive the power supply.
Using a pingpong buffering scheme, data samples In +2, In +3
are being buffered while In, In +1
are being processed.
59
A hardware / software partitioning technique with hierarc
hical design space exploration Houria Oudghiri, Bozena Kaminska, and Janusz Rajski,
Mentor Graphics Corp.
• A set of DSP examples are considered for co-design on a specific architecture in order to accelerate their performance on a target architecture including a standard DSP processor running concurrently with a custom SIMD (Single Instruction Multiple Data) processor
60
proposed methodologyinput : List of blocks and time constraints , output : Two subsets where blocks are assignedStep 1 : construct the complete weighted dependency graph GStep 2 : Assign all blocks to software, compare the complete system execution timeStep 3 : while (time constraints not satisfied) do step 3_i : Select the node with the maximum execution time (i) step 3_ii : Assign i to hardware, Update the system execution time step 3_iii : while (time constraints not satisfied) do step 3_iii_1 : Select the maximum weighted edge connected to i with the most time consuming node (j) step 3_iii_2 : Assign to hardware, Update the dependency graph G Update the system execution time endo endo
61
co-design target architecture
The Texas Instruments DSP processor TMS320C40 is used as the master processor and the custom SIMD processor PULSE (Parallel Ultra Large Scale Engine, 4 processors in parallel) as the slave processor
62
The hierarchical model of the FFT transform
Level 1
Level 2
Level 3
Level 4
Level 5
Level 6
Level 7
Level 8
FFT
Initialize
Bit Revers
al
Danielson control
Output
Initialize VariableInitialize
Data
Bit_init
Bit_loop1Bit_incr
Dan_init
Dan_loop
Out_init
Out_write
Out_incr
Bit_shift
Bit_loop2
Bit_cond
Bit_acc
Index_init
Read_data
Index_incr
Data_test
Danielson
Dan_init
Bit_test
Bit_swap1Bit_swap2Loop2_testLoop2_assLoop2_shif
InitializeDan_loop
1
Loop1_initLoop1_bodyLoop1_incr
Update Variable
s
Dan_loop1
Loop2_initLoop2_bodyLoop2_incr
InitializeDan_real
Dan_imag
63
Block assignment at different hierarchical levels
level Nb.of Bolcks
C40 PULSE Time(ms) / time constraint = 25 ms
PULSE C40 Total1 4 2 2 18.14 4.8 22.942 10 6 4 18.8 2.96 21.763 17 11 6 15.56 9 24.564 22 18 6 14.68 10.24 24.925 24 17 7 14.56 10.4 24.946 24 22 2 6.82 17.72 24.547 25 22 3 7 17.92 24.928 27 18 9 5.88 18.64 24.52
66
System C supports:– Mentor Graphics - Seamless® C-Bridge™– Verisity - SpecMan™ Elite– Forte Design Systems - ESC Library– Emulation & Verification Engineering - Zebu– Axys Design - MaxSim™– CoWare - N2C updated for SystemC 2.0– Cadence - SPW 4.8 / SystemC v2.0 IF– Synopsys - CoCentric System Studio
• Plus Kluwer book - “System Design Using SystemC”, 2002
69
Specification and modeling• Executable specification - Verilog, VHDL, C,
C++, Java.• Common models: synchronous dataflow
(SDF), sequential programs (Prog.), communicating sequential processes (CSP), object-oriented programming (OOP), FSMs, hierarchical/concurrent FSM (HCFSM).
• Depending on the application domain and specification semantics, they are based on
different models of computation.
70
Hardware Synthesis• Many RTL, logic level, physical level commercial CAD tools.• Some emerging high-level synthesis tools: Behavioral Compiler (Synosys), Monet
(Mentor Graphics), and RapidPath (DASYS).• Many open problems: memory optimization,
parallel heterogeneous hardware architectures, programmable hardware synthesis and optimization, communication optimization.
71
Software synthesis• The use of real-time operating systems (RTOSs)• The use of DSPs and micro-controllers – code generation issues• Special processor compilation in many cases is
still far less efficient than manual code generation!
• Retargeting issues - C code developed for TI TMS320C6x is not optimized for running on
Philips TriMedia processor.
72
Interface synthesis• Interface between:
- hardware-hardware - hardware-software - software-software
• Timing and protocols • Recently, first commercial tools
appeared: the CoWare system (hw-sw protocols) and the Synopsys Protocol Compiler
(hw interface synthesis tool)
73
Co-design Sites• Bibliography of Hardware/Software Codesign: http://www-ti.informatik.uni-tuebingen.de/~buchen/ • Ralf Niemann's Codesign Links and Literature:
http://ls12-www.informatik.uni-dortmund.de/~niemann/codesign/codesign_links.html • URLs to Hardware/Software Co-Design Research: http://www.ece.cmu.edu/~thomas/hsURL.html • RASSP Architecture Guide: http://www.sanders.com/hpc/ArchGuide/TOC.html • EDA, Electronic Design Automation: http://www.eda.org• COMET (Case Western Reserve University): http://bear.ces.cwru.edu/research/hard_soft.html • COSMOS (Tima - Cmp, France): http://tima-cmp.imag.fr/Homepages/cosmos/research.html • COSYMA (Braunschweig): http://www.ida.ing.tu-bs.de/projects/cosyma/ • Handel-C (Oxford): http://oldwww.comlab.ox.ac.uk/oucl/hwcomp.html • Lycos (Technical University of Lyngby, Denmark): http://www.it.dtu.dk/~lycos/ • MOVE (Technical University Delft): http://cardit.et.tudelft.nl/MOVE/ • Polis (University of Berkeley): http://www
cad.eecs.berkeley.edu/Respep/Research/hsc/abstract.html • ProCos (UK Research): http://www.comlab.ox.ac.uk/archive/procos/codesign.html • Ptolemy (University of Berkeley): http://ptolemy.eecs.berkeley.edu/ • SPAM (Princeton): http://www.ee.princeton.edu/~spam/ • TRADES (University of Twente, INF/CAES): http://wwwspa.cs.utwente.nl/aid/aid.html Specificatiet
alen• SystemC: http://www.systemc.org
74
SOC CAD Companies• Cadence www.cadence.com• Duet Tech www.duettech.com• Escalade www.escalade.com• Logic visions www.logicvision.c
om• Mentor Graphics www.mentor.c
om• Palmchip www.palmchip.com• Sonic www.sonicsinc.com• Summit Design www.summit-d
esign.com
• Synopsys www.synopsys.com• Topdown design solutions www.
topdown.com• Xynetix Design Systems www.x
ynetix.com• Zuken-Redac www.redac.co.uk