1 low power system on chip design. 2 system level power optimization algorithm selection / algorithm...

74
1 Low Power System on Chip Design

Upload: angela-hancock

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

3 Levels for Low Power Design Level of Abstraction Expected Saving Algorithm Architecture Logic Level Layout Level Device Level times % % % System Algorithm Architecture Circuit/Logic Technology Hardware-software partitioning, Complexity, Concurrency, Locality, Parallelism, Pipelining, Signal correlations Sizing, Logic Style, Logic Design Threshold Reduction, Scaling, Advanced packaging Regularity, Data representation Instruction set selection, Data rep. SOI Power down

TRANSCRIPT

Page 1: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

1

Low Power System on Chip

Design

Page 2: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

2

System Level Power Optimization

• Algorithm selection / algorithm transformation

• Identification of hot spots• Low Power data encoding• Quality of Service vs. Power• Low Power Memory mapping• Resource Sharing / Allocation

Page 3: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

3

Levels for Low Power Design

Level ofAbstraction Expected SavingAlgorithm

ArchitectureLogic Level

Layout LevelDevice Level

10 - 100 times10 - 90%20 - 40%10 - 30%10 - 30%

SystemAlgorithm

ArchitectureCircuit/Logic

Technology

Hardware-software partitioning, Complexity, Concurrency, Locality,

Parallelism, Pipelining, Signal correlationsSizing, Logic Style, Logic Design

Threshold Reduction, Scaling, Advanced packaging

Regularity, Data representation Instruction set selection,

Data rep. SOI

Power down

Page 4: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

4

High Performance System 구현을 위한 제반 요소

High Performance System

Reduced SwingLogic

Low Voltage

Low VT

AdvancedTechnology

High Speed

Deep SubmicronTechnology

ChannelEngineering

High Density

Low Power perGate

Low Capacitance

Page 5: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

5

System Level Power Optimization

• Algorithm selection / algorithm transformation

• Identification of hot spots• Low Power data encoding• Quality of Service vs. Power• Low Power Memory mapping• Resource Sharing / Allocation

Page 6: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

6

전력 소모에 대한 고찰• Digital 회로에서 전력 소모의 구성 성분Power f C V I V Q f V

Switching Activity f Frequency CSupply Voltage I Leakage Current

Q Short Circuit Charge

DD leak DD short circuit DD

leak

short circuit

2

: : :: :

:

CapacitanceVDD

Page 7: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

7

Vdd, power, and current trend

Year

Volta

ge

Pow

er p

er c

hip

[W]

VDD

cur

rent

[A]

0 0

200 500

Current

Power

Voltage2.5

2.0

1.5

1.0

0.5

0.01998 2002 2006 2010 2014

International Technology Roadmap for Semiconductors 1998 update

Page 8: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

8

Three Factors affecting Energy– Reducing waste by Hardware Simplification: redundant h/

w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing

– All in one Approach(SOC): I/O pin and buffer reduction– Voltage Reducible Hardwares

– 2-D pipelining (systolic arrays)– SIMD:Parallel Processing:useful for data w/ parallel st

ructure– VLIW: Approach- flexible

Page 9: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

9

전력 소모를 줄일 수 있는 설계 방법• 공급 전압을 조절하는 방법

– IC 내에서 high speed 가 필요한 곳에만 높은 전압을 사용한다 .– 사용하지 않는 block 에 대해서는 sleep mode 로 전력 소모를 줄인다 .

• 동작 주파수를 낮추는 방법– Parallel processing 으로 같은 throughput 을 얻으면서 동작 주파수는 낮춘다 . 이로 인한 면적의 증가는 필연적이다 .– 큰 clock buffer 의 사용을 피한다 .– Phase Locked Loop (PLL) 을 사용하여 필요한 곳에만 주파수를 높여 사용한다 .

Page 10: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

10

전력 소모를 줄일 수 있는 설계 방법• Parasitic capacitance 를 줄이는 방법

– Critical node 에 짧은 배선을 사용한다 .– 3 배 이상의 fan-out 을 피한다 .– 낮은 전압 사용시 배선의 폭을 줄인다 .– 가능한 한 작은 크기의 transistor 를 사용한다 .

• Switching Activity 를 줄이는 방법– Bit 수를 감소시킨다 .– Dynamic 회로보다는 static 회로를 사용한다 .– 전체 transistor 수를 줄인다 .– 가장 active 한 node 는 internal node 로 결정한다 .

Page 11: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

11

전력 소모를 줄일 수 있는 설계 방법• Switching Activity 를 줄이는 방법

– 각 node 에서 주파수와 capacitance 의 곱의 합이 최소가 되도록 logic 을 설계한다 . 즉 , switching activity 가 통계적으로 최소가 되도록 한다 .

– Logic tree 를 결정할 때 , 입력 신호의 activity 가 높을수록 VDD 또는 ground 에서 멀리 위치시킨다 .

– Activity 가 큰 cell 은 dynamic 으로 , activity 가 작은 cell 은 static으로 설계한다 .– Data 가 변하지 않는 flip-flop 의 clock 을 off 시킨다 .– 항상 사용하지 않는 cell 의 clock 을 disable 시킬 수 있도록 한다 .

f Ci ii

n

1

min , f mean switching frequency of node i

C capacitance of node i

i

i

Page 12: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

12

Web browsing is slow with 802.11 PSMSon! Haven’t I told you to turn on power-

saving mode. Batteries don’t grow on trees you know!But dad! Performance

SUCKS when I turn on power-saving

mode!So what! When I was your age, I walked 2

miles through the snow to fetch my Web

pages!• Users complain about performance degradation

Page 13: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

13

IBM’s PowerPC Lower Power Architecture

• Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution– 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) – FPU is pipelined so a multiply-add instruction can be issued every clock

cycle – Low power 3.3-volt design

• Use small complex instruction with smaller instruction length – IBM’s PowerPC 603e is RISC

• Superscalar: CPI < 1– 603e issues as many as three instructions per cycle

• Low Power Management– 603e provides four software controllable power-saving modes.

• Copper Processor with SOI• IBM’s Blue Logic ASIC :New design reduces of power by a factor of

10 times

Page 14: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

14

Power-Down Techniques

◆ Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work

Page 15: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

15

Voltage vs Delay

•Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.

Page 16: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

16

Why Copper Processor?• Motivation: Aluminum resists the flow

of electricity as wires are made thinner and narrower.

• Performance: 40% speed-up • Cost: 30% less expensive• Power: Less power from batteries• Chip Size: 60% smaller than Aluminum

chip

Page 17: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

17

Silicon-on-Insulator• How Does SOI Reduce Capacitance ?

Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error

Page 18: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

18

Clock Network Power Managements

• 50% of the total power• FIR (massively pipelined circuit): video processing: edge detection voice-processing (data transmission like xDSL) Telephony: 50% (70%/30%) idle, 동시에

이야기하지 않음 .with every clock cycle, data are loaded into the

working register banks, even if there are no data changes.

Page 19: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

19

Partitioning• Performance Requirements

– 몇몇의 Function 들은 Hardware 로의 구현이 더 용이– 반복적으로 사용되는 Block– Parallel 하게 구성되어 있는 Block

• Modifiability– Software 로 구성된 Block 은 변형이 용이

• Implementation Cost– Hardware 로 구성된 Block 은 공유해서 사용이 가능

• Scheduling– 각각 HW 와 SW 로 분리된 Block 들을 정해진 constraints 들에 맞출 수 있도록

scheduling– SW Operation 은 순차적으로 scheduling 되어야 한다– Data 와 Control 의 의존성만 없다면 SW 와 HW 는 Concurrent 하게

scheduling

Page 20: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

20

Low power partitioning approach

• Different HW resources are invoked according to the instruction executed at a specific point in time

• During the execution of the add op., ALU and register are used, but Multiplier is in idle state.

• Non-active resources will still consume energy since the according circuit continue to switch

• Calculate wasting energy• Adding application specific core and partial running Whenever one core performing, all the other cores are

shut down

Page 21: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

21

Design Flow

- Max 94% energy saving and in most case even reduced execution time- 16k sell overhead

Application

DevideAppliction incluster

List scheduleComputeutilization

rate(ASIC)

Select cluster

Computeutilization

rate(uP)

-Core EnergyEstimation

HW Synthesis

SEvaluate

Page 22: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

22

H/W and S/W 통합 저전력 설계 최적화

H/W 합성 및 에너지 예측

HW SW 통합

S/W 코아 에너지 예측

SW 에너지 효율 계산

시스템 수준 에너지 예측

클러스터 스케쥴링

클러스터 선택

HW 에너지 효율 계산

클러스터 링

알고리즘 선택S/WS/WH/WH/W

- Max 94% energy saving and in most case even reduced execution time- 16k sell overhead

Page 23: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

23

PN-CodeGeneration

SynchronousAccumulator

(SW)

SynchronousAccumulator1

(HW)

Cost(Speed,Area,Power)

EnergyEstimate

(SW)

SynchronousAccumulator2

(HW)

Comparator(SW)

AsynchronousAccumulator

(SW)Comparator

(SW)

EnergyEstimate

(HW)

Comparatorwith

precomputation(HW)

AsynchronousAccumulator

(HW)

Comparatorwith

precomputation(HW)

GOAL!

PN-CodeGeneration

SynchronousAccumulator

(SW)

SynchronousAccumulator1

(HW)

Cost(Speed,Area,Power)

EnergyEstimate

(SW)

SynchronousAccumulator2

(HW)

Comparator(SW)

AsynchronousAccumulator

(SW)Comparator

(SW)

EnergyEstimate

(HW)

Comparatorwith

precomputation(HW)

AsynchronousAccumulator

(HW)

Comparatorwith

precomputation(HW)

GOAL!

IS-95 CDMA Searcher H/W and S/W 통합 설계

황인기 , 성균관대

Page 24: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

24

Low Power DSP• DO-LOOP Dominant

VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %

DO-LOOP Power Minimization ==> DSP Power Minimization

VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding

Page 25: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

25

Loop unrolling• The technique of loop unrolling replicates the body of a

loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.

Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.

for i to NA i A i A i A i

= - ( ) = ( ) + ( - ) ( + )

2 11 1

for i to NA i A i A i A iA i A i A i A i

= - 2 step 2 ( ) = ( ) + ( - ) ( + ) ( ) = ( ) + ( ) ( + )

21 1

1 1 2

Page 26: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

26

Loop Unrolling (IIR filter example)

loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples.

Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,

The transformation yields critical path of 3, thus voltage can be dropped.

)( 211

211

nnnnnn

nnn

YAXAXYAXYYAXY

22

1

211

nnnn

nnn

YAYAXY

YAXY

Page 27: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

27

Loop Unrolling for Low Power

Page 28: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

28

Loop Unrolling for Low Power

Page 29: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

29

Loop Unrolling for Low Power

Page 30: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

30

Designing a Parallel FIR To obtain a parallel processing structure, the

SISO(single-input single-output) system must be converted into a MIMO(multiple-input multiple-output) system. y(3k) = ax(3k)+bx(3k-1)+cx(3k-2) y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1) y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k)

Parallel Processing systems are also referred to as block processing systems.

Page 31: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

31

Parallel Processing (2)

Parallel processing architecture for a 3-tap FIR filter (with block size 3)

Page 32: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

32

Parallel Processing (3)<Combined fine-grain pipelining and parallel processing for 3-tap FIR filter>

Page 33: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

33

Motion Estimation

Page 34: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

34

Block Matching Algorithm

Page 35: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

35

Configurable H/W Paradigms

Page 36: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

36

Why Hardware for Motion Estimation?

• Most Computationally demanding part of Video Encoding

• Example: CCIR 601 format• 720 by 576 pixel• 16 by 16 macro block (n = 16)• 32 by 32 search area (p = 8)• 25 Hz Frame rate (f frame = 25)• 9 Giga Operations/Sec is needed for Full

Search Block Matching Algorithm.

Page 37: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

37

Why Reconguration in Motion Estimation?

• Adjusting the search area at frame-rate according to the changing characteristics of video sequences

• Reducing Power Consumption by avoiding unnecessary computation

Motion Vector Distributions

Page 38: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

38

Architecture for Motion Estimation

From P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995

Page 39: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

39

Re-configurable Architecture for ME

Page 40: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

40

Power Estimation in Recongurable Architecture

Page 41: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

41

Power vs Search area

Page 42: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

42

Resource Reuse in FPGAs

Page 43: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

43

Motion Estimation - Conventional

Page 44: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

44

Motion Estimation - Data Reuse

P P PP P P PP P

a add abs

b add add abs

abs add

2 22

0 45

2

2 1

2

//

.Therefore, power reduction factor is 11%

Page 45: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

45

Vector Quantization• Lossy compression technique which exploits the correlati

on that exists between neighboring samples and quantizes samples together

Page 46: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

46

Complexity of VQ EncodingThe distortion metric between an input vector X anda codebook vector C_i is computed as follows:

Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook tree-search.

Page 47: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

47

Full Search• Brute-force VQ: the distortion between the input vector an

d every entry in the code-book is computed, and the codeindex that corresponds to the minimum distortion is determined and sent over to the decoder.

• For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined.

Page 48: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

48

Tree-structured Vector Quantization

If for example at level 1, the input vector iscloser to the left entry, then the right portion of the tree is never compared below level 2 and an index bit 0 istransmitted.

Here only 2 x log 2 256 = 16 distortion calculations with 8 comparisons

Page 49: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

49

Algorithmic Optimization• Minimizing the number of operations

– example• video data stream using the v

ector quantization (VQ) algorithm

• distortion metric

– Full search VQ• exhaustive full-search• distortion calculation : 256• value comparison : 255

15

0

2

jijji CXD

– Tree-structured VQ• binary tree-search• some performance

degradation• distortion calculation :

16 ( 2 x log2 256 )• value comparison : 8

1

2 2

3 3 3 3

8 8

0

0 1 0 1

1

Page 50: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

50

Differential Codebook Tree-structure Vector Quantization

• The distortion difference b/w the left and right node needs to be computed. This equation can be manipulated to reduce the number of operations

.

Page 51: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

51

Algorithmic Optimization– Differential codebook tree-structure VQ

• modify equation for optimizing operations

algorithm # ofmem.

accessfull searchtree searchdifferentialtree search

# ofmul.

# ofadd.

# ofsub

4096 4096 3840 4096256 256 240 264

136 128 128 0

15

0

15

0,,

2,

2,

15

0

15

0

2,

2,

2j j

jleftjrightjjrightjleft

j jjrightjjleftjrightleft

CCXCX

CXCXD

Page 52: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

52

ALU MULT

ACC PR

X Y

MUL > (5 * ALU)

X

Y

[ Modified Booth Encoding ]One of 0, X, -X, 2X, -2Xbased on each 2 bits of Y

Multiplication and Accumulation: MAC

• Major operation in DSP

PR

CSA

CPA

Page 53: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

53

Operand Swapping (1/2)

• Weight = how many additions are needed ?

ByBooth Encoding

0011110000X000X0

Y= Weight = 2

7FFF AAAA0001 AAAA7FFF 66660001 AAAA7FFF AAAA0001 0001

A B A*B B*A22.0

31.6

28.8

10.0

10.0

12.2

Saving

54%

68%

58%

Current (mW)Operands

Low WeightHigh Switching

Page 54: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

54

DIGLOG multiplierC n n C n n

A A B B

A B A B B A A B

mult add

jR

kR

jR

kR

jR

kR R R

( ) , ( ) ,

,

( )( )

253 214

2 2

2 2 2 2

2 where n world length in bits

1st Iter 2nd Iter 3rd Iter

Worst-case error -25% -6% -1.6%

Prob. of Error<1% 10% 70% 99.8%

With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)

Page 55: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

55

Voltage Scaling• Merely changing a processor clock frequency

is not an effective technique for reducing energy consumption. Reducing the clock frequency will reduce the power consumed by a processor, however, it does not reduce the energy required to perform a given task.

• Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work.

Page 56: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

56

Different Voltage Schedules

0 5 10 15 20 25 Time(sec)

5.02 1000Mcycles50MHz

40J

(A)

0 5 10 15 20 25 Time(sec)

5.02 750Mcycles50MHz

32.5J

(B)

0 5 10 15 20 25Time(sec)

5.02

1000Mcycles40MHz

25J (C)

Timing constraint

2.52

250Mcycles25MHz

4.02

Ene

rgy

cons

umpt

ion

( V

dd2 )

Page 57: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

57

Data Driven Signal Processing

The basic idea of averaging two samples are buffered and their work loads are averaged.

The averaged workload is then used as the effective workload to drive the power supply.

Using a pingpong buffering scheme, data samples In +2, In +3

are being buffered while In, In +1

are being processed.

Page 58: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

58

RTL: Multiple Supply Voltages Scheduling

Filter Example

Page 59: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

59

A hardware / software partitioning technique with hierarc

hical design space exploration Houria Oudghiri, Bozena Kaminska, and Janusz Rajski,

Mentor Graphics Corp.

• A set of DSP examples are considered for co-design on a specific architecture in order to accelerate their performance on a target architecture including a standard DSP processor running concurrently with a custom SIMD (Single Instruction Multiple Data) processor

Page 60: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

60

proposed methodologyinput : List of blocks and time constraints , output : Two subsets where blocks are assignedStep 1 : construct the complete weighted dependency graph GStep 2 : Assign all blocks to software, compare the complete system execution timeStep 3 : while (time constraints not satisfied) do step 3_i : Select the node with the maximum execution time (i) step 3_ii : Assign i to hardware, Update the system execution time step 3_iii : while (time constraints not satisfied) do step 3_iii_1 : Select the maximum weighted edge connected to i with the most time consuming node (j) step 3_iii_2 : Assign to hardware, Update the dependency graph G Update the system execution time endo endo

Page 61: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

61

co-design target architecture

The Texas Instruments DSP processor TMS320C40 is used as the master processor and the custom SIMD processor PULSE (Parallel Ultra Large Scale Engine, 4 processors in parallel) as the slave processor

Page 62: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

62

The hierarchical model of the FFT transform

Level 1

Level 2

Level 3

Level 4

Level 5

Level 6

Level 7

Level 8

FFT

Initialize

Bit Revers

al

Danielson control

Output

Initialize VariableInitialize

Data

Bit_init

Bit_loop1Bit_incr

Dan_init

Dan_loop

Out_init

Out_write

Out_incr

Bit_shift

Bit_loop2

Bit_cond

Bit_acc

Index_init

Read_data

Index_incr

Data_test

Danielson

Dan_init

Bit_test

Bit_swap1Bit_swap2Loop2_testLoop2_assLoop2_shif

InitializeDan_loop

1

Loop1_initLoop1_bodyLoop1_incr

Update Variable

s

Dan_loop1

Loop2_initLoop2_bodyLoop2_incr

InitializeDan_real

Dan_imag

Page 63: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

63

Block assignment at different hierarchical levels

level Nb.of Bolcks

C40 PULSE Time(ms) / time constraint = 25 ms

PULSE C40 Total1 4 2 2 18.14 4.8 22.942 10 6 4 18.8 2.96 21.763 17 11 6 15.56 9 24.564 22 18 6 14.68 10.24 24.925 24 17 7 14.56 10.4 24.946 24 22 2 6.82 17.72 24.547 25 22 3 7 17.92 24.928 27 18 9 5.88 18.64 24.52

Page 64: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

64

Function-Architecture Co-Design CadenceCadence

Page 66: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

66

System C supports:– Mentor Graphics - Seamless® C-Bridge™– Verisity - SpecMan™ Elite– Forte Design Systems - ESC Library– Emulation & Verification Engineering - Zebu– Axys Design - MaxSim™– CoWare - N2C updated for SystemC 2.0– Cadence - SPW 4.8 / SystemC v2.0 IF– Synopsys - CoCentric System Studio

• Plus Kluwer book - “System Design Using SystemC”, 2002

Page 67: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

67

OCAPI-xl design flow

Page 68: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

68

Application Structure

Page 69: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

69

Specification and modeling• Executable specification - Verilog, VHDL, C,

C++, Java.• Common models: synchronous dataflow

(SDF), sequential programs (Prog.), communicating sequential processes (CSP), object-oriented programming (OOP), FSMs, hierarchical/concurrent FSM (HCFSM).

• Depending on the application domain and specification semantics, they are based on

different models of computation.

Page 70: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

70

Hardware Synthesis• Many RTL, logic level, physical level commercial CAD tools.• Some emerging high-level synthesis tools: Behavioral Compiler (Synosys), Monet

(Mentor Graphics), and RapidPath (DASYS).• Many open problems: memory optimization,

parallel heterogeneous hardware architectures, programmable hardware synthesis and optimization, communication optimization.

Page 71: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

71

Software synthesis• The use of real-time operating systems (RTOSs)• The use of DSPs and micro-controllers – code generation issues• Special processor compilation in many cases is

still far less efficient than manual code generation!

• Retargeting issues - C code developed for TI TMS320C6x is not optimized for running on

Philips TriMedia processor.

Page 72: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

72

Interface synthesis• Interface between:

- hardware-hardware - hardware-software - software-software

• Timing and protocols • Recently, first commercial tools

appeared: the CoWare system (hw-sw protocols) and the Synopsys Protocol Compiler

(hw interface synthesis tool)

Page 73: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

73

Co-design Sites• Bibliography of Hardware/Software Codesign: http://www-ti.informatik.uni-tuebingen.de/~buchen/ • Ralf Niemann's Codesign Links and Literature:

http://ls12-www.informatik.uni-dortmund.de/~niemann/codesign/codesign_links.html • URLs to Hardware/Software Co-Design Research: http://www.ece.cmu.edu/~thomas/hsURL.html • RASSP Architecture Guide: http://www.sanders.com/hpc/ArchGuide/TOC.html • EDA, Electronic Design Automation: http://www.eda.org• COMET (Case Western Reserve University): http://bear.ces.cwru.edu/research/hard_soft.html • COSMOS (Tima - Cmp, France): http://tima-cmp.imag.fr/Homepages/cosmos/research.html • COSYMA (Braunschweig): http://www.ida.ing.tu-bs.de/projects/cosyma/ • Handel-C (Oxford): http://oldwww.comlab.ox.ac.uk/oucl/hwcomp.html • Lycos (Technical University of Lyngby, Denmark): http://www.it.dtu.dk/~lycos/ • MOVE (Technical University Delft): http://cardit.et.tudelft.nl/MOVE/ • Polis (University of Berkeley): http://www

cad.eecs.berkeley.edu/Respep/Research/hsc/abstract.html • ProCos (UK Research): http://www.comlab.ox.ac.uk/archive/procos/codesign.html • Ptolemy (University of Berkeley): http://ptolemy.eecs.berkeley.edu/ • SPAM (Princeton): http://www.ee.princeton.edu/~spam/ • TRADES (University of Twente, INF/CAES): http://wwwspa.cs.utwente.nl/aid/aid.html Specificatiet

alen• SystemC: http://www.systemc.org

Page 74: 1 Low Power System on Chip Design. 2 System Level Power Optimization Algorithm selection / algorithm transformation Identification of hot spots Low Power

74

SOC CAD Companies• Cadence www.cadence.com• Duet Tech www.duettech.com• Escalade www.escalade.com• Logic visions www.logicvision.c

om• Mentor Graphics www.mentor.c

om• Palmchip www.palmchip.com• Sonic www.sonicsinc.com• Summit Design www.summit-d

esign.com

• Synopsys www.synopsys.com• Topdown design solutions www.

topdown.com• Xynetix Design Systems www.x

ynetix.com• Zuken-Redac www.redac.co.uk