low power system level design methodologies young-chul kim chonnam national univ. dept. of ece, it...

Low Power System Level Design Methodologies

Young-Chul KimChonnam National Univ.

Dept. of ECE, IT SoC Lab. http://soc.chonnam.ac.kr/~yckim

IT SoC Lab.

2

Contents

Introduction to System Level Design Hardware and Software Co-design Re-configurable Processors Other Low Power System Level Designs

IT SoC Lab.

3

Introduction to SOC

• SOC will bridge the gap b/w s/w and their implementation

in novel, energy-efficient silicon architecture.

•In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate level

•SOC specs are coming from ICT system engineers rather

than RTL descriptions.

IT SoC Lab.

4

Common Fabric for IP Blocks Soft IP blocks are portable, but not as predictable as

hard IP. Hard IP blocks are very predictable since a specific

physical implementation can be characterized, but are hard to port since are often tied to a specific process.

Common fabric is required for both portability and predictability.

Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.

IT SoC Lab.

5

Four main applications

Set-top box: Mobile multimedia system, base station for the home local-area network.

Digital PCTV: concurrent use of TV,3D graphics, and Internet services

Set-top box LAN service: Wireless home-networks, multi-user wireless LAN

Navigation system: steer and control traffic and/or goods-transportation

IT SoC Lab.

6Types of System-on-a-Chip Designs

IT SoC Lab.

7

Silicon in 2010Die Area: 2.5x2.5 cmVoltage: 0.6 VTechnology: 0.07 m

Density Access Time(Gbits/cm2) (ns)

DRAM 8.5 10DRAM (Logic) 2.5 10SRAM (Cache) 0.3 1.5

Density Max. Ave. Power Clock Rate(Mgates/cm2) (W/cm2) (GHz)

Custom 25 54 3Std. Cell 10 27 1.5

Gate Array 5 18 1Single-Mask GA 2.5 12.5 0.7

FPGA 0.4 4.5 0.25

IT SoC Lab.

8

Why Lower Power

Portable systems long battery life light weight small form factor

IC priority list power dissipation cost performance

Technology direction Reduced voltage/power

designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed

IT SoC Lab.

9

year

Power(W)

1980 1985 1990 1995 2000

10

20

30

40

50

5

15

25

35

45

i286i386 DX 16 i486 DX25

i486 DX 50

i486 DX2 66 P-PC601 50

P6 166

P5 66

Alpha21064 200

Alpha 21164

i486 DX4 100

P II 300

P-PC604 133

P-PC750 400

P III 500

Alpha 21264

Microprocessor Power Dissipation

IT SoC Lab.

10

Levels for Low Power DesignSystem

Algorithm

Architecture

Circuit/Logic

Technology

Hardware-software partitioning,

Complexity, Concurrency, Locality,

Parallelism, Pipelining, Signal correlations

Sizing, Logic Style, Logic Design

Threshold Reduction, Scaling, Advanced packaging

Possible Power Savings at Different Design LevelsLevel of

Abstraction Expected Saving

Algorithm

Architecture

Logic Level

Layout Level

Device Level

10 - 100 times

10 - 90%

20 - 40%

10 - 30%

10 - 30%

Regularity, Data representation

Instruction set selection, Data rep.

SOI

Power down

IT SoC Lab.

11

Power-hungry Applications

Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management

Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders

IT SoC Lab.

12

New Computing Platforms

SOC power efficiency more than 10GOPs/w Higher On Chip System Integration: COTS: 100W,

SOAC:10W (inter-chip capacitive loads, I/O buffers) Speed & Performance: shorter interconnection,fewer

drivers,faster devices,more efficient processing artchitectures

Mixed signal systems Reuse of IP blocks Multiprocessor, configurable computing Domain-specific, combined memory-logic

2P kCFV

IT SoC Lab.

13

Physical gap

Timing closure problem: layout-driven logic and RT-level synthesis

Energy efficiency requires locality of computation and storage: match for stream-based data processing of speech,images, and multimedia-system packets.

Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.

IT SoC Lab.

14

Low Power Design Flow IFunction

Partitioning andHW/SW Allocation

SystemLevel

Specification

System-LevelPower Analysis

BehavioralDescription

SoftwareFunctions

ProcessorSelection

Power-drivenBehavioralTransformation

Behavioral-LevelPower Analysis

Power ConsciousBehavioralDescription

Power AnalysisRT-LevelHigh-Level

Synthesis andOptimization

SoftwareOptimization

Software-Level

Power Analysis

To RT-Level Design

IT SoC Lab.

15

Low Power Design Flow IIRT-levelDescription

RTLmapping

Logic SynthesisandOptimization

Gate-LevelPower Analysis

Gate-level

Description

Power AnalysisSwitch-LevelHigh-Level

Synthesis andOptimization

RTLLibrary

Data-path Controller

Switch-level

Description

Standard cellLibraryProcessor

Control andSteering Logic

Memory

RTLMacrocells

IT SoC Lab.

16

Three Factors affecting Energy– Reducing waste by Hardware Simplification:

redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing

– All in one Approach(SOC): I/O pin and buffer reduction– Voltage Reducible Hardwares

2-D pipelining (systolic arrays) SIMD:Parallel Processing:useful for data w/ parallel

structure VLIW: Approach- flexible

IT SoC Lab.

17

Example 1: Filter: Eliminating Redundant Computations

IT SoC Lab.

18

Example2: IBM’s PowerPC Lower Power Architecture Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction

execution 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) FPU is pipelined so a multiply-add instruction can be issued every clock cycle Low power 3.3-volt design

Use small complex instruction with smaller instruction length IBM’s PowerPC 603e is RISC

Superscalar: CPI < 1 603e issues as many as three instructions per cycle

Low Power Management 603e provides four software controllable power-saving modes.

Copper Processor with SOI IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times

IT SoC Lab.

19

Power-Down Techniques

◆ Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work

IT SoC Lab.

20

Voltage vs Delay

•Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.

IT SoC Lab.

21

Low Voltage Main Memories

IT SoC Lab.

22

Why Copper Processor? Motivation: Aluminum resists the flow of

electricity as wires are made thinner and narrower.

Performance: 40% speed-up Cost: 30% less expensive Power: Less power from batteries Chip Size: 60% smaller than Aluminum chip

IT SoC Lab.

23

Silicon-on-Insulator How Does SOI Reduce Capacitance ?

Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error

IT SoC Lab.

24

SOC Co-Design Challenges Current systems are complex and heterogenous

Contain many different types of components Half of the chip can be filled with 200 low-power,

RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: ASIC

Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz. This will greatly simplify the design for correct timing, testability, and signal integrity.

IT SoC Lab.

25

Configurability

One-M gate reconfigurable, one-M gate hardwired logic.

50GIPS for programmable components or 500 GIPS for dedicated hardwares

Reduce design risks for which NRE costs will become dominant

1 V with the watt range

IT SoC Lab.

26

Bridging the architectural gap Product reliability: design at a level far above the

RT level, with reuse factors in excess of 100 Trade-off: 100MOPs/watt (microprocessor)

100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)

IT SoC Lab.

27

Implementing Digital Systems

IT SoC Lab.

28

H/W and S/W Co-design

IT SoC Lab.

29

Hardware/Softrware C0-Design Flow

Analysis of Constra ints& Requirem ents

System Specification

Hardware & SoftwarePartitioning

HardwareDescription

SoftwareDescription

Interface SynthesisHardware Synthesis

& ConfigurationSoftware G eneration &

Param eterization

ConfigurationM odules

HardwareCom ponents

HW / SWInterface

SoftwareM odules

HW / SW Integration &Cosim ulation

IntegrationSystem

System Evaluation Design Verification

IT SoC Lab.

30

Three Co-Design Approaches IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -software co-design of embedded

systems using multiple formalisms for application development” ASIP co-design: starts with an application, builds a specific

programmable processor and translates the application into software code. H/w and s/w partitioning includes the instruction set design.

H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-processors. Vulcan,Codes,Tosca,Cosyma

H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and communication

transformation. Coware(powerful),Siera (reuse),Ptolemy (DSP)

IT SoC Lab.

31

Mixing H/W and S/W Argument: Mixed hardware/ software systems

represent the best of both worlds.High performance, flexibility, design reuse, etc.

Counterpoint: From a design standpoint, it is the worst of both worlds

Simulation: Problems of verification, and test become harder

Interface: Too many tools, too many interactions, too much heterogeneity

Hardware/ software partitioning is “AI- complete”!

IT SoC Lab.

32

Partitioning Performance Requirements

몇몇의 Function 들은 Hardware 로의 구현이 더 용이 반복적으로 사용되는 Block Parallel 하게 구성되어 있는 Block

Modifiability Software 로 구성된 Block 은 변형이 용이

Implementation Cost Hardware 로 구성된 Block 은 공유해서 사용이 가능

Scheduling 각각 HW 와 SW 로 분리된 Block 들을 정해진 constraints 들에 맞출 수 있도록

scheduling SW Operation 은 순차적으로 scheduling 되어야 한다 Data 와 Control 의 의존성만 없다면 SW 와 HW 는 Concurrent 하게

scheduling

IT SoC Lab.

33

Low power partitioning approach

Different HW resources are invoked according to the instruction executed at a specific point in time

During the execution of the add op., ALU and register are used, but Multiplier is in idle state.

Non-active resources will still consume energy since the according circuit continue to switch

Calculate wasting energy Adding application specific core and partial running Whenever one core performing, all the other cores

are shut down

IT SoC Lab.

34

Partitioning Process - Derives a graph G

- operation and connection- Decomposition of G into a set of clusters

- cluster : set of operation- Calculate bus-traffic energy- Pre-select clusters with constraints- Set the number of resources- List scheduling- Test the utilization rate (ASIC or µP)

- the utilization rate of µP is supported by SW estimation tool

IT SoC Lab.

35

Design FlowApplication

DevideAppliction in

cluster

List schedule

Computeutilizationrate(ASIC)

Select cluster

Computeutilizationrate(uP)

-

Core EnergyEstimation

HW Synthesis

Evaluate

- Max 94% energy saving and in most case even reduced execution time- 16k sell overhead

IT SoC Lab.

36

Interface Interface Block 의 필요성

Hardware 와 Software Block 간의 Data 전달 효율적인 Interface Block 을 구성해야만 HW/SW

Block 간의 Overhead 를 줄일 수 있다

Interface 방법 Shared Memory FIFO Handshaking protocol

IT SoC Lab.

37

Logical Bus ArchitectureSystem Bus Signals

address, data, control signalsaddress space consists of the memory space & I/O spacememory space : memory of the SW componentI/O space : ports within SW & registers in other HW

Port SignalsThese are specialized signals capable of directly interfacing between SW & HW component

Interrupt SignalsWhen SW & HW components have completed an operation, or when an error condition is detected

IT SoC Lab.

38

Co-Simulation Co-simulation 의 필요성

HW part 와 SW part 를 함께 Simulation 을 할 수 있게 해 줌으로써 구성된 System 의 결과를 예측할 수 있다

System Performance 를 예측하여 Synthesis 이전에 지정된 Spec. 에 맞도록 System 을 재설계할 수 있도록 해 준다

HW/SW Partitioning 을 위한 각 Sub-block 의 특성을 예측해 준다

Co-simulation Tool Ptolemy COSSAP POLIS

IT SoC Lab.

39

Partitioning Example: CDMA Searcher- vada Lab. SKKU

P N -C odeG enera to r

µ¿ ±â´© À û´Ü(R ea l)

µ¿ ±â´© À û´Ü(Im age)

¿ ¡³Ê Á ö°è»ê´Ü(R ea l)

¿ ¡³Ê Á ö°è»ê´Ü(Im age)

ºñ± ³, ¼ ±Å Ã ´Ü ºñµ ¿ ±â´© À û´Ü ºñ± ³, ¼ ±Å Ã ´Ü

P N -C odeG enera tion

S ynchronousA ccum ula tor

(S W )

S ynchronousA ccum ula tor1

(H W )

C ost(S peed,A rea,P ow er)

E nergyE stim ate

(S W )

S ynchronousA ccum ula tor2

(H W )

C om parator(S W )

A synchronousA ccum ula tor

(S W )

C om parator(S W )

E nergyE stim ate

(H W )

C om paratorw ith

precom puta tion(H W )

A synchronousA ccum ula tor

(H W )

C om paratorw ith

precom puta tion(H W )

G O A L!

IT SoC Lab.

40

Approach - vada Lab. SKKU

+ +

+ +

Y I2 YQ

2

>

>

+

>

RXI TXI RXQ TXQ RXI TXQ RXQ - TXI

max 값 선 택

θ 1 와 비 교

θ 2 와 비 교

동 기 누 적 단

비 동 기 누 적 단

에 너 지 계 산 단

O I = (RX I * TX I)

+ (RXQ * TXQ) O Q = (RX I * TXQ)

+ (RXQ * (- TX I))

Y I = ∑ O I Y Q = ∑ O Q

Z = max (Y I2 , Y Q

2)

∑ Z

Search Done !!

Yes

YesSearch_Slew No

No

C ontrol Signal G enerator

- Software oriented design- Dark block : Hardware- Interface : Control signal gen.- Partitioned in terms of speed cost

- Change from SW to HW 1. Implementation speed 2. Parallel architecture

IT SoC Lab.

41

Result -vada Lab. SKKU

cycle ratio Area(gates)Full SW 266 -

Full HW - 9008Synchronous accumulator(1) 138 48.1 + 872

Computing energy(2) 265 4.4 + 3096(1) & (2) 137 48.5 + 3968

(2) &Comparator(3)

265 4.4 + 3155

(1) & (3) 138 48.1 + 931

IT SoC Lab.

42Low Power CDMA Searcher Project at SKKU

과제명 : IS-95 기반의 DS/CDMA 시스템 Co-design 기법을 이용한 저전력 설계

개발기간 : 1999.3.1 - 2000.2:28 (12 개월 ) 개발 목적 및 방법 : CDMA 단말기에 사용하기위한 MSM

(Mobile Station Modem) 칩의 탐색자 (Searcher Engine) 에 대한 RTL 수준 저전력 설계 구현 . 동작 주파수 : 12.5MHz

Data flow graph 를 사용하여 rescheduling, pre-computation 및 strength reduction, Synchronous Accumulator 를 이용한 저전력 설 , area 와 power

를 각각 최대 67.68%, 41.35% 감소 시킴 . H/W and S/W Co-design 기법 적용 San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May.

1999.

Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”, ASIC Workshop,

Sep. 1999.

IT SoC Lab.

43

Application- Specific Instruction Processor Processor architecture tailored not just for

application domain (e. g., DSP, microcontrollers), but for specific sets of applications (e. g., audio, engine control)

ASIP characteristics Greater design cost (processor + compiler) Higher performance, lower power than

commercial cores, more flexibility than ASIC

IT SoC Lab.

44

ASIP Design Given a set of applications, determine micro

architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)

To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code.

The micro architecture of the processor is a design parameter!

IT SoC Lab.

45

ASIP Design Flow

IT SoC Lab.

46

Compiler Optimizations Machine independent optimizations

Parallelizing transformations, Common sub-expression elimination, Constant Propagation, Strength reduction, Loop Invariant Code motion

Machine dependent optimizations Loop unrolling and software pipelining Static allocation (non- recursive procedure calls) Storage layout (arrays, scalars) Optimization of mode setting instructions Instruction selection, scheduling, and register allocation

IT SoC Lab.

47

Cross-Disciplinary nature

Software for low power:loop transformation leads to much higher temporal and spatial locality of data.

Code size becomes an important objective Software will eventually become a part of the chip

Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation.

Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institute http://www.eesi.tue.nl/english)

IT SoC Lab.

48

VLSI Signal Processing Design Methodology

pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering

bit-serial, bit-parallel and digit-serial architectures, carry save architecture

redundant and residue systems Viterbi decoder, motion compensation, 2D-

filtering, and data transmission systems

IT SoC Lab.

49

Low Power DSP DO-LOOP Dominant

VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %

DO-LOOP Power Minimization ==> DSP Power Minimization

VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding

IT SoC Lab.

50

Loop unrolling The technique of loop unrolling replicates the body of a loop some number of

times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.

Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.

for i to N

A i A i A i A i

= -

( ) = ( ) + ( - ) ( + )

2 1

1 1

for i to N

A i A i A i A i

A i A i A i A i

= - 2 step 2

( ) = ( ) + ( - ) ( + )

( ) = ( ) + ( ) ( + )

2

1 1

1 1 2

IT SoC Lab.

51

Loop Unrolling (IIR filter example) loop unrolling : localize the data to reduce the activity of the inputs of the

functional units or two output samples are computed in parallel based on two input samples.

Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,

The transformation yields critical path of 3, thus voltage can be dropped.

)( 211

211

nnnnnn

nnn

YAXAXYAXY

YAXY

22

1

211

nnnn

nnn

YAYAXY

YAXY

IT SoC Lab.

52

Loop Unrolling for Low Power

IT SoC Lab.

53


IT SoC Lab.

54


IT SoC Lab.

55

Effective Resource Utilization+

+

+

+

D

D

S

5 1 2

3 4

6

7

Retiming

D

D

D

D

D+

+

+

+S

51 2 6

7

43

Before AFTER

CYCLE Multipliers1 1, 3

2, 4

-

-5

6, 8

7

2

13

4

Adder8

6

7

5

Adder Multipliers

2

1

1

1

-

Can reducd interconnect capacitance.

IT SoC Lab.

56

Domain Specific Processor: Flexibility vs. Energy-Efficiency

• Trade-off between efficiency and flexibility, programmable designs incur significant performance and power penalties compared to ASIC.•The parallel algorithm of signal processing can be achieved significant power savings by executing the dominant computational kernels of a given class of applications with common features on dedicated, optimized processing elements with minimum energy overhead.

Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithmsDomain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms

IT SoC Lab.

57

Hybrid Architecture Template (Pleiades) Arthur Abnous and Jan Rabaey

Pleiades does much better on the energy scale than the TI DSPs.Because DSPs are general-purpose, and instruction execution involves a great deal of overhead. Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead

IT SoC Lab.

58

Application Domains : ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS CELP- Based Speech Coding LPC Analysis and Synthesis Codebook Search Lag ComputationDCT- Based Video Compression and Decompression DCT and Inverse- DCT Motion Estimation and Compensation Huffman Coding and Decoding Baseband Processing for Digital Radios Demodulation, Channel Equalization Timing Recovery, Error Correction

IT SoC Lab.

59

The Re-configurable Terminal

IT SoC Lab.

60

Satellite Processors

IT SoC Lab.

61

Elements of Energy- Efficiency

IT SoC Lab.

62

Multi-Processor Implementation

IT SoC Lab.

63

Communication Network

IT SoC Lab.

64

Distributed Data- Driven Control

Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.

IT SoC Lab.

65

Implementation of Handshaking

IT SoC Lab.

66

Design Methodology

IT SoC Lab.

67

Low Power Circuit Techniques Reduced swing interconnect (communication network, memories,

programmable logic modules) On chip dc- dc conversion + multiple supply voltages Locally synchronous - globally asynchronous Automatic power- down Optimized libraries (0.6 m CMOS + Cadence/ Synopsys design flow)

IT SoC Lab.

68

Switching Activity Reduction(a) Average activity in a multiplier as a function of the constant value

(b) A parallel and serial implementations of an adder tree.

IT SoC Lab.

69

VSELP Synthesis Filter Mapped onto Satellite Processors

IT SoC Lab.

70

Mappings of VSELP Kernel

The most energy efficient CELP-based speech algorithm - dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS) - requires 23.4 MOPS

Proposed VSELP speech coder - 0.6 um CMOS - dissipates under 5 mW

IT SoC Lab.

71

IIR Mapping

IT SoC Lab.

72

IIR Comparison

IT SoC Lab.

73

FFT Mapping

IT SoC Lab.

74

FFT Comparison

IT SoC Lab.

75

Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Park,UMASS

IT SoC Lab.

76

Motion Estimation

IT SoC Lab.

77

Block Matching Algorithm

IT SoC Lab.

78

Configurable H/W Paradigms

IT SoC Lab.

79

Programmable Logic Modules

IT SoC Lab.

80

Why Hardware for Motion Estimation?

Most Computationally demanding part of Video Encoding

Example: CCIR 601 format 720 by 576 pixel 16 by 16 macro block (n = 16) 32 by 32 search area (p = 8) 25 Hz Frame rate (f frame = 25) 9 Giga Operations/Sec is needed for Full Search

Block Matching Algorithm.

IT SoC Lab.

81

Why Reconguration in Motion Estimation?

Adjusting the search area at frame-rate according to the changing characteristics of video sequences

Reducing Power Consumption by avoiding unnecessary computation

Motion Vector Distributions

IT SoC Lab.

82

Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995

IT SoC Lab.

83

Re-configurable Architecture for ME

IT SoC Lab.

84

Power Estimation in Recongurable Architecture

IT SoC Lab.

85

Power vs Search area

IT SoC Lab.

86

Resource Reuse in FPGAs

IT SoC Lab.

87

Motion Estimation - Conventional

IT SoC Lab.

88

Motion Estimation - Data Reuse

P P P

P P P P

P P

a add abs

b add add abs

abs add

2 2

2

0 45

2

2 1

2

/

/

.

Therefore, power reduction

factor is 11%

IT SoC Lab.

89

DIGLOG multiplierC n n C n n

A A B B

A B A B B A A B

mult add

jR

kR

jR

kR

jR

kR R R

( ) , ( ) ,

,

( )( )

253 214

2 2

2 2 2 2

2 where n world length in bits

1st Iter 2nd Iter 3rd Iter

Worst-case error -25% -6% -1.6%

Prob. of Error<1% 10% 70% 99.8%

With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)

IT SoC Lab.

90

Voltage Scaling Merely changing a processor clock frequency

is not an effective technique for reducing energy consumption. Reducing the clock frequency will reduce the power consumed by a processor, however, it does not reduce the energy required to perform a given task.

Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work.

IT SoC Lab.

91

OS: Voltage Scaling

IT SoC Lab.

92

Different Voltage Schedules

0 5 10 15 20 25 Time(sec)

5.021000Mcycles50MHz

40J

(A)

0 5 10 15 20 25 Time(sec)

5.02750Mcycles50MHz

32.5J

(B)

0 5 10 15 20 25Time(sec)

5.02

1000Mcycles40MHz

25J (C)

Timing constraint

2.52

250Mcycles25MHz

4.02

En

ergy

con

sum

pti

on (

Vd

d2 )

IT SoC Lab.

93

OS: Voltage Scheduling

IT SoC Lab.

94

Scale Supply Voltage with fCLK

IT SoC Lab.

95

Adaptive Power Supply Voltages

IT SoC Lab.

96

Data Driven Signal ProcessingThe basic idea of averaging two samples are buffered and their work loads are averaged.

The averaged workload is then used as the effective workload to drive the power supply.

Using a pingpong buffering scheme, data samples In +2, In +3

are being buffered while In, In +1

are being processed.

IT SoC Lab.

97

Example of Buffering

IT SoC Lab.

98RTL: Multiple Supply Voltages SchedulingFilter Example

IT SoC Lab.

99

Viterbi decoder project▶ 과제명 : Convolutional Encoder 를 위한 저전력 복호

알고리즘의 연구▶ 개발기간 : 1999.02.22 - 11:30 ( 약 9 개월 )▶ 개발 목적 및 방법 : IMT-2000 중에 포함되는 channel

coding 장치의 저전력화를 위한 독 자적인 기술의 연구 / 개발

▶ CODEC 주요사양 : - Code Rate : R = 1/2, 1/3, 1/4 , k=9 - Decoding 방법 : Trace-back Viterbi Decoder using Soft Decision

IT SoC Lab.

100

Viterbi decoder project▶ 발표논문

1. Asia Pacific Conference on ASIC’99In this paper, we have presented the use of the consensus term and clocking control signal in ACSU for the low power Viterbi decoder. A 20% reduction in area and 30% reduction in power consumption are obtained based on the low power ACSU architecture[1]. Applying our proposed glitch reduction techniques to [1], the additional power consumption is reduced by 7% at a cost of 3% increase in area.

2. International Conference on VLSI and CAD’99 In this paper, we propose a new lower power algorithm on the trace-back unit of

systolic array Viterbi decoder[2]. Reusing the already-generated trace-back routes reduces the number of trace-back operations, and results in increasing the area of spurious switching activity region. Therefore, the switching activity during trace-back operation was further reduced with using gated-clocks. Our result showed on the average 40% reduction in power with the same latency, but 23% increase in area against the trace-back unit in [2]. We used Design Compiler of SYNOPSYS and measured power consumption using DesignPower of SYNOPSYS.

IT SoC Lab.

101

Viterbi decoder project▶ Reference

1. B C. Y. Tsui, R.S. K. Cheng and C. Ling, “Using Transformation to Reduce Power Consumption of IS-95 CDMA Receiver”, International Symposium on Low Power Electronics and Design, 1999

2. T. K. Truong, A. M. T. Shih, I. S. Reed, E. H.Satorius, “A VLSI Design for a Trace-back Viterbi Decoder”, IEEE Trans. Communication, vol. 40, no. 3, Mar. 1992.

IT SoC Lab.

102

References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of the IEEE VLSI

Signal Processing Workshop, San Francisco, Oct 1996.

[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.

[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.

[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.

[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.

[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.

[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.

[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.

[9] E. Kusse, Personal communication, 1996.[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2,

pp. 40-51, 1991.[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31,

N0. 11, pp. 1703-1714, Nov. 1996.[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report

SPRA281, TI, 1997.[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer

Academic publishers, 1992.

[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.

[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.

[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.

IT SoC Lab.

103

References[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller Synthesis for Dynamically Recongurable

Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.

[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.

[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.

[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.

[21] Xilinx XABEL reference manual.

IT SoC Lab.

104

SOC CAD Companies Avant! www.avanticorp.com Cadence www.cadence.com Duet Tech www.duettech.com Escalade www.escalade.com Logic visions

www.logicvision.com Mentor Graphics

www.mentor.com Palmchip www.palmchip.com Sonic www.sonicsinc.com Summit Design www.summit-

design.com

Synopsys www.synopsys.com

Topdown design solutions www.topdown.com

Xynetix Design Systems www.xynetix.com

Zuken-Redac www.redac.co.uk

low power system level design methodologies young-chul kim chonnam national univ. dept. of ece, it...

Documents

soc soc

soc design

yckimit soc

speedit soc

chip designsit soc

system level design

types of system

apage pdensity mgatescm2max