piil bl ltipromising low power reusable solutions: apppp ... · digital signal processors history...

82
P ii l bl l ti P ii l bl l ti Promising low power reusable solutions: Promising low power reusable solutions: Application Specific Instruction Application Specific Instruction-set Processors set Processors Myung Hoon Sunwoo Multimedia Comm. SoC Lab. Ajou University, Korea Ajou Univ. SOC Lab. Multimedia Communications 1 / 75

Upload: others

Post on 15-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

P i i l bl l tiP i i l bl l tiPromising low power reusable solutions: Promising low power reusable solutions: Application Specific InstructionApplication Specific Instruction--set Processors set Processors pp ppp p

Myung Hoon SunwooMultimedia Comm. SoC Lab. Ajou University, Korea

Ajou Univ. SOC Lab.MultimediaCommunications1 / 75

Page 2: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Outline

What is ASIP? and Why ASIP?

SPOCS (Signal Processors for OFDM Communications)SPOCS (Signal Processors for OFDM Communications)SPOCS Architecture for FFT and Bit Manipulation

Performance Comparisons and Implementations

DASIP (Digital Audio Specific Instruction set Processor)Proposed Instructions and Coprocessor

Proposed Inverse Quantization Algorithm

VSIP (Video Specific Instruction set Processor)Proposed Instructions and Coprocessors

Performance Comparisons

Trends of recent ASIPsApplications of Low power ASIPs

ASIP design technologies

Conclusions

Ajou Univ. SOC Lab.MultimediaCommunications2 / 75

Page 3: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Outline

What is ASIP? and Why ASIP?

SPOCS (Signal Processors for OFDM Communications)SPOCS (Signal Processors for OFDM Communications)SPOCS Architecture for FFT and Bit Manipulation

Performance Comparisons and Implementations

DASIP (Digital Audio Specific Instruction set Processor)Proposed Instructions and Coprocessor

Proposed Inverse Quantization Algorithm

VSIP (Video Specific Instruction set Processor)Proposed Instructions and Coprocessors

Performance Comparisons

Trends of recent ASIPsApplications of Low power ASIPs

ASIP design technologies

Conclusions

Ajou Univ. SOC Lab.MultimediaCommunications3 / 75

Page 4: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

What is ASIP?DSP

Disadvantages : L P f /

Multi-StandardMultimedia & Communications

Low Performance/High Power Consumption

WLAN

Ad t f ASIC

Advantages : Programmability,

Flexibility4G Wireless

Communication

Advantages of ASIC + Advantages of DSP ASIP

Advantages :

DVB, DAB

Disadvantages :

Advantages :Optimization, Low Power,

High Performance H.264/AVC

ASIC

Disadvantages : High Development Cost,

Low Flexibility, Long Time to Market

DMB

Ajou Univ. SOC Lab.MultimediaCommunications

ASIC

4 / 75

Page 5: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

What is ASIP?

Changes of System Design EnvironmentSh t Ti t M k tShort Time to MarketFrequent Spec. Changes27% CAGR(Compound Annual Growth Rate) of DSP Market

16

18

10

12

14

$

4

6

8$B

year 0

2

2002 2003 2004 2005 2006 2007 2008 2009

S F d C t F b 2005

Ajou Univ. SOC Lab.MultimediaCommunications

Source: Forward Concepts, February 2005

5 / 75

Page 6: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Why ASIP?

Computational Efficiency and Flexibility

GeneralPurpose Digital

Signal

StrongARM1100.4MIPS/mW

TMS320C54x3MIPS/mW

exib

ility

Processors SignalProcessors Application

Specific Instruction setProcessors

Application

Fle

PhysicallyOptimized

ApplicationSpecific

ICs

Performance

OptimizedICs

Determine the Best Choice between Flexibility vs. PerformanceHigh Performance and Flexibility System

Source: T. Noll, RWTH Aachen

Ajou Univ. SOC Lab.MultimediaCommunications

g y yApplication Specific Instruction set Processors

6 / 75

Page 7: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

What Resources in SOC

Digital signal processors Hardware-independent SoftwareDigital signal processorsMicroprocessorsASIPs

Hardware-independent Software

Applications

User definedI f

Libraries Middle

Various MemoriesPeripheral, InterfaceP bl C

Interface

Hardware-Dependent Software

Operating Systems

ware

Programmable CoresA/D, D/A, AnalogRTOS

Operating Systems (Kernel)

Device Drivers

RTOSMiddle WareApplication SW

Hardware

Analog

CPUCore

DSPROM

MPEG Cache

DRAM

Logic

Etc.Analog DSPROMDRAM

Ajou Univ. SOC Lab.MultimediaCommunications7 / 75

Page 8: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

SOC Challenges

Reuse Technology

Block Based Design

Platform Based DesignMethodology

Timing Driven DesignMethodology

Block Based DesignMethodology

SRAM

Methodology

ReusableμP core

ROM

ROMATMData Cache

S i l I/F

SRAM

ROMμP core

Logic

CustomerDefined

Logic

Logic

MPEG RAM

Serial I/F

LogicSoft I/F IP

LogicLogic

Ajou Univ. SOC Lab.MultimediaCommunications

Cited from “Surviving the SOC Revolution,” Chang et al., Kluwer Academic Publishers

8/ 75

Page 9: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Microprocessors vs. Digital Signal ProcessorsDigital Signal Processors

History of Microprocessors

ConvergingConverging

Ajou Univ. SOC Lab.MultimediaCommunications9 / 75

Page 10: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Microprocessors vs. Digital Signal ProcessorsDigital Signal Processors

History of DSPsy

Diverging

Hundreds of DSPs

(In-house)

Ajou Univ. SOC Lab.MultimediaCommunications10 / 75

Page 11: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Design flow of ASIP

Target ApplicationSelection SPOCS DASIP VSIPSelection

Application Profiling

SPOCS DASIP VSIP

WLAN MPEG – 2/4 AAC H.264/AVC

H/W, S/W Partitioning

Design Special Instructions

and Architecture

Design Hardware Accelerators

FFT, Bit operation

IMDCT,Huffman decoding

ME/MC,VLC

Verification and

and Architecturep Huffman decoding VLC

FPGA board LISA simulator C/Matlab programPerformance Comparison

Chip Fabrication

FPGA board, LISA simulator, C/Matlab program

Ajou Univ. SOC Lab.MultimediaCommunications11 / 75

Page 12: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Design flow using LISATek tools

ApplicationAdjust Generate

LISA 2.0 DescriptionLISATek

Processor Designer

Application

C-Compiler

Assembler

LinkerD i lSimulator

Architecture

Design goalsmet?

NoArchitecture

Debugging & Profiling

RTL Generation BuildYes

RTLImplementation

SoftwareTools

ConvergenSCSystemC

Analyze

Ajou Univ. SOC Lab.MultimediaCommunications12/75

p(Verilog, VHDL,SystemC)

yModels

Page 13: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Software tool developmentDisAssembly Assembly code

< LISATek Development Environment >

< Assembler / Linker > < Simulator >Register Memory Pipeline

Ajou Univ. SOC Lab.MultimediaCommunications

< Assembler / Linker > Simulator

13 / 75

Page 14: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

HW/SW verification environment

Compare FPGA board, C / Matlab, Lisa simulatorp , ,Reduce the ASIP development time

C simulator FPGA results

Ex) Verification of IMDCT of DASIP

Lisa simulator

Matching !!

Ajou Univ. SOC Lab.MultimediaCommunications14 / 75

Page 15: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Outline

What is ASIP? and Why ASIP?

SPOCS (Signal Processors for OFDM Communications)SPOCS (Signal Processors for OFDM Communications)SPOCS Architecture for FFT and Bit Manipulation

Performance Comparisons and Implementations

DASIP (Digital Audio Specific Instruction set Processor)Proposed Instructions and Coprocessor

Proposed Inverse Quantization Algorithm

VSIP (Video Specific Instruction set Processor)Proposed Instructions and Coprocessors

Performance Comparisons

Trends of recent ASIPsApplications of Low power ASIPs

ASIP design technologies

Conclusions

Ajou Univ. SOC Lab.MultimediaCommunications15 / 75

Page 16: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Signal Processors for OFDM Communication Systems (SPOCS)Communication Systems (SPOCS)

PCU Program

SPOCSFFT calculation problem of General DSP

PCU(Program Control

Unit)

ProgramMemory

Do/Loop instruction => additional cycle neededInefficient Butterfly calculation (Fixed MAC structure)

AGU(Address Generation

FAGU(FFT AGU)

FFT #N (Instruction)Input data address decision

(Address Generation Unit)

DPU

(FFT AGU) Addr.offset

Address generation (automatically)Reduce address generation time

DataMemory

(Data Processing Unit)

DSP FFT calculation cycleCarmel DSP (N+10)log2N + 5N/4- 4

TMS320C62X(4N/2)log2N +

BMUTMS320C62X 2

7log2N + N/4 + 9

SPOCS (2N/2)log2N + 9 * N : FFT point

DPU(Data Processing

Unit)

BMU(Bit Manipulation

Unit)

Ajou Univ. SOC Lab.MultimediaCommunications

pSPOCS : application specific signal processor for OFDM communication systems [Jour. Of signal proc., 2008].Design of new DSP instructions and their hardware architecture for high-speed FFT [Jour. of VLSI signal proc., 2003].

16 / 75

Page 17: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

SPOCS architecture

Proposed DPU Architecture Butterfly Calculation flow

Adder3Mul MulP1 P2 Acc3

Cycle 1(SBUTTERFLY)

Cycle 2(ABUTTERFLY)

Switching Logic

Adder1 Adder2Acc1 Acc2

2MAC/1ALU

SPOCS FFT Calculation

DPU ArchitectureFixed MAC of Existing DSP add Switching Logic : Support MUL-MUL-SUB(ADD), ADD-SUB Operation per CycleFFT Instruction

Existing DSP : Many Instructions Using (DO, ADD, SUB, Load, Store, MAC etc.) FFT, SBUTTERFLY, ABUTTERFLYSupport Various Instructions

Ajou Univ. SOC Lab.MultimediaCommunications

51 Instructions including New Instructions

17 / 75

Page 18: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

SPOCS bit manipulation operations

MotivationVarious communication systems have been developed, such as xDSL, WLAN,

DMB, IMT2000, etc.These systems have similar bit manipulation functions.

ScramblingConvolutional

Encoding/Puncturing

Interleaving Modulation

BasebandChannel

Sync/ViterbiDescrambling

BasebandData

yDemodulationDeinterleaving

ViterbiDecodingDescrambling

Ajou Univ. SOC Lab.MultimediaCommunications18 / 75

Page 19: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Basic bit manipulation operations

ScramblingN th Output decided by XOR operations of

Input

N-th Output decided by XOR operations of input bit and N-th shifted data according to generator polynomialGenerator Polynomial = X7 + X4 + 1

Output

R0R1R2 R3R4R5R6

Shift XOR operations

Output A

C l ti l E di

Shift, XOR operations

Input

Output B

R0 R1 R2 R3 R4 R5 Convolutional EncodingOutputs derived by XOR operations of bits in the shift register decided by encoder structure

Input

Generator Polynomial = X7 + X4 + 1 Shift, XOR operations

A4A3A2A1A0 B0B1B2B3B4 Bit Stream MultiplexingCombining two bit streams as an alternate order

A4A3A2A1A0 B0B1B2B3B4

A2 A1 A0B0B1B2

Ajou Univ. SOC Lab.MultimediaCommunications

B7 A7 B6 A6 B5 A5 B4 A4 B3 A3 B2 A2 B1 A1 B0 A0 Bit Stream Multiplexing

19 / 75

Page 20: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Basic bit manipulation operations

Input AInput A

Input B PuncturingDeletes some of the encoded bits according to

ttOutput

patterns

Bit Insert and Extract OperationsOperations

InterleavingShuffling input bits

Bit Insert and Extract Operations

Ajou Univ. SOC Lab.MultimediaCommunications20 / 75

Page 21: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

SPOCS bit manipulation Instructions

Existing DSP (Puncturing, Interleaving) SPOCS (Puncturing, Interleaving)

Input DataShift LeftShift Right0 0 0 0 0 0

Input Data

Bit ExtractProgrammable Switchg

0 0 0 0 0

0 0 0 0 Data Generation

OR OperationBit Load Register :Load the Extracted BitData Generation1 Cycle Operation0 0 0 0 Data Generation 1 Cycle Operation

Existing DSP (Scrambling, Convolution) SPOCS (Scrambling, Convolution)

Input DataALU : XOR Operation Input Data

Existing DSP (Scrambling, Convolution) SPOCS (Scrambling, Convolution)

Shifter : Shift

Shifter : ShiftALU : XOR Operation

ALU : XOR OperationBMU : Maximum 9 DataCan Be Shifted and XOR1 Cycle Operation

Ajou Univ. SOC Lab.MultimediaCommunications21 / 75

Page 22: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

FFT performance of SPOCS

Key Features

Proposed Instructions for FFT Calculation FFT ABUTTERFLY SBUTTERFLYProposed Instructions for FFT Calculation - FFT, ABUTTERFLY, SBUTTERFLY

FAGU – Automatically generate Data addresses (Very Fast FFT Operation)

Reduce Program Memory Accesses (Only three instructions) => Very Low Power

Standard FFT point Time limit (µs) SPOCS time (µs)

WLAN (54Mbps) 64 4 1.4

DAB512 62 16.5

2048 256 80.5

DVB-T 2048 231 80 5

Meet Various Communication Standards

DVB-T 2048 231 80.5

VDSL 4096 250 174.5

Implementation of application-specific DSP for OFDM systems [IEEE ISCAS2004].FFT operating apparatus of programmable processors and operation method thereof[US/European patents].Digital signal processor architecture with bit manipulation accelerator for communication

Ajou Univ. SOC Lab.MultimediaCommunications

Digital signal processor architecture with bit manipulation accelerator for communicationsystems [EURASIP JASP, 2005].Bit manipulation operation circuit and method in programmable processor [US patents].

22 / 75

Page 23: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

OFDM performance of SPOCS

PerformanceCarmel DSP TMS320C62X SPOCS

DSP Structure VLIW VLIW Application Specific DSP

Hardware Size VLIW (N.A.) VLIW (N.A.) 107,000 Gates + 12Kbyte Memory( ) ( ) , y y

DPU Structure 2MAC/2ALU 2MUL/6ALU 2MAC/1ALU

Cycles/Butterfly 2 4 2

Calculation Time (FFT)64-point 520 835 393

256-point 2,452 4,225 2,057

1024-point 11,616 20,815 10,249

2048 point 25 194 45 654 22 5372048-point 25,194 45,654 22,537

StarCore SC140 TMS320C62X SPOCS

Operation 4 Shift / 4 Logical Operation BMUOperation 4 Shift / 4 Logical Operation BMU

Convolution (IS-95) (K=9, R=1/2, 192 bits) 463 N.A. 152

Block Interleaving (802.11a) (16 * 6 bits) 414 N.A. 91

Scrambling (802.11a) (12Mbit/s) N.A. 39 X 106 20 X 106

Ajou Univ. SOC Lab.MultimediaCommunications

Convolution (802.11a) (12Mbit/s) N.A. 77 X 106 12 X106

23 / 75

Page 24: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

SPOCS implementation

iPROVE Xilinx xc2v6000

SPOCS Core Design FPGA Implementation

SEC 0.18um Synthesis (Synopsys)• Gate : 107,000• Program Memory : 4 Kbyte, Data Memory : 8 Kbyte• Frequency : 290MHz

iPROVE Xilinx xc2v6000Emulate IEEE 802.11a WLAN

Special Instruction Set for FFT Operation and BMU InstructionsC t OFDM C i ti t d d

Frequency : 290MHz

Ajou Univ. SOC Lab.MultimediaCommunications

Can meet OFDM Communication standards

24 / 75

Page 25: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

SPOCS implementation

Macro Libraries for IEEE 802.11aScrambling (Descrambling)DO #end, @R3SCB GR7, #0x0cMOV2 @R1, ACC0 | @R4, ACC1PUNC ACC1 GR2L

Mapping (Demapping)start of 64 QAM mapping

MOVI #0x0000,R3 * Q-channel inputPUNC ACC1, GR2LMOV2 @R2, ACC0 | @R5, ACC1PUNC ACC1, GR3Lend:

Convolution Encoding

MOVI #0x0000,R3 Q channel inputMOVI #0x0050,R4 * I-channel input MOVI #0x0090,R1 * to loop MOVI #0x0030,GR7 * #48 loopingMOVE GR7,@R1

DO # d f1 @R1Convolution EncodingDO #ENDDO, @R4

MOVEC R5, GR7MOVE @R0+, GR2

DO #endof1,@R1MOVE @R3,GR0 * to change value two's complimentMOVI #0x0003,ACC0 * make ACC0 011 to get last 2bits of GR0AND GR0,ACC0 * get last 2bits of GR0MOVEC ACC0,GR1 * store the value of ACC0MOVI #0x0002,ACC0 * make ACC0 010 to compare with GR1

CONV GR0, GR2, GR3, GR4CONV GR1, GR2, GR3, GR5MOVE @R1+, ACC0

Interleaving (Deinterleaving)

p

IFFT (FFT)Interleaving (Deinterleaving)DO #loop1, @R5DO #label1, @R1MOVE @R0+, GR0label1: PUNC ACC1, GR0L, GR6

( )MOVI PSW 0x4000 -- PSW setting scale downMOVI M0 0x000A -- Xmem base = 10MOVI R7 0x000A -- Ymem base = 10IFFT #256SBUTTERFLYABUTTERFLY

Ajou Univ. SOC Lab.MultimediaCommunications

, ,ROL GR6, ACC0MOVEC GR4, R0

ABUTTERFLY

25 / 75

Page 26: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

SPOCS implementationHW/SW Verification Environment using FPGA, Matlab, Lisa simulator

Ajou Univ. SOC Lab.MultimediaCommunications26 / 75

Page 27: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Ajou Univ. SOC Lab.MultimediaCommunications

Page 28: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Outline

What is ASIP? and Why ASIP?

SPOCS (Signal Processors for OFDM Communications)SPOCS (Signal Processors for OFDM Communications)SPOCS Architecture for FFT and Bit Manipulation

Performance Comparisons and Implementations

DASIP (Digital Audio Specific Instruction set Processor)Proposed Instructions and Coprocessor

Proposed Inverse Quantization Algorithm

VSIP (Video Specific Instruction set Processor)Proposed Instructions and Coprocessors

Performance Comparisons

Trends of recent ASIPsApplications of Low power ASIPs

ASIP design technologies

Conclusions

Ajou Univ. SOC Lab.MultimediaCommunications28 / 75

Page 29: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Digital Audio Specific Instruction set Processor (DASIP)Processor (DASIP)

Audio Applications

High Speed IMDCT

High Speed Parallel Execution

DOLBY (AC3)DOLBY (AC3)

Parallel Executionof Huffman Decoding

DTS 96/24DTS 96/24

MPEG AACMPEG AAC

High

ApplicationSpecific MP3PROMP3PRO

ASIP for Audio Applications

HighPerformance

AAC

Instruction Setfor Audio Algorithm

MP3PROMP3PRO

OGG, WMAOGG, WMA

Ajou Univ. SOC Lab.MultimediaCommunications29 / 75

Page 30: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Digital Audio Specific Instruction set Processor (DASIP)

Register files including 32 registersProgram control unit, data processing unit, address generation unit

Processor (DASIP)

Program control unit, data processing unit, address generation unitHuffman accelerator for MPEG-2/4 AAC2 ROM tables and 2 Data Memories

ControlP C t l U it Program

Register Program Control Unit ProgramMemory

DataProcessing

Unit

AddressGeneration

UnitRegister

ROMTABLE

Data

ROMTABLE

Data

Huffmanaccelerator

filesData

MemoryData

Memory

Design of a high-quality audio-specific DSP core [Best Paper Award in IEEE SIPS 2005].

Ajou Univ. SOC Lab.MultimediaCommunications

Computing circuits and method for running an MPEG-2 AAC or MPEG-4 AAC audio decodingalgorithm on programmable processors [US and Korea patents].

30 / 75

Page 31: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Complexity of the MPEG-2 AAC decodingdecodingHigh computational loadsHigh computational loads

Filterbank IMDCT(Inverse Modified DCT)Huffman decoding Compare & Program controls

FilterbankHuffman DecodingI Q t & l

4 1%

33%

Inv-Quant & scaleEtc.

16%

4.1%

16%

48%

Ajou Univ. SOC Lab.MultimediaCommunications31 / 75

Page 32: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Fast IMDCT Algorithm

The fast algorithm efficiently reduces the computational loads g y pof overall system by a factor of about 10 Using N/4-point complex IFFT

( )X k ( 2 1) (2 )2NX k j X k− − + ⋅

2 1( )8

j nNeπ

⋅ +×

2 1( )8

j kNeπ

⋅ +×

( )x n

Ajou Univ. SOC Lab.MultimediaCommunications32 / 75

Page 33: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Proposed instructions for IMDCT

X(k) LDPRE instruction

Pre-processing LDPRE, ST2 • 4 data transfers (load)• IAMU• Support parallel loads

N/4 IFFT LD4 instruction

pp p

Post-processing LD4, ST2

• 4 data transfers (load)• High data bandwidth• Support parallel loads

Data de-

interleavingLD4, ST2

pp p

ST2 instructioninterleaving

x(n)

• 2 data transfers (store)• High data bandwidth• Support parallel stores

Ajou Univ. SOC Lab.MultimediaCommunications33 / 75

Page 34: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Huffman decoder

Bitstream parser Specific Instructions for Huffman decoding

General Reg.

Huffman book select

Accumulator

HFMD GR0, GR1, Acc0, GR[n]GR0 index(9bit) of [Acc0]GR1 code length(5bit) of [Acc0]

▪ Gate Count : 3800 gates

<Special Feature>HFMD

g ( ) [ ]

▪ Index value directly loaded to RegisterHuffman decoder

Processor Computation CycleTMS320C62x N. A. (Very large)

Korean DSP 5 cycles

General Reg. General Reg.

Korean DSP 5 cycles

ASIC 2.5 cycles

Ajou ASIP 2 cycles<Performance Comparisons of Huffman Decoding >

Ajou Univ. SOC Lab.MultimediaCommunications

index Code length<Performance Comparisons of Huffman Decoding >

34 / 75

Page 35: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Proposed inverse quantization algorithm

4 43 3( 8) ( ) 16

8 8X XX = × = × Features

43

(1) 1 256,

: ( )

from X to

X LUT X

=

=

1. Require 256 LUT

2. Consist of 4 stages

3 No computation requires atRemainder Function

443

16

(2) 257 2047,

(401 [ ])8: 2( ([ 1]) ([ ]) ) ( ) ([ ]) 2

8 8 2 8 8

from X toX

X X X XX LUT LUT rem LUT

=

−= + − − × + ×

3. No computation requires at

the first stage

4. All of multiplications and ②

(3) 2048 8191,

: ( ) 32,64

from X toXif rem

=

divisions can achieve by

only shift operations

5. The positive and negative

(1)③

43

12

(218 [ ])644( ([ 1]) ([ ]) ) ( )

64 64 2 64

XX X XX LUT LUT rem

−= + − − × + 8([ ]) 2

64

: ( ) 32,

XLUT

Xif rem

×

>

errors have almost same

distribution (It can reduce

error accumulation)(2)④

483

12

: ( ) 32,64

(218 [ ])644( ([ 1]) ([ ]) ) ( ( ) 64) ([ ]) 2

64 64 2 64 64

if rem

XX X X XX LUT LUT rem LUT

>

−= + − + × − + ×

(2)④

Ajou Univ. SOC Lab.MultimediaCommunications

(3)Gauss Function

35 / 75

Page 36: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Proposed architecture

EXTB instructionThe rem(X/N) and the gauss[X/N] functions in one cycle

Syntax EXTB ACC0, GR0, #N

( ) g [ ] yThe syntax of the EXTB instruction The operation of the EXTB2

Description ACC rem ( GR0 / 2N ) when N<0

Description ACC [ GR0 / 2N ] when N>0Description ACC [ GR0 / 2 ] when N>0

Implementation Results (Instruction count)

Can reduce computational loads

Processor ARM TI 54X DASIP

Direct linear interpolation algorithm 29 27 21

Implementation Results (Instruction count)

Tsai algorithm 61 57 47

Proposed algorithm 49 46 38

Ajou Univ. SOC Lab.MultimediaCommunications

T. H. Tsai and C. C. Yen, “A High Quality requantization quantization method for MP3 and MPEG-4 AAC audio coding,” in Proc. IEEE Int. Symp. On Circuits and Syst., 2002, pp. 851-854

36 / 75

Page 37: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Proposed inverse quantization algorithm

Error graph of the proposed IQ method

Proposed method vs. Direct method

ErrorDirect

Method(256)Korean

256(2001)

Taiwan256

(2003)

Taiwan128

(2003)

The proposed Algorithm

256Max. error(257-2048) 0.08728 0.04365 0.02538 0.03669 0.048115Max. error(2049-8191) 1.39655 0.69832 0.35389 0.58217 0.323076

Average error 0.41979 -0.20990 0.03161 0.16233 0.0079631

Ajou Univ. SOC Lab.MultimediaCommunications

Novel non-linear inverse quantization algorithm and its architecture for digital audio codecs [IEEE ISCAS 2007].

37 / 75

Page 38: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Outline

What is ASIP? and Why ASIP?

SPOCS (Signal Processors for OFDM Communications)SPOCS (Signal Processors for OFDM Communications)SPOCS Architecture for FFT and Bit Manipulation

Performance Comparisons and Implementations

DASIP (Digital Audio Specific Instruction set Processor)Proposed Instructions and Coprocessor

Proposed Inverse Quantization Algorithm

VSIP (Video Specific Instruction set Processor)Proposed Instructions and Coprocessors

Performance Comparisons

Trends of recent ASIPsApplications of Low power ASIPs

ASIP design technologies

Conclusions

Ajou Univ. SOC Lab.MultimediaCommunications38 / 75

Page 39: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Video Specific Instruction set Processor (VSIP)Processor (VSIP)

Video Applications

JPEG 2000JPEG 2000

Special Features for

HuffmanME/MC CoprocessorParameterized JPEG 2000JPEG 2000Parameterized,

Highly Parallel Architecture

MPEG 2/4MPEG 2/4

H.264/AVCH.264/AVC

ASIP for VideoApplications

Optimized DALUApplication

Specific H.264/AVCH.264/AVCOptimized DALUfor

Integer DCT, Loop Filter

Instruction Setfor VideoAlgorithm

Ajou Univ. SOC Lab.MultimediaCommunications39 / 75

Page 40: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Video Specific Instruction set Processor (VSIP)

DSP Core

Processor (VSIP)

H.264 Decoding (%)MC

In-Loop filter

VLC

Color converter

Inv. Transform/Q

DSP Core

PCUProgramS ifiS ifi

Q

Intra Prediction

Decode MV

Other DPU

Programmemory

Data

Specific Specific InstructionsInstructions

AGU

Datamemory

H.264 Encoding (%)

Motion Estimation

Intra Prediction

In-loop filter ME/MCCAVLC/UVLC

CoprocessorCoprocessor

Transform/Q CoprocessorCoprocessor

Ajou Univ. SOC Lab.MultimediaCommunications

ASIP Instructions and their hardware architecture for H.264/AVC [Journal of Semiconductor Technology and Science, 2005.12]

40 / 75

Page 41: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

H.264 computation characteristic

Deblocking filtering Intra prediction

p’0=(p2+2*p1+2*p0+2*q0+q1+4)>>3p’1=(p2+p1+p0+q0+2)>>2p’2=(2*p3+3*p2+p1+p0+q0+4)>>3

– a is predicted by (A + 2B + C + I + 2J + K + 4) >> 3

– b, e are predicted by (B + 2C + D + J + 2K + L + 4) >> 3

c f i are predicted by (C + 2D + E + K + 2L + M + 4) >> 3p ( p p p p q )

p’0=(2*p1+p0+q1+2)>>2p’1=p1p’2=p2

– c, f, i are predicted by (C + 2D + E + K + 2L + M + 4) >> 3

– d, g, j, m are predicted by (D + 2E + F + L + 2M + N + 4) >> 3

– h, k, n are predicted by (E + 2F + G + M + 2N + O + 4) >> 3

Ajou Univ. SOC Lab.MultimediaCommunications

p’2=p2 – l, o are predicted by (F + 2G + H + N + 2O + P + 4) >> 3

41 / 75

Page 42: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Proposed instruction

Packed Instruction

8-bit 8-bit 8-bit 8-bit8-bit 8-bit 8-bit 8-bit

8-bit 8-bit 8-bit 8-bit

Existing packed instruction Packed instructionExisting packed instruction Packed instruction required for H.264

Ajou Univ. SOC Lab.MultimediaCommunications42 / 75

Page 43: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Integer transform

Integer transform matrix

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

−−−−

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

−−−−

=⊗=2/2/4/2/4/2/2/2/

21112111

1121

11112112

1111

)( 22

22

22

abaabababbababaaba

XECXCY T

Operation flow of 4x4 integer transform21d

52b

21

≅≅=a⎥⎥⎦⎢

⎢⎣

⎥⎦

⎢⎣ −−⎥⎦

⎢⎣⎥⎦

⎢⎣ −− 4/2/4/2/11211221 22 babbab

Operation flow of 4x4 integer transform

x(0)

x(1)

X(0)

X(2)- -

x(0)

x(1)

X(0)

X(2)

-22

x(2)

x(3)

X(1)

X(3)

-

-

1/2

1/2-

- x(2)

x(3)

X(1)

X(3)

Ajou Univ. SOC Lab.MultimediaCommunications

-

1D Forward Transform 1D Inverse Transform43 / 75

Page 44: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Proposed instructions

fTRAN, iTRANForward /Backward Transform4 x 1 1D transform for 1 cycle 2 input operands, 1 output operandT d f 16 16 bl k d thTwo modes for 16x16 blocks and others

Operation AssemblyADD R0(0), R0(3), tmp0 ADD R0(1), R0(2), tmp1 SUB R0(1) R0(2) tmp2SUB R0(1), R0(2), tmp2 SUB R0(0), R0(3), tmp3 ADD tmp0, tmp1, R4(0)

R4 = fTRAN (R0, mode) - mode 1 : 16x16 - mode 2 : Others

ADD tmp2, tmp1<<1, R4(1) SUB tmp0, tmp1, R4(2) SUB tmp2, tmp1<<1, R4(3)

mode 2 : Others

Ajou Univ. SOC Lab.MultimediaCommunications

p , p , ( )

44 / 75

Page 45: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Performance comparisons

Deblocking filtering performanceLDW AX0, p r0= M(a0)

Edge Filtering

pLDW AX1, qLDW r1 #h’4LDW r2 #h’1LDW r3 #h’1222DOTPU4 r2, pDOTPU4 r3 q

( )r1=M(a1)r3=#h’4r4=hadd(r0:0011.0001)r5=hadd(r1:0111.0011)r4=hadd(r0:0011.0001)r5=hadd(r1:0111 0011)

Improves 20~25 % of deblockingFiltering

(66 %)

Others

DOTPU4 r3, qADD2 acc0,acc1ADD2 acc0, r1SHFL acc0 3PACK acc0STDW acc0

r5=hadd(r1:0111.0011)Acc0=r4+r5acc0=(acc0+r3)>>3M(a3)=acc0

Reduced 40 %

deblocking filtering performance

Integer transform performance

Others(34 %)

Deblocking filtering

15 instructions 9 instructions

64x Proposed Instruction

Reduced 40 %

TMS320c55x TMS320c55x TMS320c64x Proposed

Integer transform performance

SW HW SW ASIP

Required MIPS 12.8 2.8 1.0 1.2

Ajou Univ. SOC Lab.MultimediaCommunications

Novel Instructions and Their Hardware Architecture for Video Signal Processing [IEEE ISCAS 2005].ASIP Approach for Implementation of H.264/AVC [Journal of Signal Processing Systems, Jan. 2008]

45 / 75

Page 46: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

VSIP implementation

Compare FPGA board, C / Matlab, Lisa simulator

Forward Integer Transform

loop #16 lpR0=M(AR0,2) - - - copy pixels to register R1=M(AR0,2)R2=M(AR0,2)R3=M(AR0,2)

loop #2 ftran - - - loopRF1=trans(RF0) - - - transpose 4 x 4 matrixR0=ftran(R4,1) - - - 1D integer transformR1=ftran(R5,1)R2=ftran(R6,1)R3=ftran(R7,1)R3 ftran(R7,1)

ftran: - - - ftran loop endnopnopM(AR1,2)=R0 - - - store pixels to memoryM(AR1,2)=R1M(AR1 2)=R2M(AR1,2)=R2M(AR1,2)=R3

lp:

VSIP ME Chi < VSIP MC Chi >

Ajou Univ. SOC Lab.MultimediaCommunications

< VSIP ME Chip> < VSIP MC Chip >

46 / 75

Page 47: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Further Research

ASIP for motion estimationAims to support various Motion Estimation (ME) algorithmsTry to find good balance between flexibility and performanceFunded by Samsung ElectronicsR h t iResearch topics

Reconfigurable Interconnection

Optimalprocessor

model

Reconfigurablearchitecture

Interconnectionbetween core andH/W accelerator

model

Development of ME ASIP Scalability of ME ASIP

Program

Ajou Univ. SOC Lab.MultimediaCommunications47 / 75

Programtemplates forME algorithms

Page 48: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Outline

What is ASIP? and Why ASIP?

SPOCS (Signal Processors for OFDM Communications)SPOCS (Signal Processors for OFDM Communications)SPOCS Architecture for FFT and Bit Manipulation

Performance Comparisons and Implementations

DASIP (Digital Audio Specific Instruction set Processor)Proposed Instructions and Coprocessor

Proposed Inverse Quantization Algorithm

VSIP (Video Specific Instruction set Processor)Proposed Instructions and Coprocessors

Performance Comparisons

Trends of recent ASIPsApplications of Low power ASIPs

ASIP design technologies

Conclusions

Ajou Univ. SOC Lab.MultimediaCommunications48 / 75

Page 49: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIP for CommunicationsMSC8156 Processor - Freescale semiconductor

FeatureProvide flexibility integration and cost efficient for next generationProvide flexibility, integration and cost efficient for next generation wireless communication standards (3G-LTE, WiMAX, eHSPA, TDD-LTE, etc)S pport req irements of the ne t generation base stationSupport requirements of the next generation base station

High speed processing and decreasing latencySupport high data rates with up-to-date OFMDA (Orthogonal Frequency Division Multiple Access) standard

CLASSCLASS

SC3850 DSP CORESC3850 DSP CORESC3850 DSP CORE

32 KB L1 32 KB L1

SC3850 DSP CORE

32 KB L1 32 KB L1 Dual RISC Processors

MAPLE-B

32 KB L1I-Cache

32 KB L1D-Cache

512 KB L2 Cache/M2 Memory

32 KB L1I-Cache

32 KB L1D-Cache

512 KB L2 Cache/M2 Memory

32 KB L1I-Cache

32 KB L1D-Cache

512 KB L2 Cache/M2 Memory

32 KB L1I-Cache

32 KB L1D-Cache

512 KB L2 Cache/M2 Memory

Dual RISC Processors

DFT/IDFT

Turbo/Viterbi

FFT/IFFT CRC

Ajou Univ. SOC Lab.MultimediaCommunications49 / 75

Page 50: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIPs for Multimedia (Video)

SSD1933 Multimedia Processor - Solomon SystechF tFeatures

Dual core architecture with ARM926EJ-S and AV-DSPHigh quality multimedia for mobile multimedia device, navigation system, mobile internet device

Standard I/O

Connectivity

Humanf

CPU Subsystem

ARM926D-Cache

I-Cache

MultimediaAcceleration

2D GraphicInterface

Systemcontrol

Memoryf

Multimedia Subsystem

AV-DSP3D-DMA

L1-Cache

Engine

Pre and PostInterface

MultimediaInterface SRAMPRISM

Processing

Ajou Univ. SOC Lab.MultimediaCommunications50 / 75

Page 51: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIPs for Multimedia (Audio)

ZSP800 processor – VeriSiliconF tFeatures

Support Z.Turbo accelerator – users can add instructions and acceleratorHigh-definition audio DSP incorporates innovative features to provide the right balance between silicon cost and processing

Ajou Univ. SOC Lab.MultimediaCommunications51 / 75

Page 52: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIPs for Multimedia (Audio)

Z.Turbo accelerator of ZSP processorF tFeatures

User-definable, user-configurableEnables user to add own accelerator or co-processor

Accelerates special functions without burdening the main DSP core

M d t ffi i t th j t kiMore power and cost efficient than just cranking up MHz or just adding more execution units

Customers can differentiate using own designs g gon top of ZSP architecture

Ajou Univ. SOC Lab.MultimediaCommunications52 / 75

Page 53: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIP for FECFEC ASIP - IMEC

FeaturesThe world’s first decoding of Turbo code and LDPC in one processorThe world s first decoding of Turbo code and LDPC in one processorUsing multiprocessor with several SIMD architectures shows high performance and energy efficiencyHandling Scrambling of LDPC and Interleaving of turbo code with rAGU (reconfigurable Address Generation Unit)

Input/output

Inputfifo

Outputfifo

Input/outputinterface

AGU1

AGU2

BackgroundMem bank

AGU1

AGU2

BackgroundMem bank

AGU1

AGU2

BackgroundMem bank

Shuffler Shuffler Shuffler

Rotation engine

Rot

atio

nsu

port

Aligned scratchpad

N-way SIMDpipline

VRF LIFO

ControlUnit

Program

SRF Aligned scratchpad

N-way SIMDpipline

VRF LIFO

ControlUnit

Program

SRF Aligned scratchpad

N-way SIMDpipline

VRF LIFO

ControlUnit

Program

SRF

Ajou Univ. SOC Lab.MultimediaCommunications

VRF LIFOmem

Control interface

VRF LIFOmem VRF LIFOmem

53 / 75

Page 54: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIPs for MPSOC systemAachen Univ. - T.G Noll team

Reconfigurable ASIP architecture using eFPGA (embedded FPGA)More application specific architecture than typical FPGAMore application specific architecture than typical FPGASmall area and low power architecture - Optimize arithmetic operation

Performance update using program language like HDLUsing configurable block, the performance closed to ASIC with low cost and time

I t ti C fi tiInstructionMemory

Configurationmemory

eFPGA

Control unit

register

ASIP core

Ajou Univ. SOC Lab.MultimediaCommunications54 / 75

Page 55: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIPs for MPSOC systemAachen Univ. - H. Meyr team

Reconfigurable ASIP architecture using CGRA (Coarse Grained Reconfigurable Architecture)Reconfigurable Architecture)CGRA

Include arithmetic, logical operation or specific processing element)inside coreInstead of FPGA CGRA implement system using architecture inside the coreInstead of FPGA, CGRA implement system using architecture inside the coreAlso the reconfigurable block is application specific block

Although flexibility of CGRA is less then flexibility of FPGA, we can develop fast with low cost using application specific CGRAdevelop fast with low cost using application specific CGRA

z

+resistera

z

>>

configurable

by

Ajou Univ. SOC Lab.MultimediaCommunications

g

CGRA – PE architecture55 / 75

Page 56: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIPs for MPSOC system

ASIP should be specialized for specific applicationASIP should be specialized for specific application

To optimize MPSOC systemTo optimize MPSOC system

Support the interface for communication among ASIPsSupport the interface for communication among ASIPs inside system

Guarantee compatibility among compilers

Need a low power architecture for mobile deviceNeed a low power architecture for mobile device

Ajou Univ. SOC Lab.MultimediaCommunications56 / 75

Page 57: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ASIP design technologies

Architecture Description Language (ADL) based designArchitecture Description Language (ADL) based designMaximize flexibility and efficiency, but significant design effortLISATek (CoWare), IP Designer (Target), ASIP Meister (ASIP S l ti I )Solutions, Inc.)

Configurable Processor CoresUse pre-designed and pre-verified coreEfficiency via custom instruction set extensionsEfficiency via custom instruction set extensionsXtensa (Tensilica), CorExtend (MIPS), Configurable cores ARC600, ARC700 (ARC)

Ajou Univ. SOC Lab.MultimediaCommunications57 / 75

Page 58: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ADL based ASIP designLISATek Processor Designer – CoWare

Language for Instruction-set Architectures (LISA) is powerful g g ( ) prepresentative of instruction-set languageGenerate complete set of SW development tools including optimizing C-Compiler and fast instruction-set simulatorp g p

Ajou Univ. SOC Lab.MultimediaCommunications58 / 75

Page 59: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ADL based ASIP designIP Designer – Target Compiler Technologies

Retargetable tool-suitable for ASIP designg gDefine ASIP architecture in the nML language (hierarchical and highly structured architecture description language)

Ajou Univ. SOC Lab.MultimediaCommunications59 / 75

Page 60: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

ADL based ASIP designASIP Meister – ASIP Solutions, Inc.

Generate dedicated processor hardware descriptions and software development tools automatically based on target specificationsOperations of instructions can be defined easily using the Micro Operation description language provided by ASIP Meisterp p g g p y

Ajou Univ. SOC Lab.MultimediaCommunications60 / 75

Page 61: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Configurable ASIPsXtensa LX3 - Tensilica

Architecture16bit or 32 bit multiplier, single 16 bit MAC16bit or 32 bit multiplier, single 16 bit MACSupport multiprocessorAdapt multi-issue VLIW using FLIX (Flexible Length Instruction eXtensions) architectureSelectable 5-stage or 7-stage optional pipelineConfigurable over a wide range of pre-verified options

Ajou Univ. SOC Lab.MultimediaCommunications61 / 75

Page 62: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Configurable ASIPsXtensa LX3 - Tensilica

XPRES compiler – featureAnalysis C/C++ source code and a run-time application profile to automaticallyAnalysis C/C source code and a run time application profile to automatically suggest configuration settings and new instructionsProvide a useful starting point for further optimization by the designer

XPRES compiler – design flow

A li ti d f ti l

Xtensa Processor Generatorbuilds complete optimized

hardware block and tool-chainin minutes

C/C++ source code

TIE :Designer-Defined

Instructions

ProcessorConfiguration

Input

Application code or functionalspecification in full C/C++ language

Analyze thousands of possiblefi ti i i t XPRES Compiler

TIE :Designer Defined

TIE :TIE :

Instructions Input

Xtensa Processor Generator

processor configurations in minutes

Optimally tune TIE or combine Designer-DefinedInstructions

Designer-DefinedInstructions

TIE :Designer-Defined

Instructions Hardware (RTL) System Models CompleteSoftware Tools

p ywith manually generated or

automatically generated TIE.Select optimal configuration

Ajou Univ. SOC Lab.MultimediaCommunications62 / 75

Page 63: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Configurable ASIPsCorExtend - MIPS

FeaturesAllow SoC designers to add proprietary instructions and tightlyAllow SoC designers to add proprietary instructions and tightly coupled hardwareAs many instructions as an expert designer needs can be addedMIPS32@4KE, M4K, 4KSd Pro, MIPS32@24K Pro, 24KEMIPS32@34K Pro, MIPS32@74K, MIPS32@1004K

Ajou Univ. SOC Lab.MultimediaCommunications63 / 75

Page 64: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Configurable ASIPsConfigurable cores ARC600, ARC700 - ARC

FeaturesEnable designers to add features they need and remove featuresEnable designers to add features they need and remove features they do not need for their individual applicationOffer the flexibility to add instructions, registers, flags and condition codes creating processor that is highl t ned for specific applicationcodes, creating processor that is highly tuned for specific application

Ajou Univ. SOC Lab.MultimediaCommunications64 / 75

Page 65: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Evolution of ASIPsFuture of ASIPs

Higher PerformanceHigher Performance

ASIPASIP

ReconfigurableReconfigurable More specificMore specificapplicationapplication

Low power Low power consumptionconsumption

High FlexibilityHigh Flexibility

Ajou Univ. SOC Lab.MultimediaCommunications65 / 75

Page 66: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Outline

What is ASIP? and Why ASIP?

SPOCS (Signal Processors for OFDM Communications)SPOCS (Signal Processors for OFDM Communications)SPOCS Architecture for FFT and Bit Manipulation

Performance Comparisons and Implementations

DASIP (Digital Audio Specific Instruction set Processor)Proposed Instructions and Coprocessor

Proposed Inverse Quantization Algorithm

VSIP (Video Specific Instruction set Processor)Proposed Instructions and Coprocessors

Performance Comparisons

Trends of recent ASIPsApplications of Low power ASIPs

ASIP design technologies

Conclusions

Ajou Univ. SOC Lab.MultimediaCommunications66 / 75

Page 67: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Conclusions

Proposed three ASIPs for OFDM systems, Audio and Video combine high performance of ASIC and flexibilityVideo combine high performance of ASIC and flexibility of DSP

Smaller hardware size than existing DSPsSupport various standardsSupport various standards

ASIP Core for OFDM communication systemsSpecial instructions and hardware architectures for FFT and bit manipulationSupport various OFDM and DMT modem systemsSupport various OFDM and DMT modem systems

ASIP Core for AudioSpecial instructions for audio codingAccelerator for Huffman decodingAccelerator for Huffman decodingSupport various high quality audio codecs

ASIP Core for Video applicationsSpecial instructions for video codingSpecial instructions for video codingTwo coprosessors for ME/MC and VLCSupport various video Codecs

Ajou Univ. SOC Lab.MultimediaCommunications67 / 75

Page 68: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Implemented ASIPs

< SPOCS > < VSIP ME > < VSIP MC > < DASIP >

ASIP Specification

< SPOCS > < VSIP ME > < VSIP MC > < DASIP >

Library OperationFrequency

GateCounts

MemorySize

Remarks

SPOCS Sec 0.18㎛ 280MHz 107,000 12Kbyte -SPOCS Sec 0.18㎛ 280MHz 107,000 12Kbyte

VSIP HSI 0.25㎛ 160MHz 141,260 24Kbyte ME/MC hardware accelerator

DASIP Sec 0.18㎛ 200MHz 120,283 24Kbyte -

Ajou Univ. SOC Lab.MultimediaCommunications68 / 75

Page 69: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Implemented chips (1/2)

S fDSP forwireless

communication 40MH

MDSP (1st version)

30MHz

MDSP (2nd version)

60MHz

MDSP (3nd version)

60MHz40MHz 30MHz 60MHz

Multimedia DSP + Fixed Point DSP16 bits fixed point DSP

60MHz

Multimedia DSP Fixed Point DSPMobile multimedia communication

DCT(176 x 144) 168.64 fr/sBMA(352 240) 14 f /

16 bits fixed point DSPInstructions are

compatiblewith Motorola

DSP56100

Ajou Univ. SOC Lab.MultimediaCommunications

BMA(352 x 240) 14 fr/sDSP56100

69 / 75

Page 70: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Implemented chips (2/2)

PRML ReadDOCSIS 2.0 WLAN modem chip LMDSChannel FilterCable modem IEEE 802.11 DOCSIS

RS+Viterbi FEC Parallel image S DCME

Ajou Univ. SOC Lab.MultimediaCommunications

RS+Viterbi FECdecoder

gprocessor FFT processor

S-DCMERS decoder

70 / 75

Page 71: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

DVB-S2 System Chip Design

ETRI – Ajou universityTRI Ajou university

SoC LabSoC Lab

Ajou Univ. SOC Lab.MultimediaCommunications

Page 72: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

DVB-S2 Receiver System Description

Standard of Satellite Digital Video BroadcastingCharacteristics

channel adaptive transmitter algorithm using ACM(Adaptive coding and modulation) and VCM (Variable coding and modulations)Important 3 Signal processing blocks

S h i Ti i d f h i d d d l tiSynchronizer : Timing and frequency synchronizer and demodulationsFEC : Error detection and correctionMode de-adaptation : Packet header decoding

DVB S2DVB-S2synchronizer(Ajou univ.)

FEC(LDPC+BCH)

ModeDe-adaptation

MODCOD

ADC Video signal

ADC : Analog Digital ConverterMODCOD : The code of modulation method and code-rate.BCH : Bose-Chaudhuri-Hocquenghem multiple error correction binary block code

Ajou Univ. SOC Lab.MultimediaCommunications72 / 75

Page 73: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

DVB-S2 Synchronizer Description

DVB-S2 Synchronizer Descriptions STR : Using Gardner AlgorithmSTR : Using Gardner Algorithm.Frame Sync : Adopt correlation schemes GDPDIFreq Sync : Coarse, fine and phase estimation.SNR Estimator : Using SNV algorithmsReed-Muller decoder : MODCOD DecodingDemapper : QPSK, 8PSK, 16APSK, 32APSK demodulations

STR AGC Frame Sync.

SNR EstimatorADC

frame done

y

Descrambler Freq. Sync

Demapper

frame done SNR

Sync.

Reed-Muller

STR: Symbol timing recoveryAGC: Automatic Gain ControllerGDPDI: differential generalized post detection integration

MODCOD

Ajou Univ. SOC Lab.MultimediaCommunications73 / 75

DecoderGDPDI: differential generalized post-detection integrationSNR: Squared Signal-to-Noise Variance

Page 74: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

DVB-S2 Test Environments

Test EnvironmentsUse VCM (Variable coding modulation)

QPSK : code rate : 1/28PSK : code rate : 2/3

SNR : 6dBSample rate : 9Msymbol/sCarrier Frequency : 21Ghz.

Ajou Univ. SOC Lab.MultimediaCommunications74 / 75

Page 75: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

DVB-S2 Test Movie

Ajou Univ. SOC Lab.MultimediaCommunications75 / 75

Page 76: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Papers and Patents list

Papers[1] L l it d d ME d i t l ti l ith f H 264/AVC[1] Low power complexity-reduced ME and interpolation algorithms for H.264/AVC,

Jour. of signal proc., 2009[2] SPOCS : Application specific signal processor for OFDM communication systems,

Jour of signal proc 2008Jour. of signal proc., 2008[3] ASIP Approach for implementation of H.264/AVC, Jour. Of signal proc., 2008[4] Novel intra prediction algorithm using residual prediction for low power

lti di d ISIC2009multimedia codecs,ISIC2009[5] Efficient integer motion estimation algorithm using sub-sampling, IEEE

ISOCC2009[6] Novel residual prediction scheme for hybrid video coding, IEEE ICIP2009[7] Novel frame selection methods for multi-reference motion estimation,

International Conference on Digital Signal Processing 2009[8] Efficient frame selection schemes for multi-reference and variable block Size

Motion Estimation, IEEE ICME2008[9] Novel fractional pixel motion estimation algorithm using motion prediction and

Ajou Univ. SOC Lab.MultimediaCommunications

fast search pattern, IEEE ICME2008

Page 77: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Papers and Patents list

Papers[10] I d f l t f M lti R f ti ti ti IEEE ISOCC2008[10] Improved frame selector for Multi-Reference motion estimation, IEEE ISOCC2008[11] Fast multiple reference frame selection method For H.264/AVC, IEEE WSPS2008[12] Fast full search motion estimation algorithm using MNPDS, IEEE ICEIC2008[13] Power efficient integrated motion compensator for MPEG and H.264/AVC, IEEE

SLPHSC2008[14] Three low power ASIP processor designs for communications, video, and audio [ ] p p g , ,

applications, DTIS 2007[15] An ASIP approach for H.264/AVC implementation having novel coprocessors,

SIPS 2007[16] Low power ASIC architecture optimization based on target application profiling,

IEEE SCS2007[17] Novel non-linear inverse quantization algorithm and its architecture for digital [ ] q g g

audio codecs, ISCAS 2007[18] VSIP : Implementation of video specific instruction-set processor, IEEE APCCAS

2006

Ajou Univ. SOC Lab.MultimediaCommunications

Page 78: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Papers and Patents list

Papers[19] VSIP: Video specific instruction set processor for H.264/AVC, IEEE SIPS 2006[20] ASIP approach for implementation of H.264/AVC, IEEE ASP-DAC2006[21] Efficient memory reuse and sub-pixel interpolation algorithms for ME/MC of [ ] y p p g

H.264/AVC, IEEE SIPS 2006[22] Efficient motion estimation accelerator for H.264/AVC, A-SSCC 2006[23] ASIP instructions and their hardware architecture for H.264/AVC, ISOCC 2005[23] ASIP instructions and their hardware architecture for H.264/AVC, ISOCC 2005[24] Implementation of application-Specific DSP for OFDM Systems, IEEE international

Symposium on Antennas and Propagation 2005[25] Application-specific DSP architecture for H 264/AVC ITC-CSCC 2005[25] Application-specific DSP architecture for H.264/AVC, ITC-CSCC 2005[26] Reconfigurable coprocessor for communication systems, ITC-CSCC 2005[27] Design of a high-quality audio-specific DSP core, Best Paper Award in IEEE SIPS

20052005[28] Novel instructions and their hardware architecture for video signal processing,

IEEE ISCAS 2005[29] I l i f li i ifi i l f hi h d OFDM

Ajou Univ. SOC Lab.MultimediaCommunications

[29] Implementation of application-specific signal processor for high-speed OFDM Systems, COOL Chips Ⅷ

Page 79: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Papers and Patents list

Papers[30] I l i f i l l i di DSP hi f bil li i J[30] Implementation of a wireless multimedia DSP chip for mobile applications, Jour.

of VLSI signal proc., 2005[31] Digital signal processor architecture with bit manipulation accelerator for

communication systems EURASIP JASP 2005communication systems, EURASIP JASP, 2005[32] Implementation of a wireless multimedia DSP chip for mobile applications, Jour.

of VLSI signal proc., 2005S f / C[33] ASIP Instructions and their hardware architecture for H.264/AVC, Journal

Semiconductor Technology and Science, 2005[34] Audio-Specific Signal Processor (ASSP) for High-Quality Audio Codec, A-SSCC

20052005[35] Implementation of application-specific signal processor for high-speed

communication systems, ISPACS 2004[36] Design of reconfigurable coprocessor for communication Systems, SIPS 2004[37] Implementation of application-specific DSP for OFDM systems, ISCAS2004[38] Design of new DSP instructions and their hardware architecture for high-speed

Ajou Univ. SOC Lab.MultimediaCommunications

FFT, Jour. of VLSI signal proc., 2003

Page 80: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Papers and Patents list

PatentsPatents[1] Computing Circuits and Method for Running an MPEG-2 AAC or MPEG-4 AAC

Audio Decoding Algorithm on Programmable Processors, US patents[2] Frequency error estimator and frequency error estimating method thereof US[2] Frequency error estimator and frequency error estimating method thereof, US

patents[3] Modulation apparatus using mixed-radix fast Fourier transform, US patents[4] Bit i l ti ti i it d th d i bl US[4] Bit manipulation operation circuit and method in programmable processor, US

patents[5] Apparatus and method for computing an FFT in a programmable processor,

European patentsEuropean patents [6] FFT operating apparatus of programmable processors and operation method

thereof, US/European patents[7]M d l ti t i i d di f t F i t f J t t[7]Modulation apparatus using mixed-radix fast Fourier transform, Japan patents[8] Computing circuits and method for running an MPEG-2 AAC or MPEG-4 AAC

audio decoding algorithm on programmable processors, Korea patents

Ajou Univ. SOC Lab.MultimediaCommunications

Page 81: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Papers and Patents list

PatentsPatents[9] Reducing decoding complexity method and devices for low density parity

check, Korea patents[10] Frequency error estimator and frequency error estimating method thereof[10] Frequency error estimator and frequency error estimating method thereof,

Korea patents[11] Frame synchronization circuit in DVB-S2, Korea patents[12] R f f l ti th d f lti f ti ti ti f[12] Reference frame selection method for multi-reference motion estimation of

high performance multimedia codec, Korea patents[13] S-DCME algorithm processing Methods and Circuits for Reed-Solomon

decoder Korea patentsdecoder, Korea patents

Ajou Univ. SOC Lab.MultimediaCommunications

Page 82: Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History of Microprocessors Converging ... Design Special Instructions andArchitecture Design

Thank you!Thank you!

Ajou Univ. SOC Lab.MultimediaCommunications