usc 2007 entropy coding on a programmable processor array for multimedia soc roberto r. osorio and...

Entropy Coding on a Programmable Processor Array for Multimedia SoC

Roberto R. Osorio and Javier D. Bruguera

University of Santiago de Compostela. SPAIN

Dept. Electronic and Computer Engineering

e-mail: (roberto,bruguera)@dec.usc.es

USC 2007 2Roberto R. Osorio – ASAP 2007

Outline

Entropy coding Relevance Complexity

Options for implementation Application-specific accelerators Reconfigurable instruction-set extensions Programmable processors

ASIPs Our proposal as a processors array Implementation view

Implementation details Results and conclusions


Entropy coding

Lossless data compression More probable symbols (events) → short codewords Less probable symbols → long codewords

It is a critical task in implementing multimedia standards It is more than just Huffman or arithmetic coding

• Zig-zag, run-length, binarization, context selection,... Focusing just on pure entropy coding renders poor acceleration On JPEG-2000 represents more than 50% of computations On other standards is just 5-10%, however...

• 10% can be a lot in video encoding

• It does not benefit from SIMD or MIMD due to: Data dependencies Bit-level operations


Options for implementation

Application-specific hardware Highest performance

• High throughput, low latency and low power consumption

• Optimized integration reduces latency and cost Painful design process

• Skilled engineers needed

• Complex implementation. Errors may show up after taping out

• No flexibility: one design → one or two applications

Reconfigurable instruction-sets or accelerators High flexibility: one application → one design

• Errors can be corrected at (almost) any time Still, many times slower, bigger and power hungry than an ASIC Painful design process

• Skilled engineers

• Benefits of accelerating small kernels limited by Amdahl's law


Options for implementation (2)

Programmable processors Limited performance, high power consumption Several choices

• Scalar processors → poor performance You get what you paid for

• Super scalar → high power consumption Diminishing returns

• VLIW → something in between Preferred choice for implementing multimedia systems Performance suffers due to data dependencies

Best flexibility

• One design → any application

• Changes can be applied on the field


Entropy coding on programmable processors

Example application Context-adaptive Binary Arithmetic Coder (CABAC) in H.264

• Data binarization

• Context selection and updating

• Binary arithmetic coding

• Bit-stream formation The number of operations in high-quality encoding scenarios is

overwhelming!

50 200M. symbols / s

2.5 10Gops !!

00111010001 10000011101~50 ops /symbol

RISC or VLIW


Hardware-software co-design

Need for efficient implementations Processing speed Power consumption

MPEG-4. Encoder VGA resolution @30fps 4.1 GIPS

HW

0 RISC 21 RISC

SWLow cost Greater flexibilityExploration

SW: 5 RISC, 4 threads

Coproc: Clip Div Abs Sgn(88% utilization)

HW (80% performance)DCT, SAD,

BDIFF, BADD, BQ, BIQ

HW (65% performance)DCT, SAD

SW: 15 RISC, 16 threads(75% utilization)

Coproc: Clip Div Abs Sgn PierrePaulinST MicroelectronicsEuromicro DSD 2004


Motivation for a new platform

Devices Formats

JPEG

GIF

PNG

TIFF

JPEG 2000

MPEG-1

MPEG-2

MPEG-4 SP

H.264

WMV

QuickTime

PDF …

Algorithms

Huffman

Q-Coder

QM-Coder

MQ-Coder

CABAC

Rice

Golomb

Exp-Golomb

Lempel-Ziv

Run-length …

Applications

Image visualization

Video playing

Music

Sound recording

Still digital cameras

Video cameras

Digital TV

Time shifting

Multiple tuners

Continuous recording

…


Motivation for a new platform

Increasing complexity

1990’s

Thousands of lines

5 13 50

146

350

1022

500500

1637

2002 3G 2010

Engineers x month

Source: TI 2002

1000

1500

Support multiple

standards;

services;

applications

+Complexity grows

quadratically with the size of the problem

+Implementation for

heterogeneous platforms


ASIP

Application-Specific Instruction-set Processor Tailored to a given range of applications

Best performance and lower cost for a programmable processor Still retains high flexibility

Design process From scratch From a base processor

• Profiling

• Adding new instructions / removing unused ones

• Adding / removing functional units

• Tailoring instruction format and signal widths Other alternatives

• Tensilica


Our ASIP implementation

Array of low cost processors 8-bit processors 2-stage pipeline: fetch/decode and execute 2 instructions per cycle in a VLIW fashion Each processor has its own data and code memories

Communication through queues A linear structure has been found to be sufficient so far

Global memory accessed through a shared bus

mem

P

mem

P

mem

P

mem

P

Local memory

Processor

Local memory

Processor

Local memory

Processor

Local memory

Processor


Architecture

Programlocal memory

Pipeline registers

Datalocal memory

Fetch & decodingFlow control

Registers bank

8 8

88


Instruction set

8-bit instructions add and sub with and without carry and, or, exor left and rigth shift and rotation (only 1 bit each time) conditional (zero, carry) and unconditional branch memory load and store data and code prefetch queue input and output

16- bit instructions: carry bit passes to the next ALU We do not implement

call and return

• put an address in the queue for next processor

• jump to an address in the queue stack management interrupts


Programming model

Start up First processor reads starting address from the queue Initialization subroutine puts an address for the next processor

• After a few cycles, all processors are up

Processing Each processor executes a part of the code and communicates with other

processors using the queues

• Processors read the queues at specific points in their code Empty/full queues make processors stall

• The same applies for data or code not present in the local memory

Switching to another subroutine When the work is done, processors read a new address from the queue

• Some processors always execute the same piece of code


Distributing the code

LOOP

Databinarization:

Contextmodelling:

Encodingiteration:

Output:

LOOPCall

Return

Call

Call

Return

Return

for(…){ for(…){ for(…){ for(…){ ….. ….. ….. ….. } } }}

Idealstructure


Case study

CABAC encoding in H.264 Follows a pipelined structure Irregular algorithms

• Not well suited for software pipelining Zig-zag coefficient ordering: LUT-based indirections Binarization: data dependencies Context managing: Table accessing and updating Binary arithmetic coding: Bit-level operations and data dependencies

JPEG encoding Zig-zag coefficient ordering: LUT-based indirections Token formation: data dependencies Huffman encoding: bit manipulation


Results

Comparing with a TI TMS320C6711 VLIW DSP 5 of our processors were used in both cases CABAC

10 macroblocks from the 3rd frame of Foreman QCIF encoded as a P-frame with quantizer 28

JPEG 10 macroblocks from Lena image with quality level 75

VLIW DSP Processors array Speed-up

CABAC 500620 48974 10.2

JPEG 112150 39512 2.8


Other algorithms

We expect other encoding algorithms to perform similar to the proposed ones: CAVLC in H.264 Huffman in MPEG-2 and 4 EBCOT in JPEG-2000,...

Decoding presents serious data dependencies We have studied CABAC decoding We have being working on reducing the impact of data dependencies At this moment we do not have:

• A whole implementation

• An efficient implementation on other platform to compare with


Other algorithms

Zig-Zagquantization

Run-legthCoefficients processing

Huffmanencoding

Bit-streamformation

Ebcot1.1

Ebcot1.2

Contextmodeling

Encodingiteration

Bit-streamformation

Contextsmodeling

Encodingiteration

Bit-streamformation

Zig-Zagquantization

Significance mapSignificant coefficients

Ebcot 1.1Context modeling

Ebcot1.2

Bit-streamparsing

Arithmeticdecoding

Bit-streamparsing

Contexts modelingCoefficients reconstruction

Arithmeticdecoding

Zig-Zagde-quantization

Bit-streamparsing

Coefficientsreconstruction

Huffmandecoding

Zig-Zagde-quantization

CABAC encodingH.264

JPEGencoder

JPEG 2000encoder

CABAC decodingH.264

JPEGdecoder

JPEG 2000decoder


Data dependencies in the decoder

Data

reconstruction:

Context

modelling:

Decoding

iteration:

Data binarization:

Context modeling:

Dfskdfjkadsfsa sa

kf s faskfj saf

ds skfj

Encoding iteration:

Output:

Context modeling:

Dfskdfjkadsfsa sfully prog

Ramm

able processor

able to implem

ent an

y encoding or

ecoding algorithm w

ith high efficiency

Able to switch to anot

her a

lgorithm in a

short time

With a performance in be

tween a programmable pro

essor an

d a hardware acceleratora

Data binarization:

Context modeling:

Dfskdfjkadsfsa sa

kf s faskfj saf

ds skfj

Encoding iteration:


Ramm

able processor

able to implem

ent an

y encoding or

ecoding algorithm w

ith high efficiency


her a

lgorithm in a

short time

With a performance in be

tween a programmable pro

essor an

Data binarization:

Context modeling:

Dfskdfjkadsfsa sa

kf s faskfj saf

Output:

Context modeling:


Ramm

able processor

able to implem

ent an

y encoding or

ecoding algorithm w

ith high efficiency


LOOP LOOP


Work around

data_reconstruction(…){…do{

…

context_modeling(…)…

}…

}

context_modeling(…){……decoding_iteration(…)……

}

decoding_iteration(…){…………

}

data_reconstruction(…){

…

do{

…

context_modeling(…)

…

use_value

…

}

…

}

context_modeling(…){

…

…

decoding_iteration(…)

…

use_value

…

}

decoding_iteration(…){

…

…

…

…

}

INLINING


…

do{

…

// context_modeling

…

…


…

…

use_value

…

}

…

}


…

…

…

…

}

CODE REDISTRIBUTION


…

do{


…

…

…

…

…

…

use_value

}

…

}


…

…

…

…

}

~


Applicationsbzr 100input $2output $4xor $0 $0add $0 1output $2fetch $4and $4 7add $1 4sl0 $4add $4 $5

begin

-- registers clocking SYNC: process (clk, reset) begin

if(clk'event and clk = '1') then if(reset = '1') then codigoOutReg <= "0000"; numSeqOutReg <= "000"; calcSreg <= "0000000000000000"; calcCreg <= "0000000000000000"; shiftOutReg <= "000";

+ASIC

FPGAcoarse grain

media

processor


Implementation issues

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

PYield

Utilization

Power

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

mem

P

• Reduce voltage• Reduce clock

frequency


An ASIP-based media-processor

I/O

I/O

Mem

Mem

P

ASIP ASIP ASIP ASIP

ASIP ASIP ASIP ASIP

I/O

I/O

Mem

Mem

P

ASIP DCT ME ME

Codif. Filter ME ME

I/O

I/O

Mem

Mem

P

ASIP iDCT ASIP ASIP

Decod. MC Filter ASIP


Implementation results

Area and speed figures for the proposed processor using AMS 0.35µ libraries

Area (nand gates) 5673

Clock speed (MHz) 180

Registers cost (bits) 232

Latency (cycles) 2

Maximum throughput (instr/cycle) 2

Code memory (KB) 1

Data memory (KB) 1


Comparison

Processors array

TI C6711

Registers (bits) 1672 > 2048

ALU (bits) 80 256

Local memory (KB) 10 8+64

Speed (MHz) 180 150

Technology AMS 0.35µ TI 0.13µ

Approximate comparison of the hardware cost of a 5-element processors array and a TI C6711 VLIW DSP


Conclusions

Entropy coding is a complex task in multimedia applications that often needs of hardware acceleration

The implementation cost and lack of flexibility demand programmable solutions with comparable performance

ASIPs are a intermediate solution between hardware accelerators and general purpose processors

In this work an ASIP is proposed for entropy encoding This ASIP is not based on optimized new instructions but

on achieving high parallelism in computations and data flow

Results demonstrate that this is a valid approach for the applications we have studied

We pretend to extend the results to other applications

usc 2007 entropy coding on a programmable processor array for multimedia soc roberto r. osorio and...

Documents

osorio asap

vliw slide

pure entropy coding

binary arithmetic coding

performance dct

multimedia soc roberto

design errors

field slide