usc 2007 entropy coding on a programmable processor array for multimedia soc roberto r. osorio and...
Post on 20-Dec-2015
215 views
TRANSCRIPT
Entropy Coding on a Programmable Processor Array for Multimedia SoC
Roberto R. Osorio and Javier D. Bruguera
University of Santiago de Compostela. SPAIN
Dept. Electronic and Computer Engineering
e-mail: (roberto,bruguera)@dec.usc.es
USC 2007 2Roberto R. Osorio – ASAP 2007
Outline
Entropy coding Relevance Complexity
Options for implementation Application-specific accelerators Reconfigurable instruction-set extensions Programmable processors
ASIPs Our proposal as a processors array Implementation view
Implementation details Results and conclusions
USC 2007 3Roberto R. Osorio – ASAP 2007
Entropy coding
Lossless data compression More probable symbols (events) → short codewords Less probable symbols → long codewords
It is a critical task in implementing multimedia standards It is more than just Huffman or arithmetic coding
• Zig-zag, run-length, binarization, context selection,... Focusing just on pure entropy coding renders poor acceleration On JPEG-2000 represents more than 50% of computations On other standards is just 5-10%, however...
• 10% can be a lot in video encoding
• It does not benefit from SIMD or MIMD due to: Data dependencies Bit-level operations
USC 2007 4Roberto R. Osorio – ASAP 2007
Options for implementation
Application-specific hardware Highest performance
• High throughput, low latency and low power consumption
• Optimized integration reduces latency and cost Painful design process
• Skilled engineers needed
• Complex implementation. Errors may show up after taping out
• No flexibility: one design → one or two applications
Reconfigurable instruction-sets or accelerators High flexibility: one application → one design
• Errors can be corrected at (almost) any time Still, many times slower, bigger and power hungry than an ASIC Painful design process
• Skilled engineers
• Benefits of accelerating small kernels limited by Amdahl's law
USC 2007 5Roberto R. Osorio – ASAP 2007
Options for implementation (2)
Programmable processors Limited performance, high power consumption Several choices
• Scalar processors → poor performance You get what you paid for
• Super scalar → high power consumption Diminishing returns
• VLIW → something in between Preferred choice for implementing multimedia systems Performance suffers due to data dependencies
Best flexibility
• One design → any application
• Changes can be applied on the field
USC 2007 6Roberto R. Osorio – ASAP 2007
Entropy coding on programmable processors
Example application Context-adaptive Binary Arithmetic Coder (CABAC) in H.264
• Data binarization
• Context selection and updating
• Binary arithmetic coding
• Bit-stream formation The number of operations in high-quality encoding scenarios is
overwhelming!
50 200M. symbols / s
2.5 10Gops !!
00111010001 10000011101~50 ops /symbol
RISC or VLIW
USC 2007 7Roberto R. Osorio – ASAP 2007
Hardware-software co-design
Need for efficient implementations Processing speed Power consumption
MPEG-4. Encoder VGA resolution @30fps 4.1 GIPS
HW
0 RISC 21 RISC
SWLow cost Greater flexibilityExploration
SW: 5 RISC, 4 threads
Coproc: Clip Div Abs Sgn(88% utilization)
HW (80% performance)DCT, SAD,
BDIFF, BADD, BQ, BIQ
HW (65% performance)DCT, SAD
SW: 15 RISC, 16 threads(75% utilization)
Coproc: Clip Div Abs Sgn PierrePaulinST MicroelectronicsEuromicro DSD 2004
USC 2007 8Roberto R. Osorio – ASAP 2007
Motivation for a new platform
Devices Formats
JPEG
GIF
PNG
TIFF
JPEG 2000
MPEG-1
MPEG-2
MPEG-4 SP
H.264
WMV
QuickTime
PDF …
Algorithms
Huffman
Q-Coder
QM-Coder
MQ-Coder
CABAC
Rice
Golomb
Exp-Golomb
Lempel-Ziv
Run-length …
Applications
Image visualization
Video playing
Music
Sound recording
Still digital cameras
Video cameras
Digital TV
Time shifting
Multiple tuners
Continuous recording
…
USC 2007 9Roberto R. Osorio – ASAP 2007
Motivation for a new platform
Increasing complexity
1990’s
Thousands of lines
5 13 50
146
350
1022
500500
1637
2002 3G 2010
Engineers x month
Source: TI 2002
1000
1500
Support multiple
standards;
services;
applications
+Complexity grows
quadratically with the size of the problem
+Implementation for
heterogeneous platforms
USC 2007 10Roberto R. Osorio – ASAP 2007
ASIP
Application-Specific Instruction-set Processor Tailored to a given range of applications
Best performance and lower cost for a programmable processor Still retains high flexibility
Design process From scratch From a base processor
• Profiling
• Adding new instructions / removing unused ones
• Adding / removing functional units
• Tailoring instruction format and signal widths Other alternatives
• Tensilica
USC 2007 11Roberto R. Osorio – ASAP 2007
Our ASIP implementation
Array of low cost processors 8-bit processors 2-stage pipeline: fetch/decode and execute 2 instructions per cycle in a VLIW fashion Each processor has its own data and code memories
Communication through queues A linear structure has been found to be sufficient so far
Global memory accessed through a shared bus
mem
P
mem
P
mem
P
mem
P
Local memory
Processor
Local memory
Processor
Local memory
Processor
Local memory
Processor
USC 2007 12Roberto R. Osorio – ASAP 2007
Architecture
Programlocal memory
Pipeline registers
Datalocal memory
Fetch & decodingFlow control
Registers bank
8 8
88
USC 2007 13Roberto R. Osorio – ASAP 2007
Instruction set
8-bit instructions add and sub with and without carry and, or, exor left and rigth shift and rotation (only 1 bit each time) conditional (zero, carry) and unconditional branch memory load and store data and code prefetch queue input and output
16- bit instructions: carry bit passes to the next ALU We do not implement
call and return
• put an address in the queue for next processor
• jump to an address in the queue stack management interrupts
USC 2007 14Roberto R. Osorio – ASAP 2007
Programming model
Start up First processor reads starting address from the queue Initialization subroutine puts an address for the next processor
• After a few cycles, all processors are up
Processing Each processor executes a part of the code and communicates with other
processors using the queues
• Processors read the queues at specific points in their code Empty/full queues make processors stall
• The same applies for data or code not present in the local memory
Switching to another subroutine When the work is done, processors read a new address from the queue
• Some processors always execute the same piece of code
USC 2007 15Roberto R. Osorio – ASAP 2007
Distributing the code
LOOP
Databinarization:
Contextmodelling:
Encodingiteration:
Output:
LOOPCall
Return
Call
Call
Return
Return
for(…){ for(…){ for(…){ for(…){ ….. ….. ….. ….. } } }}
Idealstructure
USC 2007 16Roberto R. Osorio – ASAP 2007
Case study
CABAC encoding in H.264 Follows a pipelined structure Irregular algorithms
• Not well suited for software pipelining Zig-zag coefficient ordering: LUT-based indirections Binarization: data dependencies Context managing: Table accessing and updating Binary arithmetic coding: Bit-level operations and data dependencies
JPEG encoding Zig-zag coefficient ordering: LUT-based indirections Token formation: data dependencies Huffman encoding: bit manipulation
USC 2007 17Roberto R. Osorio – ASAP 2007
Results
Comparing with a TI TMS320C6711 VLIW DSP 5 of our processors were used in both cases CABAC
10 macroblocks from the 3rd frame of Foreman QCIF encoded as a P-frame with quantizer 28
JPEG 10 macroblocks from Lena image with quality level 75
VLIW DSP Processors array Speed-up
CABAC 500620 48974 10.2
JPEG 112150 39512 2.8
USC 2007 18Roberto R. Osorio – ASAP 2007
Other algorithms
We expect other encoding algorithms to perform similar to the proposed ones: CAVLC in H.264 Huffman in MPEG-2 and 4 EBCOT in JPEG-2000,...
Decoding presents serious data dependencies We have studied CABAC decoding We have being working on reducing the impact of data dependencies At this moment we do not have:
• A whole implementation
• An efficient implementation on other platform to compare with
USC 2007 19Roberto R. Osorio – ASAP 2007
Other algorithms
Zig-Zagquantization
Run-legthCoefficients processing
Huffmanencoding
Bit-streamformation
Ebcot1.1
Ebcot1.2
Contextmodeling
Encodingiteration
Bit-streamformation
Contextsmodeling
Encodingiteration
Bit-streamformation
Zig-Zagquantization
Significance mapSignificant coefficients
Ebcot 1.1Context modeling
Ebcot1.2
Bit-streamparsing
Arithmeticdecoding
Bit-streamparsing
Contexts modelingCoefficients reconstruction
Arithmeticdecoding
Zig-Zagde-quantization
Bit-streamparsing
Coefficientsreconstruction
Huffmandecoding
Zig-Zagde-quantization
CABAC encodingH.264
JPEGencoder
JPEG 2000encoder
CABAC decodingH.264
JPEGdecoder
JPEG 2000decoder
USC 2007 20Roberto R. Osorio – ASAP 2007
Data dependencies in the decoder
Data
reconstruction:
Context
modelling:
Decoding
iteration:
Data binarization:
Context modeling:
Dfskdfjkadsfsa sa
kf s faskfj saf
ds skfj
Encoding iteration:
Output:
Context modeling:
Dfskdfjkadsfsa sfully prog
Ramm
able processor
able to implem
ent an
y encoding or
ecoding algorithm w
ith high efficiency
Able to switch to anot
her a
lgorithm in a
short time
With a performance in be
tween a programmable pro
essor an
d a hardware acceleratora
Data binarization:
Context modeling:
Dfskdfjkadsfsa sa
kf s faskfj saf
ds skfj
Encoding iteration:
Dfskdfjkadsfsa sfully prog
Ramm
able processor
able to implem
ent an
y encoding or
ecoding algorithm w
ith high efficiency
Able to switch to anot
her a
lgorithm in a
short time
With a performance in be
tween a programmable pro
essor an
Data binarization:
Context modeling:
Dfskdfjkadsfsa sa
kf s faskfj saf
Output:
Context modeling:
Dfskdfjkadsfsa sfully prog
Ramm
able processor
able to implem
ent an
y encoding or
ecoding algorithm w
ith high efficiency
Able to switch to anot
LOOP LOOP
USC 2007 21Roberto R. Osorio – ASAP 2007
Work around
data_reconstruction(…){…do{
…
context_modeling(…)…
}…
}
context_modeling(…){……decoding_iteration(…)……
}
decoding_iteration(…){…………
}
data_reconstruction(…){
…
do{
…
context_modeling(…)
…
use_value
…
}
…
}
context_modeling(…){
…
…
decoding_iteration(…)
…
use_value
…
}
decoding_iteration(…){
…
…
…
…
}
INLINING
data_reconstruction(…){
…
do{
…
// context_modeling
…
…
decoding_iteration(…)
…
…
use_value
…
}
…
}
decoding_iteration(…){
…
…
…
…
}
CODE REDISTRIBUTION
data_reconstruction(…){
…
do{
decoding_iteration(…)
…
…
…
…
…
…
use_value
}
…
}
decoding_iteration(…){
…
…
…
…
}
~
USC 2007 22Roberto R. Osorio – ASAP 2007
Applicationsbzr 100input $2output $4xor $0 $0add $0 1output $2fetch $4and $4 7add $1 4sl0 $4add $4 $5
begin
-- registers clocking SYNC: process (clk, reset) begin
if(clk'event and clk = '1') then if(reset = '1') then codigoOutReg <= "0000"; numSeqOutReg <= "000"; calcSreg <= "0000000000000000"; calcCreg <= "0000000000000000"; shiftOutReg <= "000";
+ASIC
FPGAcoarse grain
media
processor
USC 2007 23Roberto R. Osorio – ASAP 2007
Implementation issues
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
PYield
Utilization
Power
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
mem
P
• Reduce voltage• Reduce clock
frequency
USC 2007 24Roberto R. Osorio – ASAP 2007
An ASIP-based media-processor
I/O
I/O
Mem
Mem
P
ASIP ASIP ASIP ASIP
ASIP ASIP ASIP ASIP
I/O
I/O
Mem
Mem
P
ASIP DCT ME ME
Codif. Filter ME ME
I/O
I/O
Mem
Mem
P
ASIP iDCT ASIP ASIP
Decod. MC Filter ASIP
USC 2007 25Roberto R. Osorio – ASAP 2007
Implementation results
Area and speed figures for the proposed processor using AMS 0.35µ libraries
Area (nand gates) 5673
Clock speed (MHz) 180
Registers cost (bits) 232
Latency (cycles) 2
Maximum throughput (instr/cycle) 2
Code memory (KB) 1
Data memory (KB) 1
USC 2007 26Roberto R. Osorio – ASAP 2007
Comparison
Processors array
TI C6711
Registers (bits) 1672 > 2048
ALU (bits) 80 256
Local memory (KB) 10 8+64
Speed (MHz) 180 150
Technology AMS 0.35µ TI 0.13µ
Approximate comparison of the hardware cost of a 5-element processors array and a TI C6711 VLIW DSP
USC 2007 27Roberto R. Osorio – ASAP 2007
Conclusions
Entropy coding is a complex task in multimedia applications that often needs of hardware acceleration
The implementation cost and lack of flexibility demand programmable solutions with comparable performance
ASIPs are a intermediate solution between hardware accelerators and general purpose processors
In this work an ASIP is proposed for entropy encoding This ASIP is not based on optimized new instructions but
on achieving high parallelism in computations and data flow
Results demonstrate that this is a valid approach for the applications we have studied
We pretend to extend the results to other applications