an exploration of the hardware implementation of quadtree

1
Michiel D'Haene, Hendrik Eeckhaut and Mark Christiaens Ghent University, ELIS/PARIS, Sint-Pietersnieuwstraat 41, 9000 Ghent An exploration of the hardware implementation of Quadtree compression ProRISC 2004 Parallel Information Systems image sequence in motion estimation motion vectors reference + error frames wavelet- transform entropy- encoder compressed bitstream out Quadtree- algorithm FPGA DDR- SDRAM Situation Last stage of a scalable wavelet-based video-encoder scalability in framerate by motion estimation scalability in resolution by wavelet transform scalability in quality by Quadtree compression Quadtree division - non-significance pass significance pass refinementpass image module start threshold - layer info preload lists NS-lists RF-lists preload stack module lists buffer stack memory start busy Internal memory DDR memory image frame lists layer buffer context based model selection arithmetic encoder quadrant buffer H, V, Sign, D 3x26 models - preload - copy - information coded output output buffer take symbols division Quadtree- algorithm context-based model selection model selection based on collected statistical information arithmetic encoder 0.7 0.4 0.6 0.55 0.5 0.6 0.4 Out 0.3 0.45 0.5 0.7 0.3 Principle A layered approach allows scalability in quality for the decoder: start from the most significant layer of the image detect the significant pixels in the current layer code them efficiently in a progressive way non-significant pixels are irrelevant in this layer repeat this process for each bitlayer Algorithm Consists of 3 stages: limited Quadtree division context-based model selection arithmetic coding For example, in the highest bit-layer, the significant pixels are: Limited Quadtree division for each layer: locate pixels with value > current threshold done by recursive division of the image: while pixels in the area are significant subdivide to reduce the amount of superfluous bits (the decoder has to be able to reconstruct this Quadtree), division stops when a certain area- size is reached code all pixels in this area. Non-significant pixels are predicted to have a high probability to become significant soon (high valued pixels are mostly surrounded by high valued pixels) Improvements 2D-memory structure to cache-based 1D structure intelligent, stack-based, recursive division, saves >75% computational power Results 1.469 LE (5,7%) and 5 multipliers QCIF-frame 1.001.457 clock cycles CIF-frame 4.187.626 clock cycles at 100 Mhz: 99 QCIF frames/sec 24 CIF frames/sec 9x speedup compared to software at only 100 MHz! Overview Hardware improve interval wait for new symbol coding with multiplication and division investigate interval load model perform output coded symbols model adaptation wait for new symbol load model coding with multiplication and division investigate interval perform output coded symbols model adaptation improve interval Context-based model selection divides data from Quadtree-division in context-models based on their probability to do this, it keeps a history of values of the surrounding pixels very memory-consuming Arithmetic encoder last stage of the encoder compresses data based on their probability adaptive encoding most computational intensive part very sequential code, hard to parallelize Conclusion: some shortcomings prediction of non-significant pixels close to the significant pixels is aligned on and restricted to the Quadtrees minimal area overhead for coding the tree still 7% of the total bitstream need for a stack complicates the hardware very hard communication Quadtree model selection caused by different processing speeds coding the tree (contains no data) consumes 40% of the processing time Improvements optimization of the memory structure: - less memory - less memory accesses - less computations results in an adequate fast implemen- tation compared to the Quadtree uses 2 M-RAM blocks Goal because software implementation of the Quadtree algorithm on a PentiumIV 2.4GHz: QCIF: 11 frames/sec CIF: 2.7 frames/sec is much too slow for real applications acceleration in hardware on a FPGA platform: Altera Stratix EP1S25F1020 on PCI-board 25.660 LE (logical elements) 80 multipliers (9 bit) 224 M512 RAM (576 bits) 138 M4K RAM (4.608 bits) 2 M-RAM (589.824 bits) 512 MiB external DDR-SDRAM [email protected] http://www.elis.UGent.be/~mdhaene 512 MiB DDR Improvements simplification by specification complex pipelined architecture Results 958 LE (3,7%) and 2 multipliers 6,35 clock cycles/symbol less clock cycles are possible with a modified algorithm at 100 Mhz: 16 million symbols/sec ~ 76 QCIF-frame/sec ~ 19 CIF-frames/sec memory-usage M-RAM 2/2 M4K RAM 36/138 M512 RAM 28/224 Solution Therefore, a new algorithm has been proposed: the island-algorithm builds islands around significant pixels instead of fixed areas smart approach involves no overhead for the spread out of an island overhead possibly < 1% more adequate prediction higher encoding speed possible less memory usage start refinement list start nonsignificant list y http://www.elis.ugent.be/resume

Upload: others

Post on 16-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An exploration of the hardware implementation of Quadtree

Michiel D'Haene, Hendrik Eeckhaut and Mark ChristiaensGhent University, ELIS/PARIS, Sint-Pietersnieuwstraat 41, 9000 Ghent

An exploration of the hardware implementation of Quadtree compression

ProRISC 2004

Parallel Information Systems

image sequence in

motion estimation

motion vectors

reference + error frames

wavelet-transform

entropy- encoder

compressed bitstream out

Quadtree­algorithmFPGA

DDR­SDRAM

SituationLast stage of a scalable wavelet-based video-encoder

⇒ scalability in framerate by motion estimation⇒ scalability in resolution by wavelet transform⇒ scalability in quality by Quadtree compression

Quadtree division- non-significance pass

significance pass refinementpass

image module start threshold

- layer info preload

listsNS-lists RF-lists preload

stack module

lists buffer

stackmemory

start busy

Internal

mem

oryD

DR

m

emory

image frame lists

layerbuffer

context based model

selection

arithmeticencoder

quadrant buffer

H, V,Sign, D

3x26models

­preload­ copy

­information

codedoutput

outputbuffer

takesymbols

division

Quadtree-algorithm

context-based model selection

model selection based on collected statistical information

arithmetic encoder

0.7

0.4 0.6

0.55

0.5

0.6 0.4

Out

0.3

0.45

0.5

0.7 0.3

PrincipleA layered approach allows scalability in quality for the decoder:

start from the most significant layer of the image detect the significant pixels in the current layer code them efficiently in a progressive way non-significant pixels are irrelevant in this layer repeat this process for each bitlayer

AlgorithmConsists of 3 stages:

limited Quadtree division context-based model selection arithmetic coding

For example, in the highest bit-layer, the significant pixels are:

Limited Quadtree division for each layer:

locate pixels with value > current thresholddone by recursive division of the image: while

pixels in the area are significant subdivideto reduce the amount of superfluous bits (the

decoder has to be able to reconstruct this Quadtree), division stops when a certain area-size is reached

code all pixels in this area. Non-significant pixels are predicted to have a high probability to become significant soon (high valued pixels are mostly surrounded by high valued pixels)

Improvements2D-memory structure to cache-based 1D

structureintelligent, stack-based, recursive division,

saves >75% computational power

Results➔ 1.469 LE (5,7%) and 5 multipliers➔ QCIF-frame 1.001.457 clock cycles CIF-frame 4.187.626 clock cycles➔ at 100 Mhz: 99 QCIF frames/sec

24 CIF frames/sec➔ 9x speedup compared to software

at only 100 MHz!

OverviewHardware

improve interval

wait fornew symbol

coding with multiplication and division

investigate interval

load model

perform output coded

symbols

model adaptation

wait fornew symbolload model

coding with multiplication and division

investigateinterval

perform output coded

symbols

model adaptation

improve interval

Context-based model selectiondivides data from Quadtree-division in

context-models based on their probabilityto do this, it keeps a history of values of

the surrounding pixels very memory-consuming

Arithmetic encoderlast stage of the encodercompresses data based on their probabilityadaptive encodingmost computational intensive partvery sequential code, hard to parallelize

Conclusion: some shortcomings prediction of non-significant pixels close to the significant pixels is aligned

on and restricted to the Quadtrees minimal area overhead for coding the tree still 7% of the total bitstream need for a stack complicates the hardware very hard communication Quadtree model selection caused by

different processing speeds coding the tree (contains no data) consumes 40% of the processing time

Improvementsoptimization of the memory structure:

- less memory- less memory accesses- less computations

results in an adequate fast implemen-tation compared to the Quadtree

uses 2 M-RAM blocks

Goalbecause software implementation of the

Quadtree algorithm on a PentiumIV 2.4GHz:QCIF: 11 frames/secCIF: 2.7 frames/sec

is much too slow for real applications acceleration in hardware on a FPGA

platform: Altera Stratix EP1S25F1020 on PCI-board

25.660 LE (logical elements) 80 multipliers (9 bit)224 M512 RAM (576 bits)138 M4K RAM (4.608 bits)2 M-RAM (589.824 bits)512 MiB external DDR-SDRAM

[email protected] http://www.elis.UGent.be/~mdhaene

512 MiB DDR

Improvementssimplification by specificationcomplex pipelined architecture

Results➔ 958 LE (3,7%) and 2 multipliers➔ 6,35 clock cycles/symbol➔ less clock cycles are possible with a

modified algorithm➔ at 100 Mhz: 16 million symbols/sec ~ 76 QCIF-frame/sec ~ 19 CIF-frames/sec

memory-usageM-RAM 2/2M4K RAM 36/138M512 RAM 28/224 Solution

Therefore, a new algorithm has been proposed:the island-algorithm builds islands around significant pixels

instead of fixed areas smart approach involves no overhead for

the spread out of an island overhead possibly < 1% more adequate prediction higher encoding speed possible less memory usage

start

refinement list

start

nonsignificant list

y

http://www.elis.ugent.be/resume