an exploration of the hardware implementation of quadtree

Michiel D'Haene, Hendrik Eeckhaut and Mark ChristiaensGhent University, ELIS/PARIS, Sint-Pietersnieuwstraat 41, 9000 Ghent

An exploration of the hardware implementation of Quadtree compression

ProRISC 2004

Parallel Information Systems

image sequence in

motion estimation

motion vectors

reference + error frames

wavelet-transform

entropy- encoder

compressed bitstream out

QuadtreealgorithmFPGA

DDRSDRAM

SituationLast stage of a scalable wavelet-based video-encoder

⇒ scalability in framerate by motion estimation⇒ scalability in resolution by wavelet transform⇒ scalability in quality by Quadtree compression

Quadtree division- non-significance pass

significance pass refinementpass

image module start threshold

- layer info preload

listsNS-lists RF-lists preload

stack module

lists buffer

stackmemory

start busy

Internal

mem

oryD

DR

m

emory

image frame lists

layerbuffer

context based model

selection

arithmeticencoder

quadrant buffer

H, V,Sign, D

3x26models

preload copy

information

codedoutput

outputbuffer

takesymbols

division

Quadtree-algorithm

context-based model selection

model selection based on collected statistical information

arithmetic encoder

0.7

0.4 0.6

0.55

0.5

0.6 0.4

Out

0.3

0.45

0.5

0.7 0.3

PrincipleA layered approach allows scalability in quality for the decoder:

start from the most significant layer of the image detect the significant pixels in the current layer code them efficiently in a progressive way non-significant pixels are irrelevant in this layer repeat this process for each bitlayer

AlgorithmConsists of 3 stages:

limited Quadtree division context-based model selection arithmetic coding

For example, in the highest bit-layer, the significant pixels are:

Limited Quadtree division for each layer:

locate pixels with value > current thresholddone by recursive division of the image: while

pixels in the area are significant subdivideto reduce the amount of superfluous bits (the

decoder has to be able to reconstruct this Quadtree), division stops when a certain area-size is reached

code all pixels in this area. Non-significant pixels are predicted to have a high probability to become significant soon (high valued pixels are mostly surrounded by high valued pixels)

Improvements2D-memory structure to cache-based 1D

structureintelligent, stack-based, recursive division,

saves >75% computational power

Results➔ 1.469 LE (5,7%) and 5 multipliers➔ QCIF-frame 1.001.457 clock cycles CIF-frame 4.187.626 clock cycles➔ at 100 Mhz: 99 QCIF frames/sec

24 CIF frames/sec➔ 9x speedup compared to software

at only 100 MHz!

OverviewHardware

improve interval

wait fornew symbol

coding with multiplication and division

investigate interval

load model

perform output coded

symbols

model adaptation

wait fornew symbolload model

coding with multiplication and division

investigateinterval

perform output coded

symbols

model adaptation

improve interval

Context-based model selectiondivides data from Quadtree-division in

context-models based on their probabilityto do this, it keeps a history of values of

the surrounding pixels very memory-consuming

Arithmetic encoderlast stage of the encodercompresses data based on their probabilityadaptive encodingmost computational intensive partvery sequential code, hard to parallelize

Conclusion: some shortcomings prediction of non-significant pixels close to the significant pixels is aligned

on and restricted to the Quadtrees minimal area overhead for coding the tree still 7% of the total bitstream need for a stack complicates the hardware very hard communication Quadtree model selection caused by

different processing speeds coding the tree (contains no data) consumes 40% of the processing time

Improvementsoptimization of the memory structure:

- less memory- less memory accesses- less computations

results in an adequate fast implemen-tation compared to the Quadtree

uses 2 M-RAM blocks

Goalbecause software implementation of the

Quadtree algorithm on a PentiumIV 2.4GHz:QCIF: 11 frames/secCIF: 2.7 frames/sec

is much too slow for real applications acceleration in hardware on a FPGA

platform: Altera Stratix EP1S25F1020 on PCI-board

25.660 LE (logical elements) 80 multipliers (9 bit)224 M512 RAM (576 bits)138 M4K RAM (4.608 bits)2 M-RAM (589.824 bits)512 MiB external DDR-SDRAM

[email protected] http://www.elis.UGent.be/~mdhaene

512 MiB DDR

Improvementssimplification by specificationcomplex pipelined architecture

Results➔ 958 LE (3,7%) and 2 multipliers➔ 6,35 clock cycles/symbol➔ less clock cycles are possible with a

modified algorithm➔ at 100 Mhz: 16 million symbols/sec ~ 76 QCIF-frame/sec ~ 19 CIF-frames/sec

memory-usageM-RAM 2/2M4K RAM 36/138M512 RAM 28/224 Solution

Therefore, a new algorithm has been proposed:the island-algorithm builds islands around significant pixels

instead of fixed areas smart approach involves no overhead for

the spread out of an island overhead possibly < 1% more adequate prediction higher encoding speed possible less memory usage

start

refinement list

start

nonsignificant list

y

http://www.elis.ugent.be/resume

an exploration of the hardware implementation of quadtree

Documents