an exploration of the hardware implementation of quadtree
TRANSCRIPT
Michiel D'Haene, Hendrik Eeckhaut and Mark ChristiaensGhent University, ELIS/PARIS, Sint-Pietersnieuwstraat 41, 9000 Ghent
An exploration of the hardware implementation of Quadtree compression
ProRISC 2004
Parallel Information Systems
image sequence in
motion estimation
motion vectors
reference + error frames
wavelet-transform
entropy- encoder
compressed bitstream out
QuadtreealgorithmFPGA
DDRSDRAM
SituationLast stage of a scalable wavelet-based video-encoder
⇒ scalability in framerate by motion estimation⇒ scalability in resolution by wavelet transform⇒ scalability in quality by Quadtree compression
Quadtree division- non-significance pass
significance pass refinementpass
image module start threshold
- layer info preload
listsNS-lists RF-lists preload
stack module
lists buffer
stackmemory
start busy
Internal
mem
oryD
DR
m
emory
image frame lists
layerbuffer
context based model
selection
arithmeticencoder
quadrant buffer
H, V,Sign, D
3x26models
preload copy
information
codedoutput
outputbuffer
takesymbols
division
Quadtree-algorithm
context-based model selection
model selection based on collected statistical information
arithmetic encoder
0.7
0.4 0.6
0.55
0.5
0.6 0.4
Out
0.3
0.45
0.5
0.7 0.3
PrincipleA layered approach allows scalability in quality for the decoder:
start from the most significant layer of the image detect the significant pixels in the current layer code them efficiently in a progressive way non-significant pixels are irrelevant in this layer repeat this process for each bitlayer
AlgorithmConsists of 3 stages:
limited Quadtree division context-based model selection arithmetic coding
For example, in the highest bit-layer, the significant pixels are:
Limited Quadtree division for each layer:
locate pixels with value > current thresholddone by recursive division of the image: while
pixels in the area are significant subdivideto reduce the amount of superfluous bits (the
decoder has to be able to reconstruct this Quadtree), division stops when a certain area-size is reached
code all pixels in this area. Non-significant pixels are predicted to have a high probability to become significant soon (high valued pixels are mostly surrounded by high valued pixels)
Improvements2D-memory structure to cache-based 1D
structureintelligent, stack-based, recursive division,
saves >75% computational power
Results➔ 1.469 LE (5,7%) and 5 multipliers➔ QCIF-frame 1.001.457 clock cycles CIF-frame 4.187.626 clock cycles➔ at 100 Mhz: 99 QCIF frames/sec
24 CIF frames/sec➔ 9x speedup compared to software
at only 100 MHz!
OverviewHardware
improve interval
wait fornew symbol
coding with multiplication and division
investigate interval
load model
perform output coded
symbols
model adaptation
wait fornew symbolload model
coding with multiplication and division
investigateinterval
perform output coded
symbols
model adaptation
improve interval
Context-based model selectiondivides data from Quadtree-division in
context-models based on their probabilityto do this, it keeps a history of values of
the surrounding pixels very memory-consuming
Arithmetic encoderlast stage of the encodercompresses data based on their probabilityadaptive encodingmost computational intensive partvery sequential code, hard to parallelize
Conclusion: some shortcomings prediction of non-significant pixels close to the significant pixels is aligned
on and restricted to the Quadtrees minimal area overhead for coding the tree still 7% of the total bitstream need for a stack complicates the hardware very hard communication Quadtree model selection caused by
different processing speeds coding the tree (contains no data) consumes 40% of the processing time
Improvementsoptimization of the memory structure:
- less memory- less memory accesses- less computations
results in an adequate fast implemen-tation compared to the Quadtree
uses 2 M-RAM blocks
Goalbecause software implementation of the
Quadtree algorithm on a PentiumIV 2.4GHz:QCIF: 11 frames/secCIF: 2.7 frames/sec
is much too slow for real applications acceleration in hardware on a FPGA
platform: Altera Stratix EP1S25F1020 on PCI-board
25.660 LE (logical elements) 80 multipliers (9 bit)224 M512 RAM (576 bits)138 M4K RAM (4.608 bits)2 M-RAM (589.824 bits)512 MiB external DDR-SDRAM
[email protected] http://www.elis.UGent.be/~mdhaene
512 MiB DDR
Improvementssimplification by specificationcomplex pipelined architecture
Results➔ 958 LE (3,7%) and 2 multipliers➔ 6,35 clock cycles/symbol➔ less clock cycles are possible with a
modified algorithm➔ at 100 Mhz: 16 million symbols/sec ~ 76 QCIF-frame/sec ~ 19 CIF-frames/sec
memory-usageM-RAM 2/2M4K RAM 36/138M512 RAM 28/224 Solution
Therefore, a new algorithm has been proposed:the island-algorithm builds islands around significant pixels
instead of fixed areas smart approach involves no overhead for
the spread out of an island overhead possibly < 1% more adequate prediction higher encoding speed possible less memory usage
start
refinement list
start
nonsignificant list
y
http://www.elis.ugent.be/resume