evaluation of data-parallel h.264 decoding approaches for ... · the traditional profiling methods....

Multimed Tools ApplDOI 10.1007/s11042-010-0501-7

Evaluation of data-parallel H.264 decoding approachesfor strongly resource-restricted architectures

Florian H. Seitner · Michael Bleyer ·Margrit Gelautz · Ralf M. Beuschel

© Springer Science+Business Media, LLC 2010

Abstract Decoding of an H.264 video stream is a computationally demanding mul-timedia application which poses serious challenges on current processor architec-tures. For processors with strongly limited computational resources, a natural wayto tackle this problem is the use of multi-core systems. The contribution of thispaper lies in a systematic overview and performance evaluation of parallel videodecoding approaches. We focus on decoder splittings for strongly resource-restrictedenvironments inherent to mobile devices. For the evaluation, we introduce a high-level methodology which can estimate the runtime behaviour of multi-core decodingarchitectures. We use this methodology to investigate six methods for accomplishingdata-parallel splitting of an H.264 decoder. These methods are compared againsteach other in terms of runtime complexity, core usage, inter-communication and bustransfers. We present benchmark results using different numbers of processor cores.Our results shall aid in finding the splitting strategy that is best-suited for the targetedhardware-architecture.

Keywords Video decoding · H.264/AVC · Multimedia · Multi-core ·Embedded architectures

F. H. Seitner (B) · M. Bleyer · M. Gelautz · R. M. BeuschelInstitute for Software Technology and Interactive Systems,Vienna University of Technology,Favoritenstrasse 9-11/188/2, 1040 Vienna, Austriae-mail: [email protected]

M. Bleyere-mail: [email protected]

M. Gelautze-mail: [email protected]

R. M. Beuschele-mail: [email protected]

Multimed Tools Appl

1 Introduction

The H.264 video standard [9] is currently used in a wide range of video-related areassuch as video content distribution and television broadcasting. Compared to preced-ing standards such as MPEG-2 and MPEG-4 SP/ASP, improved coding efficiencycould be reached by introducing more advanced pixel processing algorithms (e.g.quarter-pixel motion estimation) as well as by the use of more sophisticated algo-rithms for predicting syntax elements from neighbouring macroblocks (e.g. context-adaptive VLC). These new coding tools result in significantly increased CPU andmemory loads required for decoding a video stream. In environments of limitedprocessing power such as embedded systems, the high computational demands posea challenge for practical H.264 implementations [6]. Multi-core systems provide anelegant and power-efficient solution to overcome these performance limitations.They enable running complex multimedia applications such as the one describedin [8] on embedded architectures.

However, the efficient distribution of the H.264 video algorithm among multipleprocessing units is a non-trivial task. For using the available processing resourcesefficiently, an equally balanced distribution of the decoder onto the hardware unitsmust be found. The system designer has to consider data dependency issues aswell as inter-communication and synchronization between the processing units.Furthermore, the resource limitations in an embedded environment such as lowcomputational power, small-sized on-chip memories and low bus bandwidth requirean efficient software design. A parallel decoder approach for mobile systems mustbe able to work under these resource restrictions.

The contributions of this paper are twofold. Our first and major contribution is anevaluation of different parallel decoding approaches. We evaluate these approachesin terms of runtime complexity, efficient core usage, data transfers and buffer sizes.This information is of prime importance when designing a resource-efficient multi-core decoder. Our results can be exploited for choosing that parallel approach whichperforms best under certain hardware restrictions. We provide a detailed overview ofsix different data-parallel splitting methods for H.264 and compare these approachesagainst each other. We investigate the impact of raising the number of cores on theresource requirements by processing different test sequences.

To accomplish our benchmark, we introduce a high-level simulation methodologyfor parallel decoder systems. This methodology represents the second contributionof this paper. The methodology, referred to as Partition Assessment Methodology,allows estimating the complex runtime behaviour of a parallel decoding architecture.The Partition Assessment Methodology models critical multi-core issues such asdata-dependencies between cores, buffer limitations and concurrency. We overcomethe problems of traditional estimation techniques that are typically not suitable formulti-core environments and not flexible enough for exploring a large range ofdifferent decoder configurations.

In the following, we go into more detail on both contributions and discuss themin the context of prior work. First, we describe methods for splitting the H.264decoder onto multiple processors. These methods form the foundation for thesplitting approaches investigated in this paper. We then focus on runtime estimationtechniques that we consider as being related to the proposed Partition AssessmentMethodology.

Multimed Tools Appl

1.1 Methods for parallel decoding

Various approaches for parallelizing the H.264 decoding process have been intro-duced in the literature. Van der Tol et al. [25] investigate methods for mappingthe H.264 decoding process onto multiple cores. Functional splitting of a videodecoder and the use of macroblock pipelining at thread-level are demonstratedin [3, 4, 10, 21]. Zhao and Liang [28] exploit frame parallelism in the Wavefronttechnique. A hierarchical approach working at group-of-picture and frame levels isdemonstrated in [20]. Meenderinck et al. [13] investigate the scalability of H.264 for adata-parallel decoding approach operating on the macroblock-level and on multipleframes in parallel. Moreover, these authors introduce an efficient technique forH.264 frame-level parallelization, namely the 3D-Wave strategy. All of those papersshare a focus on parallelization in terms of algorithmic scalability. Upper limits onthe number of processors and frames processed in parallel are given. However, thememory restrictions of embedded environments make these approaches hardly us-able for mobile and embedded architectures. More resource-efficient H.264 splittingapproaches are introduced in [23, 24, 26]. The focus of these papers is on efficientdecoder implementations for embedded architectures.

This paper focuses on evaluating the behaviour of different parallel decodingapproaches in terms of runtime complexity, efficient core usage, data transfers andbuffer sizes. This information is of prime importance when designing a resource-efficient multi-core decoder. The information can be exploited for choosing theparallel approach which performs best under certain hardware restrictions.

However, benchmarking a large variety of different decoder partitionings runningon multiple cores poses two problems. Firstly, architectures with four and more coresand appropriate benchmarking tools are not yet commonly available. Secondly, im-plementing the different partitioning methods and adapting them to run on a specificarchitecture is a highly labour-intensive task and limits the number of approachesthat can typically be evaluated. The estimation techniques described in the nextsection provide a way to gain insights into a video decoder’s runtime behaviourwithout necessarily requiring a completely functional system implementation.

1.2 Runtime estimation techniques

Analysing an algorithm’s definition (e.g., its source code) can provide coarse esti-mation on the computational complexity of an algorithm [12, 17]. However, thesetechniques do, in general, not consider the impact of the input data on the program’sexecution behaviour (e.g., input-dependent recursions and branches). For analysingvideo decoding systems where the runtime behaviour is typically strongly dependenton the input stream, these techniques are not suitable.

Profiling methods [1, 5, 7, 15, 18, 19, 27] overcome this problem. These methodseither simulate or extend the program in a way that the program’s executionbehaviour can be observed during runtime. This allows measuring the complexity ofdecoding functions in the context of a specific input stream. The profiling proceduretypically outputs measurements of consumed time or cycle counts of the investigatedprogram running on a specific hardware.

However, profilings are typically available only for single-core architectures. Forvideo decoding systems with strong variations in the runtime complexity, making

Multimed Tools Appl

predictions about the parallel decoding system’s behaviour based on a single-coreprofiling is non-trivial. Algorithmic dependencies, differences in workload and re-source limitations result in varying runtime constraints between the processing units.These constraints must be considered.

The simulator introduced in this work overcomes these shortcomings by extendingthe traditional profiling methods. Our simulator is capable of predicting the runtimebehaviour of a virtual multi-core decoder based on information gained from tradi-tional single-core profilers. In contrast to hardware-based development techniquessuch as HW/SW-Codesign methods [2, 11], the labour- and time-intensive task ofimplementing a fully functional hardware as well as the decoder software is avoided.This results in a high flexibility when exploring different hardware configurationsand decoder partitionings. In this work, we use this method to compare differentcompeting decoder partitionings and architecture configurations against each otherto find the best-suited one.

1.3 Paper structure

The remainder of the paper is organized as follows. Section 2 focuses on thefundamental differences between functional and data-parallel splitting approaches.Additionally, it describes the H.264 decoding process and the algorithmic depen-dencies that counteract the parallelization. In Section 3, we first analyse a single-coreprofiling of an H.264 decoder. We then derive our high-level simulation methodologyfor parallel decoders from this analysis. Section 4 gives a detailed overview of the sixevaluated parallelization approaches. We provide a structured analysis and describeour motivation for including them in our survey. Results are provided in Section 5.Final conclusions and an outlook on future work are given in Section 6.

2 Parallel H.264 decoding

This section first explains the two options for partitioning a video decoder. It thenanalyses the influence of the algorithmic structure of an H.264 video decoder on thepartitioning process.

2.1 Functional and data-parallel splittings

The memory restrictions of most embedded video coding architectures require theuse of small data packages and typically result in a macroblock-based decodingsystem. In functional-partitioned decoding systems, the decoding tasks such asparsing, motion compensation or deblocking are assigned to individual processingcores, typically one task per each processing unit. Macroblocks are processed byone processor after the other. The multiple processors allow starting the nextmacroblock’s decoding tasks before computation of the current macroblock hasfinished.

This splitting method has the advantage that each processing unit can be optimisedfor a certain task (e.g. by adding task-specific hardware extensions) and minimal-sized instruction caches. In contrast to data-dependent parallelization, also strongly

Multimed Tools Appl

sequential tasks can be accelerated by this strategy. The disadvantages are anunequal workload balancing and high transfer rates for inter-communication.

As opposed to functional splitting methods, data-parallel systems do not distributethe functions, but the macroblocks among multiple processing units. For efficientparallelization, the macroblocks’ core assignment algorithm has to address thefollowing issues:

– The data-dependencies between different cores must be minimized and data-locality must be exploited (i.e. supporting of caching strategies).

– The macroblock distribution onto the processing cores must achieve an equalworkload balancing.

– Generic macroblock core assignment for different frame sizes must be possible.

Scalability in parallel systems requires minimal data-dependency between thecores. A compromise between small memory size and data-dependencies can bereached by grouping the macroblocks as described in [25]. For supporting cachingstrategies at a more global scale, the groups of macroblocks assigned to one coremust be aligned closely together in each frame. By introducing a centralized andconstant macroblock assignment for each frame, this global caching issue can beaddressed efficiently. Additionally, a constant macroblock assignment allows theparsed macroblocks to be written directly to the input FIFOs of the correspondingreconstructing cores.

In this work, we focus solely on approaches based on data-parallelization. For ahigh core count, this parallelization paradigm offers better scalability and a morebalanced workload than functional splitting approaches [25].

2.2 The H.264 decoder

Figure 1 illustrates the principle of the H.264 decoding process. The decoder can bedivided into two fundamentally different parts, the parser and the reconstructor.

Stream ParsingEntropy Decoding

IDCT

SpatialPrediction

MotionCompensation

Deblocking+

ReferenceFrames

DecodedFrame

EncodedBitstream

(investigated in our study)(not investigated in our study)

- Parser- Functional Splitting

- Reconstructor- Data-Parallel Processing

Fig. 1 The H.264 decoding process. Macroblocks run through the decoder pipeline, one after theother

Multimed Tools Appl

The parser includes context calculations for the context-adaptive part of the en-tropy decoding process, the low-level entropy decoding functions (binary-arithmeticor variable length coding) and the reconstruction of the macroblock’s syntax ele-ments such as motion vectors or residual information. This decoder part is highlysequential with short feedback loops at bit- and syntax element-levels. Furthermore,the conditional calculations of the entropy encoder’s context information requireshort branch executions of the executing processor and efficient data access. Thismakes data-parallel processing hardly usable in these decoding stages and functionalsplitting is preferable. Our work therefore focuses on the reconstructor module ofthe H.264 decoder, which provides a wide range of possibilities for exploiting data-parallelization. In our simulations, we assume that stalls cannot be caused by theparser. In other words, an idealized parser always has those macroblocks availablethat are currently needed in the reconstructor module.

In the reconstruction part, the H.264 decoder uses the parsed syntax elements togenerate spatial (=intra) or temporal (=inter) predictions for each macroblock. Theresidual coefficients are inverse transformed (IDCT) and added to the macroblock’sprediction. Finally, an adaptive deblocking filter is applied to the outer and inneredges of the 16 4 × 4 pixel sub-blocks of a macroblock in order to remove blockingartefacts.

2.3 Macroblock dependencies

Data-parallel splitting of the decoder’s reconstruction module is challenging due todependencies between spatially as well as temporally neighbouring macroblocks.These dependencies originate from three sources illustrated in Fig. 2 and aredescribed as follows.

Firstly, in progressive frame coding, the intra-prediction uses unfiltered pixelinformation from up to four spatially neighbouring macroblocks to predict thecurrent macroblock. These dependencies are depicted in Fig. 2a. In general, it isa good option to gather the current macroblock and its reference macroblocks onthe same CPU to avoid expensive inter-processor communication for resolving thisdependency. For an efficient parallelization, this dependency must be addressedcarefully.

Secondly, the deblocking filter imposes additional spatial dependencies. Forfiltering the outer edges of the current macroblock, up to four pixel rows/columnsfrom the upper and left neighbouring macroblocks are used as filter input. These

CurrentMB

CurrentMB

CurrentMB

Frame n-1

Frame n

(a) (b) (c)

Fig. 2 Macroblock dependencies in H.264 decoding. Arrows mean that the macroblock at the originof the arrow needs to be processed before decoding the current macroblock. a Intra-predictiondependency. b Deblocking filter dependency. c Inter-prediction dependency

Multimed Tools Appl

macroblock dependencies are visualized in Fig. 2b. An efficient parallelizationmethod will focus on avoiding that these dependencies have to be resolved acrossindividual processors.

The third and final macroblock dependency arises from the inter-prediction.The inter-prediction reads reference data from macroblocks of previously decodedframes. Obviously, it is required that processing of these reference macroblockshas already been completed before they can be used for inter-prediction of thecurrent macroblock. This results in the temporal dependency depicted in Fig. 2c.In fact, the current macroblock can depend on a rather large number of referencemacroblocks. H.264 allows splitting of the current macroblock into small sub-blocks,for each of which a separate motion vector is computed. In P-slices, each inter-codedmacroblock can contain up to 16 motion vectors and point to one reference frame.For bi-directionally predicted macroblocks in B-slices, a maximum of 32 motionvectors and two reference frames is possible.

3 A method for simulating video decoding architectures

In this section, we describe the simulation methodology used for this evaluation.First, we analyse a single-core decoder profile and describe the profiling informationthat we use as input for the high-level simulation of a parallel decoder. Second, weprovide a detailed description of the methodology and the internals of the high-levelsimulation approach.

3.1 Analysis of a single-core decoder

Macroblock prof iling Traditional dynamic profiling provides runtime informationfor each decoding function. In our implementation, such profiles are derived from asingle-core instruction-set (IS) simulator for the multimedia processor CHILI [22].The profiler of the CHILI simulator provides the time of each function call and eachfunction return that has occurred during the execution of the decoder (Fig. 3). Notethat in this work we have used the CHILI profiler for obtaining this information.Nevertheless, this is no restriction of our approach, since profilers for other platformsare, in general, also able to provide this information.

The idea behind recording the time of function calls and returns is to determinethe computation time that an individual macroblock spends in the decodings stepsdepicted in Fig. 1. Figure 3 visualizes this idea of mapping each function call to thedecoding task of an individual macroblock. For example, we want to determine thecomputation time that is required for the intra-prediction of the third macroblock inthe stream. We know that a function IntraPrediction implements the intra-predictionfor this macroblock. Let us assume that in the decoding of a macroblock thisfunction is called exactly once. The processing times for the third invocation ofIntraPrediction will consequently represent the time that Macroblock 3 spends forthe intra-prediction.

However, it should be noticed that this mapping is considerably more complexin practice, since functions are typically called arbitrary times for the decoding ofa single macroblock, which depends on the macroblock’s coding mode. We have

Multimed Tools Appl

Macroblock 3

Parse

Intra

IDCT

Deblock

Inputstream

Profiler for IS simulator

Macroblock 2

A Cycles

B Cycles

C Cycles

D Cycles

Mapped profiling Macroblockprofiling

Call DecodeMacroblock....

Return DecodeMacroblock

Call DecodeMacroblockCall ParseMacroblock...Return ParseMacroblockCall IntraPrediction....Return IntraPredictionCall IDCT...Return IDCTCall Deblock....Return Deblock

Return DecodeMacroblock

Call DecodeMacroblock....

Return DecodeMacroblock....

Macroblock 1

....

Fig. 3 A dynamic profiler delivers the exact execution times for function calls and returns. We mapthis information to the major decoding functions of each macroblock and calculate the number ofcycles spent in each decoding block. We refer to this macroblock-assigned profiling information asmacroblock prof iling

therefore implemented an interpreter that performs the mapping of raw profilingdata to meaningful function units.

Decoder software The profiles are based on a commercial H.264/AVC main profiledecoder for embedded architectures. This decoder has been optimised in termsof memory usage and support of DMA transfers. In addition, the regular pixel-based processing functions of the decoder (e.g., interpolation, prediction) have beenassembly optimised to make use of the SIMD processor commands. For mobilemultimedia applications, using SIMD instructions typically improves the run-timeperformance in a power-efficient way [16].

Test sequences We use four HD test sequences that are shown in Table 1. At aPeak Signal-to-Noise Ratio of approximately 40 db, the four test sequences havebitrates between 12.6 and 50.8 MBit/s. The bitrate is an indicator for the amountof texture and motion occurring in a sequence. The Bus sequence with large, low-textured regions and moderate motion has the smallest bitrate of all test sequences(12.6 MBit/s). Shields and Stockholm are both moderately textured. The horizontalmovement in the Shields sequence as well as the zooming operation of the camerain the Stockholm sequence result in a slightly higher motion activity than in the Bussequence. The higher texturedness and motion activity for the Shields and Stockholmsequences lead to bitrates of 18.8 and 25.6 MBit/s, respectively. The backgroundin the Parkrun sequence contains strong texture patterns and a strong horizontalmotion with various temporal occlusions. This leads to the highest bitrate in our testset (50.8 MBit/s).

Multimed Tools Appl

Table 1 Four test sequences at 720p resolution

Sequences

Bus Shields Stockholm Parkrun

12.6 MBit/s 18.8 MBit/s 25.6 MBit/s 50.8 MBit/s1,280 × 720 pixels 1,280 × 720 pixels 1,280 × 720 pixels 1,280 × 720 pixelsModerately textured Strongly textured Moderately textured Strongly texturedModerate motion Moderate motion Continuous global Strong motion

motion

Rows 3 and 4 show the resulting bitrates (at 25 frames per second) and the frame resolution in pixels

Dynamic runtime behaviour We have computed macroblock-based runtime mea-surements for the test sequences of Table 1. Figure 4 visualizes the macroblocks’decoding complexity for all four sequences. Here, the important point lies in thedynamic behaviour of macroblocks. It can be clearly seen that cycle counts are verydifferent among individual macroblocks. As is also shown in the figure, this observa-tion can still be made when considering the classes of I-, B- and P-macroblocks alone.The pixel information that each macroblock encodes seems to have large influenceon its runtime behaviour.

Variations in the execution times of the decoding functions will result in a highlydynamic system when developing parallel decoders. Runtime prediction of a paralleldecoding system is not straight-forward in the light of this dynamic behaviour.Moreover, multi-core decoder simulation is complicated by dependencies that existbetween individual macroblocks. The methodology described in the following sec-tion considers these variations and dependencies when predicting the runtime of amulti-core decoding system.

(Bus) (Shields) (Stockholm) (Parkrun)

0 50 100 1500

0.5

1

1.5

2

2.5

3

3.5x 10

4

Complexity [KCycles/MB]

Bin

cou

nt

I-MBsP-MBsB-MBs

0 50 100 1500

0.5

1

1.5

2

2.5

3

3.5x 10

4


I-MBsP-MBsB-MBs

0 50 100 1500

0.5

1

1.5

2

2.5

3

3.5x 10

4


I-MBsP-MBsB-MBs

0 50 100 1500

0.5

1

1.5

2

2.5

3

3.5x 10

4


I-MBsP-MBsB-MBs

Fig. 4 Dynamic variations in the execution times of individual macroblocks in the H.264 decodingprocess. Histogram bins plot the number of macroblocks having similar runtimes. The colours indi-cate the contributions of macroblocks from I-, P- and B-slices to the overall bin counts. Histogramsare shown for 50 frames of the four sequences from Table 1. These sequences are IPB coded with theGroup of Pictures (GOP) size being 12. It is observed that the runtimes of macroblocks considerablyvary within a sequence. This observation is also made when considering I-, P- and B-macroblocksseparately

Multimed Tools Appl

3.2 Methodology

For a flexible description of the hardware as well as the decoding algorithm, we pro-pose a high-level modelling approach. This approach is implemented by a simulatorthat we refer to as the Partition Assessment Simulator (PAS). Figure 5 illustrates thebasic idea behind the PAS. To use the PAS, one has to give the following inputs.First, a description of a specific video decoding algorithm is required. Second, theuser provides an abstract description of the hardware as well as a partitioning ofdecoder functions onto this hardware. Third, the user presents macroblock profilingresults and coding information (e.g. coding type and motion vectors) of a specificstream to the PAS. The task of the PAS is then to simulate the runtime behaviour ofthe described distributed decoder running on this virtual architecture.

Modelling the hardware and the decoder software results in high flexibility whendoing design space explorations. This includes the evaluation of unknown hardwareconfigurations as well as decoder software partitionings. For example, let us assumethat we want to test a new multi-core hardware system. In this case, the systemdesigner just needs to provide an abstract description of the new hardware as wellas an updated mapping of decoder tasks to the given processing cores. These smallmodifications are already sufficient to estimate the runtime behaviour of the decodersoftware on the new hardware.

Figure 6 illustrates how the PAS works in more detail. The decoding processfor three macroblocks is shown in six simulation steps. Core 1 performs the firsttwo decoding tasks (i.e. parsing and entropy decoding) and writes the results toFIFO 1. In this example, the buffer size of FIFO 1 is limited to one macroblock.Core 2 reads the results from FIFO 1 and applies the remaining decoding tasks.Internally, the PAS maintains a list for each processing unit and FIFO buffer. Itstores the states and the filling levels of each processing unit and FIFO, respectively.The simulator sequentially processes one macroblock decoding task after the other.It uses the single-core macroblock profiles to determine the time required to finish

PARTITIONASSESSMENT

SIMULATOR (PAS)

Runtime behaviour- Complexity- Memory requirements- Transfer volume...

Video Stream

Video Algorithm

Simulated Hardware

Runtime profilingsCoding information

AlgorithmicStructure

Abstract HardwareDescription

Fig. 5 The Partition Assessment Simulator (PAS). Our simulator combines information about avideo codec (e.g., H.264), a virtual multi-core architecture and a video input stream in order toestimate the runtime behaviour of this specific video decoding architecture. Approximations of thedecoder’s performance (e.g., complexity, memory accesses, etc.) are accomplished without the needto fully implement the hardware or the decoder software

Multimed Tools Appl

Fig. 6 Visualization of theinternal simulation process inthe PAS. The f igure shows theexecution states and bufferlevels of two cores and oneFIFO, respectively. Threemacroblocks are processed ontwo individual cores. Afterpartially decoding eachmacroblock, Core 1 writes theresults to FIFO 1. Forsimplicity, the FIFO’smaximum size is set to onemacroblock. Core 2 reads themacroblocks from this bufferand computes the remainingdecoding tasks. A detailedexplanation is given in the text

2

2

1

1

Core 1

Core 2

FIFO 1

1

(a)

1Core 1

Core 2

FIFO 1(b)

1

1Core 1

Core 2

FIFO 1(c)

2

1

1Core 1

Core 2

FIFO 1(d)

21

1Core 1

Core 2

FIFO 1(e)

3

2

2

1

1Core 1

Core 2

FIFO 1(f)

3

3

1

2

2

32

2 3

S

S E

E

E

E

E

E

E

E

E

E

E

E

S

S

S

S

S

S

S

S

S

4

time

time

time

time

time

time

time

time

time

time

time

time

time

time

time

time

time

time

a specific decoding task. The data-dependencies of a macroblock depend on themacroblock’s coding mode. This coding mode is derived from the macroblocks’coding information.

For each decoding task, the duration and the dependencies on other tasks andmacroblocks are determined by the macroblock profiling and coding information.Firstly, the PAS uses the algorithm mapping to determine the processing unitexecuting a decoding task. Secondly, it evaluates when this processing unit can start

Multimed Tools Appl

to execute a task. In Fig. 6a, this is indicated by the white marker (S). The core’sexecution counter is increased by the task’s duration. The black marker (E) indicatesthe end of a task. At this point, the core has finished the task execution and writtenits results to the output buffer FIFO 1. At this stage, the PAS does not know whenthe data is removed by another task and marks the state of FIFO 1 as occupied.

In Fig. 6b, Core 2 reads the macroblock data from the FIFO and frees theoccupied memory in FIFO 1. During the decoding of a macroblock, functionaldependencies between the decoding tasks and data dependencies between theindividual macroblocks exist. We have presented macroblock dependencies for theH.264 codec in Section 2.3. These dependencies are determined by the algorithmdefinition and change according to a macroblock’s specific coding (e.g. intra-prediction poses dependencies on the spatially neighbouring macroblocks). Thesedependencies obviously change the output of our simulator. We use the macroblocks’coding information for adapting the dependencies of a macroblock according to itscoding. In Fig. 6b, a read stall due to data-dependencies is illustrated. Core 2 cannotstart its decoding operations simultaneously with Core 1, but has to wait until therequired data becomes available in the input buffer. The PAS uses the algorithmdescription for detecting this read stall. The start of the task execution is delayedautomatically.

In Fig. 6c, the second macroblock is executed by Core 1 and written to FIFO 1.After Core 2 has finished its decoding operations for Macroblock 1, it reads Mac-roblock 2 and executes it (Fig. 6d).

Until now we have only considered the impact of computational complexity onour multi-core decoding system. Additionally, the PAS provides flexible interfaces todescribe buffer size constraints. For each macroblock task, limitations in the buffercommunication (e.g., write stalls due to insufficient buffer sizes) can be considered.The PAS maintains a set of user-defined rules on how the runtime behaviour of thedecoder is influenced if buffer scarcity occurs. For example, the parsing task requiresa certain amount of free memory for writing its results into an output buffer. Weextract this implementation-dependent information from the macroblocks’ codinginformation. We specify how much memory is required and how the decoder’sbehaviour is influenced by accessing these resources (e.g., “What happens in thecase of insufficient buffer?” or “How does the memory access latency influence thedecoder’s runtime?”). The PAS provides a language for fast specification of bufferand memory resources.

Figure 6e visualizes a case where insufficient buffer is available and a writestall occurs. Since FIFO 1 has a maximum size of 1, Core 1 cannot write theresults of decoding Macroblock 3 immediately. It has to wait until Core 2 has readMacroblock 2 and freed the occupied memory in FIFO 1. A write stall occurs in thiscase. In the following, Macroblock 2 is finished and Core 2 reads Macroblock 3 fromFIFO 1 and decodes it. After Macroblock 3 could be written to FIFO 1 (Fig. 6f),Core 1 continues with the next macroblock.

4 Evaluated methods

The approaches considered in this evaluation represent the main groups of data-parallel approaches addressed in the literature. All of them work under the re-

Multimed Tools Appl

5678

1 1 1 1 1 1 1 12 2 2 2 2 2 2 21 1 1 1 1 1 1 12 2 2 2 2 2 2 212

1 1 1 1 1 1 12 2 2 2 2 2 2

1 1 1 1 1 1 1 12 2 2 2 2 2 2 2

1 1 1 1 1 1 1 12 2 2 2 2 2 2 23 3 3 3 3 3 3 34 4 4 4 4 4 4 41 1 1 1 1 1 1 12 2 2 2 2 2 2 23 3 3 3 3 3 3 34 4 4 4 4 4 4 4

1 1 1 1 1 1 1 12 2 2 2 2 2 2 23 3 3 3 3 3 3 34 4 4 4 4 4 4 4

5 5 5 5 5 5 56 6 6 6 6 6 67 7 7 7 7 7 78 8 8 8 8 8 8

(2 Cores) (4 Cores) (8 Cores)

Fig. 7 The Single-row splitting approach. The assignment of processors to macroblocks is shown

strictions of typical embedded architectures. These restrictions include low systemmemory and buffer sizes on the hardware side as well as thread-less and application-dedicated runtime environments on the OS side.

Our evaluation study compares the performance of six different approaches foraccomplishing data-parallel splitting of the decoder’s reconstructor module. Theseare (1) a Single-row approach, (2) a Multi-column approach, (3, 4) a blocking and anon-blocking version of a Slice-parallel method, (5) a dynamic version of the non-blocking Slice-parallel approach as well as (6) a Diagonal technique. Details on thebenchmarked approaches are provided in the following.

4.1 Single-row approach

The first splitting strategy investigated in our study distributes horizontal lines ofmacroblocks among different processors as shown in Fig. 7. More formally, let Nbe the number of processors. Processor i ∈ {0, · · · , N − 1} is then responsible fordecoding the yth row of macroblocks if y mod N = i.

To illustrate the Single-row (SR) approach, we give an example with two proces-sors on an image divided into 8 × 8 macroblocks in Fig. 8. In this example as well asin the following ones of Figs. 11, 13, 14 and 16, we assume that it takes a constantvalue of 1 unit of time to process each macroblock. It is, however, important tonotice that this is a coarse oversimplification. In real video streams, there are largevariations in the processing times of individual macroblocks, which makes it difficult

12121212

1 1 1 1 12

1 12

1 1 1 1 1 12 2 2 2 2

1 12

1 1 1 1 1 12 2 2 2 2

1 12 2

1 1 1 1 1 1 12 2 2 2 2 2 21 1 1 1 1 1 12 2 2 2 2 2 21 1 1 1 1 1 12 2 2 2 2 2 21 1 1 1 1 1 12 2 2 2 2 2 2

(t = 2) (t = 3) (t = 8) (t = 10) (t = 34)

Fig. 8 Example of the Single-row splitting approach used with two cores. Processed macroblocksare shown at different instances of time t. It takes a constant value of 1 unit of time to process amacroblock

Multimed Tools Appl

to evaluate the goodness of a parallelization approach. In Fig. 8, only processor1 is able to decode macroblocks at time t = 2, since all other macroblocks areblocked as a consequence of the dependencies illustrated in Fig. 2. After the firsttwo macroblocks of row 1 have been computed, the second core can start processingthe first macroblock of the second row, since the dependencies for this macroblockare now resolved (t = 3). The next interesting event occurs at t = 8, when Core 1has finished the computation of the first row. Macroblocks of the second row havealready been computed and therefore Core 1 can start decoding macroblocks of row3 that are dependent on their upper neighbours. At time t = 10 we obtain the samesituation as at t = 2 where the first core unlocks the second one. Finally, the wholeframe has been decoded at t = 34.

The advantage of the Single-row approach lies in its simplicity. It is very easy tosplit the frame among the individual processors. There is only a small start delay afterwhich all processors can effectively work. The potential downside of this approachis that there are many dependencies that need to be resolved across processorassignment borders. This has played no role in our example where we have assumedconstant processing time for each macroblock. However, this problem will be no-ticeable for real videos streams that contain macroblocks of considerably differentruntime characteristics. In fact, each macroblock processed by Core i depends on itsupper neighbours that are processed by a different Core i − 1. If processor i − 1 failsto deliver these macroblocks at the right time, this will immediately produce stalls atCore i. This behaviour is shown in Fig. 9a. On the other hand, this strong coupling ofCPUs can potentially lead to low buffer requirements.

4.2 Multi-column approach

The second splitting method of our study divides the frame into equally-sizedcolumns as shown in Fig. 10. Let w denote the width of a multi-column that istypically derived by dividing the number of macroblocks in a row by the numberof processors. Processor i is then responsible for decoding a macroblock of the xthcolumn if iw ≤ x < (i + 1)w. A similar method to partition the image has recentlybeen proposed for the H.264 encoder in [24].

Figure 11 simulates the Multi-column (MC) approach using two processors. Core 1thereby starts processing the first row of macroblocks until it hits the border to

Time

Core 1

Core 2

Time

Core 1

Core 2

(a) (b)

Fig. 9 The number of inter-processor dependencies is crucial for the overall performance of themulti-core system. Rectangles represent macroblocks. A macroblock’s width indicates the requiredprocessing time. Arrows between two macroblocks mean that processing of the macroblock whichthe arrow points to can only start after the other macroblock has been decoded. a A large numberof inter-processor dependencies slows down the system. b Due to the low amount of inter-processordependencies, different running times of individual macroblocks become averaged out. This shouldimprove the overall performance

Multimed Tools Appl

3 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 2

11111111

11111111

22222222

22222222

33333333

33333333

44444444

44444444

11111111

11111111

11111111

11111111

22222222

22222222

22222222

22222222


Fig. 10 The multi-column splitting approach

the macroblocks assigned to processor 2 (t = 4). Since the dependency for the left-most macroblock of Core 2 is now resolved, processor 2 can finish decoding its firstmacroblock at t = 5. We obtain a similar situation at t = 8. The dependencies of theleftmost macroblock of the second row have been resolved, and Core 2 can thereforecontinue its work. Decoding of the frame is finally completed at t = 36.

The basic idea behind using the Multi-column approach is to obtain a loosercoupling of processor dependencies. In fact, the processor assignment borders aresignificantly reduced in comparison to the Single-row approach. One processor hasto wait for the results of another one only at the boundary of its multi-column.Within the multi-column, macroblock dependencies can be resolved on the sameprocessor. This should lead to reduced inter-processor dependencies and couldtherefore improve the overall runtime behaviour of the multi-core system as isdepicted in Fig. 9b.

4.3 Slice-parallel approach

As a third splitting method, we investigate a 90-degree rotated version of theMulti-column approach that divides the frame into even-sized rows. This methodis depicted in Fig. 12. Formally spoken, let h denote the height of a multi-row. Amacroblock of the yth row is then assigned to Core i if ih ≤ y < (i + 1)h.

The runtime behaviour of the Slice-parallel (SP) approach is illustrated in Fig. 13.Here, Core 2 has to wait for a relatively long time (t = 26) until the dependencies for

11111111

11111111

11111111

11111111

22222222

22222222

22222222

22222222

1 1 1 1 1 1 1 11

2 1 1 1 11

2 2 2 21 1 1

(t = 4) (t = 5) (t = 8) t = 36)(

Fig. 11 Example of the multi-column splitting approach

Multimed Tools Appl

1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 12 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2

1 1 1 1 1 1 1 11 1 1 1 1 1 1 12 2 2 2 2 2 2 22 2 2 2 2 2 2 23 3 3 3 3 3 3 33 3 3 3 3 3 3 34 4 4 4 4 4 4 44 4 4 4 4 4 4 4

5678

1 1 1 1 1 1 1 12 2 2 2 2 2 2 23 3 3 3 3 3 3 34 4 4 4 4 4 4 4

5 5 5 5 5 5 56 6 6 6 6 6 67 7 7 7 7 7 78 8 8 8 8 8 8


Fig. 12 The slice-parallel splitting approach

its first assigned macroblock are resolved. While the first processor can complete itswork on the current frame at t = 32, it still takes 26 units of time until the second corefinishes processing the remainder of the frame at t = 58. In the following, we refer tothis approach as the blocking Slice-parallel technique.

The incorporation of the Slice-parallel approach into our benchmark is alsomotivated by recent work [14] that has presented a non-blocking encoder versionof this method. The authors encode their video streams so that slice borders coincidewith horizontal lines in the frames. Since neither dependencies introduced by intra-prediction nor dependencies introduced by the deblocking filter occur across sliceborders, the multi-rows can be processed completely independent from each other.

Obviously, this non-blocking SP approach (NBSP) requires having full controlover the encoder, which will not be the case for many applications. This downsideshould be considered when interpreting the results for this method in our benchmark.For completeness, we also give an example of this approach in Fig. 14. Here, bothcores can start processing their assigned macroblocks immediately (t = 1) and finishthe decoding of the complete frame at t = 32.

4.4 Rotating slice-parallel approach

Content-dependent differences in the decoding complexity of the slices will resultin an unequal workload of each core [23]. For example, static and low textured

1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1

1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1

1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 12 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2

1 1 1 1 1 12 2 2 2 2 2

(t = 26) (t = 32) t = 58)(

Fig. 13 Example of the Slice-parallel approach in the blocking version

Multimed Tools Appl

Fig. 14 Example of theSlice-parallel approach in thenon-blocking version

1

2

1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 12 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2

(t = 1) (t = 32)

frame regions will use coarse macroblock partitions and few residual information.This requires less processing power than for moving and strongly textured regions.If we assign a slice with low complexity constantly to the same core, a drift betweenthis core and the cores working on more complex slices will occur on the long termin the non-blocking NBSP approach. This will result in low system usage and highbuffer size requirements for compensating the differences in the workload.

For investigating the impact of the sequence content on the NBSP approach’sruntime behaviour, we introduce a modified version of this approach (NBSPrapproach). It supports a rotating core assignment depending on the frame number.For example, in a two-core splitting, the first processor would process the upperimage part in the first frame, the bottom part in the second frame, then again theupper part and so on. In contrast to the NBSP approach with static assignment, onaverage, this rotating approach should reach a more balanced system workload.

4.5 Diagonal approach

As a final approach, we include the splitting method illustrated in Fig. 15 intoour evaluation. This processor assignment is obtained by dividing the first line ofmacroblocks into equally-sized columns. The assignments for the subsequent linesare then derived by left-shifting the macroblock assignments of the line above. Thisprocedure leads to diagonal patterns.

11112222

2222

22

2 22

2 22

22 22 22 2 2

22

111 1

11

3

3 333

333

33

33

3

333

1 1 1 11 1 1 1

1 1 1 11 1 1 1

1 1 11 1

1

2 2 2222

11 1

1 11 1

1

1

11 1

111

1

22

22

22 2

2 22 2

222

22

4 44 4

4 44 4

4 44 4

4 44 4

3 4 5 6 7 81 23 4 5 6 7 82

3 4 5 6 7 84 5 6 7 85 6 7 86 7 87 88 3 4 5 6 71 2

3 4 5 61 23 4 51 2

3 41 231 2

1 21


Fig. 15 The diagonal splitting approach

Multimed Tools Appl

1 1 1 1 1 1 1 1

11 11 1 1

2 2 2 22 2

1 1 1 1

11 11 1 1

2 2 2 22 2 2 2

2 2 2 21 1 1 11

22 2 2 2

2 2 2 21 1 11 1 1

11 1

2

2 2 2 21 1 1 11

22 2 2 2

2 2 2 21 1 11 1 1

11 1

2

(t = 4) (t = 10) (t = 12) (t = 13) (t = 16)

2 2 2 21 1 1 1

1 1 1 11 1 1 1

1 1 1 11 1 1 1

1 1 11 1

1

2 2 2222

12222

222

22

22 2 2 2

2 2 2 22 2 2 2

1 11 1 11 1 1

2 2 2 21 1 1 1

1 1 11 1

1

12 2 2 2

2 2 2 22 2 2 2

2 2 2 21 11 1 11 1 1

2 2 2 21 1 1 1

1 1 11 1

1

22 2 2 2

2 2 2 22 2 2 21 1 11 1 1

12 2

1 1

2 2 2 21 1 1 1

1 11

2 2 2 22 2 2 2

2 2 2 21 1 11 1 1

11 1

2 2 2 21 1 1 1

1 11

2 22 2 2 2

2 2 2 21 1 11 1 1

11 1

(t = 18) (t = 20) (t = 23) (t = 24) (t = 43)

Fig. 16 Example of the diagonal approach

Figure 16 gives an example of the diagonal (DG) approach using two processors.Here, the second core stalls until its dependencies become resolved by Core 1 att = 4. The first core completes computation of its first image partition at t = 10.Unfortunately, it cannot directly start processing the second partition, but has towait for Core 2 to resolve dependencies until t = 12. The following images (t ={13, 16, 18, 20, 23, 24}) show situations where the first core has to wait for mac-roblocks decoded by processor 2. For legibility, we do not show subsequent stateswhere one processor blocks the other one, but directly proceed to the final resultderived at t = 43.

Inclusion of the Diagonal approach into our evaluation is motivated by themacroblock partitioning approaches proposed in [25]. The Diagonal splitting methodis regarded as an approach that “respects” the dependency patterns spanned by theintra-prediction and the deblocking filter. (Dependencies are shown in Fig. 2). We il-lustrate the idea behind the Diagonal splitting method in Fig. 17. The figure comparesthe inter-processor dependencies introduced by Diagonal and Multi-column splittingtechniques. The Diagonal method thereby only shows dependencies on macroblocksfrom its left neighbouring processor, which is in contrast to the Multi-column method

1 1 2 2

1 1

1

2 22

2 2

3

3

3

3

3

1 1 1 2 2 2

1 1

1

2 22

2 22

3

3

3

3

33

2

2

3

1

(a) (b)

Fig. 17 Processor dependencies in the diagonal and multi-column approaches. a In the Diagonalmethod, dependencies for Core 2 originate solely from macroblocks assigned to Core 1. Core 2therefore never has to wait for Core 3. b In the Multi-column approach, macroblocks assigned toCore 2 are also dependent on processor 3, as indicated by the dashed arrow

Multimed Tools Appl

(AVERAGE) (Bus) (Shields) (Stockholm) (Parkrun)

0

10

20

30

40

Fra

mes

per

sec

ond

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

0

10

20

30

40NBSPrMCNBSPSRDGSP

0

10

20

30

40NBSPrMCNBSPSRDGSP

0

10

20

30

40NBSPrMCNBSPSRDGSP

0

10

20

30

40NBSPrMCNBSPSRDGSP

Fig. 18 Frames per second achieved for a system running at 800 MHz. The achieved frame ratefor each parallelization approach on average is shown. We considered 1, 2, 4 and 8 cores. Theparallelization methods are: the Single-row method (SR), the Multi-column method (MC), the non-blocking versions of the Slice-parallel method (NBSP/NBSPr), the Diagonal method (DG) and theblocking version of the Slice-parallel method (SP). Additionally, the f igure provides detailed runtimeresults for each of the four 720p sequences

that contains dependencies on macroblocks of both neighbouring cores. Thesereduced inter-processor dependencies could lead to an improved runtime behaviourof the multi-core system.

5 Results

5.1 Run-time complexity

Two major indicators for the efficiency of a multi-core decoding system are thedecoder’s runtime and the number of data-dependency stalls occurring during thedecoding process. A low runtime indicates a high system decoding performance,while the number of stalls provides an estimate on how efficiently the system’scomputational resources are used.

Figure 18 provides runtime estimates for the decoder performance in framesper second for a system running at 800 MHz. Additionally, Fig. 19 shows thepercentage of stall cycles occurring during the overall decoding process for the sixinvestigated approaches. We provide the average core usage of the decoders usingdifferent approaches and core counts in Fig. 20. Additionally, Table 2 providesdetailed performance numbers and shows the average performance results over all


0

20

40

60

80

Sta

lls [%

]

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

0

20

40

60

80

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

0

20

40

60

80

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

0

20

40

60

80

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

0

20

40

60

80

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

Fig. 19 Stall cycles caused by the data-dependencies between the cores. We plot the percentageof stall cycles in relation to the overall cycles spent for decoding a sequence. The f igure shows theaverage number of stalls over all HD sequences and for the individual HD sequences considering 1,2, 4 and 8 cores

Multimed Tools Appl


50

100

150

200

Usa

ge [%

]

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

50

100

150

200

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

50

100

150

200

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

50

100

150

200

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

50

100

150

200

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

Fig. 20 Average core usage for four test sequences. For the parallelization approaches, the system’saverage usage is shown for 1, 2, 4 and 8 cores

HD sequences considering 1, 2, 4 and 8 cores. It ranks the approaches according totheir runtime performance and memory transfer rates.

In Fig. 18, it is observed that the SR method and MC method as well as the NBSPapproach perform considerably better than the DG and the blocking SP approaches.Despite the usage of depending partitions, the MC approaches even outperform theNBSP approach without data-dependencies between the cores.

We found that the constant core assignment to frame regions and the runtimedifferences between these regions result in an unbalanced workload distributionfor the NBSP approach. According to our results, approaches working on a finergranularity level and alternating the cores over the frame more frequently (e.g. theSR approach) adapt better to local complexity peaks.

For comparing the scalability of the parallel approaches for different resolutions,we have included five common CIF sequences in our evaluation. These are theBarcelona, the Bus, the Flowergarden, the Foreman and the Mobile sequenceswhich have been encoded using the same encoding parameters as for the four HDsequences. Figure 22 shows the average performance results in terms of runtime,stalls, core usage and memory transfer rates for these sequences. For CIF resolution,the NBSP and the SR show similar runtime performance. With increasing numberof cores the NBSP approach has the advantage of less dependencies between thesplitting partitions which results in a better runtime performance.

The impact of the rotating order in the NBSPr approach can be observed for allcore counts and sequences. However, architectures with high core counts tend tobenefit stronger from this rotating order than systems with few cores. A good rankingof the NBSPr approach for the CIF and the HD sequences indicates that rotating theslices increases the runtime performance independent of the resolution.

Evaluating only the encoder-independent approaches shows a similar scalabilityfor the SR and the MC approaches for HD decoding. For most of these sequencesand core counts, the MC approach outperforms the SR approach. For the SRapproach, the different complexity between adjacent lines accumulates and results ina faster increase of the stall cycles for increasing core counts. For decoding sequencesat CIF resolution, the SR approach seems to be more suitable. The simulations showthat the performance for the MC approach with 8 cores decreases more stronglycompared to the systems with fewer cores. At CIF resolution with a frame width of22 macroblocks, no partitioning into equally sized vertical partitions is possible formore than two cores. Additionally, the column width decreases with each additionalcore and results in a more frequent inter-communication between the cores.

Multimed Tools Appl

Tab

le2

Res

ults

ofou

rev

alua

tion

Alg

orit

hmA

vg.

Bus

Shie

lds

Stoc

khol

mP

arkr

unra

nk↓

FP

SS

UT

FP

SS

UT

FP

SS

UT

FP

SS

UT

1C

ore

NoS

plit

–5.

20.

010

0.0

39.8

4.5

0.0

100.

042

.04.

50.

010

0.0

41.4

4.1

0.0

100.

043

.12

Cor

esN

BSP

r1.

610

.03

3.6 3

96.3

338

.81

8.8 1

1.0 1

98.9

141

.11

9.0 2

0.9 2

99.1

240

.41

8.2 1

0.3 1

99.6

142

.11

MC

2.4

9.9 4

4.7 4

95.2

440

.03

8.8 2

1.2 2

98.7

242

.33

9.0 1

0.7 1

99.3

141

.63

8.1 2

0.9 2

99.1

243

.43

NB

SP2.

710

.12

2.1 2

97.8

238

.81

8.8 3

1.6 3

98.4

341

.11

8.8 4

3.1 4

96.9

440

.41

7.9 4

3.6 4

96.4

442

.11

SR3.

610

.21

1.7 1

98.2

158

.66

8.7 4

2.3 4

97.7

461

.06

8.9 3

1.8 3

98.2

360

.86

8.0 3

2.1 3

97.9

362

.16

DG

5.0

9.6 5

7.6 5

92.4

540

.75

8.5 5

4.5 5

95.4

542

.95

8.7 5

4.0 5

95.9

542

.35

7.9 5

3.9 5

96.0

544

.05

SP5.

56.

2 640

.46

59.5

640

.14

5.5 6

38.9

661

.06

42.4

45.

5 639

.26

60.8

641

.84

5.0 6

38.9

661

.16

43.5

4

4C

ores

NB

SPr

1.0

20.1

12.

7 197

.21

37.1

117

.81

0.5 1

99.5

139

.21

17.7

11.

9 198

.01

38.4

116

.31

0.9 1

99.1

140

.21

NB

SP2.

519

.82

4.2 2

95.8

237

.11

16.8

37.

3 394

.23

39.2

117

.53

3.3 3

96.6

338

.41

15.3

47.

0 492

.94

40.2

1

MC

2.6

19.2

47.

1 492

.84

40.5

317

.52

4.4 2

97.7

242

.83

17.7

12.

5 297

.52

42.1

316

.02

2.7 2

97.3

243

.83

SR4.

119

.63

5.4 3

94.6

358

.66

16.7

49.

5 493

.64

61.0

617

.14

5.6 4

94.4

460

.86

15.4

36.

1 393

.93

62.1

6

DG

5.0

17.6

515

.25

84.8

541

.65

15.9

520

.55

89.1

543

.95

16.3

59.

7 590

.35

43.2

514

.85

10.0

589

.95

44.9

5

SP5.

57.

1 665

.66

34.3

640

.94

6.3 6

79.6

635

.06

43.2

46.

4 664

.86

35.2

642

.54

5.7 6

65.3

634

.76

44.3

4

8C

ores

NB

SPr

1.0

39.8

13.

9 196

.11

33.5

134

.91

2.3 1

97.7

135

.51

35.2

12.

6 197

.31

34.8

132

.21

1.9 1

98.0

136

.41

MC

2.4

37.3

39.

7 390

.23

41.4

334

.22

4.4 2

95.6

243

.73

34.5

24.

7 295

.32

43.1

331

.42

4.5 2

95.5

244

.83

NB

SP3.

136

.54

11.9

488

.14

33.5

133

.13

7.3 3

92.7

335

.51

32.0

411

.64

88.4

434

.81

30.0

48.

8 491

.24

36.4

1

SR3.

838

.42

7.3 2

92.7

258

.66

32.3

49.

5 490

.54

61.0

633

.33

8.0 3

91.9

360

.86

30.2

38.

0 392

.03

62.1

6

DG

5.0

31.3

524

.35

75.7

543

.45

28.4

520

.55

79.5

545

.75

29.2

519

.35

80.7

545

.15

26.4

519

.75

80.3

546

.75

SP5.

58.

3 679

.96

20.1

642

.44

7.3 6

79.6

620

.46

44.7

47.

4 679

.56

10.5

644

.14

6.6 6

79.9

620

.16

45.8

4

We

eval

uate

four

perf

orm

ance

met

rics

for

each

test

sequ

ence

and

deco

der

split

ting

appr

oach

.The

sear

eth

enu

mbe

rof

fram

esde

code

din

1s

(FP

S),t

hepe

rcen

tage

ofst

allc

ycle

sin

rela

tion

toth

eov

eral

lcyc

leco

unt

(S),

the

aver

age

core

usag

e(U

)gi

ven

inpe

rcen

tan

dth

eda

tatr

ansf

er(T

)m

easu

red

inM

B/s

.The

tabl

e’s

valu

esar

evi

sual

ized

inF

igs.

18–2

1.T

hesu

bscr

ipts

deno

tera

nks

com

pute

dby

com

pari

son

agai

nst

the

com

peti

ngsp

litti

ngap

proa

ches

.Ran

k1

mea

nsth

atth

eap

proa

chis

the

best

-per

form

ing

one

acco

rdin

gto

the

perf

orm

ance

met

ric.

The

colu

mn

Avg

.ran

kis

com

pute

das

the

aver

age

ofal

lsub

scri

pts

wit

hin

aro

w.T

ode

rive

anov

eral

lpe

rfor

man

cera

nkin

gof

split

ting

appr

oach

es,w

eso

rtth

eta

ble

acco

rdin

gto

Avg

.ran

k

Multimed Tools Appl


20

40

60

80

100

Mem

ory

tran

sfer

s [M

B/s

]

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

20

40

60

80

100

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

20

40

60

80

100

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

20

40

60

80

100

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

20

40

60

80

100

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

Fig. 21 Data transfer volume for intra and inter reference data and deblocking information, if onemacroblock line can be cached in local core memory

The diagonal (DG) approach shows relatively poor performance, especially forhigh core counts and small resolutions. For each additional core, the DG approachcreates two additional partitions with dependencies on other cores’ partitions.Adding a new core to the system raises the number of dependencies faster than forthe other approaches and results in a higher number of stalls. For CIF resolutionand high core counts a stronger decrease in performance than for HD resolution isobserved.

For the SP approach, the coarse horizontal splitting and the dependencies betweeneach partition and its upper partitions result in high stall counts and the worst runtimeof all approaches. With increasing resolutions, each core has to wait longer until theprevious partition is finished. Hence, the efficiency of the SP approach decreaseswith increasing resolution.

5.2 Inter-communication

Memory transfers to and from the external DRAM and between the cores’ localmemories are expensive in terms of power consumption and transfer times. In avideo decoding system, these transfers are strongly influenced by the core inter-communication and the loading of reference data and deblocking pixels. Dependingon the size of the cores’ local memories and how efficiently the parallelizationapproach exploits locality, macroblock information can be kept on a core and reusedfor the decoding of future macroblocks. In our work, we assume that exactly one

(Average frame rate) (Average stalls) (Average usage) (Average transfer)

0

200

400

600

800

1000

1200

Fra

mes

per

sec

ond

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

0

10

20

30

40

50

Sta

lls [%

]

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

50

100

150

200

Usa

ge [%

]

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

0

2

4

6

8

10

Mem

ory

tran

sfer

s [M

B/s

]

1 Cor

e

2 Cor

es

4 Cor

es

8 Cor

es

NBSPrMCNBSPSRDGSP

(a) (b) (c) (d)

Fig. 22 Average frame rate, stall cycles and usage for a system running at 800 MHz and decodingsequences with CIF resolution. The sequences included in these calculations are Barcelona, Bus,Foreman, Flowergarden and Mobile. We calculated (a) the average framerate, (b) the percentage ofstall cycles, (c) the average usage and (d) the memory transfer rate for 1, 2, 4 and 8 cores

Multimed Tools Appl

macroblock line of intra-prediction and deblocking data can be kept in the corememory of each core. The resulting transfer rates for HD and CIF resolution videosare shown in Figs. 21 and 22d, respectively.

A strong difference between the approaches can be observed for the HD as wellas for the CIF sequences. The SR approach only exploits data locality between hor-izontally neighbouring macroblocks. The fine horizontal splitting into macroblocklines results in the highest transfer rate of all approaches and represents the worstcase for horizontal splitting approaches in terms of memory transfers. Note that fortwo or more cores the transfer rate of the SR approach is constant. The alternatingprocessing order of the macroblock lines requires the transfer of vertical referencedata and deblocking information between the cores.

For the MC approach, a smaller amount of data has to be exchanged between thecores and efficient data reuse is possible. This results in the lowest transfer rate of allencoder-independent approaches and even outperforms the blocking SP approachat HD as well as CIF resolution. Increasing the core counts only results in a slighttransfer increase for the MC and the SP approaches. For the NBSP and the NBSPrapproaches, the data communication even decreases for high core counts, since thenumber of independent frame partitions increases. However, this results in blockingartefacts at slice borders and a less efficient prediction.

The transfer rates for the DG approach are slightly worse than for the MC and theSP approaches for all core counts. These results can be observed for HD as well asCIF resolutions. The two additional partitions for each additional core result in newdata dependencies to neighbouring partitions and a stronger increase of the datatransfers when raising the number of cores.

6 Conclusions

In this study, we have evaluated six data-parallel approaches for decoding H.264video streams in environments with strong resource restrictions. We compare theapproaches against each other and investigate the impact of raising the core counton the decoder’s runtime, core usage and transfer rates. For accomplishing thisevaluation, we have developed a high-level simulator. It is capable of estimating theruntime behaviour of a virtual multi-core decoder based on single-core profilings.

The runtime of each parallelization approach is strongly influenced by the framepartitions’ sizes and shapes. Large and dependency-minimizing partitions causeless inter-communication between the cores and a lower runtime. Additionally, astatic assignment of each core to constant frame regions makes an approach highlysensitive to local complexity peaks and can result in an unequal workload balancing.Assigning each core to multiple and spatially separated frame regions (e.g. the SRapproach) or rotating the assignment order between frames can compensate thesecomplexity variations more effectively. A rotating assignment has the advantage thatit can reduce dependencies by maintaining large partitions and process diverse frameregions on each core.

For approaches working on larger partitions, it is generally easier to exploit local-ity. This reduces transfer rates, since more spatially close neighbouring macroblocksare processed on the same core and efficient data reuse is possible. Reusing thedata from horizontal neighbouring macroblocks is easily possible by keeping a single

Multimed Tools Appl

macroblock’s prediction and deblocking information in the local cache. Verticalcaching requires that the partitions’ width is smaller than the number of locallystorable macroblocks. For approaches based on horizontal splitting, this results inthe highest requirements on local memories.

In future work, we are planning to investigate the efficiency of parallel codingapproaches for different H.264 coding tools and GOP structures. Other videostandards such as VC-1 and AVS shall be included in further evaluations.

Acknowledgements This work has been supported by the Austrian Federal Ministry of Transport,Innovation, and Technology under the FIT-IT project VENDOR (Project nr. 812429). MichaelBleyer would like to acknowledge the Austrian Science Fund (FWF) for financial support underproject P19797.

References

1. Ball T, Larus JR (1994) Optimally profiling and tracing programs. ACM Trans Program LangSyst 16(4):1319–1360

2. Cesario WO, Lyonnard D, Nicolescu G, Paviot Y, Yoo S, Jerraya AA, Gauthier L, Diaz-NavaM (2002) Multiprocessor SoC platforms: a component-based design approach. IEEE Des TestComput 19(6):52–63

3. Chen TW, Huang YW, Chen TC, Chen YH, Tsai CY, Chen LG (2005) Architecture design ofH.264/AVC decoder with hybrid task pipelining for high definition videos. In: Proc. of the IEEEint. symposium on circuits and systems, pp 2931–2934

4. Chen YK, Tian X, Ge S, Girkar M (2004) Towards efficient multi-level threading of H.264encoder on Intel hyper-threading architectures. In: Proc. of the 18th int. parallel and distributedprocessing symposium, vol 1, pp 63–72

5. Cmelik B, Keppel D (1994) Shade: a fast instruction-set simulator for execution profiling. In:Proc. of the ACM SIGMETRICS conference on measurement and modeling of computer sys-tems, pp 128–137

6. Faichney J, Gonzalez R (2001) Video coding for mobile handheld conferencing. J MultimedTools Appl 13(2):165–176

7. Graham SL, Kessler PB, McKusick MK (1982) gprof: a call graph execution profiler. In: Proc. ofthe SIGPLAN symposium on compiler construction, pp 120–126

8. Gulliver SR, Ghinea G, Patel M, Serif T (2007) A context-aware tour guide: user implications.Mobile Inform Syst 3(2):71–88

9. ITU-T, ISO/IEC (2005) Advanced video coding for generic audiovisual services (ITU Rec. H.264| ISO/IEC 14496-10). ITU-T and ISO/IEC

10. Jeon J, Kim H, Boo G, Song J, Lee E, Park H (2000) Real-time MPEG-2 video codec systemusing multiple digital signal processors. J Multimed Tools Appl 11(2):197–214

11. Knudsen PV, Madsen J (1996) Pace: A dynamic programming algorithm for hardware/softwarepartitioning. In: Proc. of the int. workshop on hardware–software co-design, pp 85–92

12. Malik S, Martonosi M, Li YTS (1997) Static timing analysis of embedded software. In: Proc. ofthe 34th ACM/IEEE design automation conference, pp 147–152

13. Meenderinck C, Azevedo A, Juurlink B, Alvarez M, Ramirez A (2008) Parallel scalability ofvideo decoders. J Signal Process Syst 56:173–194

14. Moriyoshi T, Miura S (2008) Real-time H.264 encoder with deblocking filter parallelization. In:IEEE int. conference on consumer electronics, pp 63–64

15. Nachtergaele L, Catthoor F, Kapoor B, Janssens S, Moolenaar D (1996) Low power storageexploration for H.263 video decoder. In: Proc. of the IX workshop on VLSI signal processing,pp 115–124. doi:10.1109/VLSISP.1996.558310

16. Paver N, Khan M, Aldrich B (2006) Optimizing mobile multimedia using SIMD techniques.J Multimed Tools Appl 28(2):221–238

17. Puschner PP, Koza C (1989) Calculating the maximum execution time of real-time programs.J Real-Time Syst 1(2):159–176

18. Ravasi M, Mattavelli M (2003) High-level algorithmic complexity evaluation for system design.J Systems Archit 48(13–15):403–427

http://dx.doi.org/10.1109/VLSISP.1996.558310

Multimed Tools Appl

19. Ravasi M, Mattavelli M (2005) High abstraction level complexity analysis and memory archi-tecture simulations for multimedia algorithms. IEEE Trans Circuits Syst Video Technol 15(5):673–684

20. Rodriguez A, Gonzalez A, Malumbres M (2006) Hierarchical parallelization of an H.264/AVCvideo encoder. In: Proc. of the int. symposium on parallel computing in electrical engineering,pp 363–368

21. Schöffmann K, Fauster M, Lampl O, Böszörményi L (2007) An evaluation of parallelizationconcepts for baseline-profile compliant H.264/AVC decoders. In: Proc. of the Euro-Par 2007,pp 782–791

22. Seitner F, Meser J, Schedelberger G, Wasserbauer A, Bleyer M, Gelautz M, Schutti M, SchreierR, Vaclavik P, Krottendorfer G, Truhlar G, Bauernfeind T, Beham P (2008) Design methodologyfor the SVENm multimedia engine. In: Proc. of the Austrochip 2008, poster presentation

23. Seitner FH, Schreier RM, Bleyer M, Gelautz M (2008) A high-level simulator for the H.264/AVCdecoding process in multi-core systems. In: Proc. of the SPIE, multimedia on mobile devices,vol 6821, pp 5–16

24. Sun S, Wang D, Chen S (2007) A highly efficient parallel algorithm for H.264 encoder based onmacro-block region partition. In: Proc. of the 3rd int. conference on high performance computingand communications, pp 577–585

25. van der Tol EB, Jaspers EG, Gelderblom RH (2003) Mapping of H.264 decoding on a multi-processor architecture. In: Proc. of the SPIE, vol 5022, pp 707–718

26. Wang SH, Peng WH, He Y, Lin GY, Lin CY, Chang SC, Wang CN, Chiang P (2003) A platform-based MPEG-4 advanced video coding (AVC) decoder with block-level pipelining. In: Proc. ofthe 2003 joint conference of the 4th int. conference on information, communications and signalprocessing and the 4th Pacific rim conference on multimedia, vol 1, pp 51–55

27. Witchel E, Rosenblum M (1996) Embra: fast and flexible machine simulation. In: Proc. ofthe ACM SIGMETRICS int. conference on measurement and modeling of computer systems,pp 68–79

28. Zhao Z, Liang P (2006) A highly efficient parallel algorithm for H.264 video encoder. In: Proc.of the 31st IEEE int. conference on acoustics, speech, and signal processing, vol 5, pp 489–492

Florian H. Seitner received his diploma degree (M.S.) in Computer Science from the ViennaUniversity of Technology in 2006. He is currently employed as a project assistant at the ViennaUniversity of Technology where he works on his Ph.D. degree. His scientific interests include videocoding, parallel software design and video analysis.

Multimed Tools Appl

Michael Bleyer received his M.S. and Ph.D. degrees in Computer Science from the ViennaUniversity of Technology in 2002 and 2006, respectively. He is currently employed as a post-docresearcher at the Vienna University of Technology. His research interests include stereo matching,optical flow computation, combinatorial optimization and video coding.

Margrit Gelautz received her PhD degree in Computer Science from Graz University of Technology,Austria, in 1997. She worked on optical and radar image processing for remote sensing applicationsduring a postdoctoral stay at Stanford University. In her current position as an associate professor atVienna University of Technology, her research focuses on image and video analysis for multimediaapplications. She is Women in Engineering (WIE) officer of IEEE Austria Section.

Multimed Tools Appl

Ralf M. Beuschel was born in Friedrichhafen, Germany in 1974. He received the diploma degree(M.Eng.) in electrical engineering (systems engineering) from the University of Ulm in 2000 witha best thesis award donated by the VDE/VDI. From 2000 to 2006 he was a research fellow at theMicroelectronics Department at the University of Ulm doing research in the field of low-delayvideo coding algorithms and real-time DSP implementations. Afterwards he was with the Institute ofSoftware Technology and Interactive Systems of the TU Vienna for two years. In early 2009 he hiredat DSP-Weuffen in Amtzell, Germany in the position of a Senior System Designer for DSP videoapplications. He is member of IEEE and the German TV and Cinema Technology Society (FKTG).

evaluation of data-parallel h.264 decoding approaches for ... · the traditional profiling methods....

Documents