optimized implementation of an mvc decoder - uni … · saarland university telecommunications lab...

Saarland UniversityTelecommunications Lab

Intel Visual Computing Institute

Optimized implementation ofan MVC decoder

Master’s Thesis in Computer and Communication Technology

Autor: Jochen Britz

Matriculation number: 2513447

Course of studies: Computer- und Kommuniationstechnik

Supervisor: Prof. Dr.-Ing. Thorsten Herfet

Advisor: M. Sc. Goran Petrovic

Reviewers: Prof. Dr.-Ing. Thorsten Herfet

Prof. Dr.-Ing. Philipp Slusallek

April 30, 2013

Eidesstattliche Erklärung

Ich erkläre hiermit an Eides statt, dass ich die vorliegende Arbeit selbstständigverfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwen-det habe.

Statement in Lieu of an Oath

I hereby confirm that I have written this thesis on my own and that I have notused any other media or materials than the ones referred to in this thesis.

Saarbrücken, den 30. April 2013(Jochen Britz)

Einverständniserklärung

Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Ver-sionen in die Bibliothek der Informatik aufgenommen und damit veröffentlichtwird.

Declaration of Consent

I agree to make both versions of my thesis (with a passing grade) accessibleto the public by having them added to the library of the Computer ScienceDepartment.

Saarbrücken, den 30. April 2013(Jochen Britz)

Acknowledgments

First and foremost, I would like to thank Prof. Dr.-Ing. Thorsten Herfet forgiving me the opportunity to write this master’s thesis at his chair, the Telecom-munications Lab.I would like to express special thanks to M.Sc. Goran Petrovic, who helped meto get in touch with H.264 and FFmpeg and for his professional advise whilewriting the thesis.In particular, I would like to thank the team of the Telecommunications Labwho readily provided information at any time.Furthermore, I would like to thank all people who supported me in writing thisthesis, especially Tobias Lange, Oliver Dejon, my girlfriend Jennifer Schnell,and my family.

Abstract

3D video is getting more popular in various applications. For a 3D video experi-ence, at least two different views of the same scene are necessary. Despite seriousinterests in consumer industries and open-source communities, the main chal-lenge in designing a real-time 3D communication system is that by today severalmandatory components required for such a system are generally not available.In particular, to the best of our knowledge, no MVC-compatible software de-coder currently achieves real-time performance. In our work, we address thechallenge of implementing an open-source decoder for multi-view video (MVV)representations.

We focus on the design and implementation of a real-time decoder based onthe FFmpeg framework, that is compliant with H.264/AVC Annex H (referredto as MVC). As such, we address and implement the missing components inthe H.264 implementation of FFmpeg according to MVC. Namely, we first ex-tend the parsing routines to be able to handle MVC-compliant bitstream byimplementing support for new NAL unit types and parameter sets. In addition,we implement new buffers and enhanced structs to store MVC-dependent data,such as the structs for SPS and PPS and buffers for Subset SPS and inter-viewreference lists. Second, we extend the decoding routines by extending DBP andmodifying the reference picture handling. For that reason, we modify the exist-ing code and implement additional functions, as required by MVC. Additionally,we investigate multi-threading capabilities and optimize the implementation tobe able to decode selected views only. Finally, we add configuration options byextending the command line interface with additional parameters. We test ourimplementation on a commodity desktop computer and achieve decoding timesof 18 ms per frame on average (for a single frame in all 8 views). This means, weachieve real-time performance for sequences with up to 50 frames per second.

In addition to the implementation, we perform experiments on differentMVV sequences to optimize the coding of multi-view sequences in terms ofquality-complexity trade-offs; we perform this by varying the prediction schemesand quantization parameters as well as the scenes. Our findings are that there isdependence of coding on scene characteristics and prediction schemes. Finally,we analyze the impact of quantization on virtual view rendering and its relationto prediction schemes. We experience, that quantization has impact on virtualview rendering while applying of prediction schemes do not.

Contents

1 Introduction 11.1 3D and MVV communication systems . . . . . . . . . . . . . . . 11.2 Problem statement and thesis scope . . . . . . . . . . . . . . . . 41.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 82.1 H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Hybrid coding . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Block-based motion estimation and compensation . . . . . 92.1.3 Hierarchical B-pictures . . . . . . . . . . . . . . . . . . . . 112.1.4 H.264 bitstream structure and syntax . . . . . . . . . . . 11

2.2 Multi-View Coding (MVC) . . . . . . . . . . . . . . . . . . . . . 132.2.1 H.264 - MVC Extension . . . . . . . . . . . . . . . . . . . 142.2.2 Stereo interleaving . . . . . . . . . . . . . . . . . . . . . . 142.2.3 Inter-view prediction . . . . . . . . . . . . . . . . . . . . 142.2.4 Optimized reference-block selection . . . . . . . . . . . . . 162.2.5 Camera arrangements and complexity of inter-view pre-

diction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Multi-view FFmpeg decoding 193.1 The H.264 implementation in FFmpeg . . . . . . . . . . . . . . . . 19

3.1.1 Codec implementation in FFmpeg . . . . . . . . . . . . . 193.1.2 Interaction between the codec implementation and the

FFmpeg framework . . . . . . . . . . . . . . . . . . . . . . 203.2 Extending FFmpeg for multi-view decoding . . . . . . . . . . . . . 22

3.2.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.2 Modification of ff_h264_decoder and corresponding structs

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.3 Interaction between the parser and decode_frame function 283.2.4 Splitting the bitstream . . . . . . . . . . . . . . . . . . . . 293.2.5 Support for new NAL unit types . . . . . . . . . . . . . . 293.2.6 Extending parameter sets . . . . . . . . . . . . . . . . . . 31

i

3.2.7 Mapping slice types in FFmpeg . . . . . . . . . . . . . . . 333.2.8 Modifying reference picture lists . . . . . . . . . . . . . . 343.2.9 Allocation of Decoded Picture Buffer (DPB) . . . . . . . 363.2.10 Optimization and parallelization of the decoder . . . . . 383.2.11 Configuring decoding with command line parameters . . 39

4 Experimental analysis and optimization 414.1 Quality metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Coding optimization . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Prediction scheme dependencies . . . . . . . . . . . . . . . 444.2.2 Scene dependencies . . . . . . . . . . . . . . . . . . . . . . 444.2.3 Relation of prediction scheme and scene dependencies . . 45

4.3 Impact on virtual view rendering . . . . . . . . . . . . . . . . . . 474.3.1 Quantization impact on virtual view rendering . . . . . . 484.3.2 Visual comparison of the rendering results . . . . . . . . . 494.3.3 Impact of prediction schemes on virtual view rendering . 51

4.4 Decoding speed with JMVC . . . . . . . . . . . . . . . . . . . . . 52

5 Conclusion 535.1 Open issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A Code snippets 56A.1 Structs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A.1.1 The struct AVCodec . . . . . . . . . . . . . . . . . . . . . 56A.1.2 The struct AVCodecParser . . . . . . . . . . . . . . . . . 57A.1.3 The struct AVCodecContext . . . . . . . . . . . . . . . . . 57A.1.4 The struct H264Context . . . . . . . . . . . . . . . . . . . 58

A.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.2.1 The find_frame_end function . . . . . . . . . . . . . . . 59

Bibliography 62

ii

List of Figures

1.1.1 A 3D/MVV video communication system. . . . . . . . . . . . . . 21.1.2 Scene image and its associated depth map. . . . . . . . . . . . . 3

2.1.1 Hybrid coding principle. . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 MB-level motion prediction in hybrid coding. . . . . . . . . . . . 92.1.3 Residual image and motion vectors. . . . . . . . . . . . . . . . . . 102.1.4 Alignment of hierarchical B-pictures. . . . . . . . . . . . . . . . . 112.1.5 H.264 bitstream structure. . . . . . . . . . . . . . . . . . . . . . . 122.2.1 MVC Extension of H.264 – supported tools and profiles. . . . . . 132.2.2 Spatial interleaving for stereoscopic video. . . . . . . . . . . . . . 152.2.3 Different prediction schemes for inter-view prediction. . . . . . . 162.2.4 MVC Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.5 Spatial and temporal neighboring pictures of picture P. . . . . . 182.2.6 Prediction schemes depending on different positions of the base

view for a 2D camera array. . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 The structs AVCodec and AVCodecParser register the H.264 de-coder and parser in FFmpeg. . . . . . . . . . . . . . . . . . . . . 20

3.1.2 Decoder workflow without threading. . . . . . . . . . . . . . . . 223.1.3 Decoder workflow with threading. . . . . . . . . . . . . . . . . . 233.2.1 Extended NAL unit header. . . . . . . . . . . . . . . . . . . . . . 243.2.2 Dependencies between different H264Contexts. . . . . . . . . . . 283.2.3 Workflow of the find_frame_end function. . . . . . . . . . . . . 303.2.4 Parsing routine for MVC-specific NAL unit header fields. . . . . 313.2.5 Declaration of the command line parameters in FFmpeg. . . . . . 40

4.1.1 Quality metric for spatial 3D-video quality comparisons. . . . . . 424.2.1 Prediction schemes of the anchor views. . . . . . . . . . . . . . . 434.2.2 Comparison of different prediction schemes applied to the anchor

pictures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.3 Comparison of different sequences with the same prediction scheme. 454.2.4 PSNR dependencies between sequences and prediction schemes. . 464.2.5 Bit rate dependencies between sequences and prediction schemes. 464.3.1 Quality metric for virtual view quality comparisons. . . . . . . . 474.3.2 PSNR of virtual views in comparison to decoded views. . . . . . 48

iii

4.3.3 Visual comparison of quantization results, “Tsukuba” camera 1. . 494.3.4 Detailed Visual comparison of quantization results, “Tsukuba”

camera 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.5 Comparison of different prediction schemes for virtual view ren-

dering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.1 Comparison of decoding times for MVV sequences with JMVC

reference decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

iv

List of Tables

3.1 NAL unit types overview and their relation to MVC and FFmpeg. 323.2 Slice type mapping between H.264 and FF_H264. . . . . . . . . . 34

v

Chapter 1

Introduction

3D video is getting more popular in various applications. For a 3D video ex-perience, at least two different views of the same scene are necessary. In thewell-known stereoscopic systems, as used in cinemas and TV broadcasts, twoviews are sufficient to render the depth effect with the help of special-purposeglasses. Multi-view video (MVV) systems extend stereoscopic systems with acapability to present multiple different views of the scene, each correspondingto a different observer’s viewing angle.

The application domains for 3D and multi-view video include entertainment,communications, medicine, security, visualization and education. In each ofthese domains, 3D and multi-view video bring specific advantages compared toconventional, 2D-video. In general, these advantages include a sense of immer-sion and realism of the presentation. For example, live stream of sport eventscan be enhanced with viewpoint selection, where each observer-selected view-point can be displayed in either 2D or 3D. In addition, future video-conferencingsystems can be enhanced with viewpoint adaptation where each conference par-ticipant receives a customized viewpoint of others. Likewise, video-surveillanceapplications benefit from the availability of multiple views for scene analysisand understanding.

1.1 3D and MVV communication systemsCommon to all these application domains is a need to efficiently represent,encode and transmit a multi-view scene representation to the receiver. Theassociated challenges are currently subject of research in the field of 3D-videocommunication system design. Figure 1.1.1 shows a simplified view of a 3D-video communication system. Its main components are summarized as follows.

Acquisition of multiple views. MVV contains more than one view of thesame scene, each corresponding to a different point of view of the scene(depicted as main and optional views in Figure 1.1.1 and commonly re-ferred to as “textures”). To capture MVV, one has to capture the scene

1

Texture

encoder, e.g.,

MPEG-2

Main

camera

view

Optional

camera

view

Depth image

estimation

Depth

encoder, e.g.,

MPEG-4 AVC

Texture

decoder, e.g.,

MPEG-2

Depth

decoder, e.g.,

MPEG-4 AVC

Ima

ge

ren

de

rin

g

Tra

nsm

issio

n

ch

an

ne

l

Multi-view video

acquisition

Coding, transmission and

decoding

Image

rendering

Figure 1.1.1: A 3D/MVV video communication system [20].

from different points of view at the same time. For stereoscopic video, thiscan be done either with two standalone cameras or with special 3D cam-corders equipped with two separate sensors. For MVV, capturing mustbe performed using a camera array, consisting of several cameras in somespatial arrangement. This arrangement is typically application-dependent.

Multi-view image rendering. Multiple view points can either be images di-rectly captured by one of the cameras or synthetically generated. Imagerendering in the context of MVV refers to a set of algorithms that ren-der synthetic views of the scene (“virtual views”) by blending a number oforiginally captured views. To ensure a seamless transition and a similarimage quality across all views, virtual views are automatically constructedfrom a number of selected original camera views. These views are typi-cally rendered at locations between and around the original viewpoints.The rendering algorithms either synthesize virtual views directly, or as-sume some form of scene description (e.g., a geometric model) in order togenerate those views more efficiently. Such an approach is presented in[42]. The proposed algorithm in this approach is capable of virtual viewrendering without the need of previously provided depth information. Thedepth is estimated while rendering. In [12] a real-time virtual view ren-dering implementation is presented based on this algorithm. We used thisimplementation in our evaluation, in Section 4.3.

Depth map. One commonly used geometric model is a depth map. Depthmaps are grayscale images that code the distance from the camera to theobjects in the scene as illustrated in Figure 1.1.2. Typically, the dark re-gions code the scene objects that are further away from the camera thanthe brighter ones. The advantage of using depth maps as a geometricmodel in MVV rendering is that there is less computation while synthe-sizing views and fewer neighboring views are typically needed to achievethe same rendering quality. The depth maps can either be estimated fromthe originally captured images or recorded directly using special sensors.

2

Figure 1.1.2: Scene image and its associated depth map.

In case of estimation, depth images are generates using well-known depth-from-stereo techniques from computer vision. Recently, special sensorsthat capture the depth information directly have become commerciallyavailable. An example of such a depth sensor is Microsoft’s Kinect.

Encoding. The resulting MVV representation, consisting of multiple views andoptional depth maps, needs to be encoded before the transmission. As theamount of data is significantly larger compared to conventional 2D-video,the main challenge is the design of efficient encoding algorithms. In gen-eral, the state-of-the-art video coding standards, MPEG-2, MPEG-4 andH.264/MPEG-4 AVC – although not specifically designed for the com-pression of MVV representations - are readily applicable. By treatingeach view as a conventional 2D-video, a standard coding algorithm canbe applied to each camera stream independently. The same holds for en-coding the depth maps, which are represented as grayscale images. Thenewly developed Multi-view Video Coding standard (MVC) specificallytargets efficient compression of MVV. It is largely based on the existingH.264/MPEG-4 AVC standard and extends it with encoding tools opti-mized for inter-view compression. As the MVC coding algorithm is jointlyapplied to a number of camera streams, in many scenarios it results ina more efficient compression than an independent H.264/MPEG-4 AVCcoding of views. However, the complexity of MVC encoder and decoderis typically higher than those in H.264/AVC standard.

To achieve a real-time end-to-end performance, 3D communication systems needto be designed with an appropriate choice, implementation and integration ofindividual system components. Despite the recent growth of commercial video-streaming services, real-world deployments of 3D streaming services are limitedtoday. MVV communication services are deemed a future technology that hasseen no deployment up to this date.

3

1.2 Problem statement and thesis scopeThe challenge in designing a real-time 3D communication system is that severalcomponents to build such a system are generally not available. First, despite theresearch efforts in the area of image-based rendering, we currently lack real-timealgorithms for high-quality view synthesis. Second, common streaming librariesdo not have support for a simultaneous transmission of multiple streams (most ofthem are currently limited to video and audio only). Finally, common softwareframeworks lack support for a joint encoding and decoding of multiple videosequences.

Our work addresses the challenge of achieving a real-time decoding of MVVrepresentations. We assume that suitable components exist for real-time ac-quisition, rendering of views and depth maps and encoding. With this scopelimitation, our main challenge is that common software frameworks for videocoding lack support for MVC and multi-stream decoding in general. In addition,to the best of our knowledge, no MVC-compatible software decoder currentlyachieves real-time performance, despite the significant interest in research andopen-source communities.

We address this challenge as follows. We propose to implement a softwareMVV decoder on top of software libraries available through the FFmpeg project,which implement a standard-compatible H.264/AVC decoder. In this thesis, wefocus on the following lines of work.

First, we propose to implement the inter-view coding tools defined for MVCin the existing FFmpeg decoder implementation.

Second, we extend the parsing routines in FFmpeg decoder to be able to handlebitstreams generated by a standards-compliant MVC encoder.

Third, we explore MVV coding configurations that achieve optimized trade-offs between the decoding delay and compression efficiency.

1.3 Related workStereoscopic and multi-view video coding has long been the focus of standardiza-tion efforts. The extensions of MPEG-2 standard to support stereoscopic videocoding were analyzed in [23]. The proposed techniques extended the block-basedmotion compensation in MPEG-2 to include disparity compensation. Moreadvanced mesh-based coding tools were also considered during MPEG-2 stan-dardization [38]. Overall, the coding gains achieved with these techniques werelimited.

During the recent MVC standardization, it became clear that the efficiencyand flexibility of H.264 can be exploited for multi-view video coding. Thisresulted in the inclusion of coding tools and prediction structures based onhierarchical B-frames from H.264 into the MVC standard, as described in latersections of this report. Several additional multi-view coding techniques were

4

under active research and development; however, these techniques did not findtheir way into the MVC standard. The main reason is that these techniquesrequire changes to macro-block syntax. At the time of MVC specification, it wasdesired for backward compatibility, that the macro-block layer stays unchanged[36]. These techniques are as follows.

Illumination compensation. Since color characteristics of different camerasin an array probably vary and the illumination from different viewingangles also changes, the compression efficiency may be reduced. Illumi-nation compensation is a technique to compensate for such differences,as described in [13]. In the proposed method, the difference between themean values of the current and the referencing block is computed andincluded as an offset at the motion compensation stage. This differencehas to be transmitted and signaled to the decoder. Hence, this techniqueintroduces a change at the macro block layer.

Motion skip mode. Motion skip mode is the extension of the direct predictionmode (specified in H.264/AVC) to the spatial domain. In the direct mode,transmission of motion vectors is skipped and they are estimated frommotion vectors of the neighboring blocks. In this proposal, the directprediction mode is extended to multiple views. This scheme requires thata global disparity vector for each neighboring view has to be transmittedto compensate for motion between views while estimating motion vectors.This has to be done at the macro block layer, since the motion vectors areneeded there.

View-synthesis prediction. View-synthesis prediction is a generic term fordifferent coding methods and currently under active development. Thegeneral idea is that a view can be synthesized from neighboring views. Ifwe apply view synthesis at the compression stage, instead of coding a view,we can simply skip it, while assuming that it can be synthesized from theneighboring views at the decoder. In the proposal in [15], it is assumedthat the depth information is available at the macro block layer, such thatit can be used for view synthesis. In general, the translational motion-model, as implicitly assumed in all block-based compression standards,is inappropriate to model the disparity between views, which typicallyincludes global rotational motion. Thus, it was shown in [15] that byincluding view synthesis in the spatial motion-compensation, compressionefficiency can be improved. However, as the specification of geometricmodels to use in MVV is out of the scope of MVC standardization, thisproposal was not included in the standard.

Depth-compression techniques. An active area of MVV research is the com-pression of geometric models such as depth maps. Currently, there is nostandardized method for the efficient compression of depth maps [36]. Itwas shown that the transmission of depth maps consumes about the samebandwidth as the transmission of views. In general, it is hard to find a

5

good balance between the distortion of depth data and the associated rate(i.e. bandwidth consumption) [36].

Most recently, multi-view coding progresses alongside the ongoing standard-ization of High-Efficiency Video Coding (HEVC). Several researchers focus onapplying and adapting the coding tools developed in HEVC for multi-view videocoding [29, 1].

In parallel with standardization, several techniques were proposed as a resultof research on coding of multiple views and associated scene-geometry models.A number of non-standardized multi-view video coding tools were made avail-able by Nokia Research [6]. Morvan [20] shows that the subjective quality ofsynthesized virtual views can be improved by using coding tools that are bet-ter suited to the signal properties of depth maps than conventional block-basedmethods. The platelet-based depth coding proposed in [20] avoids edge arti-facts in synthesized views that typically occur with block-based coding used inH.264 and MVC. Following this work, researchers have attempted to improveblock-based MVC coding to better handle the frame regions containing depth-discontinuity edges. The general approach is to identify the blocks of a depthmap that contain the edges and encode them at a higher quality (i.e., using alarger bit budget)[9]. In addition, a number of approaches focus on improvingthe MVC performance in streaming applications. Early on in the MVC stan-dardization, it became clear that prediction structures that introduce a largenumber of inter-view dependencies may have a negative effect on streaming effi-ciency [11]. Correspondingly, recent research focuses on optimizing the trade-offbetween the coding efficiency for storage and streaming [2].

As stated in earlier sections of this report, to the best our of knowledge, areal-time implementation of a multi-view codec is generally unavailable. Notableexceptions are the MVC decoders implemented in hardware (BluRay-compatibledevices) and a software implementation of MVC decoding by Peter Wimmer[41]. However, these decoders are currently limited to supporting stereoscopicvideos, i.e., only two views. Adding support for MVC decoding in FFmpeg wasproposed in an open-source project supported by Google Summer of Code 2011[5]. However, the project status is unclear and the results were not reported upto now. Finally, motivated by the need to reduce encoder complexity of MVC,there is a number of recent proposals to trade-off quality for the encoding speed[8, 32].

1.4 ContributionsIn this thesis, we focus on the design and implementation of a real-time H.264compliant multi-view video decoder based on the FFmpeg framework. Therebywe used the H.264 implementation in FFmpeg and modified it to be able todecode MVC-compliant bitstreams. Our specific contributions are as follows.

First, we extended the parsing routines to be able to handle MVC-compliantbitstream. In addition, we implemented new buffers and enhanced appro-

6

priate FFmpeg structs to be able to store MVC-dependent data. Theseinclude the structs for SPS and PPS and buffers for Subset SPS and inter-view reference lists.

Second, we extended FFmpeg decoding routines by modifying the existing codeand implementing additional functions, as required by MVC. Addition-ally, we investigated multi-threading capabilities, optimized the imple-mentation to decode selected views only and added configuration optionsthrough command line parameters.

Third, we performed experiments on multi-view sequences with different pre-diction schemes and different quantization parameters to explore the cod-ing of multi-view sequences in terms of quality-complexity trade-offs. Weshowed that there is dependence of coding on prediction schemes as wellas on scene characteristics. Furthermore, we found that there is no de-pendence between prediction schemes and scene characteristics. Finallywe analyzed the impact of quantization on virtual view rendering and itsrelation to prediction schemes. We experienced that quantization has seri-ous impact on virtual view rendering while applying of prediction schemesdo not.

1.5 Thesis outlineThe remainder of this thesis is structured as follows. In Chapter 2, we give anoverview of H.264/AVC and the MVC extension of H.264. We survey the codingtools of the H.264/AVC standard and describe the principle of hybrid codingand the main building blocks of the standard. Next, we describe the main com-ponents of the MVC standard codec and contrast those with the H.264/AVC. Inparticular, we detail on the inter-view coding tools, bitstream structure and de-coder resource-requirements in terms of memory. Chapter 3 first introduces theFFmpeg framework and in particular its implementation of H.264. We continuewith the missing components of the H.264 implementation of FFmpeg accordingto MVC. Each component and our corresponding contributions to the imple-mentation are discussed in detail, including the design motivation and the imple-mentation details. Chapter 4 presents the experiments on multi-view sequenceswith different prediction schemes and different quantization parameters in or-der to explore the coding of multi-view sequences in terms of quality-complexitytrade-offs. Additionally we analyze the impact of quantization on virtual viewrendering and its relation to prediction schemes. Finally, we conclude the thesiswith our findings on the H.264, FFmpeg framework and MVV-coding in generalin Chapter 5 and a discussion of opportunities for future work.

7

Chapter 2

Background

MVC is standardized in 2008 as Annex H of the widely used standard H.264/AVC.As such, it is an extension of the H.264 standard and it inherits a number ofefficient video-coding tools from H.264, and to a large extent follows the H.264syntax. In this chapter, we first give a background on the basic principle ofH.264/AVC and its main building blocks. We follow this with background onmulti-view video and detail on the MVC extension of H.264.

2.1 H.264/AVCH.264/MPEG-4 AVC is the most commonly used standard for efficient videocoding. It is used e.g. for Blue-Ray disks, HD-DVDs, DVB, Internet streamingand many more [40]. In this section we give a short overview of the coding toolsin H.264/AVC. We start by reviewing the basic principle of hybrid coding, asused in H.264 and several prior standards. We follow this with a description ofthe main building blocks of a standard H.264/AVC codec [14].

2.1.1 Hybrid codingThe compression efficiency of today’s MPEG video codecs is due to the factthat they implement the principle of hybrid coding that combines motion-compensated prediction and transform coding. The encoder of H.264 imple-ments the hybrid coding as shown in Figure 2.1.1. It implements this at thegranularity of macro blocks. Macro blocks are rectangular parts of a picturewith size of 16x16 pixel. These blocks are used to perform motion estimationbased on ’past’ and ’future’ pictures (frames) in the video sequence. After themotion estimation, the predicted block is subtracted from the current block toget the residual macro block. This residual block is transformed to the fre-quency domain, quantized and then encoded via entropy coding. To avoid thedrift of quantization errors in the decoder while decoding a macro block, theencoder internally implements a prediction loop identical to that at the decoder.

8

Current frameor field

Intra

Inter

Previous codedframes or fields

Formprediction

PredictionMB

CurrentMB

ResidualMB

Decodedresidual MB

-

+

++

+ Motionvector

Coded bitstream....

Entropyencoder

Transform+quantize

Inversetransform+quantize

Figure 2.1.1: Hybrid coding principle [26].

past frames future frame

current frame

MB2

MB1

Figure 2.1.2: MB-level motion prediction in hybrid coding [26].

In particular, the reference pictures used for motion compensation are encodedand decoded, instead of using the original pictures (which are unavailable to thedecoder), as shown in the lower part of Figure 2.1.1. The ordering of encodedpictures does not necessarily reflect the ordering of the input raw frames. Thereare different picture orders allowed in the standard, namely the display orderand the decoding order. The display order reflects the ordering in which thepictures are presented to the viewer. On the other hand, the decoding orderrepresents the order in which the pictures have to be decoded. The transmissiontakes place in the decoding order, which saves buffer space on the receiver side,since all references are received before they are required.

2.1.2 Block-based motion estimation and compensationThe size of the macro blocks used for motion compensation is flexible in H.264.In H.264, a macro block (MB) of 16x16 pixel can be subdivided up to 4 times,so that 16x16, 16x8, 16x4, 8x8, 8x4 and 4x4 pixel blocks are possible [31]. Theadvantage of subdivision is visible in regions with high motion, where a higher

9

motion vectorsresidual imageoriginal image

Figure 2.1.3: Residual image and motion vectors [26].

accuracy of motion compensation leads to higher compression efficiency [7]. Toadditionally enhance the motion compensation, H.264 allows 1

4 -pel accuracyof motion vectors. This means that the reference MB can be an interpolatedblock. Thereby, the reference MB is the MB that is subtracted from the currentmacro block, to generate the residual. For example, the reference MB can bea synthetically generated macro block whose position is on a grid with 4 timesthe pixel resolution of the original picture.Based on whether the macro blocks of a picture are encoded independently, orpredicted from reference pictures, three different pictures types are defined:

• I (intra coded) picture, where all macro blocks are independently (intra)coded and can only reference to the current picture,

• P (predicted), which can reference up to one other picture and

• B (bidirectional predicted), which can reference up to two other pictures.

The reference picture has to be one of the previous pictures in the decodingorder, but according to the display order, they can be ’past’ and/or ’future’pictures in the sequence. In contrast to earlier standards, these restrictions arenot directly applied on picture level, but on macro block level. Furthermore, allmacro blocks of a picture have to be of the same type. This only has impact onP and B pictures, because in H.264 these picture types can reference multiplepictures, since each macro block of the picture can have different references asshown in Figure 2.1.2. The I pictures can be decoded without any references,so they are needed at the start of the stream and at each random access point.For P and B pictures we have the advantage that only the residuals and themotion vectors have to be transmitted, as shown in Figure 2.1.3[26]. However,the disadvantages are an increase of required memory and decoding delay [30],since the previously decoded pictures need to be kept in the buffer. In contrast,the I pictures introduce no delay as they are encoded independently. Thus,the mix of I, P and B pictures is mainly restricted by the size of the DecodedPicture Buffer (DPB) in the decoder, the desired decoding delay and the desiredrandom access time.

10

I0/P0

B2

B1

I0/P0

B3

B2

B3B3B3

Figure 2.1.4: Alignment of hierarchical B-pictures.

2.1.3 Hierarchical B-picturesOne big improvement in H.264 compared to earlier standards is the introduc-tion of hierarchical B pictures. This coding tool introduces a hierarchy for Bpictures. This allows B pictures to reference other B pictures that are abovein hierarchy, as shown in Figure 2.1.4. The major gain is accomplished by ap-plying a prediction hierarchy and spreading the quantization parameters QP

k

over the hierarchy levels k, so that the pictures lower in hierarchy have a higherquantization. One simple way to do so is:

QPk

= QPk�1 + (k == 1?4 : 1)

which gives quite good results for a wide range of sequences [30]. Accordingto [25], this could also be done in a more sophisticated way. The reason whyhigher quantization for pictures lower in hierarchy is possible without affectingperceptual quality is that the lower the picture is in the hierarchy, the less it canbe referenced such that its impact on other pictures is lower. Since pictures ofthe same hierarchy level should not be next to each other, according to displayorder, they do not have a big effect on the viewer.

2.1.4 H.264 bitstream structure and syntaxFor transmission, the encoded bitstream is packed in Network Abstraction LayerUnits (NALUs) as shown in Figure 2.1.5. These units contain either signalinginformation or the actual compressed video data. As an example of signalinginformation, sequence parameter sets (SPS), picture parameter sets (PPS) andsupplemental enhancement information (SEI) messages only contain informa-tion. Furthermore SPS and PPS are necessary for decoding. On the otherhand, there are Video Coding Layer (VCL) NALUs, namely slices and IDRslices. VCL NALUs contain compressed video data with macro blocks and ad-ditional information in the headers. Macro blocks, in terms of transmissionsyntax, also contain information fields e.g. for prediction type, motion vectors

11

INTRA INTER

Slice layer

Macroblock

layer

Skip indication

Cb blocks Cr blocks

MB MB

SPS PPS IDR Slice Slice Slice PPS Slice

Network

Abstraction

Layer

VCL NAL Units

SliceHeader

Slice Data

MB MB MB MB MB

Type Prediction Coded BlockPattern

ResidualQP

Intra mode(s) Referenceframes

Motion vectors Luma blocks

...

...

Figure 2.1.5: H.264 bitstream structure[26].

and quantization parameters. Overall there are 18 different NAL unit typesused by the standard and further 14 are free for future extensions. The IDRslice (instantaneous decoding refresh) is an important slice type that signalsthat the decoding buffer can be emptied. All the pictures following an IDR sliceare prohibited from referencing pictures preceding the IDR slice, according toboth display and decoding order.

12

ConstrainedBaseline

I and P slice MC prediction

In-loop deblocking

Intra prediction

CAVLC

B sliceFieldcoding

MBAFF

Weightedprediction

CABAC

High

8x8 spatialprediction

8x8 transform

Monochromeformat

Scalingmatrices

Inter-view prediction(2 views)

StereoHigh

MultiviewHigh

Inter-view prediction(multiple views)

Compatible with both Multiview High and Stereo High pro!les

Figure 2.2.1: MVC Extension of H.264 – supported tools and profiles[37].

2.2 Multi-View Coding (MVC)As multi-view video may contain a large number of views of the same scene, itrequires a lot of storage space and also bandwidth if transmitted. Therefore,there is a requirement for a coding scheme that efficiently compresses theseviews and specifies a format for their transmission. Since the multiple views aretypically similar to each other, the coding scheme should exploit this property.Finally, since the consumer industry should not rebuild their entire infrastruc-ture, the coding has to be backward compatible to existing video standards.Correspondingly, the MPEG has defined the following basic requirements forsuch a coding scheme:

• good compression techniques for storage and transmission,

• backward compatibility to existing coding standards, such that:

(1) only the base view is available for non-multi-view devices and(2) base view is coded in an existing standard that these devices can

decode,

• the requirement for backward compatibility should be met without sacri-ficing the compression efficiency,

• compression should exploit the similarity of views.

It was realized that the easiest way to meet these requirements is to extend anexisting standard. In this case, the most logical choice would be a standardthat is efficient and commonly used. Since at the time of MVC standardiza-tion, H.264/MPEG-4 AVC was the most efficient and widely used, the MVC isstandardized in 2008 as Annex H of the H.264/AVC standard.

13

2.2.1 H.264 - MVC ExtensionMVC is developed as an extension of the H.264 standard. As such, it inheritsa number of efficient video-coding tools from H.264 and to a large extent fol-lows the H.264 syntax. Further, MVC extends the H.264 with encoding toolsspecifically designed to achieve additional compression efficiency for multi-viewvideo coding, most notably inter-view prediction tools. Finally, MVC extendsthe syntax of H.264 in order to support the signaling of these new tools in acompressed bitstream.

Figure 2.2.1 shows an overview of the encoding tools in H.264 and its MVCExtension. Similarly to most previous MPEG standards, the MVC uses theconcept of profiles, where each profile specifies a subset of coding tools andmemory requirements. The purpose of profiles is to guide standard-complianthardware implementations of the decoder. MVC extension specifies two profiles:

• Stereo High

• Multiview High.

The set of coding tools shared by these profiles, as well as the profile-specifictools (marked in blue) are shown in Figure 2.2.1. The Stereo High profile isfor transmission of stereoscopic videos with only two views whereas MultiviewHigh is for multi-view video with 2 or more views.

In the sequel, we detail on the coding tools standardized in MVC. We startby presenting a simple method for stereoscopic video transmission that usesMVC syntax only for signaling, without employing the MVC inter-view codingtools. Next, we provide an overview of inter-view prediction tools in MVC.

2.2.2 Stereo interleavingOne simple method for encoding and transmission of stereoscopic videos thatuses MVC syntax for signaling is the stereo interleaving. As shown in Figure2.2.2, the two views are spatially multiplexed to form one new picture. Thedisplay device has to know about this multiplexing in order to create a correctrepresentation. The drawbacks of this method are that only stereo views aresupported and only the half of the resolution is remaining for each view. Al-though there is a perfect backward compatibility in terms of transmission, themethod lacks backward compatibility in terms of displaying. Alternatively, onecan use temporal multiplexing. In this case, the frame rate is halved and thebackward compatibility issue remains. The MVC extension of H.264 adds betterways for signaling that a stereo-interleaved stream is available and which modeit uses. This is done through a new type of the SEI message, which is a specialNAL unit type.

2.2.3 Inter-view predictionSince the similarity of neighboring views can be assumed in practice, the motioncompensation described in Section 2.1.2 can be used in temporal as well as in

14

x x x x x x x x x x x x x x x x


o o o o o o o o o o o o o o o o

x o x o x o x o x o x o x o x ox o x o x o x o x o x o x o x ox o x o x o x o x o x o x o x ox o x o x o x o x o x o x o x ox o x o x o x o x o x o x o x ox o x o x o x o x o x o x o x ox o x o x o x o x o x o x o x ox o x o x o x o x o x o x o x o

o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o

x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x









Top-Bottom

Column InterleavedRow Interleaved

x x x x x x x x o o o o o o o o

Side-by-Side

x x x x x x x x o o o o o o o ox x x x x x x x o o o o o o o ox x x x x x x x o o o o o o o ox x x x x x x x o o o o o o o ox x x x x x x x o o o o o o o ox x x x x x x x o o o o o o o ox x x x x x x x o o o o o o o o

x o x o x o x o x o x o x o x oo x o x o x o x o x o x o x o xx o x o x o x o x o x o x o x oo x o x o x o x o x o x o x o xx o x o x o x o x o x o x o x oo x o x o x o x o x o x o x o xx o x o x o x o x o x o x o x oo x o x o x o x o x o x o x o x

Checkerboard

Figure 2.2.2: Spatial interleaving for stereoscopic video [37].

spatial direction. This means that the motion compensation is done betweendifferent views. The MVC implements this scheme as follows (MVC decoder isillustrated in Figure 2.2.4). One of the views is selected as the base view thatis encoded independently from other views and has no spatial references. Thebase view is introduced for backward compatibility purpose. Non-MVC devicescan receive and decode the base view exactly like a H.264/AVC video stream.As shown in Figure 2.2.3, different schemes of arranging the I, P and B picturesin both spatial and temporal domain are applicable. The schemes are onlyrestricted by buffer, delay and random access time. To be able to distinguishbetween different pictures, four major types of pictures are introduced:

IDR pictures are intra-coded pictures without spatial or temporal references.Successors of IDR pictures are prohibited to reference any predecessor ofthe IDR picture. Successors and predecessors are according to both thedisplay and the decoding order.

Anchor pictures are similar to IDR pictures but they allow spatial references.Since any kind of temporal references are still prohibited, their referencepictures have to be either IDR or anchor pictures.

Key pictures are either IDR or anchor pictures. They are called key picturesbecause they are the access points for decoding the stream.

15

II0

IP0

IP0

IP0

IP0

IP0

IP0

IP0

P P P Pb3 B2 b3 B1 b3 B2 b3 I0 b3 B2 b3

P P P Pb3 B2 b3 B1 b3 B2 b3 P0 b3 B2 b3







T5T4T3T2T1T0 T6 T7 T8 T9 T10 T11

S0

S1

S2

S3

S4

S5

S6

S7

T5T4T3T2T1T0 T6 T7 T8 T9 T10 T11

S0

S1

S2

S3

S4

S5

S6

S7

I

I

I

I

I

IB1

I

I

P P P PB3 B2 B3 B1 B3 B2 B3 B3 B2 B3

P P P Pb4 B2 b4 B1 b4 B2 b4 b4 B2 b4

P P P PB3 B2 B3 B1 B3 B2 B3 B3 B2 B3

P P P Pb4 B2 b4 B1 b4 B2 b4 b4 B2 b4

P P P PB3 B2 B3 B1 B3 B2 B3 B3 B2 B3

P P P Pb4 B2 b4 B1 b4 B2 b4 b4 B2 b4

P P P PB3 B2 B3 B1 B3 B2 B3 B3 B2 B3

P P P PB3 B2 B3 B1 B3 B2 B3 B3 B2 B3

II0

IB1

IP0

IB1

IP0

IB1

IP0

IP0

P0

P0

I0

B1

P0

B1

P0

Figure 2.2.3: Different prediction schemes for inter-view prediction [10].

Predicted pictures include all other types of pictures.

2.2.4 Optimized reference-block selectionThe joint use of temporal and spatial motion compensation in MVC leads to asignificant increase in the number of possible reference macro blocks. To findthe best references for a macro block, the common approach is to create andminimize a cost function. This cost function needs to consider the distortionbetween the macro block and its reference and should also consider the requiredbit rate for coding the motion vector. According to [39, 18, 17] this could bedone by using the Lagrangian cost function

J = D + � ·R,

16

Entropy-coding

Inversequnatization

Inversetransform

Deblockingfilter+

Intraprediction

Motioncompensation

Disparsitycompensation

Controlsyntax

DPB

Inter viewreferences Short- & long-

term references

Motionvector

Disparsityvector

Bitstreaminput

Figure 2.2.4: MVC Decoder [24].

where R(Bi

,m) is the corresponding rate of the motion vector m = (mx

,my

,mt

),� is the Lagrangian multiplier for the rate and

D(Bi

,m) =

X

(x,y)2Ai

[P (x, y, t)� I(x�mx

, y �my

, t�mt

)]

2

is the distortion between the current picture P and the decoded reference pictureI in the area A

i

of macro block Bi

. The Lagrangian multiplier � controls theweighting between distortion and rate. According to [18, 17], a typical value for� is 29.5. By solving

mi

= argmin

m2M

{D(Bi

,m) + � ·R(Bi

,m)} ,

the optimal motion vector mi

in search range M can be determined. This is donefor all first- and second-order neighboring pictures used as reference pictures intemporal and spatial domain, as shown in Figure 2.2.5. The reference with thelowest cost J is then chosen.

As described in [3], this could also be done for multiple reference pictures.In this case, only D(B

i

,m) has to be changed so that the difference is calcu-lated between the current picture P and a weighted sum of reference pictures.Reference pictures for multi-reference motion compensation are called long termpictures and are handled differently in the reference picture list.We note that optimized quantizer selection for hierarchical B pictures, as pre-sented in Section 2.1.3, can lead to problems, when it is done for inter-viewprediction. In case the view encoded at a lower level of B frame hierarchy is theone displayed for a longer time, the viewer can eventually recognize the coarserquality compared to other views higher in hierarchy [18].

17

Figure 2.2.5: Spatial and temporal neighboring pictures of picture P (1D cameraarray) [17].

P P P P P

P P I P P

P P P P P

I P P P P

P P P P P

P P P P P

P P I P P

P P I P P

P P P P P

Figure 2.2.6: Prediction schemes depending on different positions of the baseview for a 2D camera array (only spatial neighbors are illustrated) [17].

2.2.5 Camera arrangements and complexity of inter-viewprediction

In general, the number of neighbors is not restricted to 9 as the Figure 2.2.5may suggest. Namely, in Figure 2.2.5 cameras are placed on a line (1D cameraarrangement) such that each view has spatial neighbors on the left and on theright. However, as shown in Figure 2.2.6, for more complex camera arrange-ments (e.g., 2D camera arrays), the number of neighbors increases significantly(growing in spatial as well as in temporal dimension). For example, in casethat views placed on a rectangular grid in spatial domain (2D camera array),there are up to 26 possible neighbors in spatial and temporal domain. Thishas important consequences for the complexity of inter-view coding. First, thenumber of neighbors that can be used as references may increase significantly.Second, as it can be seen in Figure 2.2.6, it is possible to decode either four,three or only two different views while only the base view remains in the buffer.Therefore, the selection of the base view is important, since it directly influencesthe accessibility of other views in terms of random access time and the requiredbuffer size.

18

Chapter 3

Multi-view FFmpeg decoding

As stated earlier in this thesis, to the best of our knowledge, there is no open-source decoder available that supports MVC decoding in real-time. Our startingpoint is FFmpeg, the most popular open source project that supports H.264.

We select FFmpeg because it supports real-time H.264 decoding, has an ef-ficient implementation and a stable code base, as evidenced by the inclusionof FFmpeg-based codecs in a number of widely-used video players (e.g., vlc andMPlayer). The FFmpeg framework contains a standards-compliant implemen-tation of the H.264 decoder and routines to extract the bitstream parametersrequired for decoding from a variety of file formats. In general, new bitstreamformats can be decoded by extending the existing decoding algorithms andbitstream-parsing routines. In this chapter, we detail on the FFmpeg decodingframework and the extensions we implemented to support MVC decoding.

This chapter is structured as follows. In Section 3.1.1, we describe how acodec is implemented and registered in the FFmpeg framework. Subsequently,in Section 3.1.2, we illustrate the interaction between the FFmpeg frameworkand the codec implementation. Further, we detail on the differences betweendecoder implementations with or without multi-threading capabilities. Basedon that knowledge, we describe our implementation for real-time MVC decodingin the open-source framework FFmpeg in Section 3.2.

3.1 The H.264 implementation in FFmpeg

3.1.1 Codec implementation in FFmpeg

FFmpeg uses a struct AVCodec with function pointers and settings to registeran implementation of a codec in the framework. This struct tells the FFmpegframework how to interact with codec-specific encoding or decoding routines.It also specifies the capabilities of the codec implementation. The interactionbetween the FFmpeg framework and the decoding routines is detailed in Section3.1.2. In addition to the decoding routines, an implementation of a parser can

19

AVCodec ff_h264_decoder = {.type = AVMEDIA_TYPE_VIDEO ,.id = CODEC_ID_H264 ,.priv_data_size = sizeof(H264Context),.init = ff_h264_decode_init ,.close = h264_decode_end ,.decode = decode_frame ,.capabilities = CODEC_CAP_DR1 | CODEC_CAP_DELAY

| CODEC_CAP_SLICE_THREADS ,.flush = flush_dpb ,.init_thread_copy = decode_init_thread_copy ,.update_thread_context = decode_update_thread_context ,.priv_class = &h264_class ,...

};

AVCodecParser ff_h264_parser = {.codec_ids = { AV_CODEC_ID_H264 },.priv_data_size = sizeof(H264Context),.parser_init = init ,.parser_parse = h264_parse ,.parser_close = close ,.split = h264_split ,...

};

Figure 3.1.1: The structs AVCodec and AVCodecParser register the H.264 de-coder and parser in FFmpeg.

be accessed via the struct AVCodecParser. The main task of the parser is tosplit the bitstream into packets that correspond to single frames. The bene-fit is that the decoding routines can be called per frame, which is especiallyimportant for multi-threading. In addition, the parser can be used for gath-ering extra information about the stream. For a better understanding of howinstances of AVCodec and AVCodecParser structs look like, Figure 3.1.1 showsboth structs for the H.264 implementation in FFmpeg. The declarations of thestructs AVCodec and the AVCodecParser can be found in Appendix A.1. Theirextensions to handle MVC bitstream are described in Section 3.2.

3.1.2 Interaction between the codec implementation andthe FFmpeg framework

As mentioned above, FFmpeg uses a codec implementation that interacts with theFFmpeg framework by the use of two different structs AVCodec and AVCodecParser.All functions of AVCodec have a common argument, which is the struct AVCodecContext.This can be seen in the declaration of AVCodecContext in Appendix A.1.1.

20

AVCodecContext stores general, codec non-specific information about the videothat has to be decoded. The most important field in that context for our needsis:

void* priv_data;

since it contains all the codec-specific information. We detail on this field inSection 3.2.2. The function pointers are called in a specific order by the outerframework. Since we could not find an appropriate documentation for that, wewill give a quick overview of how the interaction is implemented.

The decoding process is split in two cycles. In the first cycle, a single-threaded analysis is applied to the bitstream, to extract information about thestream. Afterwards, in the second cycle, the decoding starts with full usage ofmulti-threading, if allowed by the decoder. In both cycles, the operations andtheir order are quite similar.

AVCodec.init is called to initialize the decoder.

AVCodecParser.parser_parse is called multiple times to locate frames in thebitstream and fill buffers.

AVCodec.decode_frame is called multiple times, until all previously locatedframes are decoded, after which the parser starts over again.

AVCodec.flush is called to flush the remaining frames in the buffer of the de-coder.

AVCodec.end is called to close the decoder, to terminate all corresponding pro-cesses correctly and to clean up used memory.

The loop of parsing and decoding is repeated until a stopping criterion is fulfilled.In the first cycle, it stops when the analysis is finished or an analysis durationlimit is reached. In the second cycle, the stopping condition depends on theapplication. For example, in case of FFplay, it finishes when the playback isstopped by the user and in case of transcoding via FFmpeg, when the end ofthe stream is reached. In the first cycle, the split function is called in additionbefore the buffers are flushed, to split previously determined extra data fromthe stream. This extra data is handed over to the first call of the decoder inthe second cycle. The decoder can directly make use of this extra data withoutfurther analyzing the bitstream. The parser is initialized at the beginning ofthe first cycle and closed at the end of the second cycle. If the decoder doesnot support multi-threading, there are no other steps in the second cycle, asshown in Figure 3.1.2. If it supports multi-threading, an init_thread_copyfunction is called for each thread after the decoder is initialized, to initialize athread-dependent context. In addition, the update_thread_context is calledper thread before a call to the decode_frame function. It synchronizes theprocessed results between the threads. Subsequently, each thread also calls thedecode_frame function. Finally, each thread is stopped via the end method, asshown in Figure 3.1.3.

21

parser_init

split

parser_parse

AVCodecParser

parser_close

parser_parse

analysis

decoding

transcode

...

output

FFmpeg/FFplay

display

file

demuxer

stream

FFmpeg/FFplay

...

init

decode_frame

flush

AVCodec

end

flush

end

init

decode_frame

Figure 3.1.2: Decoder workflow without threading.

3.2 Extending FFmpeg for multi-view decodingThe H.264 codec implementation of the FFmpeg framework supports decoding ofH.264 streams based on the release of the standard from 2003. In the following,we denote this implementation as FF_H264. At the time of this writing, FF_H264does not have support for MVC decoding. Our contributions to FFmpeg in termsof implementation of the required extensions for MVC decoding are as follows.

Modification of ff_h264_decoder and corresponding structs

• ff_h264_decoder is an instance of the AVCodec struct which reg-isters the implementation of the codec, namely H.264 AVC in theFFmpeg framework. To be able to effectively handle data, the frame-work preallocates a private data field for the codec. The size of that

22

endend

decode_framedecode_frame

init_thread_copyinit_thread_copy

update_thread_contextupdate_thread_context

parser_init

split

parser_parse

AVCodecParser

parser_close

parser_parse

analysis

decoding

transcode

...

output

FFmpeg/FFplay

display

file

demuxer

stream

FFmpeg/FFplay

...

init

decode_frame

flush

AVCodec

end

flush

end

init

decode_frame

init_thread_copy

update_thread_context

decode_frame

Figure 3.1.3: Decoder workflow with threading.

field is specified by the codec in the previously mentioned AVCodecstruct. FF_H264 uses the struct H264Context as private data. Tohandle multiple views efficiently, we need one such struct per view.Consequently, we changed it to H264Context[number of views].

• Not all information required for decoding is transmitted per view(e.g. parameter sets). To overcome this limitation, we store thisinformation in the first instance of H264Context struct, which alwayscorresponds to the base view. In addition, we implemented methodsto access this information from all views.

• Furthermore, the struct H264Context (shown in Appendix A.1.4)and additional related structs are modified, to be able to store MVC-relevant information.

23

Bits 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 ...Fields

forbidden_zero_bit

svc_extension_flag

non_idr_flag

anchor_pic_flag

inter_view_flag

reserved_one_bit

only%if%nal_unit_type%equal%to%14%or%20

nal_ref_idc

nal_unit_type

rbsp

priority_id

view_id

temporal_id

Figure 3.2.1: Extended NAL unit header.

Support for new NAL unit types

• The routine to parse the NAL unit headers is changed to supportMVC. As shown in Figure 3.2.1 the NAL unit header contains morefields in case of MVC as compared to standard H.264.

• We added parsing routines for the missing NAL unit types, includingPrefix NAL unit, Coded slice extension, Sequence parameterset extension and Subset sequence parameter set:

– The Prefix NAL unit contains multi-view dependent informa-tion for the base view, since the base view itself contains noMVC-specific information (due to the backward-compatibilityrequirement).

– The Coded slice extension NAL unit type handles the videodata of views other than the base view.

– The Sequence parameter set extension NAL unit type is in-troduced to extend the sequence parameter sets with extra in-formation.

– The subset sequence parameter set NAL unit type is intro-duced to have an extended sequence parameter set, which offersextra information necessary for the decoding of MVC (e.g. viewalignment and view order).

• We modified the order in which parsed NAL units are applied, inorder to support the newly introduced Prefix NAL unit.

Slice types

• Slice types or picture types, as described in Section 2.1.2, are usedto create inter-view prediction schemes. In general, there are threedifferent slice types, I, B and P. However, in H.264, there are up to16 different slice types, as shown in Table 3.2. Since these types canbe reduced to I, B or P in most cases, FF_H264 introduces its ownslice type numbering and a mapping scheme to the original types.We extended this mapping to support slice types EI, EB and EP andtook them into account when implementing new parsing routines.

24

Parameter sets

• Sequence Parameter Sets contain information that is necessary fordecoding. The subset sequence parameter set NAL unit intro-duces a complete new class of SPS. They can have the same identifiersas normal SPS but are not allowed to override them. To ensure thatthey do not interfere with each other, we introduced a new bufferonly for Subset SPS.

• Unlike normal SPS, Subset SPS are not activated by default, as aresult of which they cannot replace the SPS that is actually used.We implemented methods for efficient activation.

• As mentioned earlier, since Sequence Parameter Sets and Picture Pa-rameter Sets are not transmitted per view, we store them at a com-mon place. Algorithms are implemented to manage them correctlyand ensure their accessibility from all views.

• Picture Parameter Sets also lack fields necessary for MVC decoding.The parsing routines and storage structs are modified to correct that.

Reference picture lists

• For managing the reference pictures, H.264 uses two different lists.The first list is used for references of either P and B pictures and thesecond list only for the references of B pictures. Since we now havemultiple views, these lists are restructured to support picture (tem-poral) and view (spatial) indices for different prediction structures.

• A new reordering algorithm for these lists is implemented in order tosupport inter-view references.

• The creation of initial reference lists, which FF_H264 does via defaultlists, is extended to add inter-view references.

• The reference-picture marking process is modified to handle eachview separately, while keeping track of inter-view references.

• To handle inter-view references similar to short or long-term refer-ences, a new inter-view list is added.

Decoded picture buffer (DPB)

• Decoded picture buffer stores the decoded pictures, linked by thereference lists. The new MVC profiles require a buffer space up to10 times bigger than H.264 DPB, where the actual factor dependson the number of views. Therefore, the memory allocation has to bechanged and enabled for dynamic allocation. It is important to notethat FF_H264 does not implement DPB directly, but uses the VideoAcceleration API library (libva) for hardware-accelerated decodingor MpegEncContext for software decoding. Furthermore, in software

25

decoding, the frame data uses an extra buffer called frame-buffer.Since we concentrate on software decoding, all corresponding buffersare extended to support MVC.

• Since today’s graphics chips are often capable of directly decodingstandard H.264 AVC (without MVC), the use of these hardwaredecoders would require a complex and a not necessarily standard-compliant MVC extension. In addition, a hardware acceleration bynature is hardware dependent. For these reasons, we decided to pre-vent H.264 MVC decoder from using hardware acceleration. Whenthe chips become capable of MVC decoding, it would be very attrac-tive to turn the hardware acceleration back on.

• In addition, the indexing of DPB is modified to include view identi-fiers in addition to picture identifiers.

Optimization and parallelization of the decoder

• We optimized the decoder so that it only decodes the views that arenecessary to decode the given output views.

• Parallelization is done by the FFmpeg framework automatically, if thedecoder specifies the multi-threading capability in struct AVCodec.FF_H264 uses two different types of multi-threading: frame-basedthreading that parallelizes the decoding of different frames and slice-based threading that parallelizes the decoding of a single frame interms of calculation of multiple rows at the same time. Since theFFmpeg framework has no knowledge about the dependencies betweenframes or rows, the waiting conditions for them are extended for MVCsupport.

Command line parameters

• To be able to change a view without the need to recompile each time,we specified additional AVOptions. An AVOption gives us the abilityto associate a field in a special struct (detailed later) with a commandline parameter. We added two parameters: target_view_index anddecode_all_views to be able to turn optimized decoding on and offand to specify the output views.

3.2.1 ChallengesThe FFmpeg code base has grown very large over time and is completely writ-ten in C. The H.264 implementation in FFmpeg is based on implementationsof earlier codecs and reuses a lot of their code. For example, FF_H264 usesstructs and functions that are even used by the H.261 codec. The structs arealso shared with a lot of other codecs e.g. H.261, MPEG-1, MPEG-2, H.263,RealVideo (RV10 and RV20), Motion JPEG, WMV. This is certainly an effec-tive way of writing new codecs. However, in terms of extensions, this design

26

poses serious challenges. Since a lot of the used structs and functions are alsoused by other codecs, they cannot be changed easily, or not at all. In the sequel,we describe specific challenges we faced when implementing MVC decoding andour solutions.

3.2.2 Modification of ff_h264_decoder and correspondingstructs

Each AVCodecContext has a field priv_data, as mentioned in Section 3.1.2.It is a field with preallocated memory for a fast and efficient handling of theinternal data structures of the codec implementation. Therefore, its size hasto be specified in the codec struct, as the value of the field priv_data_size.As shown in Figure 3.1.1, the original FF_H264 uses the struct H264Context asprivate data. The struct H264Context contains all the codec-specific informa-tion necessary to decode the current frame, as shown in Appendix A.1.4. Sinceeach view can be seen as a single video stream in terms of displaying and mostof the information in H264Context is different for each view, a view needs itsown H264Context. We decided to use one H264Context per view and storethe common information only in the H264Context of the base view. The baseview always has a view order index of 0. As shown in Figure 3.2.2, this affectsthe DPB, the buffered parameter sets and inter-view reference lists. Namely,we have to ensure that these are equal for all views. Therefore, in all viewsother than the base view, they are removed and linked to the base view. Differ-ent reference lists, explained in Section 3.2.8, contain only pointers and can bedifferent per view, so they are managed per view.

According to the multiple H264Contexts, the value of the field priv_data_sizeis modified to the following.

.priv_data_size = sizeof(H264Context)*MAX_VIEW_COUNT ,

Here, MAX_VIEW_COUNT is a constant defined by a macro. In our current im-plementation, this constant is set to 8 since the test sequences we use have upto 8 views. The standard in principle allows for a maximum of 1024. In theStereo High profile, the maximum is two and in the Multiview High profile, themaximum is not specified directly. However, the field view_id in the NAL unitheader is restricted to the range from 0 to 1023 (Figure 3.2.1).

In terms of extending the codec priv_data, each method which uses thisdata field has to be adapted. As this is primary the case when the codec is calledfrom the outside framework, all functions linked in the struct are concerned.

Modifying the value of the field priv_data_size in the described way, al-lows us to achieve backward compatibility with the architecture of the FFmpegframework. We justify this as follows. Since the codecs of FFmpeg can be usedas a standalone library without the use of FFmpeg itself, we have to take intoaccount the backward compatibility in terms of programming/implementation.This means that if a program uses the old header files, the new implementationof the decoding should still be functional. In our implementation, since a pro-gram can access the pointer to the first H264Context, which always corresponds

27

view order index

0

Ref

eren

ce li

st

Sho

rt te

rm

refe

renc

es

Long

term

re

fere

nces

...

Inte

r-vi

ew

refe

renc

es

Dec

oded

P

ictu

re B

uffe

r

Seq

uenc

e P

aram

eter

Set

s

Sub

set

Seq

uenc

e P

aram

eter

Set

s

Pic

ture

P

aram

eter

Set

s

12

...

x

Ref

eren

ce li

st

Sho

rt te

rm

refe

renc

es

Long

term

re

fere

nces

...

Inte

r-vi

ew

refe

renc

es

Dec

oded

P

ictu

re B

uffe

r

Seq

uenc

e P

aram

eter

Set

s

Sub

set

Seq

uenc

e P

aram

eter

Set

s

Pic

ture

P

aram

eter

Set

s

H264Context of base view picture

H264Contexts of an MVC picture

Figure 3.2.2: Dependencies between different H264Contexts.

to the base view, the interaction between our implementation and programs thatuse old header files, is backward compatible. Importantly, this way of achievingbackward compatibility is the same as specified in the standard H.264 MVCextension. Namely, the H264 standard specifies that even non-MVC capabledecoders should be able to decode the base view out of the bitstream. Further-more, all structs that need to be changed for MVC-specific information have tobe modified in a backward- compatible manner. To this end, we added fields inthe end of these structs and do not modify the existing fields, so that castingthem to their previous versions still leads to valid memory entries.

3.2.3 Interaction between the parser and decode_frame func-tion

Since AVCodecContext and AVFrame lack support for multi-view video, theyboth lack a field for a view id or some other kind of view-related signaling. Thedecoding of temporal and spatial frames by separate calls of the decode_framefunction is not possible, since the frames of different views share the same tempo-ral id and the distinction between different views is not possible after decoding.To support multiple views, the decode_frame method requires that a view isspecified prior to decoding. In our implementation, the decode_frame methodhandles all views of a frame at once and outputs only the required view. Inaddition, since the output of the decode_frame method is a void pointer with acorresponding size attribute, we can output several views in a specified order atonce. Fortunately, this is possible because the MVC bitstream does not allow forarbitrary slice order and all views of a time instance have to follow each otherin correct decoding order, starting with the base view. As a result, we splitthe bitstream in the parser into packets that contain all slices corresponding to

28

one temporal id and inside the decode_frame function we split the bitstreamfurther in different views as described in Section 3.2.4.

3.2.4 Splitting the bitstreamThe splitting of the bitstream at the end of each frame/slice is done in the parserby the functions find_frame_end as shown in Appendix A.2.1. The methoditself is a state machine which operates on the bitstream and gets the initialstate out of the struct AVCodecParserContext, which is the parser’s equivalentof the struct AVCodecContext.

Due to a lack of documentation for this method, we analyzed it and createdthe flowchart in Figure 3.2.3. In principle, the method searches for the startcode of a NAL unit and extracts the NAL unit type. If the NAL unit typeis 1, 2 or 5, which means the unit contains “video data”, the frame start isfound and the algorithm starts to search for the first_mb_in_slice value.The first_mb_in_slice value has to be increasing inside a frame and normallystarts with 0. If the frame is split over multiple slices, this value indicateswhether a new frame starts. If the current value is less than the previous one,a new frame is found. Otherwise, if the NAL unit type is 6,7,8 or 9 (whichmeans the NAL unit contains “extra data”) and if a frame start was detectedbeforehand, the frame end is found. In all other cases, the algorithm has to goon to search the start of the frame. If a frame end is found, the bits of the newNAL unit that have already been read, are subtracted from the current streampointer. Afterwards, the bitstream pointer points to the end of the frame andis returned.

This analysis shows that, to adapt this method such that all slices corre-sponding to one temporal id are grouped together, no modifications are required.Next, to be able to find the end of an MVC frame, we have to add NAL unittype 20 to “video data” and the NAL unit types 13, 14 and 15 to “extra data”.In addition, in case of NAL unit type 14 or 20, three extra bytes have to besubtracted from the bitstream pointer, since their header has some extra fields.

3.2.5 Support for new NAL unit typesThere are several NAL unit types missing in FF_H264. In this section, wedescribe how we implemented their parsing and decoding. In Table 3.1 we givean overview of the NAL unit types, their corresponding names in FF_H264 andtheir state in terms of required changes according to MVC.

• The Prefix NAL unit with NAL unit type 14 adds multi-view dependentheader data to the NAL units with type 1 or 5. These NAL units carryframes of the base view and, due to the backward-compatibility require-ments, they lack multi-view dependent data. The Prefix NAL unit comesprior to the ones it applies to, but it is also capable of overriding someof their fields. Since the parsing routines override information on recep-tion, the order of how data of the Prefix NAL unit is applied has to be

29

III

II

IV

VI

I

V

B,CA

C

G

E

B

F H

B,CA

A

D

StatesI: Start (corresponds to state 7)II: Find leading zero bit (corresponds to state 2)III: Find start code (corresponds to states 1, 0)IV: Filter NALU type (corresponds to states 3, 4, 5)V: Find frame start (corresponds to states 6, 8, 9, 10, 11, 12, 13)VI: Find frame end (resets to state 7)

TransitionsA: val == 0B: val == 1C: val > 1D: val != {1,2,5,6,7,8,9}E: val == {1,2,5}F: val == {6,7,8,9} && frame_start_foundG: val > last_mb;H: val <= last_mb && frame_start_found ;

ActionsStates I-III:val = buf[i] ;

i++:State IV: val = last 5 bit of buf[i] (buf[i] & 0x1F);

i++:State V: tmp_buf = buf[i] .. buf[i+4] (4 bytes) ;

i+=4;frame_start_fount = 1;last_val = mb;val = ue_golomb(tmp_buf);

State IV: val = consumed bytes since last time in state 2return i - val

Figure 3.2.3: Workflow of the find_frame_end function.

changed. To this end, we store the received Prefix NAL unit and applyit during decoding of the succeeding NAL unit (or after the decoding ofthe NAL unit header). After applying the Prefix NAL unit, it is removedinstantly, since Prefix NAL units are applicable only once and only totheir direct successors, if they are NAL units of type 1 or 5.

• The Coded slice extension NAL unit with NAL unit type 20 containsthe slice data for all non-base views. We extended the decoding routinesfor slices of this type in a straightforward manner. Since the MVC ex-tension does not modify the lower levels of H.264 decoding such as themacro block layer, the decoding routine stays mostly the same for thisslice type, except from some additional parsing and different referencehandling, which is described in Section 3.2.8.

• Both Prefix NAL and Coded slice extension carry a 3-byte sized blockwith MVC-specific fields after the first byte of the NAL unit header, asshown in Figure 3.2.1. Therefore, the parsing routines for the NAL unitheaders in the parser and the decoder are modified to parse these extrafields. The parsing routine for the extra fields is shown in Figure 3.2.4.

• Subset sequence parameter set NAL unit type introduces a new SPStype. Therefore, it requires new parsing routines and a new buffer type.In addition, algorithms for their activation and deactivation are necessary.The required changes are explained in detail in Section 3.2.6.

30

void ff_h264_mvc_decode_nal_header(H264Context *h, constuint8_t *src) {

h->non_idr_flag = ((src[0] & 0x40) >> 6);h->priority_id = (src [0] & 0x3F);h->view_id = (src [1] << 2) + ((src[2] & 0xC0) >> 6);h->temporal_id = (src [2] & 0x38) >> 3;h->anchor_pic_flag = (src[2] & 0x4) >> 2;h->inter_view_flag = (src[2] & 0x2) >> 1;...

}

Figure 3.2.4: Parsing routine for MVC-specific NAL unit header fields.

• The Sequence parameter set extension NAL unit differs from the Subsetsequence parameter set since it adds some extra information to existingSPS. Although it turned out that they are not essential for MVC decod-ing, we have implemented the routines for parsing and storing the parsedvalues in the given SPS.

3.2.6 Extending parameter setsParameter sets are NAL units with a set of parameters either associated witha sequence of pictures or with a single picture. These sets contains parameterslike picture order, coding modes, profiles, offsets and other flags relevant fordecoding. Therefore, they are essential for decoding. H.264 specifies threedifferent types of parameter sets: Picture parameter set (PPS), Sequenceparameter set (SPS) and the newly introduced Subset sequence parameterset (Subset SPS).

A. PPS and SPS extension

The Picture parameter set and Sequence parameter set specifications arenot changed by the MVC extension, but their implementations lack some fieldswhich are relevant for MVC. Therefore, we changed the parsing routines andstorage structs accordingly. Additionally, the Sequence parameter sets areextended through the Sequence parameter set extension NAL unit, whichadds some extra parameters to existing SPS. We added routines for parsing theextra parameters and storing them to the appropriate SPS.

B. Subset sequence parameter set

The Subset Sequence Parameter Sets does not only add extra information toa SPS, but introduces a completely new type of SPS. The bitstream of a SubsetSPS carries a complete SPS at the beginning and adds extra information at theend. Because of this, they can be seen as an extension of SPS and they have the

31

NALU

type

Content of NAL unit and RBSP syntax

structure

FFmpeg name State

0 Unspecified

1

Coded slice of a non-IDR picture

slice_layer_without_partitioning_rbsp()

NAL_SLICE

2

Coded slice data partition A

slice_data_partition_a_layer_rbsp( )

NAL_DPA D

3

Coded slice data partition B

slice_data_partition_b_layer_rbsp()

NAL_DPB D

4

Coded slice data partition C

slice_data_partition_C_layer_rbsp()

NAL_DPC D

5

Coded slice of an IDR picture


NAL_IDR_SLICE

6

Supplemental enhancement information

sei_rbsp()

NAL_SEI

7

Sequence parameter set

seq_parameter_set_rbsp()

NAL_SPS

B

8

Picture parameter set

pic_parameter_set_rbsp()

NAL_PPS

B

9

Access unit delimiter

access_unit_delimiter_rbsp()

NAL_AUD A

10

End of sequence

end_of_seq_rbsp()

NAL_END_SEQUENCE A

11

End of stream

end_of_stream_rbsp()

NAL_END_STREAM

A

12

Filler data

filler_data_rbsp()

NAL_FILLER_DATA A

13

Sequence parameter set extension

seq_parameter_set_extension_rbsp()

NAL_SPS_EXT

A,B

14

Prefix NAL unit

prefix_nal_unit_rbsp()

NAL_PREFIX

C

15

Subset sequence parameter set

subset_seq_parameter_set_rbsp()

NAL_SUB_SPS C

16..18 Reserved

19

Coded slice of an auxiliary coded picture

without partitioning


NAL_AUXILIARY_SLICE

20

Coded slice extension

slice_layer_extension_rbsp()

NAL_EXT_SLICE

21..23 Reserved

24..31 Unspecified

State: Ignored in FFmpeg = A, Changes necessary = B, New in MVC = C, Ignored inMVC = D

Table 3.1: NAL unit types overview and their relation to MVC and FFmpeg.

32

same kind of identifier in the bitstream. In fact, they can even have the same idas a normal SPS, but are not allowed to override their information. To ensurethat they do not interfere with each other, we treat them as two different typesof parameter sets and introduce a new buffer for Subset SPS. Unlike normalSPS, Subset SPS are not activated at reception time by default. In other words,they do not replace the actually used SPS on reception. They can be activatedand deactivated on request. Deactivation takes place in terms of the activationof either a normal SPS or another Subset SPS. The parsing routines for SubsetSPS are implemented and the parsing routines for SPS are changed, so that theycan be reused in the beginning of Subset SPS parsing routines. In addition, weimplemented methods for efficient buffer management and SPS activation.

C. Multi-view access to parameter sets

All parameter sets are only transmitted once and not per view, as mentionedearlier. Therefore, we store them at one common place. Since the base viewis essential for the most MVC coding schemes and in terms of backward com-patibility, we store the parameter sets in the H264Context of the base view.Note that the base view does not necessarily correspond to the view 0 but italways has to have a view order index equal to 0. In addition, we implementedmethods to manage the buffers correctly and to ensure their accessibility fromall views. A different possibility, which is easier to implement, would have beenthe synchronization of all parameter sets between all views. Since this requires alot more memory and copying processes, it would be less efficient, so we decidedto take the more efficient, although a more complex way.

3.2.7 Mapping slice types in FFmpeg

Slice types, which are equivalent to picture types, describe how pictures areencoded. As described in Section 2.1.2 there are I , B and P pictures. In H.264there are in principle up to 16 slice types as shown in Table 3.2. There are10 types in the normal H.264 AVC standard and yet another 6 in the SVCextension of H.264, which is described in Annex G of the standard. The reasonwhy there are so many different slice types is mainly because of their usage asbitstream parsing flags. In terms of decoding, they can be reduced to I, B, P,SP and SI. In the most parsing routines, they can be even mapped to I, B andP. To handle these, mostly similar types, in an easier way instead of havingmany control statements, FF_H264 introduces their own slice type numberingand a mapping to the original values. Instead of one variable, there are nowthree variables for the slice type, namely slice_type, slice_type_fixed andslice_type_nos, to keep all the information. Their mapping to each other andto the slice types in the standard is shown in Table 3.2. The SVC slice types areadded to FF_H264 and their mapping is implemented. Furthermore, the newlyimplemented parsing routines and the extended parsing routines make use ofinternally used slice types.

33

H.264 standard H.264 implementation in FFmpeg

AVC SVCslicetypevalue

%5 remapped &3 (&011)slice_type

slice_type_nos

slice_type_fixed

P EP 0 0 2 010 2 10 P P 0B EB 1 1 3 011 3 11 B B 0I EI 2 2 1 001 1 01 I I 0

SP 3 3 6 110 2 10 SP P 0SI 4 4 5 101 1 01 SI I 0P EP 5 0 2 010 2 10 P P 1B EB 6 1 3 011 3 11 B B 1I EI 7 2 1 001 1 01 I I 1

SP 8 3 6 110 2 10 SP P 1SI 9 4 5 101 1 01 SI I 1

Table 3.2: Slice type mapping between H.264 and FF_H264.

3.2.8 Modifying reference picture listsFor managing the reference pictures, H.264 uses two different lists. The first listis used for references of either P and B pictures and the second list only for thereferences of B pictures. Since we now have multiple views, these lists have berestructured to support temporal and spatial indices of the pictures. Picturesthat are included in a NAL unit with an applying Prefix NAL unit, or pictureswith NAL unit type 20, are referred to as MVC pictures in the following.

A. Reference types

In standard H.264, pictures can be marked as long- or short-term references.Short-term references are removed automatically when they are not in use any-more and long-term references are only removed when it is signaled in thebitstream. In FF_H264, each of these reference types has its own list andthe reordering process accesses them to find the required references. In theMVC extension, a new reference type, namely the inter-view reference, is in-troduced. Inter-view references are neither long-term nor short-term refer-ences. The previously described NAL unit types 14 and 20 have the headerflag inter_view_flag, which signals if a picture is used as an inter-view refer-ence or not.

The inter-view references are identified by the view id. For each view id,there is only one inter-view reference at a time, so the references can be deletedwhen they are replaced by a new one. To manage these references in a similarway to long- and short-term references, we introduce a new list for inter-viewreferences.

34

B. Reordering of the lists

The two reference lists are reordered before the decoding of each picture, sothat the correct pictures are first, contained in the list and second, located atthe right place. In the decoding process, only the indices of the pictures in thereference lists are known. In the initial reference lists, the most probably usedpictures are located at the beginning. Since lower indices lead to fewer bits andthe indices are needed in every macro block, the effort of reordering the listspays off.

If the picture that has to be decoded is an MVC picture, the reordering hasto be modified to take the inter-view references into account. We implementedthe new reordering algorithm as specified in the standard.

C. Default lists

As specified by H.264, the reference lists have to be in a defined initial statebefore any reordering could be done. In MVC, these initial lists are filled withthe inter-view references and the order in which they are added is specified inthe current active Subset SPS. These initial states of the list are realized inFF_H264 with the use of default reference lists. We modify the creationof these lists so that the inter-view references are appended in the appropriateorder, as given by the current active Subset SPS.

D. Reference marking

As mentioned before, the deletion of long-term references has to be signaledin the stream. This deletion of reference pictures is done by marking themas "unused for reference". In addition to deleting, it is also possible to markshort-term references as long-term references and vice versa. This signaling iscalled "decoded reference picture marking" and it is done after each reorderingprocess. For MVC, we modified this reference marking process to handle eachview separately, while keeping track of common references. Namely, in FF_H264,the reference marking is done via reference masks (bit masks) contained inthe struct of the picture rather than in additional reference lists. Since thepictures of all views are in kept in the same DPB, and only pointers to themare handled in the reference lists, marking a picture in one view can have aneffect on another view. Therefore, we check if pictures are used by another viewbefore unreferencing them completely. This check is done via an extension ofreference mask logic. In addition, the reference marking process is also extendedto manage the inter-view reference list.

E. Multi-view access to inter-view references

Similar to the parameter sets, the reference pictures and especially the inter-view reference pictures, have to be available to all views. Therefore, the inter-view list and the picture buffer (described in Section 3.2.9) are managed in theH264Context of the base view.

35

3.2.9 Allocation of Decoded Picture Buffer (DPB)In this section, we will first talk about the size requirements of the DecodedPicture Buffer (DPB) in terms of the MVC extension. After that, we willexplain where DPB is placed in FFmpeg and which other buffers are related toit. Then, we describe how the DPB can be allocated to its new size and howthe size of related buffers has to be changed. In doing this, we make sure thatwe have only one common buffer for all views and that this buffer is accessiblefrom all of them.

A. Calculation of the DPB size in MVC and FF_H264

The Decoded Picture Buffer is the buffer that stores all decoded pictures thatare either not yet displayed or needed as a reference. Since MVC has to handlemore decoded pictures, the DPB has to be bigger compared to standard H.264.The size of the DPB is dependent on the profile and the level. In terms of MVCthe size is calculated by the following equation (in the Stereo High or Multi-viewHigh profile).

MaxDpbFramesMVC

= min(

mvcScaleFactor ·MaxDpbMbs

PicWidthInMbs · FrameHeightInMbs,

max (1, blog2 NumV iewsc) · 16) (3.1)

In this formula, mvcScaleFactor is always equal to 2, MaxDpbMbs is level-dependent and specified in Table A-1 in the H.264 standard, PicWidthInMbsand FrameHeightInMbs are specified in the SPS and NumV iews is specifiedin the Sub SPS. Since the equation for H.264 without MVC is given by:

MaxDpbFrames = min

✓MaxDpbMbs

P icWidthInMbs · FrameHeightInMbs, 16

◆

and the first term is in most cases the dominant one (as can be seen in TableA-7 in the standard document), MaxDpbFrames is 16. Therefore, FF_H264uses the constant MAX_PICTURE_COUNT for that purpose. However, this con-stant in FF_H264 is set to 32. By assuming that in Equation 3.1 the firstterm is also dominant in most cases, and by replacing the 16 with the constantMAX_PICTURE_COUNT, the above formula simplifies as follows.

MaxDpbFramesMV C = MAX_PICTURE_COUNT · max (1, blog2 NumV iewsc)

Our experiences with FF_H264 suggest that the code is not memory-efficient interms of deleting unused pictures. We believe that this is probably the reasonwhy the DPB size is set to a high value of 32. Because of this, our computedbuffer size is slightly undersized. We observed that in most cases, the numberof pictures that cannot be placed in the buffer is from 2 to 8. However, to besure to have enough space, we follow the lead of FFmpeg developers and simplydouble the size. So, in our implementation we end up with the formula:

max_picture_countMV C = 2 ·MAX_PICTURE_COUNT ·max (1, blog2 NumV iewsc)

36

B. DPB and related buffers in FF_H264

There are two different approaches for how FF_H264 decodes H.264 video. Inthe first approach, it uses libva, a library for hardware acceleration. Since thelibva is capable of decoding standard H.264 (without MVC), FF_H264 actsjust as a wrapper between the FFmpeg framework and libva. Therefore, the"implementation" of the DPB is merely a list in a header file with a fixed size.Since to the best of our knowledge, current graphics hardware is not capableof H.264 MVC decoding, we have to consider the second decoding approach inFF_H264. In the second approach, the decoding is done completely in software.Thus, we are able to modify it to our needs. Similar to the first approach, theDPB is not directly part of FF_H264 but it is part of the file mpegvideo.h,which is commonly used by a lot of codecs of the MPEG family (H.261, MPEG-1, MPEG-2, H.263, ...). In this header file, the DPB is contained in the structMpegEncContext as the field Picture* picture. Therefore, the DPB imple-mentation is a list of structs. Since it is declared in pointer notation and not inarray notation, it has to be allocated manually after the allocation of the struct.Therefore, its size can be changed at runtime. Furthermore, the frame data inthe struct Picture is also a pointer. This pointer is assigned in the initializationprocess of the picture and it is taken from another buffer, called frame buffer.This frame buffer is given by the FFmpeg framework at runtime and the assign-ment is different between single- and multi-threading implementations. But atthe end, both assignments seem to end up in in avctx->internal->buffer.The avctx is a part of the previously described AVCodecContext. The fieldinternal is of type AVCodecInternal and the field buffer is a list of structsof type InternalBuffer. The size of that buffer is specified as a constant inthe file utils.c. Correspondingly, we increase its size to support MVC. Sincethis constant affects many low-level functions in the FFmpeg code, we avoid thedynamic allocation and simply change the definition of the constant size:

#define INTERNAL_BUFFER_SIZE ((32+1)*MAX_VIEW_COUNT)

For multi-threading, a second buffer is used to control the release of frames,so that frames are only released when all threads referencing them are finished.This is regulated by the field released_buffers located in struct PerThreadContext,which is defined in the file pthreads.c. This buffer can be increased in thesame way as the internal buffer, since its size is defined as a constant in the filepthreads.c.

#define MAX_BUFFERS ((32+1)*MAX_VIEW_COUNT)

C. (Re-) allocation of the DPB

To avoid changing the file mpegvideo.h and its corresponding source files (forbackward-compatibility reasons), we first tried the intuitive solution of reallocat-ing the DPB to its required size after the MpegEncContext specific initializationfunction is executed. In FFmpeg, memory allocation is performed using the allo-cation functions av_alloc() and av_realloc(). However, we noticed that the

37

av_realloc() function does not work as expected for our purposes. Namely,we obtained segmentation faults after applying av_realloc() to DPB and ac-cessing fields that are located directly behind the DPB in memory. It appearsthat the function overrides succeeding memory locations, for reasons that wewere unable to explain.

Instead, we implemented another approach. We allocate DPB to its cor-rect size by dynamically changing the initial size value, which is the constantMAX_PICTURE_COUNT, as described in Section A.. We did this by introducing anew field picture_count and changed the allocation to take the maximum ofpicture_count and MAX_PICTURE_COUNT. We make sure that picture_count isset to zero in the initialization process of MpegEncContext, so that every othercodec using this struct and corresponding functions works as before. After that,we set the value of our new field picture_count to the result of the previ-ously mentioned algorithm (Section A.). We perform this assignment beforethe allocation of the buffer.

D. Multi-view access to the DPB

As described before, we use a H264Context per view. Since DPB is containedin the MpegEncContext, which is the first field in the H264Context, we haveto ensure that there is still only one common DPB for all views. Since theinitialization is part of some shared subroutines (which have no knowledge aboutviews and different H264Contexts), all views have the same initialization. Afterthat, all DPBs not corresponding to the base view are removed and pointed tothe DPB of the base view instead. Since this initialization takes place once atthe beginning of the decoding, it does not lead to performance problems.

3.2.10 Optimization and parallelization of the decoderA. Decoding selected views

We implemented the decoding process for multiple views such that only theselected views and their reference views are decoded. This is useful in case wedo not need all views and specify only one or more target views, as describedin Section 3.2.11. In our implementation, after we decoded all target views, wesimply skip all remaining views. This is possible since in decoding order, theviews can only depend on previous views and the bitstream does not allow forarbitrary order. The decoding order is also related to the view order index,which allows us to stop the decoding if we have reached the highest voidx inthe target list. This optimization can be turned on or off by the command lineparameter decode_all_views. By default it is turned on in our implementation.

B. Threading support

The FFmpeg framework does not expose any programming interface to specifythe number of threads to use. The struct AVCodec allows to specify different

38

types of threading. FF_H264 uses slice-based threading and frame-based thread-ing.

Slice-based threading is started by the decoder and it is used to parallelizethe decoding of the slice data. Therefore, the decoder creates a copy of thecurrent H264Context with all the data necessary to decode the slice. To handlethe thread context and fill it with necessary information, some of the functions(e.g. the decode slice header) have support for two H264Contexts, namely themain and the thread context. For this reason, we also implemented support formultiple contexts. Slice-type threading seem to work fine for our implementa-tion.

With frame-based threading, the decode_frame function is executed in par-allel by the FFmpeg framework, as shown in Figure 3.1.3. It also does paralleliza-tion on the macro block layer, so that different rows are decoded simultaneously.For intra coded pictures, the threads eventually have to wait for required blocksand for inter coded pictures for other frames.

For frame-base threading, the thread-dependent functions in the AVCodecstruct have to be extended for MVC support. Since the FFmpeg framework hasno knowledge about the dependencies between rows or frames, this is controlledby the decoder through the use of the functions ff_thread_await_progressand ff_thread_report_progress. Using these functions, we can report thecurrently processed row and the currently field (which can only have two values,0 and 1). To support inter-view references in addition, we implemented twonew functions, ff_thread_await_picture and ff_thread_report_picture.This functions handle waiting for different views. In addition, we extended theprocess buffer in pthreads.c, so that the processes can also depend on views:

int progress[MAX_BUFFERS ][2+ MAX_VIEW_COUNT ];

As a result, the view can be addressed as view_id+2. We note that theframe threading in our implementation still does not work correctly and oftenleads to accessing uninitialized memory.

3.2.11 Configuring decoding with command line parame-ters

FFmpeg has the ability to specify options for an implementation of a codec. Theseoptions are accessible from the command line. In the previously described structAVCodec, the AVClass is specified in the following field.

.priv_class = &h264_class ,

The declaration of the complete struct is shown in Figure 3.1.1. The structAVClass specifies an array of AVOption elements.

We added two options and associated them with fields in the H264Context.This is possible since the struct H264Context is special in terms of FFmpegoptions handling. Its first field is of type AVClass and contains the class thatspecifies the corresponding options. We added the following parameters to havea better control of the extended functionality of FF_H264.

39

static const AVOption h264_options [] = {{"decode_all_views", "decode all views of the MVC stream

or only the necessary ... and skip the others",offsetof(H264Context , decode_all_views),FF_OPT_TYPE_INT ,{ .i64 = 0 }, 0, 1,AV_OPT_FLAG_VIDEO_PARAM|AV_OPT_FLAG_DECODING_PARAM ,"h264_mvc"},...

};

Figure 3.2.5: Declaration of the command line parameters in FFmpeg.

target_view_index This parameter specifies the output views of the decodingprocess and their order. The decode_frame function can either output asingle frame corresponding to one view or a list of frames correspondingto a list of views. This parameter has a string value which has to be eithera single number or an array of the form "{1, 3, 2,...}". Thereby eachnumber has to be in the range from 0 to MAX_VIEW_COUNT and the size ofthe array has to be in the range from 1 to MAX_VIEW_COUNT. Each numbercorresponds to the index of a view (voidx) and not to the view_id. Thedefault value is 0, so that the output corresponds to the base view. Notethat the specification of more than one view could cause segmentationfaults, since the memory for output of the decode_frame function has tobe preallocated by the outer framework.

decode_all_views With this parameter, the decoding of selected views, asdescribed in Section 3.2.10 can be turned on or off. This means that eitherall views or only selected views are decoded. By default, it is turned on, soonly the selected views are decoded. The parameter has an integer value(either 0 or 1).

If the given parameter is not in the correct range or cannot be parsed, it is setto the default value. For the parameter target_view_index, we implementeda new parsing routine for the array, since we have to take into account thatAVOption only supports one of the following types: integers, doubles, stringsand binary. In Figure 3.2.11, the declaration of decode_all_views is shown asan example.

40

Chapter 4

Experimental analysis andoptimization

In this chapter, we explore optimizations for the coding of multi-view sequencesin the form of quality-complexity trade-offs. In particular, we analyze thesedependencies for different types of sequences (e.g. sequences with fast or slowmotion) and prediction structures (or camera arrangements), as illustrated inSection 4.2. We expect that the coding quality of MVV coding has significantdependencies on first, the prediction schemes used for inter-view prediction andsecond, the characteristics of the captured scene.

Further, as we assume that the coding is part of a MVV communication sys-tem, there is no need to display the decoded views directly to the user. Instead,the displayed view can be a virtual view, calculated from the transmitted views.We illustrate the effects on virtual views of the dependencies mentioned abovein Section 4.3.

In our experiments, we use a standard-conforming H.264/MVC coder, namelythe MVC reference software JMVC 8.5. Encoding is done using typical settingsfor MVC parameters, such as: variable block size, search range of ±96, CABACenabled. We do not use rate control since rate control is not provided by JMVC.Furthermore, the hierarchical-B temporal prediction structure with a GOP sizeof 12 and quantization parameters (QP) of 22, 32 and 42 are applied. Wedecided to use JMVC as a decoder for our experiments, due to the followingreasons: first, FFmpeg attains a lower quality of the decoded video than the ref-erence implementation and second, due to errors in our implementation, visualartifacts occasionally appear in the decoded views other than the base view.

4.1 Quality metricIn our experiments we use Peak Signal-to-Noise Ratio (PSNR) as a metric to ex-press the spatial video quality. This metric is commonly used in rate–distortion-theoretic optimizations of image and video compression algorithms [35]. The

41

View iView i

Encoding/Decoding

View i+1View i+1

Quantization

Quantization

MSE/PSNR

I(m,n)O(m,n)

Figure 4.1.1: Quality metric for spatial 3D-video quality comparisons.

PSNR is measured in db and is defined as:

PSNR = 10 · log10✓255

2

D

◆

In the above, the distortion D of a video picture (frame) is measured usingthe Mean Square Error (MSE) and set in relation to the maximum achievablevalue. The MSE objectively quantifies the distortion or quality reduction in acompressed video picture and is defined in the following.

D =

PM

m=1

PN

n=1 (P (m,n)�O(m,n))2

M ·N (4.1)

Thereby, the term P (m,n) describes a pixel of the decompressed picture, Ois related to the raw original picture and the resolution of the correspondingpictures is M × N. In our scenarios, the distortion D is introduced in a videopicture by applying a particular quantization step q during encoding. The MSEcaptures the distortion D resulting from the quantization error and allows tocompare performance of different algorithms or prediction schemes, as shown inFigure 4.1.1 and in Figure 4.3.1. In H.264, the quantization step q is included inthe transform calculation and is effected by a multiplication with a quantizationmatrix. This matrix is obtained by the quantization parameter QP and anunderlying calculation scheme is specified in the standard, a detailed explanationof which is given in [27]. Larger values of QP result in a higher distortion anda smaller bit rate.

42

I P P P P P PP

I B1 P B1 P P PB1

IPP

Anchor I I I I I I II

IBP

I B2 B1 B2 PB1B2 B2IBBP

S0 S1 S2 S3 S4 S5 S6 S7

Figure 4.2.1: Prediction schemes of the anchor views.

4.2 Coding optimizationWe conducted the experiments for three MVV test datasets:

• „Ballet“ [43] with a resolution of 1024x768 px at a frame rate of 15 fps.

• „Ballroom“, Exit“ and “Vassar” [19] with a resolution of 640x480 px at aframe rate of 25fps.

• „Kendo“ and „Ballons” [4] with a resolution of 1024x768 px at a framerate of 30fps.

The first four sequences consist of eight camera views arranged in a horizontalarray and are recommended by the Joint Video Team (JVT MPEG) for testingnew coding algorithms [34]. The last two sequences consist of seven camerasarranged in horizontal array, whereby the first and the last camera have a lowerresolution and are only used to perform depth estimation. Therefore, we onlyuse the cameras 2-6.

In the first experiment, the first frame of the sequence „Ballroom“ is encodedand decoded with different prediction schemes as described in Figure 4.2.1. Weplot PSNR-Y values over bit rates averaged over all views. In the second experi-ment, the first frames of three different sequences “Ballroom”, “Exit” and “Ballet”are encoded with the prediction scheme IBP (Fig.4.2.1) and the PSNR-Y valuesare plotted over bit rate averaged over all views. In the third experiment, thefirst 25 frames of all sequences mentioned above are encoded with different pre-diction schemes and different quantization parameters. The PSNR-Y and thebit rates are plotted over the prediction schemes to visualize dependencies.

43

300 1300 2300 3300 4300 5300 6300 7300 8300 9300

29

31

33

35

37

39

41

IBBP

IBP

IPP

Anchor

Bitrate [kb/s]

PS

NR

[db

]

Figure 4.2.2: Comparison of different prediction schemes applied to the anchorpictures.

4.2.1 Prediction scheme dependenciesIn this experiment, the first frame of the sequence „Ballroom“ is encoded anddecoded with different prediction schemes as described in Figure 4.2.1. In Figure4.2.2, the PSNR-Y values are plotted over bit rates averaged over all views. Eachgraph for the corresponding prediction scheme consists of three data pointsgenerated with three different quantization parameters (QP 22, QP 32, QP42). Thus, we effectively compare different prediction structures applied to theanchor pictures only and compare the coding gains against the baseline of codingall anchor pictures as I-pictures (referred to as Anchor in 4.2.1). The resultsillustrate and quantify the gain achieved by using inter-view prediction.

4.2.2 Scene dependenciesIn this experiment, the first frames of three different sequences “Ballroom”,“Exit” and “Ballet” are encoded with the same prediction scheme, namely IBPas shown in Figure 4.2.1. Similar to our previous experiment, in Figure 4.2.3the PSNR-Y values are plotted over bit rates averaged over all views. Eachgraph consists of three data points generated with three different quantizationparameters (QP 22, QP 32, QP 42). The results indicate that there is a signif-icant scene-dependence in MVC coding efficiency. In particular, test sequencescharacterized by a large camera baseline are more difficult to encode, whichleads to a larger total bit rate (when encoding them at the same quality).

44

200 1200 2200 3200 4200 5200 6200 7200

29

31

33

35

37

39

41

43

ballroom

exit

ballet

Bitrate [kb/s]

PS

NR

[db

]

Figure 4.2.3: Comparison of different sequences with the same predictionscheme.

4.2.3 Relation of prediction scheme and scene dependen-cies

In this experiment, we used the first 25 frames of the sequences “Ballroom”,“Exit”, “Vassar” and “Ballet” with 8 cameras each, and “Kendo” and “Ballons”with 5 cameras. These sequences are encoded with different prediction schemes(Section 4.2) and different quantization parameters (QP 22, QP 32 and QP42). We averaged the PSNR values and the bit rates over all views. In Figure4.2.4, the PSNR-Y values are plotted over the prediction schemes, while inFigure 4.2.5, the bit rates are plotted over the prediction schemes. Each graphconsists of data points generated with different schemes for a fixed quantizationparameter (QP 22, QP 32, QP 42). The results indicate that there are probablyno dependencies between prediction schemes and scene characteristics in termsof image quality. In terms of bit rate, all sequences behave similarly. Thus, itseems that there is no clearly visible dependency other than on the resolution,since Ballroom”, “Exit”, “Vassar” have a resolution 640x480 px and “Ballet”,“Kendo” and “Ballons” have a resolution of 1024x768.

45

Anchor IPP IBP IBBP

15

20

25

30

35

40

45

Anchor IPP IBP IBBP

15

20

25

30

35

40

45

Anchor IPP IBP IBBP

15

20

25

30

35

40

45

ballet ballroom exit vassar ballons kendo

PS

NR

[db

]

PS

NR

[db

]

PS

NR

[db

]

(b)(a) (c)

Figure 4.2.4: PSNR dependencies between sequences and prediction schemes,for a fixed quantization parameter of: (a) QP 22, (b) QP 32 and (c) QP 42.

Anchor IPP IBP IBBP

0

500

1000

1500

2000

2500

Anchor IPP IBP IBBP

0

100

200

300

400

500

600

700

Anchor IPP IBP IBBP

0

50

100

150

200

250

300

Bitra

te [

kb/s

]

Bitra

te [

kb/s

]

Bitra

te [

kb/s

]

(b)(a) (c)

ballroom ballons ballet exit kendo vassar

Figure 4.2.5: Bit rate dependencies between sequences and prediction schemes,for a fixed quantization parameter of: (a) QP 22, (b) QP 32 and (c) QP 42.

46

View i View i

VirtualView

Encoding/Decoding

VirtualView

View i+1View i+1

Quantization

Quantization

MSE/PSNR

I(m,n)O(m,n)

Figure 4.3.1: Quality metric for virtual view quality comparisons.

4.3 Impact on virtual view renderingWe expect that the multi-view decoding is part of a MVV communication sys-tem. Thus, we can assume that the views displayed to the user can be virtualviews, e.g., to allow for a wider range of viewing angles. Since only the decodedviews are available on the receiver-side, the rendering of virtual views has to beperformed on these decoded views. To test the influence of different quantiza-tions and prediction schemes on virtual view rendering, we used the software of[12], which is able to perform real-time depth estimation and view rendering.We rendered virtual views before, and after the coding, and compared them toeach other, as shown in Figure 4.3.1.

We conducted the experiments for stereo test dataset „Head and Lamp“provided by University of Tsukuba [21] with a resolution of 384x288 px. Thesequence consists of five camera views arranged in a horizontal array and is oneof the most well-known stereo data sets [16].

47

22 32 42

25

27

29

31

33

35

37

39

41

43

Quantization parameter

PS

NR

[db

]original views

virtual view

Figure 4.3.2: PSNR of virtual views in comparison to decoded views.

4.3.1 Quantization impact on virtual view renderingIn this experiment, we used the first two views of the sequences “Head andLamp” with the prediction scheme “Anchor”, as shown in Figure 4.2.1. We usethe “Anchor” scheme in order to avoid influences by prediction schemes. Weencoded the views with three different quantization parameters (QP 22, QP 32,QP 42) and rendered a virtual view exactly in between these two views beforeand after coding, as shown in Figure 4.3.1.

In Figure 4.2.3 the PSNR-Y values are plotted over the quantization pa-rameters. Thereby, one graph corresponds to the virtual view and the othergraph to the average of the two views that are used to render the virtual view.The result indicates that the quantization has a significant impact on renderingquality. Even with the lowest quantization (QP 22), the quality is poor. Weexplain this as image blurring that is introduced by quantization. The blurredimages have less sharp edges and therefore the depth estimation is less accurate.

48

(a) (b)

(c) (d)

Figure 4.3.3: Visual comparison of quantization results, “Tsukuba” camera 1:(a) Ground truth (original view); (b) Quantized with QP = 22; (c) Quantizedwith QP = 32; (d) Quantized with QP = 42.

4.3.2 Visual comparison of the rendering resultsSince PSNR and human judgement of image quality may not correlate well[18, 33], we also perform a visual comparison between different rendering results.In Figure 4.3.3, different rendered pictures are shown as follows. The Figure4.3.3(a) corresponds to the virtual view rendered from the original pictures andthe Figures (b), (c) and (d) corresponds to the coded pictures with quantizationof QP 22, QP 32 and QP 42, respectively. Since our quality metric measuresdifferences per pixel, these effects explain the low PSNR values we get in the lastexperiment in Figure 4.3.2. We can see a difference between Figure 4.3.3(a) andthe other figures. In particular, Figures (b), (c) and (d) have a higher contrastand the background appears darker. Nevertheless, the rendering (b) still hasa good quality and the rendering (c) also has an acceptable quality. Althoughvisual artifacts at object edges are clearly visible in all rendered images, theyare only visually disturbing in the rendering (d).

Figure 4.3.4 shows a zoom in on the image region containing the lamp, wherethe quality around edges (the separation of the foreground and background) can

49

(a) (b)

(c) (d)

Figure 4.3.4: Detailed visual comparison of quantization results, “Tsukuba”camera 1: (a) Ground truth (original view); (b) Quantized with QP = 22; (c)Quantized with QP = 32; (d) Quantized with QP = 42.

be analyzed. The image arrangement is the same as in Figure 4.3.3, so that (a)is rendered from original pictures and (b), (c) and (d) from quantized pictures.At this zoom level we can clearly observe artifacts at the edge of the lamp, dueto the fact that the foreground and background pixels intermix stronger as thequantization is increased.

50

22 32 42

25,2

25,7

26,2

26,7

27,2

Quantisation parameter

PS

NR

[db

]IBP

IPP

Anchor

Figure 4.3.5: Comparison of different prediction schemes for virtual view ren-dering.

4.3.3 Impact of prediction schemes on virtual view ren-dering

Since we stated that the video quality depends on the prediction schemes, werepeated the above experiment with different prediction schemes as described inFigure 4.2.1. Given that we have only 5 views available, we used only the firstthree schemes “Anchor”, “IPP” and “IBP”. In this experiment, we used all fiveviews of the sequences “Head and Lamp” with these three different predictionschemes. We encoded the views with three different quantization parameters(QP 22, QP 32, QP 42) and rendered a virtual view exactly in between the firsttwo views before and after coding, as shown in Figure 4.3.1.

In Figure 4.3.5, the PSNR-Y values are plotted over the quantization pa-rameters. Each graph for the corresponding prediction scheme consists of threedata points generated with three different quantization parameters (QP 22, QP32, QP 42). Given that the results for quantization dependencies do not changemuch over different quantization schemes, as shown in Section 4.3.1, it is notsurprising that different prediction schemes give similar results, since in termsof quality, they only introduce further quantization (as mandated by the codingof hierarchical B-pictures).

51

ballroom exit ballet

0

50

100

150

200

250

300

Sequence name

De

co

din

g tim

e re

lative

to

re

alti

me

[%

]

Figure 4.4.1: Comparison of decoding times for MVV sequences with JMVCreference decoder.

4.4 Decoding speed with JMVCIn this experiment, the first 25 frames of three different sequences “Ballroom”,“Exit” and “Ballet” are encoded with the same prediction scheme, namely IBP, asshown in Figure 4.2.1. These encoded sequences are decoded with the decoderprovided in the JMVC reference software (version 8.5). In Figure 4.4.1, thedecoding time is measured and plotted in percentage according to the durationof the sequence, so that 100% would mean a real-time performance. As we cansee, the reference decoder for MVC does not achieve real-time performance evenfor sequences with small resolution and the performance is resolution dependent.In contrast, with our implementation the decoding takes 18 ms on average (forall 8 views). Since the decoded sequence consists of 8 views at a frame rate of25 fps, this means that our implementation achieves real-time performance andon average it is able to decode frames faster than real-time.

52

Chapter 5

Conclusion

In our work, we focus on the design and implementation of a real-time H.264MVC-compliant decoder based on the FFmpeg framework. We extended theH.264 decoder implementation in FFmpeg (FF_H264) with a multi-view decod-ing capability and achieved decoding times of up to 40 ms per frame (for all 8views) on a commodity desktop computer. On average, the decoding of 8 viewstakes 18 ms. Since we decoded a sequence with 8 views at a frame rate of 25fps, this means that our implementation achieves real-time performance and onaverage, it is able to decode frames faster than real-time. In detail, we cover themissing components of the H.264 implementation of FFmpeg according to MVCas follows.

First, we extended the parsing routines to be able to handle MVC-compliantbitstream. To achieve this, we changed the routine for NAL unit headerparsing and implemented parsing routines for the new NAL unit types:Prefix NAL unit, Coded slice extension, Sequence parameter setextension and Subset sequence parameter set. For this, we used theinternal slice type mapping, which we also extended. In addition, wemodified the order in which the NAL units are applied, to support forthe Prefix NAL unit. Furthermore, we treated the Subset SPS as newclass of SPS. Finally, to be able to store MVC-relevant information, wemodified H264Context and additional related structs.

Second, we extended the decoding routines. Specificially, we extend DPBby adding view identifiers in addition to the picture identifiers and in-creasing the size of: the field Picture* picture in MpegEncContext, thefield buffer in AVCodecInternal and the field released_buffers inPerThreadContext. Furthermore, we implemented a new reordering al-gorithm for the reference lists to support inter-view references. For this,we added a new inter-view list, to handle inter-view references similarto short or long-term references. Additionally, we modified the creationof the default reference lists by adding inter-view references in theend, as specified be the current active Subset SPS. In addition, we modi-

53

fied the reference-picture marking process to handle each view separately,while keeping track of inter-view references. Finally, we changed the sizeof the private data of ff_h264_decoder to use multiple H264Contextsand manage the common content only in the base view.

Third, we investigate multi-threading. To this end, we extended our parsingroutines to accept two H264Contexts, the main and the thread context,in order to support slice-based threading. To support frame-based thread-ing, we modified the waiting conditions of referencing macro blocks andadded waiting conditions for referenced views. In our current implementa-tion, slice-based threading seems to function correctly, while frame-basedthreading still leads to problems. In addition, we specified two additionalAVOptions, namely target_view_index and decode_all_views to ex-tend the command line interface with additional configuration parameters.

In addition to the implementation, we have performed experiments on quality-complexity trade offs. To this end, we have done experiments with a standard-conforming H.264/MVC codec (JMVC 8.5) with typical settings for MVC en-coding. We vary the quantization parameters and prediction schemes and applythem to a wide range of MVV sequences. Due to the fact that FFmpeg attainsa lower quality of the decoded video than the reference implementation anddue to errors in our implementation that cause visual artifacts to occasionallyappear in the multi-view pictures, we have decided to use JMVC as a decoderfor our experiments. Out findings are as follows. We showed that there is a gainin compression efficiency by using inter-view prediction compared to simulcast.Furthermore, we explored a significant scene-dependence in MVC coding effi-ciency, as a consequence of which, the sequences characterized by large camerabaseline are more difficult to encode. We further showed that while there seemsto be no dependencies between prediction schemes and scene characteristics,there are high dependencies between prediction schemes and resolutions. Wehave found that the inter-view prediction operates more efficiently on pictureswith high resolutions. Finally, we performed an analysis of the impact of quanti-zation on rendering of virtual views. Our experiments showed that quantizationhas a significant impact on rendering quality. We explain this as image blurringthat is introduced by quantization. In contrast, prediction schemes have lessimpact on rendering quality.

5.1 Open issuesOur current implementation has the following limitations.

• Decoding of the first inter view picture fails since there seems to be acorrupt CABAC symbol (Context Adaptive Binary Arithmetic Coding isdescribed in detail in [28]). Our analysis of CABAC decoding in FFmpegsuggests that this implementation is standard-compliant. Thus, we haveso far not been able to determine the cause for this bug. As a consequenceof the bug, some artifacts are visible in parts of the decoded video frames.

54

• Bitstream produced by the JMVC 8.5 encoder seems to be corrupt, sincereferencing beyond anchor pictures in either direction is not allowed in thestandard, nevertheless appears in the bitstream. To remedy this, in ourcurrent implementation, we do not remove the reference pictures when weencounter an anchor view in the bitstream.

• The slice threading works fine in our implementation, but frame thread-ing leads to a segmentation fault. In addition, our current implementationdoes not support threading using SEI messages. We believe that incor-porating this technique would lead to a yet faster implementation. Wealso believe that a correct implementation of frame-based threading willoutperform real-time decoding in the given setup.

5.2 Future workOur current implementation could be extended in the following directions.

• Porting changes to new version of FFmpeg, which is still in developmentand at the moment in an unstable state [22]. Since the work is ongoing ona more standalone version, which uses fewer shared functions and structs,the integration of MVC could be much cleaner.

• Changing the entire FFmpeg framework (not only the decoder implemen-tation) to support multiple views.

• Changing FFplay to support the playout of multiple views.

55

Appendix A

Code snippets

A.1 Structs

A.1.1 The struct AVCodec

typedef struct AVCodec {const char *long_name;enum AVMediaType type;enum CodecID id;int capabilities;const AVRational *supported_framerates;const enum PixelFormat *pix_fmts;const int *supported_samplerates;const enum AVSampleFormat *sample_fmts;const uint64_t *channel_layouts;uint8_t max_lowres;const AVClass *priv_class;const AVProfile *profiles;int priv_data_size;struct AVCodec *next;int (* init_thread_copy)(AVCodecContext *);int (* update_thread_context)(AVCodecContext *dst , const

AVCodecContext *src);void (* init_static_data)(struct AVCodec *codec);int (*init)(AVCodecContext *);int (* encode)(AVCodecContext *, uint8_t *buf , int

buf_size , void *data);int (* encode2)(AVCodecContext *avctx , AVPacket *avpkt ,

const AVFrame *frame , int *got_packet_ptr);int (* decode)(AVCodecContext *, void *outdata , int *

outdata_size , AVPacket *avpkt);int (*close)(AVCodecContext *);void (*flush)(AVCodecContext *);

} AVCodec;

56

A.1.2 The struct AVCodecParser

typedef struct AVCodecParser {int codec_ids [5]; /* several codec IDs are permitted */int priv_data_size;int (* parser_init)(AVCodecParserContext *s);int (* parser_parse)(AVCodecParserContext *s,

AVCodecContext *avctx ,const uint8_t **poutbuf , int *poutbuf_size , const

uint8_t *buf , int buf_size);void (* parser_close)(AVCodecParserContext *s);int (*split)(AVCodecContext *avctx , const uint8_t *buf ,

int buf_size);struct AVCodecParser *next;

} AVCodecParser;

A.1.3 The struct AVCodecContext

This is an excerpt with interesting fields of the struct AVCodecContext.

typedef struct AVCodecContext {const AVClass *av_class;int log_level_offset;enum AVMediaType codec_type; /* see AVMEDIA_TYPE_xxx */struct AVCodec *codec;char codec_name [32];enum CodecID codec_id; /* see CODEC_ID_xxx */unsigned int codec_tag;unsigned int stream_codec_tag;void *priv_data;struct AVCodecInternal *internal;uint8_t *extradata;int extradata_size;AVRational time_base;int ticks_per_frame;int delay;int width , height;int (* get_buffer)(struct AVCodecContext *c, AVFrame *pic)

;void (* release_buffer)(struct AVCodecContext *c, AVFrame

*pic);int (* execute)(struct AVCodecContext *c, int (*func)(

struct AVCodecContext *c2, void *arg), void *arg2 ,int *ret , int count , int size);

void *thread_opaque;int profile;int level;...

} AVCodecContext;

57

A.1.4 The struct H264Context

This is an excerpt with interesting fields of the struct H264Context.

typedef struct H264Context {MpegEncContext s; ///< AVClass is first fieldH264DSPContext h264dsp;SPS sps; ///< current SPSPPS pps;int slice_num;int slice_type;int slice_type_nos;int slice_type_fixed;unsigned int ref_count [2];unsigned int list_count;uint8_t *list_counts;Picture ref_list [2][48];int ref2frm[MAX_SLICES ][2][64];SPS *sps_buffers[MAX_SPS_COUNT ];PPS *pps_buffers[MAX_PPS_COUNT ];int frame_num;int prev_frame_num;int curr_pic_num;int max_pic_num;Picture *short_ref [32];Picture *long_ref [32];Picture default_ref_list [2][32];struct H264Context *thread_context[MAX_THREADS ];...

// H264Context extension// EDIT for MVC support @author: Jochen Britz

// refrence listPicture* inter_ref_list[MAX_VIEW_COUNT ];int inter_ref_count;uint8_t default_list_done;// sub_sps_bufferSPS *sub_sps_buffers[MAX_SPS_COUNT ];// NAL MVC extension int view_id;uint8_t anchor_pic_flag;uint8_t inter_view_flag;// for intern handling of mvc (not in the standard)target_voidx_array[MAX_VIEW_COUNT ];u_int8_t is_mvc;int base_view_id;int target_voidx;struct H264Context* mvc_context[MAX_VIEW_COUNT ];u_int8_t prefix_nal_present;...

// END EDIT} H264Context;

58

A.2 Functions

A.2.1 The find_frame_end functionThe find_frame_end function is able to find the end of frames (slices) in theH.264-compliant bitstream. In case of MVC-compliant bitstream, it is only ableto find the end of the last view corresponding to one frames, such that all MVCframes are found at once. The following function is the modified version of thefind_frame_end function, that is also able to find the end of individual MVCframes.

int ff_h264_find_mvc_frame_end(H264Context *h, const uint8_t*buf , int buf_size){

int i, j, extra_bytes;uint32_t state;ParseContext *pc = &(h->s.parse_context);int next_avc= h->is_avc ? 0 : buf_size;state= pc->state;extra_bytes = 0;if(state >13)

state= 7;if(h->is_avc && !h->nal_length_size)

av_log(h->s.avctx , AV_LOG_ERROR , "AVC -parser: nallength size invalid\n");

for(i=0; i<buf_size; i++){if(i >= next_avc) {

int nalsize = 0;i = next_avc;for(j = 0; j < h->nal_length_size; j++)

nalsize = (nalsize << 8) | buf[i++];if(nalsize <= 0 || nalsize > buf_size - i){

av_log(h->s.avctx , AV_LOG_ERROR , "AVC -parser:nal size %d remaining %d\n", nalsize ,buf_size - i);

return buf_size;}next_avc= i + nalsize;state= 5;

}if(state ==7){

# if HAVE_FAST_UNALIGNED/* we check i<buf_size instead of i+3/7 because its

simpler* and there should be FF_INPUT_BUFFER_PADDING_SIZE

bytes at the end*/

# if HAVE_FAST_64BITwhile(i<next_avc && !((~*( const uint64_t *)(buf+i) &

(*( const uint64_t *)(buf+i) - 0x0101010101010101ULL)) & 0

59

x8080808080808080ULL))i+=8;

# elsewhile(i<next_avc && !((~*( const uint32_t *)(buf+i) &

(*( const uint32_t *)(buf+i) - 0x01010101U))& 0x80808080U))

i+=4;# endif# endif

for(; i<next_avc; i++){if(!buf[i]){

state =2;break;

}}

}else if(state <=2){if(buf[i]==1)state^= 5; //2->7, 1->4, 0->5else if(buf[i]) state = 7;else state >>=1; //2->1, 1->0, 0->0

}else if(state <=5){int v= buf[i] & 0x1F;if(v==6 || v==7 || v==8 || v==9 || v==13 || v==14

|| v==15){if(v==14){

i+=3;extra_bytes = 3;

}if(pc->frame_start_found){

i++;goto found;

}}else if(v==1 || v==2 || v==5 || v==20 ){

if(v==20){i+=3;extra_bytes = 3;

}state +=8;continue;

}state= 7;

}else{h->parse_history[h->parse_history_count ++]= buf[i];if(h->parse_history_count >3){

unsigned int mb, last_mb= h->parse_last_mb;GetBitContext gb;init_get_bits (&gb, h->parse_history , 8*h->

parse_history_count);h->parse_history_count =0;mb= get_ue_golomb_long (&gb);

60

last_mb= h->parse_last_mb;h->parse_last_mb= mb;

if(pc->frame_start_found){if(mb <= last_mb)

goto found;}else

pc->frame_start_found = 1;state= 7;

}}

}pc->state= state;if(h->is_avc)

return next_avc;return END_NOT_FOUND;

found:pc->state =7;pc->frame_start_found= 0;if(h->is_avc)

return next_avc;return i-( state &5) - 3*(state >7)-extra_bytes;

}

61

Bibliography

[1] Sebastian Bosse, Heiko Schwarz, Tobias Hinz, and Thomas Wiegand. En-coder control for renderable regions in high efficiency 3D video plus depthcoding. In Picture Coding Symposium (PCS 2012), 2012, pages 129–132,Krakow, Poland, May 7 2012. IEEE.

[2] G. Cheung, A. Ortega, and N.M. Cheung. Interactive streaming of storedmultiview video using redundant frame structures. Image Processing, IEEETransactions on, 20(3):744–761, 2011.

[3] M. Flierl, T. Wiegand, and B. Girod. A locally optimal design algo-rithm for block-based multi-hypothesis motion-compensated prediction. InData Compression Conference, 1998. DCC’98. Proceedings, pages 239–248.IEEE, 1998.

[4] Norishige FUKUSHIMA. Multi view sequence, April 2013. URL: http://nma.web.nitech.ac.jp/fukushima/multiview/multiview.html[cited April 14, 2013].

[5] Palanivel Guruvareddiar and Kieran Khunya. VideoLan SoC 2011/ stereo high profile MVC encoding, October 8 2012. URL: http://wiki.videolan.org/SoC_2011/Stereo_high_profile_mvc_encoding[cited Online; Status 8. October 2012].

[6] Antti Hallapuro, Miska Hannuksela, Jani Lainema, Marja Salmimaa, Ke-mal Ugur, Kai Willner, Ye-Kui Wang, and Ying Chen. 3D video & theMVC standard. Technical report, Nokia Research Center, 2009.

[7] Edson M Hung, Ricardo L De Queiroz, and Debargha Mukherjee. Onmacroblock partition for motion compensation. In Image Processing, 2006IEEE International Conference on, pages 1697–1700. IEEE, 2006.

[8] S. Khattak, R. Hamzaoui, S. Ahmad, and P. Frossard. Low-complexitymultiview video coding. In Picture Coding Symposium (PCS), 2012, pages97–100. IEEE, 2012.

[9] W.S. Kim. 3-D video coding system with enhanced rendered view quality.PhD thesis, UNIVERSITY OF SOUTHERN CALIFORNIA, 2011.

62

http://nma.web.nitech.ac.jp/fukushima/multiview/multiview.html

http://nma.web.nitech.ac.jp/fukushima/multiview/multiview.html

http://wiki.videolan.org/SoC_2011/Stereo_high_profile_mvc_encoding

http://wiki.videolan.org/SoC_2011/Stereo_high_profile_mvc_encoding

[10] E. Kurutepe, A. Aksay, C. Bilen, C.G. Gurler, T. Sikora, G.B. Akar, andA.M. Tekalp. A standards-based, flexible, end-to-end multi-view videostreaming architecture. In Packet Video 2007, pages 302–307. IEEE, 2007.

[11] E. Kurutepe, M.R. Civanlar, and A.M. Tekalp. Client-driven selectivestreaming of multiview video for interactive 3DTV. Circuits and Systemsfor Video Technology, IEEE Transactions on, 17(11):1558–1565, 2007.

[12] Tobias Lange. Virtual view-rendering for virtual view-rendering for 3D-video communication systems. Master’s thesis, Saarland University, April2013.

[13] Y.L. Lee, JH Hur, YK Lee, KH Han, S. Cho, N. Hur, J. Kim, JH Kim,PL Lai, A. Ortega, et al. CE11: Illumination compensation. Joint VideoTeam (JVT) of ISO/IEC MPEG & ITU-T VCEG, Document JVT-U052r2,pages 20–27, 2006.

[14] A. Luthra and P.N. Topiwala. Overview of the H.264/AVC video codingstandard. In Proceedings of SPIE, volume 5203, page 417, July 2003.

[15] E. Martinian, A. Behrens, J. Xin, and A. Vetro. View synthesis for multi-view video compression. In Picture Coding Symposium, volume 37, pages38–39, 2006.

[16] Sarah Martull, Martin Peris, and Kazuhiro Fukui. Realistic CG stereo im-age dataset with ground truth disparity maps. PRMU 2012, 111(430):117–118, 2012.

[17] P. Merkle, K. Muller, A. Smolic, and T. Wiegand. Efficient compres-sion of multi-view video exploiting inter-view dependencies based onH.264/MPEG4-AVC. In Multimedia and Expo, 2006 IEEE InternationalConference on, pages 1717–1720. IEEE, 2006.

[18] P. Merkle, A. Smolic, K. Muller, and T. Wiegand. Efficient predictionstructures for multiview video coding. Circuits and Systems for VideoTechnology, IEEE Transactions on, 17(11):1461–1473, 2007.

[19] MERL. MVC test sequences, April 2013. URL: ftp://ftp.merl.com/pub/avetro/mvc-testseq/orig-yuv/ [cited 22].

[20] Y. Morvan. Acquisition, Compression and Rendering of Depth and Texturefor Multi-View Video. PhD thesis, Eindhoven University of Technology,2009.

[21] Yuichi Nakamura, Tomohiko Matsuura, Kiyohide Satoh, and Yuichi Ohta.Occlusion detectable stereo-occlusion patterns in camera matrix. In Com-puter Vision and Pattern Recognition, 1996. Proceedings CVPR’96, 1996IEEE Computer Society Conference on, pages 371–378. IEEE, 1996.

63

ftp://ftp.merl.com/pub/avetro/mvc-testseq/orig-yuv/

ftp://ftp.merl.com/pub/avetro/mvc-testseq/orig-yuv/

[22] Michael Niedermayer. FFmpeg, April 2013. URL: https://github.com/FFmpeg/FFmpeg [cited April 22, 2013].

[23] J.R. Ohm. Stereo/multiview video encoding using the MPEG family ofstandards. Invited Paper, Electronic Imaging, 99:30, 1999.

[24] P.B. Pandit, P. Yin, and Y. Su. Methods and apparatus for multi-view infor-mation conveyed in high level syntax. United States Patent, US 8,238,439,August 7 2012.

[25] K. Ramchandran, A. Ortega, and M. Vetterli. Bit allocation for dependentquantization with applications to multiresolution and mpeg video coders.Image Processing, IEEE Transactions on, 3(5):533–545, 1994.

[26] I. Richardson. The H.264 advanced video compression standard. John Wiley& Sons Inc, 2010.

[27] Iain Richardson. An overview of H.264 Advanced Video Coding. Technicalreport, Vcodex Ltd, 2007.

[28] Iain Richardson. H.264/AVC Context Adaptive Binary Arithmetic Coding(CABAC). Technical report, Vcodex Ltd, 2011.

[29] H. Schwarz, C. Bartnik, S. Bosse, H. Brust, T. Hinz, H. Lakshman,D. Marpe, P. Merkle, K. Muller, H. Rhee, et al. 3D video coding usingadvanced prediction, depth modeling, and encoder control methods. InPicture Coding Symposium (PCS), 2012, pages 1–4. IEEE, 2012.

[30] H. Schwarz, D. Marpe, and T. Wiegand. Analysis of hierarchical b picturesand MCTF. In Multimedia and Expo, 2006 IEEE International Conferenceon, pages 1929–1932. IEEE, 2006.

[31] Heiko Schwarz, Detlev Marpe, and Thomas Wiegand. Overview of thescalable video coding extension of the H.264/AVC standard. Circuits andSystems for Video Technology, IEEE Transactions on, 17(9):1103–1120,2007.

[32] M. Shafique, B. Zatt, S. Bampi, and J. Henkel. Power-aware complexity-scalable multiview video coding for mobile devices. In Picture Coding Sym-posium (PCS), 2010, pages 350–353. IEEE, 2010.

[33] Hamid Rahim Sheikh and Alan C Bovik. Image information and visualquality. Image Processing, IEEE Transactions on, 15(2):430–444, 2006.

[34] Y. Su, A. Vetro, and A. Smolic. Common test conditions for multiviewvideo coding. JVT-T207, Klagenfurt, Austria, July 2006.

[35] Mihaela Van der Schaar and Philip A Chou. Multimedia over IP andwireless networks: compression, networking, and systems. Academic Press,2007.

64

https://github.com/FFmpeg/FFmpeg

https://github.com/FFmpeg/FFmpeg

[36] A. Vetro. Representation and coding formats for stereo and multiviewvideo. Intelligent Multimedia Communication: Techniques and Applica-tions, pages 51–73, 2010.

[37] A. Vetro, T. Wiegand, and G.J. Sullivan. Overview of the stereo andmultiview video coding extensions of the H.264/MPEG-4 AVC standard.Proceedings of the IEEE, 99(4):626–642, 2011.

[38] R.S. Wang and Y. Wang. Multiview video sequence analysis, compression,and virtual viewpoint synthesis. Circuits and Systems for Video Technology,IEEE Transactions on, 10(3):397–410, 2000.

[39] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G.J. Sullivan. Rate-constrained coder control and comparison of video coding standards. Cir-cuits and Systems for Video Technology, IEEE Transactions on, 13(7):688–703, 2003.

[40] T. Wiegand and G.J. Sullivan. The H.264/AVC video coding standard[standards in a nutshell]. Signal Processing Magazine, IEEE, 24(2):148–153, 2007.

[41] Peter Wimmer. Stereoscopic player, October 8 2012. URL: http://www.3dtv.at/Index_de.aspx [cited [Online; Status 8. October 2012]].

[42] C. Zhang and T. Chen. A self-reconfigurable camera array. In EurographicsSymposium on Rendering, volume 4, page 6, 2004.

[43] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, SimonWinder, and Richard Szeliski. High-quality video view interpolation us-ing a layered representation. In ACM Transactions on Graphics (TOG),volume 23, pages 600–608. ACM, 2004.

65

http://www.3dtv.at/Index_de.aspx

http://www.3dtv.at/Index_de.aspx

optimized implementation of an mvc decoder - uni … · saarland university telecommunications lab...

Documents