robust wireless video multicast based on a distributed source

16
Signal Processing 86 (2006) 3196–3211 Robust wireless video multicast based on a distributed source coding approach $ M. Tagliasacchi a, , A. Majumdar b , K. Ramchandran b , S. Tubaro a a Dipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza Leonardo da Vinci, 32 20133-Milano, Italy b EECS Department, University of California-Berkeley, Cory Hall, Berkeley, CA 94720, USA Received 15 June 2005; received in revised form 1 December 2005; accepted 27 January 2006 Available online 3 May 2006 Abstract In this paper, we present a scheme for robust scalable video multicast based on distributed source coding principles. Unlike prediction-based coders, like MPEG-x and H.26x, the proposed framework is designed specifically for lossy wireless channels and directly addresses the problem of drift due to packet losses. The proposed solution is based on recently proposed PRISM (power efficient robust syndrome-based multimedia coding) video coding framework [R. Puri, K. Ramchandran, PRISM: a new robust video coding architecture based on distributed compression principles, in: Allerton Conference on Communication, Control and Computing, Urbana-Champaign, IL, October 2002] and addresses SNR, spatial and temporal scalability. Experimental results show that substantial gains are possible for video multicast over lossy channels as compared to standard codecs, without a dramatic increase in encoder design complexity as the number of streams increases. r 2006 Elsevier B.V. All rights reserved. Keywords: Video coding; Robust delivery; Scalability; Multicast over wireless networks 1. Introduction Motivated by emerging multicast and broadcast applications for video-over-wireless, this paper addresses the robust scalable video multicast problem. Examples of such applications include broadcasting TV channels to cellphones, users sharing video content with others with their PDAs/cellphones, etc. Naturally, in a broadcast setting, each receiving device has its own constraints in terms of display resolution and battery life. Fig. 1 depicts this scenario where each device receives a video stream corresponding to the desired spatial resolution, frame rate and quality. In order to target this class of applications, we need a video coding framework capable of addressing several competing requirements: Robustness to channel losses: The wireless med- ium is typically unreliable. For this reason we need to cope with medium to high probabilities of packet/frame losses. ARTICLE IN PRESS www.elsevier.com/locate/sigpro 0165-1684/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2006.03.024 $ Parts of this work were presented in [1,2]. Corresponding author. Tel.: +39 2399 7373; fax: +39 2399 3413. E-mail addresses: [email protected] (M. Tagliasacchi), [email protected] (A. Majumdar), [email protected] (K. Ramchandran), [email protected] (S. Tubaro).

Upload: others

Post on 12-Sep-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESS

0165-1684/$ - se

doi:10.1016/j.si

$Parts of th�Correspond

fax: +392399 3

E-mail addr

(M. Tagliasacc

kannanr@eecs.

stefano.tubaro@

Signal Processing 86 (2006) 3196–3211

www.elsevier.com/locate/sigpro

Robust wireless video multicast based on a distributed sourcecoding approach$

M. Tagliasacchia,�, A. Majumdarb, K. Ramchandranb, S. Tubaroa

aDipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza Leonardo da Vinci, 32 20133-Milano, ItalybEECS Department, University of California-Berkeley, Cory Hall, Berkeley, CA 94720, USA

Received 15 June 2005; received in revised form 1 December 2005; accepted 27 January 2006

Available online 3 May 2006

Abstract

In this paper, we present a scheme for robust scalable video multicast based on distributed source coding principles.

Unlike prediction-based coders, like MPEG-x and H.26x, the proposed framework is designed specifically for lossy

wireless channels and directly addresses the problem of drift due to packet losses. The proposed solution is based on

recently proposed PRISM (power efficient robust syndrome-based multimedia coding) video coding framework [R. Puri,

K. Ramchandran, PRISM: a new robust video coding architecture based on distributed compression principles, in:

Allerton Conference on Communication, Control and Computing, Urbana-Champaign, IL, October 2002] and addresses

SNR, spatial and temporal scalability. Experimental results show that substantial gains are possible for video multicast

over lossy channels as compared to standard codecs, without a dramatic increase in encoder design complexity as the

number of streams increases.

r 2006 Elsevier B.V. All rights reserved.

Keywords: Video coding; Robust delivery; Scalability; Multicast over wireless networks

1. Introduction

Motivated by emerging multicast and broadcastapplications for video-over-wireless, this paperaddresses the robust scalable video multicastproblem. Examples of such applications includebroadcasting TV channels to cellphones, users

e front matter r 2006 Elsevier B.V. All rights reserved

gpro.2006.03.024

is work were presented in [1,2].

ing author. Tel.: +39 2399 7373;

413.

esses: [email protected]

hi), [email protected] (A. Majumdar),

berkeley.edu (K. Ramchandran),

polimi.it (S. Tubaro).

sharing video content with others with theirPDAs/cellphones, etc. Naturally, in a broadcastsetting, each receiving device has its own constraintsin terms of display resolution and battery life. Fig. 1depicts this scenario where each device receives avideo stream corresponding to the desired spatialresolution, frame rate and quality. In order to targetthis class of applications, we need a video codingframework capable of addressing several competingrequirements:

.

Robustness to channel losses: The wireless med-ium is typically unreliable. For this reason weneed to cope with medium to high probabilitiesof packet/frame losses.

Page 2: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESS

I B P

IP BP PPTP1 TP2

Motion vectors

Spatial prediction

Temporal prediction

R1

R1+R

1

R2

R2+∆R

2

SNR scalability

Any codec(i.e. H.264/AVC, PRISM,...)

Proposed codec

Wireless

Network

WirelessNetwork

Network

WirelessNetwork

Fig. 1. Each device subscribes to a video stream fitting its characteristics in terms of spatio-temporal resolution and quality. On the right

we show the group of picture (GOP) structure adopted in this paper. First, the base layer (I, B and P frames) is encoded. Then, a spatial

enhancement layer (IP, BP and PP frames) is built on top of the base layer. Lastly, a temporal enhancement layer is added (TP1 and TP2).

Solid arrows represent the motion vectors estimated at the base layer, which are also used as coarse motion information at the

enhancement layer. The other arrows point to the frame used as reference to build the side information at the decoder.

M. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–3211 3197

1

Scalability in all dimensions: i.e. spatio-temporaland SNR1 scalability. In a multicast environ-ment, the receiving devices are heterogeneous,resulting in the need for a flexible bit-stream thatcan adapt to the characteristics of the receiver.As recommended by the MPEG Ad Hoc groupon scalable video coding, at least two levels ofspatial and temporal scalability are desirablealong with SNR medium granularity scalability(MGS) [3].

� Lack of ‘‘state explosion’’ at the encoder: Scal-

ability should not come at too steep a price inencoder complexity. This means that the encodershould not be forced to keep individual state, i.e.keep track of the different reconstructed se-quences that can be generated at the severalheterogeneous decoders, as is typical in a closed-loop DPCM framework such as MPEG.

� High coding efficiency: While achieving the other

requirements, any video coding frameworkshould be reasonably competitive with state-of-

Also referred to as rate or quality scalability.

the-art non-scalable predictive coders, i.e. H.264/AVC [4].

State-of-the-art closed-loop video coders such asH.264/AVC are able to provide very high codingefficiency by adaptively exploiting a very accuratemotion model on a block-by-block basis. Eachblock is coded with respect to a single deterministicpredictor that is obtained by searching over a rangeof candidates from current and previously encodedframes. Furthermore, to avoid the well-known driftissue the encoder needs to be in sync with thedecoder. Although the coding efficiency of thisscheme is very good as far as unicast streaming overa noiseless channel is concerned, it fails to meet theaforementioned requirements for video multicastover wireless:

Being tied to a single predictor, closed-loopcoders are inherently fragile in face of channelloss. If the deterministic predictor used at theencoder is not available at the decoder, i.e.because of packet losses, drift occurs as encoder
Page 3: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESSM. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–32113198

and decoder work on different data and theerrors propagate until the next intra-framerefresh is inserted.

� It is challenging to keep synchronization between

encoder and decoder while achieving scalability.Two solutions provided by the standards, i.e.MPEG4-FGS [5] and H.263+ [6] fail to fulfill therequirements stated before: MPEG4-FGS adoptsa single loop scheme favoring a simple encoderdesign at the price of a coding efficiency loss withrespect to broadcast. On the other hand, H:263þuses a multiple loop structure taking into accountthe presence of different predictors at theheterogenous decoders. Consequently, H:263þbit-streams suffer less of a hit in terms of lossover the non-scalable case. However, the multipleloop structure leads to added complexity andlimits the number of possible rates at which thestream can be decoded

One approach to overcoming these limitations andcombating both channel loss and scalability issuesat once is to have a more statistical rather than adeterministic mindset. This motivates the proposedscalable solution based on PRISM [7,8] (Power-efficient, Robust, hIgh compression, Syndrome-based Multimedia coding), a video coding frame-work built on top of distributed source codingprinciples. The PRISM codec is inherently robust tolosses in the bit-stream and significantly outper-forms standard video coders, such as H:263þ fortransmission over packet loss channels [9].Although the theoretical foundations of distributedsource coding date back to the theorems of Slepianand Wolf [10] (for lossless compression) and toWyner and Ziv [10] (for lossy compression) theo-rems (see Section 2), PRISM represents a concreteinstantiation of these concepts to video coding. In adistributed setting, when encoding two correlatedvariables X and Y, it is possible to perform separateencoding but joint decoding, provided that theencoder has access to their joint statistics. To thisregard, the key aspect here is that PRISM does notuse the exact realization of the best motioncompensated predictor Y while encoding block X,but only the correlation statistics. If the correlationnoise between any candidate predictor at thedecoder and the current block is within the noisemargin estimated at the encoder, the current blockcan be decoded. Informally speaking, PRISM sendsspecific bit-planes of the current block X, unlikepredictive coders which send information about the

difference between the block and its predictor, i.e.X� Y. Consequently, in the PRISM framework,every time a block can be decoded, it has an effectsimilar to that of intra-refresh (irrespective of anyerrors that may have occurred in the past). On theother hand, for predictive coders, once an erroroccurs, the only way to recover is through an intra-refresh. Section 3 briefly reviews the main conceptsof the PRISM framework.

Besides PRISM, other video coders based ondistributed source coding techniques and exhibitingerror resilience properties have also been proposed[11,12]. In [12] the input frames are divided intonon-overlapping blocks, DCT transformed andquantized as in intra-frame coding. The Wyner–Zivencoders sends parity bits of the source. Thedecoder receives such parity bits and uses themtogether with the previously decoded frames as sideinformation to decode the current frame. A feed-back channel is needed to inform the encoder whenno more parity bits are needed. While PRISMperforms decoding of each block independentlyallowing for motion search at the decoder, in [12]the side information is built by pre-warping thereference frame according to a coarse motioninformation. This motion model is obtained froma lower resolution and heavily quantized represen-tation of the current frame as well as from intra-coded high frequency DCT coefficients.

Scalable video coding has been thoroughlyinvestigated over the last few years. In order toovercome the aforementioned limitations that pla-gue MPEG4-FGS and H:263þ, the MPEG Ad Hocgroup on scalable video coding has undertaken thestudy of the most promising technologies capable ofaddressing the scalability requirements while mini-mally compromising the coding efficiency vis-a-visstate-of-the-art non-scalable H.264/AVC codecs.The coding architecture that has been chosen tobecome the new standard is heavily built upon thesyntax and tools of H.264/AVC adopts a multi-layered approach [13,14], where each layer improveseither the quality or the spatio-temporal resolutionof the decoded sequence. The coding scheme wepropose in this paper is partially inspired to thisarchitecture as it works in a multilayer fashion.

Recently, scalable video coders based on distrib-uted source coding have been proposed in [15–17].The algorithm of [15] is similar in philosophy toMPEG4-FGS and the goal is to provide a progres-sive bit-stream that can be decoded at any rate(within a certain range). In [16] the coding mode is

Page 4: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESSM. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–3211 3199

adaptively switched between FGS and Wyner–Zivon a block by block basis in order to take fulladvantage of the temporal correlation existing at theenhancement layer resolution. In [17] a SNRscalable extension of H.264/AVC is proposed wheredistributed coding is used to prevent the ‘‘stateexplosion’’ at the encoder. With respect to thesecoding schemes these proposed solution targets notonly SNR but also spatial and temporal scalability.Moreover building on the PRISM framework weprovide enhanced robustness.

As mentioned above, the proposed scalable videocoding solution is built on the PRISM frameworkand is designed specifically to provide goodperformance in the face of channel losses. Whilethe PRISM framework allows for a flexible dis-tribution of complexity between encoder anddecoder, in this paper we focus on the case whenmost of the motion compensation task is performedat the decoder and only part of motion search isdone at the encoder. This choice is motivated by therecent results of [18], wherein it was shown (undercertain modeling assumptions), that the rate rebateobtained by doing extensive motion search at theencoder decreases as channel noise increases.

It is valid to question the utility of shifting thecomplexity from the encoder to the decoder (or toshare it arbitrarily) when in a codec solution, it isthe sum of these complexities that is relevant. Toaddress this, we observe the following networkconfiguration for the PRISM codec (see Fig. 2)introduced in [7]. Here, the uplink direction consistsof a transmit station employing the motion-freelow-complexity PRISM encoder interfaced to aPRISM decoder in the base station. The base

Low complexity encoder

Highcomplexity

PRISMdecoder

Trans-coding

Fig. 2. System level diagram for a network scenario wit

station has a ‘‘trans-coding proxy’’ that efficientlytailors the decoded PRISM bit-stream for a high-complexity motion-based PRISM encoder which isinterfaced to a low-complexity motion-basedPRISM decoder on the down-link. Alternatively, itcould also convert the decoded bit-stream into astandard bit-stream (e.g. that output by a standardMPEG encoder). The down-link then consists of areceiving station that has the standard low-complex-ity video decoder. Under this architecture, the entirecomputational burden has been absorbed into thenetwork device. Both the end devices, which arebattery constrained, run power efficient encodingand decoding algorithms.

The paper is organized as follows. We start bysummarizing the basic ideas behind Wyner–Zivcoding in Section 2 and the PRISM framework inSection 3. Section 4 thoroughly revises the proposedarchitecture detailing how spatial, temporal andSNR scalability are achieved. Section 5 contains theresults of the simulations carried out with theproposed coding architecture, emphasizing therobustness features.

2. Background on Wyner–Ziv

Consider the problem of communicating twocorrelated random variables X and Y taking valuesfrom a discrete finite alphabet. Separate entropycoding allows the communication of these variablesat the rates of RXXHðX Þ and RYXHðY Þ whereHðX Þ and HðY Þ are the entropies of the twosources. It is obviously possible to do better byperforming joint encoding, taking advantage of thefact that X and Y are correlated. For this case

Low complexity encoder

Motion-basedPRISM

ORMPEG

encoder

proxy

h low complexity encoding and decoding devices.

Page 5: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESS

2Cyclic Redundancy Checksum.

M. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–32113200

information theory dictates that the achievable rateregion for encoding the sources X and Y is

RX þ RYXHðX ;Y Þ,

RXXHðX jY Þ,

RYXHðY jX Þ. ð1Þ

In a distributed source coding setting variables X

and Y are separately encoded but jointly decoded.The Slepian–Wolf theorem [19] states that it ispossible to attain the same achievable region, with aprobability of erroneously decoding X and Y thatgoes to zero with increasing block length.

These results were extended to the lossy case byWyner–Ziv [10] a few years later (for the case whenY is known perfectly at the decoder). Again, X andY are two correlated random variables. Theproblem here is to decode X to its quantizedreconstruction X̂ given a constraint on the distor-tion measure E½dðX ; X̂ Þ� when the side informationY is available only at the decoder. Let us denote byRX jY ðDÞ the rate-distortion function for the casewhen Y is also available at the encoder, and byRWZ

X jY ðDÞ the case when only the decoder hasaccess to Y. The Wyner–Ziv theorem statesthat, in general, RWZ

X jY ðDÞXRX jY ðDÞ but RWZX jY ðDÞ ¼

RX jY ðDÞ for Gaussian memoryless sources and MSEas distortion measure. In [20] it was proved that forX ¼ Y þN, only the innovations N needs to beGaussian for this result to hold.

For the problem of source coding with sideinformation, the encoder needs to encode the sourcewithin a distortion constraint, while the decoderneeds to be able to decode the encoded codewordsubject to the correlation noise N (between thesource and the side information). While, the resultsproven by Wyner and Ziv are non-constructive andasymptotic in nature, a number of constructivemethods to solve this problem have since beenproposed wherein the source codebook is parti-tioned into cosets of a channel code that is matchedto the correlation noise N. The number of partitionsor cosets depends on the statistics of N. The encodercommunicates the coset index to the decoder. Thedecoder then decodes to the codeword in the cosetthat is jointly typical with the side information.Specifically for the problem at hand, we use theconcepts detailed in [21] and partition the sourcecodebook into cosets of a multilevel code (asdetailed in our earlier work in [9] and brieflysummarized in Section 3).

3. Background on PRISM

The PRISM video coder is based on a modifiedsource coding with side information paradigm,where there is inherent uncertainty in the state ofnature characterizing the side information (a sort of‘‘universal’’ Wyner–Ziv framework, see [22] fordetails). For the PRISM video coder, the videoframe to be encoded is first divided into non-overlapping spatial blocks of size 8� 8. The sourceX is then the current block to be encoded, while theside information Y is the best (motion-compen-sated) predictor for X in the previous frame(s),where it is assumed that X ¼ YþN. The encoderquantizes X and then performs syndrome encodingon the resulting quantized codeword; i.e. theencoder finds a channel code that is matched tothe noise N and uses that channel code to partitionthe source codebook into cosets of the channel code.Intuitively, this means that we need to allocate anumber of cosets (therefore, a number of bits) thatis proportional to the noise variance. Such noise canbe modeled as the sum of three contributions:‘‘correlation noise,’’ due to the changing state ofnature of the video sequence (illumination changes,camera noise, occlusions), quantization noise, sincethe side information available at the decoder isusually quantized, and channel noise due to packetlosses that might corrupt the side information. Theencoder transmits the syndrome (indicating thecoset for X) as well as a CRC2 calculated on thequantization indices. In contrast to traditional,hybrid video coding, it is the task of the decoderto perform the motion search, as it searches over thespace of candidate predictors, one by one, seeking ablock from the coset labeled by the syndrome.When the decoded block matches the CRC, decod-ing is declared to be successful. In essence, thedecoder tries successive versions of side informationY until it finds one that permits successful decoding.Thus, the computational burden of motion estima-tion is shifted from the encoder to the decoder, sothat the encoder is on the same order of complexityas frame-by-frame intra-frame coding.

3.1. Coding strategy

Encoder: The video frame to be encoded isdivided into non-overlapping spatial blocks. (Wechoose blocks of size 8� 8 in our implementations.)

Page 6: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESSM. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–3211 3201

We now enlist the main aspects of the encodingprocess.

1.

00

Fig

sid

Classification: Real video sources exhibit aspatio-temporal correlation noise structurewhose statistics are highly spatially varying.Within the same frame, spatial blocks that are apart of the scene background are highly corre-lated with their temporal predictor blocks(‘‘small’’ N). On the other hand, blocks that area part of a scene change or occlusion have littlecorrelation with the previous frame (‘‘large’’ N).This motivates the modeling of the video sourceas a composite or a mixture source where thedifferent components of the mixture correspondto sources with different correlation (innovation)noise. In our current implementation, we use 16classes corresponding to the different degrees ofcorrelation varying from maximum to zerocorrelation. These classes range from the SKIPmode at one hand where the correlation noise isso small that the block is not encoded at all, tothe INTRA mode at the other extreme, corre-sponding to high correlation noise (poor correla-tion), so that intra-coding is appropriate. Theappropriate class for the block to be encoded isdetermined by thresholding the scalar meansquared error between the block and the co-located block in the previous frame. The thresh-olds Tp and Tpþ1 corresponding to the pth classwere chosen using offline training. The corre-sponding block correlation noise Np vector isconsidered in the DCT domain where it ismodeled as a set of independent Laplacian

000 001 010 011 100

000 010 100 110

0 100 010 110 001

Y

0

10

2

δ

. 3. Multilevel coset coding: partitioning the integer lattice into three lev

e information. The number of levels in the partition tree depends on th

random variables fNp1;N

p2;N

p3; . . .g. The choice of

this model was based on its success as reportedpreviously in literature [23] and by our experi-ments on statistical modeling of residue coeffi-cients in the transform domain. These classescorrespond to different quantization/syndromechannel code choices. The 4-bit classification/mode label for a block, based on the thresholdingof its mean squared error with a co-located blockin the previous frame, is included as part of theheader information for use by the decoder.

2.

Decorrelating transform: We apply a DCT on thesource block. The transformed coefficients X arethen arranged in a one-dimensional order by adoing a zig-zag scan on the two-dimensionalblock.

3.

Quantization: The scanned transformed coeffi-cients are scalar quantized with the targetquantization step size. The step size is chosenbased on the desired reconstruction quality.

4.

Syndrome coding: The quantized codeword se-quence is then syndrome encoded.� Multilevel coset codes: Consider the DCTcoefficient X as the source and an m-levelpartition (see Fig. 3) of a lattice. At each leveli, a subcodebook is completely determined bya bit, Bi, for that level and i � 1 bits fromprevious levels, Bk; 1pkpi � 1. Encodingmay then proceed by first quantizing X tothe closest point in the lattice at level 0, andthen determining the path through the parti-tion tree to the subcodebook at level m, thatcontains the codepoint representing X. Thepath will specify the source bits, Bi; 1pipm,

10

U

els.

e ef

1 110 111

001 011 101 111

101 011 111

X

1

10

YU

X is the source, U is the (quantized) codeword and Y is the

fective noise between U and Y given X.

Page 7: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESS

0 1 n-101

m-1

... ...... ...j

......

i

DCT coefficient (in zig-zag scan order)

bitp

lane

MSB

LSB

(n,n-k1)

(n,n-k2)

0 1 n-1... ...... ...j

DCT coefficient (in zig-zag scan order)

0 1 n-1... ...... ...j

DCT coefficient (in zig-zag scan order)

Fig. 4. Syndrome-based encoding of a DCT transformed block. Left: original DCT coefficients. Middle: based on the correlation noise

estimate, only the least significant bits of each DCT coefficient are sent. Right: a further rate rebate can be obtained by syndrome encoding

the most significant bitplanes.

M. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–32113202

that are to be transmitted to the decoder. Thenumber of levels in the partition tree can bevaried based on the estimated variance of theeffective noise between X and Y as shown inFig. 4, where for each coefficient X j we assigna different number of levels mj. The value ofmj also depends on the class the block belongsto, as determined in the classification step.� Syndrome generation: The output of theprevious stage can be sent uncoded or can befurther processed in order to reduce the rate.The most significant bits of each DCTcoefficient can be grouped together to form abinary channel codeword of size n and canthen be passed through a parity check matrixof an ðn; kÞ linear error correction codeproducing in output a syndrome of size n� k

bits. The encoding rate will then be ðn� kÞ=n.The same procedure can then be applied tolower bitplanes by changing accordingly therate of the error correction codes. It is clearthat low-rate error correction codes, whichusually correspond to stronger error correc-tion capabilities, will result in higher encodingrates. Thus, lower levels will require higherencoding rates, because they will have higheruncoded probabilities of error, which comesfrom the lower correlation with the sideinformation, and therefore demand strongererror correction codes. In practice, the choiceof rates (channel codes) for each level shouldbe done jointly to minimize the end-to-endexpected distortion. Since, the expected dis-tortion depends on the probability of error, soa set of error correction codes should bechosen to achieve a desired probability oferror. This can be done by modeling the testchannel to be characterized by the correlationnoise N which was discussed earlier. Theprobability of error can then be calculated

either analytically or empirically based on theoverall noise statistics.

Decoder: For each block the decoder searchescandidate blocks taken from the reference frame tobe used as side information. Usually, candidateblocks are visited in spiral order starting from theco-located block in the reference frame. For each ofthem the decoded codeword is obtained by perform-ing multistage decoding that is initiated by decodingthe lowest level and then passing the decoded bit tothe next level. Each decoded bit is passed tosuccessive levels, until all bits are decoded and anassociated codeword is obtained. At each level, asyndrome is received from the encoder. Thissyndrome can be used to choose a coset of thelinear error correction code associated with thatlevel, and then perform soft decision decoding[24,25] on the side information to find the closestcodeword within the specified coset. Thus, for eachcandidate predictor a reconstructed version of thecurrent block is obtained. In order to determine ifthis reconstruction is correct, a CRC is computedfrom the reconstructed quantized coefficients and itis compared with the CRC sent by the encoder. Ifthe CRC matches, decoding is declared successful.In our simulations we have never found the CRC tomatch when the decoded codeword is actuallywrong. We need to emphasize that this methodgrants high robustness in face of channel loss. Infact, when the best motion compensated candidatepredictor is not available, decoding might stillsucceed using other candidate predictors taken fromthe same reference frame as well as from pastframes.

4. Proposed video multicast solution

Building on the PRISM framework, we propose acoding scheme that provides spatial and temporal

Page 8: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESSM. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–3211 3203

scalability based on the principles of distributedvideo coding. This scalable flavor of PRISM isdesigned specifically to offer good performance inthe face of channel losses.

The proposed architecture is inspired to what hasbeen chosen to become the future scalable videocoding standard [14], as an extension of H.264/AVC[4]. First, a multilayer representation of thesequence is built by spatially downsampling theoriginal frames. Fig. 1 gives an example where onlytwo layers are shown. Although the extension tomultiple layer is conceptually straightforward, inthis paper we refer to a two-layer scheme, where thebase layer has half of the resolution of theenhancement layer. First, the base layer is encodedusing any coding algorithm. Backward compatibil-ity can be assured at the base layer if a standardcodec is used, i.e. H.264/AVC [4]; H.263+[6] orMPEG2 [26]. In this work we have adopted anIBPBP group of pictures (GOP) structure so thatthe first temporal scalability layer is supported. Forexample, if the full spatio-temporal resolutionsequence is CIF@30 fps,3 then by decoding the baselayer only, we obtain a sequence at QCIF@15 fps [email protected] fps (by skipping the B frames). Asmentioned in Section 1, in this work we will focuson the case when the encoder does only part of themotion estimation task and most of the motionsearch is performed at the decoder. In fact theencoder performs motion estimation only at thebase layer resolution, at a fraction of the cost of fullresolution motion search. This is motivated by thefact that in this paper we are mostly concernedabout robustness to channel loss. To this end, it wasrecently shown that the importance of estimating anaccurate motion model at the encoder decreaseswhen the channel noise increases [18].

The base layer quality can be improved from rateR1 to rate R1 þ DR1 with a SNR enhancement layerencoded as explained in Section 4.2, in such a waythat different users can decide to subscribe to thestream they are interested in according to theirnetwork bandwidth constraints. Like H:263þ wewant to be able to exploit the temporal correlationat the SNR enhancement layer in order to minimizethe coding efficiency loss of MPEG4-FGS. At thesame time we do not want to keep multiple loops atthe encoder tracking different decoder states. UsingPRISM, we encode the enhancement layer based onthe statistical correlation between the original

3CIF resolution: 352� 288, QCIF resolution 176� 144.

sequence and the side information, that can begenerated from the SNR enhancement layer ofpreviously decoded frames as well as from the baselayer of the current frame.

The spatial enhancement layer is encoded on topof the higher quality base layer with the proposeddistributed source coding approach detailed inSection 4.3. The frames labeled IP, BP and PP formthe spatial enhancement layer (achievingCIF@15 fps) and these frames can leverage thebase layer as a spatial predictor as well as previouslydecoded frames as temporal predictors.

Subsequently, the temporal enhancement layer isadded (frames TP1, TP2) in order to obtain the fullresolution version of the sequence, CIF@30 fps. Inboth cases, the motion information available in thebase layer is exploited to build the temporalpredictor as will be detailed in Section 4.4. Themain issue here is to tune the estimation of thestatistical correlation based on the temporal dis-tance between the frame to be encoded and itsreference.

A further SNR scalability layer can be added inorder to improve the quality at full spatial resolu-tion increasing the target bitrate to R2 þ DR2.Therefore, in our current implementation we areable to decode the sequence at two target bitratesfor each spatial and temporal resolution.

The proposed scalable solution inherits therobustness features of PRISM, when video isstreamed over a noisy channel. Experimental results(see Section 5) demonstrate that it outperformsstate-of-the-art predictive-based video codecs atmedium to high packet loss rates even when forwarderror correcting (FEC) codes are used to preventerrors. Furthermore, the layered organization of thebit-stream makes the proposed solution amenablefor unequal error protection (UEP) in order tofurther improve its robustness.

4.1. Information theoretic setup

With respect to Fig. 5 we explain the encoding/decoding of an enhancement layer on top of a baselayer. We consider here an information theoreticperspective, postponing to the next sections thedescription of the actual coding algorithm.

Decoder 1 has a rate constraint of R, whiledecoder 2 has a rate constraint of Rþ DR. Yb andYg are the predictor blocks (from previouslydecoded frame(s)) available to decoders 1 and 2,respectively. Yb and Yg form the side informations

Page 9: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESS

Encoder

Decoder 1

Decoder 2

X

Rate = R

Rate = ∆R

Yb

Yg

Xb

Xg

Fig. 5. SNR scalability: decoder 1 subscribes to the base layer

stream at rate R while decoder 2 to both streams at rate R and

DR. Y b and Y g are the side informations respectively available at

the two decoders. Y g is a better side information than Y b.

M. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–32113204

for the respective decoders. Since decoder 2 receivesdata at a higher rate, it will have a better predictor(and hence a better side information) than thedecoder 1. In the case of SNR scalability, Yb and isgenerated from previously decoded frames at rate R,while Yg from previously decoded frame at rate Rþ

DR as well as from the same frame decoded at rateR. The same scenario holds for spatial scalability,where the rate increment between the base and theenhancement layer DR is used to increase the spatialresolution instead of improving the reconstructionquality. Xb and Xg are the reconstructions of thesource X by decoders 1 and 2, respectively. Xg is abetter reconstruction than Xb.

Heegard and Berger [27] provided the optimalrate-distortion region for this problem for the casewhen DR ¼ 0. Steinberg and Merhav [28] haverecently extended the result of [27] to cover the caseof non-zero DR, where X� Yg � Yb forms aMarkov chain. The Markov chain implies that thelower rate user’s side information is a degradedversion of the better user’s. The entire optimal rate-distortion region for this problem is provided in[28]. In the interests of simplicity, we will restrictourselves to one important operating point in thisregion. This point corresponds to the case where theentire rate R can be utilized by decoder 1.

The solution for this case calls for the generationof two source codebooks C1 and C2. The rate ofcodebook C1 is R while that of C2 is DR. The sourceX is quantized using C1 and C2 to generate thecodewords U and W , respectively. Conceptually thedecoding process is as follows: the codeword U isfirst decoded by both decoders. Xb is the reconstruc-tion by decoder 1 and let X0g be the reconstructionby decoder 2. X0g is a better reconstruction of X thanXb due to greater estimation gains (because of thepresence of the better side information at decoder2). Note that this estimation gain comes from

multiple independent looks at the source data [10].Now, the codeword W is decoded using X0g as theside information. Note that it would be sub-optimalhere to assume that the reconstruction by decoder 2is also Xb and we get a rate rebate by using thebetter reconstruction X0g.

Multiple users: The extension to more than twousers is relatively straightforward. For example, letthere be a third client in the system with a rateconstraint of Rþ DRþ DR0. Then we will encodethe R and DR bit-streams just like in the two-clientcase while the new DR0 bit-stream will be codedkeeping in mind the better reconstruction that thethird client has after it has decoded the R and DR

bit-streams. This allows to target MGS (mediumgranularity scalability).

Unlike the H:263þ encoder, our encoder needs tomaintain a relatively small amount of ‘‘state’’information relating to the statistical correlationbetween the current frame and the differentpredictors at the decoders. While details dependon the exact implementation (e.g. a single scalarquantity representing the estimated correlationnoise might suffice), the key difference is that inthe predictive coding framework, deterministic

copies of each predictor frame need to be kept inthe encoder state. This allows our algorithm to scalewith the number of users.

4.2. SNR scalability

Fig. 1 shows that two SNR scalability layers aremade available both at the base layer and at thespatial enhancement layer resolution.

The encoding process of the SNR scalability layerfollows the algorithmic steps of the PRISM codecdescribed in Section 3. Each block having size 8� 8is encoded independently with the previouslydecoded blocks at the decoder serving as the sideinformation.

As in Section 4.1, let us again consider the casewhen the entire rate R can be utilized by decoder 1(see Fig. 5). As in the single client case, an estimateof the correlation noise between the block to beencoded and the side information is needed. For thispurpose we use the frame-difference-based classifi-cation algorithm described in Section 3.1. Since theentire rate R can be utilized by decoder 1, the designof the first codebook C1 (using the notation ofSection 4.1) is identical to the single client setupdescribed in Section 3. The second codebook C2

essentially consists of extra bit-planes that can

Page 10: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESSM. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–3211 3205

further refine the reconstruction at decoder 2. Since,the side information Yg (present at decoder 2) isbetter than that of Yb (present at decoder 1), thesebit-planes can be further compressed using channelcodes to achieve rate-savings.

At the decoder, the side information can beobtained either from the decoded base layer at rateR and/or from the previously decoded frames at rateRþ DR. The decoding process for the first code-word (U) is identical to that described for the one-client case in Section 3.1. Each client will indepen-dently perform motion search to generate sideinformation that can be used to correctly decodethe codeword. Upon decoding U, decoder 1 willreconstruct the source X to Xb and decoder 2 willreconstruct X to X0g. The decoder 2 now needs todecode the second codeword (W ). At this step, X0gwill serve as the side information to the decoder.The decoding process is identical to regularWyner–Ziv decoding.4

Interpolation and motion vector scaling

YTZM

YTCM

YTFM

YS

base

laye

ren

hanc

emen

t lay

er

best motion vector (fullsearch)

base layer motion vector

Current frameReference frame

Fig. 6. When encoding block X the encoder has access to its

spatial predictor YS and the coarse temporal predictor YTCM

obtained by scaling the base layer motion vector. At the decoder

also the best motion-compensated predictor YTFM is available as

4.3. Spatial scalability

In the proposed solution, the spatial enhancementlayer is encoded on top of the higher quality baselayer. As shown in Fig. 1 when it comes to encodeframes IP, PP and BP both the base layer and thepreviously decoded frames at the enhancement layercan serve as side information. Moreover, since theenhancement layer encoder is not allowed toperform any motion search, the correlation noisebetween the current block and the unknown bestpredictor, that will be revealed only during decod-ing, needs to be computed in a computationallyefficient manner. To this end, in the original (non-scalable) version of PRISM [7], each block X isclassified according to the mean square errorcomputed using as a predictor the block in thereference frame at the same spatial location, i.e.using a zero motion temporal predictor, block YTZM.An offline training scheme provides an estimate ofthe correlation noise for each DCT coefficient basedon the measured MSE and the best motioncompensated predictor that will be available at thedecoder, YTFM.5 Unfortunately, this method is likelyto fail when there is significant, yet simple, motionsuch as camera panning. The proposed solutiontakes advantage of the existence of the base layer in

4Note that since there is no further motion search at this step,

no CRCs are required to verify decoding of W .5Subscript FM stands for full motion.

two different aspects: as a spatial predictor for thoseblocks that cannot be properly temporally pre-dicted, e.g. because of occlusions, as well as usingthe motion vectors available at the coarser resolu-tion to provide a better estimate of the correlationnoise. The encoding process for the three types offrames are as follows:

sid

pre

(YS

tem

Frame IP: A spatial predictor is computed byinterpolating the quantized I frame of the baselayer. The prediction residual is quantized andentropy coded as in H:263þ.

� Frame PP: Spatial, temporal and spatio-temporal

predictors are built using only the coarse motionvectors of the base layer (see Fig. 6). Then, thebest predictor is chosen according to a MSEmetric and the correlation noise is estimatedbased on the statistics collected offline. The blockis then quantized and encoded as described inSection 3, sending only the least significant bits ofthe DCT coefficients as the most significant oneswill be recovered using the side information.

e information. This figure does not show the spatio-temporal

dictors available at the encoder (YSTCM) and at the decoderTFM) computed as a simple average between the spatial and

poral predictors.

Page 11: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESSM. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–32113206

Frame BP: The encoding algorithm is similar tothat of the PP frame, except for the fact that thetemporal predictor can use the forward and/orbackward motion vectors (bi-directional predic-tion). The prediction mode as well as the motionvectors are the same used in the base layer.

At the decoder, the algorithm tests differentpredictors until it finds one that is good enough tocorrectly decode the block. If the CRC of thedecoded block matches the one computed at theencoder side, a decoding success is declared. Thedecoder is allowed to use any predictor-spatial,temporal and spatio-temporal—for the purpose ofdecoding the encoded block.

As mentioned above, the correlation between theblock to be encoded and the (unknown) bestpredictor (which will be found at the decoder aftera full motion search) needs to be estimated. This is thetask of the classifier. At the encoder, only the motioninformation available at the base layer (termed the‘‘coarse’’ motion vector) is used to provide anestimate of the correlation. Three different predi-ction types are allowed—spatial YS, temporal YTCM

and spatio-temporal YSTCM ¼ ðYS þ YTCMÞ=2 (seeFig. 6)—and the best among these choices iscomputed based on the ‘‘coarse’’ motion vector.The classifier works on training sequences. Using thedata collected offline, the classifier estimates thecorrelation statistics between the block to be en-coded and the best motion compensated predictoravailable at the decoder (either spatial YS, temporalYTFM or spatio-temporal YSTFM ¼ ðYS þ YTFM Þ=2) foreach type of predictor that can be selected at theencoder (YS, YTCM and YSTCM Þ. See Fig. 7 for moredetails.

Although only two levels are considered in thecurrent implementation, the proposed scheme sup-ports any number of levels of spatial scalability. Infact, the same concepts can be extended to amultilayer architecture, where each layer can usethe upper layer as a spatial predictor. Furthermore,the ratio between the resolutions of two succeedinglayers is not constrained to be 2:1. All that is neededis an interpolating algorithm that is able to build thespatial predictor of the appropriate size startingfrom the base layer.

4.4. Temporal scalability

Fig. 1 shows that by encoding frames TP1 andTP2 is possible to get full spatio-temporal resolu-

tion. The encoding of the temporal enhancementlayer is more involved since we can rely onlypartially on the information available at the baselayer. Specifically, we have neither a spatialpredictor available nor a motion field that coverscompletely the frame to be encoded. For thesereasons we allow only temporal prediction. Themotion field is inferred by that available at the baselayer. In our current implementation, the estimationof the coarse motion field for TP1 frames proceedsas follows. First, the motion field of frame BP isextracted from the base layer by simply scaling themotion vectors in order to match the spatialresolution of the enhancement layer. Then, themotion field of frame TP1 is estimated by ‘‘inter-polating’’ the motion trajectories from BP to IP (orPP). Fig. 8 gives a pictorial representation of thealgorithm and it shows the different scenarios thatcan happen:

Block a is completely covered by the projectionof block A and no other block overlaps with it.We apply to block a the scaled version of MVA,i.e. MVA ¼ MVA/2. � Block b is only partially covered by the projec-

tion of block B. As before, We apply to block b

the scaled version of MVB, i.e. MVB ¼ MVB/2.

� Block c is covered by the projections of block B

and C. The motion vector of the block thatcovers the most is selected; so MVC ¼ MVC/2.

� Block e is covered by the projections of blocks E,

D and F. As before, the block with the widestcoverage, i.e. E, is selected and its scaled motionvector is assigned to block e.

� Block g is not covered by any block. In this case

we can either use the zero motion vector or assigna vector that is estimated from its causalneighbors, i.e. blocks d and e in this case.

Although more sophisticated methods can be usedfor this operation, the overall coding algorithm isnot very sensitive to the accuracy of the coarsemotion field. In fact the coarse motion vector is usedto determine MSEc. Based on the value of MSEc,the block is assigned to one of the classes, therefore,driving the coset bit allocation. Similar values ofMSEc thus lead to the same decision in theclassification process.

We have to point out that the backward motionvector from BP to IP (or PP) might not be availablein the base layer. This can happen in twocircumstances: the block is intra-coded or the block

Page 12: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESS

Fig. 7. Working offline, the classifier computes a mapping between the residue computed using the ‘‘coarse’’ predictor and the residue

computed using the best predictor obtained from a full motion search. Each block is represented by a circle in one of the nine scatter plots,

according to the prediction type computed using the coarse motion vector available at the encoder (determining the row) and the best

motion vector (determining the column). In each scatter plot, the x-axis is the MSE computed using the coarse motion vector (MSEc),

while the y-axis the MSE computed with the best motion vector (MSEb). MSEb is an aggregate measure of the correlation noise at the

block level and it is not directly used in the actual encoding algorithm. In fact MSEc determines the class a block belongs to (see Section 3).

For each class, the MSE of each DCT coefficient is estimated offline and used to drive the coset bit allocation.

M. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–3211 3207

is inter-coded but only the forward motion vector isavailable. In the former case, the same policyadopted for blocks not covered by any projectionis applied. In the latter case, the backward motionvector is obtained by simply inverting the forwardmotion vector.

The estimation of the motion field of frame TP2follows the same algorithm as that for frame TP1.In this case we can leverage either the backwardmotion field from PP to IP (or PP) or the forwardmotion field from BP to PP.

We note that separate statistics of the correlationnoise are collected for each type of frame. This isdue to the fact that the distance between the currentframe and its temporal reference is different for eachtype of frame. Hence the accuracy of the estimatedcoarse motion field varies with frame type (typicallythe motion fields estimated for the frames of type

BP and PP are more precise than for the frames oftype TP1 and TP2).

5. Experimental results

In this section we present results to showcase thepromise of our approach. In Section 5.1 we presentresults for SNR scalability, followed by results forspatial and temporal scalability in Section 5.2. In allexperiments the GOP size is equal to 32 frames.Therefore, one intra-coded frame is inserted every16 frames at 15 fps or every 32 frames at 30 fps.

5.1. SNR scalability tests

We present results for the two client/two rate caseusing the ‘‘uplink’’ PRISM framework (i.e. one inwhich motion compensation is performed at the

Page 13: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESS

A B C

FE

IG

D

A

H

C

D F

G

B

E

H I

Frame BP

Frame TP1superimposed with theprojections of frame BP

a c

d f

g

b

e

h i

Frame TP1

Fig. 8. Estimation of motion field of frame TP1 from motion field of frame BP. Block e is covered by the projections of blocks E, D and F.

The block with the maximum overlap, i.e. E, is selected and so MVe ¼MVE=2. Similarly, MVa ¼MVA=2, MVb ¼MVB=2 and

MVc ¼MVC=2.

0 2 4 6 8 1022

24

26

28

30

32

34

36Football (QCIF, 15fps, 327kbps)

packet drop rate (%)

Ave

rage

Y P

SN

R (

dB)

0 2 4 6 8 1022

24

26

28

30

32

34

36Football (QCIF, 15fps, (327 + 197) kbps)

packet drop rate (%)

Ave

rage

Y P

SN

R (

dB)

0 2 4 6 8 1020

22

24

26

28

30

32

34

36

38Stefan (QCIF, 15fps, 720kbps)

packet drop rate (%)

Ave

rage

Y P

SN

R (

dB)

0 2 4 6 8 1020

22

24

26

28

30

32

34

36

38Stefan (QCIF, 15fps, (720 + 103) kbps)

packet drop rate (%)

Ave

rage

Y P

SN

R (

dB)

Scalable PRISMH.263+ with intra refreshH.263+ with FEC

Scalable PRISMH.263+ with intra refreshH.263+ with FEC

Scalable PRISMH.263+H.263+ with FEC

H.263+H.263+ with FEC

Scalable PRISM

Fig. 9. Performance comparison (for Multicast) of scalable PRISM, H:263þ protected with FECs (Reed–Solomon (RS) codes used, 20%

of the total rate used for parity bits) and H:263þ protected with block-based intra-refresh (15% of the blocks are forced to be intra-coded)

for the Football and Stefan sequences. For the FEC case, protection was given only to the base layer.

M. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–32113208

Page 14: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESS

3 4 5 6 7 8 9 1018

20

22

24

26

28

30

32

34

36

Football 15fps − 1.8Mbps

packet drop rate (%)

Ave

rage

Y P

SN

R (

dB)

3 4 5 6 7 8 9 1018

20

22

24

26

28

30

32

34

36

Stefan 15fps − 1.8Mbps

packet drop rate (%)A

vera

ge Y

PS

NR

(dB

)

Scalable PRISM

H.263+ with intra refresh 15%

H.263+ with FEC 15%

Scalable PRISM

H.263+ with intra refresh

H.263+ with FEC

Fig. 10. Performance comparison of proposed scalable solution, H:263þ protected with FECs (Reed–Solomon (RS) codes used, 20% of

the total rate used for parity bits) and H:263þ protected with block-based intra-refresh (15% of the blocks are forced to be intra-coded)

for the Football and Stefan sequences (CIF, 15 fps, 1800 kbps).

M. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–3211 3209

decoder) and compare it to the SNR scalableversion of the H:263þ video coder6 protected withFEC and block-based intra-refresh.

For the case of scalable H:263þ protected withFEC, we use Reed–Solomon (RS) codes with 20%of the total rate allocated to parity bits.7 Nounequal error protection scheme is applied in oursimulations, and it is assumed that the same packetloss rate affects both the base layer and theenhancement layer. For the case of block-basedintra-refresh, approximately 15% of the blocks areforced to be intra-coded. In this experiment we usedH:263þ as a benchmark instead of the state-of-the-art H.264/AVC codec as the former has built insupport for SNR scalability. For our tests, werestrict ourselves to the case when the entire rate R

can be utilized by the lower rate client (decoder 1 inFig. 5).

For the case of SNR scalable PRISM, thebaseline version of PRISM as described in Section3 is used at the base layer, whereas the algorithm

6Free Version of H:263þ obtained from University of British

Columbia.7Evolving standards for video broadcast over cellular networks

(such as 3GPP) typically allocate extra rate of about 20% for

FECs and/or other error correcting mechanisms.

described in Section 4.2 is employed at theenhancement layer.

We tested our scheme using a wireless channelsimulator.8 This simulator adds packet errors tomultimedia data streams transmitted over wirelessnetworks conforming to the CDMA2000 1X stan-dard [29].9 For each SNR layer, a frame is dividedinto horizontal slices (four or 16 slices at QCIF/CIFresolution, respectively) and each slice is sent as apacket. We assume here that either a packet isreceived or it is completely lost. In the latter case weuse a simple error concealment technique by pastingthe co-located blocks taken from the referenceframe.

Fig. 9 shows the performance comparison for theFootball (QCIF, 15 fps) and Stefan (QCIF, 15 fps)sequences. As can be seen from Fig. 9, the scalablePRISM codec is superior to scalable H:263þ as wellas scalable H:263þ protected with FECs by a verywide margin (5–8 dB). Although assigning 20% ofrate to FECs seems to overprotect the video stream,given the largest packet drop rate being equal to10%, this is not the case under two importanttesting conditions: (a) FECs are computed across

8Courtesy of Qualcomm, Inc.9The packet error rates are determined by computing the

carrier to interference ratio of the cellular system.

Page 15: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESSM. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–32113210

one frame at the time, in order to avoid delay; (b)packet loss patterns observed in the tested networkconfiguration are not random, as large bursts oferrors occur in practice. This explains why theperformance of H:263þ protected with FEC dropseven at low packet loss rates.

5.2. Spatial and temporal scalability tests

For tests on spatial and temporal scalability, thebase layer was coded using PRISM and the spatialand temporal enhancement layers are encoded asdescribed in Sections 4.3 and 4.4, respectively. Theproposed system was compared at full spatialresolution against the H:263þ video codec undertwo testing conditions: (a) protected with FECs with20% of the total rate used for parity bits (RS codeswere used); (b) protected with intra-refresh blocks,with approximately 15% of the blocks being forcedto be intra-coded. As in Section 5.1, we tested theseschemes using the wireless channel simulator con-forming to the CDMA2000 1X standard. Weassumed that packet losses hit the base and theenhancement layer with the same probability.

Figs. 10 and 11 show the performance compar-ison for the Stefan sequence at 15 fps and Football

3 4 5 6 7 8 9 1020

22

24

26

28

30

32

34

36

Footbal l 30fps − 3.5Mbps

packet drop rate (%)

Av

era

ge

Y P

SN

R

Scalable PRISM

H.263+ with intra refresh

H.263+ with FEC

Fig. 11. Performance comparison of proposed scalable solution,

H:263þ protected with FECs (Reed–Solomon (RS) codes used,

20% of the total rate used for parity bits) and H:263þ protected

with block-based intra-refresh (15% of the blocks are forced to be

intra-coded) for the Football sequence (CIF, 30 fps, 3500kbps).

sequence at 15 and 30 fps. The scalable PRISMimplementation clearly out-performs H:263þ inboth configurations (protected with FECs andintra-refresh) by a wide margin (up to 6 and 4 dB,respectively, at high packet loss rates for Football).Fig. 12 shows the reconstruction of a particularframe (the middle frame of the GOP) of the Stefan

sequence by the proposed scalable PRISM coderand H:263þ. As can be seen from Fig. 12 the visualquality provided by the scalable PRISM coder isclearly superior to that provided by H:263þ. As canbe seen from Figs. 10 and 12, the scalable PRISMcoder is able to provide good quality reconstructioneven when parts of the base layer is lost. This is in

Fig. 12. Comparison of Frame 8 of the Stefan sequence (15 fps,

1800 kbps) reconstructed by the proposed solution and H:263þ at

a channel error rate equal to 8%. (a) Proposed codec: base layer

only (QCIF). (b) Proposed codec: base layer and enhancement

layer (CIF). (c) H:263þ (CIF).

Page 16: Robust wireless video multicast based on a distributed source

ARTICLE IN PRESSM. Tagliasacchi et al. / Signal Processing 86 (2006) 3196–3211 3211

marked contrast to standard (prediction-based)scalable video coders where loss of the base layeroften severely affects the video quality.

6. Conclusions

We proposed a fully scalable coding schemebased on distributed source coding targeting wire-less video multicast applications. Experimentalresults showcase the robustness features of theproposed approach, showing significant objectiveand subjective gains with respect to predictivecoders like H:263þ. Currently, we are in the processof making the codec to work efficiently at lowerencoding rates and running extensive tests overdifferent types of channels to further validate ourapproach.

References

[1] A. Majumdar, K. Ramchandran, Video multicast over lossy

channels based on distributed source coding, in: Proceedings

of the International Conference on Image Processing,

Singapore, October 2004.

[2] M. Tagliasacchi, A. Majumdar, K. Ramchandran, A

distributed-source-coding based robust spatio-temporal scal-

able video codec, in: Picture Coding Symposium, San

Francisco, CA, December, 2004.

[3] Requirements and applications for scalable video coding v.5,

ISO/IEC JTC1/SC29/WG11 MPEG Document N6505, July

2004.

[4] ITU-T, Information Technology Coding of Audio-visual

Objects-Part 10: Advanced Video Coding, May 2003, ISO/

IEC International Standard 14496-10:2003.

[5] MPEG-4 video, proposed draft amendment (PDAM), ISO/

IEC FGS v. 4.0 14 496-2, March 2000.

[6] ITU-T, Video coding for low bitrate communication,

January 1998, ITU-T Recommendation H.263, Version 2.

[7] R. Puri, K. Ramchandran, PRISM: a new robust video

coding architecture based on distributed compression

principles, in: Allerton Conference on Communication,

Control and Computing, Urbana-Champaign, IL, October

2002.

[8] R. Puri, K. Ramchandran, PRISM: A video coding

architecture based on distributed compression principles,

Technical Report no. UCB/ERLM03/6, ERL, UC Berkeley,

March 2003.

[9] A. Majumdar, J. Chou, K. Ramchandran, Robust distrib-

uted video compression based on multilevel coset codes, in:

Proceedings of the 37th Asilomar Conference on Signals,

Systems, and Computers, Pacific Grove, CA, November

2003.

[10] A.D. Wyner, J. Ziv, The rate distortion function for source

coding with side information at the decoder, IEEE Trans.

Inform. Theory 22 (January 1976) 1–10.

[11] A. Sehgal, N. Ahuja, Robust predictive coding and the

Wyner–Ziv problem, in: Proceedings of the IEEE Data

Compression Conference, Snowbird, UT, October 2003.

[12] B. Girod, A. Aaron, S. Rane, D.R. Monedero, Distributed

video coding, Proc. IEEE 93 (January 2005) 71–83.

[13] H. Schwarz, D. Marpe, T. Wiegand, Snr-scalable extension

of H.264/AVC, in: Proceedings of the International Con-

ference on Image Processing, Singapore, October 2004.

[14] Scalable video model version 3.0, ISO/IEC JTC1/WG11

Doc. N6716, November 2004.

[15] Q. Xu, Z. Xiong, Layered Wyner–Ziv video coding, in:

Visual Communications and Image Processing, Proceedings

of SPIE, San Jose, CA, January 2004.

[16] H. Wang, A. Ortega, WZS: Wyner–Ziv scalable predictive

video coding, in: Picture Coding Symposium, San Francisco,

CA, December 2004.

[17] A. Sehgal, A. Jagmohan, N. Ahuja, Scalable video coding

using Wyner–Ziv codes, in: Picture Coding Symposium, San

Francisco, CA, December 2004.

[18] A. Majumdar, R. Puri, P. Ishwar, K. Ramchandran,

Complexity/performance trade-offs for robust distributed

video coding, Genova, Italy, 2004.

[19] J.D. Slepian, J.K. Wolf, Noiseless coding of correlated

information sources, IEEE Trans. Inform. Theory 19 (July

1973) 471–480.

[20] S.S. Pradhan, J. Chou, K. Ramchandran, Duality between

source coding and channel coding and its extension to the

side information case, IEEE Trans. Inform. Theory 49 (May

2003).

[21] S.S. Pradhan, K. Ramchandran, Distributed source coding

using syndromes (DISCUS): design and construction, IEEE

Trans. Inform. Theory, March 2003.

[22] P. Ishwar, V.M. Prabhakaran, K. Ramchandran, To-

wards a theory for video coding using distributed com-

pression principles, in: Proceedings of the International

Conference on Image Processing, Barcelona, Spain,

September 2003.

[23] A. Aaron, R. Zhang, B. Girod, Wyner–Ziv coding of motion

video, in: Proceedings of the 36th Asilomar Conference on

Signals, Systems, and Computers, vol. 1, Pacific Grove, CA,

October 2002, pp. 240–244.

[24] J.K. Wolf, Efficient maximum likelihood decoding of linear

block codes using a trellis, IEEE Transactions on Informa-

tion Theory 24 (1) (January 1978) 76–80.

[25] M.P.C. Fossorier, S. Lin, Soft decision decoding of

linear block codes based on order statistics, IEEE Transac-

tions on Information Theory 41 (5) (September 1995)

1379–1396.

[26] ISO/IEC 13818-2, Information technology—generic coding

of moving pictures and associated audio information: video,

1995, MPEG-2 video coding standard.

[27] C. Heegard, T. Berger, Rate distortion when side informa-

tion may be absent, IEEE Trans. Inform. Theory 31

(November 1985) 727–734.

[28] Y. Steinberg, N. Merhav, On successive refinement of the

Wyner–Ziv problem, IEEE Trans. Inform. Theory 50 (8)

(August 2004) 1636–1654.

[29] TIA/EIA, interim standard for CDMA2000 spread spectrum

systems, May 2002.