video quality evaluation - tu wien · 2007. 9. 14. · video quality is a characteristic of video...

Diplomarbeit

Video Quality Evaluation

ausgefuhrt zum Zwecke der Erlangung des akademischen Grades eines

Diplom-Ingenieurs unter Leitung von

Betreuer: Michal Ries

Professor: Markus Rupp

E389

Institut fur Nachrichtentechnik und Hochfrequenztechnik

eingereicht an der Technischen Universitat Wien

Fakultat fur Elektrotechnik und Informationstechnik

von

Antitza Dantcheva

Matrikelnummer: 9926639Margaretenstrasse 30/62, A-1040 Wien

Wien, im April 2007

Contents

Abstract iii

Zusammenfassung iv

1 INTRODUCTION TO VIDEO QUALITY 1

1.1 Objective Measurement . . . . . . . . . . . . . . . . . . . . . . . 3

2 VIDEO CODING AND COMPRESSION 5

2.1 Colour Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Compression Methods . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 H.263 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Compression Artefacts . . . . . . . . . . . . . . . . . . . . . . . 14

3 UMTS 18

3.1 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Video streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Packet Switched Streaming Service . . . . . . . . . . . . 20

3.2.2 Multimedia Broadcast / Multicast Service . . . . . . . . 21

3.2.3 IP Multimedia Subsystem . . . . . . . . . . . . . . . . . 21

4 CONTENT CLASSIFICATION 23

4.1 Scene Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Sum of Absolute Differences . . . . . . . . . . . . . . . . 24

4.1.2 Fixed Threshold . . . . . . . . . . . . . . . . . . . . . . . 26

4.1.3 Dynamic Threshold . . . . . . . . . . . . . . . . . . . . . 30

4.2 Spatial and Temporal Information . . . . . . . . . . . . . . . . . 33

4.2.1 SI-TI Research . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Sequence Motion Characteristics . . . . . . . . . . . . . . . . . . 40

4.3.1 Motion Feature Extraction . . . . . . . . . . . . . . . . . 40

4.4 Detecting Pans and Zooms . . . . . . . . . . . . . . . . . . . . . 49

4.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5 Colour Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

i

ii

5 TEST AND METRIC DESIGN 575.1 Choice of objective parameters set . . . . . . . . . . . . . . . . . 585.2 Metric design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 CONCLUSIONS 64

Bibliography 68

Abstract

The flourishing telecommunication industry has brought several innovations inthe past years. The major and most widespread of all is the mobile phone -with over 1.5 billion sold devices worldwide. It has become an unexchangeablepersonal natural gadget - the dream of every producer. That is why mobilenetwork provider are more than willing to fulfill and propose every demandthinkable on this field. TV, music, gaming, gambling, navigation - the boomseems to be unstoppable.

The aim of this master thesis is the employment of video quality in mobiletelephony. Video transmission requires large amounts of bandwidth and storagespace. End-user’s costs in the mobile telephone system on the other hand areproportional to the transmitted data, thus bandwidth and transmission powerare a cost factor to be minimized. The key for wireless video and multimediaapplications is compression. To still provide a high QoS (Quality of Service)the compression is to be accomplished non linearly - depending on the viewer’sperception to particularly characteristics of video. To study this sensitivity, asubjective test was accomplished. User had to evaluate videos of different contentcoded in varying bit and frame rate. The gathered MOS (Mean Opinion Score)was rebuilt in terms of objective video parameters. This led to a metric, able toforecast subjective perception depending on the content of a video sequence.

Consequently an objective content classification was essential. Thereforefirstly the tested video had to be cropped in its initial parts - the cuts. Sev-eral approaches then were studied to achieve a satisfying and reliable contentclassification. SI-TI (Spatial Information - Temporal Information), SAD (Sumof Absolute Differences), colour and mainly motion vectors got analyzed andfinally used to assure the demanded performance. Therewith the transmissionchain is all over considered: an input video is cut and classified, coded dependingon the content class and sent to the end user.

Application for this work is the monitoring of the optimal setting for process-ing and transmission of videos on demand but also streamed videos with anallowed delay computation time.

iii

Zusammenfassung

Die florierende Nachrichtentechnik Industrie brachte viele Neuerungen in denvergangenen Jahre. Die wohl grosste und weitverbreitetste ist das Mobiltelefonmit uber 1500 Millionen verkauften Geraten weltweit. Das Handy ist zu einempersonlichen, selbstverstandlichen und nicht mehr wegdenkbaren Begleiter gewor-den. Deswegen sind Mobilnetzbetreiber mehr als bemuht, jeden denkbaren Wun-sch auf dem Feld zu erfullen, aber auch zu erwecken. TV, Musik, Spiel, Glucks-spiel, Navigation - der Boom scheint unaufhaltbar.

Das Ziel dieser Diplomarbeit ist es, sich mit Videoqualitat in der Mobiltele-phonie zu beschaftigen. Videoubertragung erfordert enorme Mengen an Band-breite und Speicherplatz. Die Kosten fur den Verbraucher wiederum sind propor-tional zu der ubertragenen Datenmenge, das heisst zu der Bandbreite und derubertragenen Leistung - Faktoren, die es zu minimieren gilt. Der Schlussel dazuheißt Komprimierung. Um dennoch hohe QoS (Quality of Service) zu gewahrleis-ten, sollte die Komprimierung nicht linear erfolgen - also abhangig von derEmpfindlichkeit des Betrachters auf die jeweilige Eigenschaft des Videos. Umdiese Empfindlichkeit zu erforschen, wurde ein subjektiver Test durchgefuhrt.Benutzer mussten Videos verschiedenes Inhalts beurteilen, kodiert mit verschiede-nen Bit und Frame Rates. Das dem Test entnommene MOS (Mean OpinionScore) wurde mittels objektiver Parameter nachgebildet. Das Ergebnis war eineMetrik, die die subjektive Empfanglichkeit der Benutzer prognostiziert, abhangigvon dem Inhalt des Videos.

Folglich war eine Inhaltsklassifizierung notwendig. Dafur wurde zunachst dasgetestete Video in dessen ursprungliche Teile - Cuts geschnitten. VerschiedeneAnsatze zum Ermoglichen einer zufriedenstellenden und verlasslichen Inhalts-klassifizierung wurden verfolgt. SI-TI (spatial information - temporal informa-tion), SAD (sum of absolute differences), Farb- und hauptsachlich Motion Vec-tors Analyse beziehungsweise Anwendung gewahrleisteten die geforderte Leis-tungsfahigkeit. Damit ist die gesamte Ubertragungskette berucksichtigt: am Ein-gang wird ein Video geschnitten und klassifiziert, je nach Klasse kodiert und zudem Kunden gesendet.

Anwendung kann die vorliegende Arbeit bei der Aufbereitung und Ubertra-gung von Videos on Demand oder aber auch bei Streaming Videos finden.

iv

List of Abbreviations

2G Second Generation, GSM3G Third Generation, UMTS3GPP 3rd Generation Partnership ProjectADSL Asymmetric Digital Subscriber LineCDMA Code Division Multiple AccessCPCFC Custom Picture Clock FrequencyCPFMT Custom Picture FormatCS Circuit SwitchedDCT Discrete Cosine TransformationDPCM Differential Pulse-Code-ModulationEDGE Enhanced Data Rates for GSM EvolutionFig. FigureGOB Group of BlocksGSM Global System for Mobile CommunicationsHDTV High Definition TelevisionIP Internet ProtocolITU International Telecommunication UnionMB MacroblockMBMS Multimedia Broadcast / Multicast ServiceMOS Mean Opinion ScoreMPEG Moving Picture Experts GroupMV Motion VectorNTSC National Television Systems CommitteePAL Phase Alternating LinePCA Principal Component AnalysisPCS Personal Communications ServicePS Packet SwitchedPSS Packet Switched Streaming ServiceQCIF Quarter Common Intermediate Format (144x176)QoS Quality of ServiceQPSK Quadrature Phase Shift KeyingRGB Red Green BlueRTP Real Time ProtocolSAD Sum of Absolute DifferencesSI Spatial InformationTDMA Time Division Multiple AccessTE Terminal EquipmentTI Temporal InformationUMTS Universal Mobile Telecommunications System

v

VoIP Voice over IPWCDMA Wideband Code Division Multiple AccessYUV Colourspace in terms of luminance (Y) and chrominance (U and V)

vi

List of Figures

1.1 Development Process for Video Quality Assessment Algorithm[Arthur A. We 00] . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Temporal and spatial redundancy . . . . . . . . . . . . . . . . . 6

2.2 Luminance sensitivity of the human eye . . . . . . . . . . . . . . 6

2.3 Temporal and spatial redundancy . . . . . . . . . . . . . . . . . 7

2.4 Discrete Cosine Transformation . . . . . . . . . . . . . . . . . . 9

2.5 I-, B- and P-Frames . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Coding Block Diagram . . . . . . . . . . . . . . . . . . . . . . . 11

2.7 H.263 Video Encoder Block Diagram . . . . . . . . . . . . . . . 13

2.8 Blocking effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.9 Blurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.10 a) Colour bleeding b) Slanted lines . . . . . . . . . . . . . . . . 15

4.1 Abrupt shot boundary (cut) . . . . . . . . . . . . . . . . . . . . 23

4.2 Gradual shot boundary . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 SAD of a trailer-sequence . . . . . . . . . . . . . . . . . . . . . 25

4.4 Fixed threshold at 3 times the mean(SAD) of the trailer sequence 26

4.5 Detected cuts at a varying mean of the SAD function; true cuts= 2; fixed threshold=2.35-4.9 times the mean(SAD) . . . . . . . 27

4.6 Detected cuts at a varying mean of the SAD function; true cuts= 6; fixed threshold=6 times the mean(SAD) . . . . . . . . . . 28

4.7 Fixed threshold adjusting within football sequences . . . . . . . 28

4.8 Fixed threshold adjusting in different content class sequences . . 29

4.9 Dynamic threshold suggested by [Anastasios D 05] . . . . . . . . 31

4.10 Upgraded dynamic threshold . . . . . . . . . . . . . . . . . . . . 32

4.11 Sobel filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.12 SI-TI values of shots of a video-trailer (blue). The cross signs thecentroid of the SI-TI values. The circle is the weighted centroid(the longer the shot the closer the centroid is moved the certainSI-TI value) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.13 SI-TI Centroid Diagram (each point belongs to a whole sequence) 35

4.14 SI-TI Diagram: the SI-TI value cloud is still not dividable, butpoints are slightly more corresponding to each other belonginggroups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

vii

viii

4.15 SI-TI diagram of different content centroids with varying weightscoefficients: magenta: cartoons; blue: comedy action; cyan: com-edy talkshow; red: nature; green: motorsports; yellow; football;black: trailer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.16 Further SI-TI diagrams of different content centroids with vary-ing weights coefficients: magenta: cartoons; blue: comedy action;cyan: comedy talkshow; red: nature; green: motorsports; yellow;football; black: trailer . . . . . . . . . . . . . . . . . . . . . . . . 38

4.17 SI-TI diagram of all sequences . . . . . . . . . . . . . . . . . . 384.18 SI-TI diagram of 4 content classes (comedy action, comedy talk-

show, cartoons, nature) . . . . . . . . . . . . . . . . . . . . . . . 394.19 H.263 Picture Structure at QCIF resolution . . . . . . . . . . . 404.20 Motion Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.21 QCIF split in 8x8 macroblocks . . . . . . . . . . . . . . . . . . . 424.22 Search region in the previous frame for the current macroblock

of the current frame . . . . . . . . . . . . . . . . . . . . . . . . . 424.23 Motion vectors in a News sequence . . . . . . . . . . . . . . . . 434.24 a) Large Search Region resulting big and erroneous Motion Vec-

tors b) Smaller search region [Ralph Braspe 04] . . . . . . . . . 434.25 Mean size of MVs in a 200-frames football and panorama sequence 444.26 Erroneous Motion Vectors in untextured Areas . . . . . . . . . . 444.27 Visualization of DCT . . . . . . . . . . . . . . . . . . . . . . . . 464.28 MV of a Panorama sequence . . . . . . . . . . . . . . . . . . . . 474.29 Motion vectors of a News sequence . . . . . . . . . . . . . . . . 484.30 Pan and Zoom within a SAD - graph . . . . . . . . . . . . . . . 504.31 Pan and Zoom - detection method graph [Garcia 05] . . . . . . 504.32 News sequences tested with the Pan and Zoom method . . . . . 514.33 Panorama tested with the Pan and Zoom detection method . . . 524.34 Football sequences tested with the Pan and Zoom detection method 524.35 Colourmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.36 Colourmap Colourcube, different gradings . . . . . . . . . . . . 554.37 64-bin Colourmap for a Football video sequence . . . . . . . . . 554.38 Colourmaps of Football sequence video sequences . . . . . . . . 564.39 Colourmaps of non-Football sequences . . . . . . . . . . . . . . 564.40 Cartoon Colourmaps . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1 First video content class ”News” . . . . . . . . . . . . . . . . . . 575.2 Second video content class ”Videocall” . . . . . . . . . . . . . . 585.3 Third video content class ”Panorama” . . . . . . . . . . . . . . 585.4 Fourth video content class ”Football” . . . . . . . . . . . . . . . 595.5 Fifth video content class ”Traffic” . . . . . . . . . . . . . . . . . 595.6 Visualization of PCA results . . . . . . . . . . . . . . . . . . . . 61

6.1 Content Classifier within the Transmitting Chain . . . . . . . . 65

Chapter 1

INTRODUCTION TO VIDEOQUALITY

Video quality is a characteristic of video passed through a video processingsystem. Since the time when the first video sequence was recorded, lots of videoprocessing systems have been designed. Different systems may have differentinfluence on a video sequence, so video quality evaluation is an essential factorto be taken care of.

Video quality measurement is an important requirement for designers and im-plementers of video communication systems. Quality measurement is especiallyimportant for compressed video, since most video compression algorithms are”lossy”, i.e. compression is achieved at the expense of video quality. Video qual-ity measurement includes comparison of the performance of alternative videocompression algorithms and equipment, quantifying the bit rate against distor-tion (’rate-distortion’) behavior of a video compression algorithm and determin-ing levels of acceptable video quality for a particular application (e.g. broadcastvideo, videoconferencing or streamed video). Video quality measurement meth-ods can be categorized as objective (based on automatic measurement of para-meters of the video sequence) or subjective (based on evaluation by one or morehuman observers).

Subjective measurements can provide a realistic guide to the video quality per-ceived by a user. However, obtaining an accurate outcome from subjective teststypically involves many repetitions of a time-consuming test with a large num-ber of test subjects, making them expensive and difficult to carry out. Objectivemeasurements are automatic and readily repeatable but the complex nature ofhuman visual perception makes it difficult to accurately model the responseof a human observer. The widely-used peak signal-to-noise ratio (PSNR) asin [Schulzrinne 98] does not correlate well with subjective responses to visualquality. More sophisticated objective quality models have been proposed [J. 80],[H. 96] but as yet there is no clear replacement for subjective measurements.

Conventional subjective quality measurement quality measurement methods in-

1

INTRODUCTION TO VIDEO QUALITY 2

volve presenting a test subject with one or more video clips. The subject isasked to make a decision about the subjective quality of the test clips, e.g. byrecording the perceived degradation of the clip compared with a reference videoclip. A typical test requirement is to determine the optimum choice from a setof alternative versions of a video clip, e.g. versions of the same clip, e.g. versionsof the same clip encoded with different codecs, alternative strengths of a post-processing filter and alternative trade-offs between frame-rate and image qualityfor a given bit rate. In this type of scenario, each of the alternative versions of avideo clip must be viewed and graded separately by the test subjects, so that thetime taken to carry out a complete test increases linearly with N, the numberof alternatives to be tested. In some cases (e.g. choosing a preferred trade-offbetween frame-rate and image quality), there are a large number of possibleoutcomes and the test designer is faced with the choice between (a) running avery large number of tests in order to obtain a fine-graned result or (b) limitingthe number of tests at the expense of discretizing the result [Richardson 04].

In UMTS mobile video streaming is an important service, with the necessityof providing a required level of customer satisfaction, given in case of videoby the perceived video stream quality. It is therefore important to choose thecompression parameters as well as the network settings so that they optimizethe end-user quality. Hence, it is looked for an objective measure of the videoquality simple enough to be calculated in real-time. Distinguishing differentcontent characters is hereby an issue, because subjective quality depends ondifferent content types strongly. Goal of the present research is to estimate thevideo quality of mobile video streaming at the user level (perceptual quality ofservice) and to find most suitable codec settings for the most frequent contenttypes.

Mobile video streaming is characterized by the low resolutions that can be used,mostly 144×176 QCIF for mobile phones or 288×352 CIF for data-cards andpalmtops. In UMTS the bearers with 64-384 kbit/s are used for multimedia(audio and video) streaming as the capacity of 2 Mbit/s has to be shared byall the users in a single cell. Mobile terminals also have limited complexity andpower, so the decoding of higher rate videos becomes quite a challenging task.

In the last years, several objective metrics for perceptual video quality estima-tion were proposed. The proposed metrics can be subdivided into two maingroups: human vision model based video metrics, and metrics based only onthe objective video parameters. The complexity of these methods is quite highand significant computational power is necessary to calculate them. For smallerresolutions, it is possible to find simpler estimates achieving the same perfor-mance. This is the aim of this work. Moreover, measures that would not needthe original (non-compressed) sequence for the estimation of quality are of in-terest, because this reduces the complexity and at the same time broadens thepossibilities of the quality prediction deployment [Michal Ries 06].


1.1 Objective Measurement

Different classes of objective quality measurements, requiring different knowl-edge of the source material can be envisaged:

• measurements performed on the difference between reference and codedpictures,

• measurements obtained by computing some parameters on the coded pic-ture and comparing them with the same parameters computed on thereference picture and considered as a signature,

• measurements performed only on the coded picture.

The objective measurements according to the first class require the compu-tation of the differences between the coded pictures and the reference ones.Consequently, the uncoded pictures (references) are required at the input of themeasurement device. The differences can be used to compute comparative dis-tortion measures, which have a low correlation with the perceived impairmentbut are easy to extract. Alternatively, two different methods can be appliedwhere the sensitivity of the human visual system is taken into account. In thefirst one the differences between coded and reference pictures are analyzed ac-cording to a model of the human perception system. The accuracy of the resultsis closely related to the approximation of the model and, consequently, studiesand investigations of the human perceptual system are important in order todefine an accurate perception model. The second method requires prior resultsfrom subjective assessments. Prediction models with linear coefficients can bedeveloped based on such results.

In the second class, quality indications can be obtained by comparing parame-ters computed separately on the coded pictures and the reference pictures. Theparameters represent signal features and impairments which are of interest in adigital network. These parameters can be distributed in the network at low bitrates to be used when the entire reference signal is not available.

The third class of objective quality measurements could be obtained by an-alyzing the parameters included in the MPEG-2 bit stream. In this way noknowledge of the reference material is required. The MPEG-2 video bitstreamincludes coding parameters such as quantizer-scales, coding modes, motion vec-tors, etc. The quantizer-scale and the coding mode are available for each macroblock [S. Olsson 97].

For the quantization of the transform coefficients H.263 uses scalar quantization.The quantized transform coefficients are converted into coding symbols and allsyntax elements of a macro block including the coding symbols are conveyed byentropy coding methods. A macro block can always be coded in INTRA-modes,various efficient INTER-modes, or the SKIP-mode.


The selection of the coding mode or the selection of the QP for each macroblock within one frame influences the compression efficiency of the video codec.For the local selection usually objective measures, such as the mean squareerror, are applied. However, the selection of global parameters such as bit rates,frame rates, etc, should not exclusively be solved by these objective evaluationcriteria. Therefore, subjective tests have been performed to assess, evaluate, andoptimize these global parameters for low-resolution low-rate video sequences.

Figure 1.1: Development Process for Video Quality Assessment Algorithm[Arthur A. We 00]

Chapter 2

VIDEO CODING ANDCOMPRESSION

Visual data in general and video in particular require large amounts of band-width and storage space. End-user’s costs in the mobile telephone system onthe other hand are proportional to the transmitted data, thus bandwidth andtransmission power are a bottleneck. The key for wireless video and multimediaapplications is compression.

Uncompressed video at TV-resolution has typical data rates of a few hundredMb/s, for example; for HDTV this goes up into the Gb/s range. Evidently,effective compression methods are vital to facilitate handling such data rates.

Lossless compression is the reduction of redundancy in data. Generic losslesscompression algorithms, which assure the perfect reconstruction of the initialdata, could be used for images and video. However, these algorithms only achievea data reduction of about 2:1 on average, which is not enough. When compress-ing video, two special types of redundancy can be exploited:

• Spatial and temporal redundancy: Typically, pixel values are correlatedwith their neighbors, both within the same frame and across frames.

• Psycho visual ”redundancy”: The human visual system is not equally sen-sitive to all patterns. Therefore, the compression algorithm can discardinformation that is not visible to the observer. This is referred to as lossycompression.

2.1 Colour Coding

Many compression schemes and video standards such as PAL, NTSC or MPEG,are already based on human vision in the way that colour information is processed.

5

VIDEO CODING AND COMPRESSION 6

Temporal redundancy

Spatial redundancy

Figure 2.1: Temporal and spatial redundancy

In particular, they take into account the nonlinear perception of lightness, theorganization of colour channels, and the low chromatic acuity of the human vi-sual system. Further the human visual system has an approximately logarithmicresponse to intensity.

Figure 2.2: Luminance sensitivity of the human eye

The theory of opponent colour states that the human visual system decorrelatesits input into white-black, red-green and blue-yellow difference signals, whichare processed in separate visual channels. Furthermore, chromatic visual acu-ity is significantly lower than achromatic acuity. In order to take advantage ofthis behavior, the colour primaries red, green and blue are rarely used for cod-ing directly. Instead, colour difference (chroma) signals similar to the ones jusmentioned are computed. In component video, for example, the resulting colourspace is referred to as YUV or YCbCr, where Y encodes luminance, U or Cb thedifference between the blue primary and luminance, the V or Cr the differencebetween the red primary and luminance.

The following matrices can be used to derive Y, U and V from R, G and B: YUV

=

0.299 0.587 0.114−0.147 −0.289 0.4360.615 −0.515 −0.100

RGB

(2.1)

Here, R, G and B are assumed to range from 0 to 1, with 0 representing the


minimum intensity and 1 the maximum. Y is in range 0 to 1, U is in range-0.436 to 0.436 and V is in range -0.615 to 0.615.

The low chromatic acuity permits a significant data reduction of the colourdifference signals. In digital video, this is achieved by chroma sub-sampling.The notation commonly used is a follows:

• 4:4:4 denotes no chroma sub-sampling.

• 4:2:2 denotes chroma sub-sampling by a factor of 2 horizontally; this sam-pling format is used in the standard for studio-quality component digitalvideo as defined by ITU-R Rec. BT.601-5 (1995), for example.

• 4:2:0 denotes chroma sub-sampling by a factor of 2 both horizontally andvertically; it is probably the closest approximation of human visual colouracuity achievable by chroma sub-sampling alone. This sampling format isthe most common in JPEG or MPEG, e.g. for distribution-quality video.

• 4:1:1 denotes chroma sub-sampling by a factor of 4 horizontally.

Figure 2.3: Temporal and spatial redundancy

In digital image processing, chroma sub-sampling is the use of lower resolu-tion for the colour (chroma) information in an image than for the brightness(intensity or luma) information.

Because the human eye is less sensitive to colour than to intensity, the chromacomponents of an image do not need to be as well defined as the luma compo-nents; thus many video systems sample the colour difference channels at a lowerdefinition (i.e., sample frequency) than the brightness. This reduces the overallbandwidth of the video signal without much apparent loss of picture quality.The missing values will be interpolated or repeated from the preceding samplefor that channel.

The sub-sampling in a video system is usually expressed as a three part ratio.The three terms of the ratio are: the number of brightness (”luminance” ”luma”or Y) samples, followed by the number of samples of the two colour (”chroma”)components: U/Cb then V/Cr, for each complete sample area. For quality com-parison, only the ratio between those values is important, so 4:4:4 could easilybe called 1:1:1; however, traditionally the value for brightness is always 4, withthe rest of the values scaled accordingly.


2.2 Compression Methods

As mentioned at the beginning of this section, digital video is amenable tospecial compression methods. They can be roughly classified into model-basedmethods, e.g. fractal compression, and waveform-based methods, e.g. DCT orwavelet compression. Most of today’s video codecs and standards belong to thelatter category and comprise the following stages:

• Transformation: To facilitate exploiting psycho visual redundancies, thepictures are transformed to a domain where different frequency rangeswith varying sensitivities of the human visual system can be separated.This can be achieved by the discrete cosine transform (DCT) or the wavelettransform, for example. This step is reversible, i.e. no information is lost.

Two dimensional transformation:

F [u, v] =4C(u)C(v)

n2

n−1∑j=0

n−1∑k=0

f(j, k) cos

((2j + 1)uπ

2n

)cos

((2k + 1)vπ

2n

)(2.2)

where:

C(ω) =

{ 1√2

ω = 0

1 ω = 1, 2, . . . , (n − 1)

}(2.3)

The component at ω = 0 has the greatest energy. The energies of thefurther components decrease with the increasing grade, which is distinctivefor the transformation.

The concrete realization proceeds as follows: Each frame is subdividedinto slices, which are a collection of consecutive macro blocks. Each macroblock in turn contains four blocks of 8x8 pixels each. The DCT is computedon these blocks, while motion estimation is performed on macro blocks.

The resulting DCT coefficients are quantized and coded with variable-lengths.

• Quantization: After the transformation, the numerical precision of thetransform coefficients is reduced in order to decrease the number of bitsin the stream. The degree of quantization applied to each coefficient isusually determined by the visibility of the resulting distortion to a humanobserver; high-frequency coefficients can be more coarsely quantized thanlow-frequency coefficients, for example. Quantization is the stage that isresponsible for the ”lossy” part of compression.


Figure 2.4: Discrete Cosine Transformation

• Coding: After the data has been quantized into a finite set of values, it canbe encoded lossless by exploiting the redundancy between the quantizedcoefficients in the bit stream. Entropy coding, which relies on the fact thatcertain symbols occur much more frequently than others, is often used forthis process. Two of the most popular entropy coding schemes are Huffmancoding and arithmetic coding.

A key aspect of digital video compression is exploiting the similarity betweensuccessive frames in a sequence instead of coding each picture separately. Whilethis temporal redundancy could be taken care of by a spatial-temporal trans-formation, a hybrid spatial- and transform-domain approach is often adoptedinstead for reasons of implementation efficiency. A simple method for temporalcompression is frame differencing, where only the pixel-wise differences betweensuccessive frames are coded. Higher compression can be achieved using motionestimation, a technique for describing a frame based on the content of nearbyframes with the help of motion vectors. By compensating for the movements ofobjects in this manner, the differences between frames can be further reduced.

2.3 Standards

The Moving Picture Experts Group (MPEG) is a working group of ISO/IECin charge of developing international standards for the compression, decompres-sion, processing, and coded representation of moving pictures, audio, and theircombination. MPEG comprises some of the most popular and widespread stan-dards for video coding. The group was established in January 1988, and sincethen it has produced:

• MPEG-1, a standard for storage and retrieval of moving pictures andaudio, which was approved in 1992. MPEG-1 defines a block-based hybridDCT/DPCM coding scheme with prediction and motion compensation. Italso provides functionality for random access in digital storage media.


• MPEG-2 is typically used to encode audio and video for broadcast sig-nals, including direct broadcast satellite and Cable TV. MPEG-2, withsome modifications, is also the coding format used by standard commer-cial DVD movies.MPEG-2 includes a systems part (Part 1) that defines Transport Streams,which are designed to carry digital video and audio over somewhat unreli-able media, and are used in broadcast applications. The video part (Part2) of MPEG-2 is similar to MPEG-1, but also provides support for inter-laced video (the format used by broadcast TV systems). MPEG-2 videois not optimized for low bit-rates (less than 1 Mbit/s), but outperformsMPEG-1 at 3 Mbit/s and above. All standards-conforming MPEG-2 Videodecoders are fully capable of playing back MPEG-1 Video streams.

• MPEG-4 absorbs many of the features of MPEG-1 and MPEG-2 andother related standards, adding new features such as (extended) VRMLsupport for 3D rendering, object-oriented composite files (including audio,video and VRML (Virtual Reality Modeling Language) objects), supportfor externally-specified Digital Rights Management and various types ofinteractivity.Most of the features included in MPEG-4 are left to individual developersto decide whether to implement them. This means that there are probablyno complete implementations of the entire MPEG-4 set of standards. Todeal with this, the standard includes the concept of ”profiles” and ”levels”,allowing a specific set of capabilities to be defined in a manner appropriatefor a subset of applications.

The MPEG standard specifically defines three types of pictures: Intra Frames (I-Frames), Predicted Frames (P- Frames) and Bidirectional Frames (B- Frames).

These three types of frames are combined to form a group of frames.

Figure 2.5: I-, B- and P-Frames


Intra Frames

Intra frames, or I- Frames, are coded using only information present in thepicture itself, and provide potential random access points into the compressedvideo data. They use only transform coding and provide moderate compression.Typically they use about two bits per coded pixel.

Predicted Frames

Predicted Frames, or P- Frames, are coded with respect to the nearest previousI- or P- Frames. This technique is called forward prediction and is illustratedin figure above. Like I- Frames, P- Frames also can serve as a prediction ref-erence for B-pictures and future P-pictures. Moreover, P-Frames use motioncompensation to provide more compression than is possible with I-pictures.

Bidirectional Frames

Bidirectional Frames, or B- Frames, use both a past and future frame as areference. This technique is called bidirectional prediction. B- Frames providethe most compression since they use the past and future frame as a reference;the computation time is the largest.

The MPEG-2 system specification defines a multiplexed structure for combin-ing audio and video data as well as timing information for transmission overa communication channel. It is based on two levels of packetization. First, thecompressed bitstreams or elementary streams (audio or video) are packetized.Subsequently, the packetized elementary streams are multiplexed together tocreate the transport stream, which can carry multiple audio and video pro-grams. It consists of fixed-size packets of 188 bytes each; their headers containsynchronization and timing information. Finally the transport stream is encap-sulated in real - time protocol (RTP) packets for transmission.

Figure 2.6: Coding Block Diagram


2.3.1 H.263

In H.263 there are different optional modes for operation. The different profilesand level parameters describe the capabilities of the coder. Several preferredmode combinations for operation are defined and structured into ”profiles” ofsupport. Moreover, some groupings of maximum performance parameters aredefined as levels of support for these profiles.

3GPP technical specifications define PSS (Packet-Switched Streaming Services)protocols and codecs. The mandatory video codec is H.263 profile 0 level 10.The codec H.263 profile 3 level 10 is optional [3GPP 02].

The profile 0 is the Baseline Profile, with no optional modes of operation - cus-tom picture formats (CPFMT) and custom picture clock frequencies (CPCFC)are supported but no motion vectors are provided [H.263 01].

Encoders and decoders supporting custom picture formats and/or custom pic-ture clock frequencies are recommended to follow these rules:

1. A decoder for any profile and level defined herein that supports a maxi-mum picture format shall support all standard picture formats smaller orequal in both height and width than those of the maximum supported pic-ture format. For example, a decoder supporting a custom picture formatof 720x288 shall support CIF, QCIF and sub-QCIF picture decoding.

2. A decoder for any profile and level defined herein that supports custompicture formats shall support all standard or custom picture formats hav-ing both height and width smaller than or equal to those of the maximumsupported picture format.

3. A decoder for any profile and level defined herein that supports a min-imum picture interval with the standard picture clock frequency of (30000)/1001 units per second shall support the same or smaller minimumpicture interval for all supported picture formats having both height andwidth smaller than or equal to those of the maximum picture format atwhich the minimum picture interval is specified.

4. A decoder for any profile and level defined herein that supports a mini-mum picture interval and supports custom picture clock frequencies shallsupport the use of any picture clock frequency with the same or largerpicture interval for all supported picture formats having both height andwidth smaller than or equal to those of the maximum picture format atwhich the minimum picture interval is specified.


Figure 2.7: H.263 Video Encoder Block Diagram

The Version 2 Interactive and Streaming Wireless Profile, designated as Profile3, is composed of the baseline design plus the following modes [H.263 01]:

Advanced INTRA Coding: Use of this mode improves the coding efficiencyfor INTRA macro blocks (whether within INTRA pictures or predictivelycoded pictures). The additional computational requirements of this modeare minimal at both the encoder and decoder (as low as a maximum of 8additions/subtractions per 8 x 8 block in the decoding process plus the useof a different but very similar VLC table in order to obtain a significantimprovement in coding efficiency).

Deblocking Filter: Because of the significant subjective quality improvementthat may be realized with a deblocking filter, these filters are widely inuse as a method of post-processing in video communication terminals.As with the Advanced Prediction mode, this mode also includes the fourmotion-vector per macro block feature and picture boundary extrapola-tion for motion compensation, both of which can further improve codingefficiency. The computational requirements of the deblocking filter are sev-eral hundred operations per coded macro block, but memory accesses andcomputational dependencies are uncomplicated. This last point is whatmakes the Deblocking Filter preferable to Advanced Prediction for someimplementations. Also, the benefits of Advanced Prediction are not as sub-stantial when the Deblocking Filter is used as well. Thus, the DeblockingFilter is included in this basic package of support.


Slice Structured Mode: The Slice Structured mode provides resynchroniza-tion points within the video bitstream for recovery from erroneous or lostdata.

Modified Quantization: This mode includes an extended DCT coefficientrange, modified DQUANT syntax, and a modified step size for chromi-nance. The first two features allow for more flexibility at the encoder andmay actually decrease the encoder’s computational load (by eliminatingthe need to re-encode macro blocks when coefficient level saturation oc-curs). The third feature noticeably improves chrominance fidelity, typicallywith little added bit-rate cost. At the decoder, the only significant addedcomputational burden is the ability to parse some new bitstream symbols.Both the profile 0 and the profile 3 are provided in PSS services as level10 [3GPP 02].

For each profile a group of parameters is defined - the level of support for a maxi-mum performance in a profile. Seven levels of performance capability are defined(from level 10 to 70). Level 10 is the minimum level of performance capability.It supports QCIF and sub-QCIF resolutions and is capable for bit rates up to64000 bits per second with a picture decoding rate up to (15 000)/1001 picturesper second. Upper levels support greater resolution and bit rate [H.263 01].

All further observations were made with QCIF and H.263 coding for a realUMTS scenario.

2.4 Compression Artefacts

The compression algorithms used in various video coding standards are quitesimilar. Most of them rely on motion compensation and block-based DCT withsubsequent quantization of the coefficients. In such coding schemes, compressiondistortions are caused by only one operation, namely the quantization of thetransform coefficients. Although other factors affect the visual quality of thestream, such as motion prediction or decoding buffer size, they do not introduceany distortion per se, but affect the encoding process indirectly.

A variety of artefacts can be distinguished in a compressed video sequence:

The blocking effect or blockiness refers to a block pattern in the compressedsequence (fig.5.5). It is due to the independent quantization of individualblocks (usually of 8x8 pixels in size) in block-based DCT coding schemes,leading to discontinuities at the boundaries of adjacent blocks. The block-ing effect is often the most prominent visual distortion in a compressedsequence due to the regularity and extent of the pattern. Recent codecssuch as H.264 employ a deblocking filter to reduce the visibility of theartefact.


Figure 2.8: Blocking effect

Blur manifests itself as a loss of spatial detail and a reduction of edge sharpness.It is due to the suppression of the high-frequency coefficients by coarsequantization.

Figure 2.9: Blurring

Colour bleeding is the smearing of colours between areas of strongly differ-ing chrominance, see fig.2.10 a). It results from the suppression of high-frequency coefficients of the chroma components. Due to chroma sub-sampling, colour bleeding extends over an entire macro block.

Figure 2.10: a) Colour bleeding b) Slanted lines

The DCT basis image effect is prominent when a single DCT coefficient isdominant in a block. At coarse quantization levels, this results in an em-phasis of the dominant basis image and the reduction of all other basisimages.


Slanted lines often exhibit the staircase effect, see fig. 2.10 b). This is dueto the fact that DCT basis images are best suited to the representationof horizontal and vertical lines, whereas lines with other orientations re-quire higher-frequency DCT coefficients for accurate reconstruction. Thetypically strong quantization of these coefficients causes slanted lines toappear jagged.

Ringing is fundamentally associated with Gibbs’ phenomenon and is thus mostevident along high-contrast edges in otherwise smooth areas. It is a directresult of quantization leading to high-frequency irregularities in the recon-struction. Ringing occurs with both luminance and chroma components.

False edges are a consequence of the transfer of block-boundary discontinuities(due to the blocking effect) from reference frames into the predicted frameby motion compensation.

Jagged motion can be due to poor performance of the motion estimation.Block-based motion estimation works best when the movement of all pixelsin a macro block is identical. When the residual error of motion predictionis large, it is coarsely quantized.

Motion estimation is often conducted with the luminance component only,yet the same motion vector is used for the chroma components. This canresult in chrominance mismatch for a macro block.

Mosquito noise is a temporal artefact seen mainly in smoothly textured re-gions as luminance/chrominance fluctuations around high-contrast edgesor moving objects. It is a consequence of the coding differences for thesame area of a scene in consecutive frames of a sequence.

Flickering appears when a scene has high texture content. Texture blocks arecompressed with varying quantization factors over time, which results ina visible flickering effect.

Aliasing can be noticed when the content of the scene is above the Nyquistrate, either spatially or temporally.

While some of these effects are unique to block-based coding schemes, many ofthem are observed with other compression algorithms as well. In wavelet basedcompression, for example, the transform is applied to the entire image; thereforenone of the block-related artefacts occur. Instead, blur and ringing are the mostprominent distortions.

The relevant artefacts for streaming over UMTS are blockiness, blurriness andjerkiness. The forthcoming analyze informs about the tolerance of the user withdifferent artefacts.


Essential to be mentioned in the context is error-resilience (combatting e.g.packet delays or losses). As mobile devices still are constricted by their process-ing power and storage limit, receiver algorithms can be only advanced to acertain point.

Chapter 3

UMTS

”Universal Mobile Telecommunications System” stands for the mobile networksstandard of the third generation (3G, analogue cellular being the first generation,digital PCS the second)- offering conspicuously higher capacity, data speedsand new service capabilities. 3G/UMTS supports a greater number of voice anddata customers - especially in urban centers - offering higher data rates, beingspecified as an integrated solution for mobile voice and data with wide areacoverage. The International Telecommunication Unit (ITU) identified followingradio spectrum bands for the 3G IMT-2000 mobile services in Europe:

• 1.900 MHz - 1.920 MHz (TDD)

• 1.920 MHz - 1.980 MHz (FDD-Uplink)

• 2.010 MHz - 2.025 MHz (TDD)

• 2.110 MHz - 2.170 MHz (FDD-Downlink)

with a 5 MHz channel carrier width. The uplink and downlink data rates aresymmetric - an optimal assumption for applications like real-time video tele-phony (in comparison to ADSL, where there is an asymmetry between uplinkand downlink throughput rates).The mainly used access technique is WCDMA(Wideband Code Division Multiple Access). The maximum transmission powerof a mobile station is 0,125W - 0,25W (GSM had 1W-2W). HSDPA (HighSpeed Downlink Packet Access) allows even higher data rates in downlink -up to 14,4Mbit/s (theoretically). UMTS can use a variety of present and fu-ture wireless network technologies, including GSM, CDMA, TDMA, WCDMA,CDMA2000 and EDGE.

3.1 Services

3G combines high-speed mobile access with Internet Protocol (IP) based ser-vices. The Bearer Services are flexible - they can be altered at a session or

18

UMTS 19

connection establishment and during the ongoing session or connection. Bothconnection - oriented and connectionless services are offered for Point-to-Pointand Point-to-Multipoint communication.

UMTS network services have different QoS classes for four types of traffic:

• Conversational class (real time; voice, video telephony, video gaming)

• Streaming class (real time; multimedia, video on demand, webcast)

• Interactive class (best effort; web browsing, network gaming, databaseaccess)

• Background class (best effort; email, SMS downloading)

Network services are considered End-to-End, this means from a Terminal Equip-ment (TE) to another TE and End-to-End service may have a certain QoS whichis provided for the user of a network service. It is the user that decides whetherhe is satisfied with the current QoS.

To realize a certain network QoS a Bearer Service with clearly defined charac-teristics and functionality is to be set up from the source to the destination ofa service.

A Bearer Service includes all aspects to enable the provision of a contracted QoS.These aspects are among others the control signalling, user plane transport andQoS management functionality.

3.2 Video streaming

Streaming is the consuming (reading, hearing, viewing) of media while it isbeing delivered (e.g. watching and listening synchronized video and audio on amobile phone, while they are being transmitted to the TE over a data network).Streaming media storage size is calculated from the streaming bandwidth andthe length of the media with the following formula (single user and file):

size(MB) = (lenght(s) ∗ bitrate(kbit/s))∗1000bit

1kbit∗1byte

8bits∗ 1MB

1048576bytes(3.1)

One hour of video encoded at 300 kbit/s will be:

(3600s × 300kbit/s)/8388608 = 128.7MB (3.2)

UMTS 20

of storage if the file is stored on a server for on-demand streaming. If this streamis viewed by 1,000 people, one would need 300 kbit/s · 1,000 = 300,000 kbit/s= 300 Mbit/s of bandwidth. This is equivalent to 105.5 GB per hour.

UMTS enables IP-based video transmission and reception at any place andtime at reasonable and sufficient data rates. It is a major application in mobilesystems, growing and improving with the technical progress. New applicationsare made possible by video-capable and aligned mobile displays. Mobile net-work provider are willing to satisfy each of the growing user’s demands as thetelecommunication industry is flourishing more than ever. Not only the num-ber of sold mobile phones is taking cosmical values, the devices have becomea general personal item - unimaginable to be missed in a modern life. Makersovertrump each other, offering bigger screens, 3-D graphics chips and lush digitalsound and video. The users are taking advantage of the attractions, acceptingand embracing each innovation. In the field of video streaming within UMTSthis means following two offers:

• Streaming Video on Demand: Stored videos are sent on demand. Contentsof interests are news, music videos, weather information, past soap operaepisodes, past sitcom episodes, sport recordings,...

• Streaming Live TV: choosing a TV channel and switching it on, just likea standard tube.

3.2.1 Packet Switched Streaming Service

The packet-switched streaming service (PSS) of UMTS enables mobile stream-ing applications, where the protocol and terminal complexity is lower than forconversational services, which in contrast to a streaming terminal require mediainput devices, media encoders and more complex protocols. GSM in compari-son delivered a 9,6kb/s circuit switched channel for data transfer. As demandsincreased, the system improved, boosting first the speed and then changingthe strategy completely. The capacity and performance were heightened by newmodulation methods. WCDMA (Wideband Code-Division Multiple Access) usesQPSK (Quadrature Phase Shift Keying), its variant dual QPSK and 16QAM(Quadrature Amplitude Modulation). Those new features were specified in thefifth Release of 3GPP. Still UMTS includes a CS capability, as a backward com-patibility has to be assured. This feature is not necessarily a thankless effort,but an enrichment with which UMTS exhibits three operating modes:

PS/CS operation mode PS and CS services can be provided simultaneouslyvia the PS and CS domain

PS operation mode PS domain provides PS services, not preventing classicalCS services to be carried out as PS (e.g. VoIP - Voice over IP)

UMTS 21

CS operation mode CS services are offered over the CS domain. Neverthelesspacket-like services can be provided over the CS domain. Especially highQoS requirements can be accomplished, offering a convenient constant line- with a constant bit rate

3.2.2 Multimedia Broadcast / Multicast Service

3GPP introduced in the sixth release (July 2005) the multimedia broadcast/multicast service (MBMS) feature, allowing the addressing the same informa-tion to many users in one cell using the same radio resources. MBMB is aunidirectional point-to-multipoint service. One source is supplying various userstransmitting the same data. This allows network resources to be shared, im-proving the efficiency. MBMS is provided over a broadcast or multicast servicearea, covering the whole network or a small spatial area [3GPP 05].

• Multicast Mode: Unidirectional point-to-multipoint transmission of multi-media data from a single source point to a multicast group in a multicastservice area. A multicast group containing subscribed users is the recipi-ent.

• Broadcast Mode: Unidirectional point-to-multipoint transmission of mul-timedia data from a single source entity to all users in a broadcast servicearea. No subscription is required, so every user is a recipient.

3.2.3 IP Multimedia Subsystem

IMS (IP Multimedia Subsytem) introduced in the fifth 3GPP release, (improvedin the sixth) provides a flexible architecture to offer person-to-person servicesusing a wide range of integrated media, voice, text, picture and video. IP basedmultimedia allows users to communicate via voice, video or text simultaneouslyover multiple access technologies via a single client on a mobile device. Anindependence from the access technologies is claimed, realized by a separationof access, transport and control. Applications are:

• Push-to-Talk over Cellular (PoC): two-way form communications, althoughhalf-duplex, connecting two or more users analogous to the classic ”walkietalkie”

• Conferencing: audio or / and video communications over a stream

• Presence: service for supporting management of presence information (sta-tus, optional communication address or other optional attributes) betweenwatchers and presentities

UMTS 22

• Messaging: service for sending and receiving messages or creating andparticipating in messaging conferences with multiple users.

• other interactive applications like improved Voice and Video over IP

Chapter 4

CONTENT CLASSIFICATION

4.1 Scene Changes

Splitting a video into its basic temporal units - shots - is considered the initialstep in the process of video content analysis. A shot is a series of video framestaken by one camera (e.g. zooming in or out an object, panning along a land-scape). Since each shot of a sequence can have a different content character ascene change detection represents a pre-stage of contents classification. Shots arecompared to words with each frame being a letter and scenes being sentences.To distinguish the class of the words, first they should be identified correctly.

Two consecutive shots are separated by a shot boundary, which can be abrupt orgradual. While an abrupt shot boundary (=cut) is generated by simply attachingone shot to another without modifying them, a gradual shot boundary is theresult of applying an editing effect to merge two shots.

5. Content Classification

5.1 Scene changes Splitting a video into its basic temporal units – shots – is considered the initial step in the process of video content analysis. A shot is a series of video frames taken by one camera (e.g. zooming in or out an object, panning along a landscape). Shots are compared to words with each frame being a letter and scenes being sentences. Two consecutive shots are separated by a shot boundary, which can be abrupt or gradual. While an abrupt shot boundary (=cut) is generated by simply attaching one shot to another without modifying them, a gradual shot boundary is the result of applying an editing effect to merge two shots.

Fig. 5.1 Abrupt shot boundary (cut)

Fig. 5.2 Gradual shot boundary

From now only cuts are considered, because gradual shot boundaries (dissolve, fades or wipes) are hardly detectible. The development of shot-boundary detection algorithms has the longest and richest history in the area of video content analysis. Longest, because this area was actually initiated by the attempts to automate the detection of cuts in video, and riches, because a vast majority of all works published in this area so far address in one way or another the problem of shot-boundary detection. This is not surprising since detection of shot boundaries provides the base for nearly all high – level video content analysis approaches, and is, therefore, one of the

2

Figure 4.1: Abrupt shot boundary (cut)

From now on only cuts are considered, because gradual shot boundaries (dis-solve, fades or wipes) are hardly detectible.

The development of shot-boundary detection algorithms has the longest andrichest history in the area of video content analysis. Longest, because this areawas actually initiated by the attempts to automate the detection of cuts invideo, and riches, because a vast majority of all works published in this areaso far address in one way or another the problem of shot-boundary detection.This is not surprising since detection of shot boundaries provides the base for

23

CONTENT CLASSIFICATION 24

Figure 4.2: Gradual shot boundary

nearly all high - level video content analysis approaches, and is, therefore, oneof the major prerequisites for successfully revealing the video content structure(but also profitably for video restoration or colouring black-and-white movies)[Hanjalic 04].

The basic idea when detecting cuts in video is that the frames surrounding a cutare, in general, much more different with respect to their visual content thanthe frames taken from within a shot. A helpful fact is the high frame rate (inour case 25 frames/second), which means a high correlation between consecutiveframes within a shot. Following the disturbance of the order means a cut (whichsignifies an immense visual difference within consecutive frames).

A variety of techniques have been proposed for shot boundary detection indigital video. Pair wise comparison checks each pixel in one frame with thecorresponding pixel in the next frame [Zhang H.J. 93]. The Likelihood ratioapproach compares blocks of pixel regions [Zhang H.J. 93]. Net comparisonbreaks the frame into base windows [John Chung-M 04]. The Colour histogrammethod compares the intensity or colour histograms between adjacent frames[Bouthemy 04]. Model based comparison uses the video production system asa template [Hampapur 94]. Edge detection segmentation looks for entering andexiting edge pixels [Ramin Zabih 95].

4.1.1 Sum of Absolute Differences

The success of shot-boundary detection largely depends on the selection of fea-tures used to represent the visual content flow along a video. The feature needsto have a clear distinctive temporal behavior before and after a cut. In the used


case the intensity variance was the feature. The Sum of Absolute Differences(SAD) is computed as follows:

SAD(n) =N−1∑i=0

M−1∑j=0

|Fn(i, j) − Fn−1(i, j)| (4.1)

where Fn is the n-th frame of size NxM, i and j denote the pixel coordinates.SAD values are further stored and used to evaluate the best compression scheme.

0 20 40 60 80 100 120 140 160 1800

0.5

1

1.5

2

2.5

3x 106

Frames

SA

D

Figure 4.3: SAD of a trailer-sequence

The feature is rather simple, but still highly sensitive to camera and objectmotion.


4.1.2 Fixed Threshold

To decide whether a shot boundary has occurred, it is necessary to set a thresh-old, or thresholds for the similarity between adjacent frames. SAD values abovethis threshold are considered as cuts, while values below this threshold are ig-nored. To accurately segment broadcast video, it is necessary to balance thefollowing two - apparently conflicting points:

• The need to prevent detection of false shot boundaries, by setting a suffi-ciently high threshold level so as to insulate the detector from noise.

• The need to detect subtle shot transitions such as dissolves, by makingthe detector sensitive enough to recognize gradual change.

In the shown particular sequence it can be seen that the ”noise level” is quitelow. This makes it easy to detect a real shot boundary using a fixed threshold,shown in figure 4.4 by the horizontal line at 3 times the mean SAD of the video.

0 20 40 60 80 100 120 140 160 1800

0.5

1

1.5

2

2.5

3x 106

Frames

SA

D

SADFixed Threshold

Figure 4.4: Fixed threshold at 3 times the mean(SAD) of the trailer sequence

Thus, this result shows ideal conditions for correctly identifying shot cuts (re-lying only on the ranges of the discontinuities). Unfortunately this case is onlyan exception of the huge variety of existing videos.

The main reason why automatic shot boundary detection is difficult is the factthat any kind of shot transition can be easily confused with camera and objectmotion which occurs in video anyway. A shot with lots of object motion through-out the frame, such as a sports or an action shot, can cause the false recognition


of a shot boundary. To further complicate matters, a camera can have a varietyof movements such as panning, tilting, booming, tracking, zooming in or out,or a combination of these. All this can lead to too much visual ”noise” makingtechniques for automatic detection too simplistic [Paul Browne 99].

Even if accepting the impreciseness of the method (= not detecting all cuts +de-tecting false cuts), when considering different videos (from diverse categories)this approach is unusable as the experiments below show.

Tests were done, where cuts in videos from different categories were tried tofind with the fixed-threshold-method, searching a universal fixed threshold (de-pending on the mean-SAD-range of the sequence). The true cuts number wasdetected by a human. A detected number of cuts depending on the mean of theSAD was compared, see figure 4.5 and 4.6.

Nature Film

433

326

228

168

105

6140

173 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 0 0 0 0 0

0

50

100

150

200

250

300

350

400

450

500

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

Figure 4.5: Detected cuts at a varying mean of the SAD function; true cuts =2; fixed threshold=2.35-4.9 times the mean(SAD)

When in some videos, the true cut value was achieved at 2.3 times the meanSAD value, others could not arrange until 6 times of their mean value.

The ”fixed threshold” depending on mean(SAD) varied between different groups(See Figure 4.7) as well as within the content groups (See Figure 4.8). This trialshowed that a general fixed threshold, even with the constraint ”per contentclass” cannot be defined. Further the cuts number per time was mentioned tobe a weak parameter for content classification.

A new parameter had to be taken into account. At this point the variance ofSAD was considered. It was awaited that videos with large variance behaveddifferently (having a bigger amplitude, they needed a larger threshold) than theones with the smaller variance. Unfortunately no particular correlation could befound.

Based on the results obtained, particularly with reference to the experimentaloutcomes shown, a fixed threshold was not adequate to deal with the variety of


Comedy Action

197

150

128

111

94

8069

6459

5346

3833

2823 20 19 17

12 11 9 8 7 7 7 7 7 7 7 7 7 7 7 7

0

50

100

150

200

250

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

Figure 4.6: Detected cuts at a varying mean of the SAD function; true cuts =6; fixed threshold=6 times the mean(SAD)

E:\A1\F_3.xls

Football 1

404

302

216

135

7657

3919 11 10 10 10 9 8 8 6 5 5 5 5 5 5 4 4 4 4 3 3 3 3 3 2 2 2

0

50

100

150

200

250

300

350

400

450

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

(a) 10cutsE:\A1\F_2.xls

Football 2

439

354

242

173

120

9177 73 67 58 50 42 36

22 15 13 10 8 5 4 3 3 2 0 0 0 0 0 0 0 0 0 0 00

50

100

150

200

250

300

350

400

450

500

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

(b) 18cuts

E:\A1\F_1.xls

Football 3

258

222

182

121

74

47 41 35 3019 14 11 6 6 6 2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0

50

100

150

200

250

300

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

(c) 10cutsE:\A1\F_4.xls

Football 4

483

407

339

233

149

9969

5430 22 17 15 14 13 13 11 9 8 7 6 3 3 3 2 2 2 1 1 1 1 1 1 1 1

0

100

200

300

400

500

600

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

(d) 15cuts

Figure 4.7: Fixed threshold adjusting within football sequences

different video content types, not even in the testing set of mobile-videos. There-fore this trace was no longer persecuted as even upgrades were not promising.


Motorsports

0

20

40

60

80

100

120

140

160

180

200

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

(a) 10cuts

Cinema Trailer

0

20

40

60

80

100

120

140

160

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

(b) 24cuts

Comedy Action

197

150

128

111

94

8069

6459

5346

3833

2823 20 19 17

12 11 9 8 7 7 7 7 7 7 7 7 7 7 7 7

0

50

100

150

200

250

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

(c) 6cuts

Comedy Talkshow

419

338

266

219

174

141

11494

7362 54

43 39 30 24 21 18 18 17 17 17 17 17 17 17 17 16 15 15 14 12 11 11 11

0

50

100

150

200

250

300

350

400

450

11.1

5 1.3 1.45 1.6 1.7

5 1.9 2.05 2.2 2.3

5 2.5 2.65 2.8 2.9

5 3.1 3.25 3.4 3.5

5 3.7 3.85 4

4.15 4.3 4.4

5 4.6 4.75 4.9 5.0

5 5.2 5.35 5.5 5.6

5 5.8 5.95

Mean (SAD)

Det

ecte

d C

uts

(d) 10cuts

Figure 4.8: Fixed threshold adjusting in different content class sequences


4.1.3 Dynamic Threshold

As a starting position [Anastasios D 05] was used:

For the proposed threshold function the local statistical properties of the se-quence were used. Xi is a random variable which the SAD value models inframe i. N is the length of a sliding window. If the actual frame number is n,then the window lies between [n − (N + 1), n − 1]. The empirical mean valuemn and the standard deviation σn of Xi are computed in the defined window asfollows:

SAD(n) =N−1∑i=0

M−1∑j=0

|Fn(i, j) − Fn−1(i, j)| (4.2)

mn =1

N

n−1∑i=n−N−1

Xi (4.3)

σn =

√√√√ 1

N − 1

n−1∑i=n−N−1

(Xi − mn)2 (4.4)

The equations are used together with Xn−1 to define the threshold T(n) asfollows:

T (n) = a · Xn−1 + b · mn + c · σn, (4.5)

where n denotes the current frame and a, b, c are constants. Alternatively, theequation can be written as T (n) = a(Xn−1 − mn) + (b + a)mn + cσn, whichis more illustrative for the following discussion about appropriate choice of theconstants. a, b and c are very important because they determine the characterof the threshold function. If b takes high values, the threshold will become morerigid, keeping high values without approaching the Xi sequence. This avoidswrong change detections like in case of intense motion scenes; but on the otherhand, the detector can miss some low valued, difficult scene changes. A lowvalue of b allows us to detect regions with high activity, which can providevaluable information to other applications (e.g. bit-rate control, semantics, errorconcealment etc.). As σ is the standard deviation, high values of c prevent fromdetecting intense motion as a scene change. On the contrary, a must have a smallvalue because, as we have already discussed, the properties of Xi are not alwayswelcome in scene change detection. The whole procedure of choosing a, b, c is atrade off and should be performed with respect to the intended application.

Quite satisfying results were achieved immediately. Still some changes were per-formed to make the dynamic threshold - metric more applicable for the presentvideo sequences. The upgrade was following:

• the Xi value was eliminated (with no particular changes in the result)


0 20 40 60 80 100 120 140 160 1800

0.5

1

1.5

2

2.5

3x 106

Frames

SA

D

SAD

Dynamic Threshold

Figure 4.9: Dynamic threshold suggested by [Anastasios D 05]

• the constants were modified (lower weighting)

• SAD smoothing was performed (especially for frame rate - transformedvideos)

• the until then not computed period of the first 20 frames was calculatedwith respect to the following 20 frames

• the sliding window used for the remaining sequence was centered at i(present contemplated frame), considering the past ten frames and up-coming ten frames.

m′n =

1

N + 1

n+N/2∑i=n−N/2

Xi (4.6)

σ′n =

√√√√ 1

N

n+N/2∑i=n−N/2

(Xi − mn)2 (4.7)

• 10 frames were set as a minimal period for occurring cuts

• the minimal amplitude of 1 times mean for detectable cuts was fixed

• the exponential decaying was put out as the minimum of 10 frames be-tween 2 cuts foreclosed the function of it

The condition to find a low-complex, fast and dependable method for shotboundary detection was assured with the modified dynamic threshold technique.


0 20 40 60 80 100 120 140 160 1800

0.5

1

1.5

2

2.5

3

3.5x 106

Frames

SA

D

Detected CutsSADDynamic ThresholdFixed Threshold

Figure 4.10: Upgraded dynamic threshold


4.2 Spatial and Temporal Information

The cropped video sequences are the starting position for the further analy-sis. Every video part is a consistent unit of itself. Considering the particularproperties - they are awaited to be very similar.

Thus when looking at a couple of videos with related characters, the attributesshould be also comparable. To qualify those features mathematically the follow-ing objective parameters are observed:

Spatial Information

The Spatial Information content, SI, is based on the Sobel filter. Each videoframe (luminance plane) at time n (Fn) is first filtered with the Sobel filter(Sobel(•)). The standard deviation over the pixels (stdspace) in each Sobel-filtered frame is then computed. This operation is repeated for each frame inthe video sequence and results in a time series of spatial information values. Themaximum value in the time series (maxtime) is chosen to represent the spatialinformation content of the scene. This process can be represented in equationform as equation 4.8.

SI = maxtime{stdspace[Sobel(Fn)]}. (4.8)

Temporal Information

The Temporal Information content, TI, is based upon the motion difference fea-ture, Mn(i, j), which is the difference between the pixel values (of the luminanceplane) at the same location in space but at successive times or frames. Mn(i, j)as a function of time (n) is defined as

Mn(i, j) = Fn(i, j) − Fn−1(i, j), (4.9)

where Fn(i, j) is the pixel at the i-th row and j-th column of the n-th framein time. The measure of temporal information content, TI, is computed as themaximum over time (maxtime) of the standard deviation over space (stdspace) ofMn(i, j) over all i and j.

TI = maxtime{stdspace[Mn(i, j)]} (4.10)

More motion in adjacent frames will result in higher values of TI.

That is the reason to take into account only values between cuts. Otherwise acompletely falsified value would be the result (because of the extremely highvalues due to the cut).


Video shots from one scene have (with a higher probability) the same back-ground, so they might have higher visual similarities compared to other shotsfrom different scenes. Moreover, shots in the same scene may also be organizedin a temporal sequence to convey scenario information, where:

• Fn(i, j): the pixel on the i,j position in the n-th Frame

• Stdspace: standard deviation over space

• Sobel Filter:

y(i, j) =√

G2x(i, j) + G2

y(i, j) (4.11)

Gc(i, j) =i+1∑

k=i−1

j+1∑l=j−1

(Sc(k, l)F (k, l)) (4.12)

-1 0 1

-2 0 2

-1 0 1

-1 -2 -1

0 0 0

1 2 1

Sx Sy

Figure 4.11: Sobel filter

4.2.1 SI-TI Research

The distribution of points in a SI-TI diagram (figure 4.12) qualifies the diversityof the scene character within a video. The smaller the SI-TI cloud - the moresimilar are the shots. The bigger the SI-TI cloud is, the harder is the specifi-cation of the video character, especially through the SI-TI centroid: only onepoint represents the whole video sequence, averaging all SI-TI values and mixingherewith all features. Even in the same video, there might be few of contents,whereby the centroid takes a mean value - not representing the video characterat all.

Still the next approach was the inspection of all SI-TI centroids (fig. 4.13). Dueto the huge amount of SI-TI values per category this was considered as a concisemethod. The aim was to find a correlation among semantically related shots andgroup them without having to deal with the entourage SI-TI values.

All SI-TI centroids of all videos and all categories the following graphic areshown in fig.4.13:


0 50 100 150 200 2500

10

20

30

40

50

60

70

80

SI

TI

SI−TI diagram of a cinema trailer

SI−TI between cutscentroid of SI−TIweightedcentroid of SI−TI

Figure 4.12: SI-TI values of shots of a video-trailer (blue). The cross signs thecentroid of the SI-TI values. The circle is the weighted centroid (the longer theshot the closer the centroid is moved the certain SI-TI value)

20 40 60 80 100 120 140 160 18010

20

30

40

50

60

70

80

90

SI

TI

mean SI−TI diagram (each point represents a video sequence)

CartoonsComedyActionComedyTalkshow

NatureNature IIFootballTrailerMotorsports

Figure 4.13: SI-TI Centroid Diagram (each point belongs to a whole sequence)

Considering the weighted mean values (the centroid values were weighted non-linearly with the hope of distinguishing them better) the graph changes into fig.4.14:


0 50 100 150 200 25010

20

30

40

50

60

70

80

90

SI

TI

SI−TI− diagram for weightes means


NatureNature IIFootballTrailerMotorsports

Figure 4.14: SI-TI Diagram: the SI-TI value cloud is still not dividable, butpoints are slightly more corresponding to each other belonging groups.

Some experiments were tried out. The weights of the centroids were computedby exponential functions. The constants were changed to move on the function(fig. 4.15, 4.16):

Unfortunately the exponential weights were not further grouping or illuminating.So for the ongoing study the weighted centroids were referred.

More descriptors had to be added to facilitate the clustering.

• Colour research was considered. As in the football group the green colourwould be expected to be most occurring, this approach was evident.

• For the music group an audio accrual was supposable. Especially effective,thus video content in music videos is very diverse.

• When seeing video trailer the frequent cut appearance stands out (every30frames).

In [Winkler 05] there were three parameters proposed agreeing to the upperadvisements:

• motion activity

• cut density

• sound energy


0 1 2 3 4 5 60

0.5

1

1.5

2

2.5

3

3.5

SI

TI

SI − TI diagram with exponential coefficients (exp x)

(a) exp(x)

−0.4 −0.35 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0−0.14

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

SI

TI

SI − TI diagram with exponential coefficients (1−exp(x))

(b) 1-exp(x)

0 1 2 3 4 5 60

0.5

1

1.5

2

2.5

3

3.5

SI

TI

SI − TI diagram with exponential coefficients (1−exp(−x))

(c) 1-exp(-x)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

0.5

1

1.5

2

2.5

3

SI

TI

SI − TI diagram with exponential coefficients (exp 10)

(d) exp(10x)

Figure 4.15: SI-TI diagram of different content centroids with varying weightscoefficients: magenta: cartoons; blue: comedy action; cyan: comedy talkshow;red: nature; green: motorsports; yellow; football; black: trailer

Still the conclusion step was observing all SI-TI values of all videos. Whenconsidering only the centroids, videos with high diversity within the videos werevalued to a complete different group than the included values.

The single shots representing the crosses in the graphs were solely regarded,searching for similarities in close to each other regions. The graph was split,trying to separate at least some of the classes (fig. 4.18), but still a cloud re-mained containing the mixed and undistinguishable sequences.


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.2

0.4

0.6

0.8

1

1.2

1.4

SI

TI


(a) exp(50x)

0 0.2 0.4 0.6 0.8 1 1.2 1.40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SI

TI


(b) exp(100x)

0 0.2 0.4 0.6 0.8 1 1.2 1.40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SI

TI


(c) exp(125x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

SI

TI


(d) exp(150x)

Figure 4.16: Further SI-TI diagrams of different content centroids with vary-ing weights coefficients: magenta: cartoons; blue: comedy action; cyan: comedytalkshow; red: nature; green: motorsports; yellow; football; black: trailer

0 50 100 150 200 250 3000

20

40

60

80

100

120

140

SI

TI

SI − TI diagram of all sequences


Nature Nature IIFootballTrailerMotorsports

Figure 4.17: SI-TI diagram of all sequences


0 50 100 150 200 250 3000

20

40

60

80

100

120

SI

TI

SI − TI diagram of 4 content classes

NatureComedyActionComedyTalkshow

Nature II

Figure 4.18: SI-TI diagram of 4 content classes (comedy action, comedy talk-show, cartoons, nature)


4.3 Sequence Motion Characteristics

4.3.1 Motion Feature Extraction

H.263 uses block-based coding schemes. In these schemes, the pictures are subdi-vided into smaller units called blocks, computed one by one, by both the decoderand the encoder. These blocks are processed in a scan order row-wise from theupper left edge to the lower right edge.

A block is defined as a set of 8x8 pixels. As the chrominance components aredownsampled, each chrominance block corresponds to four Y blocks. The col-lection of these six blocks is called a macroblock (MB). This constitutes a unitduring the coding process. A number of macroblocks are grouped together intoa unit called a group of blocks (GOB). H.263 allows for GOB to contain one ormore rows of MBs.

Figure 4.19: H.263 Picture Structure at QCIF resolution

Motion

A video with a sufficiently high frame rate has major similarity between suc-cessive frames. When coding, this fact is used - the difference between frames israther coded than the frames themselves. An estimate for the frame being codedis obtained from the previous frame, and the difference between the predictionand the current frame is sent.


Most video sequences contain moving objects, and this motion is one of thereasons for the difference between successive frames. If there was no motion ofobjects between frames, frames would be very similar. The basic idea behindmotion compensation is to estimate this motion of objects and to use this infor-mation to build a prediction for successive frames. H.263 supports block basedmotion estimation and compensation. Motion compensation is done at the mac-roblock level, except in the advanced modes, when it is done at the block level.The process involves looking at every MB in the current frame and trying tofind a match for it in the previous frame. Each MB in the current frame iscompared to 16x16 areas in a specified search space in the previous frame, andthe best matching area is selected. This area is then offset by the displacementbetween its position and the position of the current MB. This forms a predic-tion of the current MB. In most cases, it is not possible to find an exact matchfor the current MB. The prediction area is usually similar to the MB, and thedifference or residue between the two is small. Similarly, predictions for all theMBs in the current frame are obtained and the prediction frame is constructed.The residue frame is computed and is expected to have a much smaller energythan the original frame. This residue frame is then coded using the transformcoding procedure.

Figure 4.20: Motion Vector

The two grey blocks in fig.4.20 are equal - the left one represents the best matchfound for the current block in the previous frame. A major part of the encodingprocess involves finding these best matches or, equivalently, the offsets betweenthe position of the best match and the position of the current block. This processof searching for the best match is called motion estimation. The offsets betweenthe position of the best match and the position of the current MB are calledmotion vectors. H.263 allows these motion vectors to have non integral values(matches between pixels can be found) potentiated through bilinear interpola-tion [Ming-Ting Su 00].

Content Classification via Motion Vectors

The Motion Vector idea used in the H.263 codec was followed with a new pur-pose. Having Motion Vectors corresponds to a description of the moving objectsor the movement of the camera.


The assumption was made, that a camera movement is equal to the movement ofall macroblocks in the same direction with the same magnitude (Motion Vectorshave same lengths and angles).

Less movement (small or zero Motion Vectors) can be ascribed to a very staticsequence, like news.

To test these statements, following implementation was done:A video sequence is read framewise. The luminance of one frame is calculated,separating the frame in 8x8 pixels macroblocks.

176pixels=22x8pixels

144pix

els

=18x8pix

els

Figure 4.21: QCIF split in 8x8 macroblocks

From each macroblock’s luminance a sum of absolute differences is built andsearched for a likewise value in a region around it. Border macroblocks havesmaller search regions. The area obtained for a corner macroblock contains onlyfour blocks.

Figure 4.22: Search region in the previous frame for the current macroblock ofthe current frame

The comparison of two macroblocks happens pixelwise. The current frame’smacroblock position gets the highest search priority in the previous frame.

The result is a position, a length and an angle of a motion vector. In a frameall motion vectors displayed in their right position can look according fig. 4.23:

Choosing a bigger search region can lead to a higher precision, but also cancalculate very erroneous values (fig.4.22).


(a) News sequence frame Nr.1 (b) News sequence frame Nr.4

0 10 20 30 40 500

10

20

30

40

(c) MV between frame Nr.1 and Nr.4

Figure 4.23: Motion vectors in a News sequence

Figure 4.24: a) Large Search Region resulting big and erroneous Motion Vectorsb) Smaller search region [Ralph Braspe 04]

To obtain the velocity of the movement, the length of the motion vectors has tobe considered.

A steady distribution corresponds to a uniform movement, anticipating a panorama-like sequence. An uneven distribution is an evidence for irregular movementswithin a sequence.


Figure 4.25: Mean size of MVs in a 200-frames football and panorama sequence

Further conclusions can be derived from the angles of the motion vectors.

When viewing motion vector graphs soon a constitutional failure appears. Thevalues of untextured areas differ essentially from those in the rest of the frame.They are disordered and uncorrelated (fig.4.26).

Figure 4.26: Erroneous Motion Vectors in untextured Areas

The origin of the problem can be tracked back to the very nature of prediction-correction coding. The main purpose of coding is to allow a reasonable renderingquality at high compression rates. The motion estimation can afford to be wrongas long as the errors to correct it are small.

In the case of low-textured, uniform areas, the correlation methods used toestimate motion in the first place do not work; however, this does not constitutea problem since it is likely that this will be easily corrected with just a few bits


of extra data. This trade off has allowed the encoder to be implemented butclearly from the more refined image-analysis point of view, the average motionvectors in a coder stream are of appalling quality. Given our purposes, it is betterif these low-textured macroblocks are not included in the fitting. For doing sothere are several possibilities.

The more efficient one would be that of analyzing the AC components of theDCT coefficients, thus staying in the compressed domain. The one implementedin [Pulu 97] is a rougher one. The uncompressed images are available in par-allel, a convolution mask is simply run in each block and counted how manypixels have got a gradient above a given threshold. The macroblocks, that havetoo few edgy pixels are deemed low-textured and therefore excluded by furtherconsideration. This method is fairly robust and effectively acts as a binary maskapplied to the motion field.

Another correction method proposed in [Sung-Hee Lee 03] is using the pattern-like image analysis to correct the erroneous motion vectors due to the failures ofblock matching algorithms-based motion estimation. To detect blocks that mayyield erroneous motion vectors in the pattern-like image, both the SAD distrib-ution and motion vector fields are analyzed. Erroneous motion vectors detectedby the proposed method are defined as the motion vectors of static areas or themoving areas for which the discrimination level in the SAD (sum of absolutedifferences) distribution is too low to allow a good estimation of the motion.The motion vectors detected as the erroneous motion vectors are then assignedto the zero vector even if motion estimation is performed. If the erroneous mo-tion vectors are highly correlated with the motion vectors of neighboring blocks,the proposed method corrects these motion vectors using the motion vectors ofneighboring blocks [Sung-Hee Lee 03].

In this work the analysis of the AC components of the DCT coefficients waschosen and applied.

One frame of a video sequence is read. The black and white image of it iscalculated. A DCT is applied on it. Like shown in figure 4.27, DCT can beapplied on 4 and on 8 pixels blocks. As the motion vector computing is donewith 8 pixels blocks, the DCT should correlate to that. Only the first coefficientsfrom the DCT are used from now on (visibly an intentional blurring is applied,actually it is an abrasive integering of the former nuances). The deterioration issupposed to gather all interim ”values” in one, so that the resulting untexturedarea motion vectors are correct, or at least not as erroneous and disturbing asbefore. So the primary white noise (even visibly evident in the sky, but also inplain same-coloured areas) can be escaped by consolidation into one shade-bin.The quality forfeit can be justified, as still contours are conspicuously and theremaining motion vectors are still reliable (fig.4.28).

Assuming that video sequences have few frame changes, like news, a trial wasproceeded, considering every second and every third frame instead of every frame(4.29). This is also valid for high frame rate videos of all kind, where changes


(a) Original frame from a panoramavideo

(b) RGB2Grey transformed frame

(c) 4x4 pixels DCT (d) 8x8 pixels DCT

Figure 4.27: Visualization of DCT

between consequent frames are minimal as in fig.4.29.

Motion vectors (MV) offer a solid base for analyzing and researching videos.


(a) Original frame

(b) MV without DCT (c) MV with DCT

Figure 4.28: MV of a Panorama sequence


(a) Frame Nr.1 (b) Frame Nr.3

0 10 20 30 40 500

10

20

30

40

(c) MV without DCT

0 10 20 30 400

10

20

30

40

(d) MV with DCT

0 10 20 30 40 500

10

20

30

40

(e) Considering every second frame

0 10 20 30 40 500

10

20

30

40

(f) Considering every third frame

Figure 4.29: Motion vectors of a News sequence


4.4 Detecting Pans and Zooms

From a film maker’s point of view basis video contents have certain requirementsleading to a variety of camera operations. So it was closed down that differentcamera shot techniques occurring frequently in a sequence are representative forthe content character of this sequences (e.g. pans for panorama). Camera motioneffects such as zooming, panning / tilting, dollying and tracking/booming areused to more efficiently capture the visual features of the scene during a longshot. For instance, when a camera pans over a scene, the entire video sequencebelongs to one shot but the content of the scene could change substantially,thus suggesting the use of more than one key frame. Also, when the camerazooms, the images at the beginning and end of the zoom may be considered asrepresentative of the entire shot.

Definitions:

• Zooming is the act of changing the scale of a display, generally withoutchanging the amount of screen space it occupies. A zoom enables the viewof the recorded feature or scenery in a more detailed way whereas a zoomout brings a more general sight of it.

• Panning is a technique with which the camera is used to slowly scan asubject horizontally or vertically respectively.

• Dollying is similar to zooming, in that the camera moves in toward or awayfrom a subject. However, a dolly shot doesn’t use the zoom control at all.In feature film shoots, a dolly is a platform that holds the camera and rideson rails similar to railroad tracks. When filming, one or more people pushthe dolly, resulting in a smooth shot. When you zoom, the camcorder’slens is simulating the appearance of moving closer to your subject. Whendollying, you’re moving the camera physically closer. If we compare botheffects, the main difference is that zooming does not produce a perspectiveshift, whereas it does for dollying. It is especially noticeable by looking atthe background of the image.

• Tilting is similar to panning - the camera is used to slowly scan a subjecthorizontally or vertically.

The introduced cuts-detection algorithm (4.1.3) detects reliable scene changes,but is incapable for distinguishing camera motion effects, as the difference con-tained by the SAD-computation is not high enough and thus not unique enough.

In [Adriana Dumi 04] an algorithm for detecting pans and zooms based on mo-tion vectors is described. The algorithm is organized as follows:

1. Division of a frame in macroblocks (region containing of a fixed numberof pixels, as before 8x8).


4.2. Camera Motion Effects detection Our focus in this chapter, as we said before, is the recognition of camera motion effects, zooming/dollying and panning/tilting concretely. But, since the difference between frames is not that big for those cases, the algorithm we used for scene cut detection does not offer sufficient results. As we can see in the Figure 42, SAD analysis detects an apparent gradual change between frames. However, it will never inform us of whether it is exactly a pan or a zoom what is being detected because they both look the same in the picture. Furthermore, these shots may lead to the false scene cut detections as can be seen in Figure 42.

F

Figure 42: Performance of scene cut detection method for pan and zoom

Other features, therefore, should be taken into account if we want to do an accurate detection. Estimation of motion is the basis of the approaches that are used for this task.

58

Figure 4.30: Pan and Zoom within a SAD - graph

2. Computation of the motion vectors for each macroblock in polar coordi-nates (r,Θ; r is the modulus and Θ the angle ).

3. Grouping of the MV in regions. Each region contains MVs with similarorientations.

4. Identifying of pans and zooms with the values from the previous (withpresented thresholds).

The first two steps were already done and are by now available (as explained inthe previous section), so this method was slightly for an application.

The two largest angle regions were added and displayed over the frames of threevideo contents.

4.6. Results The video we have used for our research is called ‘futballBetter44-h263_Orig.avi’. It is a high motion football video with a great amount of zooming and panorama shots, as well as fragments where combinations of both techniques are present. This is the graph obtained with the results of the detections:

Figure 48: Zoom and pan detection performance for ‘futballBetter44-h263_Orig.avi’

Fragments where both pan and zoom are present have been outlined. The reason is that, since our detections are based on the recognition of high peaks as panorama shots and low ones as zoom shots, one can easily think that in shots where the combination of both camera effects is a fact, detection could not be arranged. However, as we can see in the three examples ticked in the figure 48, our method succeeds in recognizing both zoom and pan in the first and third example and fails in the second just for zoom recognition. In order to analyze this case in which zooming remains undetected, the screenshots corresponding to that sequence are hereafter included:

70

Figure 4.31: Pan and Zoom - detection method graph [Garcia 05]


As already mentioned panorama lends itself to be a test sequence. Football isconsidered to contain lots of zooming - also good for testing reasons. Why newswas also chosen is the need of a reference to which the well-defined testees canbe compared. It has a very static content, still containing small movements withnot anticipated directions (expected to be not too bounded).

4.4.1 Results

Figure 4.32 shows news sequences of the same length. The occurring angles arevery similar and no big variances are observed. Whether zooms, nor pans canbe detected. So far the results are very reasonable and easy to reconstruct. Thesurprising point is, that the main angle directions are quite bounded - muchmore than expected. This makes difficulties in exactly the followed goal (todifferentiate news and panorama).

News11

0

5

10

15

20

25

30

35

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

News12

0

5

10

15

20

25

30

35

40

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

News 14

0

5

10

15

20

25

30

35

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

News 15

0

5

10

15

20

25

30

35

40

45

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Figure 4.32: News sequences tested with the Pan and Zoom method

Panorama sequences of same length are shown in figure 4.33. Very steady dia-grams explained by the mostly unidirectional uniform camera movement. Herea high percentage of a main direction is awaited and good to see. But unfor-tunately the bounded directions are very similar to news. Even the values ofboundary respond to those from the news - sequences. The effect can be ex-plained by the fact, that only the angles are observed (and not the length of thevectors).

Football sequences of same length can be observed in figure 4.34. Other thanat the news and panorama sequences here discontinuities are perceived. That is


Panorama 4

0

5

10

15

20

25

30

35

40

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Panorama 2

0

5

10

15

20

25

30

35

40

45

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Panorama 3

0

5

10

15

20

25

30

35

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Panorama 1

0

5

10

15

20

25

30

35

40

45

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Figure 4.33: Panorama tested with the Pan and Zoom detection method

a sign for the proposed zooming. In those videos (likewise football) zooms andsudden pans are performed.

Football 1

0

5

10

15

20

25

30

35

40

45

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Football 2

0

5

10

15

20

25

30

35

40

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Football 3

0

5

10

15

20

25

30

35

40

45

50

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Football 4

0

10

20

30

40

50

60

70

80

90

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Frames

Sum

of 2

mai

n di

rect

ions

Figure 4.34: Football sequences tested with the Pan and Zoom detection method

The shown method works well for detecting variances of motion-directions ordetecting zoom. Unfortunately that is not enough to distinguish whole videocontent classes. News and panorama cannot be differed with this approach.Football seems to be unique in these test sequences, but as it is doubtless not


the only class containing zooms and motion distinctions, the showed way doesnot offer an explicit assignment. It can be used as an additional classifyingfactor, but not as the main solely one.

An analytic continuation would be not to analyze only the direction bounding,but also the specific direction. This might then work for horizontal panoramas,but even in the provided test videos pans were not necessarily horizontally (e.g.mountain view: mixed horizontal and vertical camera movement).


4.5 Colour Research

On the search for reliable and unique objective video parameters, not only cam-era techniques and sequence parts were concerned, but also elementary issues -like the containing colours. The consideration was that in different types of con-tents colours occur in a different denseness and magnitude. Football for exampleis awaited to contain a lot of varying green colours. Cartoons have intuitionallydiscrete saturated colours. That is why the next approach considered colourmapsof the contemplated video sequences. The videos have a colourdepth of 24bit(8bit pro colour). In MATLAB a colourmap is a three-column matrix, whoselength is equal to the number of colours it defines. Each row of the matrix de-fines a particular colour by specifying three values in the range 0 to 1. Thesevalues define the RGB components (i.e., the intensities of the red, green, andblue video components). There are several colourmap - types: hsv, hot, cool,summer and gray. The figure 4.35 shows two types - hsv (hue-saturation-value)and jet.

(a) hsv-colourmap (b) jet-colourmap

Figure 4.35: Colourmaps

• hsv varies the hue component of the hue-saturation-value colour model.The colours begin with red, pass through yellow, green, cyan, blue, ma-genta, and return to red. The colourmap is particularly appropriate fordisplaying periodic functions.

• jet ranges from blue to red, and passes through the colours cyan, yellow,and orange. It is a variation of the hsv colourmap. The jet colourmap isassociated with an astrophysical fluid jet simulation from the NationalCenter for Supercomputer Applications.

The type, which proved to be reasonable for the further research was colourcube:

• colourcube contains as many regularly spaced colours in RGB colourspaceas possible, while attempting to provide more steps of gray, pure red, puregreen, and pure blue.


Further it was obviously, that the gigantic number of colours had to be reduced.The first way was simply reducing the existing colour-values in the matricesby abrasively integering them. Then the more elegant way was found withindiscretize the colourmap. When doing this, following succeeds can be perceived.

(a) 7bins (b) 8bins (c) 10bins

(d) 12bins (e) 16bins (f) 20bins

Figure 4.36: Colourmap Colourcube, different gradings

Under seven bins MATLAB refers the image to a 7-grayscale-grades. Notice,that the same image is analyzed in all 6 cases. With the increasing colour bins,the distribution successively changes. Finally a colourcube colourmap with 64colours was chosen, as a simplified enough case for an easier classification, butstill recognizable image.

Figure 4.37: 64-bin Colourmap for a Football video sequence

As the main reason for accomplishing the colour research was the expectance todistinguish football videos from the rest, the interest was mainly in analyzingthose.


Figure 4.38: Colourmaps of Football sequence video sequences

The interpretation is proceeded after calculating the sum of all green shades.Then the obtained main colour is definitely and clearly the green colour.

For a comparison fig.4.39 shows two non-football video sequences:

(a) Film trailer (b) Music clip

Figure 4.39: Colourmaps of non-Football sequences

Of course there will always be exceptions. Videos, where the green colour ismajor are imaginable, but coupled with the motion vector - achievement theclassification is quite reasonable and absolutely justifiable. The cartoon colourresults could not deliver abnormalities to aid the required recognition.

Figure 4.40: Cartoon Colourmaps

Chapter 5

TEST AND METRIC DESIGN

Five video sequences (”Akiyo” - News, ”Foreman” - Videocall, ”Panorama”,”Football” and ”Traffic”) in QCIF were processed at different frame and bitrates. Four frame rate combinations (5frames/s, 7.5frames/s, 10frames/s and15frames/s) and three bit rate combinations (18kbit/s, 44kbit/s and 80kbit/s)yielded with the 5 different contents over 50different sequences. All sequenceswere encoded with H.263 profile 3 and level 10. There were 38 paid test personsof different age, sex, education and experience evaluated on a five grad scale (1-bad, 2-poor, 3-fair, 4-good, 5-excellent) the videos, providing the MOS (meanopinion score). The displaying device was a 20-30cm distanced mobile phone,simulating a real life situation. After three training sequences the sequenceswere presented in an arbitrary order. Three runs of each test were taken. Theobtained ratings are gathered and shown in the figures below.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

18/5 18/7 18/10 18/15 44/5 44/7 44/10 44/15 80/5 80/7 80/10 80/15Bit Rate [kbit/s] / Frame Rate [Frames/s]

MO

S

QualityUsablity

Figure 5.1: First video content class ”News”

The usability bar pictures the willingness to pay. After evaluating the grade ofthe sequence the subjects were asked to decide whether they would pay for theseen video in the actual compression. The value measures the user’s receptivity.

”Akiyo” - News The sequence is filmed by a static camera. Only small andslightly changes can be observed - the movement of mouth and eyes.

57

Chapter 5. Test Setup for Video Quality Evaluation 58

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

18/5 18/7 44/5 44/7 44/10 44/15 80/5 80/7 80/10 80/15Bit Rate [kbit/s] / Frame Rate [Frames/s]

MO

S

QualityUsability

Figure 5.2: Second video content class ”Videocall”

”Foreman” - Videocall Unprofessional camera movement, first very static,then a fast pan. The object filmed is moving and gesticulating until thepan to a scaffolding.

”Panorama” Uniform, smooth and slow movement of an automatically oper-ated camera. The sight is static.

”Football” Mostly wide angle scenes with pans (horizontally) and zooms. Typ-ical is the green background.

”Traffic” A static camera films the moving cars (small objects).

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

18/5 18/7 44/5 44/7 44/10 44/15 80/5 80/7 80/10 80/15Bit Rate [kbit/s] / Frame Rate [Frames/s]

MO

S

QualityUsability

Figure 5.3: Third video content class ”Panorama”

5.1 Choice of objective parameters set

By the term ”objective parameters” both is understood - the compression pa-rameters and the content characteristics. Eight objective video parameters with


0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

18/5 18/7 18/10 44/5 44/7 44/10 44/15 80/5 80/7 80/10 80/15Bit Rate [kbit/s] / Frame Rate [Frames/s]

MO

S

QualityUsability

Figure 5.4: Fourth video content class ”Football”

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

18/5 18/7 18/10 44/5 44/7 44/10 44/15 80/5 80/7 80/10 80/15Bit Rate [kbit/s] / Frame Rate [Frames/s]

MO

S

QualityUsability

Figure 5.5: Fifth video content class ”Traffic”

various computational complexity are observed. The parameters sigain andsiloss are also taken into account as recommended in [ANS]. They measurethe gain and the loss in the amount of spatial activity, respectively. If the codecoperates through an edge sharpening or enhancement, a gain in the spatial ac-tivity is obtained, that is an improvement in the video quality of the image.On the other hand, when a blurring effect is present in an image, it leads to aloss in the spatial activity. The other two parameters, hvgain and hvloss mea-sure the changes in the orientation of the spatial activity. In particular, hvlossreveals if horizontal and vertical edges suffer of more blurring than diagonaledges. The parameter hvgain reveals if erroneous horizontal and vertical edgesare introduced in the form of blocking or tiling distortions. These parameters arecalculated over the space-time (S-T) regions of original and degraded frames.The S-T regions are described by the number of pixels horizontally, verticallyand by the time duration of region. In our case one S-T region corresponds to8×8 pixels over five frames. The complexity to calculate these parameters israther high. To do so the knowledge of the original sequence is required.

A further parameter looked at, is a reference-free measure of overall spatialinformation [H.263 01], denoted as fSI13 since images were preprocessed usingthe 13 × 13 Sobel filter masks. It is calculated as the standard deviation over


an S-T region of R(i, j, t) samples, i and j being the coordinates within thepicture displayed in time t. The result is clipped at the perceptibility thresholdP[Michal Ries 06]:

fSI13 = {stdspace[R(i, j, t)]}|P : i, j, tε{S − Tregion} (5.1)

This feature is sensitive to the changes in overall amount of spatial activitywithin a given S-T region. For instance, localized blurring produce a reductionin the amount of spatial activity, whereas noise produces an increase of it. SIand TI are used too. Finally, the codec compression settings frame rate (FR)and bit rate (BR) are investigated, requiring no computational complexity forestimation as they are known at both sender and receiver.

For the reduction of the dimensionality of the data set while retaining as muchinformation as possible, the Principal Component Analysis (PCA) was used[W.J.Krzanows 88]. The PCA was carried out to determine the relationshipbetween MOS and the objective video parameters and to identify the objectiveparameters with the lowest mutual correlation, allowing to propose a compactdescription of the data set. PCA was performed for all content classes separately.In the present case the first two components proved to be sufficient for anadequate modeling of the variance of the data (see Table 5.1), because the totalvariability of the first two components was at least 81% for all content classes.Variability describes the percentual part of data sets variance that is covered bythe variance of particular component. The horizontal axis represents principalcomponent 1 (PC1) and vertical axis represents the principal component 2 (PC2)at the figures 5.6(a), 5.6(b), 5.6(c), 5.6(d), 5.6(e). Each of the ten parameters isrepresented in the Figures 5.6(a), 5.6(b), 5.6(c), 5.6(d), 5.6(e) by a vector. Thedirection and length of the vector indicates how each parameter contributes tothe two principal components in the graph.

According to this analysis, each parameter in the analyzed data set can be as-sessed concerning its contribution to the overall distribution of the data set.This is achieved by correlating the direction of the maximum spread of eachvariable with the direction of each principal component axis (eigenvector). Highcorrelation between the first principal component (PC1) and investigated para-meters indicate that the variable is associated with the direction of the maximumamount of variation in the data set. A strong correlation between the parame-ters and the second principal component (PC2) indicates that the variable isresponsible for the next largest variation in the data, perpendicular to PC1.

Conversely, if a parameter (vector) does not correspond to any PC axis and itslength is small compared to PC axis dimensions, this usually suggests that thevariable has little or no control on the distribution of the data set. Therefore,PCA suggests, which parameters in our data set are important and which onesmay be of little consequence. The parameters with the largest impact were takeninto account metric design as is described in the following secton.


−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

FR

BR

SI

TI

fSI13

si_gainsi_loss

hv_loss

hv_gain

MOS

Component 1

Com

pone

nt 2

(a) ”News” sequence

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

FR

BR

SI TIfSI13

si_gain

si_loss

hv_loss

hv_gain

MOS

Component 1

Com

pone

nt 2

(b) ”Video call” sequence

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

FR

BR

SI

TI fSI13

si_gain

si_loss

hv_loss

hv_gain

MOS

Component 1

Com

pone

nt 2

(c) ”Football” sequence

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

FR

BR

SI

TI

fSI13

si_gain

si_loss

hv_loss

hv_gain

MOS

Component 1

Com

pone

nt 2

(d) ”Panorama” sequence

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

FR

BR

SI

TI

fSI13

si_gain

si_losshv_loss

hv_gainMOS

Component 1

Com

pone

nt 2

(e) ”Traffic” sequence

Figure 5.6: Visualization of PCA results

Sequence Variability of PC1 [%] Variability of PC2 [%]

Akiyo 84 14Foreman 67 28Football 55 37

Panorama 81 10Traffic 61 20

Table 5.1: The total variability of the first two components for all content classes.

5.2 Metric design

According to the PCA results the most suitable objective parameters for themetric design relevant are for all content classes. Surprisingly, the objective


Coeff. Akiyo Foreman Football Panorama TrafficA -0,0302 0,0411 0,0241 0,0226 0,0322B 0,1382 -0,0371 -0,0117 -0,0688 -0,02701C 0,9252 -0,0288 0,1237 0,1429 0,14415K -78,4283 3,5970 -6,1850 -12,6834 -11,1982r 0,95 0,96 0,98 0,93 0,90

Table 5.2: Coefficients of linear metric model and correlation of average MOSwith so obtained estimation for all content classes

parameters with higher complexity (sigain, siloss, hvgain, hvloss) did not cor-relate very well with PC1 and PC2 for all sequences. PCA results (see Figures5.6(a), 5.6(b), 5.6(c), 5.6(d), 5.6(e)) and the high complexity of these objectiveparameters show that these parameters are not appropriate for metric design inthe scenario. The most suitable parameters are FR, fSI13, SI, TI and BR.

These experiences determine the further steps in metric design. The proposedmetric has different coefficient values for different content classes because spatialand temporal sequence characteristics of the sequences are significantly different.The used metric takes three objective parameters into account according to thecorrelation with PC and their complexity. We choose two parameters whichcorrelate best with PC1, and one correlating with PC2 over all sequences becausethe variability of PC1 is approximately two times higher than the variability ofPC2 (see Table 5.1). Finally, the metric bases on the codec parameters FR andBR and sequence character parameter fSI13. The uniform mathematical modelfor all content classes has been chosen due to its simplicity and rather good fitwith the measured data. The simple linear model was approved with four mixedterms reflecting the most important mutual combinations of selected objectiveparameters.

MOS = K + A · BR + B · FR + C · fSI13 + D · BR · FR + E · BR · fSI13

+F · FR · fSI13 + G · BR · FR · fSI13

(5.2)

The aimed MOS is evaluated by averaging the three runs of all rating subjectsfor all tested sequences. The best correlation results were obtained for the firstthree sequences of news, video call and sports. News contains neither high spatialinformation nor high temporal information. Thus for low BR the results werealready satisfying. Sports on the other hand is very sensitive in the spatial andtemporal domain. Critical are also the remaining sequences as containing eithernonuniform movement or important spatial information. Therefore panoramaand traffic were rated in the lower scale. The obtained MOS metric providesa reliable prediction for subjective video quality estimation, depending on thedifferent contents (which affirms the conclusion from [O. Nemethova 04]). The


Coefficients News Video Call Sport Panorama Traffic

A 2,1618 1,4769 -1,1826 -3,4122 2,2139B -3,2939 -4,9962 -13,7067 -33,2325 11,1780C 0,4491 -0,5986 -1,7714 -1,2473 2,0426D -0,1338 -0,936 0,1623 0,7717 -0,0859E -0,0234 -0,0158 0,0219 0,0325 -0,0244F 0,0407 0,0592 0,2479 0,3134 -0,1256G 0,0014 0,0010 -0,0029 -0,0073 0,0010K -39,0282 50,6525 98,3703 134,3703 -181,0251r 0,989 0,997 0,996 0,999 0,974

Table 5.3: Coefficients of improved metric model and correlation of average MOSwith so obtained estimation for all content classes

metric is reference free, as the original sequence is not needed for computing themodel. Further the calculating effort is kept low.

Chapter 6

CONCLUSIONS

The last chapter of this work is the summary of all acquired research achieve-ments. An original and uncompressed video sequence is the initial point of thechain. This requirement is significant and of great importance, as in any othercase (e.g. compression leading to blocking, smaller frame and bit rate and thusto completely different motion vectors) the further computation will be still cor-rect for itself, but definitely not matchable to the set thresholds for the contentrecognition classes.

Following classes are to be identified and classified:

• wide angle camera tracking a small, rapid moving object on a green back-ground (”football”).

• global motion sequence taken with a panning camera (”panorama”).

• small moving region of interest (face) on a static background (”news”).

• artificial motion without global motion (”cartoon”)

• a global group containing all the rest sequences.

The proceeding contains four steps:

1. The first step is a scene change recognition. The sequence is analyzed andthe identified cuts (actually the frames, where they are appearing) arehanded to the concrete classifier.

2. The content classifier computes the motion vectors and the percentage of”green frames” (frames in which green amounts at least 70% of the wholeframe). From the motion vectors a variety of computation is initiated.The angles and their bounding, the span and the variation are considered.Those exact categories are explicitly computed, listed and used for thethreshold finding:

64

Video Quality 65

All blocks considered:

• Percentage of mean static blocks through the frames

• Mean size of all MVs

• Intraframe standard deviation MV size

• Interframe standard deviation MV size

Only non-static blocks considered:

• Mean size of non-static MVs

• Intraframe standard deviation of non-static MV size

• Interframe standard deviation of non-static MV size

Directivity:

• Percentage of MVs in the main direction

• Percentage of MVs in the first and second-powerful direction

• Percentage of non-static MVs in the main direction

• Percentage of non-static MVs in the first and second-powerful direc-tion

• Percentage of MVs in horizontal direction (0◦ and 180◦)

Colour:

• Percentage of green blocks

ContentClassification

CorrespondingBit RateFrame Rate

- Cartoon- Football- News- Panorama- Everything else

Original,uncompressedVideo

BR, FRCallibration

QualityEvaluation

CutsDetector

WirelessConnection

Figure 6.1: Content Classifier within the Transmitting Chain

3. After analysis the contained values are compared with former stored refer-ence values, evaluated in the development phase in the following process:

• Training set

• Validation set

• Test set

Video Quality 66

4. The output of the content classifier is the mapping in those categories:

• Corresponding to the content character a certain video quality - bitrate and frame rate set are choosen. The values are computed accord-ing MOS metric. The now compressed video is spread to users, whoare hopefully satisfied with the received videos and their quality.

As video quality is a content dependent attribute, primarily this work researchesthe different content classifying approaches. The conclusion of this research is asolid content classifier including different means of video analysis. Subsequentlya motion based video quality metric for mobile video streaming services is pro-posed. The metric allows continuous and reference free quality measurements onboth transmitter and receiver sides. A mutual relation of the content character,motion features and subjective video quality is revealed. This solution allows agood estimation of perceptual video quality estimation affording a wide rangeof application.

Acknowledgements

Special thanks to my supervisor DI Michal Ries and my professor Dr. MarkusRupp for the initial ideas, their invaluable support, time and patience. I amgrateful for the chance to perform my master thesis at the Institute of Commu-nications and Radio-Frequency Engineering.

I would like to take this opportunity and express my gratitude to my parents.Thank you so much for believing in me. I appreciate your love, guidance, en-couragement, generosity and technical genes so much. You are my inspiration.

Further I want to thank to my friends. I am so incredibly fortunate to knoweach one of you.Stefan Fikar: I have grown so much both working with, and knowing you. Thankyou for always listening, caring and sharing the same vision. I really appreciateyour support.Thomas Linder: Thank you for sharing with me the benefit of your brillianceand foresight and also for helping me out in any situation.Jutta Unterscheider: Your strength, brightness and positivity are unique. Thanksfor sharing them with me for such a long time.Adela Marinescu: We had some hard times and some good times. I am so grate-ful we survived them all together and still proudly stand. Thank you for beingthere.Barbara Neuherz: You have been a great colleague and friend. Thanks for yourkindness and understanding.

67

Bibliography

[3GPP 02] 3GPP, “3rd Generation Partnership Project; TechnicalSpecification Group Services and System Aspects(release4); Transparent end-to-end Packet-switched Streaming Ser-vice (PSS); Protocols and codecs”, December 2002.

[3GPP 05] 3GPP, “The Evolution of UMTS/HSDPA 3GPP Release 6and Beyond”, July 2005.

[Adriana Dumi 04] Barrz G. Haskell Adriana Dumitras, “A look-ahead Methodfor Pan and Zoom Detection in Video Sequences usingBlock/based Motion Vectors in Polar Coordinates”, May2004.

[Anastasios D 05] Olivia Nemethova Anastasios Dimou and Markus Rupp,“Scene Change Detection for H.264 using Dynamic Thresh-old Techniques”, July 2005.

[ANS] “T1.801.03, American National Standard for Telecommu-nications - Digital Transport of one-way Video Signals. Pa-rameters for objective Performance Assessment”.

[Arthur A. We 00] Margaret H. Pinson-Stephen D. Voran-Stephen WolfArthur A. Webster, Coleen T. Jones, “An objective videoquality assessment system based on human perception”,2000.

[Bouthemy 04] Garcia C. Ronfard R. Tziritas G. Bouthemy, P., Scene Seg-mentation and Image Feature Extraction for Video Indexingand Retrieval, Springer Berlin / Heidelberg, 2004.

[Garcia 05] Beatriz Lopez Garcia, “Segmentation of sport video se-quences”, Master’s thesis, Technical University of Vienna,2005.

[H. 96] Schulzrinne H., “RTP: A Transport Protocol for Real-TimeApplications”, 1996.

68

69

[H.263 01] ITU-T Recommendation H.263, “AnnexX: Profiles and lev-els definition”, April 2001.

[Hampapur 94] Jain R. Hampapur, A. and T. Weymouth, “Digital VideoSegmentation”, Second ACM international conference onMultimedia, San Francisco, California, United States, 1994.

[Hanjalic 04] Alan Hanjalic, A Content-Based Analysis of Digital Video,Springer US, 2004.

[J. 80] Postel J., “User Datagram Protocol”, 1980.

[John Chung-M 04] Qing LI John Chung-Mong Lee and Wei Xiong, A VideoInformation Management System , Springer Netherlands,2004.

[Michal Ries 06] Markus Rupp Michal Ries, Olivia Nemethova, “Reference-Free Video Quality Metric for Mobile Streaming Applica-tions”, June 2006.

[Ming-Ting Su 00] Amy R.Reibman Ming-Ting Sun, Compressed Video overNetworks, CRC, 2000.

[O. Nemethova 04] E. Siffel M. Rupp O. Nemethova, M. Ries, “Quality As-sessment for H.264 coded low-rate and low-resolution VideSequences”, 2004.

[Paul Browne 99] Noel Murphy Noel O’Connor-Sean Marlow Cather-ine Berrut Paul Browne, Alan F Smeaton, “Evaluating andCombining Digital Video Shot Boundary Detection Algo-rithms”, 1999.

[Pulu 97] Maurizio Pulu, “On Using Raw MPEG Motion Vectors ToDetermine Global Camera Motion”, August 1997.

[Ralph Braspe 04] Gerard de Haan Ralph Braspenning, “True-Motion Esti-mation using Feature Correspondences”, 2004.

[Ramin Zabih 95] Kevin Mai Ramin Zabih, Justin Miller, “A Feature-BasedAlgorithm for Detecting and Classifying Scene Breaks”,Third ACM international conference on Multimedia, SanFrancisco, California, United States, 1995.

[Richardson 04] I.E.G. Richardson and C.S. Kannangara, “Fast subjec-tive video quality measurements with user feedback”, June2004.

70

[S. Olsson 97] J. Baina S. Olsson, M. Stroppiana, “Objective methodsfor assessment of video quality: state of the art”, IEEETransactions on Broadcasting. Vol. 43, no. 4, pp. 487-495,December 1997.

[Schulzrinne 98] Rao A. Schulzrinne H. and Lanphier R., “Real TimeStreaming Protocol (RTSP)”, 1998.

[Sung-Hee Lee 03] Young-Wook Sohn Jeong-Woo Kang Rae-Hong Park Sung-Hee Lee, Ohjae Kwon, “Motion Vector Correction Basedon the Pattern-like Image Analysis”, ICCE. 2003 IEEEInternational Conference on Volume, June 2003.

[Winkler 05] Stefan Winkler, Digital Video Quality. Vision Models andMetrics, Wiley and Sons, 2005.

[W.J.Krzanows 88] W.J.Krzanowski, Principles of Multivariate Analysis: AUser’s Perspective, Clarendon Press, Oxford, 1988.

[Zhang H.J. 93] Kankanhalli A. Zhang H.J. and Smoliar S.W., AutomaticPartitioning of Full-Motion Video, Springer-Verlag NewYork, Inc., 1993.

video quality evaluation - tu wien · 2007. 9. 14. · video quality is a characteristic of video...

Documents