image processing technique for video content extraction

1. INTRODUCTION

This chapter will introduce the basis of video and image processing. The image or

video is stored only as a set of pixels with RGB values in computer. The computer

knows nothing about the meaning of these pixel values. The content of an image is

quite clear for a person. However, it is not so easy for a computer. For example, it is a

piece of cake to recognize yourself in an image or video, even in a crowd. But this is

extremely difficult for computer. The preprocessing is to help the computer to

understand the content of image or video. What is the so-called content of image or

video? Here content means features of image or video or their objects such as color,

texture, resolution, and motion. Object can be viewed as a meaningful component in

an image or video picture. For example, a moving car, a flying bird, a person are all

objects. There are a lot of techniques for image and video processing. This chapter

starts with an introduction to general image processing techniques and then talks

about video processing techniques. The reason we want to introduce image processing

first is that image processing techniques can be used on video if we treat each picture

of a video as a still image.

Example: Color:

1

2. BACKGROUND

A few years ago, the problems of representation and retrieval of visual media

were confined to specialized image databases (geographical, medical, pilot

experiments in computerized slide libraries), in the professional applications of the

audiovisual industries (production, broadcasting and archives), and in computerized

training or education. The present development of multimedia technology and

information highways has put content processing of visual media at the core of key

application domains: digital and interactive video, large distributed digital libraries,

multimedia publishing. Though the most important investments have been targeted at

the information infrastructure (networks, servers, coding and compression, delivery

models, multimedia systems architecture), a growing number of researchers have

realized that content processing will be a key asset in putting together successful

applications. The need for content processing techniques has been made evident from

a variety of angles, ranging from achieving better quality in compression, allowing

user choice of programs in video-on-demand, achieving better productivity in video

production, providing access to large still image databases or integrating still images

and video in multimedia publishing and cooperative work.

Example: Texture:

2

Content of image includes resolution, color, intensity and texture. Image

resolution is just the size of image in term of display pixels. Color is represented using

RGB color model in computer. For each pixel on the screen, there are three bytes

(R,G,B color component) to represent its color. Each color component is in the range

of 0 to 255. Intensity is the gray level information of pixels represented by one byte.

The intensity value is in the range of 0 to 255. Texture characterizes local variations

of image color or intensity. Although texture-based methods have been widely used in

computer vision and graphics, there is no single commonly accepted definition of

texture. Each texture analysis method defines texture according to its own model. We

consider texture as a symbol of local color or intensity variation. Image regions that

are detected to have a similar texture have similar pattern of local variation of color or

intensity.

3.COMMON IMAGE PROCESSING TECHNIQUES

3.1 Dithering:

Dithering is a process of using a pattern of solid dots to simulate shades of gray.

Different shapes and patterns of dots have been employed in this process, but the

effect is the same. When viewed from a great enough distance that the dots are not

discernible, the pattern appears as a solid shade of gray.

3.2 Erosion:

Erosion is the process of eliminating all the boundary points from an object, leaving

the object smaller in area by one pixel all around its perimeter. If it narrows to less

than three pixels thick at any point, it will become disconnected (into two objects) at

that point. It is useful for removing from a segmented image objects that are too small

to be of interest. Shrinking is a special kind of erosion in that single-pixel objects are

left intact. This is useful when the total object count must be preserved. Thinning is

another special kind of erosion. It is implemented in a two-step process. The first step

will mark all candidate pixels for removal. The second step actually removes those

candidates that can be removed without destroying object connectivity.

3

3.3 Dilation:

Dilation is the process of incorporating into the object all the background pixels that

touch it, leaving it larger in area by that amount. If two objects are separated by less

than three pixels at any point, they will become connected (merged into one object) at

that point. It is useful for filling small holes in segmented objects. Thickening is a

special kind of dilation. It is implemented in a two-step process. The first step marks

all the candidate pixels for addition. The second step adds those candidates that can be

added without merging objects.

3.4 Opening:

The process of erosion followed by dilation is called opening. It has the effect of

eliminating small and thin objects, breaking objects at thin points, and generally

smoothing the boundaries of larger objects without significantly changing their area.

3.5 Closing:

The process of dilation followed by erosion is called closing. It has the effect of filling

small and thin holes in objects, connecting nearby objects, and generally smoothing

the boundaries of objects without significantly changing their area.

3.6 Filtering:

Image filtering can be used for noise reduction, image sharpening, and image

smoothing. By applying a low-pass or high-pass filter to the image, the image can be

smoothed or sharpened respectively. Lowpass filter is used to reduce the amplitude of

high-frequency components. Simple lowpass filters apply local averaging. The gray

level at each pixel is replaced with the average of the gray levels in a square or

rectangular neighborhood. Gaussian Lowpass Filter applies Fourier transform to the

image. Highpass filter is used to increase the amplitude of high-frquency components.

4

3.7 Segmentation:

It is useful for detecting a set in which all the pixels are adjacent or touching. Within

each region, there are some common features among the pixels, such as color,

intensity, or texture. When a human observer views a scene, his visual system will

automatically segment the scene for him or her. The process is so fast and efficient

that one sees not a complex scene, but rather a collection of objects. However,

computer must laboriously isolate the objects in an image by breaking the image into

sets of pixels, each of which is the image of one object.

Image segmentation can be approached from three ways. The first approach is called

region approach, in which each pixel is assigned to a particular object or region. In

the boundary approach, only the boundaries that exist between the regions are

located. The third is called edge approach, where people try to identify edge pixels

and then link them together to form the required boundaries.

3.8 Object Recognition:

The most difficult part of image processing is object recognition. Although there are

many image segmentation algorithms that can segment image into regions with some

continuous feature, it is still very difficult to recognize objects from these regions.

There are several reasons for this. First, image segmentation is an ill-posed task and

there is always some degree of uncertainty in the segmentation result. Second, an

object may contain several regions and how to connect different regions is another

problem. At present, no algorithm can segment general images into objects

automatically with high accuracy. In the case that there is some prior knowledge

about the foreground objects or background scene, the accuracy of object recognition

could be pretty good. Usually the image is first segmented into regions according to

the pattern of color or texture. Then separate regions will be grouped to form objects.

The grouping process is important for the success of object recognition. Full

automatical grouping only occurs when the prior knowledge about the foreground

objects or background scene exists. In the other cases, human interaction may be

required to achieve good accuracy of object recognition.

5

4. BASIS OF VIDEO PROCESSING

4.1 Content of Digital Video

Generally speaking, there is much similarity between digital video and image. Each

picture of video can be treated as a still image. All the techniques applicable to images

can also be applied to video pictures. However, there are still different. The most

significant difference is that video has temporal information and uses motion

estimation for compression. Video is a meaningful group of pictures that tells a story

or something else. Video pictures can be grouped as a shot. A video shot is a set of

pictures taken in one camera break. Within each shot, there can be one or more key

pictures. Key picture is a representative of the content of a video shot. For a long

video shot, there may be multiple key pictures. Usually video processing segments

video into separate shots, selects key pictures from these shots, and then generates

features of these key pictures. The features (color, texture, object) of key pictures are

searched in video query.

Video processing includes shot detection, key picture selection, feature generation,

and object extraction.

4.1.1 Shot Detection:

Shot detection is a process to detect camera shots. A camera shot consists of one or

more pictures taken in one camera break. The general approach to shot detection has

been the definition of a difference metric. If the differences between two pictures are

above the metric, then there is a shot between them. An algorithm can be proposed for

this. This algorithm uses binary search to detect shot which makes it very fast and

achieve good performance as well.

6

4.1.2 Key Picture Selection:

After shot detection, each shot is represented by at least one key picture. The choice

of key picture could be as simple as a particular picture in the shot: the first, the last,

or the middle. However, in situations such as long shot, no single picture can

represent the content of the entire shot. QBIC (Query by Image Content) uses a

synthesized key picture created by seamlessly mosaicking all the pictures in a given

shot using the computed motion transformation of the dominant background. This

picture is an authentic depiction of all background captured in the whole shot. In

CBIR (Content based Image Retrieval) system, key picture selection is a simple

process that usually chooses the first and last pictures of a shot as key pictures.

4.1.3 Feature Generation:

After key picture selection, features of key pictures such as color, texture, intensity

are stored as indexes of the video shot. Users can perform traditional search by using

keyword querying and content-based query by specifying a color, intensity, or texture

pattern. Only the generated features will be searched against and the retrieval can be

in real time.

4.2.4 Object Extraction:

During the process of shot detection and key picture selection, the objects in the video

are also extracted using image segmentation techniques or motion information.

Segmentation-based techniques are mainly based on image segmentation. And objects

are recognized and tracked by segmentation projection. Motion-based techniques

make use of motion vectors to distinguish objects from background and keep track of

their motion. It is a very difficult problem. And the new standard being developed will

talk about how to get objects in the video and encode them separately into different

layers. Hopefully this process is not manual and it is also unrealistic to expect it to be

full automatical.

7

5.CURRENT RESEARCH ABOUT VIDEO INDEXING

AND RETRIEVAL

Video indexing and retrieval is a very active research area. In the field of digital

video, computer-assisted content-based indexing is a critical technology and currently

a bottleneck in the productive use of video resources. Only an indexed video can

effectively support retrieval and distribution in video editing, production, video-on-

demand and multimedia information systems. To achieve this, we need algorithms

and systems that provide the ability to store and retrieve video in a way that allows

flexible and efficient search based on content. In this chapter, we will talk about some

important aspects about the state of art progress in video indexing and retrieval. It is

organized as follows:

Video Parsing

Video Indexing and Retrieval

Object Recognition and Motion Tracking

5.1 Video Parsing

The first step of video processing is video parsing. Video parsing is a process to

segment video stream into generic shots. These shots are the elementary index unit in

a video database, just like a word in a text database. Then each of these shots will be

represented by one or more key pictures. Only these key pictures are stored into the

8

video database. There are several tasks in video parsing, including shot detection and

key picture selection.

5.1.1 Shot Detection in video parsing:

The first step of video parsing is shot detection. Shot detection algorithms usually

belong to two classes: (1) those based on global representations like color/intensity

histograms without any local information, and (2) those based on measuring local

difference like intensity change. The former are relatively insensitive to motion but

can miss shots when scenes look quite different but have similar distributions. The

latter are sensitive to moving objects and camera. Some systems combine the

advantages of the two classes of detection by using a mixed method. QBIC is one of

these systems.

5.1.2 Key Picture Selection in video parsing:

The next step after shot detection is key picture selection. Each shot has at least one

key picture. Key picture can best represent the visual content of video. The number of

key pictures for each shot can be constant or adaptive to shot content. The first picture

is selected as a key picture; and subsequent pictures are compared against this

candidate. A two-threshold technique, similar to the one described above, is applied to

identify a picture significantly different from the candidate. This new picture is

considered another key picture and the subsequent pictures are compared against this

new candidate. Users can control the density of key pictures by adjusting the two

threshold values.

5.1.3 Feature Generation in video parsing:

After key picture selection, features of key pictures such as color, texture, intensity

are stored as indexes of the video shot. Users can perform traditional search by using

keyword querying and content-based query by specifying a color, intensity, or texture

pattern. Only the generated features will be searched against and the retrieval can be

in real time.

9

5.2 Video Indexing and Retrieval

After each object in the video shot has been segmented and tracked, their features

such as color, texture, motion can be obtained and stored in a feature database. The

resulting database is a simple feature, value pair and the actual query is performed on

this feature database. For each feature, there is a function to calculate the distance

between query object and tracked objects in the video database. The total distance is a

weighted sum of these distances. If the total distance is below a certain threshold, then

it is returned as a possible matching.

There are also some image processing system such as Yahoo Image Surfer

Category List, Web Seer, Web Seek, VisualSeek, UCB’s query all images, Lycos and

MIT's Photo book. Some of them are mainly based on keyword searching. First the

images are assigned one or more keywords manually and catorized into different

groups such as photos, arts, people, animals, plants. Users can then browse through

the separate category that may be interesting to them.

5.2.1 Examples of some image processing systems :

``Yahoo Image Surfer Category List (YISCL)'' and ``Lycos''. YISCL system also

provides visual search function which is based on color distribution matching. USB’s

query all image presents several interesting ideas such as ``blob world'' and ``body

plan''. Blob world is a region. While blob world does not exist completely in the

"thing" domain, it recognizes the nature of images as combinations of objects, and

querying and learning in blob world are more meaningful than they are with simple

"stuff" representations. The Expectation-Maximization (EM) algorithm is used to

perform automatic segmentation based on image features. After segmentation, each

region is shown as an elliptic blob. Body plan is an algorithm for image segmentation.

A body plan is a sophisticated model of the way a horse is put together; as a result, the

program is capable of recognizing horses in different aspects. MIT’s Photo book

allows users to perform texture modeling, face recognition, shape matching, brain

matching, and interactive segmentation and annotation. Web Seek allows users to

draw a query that depicts the spatial relations between objects.

10

5.3 Object Recognition and Motion Tracking:

This is such an important topic. In video, the credibility of object recognition is higher

than that in still image because there is more information available. The most valuable

information is motion vectors. The motion vectors of a moving object has some

intrinsic patterns that conform to a motion model. There are some papers talking

about object recognition using affine motion model that allow the determination of

the occurred transformation between images.

6.BASIC REQUIREMENTS

6.1 Video Data Modeling

In a conventional database management system (DBMS), access to data is based on

distinct attributes of well-defined data developed for a specific application. For

unstructured data such as audio, video, or graphics, similar attributes can be defined.

A means for extracting information contained in the unstructured data is required.

Next, this information must be appropriately modeled in order to support both user

queries for content and data models for storage.

Fig: First Stage in Video Data Adaptation: Data Modeling

From a structural perspective, a motion picture can be modeled as data

consisting of a finite-length of synchronized audio and still images. This model is a

11

simple instance of the more general models for heterogeneous multimedia data

objects. Davenport described the fundamental film component as the shot: a

contiguously recorded audio/image sequence. To this basic component, attributes

such as content, perspective, and context can be assigned, and later used to formulate

specific queries on a collection of shots. Such a model is appropriate for providing

multiple views on the final data schema and has been suggested by Lip man and

Bender Smith and Davenport use a technique called stratification for aggregating

collections of shots by contextual descriptions called strata. These strata provide

access to frames over a temporal span rather than to individual frames or shot

endpoints. This technique can then be used primarily for editing and creating movies

from source shots. It also provides a quick query access and a view of desired blocks

of video. Because of the linearity of the medium we cannot get a coherent description

of an item but as a result of the stratification method the related information is lumped

together. The linear integrity of the raw footage is erased resulting in contextual

information which relates the shot with the environment. Rowet has developed a

video-on-demand system for video data browsing. In this system the data are modeled

based on a survey of what users would query for. Three types of indices were

identified to satisfy the user queries. The first is a textual bibliographic index which

includes information about the video and the individuals involved in the making of

the video. The second is a textual structural index of the hierarchy of movie, i.e.,

segment, scene, and shots. The third is a content index which includes keyword

indices for the audio track, object indices for significant objects and key images in the

video which represent important events.

The above model does not utilize the semantics associated with video data.

Different video data types have different semantics associated with them. We must

take advantage of this fact and model video data based on the semantics associated

with each data type.

6.2 Video indexing

Video annotation or indexing is the process of attaching content based labels to video.

Video indexing is the process of extracting from the video data the temporal location

of a feature and its value.

12

6.2.1 Need of video indexing:

Indexing video data is essential for providing content based access. Indexing has

typically been viewed either from a manual annotation perspective or from an image

sequence processing perspective. The indexing effort is directly proportional to the

granularity of video access. As applications demand finer grain access to video,

automation of the indexing process becomes essential. Given the current state of art in

computer vision, pattern recognition and image processing reliable and efficient

automation is possible for low level video indices, like cuts and image motion

properties etc.

Existing work on content based video access and video indexing can be grouped

into three main categories

6.2.1.1 High level indexing:

The work by Davis is an excellent instance of high level indexing. This approach uses

a set of predefined index terms for annotating video. The index terms are organized

based on a high level ontological category like action, time, space, etc.

The high level indexing techniques are primarily designed from the perspective

of manual indexing or annotation. This approach is suitable for dealing with small

quantities of new video and for accessing previously annotated databases.

6.2.1.2 Low level indexing:

These techniques provide access to video based on properties like color, texture

etc. These techniques can be classified under the label of low level indexing.

The driving force behind these groups of techniques is to extract data features

from the video data, organize the features based on some distance metric and to use

similarity based matching to retrieve the video. Their primary limitation is the lack of

semantics attached to the features.

6.2.1.3 Domain specific indexing:

These techniques use the high level structure of video to constrain the low level video

feature extraction and processing. These techniques are effective in their intended

13

domain of application. The primary limitation of these techniques is their narrow

range of applicability.

Figure 1 shows a conceptual architecture for content-based video indexing and search (Figure 1).

Figure 1: Parsing, representation and organization of video content

6.3 Video data Management

We here want to know how to extract contents from segmented video shots and then

index them effectively so that users can retrieve and browse a large amount of video

collections. Management of sequential video streams includes three steps, i.e. parsing,

content extraction & indexing and retrieval & browsing.

Video parsing is the process of detecting scene changes or the boundaries

between camera shots in a video stream the video stream is segmented into generic

clips. These clips are the elemental index units in a video database, just like a word in

a text database. Then, each of these clips will be represented visually by their key

frames. To reduce the requirements for mass amount of storage, only these key frames

will be stored into the database. There are two type of transitions, abrupt transitions or

camera break and gradual transitions e.g., fade-in, fade-out, dissolve, and wipe.

Indexing, this tags video clips when the system inserts them into the database.

The tag includes information based on a knowledge model that guides the

classification according to the semantic primitives of the images. Indexing is thus

driven by the image itself and any semantic descriptors provided by the model. Two

14

types of indices, text-based and image-based, are needed. The text-based index is

typed in by human operator based on the key frames using a content logger. The

image-based index is automatically constructed based on the image features extracted

from the key frames.

Retrieval and browsing, where users can access the database through queries

based on text and/or visual examples or browse it through interaction with displays of

meaningful icons. Users can also browse the results of a retrieval query. It is

important that both retrieval and browsing appeal to the user's visual intuition. By

visual query, users want to find video shots that look similar to a given example. In

concept query, users want to find video shots by the presence of specific objects or

events. Visual query can be realized by directly comparing low level visual features

like color, texture, shape and temporal variance of video shots or their representative

frames (i.e. key frames). On the other hand, the concept query depends on object

detection, tracking and recognition. Since fully automatic object extraction is still

impossible, some extent of user interaction is necessary in this process Manual

indexing labour can be greatly reduced with the help of video analysis techniques.

7.IMAGE PROCESSING TECHNIQUES FOR VIDEO

CONTENT EXTRACTION

The increase in the diversity and availability of electronic information led to

additional processing requirements, in order to retrieve relevant and useful data: the

accessibility problem. This problem is even more relevant for audiovisual

information, where huge amounts of data have to be searched, indexed and processed.

Most of the solutions for this type of problems point towards a common need: to

extract relevant information features for a given content domain. A process which

underlies two difficult tasks: deciding what is relevant and extracting it. In fact, while

content extraction techniques are reasonably developed for text, video data still is

essentially opaque. Despite its obvious advantages as a communication medium, the

lack of suitable processing and communication supporting platforms has delayed its

15

introduction in a generalized way. This situation is changing and new video based

applications are being developed.

7.1 Toolkit overview

videoCEL is basically a library for video content extraction. Its components extract

relevant features of video data and can be reused by different applications. The object

model includes components for video data modelling and tools for processing and

extracting video content, but currently the video processing is restricted to images.

At the data modelling level, the more significant concepts are the following:

Images, for representing the frame data, a numerical matrix whose values can be

colors, color map entries, etc.

ColorMaps, which map entries into a color space, allowing an additional indexation

level;

ImageDisplayConvertes and ImageIOHandlers that converts images in the specific

formats of the platforms and vice-versa.

The object model of videoCEL is a subset of a more complete model, which also

includes concepts such has shots, shot sequences and views Concepts, which are

modelled in a distinct toolkit that provides functionalities for indexing, browsing and

playing annotated video segments.

A shot object is a discrete sequence of images with a set of temporal attributes

such as frame rate and duration and represents a video segment. A shot sequence

object groups several shots using some semantic criteria. Views are used to visualize

and browse shots and shot sequences.

7.2 Temporal segmentation tools

One of the most important tasks for video analysis is to specify a unit set, in which the

video temporal sequence may be organized. The different video transitions are

important for video content identification and for the definition of the semantics of the

video language, making their detection one of the primary goals to be achieved. The

basic assumption of the transition detection procedures is that the video segments are

spatially and temporally continuous, and thus the boundary images must suffer

significant content changes. Changes, which depend on the transition type and can be

16

measured. The original problem is reduced to the search of suitable difference

quantification metrics, whose maximums identify, with great probability, the

transition temporal locations.

7.3 Cut detection

The process of detecting cuts is quite simple, mainly because the changes in content

are very visible and they always occur instantaneously between consecutive frames.

The implemented algorithm simply uses one of the quantification metrics, and a cut is

declared when the differences are above a certain threshold. Thus, its success is

greatly dependent on the metric suitability. The results obtained by applying this

procedure to some of our metrics are presented next. The thresholds selection was

made empirically, while trying to maximize the success of the detection (minimizing

simultaneously the false and missed detections). The captured video segment belongs

to an outdoors news report, so its transitions are not very “artistic” (mainly cuts).

There are several well known strategies that usually improve this detection. For

instance, the use of adaptive thresholds increases the flexibility of the threshold,

allowing the adaptation of the algorithm to diverse video content An approach that

was used with some success in previous work, while trying to reduce some of the

lacks of the metrics specific behavior, was simply to produce a weighted average of

the differences obtained with two or more metrics. Pre-processing images using noise

filters or lower resolution operators are also quite usual tasks, offering means for

reducing image noise and also the processing complexity. The distinctive treatment of

image regions, in order to eliminate some of the more extreme values, remarkably

increases the detection accuracy, especially when there are only a few objects moving

on the captured scene.

7.4 Gradual transition detection

Gradual transitions, such as fades, dissolves and wipes, cause more gradual changes

which evolve during several images. Although the obtained differences are less

distinct from the average values, and can have similar values to the ones caused by

camera operations, there are several successful procedures, which were adapted and

are currently supported by the toolkit.

17

7.4.1 Twin-Comparison algorithm :

This algorithm was developed after verifying that, in spite of the fact that the first and

last transition frames are quite different, consecutive images remain very similar.

Thus, as in the cuts detection, this procedure uses one of the difference metrics, but,

instead of one, it has two thresholds: one higher for cuts, and another for the gradual

transitions. While this algorithm just detects gradual transitions and distinguish them

from cuts, there are other approaches which also classify fades, dissolves and wipes.

7.4.2 Edge-Comparison algorithm:

This algorithm analyses both edge change fractions, exiting and entering. Distinct

gradual transitions generate characteristic variations of these values. For instance, a

fade in always generates an increase in the entering edge fraction; conversely, a fade

out causes an increase in the exiting edge fraction; a dissolve has the same effect as a

fade out followed by a fade in.

7.5 Camera operation detection

As distinct transitions give different meanings to adjacent video segments, the

possible camera operations are also relevant for content identification . For example,

that information can be used to build salient stills and select key frames or segments

for video representation. All the methods which detect and classify camera operations

start from the following observation: each one generates global characteristic changes

in the captured objects and background. For example, when a pan happens they move

horizontally in the opposite direction of the camera motion; the behaviour of the tilts

is similar but in the vertical axis; zooms generate convergent or divergent moves.

7.5.1 X-ray based method:

This approach basically produces fingerprints of the global motion flow. After

extracting the edges, each image is reduced to its horizontal and vertical projections, a

column and a row, that roughly represent the horizontal and vertical global motions,

which are usually referred to as the x-ray images.

18

7.6 Lighting conditions characterization

Light effects are usually mentioned in the cinema language grammar, as they

contribution is essential for the overall video content meaning. The lighting conditions

can be easily extracted by observing the distribution of the light intensity histogram:

its mode, mean and average are valuable in characterizing its distribution type and

spread. These features also allow the quantification of the lighting variations, once the

similarity of the images is determined.

7.7 Scene segmentation

Scene segmentation refers to the image decomposition in its main components:

objects, background, captions, etc. It is a first step for the identification and

classification of the scene main features, and its tracking during all the sequence. The

simplest implemented segmentation method is the amplitude threshold, which is quite

successful when the different regions have distinct amplitudes. It is particularly useful

procedure for binarizing captions. Other methods are described below.

7.7.1 Region-based segmentation:

Region-based segmentation procedures find out various regions in an image which

have similar features. One of such algorithms is the split and merge algorithm that

first divides the image in atomic homogeneous regions, and then merges the similar

adjacent regions until they are sufficiently different. Two distinct metrics are needed:

one for measuring the initial regions homogeneity (the variance, or any other

difference measure), and another for quantifying the adjacent regions similarity (the

average, median, mode, etc.).

7.7.2 Motion-based segmentation:

The main idea in motion-based segmentation techniques is to identify image regions

with similar motion behaviours. These properties are determined by analysing the

temporal evolution of the pixels. This process is carried out in the frequency image

produced for all the image sequence. When more constant pixels are selected, for

example, the final image is the background causing the motion removal. Once the

19

background is extracted, the same principle can be used to extract and track motion or

objects.

7.7.3 Scene and object detection:

The process of detecting scenes or scene regions (objects) is, in certain way, the

opposite process of transition detection: we want to find images regions whose

differences are below a certain threshold. As a consequence this procedure uses

difference quantification metrics. These functions can be determined for the entire

image, or a hierarchical growing resolution calculation can be performed to accelerate

the process. Another tested algorithm, also hierarchical, is based in the hausdorff

distance. It retrieves all the possible transformations (translation, rotation, etc.)

between the edges of two images. Another way of extracting objects is by

representing their contours. The toolkit uses a polygonal line approach to represent

contours as a set of connected segments. The ending of a segment is detected when

the relation between the current segment polygonal area and its length is beyond a

certain threshold.

7.7.4 Caption extraction:

Based on an existing caption extraction method a new and more effective procedure

was implemented. As the captions are usually artificially added to images, the first

step of this procedure is extracting high-contrast regions. This task is performed by

segmenting the edge image, whose contours have been previously dilated by a certain

radius. These regions are then subjected to a certain caption-characteristic size

constraints, based on the x-rays (projections of edge images) properties; just the

horizontal cluster remains. The resulting image is segmented and two different images

are produced: one with black background for lighter text, and another with white

background for darker text. The process is complete after binarizing both images and

proceeding to more dimensional region constraints.

20

7.8 Edge detection

Two distinct procedures for edge detection were implemented: (1) gradient module

threshold, where the image vectors are obtained using the Sobel operator; (2) the

canny filter, considered the optimum detector, which analyses the representatively of

gradient module maximums, and thus producing thinner contours. As the differential

operators amplify high frequency zones, it is common practice to pre-process the

images using noise filters, a functionality also supported by the toolkit in the form of

several smoothing operators: the median filter, the average filter, and a Gaussian

filter.

21

8. APPLICATIONS

8.1 Videocel applications:

Video browser:

This application is used to visualize video streams. The browser can load a stream and

split it in its shot segments using cut detection algorithms. Each shot is then

represented in the browser main window by an icon that is a reduced form of its first

frame the shots can be played using several view objects.

Weather Digest:

The Weather Digest application generates HTML documents from TV weather

forecasts. The temporal sequence of maps, presented on the TV, is mapped to a

sequence of images in the HTML page. This application illustrates the importance of

information models.

News analysis:

News analysis developed a set of applications to be used by social scientists in content

analysis of TV news. The analysis was in filling forms including news items duration,

subjects, etc., which our attempts to automate. The system generates HTML pages

with the images and CSV (Comma Separated Values) tables suitable for use in

spreadsheets such as Excel. Additionally, these HTML pages can be also used for

news browsing and there also is a java based tool for accessing this information.

22

8.2 COBRA Model

In order to explore video content and provide a framework for automatic extraction of

semantic content from raw video data, we propose the Content Based Retrieval

(COBRA) video data model. The model is independent of feature/semantic extractors,

providing flexibility by using different video processing and pattern recognition

techniques for that purposes. The feature grammar is exploited to describe the low-

level persistent meta-data. The grammar also describes the dependencies between the

extractors.

At the same time it is in line with the latest development in MPEG-7,

distinguishing four distinct layers within video content: the raw data, the feature, the

object, and the event layer. The object and event layers are concept layers consisting

of entities characterized by prominent spatial and temporal dimensions respectively.

To provide automatic extraction of concepts (objects and events) from visual

features (which are extracted using existing video/image processing techniques), the

COBRA video model is extended with object and event grammars. These grammars

are aimed at formalizing the descriptions of these high-level concepts, as well as

facilitating their extraction based on features and spatial-temporal reasoning.

This rule-based approach results in the automatic mapping from features to

high-level concepts. However, we still have the problem of creating object and event

rules manually. This might be very difficult, especially in the case of certain object

rules which require extensive user familiarity with features and object extraction

techniques.

As the model also provides a framework for stochastic modeling of events, we

have chosen to exploit the learning capability of Hidden Markov Models (HMMs) to

recognize events in video data automatically.

23

Figure 1 - Video sequences

Figure 2 - Video shots

Figure 3 - Principal color detection

Figure 4 - Detected player

SE

LE

CT

vi.frame seq

FROM player pl, video vi

WH

ER

E

Event:vi.frame.e=((Appr_n

et_BSL:

((e1:Player_near_the_net,

e2:Backhand_slice),(),(),

24

(),(before

(e2,e1,n),n<75)),e1.o1.na

me=e2.o1.nam="Sampras");

Query 1 - Video query

The WHERE clause of the query shown in the table above, constrains player

profiles on only documents that contain videos with the event Approaching the net

with backhand slice stoke'. This new event description, defined inside the query,

demonstrates how complex events can be defined dynamically based on previously

extracted events and spatial-temporal relations. The first one of two predefined events

is the Player_near_the_net event, which is defined using spatial-temporal

rules, where as the second one, the Backhand_slice event, is defined using the

HMM approach. The temporal relation requires e1 to start at least 75 frames before

event e2. The event descriptions are evaluated by the query processor. It rewrites the

event from its conceptual definition into a standard object algebra extended by the

COBRA video model and spatial-temporal operations. Therefore, a user is able to

explore video content specifying very detailed complex queries that include a

combination of features, objects, and events, as well as spatial-temporal relations

among them.

CONCLUSION

Visual information has always been always an important source of knowledge. With

the advances in information computing & communication technology, this

information in the type of digital images & digital video, is highly available also

through the computer. To be able to cope with the explosion of visual information, an

organization of material which allows for fast search & retrieval is required. These

calls for the system which is in some way can provide content-based handling of

visual information. In this seminar I have tried to give the basic image processing

25

techniques, status of content based access to images & video databases, some

applications regarding to video content extraction.

An image extraction system is necessarily for users that have large collection

of images like digital library. During the last few years, some content based

techniques for image retrieval system are commercially available. These systems offer

retrieval by color; texture or shape & smart combinations of these images help users

in finding the image he is looking for. A video retrieval system is useful for video

archiving, video editing, production etc.

BIBLIOGRAPHY

[1] Basis of Video and Image Processing by Jian Wang

[2] A survey on content based access to image & video databases by Kjersti Aas &

Line Eikvil

[3]Content-based representation and retrieval of visual Media: a state-of-the-art

review 1

Philippe Aigrain, Hongjiang Zhang Dragutin Petkovic

[4]Advanced content-based semantic scene analysis and information retrieval: the

schema project

E. Izquierdo, J.R. Casas, R. Leonardi, P. Migliorati, Noel e.

O’connor, I. Kompatsiaris and M. G. Strintzis

[5]Relevance feedback techniques in content-based image retrieval

By Yong Rui, Thomas U Huang, Sharad Mehrotra

[6] IEEE conference on Advanced Video and Signal Based Surveillance

Video Extraction in Compressed Domain

26

[7] IEEE paper on Image Segmentation; block based feature extraction; large

Video/image databases;

[8] IEEE paper on Face location in wavelet-based video compression for

high perpetual quality

Feature extraction

27

image processing technique for video content extraction

Documents