video analysis techniques that enhance video retrieval · video analysis techniques that enhance...

9
Broadcast Technology No.60, Spring 2015 C NHK STRL 9 Search engines have made it possible for anyone to obtain information readily from the Internet. On the other hand, broadcasters are still looking for simple ways to search large quantities of video footage. Improvements in computer capabilities have made it possible to quickly find similar images amidst large quantities of video footage, and computers can recognize and search video content by themselves, albeit in a limited manner. Here, broadcasters are migrating their video to files on computers by preparing environments that will have enhanced video searching capabilities. This paper describes research on video analysis techniques that enhance not only searches of programs by using text data such as titles and summaries, but also enable searches of shots within the video content. It also describes a powerful video search system for broadcasters has been introduced on a trial basis at NHK. 1. Introduction The increasing number of Web pages has led to the rapid development of Internet search techniques that target large-scale text data. These techniques have in turn fed the phenomenal expansion of Web content and its usefulness. Techniques for searching images and audio have recently appeared, and powerful systems for searching Web content are becoming indispensable. However, looking at the video services on the Internet, their searches are in units of video clips (short videos lasting from several tens of seconds to several minutes). We can not find any service that provides a search function that can answer a query for video that will show the desired subject in certain time intervals. Broadcasters use huge volumes of video footage every day and have recently started handling these resources as computer files. One advantage of files-based methods is that it makes it possible to skip through the replay, thereby enabling efficient editing and browsing. However, it is impossible to find video as easily as text information, even if the video has been converted into file data. As with text data, in order to quickly find the desired images amidst a large quantity of video footage, it is necessary to have information that describes the content of each video (metadata), such as when (temporal position) it appears in the footage and what (the subject content) it is. However, it is difficult to create enough metadata for large quantities of video footage manually. If detailed metadata could be automatically assigned to such video data, it would become possible to easily search for and retrieve video footage. In this paper, we describe various video analysis techniques for automating the creation of metadata with the aim of enhancing video retrieval. In addition, we describe the trials of an enhanced video search system at NHK. 2. Video analysis techniques that enhance search- ing 2.1 Issues with video searching The program archives of NHK are manages and searched in program units. Text information, such as the title, performers’ names, and program summary, which was attached when each program was stored in the database, is used in such searches. The stored program video footage is recycled as raw video for the Video Analysis Techniques that Enhance Video Retrieval Figure 1: Division into video segments and assignment of video content information (metadata) Time Program video footage Effective information retrieval for video recycle purposes Program summary: The giant Lake Biwa catfish lives in Lake Biwa, which is the largest lake in Japan. With a total length of 1.2 meters, it is twice the length of ordinary catfish. And it is five times as heavy, at ten kilos. Seen only in Lake Biwa and its surroundings, it is rare.... Program title: Darwin is here! “Chasing the giant catfish of Lake Biwa!” This segment shows “catfish”, “lake”, and “sun” This segment shows “flowers” and “mountain”

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL 9

Search engines have made it possible for anyone to obtain information readily from the Internet. On the other hand, broadcasters are still looking for simple ways to search large quantities of video footage. Improvements in computer capabilities have made it possible to quickly find similar images amidst large quantities of video footage, and computers can recognize and search video content by themselves, albeit in a limited manner. Here, broadcasters are migrating their video to files on computers by preparing environments that will have enhanced video searching capabilities. This paper describes research on video analysis techniques that enhance not only searches of programs by using text data such as titles and summaries, but also enable searches of shots within the video content. It also describes a powerful video search system for broadcasters has been introduced on a trial basis at NHK.

1. IntroductionThe increasing number of Web pages has led to the

rapid development of Internet search techniques that target large-scale text data. These techniques have in turn fed the phenomenal expansion of Web content and its usefulness. Techniques for searching images and audio have recently appeared, and powerful systems for searching Web content are becoming indispensable. However, looking at the video services on the Internet, their searches are in units of video clips (short videos lasting from several tens of seconds to several minutes). We can not find any service that provides a search function that can answer a query for video that will show the desired subject in certain time intervals.

Broadcasters use huge volumes of video footage every day and have recently started handling these resources as computer files. One advantage of files-based methods is that it makes it possible to skip through the replay, thereby enabling efficient editing and browsing. However, it is impossible to find video as easily as text information, even if the video has been converted into file data. As with text data, in order to quickly find the desired images amidst a large quantity of video footage, it is necessary to have information that describes the content of each video (metadata), such as when (temporal position) it appears in the footage and what (the subject content) it is. However, it is difficult to create enough metadata for large quantities of video footage manually. If detailed metadata could be automatically assigned to such video data, it would become possible to easily search for and retrieve video footage.

In this paper, we describe various video analysis techniques for automating the creation of metadata with the aim of enhancing video retrieval. In addition, we describe the trials of an enhanced video search system at NHK.

2. Video analysis techniques that enhance search-ing 2.1 Issues with video searching

The program archives of NHK are manages and searched in program units. Text information, such as the title, performers’ names, and program summary, which was attached when each program was stored in the database, is used in such searches. The stored program video footage is recycled as raw video for the

Video Analysis Techniques that Enhance Video Retrieval

Figure 1: Division into video segments and assignment of video content information (metadata)

Time

Program video footage

Effective information retrieval for video recycle purposes

Program summary: The giant Lake Biwa catfish lives in Lake Biwa, which is the largest lake in Japan. With a total length of 1.2 meters, it is twice the length of ordinary catfish. And it is five times as heavy, at ten kilos. Seen only in Lake Biwa and its surroundings, it is rare....

Program title: Darwin is here! “Chasing the giant catfish of Lake Biwa!”

This segment shows “catfish”, “lake”, and “sun”

This segment shows “flowers” and “mountain”

Page 2: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

production of new programs and Web content. Program producers seeking to recycle video footage would like to have a simple means of searching video in units smaller than programs. They need an easy-to-use system that automatically creates metadata describing video segments and uses that metadata for searches (Figure 1).

In order to find video using the current search system, users must search through the titles of programs that are likely to include what they are looking for, and they must spend time replaying the programs to find it within the footage. It is not feasible to manually assign detailed metadata to a huge quantity of archived video (the NHK archives has over 700,000 items). Here, although subtitles have been added to many programs, they usually do not describe the video content, meaning that subtitles are of limited value in searches.

Raw video footage does not have text information attached to it; it is managed by means such as tape numbers and simple notes attached the video content. The length of raw footage ranges from ten to almost one thousand times the running time of the program. It would be a great waste of labor and money to attach metadata to all the raw video footage, especially since most of it is not used in broadcasts. Furthermore, vast quantities of video are captured in a short time when there is a disaster or other major event; a file based system should be able to quickly and efficiently assign metadata to video footage.

As shown above, broadcasters must spend considerable effort finding the video footage they want from among large quantities of video footage in a variety of situations, and they are demanding enhancements to the current search functions. In the following sections, we describe video analysis techniques, such as temporal demarcation, for easily searching video footage, techniques for generating searchable content

descriptions (metadata) from video footage, and similar-image searches based on the composition of objects in images that is difficult to express in words.

2.2 Structure of video footageVideo differs from still images as a medium in that

it has a time axis, and hence, it is not easy to list its contents. In long video footage such as in a TV program, it takes time to discover the temporal position of the information one is interested in. For that reason, a video search function for obtaining footage in a short time should first divide the video into easy-to-handle time lengths, then determine what those segments show.

A program video can be represented in terms of granularity, i.e., program, scene, and shot (or cut), as determined by semantic and temporal divisions in Figure 2. A program can thus be viewed as a unit of video content consisting of a number of scenes and it is the result of editing the raw footage. A scene, or a situation, refers to a semantic video segment having the same depicted time or location, and it consists of one or more (usually several) shots. A shot is a segment of video that was filmed without a break. The boundary between one shot and another is called a cut point (excluding boundaries that have special effects such as a wipe*1, dissolve*2, or fade*3); a shot may also be called a cut. Frames are still images, which at, say, about 30 frames per second make

10

Figure 2: Structure of program video footage

Program video footage

Scenes

Shots

Frames

Time

*1 A scene transformation technique where the screen “wipes” away, starting from one side of the image to the other side as away to lead in to the next piece of video footage.

*2 A scene transformation technique where the screen gradually switches to the next piece of video footage.

*3 A scene transformation technique where the screen gradually changes from a color such as black to a state in which the video footage can be seen (or vice versa).

Page 3: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

Feature

11

up the video. Unedited raw video footage, on the other hand, has none of the demarcations corresponding to programs or scenes that are given meaning by editing, and the medium (such as tape) used in the filming is the top unit. Depending on the content being filmed, the entire raw video footage of one tape could consist of one shot.

(1) Division into shotsA video analysis method that detects the boundary

points between shots can be used to retrieve individual shots. In modern video recording devices where video is recorded as a file, these shot delimiters are the data delimiters of file units, so it is not necessary to detect them again, but with VTR video footage, the shot boundary information is often missing.

A basic method of shot boundary detection is to evaluate the continuity of the video data from one frame to the next and determine a point where the continuity breaks to be a cut point. A histogram comparison method is used to evaluate continuity. Histogram comparison methods use differences in color histograms*4, such as RGB (red-green-blue) or HSV*5, and the processing involved often does not cover the entire screen, but rather is done within each of 16 or so blocks that the screen is divided into1). Histogram comparison methods cannot detect switches between shots that have a high degree of similarity and can make detection errors with fast-moving shots. For those reasons, there has been research into techniques such as for handling a number of feature quantities in a multi-dimensional feature space2) or on detecting compositional changes in the texture of the images3).

The international TRECVid*6 workshops running from 2001 to 2007 were a driving force in the development and evaluation of methods for detecting shot boundaries4). The methods resulting from TRECVid can accurately detect shot boundaries (cut switchovers).

We have developed a shot boundary detection method5)

that emphasizes processing speed in order to enhance practicality. This method uses the sum of differences of absolute RGB density values between frames, which has a low processing cost, to discover candidate shot boundaries. Then it uses the block-matching difference*7, which has a high processing cost but is very accurate, to measure the difference between adjacent frames, and it takes any boundary that exceeds a certain threshold value to be a shot boundary. It is the fastest boundary

detection method developed so far. We have increased its accuracy on broadcast video footage by adding functions such as one that calculates between-frame differences not just between adjacent frames but also a few frames before and after each boundary candidate, and it determines that any amount of change below a certain threshold is not due to a shot boundary, in order to prevent erroneous detection of shot boundaries due to flashes that often occur in raw news footage, etc.

However, these methods are less than perfect at detecting gradual switchovers such as wipes, dissolves, and fades, that are often used as dramatic techniques. In addition, the challenge remains of reducing erroneous detections in cases such as filming large subjects that cross the screen or rapidly blinking light sources.

(2) Division of shot contents (thumbnail extraction) It is sometimes useful to divide up shots that run a

long time. In addition, the shot boundaries are only time data; they do not describe the content of the shot. To represent the content, an image (thumbnail) can be extracted from the shot. A frame from the opening of the shot is often used in a thumbnail, even though the main subject does not necessarily appear in the opening. In fact, there are cases where many subjects appear in sequence. In such case, a number of thumbnail images would have to be extracted from one shot in order to represent its content, for instance, frames at the opening, ending, and middle of the shot, or in a format that depends on changes in the content, not on time. Using more thumbnail images to search through the video also has the effect of reducing the chance of missing the subject and reduces the amount of processing compared with searching all of the frames.

The method we developed accumulates frame-difference information for determining shot boundaries, and it outputs a thumbnail image when the accumulated value exceeds a certain threshold. The number of thumbnails depends on the amount of changes in content; many thumbnails are output when there are large changes in the video footage, but only a single thumbnail is output when there is little change.

(3) Division into scenes In a program video that is the result of editing,

individual shots typically last from five to 30 seconds, and a shot can feel too short when verifying the search results or when searching to find video footage that can be recycled for the purpose at hand. In addition, it is often easier to handle video that is summarized in terms of the concept of a scene by placing semantic links between shots having the same depicted location or time. However, scene boundaries are also semantic boundaries, and they may differ depending on the content, the person doing the viewing, and the purpose of recycling, etc. For this reason, automatic scene division is a difficult challenge.

The current scene division techniques use comparatively superficial information such as changes in color within the video or changes in the audio signal to judge the continuity of content, and they

*4 A frequency distribution chart with color plotted on the hori-zontal axis and frequency plotted on the vertical axis.

*5 A color space consisting of the three components of hue, saturation, and value.

*6 Text REtrieval Conference Video Retrieval Evaluation: A com-petitive workshop on information retrieval, sponsored by the US National Institute of Standards and Technology (NIST).

*7 A technique of representing differences between frames, wherein images are divided into small areas (blocks) and it is checked if the differences between corresponding blocks of consecutive frames exceed a certain value.

Page 4: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

often regards such instances as scene cuts7). We have researched scene division techniques based on the continuity of color information8)9). Another method is to analyze the subtitles synchronized with broadcast video footage as document and compute the degree of cohesiveness of single words (a measure of how often the same single word appears)10). Although this is useful for some purposes, it can not often be used on broadcast programs because words indicating the subjects do not likely appear throughout the entire program. To divide up video into scenes that could be recognized as such by humans, we need to research scene division techniques that integrate components such as sound and text with the video footage, to enable not just signal processing of a single stream of media information, but also to reflect the subject and its meanings.

2.3 Techniques of generating content description information (metadata)

The most common video search queries (search keyword) are typically the name of the subject, often a noun such as “Mount Fuji”, or “sea”, or “bird”. Search efficiency would be greatly improved if such subject names could be assigned automatically to video segments that have been divided into shot units as described above.

The metadata describing content varies greatly, not only depending on the content but also the search situation and the user. Someone searching for a news item might query something like “Press conference by the Prime Minister on XX day of YY month” or “Metropolitan Police Department building”. On the

other hand, someone interested in recycling video on nature programs might enter search keywords such as “Okinawa dolphin”, whereas someone compiling a sports report might enter “Ichiro home-run”. However, describing video footage in enough detail that would produce relevant results for such queries is difficult even for humans, and the range that can be described by video analysis alone is rather limited. The current technology can only handle keywords consisting of general nouns such as “person” or “mountain” and are confined to indicating the “What is it” of generic objects or “What is occurring” in very limited circumstances.

(1) Object recognition Object recognition refers to an automated function in

which a computer recognizes an object that is the subject of a video and outputs text (typically, the name of the subject). Object recognition has been taken up as a challenge by researchers since the advent of computers, but computers cannot yet recognize objects as easily as humans do.

In the 1990s, programs were developed that can determine the presence or absence of a subject by representing local features such as the corners and edges of a subject within an image as vectors and training a computer to recognize these vectors as representing the desired subjects. As shown in Figure 3, image features are expressed as a large number of feature quantities (feature quantity vectors) that are digitizations of the colors and patterns of the image. This set of feature quantity vectors is processed by the computer in order to find the conceptual boundaries that separate a set of

12

Figure 3: General object recognition using bag-of-visual-words technique

Learning by feature vectors

Not watches Wristwatch

Identifier (discriminant criterion)

Input image

Identification

Visual words

Clustering

Gradient histogram space (multi-dimensional)Training data

Input imageFeature vector

App

eara

nce

freq

uenc

y

Substitution into Visual words

Training data

Bag-of-visual-words technique

Word creation

Feature calculation

Feature point extraction and gradient histogram calculation

Feature point extraction and gradient histogram calculation

Not watches

Training data

Watches

Page 5: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

positive examples which include the subject and a set of negative examples which do not include it. The presence or absence of the subject in question is determined by which of the sets the unknown image feature vectors belong to. This method works on different subjects by changing the training data that gives a correct answer.

The Scale-Invariant Feature Transform (SIFT) method11) and the Speeded Up Robust Features (SURF) method12) are examples of feature quantity extraction methods. Techniques such as Bag-of-Visual-Words (BoVW) that were subsequently developed can quickly check a large number of feature quantities and handle local feature quantities independently of position. The BoVW technique calculates a gradient histogram*8 of luminances from the area surrounding a feature point such as an edge or corner and determines clusters of such points in a multi-dimensional space (similar objects form a cluster) to be Visual Words*9. Then it calculates feature vectors based on the appearance frequencies of those clusters, as shown in Figure 3. It uses Support Vector Machine (SVM)*10 or the like to identify the features.

In 2012, a method called Deep Learning, which uses a neural network*11 with many layers, was applied to general object recognition. Besides greatly improving recognition accuracy13), it could automatically determine when the feature quantities were valid. Recent advances in the capabilities of computers and the use of large quantities of video footage and images on the Internet have made it possible to develop appropriate training methods for building neural networks that have good performance. However, training a neural network still takes time and since a large quantity of data is required and the parameter adjustment is complicated, it will likely be some time before deep learning is practical.

We are developing a technique that gives good recognition accuracy despite using less training data. It is especially useful in broadcasting where the training time short14). This technique resolves issues with the BoVW technique wherein the spatial position information within the video frame is not reflected in the feature vectors and the training data is expensive to make. It divides each frame image into areas of various sizes; then it composes feature vectors reflecting spatial information about the subject by calculating the image features in each area. In the image feature calculation, subject features can be comprehended more accurately by considering global

feature vectors that can take account of larger areas, in addition to local feature vectors obtained by the BoVW technique. Furthermore, by selecting effective feature quantities, it achieves highly accurate recognition even with a comparative small quantity of training data. We have also devised semi-supervised training*12 method that can efficiently yet accurately create training data by assigning labels to only part of the data15). Furthermore, we use search functions like those used on the Internet when collecting images for training16).

(2) Face detection The faces of people in TV programs often convey

important meanings. Finding situations within news videos such as interviews and press conferences in which faces are shown would be also useful for recycling video footage. In comparison with other objects, variations in human faces apparent from the front are comparatively small. Such subjects are especially useful for detection and recognition.

Face detection research17) has a long history. Here, a face can be initially depicted as an outline that a computer then matches with standard face patterns18). One such method19) uses AdaBoost*13, a statistical machine learning method*14. It extracts rectangular feature quantities represented by a group of black and white filters, configures strong trainers*15 that select objects with a high degree of importance from the weak trainers*16 for each feature quantity. Face detection systems such as this are used for improving the photographic quality of commercial digital cameras, for authentication on smartphones and PCs, and for checking persons’ identities at airports20). However, it works only when the subject directly faces the camera under controlled dimension and brightness conditions. It has difficulty detecting faces in video having various different lighting conditions and persons facing in different directions other than directly at the camera. It also has trouble accurately identifying individuals among crowd scenes appearing in many TV programs.

We are conducting research into face detection, tracking, and identification techniques21) based on the method of Reference 19. However, faces in video footage to be used in broadcasts, particularly in the raw video material for the news, may be partially obscured by masks, eyeglasses, and hats. Furthermore, faces may have extreme orientations. The detection results in such cases have many false negatives. In order to reduce their number, we modified a object recognition method so that

13

*8 A frequency distribution diagram with the gradient direction plotted along the horizontal axis and gradient strength along the vertical axis.

*9 A method of handling a document as a set of words and handling local feature quantities of an image as single words through the application of a natural language processing method called Bag-of-Words.

*10 A supervised machine training method for pattern identifi-cation (a training method for data that has been assigned labels) which creates linear classifiers that divide data into two classes.

*11 A mathematical model of brain functions that can be used to make a brain simulation on a computer.

*12 A machine training method that generates identifiers from incomplete training data (a mixture of labeled and unlabeled data).

*13 A machine training algorithm that adjusts for mistakes made by the previous identifier when it creates the next identifier.

*14 A machine training method based on statistical methods. *15 A trainer that has a high identification accuracy as a result of

combining weak trainers.*16 A trainer that does not have a high identification accuracy,

but is more accurate than random identification.

Page 6: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

it could train using close-up images of human faces and combined it with another face detection method being used in some applications22).

(3) Recognition of text informationIf text appearing in video footage could be recognized

readily, extremely effective metadata in the form of unique place names, personal names, and product names could be attached to broadcast video footage.

Printed text recognition (Optical Character Recognition, or OCR) is a mature technology, and characters can be recognized with a high degree of accuracy if documents are printed in black and white with a clear font. Sufficient accuracy can also be obtained for multi-level typesetting and vertical and horizontal writing.

However, the same cannot be said for characters appearing in images (photographs) and video 23). There are various reasons for this failure, such as the wide variety of typefaces, character colors, and background colors of characters that prevent them from being neatly separated from the background, and the effects of variations in the size, orientation, and slope of characters are large. In addition, there are numerous edge features within normal scenes that are similar to those of characters, such as buildings and window frames, making it difficult to detect the character areas themselves. Finally, languages that use Chinese characters, such as Japanese, are extremely difficult to perform text recognition on because they have many different characters and feature both vertical and horizontal writing.

The recent spread of smartphones and the advent of eyeglass-type smart devices have led to an increasing demand for character information recognition techniques, and more and more researchers are being attracted by the challenges set by competitive workshops24) for developing techniques of recognizing character data within video footage.

(4) Detection of events People can easily determine events (incidents), such

as goals in a soccer game, by observing short video segments. If a computer could detect such events as easily, it would lead to a variety of new uses of video.

A lot of research has gone into automatic identification of the movements of people and on detecting suspicious behavior in video footage25). Events in sports programs such as home runs, doubles, and strikeouts in baseball can be now extracted by virtue of the fact that they exist

in comparatively similar shots 26) 27). It is also possible for computers to recognize motions such as jumps and shots on the hoop in basketball28) and free kicks and kick-offs of soccer players from video capturing a view of an entire pitch29).

However, unlike events captured under specific conditions such as those of sports, the identification of complex and diverse events in everyday situations is still a difficult challenge. There is a recognition method that limits events according to the type of subject30), and attempts are being made at developing a way to recognize complex events by combining and correlating simpler events31). Evidence of the growing interest in such studies, the TRECVid Multimedia Event Detection task for extracting complicated events from video footage began in 201032).

Unlike still images such as a photographs, video footage has movement information that can be used for describing events in greater detail. The detection and recognition of complex events will continue to be an important research topic for the foreseeable future.

2.4 Similar image search There are times when a user wants to search for video,

say, of landscape that is similar to the image at hand, or has an image in mind and wants to focus on details such as its composition, color, or pattern that cannot be easily described in words. Here, similar-image search technology that finds images similar to an example image is a way to enable non-verbal information to be used as a query. A similar image search digitizes the colors and patterns within a given image and compares the mathematical distances between these numerical values and stored digitized data corresponding to similar images. Previous training of the system is unnecessary, and large numbers of images can be processed comparatively quickly by digitizing and storing features such as colors and patterns beforehand. A number of such systems are in use, or are nearing deployment, and they have functions33) similar to those of Google image search and other image search web sites34).

We are researching similar image search techniques that calculate the degree of similarity to an example image. For broadcasting production purposes, we are interested more in similarities of general structures, rather than exactly matching images. This method compares the colors and textures of rather large 4x4 blocks, enabling searches with an emphasis on the entire composition. To address the problem of deterioration of

14

Figure 4: Processing for when a subject in a similar image search straddles a number of block boundaries

It is possible to discover images where the layout is slightly different

Saliency map

Subj

ect

area

Specify subject area by “Saliency Map”

Displace central block into subject area

Compare image features in blocks between chasing the subject

Page 7: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

Feature

the degree of similarity when the main subject straddles the boundaries of a number of blocks, as shown in Figure 4, we have also implemented a gradual similar image search that moves the blocks within a certain range35). The results of experiments on this technique showed that over 70% of the images were judged to be similar, from among the search results where the degree of similarity is in the top ten, which we consider to be sufficient accuracy for narrowing down video footage. This technique can also take a hand-drawn sketch as input; it compares its color and texture with those of images stored in the database36). This similar image search could be further developed to cover whole scenes by comparing the feature quantities of representative still images of a number of shots.

3. Search system that applies video analysis tech-niques

Video footage management in the past mainly involved searching through libraries of VTR tapes and small quantities of text data. NHK is moving away from this method and is converting video into file data. We have started experimental use of video search systems. Below, we describe two of those systems.

3.1 Earthquake disaster metadata supplementation system

During the Great East Japan Earthquake that occurred in 2011, large quantities of raw video were captured for news coverage. This video footage is not just valuable as raw broadcast material, it is very important for preventing and reducing the impact of future disasters, and there is an urgent need to organize and archive it. However, the work of organizing it is enormous, and the construction of a video database has been a challenge.

We have prototyped a metadata supplementation system that detects shot boundaries in raw video footage and reduces the work needed to assign metadata to the subject information22)37). This system automatically assigns metadata by combining techniques such as video analysis and sound identification. With the aim of assigning accurate metadata efficiently, it provides automatic means of demarcation of footage into shot units and high-speed assignment of subject information. Users manually correct recognition errors and add semantic information. Its video search function uses the assigned metadata. This system is currently in trial operation at the Fukushima broadcasting station where it has processed more than ten thousand VTR tapes in approximately three months, and the materials found by using its video search functions have been used in program production. It works on a notebook PC, and we

15

Figure 5: Overview of processing of earthquake disaster metadata supplementation system

Image

*1 Linear Tape-Open: Standard for magnetic tape for computers*2 File Transfer Protocol

LTO*1 drive

Shot division

thumbnail creation

Proxy video

Check tools (verification, correction, addition)

Search database

Automatic processing management tool

Metadata

FTP*2 tool

Caption data

Face close-up

Aerial view

People’s voices

Filming notes

Face detection

Color bar

Object recognition techniques

Audio

Some of audio recognition techniques

Experimentally extracted subjects:

Interview

- Fire and flames - Water - Rubble - Helicopter - Ambulance - Nuclear plant

Figure 6: The earthquake disaster metadata supplementation system in use at the Fukushima office

Page 8: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

are studying how it can be used for other purposes. An overview of the processing is shown in Figure 5 and the system in use at the Fukushima office is shown in Figure 6.

3.2 Validation of sophisticated archives searchTargeting NHK’s archive of the past programs, we

have prototyped a search system that incorporates object recognition for extracting subject information and similar image searches for video footage. We plan to start verification experiments of the search functions in January 2015. This system provides the following search functions on the search screen shown in Figure 7:- Search using subject metadata assigned to each shot by

object recognition - Search for shots that are similar to a designated shot or

an uploaded image- List of program contents by shot units or scene units- List of shots in a program that that have superimpositions

(characters or diagrams that are overlaid on the screen) - List of shots in a program that show faces - Search results for shots that are similar to the currently

selected shot We will evaluate the search functions at broadcast

production sites and the function that automatically assigns metadata to video archives.

4. ConclusionsThis paper described trends in video analysis

techniques and the circumstances under which we have developed a video footage search system enabling users to readily obtain the video footage they want. The increasing capabilities of computers have made it possible for broadcasters to handle large quantities of video footage by turning it into file. We have finally arrived at the automated search era for video. However, there are still many challenges to overcome before we truly have a search system that can easily find and retrieve video footage, and various research projects are

currently under way (see Reference 38 for details). The video analyses that we have described for

enhancing video searches work at the current technological level, depending on usage environment and method of systemization. Even though it is not possible to automate the entire process, we can incorporate these techniques into support systems to reduce the work of broadcasters and workers in other fields that process video materials.

(Hideki Sumiyoshi)

References

1) A. Nagasaka and Y. Tanaka: “Automatic Video Indexing and Full-Video Search for Object Appearances,” Proc. IFIP TC 2/WG 2.6 Second Working Conference on Visual Database Systems II, pp. 113-127 (1991)

2) Iwamoto and Yamada: “A Cut Detection Method for a Video Sequence based on Multi-Dimensional Feature Space Analysis,” FIT, I-026 (2005)

3) Mochizuki, Tadenuma, and Yagi: “Cut Point Detection based on Variations of Fractal Features,” The Institute of Electronics, Information and Communication Engineers (IEICE) General Conference, D11-134, p. 134 (2005)

4) A. Smeaton, P. Over, and A. Doherty: “Video Shot Boundary Detection: Seven Years of TRECVid Activity,” Computer Vision and Image Understanding, Volume 114, Issue 4, pp. 411-418 (2010)

5) Kawai, Sumiyoshi, Yagi: “Fast Detection Method for Shot Boundary Including Gradual Transition Using Sequential Feature Calculation,” The Institute of Electronics, Information and Communication Engineers (IEICE) Transaction on Information and Systems D, Vol. J91-D, No. 10, pp. 2529-2539 (2008)

6) Kawai, Sumiyoshi, Fujii, and Yagi: “Method of Correcting Video Fluctuations due to Flash Using Frame Interpolation,” The Institute of Image Information and Television Engineers Journal, Vol. 66, No. 11, pp. J444-J452 (2012)

7) Sou, Ogawa, and Haseyama: “Study into Increasing Accuracy of Scene Divisions by MCMC Method, Focusing on Video Structure,” IEICE technical report, CAS Circuitry and Systems, 110 (86), pp. 115-120 (2010)

8) Fukuda, Mochizuki, Sano, and Fujii: “Scene Sequence Generation of Program Video Based on Integrative Color List,” , Proceedings of the ITE Annual Convention, 23-8 (2012)

9) Mochizuki and Sano: “Video Scene Sequence Generation by Shot Integration based on Image Peace List,” FIT, No. 3, H-004, pp. 101-102 (2013)

10) M. Hearst: “Multi-Paragraph Segmentation of Expository Text,” Proc. 32nd Annual Meeting of the Association for Computational Linguistics (1994)

11) David G. Lowe: “ Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision 60 (2), pp. 91-110 (2004)

16

Figure 7: Example of search screens of NHK archives search system

Page 9: Video Analysis Techniques that Enhance Video Retrieval · Video analysis techniques that enhance search-ing 2.1 Issues with video searching The program archives of NHK are manages

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

Feature

17

12) H. Bay, T. Tuytelaars, L. Gool: “SURF: Speeded Up Robust Features,” Proc. of European Conference on Computer Vision, pp. 404-415 (2006)

13) A. Krizhevsky, I. Sutskever, G. Hinton: “ ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems 25 (2012)

14) Kawai and Fujii: “Semantic Concept Detection based on Spatial Pyramid Matching and Semi-supervised training,” ITE Trans. Media Technology and Applications, Vol. 1, No. 2, pp. 190-198 (2013)

15) Kawai and Fujii: “A Video Retrieval System based on an Interative Learning Considering Closed Caption and Image Feature,” Proceedings of the ITE Annual Convention, 23-7 (2012)

16) Kawai, Mochizuki, Sumiyoshi, and Sano: “NHK STRL at TRECVID 2013: Semantic Indexing,” TREC Video Retrieval Evaluation (TRECVID 2013 Workshop) (2013)

17) Iwai, Lao, Yamaguchi, and Hirayama: “A Survey on Face Detection and Face Recognition,” Information Processing Society of Japan, CVIM Research Materials, CVIM-149 (37) (2005)

18) Sakai, Nagao, Fujibayashi, and Kidode: “Line Extraction and Pattern Detection in a Photograph,” Information Processing, Vol. 10, No. 3, pp. 132-142 (1969)

19) P. Viola and M. Jones: “Robust Real-time Face Detection,” International Journal of Computer Vision (IJCV) 57(2), pp. 137-154 (2004)

20) “Concerning the Implementation of Demonstration Experiments on New Automated Gate,” http://w w w. m o j . g o . j p / n y u u k o k u k a n r i / k o u h o u /nyuukokukanri04_00023.html

21) S. Clippingdale and M. Fujii: “Video Face Tracking and Recognition with Skin Region Extraction and Deformable Template Matching,” International Journal of Multimedia Data Engineering and Management (IJMDEM) Vol. 3, No. 1, pp. 36-48 (2012)

22) Sumiyoshi, Kawai, Mochizuki, Sano, and Fujii: “Metadata Supplementation System for Earthquake Disaster Archives,” Proceedings of the ITE Annual Convention, 6-1 (2012)

23) L. Neumann and J. Matas: “A Method for Text Localization and Recognition in Real-world Images,” Asian Conference on Computer Vision, ACCV’2010, pp. 2067-2078 (2010)

24) D. Karatzas, et al.: “ICDAR 2013 Robust Reading Competition,” ICDAR 2013 (2013)

25) J. Fiscus, et al.: “TRECVID 2009 Video Surveillance Event Detection Track,” 2009 TREC Video Retrieval Evaluation Notebook Papers and Slides (http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.9.org.html)

26) Mochizuki, Fujii, Yagi, and Shinoda: “Automatic Event Classification of Baseball Broadcast Video, Using Patterning of Scenes Focusing on Next Shot in Baseball and Discrete Hidden Markov Model,” ITE Journal,[KA5] Vol. 61, No. 8, pp. 1139-1149 (2007)

27) Yazaki, Misu, Nakata, Motoi, Kobayashi, Matsumoto, and Yagi: “Increasing the Accuracy of Sports Event

Detection Using Bayes Hidden Markov model,” IEICE technical report, Human Information Processing, HIP 109 (471), pp. 401-406 (2010)

28) M. Takahashi, M. Naemura, M. Fujii, and James J. Little: “Recognition Action in Broadcast Basketball Video on Basis of Global and Local Pairwise Representation,” Proc. IEEE International Symposium on Multimedia (ISM 2013), pp. 147-154 (2013) 29) Misu, Takahashi, Tadenuma, and Yagi: “Real-Time Event Detection based on Formation Analysis of Soccer Video,” FIT, LI003 (2005) [KA6]

30) M. Mazloom, E. Gavves, K. E. A. van de Sande, and C. Snoek: “Searching Informative Concept Banks for Video Event Detection,” Proc. 3rd ACM Conf. International Conference on Multimedia Retrieval (ICMR2013), pp. 255-262 (2013)

31) Z. Ma, Y. Yang, Z. Xu, S. Yan, Nicu Sebe and A. G. Hauptmann: “Complex Event Detection via Multi-raw video Attributes,” IEEE Conf. computer Vision and Pattern Recognition (CVPR2013), pp. 2627-2633 (2013)

32) J. Fiscus, G. Sanders, D. Joy and P. Over: “2013 TRECVID Workshop Multimedia Event Detection and Recounting Tasks,” http://www-nlpir.nist.gov/projects/tvpubs/tv13.slides/tv13.med.mer.final.slides.pdf (2013)

33) Google image search: “Similar Images Graduates from Google Labs,” http://googleblog.blogspot.com/2009/10/similar-images-graduates-from-google.html [KA7]

34) amanaimages: “http://amanaimages.com/” 35) T. Mochizuki, H. Sumiyoshi, M. Sano and M. Fujii:

“Visual-based Image Retrieval by Block Reallocation Considering Object Region,” Asian Conference on Pattern Recognition (ACPR2013), PS2-03, pp. 371-375 (2013)

36) Mochizuki, Sumiyoshi, and Fujii: “Faster Image Retrieval by Query Image Drawing using Structural Template,” FIT, No. 3, H-040, pp. 219-220 (2010) [KA8]

37) Sumiyoshi, Kawai, Mochizuki, Clippingdale, and Sano: “Metadata Supplementation System for Earthquake Disaster Archives,” Proceedings of the ITE Annual Convention, 14-2 (2013) [KA9]

38) M. Haseyama, T. Ogawa and N. Yagi: “A Review of Video Retrieval Based on Image and Video Semantic Understanding,” ITE Trans. on Media Technology and Applications, Vol. 1, No. 1, pp. 2-9 (2013)