associating video frames with texthdw/acm-sigir-2003.pdftext and video frames, a query based on text...

Associating video frames with text

Pınar Duygulu and Howard D. WactlarInformedia Project

School of Computer ScienceCarnegie Mellon University

Pittsburgh, PA, USA

[email protected], [email protected]

ABSTRACT

In this study, integration of visual and textual data is pro-posed to solve the correspondence problem between videoframes and associated text in order to annotate video frameswith more reliable labels and descriptions. Visual featuresextracted from video frames are linked to text that is ob-tained from the audio transcripts using joint statistics. Theresults show that using this approach it is possible to havebetter annotations for video frames that can be later usedto improve the performance of text based queries. The pro-posed approach will be integrated into the Informedia Digi-tal Video Library Project at Carnegie Mellon University.

1. INTRODUCTION

Video is a rich source of data where visual and textual in-formation occur together. A system that combines thesetwo types of data is more powerful than a system address-ing only one of them, since text and visual features maybe ambiguous if treated in isolation. In order to make bet-ter use of the available data, visual features (such as color,texture, shape, and motion features) and semantics derivedfrom text can be integrated. The text provides high-level se-mantics for a video sequence that cannot be obtained usingcurrent computer vision techniques, and visual propertiessupply rich information that can be difficult to express in atextual format.

When text and visual properties are integrated, several ap-plications become possible including better search and brows-ing. There are many collections where images/videos areannotated with some descriptive text (e.g., Corel data set,some museum collections, news photographs on the web withcaptions, etc.). Manual annotation of these collections issubjective and requires a huge amount of effort [14]. Someapproaches are proposed to make this process easier andautomatic [2, 4, 5, 12, 13].

Figure 1: An example query result from the Infor-media system [1], where the query word is president.A story is segmented into shots where each shot isrepresented with a keyframe. Transcripts that areextracted from the audio narrative are aligned withthe shots. The arrows represent the shots where thequery word occurs in the transcript aligned withthe shot. There are correspondence problems be-tween the frames and text. The word president ismentioned when the anchor and/or reporter weretalking; but the president appears when his name isnot mentioned. If a single frame is required as theresult of the query, the system will produce the an-chorperson/reporter frames which are not desirablein many cases. It is better to produce a frame wherethe president appears.

In these manually annotated image collections, although itis known that the annotation words are associated with theimage, the correspondence between the words and the imageregions are unknown. Some methods are proposed to findthe correspondences by modeling the joint statistics of wordsand image regions [2, 10, 11, 15, 17].

Similar correspondence problems occur in video data. Thereare sets of video frames and transcripts extracted from theaudio speech narrative, but the semantic correspondencesbetween them are not fixed because they may not be co-occurring in time. If there is no direct association between

caeText Box26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28-August 1, 2003.

text and video frames, a query based on text may produceincorrect visual results.

For example, in most news videos (see Figure 1) the an-chorperson talks about an event, place or person, but theimages relating to the event, place, or person appear laterin the video. Therefore, a query based only on text relatedto a person, place, or event, and showing the frames at thematching narrative, will yield incorrect frames of the an-chorperson as the result.

The goal of this study is to determine the correspondencesbetween the video frames and associated text in order toannotate the video frames with more reliable labels and de-scriptions. This enables a textual query to return more ac-curate semantically corresponding images, and enables animage-based query or response to provide more meaningfuldescriptors.

In Section 2, our approach to solve the correspondence prob-lem between image regions and text will be explained. Thestrategy to apply a similar approach on video data will bedescribed in Section 3. In Sections 4 and 5, the procedureand the experimental results will be presented for two differ-ent data sets : TREC 2001 and Chinese cultural data setsrespectively. Section 6 will discuss the results and presentfuture directions.

2. MULTIMEDIA TRANSLATION

The problem of finding the correspondences can be consid-ered as the translation of visual features to words, similar tothe translation of text from one language to another. In thatsense, there is an analogy between learning a lexicon for ma-chine translation and learning a correspondence model forassociating words with image regions.

Learning a lexicon from data is a standard problem in ma-chine translation literature [7, 16]. Typically, lexicons arelearned from a type of data set known as an aligned corpora.Assuming an unknown one-to-one correspondence betweenwords, coming up with a joint probability distribution link-ing words in two languages is a missing data problem [7]and can be dealt by application of the EM (ExpectationMaximization) algorithm [9].

Data sets consisting of annotated images are similar to alignedcorpora. There is a set of images, each consisting of a num-ber of regions and a set of associated words. Each imageis segmented into regions and from each region a set of fea-tures, including color, texture, shape, position and size, areextracted. In order to exploit the analogy with machinetranslation, we vector-quantize the set of features represent-ing an image region using k-means. Each region then gets asingle label (blob token).

The problem is then to construct a probability table thatlinks the blob tokens with word tokens. The probability ta-ble is initialized to the co-occurrences of blobs and words.The final translation probability table is constructed usingthe EM algorithm which iterates between the two steps:

(i) use an estimate of the probability table to predict corre-spondences;(ii) then use the correspondences to refine the estimate ofthe probability table.

Once learned, the probability table is used to predict wordscorresponding to particular image regions (region naming),or words associated with whole images (auto-annotation)(see [10, 11] for the details).

3. CORRESPONDENCES ON VIDEO

The Informedia Digital Video Library Project at CarnegieMellon University [1] combines speech, image and naturallanguage understanding to automatically transcribe, seg-ment and index video for intelligent search and image re-trieval. The current library consists of terabytes of data(broadcast news captured over the last years, documen-taries produced for public television and government agen-cies, classroom lectures, and other video genres) with auto-matically extracted metadata and indices.

Broadcast news is very challenging data, but due to its na-ture it is mostly based on people and requires a focus onperson detection/recognition. In this study, we use subsetsof Chinese culture and TREC 2001 data sets which are rel-atively simpler.

The data consists of video frames and the associated tran-script extracted from the audio. The frames and transcriptsare associated on the shot-basis. Each shot is representedwith a single keyframe. Transcripts are aligned with theshots by determining when each word was spoken throughan automatic alignment process using Sphinx-III speech rec-ognizer.

Each keyframe is segmented into regions, and features areextracted from each region. In this study, fixed sized gridsare used as the regions because of their simplicity to applyon a large volume of data. A feature vector of size 46 isformed to represent each region. Position is represented bythe coordinates (x, y) of the region’s center of gravity inrelation to the image dimensions. Color is redundantly rep-resented using the mean and variance of the HSV and RGBcolor spaces. Texture is represented by using the mean andvariance of 16 filter responses. We used four difference ofGaussian filters with different sigmas and twelve orientedfilters, aligned in 30 degree increments.

The vocabulary consists of only nouns which are extractedby applying Brill’s tagger [6] to the transcript. Due to theerrorful transcripts obtained from the speech recognition,many noisy words remain in the vocabulary.

4. TREC 2001 DATA

The TREC 2001 data is chosen, because it is a truthed andextensively analyzed data set. 2232 images are used fromthis data set. All the nouns (1938 words) are taken as thevocabulary words.

plane(2) space(5),telescope(10) space(1),world(6)

space(6),astronaut(7) plane(2) robot(5)

Figure 2: Example annotation results for TREC2001 data. The images are annotated with the top10 words with the highest prediction probabilitygiven the regions of the image. For each image someof the annotation words and their ranks are shown.

Different from a still image with some annotated keywords,in video, text is not usually associated with a single frame.The words for the surrounding frames can also be associatedwith the current frame. For the experiments on TREC data,the text for the surrounding frames is also considered by set-ting the window size arbitrarily to five (i.e., for each frame,text associated with the five preceding and five followingframes are taken together with the text for that frame).

Each image is divided into 7 x 7 blocks, resulting in 49 re-gions for each image. Then, features are extracted from eachof these blocks and feature space is vector quantized usingk-means where k is set to 500. An initial translation prob-ability table is constructed by taking the co-occurrences of500 blob tokens and 1938 word tokens. The final translationprobability table is obtained by applying EM. In order to an-notate the images, the word posterior probabilities for theimage blobs are summed into a single word posterior, andthe highest probability words are taken as the annotationwords.

In Figure 2, some of the annotation words predicted in thetop ten words with the highest probability for an image areshown. As can be seen, the predicted words are the cor-rect words that describe what is in the image. Figure 3shows the query results for Statue of Liberty using thecurrent Informedia system. In some cases, neither statuenor liberty matches with the frames where the Statue ofLiberty appears, and in some of the frames where the Statuedoesnt appear, the text mentions the name. With the pro-posed approach, all the frames with the Statue of Libertyare annotated with statue and liberty in the top threewords as shown in Figure 4. Also, for the frames that donot include the statue, neither statue nor liberty is pre-dicted. Therefore, when a query is performed on statue ofliberty, with the proposed approach the results will giveonly the correct matches where the Statue appears.

Figure 3: Query results for statue of liberty usingthe current Informedia system. Each shot is repre-sented with a single keyframe and the transcript ex-tracted from the audio is aligned with the shots. Thearrows indicate the shots where the words statue orliberty occur in the corresponding audio/transcript.For example, the first and the second purple arrowsindicate the word statue for the third and fourthimages on the first row respectively, the first red ar-row indicates the word liberty for the first imageon the second row, and so on. Although, the Statueof Liberty appears in the second and fourth imageson the second row, none of the words are matched.However, the third image on the first row, and thefirst image on the second row is matched, althoughthey are not the images of Statue of Liberty. Withthe proposed approach, the results are corrected bypredicting both statue and liberty in the top threewords for all the images where the Statue appears,and by not predicting any of the words for the otherswhere the statue doesnt appear.

statue(1) liberty(2) statue(1) liberty(3)

statue(1) liberty(2) statue(1) liberty(3)

Figure 4: For the shots that Statue of Liberty ap-pears, rank of the statue and liberty words in thepredictions.

5. CHINESE CULTURE DATA

The other data set used for the experiments consists of Chi-nese cultural documentaries [8, 18]. Figure 5 and Figure 6show some examples from this data set. Similar correspon-dence problems occur in this data set. For example, in thestory about the Great Wall, some frames show the views ofthe Wall, while the others show the reporter or the otherrelated people/objects; therefore the word wall may notmatch with the view of the Wall.

For the experiments, the movies that includes interviewswith people are removed due to their dependence on face de-tection/recognition, and the remaining 3785 shots are used.Each shot is represented with a single keyframe which is di-vided into 5 x 5 blocks. Then, features are extracted fromeach of these blocks and feature space is vector quantizedusing k-means where k is set to 1000. Therefore there are1000 blob tokens.

In order to construct the vocabulary, nouns extracted fromBrill’s tagger are used. Figure 7 shows that the frequency ofthe words lies in a large range. The number of words in thevocabulary is 2597, but only about 20% of the words are ina meaningful range. We eliminate the words that occur lessthan 5 times, or more than 250 times. This pruning processreduces the vocabulary to 626 words. Shots that are notassociated with any word after this pruning process are alsoeliminated, resulting in 2785 remaining shots.

Figure 5: Results of a query for great wall on Chi-nese culture data set using the current Informediasystem. The arrows indicate the shots where thequery words occur in the corresponding audio.

For the experiments on the TREC data, the text for thesurrounding shots is also taken by setting a fixed windowsize. In the Chinese data set, the window size is varied.First, we give the results using only the text aligned with asingle shot, then we compare the results for different windowsizes.

Figure 8 shows some frames that predict (a)panda, (b)wall,(c)emperor in the first three words with the highest proba-bility. As can be seen, the right words are predicted for theframes that are related with the word. The word panda is

Figure 6: Results of a query for panda on Chineseculture data set using the current Informedia sys-tem. The arrows indicate the shots where the queryword occurs in the corresponding audio.

0 500 1000 1500 2000 2500 30000

200

400

600

800

1000

1200

1400

words

wor

d fr

eque

ncie

s

Figure 7: Word frequencies for the Chinese culturedata set. Originally there are 2597 words in thevocabulary. However, almost 80% of the vocabularyoccurs less then five times. We eliminate the wordsthat occur less than 5 times or more than 250 timesto have a new vocabulary. The number of wordsremained in the pruned vocabulary is 626.

correctly predicted for different views of panda. Similarly,the word wall is correctly predicted when the Great Wallappears in the image, although the images are not very sim-ilar. The images for emperor are in a larger range, since it ishard to associate this word with a specific object or scene.

In order to evaluate the results on a larger scale 189 imagesare visually investigated for the word panda. The results areshown in Table 1. 127 images are associated with the wordpanda (using the aligned transcript), but only in 42 of theseimages a panda appears. The system predicts panda as thefirst word with the highest probability for 121 images, andin 51 of them a panda appears in the image. In 25 of theseimages the word panda is predicted for the images where apanda appears, even though the word was not associatedwith the frame originally. The results show that we missed11 images where the panda word was associated and a pandaappears in the image, but cannot be predicted as the firstword. For the images where the system cannot predict pandaas the first word, it was usually the second or the third word.

(a)

(b)

(c)

Figure 8: Sample images that predict (a)panda,(b)wall, and (c)emperor in the first three words withthe highest probability.

The predicted words can be used for two different purposes:to have better annotations for the frames on the training setand to annotate the frames on the test set that do not haveassociated text. As explained before, we eliminated 1000images since the words associated with those images werethe less frequent words. These images are used as the testset. For panda, we randomly choose two sets of frames wherethe story is about a panda. Among these, the frames thatare in the test set are analyzed. We look at the images wherepanda is predicted as the first word, and check whether thepanda appears in the image. In the first set 11 images over25 images, and in the second set 2 images over 25 images,predict panda as the first word when the panda appears inthe image.

The scene shown in Figure 6 is further used to analyze theresults for predicting the word panda both for the trainingand test sets. Figure 9 shows the rank of the word pandaas the predicted word for the corresponding frames. Red(dark gray) numbers correspond to the images that are inthe test set (the frames that are initially eliminated), andgreen (light gray) numbers correspond to the images thatare in the training set. The ranking in the training set isalso important, since most of the correct associations aremissing and our goal is finding more reliable annotations forthe frames. As the figure shows, on most of the trainingimages if a panda appears in the image, we predict the word

Table 1: Correspondence results for panda. 189frames are inspected. Associated means that theword is in the transcript that is aligned with theframe. Predicted means that the system predictpanda for the frame as the first word. Correctmeans a panda appears, and incorrect means a pandadoesn’t appear in the frame.

number of associations 127number of correct associations 42number of incorrect associations 85number of predictions 121number of correct predictions 51number of incorrect predictions 70associated but not predicted 64association is correct but not predicted 11association is incorrect not predicted 53not associated but predicted 61not associated but correctly predicted 25not associated and incorrectly predicted 36associated and predicted and correct 26associated and predicted but incorrect 31

panda in the first four words with the highest probability.The predictions for the test images are also satisfactory:some of the images where a panda appears are annotatedwith the word panda, and for the images that don’t showany panda the annotation rank is very low. However, bothfor the training and test images, we see a problem when thewoman appears: since the woman frames highly co-occurwith the word panda, the system learns that the womanframes are also associated with panda and we incorrectlypredict panda with a high score.

Figure 9: The predictions for the panda scene. Thenumbers show the rank of the word panda amongthe predicted words. Red (dark gray) numbers cor-respond to the images that are not in the trainingset, green (light gray) numbers correspond to theimages in the training set.

We investigate the effect of window size, by choosing eitheronly the words from a single shot, or also from the surround-ing shots (window size is set to 1,2 or 3). The average num-ber of words associated with a single shot is 3.7275 whenonly the words that are associated with a single shot areused; 10.1670 when window size is set to 1; 16.4470 whenwindow size is 2; and 22.6725 when window size is 3. Themaximum number of words are 25, 47, 63 and 70 respec-tively.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

wallgreatsochinese

pandaoneemperor

ha

china

flag

hsi

cranenot

there

lan

merchant

mercury

round

square

first

otherwho

very

deer

chcourtyard

year

can

people

sea.

stretche

glorynative

chamber

detail

eternal

letter

bronze

she

yangtze

captive

state

what

qin

tomb

saying

tzu

blue

ding

birthcity

month

manyking

forbiddendoidea

lot

time

wartrade

feet

new

dynastyprogram

travel

elixir

harmony

jade

middle

just

now

army

boundary

evil

thing

imperial

upbamboo

today

northern

stone

shi

power

meter

arrow

roof

roman

watchtower

ancestor

hair

throne

woman

system

keeper

range

dream

animal

area

heremay

writing

why

tangland

mountain

kind

symbol

region

fire

word

individual

giant

ming

eastern

20th

river

forestgenghi

magicbarrierfaith

fighting

text

line

archeologistsuccessorresource

act

burning

carriage

weapon

command

prince

handwalk

century

respect

hawk

pitideal

fitwolong

generation

white

waytreasure

cotta

yellow

piecepalace

workbuild

form

north

life

book

masgod

fragmentinvasion

servant

shape

qing

lengthdecade

feeling

organizationtrip

world

li

dragonspirit

babyearthmilitary

mongol

mind

conflict

monument

tourist

pu

open

beginning

chariot

sword

crown

objectlymodel

legend

ground

mean

gold

shownsleepanswer

number

li−li

member

home

houralmostsoldierface

found

little

part

silk

sense

femaleperiod

zhaotian

nomadimmortality

something

run

villager

culture

tower

thought

hill

lo

court

lady

figure

longschaller

researcher

central

thousandplacesonright

national

far

remain

nightofficer

burial

incrediblesectionheaven

ruler

wonder

size

radio

let

dr.

story

attempt

empire

warrior

history

natural

statue

shang

reign

eunuch

empreswater

standard

excavationkept

official

sort

effort

man

barbarian

control

pagoda

b.c.

pasexampleten

warring

research

wildlife

hundred

million

name

body

country

age

ritual

top

construction

beliefmother

look

traditional

eyelivingdiscovery

secret

rule

question

death

sea

huang

gate

center

probablyunited

site

day

showinside

force

wild

concubinered

humanhabitatoutside

westwesternhan

specybuildingancient

recall

prec

isio

n

Figure 10: Recall vs precision graph : single frame.Recall is defined as the number of correct predictionsover the number of times that the word occurs inthe data, and precision is defined as the number ofcorrect predictions over all predictions.

Figures 10, 11, 12 and 13 show the recall and precision val-ues as a function of words when the associated words arechosen from a single frame, or from a window size 1, 2, and3 respectively. Recall is defined as the number of correctpredictions over the number of times that the word occursin the data, and precision is defined as the number of cor-rect predictions over all predictions. Correct predictions arefound by comparing the first n words that are predicted withthe highest probability (where n is the number of actualwords associated with the frame), with the actual ones. Itis hard to understand the figures clearly, but it is easy to seethat the system predicts more words as the window size in-creases. Also, the figures indicate that for most of the wordsthe recall and/or precision values increase as the number ofwords associated with a frame increases. Table 2 shows therecall and precision values for some selected words.

Table 2: Comparison of recall and precision valuesfor some selected words as a function of window size.

panda wall emperorrec - prec rec - prec rec - prec

single frame 0.718 - 0.204 0.849 - 0.228 0.689 - 0.224wsize = 1 0.818 - 0.243 0.918 - 0.292 0.874 - 0.286wsize = 2 0.886 - 0.256 0.952 - 0.315 0.945 - 0.326wsize = 3 0.915 - 0.260 0.972 - 0.319 0.969 - 0.363

Table 3: Prediction measures when different windowsizes are used to obtain the words associating witha shot.

prediction measuresingle frame 0.1851wsize = 1 0.2469wsize = 2 0.2783wsize = 3 0.2975

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

wallgreatchineseoneemperor

panda

sochinahanottherefirstwhoyear

peoplewhat

verych

zhao

tombcandeerotherscript

round

many

mongol

cranejade

sheanimal

dynastyqin

king

justding

forbiddenkind

feetming

hsi

chamber

up

ideamerchanthair

lady

city

northern

here

statearmynow

rule

imperial

stretche

porcelain

ground

huo

system

bamboo

tang

powercentury

tzu

ideal

giant

remain

waysort

silk

eastern

pig

cotta

dragon

wild

stone

do

alligator

sea.

pagoda

keeperlandlot

bronze

empire

symbolli

place

buildinghandtime

flag

communist

crownmountain

burial

heaven

material

evil

thousandtoday

faith

empres

courtyard

concubine

shi

captive

program

elixir

spirit

square

almost

officer

baby

roman

war

qing

period

marqui

world

habitat

new

edge

native

gravebirth

number

may

ancestor

treasure

individual

man

book

harmonygeneral

partspecyriver

sealook

yellow

han

shang

wildlife

title

object

long

figure

life

mind

lored

country

next

barbarianthrone

build

warrior

mercury

somethingeveryone

dr.

western

earth

sense

site

saying

yungkey

sword

yisleep

spot

schaller

wolong

historyscholargold

genghimagic

tower

tradition

weapon

military

a.d.

model

woman

shapedestruction

death

policy

man.

natural

philosophy

section

west

conflict

palace

b.c.

mile

li−li

writingtrade

archeological

official

nomad

travel

rebel

corner

hu

areahorse

george

workconstructiongate

little

lan

line

lieburning

eternal

heightmother

generation

iron

eye

gift

road

hundred

eunuch

fall

art

north

radio

princemeng

control

law

ten

running

piece

trip

expedition

forest

leave

fenghill

form

human

center

white

righteffort

afterlife

pay

father

heart

god

curse

chi

everything

servant

journey

leopard

range

ancientfound

belief

setwonder

national

archeologist

discovery

huangtouristmanchu

female

waterbody

brick

invasion

relic

practice

heavenly

row

pas

young

position

problem

reason

attempt

immortality

thought

yang

tour

kingdomday

route

tian

beginning

existence

frontlocation

million

ruler

breeding

name

ritual

enemy

monument

sacred

buddha

thing

yangtze

group

secret

far

myth

culture

pride

fire

mystery

carriage

cloud

development

warring

united

huangdinightcourt

age

attack

yin

agricultural

supreme

command

soldier

capital

purpose

master

inside

museum

facetraditional

looking

living

men

glimpsefighting

taoist

zhou

soil

successor

blackhistorical

surrounding

legend

item

letter

characterpit

standkleiman

restoration

reality

barrier

text

territory

glory

step

civil

chime

blade

collar

pound

laborer

space

probably

statue

heavy

favorite

pottery

walk

middle

mouth

decade

fit

feeling

population

bear

force

family

ratherleftsaw

desert

artisan

blue

example

visit

dead

hopedoing

organizationmao

peasant

moderncommon

change

story

amount

accountpu

top

energy

local

understand

sovereign

flow

open

play

meetwine

silver

stop

zoo20th

mean

male

member

hour

future

reserve

scaleruntakingshown

leadercivilization

word

matter

shareanswer

food

alive

big

search

incredible

region

sizeletscientist

troop

picture

standard

culturalrecord

reign

kept

excavation

boundaryresearchfield

head

project

high

why

question

government

researchershow

son

outside

recall

prec

isio

n

Figure 11: Recall vs precision : window size = 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

onewallchineseemperorgreatsochina

panda

hanot

therefirstwhotombpeopleveryyearcanwhat

deer

othermanychzhaoshe

mongol

justupjadedynasty

chamber

hsiround

qin

ming

state

tzu

powerarmy

animal

porcelain

cityimperialnow

here

king

ideacentury

forbiddensymbol

pagoda

kindway

system

feet

native

rulealligatorwarrior

ding

do

dragon

cotta

time

spirit

scriptnorthern

empres

han

ideal

silk

treasure

bamboo

flag

lady

heaven

western

bronze

almost

mountainland

thousand

lettercountry

new

crane

pig

tang

merchant

shi

giantconcubine

wildlifeyang

empiremayphilosophy

elixir

stretche

manstone

war

building

barbarianroman

writingtoday

evilschaller

tower

crownancient

lot

history

remainplace

captive

areaworldwork

ancestor

partkeepertrade

look

longsite

li

yin

sort

officer

yellow

b.c.

wild

military

specy

qing

square

build

death

burial

birthbabynext

huo

period

supreme

eastern

conflict

thing

yung

mouth

soldierlifeprince

tenwolong

construction

george

night

red

mind

gold

wonder

ground

lan

sun

hair

line

sea.

book

harmony

habitat

throne

official

name

natural

something

communist

thought

carriage

individual

li−lifound

figure

story

human

enemyscholar

confucian

lo

edge

north

fire

breeding

palace

program

beliefhope

piece

dr.west

men

brick

a.d.

shape

yangtze

object

cloud

model

sea

20th

face

dayright

shang

house

mystery

ritual

stand

kingdomriversense

material

picture

buddhahu

little

rulerafterlife

manchu

group

hundred

set

faith

forcetravelmonumentsealrather

route

problem

confucianism

god

relic

journey

record

mile

leader

culture

bodywarring

archeological

womanorganizationrebel

eunuch

corner

answer

soninside

earthgeneral

huangdi

mercury

silver

generation

road

number

hand

marqui

destruction

archeologist

tian

probably

territorydetail

leave

art

gatereign

heavenly

row

techniquesection

burning

highimmortality

field

living

civil

grave

walk

waterfutureattempt

family

feng

meng

everything

sleep

agricultural

nature

sacred

chairman

far

legend

excavationnothing

servant

mao

center

white

person

word

huang

weapon

zhou

flow

command

wine

manmade

stop

radio

secret

understandcontrol

million

female

sign

courtyard

glimpse

demon

taoistmuseum

question

discovery

man.

someone

chime

pit

looking

researcher

young

taking

food

lie

historical

form

sharestandard

effort

moon

surrounding

item

rich

leopard

conservationnational

structure

reason

height

reality

heart

fighting

curse

chi

glory

respectcharacter

mothereat

expedition

gift

nomad

statueeverybody

pavilion

mean

middle

month

father

artisan

hill

troop

law

saying

destroybeginningplay

sword

meet

maletrip

forest

common

project

di

age

feeling

invasion

why

women

yi

tryold

tourist

tradition

eye

rainforest

marco

beijing

title

traditional

government

boundary

kept

villager

pere

zootrack

bigpopulation

roof

hawkhead

development

location

showfallexample

outside

genghi

civilization

search

size

collar

horse

capital

arrow

black

light

range

incredible

region

soil

becoming

culturalbarrierkhansuccessor

pas

purpose

ruling

favorite

fit

director

ironrestoration

boy

basic

mount

fightlaborer

research

wealthtemplecycleeternaldozen

united

republic

decadethreat

spot

modernleft

key

running

position

peasant

westerner

pay

sovereign

blue

top

desert

visit

doing

amount

account

practice

else

weight

bone

society

political

home

reserveopenexistence

shown

scale

main

farmer

lif

front

tour

matterstanding

nation

myth

runscientist

guard

second

central

alive

court

let

colormaster

foot

reach

spacepottery

saw

change

recall

prec

isio

n


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

one

wall

chineseemperorsogreat

chinanotha

panda

therefirstyearwho

peopleverytombwhatother

canmany

just

upding

forbidden

charmy

zhaoshe

hsi

deer

dynastyjadeanimalqin

ming

citynow

mongolround

imperial

porcelain

power

century

state

cottasystemhere

tzubamboo

kind

time

king

ideawarrior

lidoway

pagoda

symbol

feet

building

chamber

rulemayhan

war

northern

alligator

letter

empressilkscripttreasurebronzethousand

ancestor

today

new

carriage

specy

heaven

worldshi

princemountainempiremilitaryshangdragon

barbarian

tang

lo

western

landgoldphilosophy

history

remain

look

stonewriting

crane

almost

ancient

giantcommunist

huo

concubinefire

weapon

row

man

soldierlifeworkpartfigure

build

yung

lady

areaschaller

b.c.

longlittle

countrythronenextperiod

merchant

yangyellowevil

crown

roman

spirit

face

place

elixir

construction

flag

ideal

tower

keeper

wild

ground

birtheastern

wildlife

body

house

burial

lot

baby

stretche

sea

site

roofsort

piece

confucian

red

mind

trade

archeologistlan

hope

found

linewonder

palace

squarenatural

day

qing

officer

pig

west

tian

gatethingnumber

confucianism

relic

conflict

nomad

name

supreme

problemnight

habitat20th

death

shape

belief

techniqueobject

thoughtearthzhou

captive

officialnative

hundred

storycourtyard

marquiwarring

rightafterlife

westerner

leaderdetail

bird

sense

effortscholar

triproute

sunriver

group

hu

pas

picturewolong

destruction

hair

woman

something

discovery

stopfaithartisan

territoryrulerprobably

why

a.d.

journey

force

troop

attack

boundary

fall

edge

model

form

tourist

high

ritual

reasongeneration

individual

leave

seal

inside

stand

north

book

statue

taoist

fengmen

leopard

li−li

general

hand

human

manchu

sword

standard

female

breeding

mercury

cloud

glimpsehuang

program

huangdi

sacred

mile

everythingmarcoyin

immortality

curse

living

road

george

fieldmotherculturelegend

genghi

di

mysteryarcheologicalset

control

chiten

mouth

man.

government

modern

far

favorite

nothingenemy

buddha

rebel

water

outside

kingdom

meng

agricultural

temple

chimeradio

white

heavenly

food

forest

civil

walk

million

materialyangtzecenter

god

harmony

pavilion

grave

section

insemination

eye

future

mean

mao

family

head

silver

organizationunderstand

secret

reignnature

eunuchtitle

running

moon

lie

attempt

son

person

top

nationalhistorical

population

women

researcherword

beginning

brick

buddhist

try

rather

commonsawmanmade

someone

lengthsea.

surrounding

item

respect

eat

iron

master

record

travel

cultural

fighting

burning

charactersleep

opentaking

tour

myth

main

question

father

corner

dr.

barrier

wineexample

peasant

demon

key

tradition

project

space

horse

rangehilldesert

light

trap

feeling

showzoo

elsegreenkhansuccessor

stepincredible

shown

doing

unitedexcavation

age

gift

front

nation

minister

realityheart

monument

share

big

answer

mount

rich

ling−ling

size

restoration

spotsearch

energy

expedition

beijing

run

civilizationrainforest

alive

resource

everybody

meet

eternal

saying

account

servant

threatthirdlaw

variety

destroy

color

capitallooking

claim

play

leftyoung

old

pay

sign

invasion

flow

pere

better

reach

home

weight

political

structure

kept

meter

existence

hour

sturgeon

hawkchairman

view

watchtowerheight

development

start

magic

everyone

begun

let

revolution

artguard

blade

month

location

matingarrowpridecommand

court

pound

becoming

second

museum

region

soil

fight

policyworker

purpose

pit

attention

fate

boy

heavy

foot

plan

coming

holding

bear

child

ruling

fitglory

basic

bloodpottery

path

chariot

hall

fear

wealth

decademiddledozenposition

sovereign

pu

lydead

changeyi

male

amountblue

visit

peakreserve

battle

villager

one.

standing

traditionalfarmer

lifscale

matter

unique

hard

design

achievement

research

recall

prec

isio

n


The results are also compared using the prediction mea-sure, which is calculated by averaging the number of correctpredictions over the number of words in an image for all theimages. As Table 3 shows, prediction measure is better asthe window size increases.

6. DISCUSSION AND FUTURE WORK

In this study, integration of visual and textual data is pro-posed to solve the correspondence problem between videoframes and associated text. The preliminary results showthat using this approach it is possible to have better anno-tations for the video frames that can be used to improve theperformance of the text based queries.

In the current system relatively simpler and smaller datasets are used. Our goal is to apply a similar approach tobroadcast news which is a harder dataset since there areterabytes of video and it requires focusing on people.

Currently the system uses simple features extracted from thestill images. A better set of features that also include somedetectors (e.g., face detector) and some motion informationis likely to improve the performance. In the experiments,fixed size grids are used. Using a segmenter, better resultscan be obtained [3]. Also, in video the objects are mostly theones that moves. Therefore, it is better to use the temporalinformation to segment the moving objects.

Using a large number of data may allow us to perform sta-tistical analysis. It is an interesting problem to come with adistance metric for choosing the text that best describes theframe. In this study, individual words are taken as the vo-cabulary words. Taking the noun phrases or the compoundwords (e.g. “statue of liberty”) may improve the perfor-mance. Also, some lexical analysis needs to be done to un-derstand which type of words more precisely correspond tovisual features and should be taken as the vocabulary words.

7. REFERENCES[1] Informedia Project. http://informedia.cs.cmu.edu.

[2] K. Barnard, P. Duygulu, N. de Freitas, D. A. Forsyth,D. Blei, and M. Jordan. Matching words and pictures.Journal of Machine Learning Research, 3:1107–1135,2003.

[3] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, andD. A. Forsyth. The effects of segmentation and featurechoice in a translation model of object recognition. InIEEE Conf. on Computer Vision and PatternRecognition, 2003.

[4] K. Barnard and D. A. Forsyth. Exploiting imagesemantics for picture libraries. In The FirstACM/IEEE-CS Joint Conference on Digital Libraries,page 469, 2001.

[5] K. Barnard and D. A. Forsyth. Learning the semanticsof words and pictures. In Int. Conf. on ComputerVision, pages 408–15, 2001.

[6] E. Brill. A simple rule-based part of speech tagger. InIn Proceedings of the Third Conference on AppliedNatural Language Processing, 1992.

[7] P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L.Mercer. The mathematics of statistical machinetranslation: Parameter estimation. ComputationalLinguistics, 19(2):263–311, 1993.

[8] C. Chen. First Emperor of China, Voyager CD-ROM.http://www.voyagerco.com/cdrom/, 1994.

[9] A. P. Dempster, N. M. Laird, and D. Rubin.Maximum likelihood from incomplete data via the emalgorithm. Journal of the Royal Statistical Society,Series B, 1(39):1–38, 1977.

[10] P. Duygulu, K. Barnard, N. Freitas, and D. A.Forsyth. Object recognition as machine translation:learning a lexicon for a fixed image vocabulary. InSeventh European Conference on Computer Vision(ECCV), volume 4, pages 97–112, 2002.

[11] P. Duygulu-Sahin. Translating Images to words: Anovel approach for object recognition. PhD thesis,Middle East Technical University, Turkey, 2003.

[12] C. Jelmini and S. Marchand-Maillet. Deva: anextensible ontology-based annotation model for visualdocument collections. In Proceedings of SPIEPhotonics West, Electronic Imaging 2002, InternetImaging IV, Santa Clara, CA, USA, 2003.

[13] J. Li and J. Z. Wang. Automatic linguistic indexing ofpictures by a statistical modeling approach. IEEETrans. on Pattern Analysis and Machine Intelligence,25(10):14, 2003.

[14] M. Markkula and E. Sormunen. End-user searchingchallenges indexing practices in the digital newspaperphoto archive. Information retrieval, 1:259–285, 2000.

[15] O. Maron. Learning from Ambiguity. PhD thesis,MIT, 1998.

[16] I. D. Melamed. Empirical Methods for ExploitingParallel Texts. MIT Press, 2001.

[17] Y. Mori, H. Takahashi, and R. Oka. Image-to-wordtransformation based on dividing and vectorquantizing images with words. In First InternationalWorkshop on Multimedia Intelligent Storage andRetrieval Management, 1999.

[18] H. Wactlar and C. Chen. Enhanced perspectives forhistorical and cultural documentaries using informediatechnologies. Joint Conference on Digital Libraries(JCDL’02), Portland, Oregon, July 13-17, 1994.

associating video frames with texthdw/acm-sigir-2003.pdftext and video frames, a query based on text...

Documents