associating video frames with texthdw/acm-sigir-2003.pdftext and video frames, a query based on text...

7
Associating video frames with text Pınar Duygulu and Howard D. Wactlar Informedia Project School of Computer Science Carnegie Mellon University Pittsburgh, PA, USA [email protected], [email protected] ABSTRACT In this study, integration of visual and textual data is pro- posed to solve the correspondence problem between video frames and associated text in order to annotate video frames with more reliable labels and descriptions. Visual features extracted from video frames are linked to text that is ob- tained from the audio transcripts using joint statistics. The results show that using this approach it is possible to have better annotations for video frames that can be later used to improve the performance of text based queries. The pro- posed approach will be integrated into the Informedia Digi- tal Video Library Project at Carnegie Mellon University. 1. INTRODUCTION Video is a rich source of data where visual and textual in- formation occur together. A system that combines these two types of data is more powerful than a system address- ing only one of them, since text and visual features may be ambiguous if treated in isolation. In order to make bet- ter use of the available data, visual features (such as color, texture, shape, and motion features) and semantics derived from text can be integrated. The text provides high-level se- mantics for a video sequence that cannot be obtained using current computer vision techniques, and visual properties supply rich information that can be difficult to express in a textual format. When text and visual properties are integrated, several ap- plications become possible including better search and brows- ing. There are many collections where images/videos are annotated with some descriptive text (e.g., Corel data set, some museum collections, news photographs on the web with captions, etc.). Manual annotation of these collections is subjective and requires a huge amount of effort [14]. Some approaches are proposed to make this process easier and automatic [2, 4, 5, 12, 13]. Figure 1: An example query result from the Infor- media system [1], where the query word is president. A story is segmented into shots where each shot is represented with a keyframe. Transcripts that are extracted from the audio narrative are aligned with the shots. The arrows represent the shots where the query word occurs in the transcript aligned with the shot. There are correspondence problems be- tween the frames and text. The word president is mentioned when the anchor and/or reporter were talking; but the president appears when his name is not mentioned. If a single frame is required as the result of the query, the system will produce the an- chorperson/reporter frames which are not desirable in many cases. It is better to produce a frame where the president appears. In these manually annotated image collections, although it is known that the annotation words are associated with the image, the correspondence between the words and the image regions are unknown. Some methods are proposed to find the correspondences by modeling the joint statistics of words and image regions [2, 10, 11, 15, 17]. Similar correspondence problems occur in video data. There are sets of video frames and transcripts extracted from the audio speech narrative, but the semantic correspondences between them are not fixed because they may not be co- occurring in time. If there is no direct association between

Upload: others

Post on 31-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Associating video frames with text

    Pınar Duygulu and Howard D. WactlarInformedia Project

    School of Computer ScienceCarnegie Mellon University

    Pittsburgh, PA, USA

    [email protected], [email protected]

    ABSTRACT

    In this study, integration of visual and textual data is pro-posed to solve the correspondence problem between videoframes and associated text in order to annotate video frameswith more reliable labels and descriptions. Visual featuresextracted from video frames are linked to text that is ob-tained from the audio transcripts using joint statistics. Theresults show that using this approach it is possible to havebetter annotations for video frames that can be later usedto improve the performance of text based queries. The pro-posed approach will be integrated into the Informedia Digi-tal Video Library Project at Carnegie Mellon University.

    1. INTRODUCTION

    Video is a rich source of data where visual and textual in-formation occur together. A system that combines thesetwo types of data is more powerful than a system address-ing only one of them, since text and visual features maybe ambiguous if treated in isolation. In order to make bet-ter use of the available data, visual features (such as color,texture, shape, and motion features) and semantics derivedfrom text can be integrated. The text provides high-level se-mantics for a video sequence that cannot be obtained usingcurrent computer vision techniques, and visual propertiessupply rich information that can be difficult to express in atextual format.

    When text and visual properties are integrated, several ap-plications become possible including better search and brows-ing. There are many collections where images/videos areannotated with some descriptive text (e.g., Corel data set,some museum collections, news photographs on the web withcaptions, etc.). Manual annotation of these collections issubjective and requires a huge amount of effort [14]. Someapproaches are proposed to make this process easier andautomatic [2, 4, 5, 12, 13].

    Figure 1: An example query result from the Infor-media system [1], where the query word is president.A story is segmented into shots where each shot isrepresented with a keyframe. Transcripts that areextracted from the audio narrative are aligned withthe shots. The arrows represent the shots where thequery word occurs in the transcript aligned withthe shot. There are correspondence problems be-tween the frames and text. The word president ismentioned when the anchor and/or reporter weretalking; but the president appears when his name isnot mentioned. If a single frame is required as theresult of the query, the system will produce the an-chorperson/reporter frames which are not desirablein many cases. It is better to produce a frame wherethe president appears.

    In these manually annotated image collections, although itis known that the annotation words are associated with theimage, the correspondence between the words and the imageregions are unknown. Some methods are proposed to findthe correspondences by modeling the joint statistics of wordsand image regions [2, 10, 11, 15, 17].

    Similar correspondence problems occur in video data. Thereare sets of video frames and transcripts extracted from theaudio speech narrative, but the semantic correspondencesbetween them are not fixed because they may not be co-occurring in time. If there is no direct association between

    caeText Box26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28-August 1, 2003.

  • text and video frames, a query based on text may produceincorrect visual results.

    For example, in most news videos (see Figure 1) the an-chorperson talks about an event, place or person, but theimages relating to the event, place, or person appear laterin the video. Therefore, a query based only on text relatedto a person, place, or event, and showing the frames at thematching narrative, will yield incorrect frames of the an-chorperson as the result.

    The goal of this study is to determine the correspondencesbetween the video frames and associated text in order toannotate the video frames with more reliable labels and de-scriptions. This enables a textual query to return more ac-curate semantically corresponding images, and enables animage-based query or response to provide more meaningfuldescriptors.

    In Section 2, our approach to solve the correspondence prob-lem between image regions and text will be explained. Thestrategy to apply a similar approach on video data will bedescribed in Section 3. In Sections 4 and 5, the procedureand the experimental results will be presented for two differ-ent data sets : TREC 2001 and Chinese cultural data setsrespectively. Section 6 will discuss the results and presentfuture directions.

    2. MULTIMEDIA TRANSLATION

    The problem of finding the correspondences can be consid-ered as the translation of visual features to words, similar tothe translation of text from one language to another. In thatsense, there is an analogy between learning a lexicon for ma-chine translation and learning a correspondence model forassociating words with image regions.

    Learning a lexicon from data is a standard problem in ma-chine translation literature [7, 16]. Typically, lexicons arelearned from a type of data set known as an aligned corpora.Assuming an unknown one-to-one correspondence betweenwords, coming up with a joint probability distribution link-ing words in two languages is a missing data problem [7]and can be dealt by application of the EM (ExpectationMaximization) algorithm [9].

    Data sets consisting of annotated images are similar to alignedcorpora. There is a set of images, each consisting of a num-ber of regions and a set of associated words. Each imageis segmented into regions and from each region a set of fea-tures, including color, texture, shape, position and size, areextracted. In order to exploit the analogy with machinetranslation, we vector-quantize the set of features represent-ing an image region using k-means. Each region then gets asingle label (blob token).

    The problem is then to construct a probability table thatlinks the blob tokens with word tokens. The probability ta-ble is initialized to the co-occurrences of blobs and words.The final translation probability table is constructed usingthe EM algorithm which iterates between the two steps:

    (i) use an estimate of the probability table to predict corre-spondences;(ii) then use the correspondences to refine the estimate ofthe probability table.

    Once learned, the probability table is used to predict wordscorresponding to particular image regions (region naming),or words associated with whole images (auto-annotation)(see [10, 11] for the details).

    3. CORRESPONDENCES ON VIDEO

    The Informedia Digital Video Library Project at CarnegieMellon University [1] combines speech, image and naturallanguage understanding to automatically transcribe, seg-ment and index video for intelligent search and image re-trieval. The current library consists of terabytes of data(broadcast news captured over the last years, documen-taries produced for public television and government agen-cies, classroom lectures, and other video genres) with auto-matically extracted metadata and indices.

    Broadcast news is very challenging data, but due to its na-ture it is mostly based on people and requires a focus onperson detection/recognition. In this study, we use subsetsof Chinese culture and TREC 2001 data sets which are rel-atively simpler.

    The data consists of video frames and the associated tran-script extracted from the audio. The frames and transcriptsare associated on the shot-basis. Each shot is representedwith a single keyframe. Transcripts are aligned with theshots by determining when each word was spoken throughan automatic alignment process using Sphinx-III speech rec-ognizer.

    Each keyframe is segmented into regions, and features areextracted from each region. In this study, fixed sized gridsare used as the regions because of their simplicity to applyon a large volume of data. A feature vector of size 46 isformed to represent each region. Position is represented bythe coordinates (x, y) of the region’s center of gravity inrelation to the image dimensions. Color is redundantly rep-resented using the mean and variance of the HSV and RGBcolor spaces. Texture is represented by using the mean andvariance of 16 filter responses. We used four difference ofGaussian filters with different sigmas and twelve orientedfilters, aligned in 30 degree increments.

    The vocabulary consists of only nouns which are extractedby applying Brill’s tagger [6] to the transcript. Due to theerrorful transcripts obtained from the speech recognition,many noisy words remain in the vocabulary.

    4. TREC 2001 DATA

    The TREC 2001 data is chosen, because it is a truthed andextensively analyzed data set. 2232 images are used fromthis data set. All the nouns (1938 words) are taken as thevocabulary words.

  • plane(2) space(5),telescope(10) space(1),world(6)

    space(6),astronaut(7) plane(2) robot(5)

    Figure 2: Example annotation results for TREC2001 data. The images are annotated with the top10 words with the highest prediction probabilitygiven the regions of the image. For each image someof the annotation words and their ranks are shown.

    Different from a still image with some annotated keywords,in video, text is not usually associated with a single frame.The words for the surrounding frames can also be associatedwith the current frame. For the experiments on TREC data,the text for the surrounding frames is also considered by set-ting the window size arbitrarily to five (i.e., for each frame,text associated with the five preceding and five followingframes are taken together with the text for that frame).

    Each image is divided into 7 x 7 blocks, resulting in 49 re-gions for each image. Then, features are extracted from eachof these blocks and feature space is vector quantized usingk-means where k is set to 500. An initial translation prob-ability table is constructed by taking the co-occurrences of500 blob tokens and 1938 word tokens. The final translationprobability table is obtained by applying EM. In order to an-notate the images, the word posterior probabilities for theimage blobs are summed into a single word posterior, andthe highest probability words are taken as the annotationwords.

    In Figure 2, some of the annotation words predicted in thetop ten words with the highest probability for an image areshown. As can be seen, the predicted words are the cor-rect words that describe what is in the image. Figure 3shows the query results for Statue of Liberty using thecurrent Informedia system. In some cases, neither statuenor liberty matches with the frames where the Statue ofLiberty appears, and in some of the frames where the Statuedoesnt appear, the text mentions the name. With the pro-posed approach, all the frames with the Statue of Libertyare annotated with statue and liberty in the top threewords as shown in Figure 4. Also, for the frames that donot include the statue, neither statue nor liberty is pre-dicted. Therefore, when a query is performed on statue ofliberty, with the proposed approach the results will giveonly the correct matches where the Statue appears.

    Figure 3: Query results for statue of liberty usingthe current Informedia system. Each shot is repre-sented with a single keyframe and the transcript ex-tracted from the audio is aligned with the shots. Thearrows indicate the shots where the words statue orliberty occur in the corresponding audio/transcript.For example, the first and the second purple arrowsindicate the word statue for the third and fourthimages on the first row respectively, the first red ar-row indicates the word liberty for the first imageon the second row, and so on. Although, the Statueof Liberty appears in the second and fourth imageson the second row, none of the words are matched.However, the third image on the first row, and thefirst image on the second row is matched, althoughthey are not the images of Statue of Liberty. Withthe proposed approach, the results are corrected bypredicting both statue and liberty in the top threewords for all the images where the Statue appears,and by not predicting any of the words for the otherswhere the statue doesnt appear.

    statue(1) liberty(2) statue(1) liberty(3)

    statue(1) liberty(2) statue(1) liberty(3)

    Figure 4: For the shots that Statue of Liberty ap-pears, rank of the statue and liberty words in thepredictions.

  • 5. CHINESE CULTURE DATA

    The other data set used for the experiments consists of Chi-nese cultural documentaries [8, 18]. Figure 5 and Figure 6show some examples from this data set. Similar correspon-dence problems occur in this data set. For example, in thestory about the Great Wall, some frames show the views ofthe Wall, while the others show the reporter or the otherrelated people/objects; therefore the word wall may notmatch with the view of the Wall.

    For the experiments, the movies that includes interviewswith people are removed due to their dependence on face de-tection/recognition, and the remaining 3785 shots are used.Each shot is represented with a single keyframe which is di-vided into 5 x 5 blocks. Then, features are extracted fromeach of these blocks and feature space is vector quantizedusing k-means where k is set to 1000. Therefore there are1000 blob tokens.

    In order to construct the vocabulary, nouns extracted fromBrill’s tagger are used. Figure 7 shows that the frequency ofthe words lies in a large range. The number of words in thevocabulary is 2597, but only about 20% of the words are ina meaningful range. We eliminate the words that occur lessthan 5 times, or more than 250 times. This pruning processreduces the vocabulary to 626 words. Shots that are notassociated with any word after this pruning process are alsoeliminated, resulting in 2785 remaining shots.

    Figure 5: Results of a query for great wall on Chi-nese culture data set using the current Informediasystem. The arrows indicate the shots where thequery words occur in the corresponding audio.

    For the experiments on the TREC data, the text for thesurrounding shots is also taken by setting a fixed windowsize. In the Chinese data set, the window size is varied.First, we give the results using only the text aligned with asingle shot, then we compare the results for different windowsizes.

    Figure 8 shows some frames that predict (a)panda, (b)wall,(c)emperor in the first three words with the highest proba-bility. As can be seen, the right words are predicted for theframes that are related with the word. The word panda is

    Figure 6: Results of a query for panda on Chineseculture data set using the current Informedia sys-tem. The arrows indicate the shots where the queryword occurs in the corresponding audio.

    0 500 1000 1500 2000 2500 30000

    200

    400

    600

    800

    1000

    1200

    1400

    words

    wor

    d fr

    eque

    ncie

    s

    Figure 7: Word frequencies for the Chinese culturedata set. Originally there are 2597 words in thevocabulary. However, almost 80% of the vocabularyoccurs less then five times. We eliminate the wordsthat occur less than 5 times or more than 250 timesto have a new vocabulary. The number of wordsremained in the pruned vocabulary is 626.

    correctly predicted for different views of panda. Similarly,the word wall is correctly predicted when the Great Wallappears in the image, although the images are not very sim-ilar. The images for emperor are in a larger range, since it ishard to associate this word with a specific object or scene.

    In order to evaluate the results on a larger scale 189 imagesare visually investigated for the word panda. The results areshown in Table 1. 127 images are associated with the wordpanda (using the aligned transcript), but only in 42 of theseimages a panda appears. The system predicts panda as thefirst word with the highest probability for 121 images, andin 51 of them a panda appears in the image. In 25 of theseimages the word panda is predicted for the images where apanda appears, even though the word was not associatedwith the frame originally. The results show that we missed11 images where the panda word was associated and a pandaappears in the image, but cannot be predicted as the firstword. For the images where the system cannot predict pandaas the first word, it was usually the second or the third word.

  • (a)

    (b)

    (c)

    Figure 8: Sample images that predict (a)panda,(b)wall, and (c)emperor in the first three words withthe highest probability.

    The predicted words can be used for two different purposes:to have better annotations for the frames on the training setand to annotate the frames on the test set that do not haveassociated text. As explained before, we eliminated 1000images since the words associated with those images werethe less frequent words. These images are used as the testset. For panda, we randomly choose two sets of frames wherethe story is about a panda. Among these, the frames thatare in the test set are analyzed. We look at the images wherepanda is predicted as the first word, and check whether thepanda appears in the image. In the first set 11 images over25 images, and in the second set 2 images over 25 images,predict panda as the first word when the panda appears inthe image.

    The scene shown in Figure 6 is further used to analyze theresults for predicting the word panda both for the trainingand test sets. Figure 9 shows the rank of the word pandaas the predicted word for the corresponding frames. Red(dark gray) numbers correspond to the images that are inthe test set (the frames that are initially eliminated), andgreen (light gray) numbers correspond to the images thatare in the training set. The ranking in the training set isalso important, since most of the correct associations aremissing and our goal is finding more reliable annotations forthe frames. As the figure shows, on most of the trainingimages if a panda appears in the image, we predict the word

    Table 1: Correspondence results for panda. 189frames are inspected. Associated means that theword is in the transcript that is aligned with theframe. Predicted means that the system predictpanda for the frame as the first word. Correctmeans a panda appears, and incorrect means a pandadoesn’t appear in the frame.

    number of associations 127number of correct associations 42number of incorrect associations 85number of predictions 121number of correct predictions 51number of incorrect predictions 70associated but not predicted 64association is correct but not predicted 11association is incorrect not predicted 53not associated but predicted 61not associated but correctly predicted 25not associated and incorrectly predicted 36associated and predicted and correct 26associated and predicted but incorrect 31

    panda in the first four words with the highest probability.The predictions for the test images are also satisfactory:some of the images where a panda appears are annotatedwith the word panda, and for the images that don’t showany panda the annotation rank is very low. However, bothfor the training and test images, we see a problem when thewoman appears: since the woman frames highly co-occurwith the word panda, the system learns that the womanframes are also associated with panda and we incorrectlypredict panda with a high score.

    Figure 9: The predictions for the panda scene. Thenumbers show the rank of the word panda amongthe predicted words. Red (dark gray) numbers cor-respond to the images that are not in the trainingset, green (light gray) numbers correspond to theimages in the training set.

    We investigate the effect of window size, by choosing eitheronly the words from a single shot, or also from the surround-ing shots (window size is set to 1,2 or 3). The average num-ber of words associated with a single shot is 3.7275 whenonly the words that are associated with a single shot areused; 10.1670 when window size is set to 1; 16.4470 whenwindow size is 2; and 22.6725 when window size is 3. Themaximum number of words are 25, 47, 63 and 70 respec-tively.

  • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    wallgreatsochinese

    pandaoneemperor

    ha

    china

    flag

    hsi

    cranenot

    there

    lan

    merchant

    mercury

    round

    square

    first

    otherwho

    very

    deer

    chcourtyard

    year

    can

    people

    sea.

    stretche

    glorynative

    chamber

    detail

    eternal

    letter

    bronze

    she

    yangtze

    captive

    state

    what

    qin

    tomb

    saying

    tzu

    blue

    ding

    birthcity

    month

    manyking

    forbiddendoidea

    lot

    time

    wartrade

    feet

    new

    dynastyprogram

    travel

    elixir

    harmony

    jade

    middle

    just

    now

    army

    boundary

    evil

    thing

    imperial

    upbamboo

    today

    northern

    stone

    shi

    power

    meter

    arrow

    roof

    roman

    watchtower

    ancestor

    hair

    throne

    woman

    system

    keeper

    range

    dream

    animal

    area

    heremay

    writing

    why

    tangland

    mountain

    kind

    symbol

    region

    fire

    word

    individual

    giant

    ming

    eastern

    20th

    river

    forestgenghi

    magicbarrierfaith

    fighting

    text

    line

    archeologistsuccessorresource

    act

    burning

    carriage

    weapon

    command

    prince

    handwalk

    century

    respect

    hawk

    pitideal

    fitwolong

    generation

    white

    waytreasure

    cotta

    yellow

    piecepalace

    workbuild

    form

    north

    life

    book

    masgod

    fragmentinvasion

    servant

    shape

    qing

    lengthdecade

    feeling

    organizationtrip

    world

    li

    dragonspirit

    babyearthmilitary

    mongol

    mind

    conflict

    monument

    tourist

    pu

    open

    beginning

    chariot

    sword

    crown

    objectlymodel

    legend

    ground

    mean

    gold

    shownsleepanswer

    number

    li−li

    member

    home

    houralmostsoldierface

    found

    little

    part

    silk

    sense

    femaleperiod

    zhaotian

    nomadimmortality

    something

    run

    villager

    culture

    tower

    thought

    hill

    lo

    court

    lady

    figure

    longschaller

    researcher

    central

    thousandplacesonright

    national

    far

    remain

    nightofficer

    burial

    incrediblesectionheaven

    ruler

    wonder

    size

    radio

    let

    dr.

    story

    attempt

    empire

    warrior

    history

    natural

    statue

    shang

    reign

    eunuch

    empreswater

    standard

    excavationkept

    official

    sort

    effort

    man

    barbarian

    control

    pagoda

    b.c.

    pasexampleten

    warring

    research

    wildlife

    hundred

    million

    name

    body

    country

    age

    ritual

    top

    construction

    beliefmother

    look

    traditional

    eyelivingdiscovery

    secret

    rule

    question

    death

    sea

    huang

    gate

    center

    probablyunited

    site

    day

    showinside

    force

    wild

    concubinered

    humanhabitatoutside

    westwesternhan

    specybuildingancient

    recall

    prec

    isio

    n

    Figure 10: Recall vs precision graph : single frame.Recall is defined as the number of correct predictionsover the number of times that the word occurs inthe data, and precision is defined as the number ofcorrect predictions over all predictions.

    Figures 10, 11, 12 and 13 show the recall and precision val-ues as a function of words when the associated words arechosen from a single frame, or from a window size 1, 2, and3 respectively. Recall is defined as the number of correctpredictions over the number of times that the word occursin the data, and precision is defined as the number of cor-rect predictions over all predictions. Correct predictions arefound by comparing the first n words that are predicted withthe highest probability (where n is the number of actualwords associated with the frame), with the actual ones. Itis hard to understand the figures clearly, but it is easy to seethat the system predicts more words as the window size in-creases. Also, the figures indicate that for most of the wordsthe recall and/or precision values increase as the number ofwords associated with a frame increases. Table 2 shows therecall and precision values for some selected words.

    Table 2: Comparison of recall and precision valuesfor some selected words as a function of window size.

    panda wall emperorrec - prec rec - prec rec - prec

    single frame 0.718 - 0.204 0.849 - 0.228 0.689 - 0.224wsize = 1 0.818 - 0.243 0.918 - 0.292 0.874 - 0.286wsize = 2 0.886 - 0.256 0.952 - 0.315 0.945 - 0.326wsize = 3 0.915 - 0.260 0.972 - 0.319 0.969 - 0.363

    Table 3: Prediction measures when different windowsizes are used to obtain the words associating witha shot.

    prediction measuresingle frame 0.1851wsize = 1 0.2469wsize = 2 0.2783wsize = 3 0.2975

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    wallgreatchineseoneemperor

    panda

    sochinahanottherefirstwhoyear

    peoplewhat

    verych

    zhao

    tombcandeerotherscript

    round

    many

    mongol

    cranejade

    sheanimal

    dynastyqin

    king

    justding

    forbiddenkind

    feetming

    hsi

    chamber

    up

    ideamerchanthair

    lady

    city

    northern

    here

    statearmynow

    rule

    imperial

    stretche

    porcelain

    ground

    huo

    system

    bamboo

    tang

    powercentury

    tzu

    ideal

    giant

    remain

    waysort

    silk

    eastern

    pig

    cotta

    dragon

    wild

    stone

    do

    alligator

    sea.

    pagoda

    keeperlandlot

    bronze

    empire

    symbolli

    place

    buildinghandtime

    flag

    communist

    crownmountain

    burial

    heaven

    material

    evil

    thousandtoday

    faith

    empres

    courtyard

    concubine

    shi

    captive

    program

    elixir

    spirit

    square

    almost

    officer

    baby

    roman

    war

    qing

    period

    marqui

    world

    habitat

    new

    edge

    native

    gravebirth

    number

    may

    ancestor

    treasure

    individual

    man

    book

    harmonygeneral

    partspecyriver

    sealook

    yellow

    han

    shang

    wildlife

    title

    object

    long

    figure

    life

    mind

    lored

    country

    next

    barbarianthrone

    build

    warrior

    mercury

    somethingeveryone

    dr.

    western

    earth

    sense

    site

    saying

    yungkey

    sword

    yisleep

    spot

    schaller

    wolong

    historyscholargold

    genghimagic

    tower

    tradition

    weapon

    military

    a.d.

    model

    woman

    shapedestruction

    death

    policy

    man.

    natural

    philosophy

    section

    west

    conflict

    palace

    b.c.

    mile

    li−li

    writingtrade

    archeological

    official

    nomad

    travel

    rebel

    corner

    hu

    areahorse

    george

    workconstructiongate

    little

    lan

    line

    lieburning

    eternal

    heightmother

    generation

    iron

    eye

    gift

    road

    hundred

    eunuch

    fall

    art

    north

    radio

    princemeng

    control

    law

    ten

    running

    piece

    trip

    expedition

    forest

    leave

    fenghill

    form

    human

    center

    white

    righteffort

    afterlife

    pay

    father

    heart

    god

    curse

    chi

    everything

    servant

    journey

    leopard

    range

    ancientfound

    belief

    setwonder

    national

    archeologist

    discovery

    huangtouristmanchu

    female

    waterbody

    brick

    invasion

    relic

    practice

    heavenly

    row

    pas

    young

    position

    problem

    reason

    attempt

    immortality

    thought

    yang

    tour

    kingdomday

    route

    tian

    beginning

    existence

    frontlocation

    million

    ruler

    breeding

    name

    ritual

    enemy

    monument

    sacred

    buddha

    thing

    yangtze

    group

    secret

    far

    myth

    culture

    pride

    fire

    mystery

    carriage

    cloud

    development

    warring

    united

    huangdinightcourt

    age

    attack

    yin

    agricultural

    supreme

    command

    soldier

    capital

    purpose

    master

    inside

    museum

    facetraditional

    looking

    living

    men

    glimpsefighting

    taoist

    zhou

    soil

    successor

    blackhistorical

    surrounding

    legend

    item

    letter

    characterpit

    standkleiman

    restoration

    reality

    barrier

    text

    territory

    glory

    step

    civil

    chime

    blade

    collar

    pound

    laborer

    space

    probably

    statue

    heavy

    favorite

    pottery

    walk

    middle

    mouth

    decade

    fit

    feeling

    population

    bear

    force

    family

    ratherleftsaw

    desert

    artisan

    blue

    example

    visit

    dead

    hopedoing

    organizationmao

    peasant

    moderncommon

    change

    story

    amount

    accountpu

    top

    energy

    local

    understand

    sovereign

    flow

    open

    play

    meetwine

    silver

    stop

    zoo20th

    mean

    male

    member

    hour

    future

    reserve

    scaleruntakingshown

    leadercivilization

    word

    matter

    shareanswer

    food

    alive

    big

    search

    incredible

    region

    sizeletscientist

    troop

    picture

    standard

    culturalrecord

    reign

    kept

    excavation

    boundaryresearchfield

    head

    project

    high

    why

    question

    government

    researchershow

    son

    outside

    recall

    prec

    isio

    n

    Figure 11: Recall vs precision : window size = 1

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    onewallchineseemperorgreatsochina

    panda

    hanot

    therefirstwhotombpeopleveryyearcanwhat

    deer

    othermanychzhaoshe

    mongol

    justupjadedynasty

    chamber

    hsiround

    qin

    ming

    state

    tzu

    powerarmy

    animal

    porcelain

    cityimperialnow

    here

    king

    ideacentury

    forbiddensymbol

    pagoda

    kindway

    system

    feet

    native

    rulealligatorwarrior

    ding

    do

    dragon

    cotta

    time

    spirit

    scriptnorthern

    empres

    han

    ideal

    silk

    treasure

    bamboo

    flag

    lady

    heaven

    western

    bronze

    almost

    mountainland

    thousand

    lettercountry

    new

    crane

    pig

    tang

    merchant

    shi

    giantconcubine

    wildlifeyang

    empiremayphilosophy

    elixir

    stretche

    manstone

    war

    building

    barbarianroman

    writingtoday

    evilschaller

    tower

    crownancient

    lot

    history

    remainplace

    captive

    areaworldwork

    ancestor

    partkeepertrade

    look

    longsite

    li

    yin

    sort

    officer

    yellow

    b.c.

    wild

    military

    specy

    qing

    square

    build

    death

    burial

    birthbabynext

    huo

    period

    supreme

    eastern

    conflict

    thing

    yung

    mouth

    soldierlifeprince

    tenwolong

    construction

    george

    night

    red

    mind

    gold

    wonder

    ground

    lan

    sun

    hair

    line

    sea.

    book

    harmony

    habitat

    throne

    official

    name

    natural

    something

    communist

    thought

    carriage

    individual

    li−lifound

    figure

    story

    human

    enemyscholar

    confucian

    lo

    edge

    north

    fire

    breeding

    palace

    program

    beliefhope

    piece

    dr.west

    men

    brick

    a.d.

    shape

    yangtze

    object

    cloud

    model

    sea

    20th

    face

    dayright

    shang

    house

    mystery

    ritual

    stand

    kingdomriversense

    material

    picture

    buddhahu

    little

    rulerafterlife

    manchu

    group

    hundred

    set

    faith

    forcetravelmonumentsealrather

    route

    problem

    confucianism

    god

    relic

    journey

    record

    mile

    leader

    culture

    bodywarring

    archeological

    womanorganizationrebel

    eunuch

    corner

    answer

    soninside

    earthgeneral

    huangdi

    mercury

    silver

    generation

    road

    number

    hand

    marqui

    destruction

    archeologist

    tian

    probably

    territorydetail

    leave

    art

    gatereign

    heavenly

    row

    techniquesection

    burning

    highimmortality

    field

    living

    civil

    grave

    walk

    waterfutureattempt

    family

    feng

    meng

    everything

    sleep

    agricultural

    nature

    sacred

    chairman

    far

    legend

    excavationnothing

    servant

    mao

    center

    white

    person

    word

    huang

    weapon

    zhou

    flow

    command

    wine

    manmade

    stop

    radio

    secret

    understandcontrol

    million

    female

    sign

    courtyard

    glimpse

    demon

    taoistmuseum

    question

    discovery

    man.

    someone

    chime

    pit

    looking

    researcher

    young

    taking

    food

    lie

    historical

    form

    sharestandard

    effort

    moon

    surrounding

    item

    rich

    leopard

    conservationnational

    structure

    reason

    height

    reality

    heart

    fighting

    curse

    chi

    glory

    respectcharacter

    mothereat

    expedition

    gift

    nomad

    statueeverybody

    pavilion

    mean

    middle

    month

    father

    artisan

    hill

    troop

    law

    saying

    destroybeginningplay

    sword

    meet

    maletrip

    forest

    common

    project

    di

    age

    feeling

    invasion

    why

    women

    yi

    tryold

    tourist

    tradition

    eye

    rainforest

    marco

    beijing

    title

    traditional

    government

    boundary

    kept

    villager

    pere

    zootrack

    bigpopulation

    roof

    hawkhead

    development

    location

    showfallexample

    outside

    genghi

    civilization

    search

    size

    collar

    horse

    capital

    arrow

    black

    light

    range

    incredible

    region

    soil

    becoming

    culturalbarrierkhansuccessor

    pas

    purpose

    ruling

    favorite

    fit

    director

    ironrestoration

    boy

    basic

    mount

    fightlaborer

    research

    wealthtemplecycleeternaldozen

    united

    republic

    decadethreat

    spot

    modernleft

    key

    running

    position

    peasant

    westerner

    pay

    sovereign

    blue

    top

    desert

    visit

    doing

    amount

    account

    practice

    else

    weight

    bone

    society

    political

    home

    reserveopenexistence

    shown

    scale

    main

    farmer

    lif

    front

    tour

    matterstanding

    nation

    myth

    runscientist

    guard

    second

    central

    alive

    court

    let

    colormaster

    foot

    reach

    spacepottery

    saw

    change

    recall

    prec

    isio

    n

    Figure 12: Recall vs precision : window size = 2

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    one

    wall

    chineseemperorsogreat

    chinanotha

    panda

    therefirstyearwho

    peopleverytombwhatother

    canmany

    just

    upding

    forbidden

    charmy

    zhaoshe

    hsi

    deer

    dynastyjadeanimalqin

    ming

    citynow

    mongolround

    imperial

    porcelain

    power

    century

    state

    cottasystemhere

    tzubamboo

    kind

    time

    king

    ideawarrior

    lidoway

    pagoda

    symbol

    feet

    building

    chamber

    rulemayhan

    war

    northern

    alligator

    letter

    empressilkscripttreasurebronzethousand

    ancestor

    today

    new

    carriage

    specy

    heaven

    worldshi

    princemountainempiremilitaryshangdragon

    barbarian

    tang

    lo

    western

    landgoldphilosophy

    history

    remain

    look

    stonewriting

    crane

    almost

    ancient

    giantcommunist

    huo

    concubinefire

    weapon

    row

    man

    soldierlifeworkpartfigure

    build

    yung

    lady

    areaschaller

    b.c.

    longlittle

    countrythronenextperiod

    merchant

    yangyellowevil

    crown

    roman

    spirit

    face

    place

    elixir

    construction

    flag

    ideal

    tower

    keeper

    wild

    ground

    birtheastern

    wildlife

    body

    house

    burial

    lot

    baby

    stretche

    sea

    site

    roofsort

    piece

    confucian

    red

    mind

    trade

    archeologistlan

    hope

    found

    linewonder

    palace

    squarenatural

    day

    qing

    officer

    pig

    west

    tian

    gatethingnumber

    confucianism

    relic

    conflict

    nomad

    name

    supreme

    problemnight

    habitat20th

    death

    shape

    belief

    techniqueobject

    thoughtearthzhou

    captive

    officialnative

    hundred

    storycourtyard

    marquiwarring

    rightafterlife

    westerner

    leaderdetail

    bird

    sense

    effortscholar

    triproute

    sunriver

    group

    hu

    pas

    picturewolong

    destruction

    hair

    woman

    something

    discovery

    stopfaithartisan

    territoryrulerprobably

    why

    a.d.

    journey

    force

    troop

    attack

    boundary

    fall

    edge

    model

    form

    tourist

    high

    ritual

    reasongeneration

    individual

    leave

    seal

    inside

    stand

    north

    book

    statue

    taoist

    fengmen

    leopard

    li−li

    general

    hand

    human

    manchu

    sword

    standard

    female

    breeding

    mercury

    cloud

    glimpsehuang

    program

    huangdi

    sacred

    mile

    everythingmarcoyin

    immortality

    curse

    living

    road

    george

    fieldmotherculturelegend

    genghi

    di

    mysteryarcheologicalset

    control

    chiten

    mouth

    man.

    government

    modern

    far

    favorite

    nothingenemy

    buddha

    rebel

    water

    outside

    kingdom

    meng

    agricultural

    temple

    chimeradio

    white

    heavenly

    food

    forest

    civil

    walk

    million

    materialyangtzecenter

    god

    harmony

    pavilion

    grave

    section

    insemination

    eye

    future

    mean

    mao

    family

    head

    silver

    organizationunderstand

    secret

    reignnature

    eunuchtitle

    running

    moon

    lie

    attempt

    son

    person

    top

    nationalhistorical

    population

    women

    researcherword

    beginning

    brick

    buddhist

    try

    rather

    commonsawmanmade

    someone

    lengthsea.

    surrounding

    item

    respect

    eat

    iron

    master

    record

    travel

    cultural

    fighting

    burning

    charactersleep

    opentaking

    tour

    myth

    main

    question

    father

    corner

    dr.

    barrier

    wineexample

    peasant

    demon

    key

    tradition

    project

    space

    horse

    rangehilldesert

    light

    trap

    feeling

    showzoo

    elsegreenkhansuccessor

    stepincredible

    shown

    doing

    unitedexcavation

    age

    gift

    front

    nation

    minister

    realityheart

    monument

    share

    big

    answer

    mount

    rich

    ling−ling

    size

    restoration

    spotsearch

    energy

    expedition

    beijing

    run

    civilizationrainforest

    alive

    resource

    everybody

    meet

    eternal

    saying

    account

    servant

    threatthirdlaw

    variety

    destroy

    color

    capitallooking

    claim

    play

    leftyoung

    old

    pay

    sign

    invasion

    flow

    pere

    better

    reach

    home

    weight

    political

    structure

    kept

    meter

    existence

    hour

    sturgeon

    hawkchairman

    view

    watchtowerheight

    development

    start

    magic

    everyone

    begun

    let

    revolution

    artguard

    blade

    month

    location

    matingarrowpridecommand

    court

    pound

    becoming

    second

    museum

    region

    soil

    fight

    policyworker

    purpose

    pit

    attention

    fate

    boy

    heavy

    foot

    plan

    coming

    holding

    bear

    child

    ruling

    fitglory

    basic

    bloodpottery

    path

    chariot

    hall

    fear

    wealth

    decademiddledozenposition

    sovereign

    pu

    lydead

    changeyi

    male

    amountblue

    visit

    peakreserve

    battle

    villager

    one.

    standing

    traditionalfarmer

    lifscale

    matter

    unique

    hard

    design

    achievement

    research

    recall

    prec

    isio

    n

    Figure 13: Recall vs precision : window size = 3

  • The results are also compared using the prediction mea-sure, which is calculated by averaging the number of correctpredictions over the number of words in an image for all theimages. As Table 3 shows, prediction measure is better asthe window size increases.

    6. DISCUSSION AND FUTURE WORK

    In this study, integration of visual and textual data is pro-posed to solve the correspondence problem between videoframes and associated text. The preliminary results showthat using this approach it is possible to have better anno-tations for the video frames that can be used to improve theperformance of the text based queries.

    In the current system relatively simpler and smaller datasets are used. Our goal is to apply a similar approach tobroadcast news which is a harder dataset since there areterabytes of video and it requires focusing on people.

    Currently the system uses simple features extracted from thestill images. A better set of features that also include somedetectors (e.g., face detector) and some motion informationis likely to improve the performance. In the experiments,fixed size grids are used. Using a segmenter, better resultscan be obtained [3]. Also, in video the objects are mostly theones that moves. Therefore, it is better to use the temporalinformation to segment the moving objects.

    Using a large number of data may allow us to perform sta-tistical analysis. It is an interesting problem to come with adistance metric for choosing the text that best describes theframe. In this study, individual words are taken as the vo-cabulary words. Taking the noun phrases or the compoundwords (e.g. “statue of liberty”) may improve the perfor-mance. Also, some lexical analysis needs to be done to un-derstand which type of words more precisely correspond tovisual features and should be taken as the vocabulary words.

    7. REFERENCES[1] Informedia Project. http://informedia.cs.cmu.edu.

    [2] K. Barnard, P. Duygulu, N. de Freitas, D. A. Forsyth,D. Blei, and M. Jordan. Matching words and pictures.Journal of Machine Learning Research, 3:1107–1135,2003.

    [3] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, andD. A. Forsyth. The effects of segmentation and featurechoice in a translation model of object recognition. InIEEE Conf. on Computer Vision and PatternRecognition, 2003.

    [4] K. Barnard and D. A. Forsyth. Exploiting imagesemantics for picture libraries. In The FirstACM/IEEE-CS Joint Conference on Digital Libraries,page 469, 2001.

    [5] K. Barnard and D. A. Forsyth. Learning the semanticsof words and pictures. In Int. Conf. on ComputerVision, pages 408–15, 2001.

    [6] E. Brill. A simple rule-based part of speech tagger. InIn Proceedings of the Third Conference on AppliedNatural Language Processing, 1992.

    [7] P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L.Mercer. The mathematics of statistical machinetranslation: Parameter estimation. ComputationalLinguistics, 19(2):263–311, 1993.

    [8] C. Chen. First Emperor of China, Voyager CD-ROM.http://www.voyagerco.com/cdrom/, 1994.

    [9] A. P. Dempster, N. M. Laird, and D. Rubin.Maximum likelihood from incomplete data via the emalgorithm. Journal of the Royal Statistical Society,Series B, 1(39):1–38, 1977.

    [10] P. Duygulu, K. Barnard, N. Freitas, and D. A.Forsyth. Object recognition as machine translation:learning a lexicon for a fixed image vocabulary. InSeventh European Conference on Computer Vision(ECCV), volume 4, pages 97–112, 2002.

    [11] P. Duygulu-Sahin. Translating Images to words: Anovel approach for object recognition. PhD thesis,Middle East Technical University, Turkey, 2003.

    [12] C. Jelmini and S. Marchand-Maillet. Deva: anextensible ontology-based annotation model for visualdocument collections. In Proceedings of SPIEPhotonics West, Electronic Imaging 2002, InternetImaging IV, Santa Clara, CA, USA, 2003.

    [13] J. Li and J. Z. Wang. Automatic linguistic indexing ofpictures by a statistical modeling approach. IEEETrans. on Pattern Analysis and Machine Intelligence,25(10):14, 2003.

    [14] M. Markkula and E. Sormunen. End-user searchingchallenges indexing practices in the digital newspaperphoto archive. Information retrieval, 1:259–285, 2000.

    [15] O. Maron. Learning from Ambiguity. PhD thesis,MIT, 1998.

    [16] I. D. Melamed. Empirical Methods for ExploitingParallel Texts. MIT Press, 2001.

    [17] Y. Mori, H. Takahashi, and R. Oka. Image-to-wordtransformation based on dividing and vectorquantizing images with words. In First InternationalWorkshop on Multimedia Intelligent Storage andRetrieval Management, 1999.

    [18] H. Wactlar and C. Chen. Enhanced perspectives forhistorical and cultural documentaries using informediatechnologies. Joint Conference on Digital Libraries(JCDL’02), Portland, Oregon, July 13-17, 1994.