supplementary material: referring to object in video using ... filesepak takraw 23 818 birds 24 426...

Supplementary Material:Referring to Object in Video using

Spatio-Temporal Identifying Description

Peratham Wiriyathammabhum♠♦, Abhinav Shrivastava♠♦,Vlad I. Morariu♦, Larry S. Davis♠♦

University of Maryland: Department of Computer Science♠, UMIACS♦[email protected], [email protected]@umd.edu, [email protected]

Figure 1: The number of words in a sentence in STV-IDL is normally distributed with an average of 22.65words.

A Dataset Statistics

The length of referring expressions. We first an-alyze the length of referring expressions. We spliteach referring expression into words using Natu-ral Language Toolkit (NLTK) (Bird et al., 2009).Figure 1 shows the distribution of the number ofwords in each referring expression. STV-IDL con-tains varying lengths of referring expressions andthe average length (22.65 words) is longer thanmost existing video referring expression datasets(Gao et al., 2017; Yamaguchi et al., 2017; Hen-dricks et al., 2017; Krishna et al., 2017) becausewe force the sentence syntax and encourage con-junctions.

The referring expressions provided from the an-notators as speakers are subjective. However, aslong as we correctly understand what they refer toand the process leads to mutual understanding andsuccessful communication, the grounding processis valid. We are aware that giving an example sen-tence may limit the variations of the referring ex-

Figure 2: The number of target objects in STV-IDL isat least 2 with an average number of objects per videoof 2.85.

pressions. But, we want to make sure that our con-straint is valid on every sentence and we also col-lect only a few referring expressions per referredobject in a temporal interval. Besides, the col-lected sentences are usually from different anno-tators from the randomization in our web interfaceso there are still some constraint-satisfied varia-tions within our few sentences per referred object.Compared to the ReferIt game (Kazemzadeh et al.,2014), the annotators are the speakers and we arethe listeners. However, we are not collecting thedata in a gamification setting in which referringexpressions look short and concise similar to ver-bal utterances. The reason is it is not clear how aspeaker will compare and utter words to contrastthe referent from other distractors in the tempo-ral dimension. Also, it is not clear how the lis-tener will comprehend the referring expression inthis setting for untrimmed video except rely onthe time stamp to fast forward to the event itself.Our data collection pipeline is more similar to theGoogle-Refexp dataset (Mao et al., 2016) which

Table 1: Composition of the STV-IDL dataset, includ-ing number of videos and sentences for each collection.

Collection Number of Number ofVideos Sentences

sepak takraw 23 818birds 24 426dogs 18 410elephants 11 301panda 16 662tennis double 9 1160tennis single 15 668tabletennis double 9 336tabletennis single 10 376badminton double 12 816badminton single 18 408beach volley 14 812fencing 20 372STV-IDL 199 7569

has two separated steps, collecting the referringexpression and then verification. This setting leadsto grammatical sentences which are usually longerthan verbal utterances.

The number of objects in each video. Figure2 shows that our STV-IDL has multiple objects ineach video with an average of 2.85 objects fromthe same class. From the annotation files, we ob-serve that each video is annotated with 38.04 re-ferring expressions on average.

Word occurrences. To ensure temporal words,Figure 8 shows that we have words like ‘while’,‘then’, ‘moves’, ‘steps’, ‘turns’ in the top 50 fre-quency list. We further filter out stopwords in Fig-ure 9. We can see that top words are words de-scribing person (man or player) or appearances(wearing or shirt or red) or actions (moves orsteps) or spatial relations (front or right or left orback). Temporal words like then are in NLTKstopword list.

Part-of-speech occurrences. To ensure thesentence syntax, Figure 10 shows the proportionsof Penn part-of-speech tags (Taylor et al., 2003)using NLTK POS tagger. We observe that STV-IDL has a high proportion of prepositions (10.9%IN), adjectives (9.7% JJ), conjunctions (3.4%CC) and adverbs (3.6% RB) with only 28.1%nouns (NN).

B Annotation Interfaces

We show good referring expression examples[Link] to the Amazon Mechanical Turk workers.We show a video clip with a green bounding boxsurrounding the target object. We then ask the an-notators to write a sentence that contains at least 3phrases, a noun phrase, a verb phrase and a prepo-sition or adverb phrase. A noun phrase should de-scribe color or appearance. A verb phrase shoulddescribe actions. A preposition or adverb phraseshould describe relations of the target object toother objects in the scene. We also encourage theannotators to write many phrases using conjunc-tions.

In the second stage, we take the annotated refer-ring expressions to the verification interface. Wemanually verify that the sentence refers to the tar-get object. If the sentence is valid, we then correctminor grammatical errors and misspelling. Wepaid workers 0.05$ to 0.10$ for each valid refer-ring expression.

C Implementation Details

We implement our modular attention networkbased on PyTorch. The temporal interval proposalis implemented in Tensorflow. Optical flow im-ages are estimated using the TVL-1 algorithm inOpenCV. For Faster-RCNN, we use a batch sizeof 8 and train on 4 GPUs for 20 epochs. The mo-tion Faster-RCNN uses 128, 128, 128 as the meandata value. For the modular attention network, weuse ADAM optimizer with an initial learning rate1e − 3, and we train for 30 epochs where we ob-served the learning process saturated. We use a 1-layer bidirectional LSTM for every LSTM in themodel. The hidden layer of LSTM in language at-tention network is 512. For moving location mod-ule is 50. For relationship motion module is 20.All other settings are the same as the original im-plementation (Yu et al., 2018). We train the modelusing a combination of the ranking loss, the sub-ject attribute loss, and the subject motion attributeloss. The training time takes three days for onestream models and a week for two-stream modelson an NVIDIA P6000 GPU with 48GBs memory.We split the STV-IDL dataset into train, validationand test sets.

D Module Definitions

The language attention module learns to attendto each word for each visual module individually.

https://docs.google.com/presentation/d/1t9gh_f4nt3zzyS73pQBNI2jaFjo0zdWke3wja3N71wc/edit?usp=sharing

(a) Subject module RGB (below) andflow (top).

(b) Relationship module RGB (below)and flow (top).

(c) Location module. (coordinate +diffea)

Figure 3: Aggregations of output word attention weights for each module on the STV-IDL test set.

Figure 4: Aggregations on all verbs for each module.(from left to right: Relationship flow/RGB, Subjectflow/RGB, Location)

Figure 5: Aggregations on all verbs for each module.(from left to right: Relationship RGB, Subject RGB,and Location for RGB MAttNet) The model groundsmore verbs to Subject RGB than fused1.

First, each word in the input expression is encodedinto a one-hot vector. Then, a bidirectional LSTMencodes the whole expression. The hidden statesin both direction for each time step are concate-nated to create the word representation for eachword. Next, a weighted vector is placed for eachword representation and the weighted sum is theattention for each word. The phrase embedding isthe weighted average of the one-hot vector and theattention for every word in the expression.

The subject module has two branches, attributeprediction and phase-guided attention pooling. Inattribute prediction, the input expression is parsedto R1-R7 attributes (Kazemzadeh et al., 2014) us-

ing Stanford dependency parser (Manning et al.,2014). The attributes describe color, size, loca-tion and observed attribute types. Both ‘pool5’and ‘spatial fc7’ are concatenated and followed bya 1 × 1 convolution. The attribute feature blobis average-pooled and used for prediction by afully connected layer. This branch is trained witha cross-entropy loss when the system can parseany attribute from the input expression. In phase-guided attention pooling, the attribute feature blobis concatenated with ‘spatial fc7’ and phrase em-bedding followed by another 1× 1 convolution tocreate the subject feature blob. The subject fea-ture blob contains grids that correspond to the in-put image based on each spatial location. Then,the spatial attention is computed by forwardingthe grid feature to tanh and softmax layers respec-tively. The weighted attention subject feature iscomputed by a weighted sum on the concatenationof attribute blob and ‘spatial fc7’ with the spatialattention.

The subject module score is computed bymatching weighted attention subject feature withthe phrase embedding. The matching functionconsists of two MLPs with ReLU activations toproject both features into a joint embedding space,two L2 normalization layers and a cosine similar-ity that measures the module score.

For the location module, the normal-ized bounding box coordinates, li =

[xminW , ymin

H , xmaxW , ymax

H ,ArearegionAreaimage

] whereW and H are width and height of the im-age, are computed for every object from thesame class. Then, the location differencefeature of the target object with up to fivecontext objects from the same class, δij =

[∆xminW , ∆ymin

H , ∆xmaxW , ∆ymax

H ,∆ArearegionAreaimage

], isconcatenated with the normalized bounding boxcoordinates [li; δij ] and fed to a fully connectedlayer which results in the final location feature.

The module score is from the matching betweenthe final location feature and the phrase em-bedding using the matching function like in thesubject module.

For the relationship module, the ‘spatial fc7’for top five context objects are averaged-pooledand become the context ‘fc7’ features. The ‘fc7’feature is concatenated with the location differ-ence feature [δij ] as the final relationship feature.However, the module score is from the maxi-mum of the matching between the final relation-ship feature of each context object and the phraseembedding using the matching function like inthe subject and location module. This is likeputting all context objects into a bag in the weakly-supervised Multiple Instance Learning (MIL) set-ting and takes the best matching context as the fi-nal score.

For the motion stream, we train the motionFaster-RCNN on optical flow images of STV-IDL.We start from an MS COCO pretrained model, andwe replace the three channel RGB input with astack of flow-x, flow-y and flow magnitude fromthe flow image. Then, we duplicate the subject andthe relationship module into the subject motionand relationship motion module which take the‘pool5’ and ‘spatial fc7’ features extracted fromthe motion Faster-RCNN.

Previous work (Simonyan and Zisserman,2014) has shown that stacking many optical flowimages can help recognition. So, we train anothervariant of two-stream modular attention networkusing stacked five optical flow frames. In this set-ting, we train the stacked motion Faster-RCNNby stacking flow images Fidx where frame indexidx ∈ [t−2, t+2]. The input becomes a 15 chan-nel stacked optical flow image. In addition, weadd the moving location module to further modelthe movement of the location by stacking locationfeatures so that we have a sequence of [li; δij ]idxwhere frame index idx ∈ [t − 2, t + 2]. Then,we place an LSTM on top of the sequence and weforward the concatenation of all hidden states toa fully connected layer and output the final loca-tion features. We also make a location sequenceand placing an LSTM on top for location in therelationship motion module in this stacked opti-cal flow setting. To fuse all modules, all modulescores are weighted average by the language at-tention module.

E Experiments

The detailed experimental results are in Table 2.There are 3.37 objects per instance on averagein the test set so randomly selecting one tubeletwill get the accuracy of only 29.68% over 11%less than the baseline. The aggregated statistics ofword attention are shown in Figure 3 and 4.

ReferencesSteven Bird, Ewan Klein, and Edward Loper. 2009.

Natural language processing with Python: analyz-ing text with the natural language toolkit. ” O’ReillyMedia, Inc.”.

Jiyang Gao, Chen Sun, Zhenheng Yang, and RamNevatia. 2017. Tall: Temporal activity localizationvia language query. In International Conference onComputer Vision (ICCV).

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman,Josef Sivic, Trevor Darrell, and Bryan Russell. 2017.Localizing moments in video with natural language.In International Conference on Computer Vision(ICCV).

Justin Johnson. simple-amt a microframework forworking with amazon’s mechanical turk. https://github.com/jcjohnson/simple-amt.

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,and Tamara L Berg. 2014. Referitgame: Refer-ring to objects in photographs of natural scenes. InEMNLP, pages 787–798.

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei,and Juan Carlos Niebles. 2017. Dense-captioningevents in videos. In International Conference onComputer Vision (ICCV).

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The stanford corenlp natural language pro-cessing toolkit. In Proceedings of 52nd annualmeeting of the association for computational lin-guistics: system demonstrations, pages 55–60.

Junhua Mao, Jonathan Huang, Alexander Toshev, OanaCamburu, Alan L Yuille, and Kevin Murphy. 2016.Generation and comprehension of unambiguous ob-ject descriptions. In Proceedings of the IEEE con-ference on computer vision and pattern recognition,pages 11–20.

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recogni-tion in videos. In Advances in neural informationprocessing systems, pages 568–576.

Ann Taylor, Mitchell Marcus, and Beatrice Santorini.2003. The penn treebank: an overview. In Tree-banks, pages 5–22. Springer.

https://github.com/jcjohnson/simple-amt

https://github.com/jcjohnson/simple-amt

Table 2: Identifying Description Localization: mAP for each collection. (values are in percents.) The fused1MAttNet is the proposed two-stream method and the fused5 MAttNet is the stacked version of the proposed two-stream method.

Collection RGB MAttNet flow MAttNet flow5 MAttNet fused1 MAttNet fused5 MAttNetsepak takraw 33.70 14.67 30.43 27.17 17.93birds 48.53 45.59 51.47 57.35 64.71dogs 55.77 61.54 61.54 51.92 53.85elephants 49.35 38.96 45.45 63.64 51.95panda 52.34 53.27 56.07 42.06 51.40tennis double 18.38 27.21 23.53 20.59 26.47tennis single 79.81 55.77 64.42 71.15 62.50tabletennis double 25.83 32.50 26.67 28.33 39.17tabletennis single 33.33 51.67 54.17 62.50 45.83badminton double 36.57 32.87 33.33 43.98 35.65badminton single 64.47 75.00 81.58 85.53 81.58beach volley 34.41 25.81 28.49 31.72 39.25fencing 55.70 58.23 48.10 51.90 48.10STV-IDL 41.51 39.02 41.90 44.66 42.82

Masataka Yamaguchi, Kuniaki Saito, YoshitakaUshiku, and Tatsuya Harada. 2017. Spatio-temporalperson retrieval via natural language queries. In In-ternational Conference on Computer Vision (ICCV).

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, XinLu, Mohit Bansal, and Tamara L Berg. 2018. Mat-tnet: Modular attention network for referring expres-sion comprehension. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition (CVPR).

Figure 6: The referring expression annotation interface based on a modified simple-amt (Johnson).

Figure 7: The referring expression verification interface allows the verifier to correct by editing the referringexpression. Most grammatical errors and misspellings are corrected using this interface.

Figure 8: Top 50 words in STV-IDL with stopwords.

Figure 9: Top 50 words in STV-IDL without stopwords.

Figure 10: The percentages of each Part-of-speech in STV-IDL.

Figure 11: Sample data from birds, fencing, sepak takraw and panda collections.

supplementary material: referring to object in video using ... filesepak takraw 23 818 birds 24 426...

Documents