tiger: text-to-image grounding for image caption evaluation · tiger: text-to-image grounding for...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 2141–2152,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

2141

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

Ming Jiang1, Qiuyuan Huang3, Lei Zhang3, Xin Wang2

Pengchuan Zhang3, Zhe Gan3, Jana Diesner1, Jianfeng Gao3

1University of Illinois at Urbana-Champaign, 2University of California, Santa Barbara3Microsoft Research, Redmond

{mjiang17,jdiesner}@illinois.edu, [email protected]{qihua,leizhang,penzhan,zhe.gan,jfgao}@microsoft.com

Abstract

This paper presents a new metric called TIGErfor the automatic evaluation of image caption-ing systems. Popular metrics, such as BLEUand CIDEr, are based solely on text match-ing between reference captions and machine-generated captions, potentially leading to bi-ased evaluations because references may notfully cover the image content and naturallanguage is inherently ambiguous. Buildingupon a machine-learned text-image ground-ing model, TIGEr allows to evaluate captionquality not only based on how well a cap-tion represents image content, but also onhow well machine-generated captions matchhuman-generated captions. Our empiricaltests show that TIGEr has a higher consistencywith human judgments than alternative exist-ing metrics. We also comprehensively assessthe metric’s effectiveness in caption evaluationby measuring the correlation between humanjudgments and metric scores.

1 Introduction

Image captioning is a research topic at the nexus ofnatural language processing and computer vision,and has a wide range of practical applications (Ho-dosh et al., 2013; Fang et al., 2015), such as imageretrieval and human-machine interaction. Givenan image as input, the task of image captioning isto generate a text that describes the image content.Overall, prior studies of this task have been focus-ing on two aspects: 1) building large benchmarkdatasets (Lin et al., 2014; Hodosh et al., 2013),and 2) developing effective caption generation al-gorithms (Karpathy and Li, 2015; Bernardi et al.,2016; Gan et al., 2017). Remarkable contributionsthat have been made, but assessing the quality ofthe generated captions is still an insufficiently ad-dressed issue. Since numerous captions with vary-ing quality can be produced by machines, it is im-

Figure 1: An example of caption evaluation challenge.Given two candidate captions: (a) correctly describesimage content, but most words in this caption are notcontained in references, while (b) overlaps with refer-ences (“is holding a bottle of beer”), but involves wronginformation (“a woman with straight hair”). Prior met-rics based only on text-level comparisons fail to as-sess the quality of candidate captions, while our met-ric (TIGEr) can detect this inconsistency by matchinga caption with image content. To explain the TIGEr re-sult, we show the grounding weights of two illustrativeimage regions with each candidate. The proper text in-formation matched with each region is highlighted bycolors.

portant to propose an automatic evaluation metricthat is highly consistent with human judges, easyto implement and interpret.

Prior evaluation metrics typically consideredseveral aspects, including content relevance, in-formation correctness, grammaticality, and if ex-pressions are human-like (Hodosh et al., 2013;Bernardi et al., 2016). Rule-based metrics inspiredby linguistic features such as n-gram overlappingare commonly used to evaluate machine-generatedcaptions (Chen et al., 2018; Karpathy and Li,2015; Huang et al., 2019). However, these metricsprimarily evaluate a candidate caption based onreferences without taking image content into ac-count, and the possible information loss caused byreferences may bring biases to the evaluation pro-cess. Moreover, the ambiguity inherent to naturallanguage presents a challenge for rule-based met-

2142

rics that are based on text overlapping. As shownin Figure 1, although Caption (a) describes the im-age properly, most words in this sentence are dif-ferent from the references. In contrast, Caption(b) contains wrong information (“a woman withstraight hair”), but overlaps with human-writtenreferences (“is holding a bottle of beer”). Con-sequently, all existing metrics improperly assign ahigher score to Caption (b).

In order to address the aforementioned chal-lenges, we propose a novel Text-to-Image Ground-ing based metric for image caption Evaluation(TIGEr), which considers both image content andhuman-generated references for evaluation. First,TIGEr grounds the content of texts (both refer-ence and generated captions) in a set of image re-gions by using a pre-trained image-text ground-ing model. Based on the grounding outputs,TIGEr calculates a score by comparing both therelevance ranking and distribution of groundingweights among image regions between the refer-ences and the generated caption. Instead of eval-uating image captions by exact n-gram matching,TIGEr compares the candidate caption with refer-ences by mapping them into vectors in a commonsemantic space. As a result, TIGEr is able to de-tect the paraphrases based on the semantic mean-ing of caption sentences. A systematic compari-son of TIGEr and other commonly used metricsshows that TIGEr improves the evaluation perfor-mance across multiple datasets and demonstrates ahigher consistency with human judgments. For ex-ample, Figure 1 presents a case where only TIGEris able to assign a higher score to the caption can-didate that is more consistent with human judg-ments. Our main contributions include:

• We propose a novel automatic evaluationmetric, TIGEr1, to assess the quality of imagecaptions by grounding text captions in imagecontent.

• By performing an empirical study, we showthat TIGEr outperforms other commonlyused metrics, and demonstrates a higher andmore stable consistency with human judg-ments.

• We conduct an in-depth analysis of theassessment of metric effectiveness, which

1Code is released at https://github.com/SeleenaJM/CapEval.

deepens our understanding of the character-istics of different metrics.

2 Related Work

Caption Evaluation The goal of image captionevaluation is to measure the quality of a generatedcaption given an image and human-written refer-ence captions (Bernardi et al., 2016).

In general, prior solutions to this task can be di-vided into three groups. First, human evaluationis typically conducted by employing human anno-tators to assess captions (e.g., via Amazons Me-chanical Turk) (Bernardi et al., 2016; Aditya et al.,2015a; Wang et al., 2018b).

Second, automatic rule-based evaluation mea-sures assess the similarity between referencesand generated captions. Many metrics in thisgroup were extended from other related tasks (Pa-pineni et al., 2002; Banerjee and Lavie, 2005;Lin, 2004). BLEU (Papineni et al., 2002) andMETEOR (Banerjee and Lavie, 2005) were ini-tially developed to evaluate machine translationoutcomes based on n-gram precision and recall.ROUGE (Lin, 2004) was originally used in textsummarization, which measures the overlap of n-grams using recall. Recently, two metrics werespecifically built for visual captioning: 1) CIDEr(Vedantam et al., 2015) measures n-gram simi-larity based on TF-IDF; and 2) SPICE (Ander-son et al., 2016) quantifies graph similarity basedon the scene graphs built from captions. Overall,these metrics focus on text-level comparisons, as-suming that the information contained in human-written references can well represent the imagecontent. Differing from prior work, we argue inthis study that references may not fully cover theimage content because both references and cap-tions are incomplete and selective translations ofimage contents made by human judges or auto-mated systems. The ground truth can only be fullyrevealed by taking the images themselves into ac-count for evaluation.

Finally, machine-learned metrics use a trainedmodel to predict the likelihood of a testing cap-tion as a human-generated description (Cui et al.,2018; Dai et al., 2017). Prior studies have appliedlearning-based evaluation for related text gener-ation tasks, e.g., machine translation (Corston-Oliver et al., 2001; Lowe et al., 2017; Kuleszaand Shieber, 2004). Most recently, Cui et al.(2018) trained a hybrid neural network model for

https://github.com/SeleenaJM/CapEval

https://github.com/SeleenaJM/CapEval

2143

caption evaluation based on image and text fea-tures. This work mainly focuses on the generationof adversarial data used for model training. De-spite improving the consistency with human de-cisions, this approach may involve high compu-tational cost and lead to overfitting (Gao et al.,2019). Besides, the interpretability of the eval-uation is limited due to the black-box nature ofend-to-end learning models. An even more seriousproblem of using machine-learned metrics is theso-called “gaming of the metric”: if a generationsystem is optimized directly for a learnable met-ric, then the system’s performance is likely to beover-estimated. See a detailed discussion in Gaoet al. (2019).

Though caption evaluation is similar to tradi-tional text-to-text generation evaluation in terms oftesting targets and evaluation principles, this taskadditionally need to synthesize image and refer-ence contents as ground truth information for fur-ther evaluation, which is hard to achieve by usinga rule-based method with a strong human prior. Inthis study, we propose a new metric that combinesthe strengths of learning-based and rule-based ap-proaches in terms of: 1) utilizing a pre-trainedgrounding model2 to combine the image and textinformation; and 2) scoring captions by definingprinciple rules based on the grounding outputs tomake the evaluation interpretable.

Image-Text Matching The task of image-textmatching can be defined as the measurement ofsemantic similarity between visual data and textdata. The main strategy of solutions designed forthis task is to map image data and text data into acommon semantic vector space (Fang et al., 2015;Wang et al., 2018a; Lee et al., 2018). Prior studiesusually consider this issue from two perspectives:From the perspective of data encoding, prior workeither regards an image/text as a whole or appliesbottom-up attention to encode image regions andwords/phrases (Karpathy et al., 2014; Kiros et al.,2014). From the perspective of algorithm develop-ment, prior work has often applied a deep learningframework to build a joint embedding space forimages and texts, and some of these studies traina model with ranking loss (Fang et al., 2015; Leeet al., 2018; Frome et al., 2013) and some use clas-

2We find in our experiments that we can simply use thepre-trained model as is to evaluate systems developed on dif-ferent datasets, and the pre-trained model is very efficient torun. Thus, the cost of computing TIGEr is not much higherthan that of computing other traditional metrics.

Figure 2: TIGEr framework.

sification loss (e.g., softmax function) (Jabri et al.,2016; Fukui et al., 2016). In this work, we takeadvantage of a state-of-the-art model for image-text matching (Lee et al., 2018) and propose anautomatic evaluation metric for image captioningbased on the matching results. Our goal is to cap-ture comprehensive information from input datawhile also providing an explainable method to as-sess the quality of an image description.

3 The TIGEr Metric

The overall framework of TIGEr is shown in Fig-ure 2. With the assumption that a good machine-generated caption C should generate a descriptionof an image V like a human would, we compareC against a set of reference captions produced byhumansR = {R1, ..., Rk} in two stages.

The first stage is text-image grounding, wherefor each caption-image pair, we compute a vectorof grounding scores, one for each specific regionof the image, indicating how likely the caption isgrounded in the region. The grounding vector canalso be interpreted as estimating how much atten-tion a human judge or the to-be-evaluated imagecaptioning system pays on each image region ingenerating the caption. A good system is expectedto distribute its attention among different regionsof an image similarly to that of human judges. Forexample, if a human judge wrote a caption solelybased on a particular region of an image (e.g.,a human face or a bottle as shown in Figure 1)while ignoring the rest of the image, then a goodmachine-generated caption would be expected todescribe only the objects in the same region.

The second stage is grounding vector compari-son, where we compare the grounding vector be-tween an image V and the machine-generated cap-tion C, denoted by s(V,C), and that between Vand reference captions R, denoted by s(V,R).The more similar these two vectors are, the higherquality of C is. The similarity is measured by

2144

Figure 3: Overview of TIGEr calculation. For each pair of image and caption sentence, the pre-trained SCANmodel generates a similarity vector where each dimension denotes the grounding relevance between an imageregion and the caption sentence in this region context. Given two similarity vectors denoting the grounding out-comes of reference versus candidate captions, we measure: 1) region rank disagreement, and 2) the similarity oftwo grounding weight distributions.

using two metric systems. The first one mea-sures how similarly these image regions are or-dered (by their grounding scores) in the two vec-tors. The second one measures how similarly theattention (indicated by grounding scores) is dis-tributed among different regions of the image inthe two vectors. The TIGEr score is the averagescore of the resulting two similarity scores.

In the rest of this section, we describe the twostages in detail.

3.1 Text-Image Grounding

To compute the grounding scores of an image-caption pair, we need to map the image andits paired caption into vector representations inthe same vector space. We achieve this by us-ing a Stacked Cross Attention Neural Network(SCAN) (Lee et al., 2018), which is pre-trainedon the 2014 MS-COCO train/val dataset with de-fault settings3. The training data contains 121,287images, of which each is annotated with five textdescriptions. The architecture of SCAN is shownin Figure 3.

We first encode caption C or R, which is asequence of m words, into a sequence of d-dimensional (d = 300) word embedding vec-tors, {w1, ...,wm}, using a bi-directional recur-rent neural network (RNN) models with gated re-current units (GRU) (Bahdanau et al., 2015). Wethen map image V into a sequence of feature vec-tors in two steps. First, we extract from V a setof n = 36 region-level 2048-dimensional fea-ture vectors using a pre-trained bottom-up atten-tion model (Anderson et al., 2018). Then, we uselinear projection to convert these vectors into a

3Github link: https://github.com/kuanghuei/SCAN

sequence of d-dimensional image region vectors{v1, ...,vn}.

We compute how much caption C, representedas {w1, ...,wm}, is grounded to each image re-gion vi, i = 1...n, as follows. First, we computean attention feature vector for each vi as

ai =m∑j=1

αijwj (1)

αij =exp(λsim(vi,wj))∑mk=1 exp(λsim(vi,wk))

(2)

where λ is a smoothing factor, and sim(v,w) is anormalized similarity function defined as

sim(vi,wj) =max(0, score(vi,wj))√∑nk=1max(0, score(vk,wj))2

(3)

score(vi,wj) =vi ·wj

‖ vi ‖‖ wj ‖(4)

Then, we get the grounding vector of C and V as

s(V,C) = {s1, ..., sn} (5)

si := s(vi, C) =vi · ai

‖ vi ‖‖ ai ‖(6)

where the absolute value of si indicates how muchcaption C is grounded to the i-th image regionor, in other words, to what degree C is generatedbased on the i-th image region.

Given an image V , in order to evaluate the qual-ity of a machine-generated caption C based ona set of reference captions R = {R1, ..., Rk},we generate two grounding vectors, s(V,C) as inEquation 5 and s(V,R) which is a mean ground-ing vector over all references inR as

s(V,R) = avg([s(V,R1), ..., s(V,Rk)]). (7)

2145

3.2 Grounding Vector Comparison

The quality of C is measured by comparings(V,C) and s(V,R) using two metric systems,Region Rank Similarity (RRS) and Weight Distri-bution Similarity (WDS).

3.2.1 Region Rank Similarity (RRS)RRS is based on Discounted Cumulative Gain(DCG) (Jarvelin and Kekalainen, 2002), which iswidely used to measure document ranking qual-ity of web search engines. Using a graded rele-vance scale of documents in a rank list returned bya search engine, DCG measures the usefulness, orgain, of a document based on its position in theranked list.

In the image captioning task, we view a cap-tion as a query, an image as a collection of docu-ments, one for each image region, and the ground-ing scores based on reference captions as human-labeled relevance scores. Note that s(V,C) con-sists of a set of similarity scores {s1, s2, ..., sn}for all image regions. If we view s(v, C) as a rel-evance score between caption (query) C and im-age region (document) v, we can sort these imageregions by their scores to form a ranked list, simi-lar to common procedure in Information Retrievalof documents from data collections via search en-gines.

Then, the quality of the ranked list, or equiva-lently the quality of C, can be measured via DCG,which is the sum of the graded relevance values ofall image regions in V , discounted based on theirpositions in the rank list derived from s(V,C):

DCGs(V,C) =n∑

k=1

relklog2(k + 1)

(8)

where relk is the human-labeled graded relevancevalue of the image region at position k in theranked list.

Similarly, we can generate an ideal ranked listfrom s(V,R) where all regions are ordered basedon their human-labeled graded relevance values.We compute the Ideal DCG (IDCG) as

IDCGs(V,R) =n∑

k=1

relklog2(k + 1)

(9)

Finally, RRS between s(V,C) and s(V,R) is de-fined as Normalized DCG (NDCG) as

RRS(V,C,R) := NDCG(V,C,R) =DCGs(V,C)

IDCGs(V,R)(10)

The assumption made in using RRS is that Cis generated based mainly on a few highly impor-tant image regions, rather than the whole image.A high-quality C should be generated based on asimilar set or the same set of highly important im-age regions based on which human judges writereference captions. One limitation of RRS is thatit is not suitable to measure the quality of captionsthat are generated based on many equally impor-tant image regions. In these cases, we assume thathumans produce captions by distributing their at-tention more or less evenly across all image re-gions rather than focusing on a small number ofhighly important regions. The remedy to this lim-itation is the new metric we describe next.

3.2.2 Weight Distribution Similarity (WDS)

WDS measures how similarly a system and hu-man judges distribute their attention among differ-ent regions of an image when generating captions.Let P and Q be the attention distributions derivedfrom the grounding scores in s(V,R) and s(V,C),respectively. We measure the distance betweenthe two distributions via KL Divergence (Kullbackand Leibler, 1951) as

DKL(P ||Q) =

n∑k=1

P (k) logP (k)

Q(k)(11)

where P (k) = softmax(s(vk,R)) and Q(k) =softmax(s(vk, C)).

In addition, we also find it useful to capturethe difference in caption-image relevance between(V,C) and (V,R). The relevance value of acaption-image pair can be approximated by usingthe module of each grounding vector. We useDrel

to denote the value difference term, defined as

Drel((V,R)||(V,C)) = log‖s(V,R)‖‖s(V,C)‖

(12)

Finally, letting D(R||C) = DKL(P ||Q) +Drel((V,R)||(V,C)), we get WDS between twogrounding vectors using a sigmoid function as

WDS(V,C,R) := 1− exp (τD(R||C))exp (τD(R||C)) + 1

(13)

where τ is the to-be-tuned temperature.

2146

3.2.3 TIGEr ScoreThe TIGEr score is defined as the average value4ofRRS and WDS:

TIGEr(V,C,R) :=RRS(V,C,R)+WDS(V,C,R)

2 (14)

The score is a real value of [0, 1]. A higher TIGErscore indicates a better caption as it matches betterwith the human generated captions for the sameimage.

4 Experiments

4.1 Dataset

Composite Dataset The multisource dataset(Aditya et al., 2015a) we used contains testingcaptions for 2007 MS-COCO images, 997 Flickr8k pictures, and 991 Flickr 30k images. In total,there are 11,985 candidates graded by annotatorson the description relevance from 1 (not relevant)to 5 (very relevant). Each image has three candi-date captions, where one is extracted from human-written references, and the other two are generatedby recently proposed captioning models (Karpathyand Fei-Fei, 2015; Aditya et al., 2015b).

Flickr 8K The Flickr 8K dataset was collectedby Hodosh et al. (2013), and contains 8092 im-ages. Each picture is associated with five referencecaptions written by humans. This dataset also in-cludes 5822 testing captions for 1000 images. Un-like the aforementioned two datasets, where test-ing captions are directly generated based on animage, candidates in Flickr 8K were selected byan image retrieval system from a reference cap-tion pool. For grading, native speakers were hiredto give a score from 1 (not related to the imagecontent) to 4 (very related). Each caption wasgraded by three human evaluators, and the inter-coder agreement was around 0.73. Because 158candidates are actual references of target images,we excluded these for further analysis.

Pascal-50S The PASCAL-50S dataset (Vedan-tam et al., 2015) contains 1000 images from theUIUC PASCAL Sentence dataset, of which 950images are associated with 50 human-written cap-tions per image as references, and the remainderof the images has 120 references for each picture.This dataset also contains 4000 candidate caption

4We selected arithmetic mean by empirically observingthe value variance between RRS and WDS in the [0, 1] inter-val.

pairs with human evaluation, where each annota-tor was asked to select one candidate per pair thatis closer to the given reference description. Can-didate pairs are grouped by four types: 1) human-human correct (HC) contains two human-writtencaptions for the target image, 2) human-human in-correct (HI) includes two captions written by hu-man but describing different images, 3) the groupof human-machine (HM) is contains a human-written and a machine-generated caption, and 4)machine-machine (MM) includes two matching-generated captions focusing on the same image.

4.2 Compared MetricsGiven our emphasis on metric interpretability andefficiency, we selected six rule-based metrics thathave been widely used to evaluate image captionsfor comparison. The metrics are BLEU-1, BLEU-4, ROUGE-L, METEOR, CIDEr, and SPICE. Weuse MS COCO evaluation tool5 to implement allmetrics. Before testing, the input texts were lower-cased and tokenized by using the ptbtokenizer.pyscript from the same tool package.

4.3 EvaluationOur examination of metric performances mainlyfocuses on the caption-level correlation with hu-man judgments. Following prior studies (Ander-son et al., 2016; Elliott and Keller, 2014), we useKendall’s tau (τ ) and Spearman’s rho (ρ) rank cor-relation to evaluate pairwise scores between met-rics and human decisions in the Composite andFlickr 8K datasets.

Composite Flickr8k

τ ρ τ ρ

BLEU-1 0.280 0.353 0.323 0.404BLEU-4 0.205 0.352 0.138 0.387ROUGE-L 0.307 0.383 0.323 0.404METEOR 0.379 0.469 0.418 0.519CIDEr 0.378 0.472 0.439 0.542SPICE 0.419 0.514 0.449 0.596

OursRRS 0.388 0.479 0.418 0.521WDS 0.433 0.526 0.464 0.572TIGEr 0.454 0.553 0.493 0.606

Table 1: Caption-level correlation between metrics andhuman grading scores in Composite and Flickr 8Kdataset by using Kendall tau and Spearman rho. Allp-values < 0.01.

Since human annotation of the Pascal-50S datais a pairwise ranking instead of scoring, we kept

5https://github.com/tylin/coco-caption

2147

Figure 4: Average metric score based on human scoregroup.

consistency with the prior work (Vedantam et al.,2015; Anderson et al., 2016) that evaluates metricsby accuracy. Different from Anderson et al. (2016)who considered equally-scored pairs as correctcases, our definition of a correct case is that themetric should assign a higher score to a candidatethat was preferred by human annotators.

4.4 Result on Composite & Flickr 8K

Metric Performance Table 1 displays the cor-relation between metrics and human judgments interms of τ and ρ. Based on both correlation co-efficients, we achieved a noticeable improvementin the assessment of caption quality on Compos-ite and Flick8K compared to the previous best re-sults produced with existing metrics (Andersonet al., 2016; Elliott and Keller, 2014). Regard-ing the isolated impact of two similarity measures,we observe that WDS contributes more to captionevaluation than RRS. This finding indicates thatthe micro-level comparison of grounding similar-ity distribution is more sensitive to human judgesthan the macro-level contrast of image region rankby grounding scores.

TIGEr Score Analysis To understand the eval-uation results of TIGEr, we further analyzed ourmetric scores based on the group of human scoringon caption quality. As shown in Figure 4, our met-ric scores are increasing according to the growthof human scores in both datasets. Overall, thegrowth rate of RRS is similar with the WDS scoreper dataset. Interestingly, the value of the metricscores increase more slowly in high-scored cap-tion groups than in low-scored groups (Flickr8k),which suggests that the difference in caption rele-vance between adjacent high-scored groups (e.g.,3 vs. 4) is smaller than the descriptions in nearbylow-scored groups. A more detailed error analy-sis and qualitative analysis is addressed in the Ap-pendix.

HC HI HM MM All

BLEU-1 51.20 95.70 91.20 58.20 74.08BLEU-4 53.00 92.40 86.70 59.40 72.88ROUGE-L 51.50 94.50 92.50 57.70 74.05METEOR 56.70 97.60 94.20 63.40 77.98CIDEr 53.00 98.00 91.50 64.50 76.75SPICE 52.60 93.90 83.60 48.10 69.55

TIGEr (ours) 56.00 99.80 92.80 74.20 80.70

Table 2: Accuracy of metrics at matching human judg-ments on PASCAL-50S with 5 reference captions. Thehighest accuracy per pair type is shown in bold font.Column titles are explained in Section 4.1.

4.5 Result on Pascal-50S

Fixed Reference Number Table 2 reports theevaluation accuracy of metrics on PASCAL-50Swith five reference captions per image. Our metricachieves higher accuracy in most pair groups ex-cept for HM and HC. Given all instances, TIGErimproves the closeness to human judgment by2.72% compared to the best prior metric (i.e.,METEOR). Among the four considered candidategroups, identifying irrelevant human-written cap-tions in HI is relatively easy for all metrics, andTIGEr achieves the highest accuracy (99.80%).In contrast, judging the quality of two correcthuman-annotated captions in HC is difficult with alower accuracy per metric compared to other test-ing groups. For this pair group, TIGEr (56.00%)shows a comparable performance with the best al-ternative metric (METEOR: 56.70%). More im-portantly, TIGEr reaches a noteworthy improve-ment in judging machine-generated caption pairs(MM) with an increasing in terms of accuracy byabout 10.00% compared to the best prior metric(CIDEr: 64.50%).

Changing Reference Number In order to ex-plore the impact of reference captions on metricperformance, we changed the number of refer-ences from 1 to 50. As shown in Figure 5, TIGEroutperforms prior metrics in All and MM candi-date pairs by: 1) achieving higher accuracy, es-pecially for small numbers of references; and 2)obtaining more stable performance results acrossvaried reference sizes. Our findings suggest thatTIGEr has low reference dependency. Comparedwith prior work (Vedantam et al., 2015), the slightdifferences in results might be caused by the ran-dom choices for reference subsets.

2148

Figure 5: Pairwise comparison accuracy (y-axis) of metrics at matching human judgments with 1-50 referencecaptions (x-axis). TIGEr (solid line in light-purple) is the best performing metric in both all pair group andmachine-machine pair group. The number of reference captions is obviously beneficial to all existing metricsexcept TIGEr, which is less dependent on excessive references and can therefore greatly reduce human annotationcosts.

Figure 6: Visualization of text-to-image grounding.

4.6 Text Component Sensitivity

We also explore the metrics’ capability to identifyspecific text component errors (i.e., object, action,and property). We randomly sampled a small setof images from Composite and Pascal-50S. Givenan image, we pick a reference caption and gener-ate two candidates by replacing a text component.For example, we replaced the action “walk” froma reference “People walk in a city square” with asimilar action “stroll” and a different action“arerunning” as a candidate pair. We then calcu-lated the accuracy of pairwise ranking per metricfor each component. As Figure 7 shows, TIGEris sensitive to recognizing object-level changeswhile comparatively weak in detecting action dif-ferences. This implies that text-to-image ground-ing is more difficult at the action-level than theobject-level. Similarly, SPICE also has a lower ca-pability to identify action inconsistencies, whichshows the limitation of scene graphs to capturethis feature. In contrast, the n-gram-based metricsprefer to identify movement changes. To furtherstudy the sensitivity of TIGEr to property-level(i.e., object attributes) differences, we manuallygrouped the compared pairs based on their asso-ciated attribute category such as color. Our resultsshow that states (e.g., sad, happy), counting (e.g.,

Figure 7: Metric accuracy at three text component lev-els.

one, two), and color (e.g., red, green) are top at-tribute groups in the sampled data, where TIGErcan better identify the differences of the countingdescription (acc. 84.21%) compared to the expres-sion of state (69.70%) and color (69.23%).

4.7 Grounding Analysis

To analyze the process of text-to-image ground-ing behind TIGEr in depth, we visualize a set ofmatching cases of image-caption pairs. Figure 6provides two illustrative examples. Interestingly,both captions per image are correct, but focus ondifferent image parts. For instance, in the left im-age, caption 1 mentions “chairs”, which matchesregion a, while caption 2 relates to region b by ad-

2149

dressing “garage”. According to our observation,the image region typically has a higher ground-ing weight with the corresponding caption than theother unrelated region. Our observations suggestthat since captions are typically short, they maynot cover the information of all regions of an im-age and hence taking image content into accountfor captioning evaluation is necessary.

5 Conclusion and Future Work

We have presented a novel evaluation metriccalled TIGEr for image captioning by utilizingtext-to-image grounding result based on a pre-trained model. Unlike traditional metrics that aresolely based on text matching between referencecaptions and machine-generated captions, TIGEralso takes the matching between image contentsand captions into account , and the similarity be-tween captions generated by human judges andautomated systems, respectively. The presentedexperiments with three benchmark datasets haveshown that TIGEr outperforms existing popularmetrics, and has a higher and more stable corre-lation with human judgments. In addition, TIGEris a fine-grained metric in that it identifies descrip-tion errors at the object level.

Though TIGEr was build upon SCAN, our met-ric is not tied to this grounding model. Theimprovements with grounding models focusingon the latent correspondence between object-levelimage regions and descriptions will allow TIGErto be further improved in the future. Since the pre-trained data mainly contains photo-based images,TIGEr primarily focuses on the caption evaluationin this image domain. For other domains, we canretrain the SCAN model using the ground-truthimage-caption pairs, where the process is similarto applying a learning-based metric. In our futurework, we plan to extend the metric to other visual-text generation tasks such as storytelling.

Acknowledgments

We appreciate anonymous reviewers for their con-structive comments and insightful suggestions.This work was mostly done when Ming Jiang wasinterning at Microsoft Research.

ReferencesSomak Aditya, Yezhou Yang, Chitta Baral, Cor-

nelia Fermuller, and Yiannis Aloimonos. 2015a.

From images to sentences through scene descriptiongraphs using commonsense reasoning and knowl-edge. arXiv preprint arXiv:1511.03292.

Somak Aditya, Yezhou Yang, Chitta Baral, Cor-nelia Fermuller, and Yiannis Aloimonos. 2015b.From images to sentences through scene descriptiongraphs using commonsense reasoning and knowl-edge. arXiv preprint arXiv:1511.03292.

Peter Anderson, Basura Fernando, Mark Johnson, andStephen Gould. 2016. Spice: Semantic proposi-tional image caption evaluation. In ECCV.

Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and vqa. In CVPR.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceedingsof the ACL workshop.

Raffaella Bernardi, Ruket Cakici, Desmond Elliott,Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis,Frank Keller, Adrian Muscat, and Barbara Plank.2016. Automatic description generation from im-ages: A survey of models, datasets, and evalua-tion measures. Journal of Artificial Intelligence Re-search.

Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi,and Cho-Jui Hsieh. 2018. Attacking visual languagegrounding with adversarial examples: A case studyon neural image captioning. In ACL.

Simon Corston-Oliver, Michael Gamon, and ChrisBrockett. 2001. A machine learning approach tothe automatic evaluation of machine translation. InACL.

Yin Cui, Guandao Yang, Andreas Veit, Xun Huang,and Serge Belongie. 2018. Learning to evaluate im-age captioning. In CVPR.

Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin.2017. Towards diverse and natural image descrip-tions via a conditional gan. In ICCV.

Desmond Elliott and Frank Keller. 2014. Comparingautomatic evaluation measures for image descrip-tion. In ACL.

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh KSrivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xi-aodong He, Margaret Mitchell, John C Platt, et al.2015. From captions to visual concepts and back.In CVPR.

2150

Andrea Frome, Greg S Corrado, Jon Shlens, SamyBengio, Jeff Dean, Tomas Mikolov, et al. 2013. De-vise: A deep visual-semantic embedding model. InNeurIPS.

Akira Fukui, Dong Huk Park, Daylen Yang, AnnaRohrbach, Trevor Darrell, and Marcus Rohrbach.2016. Multimodal compact bilinear pooling for vi-sual question answering and visual grounding. InEMNLP.

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu,Kenneth Tran, Jianfeng Gao, Lawrence Carin, andLi Deng. 2017. Semantic compositional networksfor visual captioning. In CVPR.

Jianfeng Gao, Michel Galley, and Lihong Li. 2019.Neural approaches to conversational ai. Founda-tions and Trends in Information Retrieval.

Micah Hodosh, Peter Young, and Julia Hockenmaier.2013. Framing image description as a ranking task:Data, models and evaluation metrics. Journal of Ar-tificial Intelligence Research.

Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, DapengWu, Jianfeng Wang, and Xiaodong He. 2019. Hier-archically structured reinforcement learning for top-ically coherent visual story generation. In AAAI.

Allan Jabri, Armand Joulin, and Laurens van derMaaten. 2016. Revisiting visual question answeringbaselines. In ECCV.

Kalervo Jarvelin and Jaana Kekalainen. 2002. Cumu-lated gain-based evaluation of ir techniques. ACMTransactions on Information Systems.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In CVPR.

Andrej Karpathy, Armand Joulin, and Li F Fei-Fei.2014. Deep fragment embeddings for bidirectionalimage sentence mapping. In NeurIPS.

Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descrip-tions. In CVPR.

Ryan Kiros, Ruslan Salakhutdinov, and Richard S.Zemel. 2014. Unifying visual-semantic embeddingswith multimodal neural language models. arXivpreprint arXiv:1411.2539.

Alex Kulesza and Stuart M Shieber. 2004. A learn-ing approach to improving sentence-level mt evalu-ation. In Proceedings of the 10th International Con-ference on Theoretical and Methodological Issues inMachine Translation.

Solomon Kullback and Richard A Leibler. 1951. Oninformation and sufficiency. The annals of mathe-matical statistics.

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu,and Xiaodong He. 2018. Stacked cross attention forimage-text matching. In ECCV.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In ECCV.

Ryan Lowe, Michael Noseworthy, Iulian Vlad Ser-ban, Nicolas Angelard-Gontier, Yoshua Bengio, andJoelle Pineau. 2017. Towards an automatic Turingtest: Learning to evaluate dialogue responses. InACL.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In ACL.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image de-scription evaluation. In CVPR.

Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazeb-nik. 2018a. Learning two-branch neural networksfor image-text matching tasks. PAMI.

Xin Wang, Wenhu Chen, Yuan-Fang Wang, andWilliam Yang Wang. 2018b. No metrics are perfect:Adversarial reward learning for visual storytelling.In ACL.

2151

A Appendices

A.1 Error AnalysisTo have a better understanding of inconsistentscore ranks between TIGEr and human evalua-tion assigned to candidates in the Composite andFlick8k datasets, we further conducted an erroranalysis. Considering the score of TIGEr is con-tinuous [0, 1], while human scoring is done onan n-point scale, we map our metric score ontothe n-point scale by sorting instances according toTIGEr results, and assigning a score group per in-stance based on the distribution of human scores.Figure 8 shows the distribution of score groupdifferences between TIGEr and human evaluationfor both datasets. We observe that true positives(i.e., the score group of TIGEr is equal to the hu-man score for a given caption) achieve higher fre-quency than each error group. On the other hand,inconsistent cases with the small differences of as-signed scores (e.g., TIGEr assigns 3, while humanassigns 4 to a testing caption) get a higher fre-quency than the notable difference between twometric scores, especially for the test cases from theFlickr8k dataset. Our finding suggests that cap-tions with subtle quality difference are more diffi-cult to be discriminated than instances with signif-icant gaps.

A.2 Qualitative AnalysisTo illustrate some characteristics of TIGEr morein depth, we show human and TIGEr scores fora set of caption examples in Table 3. In order tomake the result of two metrics comparable, boththe score group and actual grading value of TIGErare displayed. Note that we provide an equal num-ber of examples for each type of comparison result(i.e., either human or TIGEr score is higher thanthe other one & both scores are equal) in Table 3.

We find that TIGEr is able to measure a captionquality by considering the semantic informationof image contents. For example, human-writtenreferences in case (e) primarily provide an over-all description of all people shown in the image,while the candidate caption specifically describestwo walking women in the image. Despite suchdifference at the text-level, TIGEr assigns a highscore to this sentence - just like human evaluators.Case (b) also supports this finding to some ex-tent. Unlike the references that specially describeda baseball payer in the game, the candidate captionin this example provides a general description that

is matching with the image content (e.g., “base-ball game”, “at night”). This observation may helpto explain why TIGEr gave this sentence a highscore.

Besides, we find two challenges for captionevaluation based on TIGEr. First, though our met-ric improved evaluation performance by consid-ering semantic information from both images andreferences, objects with closed semantic meaningsuch as ”laptop” and ”computer” in example (a) islimited to differentiate. Second, human interpreta-tion inspired by the image is hard to be judged byan automatic evaluation metric. For example, inthe case (d), “makes a wish as he blows the littlepinwheels into the air” is a reasonable imaginationof an human annotator, which cannot be explicitlyobserved from the image.

2152

Figure 8: Distribution of group score differences between TIGEr and human evaluation

Table 3: Examples of scores given by TIGEr. In the column of TIGEr, we display both score group and the actualgrading value. The score group is assigned to be comparable with human scores, where we map our metric scoreinto the n-point scale like human evaluation. The detailed mapping process is shown in Section A.1.

tiger: text-to-image grounding for image caption evaluation · tiger: text-to-image grounding for...

Documents