a comparison study of human-evaluated automated

A Comparison Study of Human-Evaluated Automated Highlighting Systems

Sasha Spala1, Franck Dernoncourt2, Walter Chang2, Carl Dockhorn1

1Adobe Systems, 2Adobe Research345 Park Ave

San Jose, CA 95110-2704{sspala,dernonco,wachang,cdockhor}@adobe.com

Abstract

Automatic text highlighting aims to identifykey portions that are most important to areader. In this paper, we explore the use of ex-isting extractive summarization models for au-tomatically generating highlights; automatichighlight generation has not previously beenaddressed from this perspective. Evaluationstudies typically rely on automated evaluationmetrics as they are cheap to compute and scalewell. However, these metrics are not designedto assess automated highlighting. We there-fore focus on human evaluations in this work.Our comparison of multiple summarizationmodels used for automated highlighting ac-companied by human evaluation provides anapproximate upper bound of the quality of fu-ture highlighting models.

1 Introduction

Automatic text highlighting aims to identify key por-tions that are most important to a first-time reader.Our motivation for pursuing highlighting analysisstems from an apparent lack of current computa-tional research and machine learning applicationsdedicated to this type of reading aid. For clarity,we define document highlights as a brightly-coloredoverlay placed on top of a span of text in order toattract readers’ attention to the content of that text.We do not define any one specific color those high-lights must be, but our practice, as outlined in themethodology, makes use of yellow, green, and redhighlights.

Highlighting, when used correctly, can be an im-portant part of the reading process, and, in many

cases, can aid a reader in information retention(Fowler and Barker, 1974). Because of its similarityto extractive summarization tasks, where highlightsreflect the sentences retrieved for an extractive sum-mary, it follows that this is a relevant repurpose ofthese summarization methods. We must recognize,however, that this is not a summarization task, asevaluation of a highlighting model must be cruiciallydifferent: Summarization tasks produce summariesthat are evaluated as relevant, but separate from thedocument whereas the evaluation of highlight mod-els must be done within the direct context of the doc-ument. Though the approach may be similar, theevaluation is fundamentally different and likely in-fluences the structure of the model applied.

In order to confirm this is both useful and possi-ble, however, we must first answer two questions:1) Can humans agree on which sentences in a docu-ment should be highlighted? Knowing whether hu-man readers distinguish a specific set of sentencesfrom a document will inform to what extent thistask differs from extractive summarization tasks, asit would identify how human highlights and baselinesummarization model “highlights” differ, if at all. 2)How do current summarization models, trained fortheir original application, perform when used out-of-the-box as highlight generators?

To answer these two research questions, we cre-ated a novel method of collecting data from hu-man annotators via Amazon Mechanical Turk or anyother crowdsourcing platform. This paper presentsour experimental methodology as well as the resultswe obtained from the annotation tasks. Our resultsreflect promising developments in the application of

PACLIC 32

623 32nd Pacific Asia Conference on Language, Information and Computation

Hong Kong, 1-3 December 2018 Copyright 2018 by the authors

summarization models and the possibility for a golddataset.

2 Related Work

Highlighting is one of the most common methodsof annotation (Baron, 2009), making it a popularcontent annotation choice for many industry lead-ers as well. For our purposes, we are interested inthe effect of passive highlights, or highlights that al-ready appear in text. Passive highlighting has beenshown in several studies (Fowler and Barker, 1974;Lorch Jr., 1989; Lorch Jr et al., 1995) to be a usefultool for information retention and comprehension.

Rath et al. (1961) asked human annotators to re-trieve the “most representative” sentences in a docu-ment, but failed to find significant human agreementfor both human-retrieved and machine-retrievedsentences. Daume III and Marcu (2004) identifiedthat when instructed to choose the “most important”sentences from a passage, humans still fell shortof significant agreement. Though Daume III andMarcu had low expectations for human agreementin the summarization domain, we believe that theeffect of inline content, such as highlights, changesthe parameters of this task enough to warrant fur-ther inspection. In an effort to find a more narrowtask definition without influencing annotators’ con-cept of “highlighting importance”, we centered ourtask on the expected concrete and finite results ofin-line highlights (e.g., increased reading speed, en-hanced comprehension).

The connection between highlights and extractivesummarization has yet to be established in the com-putational linguistics community, but we believe thetwo have a significant relationship. The two tasks,as mentioned above, require the same fundamentalprocess: retrieving important sentences from a doc-ument. Certainly, there exists a large body of re-search on extractive summarization, utilizing manydifferent approaches (see (Nenkova and MecKeown,2012; Yogan et al., 2016) for details on extractivesummarization techniques), but these works focuson summarization, and thus, evaluation is done outof the direct context of the document. That is, sum-maries are scored based on their completeness, co-herence, and importance as a standalone paragraphrather than in line with the text. Because of this

shortage of computational research, we must definea proper metric for evaluating machine-generatedhighlights. While ROUGE scores (Lin, 2004) con-tinue to be the standard metric for evaluating the co-herence and correctness of extractive and abstractivesummaries, it is difficult to apply them to a high-lighting task. Instead, we are interested in humanreactions to and interest in these highlights.

3 Human Agreement Task

Before developing a capable highlight generationmodel, we must answer whether it is possible to finda ground truth to highlights in a given document. Wedesigned an Amazon Mechanical Turk experimentto test whether, given an appropriate stimulus ques-tion, humans could agree on which sentences in adocument should be highlighted.

3.1 Single-Document Interface DesignOur annotation method is based on a novel frame-work to gathering highlight data from human sub-jects. Participants, as discussed above, were asked tohighlight a specific number of sentences that wouldmake document comprehension easier and faster foranother reader. A counter in the left column up-dated the number of highlights remaining as annota-tors worked through each document. Clicking any-where within the boundaries of a sentence wouldhighlight the entire sentence in yellow. Annotatorswere allowed to highlight and un-highlight as oftenas needed, but were not able to revisit the same doc-ument after moving on.

The agreement task was designed to elicit high-lighting data without confounds. Highlighting issimple and only requires a single left click, and thereare no color variations. Text size for both the leftpanel and the document title and content are consis-tent for the duration of the task.

3.2 MethodologyWe asked 40 human annotators to highlight sen-tences that would make understanding the “mainpoint” of the same document “easier and faster fora new reader”. All annotators were required to havea US high school diploma or equivalent to ensurefluency in English. Annotators were instructed toassume their new reader would have no prior knowl-edge of the content of the background. Each anno-

PACLIC 32



Figure 1: Interface for the human agreement task. Annotators may highlight or unhighlight any sentence by clickingon it. The interface verifies that the annotator has highlighted the proper amount of highlights.

tator was given 5 randomly ordered documents froma selection of 10 documents from the 2001 Docu-ment Understanding Conference (DUC) data (DUC,2001) and asked to highlight a number of sentencesapproximating 20 percent of the total number of sen-tences in each document.

All annotators received a brief tutorial of the taskcontrols before beginning the experiment in addi-tion to a brief demographic survey before the exper-iment. Annotators were allowed as long as neededto complete the task but were instructed to remainengaged with the experiment until completed.

Though this experiment’s task remained relativelygeneral, identifying multiple different highlighting“tasks” is important here, as some human annota-tors may highlight different content based on theirinterpretation of what highlights may help compre-hension. Previous psychology research (Lindner andothers, 1996; Fowler and Barker, 1974; Lorch Jr.,1989) leads us to believe there are many highlight-ing tasks, including content organization and novelconcept retention, both of which would require morenarrow experiment parameters and annotator poolsthan our human agreement task.

3.3 Agreement Task ResultsUsing Krippendorff Alpha scores (Krippendorff,2011), we measured the agreement across the 40 an-notators for each document. For visualization andscoring purposes, each document is represented as abinary mapping of annotator highlights (see Fig. 2).Because of the skew of non-highlighted to high-lighted sentences, and the binary annotation methodin the task, Krippendorff values most closely repre-sent actual human agreement. All alpha scores re-turned α > 0, and two high-performing documents

0 10 20Sentence number

0

5

10

15

20

25

30

35An

nota

tor

LIE TESTS FOR POLICE APPLICANTS AFFIRMED;

Figure 2: Binary mapping of annotator highlights in asingle document where each row represents an annota-tor, and each column represents a sentence. Red cells arehighlighted sentences, blue cells are unhighlighted sen-tences.

scored α > 0.2 , indicating “fair” agreement. Theaverage score was α = 0.1596. Though we used40 annotators per document, Krippendorff valuesreached convergence after 20 annotators (see Fig. 3).

Though these values may seem relatively low, es-pecially compared to those of a standard annotationtask where trained annotators mark the corpus, thesevalues indicate a trend towards possible agreementfor highlighting tasks in part due to the multiple taskdilemma. Because of the number of previously ex-plored highlighting tasks and the relative generalityof our own highlighting task, these agreement scoresare likely to be diluted by multiple interpretationsof the experiment definition. We believe that withmore specific highlighting tasks (i.e., highlightingnovel information, highlighting for information re-

PACLIC 32



5 10 15 20 25 30 35Number of annotators

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

Krip

pend

orff

Alph

a Va

lue

Figure 3: Convergence of Krippendorff Alpha scores re-flecting the agreement between annotators as a functionof the number of annotators.

trieval, etc.), these scores will represent higher levelsof agreement.

The requirement that annotators highlight thesame amount of sentences may also affect re-sults. Lorch (1989) suggests that, in agreementwith the von Restorff effect (von Restorff, 1933)which states that highlights stand out against a lesscrowded background, fewer highlights will morepositively affect retention. It is possible that theexact number of highlights is an unnecessary con-straint, however, it is important to note that this exer-cise examines annotators’ sentence retrieval whereasthis previous research evaluates the effect of passivehighlights on readers.

3.4 Highlighting Intentions

We recognize that the low Krippendorff scores donot necessarily indicate strong highlighting agree-ment, but the idea that multiple highlighting inten-tions may dilute agreement metrics pushed us toexplore methods of quantifying those potential in-tentions. In a separate but similar experiment, weasked 40 new Mechanical Turk annotators to com-plete the same agreement task with additional ques-tions regarding highlighting intent. After each doc-ument, annotators were asked to consider the high-lights they had just created and detail in a short re-sponse why they chose those sentences and whatmade them different from other sentences. Manyannotators, in their own words, repeated the task in-

Figure 4: Initial paraphrased categories of user highlight-ing intentions for each document view.

structions, though others provided in-depth detailsabout individual sentences chosen.

In order to quantify these responses, we sum-marized each response by its main subject, repre-sented in Fig. 4. While the vast majority of re-sponses were characterized as “important” or per-taining to a “main topic” as expected from the ex-periment definition, many annotators also noted thattheir highlights summarized the content, containeddetails, or were especially relevant to the overalltopic. Some annotators also argued that their high-lights contained interesting facts or narratives thatrelated to the main point of the article. This datasupports our hypothesis that many annotators mayhighlight articles for many different reasons, whichmay impact agreement levels unless specific inten-tions are provided. And, even then, humans maynot be very good at identifying the exact sentencesthat should be highlighted. However that doesn’tnecessarily mean humans will not agree on high-lighted sentences when those highlights are alreadypresented to them inline with the text.

4 Comparison Task

The original human annotation task showed thatchoosing highlights is subjective to some extent. Inthis section, we investigate whether annotators ex-press clear preferences when presented documentswith a few sentences highlighted using summariza-tion models. To test this, we developed another in-terface that allows annotators to compare the output

PACLIC 32



Figure 5: Comparison task tutorial with color-coded highlights and rating questionnaire at bottom.

Figure 6: Voting options in comparison task. Participants are presented with two versions of the same text withdifferent highlights. Participants must upvote (green) or downvote (red) each highlight, then give a global gradebetween 1 and 10 for each of the two versions.

of multiple highlighting models and provide feed-back on the quality of each highlight.

4.1 Multi-Document Comparison DesignThe comparison task is similar in style to the pre-vious annotation task; though annotators were pre-sented with two side-by-side versions of the samedocument for this task in order to be able to com-pare two highlighting systems, aesthetic details suchas font, sizing, and distribution of the content and

instruction panel remain the same (see Fig. 5).

To handle annotation of positive and negativevotes on individual highlights, we introduced the“thumbs up” and “thumbs down” buttons, displayedafter left clicking anywhere within the boundariesof a highlighted sentence (see Fig. 6). Annotatorswere asked to vote on every highlight displayed onthe page, in addition to rating both versions of high-lights on a one to ten scale before moving on. In

PACLIC 32



an effort to avoid “lazy” annotation and elicit userpreference, annotators were required to give the twoversions different ratings, as raters tend to be moreconsistent in pairwise comparisons than when scor-ing directly (Agarwala, 2018).

4.2 Methodology

Using two batches, each consisting of 140 Mechan-ical Turk users with the same English fluency re-quirement as the human agreement task, annotatorswere instructed to “upvote” and “downvote” individ-ual highlights that they believed help identify themain point(s) of the document. Annotators wereshown two different highlighted versions, generatedfrom a set of 5 models: the summarization mod-els Recollect (Modani and others, 2015; Modaniet al., 2016) and Sedona (Elhoseiny et al., 2016),the SMMRY summarizer (smmry.com), and twomodels derived from the data collected in the pre-vious human agreement task representing the mostcommon and least common human-selected high-lights. Models were completely randomized andanonymized, both in location (e.g., left or right sideof the content frame) and pairing.

Each batch of annotators worked on 5 documentsof the same subset of 10 documents from the 2001DUC data. Annotators had as long as needed tocomplete the task, as well as a brief demographicssurvey before the task. Similar to the human agree-ment task, a shortened version of the task definitionremained in the left panel for the duration of thetask. All annotators interacted with a controls tu-torial before beginning the experiment. Annotatorswho worked on the same document sets from the hu-man annotation task were ineligible for this task.

In addition to the demographic survey, annotatorswere asked to provide answers to a document typeranking question at the end of the task after consider-ing the “best” versions of highlights they interactedwith during the task. This data was used to gauge themodels’ impact on user preferences for highlighteddocuments.

4.3 Comparison Results

4.3.1 Human Evaluation ResultsResults from this task reflect a clear preference

towards human-generated highlights (see Fig. 7.).

Bad Recollect Sedona SMMRY CrowdsourcedModel

2

4

6

8

10

Vote

Vote distribution by model

Figure 7: Boxplots representing the user vote distribu-tions for each model.

Bad SMMRY Crowdsourced Sedona RecollectWinning Model

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7W

in P

erce

ntag

e

Percentage of wins by model

Figure 8: Percentage of “wins” by model, where a win isdefined as the model with the higher vote score in eachcomparison.

As expected, votes for the least common high-lights performed consistently lower than any ofthe other models (∀t, p,−25 < t < −14, p <0.01, Kruskal-Wallis h test h = 515.6, p <0.01). Human-generated highlights performed sig-nificantly better (∀t, p, 6 < t < 25, p < 0.01)than all other models, indicating that there must beinformation captured by humans that our standardsummarization models do not identify. This is aclear signal that not only does there exist agreementamong readers that “good” highlights exist, but alsothat highlighting may require more research as anindependent area of study.

Results from the document type ranking question

PACLIC 32



Figure 9: Before and after task document type rankingresults for academic papers.

Figure 10: Before and after task document type rankingresults for news articles.

reflected clear appreciation of highlights shown, de-spite annotators having potentially seen “bad” high-lights. For academic papers and news articles espe-cially, annotators seemed to show increased desire tosee highlights after the task (see Fig. 9 and Fig. 10).We believe this is consistent with actual annotatorbelief rather than a confound of the task structureas results for fiction documents showed consistentdisinterest across both before-task and after-task re-sponses.

In Fig. 8, we present the percentage of times eachmodel wins, normalized by the amount of views thatparticular model received. In Fig. 11, we expand thegranularity of this data to show the distribution ofdifferences between the winning and losing model

Bad SMMRY Crowdsourced Sedona RecollectWinning models

1

2

3

4

5

6

7

8

9

Win

/Los

e Di

ffere

nce

Distribution of win/loss deltas by model

Figure 11: Distribution of win/loss deltas for each win bymodel.

for each comparison. Both visualizations further thesame conclusions made above: the crowdsourcedmodel consistently outperforms all competing mod-els in every human evaluation-based metric we ap-ply.

The relatively compact nature of the win/loss dis-tributions in Fig. 11 gives an interesting hindsight onthe annotator results. Ideally, we would see expan-sive differences between two models, where a userperceives a large difference between model A andmodel B. However, the figure indicates that in mostcases, users only rank the two models within 1 and3 ratings apart from each other.

Interestingly, the percentage of overlap with hu-man highlights by each model does not reflect themodel ordering seen in the above evaluations. Whencompared to the crowdsourced model, Recollectoverlaps 27%, SMMRY 30%, and Sedona 35% ofthe time. This may be a reflection of small samplesize, or it could reveal new insights of ”good” high-lights: perhaps some sentences are more salient toa human evaluator, such that some sentences shouldbe weighted as more important to an evaluation met-ric than others. We would need a larger dataset andfurther exploration of this topic to confirm.

4.3.2 Automatic Evaluation Results

Human evaluations such as the one we have pre-sented in the previous sections do not scale well.This section explores an automated metric for eval-uating the quality of highlighted sentences. We con-

PACLIC 32



CROWDSOURCED BAD SEDONA RECOLLECT SMMRYModel

0.0

0.2

0.4

0.6

0.8

1.0F

Scor

e

Figure 12: Distribution of ROUGE-1 F1-scores withstopword removal, using document respective DUC 50word summaries.

sider the results of our human ratings and ROUGEmetrics applied to this dataset. Since our documentscome directly from the 2001 DUC dataset, eachdocument has a matching 50-word human-generatedand standardized summary, which we use as a goldversion for ROUGE comparisons.

It is important to note that our system highlightsets may differ in number of sentences from theDUC gold files (DUC summaries are a standardized50 words, whereas our models produce n = .2Lsentences, where L is the total number of sentencesin the document. Because of this mismatch betweensummary lengths, recall would be a biased metricfor our evaluation. Instead, we report here the distri-bution of precision scores according to the ROUGEmetric.

ROUGE-1 results (see Fig. 12) show promisingdata, reflecting a similar model ranking pattern asour human voting evaluation. Again, the uncommonhighlights, our ”bad” model, show a significantlylower average ROUGE score than all other models,and the ”good” crowdsourced model shows a signif-icantly higher average ROUGE score in comparisonto all other models (∀t, p,−7 < t < −5, p < 0.01.See Fig. 12).

If higher ROUGE scores are indeed indicative ofbetter user approval, then we should see this rel-ative pattern occur across n-gram sizes. Unfortu-nately, this pattern deteriorates when using ROUGE-2, and, though our crowdsourced model still per-

CROWDSOURCED BAD SEDONA RECOLLECT SMMRYModel

0.0

0.2

0.4

0.6

0.8

1.0

F Sc

ore

Figure 13: Distribution of ROUGE-2 F1-scores withstopword removal, using document respective DUC 50word summaries.

forms significantly better than all others with p <0.01, the distance between the average F1-score ofthe uncommon highlights compresses, and no longerpasses a paired t-test against both the Sedona andRecollect models at the same confidence interval(t = 2.7, p = 0.02 and t = 2.3, p = 0.04, respec-tively. See Fig. 13). This is a problem consider-ing exactly how much different the ”bad” model hasbeen compared to other models in previous evalua-tions.

5 Conclusion

In this paper, we have explored the use of exist-ing extractive summarization models for automati-cally generating highlights. The results of our ex-periments indicate that while the agreement on thechoice of highlights to select is low, the positive re-sults of the comparison task suggest that humansfind machine generated highlights useful, which in-dicates that highlight generation is a valuable fieldof study.

Human evaluation is a necessary step in creatingan automated metric to evaluate the quality of high-lighting sets. While ROUGE scores show potentialto be a valuable, inexpensive evaluation metric, itis clear that a reliable method requires more thann-gram overlap and will call for further investiga-tion into what really differentiates human-producedhighlights from our baseline models.

PACLIC 32



ReferencesAseem Agarwala. 2018. Automatic pho-

tography with Google Clips. https://ai.googleblog.com/2018/05/automatic-photography-with-google-clips.html.

Dennis Baron. 2009. A better pencil: Readers, writers,and the digital revolution. Oxford University Press.

Hal Daume III and Daniel Marcu. 2004. Generic sen-tence fusion is an ill-defined summarization task.

DUC. 2001. Document understanding conference.Mohamed Elhoseiny, Scott Cohen, Walter Chang, Brian

Price, and Ahmed Elgammal. 2016. Automatic anno-tation of structured facts in images. In Proceedings ofthe 5th Workshop on Vision and Language.

Robert L. Fowler and Anne S. Barker. 1974. Effec-tiveness of highlighting for retention of text material.Journal of Applied Psychology, 59(3):358–364.

Klaus Krippendorff. 2011. Computing Krippendorff’salpha-reliability.

Chin-Yew Lin. 2004. ROUGE: a package for automaticevaluation of summaries. In Proceedings of the Work-shop on Text Summarization Branches Out, July.

Reinhard W. Lindner et al. 1996. Highlighting text asa study strategy: Beyond attentional focusing. In An-nual Meeting of the American Educational ResearchAssociation.

R.F. Lorch Jr, E. Pugzles Lorch, and M.A. Klusewitz.1995. Effects of typographical cues on reading andrecall of text. Contemporary Educational Psychology,20(1):51–64.

R. F. Lorch Jr. 1989. Text-signaling devices and their ef-fects on reading and memory processes. EducationalPsychology Review, 1(3):209–234.

Natwar Modani et al. 2015. Creating diverse productreview summaries: a graph approach. InternationalConference on Web Information Systems Engineering.

Natwar Modani, Balaji Vasan Srinivasan, and HarshJhamtani. 2016. Generating multiple diverse sum-maries. International Conference on Web InformationSystems Engineering.

A. Nenkova and K. MecKeown. 2012. A survey of textsummarization techniques. In Mining text data, pages43–76. Springer, Cham.

G.J. Rath, A. Resnick, and T.R. Savage. 1961. The for-mation of abstracts by the selection of sentences. parti. sentence selection by men and machines. Journal ofthe Association for Information Science and Technol-ogy, 12(2):139–141.

H. von Restorff. 1933. Uber die wirkung von bereichs-bildungen im spurenfeld. Psychologische Forschung,18(1):299–342.

J. K. Yogan, O. S. Goh, B. Halizah, H. C. Ngo, andC. Puspalata. 2016. A review on automatic text sum-marization approaches. Journal of Computer Science,12(4):178–190.

PACLIC 32



a comparison study of human-evaluated automated

Documents