visualizing translation variation: othello : a survey of text visualization and tools

8
Volume 0 (1981), Number 0 pp. 1–7 COMPUTER GRAPHICS forum Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools Zhao Geng 1 , Robert S.Laramee 1 , Tom Cheesman 2 , Andy Rothwell 2 , David M. Berry 3 , Alison Ehrmann 2 1 Visual Computing Group, Computer Science Department, Swansea University, UK, cszg,[email protected] 2 College of Arts and Humanities, Swansea University, UK, T.Cheesman,[email protected], [email protected] 3 Political and Cultural Studies, Swansea University, UK, [email protected] Abstract Being a global icon, Shakespeare’s plays have been translated into dozens of languages for about 300 years. Also, there are many re-translations to the same language, for example, there are more than 40 translation of Othello into German. Every translation is a different interpretation of the play. These large quantities of translations reflect changing culture or express individual thought by the authors. They build a wide connection between different regions and reveal a retrospective view of their histories. At the moment, researchers from Modern Languages collect a large number of translations of William Shakespeare’s play, Othello. In recent years, since roughly 2005, we have witnessed a rapid increase in the number of off-the-shelf text visualization tools which can benefit this study. Here we set out to utilize existing text visualization techniques and tools in order to gain a better understanding of the various translations of the Shakespeare’s work. In particular, we would like to learn more about which content varies highly with each translation, and which content remains table. We would also like to form hypothesis as to the implications behind this variations. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: —Line and curve gener- ation 1. Introduction The goal of this project is to visualize the various translations of Shakespeare’s work, Othello. The initial task is to iden- tify and extract the non-semantic features from the original text of a document corpus. The non-semantic features refer to the number of words, tokens and patterns in the concor- dance. Text pre-processing facilitates the construction of text concordance, term relations, document relevance and other properties of interest. Based on the extracted information, various visualizations can be applied. In this document, we present the result of our survey on the state-of-art techniques and free, off-the-shelf tools for text analysis and visualiza- tion. 2. Text Preprocessing The software WordSmith [Wor96] is able to generate various text attributes, such as word frequency, parts of speech and any other statistical information. The outcome of the anal- ysis invovles loads of statistical data about the word fre- quencies in the texts (both absolute values and compared with other texts, or compared with external corpora) and key words list (words which occur unusually frequently in com- parison with some kind of reference corpus). A screen shot of the software is shown in Figure 1. The software Concordance [Wat09] is created for people who need in-depth language or text analysis. It provides a free trial for the user. Concordance [Wat09] is able to gener- ate indexes and word lists, count word frequencies, compare different usages of a word, analyse keywords, find phrases and publish the analysis result on the web. The screen shot of the software is shown in Figure 2 c 2011 The Author(s) Journal compilation c 2011 The Eurographics Association and Blackwell Publishing Ltd. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

Upload: davidmeme

Post on 02-Apr-2015

512 views

Category:

Documents


1 download

DESCRIPTION

by Zhao Geng, Robert S.Laramee, Tom Cheesman, Andy Rothwell, David M. Berry, Alison Ehrmann (Swansea University, 2011)

TRANSCRIPT

Page 1: Visualizing Translation Variation: Othello : A Survey of Text Visualization and Tools

Volume 0 (1981), Number 0 pp. 1–7 COMPUTER GRAPHICS forum

Visualizing Translation Variation of Othello : A Survey ofText Visualization and Analysis Tools

Zhao Geng1, Robert S.Laramee1, Tom Cheesman2, Andy Rothwell2, David M. Berry3, Alison Ehrmann2

1Visual Computing Group, Computer Science Department, Swansea University, UK, cszg,[email protected] of Arts and Humanities, Swansea University, UK, T.Cheesman,[email protected], [email protected]

3Political and Cultural Studies, Swansea University, UK, [email protected]

Abstract

Being a global icon, Shakespeare’s plays have been translated into dozens of languages for about 300 years. Also,there are many re-translations to the same language, for example, there are more than 40 translation of Othello intoGerman. Every translation is a different interpretation of the play. These large quantities of translations reflectchanging culture or express individual thought by the authors. They build a wide connection between differentregions and reveal a retrospective view of their histories. At the moment, researchers from Modern Languagescollect a large number of translations of William Shakespeare’s play, Othello. In recent years, since roughly2005, we have witnessed a rapid increase in the number of off-the-shelf text visualization tools which can benefitthis study. Here we set out to utilize existing text visualization techniques and tools in order to gain a betterunderstanding of the various translations of the Shakespeare’s work. In particular, we would like to learn moreabout which content varies highly with each translation, and which content remains table. We would also like toform hypothesis as to the implications behind this variations.

Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: —Line and curve gener-ation

1. Introduction

The goal of this project is to visualize the various translationsof Shakespeare’s work, Othello. The initial task is to iden-tify and extract the non-semantic features from the originaltext of a document corpus. The non-semantic features referto the number of words, tokens and patterns in the concor-dance. Text pre-processing facilitates the construction of textconcordance, term relations, document relevance and otherproperties of interest. Based on the extracted information,various visualizations can be applied. In this document, wepresent the result of our survey on the state-of-art techniquesand free, off-the-shelf tools for text analysis and visualiza-tion.

2. Text Preprocessing

The software WordSmith [Wor96] is able to generate varioustext attributes, such as word frequency, parts of speech and

any other statistical information. The outcome of the anal-ysis invovles loads of statistical data about the word fre-quencies in the texts (both absolute values and comparedwith other texts, or compared with external corpora) and keywords list (words which occur unusually frequently in com-parison with some kind of reference corpus). A screen shotof the software is shown in Figure 1.

The software Concordance [Wat09] is created for peoplewho need in-depth language or text analysis. It provides afree trial for the user. Concordance [Wat09] is able to gener-ate indexes and word lists, count word frequencies, comparedifferent usages of a word, analyse keywords, find phrasesand publish the analysis result on the web. The screen shotof the software is shown in Figure 2

c� 2011 The Author(s)Journal compilation c� 2011 The Eurographics Association and Blackwell Publishing Ltd.Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and350 Main Street, Malden, MA 02148, USA.

Page 2: Visualizing Translation Variation: Othello : A Survey of Text Visualization and Tools

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 1: This figure shows the interface of WordSmith[Wor96].

Figure 2: This figure shows the interface of Concordance[Wat09].

3. State-of-art Text Visualization

In this section, we investigate the state-of-art text visualiza-tions from two perspectives: the research prototypes for textvisualization and the free off-the-shelf visualization tools.We refer [Hom11], [RVA04] and [iB10] for some lists ofthe available free visualization software.

3.1. Free, Off-the-shelf Text Visualization Tools

In this section, we investigate the text visualization toolswhich are free to the public. Our work can facilitate modernlanguage experts search for visualizations that benefit mostfor the analysis of their collected Shakespear’s translations.The overview of the free, off-the-shelf tools for text visual-ization is shown in Figure 11. In this section, we experimentsthese freely available tools on the 23 German translations ofOthello’s speech to the senate appeared in Shakepear’s playOthello.

A TextArc [Pal02] is a visual representation of the en-tire text on a single page. It is an advanced combination of

an index, concordance, and summary of the text. Animationis provided to enable the user keep track of the variations ofrelationship between different words, phrases and sentences.In TextArc, the entire text is depicted as an ellipse. Each lineis drawn on the outside of the ellipse. It preserves the ty-pographic structure of the text. In the middle of an ellipsedraws each word. A word with high frequency is displayedin brighter color and larger size. If a word is used more thanonce, it appears at the center of all of its mentions. The ac-cepted data for TextArc is only from the TextArc library. Fig-ure 3 shows the visualization of the Shakepear’s play Othellogenerated by TextArc.

Figure 3: This figure shows the TextArc [Pal02] visualiza-tion of the Shakepear’s book Othello in English. The entiretext is depicted as an ellipse. Each line is drawn on the out-side of the ellipse. In the middle of an ellipse draws eachword.

NameVoyager [Wat05] as a web-based visualization ofhistorical trends in baby naming, has proven remarkablypopular. The method used to visualize the data is straight-forward: given a set of name popularity time series, a setof stacked graphs is produced. However, this tool does notaccept user customized data sets.

Tagline Generator [Meh06] is a simple PHP codebasethat lets the user generate chronological tag clouds from sim-ple text data sources without manually tagging the data en-tries. Once the users have populated the data source and con-figured the generator, it creates a list of all the unique wordsthat have been used and counts how many times each wordis used. Next it identifies the different variations of wordsand combines them under the most common variation usingthe Porter Stemming Algorithm. The size of a word indicatesits frequency in the document. The brightness indicates theyear of the document, the newer document is brighter. Theaccepted data format of tagline generator is the xml file de-ployed on the web. Figure 4 shows the TagLine visualizationof 23 German translations of Shakespear’s play, Othello.

ManyEyes [VWvH∗07] is a free website where anyonecan upload, visualize, and discuss data. It is an experimentcreated by the Visual Communication Lab. The input dataof ManyEyes is obtained by copying and pasting any forms

c� 2011 The Author(s)Journal compilation c� 2011 The Eurographics Association and Blackwell Publishing Ltd.

Page 3: Visualizing Translation Variation: Othello : A Survey of Text Visualization and Tools

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 4: This figure shows the visualizations of two German translations of Othello using Tagline Generator [Meh06]. By moving thescrolling bar, user is able to see the visualization of each individual document. There are 23 German translations of Othello play experimentedon this tool.

of free text. It provides a number of text visualizations, suchas Tag Clouds, Phrase Net and Word Tree. Again, we applyour Othello data, which contains 23 various German transla-tions of the play, to the visualizations in this tool. The stan-dard Tag Clouds [BGN08] is a popular text visualizationfor depicting term frequencies. Tags are usually single wordsand are normally listed alphabetically, and the importance ofeach tag is shown with font size or color, as shown in Figure5. Word Tree [WB08] is a graphical version of the tradi-tional keyword-in-context method, and enables rapid query-ing and exploration of bodies of the text, as shown in Fig-ure 6. It is a visual search tool for unstructured text, suchas a book, article, speech or poem. It allows the user choosea word or phrase and shows them all the different contextsin which the word or phrase appears. The contexts are ar-ranged in a tree-like branching structure to reveal recurrentthemes and phrases. The size of a word represents its fre-quency. Phrase Nets [vHWV09] illustrates the relationshipsbetween different words used in a text. It uses a simple formof pattern matching to provide multiple views of the con-cepts contained in a book, speech, or poem. Such as given anetwork of words and connection pattern word "and", wheretwo words are connected if they appear together in a phraseof the form "X and Y", as shown in Figure 7.

TagCrowd [Ste08] is a web application for visualizingword frequencies in any user-supplied text by creating a tagcloud or text cloud [BGN08]. The advantage of TagCrowd isthat user can define the common words themselves and thesecommon words will be automatically reduced from the orig-inal text. Figure 8 shows the Tag Cloud visualization of ourOthello data sets. The common German words are reduced.

Wordle [Jon09] is a tool for generating "word clouds"from text that the user provides. Wordles are more artisti-cally arranged (and often vibrantly colored) versions of atext. They tend to be less directly insightful as an informa-tion graphics, but often give a more personal feel to a docu-ment. The clouds give greater prominence to words that ap-pear more frequently in the source text. The user can tweak

Figure 6: This image shows the Word Tree [WB08] of ourOthello data using ManyEyes [VWvH∗07]. As we input theword "liebte", then all of sentances beginning after this wordare shown. The size of a word represents its frequency.

their clouds with different fonts, layouts, and color schemes.As shown in Figure 9, is the wordle visualization of our Oth-ello data sets. The common German words are reduced.

ToxenX [Zil11] created by Brian Pytlik Zillig, is a pow-erful text analysis, visualization, and play tool that has beencustomized for use on the Walt Whitman Archive. The textbase for the Archive customization currently includes the sixAmerican editions of Leaves of Grass published in Whit-man’s lifetime and the deathbed edition of 1891-1892. To-kenX currently supports the following features: text high-lighting based on patterns in words, keyword in context, re-placing words with blocks, word concordances sorted alpha-betically or by frequency, word usage statistics, word sub-stitution, user-selected replacement of words with images,creative exploration. The accepted input data format is samewith Tagline Generator [Meh06], they all accept the web xmlfile. Figure 10 shows two visualizations of our Othello datagenerated by TokenX.

c� 2011 The Author(s)Journal compilation c� 2011 The Eurographics Association and Blackwell Publishing Ltd.

Page 4: Visualizing Translation Variation: Othello : A Survey of Text Visualization and Tools

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 5: This figure shows the Tag Clouds [BGN08] of our Othello data set using ManyEyes [VWvH∗07]. The left image depicts the tagclouds for every single word, wheares the right image shows the Tag Clouds of pairs of words staring with letter "b". The ManyEyes does notprovide the text preprocessing option in the Tag Cloud, such as reducing the common words.

Figure 10: On the left image shows the Tag Cloud generated by ToxenX. The right image shows the text with the words "Liebte" replaced witha heart shape.

Figure 7: This image shows the Phrase Net [vHWV09] ofour Othello data using ManyEyes [VWvH∗07]. It depictsany two words connected with open space in the Othelloplay. The size of the words depict the word frequency.

3.2. Reasearch Prototypes for Text Visualizations

Since 2005, we observe a rapid increase in the number of textvisualization prototypes being developed. As a result, vari-ous visual representations for text streams and documents

Figure 8: This image shows the TagCrowd [Ste08] visual-ization of our Othello data set. The common German wordsor stop lists are manually defined and reduced from the orig-inal text.

are proposed to effectively present and explore the text fea-tures. By the use of the text preprocessing tool introduced inSection 2, we can collect a wide range of text attributes, suchas word relationships, word frequency and sentence segmen-tation. In this section, we list some interesting and novel

c� 2011 The Author(s)Journal compilation c� 2011 The Eurographics Association and Blackwell Publishing Ltd.

Page 5: Visualizing Translation Variation: Othello : A Survey of Text Visualization and Tools

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 9: This image shows the Wordle visualiza-tion [Jon09] of our Othello data sets. The common Germanwords are reduced.

text visualizations which are able to present some of the ex-tracted text attributes. The prototypes are listed in chrono-logical order.

The ThemeRiver [HHWN02] visualization depicts the-matic variations over time within a large collection of doc-uments, as shown in Figure 12. The thematic changes areshown in the context of a time line and corresponding ex-ternal events. The focus on temporal thematic change withina context framework allows a user to discern patterns thatsuggest relationships or trends.

Figure 12: ThemeRiver [HHWN02] depicts thematicchanges over time in a collection of patents from one com-pany.

A Document Contrast Diagram [Cla08] is a visual sum-mary of the content of two text documents that illustratesshared words, words that are unique to one document or theother, word frequency, relative size of the two documents,distribution of emotional tone within the documents, relatedwords based on co-occurrence, and the most common wordin each document segment. It uses the familiar bubble tech-nique and effective use of colour to contrast topic usage intwo bodies of text. Figure 13 shows the Document Con-trast Diagram for the 2007 and 2008 US State of the Union(SOTU) Addresses.

Figure 13: In this Document Contrast Diagram [Cla08],the column of squares toward the left hand side representsthe segments of text from the left document. The topmostsquare is the first part of the document. Similarly on the righthand side. The larger of the two documents has 50 segments(squares) and the smaller document proportionally fewer.

Parallel Tag Clouds [CBW09] combines the parallel co-ordinates and tag clouds to provide a rich overview of a doc-ument collection. As shown in Figure 14, each vertical axisrepresents a category. For example, they can be different ver-sion of the Othello translation. The words in each categoryare summarized in the form of tag clouds along the verticalaxis. When clicking on a word, the same word appearing inother vertical axes is connected. Several filters can be definedto reduce the amount of text displayed in each category. Thiscould help create more screen space and improve the clarityof the visualization.

DocuBurst [CCP09] uses a radial, space-filling layout todepict the document content by visualizing the structuredtext. The structured text in this visualization refers to the IS-A relationship. For example, robin and redbreast is a bird. Abird is an animal. An animal is an organism or a living thing.A living thing is an entity. As we can see, such structuredtext can form a tree hierarchy, with the entity as the root androbin or redbreast as the leaf. As shown in Figure 15, theroot node of DocBurst visualization is shown as a circle. Allother nodes are assigned to a sector of an annulus. The an-gular width of each sector is mapped to the number of leavesor children.

SparkClouds [LRKC10] integrates sparklines into a tagcloud to convey trends between multiple tag clouds. Thesparklines can be used to present the trend over time. Asshown in Figure 16. From a controlled study that comparesSparkClouds with two traditional trend visualizations, suchas multiple line graphs, stacked bar charts and Parallel TagClouds, results show that SparkCloudsŠ is more effective toshow trends along the time.

ManiWordle [KLKS10] provides flexible control suchthat user can directly manipulate on the original Wordle tochange the layout, colour and etc, as shown in Figure 17.

c� 2011 The Author(s)Journal compilation c� 2011 The Eurographics Association and Blackwell Publishing Ltd.

Page 6: Visualizing Translation Variation: Othello : A Survey of Text Visualization and Tools

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 15: This figure shows the DocuBurst [CCP09] a fullyexpanded tree structure fo IS-A relationship.

Figure 16: SparkClouds [LRKC10] showing the top 25words from the US Presidential Speeches for the last timepoint in a series.

References[BGN08] B.SCOTT., G.CARL., N.MIGUEL.: Seeing Things in

the Clouds: The Effect of Visual Features on Tag Cloud Selec-tions. In HT ’08: Proceedings of the nineteenth ACM confer-ence on Hypertext and hypermedia (New York, NY, USA, 2008),ACM, pp. 193–202. 3, 4

[CBW09] COLLINS C., B.VIEGAS F., WATTENBERG M.: Par-allel Tag Clouds to Explore and Analyze Facted Text Corpora.In IEEE Symposium on Visual Analytics Science and Technology(2009), Computer Society, pp. 91–98. 5, 8

[CCP09] COLLINS C., CARPENDALE M. S. T., PENN G.:DocuBurst: Visualizing Document Content using LanguageStructure. Computer Graphics Forum 28, 3 (2009), 1039–1046.5, 6

[Cla08] CLARK J.: Document contrast diagrams, 2008.http://neoformix.com/2008/DocumentContrastDiagrams.html,Last Access Date: 2011-2-18. 5

[HHWN02] HAVRE S., HETZLER E., WHITNEY P., NOWELLL.: ThemeRiver: Visualizing Thematic Changes in Large Docu-

ment Collections. IEEE Transactions on Visualization and Com-puter Graphics 8, 1 (2002), 9–20. 5

[Hom11] HOME K.: Visualization Software, Feb 2011.http://www.kdnuggets.com/software/ visual-ization.html, Last Access Date: 2011-2-18. 2

[iB10] ŁILIC A., BASIC B. D.: Visualization of Text Streams:A Survey . Knowledge-Based and Intelligent Information andEngineering Systems 6277, 6 (2010), 31–43. 2

[Jon09] JONATHAN FEINBERG: Wordle: Beautiful Word Clouds,2009. http://www.wordle.net/, Last Access Date: 2011-2-18. 3, 5, 7

[KLKS10] KOH K., LEE B., KIM B. H., SEO J.: ManiWordle:Providing Flexible Control over Wordle. IEEE Transactions onVisualization and Computer Graphics 16, 6 (2010), 1190–1197.5, 8

[LRKC10] LEE B., RICHE N. H., KARLSON A. K., CARPEN-DALE M. S. T.: SparkClouds: Visualizing Trends in Tag Clouds.IEEE Transactions on Visualization and Computer Graphics 16,6 (2010), 1182–1189. 5, 6

[Meh06] MEHTA C.: Tagline Generator - Timeline-based TagClouds, 2006. http://chir.ag/projects/tagline/,Last Access Date: 2011-2-18. 2, 3, 7

[Pal02] PALEY W. B.: TextArc: An Alternative Way to View Text,2002. http://www.textarc.org/, Last Access Date:2011-2-18. 2, 7

[RVA04] RAJMAN M., VESELY M., ANDREWS P.: Stateof the Art, Evaluation and Recommendations RegardingDocument Processing and Visualization Techniques, 2004.http://arxiv.org/abs/cs/0412114, Last AccessDate: 2011-2-18. 2

[Ste08] STEINBOCK D.: TagCrowd: Joining the Crowd Together, 2008. http://tagcrowd.com/, Last Access Date: 2011-2-18. 3, 4, 7

[vHWV09] VAN HAM F., WATTENBERG M., VIÉGAS F. B.:Mapping Text with Phrase Nets. IEEE Transactions on Visu-alization and Computer Graphics 15, 6 (2009), 1169–1176. 3,4

[VWvH∗07] VIEGAS F. B., WATTENBERG M., VAN HAM F.,KRISS J., MCKEON M.: ManyEyes: A Site for Visualizationat Internet Scale. IEEE Transactions on Visualization and Com-puter Graphics 13, 6 (2007), 1121–1128. 2, 3, 4, 7

[Wat05] WATTENBERG M.: Baby Names Visualization, and So-cial Data Analysis. In Proceedings of 2005 IEEE Symposium onInformation Visualization (INFOVIS) (2005), pp. 1–6. 2, 7

[Wat09] WATT R. J. C.: Concordance 3.3, July 2009.http://www.concordancesoftware.co.uk/, LastAccess Date: 2011-2-18. 1, 2

[WB08] WATTENBERG M., B.VIEGAS F.: The Word Tree, anInteractive Visual Concordance. IEEE Transactions on Visual-ization and Computer Graphics 14, 6 (2008), 1221–1228. 3

[Wor96] WORDSMITH.ORG: WordSmith Tools, 1996.http://www.lexically.net/wordsmith/index.html,Last Access Date: 2011-3-16. 1, 2

[Zil11] ZILLIG B. P.: TokenX: a text vi-sualization, analysis, and play tool, 2011.http://segonku.unl.edu/cocoon/tokenxcather/index.html?file=../xml/base.xml, Last AccessDate: 2011-2-18. 3, 7

c� 2011 The Author(s)Journal compilation c� 2011 The Eurographics Association and Blackwell Publishing Ltd.

Page 7: Visualizing Translation Variation: Othello : A Survey of Text Visualization and Tools

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 11: From left to right, top to bottom, TokenX [Zil11], TagCrowd [Ste08], TextArc [Pal02], NameVoyager [Wat05], Tagline-generator [Meh06], ManyEyes [VWvH∗07] and WordleNet [Jon09].

c� 2011 The Author(s)Journal compilation c� 2011 The Eurographics Association and Blackwell Publishing Ltd.

Page 8: Visualizing Translation Variation: Othello : A Survey of Text Visualization and Tools

D. Fellner & S. Behnke / Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Figure 14: A parallel tag cloud [CBW09] revealing the differences in drug prevalence amongst the circuits.

Figure 17: The final layouts produced using ManiWordle [KLKS10] (left) and the original Wordle visualization by a user. Thetext is a Wikipedia entry on YU-Na Kim.

c� 2011 The Author(s)Journal compilation c� 2011 The Eurographics Association and Blackwell Publishing Ltd.