visualizing stylistic variation

Visualizing Stylistic Variation

Jussi Karlgren and Troy StraszheimCourant Institute of the Mathematical Sciences

New York Universitykarlgren,[email protected]

Abstract

Texts vary not only by topic, but by style; indeed, often thevariation between texts about the same thing can be justas noticeable as the variation between texts about differentthings. Some facets of this variation are quite easy to de-tect, and quite predictable when applied to categorizationof texts by genre, functional style, or - tentatively - quality.

Making use of such variation in an retrieval context isquite straightforward in principle; our work consists ofan implementation of a visualization tool for documentdatabases.

The issues addressed include 1) choice of stylistic itemsto investigate, 2) composition of dimensions of variation,and 3) judicious naming of dimensions for presentation. Weuse use principal components analysis to combine our quitelarge number of stylistic items into two most significant di-mensions of variation and plot the document space underconsideration into a plane. This space can be used as a firstor last filter in an information retrieval task.

The composition of the most significant dimensions isnaturally corpus dependent, as is the naming of them: ourwork is tested on Internet and TREC data.

1 Stylistic Variation

Texts vary not only by topic, but by style; indeed, often thevariation between texts about the same thing can be justas noticeable as the variation between texts about differ-ent things. Human readers process a multitude of stylisticmarkers, where each one of them taken separately will bealmost meaningless, to categorize texts in functional stylesor genres, or to assess their position along some continuumof stylistic variation. Some markers of this type are quiteeasy to identify and compute. We are most interested inexamining the stylistic variation based on the specific gen-res or functional styles (Vachek, 1975) that can be foundin electronically published documents as opposed to verysubjective or situation-specific measures such as individualstyle or even writing quality.

Methods such as ours have been used previously for au-thorship determination in cases where documents have un-known or disputed authors with some success, and for read-ability measurement for educational and mass-market read-ing materials with some lesser degree of success. Conceiv-ably similar metods could be used for quality determina-tion: determining which of two texts about the same subjectin the same genre is the better text in some or any sense.

2 Text in Uniform Guise

Digital information technology has been vectored towardsthe production of information, and the publishing thresh-old for information has been lowered dramatically the pastfew hundred years. By contrast, comparatively little workhas been put into tools for the consumer. Indeed, manyof the markers such as paper quality, typesetting, and evenspelling, that readers have been able to use previously todistinguish the New York Times from home produced hand-outs have been neutralized through the advent of inexpen-sive proofreading tools and the World Wide Web. On theInternet the publishing threshold is very low, and usefulnessof the abundance is offset by the less than perspicuous vari-ation in quality, provenance, and author intentions.

3 Aim of these experiments

This paper will describe some experiments made as agroundwork to build a tool which will display a set of textsas points on a plane, scattered according to stylistic crite-ria. We will not go into the experiments in every detail, butwe will attempt to describe how we motivate the more im-portant design choices we make. Our hypotheses are thatthere are important stylistic cues in electronically publishedtexts; that these cues can be used for categorizing or sortingdocuments in an interactive information retrieval scenario;that the stylistic variation can most handily be explained interms of genres.

1060-3425/97 $10.00 (c) 1997 IEEE

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 1997 IEEE

Variable name Statistic Typical RangeWORDS Text length in words 31-9228TT Type token ratio 0.13-0.89CPW Average word length in characters 4.59-9.95WPS Average sentence length in words 2.45-63.1P1 Proportion first person pronouns of words 0-105P2 Proportion second person pronouns of words 0-20P3 Proportion third person pronouns of words 0-60IT Proportion it of words 0-44NT Proportion contractions: Ill, youre, etc. 0-33

Table 1: Stylistic items under consideration

4 Stylistic Items

We want to weigh together information from a large num-ber of stylistic items or style markers parameters wheretypically each taken by itself will be inconsequential. Com-bining parameters by weighting them together is a commonproblem in many branches of science, and there is a batteryof algorithms to do so automatically. It is debatable whethersimple linear score combinations of textual measurementscapture the rather complex underlying interdependencieswe aim to measure, but to investigate the power of the vari-ables tested, we elected to take a cautious approach, andto make a minimal amount of assumptions about the data.

In general, the items under consideration reflect variationof various kinds: lexical - where texts about the same sub-ject can treat it with technical or lay vocabulary; syntac-tic - complex syntax may reflect more complex ideas orreasoning about a given subject (Menshikov, 1974; Losee,1996); textual - texts can be in-depth treatments of a topicor overviews over several topics. An item, naturally, may,and most often will, relate to variation of several kinds si-multaneously: therefore signifies a certain lexical choiceas compared to thus but also a certain textual progressionas compared to and; tortious interference not only hasdifferent flavor than bad influence but may suggest a dif-ferent genre.

The standard techniques used in these experiments principal com-ponents analysis, factorial analysis, and discriminant analysis make as-sumptions about the distributions of the variables under consideration.Specifically, if nothing is specified, the algorithms assume a variable isnormally distributed. These assumptions are unfounded for linguistic datasuch as the stylistic items in our experiments, and could give misleadingresults if the variables diverge significantly from the normal distribution.There exist no standard methods for examining multivariate distributionswithout making assumptions about the variable distributions; in this casewe have tested each of the items individually using Mann Whitneys U testin other experiments, and found them useful and reliable (Karlgren, 1996);we still have no method for treating their variation.

5 Text materials

New York Universityparticipates in the Text Retrieval Con-ference (TREC) information retrieval evaluation project

jointlywith General Electric, Rutgers University, and Lock-heed Martin. We have experimented with web retrieval us-ing some TREC tasks; while the TREC tasks are designedto be used on the TREC database, which consists mainlyof journalistic material, they are well known in the infor-mation retrieval community. We ran a set of typical TRECqueries on the Altavista search engine and retrieved thetop 60 returned pages. These vary considerably in style.We will use the query What is the economic impact of re-cycling tires? as an example in the following discussion.This is a very small text corpus for this sort of experiment,and the results should be understood to illustrate the tech-niques used, rather than provide any informationabout textson the Internet.

Each text in the test material is processed to obtain among others the statistics for the items listed in ta-ble 1. The items are suggested by classic readability stud-ies (Chall, 1948; Klare, 1963), our previous experiments(Karlgren and Cutting, 1994; Karlgren, 1996), or by pre-vious work on the computational study of textual variation(Biber, 1988, 1989).

6 Using the Stylistic Items

So, how do we combine the variation of these items, to dis-tinguish functional styles or genres from each other? Wemay pick simply pick a couple of parameters from the table,and plot them against each other. A useful strategy might beto pick a couple of parameters with a seemingly high spread,and see what the graph looks like. We find some exampleswhich seem to disperse the material quite well, as in figure 1and some which let the texts stick together into a corner of

http://potomac.ncsl.nist.gov/TREC/.http://www.altavista.digital.com

1060-3425/97 $10.00 (c) 1997 IEEE


Variable PRIN1 PRIN2 PRIN3 PRIN4 PRIN5WORDS 0.392610 0.188773 -.320255 0.144107 -.212966TT -.335365 0.035366 0.447046 -.009935 0.539090CPW -.123467 0.531823 0.364009 0.697326 -.248076WPS -.074589 0.627062 0.189015 -.687406 -.299164P1 0.402501 0.205379 0.096396 -.003460 0.420170P2 0.268218 -.386261 0.427624 0.003216 -.507121P3 0.447032 0.128708 -.004284 -.020343 0.169437IT 0.437037 0.178483 0.025705 0.053218 0.224331NT 0.296300 -.217421 0.580106 -.130678 0.015447Proportion 0.500556 0.142114 0.117911 0.087368 0.076375

Table 2: First principal components

the graph to a much higher extent as in figure 2.Now, we know that each of these factors is of little con-

sequence taken alone: even when they may have quite highdescriptive power, using them for diagnostics is a riskyproposition. Random variation, and more distressingly,nonrandom intentional variation may obscure or obfuscatethe variation we are interested in. Thus using a combina-tion of factors may be a better idea. As mentioned above,there are standard methods for extracting linear combina-tions of several variables that covary over a set of objects ofstudy; using principal components analysis we find the rel-ative variable weightings displayed in table 2. The princi-pal components are linear combinations of the various vari-ables under study; the weights indicate the relative impor-tance of the variables the variables are normalized first,so that their scale of variation will be similar. The pro-portion row indicates how much of the total variation thiscomponent covers: in our case, the first component covershalf of the total variation, and the second 14 per cent.

Plotting the texts with the two variables against eachother we get the graph in figure 3. The problem with thisotherwise interesting plot is that it may not be immediatelyuseful for information retrieval. The dimensions are notreadily translatable to plain English descriptors. This iswhere genres come in handy.

7 Genres and Stylistic Items

There are no objectively defined genres for our type of ma-terial; what genres we want to make use of will depend onthe domain of discourse, the data we have recourse to, andwhat stylistic items we have chosen. Above all, they willdepend on reader preferences or our perception of the read-ers information needs.

For the purposes of this experiment we make a roughhand-categorization of the texts. We find database listings,

Conceivably we could use automatic methods such as clustering tech-

error messages, technical texts, journalistic texts, commer-cial texts, legal texts, announcements, forms, and variousother textual and non-textual material. Since the materialis small, we divide the material into four quite broad cate-gories: proper text (white triangles, 23), database listingsand lists of links (black triangles, 15), governmental an-nouncements (black circles, 11), and error messages (blacksquares, 2).

The graphs displayed in the above sections show usthat the genres we defined emerge quite nicely in figure 1,whereas the pattern is much less clear in the other two fig-ures.

8 Conclusions

To get explanatory power, a genre analysis of the exem-plified kind must be designed to make use of informativedimensions of textual variation. The algorithmically bestchoices may be too dependent on the variables chosen togive a useful and explanatorily powerful display of the tex-tual material at hand.

References

Douglas Biber. 1988. Variation across speech and writ-ing. Cambridge University Press.

Douglas Biber. 1989. A typology of English texts, Lin-guistics, 27:3-43.

Jussi Karlgren and Douglass Cutting. 1994. RecognizingText Genres with Simple Metrics Using Discriminant

niques to do the same. We would then end up with the same problem asfor factorial analysis or principal componentsanalysis: we would have de-scriptively interesting categories which would be difficult to explain to thereader. We will here go with sloppily manually defined categories.

Some documents were duplicates, and thus the total is 51 rather than60.

1060-3425/97 $10.00 (c) 1997 IEEE


Analysis, Proceedings of 15th International Confer-ence on ComputationalLinguistics (COLING), Kyoto.(In the Computation and Language E-Print Archive:cmp-lg/9410008).

Jussi Karlgren. 1996. Stylistic Variation in an Informa-tion Retrieval Experiment In Proceedings of The Sec-ond International Conference on New Methods in Lan-guage Processing - NeMLaP 2, Bilkent, September1996. Ankara: Bilkent University.

George R. Klare 1963. The Measurement of Readability.Iowa Univ press.

Robert M. Losee. forthcoming. Text Windows andPhrases Differing by Discipline, Location in Docu-ment, and Syntactic Structure. Information Process-ing and Management. (In the Computation and Lan-guage E-Print Archive: cmp-lg/9602003).

I. I. Menshikov. 1974. K voprosu o zhanrovo-stilevoyobuslovlennosti sintaksicheskoy struktury frazy.(On genre-dependent stylistic variation of the syntac-tic structure in the clause) In Voprosy statisticheskoystilistiki. Golovin et al. (eds.) 1974. Kiev: Naukovadumka; Akademia Nauk Ukrainskoy SSR.

Josef Vachek. 1975. Some remarks on functional di-alects of standard languages. In Style and Text - Stud-ies presented to Nils Erik Enkvist. Hakan Ringbom.(ed.) Stockholm: Skriptor and Turku: Abo Akademi.

1060-3425/97 $10.00 (c) 1997 IEEE


Figure 1: Plot of average word length versus type-token ratio

1060-3425/97 $10.00 (c) 1997 IEEE


Figure 2: Plot of first person pronoun content versus average sentence length

1060-3425/97 $10.00 (c) 1997 IEEE


Figure 3: Plot of first two principal components

1060-3425/97 $10.00 (c) 1997 IEEE


visualizing stylistic variation

Documents