representing text on the computer: lessons for and …

28
REPRESENTING TEXT ON THE COMPUTER: LESSONS FOR AND FROM PHILOSOPHY ALLEN RENEAR BROWN UNIVERSITY, U.S.A. Practitioners and researchers in the area of computer text processing have reached the conclusion that documents, at least qua intellectual objects, are best represented as syntactically complex structures of 'content objects'. 1 Some familiar examples of content objects in this sense are: chapters, sections and subsections, titles and headings, prose paragraphs, sentences, prose extracts, verse extracts, lists, list items, tables, figures, equations, bibliographical citations and footnotes. In software and 'textbase' development the superiority of the content object approach over other possible models for representing text can be shown easily- it is by far the simplest and most functional way to create, modify and format texts, and it is required to support effectively document management, text retrieval, browsing, various kinds of analysis and many other sorts of special processing. Finally, texts represented according to this model are much more easily shared among different software applications and computing systems. Not surprisingly this view of text, which turns out to be theoretically illuminating as well as practical, is in part a generalization of portions of the methodologies, theories and practices of the many humanistic disciplines in which texts, books and documents figure importantly. One may perhaps say, somewhat more ambitiously, that the reason this model of text is so functional and effective is that it reflects what text really is. As well as the general and familiar content objects mentioned above, special disciplines and genres have developed specialized content objects. An art exhibition catalogue, for instance, might consist of catalogue entries which are made up of the sub-components artist, provenance, medium and size; a dramatic script might contain acts, scenes, stage directions, speaker designations and dialogue lines; poetry typically is composed of various kinds of metrical units. The reader can no doubt supply many other examples from the texts that are typical of his or her special discipline. Although far from the most 1 This essay is to a large extent based on collaborative work with James H. Coombs, Steven J. DeRose, David Durand and Hlli Mylonas, although none of them may be held responsible for any specific claim they may wish to deny. It also oues much to conversations, spanning several years, with Lou Burnard, Claus Huitfeldt, Michael Sperberg-McQueen and the members of the Brown University Computing in the Humanities Users' Group-

Upload: others

Post on 04-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

REPRESENTING TEXT ON THE COMPUTER: LESSONS FOR AND FROM PHILOSOPHY

ALLEN RENEARBROWN UNIVERSITY, U.S.A.

Practitioners and researchers in the area of computer text processing have reached the conclusion that documents, at least qua intellectual objects, are best represented as syntactically complex structures of 'content objects'. 1 Some familiar examples of content objects in this sense are: chapters, sections and subsections, titles and headings, prose paragraphs, sentences, prose extracts, verse extracts, lists, list items, tables, figures, equations, bibliographical citations and footnotes. In software and 'textbase' development the superiority of the content object approach over other possible models for representing text can be shown easily- it is by far the simplest and most functional way to create, modify and format texts, and it is required to support effectively document management, text retrieval, browsing, various kinds of analysis and many other sorts of special processing. Finally, texts represented according to this model are much more easily shared among different software applications and computing systems. Not surprisingly this view of text, which turns out to be theoretically illuminating as well as practical, is in part a generalization of portions of the methodologies, theories and practices of the many humanistic disciplines in which texts, books and documents figure importantly. One may perhaps say, somewhat more ambitiously, that the reason this model of text is so functional and effective is that it reflects what text really is.

As well as the general and familiar content objects mentioned above, special disciplines and genres have developed specialized content objects. An art exhibition catalogue, for instance, might consist of catalogue entries which are made up of the sub-components artist, provenance, medium and size; a dramatic script might contain acts, scenes, stage directions, speaker designations and dialogue lines; poetry typically is composed of various kinds of metrical units. The reader can no doubt supply many other examples from the texts that are typical of his or her special discipline. Although far from the most

1 This essay is to a large extent based on collaborative work with James H. Coombs, Steven J. DeRose, David Durand and Hlli Mylonas, although none of them may be held responsible for any specific claim they may wish to deny. It also oues much to conversations, spanning several years, with Lou Burnard, Claus Huitfeldt, Michael Sperberg-McQueen and the members of the Brown University Computing in the Humanities Users' Group-

222 BULLETIN JOHN RYLANDS LIBRARY

structured and specialized texts one finds, philosophical texts also frequently contain discipline-specific content objects. Some examples are: arguments, axioms, theorems, definitions and counterexamples. Many of these philosophical content objects are themselves syntac­ tically constrained structures of other content objects. A definition, for instance, typically consists of a definiendum and a definiens, exactly one of each; the definiens in turn consists of one or more clauses - and so on. In some texts, such as Spinoza's Ethics or articles by contem­ porary foundationalist epistemologists, there is a rich and typogra­ phically prominent set of special objects being used to structure and present their arguments. In other cases these objects seem implicit in the running prose and their identification as textual entities very much a matter of'subjective' interpretation and even substantive philosophi­ cal controversy.

Recently there has been an explosion of interest in this approach to text and researchers from many disciplines and a wide variety of projects and initiatives are working on both practical and theoretical problems. This essay presents an introduction to the general theory of texts as structures of content objects. Revealing the special interests of its author it both makes liberal use of some philosophical notions in presenting this approach and also makes special reference to applica­ tions to philosophical texts. 2 More generally it presents a view of what texts are and how they work - and suggests how new forms of interaction with texts will develop under the influence of emerging trends in information technology.

REDISCOVERING THE TARTS OF A BOOK'

Content-oriented text processingIn the 1960s and 1970s a number of computer scientists and software designers came independently to the conclusion that the best way to design text processing systems was to base them on the view that there are certain features of texts - such as titles, chapters, paragraphs, lists and the like - which are fundamental and salient and that all processing of texts (editing, formatting, analysis, etc.) should be implemented indirectly through these features. 3 These features we call

2 It should be noted that in what follows we generally take the perspective of the scholar whose interests are primarily philosophical and not historical, literary or bibliographical. So, for instance, our scholar's principal concern is with understanding, evaluating and communicating an opinion on, say, Descartes's views on the self, and not with the details of their historical development or transmission in manuscripts and printed editions. This focus is somewhat at variance with the dominant culture of contemporary text encoding, which tends to emphasize the literary, linguistic and historical, if not bibliographical, disciplines.

3 For early discussions of the application of this approach to text processing sec Reid (1980) and Goldfarb (1981). For a recent and more general discussion, which is drawn on extensively in the present essay, see Coombs, Renear and DeRose (1987). References to specific applications of the approach to literary documents, scholarly text processing and the development of literary 'textbases' are given later in this essay.

REPRESENTING TEXT 223

'content objects' and this approach to text processing 'content-oriented text processing'. Although we use the expression 'objects' in what follows, other nouns which have been used similarly include 'elements', 'components' and 'parts'. 4 We say content objects to stress the connection with the intellectual content of the text or document, in contrast to its material form; others have also used the adjectives logical', 'literary' and 'editorial' to contrast the content related parts of a book to its physical or material form. 5

Initially the advantage of this approach was seen in simplifying the creation and formatting of the document. In content-oriented text processing the author or editor explicitly identifies the major editorial components, or content objects f titles, sections, paragraphs, citations, axioms, theorems, footnotes and so on;, but does not directly indicate what formatting is to be applied to these objects. Exactly how this identification is done does not matter. One could use a variety of standard computer input techniques. For instance, one could type mnemonic names which are distinguished from regular text by 'delimiters':

<title>What There Is<leading quotation> Everything is water - Thales<chapter title>Things<section title>Material Things<paragraph>\\'hen \ve look around us .

Including this sort of descriptive markup along with text is in fact the traditional way of doing content-oriented text processing, as it began in the late 1970s and early 1980s, and continues to be practised by the large community of writers using such text processing software as Scribe, TeX/LaTeX, troffAms, and Script/GML. However, this tech-

4 The phrase 'parts of a book' is of course well established in bibliography and style manuals, but its use conflates, interestingly, the objects that we are distinguishing as 'logical' and 'physical'. McKerrow's 'parts of a printed book' are decidedly parts of the 'material' book, which is his subject matter: when he discusses terminology for gathers or quires, which are exemplary material 'parts of a book', he hastily rejects 'section' as 'having the disadvantage that . . [it] is commonly applied to divisions of the book's literary content' (1927, 25: italics in the original). Similarly Esdaile's chapter Tarts of a book' expands the notion of material part to include the preface, introduction and other arguably editorial parts (1967, 32^9). The Chicago manual of style also begins with a major division titled 'Parts of a book', but, as is not surprising in a manual for editors and not bibliographers, includes not only discussions of editorial objects such as chapters and the like, but even the special components of letters, diaries and poetry (1982, 3-34).

5 \\'hile adjectives like 'editorial', 'logical', 'literary' and the like are frequently used ^vmmvmoush, they may also be used to mark fine distinctions. The present essay is using 'content' to suggest the broadest possible connection with the sense or meaning of the document as opposed to its material form. \X'e continue to use the noun 'object' despite the currency of 'clement' because the latter word has become, in the nomenclature of text processing standards development, the technical term for objects of the sort we are describing - and we are exploring an intuitive 'pre-theoreticaF notion, not expounding a stipulated technical term. The terminology for these distinctions is unstable not only for the usual reasons, but also because the distinctions lo be marked are elusive and controversial, requiring a considerably subtler account of text and communication than has vet been developed.

224 BULLETIN JOHN RYLANDS LIBRARY

nique obviously requires a lot of extra typing and, unprocessed as it usually was in these systems, creates a visual display which is hard to quickly read and understand.

Alternatively one could also use a key sequence, striking, say, the key marked 'CTRL' and then the key marked 'T' and then typing 'What There Is'. The identifying codes would then be inserted automatically into the file by the software. Or one could also use 'pull down' menus, light pens or any of the more exotic devices for communicating with a computer. We could even imagine this some­ day being done by speaking:

Chapter title. Material things. New paragraph. When we look around us . . .

Here, as with traditional dictation, paralinguistic features such as intonation and gesture might be used to differentiate object identifica­ tions from content text.

Again, how the author expresses the declaration of an object is not significant, only that the objects are declared somehow and this identification is recorded in the computer file, providing semantic and pragmatic information which may be used for subsequent processing. The procedure or the mechanism for identifying content objects is independent of whether or not the text processing is content-oriented. What the author or editor does not do is indicate, specifically, how the content objects will be formatted or otherwise processed. The author simply declares what the object is (a title), not how it should look (centred, larger font size, etc.). The formatting which occurs is a consequence of the author's identification, occurring (in the 'batch' oriented systems described here) at a later phase of processing.

In the earliest content-oriented systems, the identity of the content object was indicated to the author viewing the screen simply by the presence of the markup tag, distinguished from plain text by its delimiters. But like the manner in which the author declares the identity of text objects, the manner of reflecting text object identity back to the author, editor or other reader is also not a necessary characteristic of content orientation. The identity of objects may be represented by the application software in various ways other than with delimited markup tags, including traditional formatting (leading, font changes, etc.), colours, icons and so on.

In content-oriented text processing the formatting of the different editorial objects is determined by automated reference to a set of formatting rules specified for each kind of object that occurs in a text. For instance, associated with the object axiom might be a set of formatting rules which specify enumerating the object, placing extra space before and after the object, reducing type size and indenting from the left and right margins and so on. In 'batch' formatting, which characterizes traditional text processing software such as Scribe, troff, Script and TeX, formatting occurs during a specific formatting

REPRESENTING TEXT 225

phase, usually immediately prior to printing. In interactive text processing systems at least some formatting rules take effect immedi­ ately as the author enters the content-identified text of the document. That is, as soon as a content object is identified by the author or editor the formatting rules for that object are immediately applied to the text bv the software and their effects are apparent on the video display screen. This can allow traditional typographical devices (indentation, extra leading, font changes, etc.) to be used to indicate the identity of the content objects to the composing author or editor.

For instance, in the example of a quotation (extract) the author might choose quotation from a menu and then begin typing the quotation - which would immediately begin to appear with the appropriate formatting, extra leading, indentation, etc., creating on the screen, as the author or transcriber types, the characteristic appearance of a block quotation or extract. Interactive formatting creates the sense of so-called 'WYSIWYG' (What You See Is What You Get) processing which is familiar to most users of micro-computer text processing systems. 6 Like the details of the mechanics of object identification this too - whether a system formats interactively or not - is completely independent of whether or not the software is content- oriented. 7

Figure 1 shows a text source file with visible identification of the text elements. Notice that this file contains two sorts of things: text and, in delimiters ('< . . . >'), the markup tags that identify the editorial objects. The use of various kinds of markup to identify text objects is discussed further below. Independently of the marked up text, and often in a separate file, rules that map formatting procedures to markup tags are stored, as shown in Figure 2. When formatting occurs, the formatter processing the text file refers to the rule file for the specific procedures that are to be performed on each text object. The formatted result, produced from files like those depicted in

6 The expression 'WYSIWYG' (for 'What You See Is What You Get') is commonly used to describe the sort of systems where interactive formatting takes place during the creation and editing process. However, it is misleading, taken literally, as the interactive nature of this environment is logically quite independent of whether or not the image created on the editing screen accurately resembles the image which will be rendered on some particular printer or other intended output device. To appreciate this independence note that many contemporary documents will never be printed, existing only in electronic form, but they are still described as WYSIWYG as long as they are typically processed with interactive on-screen formatting.

This clarification was necessary as it is not uncommon to present WYSIWYG and content-oriented text processing as opposites. In fact, (1) the characterization 'WYSIWYG' is ambiguous, sometimes meaning interactive formatting and sometimes the ability to accurately preview intended formatting, and (2^ in both senses WYSIWYG text processing and content- oriented text processing are logically independent characteristics of a text processing system content-oriented text processing can be interactive, or not, have the ability to faithfully preview printing, or not.

Theoretically principled categorization of text processing software is a difficult matter and most casual terminology and taxonomy is quite confused and misleading, l-.ven careful efforts probably still fall short of an adequate comprehensive taxonomy. See Furuta et al. (1982) and Coombs, Renear and DeRose (19S7

226 BULLETIN JOHN RYLANDS LIBRARY

<BOOK><TITLE>Russell's Objection to Kant</TITLE><AUTHOR>Ellen Castleman</AUTHOR><CHAPTER><CH_TITLE>Russell on Kant's Synthetic A Prior i</CH_TITLE><BLOCK_QUOTATION>Apart from minor grounds on which Kant's philosophy maybe criticized, there is one main objection which seemsfatal to any attempt to deal with the problem of a prioriknowledge by his method. . . . Our nature is as much a factof the existing world as anything, and there can be nocertainty that it will remain constant. It might happen,if Kant is right, that to-morrow our nature would sochange as to make two and two become five. BertrandRussell, <El>The Problems ofPhilosophy</ElX/BLOCK_QUOTATION>

Objections to Kant's notion of the synthetic a priori might be broadly divided into two sorts: empiricist and rationalist. The empiricist objection is that the categories of analytic and synthetic are exactly coextensive with the categories of a priori and a posteriori, respectively: there are no synthetic a priori propositions. <FN>David Hume and A. J. Ayer are prominent proponents of the empiricist view of the a priori. </FNX/P>

The rationalist criticism, on the other hand, is directed at the logical or epistemological status afforded synthetic a priori propositions by Kant's <Q>Critical Philosophy</Q>. Kant's account is seen by these critics as too subjective or contingent. [...] </PX/CHP> </BOOK>

FIGURE 1: TAGGED SOURCE TEXT FILE

Figures 1 and 2, is shown in Figure 3. Note that presentational markup (formatting and graphic devices) has been systematically generated by processing routines which mapped descriptive markup to formatting procedures (or procedural markup}. Although formatting remains the most common use of markup processing rules, other kinds of process­ ing can also be specified with similar rules, as discussed in the next section.

<title> - [center, helvetica 24]<author> = [skip 28, center, times 24, italic]<chap_title = [skip 36, quad left, times 14, bold]<paragraph> = [skip 8, times 12, justified]<block_quotation> = [skip 10, times 10, indent -5 -5]

FIGURE 2: SIMPLE RULE FILE FOR FORMATTING

REPRESENTING TEXT 227

Russell's Objection to Kant

Ellen Newcastle

Russell on Kant's Synthetic A PrioriApart from minor grounds on which Kant's philosophy may be criticized, there is one main objection which seems fatal to any attempt to deal with the problem of a priori knowledge by his method.... Our nature is as much a fact of the existing world as anything, and there can be no certainty that it will remain constant. It might happen, if Kant is right, that to-morrow our nature would so change as to make two and two become five. Bertrand Russell, The Problems of Philosophy.

Objections to Kant's notion of the synthetic a priori might be broadly divided into two sorts: empiricist and rationalist. The empiricist objection is that the categories of analytic and synthetic are exactly coextensive with the categories of a priori and a posteriori, respectively: there are no synthetic a priori propositions. 1

The rationalist criticism, on the other hand, is directed at the logical or epistemological status afforded synthetic a priori propositions by Kant's "Critical Philosophy". Kant's account is seen by these critics as too subjective or contingent. [...]

1 David Hume and A. J. Ayer are prominent proponents of the empiricist view of the a priori.

FIGURE 3: RESULTING FORMATTED OUTPUT

Figures 4 and 5 show systems that combine interactive formatting with content orientation. These examples show that content-oriented text processing can be implemented without having the user enter any special codes (pull-down menus being used) and with automatic interactive formatting. The user has the option of seeing or not seeing the markup identifiers. Unfortunately Word (Fig. 5), like current versions of other popular word processors, does not yet allow the user to 'nest' text objects to reflect hierarchical structure: the text is instead represented as a sequence of objects. Figure 6 was discovered in a colleague's office - its significance is left to the reader as an exercise. It is another piece of evidence that there is really nothing particularly new or distinctive about the content object approach to text.

Advantages of content-oriented text processing for authoring and editing Although our introductory discussion has focused on the advantages tor creation and formatting, there turn out to be many other advan­ tages to text processing systems which operate by the systematic identification of content objects. It will be helpful to summari/e some

228 BULLETIN JOHN RYLANDS LIBRARY

File Edit Search Uiem Markup Entities SpecialRussell Rrticle

slljs Objection to Kant<7mEHon the Synthetic A Priori <7cT)

(ram muer grouavl* OB. wKi«X bat's }Xilo*ofXy may Ic critieizcl, tXcrc i* OM aaia oV)cetio&. wXieX S««BLJ (itil to my i««ttjt te l»l witX tXc froVIca of i jrieri kuwlflgc Vy Ku aetXel. . . . Our nature if is mueX a tact of «X« < luti&f world »» a&ytXiag, aai tXcrc ci& l« iu> «rtii>ty tX»t it trill r»m»m c«ast>&t. H mi(Xt Xif|<&, i( Kiat u rifXt, (Kit (o-morrow our utun weuli so <\»f « is to m»k< two »&4 two l<oom« (ivt. B«rtr»alB«iss<ll/ (El ^ TV

Obitc>u>i5 to Kutt's notion of tLt jyntLttk » priori miglt itdivided into two sorts: tmpiheiit tad nliowJist . The

«m)irieist objection is th*t tht eil«g«ri«s of uulytie »nd synthttu *rt «XMtly co«xt«nsiv« witn the c*t«gorus of » jriori «jid i josttriori, rtsp«ctivtly: tkcrt *rc no syntKttie n jriori jrojojitioiu.

FN^Dnvid Huinc *ni A. J. Aytr trt prominent proponents of (It tmpirieist vitw of tLt n priori. <7E5K7P]

FIGURE 4: TEXT PROCESSING IN SOFTQUAD'S AUTHOR/EDITOR

TIM

Chapt_Twe

Block Quotation

Paragraph

Russell's Objection to KantU

Ellen Newcastle^

Russell on Kant's Synthetic A Priori^

ch inmi M«l U «y f rf I aupreUmortpooKlBowliOfibybuBMhad .. Ow unn M u Buck > hct of ox tMnf raU u injIJn ia4 uxn c«> b. no enutff out it

i M to atte tw m4 to

Objections to Kanfs notion of the synthetic a priori ntfht be divided into two sorts: empiricist and rationalist Toe empiricist objection is that the catcforics of analytic and synthetic are exactly coextensive with the catc|orics of priori and posteriori, respectively: there arc no synthetic a priori propositions.'f The rationalist criticism, on the other hand, is directed at the logical or cpistcmolo|ical status afforded synthetic a priori propositions by Kanfs "Critical Philosophy". Kanfs account is seen by these critics as too subjective

Footnotoi

footnote tod

Normal

fDavid Hum* and A. I. Aycr m* pramnml proponenN of theof the a priori.!1

tmpirraet view

Pg 1 S»c 1 Ln Col 1 NUM

FIGURE 5: TEXT PROCESSING IN MICROSOFT WORD FOR WINDOWS

REPRESENTING TEXT 229

NAME DATE

Friendly Letter

In a friendly letter you share your ideas and experi­ ences with the person you are writing to. The Heading tells your address and the date. The Greeting is the way you say "hello." The Body tells your message. The Closing is the way you say "good-by." The Signature is your name.

Directions: Write a letter to a friend on the lines below. Use your address. Put the five parts in the lines where they belong. The parts are labeled for you.

(Heading) [C

G.KUf

(Greeting)

I U<r

-**£

Level 12. UsageHolt. Rmehart «nd Winston. Publishers

J

(Closing)

(Body)

, Vr>

(Signature) Martial/ rV Qu

75

FIGURE 6: A FRIENDLY LETTER

of them. The advantages are divided below into three major cate­ gories: authoring and editing, production, and subsequent use of textual data. These categories follow the typical lifecycle of a docu­ ment, not only through publication, but beyond. 8

11 This categorization of advantages is adapted from DeRose, Durand, Mvlonas and Renear (1990,

230 BULLETIN JOHN RYLANDS LIBRARY

(1) Composition is simplified. Formatting considerations make no claims on the attention of the author or editor during content object-driven text processing: rather than remembering both (i) the required style conventions relevant to the text being produced and (ii) the formatting commands used by the software in order to format the text according to those conventions, the author instead simply identifies each text element, perhaps by choosing a name from a menu, and the appropriate formatting takes place automatically - immedi­ ately if there is some interactive formatting, and later if there is only batch formatting. The author can thus dispense with two sorts of reference works which, while otherwise necessary for scholarly text processing, are really irrelevant to scholarly writing: (i) style manuals and (ii) software manuals. 9

Using the jargon of software engineering one can say that content objects let the author deal with the document at a 'level of abstraction' appropriate to the authorial role - identifying a text object as a quotation, paragraph or verse line is an authorial task, making decisions to embolden or centre a title is the task of a compositor or designer.

(2) Writing tools are supported. Content objects support 'structure oriented editors' and other composition tools. A structure oriented editor is text processing software that 'knows' about the syntax of the special editorial objects which are found in the author's document and so can intelligently support the author in entering, editing and manipulating these objects. It will allow the author to perform operations like 'moves' and 'deletes' in terms of these natural mean­ ingful parts (words, sentences, paragraphs, footnotes, sections, etc.) rather than relying on the mediation of accidents: line displacements, numbers or arbitrarily marked regions. Rather than having to observe that the sentence to be moved runs from here to there or that a quotation starts on line 30 and ends on line 40, the author can directly instruct the software to delete this sentence, or move this quotation. Moreover an extensible editor could also be programmed to deal intelligently with discipline-specific objects such as definitions and theorems. In composing a logical proof, for instance, the software would naturally move the author through the different components - lines, justifications, conclusion - prompting for the required subcom­ ponents and rejecting syntactically invalid arrangements (a line with

9 The best-selling book by a university press in the United States is A manual for writers of term papers, theses and dissertations (Turabian, 13th edn, 1982). The importance of such style manuals will be diminished by the growing popularity of the content object approach. Indeed, many universities are already maintaining software 'stylesheets' for popular word processing programs in order to ensure that theses and dissertations conform to institutional guidelines. Degree candidates using these stylesheets never have to actually learn or directly apply their university's style guidelines - they only specify the major editorial objects and the appropriate formatting occurs automatically.

REPRESENTING TEXT 231

no justification for instance). Again, this kind of support allows the author to address the document in terms appropriate to the authorial role, that is in terms of content and structure.

Editing software can be sensitive to document structure because it has a precise description of the content hierarchy. It can provide an outline with all required objects in place, warn the author about any contextually required or prohibited objects, and automatically renum­ ber or otherwise co-ordinate document components. 10

(3) Alternative document views and links are facilitated. Outliners take advantage of the identification of the hierarchy of the major editorial content objects in a document. Recognizing chapter titles, sections, subsections, paragraphs and perhaps even sentences they give the author the ability to display an outline of the document at any desired degree of detail and support 'zooming', a level by level concealing or revealing of successive levels of detail. Outliners may also provide editing utilities which can move hierarchical components of the document, as displayed in an outline view, and have the whole structure adjust accordingly. For instance, one might directly move a section, complete with all its nested subsections and text, to another chapter. Or one might promote a section from a second-level section to a first-level section and have all of the nested sections adjust their depth accordingly, included fourth-level sections becoming third-level sections and included third-level sections becoming second-level sections.

An even more sophisticated selective display of portions of documents can be effected using discipline-specific content objects in the document. For instance, one could specify a view that contained only the verse quotations or only the dialogue lines spoken by a particular character in a script. An author preparing a philosophical monograph might wish to view only the axiomatic elements (defi­ nitions, axioms and theorems) and not the intervening text. Such elements might be rearranged to help the author or reader explore a certain line of reasoning. For instance, the axiomatic elements might be sorted by kind - definitions, axioms and theorems - regardless of their narrative order in the monograph. In this way the specialized views enabled by content-oriented text processing help overcome in the authoring environment the limitations imposed by the presen- tational requirements of the intended narrative format or rhetorical form.

As well as alternative views, rapid navigation or 'hypertext links' are also facilitated by the identification of content objects. If anno-

10 Some of these procedures require not only a precise description of the actual hierarchy in the current document, but also a description, or grammar, of all the acceptable combinations of content objects. Such grammars are also a natural extension of the content object approach, as will be discussed later

232 BULLETIN JOHN RYLANDS LIBRARY

tations, footnotes, bibliographical cross-references and the like are labelled as such, then software applications can provide direct connec­ tions from texts into bibliographic databases, personal notes and so on.

(4) Collaboration tools are supported. Mechanisms for version control, collaborative editing, annotation and so on are made much easier when content objects are specifically'identified. For instance, collaborating authors need to be able quickly and reliably to refer to places in their shared text regardless of the current pagination, formatting or added annotations. Content object identification allows references such as 'Chapter 6, Section 2, paragraph 7, sentence 9' to be recognized and processed by software applications. Or annotations can be identified as to kind, author and disposition and attached to the relevant content object.

Advantages of content-oriented text processing for publishing (1) Formatting can be generically specified and modified. As we have seen, formatting can be easily and globally adjusted at any time without altering the text file itself at all. All that is necessary is to change the relevant formatting 'rule' for a text element and the formatting of every instance of that text element will be modified. More particularly one can choose generic 'stylesheets' which will format or reformat documents according to predefined styles, say that of a journal or the house style of an office. Or one stylesheet can be used to control the interactive formatting - creating a visually effective and personalized editing environment - while another, such as the required journal stylesheet, is used for final printing. For the author the switch from one stylesheet to another is instantaneous.

Consistency is ensured - if the rule for subsection headings specifies that they should be centred, 11 point Bembo, 8 extra points of leading, then without fail every instance of subsection heading is centred, 11 point Bembo, 8 extra points of leading.

Similarly, the professional typographer can modify formatting, without having to worry about determining the types and functions of text elements in the original. Consider a philosophical or mathematical text which contains many theorems and corollaries, marked as such. If the typographer specifies a layout in which the two classes of objects are typographically distinct, this can be automatically implemented even if both had the same appearance in the author's personal stylesheet. The same is possible for objects which begin with distinct formatting and end up looking the same. On the other hand, if the distinction was not made initially, then neither transformation can be readily accomplished - someone must re-examine the text and decide for each element whether it is a definition or a corollary, a process which is prone to error.

More generally, the appearance of a document can be adjusted to suit the precise occasion and circumstances of its rendition.

REPRESENTING TEXT 233

i2j Apparatus construction can be automated. The creating of appara­ tus such as indexes and appendices can be assisted. For instance in a philosophical monograph copies of special elements such as equations, axioms, theorems may be automatically collected in an appendix and lists of proofs or diagrams constructed. In an art catalogue special indexes by artist, medium or period may be easily constructed.

(3) Output device support is enhanced. Extensive support for output devices (e.g. printers, typesetters, video display terminals) can be maintained outside the text file and referenced via text object identifi­ cations: the result is that the text files themselves are output device independent while their processing is output device sensitive. This means that the actual rendering of the document will be automatically adjusted, by the software, to suit the resources of the output device. If the device lacks opening and closing quotation marks, then neutral marks will be substituted. If French or German quotation marks are required then they will be used if available, appropriate substitutions being made when they are not.

One case of output device support deserves special mention. The content object approach facilitates typesetting directly from an 'elec­ tronic manuscript'. (The Association of American Publishers recom­ mends descriptive markup to authors hoping to generate typeset copy directly from their files.) 11

(4) Portability is maximized. Finally, files which identify content objects are much easier to transfer to other text processing systems or computers. They are relatively system- and application-independent, because they minimize information which is dependent on any particular software application, operating system, computer or prin­ ter. In fact the advantage of content orientation is often put in terms of this independence. Documents prepared in a content-oriented way are said to be hardware-, system-, software-, output device- and design- independent.

Advantages for archiving, retrieval and analysis - text as a database(1) Data integrity is protected. Because many of the functions mentioned above allow text to undergo processing - such as reformat­ ting, typesetting or being ported across text processing systems - without actually editing the source text files themselves (rules files being generally maintained externally to the source text file), many sources of data corruption can be avoided: a compositor can imple­ ment drastic changes in page design, for instance, without even having access to the actual texts being typeset.

(2) Information retrieval is supported. The content object model treats

" See An author's primer (1 C>S3) and Chicago, Guide (1987).

234 BULLETIN JOHN RYLANDS LIBRARY

documents and related files as a database of text elements which can be systematically manipulated. This can facilitate not only personal information retrieval functions, such as the generation of alternative views, but also a variety of finding aids and navigation and data retrieval functions.

For instance, full-text searches in textbases can specify structural conditions on patterns searched for and text to be retrieved. This means that a user could look for all chapters whose titles contained both the words 'love' and 'death'; or a philologist might wish to look in a dictionary for all definitions of words containing a certain prefix and derived from a particular language. (The University of Waterloo OED Project is preparing an electronic version of the Oxford English Dictionary using the content object approach.) But if one has not explicitly identified titles, metrical lines, paragraphs and so on, then searches based on such units cannot be mechanically performed. Of course the familiar 'proximity' search is still a possibility. But if you ask for '"love" within 3 words of "death"' with no further restrictions you can get a positive response when, for instance, 'love' is the last word of one chapter and 'death' is the first word of the following chapter - which would not be helpful.

(3) More extensive analysis is possible. Very similarly, much auto­ mated analysis (stylometrics, content analysis, statistical studies, etc.) is not possible unless features such as sentences, paragraphs, stanzas, dialogue lines, stage directions and so on have been clearly identified in a manner which is reliably tractable to machine processing.

(4) Special processing is facilitated. Many texts include objects such as formulas in special notations, metrical information, foreign languages, graphical or other non-textual data and the like. If these are identified by kind then they can be processed appropriately by special software if desired. For instance equations could be solved, formulas manipu­ lated, verified or evaluated, sentences parsed, prosodic lines scanned, graphics displayed and so forth.

MARKUP THEORY

In content-oriented text processing content objects are generally identified with 'descriptive markup'. A clear example of the use of descriptive markup in a computer file is shown in Figure 1.

According to one taxonomy of markup, texts consist of content

12 This taxonomy is due to James Coombs and is an extension and generalization of Charles Goldfarb's earlier division of text processing markup into procedural and descriptive. Coombs characterizes punctuational and presentational markup as typically scribal, and descriptive and procedural markup as typically electronic. He goes on to analyse markup and markup processing along a number of other dimensions. See Coombs, Renear and DeRose (1987).

REPRESENTING TEXT 235

(e.g. the alphabetic characters of the text) and markup, of which there are six kinds: 12

Descriptive e.g. '<paragraph>'Descriptive markup identifies what a text element is.

Procedural e.g. \sk 3 a;.in 5'Procedural markup specifies processing behaviour, usually formatting, although other sorts of process­ ing may also be specified.

Presentational e.g. 'A treatise of human nature'1Graphic devices such as leading, font changes, word spaces, etc.

Punctuational e.g. '?!-;,. 'Punctuational markup is part of the writing system, but not a substantive part of the text itself.

Referential e.g. k &emdash', '&cat_picture'Referential markup is used to denote special char­ acters, classes of characters, graphics, etc. It can be 'resolved' in various ways depending on circum­ stances and processing options.

Metamarkup e.g. '.sr emdash = 'Metamarkup is used in markup systems to define or otherwise explicate other markup tags.

The preceding section of this essay argued the superiority of descriptive markup over procedural and presentational markup for computer-based text processing. Presentational markup, however, is generally superior for visual display - being on the whole more effective in efficiently conveying the logical structure of the text to a human reader.

Some readers familiar with interactive (AVYSIXXTG) text process­ ing might be tempted to say that there is no markup at all in their computer files, at least apart from punctuational and presentational markup. In fact, there is frequently a great deal of procedural markup in these files - it is just concealed from view. But the result is that all of the relative disadvantages of procedural markup and none of the advantages of descriptive markup (and content-oriented text process­ ing) are present. Probably the closest one could get to a text with 'no markup' would be ancient scriptio continua.

But this confusion indicates some of the complexities involved in the analysis of markup and text processing. For instance, exactly what sort of markup is being recorded in a computer file is not necessarily apparent to the user simply from reflection on the procedures for creating and entering text. For instance a user wishing to enter a title might press a key sequence such as 'shift-t', meaning start a title. He or she will then type the title and perhaps see it set in a display font and centred on the screen. But there are several possibilities here. If codes signifying 'this is a title' (e.g. '<title>') are being stored in the computer file then indeed descriptive markup is being recorded in the

236 BULLETIN JOHN RYLANDS LIBRARY

file, processed and rendered interactively, using a formatting rule, with presentational devices. But if instead the sequence 'shift-t' simply looks up the formatting rule and then puts in all the necessary procedural codes (change font, centre, etc.) and only those codes, then only procedural markup is being stored or recorded in the computer file. And if instead of doing either of these things the software simply adds into the source file the word spaces necessary to centre the line, then, in a sense, presentational markup is being stored. But in each of these three cases the qualitative experience of the composing author is identical, as is the visual effect on the computer screen or a printed page. Only a direct inspection of the stored computer file or clues from subsequent processing would reveal which has occurred. 13

Although these imperfect distinctions assist our understanding of text processing, and perhaps even textual communication in general, a mature theory of markup is not available. There are many subtle problems in distinguishing the various dimensions along which a typology of markup might be constructed, and work in this area has just begun. Understanding the new technology of reading and writing will require considerable analytic work along these lines. But what is most important for our present purposes is the observation that it is descriptive markup that implements the content-object approach to text processing.

SGML - the Standard Generalized Markup Language There is now an international standard for defining descriptive markup systems; it is ISO 8879: Information processing - text and office systems - Standard Generalized Markup Language (SGML). 14 SGML has been endorsed by many European and North American governmental agencies, professional organizations and scholarly soci­ eties, and has been achieving steady acceptance by software developers as well. 15 SGML specifies a machine-readable format for defining markup tags which identify text 'elements' (content objects), and for

13 For example, if changing 'stylesheets' (formatting rules) will automatically adjust the formatting for the title, then descriptive markup was recorded, otherwise it was not. If it is determined that descriptive markup was not recorded and if it is observed that the title remains centred even if a new word is added, then the procedural markup for centring was recorded. But if the title is no longer centred immediately after the addition of a word, without further explicit application of centring instructions, then presentational markup - word spaces - was recorded. Coombs has attempted to get at what is going on here by distinguishing between the markup that is elicited and the markup that is stored. But much more work remains to be done in developing these distinctions.

14 ISO (1986); see also The SGML handbook (Goldfarb, 1990). Goldfarb's book contains the complete text of the standard and a great deal of valuable supplementary material; it is highly recommended to those who intend to take a close look at SGML.

15 Many government agencies and the militaries of several countries stipulate that contracts for documentation and publishing must specify software and procedures which comply with SGML. A number of scholarly professional societies and library organizations have also formally endorsed SGML.

REPRESENTING TEXT 237

specifying what elements are allowed in a document and what the svntax of each element is - that is, what combinations of elements are.• '

allowed or required. SGML is not itself a set of markup tags, but rather it is a meta-grammar for defining tag sets. Following the instructions in the standard one can prepare an SGML 'Document Type Definition' (DTD) for each type of document one is dealing with (book, letter, memorandum, etc.), spelling out what elements may occur in each type of document, what tags will be used to represent them and what the syntax or grammar of each element is. A text that is tagged according to these specifications is a 'document instance'.

For instance, the DTD for a memorandum might define tags for the elements sender, recipient, copies, subject, date and so on. A monograph DTD might define tags for title, abstract, chapter, section and the like. Both DTDs would define a tag for paragraph. The syntax for Memorandum might specify that every document that is a conform­ ing memorandum consist of a header and a body, in that order. The header must contain exactly one sender, exactly one subject, exactly one date, at least one recipient (optionally more than one) and any number of copies elements, including none at all - in that order. Prose paragraphs may not be allowed in the header at all, but the body of the Memorandum must contain at least one. The syntax of paragraph may specify that paragraph may not itself contain a paragraph within it. The technique for specifying these syntactical constraints is similar to the production rule meta-grammar invented by Noam Chomsky to des­ cribe natural languages. In fact the description of a document in SGML has many similarities to the description of a sentence in structuralist grammars.

Software which can read SGML Document Type Definitions and SGML document instances can provide all sorts of assistance and intelligent tools. For instance, documents can be validated to make sure that their structure fits the grammar expected for that category or genre, that no mistakes in input have been made. Validation of this sort can be interactive and combined with intelligent prompting: if your software knows that you are typing a definition and also knows what elements are required or optional in a definition, it can prevent you from leaving out a required item or prompt you to choose from a list of elements which are allowed at each point.

re cTEI - the Text Encoding Initiarii Although the advantages of using descriptive markup to identify text objects for general text processing, typesetting and publishing are well understood, the natural extension of this approach to scholarly research and the creation and analysis of textbases has, until recently, received much less attention. But literary scholars need to perform many procedures which require the systematic reference to content objects. They will want to ask, for instance, whether two words ever occur in the same prosodic line, or in the same stanza, or in the same

238 BULLETIN JOHN RYLANDS LIBRARY

poem. Scholars want as much information as possible included in a textbase, but they need this information represented in a format which will support mechanical processing, so that they can exploit the power of computer technology. And these files should be usable on a wide range of operating systems and by a wide range of software application programs - the problems with non-standard or proprietary formats are familiar to most computing humanists.

These are just the sort of requirements which descriptive markup tagging systems are designed to meet. Consequently literary scholars have also recently been arguing for the use of descriptive markup in general and for the use of SGML conforming markup in particular. 16 They have begun to develop SGML conforming tag sets and guide­ lines for tag set development that are specifically intended to be suitable for literary texts. They are also finding, although this is more controversial, that the grammar rules of Document Type Definitions (DTDs) are a useful way to represent and communicate theories of genres and document structure.

The principal vehicle for the development and standardization of tag sets for the humanities is the Text Encoding Initiative. The TEI, founded in 1987, is an international effort to specify a common interchange format for machine readable texts. 17 The TEI is co­ ordinating over one hundred scholars from many different disciplines in developing SGML tag sets for various document types and disci­ plinary methodologies. As well as standardizing tag sets and DTDs, the TEI Guidelines provide extensive guidance for using these tag sets and guidance for scholars developing their own tag sets. The TEI has also considerably raised the level of understanding and interest among humanities scholars in the content-oriented approach to structuring text, and recently has provided a stimulating interdisciplinary context for considerable discussion of a number of theoretical questions about the nature of text and representation.

RUDIMENTS OF A GENERAL THEORY OF TEXT STRUCTURE

Text is an Ordered Hierarchy of Content Objects - the OHCO model Content-oriented text processing represents text as a structure (usually a hierarchy) of content objects. It is tempting to explain the practical

16 See for instance Barnard, Fraser and Logan (1988) and Smuh (1987); cf. also Fraser's M.Sc. thesis (1986).

17 The TEI is sponsored by the Association for Literary and Linguistic Computing, the Association for Computing in the Humanities and the Association for Computational Linguistics; it is funded by the European Economic Commission, the National Endowment for the Humanities, the Canadian Social Science Council and the Mellon Foundation and it has an advisory board with representatives from sixteen scholarly professional societies. The first version of the Draft TEI guidelines was issued in July 1990 (TEI, 1990); the final version is scheduled to be released in the spring of 1993.

REPRESENTING TEXT 239

advantages of this approach by saying that it is effective because that is what text really is. Once one has begun to ask questions about 'what text is', it is easy to find more evidence that seems to support this account.

On the current view it is a serious and consequential error to think of a text as simply a sequence of lexical entities such as words or sentences, even when the level of abstraction relevant to one's interests and methodology is indifferent to the palaeographical, typographical or bibliographical characteristics of a document. No reader, however naive and unscholarly, encounters a text simply as a sequence of words and sentences. Instead texts present themselves to readers as complex structures of content objects - objects such as titles, sections, dedi­ cations, paragraphs, footnotes, prose extracts, verse extracts, equa­ tions and so on. Some of these objects are common to many kinds of texts, others are particular to genre or discipline. And while some function merely to make reading easier or faster, others play crucial semantic roles in representing the information content of the text.

Recognizing these objects is essential for an accurate understand­ ing of their ingredient sentences and phrases. For instance, consider how the rhetorical and semantic significance of a sentence varies depending on whether it is used as the title of the monograph, as the title of a section, in a plain paragraph or in a prose extract. Not only is the relative importance and rhetorical impact of the statement at issue here, but even its semantics is affected by its inclusion in a particular editorial object: statements in a prose extract are not affirmed by the author, but may be presented only as illustrations of another scholar's mistakes; similarly statements in a verse extract may be included only to exhibit some metrical or stylistic feature. These differences are even more acute for discipline-specific objects: it makes a big semantic difference whether text is in a definition, an axiom or a theorem.

On this view the ontological question 'What is text?' is given an equally ontological answer: text is an Ordered Hierarchy of Content Objects (OHCO). 1S A book for instance is a sequence of chapters, each chapter a sequence of major sections, and each major section a sequence of subsections. This structure is hierarchical because objects like subsections are inside of or 'dominated by' objects like major sections. The structure is ordered because the objects at the same level of the hierarchy exhibit a narrative sequence - the first section of a

'" \\'e took this high ontological approach in a recent article and posed the question AVhat is text, really?' Our answer there was, again, 'An ordered hierarchy of content objects' See IX:Rose, Durand, Mylonas and Renear (1990) As discussed below we have modified this answer in response to persuasive criticisms. Obviously this line of investigation quickly connects vvnh other traditions and disciplines asking 'what text is', such as textual criticism and scholarly editing, literary theory and the philosophv of art. \Ve think it is valuable that the present approach to investigating and theorizing about this rather abstract question, which has instigated considerable controversy among literarv theorists, has its roots directly in the very practical problems of treating computing soltware and text databases.

240 BULLETIN JOHN RYLANDS LIBRARY

chapter preceding the second, for instance. And we call these objects content objects because they organize text into natural units based on meaning and communicative intentions.

Of course documents are not nested objects 'all the way down', like the turtles in the story. Eventually we come to sections which do not themselves contain other sections. Within these we continue to find complex hierarchical structures of content objects. Almost every book has paragraphs, footnotes, reference citations and prose extracts, and specialized literature will have specialized objects such as prosodic lines, proofs and equations. Although where this stops may seem an excessively scholastic question - and indeed we will not inquire into it any further here - it is in fact a very practical question for the design of content-oriented text processing.

The content-oriented approach to text is based on the premise that, in a sense, this is what text really is, an ordered hierarchy of content objects - or, at least, this is what it is qua intellectual object. And it is the text qua intellectual object that, strictly speaking, is written, read, and that conveys meaning.

However, words like 'book', 'document', 'text' and so on do not univocally designate natural kinds. They play many disparate, if systematically related, theoretical roles in our thinking and reasoning about books and texts. There is the book of the physical bibliographer and the book of the literary theorist. The philosopher finds an invalid argument in Descartes's Third Meditation, the literary historian finds interlinear addition. It would certainly seem odd to say that there is one thing that is or has components from such disparate ontological categories as are exemplified by pages, columns and lines on the one hand; sections, subsections, axioms and definitions on another; and fallacies, allusions and recantations on a third. While efforts like the Text Encoding Initiative attempt to accommodate all the approaches to text that are relevant to humanistic studies, including the bibliogra­ phical, codicological and the like, the approach described in the present essay gives primacy to text as an intellectual, and not a material or physical, object.

How are these content objects indicated to the reader? In printed works the recognition of content objects is effected by a variety of presentational devices, chiefly typographical features and page layout. Titles, for instance, may be in larger type than the main text, centred and set off by extra leading. Prose extracts are frequently in smaller type, indented from the margins and with extra leading. Verse extracts are generally set similarly to prose extracts, but may be easily distinguished from them by the characteristic heavy 'right rag' which is created by setting each prosodic line as an unjustified typographical line.

There is an elaborate vocabulary of presentational devices used to indicate logical structure and convey information which is essential to the understanding of the text. These devices are not simply of

REPRESENTING TEXT 241

importance to the scholar interested in the history of typography or the details of literary history, they are absolutely crucial to the reader who wishes only to read and understand the document. And this is because these devices effect the recognition of the text's component content objects.

To get a sense of just how important and effective these devices are, notice from Figures 7 and 8 how much we can infer about a text in a language with which we are unfamiliar. First, we use visual cues to classify text as to genre - recognizing poetry, letter, play scripts or scholarly monographs by their characteristic appearance. Next we can pick out the editorial text objects specific to each kind of document, again by visual appearance, placement on the page and other graphic devices. Although these page images are created using nonsensical Greek, we easily classify them as poem and academic paper and then pick out titles, stanzas, prosodic lines, author, institutional affiliation, abstract, section title, prose paragraph, extract, etc.

TitXe, (|>OQ tvoiavipe

\iav E iv agyEQ time TT)OV TEXT, ip£VTQ£&, av6 OET ocjxj) (3u

EXTQQ XEa&ivy. FlQOOE ET] QQ QCOJJTO age4>Q£60£VT)OJ IV O|XaXX.£Q TUJtE, IV&EVTE&

, av6 Ect&ivy. QEQOE EXTQCOJJTO CXQE

, |30T

OET

TT)ET]£acoi) '

^Qooo6iip XIVE aa av0v^6ati(j)iE60(j> ^QEOEVTttTlOVaX 6EO)l^£O 0OE6 TO

iv(j)OQ|a,aTiov rr\ 10 EooEvtiaX TO TT]E 0v6£QOTav6ivy o(j) TT}£ TEXT. Av6 TTIEU QQE

VOT

IV TT1E TllOTOQD O(j)

T]IOTOQU, TT1ED ttQE

h]o dioT]£o ovXoTT1E 6o^0|J,£VT. Av6 TT]IO IO

TO T1]E

OQ TT]E

IV

av6 0vTT]£O£

FIGURE 7: EFFECT OF TEXT FORMAT A

Alternative Models of TextThe OHCO view of text can be contrasted with other models which can be generalized from the software, practices or methodologies which embodv them.

242 BULLETIN JOHN RYLANDS LIBRARY

TITAE OOP INZTANWE ANA THIS 12 THE 20BTITAE

MAY AAPFEPAO0EWPA<DT 0NI0EP2ITY

1I/£VTQE6, av6 081 oc|>4> |3u exTQa XEa&ivy. IIoooE EXTQCUJJTO age (}>Q£60£VT)a) LV afxaXXeg time, iv6evTe6 (J>QOH TTJE \iaQjivo, av6

Xea6ivy. ©EQOE |3u TT]E tyr]aQatyT£QioTity r]£a0u 'Qiynt i|>Q£at£& (3u oeiTivy satyr] Jigooodity Xive av!;0oTi(|>i£& TU-

Xive.

RPESENTATIONAA AE0IWE2

TT]8Q8 LO av ^QeoevtaTiovaX 6e0iapea 6ae6 TO tyovOev LV- (|)OQ^aTiov ir^at LO eaaeviiaX TO T^e 0v6EQOTav6Lvy ocj> IT\ETEXT. Av6 TT\£V GLQB VOT Ol\in\V O(|) l(J,JlOQTOlV^e TO TT]£

LVT8QSOTE6 XlTEQaQD TllOTOQU, TT^EV

6v6£QOTav6 TTJE doi^S^iEVT. Av6 TTJLO toO(j) TT]£ T£XTO \pO(!JlOV£VT \[)OVT£VT O(3

To YET a OEVOE O({) §6oi T)O§ l^OQTaVT av6 £4>(J)EtyTl0E TT]£OE &£0lty£O

GQE VOTIVE Tjod n6tyT) OE tyav ivc{)£Q a(3o6t a TEXT iv a XavyeayE ^ITT) dr)ityT) OE QQE 0v(j)a|a,iX.iaQ. 4>iQOT, ft£ 6o£ 6io0aX ty0EO TO tyXaoout>u IT ao KOETQV, XETTEQ, jtXav otygi^T, og a otyTjoXapXv novoygajir). NEXT

TO JET a O£VO£ O(|) ^9OT T]Od l^lJtOQTaVT av6 £({)(J)£a|)Tl6£ TTJEOE

6£6lO|JEO QQE VOTIVE T]oO (iSlpT) ^£ ipttV LV(j)£Q tt|3o6T a TEXT LV d

XavyOayE OLTT] §T]iapT] OE QQE 6vcj)a(iLXiaQ. OLQOT, OE 0O£ 0io6aX. \|;0£a TO tyhaooifyv IT ao JIOETQU, XETTEQ, O^QLJTT, OQ a oapr]oX.aQXu

FIGURE 8: EFFECT OF TEXT FORMAT B

(1) Bitmaps - A 'bitmap' is essentially a picture of a document, points being located mathematically to represent the dark and light places at a certain resolution. It is a rendering of the document and, properly displayed, 'looks like' the document. However, it is profoundly intractable to computer processing. Text must be added by 'painting' or otherwise specifying the location of the dots, and because there are not mechanically identifiable linguistic characters one cannot locate words or letters. Obviously one cannot easily change the design of the pages either.

(2) Characters and formatting commands - This model represents texts as characters and codes to represent the formatting which is supposed to take effect in various areas of the page. Most text processing files represent text this way, although in contemporary interactive text processing these codes are mostly hidden from the user. It is a computer typographer's view of text.

REPRESENTING TEXT 243

(3) Glyphs and white space - This represents the text as a sequence of shapes; it is, perhaps, a page designer's view of text.

(4) Character transcripts - This represents a text as simply a sequence of the orthographic characters used in the words. In practice there are usually codes for paragraph breaks and line breaks as well, although these are not usually displayed as graphic characters but are used to generate line breaks of various kinds. Compared to a printed page the character transcript is very impoverished, leaving out an enormous amount of information, much of it essential to comprehending a text and much more that is important to process it efficiently or reliably. In light of the impoverished nature of the character transcript it is odd that one not infrequently hears it referred to as 'the text itself.

(5) Layout hierarchies - These represent the text as a structure of pages, headers and footers, columns, typographical lines and the like; it is a book designer's or layout artist's view of text.

The reader can easily imagine the practical drawbacks of these models of text by considering the advantages of content orientation which were enumerated in the previous section.

Arguments for the OHCO ViewThe early arguments for text being a hierarchy of content objects were advanced largely to promote a particular kind of computer encoding and to discourage the competing alternatives. The partisans of content-oriented text processing claimed that the alternative repre­ sentational practices relied upon a false model of text and that many disadvantages and inadequacies resulted from that flawed conception. The deprecated alternatives seemed to treat texts as characters punctuated by formatting codes, as bitmaps or as hierarchies of graphical objects. The inadequacies which resulted were the loss of the practical advantages listed earlier.

Explicit positive arguments that text is a structure of content objects can be made at several levels: theoretical, empirical and practical. The overview which follows is intended only to suggest the sort of arguments which are or might be made and not to actually present full and compelling versions of these arguments.

Several theoretical arguments may be made. The first is a classic argument from variation, used to distinguish essential from accidental properties, or, in a more contemporary philosophical idiom, to establish 'identity conditions'. We observe that if a presentational feature changes, such as leading or typeface, the text remains essen­ tially the same, but if the structure of content objects changes - say the number of chapters or sentences varies - then it becomes a different text. You and I can both read the 'same book' even though mine is in 10 point type and vours in 12 point type. But if yours has fewer or

244 BULLETIN JOHN RYLANDS LIBRARY

different paragraphs than mine does, then that seems decisively to suggest that we are not reading the 'same book'.

Another theoretical argument is based on an apparent asymmetry of information content or generative power: a descriptively marked-up text can produce, in a mechanical fashion, models of the other sorts: bitmaps, hierarchies of graphical objects, character transcripts, etc. However, none of these representations of a particular text can be automatically transformed into the corresponding OHCO structure - the content object model seems to have more information.

A final theoretical argument holds that whether or not the notion is explicitly present in the writer's mind, writing assumes a grasp of logical structure. According to this argument two things are going on during the composition process: the author is not only choosing words but choosing as well the logical editorial structures which will contain them. This is what composition is. The author cannot avoid doing either of these things when composing - and need not do anything further. Anything less is at best an anomalous form of text generation, such as free association. And anything more belongs to other allied crafts, such as design, typing, copy-editing and typesetting. A similar argument might be made for reading.

Empirical arguments begin with the observation that the concepts which are prominent in our descriptions, theories, hypotheses, gen­ eralizations and conjectures about text are concepts of content objects. Our theories and conjectures about literature, for instance, make extensive use of terms for chapters, titles, sections, paragraphs, sentences, footnotes, stanzas, lines, acts, scenes, speeches and so on. On the assumption that the nominal phrases of theoretical assertions denote the entities asserted by that theory, and which play funda­ mental explanatory roles in that theory, we can conclude that such things, content objects, are the stuff of which literature is made. 19

Another class of empirical arguments, which I will call pragmatic, is based on efficiency, and originates with the designers of text processing software. Creation and maintenance of a descriptively marked-up electronic document are more efficient, as are its display, formatting and rendering. Analysis is also similarly facilitated on such texts. These phenomena of comparative efficiency are best explained, according to the argument, by a theory that assigns a salient role to logical features.

Complications for the OHCO ViewWe have already alluded to one direction in which the original characterization of text as an ordered hierarchy of content objects is being modified. It is being recognized that there is not a univocal sense

19 The classic argument that ontological commitments are found in the denoting phrases of theoretical statements is in XX'illard van Orman Quine (1953).

REPRESENTING TEXT 245

of 'book', 'text' or 'document', and that our notion of a text is determined in part by the disciplinary or analytic perspective we assume, the methodological communities we are members of, and our purposes and interests. The book designer's text might be a hierarchy of pages, columns and typographical lines - so that we may say that as a design object a book is a hierarchy of presentational objects. An editor's or author's text is a hierarchy of chapters, sections and the like. Physical bibliographers, typographers, corpus linguists, literary critics, historians and others similarly may have their own notions of text and their own way of dividing a text into salient parts.

Originally it seemed to text processing researchers that there was perhaps a single major division of perspectives on text into logical (or literary, editorial, content) and physical (material, layout, graphic). Every text had one natural logical hierarchy and a variety of possible physical hierarchies. The diversity of logical text objects was a function of genre or category of text, and not of analytical approach - legal contracts had one set of objects, scientific monographs another. Perhaps there were issues related to the specificity of the decom­ position into objects, but the objects would all be in some sense of a kind and would form strict hierarchical structures: i.e. within each approach, logical or physical, objects always 'nested' and never 'overlapped', although objects from different approaches frequently overlapped. For example, sentences, paragraphs and sections do not overlap with each other and typographical lines, columns and pages do not overlap with each other, but sentences, paragraphs and sections do overlap with typographical lines, columns and pages. 20

Consistent with this view, commerical research projects on text processing designed systems which maintained exactly two hier­ archies, the logical hierarchy of editorial (content) objects and a hierarchy of intended layout objects (pages, columns, typographical lines). Where SGML incorporated features designed to co-ordinate multiple hierarchies, these clearly were motivated by the expectation that there would frequently be two hierarchies to co-ordinate but that the logical hierarchy of content objects would never itself be beset, internally, by overlapping objects. The theoretical expression of this view was the explicit grand ontological claim that text is a hierarchy of content objects.

This no longer seems entirely accurate as a general theory of text, or adequate as a software design constraint. It is true that the editorial structure is broadly hierarchical, but its objects overlap not only with layout objects, but, for instance, with thematic and prosodic struc­ tures. Consider a verse drama: dialogue speeches, metrical lines and sentences may all overlap, and vet all of these have equal claim to be

20 Roughly, two objects overlap when one that begins 'inside' of another ends after the other ends.

246 BULLETIN JOHN RYLANDS LIBRARY

'logical' as opposed to 'material'. 21 And when thematic structures or linguistic collocations are considered, overlaps become the rule rather than the exception.

The recognition that many different analytic or methodological perspectives (and hence hierarchies) might exist for a given text raises many questions. For instance, what exactly are these perspectives? How many are there? How are they related? Do they always determine hierarchies or do perspectives themselves have overlapping objects? How should this emerging terminology and theory be related to traditional theories of textual criticism and to contemporary theories of literary meaning? These questions have yet to be answered by the construction of a unified picture. 22

This essay has described how practitioners and researchers in the area of computer text processing have come to promote a particular approach to text processing and argues that this approach is indeed a natural development which enables the most effective use of computer technology in text processing and textbase development. It also describes how some of us have come to find in this approach, and its success, the rudiments of a theory of what text 'really is'. This theory, although of tremendous value to understanding text processing sys­ tems, seems also to be arguably a general theory of text and linguistic communication which is independent of any particular technology of text production and transmission. Exploring this theory, and compet­ ing theories, has proven to be both an intellectually engaging activity for scholars from many disciplines and of considerable practical value in the design of text processing software and text encoding standards. However, we are still at the very beginning of a principled understand­ ing of text, documents, markup systems and computer text processing.

21 Barnard, Hayter, Karababa, Logan and AkFadden (1988, 265-76). Their classic example is of dramatic speeches and metrical lines overlapping when one character begins a metrical line and another finishes it and then continues speaking.

22 The topics in this section are discussed in greater detail in Renear, Durand and Mylonas, 'Refining our notion of what text really is' (forthcoming i.

REPRESENTING TEXT 247

ABOUT THE AUTHOR

ALLEX RENEAR is on the staff of Computing and Information Services at Brown University, where he is responsible for information tech­ nology planning and support for computing in humanities research. He studied philosophy at Bowdoin College and Brown University and represents the American Philosophical Association on the Advisory Board of the Text Encoding Initiative.

CONTACT ADDRESSES

Post: Dr Alien Renear, Senior Academic Planning Analyst, Box 1855,Computing and Information Services, Brown University, Providence,RI 02912, U.S.A.Phone: 401-863-7312E-mail: ALLEN @ BITNET.BROWNVMFax: 401-863-7329

REFERENCES

An author's primer to word processing (Association of American Publishers, 1983). D.T. Barnard, C.A. Eraser and G.M. Logan, 'Generalized markup for literary texts',

Literary and Linguistic Computing, 3 (1988), 26-31. David Barnard, Ron Hayter, Maria Karababa, George Logan and John McFadden,

'SGML-based markup for literary texts: two problems and some solutions',Computers and the Humanities, 22 (1988), 265-76.

The Chicago manual of style (Chicago: University of Chicago Press, 13th edn, 1982). Chicago (University of Chicago), Guide to the preparation of electronic manuscripts -for

authors and publishers (Chicago: University of Chicago Press, 1987). James S. Coombs, Alien H. Renear and Steven J. DeRose, 'Markup systems and the

future of scholarly text processing', Communications of the Association for ComputingMachinery (ACM), 30 (1987), 933^7.

Steven J. DeRose, David Durand, Elli Mylonas and Alien H. Renear, 'What is text,really?', Journal of Computing in Higher Education, 1.2 (Winter 1990), 3-26.

Esdaile's manual of bibliography, ed. Roy Stokes (London: Alien & Unwin, 1931, 4threvised edn, 1967).

C.A. Eraser, 'An encoding standard for literary documents', M.Sc. thesis, Depart­ ment of Computing and Information Science, Queen's University, 1986.

Richard Furuta, Jeffrey Scofield and Alan Shaw, 'Document formatting systems:survey, concepts, and issues', Computing Surveys, 14 (1982), 418-72.

Charles Goldfarb, 'A generalized approach to document markup', Proceedings of theACM SIGPLAX-SIGOA Symposium on Text Manipulation (New York: Associ­ ation for Computing Machinery (ACM), 1981), 68-73.

Charles Goldfarb, The SGML handbook (Oxford: O.U.P., 1990). Claus Huitfeldt and Yiggo Rossvaer, The Norwegian Wittgenstein Project Report 1988

(The Norwegian Centre for the Humanities, 1988).

ISO, Information processing - text and office systems - Standard Generalized MarkupLanguage (SGML), ISO8879-1986 (International Organization for Standardization(ISO), 1986).

Ronald B. McKerrow, An introduction to bibliography for literary students (Oxford,1927).

Willard van Orman Quine, 'On what there is', From a logical point of view(Cambridge, Mass.: Harvard U.P., 1953).

Brian Reid, 'A high-level approach to computer document formatting', Proceedings ofthe 7th Annual ACM Symposium on Programming Languages (New York: ACM,1980), 24-31.

Alien Renear, David Durand and Elli Mylonas, 'Refining our notion of what textreally is', Research in humanities computing (Oxford: O.U.P., forthcoming).

Joan Smith, 'The Standard Generalized Markup Language (SGML) for humanitiespublishing', Literary and Linguistic Computing, 2 (1987), 171-5.

C.M. Sperberg-McQueen, Text in the electronic age: textual study and textencoding, with examples from medieval texts', Literary and Linguistic Computing, 6(1991), 34_46.

TEI, Guidelines for the encoding and interchange of machine-readable texts, ed. C.M.Sperberg-McQueen and Lou Burnard (Chicago and Oxford, 1990).

Kate L. Turabian, A manual for writers of term papers, theses, and dissertations (Chicago:University of Chicago Press, 13th edn, 1982).