anatomy of a text analysis package

8
Comput. Lang Vok 9. No. 2. pp. 89.-96. 1984 0096-0551 84 $3 00-0.00 Pnnted m Great Britain. All rights reserved Copyright i- 1984 Pergamon Press Ltd ANATOMY OF A TEXT ANALYSIS PACKAGE ALAN REED Centre for Computing and Computer Science, Elms Road, University of Birmingham. P.O. Box 363. Birmingham Bl5 2TT, U.K. (Received 2 February" 1983) Abstract--A model of a text processing package is described which is based on elementary set theory. This is used to describe procedures for computing indexes, concordances, and collocations. Aspects of the implementation of the model are given. Literary Text processing Package Set theory Inverted file INTRODUCTION Many humanities students wish to examine machine readable text in well known ways, without wanting to develop special programming expertise. They can now use pre-written computer programs ('packages') each of which can do a number of fixed tasks. These traditional literary data processing tasks involve producing word lists, concordances, word frequency counts, and numerical measures of style. Packages such as CLOC [1, 2], COCOA [3, 4], EYEBALL [5], JEUDEMO [6], OCP [7], STYLE [8], etc. require that the student supply only the text to be examined and a few simple commands to tell the package what to do. Those who wish to examine texts in original ways will need to use a computer language that is good at handling natural language text, such as LISP [9], PROLOG [10, 1 1], SCAN [12], SNOBOL [13], etc. Students who use packages or develop their own programs will need a clear idea of what constitutes a 'word' or a 'phrase', especially if statistical analyses are undertaken. Are for example "The" and "the" to be counted as one word or two? This paper is intended to stimulate clear thought on such matters by presenting an approach to literary data processing which is based on elementary set theory. After we have described a text as a sequence of tokens, we introduce the concept of a 'mapping' from tokens to words. This allows us to accurately define the terms 'vocabulary', 'frequency of occurrence', 'wordindex', 'citation', 'concordance', and 'collocations'. We incorporate these definitions into a scheme which we call an 'abstract package' and show how this has been implemented as the computer program package known as CLOC. FORMALISM A machine readable text T is an ordered sequence of bytes taken from some machine alphabet (say US-ASCII). We will assume that there has been a consistent scheme for transliterating the natural language alphabet into one or more bytes in the machine alphabet. We will further assume that there is an embedded notation to describe the page, section, author, or work. These extra-textual notations we will ignore and treat the pure text as if it were a sequence of 'symbols', where each symbol would be realised by one or more bytes in some machine alphabet. Symbols can be placed next to each other to create arbitrary sequences. In general if we start with a set of symbols S we can form a set S* which consists of all possible ordered sequences of symbols taken from S. The text T, a sequence of r symbols, will be a member of this set, which we write as T E S*, where T=sls2...srand si~S. A natural language text could be a piece of writing, a musical work, a computer program, etc. Whatever it is, we will be interested in regularly occurring features in the text. These may be text words, musical notes or phrases, identifiers, etc. We shall use the term 'token' to denote this unit. Given a text as a sequence of symbols and a process which will extract from it the next token, we 89

Upload: alan-reed

Post on 28-Aug-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Anatomy of a text analysis package

Comput. Lang Vok 9. No. 2. pp. 89.-96. 1984 0096-0551 84 $3 0 0 - 0 . 0 0 Pnnted m Great Britain. All rights reserved Copyright i- 1984 Pergamon Press Ltd

A N A T O M Y O F A T E X T A N A L Y S I S P A C K A G E

ALAN REED Centre for Computing and Computer Science, Elms Road, University of Birmingham. P.O. Box 363.

Birmingham Bl5 2TT, U.K.

(Received 2 February" 1983)

Abstract--A model of a text processing package is described which is based on elementary set theory. This is used to describe procedures for computing indexes, concordances, and collocations. Aspects of the implementation of the model are given.

Literary Text processing Package Set theory Inverted file

I N T R O D U C T I O N

Many humanities students wish to examine machine readable text in well known ways, without wanting to develop special programming expertise. They can now use pre-written computer programs ( 'packages') each of which can do a number of fixed tasks. These traditional literary data processing tasks involve producing word lists, concordances, word frequency counts, and numerical measures of style. Packages such as CLOC [1, 2], COCOA [3, 4], EYEBALL [5], J E U D E M O [6], OCP [7], STYLE [8], etc. require that the student supply only the text to be examined and a few simple commands to tell the package what to do. Those who wish to examine texts in original ways will need to use a computer language that is good at handling natural language text, such as LISP [9], P R O L O G [10, 1 1], SCAN [12], SNOBOL [13], etc. Students who use packages or develop their own programs will need a clear idea of what constitutes a 'word ' or a 'phrase' , especially if statistical analyses are undertaken. Are for example "The" and "the" to be counted as one word or two? This paper is intended to stimulate clear thought on such matters by presenting an approach to literary data processing which is based on elementary set theory. After we have described a text as a sequence of tokens, we introduce the concept of a 'mapping ' from tokens to words. This allows us to accurately define the terms 'vocabulary ' , 'frequency of occurrence', 'wordindex', 'citation', 'concordance' , and 'collocations'. We incorporate these definitions into a scheme which we call an 'abstract package' and show how this has been implemented as the computer program package known as CLOC.

F O R M A L I S M

A machine readable text T is an ordered sequence of bytes taken from some machine alphabet (say US-ASCII). We will assume that there has been a consistent scheme for transliterating the natural language alphabet into one or more bytes in the machine alphabet. We will further assume that there is an embedded notation to describe the page, section, author, or work. These extra-textual notations we will ignore and treat the pure text as if it were a sequence of 'symbols ' , where each symbol would be realised by one or more bytes in some machine alphabet. Symbols can be placed next to each other to create arbitrary sequences. In general if we start with a set of symbols S we can form a set S* which consists of all possible ordered sequences of symbols taken from S. The text T, a sequence of r symbols, will be a member of this set, which we write as T E S*, where T = s l s 2 . . . s r a n d si~S.

A natural language text could be a piece of writing, a musical work, a computer program, etc. Whatever it is, we will be interested in regularly occurring features in the text. These may be text words, musical notes or phrases, identifiers, etc. We shall use the term ' token ' to denote this unit. Given a text as a sequence of symbols and a process which will extract from it the next token, we

89

Page 2: Anatomy of a text analysis package

90 ALAN REED

could re-describe the text as if it were a sequence of p tokens. This we write as T = tl t_,.., tp. We shall use the notation T, to denote the ith member of the sequence.

Mapping

The sentence 'The cat sat on the mat" could be said to contain 6 tokens. Note however that the word "'the" occurs in two forms 'The ' and 'the'. We want to use "'words" to define vocabulary, word indexes, concordances, and collocations, so we need a method of relating the tokens in the text to the words that they represent. This we do by supplying a mapping M from tokens to words. The simplest (and trivial) mapping is to allow a token to represent a word. Thus the above sentence would contain 6 words, each occurring exactly once. A more useful mapping would unify upper and lower case letters, in which case the above sentence would contain 5 words, with "'the" occurring twice. Our mapping could also take account of lemmatization by introducing an abstract head-word which represents a list of particular tokens. For example, we could define "S IT" to represent the list of tokens 'sit, sits, sat, sitting'. Thus the token 'sit ' will be counted with the word "sit" and also with the (lemmatized) word "SIT" . I f this method were used, the above sentence would now contain 6 words, namely "the", "cat" , "sat" , "on" , " m a t " and "S IT" with occurrence frequencies of 2, 1 , 1 , 1 , a and 1 respectively. This example shows that a given token can be mapped to more than one word, and different tokens may map to the same word. Thus in general the mapping is many-to-many.

Literary data processing is concerned with locating particular words in a text, their frequency of occurrence, and the context in which they occur. Now that the relationship between tokens and words has been established through the mapping M we can describe what we mean by the terms 'vocabulary ' , "index', 'concordance' , and 'collocational analysis'.

Vocabulary

We are interested in the set of distinct words in the text and how often they occur. We call the set of distinct words the vocabulary of the text, which we write as V(T). It is clear that:

V(T) = {M(T~) [ i ~ Z and 1 ~<i ~< p} (1)

where Z is the set of all negative positive and zero integers. We write the frequency of occurrence of a particular word w in a text T as F(w, T) where:

F (w,T) = order of {i I i e Z and 1 ~< i ~< p and M(Ti) = w} (2)

It is useful to collect together a word and its frequency of occurrence, so we construct a set VF(T) which combines them:

VF(T) = {(w, F(w, T)) [ w e V(T)} (3)

Index

An index of a given word w in a text T, written index(w, T), is the set of all locations where it occurs in the text T. This is of course the set from which the frequency of occurrence came:

index(w,T) = {il i ~ Z and 1 ~< i ~< p and M(Ti) = w} (4)

Concordance

A concordance is a development from the index in that it shows the context in which the given word occurs. The context we will call a 'citation' and define it to be sequence of tokens, which is a sub-sequence of the original text, but which contains the given word, thus:

With T = tit,_ • •. t i _ l t i _ l + 1 • • • t i • .. ti+m_l t i+m.. , tp

where

we define

l > / 0 a n d m > ~ 0 a n d a ~ < i - l ~ < i ~ < i + m ~ < p

citation(l, m, i, T) = t i - i . . , t , . . . ti+m (5)

Page 3: Anatomy of a text analysis package

Anatomy of a text anal?sis package 9I

For given values of 1 and m a "concordance' is the set of all citations which share a common central word w, thus:

concordance(l, m, w,T) = {citation(l, m,i , T) k i~ Z and 1 <~ i ~< p

and M(T,) = w} (6)

Collocations

The "collocates' of a word w are those words which occur in the citations of the concordance of w. They are studied because in natural language text words are regularly and frequently grouped together, e.g. in English the words "'time" and "'day" are often associated in the phrase "the time of day". Such collocational behaviour has been studied by Sinclair et al. [14]. It is clear that from a concordance one can compute the vocabulary of the collocates of the given word and their frequency of occurrence, one can also locate their index positions, and produce collocate concordances (or 'collocations' to use the appropriate term).

The collocate vocabulary, for a given word w in a region defined by 1 tokens before w and m tokens after w, is the set of words surrounding the citations in concordance (1, m, w, T). We write CV(I, m, w, T) for the collocate vocabulary defined thus:

CV(I, m , w , T ) = {M(Tk) [ i ~ Z and 1 ~< i ~< p and M(Ti) = w and

s ~ Z a n d - l~<s~< + m and s4 :0 and

k = i + s a n d l~<k~<p} (7)

The statement s -~ 0 is used because one must be careful to include in the collocate vocabulary words that are solely in the context of the central word.

In a similar way we can calculate the frequency of occurrence of a collocate, CF(c, I, m, w, T). Note that this frequency is different from the frequency of occurrence of that word in the complete text T.

CF(c, 1, m, w, T) = order of {k [ i E Z and 1 ~< i ~ p and M(T,) = w and

s ~ Z a n d - l ~ < s ~ < + m a n d s # 0 a n d

k = i + s a n d 1 ~ < k ~ < p a n d M ( T k ) = c } (8)

Once again it is useful to define a set which consists of the set of collocates together with their frequency of occurrence: this we write as CVF(1, m, w, T):

CVF(I, m, w, T) = {(c, CF(c, 1, m, w, T)) I c ~ CV(1, m, w, T)} (9)

An index to the locations of all collocates of a given word can easily be defined as follows:

collocate index(c, 1, m, w, T) =

{ k l i ~ Z a n d l ~ < i ~ < p a n d M ( T i ) = w a n d

s ~ Z and - l ~ s ~ < + m and s ~ 0 and

k = i + s a n d l ~ < k ~ < p a n d M ( T k ) = c } (10)

There are two ways to get the collocate concordance. One can collect all those citations whose central token matches the word whose collocations are being determined, or one can produce the set of citations which have the collocate as their central token. In either case the amount of context could be different from the region about which the collocational analysis is being performed. We define the citation width as q tokens before the central token and r tokens after it.

The first method of getting a collocate concordance is:

collocate concordance 1 (q, r, c, 1, m, w, T) =

{citation(q, r , i ,T ) l i ~ Z and 1 ~< i ~< p and M(Ti) = w and

s e Z a n d - l~<s~< + m a n d s : ~ 0 a n d

k = i + s a n d 1 ~<k-N<pand M ( T k ) = c } (11)

Page 4: Anatomy of a text analysis package

92 ALAN

The second method produces:

collocate concordance 2(q, r, c, 1, m, w, T) =

{citation(q, r, k, T) I i ~ Z and 1 ~< i ~< p and M(Ti) = w and

s a Z and -l~<s~< + m and s ~ 0 and

k = i + s a n d l ~<k~ < p a n d M( T~ ) = c } (i_9)

ABSTRACT PACKAGE

This section shows how we can combine the previous ideas into a computer program or 'package' that can be used in literary and linguistic computing.

We must give our package the complete set of symbols that it will encounter in any text provided, define a process for extracting tokens from a text, and define a mapping from tokens to words. It is convenient, but not necessary, to supply some simple orderings (i.e. ~<) between words so we can display the vocabulary sorted in various ways.

We begin by computing the vocabulary of the given text, the frequency of occurrence of each word, and hence compute VF(T). Then we instruct the package to choose a particular subset (say VF'(T)) of VF(T) for subsequent use. After this set has been chosen all subsequent analyses are performed by a set of functions which are applied to the original text T and the chosen subset VF'(T), and are thus of the general form G(T, VF'(T)). Specific instances of G are 'wordlist', 'index', 'concordance' and 'collocations'. The method can be seen from the following diagram:

compute VF(T)

choose VF'(T) c VF(T)

- - - - ~ select some function G

T apply G(T,VF'(T))

T 4-----

Note: A dummy function 'finish' terminates the loop.

The set VF'(T) is a subset of the set VF(T) which, from equation (3), is composed of a set of ordered pairs (w, f) where w is some word and f is its frequency of occurrence in the text T.

To produce a 'wordlist' it is sufficient to:

display VF'(T) in some suitable order.

To derive an 'index' we proceed as follows:

for all (w, f) such that (w, f) e VF'(T)

compute index (w, T)

display it in some suitable way

In a similar manner we can produce a 'concordance'.

given 1, m

for all (w, f) such that (w, f) ~ VF'(T)

compute concordance(l, m, w, T)

display it in some suitable way

Page 5: Anatomy of a text analysis package

Anatomy of a text analysis package 93

To discover the 'collocations' in a text we use:

given r, s, 1, m

for all (w, f) such that (w, f) e VF'(T)

compute CV(I, m, w, T)

for all c such that c ~ CV(I, m, w, T)

compute CF(c, 1, m, w, T)

compute CVF(I, m, w, T)

choose CVF'(1, m, w, T) c CVF(I, m, w, T)

collocate listing:

collocate index:

collocate concordance:

either

o r

display CVF'(I, m, w, T) in a suitable way

for all (c, f) such that (c, f) E CVF'(I, m, w, T)

compute collocate index(c, 1, m, w, T)

display it in a suitable way

for all (c, f) such that (c, f) ~ CVF'(1, m, w, T)

compute collocate concordance 1 (q, r, c, 1, m, w, T)

compute collocate concordance 2(q, r, c, 1, m, w, T)

display them in a suitable way.

I M P L E M E N T A T I O N

In practice we take the text data and compute a vocabulary table which represents VF [equation (3)]. The text is read by a finite state recogniser and an inverted file is created which forms a representation of the text. This file has entries which point to entries in the vocabulary table and also pointers to the subsequent occurrence of the same word. The mapping used by CLOC converts tokens to words in several user chosen ways, e.g. changing all alphabetic letters to lower case and removing certain characters. The vocabulary table VF contains, for each entry, the following information.

(a) The sequence of symbols denoting this entry (b) The frequency of occurrence of the entry (c) A two-state indicator denoting ' token' or 'word' (d) ditto denoting 'separator' or 'not separator' (e) ditto denoting 'required' or 'not required' as a word (f) ditto denoting 'required' or 'not required' as a collocate (g) The location in this table of the word corresponding to a token entry (h) The position of the first occurrence of this entry in the inverted text file.

Choosing a subset of the words in the vocabulary (VF') is done by setting the appropriate indicator in the table. The required collocates are also chosen by setting the relevant indicator. To find CV(I, m, w, T) [equation (7)] involves scanning the inverted text file looking at the context 1 words before to m words after the word w, keeping a list of collocates found together with their occurrence frequency. The fourth indicator in the vocabulary table is now used to reject unwanted

Page 6: Anatomy of a text analysis package

9~, ALAN REED

collocates from the list. By this means the set CVF'(I, m, w, T) is constructed. To produce printed citations it is a simple matter to use the inverted text file and vocabulary table to reconstruct lines of text.

R E A L I S A T I O N IN CLOC

The control statements in the CLOC package come in three groups, each reflecting the function of the underlying abstract package. The groups are: token specification, choice of vocabulary, and tasks.

Token specification

The procedure for extracting tokens from a text is a finite state recogniser. The ASCII character set is divided into three mutually exclusive categories named letters, separators, and ignorables. A token is considered to be a continuous sequence of letters. An ignorable is considered to be totally absent and a sequence of separators delimits one token from the next. By default all non-printing ASCII characters are ignorable, with end-of-line being treated differently. In practice the user need only specify the list of letter characters. A default mapping is supplied which unifies upper and lower case letters. The order in which the letters are supplied defines their collating order, and hence a simple ordering relation ~ between words is implicitly defined. The following example shows how one can specify letters as the conventional alphabet.

ITEMISE USING) cloc

• LETTERS) abcdefghijklmnopqrstuvwxyz

Choice of vocabulary

At this point CLOC has computed the complete vocabulary and the frequency of occurrence of every word in the text. The user is now able to select any subset of this set for later use. The user is given three ways to define sets of words of interest. These we call 'set descriptions'. They are:

(1) an explicit list of words. e.g. , L I S T OF WORDS) this that who what

(2) a list of 'patterns' used to define endings, beginnings or skeletal form. e.g. , P A T T E R N ) t h , , ing be , ing

(3) a set of words from the same frequency distribution. e.g. , F R E Q U E N C Y ) 1 OR 2 OR > 200 OR (20 TO 40)

Each of these could be supplied an arbitrary number of times. A sequence of 'set description' control statements implies set theoretic union between them. An explicit set theoretic union operator I N C L U D I N G is available together with a set theoretic difference operator EXCLUD- ING. The entire vocabulary can be chosen using the statement EVERYWORD. Hence the user has several ways in which to specify a subset of the original vocabulary. A working set of words is first defined by means of the EVERYWORD or SELECTWORDS control statements. An EXCLUDING statement may then be used to remove some words from this working set. The INC LUDING statement can then be used if too many have been excluded. The EX CLU D IN G or I N C L U D I N G statements can be used as often as required, e.g.

(1) The entire vocabulary

EVERYWORD

(2) The entire vocabulary excluding some 'set descriptions'

EVERYWORD

E X C L U D I N G

set description sequence (3) Some particular 'set description sequence'

Page 7: Anatomy of a text analysis package

(4) as above but with exclusion

Anatomy of a text analysis package

SELECT WORDS

set description sequence

95

SELECT WORDS

set description sequence

E X C L U D I N G

some other set description sequence

(5) as above but with inclusion of some other set

SELECT WORDS

set description sequence

E X C L U D I N G

some other set description sequence

I N C L U D I N G

a different set description sequence

Tasks

Now that the user has chosen the subset of the vocabulary that the CLOC package is to work on, he must next tell CLOC what he wants done. The package contains control statements to produce 'word lists', 'indexes', 'concordances' and to discover 'collocations'. These we illustrate by means of the following examples:

(1) word listing--prints the vocabulary subset in some order

WORDLIST) AFREQ

The parameter to the WORDLIST task allows the user to display his chosen vocabulary in ascending/descending alphabetic or frequency order. Reverse alphabetic (rhyming) order, and first (or last) occurrence order are also available.

(2) word index--prints each word together with a list of its locations in the text.

INDEX) ALPHA

The parameter causes a preliminary sort of the vocabulary into ascending alphabetic order. (3) concordance--an index which includes context.

CONCORDANCE) REVALPHA, CITE 10 BY 10

We can specify how much context to print. (4) co-occurrence--a given phrase or series is found. A series is a phrase which can have a fixed

or variable number of tokens between its constituent words.

CO-OCCURRENCE

*PHRASE) function of the

, SERIES) the UPTO2 package

(5) collocations--word association analysis

(a) COLLOCATIONS) ALPHA, CITE 6 BY 6

,SPAN) 4 B Y 4

, FREQUENCY) > 4

Page 8: Anatomy of a text analysis package

96 ALAN REED

The collocates in the region 4 before and 4 after the chosen vocabulary are examined. Their citations are printed provided that their collocate frequency exceeds four. The printed citations have in this example six words of context on either side of the word of interest.

(b) COLLOCATIONS) ALPHA, CONDENSED

• SPAN) 4 BY 4

• FREQUENCY) > 4

REJECTING

• LIST OF WORDS) the a but

This example causes a simple tabular presentation to be printed which just contains the words of interest and their collocates. Before printing, the examples which contain significant collocates, those which contain "the", "a" or "but" are rejected.

S U M M A R Y

Several packages exist which process natural language text. Traditional tasks involve producing word lists, indexes, concordances, word frequency counts and numerical measures of style. A model of one such package is described and its design is examined. The operations carried out by the package are presented in terms of elementary set theory. A brief description of an implementation is given together with a few examples.

R E F E R E N C E S

1. Reed A., CLOC: A Collocation Package. Ass. Literary Linguistic Comput. Bull. 5, 168-173 (1977). 2. Reed A. and Schonfelder J. L., CLOC: a general-purpose concordance and collocations generator. Advances in Literary

and Linguistic Research (Edited by Ager D. E., Knowles F. E. and Smith J.), pp. 59-72. AMLC, Birmingham (1979). 3. Berry-Rogghe G. L. M. and Crawford T. D., The COCOA Manual. University College, Cardiff and the ATLAS

Computer Laboratory, Chilton Didcot (1973). 4. Russell D. B., COCOA : A Word Count and Concordance Generator. ATLAS Computer Laboratory, Chilton Didcot

(1967). 5. Ross D. and Rasche R. N., EYEBALL: a computer program for description of style. Computers and the Humanities

6, 213-221 (1972). 6. Bratley P., Lusigaan S. and Ouelette F., JEUDEMO: a text handling system. International Conference on Computers

in the Humanities, Edinburgh (Edited by Mitchell J. L.), pp. 234--249 (1974). 7. Hockey S. and Marriott I., OCP: Oxford Concordance Program Users" Manual. Oxford University Computing Service

(1980). 8. O'Sullivan D. F., STYL: Stylistic Analysis Program. Computer Laboratory, Queen's University of Belfast (1972). 9. Winston P. H. and Horn B. K. P., LISP. Addison-Wesley, London (1981).

10. Clocksin W. F. and Mellish C. S., Programming in Prolog. Springer, Berlin (1981). 11. Warren D. H. D., Pereira L. M. and Pereira G., PROLOG: the language and its implementation compared with LISP.

Proceedings of A CM SIGA R T-SIGPLA N Symposium on AI and Programming Languages (1977). 12. Brown P. J., SCAN: a simple conversational programming language for text analysis. Computers and the Humanities

6, 223-227 (1972). 13. Griswold R. E., Poage J. F. and Polonsky I. P., The SNOBOL4 Programming Language, 2nd edn. Prentice-Hall,

Englewood Cliffs, New Jersey (1971). 14. Sinclair J. McH., Jones S. and Daley R., English Lexical Studies: Report to OSTI on Project C/LP/08. Department

of English Language, University of Birmingham (1970).

About the Author--ALAN REED received a B.A. degree in Chemistry from the University of York, U.K. in 1969 and an M.Sc. degree in Computer Science from the University of Leeds in 1970. From 1970 to 1972 he undertook research in algebraic manipulation in the Computer Studies Department at the University of Lancaster. At present he is a Senior Computer Officer in the Centre for Computing and Computer Science at the University of Birmingham, where he provides advisory and consultancy support to users of the University's mainframe computers. He particularly encourages members of the Arts Faculty to use computers and supplies specialist advice on text processing. He is one of the founder members of the Association for Literary and Linguistic Computing.