machine literature searching iii. making indexes amenable to machine searching

3
92 AMERICAN DOCUMENTATION and noted down for each bibliographical item. In this way a logical sequence of major head- ings emerged under which to present the bibli- ography. Subsidiary headings were then fitted into the major sequence as a very comprehen- sive cross-reference guide. On this basis an analytical subject index could be compiled with- out difficulties. Nowadays each issue of the Bulletin is de- voted to one particular topic only such as, for instance, tropical housing, urban land policies, etc. Before the studies on these subjects are undertaken a checklist is being worked out as a guide to the documentary research preceding the compilation of the Bulletin. In this way an outline frame of reference for subsequent ma- jor subject analysis is established, which is being enlarged by fitting in subsidiary aspects as the work progresses. The subject headings thus established enable the bibliography to be organized logically as an expression of the re- search which has gone into the study. At the same time we have to take care of the great number of countries and geographical regions which each Bulletin covers as a rule. We, therefore, compile what we call a 'geographi- cal selector" with each bibliography. This se- lector, in tabulated form, is based on the re- stricted number of major subject headings and an alphabetical listing of countries by major regions. Its presentation enables a two-way search to be made for the geographical loca- tion of a particular subject, either from the point of view of the subject, or from the point of view of the geographical area. Having at long last reached the end of this paper on building classification and subject an- alysis - a topic which certainly does not lend itself to an exactly stimulating presentation - I feel tempted to quote Sir Winston Churchill who said, when he received an award for his memoirs : Writing a book is an adventure. To begin with it is a joy and an amusement. Then it becomes a mistress. And then it becomes a master. Then it becomes a tyrant, and in the last phase, when you are reconciled to your servitude, you kill the monster." And so it ie with building classification . . . MACHINE LITERA TURE SEARCHING III. MAKING 1NDEXES AMENABLE TO MACHINE SEARCHING ALLEN KENT, M. M. BERRY, AND J. W. PERRY* INTRODUCTION Previous papers in this series discussed basic problems encountered in analyzing infor- mation for subsequent retrieval from storage and for correlation. Conventional classifica- tion and indexing methods were reviewed. With respect to capabilities and limitations, auto- matic equipment for searching and correlating large files of information offers attractive pos- sibilities. The realization of these possibilities requires that information be analyzed appro- priately, so that automatically performed routine operations may achieve optimum effec- tiveness. Such analysis can be based on modi- fication of previously developed indexing pro- cedures. The new indexing system must be designed so that the key operations, namely, analysis of information, indexing, and operation of the ma- chines, do not make excessive demands on hu- man time and effort. Another requirement is *Battelle Memorial Institute, Columbus, Ohio.

Upload: allen-kent

Post on 10-Aug-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine literature searching III. Making indexes amenable to machine searching

92 A M E R I C A N D O C U M E N T A T I O N

and noted down for each bibliographical item. In this way a logical sequence of major head- ings emerged under which to present the bibli- ography. Subsidiary headings were then fitted into the major sequence as a very comprehen- sive cross-reference guide. On this basis an analytical subject index could be compiled with- out difficulties.

Nowadays each issue of the Bulletin is de- voted to one particular topic only such as, for instance, tropical housing, urban land policies, etc. Before the studies on these subjects are undertaken a checklist is being worked out as a guide to the documentary research preceding the compilation of the Bulletin. In this way an outline frame of reference for subsequent ma- jor subject analysis is established, which is being enlarged by fitting in subsidiary aspects a s the work progresses. The subject headings thus established enable the bibliography to be organized logically as an expression of the re- search which has gone into the study. At the same time we have to take care of the great number of countries and geographical regions which each Bulletin covers as a rule. We,

therefore, compile what we call a 'geographi- cal selector" with each bibliography. This se- lector, in tabulated form, is based on the re- stricted number of major subject headings and an alphabetical listing of countries by major regions. Its presentation enables a two-way search to be made for the geographical loca- tion of a particular subject, either from the point of view of the subject, or from the point of view of the geographical area.

Having at long last reached the end of this paper on building classification and subject an- alysis - a topic which certainly does not lend itself to an exactly stimulating presentation - I feel tempted to quote Sir Winston Churchill who said, when he received an award for his memoirs :

Writ ing a book is an adventure. To begin with it is a joy and an amusement. Then it becomes a mistress. And then it becomes a master. Then it becomes a tyrant, and in the last phase, when you a re reconciled to your servitude, you kill the monster."

And so it ie with building classification . . .

MACHINE LITERA T U R E SEARCHING III. MAKING 1NDEXES AMENABLE TO MACHINE SEARCHING

ALLEN KENT, M. M. BERRY, AND J. W. PERRY*

INTRODUCTION

Previous papers in this ser ies discussed basic problems encountered in analyzing infor- mation for subsequent retrieval from storage and for correlation. Conventional classifica- tion and indexing methods were reviewed. With respect to capabilities and limitations, auto- matic equipment for searching and correlating large files of information offers attractive pos- sibilities. The realization of these possibilities

requires that information be analyzed appro- priately, so that automatically performed routine operations may achieve optimum effec- tiveness. Such analysis can be based on modi- fication of previously developed indexing pro- cedures.

The new indexing system must be designed so that the key operations, namely, analysis of information, indexing, and operation of the ma- chines, do not make excessive demands on hu- man time and effort. Another requirement is

*Battelle Memorial Institute, Columbus, Ohio.

Page 2: Machine literature searching III. Making indexes amenable to machine searching

MACHINE LITERATURE SEARCHING 93

that the machines themselves function rapidly and effectively; otherwise, the advantages of using automatic equipment may be lost.

DIRECT ENCODING OF INDEX ENTRIES

Once information has been analyzed in terms of features pertinent to a user’s even- tual interest, the index entries must be inter- preted in a form such that operations can be conducted by machine. This means, practically, that the entries set up by the indexer must be entered in some medium (e.g., punched cards tapes, etc.) with which the machine can operate and can conduct i ts identifying and selecting op- erations. This interpretation of index entries in a form amenable to machine searching is usually spoken of as encoding. This essential operation can be performed very simply by as- signing certain patterns, e.g., of holes in punched cards, to represent an individual letter or numeral, and then simply spelling out words, phrases, formulas, and the like.

This simplest form of direct encoding in- volves nothing more than the use of appropri- ately selected patterns of holes, o r such, a s a means for recording letters and numerals. In this approach, the recording patterns spell out the index entries in exactly the same way that embossed patterns of dots are used to spell out words in Braille. This simple form of encoding can be accomplished automatically by machines already in existence, Depressing a key corre- sponding to a given letter or numeral causes the corresponding pattern to be punched in a card or to be otherwise appropriately recorded. In other words, an operation very similar to typing suffices to render the index entries im- mediately searchable by machine. Once the in- dex entries have been recorded a s appropriate patterns, searching operations can be directed to any word or combination of words. Although this approach is simple, it suffers from many disadvantages. One of these, which is closely related to problems encountered in using con- ventional subject indexes that do not utilize ma- chines, is the necessity for the operator to set the machine so that the searching and selecting is based on the right words. Synonyms form one obvious source of difficulty in this connec- tion. Since the machine operates by pattern

matching, a search directed to a pattern repre- senting “mercury” does not produce a positive response when the pattern for “quicksilver” is encountered . Near - synonyms and terminology having overlapping areas of meaning result in similar and even more vexing limitations onthe usefulness of this simple approach. Of course, homonyms would also cause difficulty; thus, “lead” (to guide) would not be distinguishable from “lead” (heavy material).

This form of direct encoding is also lim- ited in i ts usefulness by another factor. Con- sider, for example, the situation that would re- sult if the patterns were used to spell out the names of individual disease organisms, e.g., pneurr.ococcus, and then a search is required for all documents concerning disease organ- isms invading mammals in the Northern Hem- isphere. Before the machine could be set to conduct this search, a l ist of all organisms that might be involved would have to be compiled. Furthermore, a list of all mammals would have to be prepared. Also, all of the countries, cities, etc., of the Northern Hemisphere would have to be drawn together. Only then could a thorough search be initiated.

This would be extremely time-consuming and uneconomical from the point of view of set- ting up and conducting machine operations.

However, in order to provide an elementary demonstration to illustrate basic features and operations, IBM used this very simple form of coding in demonstrating their new index-scan- ning machine.’ The IBM demonstration con- sisted of inspecting pictures in a gallery and noting the important features of each picture. Thus, a winter scene might involve ‘snow”, “trees”, ‘barn”, ‘cows”, “dog”, ‘fence’, and “man.” These words were then punched in cards by using appropriate patterns of holes for successive letters. Search by machine could then be made to select cards correspond- ing to pictures showing either “barn” or ‘dog”, or to pictures showing both “barn” and “dog”, or other combinations of words as desired.

Direct coding of index words, although ad- vantageous with regard to simplicity of input into the search medium, is an extremely naive approach to the machine retrieval of informa- tion and is quite limited in effectiveness. A search defined by the specific words used for

‘Luhn, H. P., “The IBM Electronic Information Searching System,” Presented at the Symposium on Machine Techniques for Information-Selection, Massachusetts Institute of Technology, Cambridge, Massachusetts (June 10-11, 1952) .

Page 3: Machine literature searching III. Making indexes amenable to machine searching

94 A M E R I C A N D O C U M E N T A T I O N

indexing is direct and easy, If, however, in- stead of searching for “barn’ and “dog’, the range of interest is broadened to “building” and “animal”, no simple direct search is possible. To locate all items involving “building’ re- quires that the machine search for “barn”, ‘house”, “church’, “hotel”, etc., while the ge- neric term “animal” would require search for ‘dog”, “cat”, “mouse”, “rat”, “horse“, “cow”, etc. Setting up such searches on machines would require extra effort, and the actual ma- chine searching operation would usually make the more sophisticated type of search nonfeas- ible from the economic and time points of view. In other words, so much time, both of human operators and of machines, would be conserved if terminology of a generic nature, e.g., “build- ing’, “animal”, were made available for defin- ing the scope of searches and for conducting them that investment of much initial expense and effort would be advantageous in order to insure, in advance, the efficiency of human and machine searching operations. The limitations of direct word coding become more and more evident with the decrease in number of words per document that a re used as index entries and are recorded in the search medium.

GENERIC ENCODING OF INDEX ENTRIES

It is highly advantageous, therefore, that ge- neric terminology be available for defining and conducting searches without imposing any extra burden on the indexer. It has been demon - strated experimentally, both in the machine in- dex-searching experiments at the United States Patent Office and elsewhere, that the coding system can be designed so that the encoding step itself automatically makes generic termi- nology available for searching. This i s accom- plished by arranging index terminology so that codes for individual index entries w i l l contain code elements which represent more generic concepts and which will be individually search- able. (This will be diocussed in greater detail in a later paper of this series.

In order for the code relationships built into a system to require minimum modification with

the passage of time, the generic elements built into codes should be based on permanent rela- tionships. Furthermore, it is undesirable to base these on relationships which, though gen- erally true, have important exceptions, never- theless. On this basis, it would be better to re- late the specific term “dog’ to the generic terms ‘animals”, “vertebrate”, or “mammal” than to similar generic te rms indicating the useful nature of dogs or their status as domes- tic animals. All dogs a re animals, but all dogs a re not necessarily domesticated, nor are all dogs necessarily useful animals. It may prove advantageous, for important short- term pur- poses, to violate the suggested rule that generic code elements shall be established on the basis of permanent relationships. During the course of a war, for example, it might be well to hold a coalition of enemy countries together by a code for machine searching purposes, even though such a coalition would doubtless be ter- minated at the conclusion of hostilities.

Another consideration in constructing codes is the fact that relatively little is to be gained by including in them relationships that probably would be expressed by the other index entries which designate the subject matter of a docu- ment. (This will be discussed in detail in a later paper of this series.)

Another advantage of generic encoding is i ts effectiveness in providing links between ac- counts of minor incidents and more comprehen- sive reports. Specific terminology will be re- quired for indexing minor incidents, while generic terms will be more appropriate for comprehensive reports. Codes based on linking the two types of terminology together facilitate searches directed to reports with different ranges of content.

minology. Many words in the English language have more than one meaning. Thus, the word “plant” may refer Lo an industrial establish- ment or to a botanical organism. By including such words in appropriate general codes, the chances of confusion are reduced. (This, too, wil l be discussed in detail in a later paper.)

These codes are also helpful in defining ter-