conferenze e seminariold.iss.it/binary/publ/cont/pag125_137partei_iiianno1970.pdf · sanit4 (lllio)...

13
CONFERENZE E SEMINARI Present and future trends of information retrieval (*) HE RBE RT COBLANS Formerly Reuarch Department. Al!Ociation of Special L ibrarie• and Informatio n Bur eaw c (Aslib), ùm don, Grtal Britain My s ubjoct is the c ommunication of scientific knowl cdge, a problem which da tes back to the s ixtee nth century, the beginnings of mode ro science in whi ch ltaly played so signific ant a part. What differe nti ates «modero» science from its an tece de nt e in medi eval times is assentiall y th at it re pre- sent s a consensus of acccpted knowl e dge ('). To be accepted it must be widely available, published a nd docum e nt ed so that it can be critically f; valuatcd. Thus « Because Le onardo did not live to publish bis di scoveries, hi s influence on the history of scientific thought was considerably less than the ev idence of hi s manu script s suggests it could have b een. He was drop- ping wcight s off towers a hundrcd years before Galileo is supposed to h ave perfo rm ed the s ame exper iment and long before the doctTine of gcocentr ism was abandoned he conclud cd that the sun did not move )) ( 2 ). Similarly Mendel's fa mou s laws of hcredity which, although puhlished in 1866 in an obscure Moravian periodica), did not become part of the scien- tific conscnsus until th ey were re-discovered around 1900 by de Vries and other s. lf t his concept of science as a consensus is accepted, research a nd its docume ntati on mu st be seen ra ther like mirror images. When scientific resear ch was lar gely hobby )) of gentl emen scie ntist s, from the sevent eenth ce ntury unt il th e first part of the ninetee nth ce ntury, the problema of com- munication, what we now call doc umentation , we re not serious as the scale was limited. Increasingly with the growth of applied science a nd technology at the beginning of this cent ury, e ffective docume ntation h as become one of th e most im portant prre qui sites of succcssful research and development . .Around 1950 when tbc full impact of the scientific and t ecbnical lit e- ratur e of W orld W ar II was being felt, an American, Calvi n Mooers ( 3 ) in- v ented a new term, « info rm a tion retriev al». The phrase is amb iguous and ( 0 ) Con ferenza t enuta presso il Consiglio Nazionale delle Ricerche, ll oma, il 13 no- vembr e 1969. .1 nn. llt. Suver. Sa nità (1970) l, 12!\-137.

Upload: others

Post on 13-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

CONFERENZE E SEMINARI

Present and future trends of information retrieval (*)

H E RBE RT COBLANS

Formerly Reuarch Department. Al!Ociation of Special L ibrarie• and Information Bureawc (Aslib), ùmdon , Grtal Britain

My subjoct is the communication of scientific knowlcdge, a problem which dates back to the sixteenth century, the beginnings of modero science in which ltaly played so significant a part. What differentiates «modero» science from its anteced ente in m edieval times is assentially that it repre­sents a consensus of acccpted knowledge (') . To b e accepted it must b e widely available, published and documented so that it can be crit ically f;valuatcd. Thus « Because L eonardo did not live to publish bis discoveries, his influence on the history of scientific thought was considerably less than the evidence of his manuscripts suggests it could have been . He was drop­ping wcights off towers a hundrcd years before Galileo is supposed to have performed the same experiment and long before the doctTine of gcocentrism was abandoned he concludcd that the sun did not move)) (2) .

Simila rly Mendel's famous laws of hcredity which, although puhlished in 1866 in an obscure Moravian periodica), did not become part of th e scien­tific conscnsus until they were re-discovered around 1900 by de Vries and others. lf this concept of science as a consensus is accepted, research and its documentation must be seen ra ther like mirror images. When scientific research was largely a« hobby )) of gentlemen scientist s, from the seventeenth century u nt il the first part of the nineteenth century, the problema of com­munication , what we now call documentation, were not serious as the scale was limited. Incr easingly with the growth of applied science and technology at the b eginning of this cent ury, effective documentation h as become one of the most important pre·requisites of succcssful r esearch and development .

.Around 1950 wh en tbc full impact of the scientific and tecbnical lite­rature of W orld W ar II was being felt, an American, Calvi n Mooers (3) in­vented a new term, « information retrieval». Th e phrase is ambiguous and

(0

) Conferenza tenuta presso il Consiglio Nazionale delle Ricerche, lloma, il 13 no­vembre 1969.

.1 nn. llt. Suver. Sanità (1970) l , 12!\-137.

Page 2: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

126 CO~f'EBENZf. E SEMINARI

misleading but it bas stuck, in fact it has gone into thc language and is evf'n uscd iu other languagcs. lt bas becomt' ver~· clos<'ly associated with th1• so-called « informatiou explosion » ancl with a hopcd-for revolution in docu­mentation in volved in thc' use of computers. Botb tbest' implications haY e bcen exaggerated and au, in fact, partly erron<•ou~ . Firstly, tbere is no cxplosion in any rneaningful senS(' of that roucb abused word - that is if onc is not indulging in rather crude journalism. Secondly, thc computer is not necessarily thc wh ole answt' r to the expoucntial growth of tbc Iitcra­tur<· of science an d t ecbnology.

Exponential growth is onc of tbc laws of organic nature, but it is also a statistica! la w of community lifc. This phenomenon has bcen wcll explained and illu stra ted by dc Solla Price (4 ) in bis hook « Little science, big science». H c shows that « If any sufficieutly largt· scgment of sci eu cc is measurcd ... the norrnal mode of gro~1:b is exponential ... multiplying by some fìxed arnount in equal periods of time, ... an empirica) law (whicb) holds trul' with high accuracy over long periods of time». Some typical doubling times in years are given bclow.

Doubling times in years 50 yrs. Population

nurobcr of universities 20 Numher of cb emical elements known

accuracy of instruments 15 Number of scicntific periodicals

numbcr of scientitìc abstracts published 10 Literature of non-Euclidean geometry

literaturc of expcrimental psychology number of telephones in the U.S.A.

Admittedly there has been a gr cat incrcasc in publication, but wc must keep our Jlerspective. Tht position varies with subj cct and what we are measuring. Thus a few years ago K. Mellanby, Director of the Monk' s Wood Expcrimental Station (5) estimateci that in Britain « the numher of pages u sed to communicate tbc ncw r esults of originai research has only doubled h etween 1936 and 1966» - 30 years in contrast with tbc generai donbling rate of 15 ycars. In some fields likc classicaJ taxonomy the ratf' of growtb is negative. On the other band tbc growtb of the researcb li­t erature of rnathcmatics is lincar - fairly steady at around 500 papers per annum.

Tbis question of size and growth can also b e approached from the point of view of quality . A reccnt study (41) on British periodicals on science aud t echnology uscs the frequency of citation of papcrs in periodicals as a measure of tbe« goodness» of the periodical. Using the magnetic tape storc of Science

A nn. Ist. Supcr. Scm i tò (IDiO) 6, 12;;-137.

r

j

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

Page 3: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

COBUNS 127

Citation Index it was found that there wcre 68,800 citations in the perio· dicals of 1965 to a total of 29,000 papera wbich appeared in 1842 British scientific periodicals in 1963 and 1964. However, 95 % of these citations referred to only 165 periodicals. In other words less than lO % of all perio· dicals (the best tenth) would give 95 % coverage.

Another way of looking at this problem is from the point of view of coverage. A good example is that of « Chemical abstracts».. To control the world literature of chemistry Chemical Abstracts Service (7) scans some 12,000 periodicals, but 90 % of tbe abstracts that are published are based on articles in only 2000 periodica} titles. T o satisfy 90 % of user requirements is comparatively easy and inexpensive; it is the extra l O % which requires a disproportionate increase in costs and effort.

Basically the size of the literature is a function of the amount of money spent, as major product of research and development is new information as recorded in documenta. There is thus a direct correlation between re­search expenditure and publication. A xecent Euratom investigation showed that in almost all countries involved in atomic energy, for every million dollars spent on research and development on the average 18 documents are produced. This is neitber « explosive» nor unreasonable. Tbe problem is how to extract, to retrieve the costly information embedd ed in tbese documenta.

A glance at the statistica of national expeditures will help to give some quantitative basis to tbc emotive generalisations that are commonly bandied around about the information problem.

l 9 6 2

U.S.A. ......

U.K .........

l Netherlnnds . . l

France ...... W. Germany .

Belgium .....

U.S.S.R. . ....

Japan .. . ... .

CERO i_n rnillioru

s 18,000

.E 634

Fl 860

p.. "" ~

~a o DM 4420

BF 6625

GNP

3. 1

2.2

1.8

1.5

1.3

l. O

2. 6(?)

l. 3(?)

GERD = Gross expenditure in research & development

GNP = Gross National Product (total mo-uetary volue of national production). ·

l . l

l o) For the financial year 1965 the percentages for U.S.A. & U.K. were 3.5 and 2.8 re-spectively.

b) The figure for the U.K. in 1938 was 0.3% . Thus there has been a 9-fold increase over 27 years.

--------------------- _j

A nn. lat. SupeT. Sanit4 (llliO) 8, 125-lS'l.

Page 4: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

128 CONFEREN ZE t: St:J\11 NA1tl

W e, <'Specially the scientists, are ver y concerned that more mon ey should h<' mad.- availabl t> for r escarch and devclopment, and therefore wt' must acccpt the implica tions of this. M.ore documents will result and their handling and control from the point of view of their suhjcct content- in a word, documentation - must be acceptcd as an important part of tht• whole cost of scientific research.

I have already referred to the lack of precision in the term informa ­tion retrieval. What is involved in IR and what is the prescnt state of tiH" art? IR includes two diA"erent hut related concepts, data-providing system s (fact retrieval) and re ft>rence-providing systems (doeument retrieval). ldeally the user wants information. data, but in pract ice hc (espeeially if he is a scientist) will more commonl~· be giveu a set of documents whieh might g ive him somt' of the facts or idcas that will help him. Of eourse it depend ~

on the nature of tht· originai qu estion. For example, a ge nerai question on the eomposition of a specifìe alloy, like lnv ar, or on «tbc numher of suicides in a given country in a stated year», is in a sense trivial, sinee a Jlroper referenee service in a library ean usually give an imm edialt• answer. O n the oth(•r han d a stud y of « the vacuum deposition of thin films » or« th1· epidemiology of suicide» would not bave a simpl<' precise answ~:r, in tht· present stat<' of our knowlcdge, and at b est a numbu of possibly r elevant documenti' would b e· thf' commonest r esponse. To provid c such a set of documt•nts raises qul!lìtions of vcr y rea) complexity involving indexing, classifìcation, semantics and linguisties.

The raw materia! which fiows into a doeum<'ntation sen·ict• has man) formi' (book, periodica!, r eport, tradc literaturc t•Lc·.). But in practice tbt· main efforl is devoted to the sei1·ntific paper in a periodica! as a typical hihliographical unit. Traditionally tht· paper, as presented, haR clemeut~

for idcntilication and shoulcl provide some elu rs as to its subjeet contt:nt. The following identifyiug fcatures can usually h 1· d istinguisb Nl:

l. Authors (p t• rsonal or corporate) 2. Title 3. Periodica! issu1· 4. Date ;.>, Piace of work of authors 6 . Figure & t abk caption j;

i. ContentR of tables & diagrams

8. Referenees cited

9. T ypographically diffen ' ntiat ed words in t t·x t

10. \'(fords, phrases, scntcnees providing elues to subjcct content

11. Abstract

12. Name of eonference (if applicable)

Ami . /~t. S1111rr. Sanità (!IliO) 6, 12~- 13; .

Page 5: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

COBLANS 129

The elements l to 5 and 12 are unequivocal t ags of identifìcation, though l and 5 can introduce issues of matching and ordering because of variant spelling and the lack of standardisation in any fact expressed in natural. language - what in ordinary library practice makes cataloguing a profes· sional job needing skill and experience. The elem ents 2, 6 and 7, 9 to ll can be used eitber as guides or in tbemselves for assigning indexing terms or selecting keywords or descriptors in the whole process of subj ect spe· cification. E lements l and 5 also provide the entries for citation indexing (8•9).

Sharp (10) describes a citation index as a« listing of documenta, usually ar· ranged b y autbor, with a listing against each entry of oth er documents which bave cited tbc document represented by tbc entry».

However it is subject indexing that is the centrai problem of IR. H er tt; there are a number of t ecbniques based on wbat are called indexing language devices. These fall into two classes; tbose that are « derivative», that is taking only words used in tbe document (elements 2, lO and 11), and those obtained by« assignment» from a« controlled v ocabulary» (an alpbabetical list of subj ect headings, keywords from a tbesaurus, or class headings from a bierarchical classification scheme). Traditionally hierarchical classification has been the common metbod for subject control. For r easonably effective r etrieval it pre-supposes indexers witb a professional training in classification and a good generai knowledge of science and technology, as well as a working knowledge of tbe common European languages. In tbe hope of reducing personnel coste, co-ordinate indexing based on tbesauri has b ecome in­creasingly popular since t be fiftiea, especially in industriai libraries. lt was assumed tbat working with keywords, whicb are recognisably part of natura} language, indexing would need less highly trai n ed staff. W e know now that, for large collectiona of documenta, tbese hopes are not really j ustified.

Ahove ali, tbc searcbing operations involved in co-ordinating k eywords, so as to matcb document content witb question, can be done with speed by computers. But however attractive computer metbods may ·b e for the storage and matcbing of terms assigned to documenta and to tbe questiona posed to tbc system, tbe number of relevant documenta retrieved dependa essentially on the quality of t bc originai intellectual effort of tbc human index ers. In practice it does not matter much wbetber you are using clas· sification or co-ordinate indcxing, tbe fìnal results is on the average much the same and not very good. For undcr tbc best conditions that can be achieved in tbc generai r etrieval situation in a large collcction of documenta covering many subject disciplines, tberee out of four documenta provided in answer to a question turn out to b e of little or no interest to tbe inquirer.

It is tbus not surprising that derivative indexing based only on tbc words in tbe document is attractive once juggling (searching, matcbing,

.~nn. lat. Super. Sanità (1070) 8, 12$-137.

9

Page 6: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

ordering, printing-ont) witb words can be donl' !:>O quickl) and cfticit·ntl~

with a computer. Already at tbe end of the fifties H. P. Lubn (11), tbt• pionet·r vf compu lerised documeutation. dn·dopt·d a fairl) simple computt·r program to rotate tht· words in the t itl e of a paper t o produce bis so-called Kt•y 'Word In Context (K " ' I C) I n d ex. H ere tbc infor mative words in tht· ti t l t· h eco me indcxing Lerml:>. This typt.• of index is n·ry suitable for computer operalion and bas b ecome popular sinc<' thc American Chemical Socit'ty startetl pu· blisbing itF « Chemical ti tleeo» in tbis way in 1961.

Wbile th e KWIC i ndex is fairly usefu] for subj ects lih cberuistry in wbicb t he vocabulary is ra tber p recise and unambiguous, such au artilicial r cstriction of indrxing t erm s, t o only thel:>t' words \\ b icb happen to occur iu t ht• tith· of a paper, is only a crude first approximation to satisfaclor) IH. Clearly if tbc whole text of a document could be madc macbine-readabll' and stored in tbc computer , opcrations on aU tbc words would gh·c a b ettr r indexing dcscription of tbt' documeut. Actually ovr r more tban a do:t.f'll y t'ars, billions of dollars bave bccn spcnt in tbe U.S.A. on rcsearcb in « automatie indcxing». Tbe objccti,·e Ìl:ì complete a u tomation. Tbere should be no intcllectual intervcntion at a11y point in tbt· wbole cyclc from whcu tbc document has been en tered into the computer store until tbl' answer to a qucstion bas been printed out on tbe lin e pr inter .

T bc t ccbniques used in automatic indexing are mainly statistica]. t h t· frequency count of certain word s and combinations of words in tht· wboh· tex't, supplementcd by tbc semantic and grammatica! rclationships bt·tween t hcse words. l n 1965 tbt• U.S. National Bureau of Standards publishcd a

su rvt'y rcport (1 2) whicb noted a few expcriments with t•ncouraging results. In Europe som e research in tbi s fìeld has been carried out at tbc Ct'nln· europécn de Traitement de l' l nformation scientifique (CETIS) of the E ura­tom research t•stablishment at Ispra (13). Nont' tbt· less as one American autbor ity has put i t , « Each t ime one of tbesc indexing syst ems has becu takcn out of tbc laboratory and suhj ect ed to the r ea] world tbc r esults ha' e bet·n u niformly bad».

During t he last few ycar s fr ee tex't searching on the words of an ab­stract, or preferably t he wbolc text of tbc doc umcnt, is b eing used opera· t ionally by som e large systems. Retrirval of docurnent s is bascd on thr conjunction in tbe t exts of suitable words wbicb ar e programmed from the q uest ion of the inquirer. lt is claimed tbat free text searching by words is no less efficicnt than t be usc of authorit y list s or classification schedules. H owevcr most of tbc evaluation bas been clone in a reas like cbem istrr or aerodynamics wbert' t erminology i s fairl) precise.

If t hcsc ar e t bc t hings t bat cannot b e dont• weU so far by computt·r metbodt~ Jet u s look at the r eal acbievcments and their implications fo: the future of scicntific documcntation. At prcsent tbey lie mainly in scienti·

Ann. 1st . SUJJer. San itu ( 1970) 6. 12.:-137.

l l l l l l l l l l l l l l l l l l

1 l

t

o ,;t

al

ar

l l l l l l l l l l l l l l l l l l l l l l l l

Page 7: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

·a m or d me I O U

u-

I O

i al

r r , ... lll

Id 'Il

('

Il

a ; ,

COBLANS 131

fie publication. « Computer-aided printing has rccently made possihle an important integration of two functions in communication - primary pub­lication and secondary retrieval. For cxample, in a study for the American Institute of Physics, Buckland (14) has shown how the machine-recording of the t ext of scientifìc papers during the publication of periodicals can be organizcd to produce the various indexcs, abstracts and so on as a simul­taneous by-product at a purely nominai cost. Thus each article after editing is t ypcd on a tape-typewriter with ali the retrieval elemcnts (author, title, author ahstract and so on) t agged with non-printing codes. Ali printing contro! instructions (t ype face, ty pe size, leading, justification) are added a t this input stage. This coded text on paper tape is stored in the computer and by suitable processing provides output in the ::>ame printed format as that in which the periodica! has always appeared, as well as the rcgular cumulations of indexes and abstracts » ( 11') . A typical example of the appli­cation of these techniques is « P sychological abstracts », which has been assembled and printed in this way since January 1966 (16).

The most rcccnt technical advanccs, mainly in the hardware, are t he result of pressure from newspapcrs for whom sp eed, large scale input and multiple editions in various large cities are important considcrat ions . Once the problcms of automatic j ustifìcation of the print line and long ùis tance, even across the Atlantic, transmission with on·-line t erminals, were mastercd the rate of progress of computer-aided typesetting in commerciai publications has grown rapidly. Thus a survey, covcring the U.S.A. and 26 countri t>s, published in Octoher 1968 shows that the number of computers used for this purpose has reachcd 821, representing a 54. % growth over the previous ycar.

Although n cwspapers are not concerned with a wide range of characters, pcriodicals in scicnce, especially tbe physical sciences, need large character scts. This uscd to be a limiting factor, but during the sixties the crudities of thc computer line printer bave been complct ely overcome. This has been shown in the most rccent achiev<'mcnt, the six publications wbich consti­tute the I Nformation Service in Physics, Electrotechnology, Computers and Control (11\'SPEC) of tbc Institution of Elcctrical Engineers in London (17).

They h ccame entirely computcr-produccù from January 1969. T he printing unit, the Lumitype 713 F ilmsetter, is driven by a computer-creatcd magnctic t ape and can handle over 700 different charactcrs and symbols at spccds of 1000 characters per minute. In this way not only are « Physics ab­stracts », « Electrical and electronic abs tracts» an d« Computer and control abstracts » as well as the corresponding Current Papers produced in high lJllality print, but a wide range of i ndexes for each issue {author, report number, conference proceet.lings, suhject etc.) as well as their cumulations are automatically- printed from the computer-st ored references.

.!nn. l at. Supcr. anità (1970) 6, 1~5-1:17.

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

Page 8: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

132 CON f'EHENZI:. t; SEM I NAIO

Thus it has bt'cn establish cd on an operational basi>< tbat periodicalf< and printPd li!:> tS (indcxes, a bs tract s, bibliogra phies) can be printcò economicall~ up to the b est standards if the scale of operation is large cnough . Th e storc· on magnetic tapes can b e uscd for organising and « re-packaging» text:-; a nel printing them out in diffen •nt pres1·ntations, thus adding considerably to the flex ibility and u~cfuluess of the stor NI record ~' for tbe dissemination of informatiun.

As bas already been mt•ntioned, mechaniscd IR does not show up the computer a t its bt'~ l, but that is not so mueh tbc fa ulL of tbc computt·r. a s tbt• inhercnt difliculties of IH, whctbcr m ccb anised or m anual. Within certa in limiti' th c E uratom N uclear Documentation System (ENDS), tlw pinn ccr and the largest sysh~m in Europc, ha~>' shown wba t can be dune in a some" ha t spccial situation ( 18 · 19) . I t s tarted witb two advantagcs: a s ubjert. applicd nuclear sciencc, in wbicJ1 the documentation startcd in 1946 and t be cxisl<' ncc of N uclf'ar Scicn cc A hstracl t- w hich fundamen tally cover:o; tlw wbolc fi eld with highly comp<•t ent abf;tracts. F urthermore the rate of grow t h of nuclear litcrature (by tbc end of 1968 i t h ad rea eh ed 800,000 rcferenccs) was such that computer contro! could pay o1L Tbc two main concurrent jobs werc the recruit ment and trainin~ of a team of highly qu al­ified suhj ect and indexing specialist :. and the construction of a thesaurus suitable for the interlockir1g s ubject di~ciplines. The Euratom Thesaurus (20) devclopcd over many years is in some ways a model for t his typ<· of compi­lation. lt consists of some 4 700 t erms, including only about 1200 genera i purposc k eywords, 1760 inorganic compounds with tbc remainder for nuclid<'>' ami alloys. T bc mcthod for searching in this co-ordina t e indexing syst l'm can best be illustrated by following tbrougb an actual run on a r eal question.

Wba t papers exist on the joint occurence of rhenium and osmium anywhere in tbc solar system ? (The abundance of Rh 187 and 0 !:' 187 is used for age cstimation as Rh 187 has a half Jife of six ty thou­sand million years producing Os 187 through b eta decay)

KEYWORDS

Al Rh 187 A2 Rh isotopcs A3 Rh

Sli:AROll S'l.'RA'l'EG Y

BI o~ 187 B2 Os isotopes B3 0 :;

Cl age estimation C2 abunda nc!'

51 (tight) Al AND Bl AND (Cl OR C2) This formulation will « pulJ out» only those papcrs wbicb ha,•e b cen indexed by 2 compulsor y and 2 optional kt•yword;;.

Anu. 181. Supcr. Snnitc), (1070) 6, 12.;- wr .

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

Page 9: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

orf' mel

t o io n

l hc t· r. p in

~h·~

l iti

('"

IU

n.

t '

COBLAJ'IS 133

S2 (medium) (Al OR A2 OR A3) AND (Bl OR B2 OR B3) AND (Cl OR C2) This less rigid formulation rlemands tlhe co-existence of 3 keywords, but the fìrst two bave 3 options and the third one has 2 options.

S3 (loose) (Al OR A2 OR A3) AND (Bl OR B2 OR B3 OR Cl OR C2) This formulation asks for the simultaneous presence of 2 keywords; one with 3 options and the other with 5 options.

These 3 searcbes, wbicb represent essentially a matcbing of tbese combinations and permutations in ali tbc documenta (in tbis case tbc store contained 100,000 references), finally provided a print-out by tbc computer of tbc abstract numbers in Nuclear Science Ahstracts .

Sl

S2

gave 2 referenccs NSA 08 06872 (08 = volume 8)

NSA 16 10751 (16 = volume 16)

gave 6 references (2 T 4) ~SA 01 1192 NSA 11 1917

08 6872

08 6873

11 7242

16 10751 S3 gave 62 references (2 T 4 + 56)

Tbc search analyst now goes to the collection of numbcred printed abstracts and examines eacb abstract. For Sl the two rcferences are found to be« hits», i. e. entirely relevant. For S2 there is an additiooal bit (NSA 08 6873) and one possible bit (NSA 01 1192), tbe abstract not b eing detailcd !'nough for certainty. The otber tbree are irrelevant. For 53 in tbe same way they establish two further bits and 7 possibles.

Photocopies of all the abs tracts of the .hits and the possible hits are sent to tbc questioner.

The above shows that the wbole organisation needed (computer, ab­stracts and highly trained s taff) is coosiderable and is only justified if a Jarge and varied .clientèle, in this case ba!lically the technical community of the 6 countries, can be served. In otber words, mechanised information re trieval of tbis ty pe is stili a very expensive operation and is probably beyond the range of the average documentation service in industry or in a research institute. Certainly for tbc main disciplines like bio-medicine, chemistry, physics, etc., one national centre for each , in countries of the size of the U .K., is likely to be viable under present conditions, and it may need to have some government subsidy at that.

Summary. - T he origin of tbe term « information retrieval» is rcccot and arose somewhat erroneously, as tbc computer answer to the impact after World War II of tbc exponential growtb of the literature of science. T his misnamed cooccpt is not as serious as tbc popular press has painted it and must be seen in perspective. Exponential growth (mtùtiplying by

.i~tn. lat. Supcr. Sanit<\ (11170) 6, 125-137.

Page 10: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

some fixed amount in equal pcriods of time) is one of th t• statistica! la\l i' of community life. Thus th1· accuracy of m eas uring instruments douhl<'i­on th <' average every 20 years, th e numher of scientific ahs trac ts publisht·d annualy doubh·s en•ry 15 ycars . A maj or product of r csearch anò dc·velop· mcnt is new information as r ecordNl in documents and th(•r efore tht•rl' Ì!-i a di rcct correlation b etwccn resca rch expenditure and publication.

1R involves 2 distinct and overlapping processc» - data·pro\·idin g systems (fact r c trieval) an d rcference· providing syst ems ( docum•mt retril'· val). Thc lattcr is the more difficult as the provision of a set of doc u m ent!­a nswering a specific question is a complex task involviug indcxing, classifica· tion. semantics and linguistics.

The examination of a typical artide in, say « Natur<'», shows thc elcmt·nt;; for identification an d some of tbc clut•s to its s ubjct't content. I ndcxing tcchniques art' stili vcry crud e and incfiici ent. D eviccs for the indexing contro! of documents vary from tbc free word indexing of th e K WIC (Key· word in context ) through controll<"d vocabularies (tbcsauri in co-ordinal(' indcxing and standard list s of subj ect hcadings) to bicrarcbical and facctted classifi cations, and more recently citation indexing. However in a largt· collection of documents covcring many subjects retricval, und cr tb c best conòitions, is sucb tbat 3 out of4 documcnts, provided in answer l o a qut'st· ion, turo out on the avrragr to bt• of no interest to the inquirer.

ince 1954 computer tecbniqurs bave becn applied lo achitwt' « auto· matie indexing» whereby tbcrc is no intellectual intervention (human indrx· ing) al an y point in the whole cyclr from the document entry into the system to the fina! printed-oul answer to tht' r equest er for specific information . In spite of sonw hopeful r esults in piJot projects no practical syst ems of generai validity bave em erged. The difficulties of automatic indexing art• r clatcd in principle to those of lbe machinc translation of natura! languag:t•;;. So far tht' main acbievcmenl~ are in computer-aided printing and the uti· lisation of tbe tagged and stored text for r<'lrieval centrally or in a d ecen· tralist•d manncr by the provision of duplicate magn ctic tape~;.

Tbc only computer syst ems for large scale indexing and IR actuall y operating are based on humanly assigned index t crrns, r. g. ENDS (the E uratom Nuclear Documentation Syst cm ), MEDLAR S (for bio-medicine) an d INSPE C (I nformation Servict· in Pbysics, Elcctrotechnolog), Computer:­and Control). A t ypical computer searcb to answer an actual question (wbat papcrs exist on thr joint occurrence of rhenium a nd osmium anywht"rr in the solar systcm ?) is dcscribcd and discussed.

Mechanised information retricval is stili a complex a nd exprnsive operation and therefore evcn for disciplincs like bio·m('di cine, chemistry and physics one na lional centre for middle-sized countrie::., lik<· tht' U .K. , is all tbat is juRtificd.

A·rm. l at. SuJ>er . Sanittl ( 11)70) 6, t2s-J3ò .

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

Page 11: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

COBLA.NS · 135

Riassunto (Tendenze attuali e future nel campo del ricupero dell' infor· mazione). - Il termine« ricupero dell'informazione>> alquanto impreciso ed improprio, è stato creato di recente quando, a causa della crescita esponen­ziale della letteratura scientifica verificatasi dopo la Seconda Guerra Mon­diale, sorse il problema della cosiddetta « esplosione dell'informazione» e si ritenne di poter risolv ere tale problema con l'impiego dei calcolatori. Il concetto di «esplosione dell'informazione» non va preso così sul serio come la stampa divulgativa l'ha dipinto, ma deve essere visto in prospet­tiva. La crescita esponenziale (il moltiplicarsi per un fattore fisso in eguali periodi di tempo) è una delle leggi statistiche della vita di una comunità. In tal modo l'accuratezza degli strumenti di misura raddoppia in media ogni 20 anni, mentre il numero dei riassunti scientifici pubblicati annualmente si raddoppia ogni 15 anni. Uno dei principali prodotti della ricerca e dello sviluppo è la nuova informazione registrata nei documenti; vi è quindi una correlazione diretta tra la spesa per la ricerca e la pubblicazione.

Il ricupero dell'informazione comprende due processi distinti e sovrap­posti: i sistemi che forniscono dati (ricup•·ro di fatti) e i sist emi che forni­scono riferimenti (ricupero di documenti). Qust'ultimo ricupero è il più difficile poichè la preparazione di un documento che possa rispondere ad una richiesta sp ecifica è un lavoro complesso che comprende l'indicizzazione, la classifica, la semantica e la linguis tica. L'esame di un articolo tipico, per es. in Nature, mostra gli elementi per l'identificazione ed alcuni rife­rimenti agli argomenti in esso trattati. Le tecniche della indicizzazione sono molto rudimentali ed insufficienti. Gli strumenti che permettono di control­lare i documenti mediante indicizzazione si es tendono dalla indicizzazione basata su parole indipendenti KWIC (Keyword in context) attraverso voca­bolari controllati (thesauri nell'indicizzazione coordinata, ed infine sogget­tari standard) fino a classificazioni gerarchiche e a faccette e più recentemente ad indicizzazione per citazioni.

Tuttavia in una vasta collezione di documenti estesa a molti soggetti, il ricupero nelle migliori condizioni è tale che 3 su 4 dei documenti forniti in risposta ad una richiesta di informazione si rivelano in media di nessun interesse per il richiedente. A partire dall954 sono state applicate le tecni­che dei calcolatori per r ealizzare un' indicizzazione automatica, mediante la quale si esclude l'intervento intellettuale (un'indicizzazione umana) in tutti i punti dell'intero ciclo, dall'immissione del documento originale nel sistema sino aJla stampa della risposta fornita all'utente che ha chiesto un'informa­zione specifica. Nonostante qualche risultato positivo ottenuto nei progetti pilota, non è emerso nessun sistema pratico di generale validità. Le diffi­coltà di una indicizzazione automatica sono connesse in linea di principio a quella della traduzione mecca1ùca di lingue naturali. Finora i maggiori successi si sono verificati nella stampa controllata dal calcolatore e nella

.4 nn. l at. Stti)Cr. Sanità (1970} S, 12~137.

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

Page 12: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

1:16 CONt"ERt;Nzli l> SI>M INARI

utilizzazione di un testo codificato e registrato p er il ricupero in sede ccn· tralc o decentralizzata attraverso la fornitura di duplicati dci nastri ma· gnetici. Gli unici sis temi attualmente in opt> ra p<·r la indicizzaziom· su vasta scala e per il ricupero dell'informazione a mezzo di calcolatori sono basati su termini per l'indicizzazione assegnati daJJ' uomo ad es.: la ENDS (Euratom Nuclea r Documentation Sy~Stcm) , il MEDLAHS (Medicai Litcrature Analysis R etrieval System) e l'll\SPEC (lnformation Service in P hysics, Electrotcchnology, Computers and Control). P er illustran· le possibilità offerte dal sistema ENDS, viene descritta e discussa una ricerca tipo compiuta per rispondere alla domanda: quali articoli esistono s ulla presenza contemporatwa del renio e delrosmio in q ualsiasi pun to dd sistema solare·?

Il ricupero meccanico dell 'informazione è tuttora un 'operazion e com ­plessa e costosa e quindi persino per disciplin~· quali la biomedicina, la chi­mica, la fisica un unico centro nazionale per un paese di media grandezza quale il Regno Unito i> il massimo ch e si possa giustificare.

REFERENCES

(1

) ZtMA N, J. l'ttblic knowledp.e: liti' soci al dime11sion of scie11ce. Com!Jridge Universit.y Pres>. 1968.

( 2 ) The J~eonardo l.ohoralory. Nature, 222, 611! (1969).

(3) .MooERS, C. ' . Comments on tbe paper hy Bar-Hillcl. Am. Doc., 8, 114-ll6, (1957).

(') DE Sor.LA PRICE, D. J. Liule scien cc, big science. Columbio University Press. r ew York. 1963.

( 6) MELLANBY, K. A damp squib. Neu· Scientist, 33, 626-627 (1967).

(') MARTYN, J, & A. GILCIIRIST. An evaluation of British scie11ti.fic journab. London. Aslih, 1968. (Occasiono! publication no.1).

(1) CA S today. Facts and figures about Chemicftl Ab.~tracu Service. American Chemicul So­ciety, Columbus. Ohio, 1967. 60th onniversary ed.

( 8) GARFJELD, E. « Science citation i11dex» - a new dimension in indexing. Science. 144. 649-654 (1964 ).

(8

) DEW.EZE, A. Automated establishment an d utilisotiou of bibliograplùcal citation iudexes. Une3cO Bui/ . Libraries, 18, 211-223, (1964).

( 10) Asun. Ilfmdbook of speciallibrnrumship and information work. W. Ashworth, Ed. l.oudon 1967. 3d ed., p. 216.

( 11 ) LunN, H. P. Keyword-iu-context index for technit:al literature (K"'IC iudex). Am. Doc., 11, 288-295 (1960).

e~> NATIONAL BUREAU ot• S-rAN'DAIIDS. Automatic indexing: a state of- tlte-ort report . by Mor~ Elizabeth Stevcub. Woshiugton, 1965, (NBS Monograph 91).

('3) LUSTIG, G. 1st die outomatische lndexierung !Jereits unwcud!Jar'( l\'acltr. Dok., 20, 190-193 (1969).

A nn. I st. Super. Sanitilo (1U70) 6, 12f•- 137.

l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

Page 13: CONFERENZE E SEMINARIold.iss.it/binary/publ/cont/Pag125_137ParteI_IIIAnno1970.pdf · Sanit4 (llliO) 8, 125-lS'l. 128 CONFERENZE t: St:J\11 NA1tl We,

COBUNS 131

( 14) BuCKLAND, L. F. Machine recording of textual i,;formation. during the publication of scien­tific j~tornau. , lnforonics, ~faynard, Massachosetts, 1965.

( 16) ConLANS, H . The mechanisation of documentation - a Lentative balance sbeet. fn : Ciba Fo~tndation Symposi~tm on Comm~tnication in Science. Churchill, London, 1967. p. 78-83,

(11

) SIECMANN, P. J, & B. C. GRIFPITH, The changing role of Psychological Abstract s in scientific communication. Am. Psychologìst, 21, 1037-1043 (1966).

(17

) MARTIN, .M. D. & J. R. SmTH. INSPEC: a new concept in information services. Electron. Power, 15, 66-69 (1969).

(' 8) EuRATOM. Guide pour l'r~tilisalion. du système aulomatisé de documentation n~tcUaire.

Centre d'Jnformation et de Docnmentation, Bruxelles, 1968.

( 19) ROLLINC, L. & J, PIETTE. lnteraction of economica and automation in a lurge-size re· trievul syst ern. In: Mechani:ed information storage, retrieval and dissemination. Pro· ceedings of the .FJD-TFIP Conference, Rome, June 14-17, 1967. Amsterdam, North·· HoiJand Publishing Co., 1968, 367-390.

(2°) EURATOM. ThesflurwJ ... used in Euratom's nucleflr dowmenlntion systems. 2nd ed.~ Bruxelles, CI O. Pari I lndexing terms, 1966. Part fl Terminology charts, 1967.

A nn. lat. Supcr. Sanit4 (19i0) 6, 125-137.