computer graphics for processing chemical substance information

11
Computers Ckm. Vol. 13, No. 2, pp. 129-139, 1989 co97-84sy89 $3.00 + 0.00 Printed in Great Britain. All rights reserved Copyright 0 1989 Pergamon Press plc COMPUTER GRAPHICS FOR PROCESSING CHEMICAL SUBSTANCE INFORMATION REGINALD C. HAINES Chemical Abstracts Service, P.O. Box 3012, Columbus, OH 43210, U.S.A. (Received 4 April 1988) Abstract--Chemical Abstracts Service (CAS) has developed complex computer-based systems for the registration and retrieval of chemical substance information. This paper focuses on the data structures and computer graphics software used to represent and process graphical representations of chemical substance information. At CAS the representation of graphical data is standardized and is referred lo as the CAS Graphical Data Structure (GDS). The GDS is used for the input and display of chemical structures in the CASdeveloped Messenger@ search and retrieval software. Messenger software is used to operate the online information retrieval service called STN International”, the scienticc and technical information network. STN International provides scientists and infonnation specialists with online access to many different scientific and technical databases. In particular, STN provides access to the CAS ONLINEm Registry File, which currently contains over 8.8 million records of unique chemical substances. This paper surveys the current structure graphics components of the Messenger software. These components provide online structure input and structure display capabilities for the CAS ONLINE Registry File. This paper also intro- duces the capabilities of a new Personal Computer (PC)-based software package called STN ExpressTM. The sofiware includes an offline structure input capability specifically developed for strucrure query formulation by scientists and information specialists. Possible future extensions to the CAS structure graphics software are also addressed. INTRODUCTION Chemical Abstracts Service (CAS) has developed complex computer-based systems for the unique identification (i.e. registration) and retrieval of chem- ical substance information. CAS is a division of the American Chemical Society (ACS) located in Col- umbus, Ohio. One of the missions of CAS is to provide access to the world’s primary chemical and scientific literature. In fulfilling this mission, CAS has taken on several different roles. These roles include those of publisher, database supplier, online search vendor and software supplier. The publications and databases produced by CAS contain chemical sub- stance information, including chemical structure dia- grams. The data structures and computer graphics software used at CAS to represent and process chem- ical substance information are the focus of this paper. This paper answers the questions: “how does CAS provide online access to chemical substance informa- tion, and how is computer graphics used to support this access?” The principal processes involved in retrieving information from chemical substance files are presented, including structure input, structure modeling, structure search and display. Emphasis is given to the application of computer graphics and the data structures employed. The principal data struc- tures used to represent chemical structures at CAS are the Graphical Data Structure (GDS) and the connection table (CT). The GDS is used for structure input and display, and the CT is used for structure searching. The two computer graphics application packages described in this paper are “Messenger”@, which is a CAS-developed search and retrieval sys- tem, and “STN Express” TM, which is a PC-based “front-end” to Messenger. A survey of many of the current CAS and ACS computer graphics applica- tions was presented at a recent symposium sponsored by the ACS (Sanderson & Dayton, 1986). In 1980 CAS began to use the initial prototype of the Messenger search and retrieval software. Messenger software provides searchers with a means of retrieving information from scientific and technical databases. These databases can contain different kinds of information, including chemical structure information, bibliographic text information, full text of journals, and numeric data. The Messenger software, which is licensed by CAS, is used to operate the online information retrieval service called STN International@, the scientific and technical informa- tion network. STN International offers direct access from North America, Europe and Japan to over 52 scientific and technical databases covering the fields of biology, chemistry, chemical engineering, rnathe- matics, and physics. T’he largest chemical structure database accessible via STN is the CAS ONLINE@ Registry File. Currently, it contains over 8.8 million records of unique chemical substances. The Registry File is maintained by the CAS Chemical Registry System, which first began operations at CAS in 1964. To improve the effectiveness of this system, the software has been upgraded twice; once in 1968 and again in 1974 (Dittmar et al., 1976). The source of chemical substances for the Registry File is over 12,000 chemistry journals, in addition to books, conference proceedings, and patent documents 129

Upload: reginald-c-haines

Post on 21-Jun-2016

221 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Computer graphics for processing chemical substance information

Computers Ckm. Vol. 13, No. 2, pp. 129-139, 1989 co97-84sy89 $3.00 + 0.00 Printed in Great Britain. All rights reserved Copyright 0 1989 Pergamon Press plc

COMPUTER GRAPHICS FOR PROCESSING CHEMICAL SUBSTANCE INFORMATION

REGINALD C. HAINES Chemical Abstracts Service, P.O. Box 3012, Columbus, OH 43210, U.S.A.

(Received 4 April 1988)

Abstract--Chemical Abstracts Service (CAS) has developed complex computer-based systems for the registration and retrieval of chemical substance information. This paper focuses on the data structures and computer graphics software used to represent and process graphical representations of chemical substance information. At CAS the representation of graphical data is standardized and is referred lo as the CAS Graphical Data Structure (GDS). The GDS is used for the input and display of chemical structures in the CASdeveloped Messenger@ search and retrieval software. Messenger software is used to operate the online information retrieval service called STN International”, the scienticc and technical information network. STN International provides scientists and infonnation specialists with online access to many different scientific and technical databases. In particular, STN provides access to the CAS ONLINEm Registry File, which currently contains over 8.8 million records of unique chemical substances. This paper surveys the current structure graphics components of the Messenger software. These components provide online structure input and structure display capabilities for the CAS ONLINE Registry File. This paper also intro- duces the capabilities of a new Personal Computer (PC)-based software package called STN ExpressTM. The sofiware includes an offline structure input capability specifically developed for strucrure query formulation by scientists and information specialists. Possible future extensions to the CAS structure graphics software are also addressed.

INTRODUCTION

Chemical Abstracts Service (CAS) has developed

complex computer-based systems for the unique identification (i.e. registration) and retrieval of chem- ical substance information. CAS is a division of the American Chemical Society (ACS) located in Col- umbus, Ohio. One of the missions of CAS is to provide access to the world’s primary chemical and scientific literature. In fulfilling this mission, CAS has taken on several different roles. These roles include those of publisher, database supplier, online search vendor and software supplier. The publications and databases produced by CAS contain chemical sub- stance information, including chemical structure dia- grams. The data structures and computer graphics software used at CAS to represent and process chem- ical substance information are the focus of this paper.

This paper answers the questions: “how does CAS provide online access to chemical substance informa- tion, and how is computer graphics used to support this access?” The principal processes involved in retrieving information from chemical substance files are presented, including structure input, structure modeling, structure search and display. Emphasis is given to the application of computer graphics and the data structures employed. The principal data struc- tures used to represent chemical structures at CAS are the Graphical Data Structure (GDS) and the connection table (CT). The GDS is used for structure input and display, and the CT is used for structure searching. The two computer graphics application packages described in this paper are “Messenger”@,

which is a CAS-developed search and retrieval sys- tem, and “STN Express” TM, which is a PC-based “front-end” to Messenger. A survey of many of the current CAS and ACS computer graphics applica- tions was presented at a recent symposium sponsored by the ACS (Sanderson & Dayton, 1986).

In 1980 CAS began to use the initial prototype of the Messenger search and retrieval software. Messenger software provides searchers with a means of retrieving information from scientific and technical databases. These databases can contain different kinds of information, including chemical structure information, bibliographic text information, full text of journals, and numeric data. The Messenger software, which is licensed by CAS, is used to operate the online information retrieval service called STN International@, the scientific and technical informa- tion network. STN International offers direct access from North America, Europe and Japan to over 52 scientific and technical databases covering the fields of biology, chemistry, chemical engineering, rnathe- matics, and physics. T’he largest chemical structure database accessible via STN is the CAS ONLINE@ Registry File. Currently, it contains over 8.8 million records of unique chemical substances. The Registry File is maintained by the CAS Chemical Registry System, which first began operations at CAS in 1964. To improve the effectiveness of this system, the software has been upgraded twice; once in 1968 and again in 1974 (Dittmar et al., 1976). The source of chemical substances for the Registry File is over 12,000 chemistry journals, in addition to books, conference proceedings, and patent documents

129

Page 2: Computer graphics for processing chemical substance information

130 REGINALD C. HAINES

worldwide. The Registry File is now growing at a rate of about 13,000 substances per week or almost 700,000 substances per year.

Messenger software provides searchers with the capability to search the Registry File (or any other chemical substance file) using substructure or com- plete structure queries, molecular formulae, and chemical names. To enter structure queries the searcher has a choice between “text structure input” or “graphic structure input” commands. The former allows the searcher to enter and edit structures using keyboard commands. The latter provides similar structure editing capabilities using menus and a mouse or joystick. In addition, the structure input software also interfaces with a 2-D (two-dimensional) structure modeling program. Messenger software also provides graphic structure display capabilities for displaying the results of structure searches.

Using the Messenger software as a foundation, CAS is continuing to extend its capabilities and to make STN International even easier to use. One of the goals of CAS is to develop tools for scientists and information specialists that can help them find and mauipulate chemical substance information. In 1987 CAS developed a Personal Computer (PC)-based “front-end” software package called STN Express which includes an offline structure input capability. The searcher can use STN Express to enter structure queries offline for later uploading to the Messenger

+_--------_---_-_-_+ I +~~~~~~--_~------+

I

I Remote --+ Terminal

+--_-_----_-_---_-_-_-_+

I n

software. The STN Express software is now under- going Beta testing, with a planned release date of second quarter, 1988.

CAS is also exploring other areas that involve the processing of chemical substance information using computer graphics. One area of growing interest and importance to chemical researchers is the use of 3-D (three-dimensional) structure molecular modeling packages. The latter part of this paper discusses the CAS research in this area.

SYSTEM OVERVIEW

Figure 1 shows an overview of the Messenger and STN Express architecture. The Messenger software operates at a site (a computer center at a particular location) referred to as a Messenger installation. Messenger software runs under the MVS/XA oper- ating system on IBM or IBM-compatible main- frames. Messenger software is divided into “front- end” and “back-end” components. The front-end is known as the Messenger “User Job”. It contains all the software needed to carry on the user interaction, including structure input. The back-end is known as a “software node”. It contains the collection of functions for performing searches and retrievals against a Messenger Search and Retrieval Database (SRD jthe search and display files that constitute a database.

+---_-_--_--_-_-_-_-_--_+ I +--__--_--_-__--+ PC witn --+ STN Express I ’ +---_-_---_----_-_--+

I .+

NETWORK

) D**&EjmE ) +_--_____2___+ +__--_-_--_-_-_-_-_-_+ +_--_-__--_-_-_---__c +-_---------_-+

Fig. 1. Overview of Messenger and STN Express architecture.

Page 3: Computer graphics for processing chemical substance information

Processing chemical substance inibmation 131

Searchers at remote terminals or personal comput- ers operating with STN Express can access the Mes- senger software using one of the public VANS (Value Added Networks), leased line, or by direct-dial. The searcher’s terminal or PC is connected directly via the network to an instance of a User Job. To access a specific SRD, the User Job sends network trans- actions to the software node where the requested SRD is mounted. Figure 1 illustrates User Jobs with access to SRDs at two software nodes. In general, a User Job may run at one Messenger installation and be connected via the network to a software node at the same or at another Messenger installation. Multiple software nodes located at different sites can be linked together to form a network so that a searcher can easily search databases mounted on any software node without having to first logoff and then logon again. This capability is fully utilized in the STN International online information service, which now operates using three production software nodes connecting service centers in: Columbus, Ohio; Karlsruhe, F.R.G.; and Tokyo, Japan.

The new PC-based software package called STN Express, available in second quarter, 1988, includes a new offline structure input capability specifically de- veloped in cooperation with Hampden Data Services Ltd for structure query formulation by scientists and information specialists. STN Express requires an IBM PC, PC-XT, PC-AT or PS-2, or AT&T 6300, or Olivetti M24, or Compaq III (or 100% compatibles), 640K RAM, one 360K floppy disk drive, a hard disk with 4Mbyte of space available and operates under MS-DOS or PC-DOS Version 3.0 or higher. A modem, graphics board and mouse are also required. A printer is optional. (Most dot matrix and laser printers are supported.) As shown in Fig. 1, STN Express searchers also connect to a Messenger User Job when uploading queries to Messenger and con- ducting search sessions, but searchers can build both structure queries and text queries offline.

DATA STRUCTURES

The principal data structures used to represent chemical structures at CAS are the Graphical Data Structure (GDS) and the connection table (CT). These two fundamental data representations are complementary and are described in more detail next. Structure input and structure display within Messen- ger use the GDS as their primary data structure, whereas structure searching is based on the CT.

At CAS the representation of graphical data is standardized and is referred to as the CAS Graphical Data Structure (GDS). The GDS was first described 14 yr ago (Farmer & Schehr, 1974) as a hierarchical data structure built of four basic building blocks:

(1) Node blocks provide hierarchy. A node is repre- sented graphically in this paper by the symbol: <node-type>.

(2) Branch blocks connect nodes to nodes or nodes to leaves. Each branch contains the 2-D coordinates, intensity, scale, ind other parameters affecting the subtree heloW it. A branch is represented graphically by an arrow: +,

(3) Leaf blocks contain all the characters and lines which describe the picture. A leaf is represented graphically in this paper by the symbol: [leaf-value].

(4) Data blocks contain the non-displayable, appli- cation dependent information, such as atom and bond types and atom by atom connectivity informa- tion. Data blocks are not shown graphically.

Figure 2 shows an example of the GDS and the corresponding CT for the chemical structure com- monly known as caffeine. The order in which the branches are attached to a given node is not significant. The corresponding graphic input form of this structure is shown in Fig. 3.

All of the blocks in the diagram are connected by circularly linked lists called rings. Each atom and bond leaf has a data block and these data blocks are also connected by rings. These latter rings provide the information structure that indicates which atoms are bonded to each other chemically.

The GDS has several different formats, each opti- mized for a particular use:

(i) Manipulative CDS is used by the online struc- ture input software and the programs that build images for the Registry File structures. Scanning this GDS is simplified by the multitiude of pointers it contains.

(ii) Storage Form CDS is used for saving GDS on storage devices. All pointers used in the manipulative form are removed. This form can be converted back to manipulative format and uses 35% of the space required by the manipulative form.

(iii) Compact Sroruge Form GDS is used for saving GDS that does not require any updates, such as the Registry File structure images. This form uses 25% of the space required by the storage form.

The CT shown in Fig. 2 illustrates the basic content of all connection tables. Each row corresponds to an atom or node in the structure. Each atom or node is assigned a number. The first column contains the node number and the atom or node symbol. Sub- sequent columns list the connecting node numbers and bond symbols. These bond symbols have three letter codes. The first letter indicates a ring or chain bond (R = ring and C = chain). The second and third letters indicate the bond type (S = single, D =

double, r = triple, E = exact, and other types). Each node may optionally have associated with it node attributes such as charge, mass, valence, etc.

MESSENGER STRUCI’URE INPUT AND STRUCTURE MODELING

Figure 4 shows the components of the Messenger architecture that support 2-D structure input and

Page 4: Computer graphics for processing chemical substance information

132 REGINALD c. HAWE

Graphical Data Structure (GDs):

<FRO> <PRO> = FRAGNEHT NODE

(SST> = SUBSTRUCTURE NODE

CCST, = CYCLIC! STRUCTURE NODE "

<SST>

I_________________---~~~-~_I___________I__------____

IllIll I I I I I ” v ” ” ” v v

to1 [=I IH~II-1 [Nel C-3 +CSP [II &I CZI EL

Connection Table:

NOD SPM 10 2c 3c 4C 50 6C

:: 9N

10 N II N 12 N 13 c 14 c

f”.~**CONNECTIOHS.+“r*.

NOD/BON NOD/BON NOD/BON 8 CDE 9 CSE

10 CSE 12 CSE 14 CDE 8 RSE 9 RSE 7 RDE

10 RSE 11 RSE 6 RDE 1 CDE 12 RSE 6 RSE 2CSE 13RSE 6 RSE 3 CSE 14 RSE 7 RSE

13 RDE 7 RSE 4CSE 14RSE 8 RSE

11 RDE 9 RSE 5 CDE 12RSE 1oRSE

Fig. 2. Chemical substance data structures: GDS and connection table.

2-D structure modeling; namely, the STRUCTURE the Messenger User Job that allows the searcher to command and the ALGORITHMIC STRUCTURE build and modify 2-D structure queries. The com- DISPLAY (ASD) software. The STRUCTURE mand supports both text structure input and graphic command is a software component integrated into structure input. Text structure input allows the

_H

.piG5WTS Cl c2 c3 C4 C5 ~6 R4 R5 F.6

si P s F Cl Br

X M Gk

D DE T

UNSPEZBOND

Fig. 3. Example oE graphic Structure input using messenger STRUCTURE menus.

Page 5: Computer graphics for processing chemical substance information

Processing chemical substance information

+--_-_-_-_-_-_-_-_---_-_-+ I +---__---_-----_--+

I

I Remote --+

Terminal +_---_-__--_-__--_-_-_+

(a) * (L) " I

__-_______-__--__--_~--~~~-~~~-~~~-~~-~~~~-~~~-~~--~~~-~~~-~~~-~~-~~~--~~--~~--

NETWORK -_________-_____-_-_~~~~~~-~~~~~~~~~-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-~~--~~--

e.

(a)

I

(e) "

f----__--------_---_----------+

MESSENGER USER JOB STRUCTURE +---_-----_------+

command 1 WESSENGER 1 c------y E;EY

)) +----_----_--_---+

+~--~~--~~--~~--~~--~--_--~~-~~~+ *

(b)

I

cd) v

-----_--___--_----_----~~--~--------~~-~----~------~---~-----~---~----~----~---

NETWORK ---_---_-___--__--__-------_---_--__~-_~~--~---_---~---~------_~~-_~--_~--_~--_

(b) I * (d) V I

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

MESSENGER +__--_--------+

SOFTWllRE AlgOrithmiC (c) ) SUBSTANCE )

NODE Stcucture .c ________ ) FILE

~i6pldY (ASD) ) (REGISTRY) )) +-------_---_----+

+~-~~~~~~~~~-~~~~~~~~--~~-~~~+

Fig. 4. 2-D structure input and 2-D structure modeling using Messenger.

133

searcher to enter structures on any type of terminal using only keyboard commands. Graphic structure input allows structures to be entered using menus and a mouse or joystick. It requires a terminal or micro- computer that uses or can emulate Tektronix PLOT10 graphics. The principal advantages of graphic structure input are that it allows larger, more complex structures to be built, and it provides a better quality structure image. The ASD software is a separate component (a separate IBM job) of the Messenger software node; it supports 2-D structure modeling. This software is invoked by the STRUC- TURE command when the searcher specifies a partic- ular structure on the Registry File is to be used as an initial 2-D model. The final output of a STRUC- TURE command consists of a Structure Query GDS and a corresponding Structure Query Connection Table. These data structures are saved on a Mes- senger Query File for later use by the SEARCH command.

In order to illustrate how the STRUCTURE com- mand is used to enter a structure query, a description of its menus and their uses is necessary. Figure 5 illustrates the two menus available for use with graphic structure input, the MAIN menu and the SHORTCUTS menu. The MAIN menu is subdivided into five blocks of menu items: Graph, Node, Bond, Attributes, and Supplemental. Selection of items from

the Graph block allows chains of atoms and small rings to be input and connected, building up the basic structure skeleton. The menu items from the Node and Bond blocks allow different atoms and bonds to be specified. Nodes in a structure query can assume specific element values, generic values (X = Halogen, A = Any element but R, Q = Any element but C or H, A4 = Any Metal), or they can represent variables (Gk). Variable nodes are entered with a separate keyboard command. Bonds can have both a bond value (S = Single, D = Double, T = Triple, etc.) and a bond type specified (R = Ring, C = Chain, RC = Ring or Chain).

Selection of items from the Attributes block allows node attributes (such as charge, mass, valence) and ring attributes (isolated or imbedded) to be specified. The Supplemental block contains miscellaneous con- trol menu items that allow the changing of menus (like SHORTCUTS) and perform other functions. Selecting the RECALL menu item allows the struc- ture builder to retrieve previously built structures, structure templates from the Fragment Dictionary file or a Registry No. model. The latter capability is discussed in more detail below. If the SHORTCUTS item is selected on the MAIN menu, the SHORT- CUTS menu shown in Fig. 5 replaces the MAIN menu. Selection of items from the SHORTCUTS menu allows the nodes in a structure query to assume

Page 6: Computer graphics for processing chemical substance information

134 RJIGINALD C. HAINES

SHORTCUTS MENU

_--_----------

END RECALL REFRESH

Block

Supple- mental

Graph

Node

Bond

Attributes

MAIN HENU

I END I

SHORTCUTS Cl cz c3 C4 C5 C6 R4 R5 R6 R7 R8 GRAPH C N 0 si P s F CL Br X M Ok A NODE s SE N D DE T -7 UNSPEC BOND R c RC

MARGE MASS HCQUNT NSPEC VALEMCE RSPEC I

chains

rings

element symbols

generic and variable ncdee

bond values

bond types

atom and ring attriButeS

MAIN CH CE2 He Et n-Pr i-Pr n-Bu i-m S-Bu t-Bu o-cm4 Ph q -cm4 p-C684 C(O)M3 NH NH2 CW NO2 OH SH COOH COSH CSSH so2 S03H P03H2 CF3 CC13 CBr3 CI3

Fig. 5. Messenger STRUCTURE command menus.

the value of one of the commonly occurring chemical shortcuts on this menu. Additional shortcuts can also be input by use of the NODE menu item on the MAIN menu. Figure 3 shows the final result of building a structure query for caffeine using these techniques. In this example the shortcut for methyl, Me, was used. For additional information on Mes- senger graphic structure input, consult the manual, “Using CAS ONLINE.”

As mentioned above, the RECALL menu item allows the structure builder to specify a CAS Registry NumbersM as a model. The Registry No. is input from the terminal to the STRUCTURE command along path (a) ln Fig. 4. Tbe comman d then sends a structure modeling transaction along path (b) to the Messenger software node, where the connection tables for the Registry No. and its ring systems are retrieved (c) from the CAS ONLINE Registry File. Connection table data is forwarded to the ASD software, which then creates a 2-D GDS representation for the Reg- istry No., using a CAS-developed algorithm. It is a complex algorithm designed to produce publication

quality 2-D structure diagrams (described in detail previously by Dittmar et al., 1977). An important feature of this algorithm is that the 2-D coordinates it uses for the basic ring shapes are retrieved from a prebuilt file, called the Ring image File, containing a GDS representation for each ring system. The file of ring shapes is maintained at CAS by another in-house structure input system, the Online Structure Input System (OLSIS) (Blake et al., 1977), which is the software antecedent of the current Messenger STRUCTURE command. When the ASD algorithm completes, it returns the 2-D coordinates of each node or atom along path (d) to the STRUCTURE command, where the final GDS structure is then assembled. The completed structure image is then sent back to the terminal along path (e) and displayed to the user. Figure 6 shows how the structure for caffeine can be built by specifying the CAS Registry No. for caffeine (B-08-2) at the prompt shown; it is an alternative way of invoking the 2-D modeling feature.

I+ STRUCTURE ENTER NAKE OF STRUCTURE TO BE RECALLED (NONE):58-OS-2 ENTER (DIS), GRA, NOD, BON OR 7:

Fig. 6. Example of 2-D structure modeling using Registry No. for caffeine.

Page 7: Computer graphics for processing chemical substance information

Pmcessing chemical substance information 135

l --_-_-_--_-_-_-_--_-_-_--+

1 +-------------+

I Remote --+ Terminal I (

+-------_-----+

f

h

(a) (el v

--~---~--------~---~-----~--~---~-----~--~~--~-~--~---~---~~~--~_-_-_--~~---~~- NETWORK

(a) h (e) ” I +~--~~--~~--~~~---~--~~--~~~--+

WESSENGBR USER JOB STRUCTURE SEARCH +---_-_-_---_-_----_-_+

COrmMIld and ) GESSEHGER )

DISPLAY <--------z ) p”my Commands ) FILE 1)

+~--___-~-_----~+ +~-~~~--~~~-~~~-~~~--~~~-~~~--+

h

(b) (8) V

(b) (8) V

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+ l433ssmm3R Search +_----_----_-_----_+

SOFTWARE Algorithmic and (c) ) SUBSTANCE ) NODE Structure DlSPlay c-_______ ) FILE

IllSplay (ASD) "8aCkend" ) (REGISTRY) )) Processes +---_-_-_-_-_-_--_-_+

f~-~--~--_~-~-~-~~-~-~-~-~~~--+

Fig. 7. Registry File search and display via Messenger.

MESSENGER SEARCH AND DISPLAY

Figure 7 shows the components of the Messenger architecture that support structure search and 2-D structure display, namely, the User Job SEARCH and DISPLAY commands and the corresponding software node Search aud Display “back-end” pro- cesses. The SEARCH Command is a Messenger User Job component that allows the searcher to submit searches for online execution against a Messenger SRD. The command supports both text and structure searches. Structure searches can be partial or exact match. Partial or substructure searches require that the graph, atoms, and bonds of a structure query be contained within the graph, atoms, and bonds of each answer. Exact searches require an exact match be- tween the graph, atoms, and bonds of the structure query and each answer. The final output of a search is an answer set’ which is maintained by the search back-end.

The DISPLAY Command is a Messenger User Job component that allows the searcher to display an- swers from the answer file created by a search, or by specifying individual filekeys. like the Registry No., in the CAS ONLINE Registry File. The command supports the display of both text data and graphics (GDS) data. Display of graphics data requires a

terminal that uses or can emulate Tektronix PLOT10 graphics.

After the searcher has built a structure query using the STRUCTURE command, he or she uses the SEARCH commahd to specify the structure name and other parameters that control the search. These parameters are entered from the terminal to the SEARCH command along path (a) in Fig. 7. The command then retrieves the structure query con- nection table from the Messenger Query File and sends a search transaction along path (b) to the Messenger software node, where the search query is compiled and submitted to the back-end search pro- cessors. Messenger software supports searching on large mainframe machines and on multiple micro- processors, which are used to perform parallel search- ing on large files. The design and evolution of the techniques used for searching the CAS Registry File are discussed in a recent paper (Zeidner, 1987). The search back-end performs an initial screening step and then compares the structure query connection table to candidate connection tables retrieved (c) from the Registry File. This final search step uses an iterative search algorithm, comparing the query record to candidate answer records on an atom by atom, bond by bond basis. Search answers are stored on an answer file. Final search statistics are returned to the SEARCH command along path (d). These

Page 8: Computer graphics for processing chemical substance information

136 REGINALD

statistics are then passed back to the terminal along path (e). where they are displayed to the searcher. The top of Fig. 8 shows an example of an exact structure search for caffeine. “Ll” is the name assigned to the structure query in Fig. 3. “L2” is the name assigned to the answer set by the SEARCH command.

After the search has completed, the searcher can display search results using the DISPLAY command to specify the answer set name and answer numbers. These parameters are entered from the terminal to the DISPLAY command along path (a) in Fig. 7. The command then passes along a display transaction along path (b) to the Messenger software node, where the text and graphic information for each answer are retrieved (c) from the Registry File. The format of the structure image data is compact storage form GDS. The images were grebuilt by a batch version of the ASD software. The display data is then returned to

-> SEA Ll SEA3CH

. . .

FULL PILE L2

SEARCH COMPLETE 20 SEA EXA FUL L2

-> 1

L2

DIS L2

ANSWER Em 58-08-2 IN lH-Purim-2,6-dFone, 3,7-dihydro-1,3.7-trimethyl- 19CIl SY Guaraninr

SY Methyltheobromine

SY NO-DOZ

SY Thein

SY Theine

SY 1,3.7-Trimethyl-2,6-dioxopurine SY 1,3.7-Trimethylxanthine SY Caffeine (6cI)

SY Caffeine

SY Caffein

SY Cafipel

SY Alert-Pep

SY Koffein

SY Cafeina

SY Mateina

SY Refresh'n

SY stim

DR 71701-02-5. 95789-13-2

I-m C8 H10 N4 02

EXACT FULL

INITIATED 9:54:18

C. HAINB

the DISPLAY command along path (d), where it is formatted for final display and returned to the searcher’s terminal along path (e). Figure 8 shows a display of the first answer. The field codes shown are defined; RN = Registry No., IN = Index Name, SY = Synonym, DR = Deleted RN and MF = Mol- form. The L2 posting (or statistics) line shows there were 20 answers. Other search answers contained different isotopes of hydrogen or carbon atoms but have the same basic connection table and the same elements.

STN EXPRESS: NEW PC-BASED SOFTWABE

Figure 9 shows an overview of the major com- ponents of the new PC-based software package called

REFERENCES IN FILE

7054 REBEREBCES IN CADLD (PRIOR FILE CA (1967

0 HI

Me

‘N ! I A' “r ‘1 I N

a N I

MS

TO 1967,

1 TO DIITE,

Fig. 8, Example of Messenger search and display commands. The GDS display image for caffeine

is shown.

Page 9: Computer graphics for processing chemical substance information

Processing chemical substance information 137

+---_----------+ +----_-~-~---l~-+ +-~----_----_---+

STN Express

I

----_-_---_--_> ] PSIPOH 1 STN Express structure ) connection ) -----.---z Upload CTS Input -G----------- ) Table File ) to nessenger

+--_-_-_-_-_-_---_--_-_-+ *--_-_---__-----+ +----_-_--_-_--_--_-_+ n

+--_-_-_-----_-_-_---+

STN Express AutologOn

+--_--------_-_---+

+--_-_------_-----+

STN FixpreSs search and Display

+-----_--_--_----_-+

NETWORK

I h

’ 1 ” +--_-_-_--_-_-_-_-_-_-_-_-_+

1 Hessenger 1 c____::r”~$iE~+; )

+_--_-_--_-I--_--_---+ *--_-__-_-_-_-__-_--_-*

Fig. 9. STN Express as a Messenger front-end.

STN Express. STN Express provides a streamlined front-end user interface for the Messenger software that is used to operate STN International. It contains a variety of features designed to simplify all the steps involved in performing online searches. Specifically, STN Express includes the following capabilities: offline structure input, offline search strategy fonnu- lation, upload of structures and “type-ahead” queries to Messenger, predefined search strategies, guided search, scrolling text and graphics, color or inverse video highlighting, capturing and printing of text and graphics, automatic logon to a Value Added Net- work, and help.

ments can also be added to a structure without affecting the way it is searched. A window can be placed over a portion of a structure and then all atoms or bonds in the window can be changed at once, or everything in the window can be deleted or moved. The size of a structure can be increased or decreased, using the “expand” and “contract” fea- tures. A structure can also be rotated about a hori- zontal or vertical axis.

The STN Express offline structure input capability has features equivalent to all of those currently available in the Messenger STRUCTURE command (except for Registry No. modeling, but extensive structure templates are available). Many structure input features are unique to STN Express. Freehand structure drawing is available using a mouse. An extensive use of “pop-up” and “pulldown” menus allows rapid and more complete display of parameter values, such as a list of all atom symbols. STN Express automatically converts single and double bonds in rings to normalized bonds, where appropri- ate, such as in a benzene ring system. It also re.cog- nizes “tautomeric” situations and converts the bonds to normalized ones, if single or double bonds are input. This bond conversion can yield better search results when searching the Registry File. Stereo- specific bonds can also be specified to create better quality structures for use in publications. Text com-

Structures built by STN Express can be saved and recalled locally at the PC. The STN Express struc- ture input software was developed by Hampden Data Services Ltd and is a modified version of the PSIDOM (Professional Structure Image Database on Microcomputers) package described in a recent paper by Town (1986). After a structure query has been built by STN Express, a connection table called the PSIDOM connection table is stored at the PC. The format of this data structure is proprietary and is owned by Hampden Data Services Ltd.

In order to perform a structure search using a structure query, an STN Express searcher must first logon to Messenger using the automatic logon capa- bility. Then, by hitting a function key, the PSIDOM connection table is converted to a Messenger struc- ture query connection table and is uploaded to the Messenger User Job, where it is saved on the Messen- ger Query File. A corresponding compact GDS im- age is also uploaded and saved on this file. Uploading is done by using “kermit”, a networking protocol that includes error checking; it takes approx. 30 s. An uploaded structure can then be searched by invoking

Page 10: Computer graphics for processing chemical substance information

138 REGINALD C. HAINES

Ill Fkpmas dma

ata ishortcut Iv&able itaplrte 1 bond 1 m 1 edit 1 utilitic

llmcd rta -_)c current bold --> -

Fig. 10. Example of graphic structure input using STN Express.

the Messenger SEARCH command. Figure 10 shows the final result of building a structure query for caffeine using STN Express. For additional informa- tion about STN Express, consult the user manual, “STN Express, An Action Guide”.

FUTURE DEVELOPMENT

The introduction of STN Express as a front-end to Messenger, and, more specifically, as an access “win- dow” to the CAS ONLINE Registry File, opens up additional areas for CAS to explore in offering PC-based tools for scientists and information special- ists to manipulate chemical substance information. One area in which CAS is performing research is 3-D structure modeling.

Chemical researchers are interested in calculating 3-D atomic coordinates for molecular structures and using these coordinates to construct and manipulate

3-D molecular models. PC-based 3-D modeling pack- ages (such as Alchemy TM, from Tripes Associates) are now becoming available, but there-is little stan- dardization among packages. Desirable features con- tained in such packages inciude: 3-D structure input, calculation of molecular conformations, display of space-filling molecular images, modification of bond lengths and angles, measurement of molecular dis- tances, molecular “docking”, interactive 3-D rota- tion, and many others. One significant disadvantage of 3-D structure input, however, is that building structures from scratch is a time consuming process, especially for very large structures. CAS is studying such packages and exploring various options for interfacing its existing software and possibly its 8.8 million substance Registry File with such packages. This work is still in the research stage.

CAS is dedicated to add enhancements steadily to the Messenger software and to the CAS Registry

System, so as to broaden the search and retrieval capabilities of Messenger and to increase the chem- ical and scientific value of the Registry File. For the more distant future, CAS is interested in exploring the possibility of implementing an online search and retrieval capability that would allow 3-D structure queries to be input and searched against a new version of the Registry File containing 3-D atomic coordin- ates. A mui’ti-level search algorithm is envisioned, with the final seaich step being a geometric match of the 3-D query structure against the 3-D file structure. Although it would not be easy, development of such a system would be a natural extension to the other substance search capabilities presented in this paper.

CONCLUSION

CAS has had many years of experience in en- gineering complex computer based systems for acces- sing chemical substance information, especially in the area of computer graphics. With the development of the Messenger search and retrieval software and the PC-based STN Express “front-end”, CAS continues to fulfill its goal of developing software tools for assisting scientists and information specialists.

Acknowiedgeme?tzs-The development of the CAS Chem- ical Registry System was substantially supported by the National Science Foundation (Contract 0556). The devel- opment of the Messenger software was partially funded by the United States Patent and Trademark Office (Contract No. 50.SAPT-4-00319). CAS gratefully acknowledges this support. The STN Express structure input soRware was developed in conjunction with Hampden Data Services Ltd, England. CAS also gratefully acknowledges this effort. The author also thanks William Fisanick of the CAS Research Depanmenr for sharing and discussing the ideas presented in the “Future Development” section.

Page 11: Computer graphics for processing chemical substance information

Processing chemical substance information

REFERENCES

Blake J. E. er OZ. (1977) J. Chem. Znf. 17. 223. Dittmar P. G. et if. (1476) J. Chem: Zn$- 16, Ill. Dittmar P. G. et al. (1977) J. Chain. In! 17, 186. Farmer N. A. % Schehr J. C. (1974) i+~c. ACM 2, 563. Sanderson J. M. & Dayton D. L.> (1986) Graphi& jbr

Chemical Structures, ACS Symposium Series 341 (Edited by Warr W. A.), pp. 128-142. American Chemical Society.

139

STN Express, An Action Guide. American Chemical Society, Columbus, OH 43210.

Town W. G. (1986) Graphicsfor Chemical Strucrures, ACS Symposium Series 341 (Edited by Warr W. A.), pp. 9-17. American Chemical Society.

Using CAS ONLINE, The Rigistry File Vol. ZZB, Building Structures Using a Menu. Chemical Abstracts Service, Columbus, OH-43210.

Zeidner C. R. (1987) Proc. Chemical Structures: the International Language of Chemistry.