chemical file formats for storing chemical data

23
Molecular File Formats

Upload: abhik-seal

Post on 23-Jun-2015

1.329 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Chemical File Formats for storing chemical data

Molecular File Formats

Page 2: Chemical File Formats for storing chemical data

Types of File formatsElsevier MDL supports a number of file formats for representation and communication of chemical information.

Name Description

molfiles Each molfile describes a single molecular structure which can contain disjoint fragments as salts .

SDfiles They are Structure-data files which contain data for any number of molecules .SDfiles are the primary format for large-scale data transfer between MDL databases.

RGfiles An RGfile describes a single molecular query with Rgroups. Each RGfile is a combination of Ctabs defining the root molecule and each member of each Rgroup in the query.

rxnfiles Reaction files.Eachrxnfile contains the structural information for the reactants and products of a single reaction.

RDfiles Reaction Data File: RDfile is a more general format that can include reactions as well as molecules.

Page 3: Chemical File Formats for storing chemical data

File Formats

http://c4.cabrillo.edu/404/ctfile.pdf

Page 4: Chemical File Formats for storing chemical data

Connection Table [Ctab]A connection table (Ctab) contains information describing the structural relationships and properties of a collection of atoms. The connection table is fundamental to all of the MDL file formats.

9 9 0 0 0 0 0 0 0 0999 V2000 Countline -1.0200 1.5300 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5100 2.4100 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.5000 2.3900 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 3.2700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0300 3.2700 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.5000 4.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 Atom Block -0.5000 4.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.0100 3.2800 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.0300 3.2800 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 2 8 1 0 2 3 2 3 3 4 1 0 4 5 2 0 4 6 1 0 6 7 2 3 Bonds Block 7 8 1 0 8 9 2 0

Page 5: Chemical File Formats for storing chemical data

Ctab FeaturesParts of Ctab Description

Counts Line Important specifications here relate to the number of atoms, bonds, and atom lists, the chiral flag setting, and the Ctab version.

Atom Block Specifies the atomic symbol and any mass difference, charge, stereochemistry, and associated hydrogens for each atom.

Bond Block Specifies the two atoms connected by the bond, the bond type, and any bond stereochemistry and topology (chain or ring properties) for each bond.

Properties Block

Provides for future expandability of Ctab features, while maintaining compatibility with earlier Ctab configurations.

Page 6: Chemical File Formats for storing chemical data

1. Counts Line aaabbblllfffcccsssmmmvvvvvvwhere• aaa = number of atoms (current max 255)* [Generic]• bbb = number of bonds (current max 255)* [Generic]• lll = number of atom lists (max 30)* [Query]• fff = (obsolete)• ccc = chiral flag: 0=not chiral, 1=chiral [Generic]• sss = number of stext entries [MDL ISIS/Desktop]• Mmm = number of lines of additional properties, including the M END line.

no longer supported, the default is set to 999.[Generic]

shows six atoms, five bonds, the CHIRAL flag on, and three lines in the properties block: 6 5 0 0 1 0 3 V2000

Shows 9 atoms, 9 bonds, the CHIRAL flag of9 9 0 0 0 0 0 0 0 0999 V2000

Page 7: Chemical File Formats for storing chemical data

2. Atom BlockThe Atom Block is made up of atom lines, one line per atom with the following format.xxxxx.xxxxyyyyy.yyyyzzzzz.zzzzaaaddcccssshhhbbbvvvHHHrrriiimmmnnneee

Field Meaning Values

XYZ Atom coordinates

aaa atom symbol entry in periodic table or L for atom list, A, Q, * for unspecified atom, and LP for lone pair, or R# for Rgroup label

dd Mass difference -3, -2, -1, 0, 1, 2, 3, 4 (0 if value beyond these limits)

ccc Charge 0 = uncharged or value other than these, 1 = +3, 2 = +2, 3 = +1, 4 = doublet radical, 5 = -1, 6 = -2, 7 = -3

sss atom stereo parity 0 = not stereo, 1 = odd, 2 = even, 3 = either or unmarked stereo center.

hhh hydrogen count + 1 1 = H0, 2 = H1, 3 = H2, 4 = H3, 5 = H4

bbb stereo care box 0 = ignore stereo configuration of this double bond atom, 1 = stereo configuration of double bond atom must match

vvv Valence 0 = no marking (default) (1 to 14) = (1 to 14) 15 = zero valence.

HHH H0 designator 0 = not specified, 1 = no H atoms allowed

Page 8: Chemical File Formats for storing chemical data

3.Bonds blockThe Bond Block is made up of bond lines, one line per bond, with the following format: 111222tttsssxxxrrrccc

Field Meaning Values

111 First atom number 1 - number of atoms

222 Second atom number 1 - number of atoms

ttt Bond type 1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = Double or Aromatic, 8 = Any

sss bond stereo Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down, Double bonds: 0 = Use x-, y-, z-coords from atom block to determine cis or trans, 3 = Cis or trans (either) double bond.

rrr Bond topology 0 = Either, 1 = Ring, 2 = Chain

Page 9: Chemical File Formats for storing chemical data

Mol FileA molfile consists of a header block and a connection table. The following shows a molfile for alanine corresponding to the following structure:x`

Identifies the molfile: molecule name, user's name, program, date, and other miscellaneous information and comments

atom 4: charge +1atom 6: charge -1

1 entry for an isotope atom 3: mass=13

Page 10: Chemical File Formats for storing chemical data

Representation of Stereochemistry

What is Stereochemistry ? http://www.chemhelper.com/enantiomers.html

Page 11: Chemical File Formats for storing chemical data

Representation of Stereochemistry : Atom Block

Page 12: Chemical File Formats for storing chemical data

Representation of Stereochemistry : Bond Block

1= Shows stereo bond up

Page 13: Chemical File Formats for storing chemical data

RGfiles In RGfilesLines beginning with $ define the overall structure of the Rgroup query; the molfile header block is embedded in the Rgroup header block.In addition to the primary connection table (Ctab block) for the root structure, a Ctab block defines each member (*m) within each Rgroup (*r).

Page 14: Chemical File Formats for storing chemical data

Example of RGfile

Page 15: Chemical File Formats for storing chemical data

SDfileAn SDfile (structure-data file) contains the structural information and associated data items for one or more compounds.

*l is repeated for each line of data*d is repeated for each data item*c is repeated for each compound

Page 16: Chemical File Formats for storing chemical data

Example of SDfile

Page 17: Chemical File Formats for storing chemical data

RXNfileRxnfiles contain structural data for the reactants and products of a reaction.

where:*r is repeated for each reactant*p is repeated for each product

Page 18: Chemical File Formats for storing chemical data

RXNfile example

Page 19: Chemical File Formats for storing chemical data

RDfiles• An RD-File(reaction data file) consist of a set of edible “records”. Each record

defines a molecule or reaction, and its associated data.

• The [RDfile Header] must occur at the beginning of the physical file and indentifies the file as an RDfile. A version stamp of 1 is given for future expansion of the format.

• $DATM: Date/time (M/D/Y, c) stamp. This line is treated as a comment and ignored when the program is read.

*d is repeated for each data item*r is repeated for each reaction or molecule

Page 20: Chemical File Formats for storing chemical data

RDfile example

Page 21: Chemical File Formats for storing chemical data

Mol2 files from TRIPOSOriginal from Tripos. Contains atom coordinates, bonds, substructure information.This

format supports partial charges and isotopes. • Lines 1,2,3,5 and 6 are comments. They contain

the molecule name and information about the time the molecule was created and last modified.

• Lines 8, 15, 28, and 41 in the example are Record Type Indicator(RTIs). It is used to indicate the type of data which follows in a .mol2 file.

• Lines 9-12, 16-27, 29-40, and 42 are all data records

Page 22: Chemical File Formats for storing chemical data

Parts of mol2 file@<TRIPOS>MOLECULEThe first data line is the name of the molecule. The second data line contains the number of atoms, bonds, substructures, features, and sets associated with the molecule. The third data line is the molecule type. The fourth data line tells the type of charges associated with the molecule. The fifth data line contains the internal SYBYL status bits associated with the molecule. The last data line contains any comment which may be associated with the molecule.

@<TRIPOS>ATOMatom_id atom_name x y z atom_type [subst_id [subst_name [charge [status_bit]]]]Example :1 CA -0.149 0.299 0.000 C.3 1 ALA1 0.000 BACKBONE|DICT|DIRECTIn the example above the atom has ID number 1. It is named CA and is located at (-0.149, 0.299, 0.000). Its atom type is C.3. It belongs to the substructure with ID 1 which is named ALA1. The charge associated with the atom is 0.000 and the SYBYL status bits associated with the atom are BACKBONE, DICT, and DIRECT.

@<TRIPOS>BONDbond_id origin_atom_id target_atom_id bond_type [status_bits]Example : 1 1 2 ar Example bond shows, it has ID number 1 and connects atoms 1 and 2 .It is an aromatic bond.

@<TRIPOS>SUBSTRUCTUREsubst_id subst_name root_atom [subst_type [dict_type [chain [sub_type [inter_bonds [status [comment]]]]]]]

Example: 1 BENZENE1 PERM 0 **** ****** 0 ROOTThe substructure has 1 as ID BENZENE1 as name .It is a type of PERM and associated with dictionary type 0 . The SYBYL status bits indicate it is the ROOT substructure.

Page 23: Chemical File Formats for storing chemical data

References

• http://www.tripos.com/data/support/mol2.pdf• http://accelrys.com/products/informatics/cheminformatics/ctfile-

formats/no-fee.php• Description of Several Chemical Structure File Formats Used by Computer Programs

Developed at Molecular Design Limited. Arthur Dalby etal. J. Chem. Inf Comput. Sci. 1992, 32, 244-255.

• http://www.chem.ucla.edu/harding/tutorials/stereochem/rsez.pdf• http://www.chem.ucla.edu/harding/notes/notes_14C_stereo03.pdf