chemical file formats for storing chemical data
TRANSCRIPT
Molecular File Formats
Types of File formatsElsevier MDL supports a number of file formats for representation and communication of chemical information.
Name Description
molfiles Each molfile describes a single molecular structure which can contain disjoint fragments as salts .
SDfiles They are Structure-data files which contain data for any number of molecules .SDfiles are the primary format for large-scale data transfer between MDL databases.
RGfiles An RGfile describes a single molecular query with Rgroups. Each RGfile is a combination of Ctabs defining the root molecule and each member of each Rgroup in the query.
rxnfiles Reaction files.Eachrxnfile contains the structural information for the reactants and products of a single reaction.
RDfiles Reaction Data File: RDfile is a more general format that can include reactions as well as molecules.
File Formats
http://c4.cabrillo.edu/404/ctfile.pdf
Connection Table [Ctab]A connection table (Ctab) contains information describing the structural relationships and properties of a collection of atoms. The connection table is fundamental to all of the MDL file formats.
9 9 0 0 0 0 0 0 0 0999 V2000 Countline -1.0200 1.5300 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5100 2.4100 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.5000 2.3900 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 3.2700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0300 3.2700 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.5000 4.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 Atom Block -0.5000 4.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.0100 3.2800 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.0300 3.2800 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 2 8 1 0 2 3 2 3 3 4 1 0 4 5 2 0 4 6 1 0 6 7 2 3 Bonds Block 7 8 1 0 8 9 2 0
Ctab FeaturesParts of Ctab Description
Counts Line Important specifications here relate to the number of atoms, bonds, and atom lists, the chiral flag setting, and the Ctab version.
Atom Block Specifies the atomic symbol and any mass difference, charge, stereochemistry, and associated hydrogens for each atom.
Bond Block Specifies the two atoms connected by the bond, the bond type, and any bond stereochemistry and topology (chain or ring properties) for each bond.
Properties Block
Provides for future expandability of Ctab features, while maintaining compatibility with earlier Ctab configurations.
1. Counts Line aaabbblllfffcccsssmmmvvvvvvwhere• aaa = number of atoms (current max 255)* [Generic]• bbb = number of bonds (current max 255)* [Generic]• lll = number of atom lists (max 30)* [Query]• fff = (obsolete)• ccc = chiral flag: 0=not chiral, 1=chiral [Generic]• sss = number of stext entries [MDL ISIS/Desktop]• Mmm = number of lines of additional properties, including the M END line.
no longer supported, the default is set to 999.[Generic]
shows six atoms, five bonds, the CHIRAL flag on, and three lines in the properties block: 6 5 0 0 1 0 3 V2000
Shows 9 atoms, 9 bonds, the CHIRAL flag of9 9 0 0 0 0 0 0 0 0999 V2000
2. Atom BlockThe Atom Block is made up of atom lines, one line per atom with the following format.xxxxx.xxxxyyyyy.yyyyzzzzz.zzzzaaaddcccssshhhbbbvvvHHHrrriiimmmnnneee
Field Meaning Values
XYZ Atom coordinates
aaa atom symbol entry in periodic table or L for atom list, A, Q, * for unspecified atom, and LP for lone pair, or R# for Rgroup label
dd Mass difference -3, -2, -1, 0, 1, 2, 3, 4 (0 if value beyond these limits)
ccc Charge 0 = uncharged or value other than these, 1 = +3, 2 = +2, 3 = +1, 4 = doublet radical, 5 = -1, 6 = -2, 7 = -3
sss atom stereo parity 0 = not stereo, 1 = odd, 2 = even, 3 = either or unmarked stereo center.
hhh hydrogen count + 1 1 = H0, 2 = H1, 3 = H2, 4 = H3, 5 = H4
bbb stereo care box 0 = ignore stereo configuration of this double bond atom, 1 = stereo configuration of double bond atom must match
vvv Valence 0 = no marking (default) (1 to 14) = (1 to 14) 15 = zero valence.
HHH H0 designator 0 = not specified, 1 = no H atoms allowed
3.Bonds blockThe Bond Block is made up of bond lines, one line per bond, with the following format: 111222tttsssxxxrrrccc
Field Meaning Values
111 First atom number 1 - number of atoms
222 Second atom number 1 - number of atoms
ttt Bond type 1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = Double or Aromatic, 8 = Any
sss bond stereo Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down, Double bonds: 0 = Use x-, y-, z-coords from atom block to determine cis or trans, 3 = Cis or trans (either) double bond.
rrr Bond topology 0 = Either, 1 = Ring, 2 = Chain
Mol FileA molfile consists of a header block and a connection table. The following shows a molfile for alanine corresponding to the following structure:x`
Identifies the molfile: molecule name, user's name, program, date, and other miscellaneous information and comments
atom 4: charge +1atom 6: charge -1
1 entry for an isotope atom 3: mass=13
Representation of Stereochemistry
What is Stereochemistry ? http://www.chemhelper.com/enantiomers.html
Representation of Stereochemistry : Atom Block
Representation of Stereochemistry : Bond Block
1= Shows stereo bond up
RGfiles In RGfilesLines beginning with $ define the overall structure of the Rgroup query; the molfile header block is embedded in the Rgroup header block.In addition to the primary connection table (Ctab block) for the root structure, a Ctab block defines each member (*m) within each Rgroup (*r).
Example of RGfile
SDfileAn SDfile (structure-data file) contains the structural information and associated data items for one or more compounds.
*l is repeated for each line of data*d is repeated for each data item*c is repeated for each compound
Example of SDfile
RXNfileRxnfiles contain structural data for the reactants and products of a reaction.
where:*r is repeated for each reactant*p is repeated for each product
RXNfile example
RDfiles• An RD-File(reaction data file) consist of a set of edible “records”. Each record
defines a molecule or reaction, and its associated data.
• The [RDfile Header] must occur at the beginning of the physical file and indentifies the file as an RDfile. A version stamp of 1 is given for future expansion of the format.
• $DATM: Date/time (M/D/Y, c) stamp. This line is treated as a comment and ignored when the program is read.
*d is repeated for each data item*r is repeated for each reaction or molecule
RDfile example
Mol2 files from TRIPOSOriginal from Tripos. Contains atom coordinates, bonds, substructure information.This
format supports partial charges and isotopes. • Lines 1,2,3,5 and 6 are comments. They contain
the molecule name and information about the time the molecule was created and last modified.
• Lines 8, 15, 28, and 41 in the example are Record Type Indicator(RTIs). It is used to indicate the type of data which follows in a .mol2 file.
• Lines 9-12, 16-27, 29-40, and 42 are all data records
Parts of mol2 file@<TRIPOS>MOLECULEThe first data line is the name of the molecule. The second data line contains the number of atoms, bonds, substructures, features, and sets associated with the molecule. The third data line is the molecule type. The fourth data line tells the type of charges associated with the molecule. The fifth data line contains the internal SYBYL status bits associated with the molecule. The last data line contains any comment which may be associated with the molecule.
@<TRIPOS>ATOMatom_id atom_name x y z atom_type [subst_id [subst_name [charge [status_bit]]]]Example :1 CA -0.149 0.299 0.000 C.3 1 ALA1 0.000 BACKBONE|DICT|DIRECTIn the example above the atom has ID number 1. It is named CA and is located at (-0.149, 0.299, 0.000). Its atom type is C.3. It belongs to the substructure with ID 1 which is named ALA1. The charge associated with the atom is 0.000 and the SYBYL status bits associated with the atom are BACKBONE, DICT, and DIRECT.
@<TRIPOS>BONDbond_id origin_atom_id target_atom_id bond_type [status_bits]Example : 1 1 2 ar Example bond shows, it has ID number 1 and connects atoms 1 and 2 .It is an aromatic bond.
@<TRIPOS>SUBSTRUCTUREsubst_id subst_name root_atom [subst_type [dict_type [chain [sub_type [inter_bonds [status [comment]]]]]]]
Example: 1 BENZENE1 PERM 0 **** ****** 0 ROOTThe substructure has 1 as ID BENZENE1 as name .It is a type of PERM and associated with dictionary type 0 . The SYBYL status bits indicate it is the ROOT substructure.
References
• http://www.tripos.com/data/support/mol2.pdf• http://accelrys.com/products/informatics/cheminformatics/ctfile-
formats/no-fee.php• Description of Several Chemical Structure File Formats Used by Computer Programs
Developed at Molecular Design Limited. Arthur Dalby etal. J. Chem. Inf Comput. Sci. 1992, 32, 244-255.
• http://www.chem.ucla.edu/harding/tutorials/stereochem/rsez.pdf• http://www.chem.ucla.edu/harding/notes/notes_14C_stereo03.pdf