chapter open source tools for read across and category - ambit

22
1 AMBIT AMBIT is open source software for chemoinformatics data management developed with funding industry via CEFIC LRI funded project. AMBIT2 software consists of a database and functional modules allowing a variety of queries and data mining of the information, stored in the database and is distributed under xxx licence. AMBIT XT is a user friendly application with a graphical user interface, based on AMBIT2 modules and is distributed under LGPL licence. AMBIT XT provides a set of functionalities to facilitate evaluation and registration of the chemicals for REACH. AMBIT XT introduces the concept of workflows, allowing guiding the users step by step towards achieving particular goal, and provides workflows for analogue identification and PBT assessment. The software is a standalone application, with an option to install the database on a server. Modules Ambit is organised in several modules with well defined dependency. Table Error! No text of specified style in document. .1 Database AMBIT database is a relational database, consisting of several repositories for compounds, properties, QSAR models, users, references, as well as several tables containing pre-processed information which allows speeding up the substructure and similarity queries. The current implementation is based on MySQL. Database functionality is provided by ambit2-db module. Table Error! No text of specified style in document. .2 Chemical compounds The chemical compounds are stored in the table chemicals and assigned an unique number. If connectivity is available, an unique SMILES, as well as InChI and molecular formula is generated and stored. The database supports multiple 3D structures per compound, either coming from different inventories, or generated by external programs and imported into the database. The chemical structures are stored into table structure as a compressed text, where supported formats are SDF, MOL and CML. The choice of text format makes the database transparent and easy to be used by external software. Support of multiple formats is motivated by the need to keep the data in the original format. If the original format is not one of the above formats, it is converted to MOL. Support of internal formats will be extended in future releases.

Upload: others

Post on 12-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter Open Source Tools For Read Across And Category - Ambit

1

AMBIT AMBIT is open source software for chemoinformatics data management developed with funding

industry via CEFIC LRI funded project. AMBIT2 software consists of a database and functional

modules allowing a variety of queries and data mining of the information, stored in the database and

is distributed under xxx licence. AMBIT XT is a user friendly application with a graphical user

interface, based on AMBIT2 modules and is distributed under LGPL licence. AMBIT XT provides a set

of functionalities to facilitate evaluation and registration of the chemicals for REACH. AMBIT XT

introduces the concept of workflows, allowing guiding the users step by step towards achieving

particular goal, and provides workflows for analogue identification and PBT assessment. The

software is a standalone application, with an option to install the database on a server.

Modules

Ambit is organised in several modules with well defined dependency.

Table Error! No text of specified style in document..1

Database

AMBIT database is a relational database, consisting of several repositories for compounds,

properties, QSAR models, users, references, as well as several tables containing pre-processed

information which allows speeding up the substructure and similarity queries. The current

implementation is based on MySQL. Database functionality is provided by ambit2-db module.

Table Error! No text of specified style in document..2

Chemical compounds

The chemical compounds are stored in the table chemicals and assigned an unique number. If

connectivity is available, an unique SMILES, as well as InChI and molecular formula is generated and

stored. The database supports multiple 3D structures per compound, either coming from different

inventories, or generated by external programs and imported into the database. The chemical

structures are stored into table structure as a compressed text, where supported formats are SDF,

MOL and CML. The choice of text format makes the database transparent and easy to be used by

external software. Support of multiple formats is motivated by the need to keep the data in the

original format. If the original format is not one of the above formats, it is converted to MOL.

Support of internal formats will be extended in future releases.

Page 2: Chapter Open Source Tools For Read Across And Category - Ambit

2

Data provenance

The database provides means to identify the origin of the data, i.e., the specific inventory a

compound originated from. An inventory is identified by its name and reference (tables src_dataset).

Each compound might belong to multiple inventories (table struc_dataset), thus allowing users to

select the compounds of interest for specific regulatory purposes. Moreover, the data provenance

indicator can distinguish between different conformations, for example in cases where a particular

conformation of a compound comes from one inventory and a different conformation comes from

another inventory.

Updates of the chemical structures are recorded and subsequent versions are stored in the history

table. While importing structures from a file, they are stored in its original format into the structure

table. If the structure is subsequently updated as a result of a specific calculation (e.g. 3D

conversion) or another structure import step (e.g. updated version of the original file), the new

version will be stored and become currently available, while the previous version will be moved to

the history table.

Quality Assurance

The discrepancy between structures, available in chemical databases presents a challenge for AMBIT

as a data integration platform. In order to raise the awareness of possible incorrect structures that

might be imported from external sources, AMBIT allows assigning quality labels to each chemical. ,

as follows:

Manual verification by expert(s). Any user can assign quality labels and explain the

reason of the assignment (table quality_structure). The reasons can include

discrepancies between registry numbers, names and structure, expert knowledge,

manual comparison with external sources, etc.

o 'OK' – The structure is correct

o 'ProbablyOK' – Most probably the structure is correct, but some issues still need

to be verified.

o 'Unknown' – Not possible to assign a definite label.

o 'ProbablyERROR' – Most probably there is an error

o 'ERROR' – The structure is definitely wrong.

Automatically verified, by comparing the structures available under the same chemical

compound entry (e.g. imported from different sources) – table quality_chemicals

o ‘Consensus' – all structures under the same chemical compound entry are the

same

o 'Majority' – Majority of structures under the same chemical compound entry are

the same, but there are small number of structures, which differ from the

majority

o 'Ambiguous' – There is no majority of equal structures under the same chemical

compound entry (e.g. structures come from 3 different sources and all the three

structures are different)

Page 3: Chapter Open Source Tools For Read Across And Category - Ambit

3

o 'Unconfirmed' – The structure comes from a single source and it is impossible to

make a comparison.

o 'Unknown' – No information about the structure (e.g. no connectivity)

Potential examples of QA in Ambit

Automatic comparison with different sources of chemical structures may reveal discrepancies between, as illustrated below. The first one is the chemical with CAS 55-55-0. The structure provided in the set is incomplete.

In the second example the structure with CAS 39236-46-9 provided in tone of the datasets has erroneous structure, ethyl substitution on the wrong nitrogen in a ring.

Identifiers, Descriptors and Properties

Page 4: Chapter Open Source Tools For Read Across And Category - Ambit

4

The database schema is designed to provide unified storage for arbitrary number of text (e.g.

registry numbers or names), and numerical properties (e.g. descriptors, experimental data). The

properties are not predefined, but stored in the database on demand, e.g. AMBIT database is ready

to incorporate any number of chemical compounds, identifiers, descriptors and experimental data.

A property (table properties) is identified by a name and reference, thus allowing properties with

coinciding names, but originating from different sources to be distinguished (e.g. LogP calculated

internally by different methods and LogP imported from an external file). Every newly added

property or descriptor is added to a properties table, with information about the property/descriptor

name, units, alias and reference. The reference for a property, imported from a file is the name of

the file itself, while the reference for a descriptor contains the name of the software used for

calculation. The alias usually contains a copy of the name, except in cases, when the property is

recognised as a specific type of registry number or a chemical name. In this case, the alias is assigned

a fixed value (e.g. CasRN or Names).

Fields with the same meaning, but different names can be assigned the same alias, in order to

facilitate queries (e.g. species field same across all endpoints in order to be able to search for

species).

The flat list of properties provides a flexible storage, but presenting a long list of properties and

descriptors in the user interface might be confusing. Templates (tables template and template_def)

allows to organize properties in groups: Table Error! No text of specified style in document..3

Templates themselves can be organized hierarchically, with the help of table dictionary. The

database is distributed with a set of default templates, including top level templates Endpoints,

Identifiers, Datasets and Descriptors and a number of endpoints, according to ECHA endpoints

classification1. Convenience view ontology, combines the templates with its hierarchical

organisation. An excerpt of this view is shown below: Table Error! No text of specified style in

document..4

The user interface navigator allows viewing and organizing properties and templates in convenient mode.

Page 5: Chapter Open Source Tools For Read Across And Category - Ambit

5

By default, properties imported from a file with chemical compounds belong to the dataset of origin,

but can be moved to any user selected group.

Quality labels can also be assigned to any property value, stored in the database (table

quality_labels):

'OK' – The value assigned to property is correct

Page 6: Chapter Open Source Tools For Read Across And Category - Ambit

6

'ProbablyOK' – Most probably the value is correct, but some issues still need to be

verified.

'Unknown' – Not possible to assign a definite label.

'ProbablyERROR' – Most probably there is an error

'ERROR' – The value assigned to this property is definitely wrong.

Queries

The results of the searches a user performs are stored into query and queryresults tables. Besides

providing ability to record user actions, this enables browsing query results at a later moment and

combining queries with arbitrary logic.

Search methods

Exact structure, fixed sub-structures, similarity, SMARTS

The core substructure search functionality (graph isomorphism) is provided either by the

CDK cheminformatics library or by a faster algorithm, implemented in AMBIT. Substructure

search is an computationaly intensive (NP1-hard) problem, which means that the complexity

of the algorithm increases rapidly with the size of the molecule. To speed-up substructure

searching in large datasets, a pre-calculated fingerprints are used to identify structures,

potentially containing the substructure. The AMBIT database and software combines this

technique with fast relational database queries, which results in very fast substructure

searching in large datasets. In addition, fingerprints are a standard tool for representing

chemical structures to assess structural similarity by calculating Tanimoto coefficient

between two fingerprints.

Similarity

Fingerprint generation was based on the fingerprint implementation by open source

cheminformatics library, The Chemistry Development Kit and follows the ideas of Daylight

fingerprint theory that states2: (1) for a given molecule all possible paths for a predefined length

(default is 7) are generated, (2) the path is submitted to a hash function which uses it as a seed to a

pseudo-random generator, (3) the hash function outputs a set of bits, and (4) the set of bits thus

produced is added (with a logical OR) to the fingerprint. Ambit uses 1024 bit fingerprints by default.

The Tanimoto coefficient is calculated as Tanimoto NA∩NB/(NA+NB-NA∩NB), where NA is the number

of bits ‘‘on’’ in fingerprint A, NB is the number of bits ‘‘on’’ in fingerprint B and NA\NB is the number

of bits ‘‘on’’ in both fingerprints. Since Tanimoto distance is a pair-wise measure, and here the

objective is to assess the similarity to the set of molecules, we generate a consensus fingerprint,

which is again 1024 bit fingerprint where each bit is set.

1 non-deterministic polynomial-time

Page 7: Chapter Open Source Tools For Read Across And Category - Ambit

7

Atom environments (AE) can be regarded as fragments3,4, surrounding each atom in a molecule, up

to a predefined level. The calculation procedure is as follows. First, atom types to be included in the

generation of AEs are selected. We use 34 atom types, listed in table 2, which are very similar to

Sybyl atom types that have been recommended in Bender et al. The choice is based on the available

atom type parameterization in CDK library. Next, a vector of length (34 *L+1) is constructed for each

atom, where L is the maximum level for generating atom environments and L=3 by default. Third, for

each atom, neighbors at level 1, 2, 3 are identified and corresponding counts stored in the vector. An

example of a string representation of the result for a single atom (C.sp2) is:

C.sp2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1

Note that if there are several C.sp2 atoms with the same neighbours up to 3rd level in the molecule,

they will have the same string representation. We will refer to this representation as a ‘‘fragment’’.

AEs could be compared by Tanimoto distance (see above), where NA is the number of fragments in

molecule A, NB is the number of fragments in molecule B and NA∩NB is the number of common

fragments between the two molecules. Here, we take average Tanimoto distance for the nearest

neighbours, instead of defining consensus fingerprint. For each molecule, the similarity measure is

the averaged Tanimoto distance between the molecule and its 5 nearest molecules.

Table Error! No text of specified style in document..5

Substructure search

The implementation of sub-structure searching allows only fixed structures (e.g. no wild cards for

atoms and bonds). The structure is drawn using structure diagram editor (JChemPaint 5) and

submitted as a hydrogen depleted structure, which might presents difficulties in distinguishing

certain types of functional groups (e.g. aldehydes vs. carbonyls, amines vs. nitro groups).

AMBIT also allows querying the database by Smiles ARbitrary Target Specification (SMARTS)

language6, accelerated with certain pre-processed information for structural features, stored in the

database. The SMARTS specification was originally developed and maintained by Daylight Inc.7, but is

supported by many more commercial and open source software suites8, 9, 10. A list of predefined

functional groups and their SMARTS definition is available but formulating more complex queries

requires knowledge of SMARTS. The SMARTS line notation allows extremely precise and

transparent substructural specification and atom typing. SMARTS expressions for atoms and

bonds can be combined by logical operators to form more complex queries. Recursive

SMARTS allow detailed specification of an atom's environment. For example the more

reactive (with respect to electrophilic aromatic substitution) ortho and para carbon atoms of

phenol can be defined as: [$(c1c([OH])cccc1),$(c1ccc([OH])cc1)]; Atoms that are in an

environment where (the atom is connected to an aliphatic oxygen) and where (the atom is

connected to two sequential aliphatic carbons) as [$(*O);$(*CC)].

Page 8: Chapter Open Source Tools For Read Across And Category - Ambit

8

Various queries and their combinations on properties, inventories and quality labels are available. A

query can be restricted to search for compounds within specified dataset or another query.

3D structure generation

AMBIT module ambit2-smi23d integrates the open source 3D coordinate generation smi23d for

generation of an initial 3D structure from connectivity matrix11. The initial structure is further

optimized by OpenMOPAC 7.112, which is embedded into ambit2-mopac module.

Molecular descriptors

Ambit provides facilities to calculate and store descriptors for all chemical structures in the database, as well as specification of search criteria based on descriptor values. The CDK library based descriptors 13 are shown in Table Error! No text of specified style in document..6. Descriptors, implemented in ambit descriptors module are listed in Table Error! No text of specified style in document..7. [Table Error! No text of specified style in document..6] [Table Error! No text of specified style in document..7.]

Workflow engine

Page 9: Chapter Open Source Tools For Read Across And Category - Ambit

9

A workflow engine is a software application, which manages and executes modelled business

processes. In general, the models can be edited by non-programmers, using workflow editors. The

workflow models might be as simple as series of sequential steps, but also be complex, including

many conditions and loops. The Workflow Management Coalition provides standards for defining

workflows in a XML based format14.

Following the recognized importance of support for workflows in AMBIT, number of existing open

source workflow engines were evaluated for suitability to be embedded into AMBIT XT application.

The final decision of embedding micro-workflow is based on a trade-off between simplicity and

available functionalities15. AMBIT XT is entirely based on micro-workflow engine, providing

extensible platform for workflow based wizards and facilitating recording of user actions.

Workflow for analogue identification

The workflow will consist of the following steps

1. Definition of the starting structure or set of structures. The structure(s) can be defined as:

Identifiers (e.g. CAS, EINECS number, name).

Structure, represented as SMILES, MOL, SDF, drawn manually by the structure diagram

editor, available in AMBIT or drawn using externally installed ISIS/Draw software, copied

to the system clipboard and then pasted into AMBIT user interface.

2. Basic analogue search, consists of a similarity search (hashed fingerprints compared by

Tanimoto distance by default).

3. The results are displayed in the Structure browser. The user can decide to restrict the

forthcoming queries within the set of selected structures.

4. Substructure search by user-defined fragment.

5. The results are displayed in the Structure browser. The user can decide to restrict

forthcoming queries within the set of selected structures.

6. Further filtering of the results by conducting additional compound profiling based on

experimental and calculated data (LogKow, Dmax, other 2D and 3D descriptors chosen by

the user).

7. The results are displayed in the Structure browser. The user can decide to restrict the

forthcoming queries within the set of selected structures.

8. The selected structures are grouped into typical chemical classes or by clustering, allowing

the user to inspect small groups of analogues and derive the final decision of the query

compound(s).

9. The system proposes to calculate the final value by average, min-max, Euclidean distance to

user selected properties

Workflow for REACH PBT and vPvB (Persistence, Bioaccumulation and Toxicity) assessment

Page 10: Chapter Open Source Tools For Read Across And Category - Ambit

10

REACH requires for every substance to be registered and not exempted a PBT & vPvB

Assessment if the tonnage exceeds 10 tons/year. The REACH PBT & vPvB Assessment allows

a straightforward, user friendly and quick assessment if the necessary information is

available. An important goal is to rapidly identify those REACH substances which are not PBT

& vPvB. In addition those substances identified as a potential PBT or vPvB can immediately

investigated in a higher tier assessment to find out what is necessary as a next step. Such

higher tier assessments are very often time consuming and costly and it has to be avoided

that the strict registration deadlines cannot be met due to an ongoing PBT assessment. As

the assessment is done transparently and always the same way it will allow a standardized

PBT & vPvB Assessement throughout the company, independent from personal judgments

of an assessor. Printing the result sheet e.g. as a PDF file allows proper documentation of

the PBT & vPvB Assessment

Only organic substances can be assessed. This workflow should not be applied to inorganic

or organometalic substances, polymers and mixtures. PBT assessment is visually organized

in five pages: definition of the substance, persistency check, bioaccumulation check, toxicity check

and presenting the final results.

Population of AMBIT database with data

The following datasets are imported and distributed with AMBIT database:

1. EINECS list.

2. Bioconcentration factor dataset 16

3. ECETOC Aquatic Toxicity data 17

4. Local Lymph Node Assay (LLNA) data 18

5. ECETOC Skin irritation data 19

The data is imported using the standard data import functionality. The EINECS list is publicly

available at ECB site and consists of 100204 chemicals20. Extensive verification of EINECS structures,

in order to improve their reliability, based on comparison of structures with matching registry

numbers and available from public sources. Quality labels has been assigned, as explained above.

Bioconcentration factor dataset is distributed without structural information and chemical

compounds are identified only by CAS numbers and chemical names. Structures has been retrieved

from publicly available sources and imported into database. Datasets 3-5 consist of relatively small

number of compounds and presumably contain high quality structures, manually checked by experts

before making them publicly available.

WWW- REST services

Web services, allowing to use AMBIT functionality from web applications are under development.

Similarity example:

Page 11: Chapter Open Source Tools For Read Across And Category - Ambit

11

--

Acknowledgements: AMBIT software was developed within the framework of CEFIC LRI project EEM-

9 “Building blocks for a future (Q)SAR decision support system: databases, applicability domain,

similarity assessment and structure conversions” and extended under subsequent CEFIC LRI contract

for developing AmbitXT.

Page 12: Chapter Open Source Tools For Read Across And Category - Ambit

12

Page 13: Chapter Open Source Tools For Read Across And Category - Ambit

13

Page 14: Chapter Open Source Tools For Read Across And Category - Ambit

14

Table Error! No text of specified style in document..1 Modules

Module Description

AmbitXT GUI application

AmbitXT plugin: Database search

and Analogue identification

Ambit XT plugin , allowing various database queries

and analogues identification.

AmbitXT plugin: Category building Ambit XT plugin for analogues identification

AmbitXT plugin: Database tools Ambit XT plugin for database import and

management

AmbitXT plugin: Database

administration

AmbitXT plugin for database administration activities

AmbitXT plugin: REACH PBT

assessment

AmbitXT plugin , implementing an workflow for

REACH compliant Persistence , Biodegradation and

Toxicity (PBT) assessment.

ambit2-base Base classes, without cheminformatics functionality

ambit2-core Core classes, with cheminformatics functionality

ambit2-hashcode Hashcodes

ambit2-smarts SMARTS parser

ambit2-db Database functionality

ambit2-smi23d Wrapper for Smi23d executables

http://www.chembiogrid.org/cheminfo/smi23d/

ambit2-mopac wrapper for OpenMopac

ambit2-ui User interface

ambit2-dbui Database user interface

ambit2-workflow Workflow module

ambit2-namestructure Chemical name to structure convertor, based on

OPSIN package

http://sourceforge.net/projects/oscar3-chem/files/

ambit2-model Similarity calculation, feature selection and QSAR

model development

ambit2-taglibs JSP tags

Pubchem utilities Pubchem access utilities

Ambit2 REST web services Allows to query AMBIT database by REST style web

services.

Page 15: Chapter Open Source Tools For Read Across And Category - Ambit

15

Table Error! No text of specified style in document..2 Tables in AMBIT2 database.

Table Description

Chemical structures

chemicals Chemical compounds

structure Chemical structures, conformers

history Previous versions of chemical structures

Inventories

src_dataset Datasets

struc_dataset Lookup table for structures, belonging to a

dataset

Identifiers, Descriptors, Properties

catalog_references References

properties Property definition (name,reference, units)

property_values Numerical property values or links to string

values

property_string String values

property_tuples Tuples of properties

tuples Tuples per dataset

template Templates

template_def Template definition (which properties belng to

a template)

dictionary Templates hierarchy

Queries

query Queries

query_results Structures per query

sessions Sessions

Users support

user_roles Roles, assigned to users

roles User roles

users Users

Quality assessment support

quality_chemicals Quality labels of structures and properties

quality_labels

quality_pair

quality_structure

Pre-processed data for substructure, similarity and SMARTS queries

fp1024 Pre-processed fingerprints for pre-screening

and similarity search fp1024_struc

sk1024 Pre-processed fragments for accelerating

SMARTS searches

Page 16: Chapter Open Source Tools For Read Across And Category - Ambit

16

atom_distance Pre-processed data for atom environments

similarity atom_structure

Schema version

version Database version

Page 17: Chapter Open Source Tools For Read Across And Category - Ambit

17

Table Error! No text of specified style in document..3 Templates

Template Relationship Parent template

Endpoints

Top level templates

Identifiers

Dataset

Descriptors

Other is_a Endpoint

Ecotoxic effects is_a Endpoint

Toxicokinetics is_a Endpoint

Environmental fate parameters is_a Endpoint

Human health effects is_a Endpoint

Physicochemical effects is_a Endpoint

Short-term toxicity to algae (inhibition of the

exponential growth rate)

is_a Ecotoxic effects

Toxicity to birds is_a Ecotoxic effects

Direct photolysis is_a Environmental fate

parameters

Oxidation is_a Environmental fate

parameters

BAF fish is_a Bioaccumulation

BAF other organisms is_a Bioaccumulation

BCF fish is_a Bioconcentration

BCF other organisms is_a Bioconcentration

CAS number is_a Identifier

RSCBook_Skinsens_dataset.sdf is_a Dataset

org.openscience.cdk.qsar.descriptors.molecular.HBon

dAcceptorCountDescriptor

is_a Descriptor

org.openscience.cdk.qsar.descriptors.molecular.HBon

dDonorCountDescriptor

is_a Descriptor

Verhaar scheme is_a Descriptor

Table Error! No text of specified style in document..4 An excerpt view of ontology

Template Relationship Parent template

Endpoints

Top level templates

Identifiers

Dataset

Descriptors

Other is_a Endpoint

Ecotoxic effects is_a Endpoint

Toxicokinetics is_a Endpoint

Environmental fate parameters is_a Endpoint

Page 18: Chapter Open Source Tools For Read Across And Category - Ambit

18

Human health effects is_a Endpoint

Physicochemical effects is_a Endpoint

Short-term toxicity to algae (inhibition of the

exponential growth rate)

is_a Ecotoxic effects

Toxicity to birds is_a Ecotoxic effects

Direct photolysis is_a Environmental fate

parameters

Oxidation is_a Environmental fate

parameters

BAF fish is_a Bioaccumulation

BAF other organisms is_a Bioaccumulation

BCF fish is_a Bioconcentration

BCF other organisms is_a Bioconcentration

CAS number is_a Identifier

RSCBook_Skinsens_dataset.sdf is_a Dataset

org.openscience.cdk.qsar.descriptors.molecular.

HBondAcceptorCountDescriptor

is_a Descriptor

org.openscience.cdk.qsar.descriptors.molecular.

HBondDonorCountDescriptor

is_a Descriptor

Verhaar scheme is_a Descriptor

Table Error! No text of specified style in document..5 types used to generate atom environments

H C.default N.sp2 P3 F I

Hplus Cplus.sp2 Nplus P4 F- I-

Hminus Cminus.sp2 Nplus.sp3 S2 Cl Misc

C.sp3 Caromatic.sp2 O.sp2 S2- Cl-

C.sp2 Cminus Oplus S4 Br

C.sp N Ominus S Br-

Page 19: Chapter Open Source Tools For Read Across And Category - Ambit

19

Table Error! No text of specified style in document..6 The CDK library based descriptors

ALogP and Molar Refractivity Largest Chain

Atomic Polarizabilities Largest Pi System

Amino Acids Count Largest Aliphatic Chain

Aromatic Atoms Count Moments of Inertia

Aromatic Bonds Count Petitjean Number

Atom count Petitjean Shape Indices

BCUT Rotatable Bonds Count

Bond Polarizabilities Lipinski's Rule of Five

Bond Count Topological Polar Surface Area

Charged Partial Surface Area Vertex adjacency information

magnitude

Gravitational Index WHIM

Hydrogen Bond Acceptors Wiener Numbers

Hydrogen Bond Donors XLogP

Kier and Hall kappa molecular shape

indices

Zagreb Index

Page 20: Chapter Open Source Tools For Read Across And Category - Ambit

20

Table Error! No text of specified style in document..7 AMBIT Descriptors

Common functional groups ToxTree classification schemes 21:

pKa 22, Cramer rules

Molecule Size (3D) , Extended Cramer rules

Molecular weight Verhaar scheme

Electronic descriptors, calculated by

OpenMopac

Eye irritation rules

EHOMO Skin irritation rules

ELUMO Benigni/Bossa rulebase for

mutagenicity and

carcinogenicity

TOTAL ENERGY Structural rules for Michael

acceptors

FINAL HEAT OF FORMATION Structure Alerts for the in vivo

micronucleus assay in rodents

IONIZATION POTENTIAL

ELECTRONIC ENERGY

CORE-CORE REPULSION

MOLECULAR WEIGHT

Page 21: Chapter Open Source Tools For Read Across And Category - Ambit

21

References 1http://guidance.echa.europa.eu/docs/guidance_document/information_requirements_r6_en.

pdf?%20vers=20_08_08 , accessed on June-13 2009.

2 http://www.daylight.com/dayhtml/doc/theory/theory.finger.html

3 L. Xing, R.C. Glen. J. Chem. Inf. Comput. Sci. 42, 2002, p 796

4 A. Bender, H.Y. Mussa, R.C. Glen, S. Reiling. J. Chem. Inf. Comput. Sci. 44, 2004, p 170.

5 http://sourceforge.net/apps/mediawiki/cdk/index.php?title=JChemPaint , accessed on June-

13 2009.

6 Daylight SMARTS theory. http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html,

accessed on June-8 2008.

7 Daylight Inc. http://www.daylight.com/, accessed on June-8 2008.

8 Open Babel SMARTS implementation. http://openbabel.sourceforge.net/wiki/SMARTS ,

accessed on June-8 2008.

9 The Chemical Development Library SMARTS implementation. http://cdk.sourceforge.net/,

accessed on June-8 2008.

10 JOELIB http://www-ra.informatik.uni-tuebingen.de/software/joelib/, accessed on June 8th

2008.

11 smi23d - 3D Coordinate Generation, http://www.chembiogrid.org/cheminfo/smi23d,

accessed on June-8 2008.

12 OpenMopac 7.1 http://www.openmopac.net/ , accessed on June-8, 2008.

13 A subset of descriptors, listed at http://qsar.sourceforge.net/dicts/qsar-

descriptors/index.xhtml , accessed on July-13 2009.

14 http://www.wfmc.org/standards/docs.htm , accessed on June-8 2008.

15 http://sourceforge.net/projects/micro-workflow/ , accessed on June-14 2009.

16 Bioconcentration factor (BCF) Gold Standard Database

http://www.euras.be/eng/project.asp?ProjectId=92, accessed on June-8 2008.

17 ECETOC Aquatic Toxicity (EAT) Database. Supplement to ECETOC., Aquatic Hazard

Assessment II. Technical Report No. 91, European Centre for Ecotoxicology and Toxicology of

Chemicals, Brussels, 2003.

Page 22: Chapter Open Source Tools For Read Across And Category - Ambit

22

18 Gerberick GF, Ryan CA, Kern PS, Schlatter H, Dearman RJ, Kimber I, Patlewicz G, Basketter DA,

Compilation of historical local lymph node assay data for the evaluation of skin sensitization

alternatives. Dermatitis 16(4), 2005, pp 157-202.

19 ECETOC Technical Report No. 66 Skin irritation and corrosion Reference Chemicals data base,

1995.

20 http://ecb.jrc.it/qsar/information-sources/, accessed on June-8 2008.

21 http://toxtree.sourceforge.net , accessed on July-13 2009.

22 Adam C. Lee, Jing-yu Yu, and Gordon M. Crippen, pKa Prediction of Monoprotic Small

Molecules the SMARTS Way, J. Chem. Inf. Model.48(10), 2008, pp 2042–2053.