background report (doc)
TRANSCRIPT
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
Abstract
Much of the reason for the high cost of medicines is rooted in
the length and complexity of the development and approval
process. At every possible stage of development, it is possible
that a potential drug (leader) will fail to gain approval on the
basis that it produces erratic results or harmful side effects.
Predictive toxicology aims to reduce the money and time spent
by identifying as early on in the drug development process as
possible leaders that are likely to fail. Numerous machine
learning techniques exist to identify such leaders. Here we
present a possible solution based on the Find a maximally
specific hypothesis (Find-S) algorithm. This algorithm, given a
set of positive and negative examples of data, finds
substructures that are statistically true of the majority of
positive compounds, and statistically not true of the negative
compounds.
A discussion of the algorithm and its motivation is presented
here.
i
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
Contents
Abstract................................................................................i
1. Introduction....................................................................3
1.1. Motivation................................................................................3
1.2. Summary of Report..................................................................4
2. Previous Research..........................................................5
2.1. Structure-Activity Relationships..............................................5
2.2. Attribute-based representations..............................................5
2.3. Relational-based representations............................................7
2.4. Inductive logic programming..................................................7
3. The Find-S Technique.....................................................9
3.1. Motivation................................................................................9
3.2. General-to-specific ordering of hypotheses.............................9
3.3. The Find-S algorithm.............................................................10
3.4. Algorithm evaluation methods...............................................14
3.5. Issues with the Find-S technique...........................................15
3.6. Existing Prolog implementation............................................16
4. Implementation Considerations...................................18
4.1. Representing structures........................................................18
4.2. Improvement of current implementation..............................18
4.3. Extensions..............................................................................18
5. References....................................................................20
ii
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
1. Introduction
1.1. Motivation
Each year, drug companies release new and improved drugs, claiming that
they produce better results with fewer side effects. However, the cost of
such advances in the drug industry is not small. Developing a drug from the
theoretical stage to it appearing on pharmacy shelves normally takes in the
region of 10 to 15 years, at an average cost of over £500 million [see ref 1].
This outlay by the drug company must be covered by the consumer for the
company to remain in profit, and evidence of this can be seen, for example,
in the regular rise of NHS prescription charges.
Much of the reason for the high cost of medicines is rooted in the length
and complexity of the development and approval process. At every possible
stage of development, it is possible that a potential drug (leader) will fail to
gain approval on the basis that it produces erratic results or harmful side
effects. Even after promising lab tests, further experiments on animal
specimens often return ideas to the drawing board. It is estimated that for
every one drug that reaches clinical (human) trial stage, another 1000 have
failed earlier testing.
Despite this, it is important to note that medicines still reduce overall
medical care costs by reducing even more expensive hospitalisation,
surgery or other treatments. Drugs are the primary way of controlling the
outcomes of chronic illness. Therefore, the development of new drugs is
important for both patient care and for the positive long-term financial
implications.
It is clear that reducing the number of drug leaders developed at an early
stage will have a significant effect in limiting development costs.
Determining at an early stage that a leader is unsuitable for further testing
saves the investment that may otherwise have been spent on this drug, only
for the same conclusion to be reached. For this reason, the field of
predictive toxicology was born. It is an effort on the part of biotechnology
companies to predict in advance whether or not a drug will be toxic, using
various techniques learnt from the fields of statistics, artificial intelligence
(AI), and machine learning.
Negative effects of a drug can range from relatively minor problems such as
headaches and stomach upsets, to potentially life-threatening organ
damage. While many accepted drugs do produce some side effects for some
Page 3
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
patients, the value of the treatment is always said to outweigh the side
effects. However, there are certain characteristics of chemical compounds
that will limit their effectiveness as a drug. Predictive toxicology aims to
find this drug toxicity while still in the planning stages. Ruling out a leader
at this early stage saves it being synthesised and tested, and allows
resources to be focused on more promising areas of research.
Machine learning programs in a variety of different guises have been used
to try and discover the reasons why certain chemicals are toxic and others
are not. Essentially, they learn a concept that is true of the toxic drugs and
false for other non-toxic drugs. These derived concepts are usually small
(around five or six atoms) sub-structures of the larger drug molecule, where
some of the atoms are fixed elements and others may vary.
The task in hand is to effectively and efficiently identify such sub-structures
using the Find Maximally Specific Hypothesis (FIND-S) machine learning
algorithm. An implementation of the algorithm has been written in PROLOG
by S Colton; our work here is based on extending this implementation and
producing a web-based server application.
A molecule is said to be positive if it contains the sub-structure in question.
Conversely, it is said to be negative if it does not. The application will
return interesting substructures given positive and negative molecules,
whereby the substructure is true of statistically significant more positives
than negatives.
1.2. Summary of Report
This report is an overview of the research undertaken, with an outline of
how implementation of a Substructure Server may proceed. Section 2
summarises the machine learning techniques used in the field of predictive
toxicology, and introduces the concepts of attribute-based and relationship-
based structure-activity relationships.
Section 3 is a comprehensive overview of the Find-S algorithm, with an
emphasis on how it may perform in a predictive toxicology situation. A
fictional example is presented and analysed which demonstrates the key
methodologies of the technique. Evaluation techniques applicable to both
the algorithm itself and to the results it produces are outlined, as well as
various considerations that should be addressed on implementation. S
Colton’s existing Prolog implementation of the algorithm is also discussed.
Page 4
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
Section 4 highlights some implementation considerations, suggesting a
possible course of action towards building a substructure server available
for public use.
Page 5
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
2. Previous Research
As was mentioned above, machine learning algorithms to find relevant sub-
structures have been applied in the field of predictive toxicology. It is important
to understand the approaches that have been taken in previous work, using it
as a basis for further study.
A summary of the key features of background study undertaken is summarised
in this section.
2.1. Structure-Activity Relationships
A structure-activity relationship (SAR) models the relationship between
activities and physicochemical properties of a set of compounds [2]. The
goal of our work is essentially to form SARs from the given input molecules.
These resultant SARs represent the molecules most likely contribute to
toxicity, as calculated by our algorithm.
A SAR is derived from two components:
The learning algorithm employed during derivation, and
The choice of representation to describe the chemical structure of the
compounds being considered.
The learning algorithm used will rule out possible choices of representation,
as the latter has to be rich enough to support the algorithm’s procedure.
SARs can store different information about compounds, and typically such
information (attributes) could consist of any of the following chemical
properties [5]:
Partial atomic charges
Surface area
Volume
H_bond donors/acceptors
ClogP
CMR
pKa, pKb
Hansch parameters , ,
F
Molecular grids
Polarisability
The exact nature or meaning of each attribute type need not be discussed
here. It is however important to note that there are any number of ways of
representing a compound, using any combination of the attributes given
above (and more).
Page 6
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
2.2. Attribute-based representations
A large variety of learning techniques are in use that derive SARs of
different forms. The majority of these are based on examining the types of
attributes listed above. A short summary of a few of these techniques is
presented here.
2.2.1. Linear and Partial least-squares regression
Linear regression was the first learning algorithm employed in
predictive toxicology, as detailed by Hansch et al. [3]. “Training” the
system involves providing suitable training examples, which are
simply saved to memory without being interpreted or compared in any
way. It is on this stored information (as explicitly provided by the
user) that regression aims to approximate its target function.
In the context of predictive toxicology, this would involve supplying
examples of positive compounds as training data. The procedure then
run on a new compound would invoke a set of similar compounds
being retrieved from the stored values, and use this to classify the
new compound. The analysis of the compounds is based on chemical
attributes as specified by the algorithm; Hansch used global chemical
properties of the molecule (LogP and ).
Least-squares regression is another learning technique involving the
relationship between chemical attributes. Visually it essentially entails
forming a ‘line of best fit’ for a set of training data plotted against two
variables y and x, where x and y are two chemical attributes. For any
new compound encountered, a plot is made of the same two
attributes; if the point produced lies within a fixed bound of the line of
best fit, then the new compound can be deemed positive. The system
can be extended to include multiple independent variables, and to give
each variable different weights – a measure of how important each
attribute measure is compared with each other.
It is important to note that both these techniques make no attempt to
interpret the training data as it is fed to them; all the processing of
determining suitability criteria for new compounds happens only once
the new compound has been encountered.
2.2.2. Decision trees
Page 7
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
Decision trees classify the training data by considering each
<attribute, value> pair (tuple) for a given compound [4]. Each node in
the tree specifies a test of a particular attribute, and each branch
descending from that node corresponds to a possible value for that
attribute. A compound is classified as positive or negative at the leaf
nodes of the graph.
New compounds are classified by comparing their attribute values to
ones stored from the training data. An implementation of this
algorithm needs to address the critical issue of which attribute(s) to
perform the test on. This decision could crucially alter the
classification schema, and is a problem inherent in trying to separate
objects into discrete sets when their behaviour or identity is given by
a number of attribute. It is possible that any two attribute values could
contradict each other on a particular classification scheme, and it then
becomes necessary to impose some ordering or priority system over
the attributes.
2.2.3. Neural networks
Artificial Neural Networks (ANNs) provide a general and practical
method for learning functions from examples [4], and have
widespread use in AI applications. Predictive toxicology lends itself to
the use of ANNs because of how compound attributes can be treated
as <attribute, value> tuples, in a manner similar to that discussed in
section 2.1.2 above. A compound can be represented by a list of such
tuples covering the full range of attributes.
The simplest form of ANN system is based on perceptrons, which will
take the list of tuples and calculates a ‘score’ for the compound. This
score is calculated from a combination of the input tuples, and a
weight associated with each attribute. The algorithm can learn from
the training data by considering the attributes of positive compounds,
and can then classify unknown compounds as positive or negative,
depending on the score calculated being higher than a defined
threshold.
Practical ANN systems usually implement the more advanced
backpropogation algorithm, which learns the weights for a network of
neural nodes on multiple layers. However the principal is the same as
Page 8
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
that used in the perceptron algorithm, with the compound score being
calculated in a non-linear manner taking into account more variables.
2.3. Relational-based representations
The techniques mentioned above for deriving SARs all share one key
concept: they are all based on attributes of the object (in our case, the
chemical compound being examined). These attributes can be considered to
be global properties of these molecules, e.g. using the molecular grid
attribute maps points in space, which are global properties of the
coordinate system used. The tuple of attributes that has been used to
represent the properties of the molecule is not an ideal format; it is difficult
to efficiently map atoms and the bonds onto a linear list.
A more general way to describe objects is to use relations. In a relational
description the basic elements are substructures and their associations [2].
This allows the spatial representation of the atoms within the molecule to
be represented more accurately, directly and efficiently.
2.4. Inductive logic programming
Fully relational descriptions were first used in SARs with the inductive logic
programming (ILP) learning technique, as shown in [6]. ILP algorithms are
designed to learn from training examples encoded as logical relations. ILP
has been shown to significantly outperform the feature (attribute) based
induction methods described above [7].
ILP for SARs can be based on knowledge of atoms and their bond
connectives within a molecule. Using this scheme has a number of benefits:
Simple, powerful, and can be generally applied to any SAR
Particularly well suited to forming SARs dependent on the
relationship between the atoms in space (shape)
Chemists can easily understand and interpret the resultant SARs as
they are familiar with relating chemical properties to groups of atoms.
The formal difference between the descriptive properties of attribute and
relational SARs corresponds to the difference between propositional and
first-order logic [2]. ILP involves learning a set of “if-then” rules for a
training set, which can then be applied to unseen examples. Sets of first-
order Horn clauses can be constructed to represent the given data rules,
and these can be interpreted in the logic programming language PROLOG.
Page 9
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
ILP differs from the attribute based techniques in two key areas. ILP can
learn first-order rules that contain variables, whereas the earlier algorithms
can only accept finite ground terms for attribute values. Further, ILP
sequentially examines the data set, learning one rule at a time to
incrementally grow the final set of rules.
We stated above that relational SARs can be described by fist-order
predicate logic. The PROGOL algorithm was developed [8] to allow the
bottom-up induction of Horn clauses, and is implemented in PROLOG.
PROGOL uses inverted entailment to generalize a set of positive examples
(active compounds) with respect to some background knowledge – atom
and bond structure date, given in the form of prolog facts. PROGOL will
construct a set of “if-then” rules which explain the positive (and negative)
examples given.
In the case of predictive toxicology, these rules generally specify a sub-
molecular structure of around five or six atoms. These structures are those
that have been calculated to contribute to toxicity, based on their presence
in the set of positive training examples, and their non-presence in the set of
negative training examples.
These sub-structures can then be matched with components of unseen
compounds in an attempt to predict toxicity.
Page 10
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
3. The Find-S Technique
3.1. Motivation
As mentioned previously, the focus of this research topic is to use the Find-
S algorithm as described below to identify the sub-structures discussed at
the end of section 2.3.1. Within the scope of predictive toxicology, it may
appear that both Find-S and ILP do the same thing, however this is not the
case. The Find-S technique differs from that of ILP due to the motivation
behind the process. ILP looks for concepts that are true for positive
examples, and false for negative examples, and produces a sub-molecule
structure as a result. The Find-S procedure, on the other hand, is given a
template (by the user) to guide its search, and the program looks for all
possibilities of the general shape in the positive inputs.
3.2. General-to-specific ordering of hypotheses
Any given problem has a predefined space of potential hypotheses [4],
which we shall denote H. Consider a target concept T, whose truth value (1
or 0) depends upon the values of three attributes, a1, a2, and a3. Each
attribute a1, a2, or a3 can take a range of discrete values, some combinations
of which will make T true, others will make T false. We denote the value x of
an attribute an as v(an) = x.
We can let each hypothesis consist of a conjunction of constraints on the
attributes, i.e. take the list of attribute values for that particular instance of
the problem. This list of attributes (of length three in this case) can be held
in a vector. For each attribute an, the value v(an) will take one of the
following forms:
? - indicating that any value is acceptable for this attribute
- indicating that no value is acceptable for this attribute
a single required value for the attribute, e.g. for an attribute ‘day of
week’, acceptable values would be ‘Monday’, ‘Tuesday’ etc.
With this notation, the most general hypothesis for T is
<?, ?, ?>
which states that any assignment to any of the three attributes will result in
the hypothesis being satisfied. Conversely, the most specific hypothesis
for T is
Page 11
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
<, , >
which states that no assignment to any of the variables will ever satisfy the
hypothesis.
All hypotheses within H can be represented in this way, with the majority
falling somewhere between the two above extremes of generality. Indeed,
hypotheses can be ordered on their generality, from most general to most
specific instances. For example, consider the following two possible
hypotheses for T:
h1 = <x, ?, y>
h2 = <?, ?, y>
Considering the two sets of instances that are classified positive by the two
hypotheses, we can say that any instance classified positive by h1 will also
be classified positive by h2, as h2 imposes fewer constraints. We say that h2
is more general than h1.
Formally, for two hypotheses hj and hk, we can define hj to be more general
than or equal to hk (written h j ≥g hk) if and only if
(x X) [(hk(x) = 1) (hk(x) = 1)]
Further, we can define hj to be (strictly) more general than hk (written h j g
hk) if and only if
(h j ≥g hk) (h j ≱g hk)
3.3. The Find-S algorithm
The Find-S technique orders hypotheses according to their generality as
explained in the previous section. The algorithm then starts with the most
specific hypothesis h possible within H. For each positive example it
encounters in the training set, if generalises h (if needed) so h now
correctly classifies the encountered example as positive. After considering
all positive training examples, the resultant h is output. This is the most
specific hypothesis in H consistent with the examined positive examples.
The algorithm can be more formally defined as follows [4]:
1. Initialise h to the most specific hypothesis in H.
Page 12
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
2. For each positive training instance x
For each v(ai) in h
If v(ai) is satisfied by x
Then do nothing
Else replace ai in h by the next more general constraint that is satisfied by x.
3. Output hypothesis h
The procedure is run with a different starting positive each time until all
positives have been analysed. There is a question over how to measure how
specific a particular hypothesis is. This is dependent on the representation
scheme, but in first-order logic, for example, a more specific hypothesis will
have more ground terms (fewer variables) in the logic sentence describing
it than a less specific hypothesis.
3.3.1. A simple example
An example to illustrate how the algorithm could be used in predictive
toxicology is presented below. It has been adapted from [9], and is
fabricated in that the derived structure is not a real indicator to
toxicity. The example is simply illustrates the algorithm process.
Training Data
Consider the training set of seven drugs, four of which are known
positives, and the remaining three known negatives. Diagrams of
these molecules are given below, with molecules P1, P2, P3 and P4
representing positive examples, and N1, N2 and N3 representing
negative ones. The atom labels (, , , and ) are used in place of
possible real elements (e.g. N, C, H etc) to enforce the notion that the
example is purely fabricated.
Page 13
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
At this stage, the chemist (user) must suggest a possible template on
which to base the search for toxicity-inducing substructures. It is
thought that a substructure of the form
ATOM ATOM ATOM
(with representing a bond) contributes to toxicity. It is now the
task of the algorithm to find sub-molecules matching the structure
given above which exist in as many positives as possible, but do not
exist in as many negatives as possible.
The Algorithm Procedure
To solve the problem, we use the Find-S method with the aim of
producing solutions of the form
<A, B, C>
Page 14
N1
P1
P2
P3
P4
N2
N3
Figure 1: Training set for Find-S example
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
where A, B and C are taken from the set of chemical symbols present
in the molecules, i.e. {, , , }. However, we also need to look for
general solutions where an atom in a particular position is not fixed.
We therefore append {?} to the previous set, giving {, , , , ?}.
We start off with the most specific hypothesis possible. Any final
concept learned will have to be true of at least one positive example.
We use this to produce our first set of triples:
<, , > and <, , >
These are the two substructures that exist in P1 and match the
template specified.
We now check whether each of these substructures is true in the next
molecule (P2). If they are not, then we generalise the substructure
such that it becomes true in P2. This generalisation is done by
introducing as few variables as possible. In doing this, we find the
least general generalisations, which then guarantees that our final
answers are as specific as possible. This expanded set of
substructures is then tested on P3, and following the same procedure,
on P4.
Page 15
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
A trace of the intermediate results produced is shown here:
Molecule being analysed
P1 P2 P3 P4
<, , > <, , > <, , > <, , ><, , > <, , > <, , > <, , >
<, , ?> <, , ?> <, , ?><, ?, > <, ?, > <, ?, >
<?, , > <?, , ><?, , ?> <?, , ?><, ?, ?> <, ?, ?><, ?, ?> <, ?, ?><?, ?, > <?, ?, >
The trace shows previously derived substructures with a greyed out
background. Note that no new substructures are produced on analysis
of P4 – all the substructures produced after analysis of P3 match
exactly components of P4 without the need for generalisation.
Evaluation of Results
So the algorithm has now returned nine possible hypotheses for
substructures that determine toxicity. These can now be scored, based
on
How many positive molecules contain the substructure derived
How many negatives do not contain the substructure derived
A calculation of scores is given below:
Correctly classified
positives:
Correctly
classified
negatives:
Hypothesis
P1 P2 P3 P4 N1 N2 N3Accura
cy
1 <, , > 43%
2 <, , > 57%
3 <, , ?> 57%
4 <, ?, > 86%
5 <?, , > 57%
Page 16
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
6 <?, , ?> 57%
7<, ?, ?
> 43%
8 <, ?, ?> 57%
9 <?, ?, > 57%
It can be seen that the most accurate hypothesis derived is number
four: <, ?, >. This is statistically the most frequent substructure (of
the form ATOM ATOM ATOM) that occurs in the positives, but
not in the negatives. This structure can then be used to predict the
toxicity of unseen compounds; other compounds containing a match
for hypothesis four are statistically likely to be toxic.
For a complete implementation of the algorithm, the procedure should
be repeated, but this time with P2 as the initial positive, and
generalising on the others. The same should be applied for P3 and P4
as initial positives.
3.4. Algorithm evaluation methods
On obtaining a ‘result’ from the Find-S algorithm, i.e. a hypothesis (or set of
hypotheses) representing a sub-molecule thought most likely to contribute
to toxicity, it is desirable to have some certainty that the result obtained is
indeed accurate. We want the promising results obtained with the training
set to be extended to unseen examples. There is no way to guarantee the
accuracy of a hypothesis, however there are accepted methods and
measures through which a user can become more confident in the results
obtained.
In our example above, the ‘best’ hypothesis had a (predicted) accuracy of
86%, calculated by considering the number of correctly classified positives
and negatives, over the total number of compounds analysed. However, this
figure is based purely on the examples that the hypothesis has already seen;
it is not a strong indicator of accuracy for unseen examples.
3.4.1. Cross validation
One possible way of addressing this situation is to reserve some
examples from the training set, and then subsequently use these
reserved examples as tests on the derived hypothesis. The results of
Page 17
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
the hypothesis applied to the reserved examples can then be
compared to their actual categorisation, which is known as they were
provided as part of the training set. This cross validation is a standard
machine learning technique, and the splitting of initial example data
into a training set and test set can give the user more confidence that
the derived hypothesis will be accurate and of use. Clearly, it can have
the opposite effect, with a user finding out that the derived hypothesis
in fact performs poorly on genuinely unseen examples.
3.4.2. K-fold cross validation
It is often of importance and interest that the performance of the
learning algorithm itself is measured, and not just a specific
hypothesis. A technique to achieve this is k-fold cross validation [4].
This involves partitioning the data into k disjoint subsets, each of
equal size. There are then k training and testing rounds, with each
subset successively acting as a test set, and the other k-1 sets as
training sets. The average accuracy rate can then be calculated from
each independent test run. This technique is typically used when the
number of data objects is in the region of a few hundred, and the size
of each subset is at least thirty. This ensures that the tests provide
reasonable results, as having too few test examples would result in
skewed accuracy figures.
As each round is performed independently, there is no guarantee that
the hypothesis generated on one training round will be the same as
the hypothesis generated on another. It is for this reason that the
overall accuracy figures generated are representative of the algorithm
as a whole, not just one particular result.
3.5. Issues with the Find-S technique
As with all machine learning techniques, Find-S has some factors to
encourage its use, and others that make it less favourable. Some of these
considerations are discussed here.
3.5.1. Guarantee of finding most specific hypothesis
As the name of the algorithm suggests, the process is guaranteed to
find the most specific hypothesis consistent with the positive training
Page 18
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
examples, within the hypothesis space. This is because of the
decisions made to select the least general generalisations when
analysing compounds.
This property can be viewed as being both advantageous and
disadvantageous. It is sometimes useful for users to know as much
information about the substructure as possible, and this may enable
them to better understand the chemical reason for the molecule’s
toxicity. However, in the case of an example deriving multiple
hypotheses consistent with the tracing data, the algorithm would still
return the most specific, even thought the others have the same
statistical accuracy.
Further, it is possible that the process derives several maximally
specific consistent hypotheses [4]. To account for this possible case,
we need to extend the algorithm to allow backtracking at choice
points for generalisation. This would find target concepts along a
different branch to that first explored.
3.5.2. Overfitting
Overfitting is often thought of as the problem of an algorithm
memorising answers rather than deducing concepts and rules from
them, and is inherent in many machine learning techniques. A
particular hypothesis is said to overfit the training examples when
some other hypothesis that fits the training examples less well,
actually performs better over the whole set of instances (i.e. including
non-training set instances).
Overfitting can occur when the number of training examples used is
too small and does not provide an illustrative sample of the true target
function. It can also occur when there are errors in the example data,
known as noise. Noise has a particularly detrimental effect on the
Find-S algorithm, as explained below.
3.5.3. Noisy data
Any non-trivial set of data taken from the real world is subject to a
degree of error in its representation. Mistakes can be made analysing
the data and categorising examples, in translation of information from
one form to another, and repeated data not being consistent with
itself. In machine learning terms, such errors in the data are termed
noise.
Page 19
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
While certain algorithms are fairly robust to noise in data, the Find-S
technique is inherently not so. This is because the algorithm
effectively ignores all negative examples in the training examples.
Generalisations are made to include as many positive examples as
possible, but no attempt is made to exclude negatives. This in itself is
not a problem; if the data contains no errors, then the current
hypothesis can never require a revision in response to a negative
example [4]. However, the introduction of noise into the data changes
this situation. It may no longer be the case that the negative examples
can simply be ignored. Find-S makes no effort to accommodate for
these possible inconsistencies in data.
3.5.4. Parallelisability
The Find-S algorithm lends itself well to a parallel distributed
implementation, which would speed-up computation time. A parallel
implementation could involve individual processors being allocated
different initial positives; recall above that the algorithm is only
complete when hypotheses have been derived using each possible
start positive. The derivation of any particular hypothesis from an
initial positive can be run independently, and hence can be run in
parallel with other derivations.
3.6. Existing Prolog implementation
S. Colton has implemented an initial version of the Find-S algorithm in
PROLOG. This relatively compact program (approximately 300 lines of code)
identifies substructures from a sample data set as used by King et al [2].
The program is guided by substructure templates, of which a few have been
hard coded. It has recreated some of the results produced by the ILP
method and PROGOL on the sample data set considered. The program can
take parameters to specify the minimum number of ground terms that must
appear in a resultant hypothesis (i.e. limit variables), and also specify the
minimum number of molecules for which a hypothesis should return TRUE
for a positive, and the maximum for which it can FALSE for a negative.
An important point for discussion here is the representation of the
background and structural data. Information representing the molecules is
represented as a series of facts in a PROLOG database. The representation is
identical to that suggested in the section on inductive logic programming,
and involves storing information about atoms and their inter bonding. The
Page 20
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
data stored for even a single molecule is extensive; however these PROLOG
facts can be generated automatically as mentioned in section 4.1.
Page 21
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
4. Implementation Considerations
The Find-S algorithm has been discussed at length as it represents the core
component of a system to identify substructures. However, the initial remit was
to create a substructure server, whereby users would be able to identify
potentially interesting substructures from their positive and negative examples.
As such, other considerations need to be examined, and these are summarised
here.
4.1. Representing structures
There exists a conflict between the natural user representation of chemical
structures, and those that are useful to the implemented algorithm. In a
sense, the users’ view of structures must be parsed into the computer view
(first order logic) at some stage, either by the user manually, or by the
implemented software as pre-processing to the Find-S algorithm. It is
clearly more desirable from the users’ position that this conversion is done
in an automated fashion. The feasibility of this is briefly discussed here.
Chemists are often concerned with modelling compounds, and the industry
standard modelling software is QUANTA [9]. King et al. in [2] used QUANTA
editing tools to automatically map a visual representation of a molecule into
first order logic. After some suitable pre-processing, this mapped
representation could be read by their PROGOL program as a series of facts.
Another molecular simulation program, CHARMM [10], stores as data files
information about the molecule being simulated. These data files use
standard naming and referencing techniques, as described in The Protein
Data bank [11]. The structure of these flat text files is conducive to
translations to other formats, on development of suitable schema.
4.2. Improvement of current implementation
S Colton’s current implementation of the Find-S algorithm can serve as a
basis for further work. The algorithm could be recoded in a modern object
oriented language, which would facilitate parallelising and packaging the
algorithm as a web-based application.
One key improvement that could be made is with the introduction of new
search templates. These templates guide the algorithm, restricting its
search to sub-molecules matching the specified template. Currently only a
Page 22
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
small number of templates are implemented; it is desirable that more be
available to the user.
4.3. Extensions
As advanced work in this area, further extensions to those suggested above
are possible. Implementing the algorithm in parallel is one such possible
extension. This would speed up the potentially highly complex and time
consuming derivations of hypotheses.
There is also scope for the generated hypotheses to be represented in
different formats. While an answer returned in first order logic maybe
strictly accurate, it is unlikely to be of much use to a user with little or no
knowledge of computational logic techniques. Molecular visualisation
software such as RASMOL and the later PROTEIN EXPLORER [12] exist, that
can take as input data in a similar format to that produced by QUANTA or
CHARM. It would be desirable for a user to view the resultant hypotheses,
with the sub-molecule derived by the algorithm presented visually.
Page 23
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
5. References
[1] Ellis, L., Aetna InteilHealth Drug Resource Centre, From Laboratory To
Pharmacy: How Drugs Are Developed, 2002.
http://www.intelihealth.com/IH/ihtIH/WSIHW000/8124/31116/346361.html?
d=dmtContent
[2] King, Ross D., Muggleton, Stephen H., Srinivasan, A. & Sternberg, Michael
J.E., Structure-activity relationships derived by machine learning: The use of
atoms and their bond connectives to predict mutagenicity by inductive logic
programming (1995) Proceedings of the National Academy of Sciences
(USA) 93, 438-442
[3] Hansch, C., Maloney, P. P., Fujita, T. & Muir, R. M., Correlation of Biological
Activity of Phenoxyacetic Acids with Hammett Substituent Constants and
Partition Coefficients (1962). Nature (London) 194, 178-180
[4] Mitchell, T. M., Machine Learning, International Edition, 1997, McGraw-Hill
[5] Glen,B., Molecular Modelling and Molecular Informatics, University of
Cambridge – Centre for Molecular Infomatics,
www-ucc.ch.cam.ac.uk/colloquia/rcg-lectures/A4
[6] Muggleton, S., Inductive Logic Programming (1991), New Generation
Computing 8, 295-318
[7] Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., King, R. D., Theories
for mutagenicity: a study in first-order and feature-based induction (1996),
Artificial Intelligence 85(1,2), 277-299
[8] Muggleton, S., Inverse Entailment and Progol (1995), New Generation
Computing 13, 245-286
[9] Colton, S. G., Lecture 11 – Overview of Machine Learning, Imperial College
London, 2003.
http://www2.doc.ic.ac.uk/~sgc/teaching/341.html
[9] Quanta software, http://www.accelrys.com/quanta/, Accelrys Inc.
[10] Chemistry HARvard Molecular Mechanics (CHARMM),
http://www.ch.embnet.org/MD_tutorial/pages/CHARMM.Part1.html
[11] The Protein Data Bank, http://www.rcsb.org/pdb
Page 24
Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton
A Substructure Server
[12] Rasmol Home Page, http://www.umass.edu/microbio/rasmol/
Page 25