Title: Holland Poster
1Chemical Hyperstructure Generation with a Genetic
Algorithm
Nathan Brown1, Richard Lewis2, Peter Willett1 and
David Wilton1
Introduction The tools and techniques of Graph
Theory GT have been applied to an extensive and
diverse range of real-world applications, thereby
allowing the abstracted graphs to be both
manipulated and analysed using existing and novel
computational algorithms. In this application we
describe a method of generating chemical
hyperstructures by the sequential overlapping of
compounds from a molecular library thus
constructing a single graph representation of the
entire library. However, the task of discovering
the optimal overlap between two graphs is
believed to be in the non-polynomial complete
NP-complete class of problems, and it is
therefore computationally expensive to solve
using traditional methods. In this work, based
on a previous study in this department 1, we
implement a Genetic Algorithm GA to evolve the
optimal overlap for each molecule. It is further
shown that a GA may be applied to evolving
parameter sets for the hyperstructure generation
program itself.
Mapping using a Genetic Algorithm
Figure 2 HYPERSTRUCTURE GENERATION ALGORITHM
hyp first mols foreach mol mols
mapping compare mol hyp hyp append hyp
mol mapping This pseudo-code algorithm
illustrates the simplicity of hyperstructure
generation, but only shows the mapping stage as a
black-box function. mols is the molecular
library, with mol being each molecule in the
library in turn. hyp is the hyperstructure.
Population Size and Base Operations
Parameters popX 47 10 popY 9
10 opsX 8 2 opsY 57
10 opsScale 22 500 Both these parameter
sets were separately tested on the hyperstructure
generation program, using a data-set of 200
molecules selected using the MAXMIN diversity
algorithm, the results are given in Tables 2 3.
Genetic Algorithms GAs The Genetic Algorithm
GA is an optimisation technique that takes its
inspiration from the evolutionary processes that
are believed to exist in nature. The canonical
GA operates by iteratively perturbing a
population of binary-encoded candidate solutions,
or chromosomes, by first selecting individuals to
mate according to their ability to solve a
specified problem, referred to as fitness. The
sampling stage is then followed by the
application of analogues of genetic operators
such as recombination and mutation on those
sampled individuals to evolve a new population of
chromosomes, that tend to exhibit the genetic
characteristics that are deemed more
fit. Hyperstructure Generation GA An
integer-encoded chromosome lends itself to this
particular mapping problem, the chromosome length
and size of the allele-set being determined from
the number of molecular atoms and hyperstructure
nodes, respectively Figure 4. FEATURES OF THE
GA integer-encoded chromosomes -
chromosome length count of atoms in molecule
- allele-set count of nodes in
hyperstructure steady-state population
replacement - one genetic operation
equates to one generation four genetic
operators defined - mutation and
one-point, uniform and node-based crossovers
- node-based crossover aims to reduce the
fragmentation of fit building blocks
population and generation sized at runtime
- m x y max - g x y max2
- where max in each case is number of
C atoms in the molecule fitness defined as
the number of edges preserved in the mapping
optional niche-restriction chromosomes and
node-sets either atom label or type mapping
is permitted - label atoms elemental
type - type atoms Sybyl atom type
either random or greedy population
initialisation fitness scaling raw,
window-scaled and power law
defined either as a node- or edge-induced
subgraph Figure 1. A node-induced subgraph is
determined by taking a subset of nodes of a
particular graph along with all their
interconnecting edges. Whereas an edge-induced
subgraph is realised by taking a subset of edges
along with their respective nodes. The problem
of discovering the MCS of two graphs is akin to
the location of a maximal node-induced subgraph
that is common to both graphs concerned. The MOS
can similarly be defined as the location of a
maximal edge-induced subgraph that is common to
both input graphs. Chemical Hyperstructures The
chemical hyperstructure is a single structure
representation of a molecular library generated
by the sequential overlapping of each of the
librarys compounds to the current hyperstructure
Figure 2, retaining each molecules atom and
bond information as part of the hyperstructure,
thereby permitting subsequent analysis.
Originally, Vladutz and Gould 2 proposed the
hyperstructure representation as a method for
improving the retrieval capabilities of chemical
databases. Once a mapping is located, it is
computationally trivial to append the current
molecule to the hyperstructure. Although it is
possible to generate a mapping, not taking into
account its optimality, by using random atom-node
mappings that are valid, this makes no effort to
reduce the quantity of edges that will be
appended to the hyperstructure, yielding a very
complex hyperstructure with much structural
redundancy. However, the discovery of an optimal
mapping, thereby reducing structural redundancy,
is the equivalent of the location of the
disconnected set of maximal edge-induced
subgraphs common between both graphs Figure 3,
which is believed to be an NP-complete problem.
Therefore it is computationally prudent to employ
some form of heuristic technique in order to
discover an optimal or near-optimal mapping in a
more realistic timeframe.
Parameter Optimisation using a GA GAs typically
require their parameters to be fine-tuned in
order to produce better results and/or shorter
runtimes. These parameters govern, amongst
other runtime constants/variables, genetic
operator probabilities, population size and
number of generations for the GA to run. In the
case of the hyperstructure generation GA, these
parameters are determined by parameter sets of
genetic operator probabilities, along with
variables in formulae that determine the
population size and base genetic operations. The
fine-tuning of parameters is itself a
combinatorial problem, in which there are a large
number of permutations, from which it is
difficult to determine near-optimal values for
each of the parameters. Generally the parameters
for GAs are manually tweaked by the programmer in
order to discover optimal sets, however this cane
be time-consuming. Therefore a program was
written to evolve the parameters sets for the
hyperstructure generation GA in an attempt to
locate near-optimal mappings while also
endeavouring to reduce the CPU cycles of the
process. This program evolves the parameter sets
by means of another GA, in the style of the
canonical GA. The parameter-set evolution GA is
able to evolve either the genetic operator
probabilities or the population size and base
operations for the run of the program. In order
to accurately determine the fitness of a
particular candidate solution, it is necessary to
discover how the parameter set defined by the
chromosome affects the operation of the
hyperstructure generation program. A significant
problem with ascertaining an accurate fitness for
a particular chromosome is that the
hyperstructure generation program is itself
non-deterministic and therefore prone to a degree
of error between runs. In an attempt to reduce
this error, the program is executed with 2
data-sets - each of 5 compounds - a total of 3
times. This equates to the GA being performed a
total of 24 times to evaluate each
individual. Running both versions of this GA to
evolve the separate parameter sets, resulted in
the following parameter sets for the operator
weights and population and generation parameters
the original parameters are given in
brackets. Genetic Operator Weights mutation
35 20 node-based crossover 8 15 uniform
crossover 5 5 one-point crossover 11 0
Graph Theory A graph is a pair G (V, E) of
sets, vertices or nodes, points and edges or
arcs, lines, where each element of E is a
2-element subset of V. The usual way of
representing these constructs is to draw a point
for each vertex and a line connecting vertices
for each edge in E, but the method of
visualisation is trivial so long as the
connectivity information is preserved. By
representing chemical compounds as graphs, it is
possible to process these graphs using existing
graph theoretic techniques. A significant
drawback to this however, is that many graph
theoretic methods are believed to be
NP-complete. There exist a number of graph
matching problems, such as Graph Isomorphism
GI, Subgraph Isomorphism SGI, Maximum Common
Subgraph MCS and Maximal Overlap Set MOS.
The problems of MCS and MOS, although similar,
are distinguished by a crucial difference in how
a subgraph can be
Conclusions It has been shown that the
hyperstructure generation program is able to
generate hyperstructures with limited structural
redundancy in quite a short runtime using a GA as
the overlapping function. This consistently
provides high compression of the original
molecular library in both the quantity of nodes
and edges. It has also been shown that the GA
paradigm is suitable in evolving the input
parameters for the program, producing significant
reductions in runtimes. However, further work is
required in this area to ensure that the runtime
does not over-influence the evaluation of the
chromosomes.
References 1. Brown, R. D., Jones, G. and
Willett, P. (1994). Matching two-dimensional
chemical graphs using genetic algorithms. J.
Chem. Inf. Comput. Sci. 34 63-70. 2. Vladutz, G.
and Gould, S. R. (1988). Joint
compound/reaction storage and possibilities of
hyperstructure-based solution. In Warr, W. A.
(ed.). Chemical structures the international
language of chemistry. Springer-Verlag, Berlin.
1 Krebs Institute for Biomolecular Research,
Department of Information Studies, University of
Sheffield, Sheffield, S10 2TN, UK. 2 Eli Lilly
Company, Erl Wood Manor, Windlesham, Surrey, GU20
6PH, UK. This work is funded by a CASE award from
the EPSRC and Eli Lilly Co.