Title: Canonicalized systematic nomenclature in chemoinformatics
1Canonicalized systematic nomenclature in
chemoinformatics
And some new canonicalization tools from OpenEye
Jeremy J. Yang
Morgan demo and study
New canonicalizing molfiles
OpenEye canonicalization tools
Introduction
- Canonicalizing a connection table is not new and
was discussed by Morgan1 and others. But
generating canonical forms of current standard
formats is not widely done, for historical and
practical reasons, although the available
benefits. This is increasingly true now that
longer strings are more easily handled by
existing computers. OEChem provides sufficient
control to accomplish this task. Proposed
algorithm - Remove non-structural data
- Supress hydrogens
- Canonical atom order
- Canonical bond order
- Canonical Kekule bonding based on (selected)
aromaticity model - However, the advantages of more terse canonical
line notations remain. - RESULTS Using test program canmol.py, 1990 NCI
Diversity set converted to canonical SDF files,
exactly equal to SDF files converted via SMILES
(demo.eyesopen.com/cgi-bin/canmol). Also done
with MOL2 format. This test validates the
ability of OEChem to canonicalize molfiles as
strings.
- The OpenEye chemoinformatics toolkit OEChem12
employs an optimized Morgan-like canonical
algorithm to generate canonical smiles. In
addition, the api provides a rich set of tools
which can facilitate generation of canonical
representations of many types, for many chemical
and informational models, and for many standard
file formats. - OEChemOECanonicalOrderAtoms()
- OEChemOECanonicalOrderBonds()
- OEChem aromaticity models OE, Daylight, Tripos,
MDL, MMFF - OEChem many file formats and flavors, low-level
writers - QuacPac13 tautomers application and toolkit
Canonicalization in chemoinformatics facilitates
rigorous, unambiguous expression and handling of
chemical data and knowledge. However, just as
chemistry encompasses multiple levels of
abstraction and modelling, no single
canonicalization method is sufficient to solve
all problems. This study reviews some existing
canonicalization methodology and describes new
methods implemented by chemoinformatics library
OEChem and other OpenEye tools.
Fig 1 Morgan demo. Extended connectivity values
and atom orders. Uses OEChem and Ogham. NCI
Diversity set processed with no errors.
Definition of canonicalization
Fig 2 Morgan slow due to symmetry.
A canonicalization algorithm must determine a
single representation among many possible
representations for an individual in its domain.
Conclusion
Rigorous and effective chemoinformatics systems
require concepts and methods for canonicalization
at multiple levels of chemical abstraction and
organization. The current state of the art
presents many theoretical and practical
challenges. OpenEye tools can help.
Benefits of canonicalization
Fig 3 Morgan fails
- testing equality of molecules
- database search speed
- rigorous informatics and thinking
New canonical tautomers
Aha! -- Chemo-taxonomy is a stranded hierarchy
Tautomers have the same formula (structural
isomers), but may differ in proton and electron
location, and formal bond order. Special cases
keto/enol, zwitterion, ring-chain. In the
Delany/Sayle algorithm8,13, hydrogen donors and
acceptors are perceived, and the number of free
hydrogens. Donors and acceptor atoms are ordered
canonically. At this stage all tautomerically
equivalent inputs are represented identically.
Hydrogen locations are exhaustively enumerated.
A simple ruleset for enumeration order can
designate the first to be the canonical tautomer.
Through additional rules, the liklihood can be
increased that the canonical tautomer is a
low-energy form. Applications registration
(exact search), substructure searching, property
prediction, similarity/clustering, protein-ligand
analysis. Failure to perceive tautomerism leads
to different results for different valence models
which really represent the same chemical entity.
N! (graph isomorphism is hard) Morgan to the
rescue
References
- subatomic ? atoms ? molecules
- normal weight atoms ? isotopes
- Kekule molecule model ? aromatic molecule models
- non-stereo molecule ? stereoisomers
- single molecule ? combinatorial libraries
- single molecule ? queries
- small molecule ? macromolecule cofactors
ligands - single molecule ? Markush structures
- single molecule ? tautomer set
- single molecule ? pKa states
- single molecule ? reactions
- 2D ? 3D
- There is a hierarchical relationship among some
of these expansions while some are independent.
For example, combinatorial library may involve
stereoisomeric individuals or non-stereo. For
every combination of molecular representations,
canonicalization could be advantageous for the
reasons described. Hence the task of
canonicalization is a multi-faceted one.
- Morgan, H. L., "Generation of a unique machine
description for chemical structures - A technique
developed at Chemical Abstracts Services", J.
Chem. Doc. 1965, 5, 107. - Stereochemically unique naming algorithm, W. Todd
Wipke, Thomas M. Dyott J. Am. Chem. Soc. 1974
96(15) 4834-4842. - Canonical Numbering and Constitutional Symmetry,
Clemens Jochum and Johann Gasteiger, J. Chem.
Inf. Comput. Sci. 1977 17(2) 113-117. - Computer Perception of Topological Symmetry,
Craig A. Shelley, Morton E. Munk J. Chem. Inf.
Comput. Sci. 1977 17(2) 110-113. - An Approach to the Assignment of Canonical
Connection Tables and Topological Symmetry
Perception, Craig A. Shelley, Morton E. Munk, J.
Chem. Inf. Comput. Sci. 1979 19(4) 247-250. - David Weininger, Arthur Weininger and Joseph L.
Weininger, "SMILES 2 Algorithm for Generation of
Unique SMILES Notation", Journal of Chemical
Information and Computer Science (JCICS), Vol.
29, No. 2, pp. 97-101, 1989. - A beginner's guide to responsible parenting or
knowing your roots, www.daylight.com/meetings/emug
98/Bradshaw/, EuroMUG '98, Cambridge, UK, Oct
1998. - Canonicalization and Enumeration of Tautomers,
Jack Delany and Roger Sayle, www.daylight.com/meet
ings/emug99/Delany/taut_html/sld001.htm EuroMUG
'99, Cambridge, UK, Oct 1999. - Hooked on Protonics, Roger Sayle and Geoff
Skillman, www.eyesopen.com/about/events/presentati
ons/acs02/sld001.htm, 224th ACS National Meeting,
Boston, Aug 2002. - Introduction to Chemical Info Systems, John
Bradshaw, www.daylight.com/meetings/emug02/Bradsha
w/Training/, Euromug02 24th-26th September 2002,
Cambridge UK - That INChIFeeling, www.reactivereports.com/40/40_3
.html, Reactive Reports, Sep 2004 (issue 40) - OEChem, OpenEye Scientific Software, 2002.
- QuacPac, OpenEye Scientific Software, 2004.
The Morgan algorithm1 is the basis of most
chemical canonicalization work since, and
deserves careful study. In 1965 Harry L. Morgan
published the algorithm already implemented at
CAS for its compound registry system. This work,
based on generic graph theory, comprises a
theoretical solution to the problem of molecular
canonicalization, and material validation of its
efficacy.
More Morgan, and more
- The Morgan algorithm was a huge step forward, but
the basic algorithm has some shortcomings, in
performance and comprehensiveness, which have
been corrected by subsequent investigators. The
resulting methods have been implemented and
widely used in large scale database systems.
Some key contributions - Morgan, 1965 ? note to Harry You da man! ?
CAS - Wipke Dyott, 1974 ? stereo-enhanced Morgan ?
MDL - Jochum Gasteiger, 1977 ? Morgan refinement ?
CACTVS - Shelley Munk, 1977 ? Morgan refinement
- Weininger, 1988 ? CANSMI canonical line notation
? Daylight - Bradshaw, 1998 ? parent compounds ? GSK,Daylight
- Delany Sayle, 1999 ? tautomers ? OpenEye
- INChi, 2004 ? global canonical line notation
Fig 4 example tautomers listed separately in
ACD98. The latter is the OE-canonical form.
Results The Maybridge 2003 database was analyzed
by the OE program tautomers13. Of 71367
molecules, 97 have tautomers (47 pairs and one
triplet). Note that additionally, 2381 were
found to be non-unique molecules.
Dealing with reality practical problems
- Existing formats (may often be)
- ambiguous poorly defined spec or poor
compliance - un-rigorous both syntax and semantics are
important - non-comprehensive only organic, covalent, size
limits - Stereoisomer canonicalization remains difficult
- "relative stereo-centers"
- Differing valence assumptions and conventions
- implicit-valence and Hcount formats prone to
mishandling - Information content and model differences in
existing formats - cannot robustly convert if info must be inferred
(e.g. bonds) - Disagreement over correct chemistry
- e.g., valences, aromaticity
- Local versus global canonicalization
- Benefits of canonicalization are available
locally or globally. But global canonicalization
requires cooperation. - Locality definition (time, place, software
versions)
This study canonical molecular descriptions, not
descriptors
Fig 5 tautomer triplet from Maybridge 2003
The study of graph theory and canonicalization
applied to chemistry is extensive and diverse.
Canonical descriptors which do not fully
represent the model can be of great utility in
statistical analyses but are not the focus of
this nomenclature study.
New canonical pKa states
3600 Cerrillos Road Suite 1107 Santa Fe, New
Mexico 87507
505.473.7385 info_at_eyesopen.com www.eyesopen.com
The canonicalization of alternative pKa states is
accomplished for many classes of molecules by the
OpenEye program pkatyper13. This problem
resembles tautomer canonicalization in many
respects, and is an area of active research at
OpenEye.