Canonicalized systematic nomenclature in chemoinformatics - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Canonicalized systematic nomenclature in chemoinformatics

Description:

Computer Perception of Topological Symmetry, Craig A. Shelley, Morton E. Munk; J. Chem. ... In 1965 Harry L. Morgan published the algorithm already implemented ... – PowerPoint PPT presentation

Number of Views:354
Avg rating:3.0/5.0
Slides: 2
Provided by: jerem75
Category:

less

Transcript and Presenter's Notes

Title: Canonicalized systematic nomenclature in chemoinformatics


1
Canonicalized systematic nomenclature in
chemoinformatics
And some new canonicalization tools from OpenEye
Jeremy J. Yang
Morgan demo and study
New canonicalizing molfiles
OpenEye canonicalization tools
Introduction
  • Canonicalizing a connection table is not new and
    was discussed by Morgan1 and others. But
    generating canonical forms of current standard
    formats is not widely done, for historical and
    practical reasons, although the available
    benefits. This is increasingly true now that
    longer strings are more easily handled by
    existing computers. OEChem provides sufficient
    control to accomplish this task. Proposed
    algorithm
  • Remove non-structural data
  • Supress hydrogens
  • Canonical atom order
  • Canonical bond order
  • Canonical Kekule bonding based on (selected)
    aromaticity model
  • However, the advantages of more terse canonical
    line notations remain.
  • RESULTS Using test program canmol.py, 1990 NCI
    Diversity set converted to canonical SDF files,
    exactly equal to SDF files converted via SMILES
    (demo.eyesopen.com/cgi-bin/canmol). Also done
    with MOL2 format. This test validates the
    ability of OEChem to canonicalize molfiles as
    strings.
  • The OpenEye chemoinformatics toolkit OEChem12
    employs an optimized Morgan-like canonical
    algorithm to generate canonical smiles. In
    addition, the api provides a rich set of tools
    which can facilitate generation of canonical
    representations of many types, for many chemical
    and informational models, and for many standard
    file formats.
  • OEChemOECanonicalOrderAtoms()
  • OEChemOECanonicalOrderBonds()
  • OEChem aromaticity models OE, Daylight, Tripos,
    MDL, MMFF
  • OEChem many file formats and flavors, low-level
    writers
  • QuacPac13 tautomers application and toolkit

Canonicalization in chemoinformatics facilitates
rigorous, unambiguous expression and handling of
chemical data and knowledge. However, just as
chemistry encompasses multiple levels of
abstraction and modelling, no single
canonicalization method is sufficient to solve
all problems. This study reviews some existing
canonicalization methodology and describes new
methods implemented by chemoinformatics library
OEChem and other OpenEye tools.
Fig 1 Morgan demo. Extended connectivity values
and atom orders. Uses OEChem and Ogham. NCI
Diversity set processed with no errors.
Definition of canonicalization
Fig 2 Morgan slow due to symmetry.
A canonicalization algorithm must determine a
single representation among many possible
representations for an individual in its domain.
Conclusion
Rigorous and effective chemoinformatics systems
require concepts and methods for canonicalization
at multiple levels of chemical abstraction and
organization. The current state of the art
presents many theoretical and practical
challenges. OpenEye tools can help.
Benefits of canonicalization
Fig 3 Morgan fails
  • testing equality of molecules
  • database search speed
  • rigorous informatics and thinking

New canonical tautomers
Aha! -- Chemo-taxonomy is a stranded hierarchy
Tautomers have the same formula (structural
isomers), but may differ in proton and electron
location, and formal bond order. Special cases
keto/enol, zwitterion, ring-chain. In the
Delany/Sayle algorithm8,13, hydrogen donors and
acceptors are perceived, and the number of free
hydrogens. Donors and acceptor atoms are ordered
canonically. At this stage all tautomerically
equivalent inputs are represented identically.
Hydrogen locations are exhaustively enumerated.
A simple ruleset for enumeration order can
designate the first to be the canonical tautomer.
Through additional rules, the liklihood can be
increased that the canonical tautomer is a
low-energy form. Applications registration
(exact search), substructure searching, property
prediction, similarity/clustering, protein-ligand
analysis. Failure to perceive tautomerism leads
to different results for different valence models
which really represent the same chemical entity.
N! (graph isomorphism is hard) Morgan to the
rescue
References
  • subatomic ? atoms ? molecules
  • normal weight atoms ? isotopes
  • Kekule molecule model ? aromatic molecule models
  • non-stereo molecule ? stereoisomers
  • single molecule ? combinatorial libraries
  • single molecule ? queries
  • small molecule ? macromolecule cofactors
    ligands
  • single molecule ? Markush structures
  • single molecule ? tautomer set
  • single molecule ? pKa states
  • single molecule ? reactions
  • 2D ? 3D
  • There is a hierarchical relationship among some
    of these expansions while some are independent.
    For example, combinatorial library may involve
    stereoisomeric individuals or non-stereo. For
    every combination of molecular representations,
    canonicalization could be advantageous for the
    reasons described. Hence the task of
    canonicalization is a multi-faceted one.
  • Morgan, H. L., "Generation of a unique machine
    description for chemical structures - A technique
    developed at Chemical Abstracts Services", J.
    Chem. Doc. 1965, 5, 107.
  • Stereochemically unique naming algorithm, W. Todd
    Wipke, Thomas M. Dyott J. Am. Chem. Soc. 1974
    96(15) 4834-4842.
  • Canonical Numbering and Constitutional Symmetry,
    Clemens Jochum and Johann Gasteiger, J. Chem.
    Inf. Comput. Sci. 1977 17(2) 113-117.
  • Computer Perception of Topological Symmetry,
    Craig A. Shelley, Morton E. Munk J. Chem. Inf.
    Comput. Sci. 1977 17(2) 110-113.
  • An Approach to the Assignment of Canonical
    Connection Tables and Topological Symmetry
    Perception, Craig A. Shelley, Morton E. Munk, J.
    Chem. Inf. Comput. Sci. 1979 19(4) 247-250.
  • David Weininger, Arthur Weininger and Joseph L.
    Weininger, "SMILES 2 Algorithm for Generation of
    Unique SMILES Notation", Journal of Chemical
    Information and Computer Science (JCICS), Vol.
    29, No. 2, pp. 97-101, 1989.
  • A beginner's guide to responsible parenting or
    knowing your roots, www.daylight.com/meetings/emug
    98/Bradshaw/, EuroMUG '98, Cambridge, UK, Oct
    1998.
  • Canonicalization and Enumeration of Tautomers,
    Jack Delany and Roger Sayle, www.daylight.com/meet
    ings/emug99/Delany/taut_html/sld001.htm EuroMUG
    '99, Cambridge, UK, Oct 1999.
  • Hooked on Protonics, Roger Sayle and Geoff
    Skillman, www.eyesopen.com/about/events/presentati
    ons/acs02/sld001.htm, 224th ACS National Meeting,
    Boston, Aug 2002.
  • Introduction to Chemical Info Systems, John
    Bradshaw, www.daylight.com/meetings/emug02/Bradsha
    w/Training/, Euromug02 24th-26th September 2002,
    Cambridge UK
  • That INChIFeeling, www.reactivereports.com/40/40_3
    .html, Reactive Reports, Sep 2004 (issue 40)
  • OEChem, OpenEye Scientific Software, 2002.
  • QuacPac, OpenEye Scientific Software, 2004.

The Morgan algorithm1 is the basis of most
chemical canonicalization work since, and
deserves careful study. In 1965 Harry L. Morgan
published the algorithm already implemented at
CAS for its compound registry system. This work,
based on generic graph theory, comprises a
theoretical solution to the problem of molecular
canonicalization, and material validation of its
efficacy.
More Morgan, and more
  • The Morgan algorithm was a huge step forward, but
    the basic algorithm has some shortcomings, in
    performance and comprehensiveness, which have
    been corrected by subsequent investigators. The
    resulting methods have been implemented and
    widely used in large scale database systems.
    Some key contributions
  • Morgan, 1965 ? note to Harry You da man! ?
    CAS
  • Wipke Dyott, 1974 ? stereo-enhanced Morgan ?
    MDL
  • Jochum Gasteiger, 1977 ? Morgan refinement ?
    CACTVS
  • Shelley Munk, 1977 ? Morgan refinement
  • Weininger, 1988 ? CANSMI canonical line notation
    ? Daylight
  • Bradshaw, 1998 ? parent compounds ? GSK,Daylight
  • Delany Sayle, 1999 ? tautomers ? OpenEye
  • INChi, 2004 ? global canonical line notation

Fig 4 example tautomers listed separately in
ACD98. The latter is the OE-canonical form.
Results The Maybridge 2003 database was analyzed
by the OE program tautomers13. Of 71367
molecules, 97 have tautomers (47 pairs and one
triplet). Note that additionally, 2381 were
found to be non-unique molecules.
Dealing with reality practical problems
  • Existing formats (may often be)
  • ambiguous poorly defined spec or poor
    compliance
  • un-rigorous both syntax and semantics are
    important
  • non-comprehensive only organic, covalent, size
    limits
  • Stereoisomer canonicalization remains difficult
  • "relative stereo-centers"
  • Differing valence assumptions and conventions
  • implicit-valence and Hcount formats prone to
    mishandling
  • Information content and model differences in
    existing formats
  • cannot robustly convert if info must be inferred
    (e.g. bonds)
  • Disagreement over correct chemistry
  • e.g., valences, aromaticity
  • Local versus global canonicalization
  • Benefits of canonicalization are available
    locally or globally. But global canonicalization
    requires cooperation.
  • Locality definition (time, place, software
    versions)

This study canonical molecular descriptions, not
descriptors
Fig 5 tautomer triplet from Maybridge 2003
The study of graph theory and canonicalization
applied to chemistry is extensive and diverse.
Canonical descriptors which do not fully
represent the model can be of great utility in
statistical analyses but are not the focus of
this nomenclature study.
New canonical pKa states
3600 Cerrillos Road Suite 1107 Santa Fe, New
Mexico 87507
505.473.7385 info_at_eyesopen.com www.eyesopen.com
The canonicalization of alternative pKa states is
accomplished for many classes of molecules by the
OpenEye program pkatyper13. This problem
resembles tautomer canonicalization in many
respects, and is an area of active research at
OpenEye.
Write a Comment
User Comments (0)
About PowerShow.com