Talking to Chemists: When louder isnt enough - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Talking to Chemists: When louder isnt enough

Description:

Computer programs describe molecules to each other using a bewildering variety ... Xylene. Xylol. Yl. Yn. Ytterbium. Yttrium. Z. Zinc. Zirconium ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 37
Provided by: roger96
Category:

less

Transcript and Presenter's Notes

Title: Talking to Chemists: When louder isnt enough


1
Talking to ChemistsWhen louder isnt enough
  • Roger Sayle
  • OpenEye Scientific Software,
  • Santa Fe, New Mexico

2
The Language of Chemistry
  • Computer programs describe molecules to each
    other using a bewildering variety of connection
    table file formats or line notations.
  • Chemists, however, communicate amongst themselves
    using words and pictures.
  • The graphical representation of molecules and the
    sophisticated symbolic names assigned to
    compounds form the standard method of defining a
    compound in the scientific literature.

3
The Four Disciplines
  • Compound Naming
  • Name Interpretation
  • Rendering/Display
  • Coordinate Assignment

4
A picture is worth a 1000 names
  • Although the field of chemical nomenclature has
    been (is being) standardized by IUPAC, there is
    still no such thing as a single unique or
    preferred name for a compound.
  • Preferred names, systematic names, acceptable
    names, trivial names, disapproved names.
  • IUPAC 79, IUPAC 93, CAS, CAS Index, Beilstein.
  • Multiple supported naming methods Natural
    products, radicofunctional, replacement,
    conjunctive, multiplicative and inorganic
    nomenclature.
  • SSSR and CIP ambiguity.
  • MDL/Beilstein Autonom and ACD labs ACD/name.

5
Simple Examples
Pentanoic acid Valeric acid Butane-1-carboxylic
acid 1-Oxo-1-pentanol 1-Carboxybutane
N-Methyl aniline N-Methyl phenylamine Methyl-pheny
l-amine
6
More Complex Examples
Dibenzothiophene 5-oxide 9-Thiafluoren-9-one 9?4-T
hiafluoren-9-one 9-Oxo-9-thiafluorene
Isopropenylbenzene Prop-1-en-2ylbenzene
(IUPAC) (1-Methylvinyl)benzene (1-Methylethenyl)be
nzene (CAS) a-Methylstyrene
7
English British vs. American
  • Sulfur vs. Sulphur.
  • Also sulfanyl, sulfonamide, sulfuric etc...
  • Aluminium vs. Aluminum.
  • Caesium vs. Cesium.

8
The Language Barrier
  • English 4-Chlorophenoxyacetic acid.
  • German 4-Chlor-phenoxy-essigsäure.
  • French Acide 4-chloro-phénoxyacétique.
  • Italian Acido 4-chloro-fenossiacetico.
  • Spanish Acido 4-clorofenoxiacético.
  • Dutch 4-Chloor-fenoxy-azijnzuur.
  • Greek 4-X????fa???????e??ó ò??.

9
Chemistry The Universal Language?
BIQ bIQ water BIQSIP bIQSIp hydrogen
Formerly, biq and biqsip in klinzhai script
system.
Ref The Klingon Language Institute
(http//www.kli.org)
10
IUPAC Structure Naming 101
For each connected component Handle
exceptions/inorganics by table lookup Handle
single heavy atom specially Perform functional
group atom typing Identify senior characteristic
group (suffix) If not parent functional
group Identify parent chain/ring
system Lookup ring system name
numbering Systematically name/number
rings/chains Special case parent/suffix
combinations Recursively name each substitution
11
Characteristic Group Priority
  • Charged Groups (anions, cations)
  • Acids (carboxylic, sulfonic, sulfinic, phosphorus
    acids)
  • Acid derivatives (anhydrides)
  • Acid halides
  • Amides (sulfonamides)
  • Nitriles (cyanides, isocyanides)
  • Aldehydes (thioaldehydes)
  • Ketones (thioketones, selenoketones,
    telluroketones)
  • Alcohols (phenols, thioalcohols, selenols,
    tellurols)
  • Amines (phosphoranes, phosphanes, boranes,
    silanes)
  • Imines

Of course, IUPAC and Chemical Abstracts disagree
on these!
12
Worked Examples 1
2-acetyloxybenzoic acid
4-dimethylamino-1,5-dimethyl-2-phenyl-1,2-dihydrop
yrazol-3-one
13
Worked Examples 2
N-2-5-(dimethylaminomethyl)-2-furylmethylsulfa
nylethyl-N'-methyl-2-nitro-ethene-1,1-diamine
hydrochloride
2-4-(2-hydroxy-3-isopropylamino-propoxy)phenylac
etamide
14
Functional Group/Library Naming
Diazonio NN
Ammoniumylideneamino NNH2
2-R1-6-R2carbonyl-benzoic acid R1c1cccc(c1C(O)O
)C(O)R2
15
Naming Performance
  • OpenEyes structure naming code currently assigns
    200853 names (that dont contain BLAH) to the
    250251 compounds in NCI00 database (80.26).
  • This compares well to the 45228 names (including
    identifiers) assigned/annotated by NCI (18.07).
  • This includes a number of inorganic compounds
    covering the entire periodic table.
  • Most drug-like molecules, including aspirin,
    acyclovir, alprenolol, aminopyrine, atenolol,
    caffeine and ranitidine are assigned names
    identical to those in the Merck index.
  • On 1.5GHz Pentium, naming NCI takes 5m13 (800
    mol/s).

16
Restrictions and Limitations?
Never get in a pissing fight with a skunk -
Ancient proverb
ACD Labs ACD/Name and Beilstein/MDLs AutoNom
are currently superior products for compound
naming. However OpenEyes tools v1.0 are
probably the best naming tools that someone has
yet to pay for!
17
AutoNom Failures
18
Compound Name Parsing 101
  • Do exactly the same but in reverse ?
  • Process similar to compiler technology
  • First break up the compound name into a sequence
    of lexemes or tokens (lexical analysis).
  • Process the stream of tokens checking rules of
    grammar (parsing).
  • Check substitution/replacement locants and bond
    orders (semantic analysis/type checking).
  • Construct functional groups, linkers and ring
    systems, either as you go or from an abstract
    syntax tree.

19
Chemical Naming Lexemes
  • A
  • Acenaphthen
  • Acenaphthylen
  • Acet
  • Acetamido
  • Acetone
  • Acetonyl
  • Acetophenone
  • Acetoxy
  • Acetylene
  • Acid
  • Acridarsin
  • Acridin
  • Acridophosphin
  • Valpro
  • Vanadium
  • Vinyl
  • Water
  • Xenon
  • Xylene
  • Xylol
  • Yl
  • Yn
  • Ytterbium
  • Yttrium
  • Z
  • Zinc
  • Zirconium

Current software recognizes over 672 distinct
tokens.
20
Difficult Cases
  • Chloroform vs. Chloroformic acid.
  • Pentanitrobenzene vs. Pentanylbenzene.
  • Chloronioborane vs. Chloroniobium.
  • Hexanethiol (Hex-an-e-thiol)
  • Not hex-an-eth-??? nor hex-a-???
  • Iododecane vs. Iodoniododecane.
  • Phonetically, Fluorine vs. Fluorene.

21
Hows your grama?
  • Fortunately, most lexemes fall into common
    classes that may be parsed interchangeably, much
    like the categorization of words into parts of
    speech nouns, verbs, adjectives, etc
  • Chloro, Allyl, Amyl, Amidino, Mesyl
  • Diazo, oxo, thioxo, vinylidene
  • Azo, carbonyl, oxy, peroxy
  • Acet, butyr, form, cinnam, nicotin, salicyl,
    stear, valer
  • Alchol, amine, azide, cyanate, ketone, selenol,
    sulfone
  • Acetone, borane, cyanide, hydrazine, isourea
  • Azulen, cuban, indol, morpholin, oxazol, purin,
    pyrrol...

22
Name Parsing Performance
  • 1135 compounds out of the 1831 names in the xlogp
    training set (61.99).
  • 382 compounds out of 404 names in Peter Kollmans
    desolvation energy test set (94.55).
  • 38 compounds out of 67 names found in a
    pharmaceutical patent benchmark provided by
    Astra-Zeneca (56.72).
  • 104,125 compounds out of the 200483 names
    assigned to NCI00 (51.93 without BLAH, 41.61 of
    the dataset).
  • A 1.5GHz Pentium can parse 250251 names
    (including BLAHs) from NCI00 in 18s (13600
    mols/s).

23
Rendering Chemistrys Handwriting
  • Display of atomic symbol, isotopic mass, formal
    charge.
  • Display and positioning of implicit hydrogen
    count.
  • Display of superatoms, aliases and atom labels.
  • Suppress display of explicit carbon hydrogens

24
Bond Order
  • Single, double, triple and quadruple bonds.
  • Assymmetric or central placement of double bonds.
  • Arrow bonds for supervalent/charge separated
    groups.

25
Representations of Aromaticity
Alternate representations of aromatic cycles.
26
Color and Titles
27
Structure Diagram Generation/Depiction
For each connected component Divide molecule
into ring systems and chains For each ring
system Lookup group graph in dictionary
(RTD) Lookup graph in ring template
dictionary Layout fused/spiro systems
algorithmically Assign local co-ordinates to
cis/trans bonds. Assign local co-ordinates to
unvisited atoms. Resolve clashes via heurisitic
search. Covert internal polar co-ordinates to
Cartesian.
28
Ring Template Dictionary
The current default ring template dictionary
contains 1027 templates (including 94 group graph
specializations).
29
Orientations from the RTD
In addition to the layout of a ring system,
group graph templates provide orientation
information for heterocycles.
30
Internal Polar Coordinate System
  • Internally, the graph layout code uses an
    internal co-ordinate representation, much like
    a z-matrix.
  • Each atom has a reference bond, and all other
    incident bonds are specified as clockwise angles.
  • Each bond records its length and cleverly a flip
    bit to invert the co-ordinate frame between ends.
  • Conversion from Cartesian requires determining
    lengths and angles.
  • Conversion to Cartesian by depth-first traversal.

31
Local Graph Layout Heuristics
  • By default, all bonds have unit length.
  • Exo cyclic bonds equally subdivide the sector of
    maximum Circular Free Sweep (CFS).
  • Atoms of degree four or higher, distribute their
    neighbors evenly (degree 4 90º, degree 5 72º,
    degree 6 60º).
  • sp-hybridized atoms (triple bonded and allenic)
    and atoms of degree two between atoms of degree
    four (or more) are linear, 180.
  • Remaining atoms prefer angles of 120º.
  • Chains are constructed extended by alternating
    120º/-120º.
  • Degree four atoms prefer para terminal
    (symmetric) atoms.

32
Clash Resolution Protocols
  • A clash is considered to be any pair of atoms
    within 0.866 units (i.e. d2 lt 0.75).
  • 211017 out of 250251 compounds in NCI00 have no
    clashes by construction (84.32).
  • Clashes are resolved by an iterative process.
  • 1. All non-ring, non-terminal bonds are
    flipped/inverted, if the number of clashes
    decreases the flip is retained,
  • if the number of clashes reaches zero, were
    done.
  • 2. If there were any changes in the previous
    pass, repeat.
  • 3. Flip all pairs of the above bonds.

33
Clash Resolution Protocols 2
4. Try placing exo-bonds from ring fusion atoms
at the bisection of each sector, not just the
largest free sector. 5. For non-ring atoms of
degree four, try all possible permutations of the
neighbors. 6. For ring atoms of degree four, with
two exo bonds, try swapping their order relative
to the ring.
34
Some Difficult Depictions
35
Depiction Performance
  • 223135 out of 250251structures in NCI00 (89.16)
    contain no clashes.
  • Many of the 27116 failures caused by inorganic
    and organometallic co-ordination chemistry.
  • Some of these failures are even caused by ring
    templates!
  • On a 1.5GHz Pentium, co-ordinates can be assigned
    to the 250251 molecules in NCI00 in 3m05
    (including 24s file I/O), i.e. 1350 mol/s.
  • This is easily fast enough to handle depict as
    you type functionality supported by recent
    version of the OEChem toolkits.

36
Acknowledgements
  • Rest of OpenEyes OEChem Team (Matt Stahl, Geoff
    Skillman).
  • Bob Tolbert (for flview, peruse, tad and mvi).
  • Jeremy Yang (for demo.eyesopen.com).
  • HP/Compaq, IBM and SGI for providing development
    hardware and compilers.
  • Daylight, Wolf-Dietrich Ihlenfeld and Harold
    Helson for previous work on depiction.
  • ACD Labs and Beilstein (MDL) for previous work on
    naming and interpretation.
Write a Comment
User Comments (0)
About PowerShow.com