Title: Talking to Chemists: When louder isnt enough
1Talking to ChemistsWhen louder isnt enough
- Roger Sayle
- OpenEye Scientific Software,
- Santa Fe, New Mexico
2The Language of Chemistry
- Computer programs describe molecules to each
other using a bewildering variety of connection
table file formats or line notations. - Chemists, however, communicate amongst themselves
using words and pictures. - The graphical representation of molecules and the
sophisticated symbolic names assigned to
compounds form the standard method of defining a
compound in the scientific literature.
3The Four Disciplines
- Compound Naming
- Name Interpretation
- Rendering/Display
- Coordinate Assignment
4A picture is worth a 1000 names
- Although the field of chemical nomenclature has
been (is being) standardized by IUPAC, there is
still no such thing as a single unique or
preferred name for a compound. - Preferred names, systematic names, acceptable
names, trivial names, disapproved names. - IUPAC 79, IUPAC 93, CAS, CAS Index, Beilstein.
- Multiple supported naming methods Natural
products, radicofunctional, replacement,
conjunctive, multiplicative and inorganic
nomenclature. - SSSR and CIP ambiguity.
- MDL/Beilstein Autonom and ACD labs ACD/name.
5Simple Examples
Pentanoic acid Valeric acid Butane-1-carboxylic
acid 1-Oxo-1-pentanol 1-Carboxybutane
N-Methyl aniline N-Methyl phenylamine Methyl-pheny
l-amine
6More Complex Examples
Dibenzothiophene 5-oxide 9-Thiafluoren-9-one 9?4-T
hiafluoren-9-one 9-Oxo-9-thiafluorene
Isopropenylbenzene Prop-1-en-2ylbenzene
(IUPAC) (1-Methylvinyl)benzene (1-Methylethenyl)be
nzene (CAS) a-Methylstyrene
7English British vs. American
- Sulfur vs. Sulphur.
- Also sulfanyl, sulfonamide, sulfuric etc...
- Aluminium vs. Aluminum.
- Caesium vs. Cesium.
8The Language Barrier
- English 4-Chlorophenoxyacetic acid.
- German 4-Chlor-phenoxy-essigsäure.
- French Acide 4-chloro-phénoxyacétique.
- Italian Acido 4-chloro-fenossiacetico.
- Spanish Acido 4-clorofenoxiacético.
- Dutch 4-Chloor-fenoxy-azijnzuur.
- Greek 4-X????fa???????e??ó ò??.
9Chemistry The Universal Language?
BIQ bIQ water BIQSIP bIQSIp hydrogen
Formerly, biq and biqsip in klinzhai script
system.
Ref The Klingon Language Institute
(http//www.kli.org)
10IUPAC Structure Naming 101
For each connected component Handle
exceptions/inorganics by table lookup Handle
single heavy atom specially Perform functional
group atom typing Identify senior characteristic
group (suffix) If not parent functional
group Identify parent chain/ring
system Lookup ring system name
numbering Systematically name/number
rings/chains Special case parent/suffix
combinations Recursively name each substitution
11Characteristic Group Priority
- Charged Groups (anions, cations)
- Acids (carboxylic, sulfonic, sulfinic, phosphorus
acids) - Acid derivatives (anhydrides)
- Acid halides
- Amides (sulfonamides)
- Nitriles (cyanides, isocyanides)
- Aldehydes (thioaldehydes)
- Ketones (thioketones, selenoketones,
telluroketones) - Alcohols (phenols, thioalcohols, selenols,
tellurols) - Amines (phosphoranes, phosphanes, boranes,
silanes) - Imines
Of course, IUPAC and Chemical Abstracts disagree
on these!
12Worked Examples 1
2-acetyloxybenzoic acid
4-dimethylamino-1,5-dimethyl-2-phenyl-1,2-dihydrop
yrazol-3-one
13Worked Examples 2
N-2-5-(dimethylaminomethyl)-2-furylmethylsulfa
nylethyl-N'-methyl-2-nitro-ethene-1,1-diamine
hydrochloride
2-4-(2-hydroxy-3-isopropylamino-propoxy)phenylac
etamide
14Functional Group/Library Naming
Diazonio NN
Ammoniumylideneamino NNH2
2-R1-6-R2carbonyl-benzoic acid R1c1cccc(c1C(O)O
)C(O)R2
15Naming Performance
- OpenEyes structure naming code currently assigns
200853 names (that dont contain BLAH) to the
250251 compounds in NCI00 database (80.26). - This compares well to the 45228 names (including
identifiers) assigned/annotated by NCI (18.07). - This includes a number of inorganic compounds
covering the entire periodic table. - Most drug-like molecules, including aspirin,
acyclovir, alprenolol, aminopyrine, atenolol,
caffeine and ranitidine are assigned names
identical to those in the Merck index. - On 1.5GHz Pentium, naming NCI takes 5m13 (800
mol/s).
16Restrictions and Limitations?
Never get in a pissing fight with a skunk -
Ancient proverb
ACD Labs ACD/Name and Beilstein/MDLs AutoNom
are currently superior products for compound
naming. However OpenEyes tools v1.0 are
probably the best naming tools that someone has
yet to pay for!
17AutoNom Failures
18Compound Name Parsing 101
- Do exactly the same but in reverse ?
- Process similar to compiler technology
- First break up the compound name into a sequence
of lexemes or tokens (lexical analysis). - Process the stream of tokens checking rules of
grammar (parsing). - Check substitution/replacement locants and bond
orders (semantic analysis/type checking). - Construct functional groups, linkers and ring
systems, either as you go or from an abstract
syntax tree.
19Chemical Naming Lexemes
- A
- Acenaphthen
- Acenaphthylen
- Acet
- Acetamido
- Acetone
- Acetonyl
- Acetophenone
- Acetoxy
- Acetylene
- Acid
- Acridarsin
- Acridin
- Acridophosphin
- Valpro
- Vanadium
- Vinyl
- Water
- Xenon
- Xylene
- Xylol
- Yl
- Yn
- Ytterbium
- Yttrium
- Z
- Zinc
- Zirconium
Current software recognizes over 672 distinct
tokens.
20Difficult Cases
- Chloroform vs. Chloroformic acid.
- Pentanitrobenzene vs. Pentanylbenzene.
- Chloronioborane vs. Chloroniobium.
- Hexanethiol (Hex-an-e-thiol)
- Not hex-an-eth-??? nor hex-a-???
- Iododecane vs. Iodoniododecane.
- Phonetically, Fluorine vs. Fluorene.
21Hows your grama?
- Fortunately, most lexemes fall into common
classes that may be parsed interchangeably, much
like the categorization of words into parts of
speech nouns, verbs, adjectives, etc - Chloro, Allyl, Amyl, Amidino, Mesyl
- Diazo, oxo, thioxo, vinylidene
- Azo, carbonyl, oxy, peroxy
- Acet, butyr, form, cinnam, nicotin, salicyl,
stear, valer - Alchol, amine, azide, cyanate, ketone, selenol,
sulfone - Acetone, borane, cyanide, hydrazine, isourea
- Azulen, cuban, indol, morpholin, oxazol, purin,
pyrrol...
22Name Parsing Performance
- 1135 compounds out of the 1831 names in the xlogp
training set (61.99). - 382 compounds out of 404 names in Peter Kollmans
desolvation energy test set (94.55). - 38 compounds out of 67 names found in a
pharmaceutical patent benchmark provided by
Astra-Zeneca (56.72). - 104,125 compounds out of the 200483 names
assigned to NCI00 (51.93 without BLAH, 41.61 of
the dataset). - A 1.5GHz Pentium can parse 250251 names
(including BLAHs) from NCI00 in 18s (13600
mols/s).
23Rendering Chemistrys Handwriting
- Display of atomic symbol, isotopic mass, formal
charge. - Display and positioning of implicit hydrogen
count. - Display of superatoms, aliases and atom labels.
- Suppress display of explicit carbon hydrogens
24Bond Order
- Single, double, triple and quadruple bonds.
- Assymmetric or central placement of double bonds.
- Arrow bonds for supervalent/charge separated
groups.
25Representations of Aromaticity
Alternate representations of aromatic cycles.
26Color and Titles
27Structure Diagram Generation/Depiction
For each connected component Divide molecule
into ring systems and chains For each ring
system Lookup group graph in dictionary
(RTD) Lookup graph in ring template
dictionary Layout fused/spiro systems
algorithmically Assign local co-ordinates to
cis/trans bonds. Assign local co-ordinates to
unvisited atoms. Resolve clashes via heurisitic
search. Covert internal polar co-ordinates to
Cartesian.
28Ring Template Dictionary
The current default ring template dictionary
contains 1027 templates (including 94 group graph
specializations).
29Orientations from the RTD
In addition to the layout of a ring system,
group graph templates provide orientation
information for heterocycles.
30Internal Polar Coordinate System
- Internally, the graph layout code uses an
internal co-ordinate representation, much like
a z-matrix. - Each atom has a reference bond, and all other
incident bonds are specified as clockwise angles. - Each bond records its length and cleverly a flip
bit to invert the co-ordinate frame between ends. - Conversion from Cartesian requires determining
lengths and angles. - Conversion to Cartesian by depth-first traversal.
31Local Graph Layout Heuristics
- By default, all bonds have unit length.
- Exo cyclic bonds equally subdivide the sector of
maximum Circular Free Sweep (CFS). - Atoms of degree four or higher, distribute their
neighbors evenly (degree 4 90º, degree 5 72º,
degree 6 60º). - sp-hybridized atoms (triple bonded and allenic)
and atoms of degree two between atoms of degree
four (or more) are linear, 180. - Remaining atoms prefer angles of 120º.
- Chains are constructed extended by alternating
120º/-120º. - Degree four atoms prefer para terminal
(symmetric) atoms.
32Clash Resolution Protocols
- A clash is considered to be any pair of atoms
within 0.866 units (i.e. d2 lt 0.75). - 211017 out of 250251 compounds in NCI00 have no
clashes by construction (84.32). - Clashes are resolved by an iterative process.
- 1. All non-ring, non-terminal bonds are
flipped/inverted, if the number of clashes
decreases the flip is retained, - if the number of clashes reaches zero, were
done. - 2. If there were any changes in the previous
pass, repeat. - 3. Flip all pairs of the above bonds.
33Clash Resolution Protocols 2
4. Try placing exo-bonds from ring fusion atoms
at the bisection of each sector, not just the
largest free sector. 5. For non-ring atoms of
degree four, try all possible permutations of the
neighbors. 6. For ring atoms of degree four, with
two exo bonds, try swapping their order relative
to the ring.
34Some Difficult Depictions
35Depiction Performance
- 223135 out of 250251structures in NCI00 (89.16)
contain no clashes. - Many of the 27116 failures caused by inorganic
and organometallic co-ordination chemistry. - Some of these failures are even caused by ring
templates! - On a 1.5GHz Pentium, co-ordinates can be assigned
to the 250251 molecules in NCI00 in 3m05
(including 24s file I/O), i.e. 1350 mol/s. - This is easily fast enough to handle depict as
you type functionality supported by recent
version of the OEChem toolkits.
36Acknowledgements
- Rest of OpenEyes OEChem Team (Matt Stahl, Geoff
Skillman). - Bob Tolbert (for flview, peruse, tad and mvi).
- Jeremy Yang (for demo.eyesopen.com).
- HP/Compaq, IBM and SGI for providing development
hardware and compilers. - Daylight, Wolf-Dietrich Ihlenfeld and Harold
Helson for previous work on depiction. - ACD Labs and Beilstein (MDL) for previous work on
naming and interpretation.