Title: CADD Overview and CACTVS License
1NCI/CADD Chemical Identifier ResolverIndexing
and Analysis of Available Chemistry Space
Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2,
andMarc C. Nicklaus1 1 Computer-Aided Drug
Design Group, Chemical Biology Laboratory, NCI-Fre
derick, NIH, DHHS 2 Xemistry GmbH, Auf den
Stieden 8, D-35094 Lahntal, Germany
2Small Molecule Databases
- since the early 2000s number of databases
publishing small molecules grew enormously,
e.g. PubChem, ChemSpider, ChEMBL, DrugBank what
is the overlap, how many small-molecules are
there currently? - ambiguities in the representation of small
molecules (e.g. tautomerism, salts, ionic
resonance forms) - growing number of chemical structure identifiers
(InChI/InChIKey, PubChem SID/CID, ChemSpider ID,
ChEBI ID, )
3Chemical Identifier Resolver
SYBYL Line Notation
SMILES
CAS Registry Number
chemical names
GIF image
SD File
ChemNavigator SID
chemical structure
CML
FDA UNII
NCI/CADD Identifiers
NSC number
MRV
InChI/InChIKey
PubChem SID/CID
ChemSpider ID
ChEBI ID
Chemical Formula
PDB Ligand ID
4NCI/CADD Web Resources
Chemical Identifier Resolver
Works as a resolver for different chemical
structure identifiers. Allows one to convert a
givenstructure identifier into
anotherrepresentation or structureidentifier.
first beta release July 2009 current release
(beta 4) April 2011
http//cactus.nci.nih.gov/chemical/structure
5NCI/CADD Web Resources
Chemical Identifier Resolver
- it is usable by a simple URL API
http//cactus.nci.nih.gov/chemical/structure/iden
tifier/representation
XML format http//cactus.nci.nih.gov/chemical/str
ucture/identifier/representation/xml
example http//cactus.nci.nih.gov/chemical/struct
ure/Tamiflu/cas
204255-11-8
MIME type text/plain
- if a request is not resolvable HTTP404 status
message
6NCI/CADD Public Web Resources
Chemical Identifier Resolver
chemical names IUPAC names (by OPSIN) CAS
numbers SMILES strings IUPAC InChI/InChIKeys NCI/C
ADD IdentifiersCACTVS HASHISYNSC numberPubChem
SID ChemSpider ID ChemNavigator SID ZINC FDA UNII
/smiles /names, /iupac_name /cas /inchi,
/stdinchi /inchikey, /stdinchikey /ficts, /ficus,
/uuuuu /image /file, /sdf /mw,
/monoisotopic_mass /formula/twirl,
/3d /urls /chemspider_id /pubchem_sid /chemnavigat
or_sid
resolver
http//cactus.nci.nih.gov/chemcial/structure
identifier
representation
7NCI/CADD Web Resources
Chemical Identifier Resolver
representation
identifier
MIME type
http request
http response
calculation of therequested structurerepresentat
ion
identifier is afull structure representation (e.
g. SMILES, InChI)
detection ofthe identifiertype
e.g. InChI, GIF image
structure
identifier is ahashed structurerepresentation (e
.g. InChIKey), trivial nameetc.
e.g. CAS number, chemical name
CACTVS
NCI/CADD Chemical Structure Database (CSDB)
database lookup
8NCI/CADD Web Resources
Chemical Identifier Resolver
representation
identifier
MIME type
http request
http response
calculation of therequested structurerepresentat
ion
identifier is afull structure representation (e.
g. SMILES, InChI)
detection ofthe identifiertype
e.g. InChI, GIF image
structure
identifier is ahashed structurerepresentation (e
.g. InChIKey), trivial nameetc.
e.g. CAS number, chemical name
CACTVS
database lookup
NCI/CADD Chemical Structure Database (CSDB)
9Chemical Identifier Resolver
Resolving Chemical Names
http//cactus.nci.nih.gov/chemical/structure/L-ala
nin/smiles/xmls?resolvername_by_chemspider,name_
by_opsin,name_by_cir
ltrequest string"L-alanin" representation"smiles"
gt ltdata id"1" resolver"name_by_chemspider"
string_class"Chemical Name (ChemSpider)"gt ltitem
id"1"gtCC_at_H(N)C(O)Olt/itemgt lt/datagt ltdata
id"2" resolver"name_by_opsin"
string_class"IUPAC Name (OPSIN)"gt ltitem
id"1"gtCC_at_H(N)C(O)Olt/itemgt lt/datagt ltdata
id"3" resolver"name_by_cir" string_class"Chemic
al Name (CIR)"gt ltitem id"1gtCC_at_H(N)C(O)Olt/it
emgt lt/datagt lt/requestgt
10Chemical Identifier Resolver
Chemical Structure Database (CSDB)
- ChemNavigator iResearch Librarycompilation of
commercially availablescreening compounds from
330 inter-national chemistry suppliers - PubChem databaseincluding Open NCI database,
EPA DSSTox databases, NIAID HIVdatabases, NIST
Webbook, NLM ChemIDplus, ChemSpider - Commercial Sources / othersAsinex, Comgenex,
eMolecules,ChEMBL,
PubChem 38
ChemNav. iResearch Lib. 56
6
others
currently 150 chemical structure databases 120
million structure records 81.6 million unique
structures by NCI/CADD FICuS Identifier 84
million unique structures by Std. InChIKey
11- NCI/CADD Structure Identifiers
FICTS, FICuS, uuuuu
12Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
- based on hashcodes calculated by the
chemoinformatics toolkit CACTVS
9850FD9F9E2B4E25
- CACTVS hashcodes
- represent a chemical structure uniquely
as16-digit hexadecimal number (64-bit unsigned) - high sensitivity to structural features of a
compound - change if connectivity changes
13Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
Molfile SDF SMILES ChemDraw cdx PDB
structurenormalization
hashcodecalculation
original structure record
parentstructure
NCI/CADDIdentifier
E_HASHISY
SDF SMILES database
14Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
Molfile SDF SMILES ChemDraw cdx PDB
structurenormalization
hashcodecalculation
original structure record
parentstructure
NCI/CADDIdentifier
E_HASHISY
SDF SMILES database
FICTS
FICuS
uuuuu
- calculation of a set of parent structures with
differentsensitivity to chemical features - representation of chemical structures on
different levels
15Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
sensitive / not sensitive
Fragments
Isotopes
Charges
Stereo
Tautomers
FICTS
FICuS
uuuuu
4A122D094098B50D-FICTS-01-1D 0E26B623DF7FAD30-FIC
uS-01-70 9850FD9F9E2B4E25-uuuuu-01-27
Na
ltCACTVS hashcode (E_HASHISY)gt-lttaggt-ltversiongt-ltche
cksumgt
16stereoisomers
tautomer
charged form
salt
O
N
a
-
O
H
N
N
N
H
2
histidine
isotope
errors
17E92E4BA2869F3611-FICTS
8A7AD1EB498CC76A-FICTS
6C16DE2351F9FF50-FICTS
stereoisomers
tautomer
charged form
salt
O
N
a
-
O
H
N
N
N
H
2
histidine
E5F83F10C5DB080A-FICTS
A3DAE0788050DDE4-FICTS
9850FD9F9E2B4E25-FICTS
FICTS
isotope
errors
B2FDA68AEDA06DB9-FICTS
E5F83F10C5DB080A-FICTS
9850FD9F9E2B4E25-FICTS
18E92E4BA2869F3611-FICuS
8A7AD1EB498CC76A-FICuS
9850FD9F9E2B4E25-FICuS
stereoisomers
tautomer
charged form
salt
O
N
a
-
O
H
N
N
N
H
2
histidine
E5F83F10C5DB080A-FICuS
A3DAE0788050DDE4-FICuS
9850FD9F9E2B4E25-FICuS
FICuS
isotope
errors
B2FDA68AEDA06DB9-FICuS
E5F83F10C5DB080A-FICuS
9850FD9F9E2B4E25-FICuS
199850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
stereoisomers
tautomer
charged form
salt
O
N
a
-
O
H
N
N
N
H
2
histidine
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
uuuuu
isotope
errors
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-FICuS
20HNDVDQJCIGZPNO-RXMQYKEDSA-N
HNDVDQJCIGZPNO-YFKPBYRVSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
stereoisomers
tautomer
charged form
salt
O
N
a
-
O
H
N
N
N
H
2
histidine
UHPNKBYGGMJTIM-UHFFFAOYSA-M
HNDVDQJCIGZPNO-UHFFFAOYSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
Std. InChIKey
isotope
errors
UHPNKBYGGMJTIM-UHFFFAOYSA-M
HNDVDQJCIGZPNO-CDYZYAPPSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
21NCI/CADD Chemical Structure Database
Structure Normalization
original record
original record
original record
original record
original record
original record
original record
original record
original record
original record
original record
119.8 million originalstructure records in CSDB
22NCI/CADD Chemical Structure Database
Structure Normalization
original record
FICTS
original record
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
83.1 million FICTSparent structures
original record
119.8 million originalstructure records in CSDB
23NCI/CADD Chemical Structure Database
Structure Normalization
original record
FICTS
original record
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
81.6 million FICuSparent structures
original record
83.1 million FICTSparent structures
original record
119.8 million originalstructure records in CSDB
24NCI/CADD Chemical Structure Database
Structure Normalization
original record
FICTS
original record
original record
FICTS
FICuS
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
76.2 million uuuuuparent structures
original record
FICTS
81.6 million FICuSparent structures
original record
83.1 million FICTSparent structures
original record
119.8 million originalstructure records in CSDB
25NCI/CADD Chemical Structure Database
Structure Normalization
original record
FICTS
original record
original record
FICTS
FICuS
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
76.2 million uuuuuparent structures
original record
FICTS
81.6 million FICuSparent structures
original record
83.1 million FICTSparent structures
original record
tautomer- invariant
119.8 million originalstructure records in CSDB
26Tautomer Analysis
How much chemical space is just generated by
drawing tautomers?
27NCI/CADD Chemical Structure Database
Tautomer Analysis
- CACTVS generation of all formal tautomers for a
given organic compound (prototropic tautomerism) - rule set of 21 transforms encoded as
(CACTVS-extended) SMIRKS - rule set is systematically applied to the
original structure(and all tautomers that have
been generated in previous steps) - tautomer generation is limited to 1000 SMIRKS
transform operations/structure - all tautomers are ranked by a scoring function
- the highest ranked tautomer is defined as
thecanonical tautomer
28NCI/CADD Chemical Structure Database
Tautomer Analysis
- 21 SMIRKS transform rules
rule 12 furanones
rule 1 1.3 (thio)keto/(thio)enol
rule 13 keten/ynol exchange
rule 2 1.5 (thio)keto/(thio)enol
rule 3 simple (aliphatic) imine
rule 14 ionic nitro/aci-nitro
rule 4 special imine
rule 15 pentavalent nitro/aci-nitro
rule 16 oxim/nitroso
rule 5 1.3 aromatic heteroatom H shift
rule 6 1.3 heteroatom H shift
rule 17 oxim/nitroso via phenol
rule 18 cyanic/iso-cyanic acids
rule 7 1.5 (aromatic) heteroatom H shift (1)
rule 8 1.5 aromatic heteroatom H shift (2)
rule 19 formamidinesulfinic acids
rule 9 1.7 (aromatic) heteroatom H shift
rule 20 isocyanides
rule 21 phosphonic acids
rule 10 1.9 (aromatic) heteroatom H shift
rule 11 1.11 (aromatic) heteroatom H shift
29NCI/CADD Chemical Structure Database
Tautomer Analysis
FICuS
FICuS
FICuS
starting from the set of FICuS parent structures
we systematically generatedall tautomers based
on the 21 SMIRKS rule set available in CACTVS
generated 680 million tautomers
FICuS
FICuS
FICuS
70.6 million FICuSparent structures
(2009 DB version)
for 1.7 of the FICuS parent structures the
enumeration was not exhaustive
30NCI/CADD Chemical Structure Database
Tautomer Analysis
tautomeric overlap within each individual
database release ()
90
80
70
numberdatabase releases
60
50
frequency
40
30
20
10
0
0.0
0.5
1.0
1.5
2.0
average 0.3 of original structure records
31NCI/CADD Chemical Structure Database
Tautomer Analysis
Ambinter BIND BindingDB ChemNavigator KEGG NCI
Open Database NIST WebBook NLM ChemIDplus NMRShift
DB Thomson Pharma Wombat
tautomeric overlap within each individual
database release ()
Asinex ChemBridge ComGenex ChemNavigator Columbia
University Molecular Screening Center EPA
DSSTox Specs
90
80
70
NCI/DTP PASS Training Set SGC-Ox
numberdatabase releases
60
50
frequency
40
ChemDB ZINC
30
ChEBI ChemSpider
20
10
0
0.0
0.5
1.0
1.5
2.0
average 0.3 of original structure records
32NCI/CADD Chemical Structure Database
Tautomer Analysis
occurrence of tautomerism-critical molecules
within each individual database release ()
30
25
20
numberdatabase releases
15
frequency
10
5
0
0.5
2.5
4.5
6.5
8.5
10.5
12.5
14.5
16.5
18.5
20.5
22.5
24.5
average 9.5 of FICuS parent structures
percentage of FICuS parent structure in each
database releaseoccurring somewhere in CSDB with
a conflict
33Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
- HPMBP is used in liquid membranes(selective
removal of metal ions) - selectivity and efficiency depends on the
tautomeric form of HPMBP which itself depends on
solvent and concentration of HPMBP -
He, D. Li Z. Ma M. Huang J. Yang Y. Study of
extraction characteristics of HPMBP. 1. Tautomer
and extraction characteristics. J. Chem. Eng.
Data 2009, 54(10), 2944-2947
34Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
CACTVS generates 7 tautomers
canonical tautomer by CACTVS
5 tautomers have potential stereo center on atoms
or bonds
35Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
3 tautomers have CAS Registry Numbersassigned
H
4551-69-1
33064-14-1
859 references
49 references
(no stereo)
H
127117-31-1
3 references
(Z)
36Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
occurrences in databases indexed in CSDB
O
O
R/S
N
N
O
H
O
N
N
6 databases
16 databases (no stereo) 3 databases (R) 2
databases (S)
12 databases
R/S
O
H
O
H
O
O
H
R/S
E/Z
E/Z
H
N
H
N
H
N
N
O
O
O
O
H
N
N
N
N
1 database (no stereo)
37Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
Ambinter ChemDB ChemSpider DiscoveryGate ChemNavig
ator Thomson Pharma
occurrences in databases
O
N
O
H
N
6 databases
16 databases (no stereo) 3 databases (R) 2
databases (S)
12 databases
ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBrid
ge ChemDB ChemSpider DiscoveryGate EPA
GCES MLSMR NCI Open Database NIST MS-Lib NLM
ChemIDplus Sigma-Aldrich Thomson Pharma
ChemDB
ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpid
er ChemNavigator MLSMR NIAID Scripps Screening
Center Thomson Pharma ZINC
ChemSpider ECOTOX ZINC
ChemSpider ZINC
1 database (no stereo)
38Scaffold Analysis
39NCI/CADD Chemical Structure Database
Scaffold Analysis
level 2
level 1
example
molecular scaffold tree
Schuffenhauer et al.J. Chem. Inf. Model. 2007,
47, 47-58
N
O
N
simple scaffold
Bemis et al.J. Med. Chem. 1996, 39, 2887-2893
S
O
O
archetype scaffold
Bemis et al.J. Med. Chem. 1996, 39, 2887-2893
40NCI/CADD Chemical Structure Database
Scaffold Analysis
CSDB
uuuuu compound set
76.2 million
41NCI/CADD Chemical Structure Database
Scaffold Analysis
level 2
level 1
molecular scaffold tree
8.1 million scaffolds
CSDB
uuuuu compound set
simple scaffold
6.8 million scaffolds
76.2 million
archetype scaffold
0.8 million scaffolds
42NCI/CADD Chemical Structure Database
Scaffold Analysis
level 2
number of unique scaffolds per hierarchy level
level 1
molecular scaffold tree
8.1 million scaffolds
CSDB
8.0
80.0
7.0
70.0
uuuuu compound set
6.0
60.0
5.0
50.0
Number of unique structures (in million)
76.2 million
4.0
40.0
Number of Unique Scaffolds (in millions)
3.0
30.0
2.0
20.0
1.0
10.0
0
0
1
2
3
4
5
6
7
8
9
10
Hierarchy Level
43Atom Neighborhoods
44Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
MNA level 1
MNA level 2
HC C(C(CC-H)C(CC-C)-H(C)) HO C(C(CC-H)C(CN-H)-
H(C)) CHCC C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN
C(C(CC-H)N(CC)-H(C)) CCCC C(C(CC-C)N(CC)-H(C))
CCOO N(C(CN-H)C(CN-H)) NCC -H(C(CC-H)) OHC
-H(C(CN-H)) OC -H(-O(-H-C)) -C(C(CC-C)-O(-H-
C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O))
Filimonov D., Poroikov V., Borodina Yu.,
Gloriozova T. J.Chem. Inf. Comput. Sci., 1999,
39 (4), 666-670.
45Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
46Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
Unique MNAs
13,426
level 1
918,516
level 2
47Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
Unique MNAs
13,426
1.3 billion relationships
level 1
17 MNAs per uuuuu parent structure
918,516
level 2
2.3 billion relationships
30 MNAs per uuuuu parent structure
48Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
Unique MNAs
13,426
1.3 billion relationships
level 1
17 MNAs per uuuuu parent structure
918,516
level 2
2.3 billion relationships
30 MNAs per uuuuu parent structure
surprising424,784 MNAs (level 2) are exclusive
to a set of 1,3 million structures in ChemSpider
49NCI/CADD Web Resources
Chemical Structure Web Services
external(web) services
ChemicalIdentifierResolver
NCI/CADDweb service
NCI/CADDweb service
http
Chemical Structure Web Services
othersoftwarepackages
CACTVS
e.g. OPSIN
NCI/CADD Chemical StructureDatabase (CSDB)
50NCI/CADD Web Resources
Chemical Identifier Resolver
http//www.akosgmbh.eu/globalsearch/index.htm
gChem
Virtual Molecular Model Kit
http//chemagic.com/web_molecules/script_page_larg
e.aspx
CACTVS
IUPHAR DATABASE http//www.iuphar-db.org
http//www.xemistry.com
51Work in progress
Chemical Structure Lookup Service II
52Work in progress
Chemical Structure Lookup Service II
53Acknowledgments
Thanks to all database providers!
University of Cambridge Daniel Lowe Peter
Murray-Rust
CADD Group, CBL, NCI Igor Filippov
ChemNavigator Scott Hutton Tad Hurst
ChemSpider Antony Williams Valery Tkachenko
Noel O Boyle (University College Cork, Ireland)
Richard Apodaca (Metamolecular) Hans-Juergen
Himmler
Our web site
http//cactus.nci.nih.gov
54NCI/CADD Web Resources
Chemical Identifier Resolver
http//cactus.nci.nih.gov/chemical/structure
http//cactus.nci.nih.gov/blog
55Acknowledgments - Software
CACTVS
Python Web Framework
Peter Ertl
Python SQL library
Javascript library
56(No Transcript)