CADD Overview and CACTVS License - PowerPoint PPT Presentation

About This Presentation
Title:

CADD Overview and CACTVS License

Description:

Title: CADD Overview and CACTVS License Author: Megan Peach Description: Modified/extended by MCN Last modified by: CN2 Created Date: 1/24/2005 9:06:17 PM – PowerPoint PPT presentation

Number of Views:220
Avg rating:3.0/5.0
Slides: 57
Provided by: MeganP150
Category:

less

Transcript and Presenter's Notes

Title: CADD Overview and CACTVS License


1
NCI/CADD Chemical Identifier ResolverIndexing
and Analysis of Available Chemistry Space
Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2,
andMarc C. Nicklaus1 1 Computer-Aided Drug
Design Group, Chemical Biology Laboratory, NCI-Fre
derick, NIH, DHHS 2 Xemistry GmbH, Auf den
Stieden 8, D-35094 Lahntal, Germany
2
Small Molecule Databases
  • since the early 2000s number of databases
    publishing small molecules grew enormously,
    e.g. PubChem, ChemSpider, ChEMBL, DrugBank what
    is the overlap, how many small-molecules are
    there currently?
  • ambiguities in the representation of small
    molecules (e.g. tautomerism, salts, ionic
    resonance forms)
  • growing number of chemical structure identifiers
    (InChI/InChIKey, PubChem SID/CID, ChemSpider ID,
    ChEBI ID, )

3
Chemical Identifier Resolver
SYBYL Line Notation
SMILES
CAS Registry Number
chemical names
GIF image
SD File
ChemNavigator SID
chemical structure
CML
FDA UNII
NCI/CADD Identifiers
NSC number
MRV
InChI/InChIKey
PubChem SID/CID
ChemSpider ID
ChEBI ID
Chemical Formula
PDB Ligand ID
4
NCI/CADD Web Resources
Chemical Identifier Resolver
Works as a resolver for different chemical
structure identifiers. Allows one to convert a
givenstructure identifier into
anotherrepresentation or structureidentifier.
first beta release July 2009 current release
(beta 4) April 2011
http//cactus.nci.nih.gov/chemical/structure
5
NCI/CADD Web Resources
Chemical Identifier Resolver
  • it is usable by a simple URL API

http//cactus.nci.nih.gov/chemical/structure/iden
tifier/representation
XML format http//cactus.nci.nih.gov/chemical/str
ucture/identifier/representation/xml
example http//cactus.nci.nih.gov/chemical/struct
ure/Tamiflu/cas
204255-11-8
MIME type text/plain
  • if a request is not resolvable HTTP404 status
    message

6
NCI/CADD Public Web Resources
Chemical Identifier Resolver
chemical names IUPAC names (by OPSIN) CAS
numbers SMILES strings IUPAC InChI/InChIKeys NCI/C
ADD IdentifiersCACTVS HASHISYNSC numberPubChem
SID ChemSpider ID ChemNavigator SID ZINC FDA UNII
/smiles /names, /iupac_name /cas /inchi,
/stdinchi /inchikey, /stdinchikey /ficts, /ficus,
/uuuuu /image /file, /sdf /mw,
/monoisotopic_mass /formula/twirl,
/3d /urls /chemspider_id /pubchem_sid /chemnavigat
or_sid
resolver
http//cactus.nci.nih.gov/chemcial/structure
identifier
representation
7
NCI/CADD Web Resources
Chemical Identifier Resolver
representation
identifier
MIME type
http request
http response
calculation of therequested structurerepresentat
ion
identifier is afull structure representation (e.
g. SMILES, InChI)
detection ofthe identifiertype
e.g. InChI, GIF image
structure
identifier is ahashed structurerepresentation (e
.g. InChIKey), trivial nameetc.
e.g. CAS number, chemical name
CACTVS
NCI/CADD Chemical Structure Database (CSDB)
database lookup
8
NCI/CADD Web Resources
Chemical Identifier Resolver
representation
identifier
MIME type
http request
http response
calculation of therequested structurerepresentat
ion
identifier is afull structure representation (e.
g. SMILES, InChI)
detection ofthe identifiertype
e.g. InChI, GIF image
structure
identifier is ahashed structurerepresentation (e
.g. InChIKey), trivial nameetc.
e.g. CAS number, chemical name
CACTVS
database lookup
NCI/CADD Chemical Structure Database (CSDB)
9
Chemical Identifier Resolver
Resolving Chemical Names
http//cactus.nci.nih.gov/chemical/structure/L-ala
nin/smiles/xmls?resolvername_by_chemspider,name_
by_opsin,name_by_cir
ltrequest string"L-alanin" representation"smiles"
gt ltdata id"1" resolver"name_by_chemspider"
string_class"Chemical Name (ChemSpider)"gt ltitem
id"1"gtCC_at_H(N)C(O)Olt/itemgt lt/datagt ltdata
id"2" resolver"name_by_opsin"
string_class"IUPAC Name (OPSIN)"gt ltitem
id"1"gtCC_at_H(N)C(O)Olt/itemgt lt/datagt ltdata
id"3" resolver"name_by_cir" string_class"Chemic
al Name (CIR)"gt ltitem id"1gtCC_at_H(N)C(O)Olt/it
emgt lt/datagt lt/requestgt
10
Chemical Identifier Resolver
Chemical Structure Database (CSDB)
  • ChemNavigator iResearch Librarycompilation of
    commercially availablescreening compounds from
    330 inter-national chemistry suppliers
  • PubChem databaseincluding Open NCI database,
    EPA DSSTox databases, NIAID HIVdatabases, NIST
    Webbook, NLM ChemIDplus, ChemSpider
  • Commercial Sources / othersAsinex, Comgenex,
    eMolecules,ChEMBL,

PubChem 38
ChemNav. iResearch Lib. 56
6
others
currently 150 chemical structure databases 120
million structure records 81.6 million unique
structures by NCI/CADD FICuS Identifier 84
million unique structures by Std. InChIKey
11
  • NCI/CADD Structure Identifiers

FICTS, FICuS, uuuuu
12
Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
  • based on hashcodes calculated by the
    chemoinformatics toolkit CACTVS

9850FD9F9E2B4E25
  • CACTVS hashcodes
  • represent a chemical structure uniquely
    as16-digit hexadecimal number (64-bit unsigned)
  • high sensitivity to structural features of a
    compound
  • change if connectivity changes

13
Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
Molfile SDF SMILES ChemDraw cdx PDB
structurenormalization
hashcodecalculation
original structure record
parentstructure
NCI/CADDIdentifier
E_HASHISY
SDF SMILES database
14
Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
Molfile SDF SMILES ChemDraw cdx PDB
structurenormalization
hashcodecalculation
original structure record
parentstructure
NCI/CADDIdentifier
E_HASHISY
SDF SMILES database
FICTS
FICuS
uuuuu
  • calculation of a set of parent structures with
    differentsensitivity to chemical features
  • representation of chemical structures on
    different levels

15
Unique Representation of Chemical Structures
NCI/CADD Structure Identifiers
sensitive / not sensitive
Fragments
Isotopes
Charges
Stereo
Tautomers
FICTS
FICuS
uuuuu
4A122D094098B50D-FICTS-01-1D 0E26B623DF7FAD30-FIC
uS-01-70 9850FD9F9E2B4E25-uuuuu-01-27
Na
ltCACTVS hashcode (E_HASHISY)gt-lttaggt-ltversiongt-ltche
cksumgt
16
stereoisomers
tautomer
charged form
salt
O

N
a
-
O
H
N
N
N
H
2
histidine
isotope
errors
17
E92E4BA2869F3611-FICTS
8A7AD1EB498CC76A-FICTS
6C16DE2351F9FF50-FICTS
stereoisomers
tautomer
charged form
salt
O

N
a
-
O
H
N
N
N
H
2
histidine
E5F83F10C5DB080A-FICTS
A3DAE0788050DDE4-FICTS
9850FD9F9E2B4E25-FICTS
FICTS
isotope
errors
B2FDA68AEDA06DB9-FICTS
E5F83F10C5DB080A-FICTS
9850FD9F9E2B4E25-FICTS
18
E92E4BA2869F3611-FICuS
8A7AD1EB498CC76A-FICuS
9850FD9F9E2B4E25-FICuS
stereoisomers
tautomer
charged form
salt
O

N
a
-
O
H
N
N
N
H
2
histidine
E5F83F10C5DB080A-FICuS
A3DAE0788050DDE4-FICuS
9850FD9F9E2B4E25-FICuS
FICuS
isotope
errors
B2FDA68AEDA06DB9-FICuS
E5F83F10C5DB080A-FICuS
9850FD9F9E2B4E25-FICuS
19
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
stereoisomers
tautomer
charged form
salt
O

N
a
-
O
H
N
N
N
H
2
histidine
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
uuuuu
isotope
errors
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-FICuS
20
HNDVDQJCIGZPNO-RXMQYKEDSA-N
HNDVDQJCIGZPNO-YFKPBYRVSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
stereoisomers
tautomer
charged form
salt
O

N
a
-
O
H
N
N
N
H
2
histidine
UHPNKBYGGMJTIM-UHFFFAOYSA-M
HNDVDQJCIGZPNO-UHFFFAOYSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
Std. InChIKey
isotope
errors
UHPNKBYGGMJTIM-UHFFFAOYSA-M
HNDVDQJCIGZPNO-CDYZYAPPSA-N
HNDVDQJCIGZPNO-UHFFFAOYSA-N
21
NCI/CADD Chemical Structure Database
Structure Normalization
original record
original record
original record
original record
original record
original record
original record
original record
original record
original record
original record
119.8 million originalstructure records in CSDB
22
NCI/CADD Chemical Structure Database
Structure Normalization
original record
FICTS
original record
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
FICTS
original record
83.1 million FICTSparent structures
original record
119.8 million originalstructure records in CSDB
23
NCI/CADD Chemical Structure Database
Structure Normalization
original record
FICTS
original record
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
FICuS
original record
FICTS
81.6 million FICuSparent structures
original record
83.1 million FICTSparent structures
original record
119.8 million originalstructure records in CSDB
24
NCI/CADD Chemical Structure Database
Structure Normalization
original record
FICTS
original record
original record
FICTS
FICuS
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
76.2 million uuuuuparent structures
original record
FICTS
81.6 million FICuSparent structures
original record
83.1 million FICTSparent structures
original record
119.8 million originalstructure records in CSDB
25
NCI/CADD Chemical Structure Database
Structure Normalization
original record
FICTS
original record
original record
FICTS
FICuS
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
uuuuu
original record
FICTS
FICuS
76.2 million uuuuuparent structures
original record
FICTS
81.6 million FICuSparent structures
original record
83.1 million FICTSparent structures
original record
tautomer- invariant
119.8 million originalstructure records in CSDB
26
Tautomer Analysis
How much chemical space is just generated by
drawing tautomers?
27
NCI/CADD Chemical Structure Database
Tautomer Analysis
  • CACTVS generation of all formal tautomers for a
    given organic compound (prototropic tautomerism)
  • rule set of 21 transforms encoded as
    (CACTVS-extended) SMIRKS
  • rule set is systematically applied to the
    original structure(and all tautomers that have
    been generated in previous steps)
  • tautomer generation is limited to 1000 SMIRKS
    transform operations/structure
  • all tautomers are ranked by a scoring function
  • the highest ranked tautomer is defined as
    thecanonical tautomer

28
NCI/CADD Chemical Structure Database
Tautomer Analysis
  • 21 SMIRKS transform rules

rule 12 furanones
rule 1 1.3 (thio)keto/(thio)enol
rule 13 keten/ynol exchange
rule 2 1.5 (thio)keto/(thio)enol
rule 3 simple (aliphatic) imine
rule 14 ionic nitro/aci-nitro
rule 4 special imine
rule 15 pentavalent nitro/aci-nitro
rule 16 oxim/nitroso
rule 5 1.3 aromatic heteroatom H shift
rule 6 1.3 heteroatom H shift
rule 17 oxim/nitroso via phenol
rule 18 cyanic/iso-cyanic acids
rule 7 1.5 (aromatic) heteroatom H shift (1)
rule 8 1.5 aromatic heteroatom H shift (2)
rule 19 formamidinesulfinic acids
rule 9 1.7 (aromatic) heteroatom H shift
rule 20 isocyanides
rule 21 phosphonic acids
rule 10 1.9 (aromatic) heteroatom H shift
rule 11 1.11 (aromatic) heteroatom H shift
29
NCI/CADD Chemical Structure Database
Tautomer Analysis
FICuS
FICuS
FICuS
starting from the set of FICuS parent structures
we systematically generatedall tautomers based
on the 21 SMIRKS rule set available in CACTVS
generated 680 million tautomers
FICuS
FICuS
FICuS
70.6 million FICuSparent structures
(2009 DB version)
for 1.7 of the FICuS parent structures the
enumeration was not exhaustive
30
NCI/CADD Chemical Structure Database
Tautomer Analysis
tautomeric overlap within each individual
database release ()
90
80
70
numberdatabase releases
60
50
frequency
40
30
20
10
0
0.0
0.5
1.0
1.5
2.0
average 0.3 of original structure records
31
NCI/CADD Chemical Structure Database
Tautomer Analysis
Ambinter BIND BindingDB ChemNavigator KEGG NCI
Open Database NIST WebBook NLM ChemIDplus NMRShift
DB Thomson Pharma Wombat
tautomeric overlap within each individual
database release ()
Asinex ChemBridge ComGenex ChemNavigator Columbia
University Molecular Screening Center EPA
DSSTox Specs
90
80
70
NCI/DTP PASS Training Set SGC-Ox
numberdatabase releases
60
50
frequency
40
ChemDB ZINC
30
ChEBI ChemSpider
20
10
0
0.0
0.5
1.0
1.5
2.0
average 0.3 of original structure records
32
NCI/CADD Chemical Structure Database
Tautomer Analysis
occurrence of tautomerism-critical molecules
within each individual database release ()
30
25
20
numberdatabase releases
15
frequency
10
5
0
0.5
2.5
4.5
6.5
8.5
10.5
12.5
14.5
16.5
18.5
20.5
22.5
24.5
average 9.5 of FICuS parent structures
percentage of FICuS parent structure in each
database releaseoccurring somewhere in CSDB with
a conflict
33
Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
  • HPMBP is used in liquid membranes(selective
    removal of metal ions)
  • selectivity and efficiency depends on the
    tautomeric form of HPMBP which itself depends on
    solvent and concentration of HPMBP

He, D. Li Z. Ma M. Huang J. Yang Y. Study of
extraction characteristics of HPMBP. 1. Tautomer
and extraction characteristics. J. Chem. Eng.
Data 2009, 54(10), 2944-2947
34
Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
CACTVS generates 7 tautomers
canonical tautomer by CACTVS
5 tautomers have potential stereo center on atoms
or bonds
35
Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
3 tautomers have CAS Registry Numbersassigned
H
4551-69-1
33064-14-1
859 references
49 references
(no stereo)
H
127117-31-1
3 references
(Z)
36
Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
occurrences in databases indexed in CSDB
O
O
R/S
N
N
O
H
O
N
N
6 databases
16 databases (no stereo) 3 databases (R) 2
databases (S)
12 databases
R/S
O
H
O
H
O
O
H
R/S
E/Z
E/Z
H
N
H
N
H
N
N
O
O
O
O
H
N
N
N
N
1 database (no stereo)
37
Example for a Tautomer Conflict
HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
Ambinter ChemDB ChemSpider DiscoveryGate ChemNavig
ator Thomson Pharma
occurrences in databases
O
N
O
H
N
6 databases
16 databases (no stereo) 3 databases (R) 2
databases (S)
12 databases
ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBrid
ge ChemDB ChemSpider DiscoveryGate EPA
GCES MLSMR NCI Open Database NIST MS-Lib NLM
ChemIDplus Sigma-Aldrich Thomson Pharma
ChemDB
ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpid
er ChemNavigator MLSMR NIAID Scripps Screening
Center Thomson Pharma ZINC
ChemSpider ECOTOX ZINC
ChemSpider ZINC
1 database (no stereo)
38
Scaffold Analysis
39
NCI/CADD Chemical Structure Database
Scaffold Analysis
level 2
level 1
example
molecular scaffold tree
Schuffenhauer et al.J. Chem. Inf. Model. 2007,
47, 47-58
N
O
N
simple scaffold
Bemis et al.J. Med. Chem. 1996, 39, 2887-2893
S
O
O
archetype scaffold
Bemis et al.J. Med. Chem. 1996, 39, 2887-2893
40
NCI/CADD Chemical Structure Database
Scaffold Analysis
CSDB
uuuuu compound set
76.2 million
41
NCI/CADD Chemical Structure Database
Scaffold Analysis
level 2
level 1
molecular scaffold tree
8.1 million scaffolds
CSDB
uuuuu compound set
simple scaffold
6.8 million scaffolds
76.2 million
archetype scaffold
0.8 million scaffolds
42
NCI/CADD Chemical Structure Database
Scaffold Analysis
level 2
number of unique scaffolds per hierarchy level
level 1
molecular scaffold tree
8.1 million scaffolds
CSDB
8.0
80.0
7.0
70.0
uuuuu compound set
6.0
60.0
5.0
50.0
Number of unique structures (in million)
76.2 million
4.0
40.0
Number of Unique Scaffolds (in millions)
3.0
30.0
2.0
20.0
1.0
10.0
0
0
1
2
3
4
5
6
7
8
9
10
Hierarchy Level
43
Atom Neighborhoods
44
Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
MNA level 1
MNA level 2
HC C(C(CC-H)C(CC-C)-H(C)) HO C(C(CC-H)C(CN-H)-
H(C)) CHCC C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN
C(C(CC-H)N(CC)-H(C)) CCCC C(C(CC-C)N(CC)-H(C))
CCOO N(C(CN-H)C(CN-H)) NCC -H(C(CC-H)) OHC
-H(C(CN-H)) OC -H(-O(-H-C)) -C(C(CC-C)-O(-H-
C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O))
Filimonov D., Poroikov V., Borodina Yu.,
Gloriozova T. J.Chem. Inf. Comput. Sci., 1999,
39 (4), 666-670.
45
Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
46
Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
Unique MNAs
13,426
level 1
918,516
level 2
47
Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
Unique MNAs
13,426
1.3 billion relationships
level 1
17 MNAs per uuuuu parent structure
918,516
level 2
2.3 billion relationships
30 MNAs per uuuuu parent structure
48
Multilevel Neighborhoods of Atoms (MNA)
NCI/CADD Chemical Structure Database
Unique MNAs
13,426
1.3 billion relationships
level 1
17 MNAs per uuuuu parent structure
918,516
level 2
2.3 billion relationships
30 MNAs per uuuuu parent structure
surprising424,784 MNAs (level 2) are exclusive
to a set of 1,3 million structures in ChemSpider
49
NCI/CADD Web Resources
Chemical Structure Web Services
external(web) services
ChemicalIdentifierResolver
NCI/CADDweb service
NCI/CADDweb service
http
Chemical Structure Web Services
othersoftwarepackages
CACTVS
e.g. OPSIN
NCI/CADD Chemical StructureDatabase (CSDB)
50
NCI/CADD Web Resources
Chemical Identifier Resolver
http//www.akosgmbh.eu/globalsearch/index.htm
gChem
Virtual Molecular Model Kit
http//chemagic.com/web_molecules/script_page_larg
e.aspx
CACTVS
IUPHAR DATABASE http//www.iuphar-db.org
http//www.xemistry.com
51
Work in progress
Chemical Structure Lookup Service II
52
Work in progress
Chemical Structure Lookup Service II
53
Acknowledgments
Thanks to all database providers!
University of Cambridge Daniel Lowe Peter
Murray-Rust
CADD Group, CBL, NCI Igor Filippov
ChemNavigator Scott Hutton Tad Hurst
ChemSpider Antony Williams Valery Tkachenko
Noel O Boyle (University College Cork, Ireland)
Richard Apodaca (Metamolecular) Hans-Juergen
Himmler
Our web site
http//cactus.nci.nih.gov
54
NCI/CADD Web Resources
Chemical Identifier Resolver
http//cactus.nci.nih.gov/chemical/structure
http//cactus.nci.nih.gov/blog
55
Acknowledgments - Software
CACTVS
Python Web Framework
Peter Ertl
Python SQL library
Javascript library
56
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com