Title: I571ChE531 2006 2D Chemical Database Applications
1I571/ChE531 20062D Chemical Database Applications
- David Wild
- Assistant Professor, Life Science Informatics
- Indiana University School of Informatics
- djwild_at_indiana.edu
2What well cover today
- Basic 2D chemical structure database theory
- Types of searching
- Substructure querying with SMARTS
- Using fingerprints for similarity searching
- Demonstration of substructure similarity
searching - Molinspiration
- Pubchem
- Available 2D database systems
- ISIS/Base, Oracle cartridges
- Demonstration of Cartridge searching
- Daycart
3Increasing size of chemical information
- Chemical Abstracts Service database (CAS) now
holds information on 29 million compounds - Pubchem contains 12 million compounds
- Pharmaceutical companies typically have over 1
million proprietary compounds - Other companies are producing large numbers of
compounds through combinatorial chemistry - Typically several hundred thousand compounds
would be screened for a particular project
4Combinatorial Chemistry
- By combining molecular building blocks, we can
create very large numbers of different molecules
very quickly. - Usually involves a scaffold molecule, and sets
of compounds which can be reacted with the
scaffold to place different structures on
attachment points. - The set of molecules that is created (or which
can be theoretically created) by a combinatorial
experiment is called a combinatorial library. - Feeds the capacity of High Throughput Screening
52D structure databases
- Special kinds of searching for 2D structure
databases - Specifying substructure queries with SMARTS
- Using fingerprints for similarity searching
- Commercially available databases
6Types of 2D structure searching
- Structure search
- Is this structure in the database?
- Substructure search
- Find me all of the structures that contain this
substructure - Similarity search
- Find me all of the structures that are similar
to this one
7Structure search
- Looking for a particular structure in a database
- Searching proprietary databases or commercial
databases - e.g. is this structure in the database?
- Mathematically, the connection table can be
considered a graph, and this is a graph
isomorphism problem (solved)
8Substructure search
- Looking for all structures that contain one or
more particular structural fragments - e.g. which structures contain a nitro group?
- Mathematically, this is a subgraph isomorphism
problem (Ullman Algorithm) - Requires way of representing query fragment(s)
9Similarity search
- Looking for all the structures in a database that
are highly similar to a given structure - e.g. show me structures with a similarity greater
than 0.7 to this molecule - Requires a way of measuring similarity
- Solved using fingerprint representations and
similarity coefficients
10www.molinspiration.com/cgi-bin/search
11Substructure search results
12Similarity search results
13Specifying a substructure query with SMARTS
- SMARTS is a superset of SMILES that is extended
to allow partial structures (substructures) and
optional parts of molecules to be represented - Simple example
- C(O)O
- where the represents an attachment point (i.e.
any number of any atoms) - More information
- http//www.daylight.com/meetings/summerschool01/co
urse/basics/smarts.html - http//www.daylight.com/dayhtml/doc/theory/theory.
smarts.html
14SMARTS special characters (examples)
15SMARTS examples
16Try out a SMARTS search
- DepictMatch
- http//www.daylight.com/cgi-bin/contrib/depictmatc
h.cgi - Enter a set of SMILES and a SMARTS, and any part
of the SMILES that is found in the SMARTS is
highlighted - As an example, well use the sample dataset
described on the following two slides, and use
C(O)O (carboxyl group) as our SMARTS and
RC(O)O (carboxyl attached to a ring)
17Sample dataset
Acetaminophen
Alprenolol
Amphetamine
Captopril
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
18Sample Dataset SMILES file
- CC(O)Nc1ccc(O)cc1 Acetaminophen
- CC(C)NCC(O)COc1ccccc1CCC Alprenolol
- CC(N)Cc1ccccc1 Amphetamine
- CC(CS)C(O)N1CCCC1C(O)O Captopril
- CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine
- OC(O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
- NCC1(CC(O)O)CCCCC1 Gabapentin
- COC(O)c1ccccc1O Salicylate
19Measuring similarity between molecules
- Similar Property Principle Molecules with
similar structure are likely to have similar
biological activity - Generally the Tanimoto Coefficient or Euclidean
Distance between fingerprints is used
20Fingerprint Similarity Tanimoto
- Also known as Jaccard Coefficient
- 1s in common / 1s not in common
- 0s are treated as not significant
- Similarity is between 0 (dissimilar) and 1 (same)
- Good cutoff for likely biologically similar
molecules is 0.7 or 0.8
c 1s in common a 1s in fingerprint A b
1s in fingerprint B
c 4 a 6 b 6
A 101101011 B 011101101
Example
Tanimoto Similarity 4 / ( 6 6 4 ) 0.5
21Fingerprint similarity Euclidean
- Pythagorean distance
- For binary dimensions, equivalent to the square
root of the Hamming distance (i.e. square root of
the number of bits that are different) - 0s are treated as significant
- Smaller values mean more similar
- Example
- 101101011
- 011101101
- Different? xx xx
-
- Euclidean distance sqrt(4) 2.0
22References on Similarity Searching
- Chemical Similarity Searching, P. Willett, J.M.
Barnard, G.M.Downs, Journal of Chemical
Information and Computer Sciences, 1998, 38,
983-996 (available as a PDF on Oncourse) - And check out the JCIM ASAP!
23Basic facilities required by Pharmaceutical
companies
- Structure Registration
- Being able to store information about new
compounds that are created in-house or brought in
from outside - Structure and Data Searching
- Being able to search for particular compounds or
groups of compounds, based on structure,
biological data, test results, etc., and export
this information
24Structure Registration
- Submission of new compounds made by a chemist to
a proprietary database - Chemist will generally draw in structure,
specifying sterochemistry, etc. - Other information provided Chemists name,
notebook numbers, reaction protocol, safety
information, amounts made, barcode , etc - Will usually be inspected by a registrator, and
assigned a unique identifier (e.g. PF123456 for
Pfizer, A12345 for Abbott, etc) - Batch registration (e.g. for combinatorial
libraries) will also be provided - May be interfaced with a LIMS system
25Structure Registration Systems
- new compounds need to be added regularly
- used to be done by chemical information
specialists - now frequently done directly by bench chemists
- registration system must
- check consistency of input data
- e.g. compare molecular formula with structure
- check that compound is really new
- different ways of handling tautomers, salts,
stereoisomers etc. - assign registry number
- add supplementary data (melting point etc.)
- make data immediately available for search
26Structure Searching System
- Structure, substructure and similarity searching
based on 2D structure, or other text/numeric
fields - Exporting of lists and tables of structures with
associated data (biological activities,
27MDL ISIS
- Currently used by most pharmas as chemical
informatics backbone - Provides server software (back-end) for
maintaining databases of chemical structures and
other data - Provides client software (front-end) called
ISIS/Base for registration, searching, etc - ISIS/Base and ISIS/Draw can be installed on
chemists desktop machines
28MDL ISIS
- Can maintain multiple views of the data, e.g.
for specific projects - Interface rather clumsy, and technology showing
its age - Maintaining two separate databases is complex
- Oracle with chemistry cartridges slowly replacing
back end - More information www.mdl.com
29MDL ISIS
ISIS/Base
ISIS/Base
ISIS/Base
ISIS/Host
ISIS Database for chemical structures
Oracle Database for other data
30MDL ISIS/Base
31Web / Oracle systems
Web Browser
Web Browser
Web Browser
Web applications server
Oracle database with chemistry cartridge
SQL
32Web / Oracle Systems
- Advantages
- Single database for structures and data
- No software to install on client machines (except
maybe plug-ins like Chime) - Not dependent on (expensive) contract with MDL
- Highly customizable
- Disadvantages
- Requires extensive web-based interface software
to be written, for registration, searching, etc - Company will have to maintain system internally
- Requires current ISIS system to be abandoned
33Other systems available
- Most of the chemical informatics companies do not
provide a complete package, but have parts (e.g.
database systems). Some exceptions are - IDBS ActivityBase
- http//www.id-bs.com/products/abase/
- Accelrys DS Accord Enterprise Informatics
- http//www.accelrys.com/aei/index.html
34Chemical Searching using Oracle
- Oracle 8 and higher have the ability to add
functionality using cartridges - These cartridges can add new data types and new
functions to SQL - The last few years have seen the release of
several chemistry cartridges for Oracle - Informix has a similar DataBlade but this has
not been as exploted by the chemical community
35Chemical Searching using Oracle / SQL
- Oracle 8 and higher have the ability to add
functionality using cartridges - These cartridges can add new data types and new
functions to SQL - Other databases like PostgreSQL can also use
cartridges - The last few years have seen the release of
several chemistry cartridges for Oracle and other
databases - Informix has a similar DataBlade but this has
not been as exploted by the chemical community
36Chemistry Cartridges
- Daylight DayCart
- http//www.daylight.com/products/daycart.html
- Tripos Auspyx
- http//www.tripos.com/sciTech/inSilicoDisc/chemInf
o/auspyx.html - Accelrys Accord for Oracle
- http//www.accelrys.com/accord/oracle.html
- MDL Direct
- http//www.mdl.com/products/framework/rel_chemistr
y_server/index.jsp - IDBS ActivityBase
- http//www.id-bs.com/products/abase/
- gNova CHORD
- http//www.gnova.com
- JChem Cartridge
- http//www.jchem.com
37Example - DayCart
- Store SMILES as string (VARCHAR2) in Oracle
database - Cartridge provides extra functions and extensions
to functions for searching based on chemical
structures - Structure search implemented by EXACT function
- Substructure search implemented by MATCHES
function - Similarity search implemented by TANIMOTO and
EUCLID functions
38Sample dataset
Acetaminophen
Alprenolol
Amphetamine
Captopril
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
39Oracle table Test for sample dataset
- Smiles Name
LogP - ------ ----
---- - CC(O)Nc1ccc(O)cc1 Acetaminophen
0.27 - CC(C)NCC(O)COc1ccccc1CCC Alprenolol
2.81 - CC(N)Cc1ccccc1 Amphetamine
1.76 - CC(CS)C(O)N1CCCC1C(O)O Captopril
0.84 - CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
Chlorpromazine 5.20 - OC(O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
4.02 - NCC1(CC(O)O)CCCCC1 Gabapentin
-1.37 - COC(O)c1ccccc1O Salicylate
2.60
40DayCart structure search using SQL
- select from Test where
- exact(Smiles, CC(N)Cc1ccccc1) 1
- Smiles Name
LogP - ------ ----
---- - CC(N)Cc1ccccc1 Amphetamine
1.76
41DayCart substructure search
- select from Test where
- matches(Smiles, C(O)O) 1
- Smiles Name
LogP - ------ ----
---- - CC(CS)C(O)N1CCCC1C(O)O Captopril
0.84 - OC(O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
4.02 - NCC1(CC(O)O)CCCCC1 Gabapentin
-1.37 - COC(O)c1ccccc1O Salicylate
2.60
42Substructure search for carboxylic acid
Acetaminophen
Alprenolol
Amphetamine
Captopril
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
43DayCart substructure / value search
- select from Test where
- (matches(Smiles, C(O)O) 1)
- AND (LogP gt 1.0))
- Smiles Name
LogP - ------ ----
---- - OC(O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
4.02 - COC(O)c1ccccc1O Salicylate
2.60
44DayCart similarity search
Aspirin
- select from TEST where
- tanimoto(SMILES, CC(O)Oc1ccccc1C(O)O) gt
0.6 - SMILES NAME
LOGP - ------ ----
---- - COC(O)c1ccccc1O Salicylate
2.60 - CC(O)Nc1ccc(O)cc1 Acetaminophen
0.27 - CC(N)Cc1ccccc1 Amphetamine
1.76
45Similarity search for carboxylic acid
?
?
Acetaminophen
Alprenolol
Amphetamine
Captopril
?
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
46More examples of DayCart
- http//www.daylight.com/meetings/summerschool02/co
urse/admin/daycart_hints.html
47Follow-up
- Try a substructure and similiarity search
yourself at www.molinspiration.com/cgi-bin/search.
Note any structures you find particularly
surprising on the similarity search. If possible,
capture a screenshot of the results (CTRLPRNT
SCRN on PCs to copy). Email notes and/or
screenshots to djwild_at_indiana.edu (5 of grade)