I571ChE531 2006 2D Chemical Database Applications - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

I571ChE531 2006 2D Chemical Database Applications

Description:

Indiana University School of. Increasing size of chemical information ... Indiana University School of. Sample Dataset SMILES file. CC(=O)Nc1ccc(O)cc1 Acetaminophen ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: I571ChE531 2006 2D Chemical Database Applications


1
I571/ChE531 20062D Chemical Database Applications
  • David Wild
  • Assistant Professor, Life Science Informatics
  • Indiana University School of Informatics
  • djwild_at_indiana.edu

2
What well cover today
  • Basic 2D chemical structure database theory
  • Types of searching
  • Substructure querying with SMARTS
  • Using fingerprints for similarity searching
  • Demonstration of substructure similarity
    searching
  • Molinspiration
  • Pubchem
  • Available 2D database systems
  • ISIS/Base, Oracle cartridges
  • Demonstration of Cartridge searching
  • Daycart

3
Increasing size of chemical information
  • Chemical Abstracts Service database (CAS) now
    holds information on 29 million compounds
  • Pubchem contains 12 million compounds
  • Pharmaceutical companies typically have over 1
    million proprietary compounds
  • Other companies are producing large numbers of
    compounds through combinatorial chemistry
  • Typically several hundred thousand compounds
    would be screened for a particular project

4
Combinatorial Chemistry
  • By combining molecular building blocks, we can
    create very large numbers of different molecules
    very quickly.
  • Usually involves a scaffold molecule, and sets
    of compounds which can be reacted with the
    scaffold to place different structures on
    attachment points.
  • The set of molecules that is created (or which
    can be theoretically created) by a combinatorial
    experiment is called a combinatorial library.
  • Feeds the capacity of High Throughput Screening

5
2D structure databases
  • Special kinds of searching for 2D structure
    databases
  • Specifying substructure queries with SMARTS
  • Using fingerprints for similarity searching
  • Commercially available databases

6
Types of 2D structure searching
  • Structure search
  • Is this structure in the database?
  • Substructure search
  • Find me all of the structures that contain this
    substructure
  • Similarity search
  • Find me all of the structures that are similar
    to this one

7
Structure search
  • Looking for a particular structure in a database
  • Searching proprietary databases or commercial
    databases
  • e.g. is this structure in the database?
  • Mathematically, the connection table can be
    considered a graph, and this is a graph
    isomorphism problem (solved)

8
Substructure search
  • Looking for all structures that contain one or
    more particular structural fragments
  • e.g. which structures contain a nitro group?
  • Mathematically, this is a subgraph isomorphism
    problem (Ullman Algorithm)
  • Requires way of representing query fragment(s)

9
Similarity search
  • Looking for all the structures in a database that
    are highly similar to a given structure
  • e.g. show me structures with a similarity greater
    than 0.7 to this molecule
  • Requires a way of measuring similarity
  • Solved using fingerprint representations and
    similarity coefficients

10
www.molinspiration.com/cgi-bin/search
11
Substructure search results
12
Similarity search results
13
Specifying a substructure query with SMARTS
  • SMARTS is a superset of SMILES that is extended
    to allow partial structures (substructures) and
    optional parts of molecules to be represented
  • Simple example
  • C(O)O
  • where the represents an attachment point (i.e.
    any number of any atoms)
  • More information
  • http//www.daylight.com/meetings/summerschool01/co
    urse/basics/smarts.html
  • http//www.daylight.com/dayhtml/doc/theory/theory.
    smarts.html

14
SMARTS special characters (examples)
15
SMARTS examples
16
Try out a SMARTS search
  • DepictMatch
  • http//www.daylight.com/cgi-bin/contrib/depictmatc
    h.cgi
  • Enter a set of SMILES and a SMARTS, and any part
    of the SMILES that is found in the SMARTS is
    highlighted
  • As an example, well use the sample dataset
    described on the following two slides, and use
    C(O)O (carboxyl group) as our SMARTS and
    RC(O)O (carboxyl attached to a ring)

17
Sample dataset
Acetaminophen
Alprenolol
Amphetamine
Captopril
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
18
Sample Dataset SMILES file
  • CC(O)Nc1ccc(O)cc1 Acetaminophen
  • CC(C)NCC(O)COc1ccccc1CCC Alprenolol
  • CC(N)Cc1ccccc1 Amphetamine
  • CC(CS)C(O)N1CCCC1C(O)O Captopril
  • CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine
  • OC(O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
  • NCC1(CC(O)O)CCCCC1 Gabapentin
  • COC(O)c1ccccc1O Salicylate

19
Measuring similarity between molecules
  • Similar Property Principle Molecules with
    similar structure are likely to have similar
    biological activity
  • Generally the Tanimoto Coefficient or Euclidean
    Distance between fingerprints is used

20
Fingerprint Similarity Tanimoto
  • Also known as Jaccard Coefficient
  • 1s in common / 1s not in common
  • 0s are treated as not significant
  • Similarity is between 0 (dissimilar) and 1 (same)
  • Good cutoff for likely biologically similar
    molecules is 0.7 or 0.8

c 1s in common a 1s in fingerprint A b
1s in fingerprint B
c 4 a 6 b 6
A 101101011 B 011101101
Example
Tanimoto Similarity 4 / ( 6 6 4 ) 0.5
21
Fingerprint similarity Euclidean
  • Pythagorean distance
  • For binary dimensions, equivalent to the square
    root of the Hamming distance (i.e. square root of
    the number of bits that are different)
  • 0s are treated as significant
  • Smaller values mean more similar
  • Example
  • 101101011
  • 011101101
  • Different? xx xx
  • Euclidean distance sqrt(4) 2.0

22
References on Similarity Searching
  • Chemical Similarity Searching, P. Willett, J.M.
    Barnard, G.M.Downs, Journal of Chemical
    Information and Computer Sciences, 1998, 38,
    983-996 (available as a PDF on Oncourse)
  • And check out the JCIM ASAP!

23
Basic facilities required by Pharmaceutical
companies
  • Structure Registration
  • Being able to store information about new
    compounds that are created in-house or brought in
    from outside
  • Structure and Data Searching
  • Being able to search for particular compounds or
    groups of compounds, based on structure,
    biological data, test results, etc., and export
    this information

24
Structure Registration
  • Submission of new compounds made by a chemist to
    a proprietary database
  • Chemist will generally draw in structure,
    specifying sterochemistry, etc.
  • Other information provided Chemists name,
    notebook numbers, reaction protocol, safety
    information, amounts made, barcode , etc
  • Will usually be inspected by a registrator, and
    assigned a unique identifier (e.g. PF123456 for
    Pfizer, A12345 for Abbott, etc)
  • Batch registration (e.g. for combinatorial
    libraries) will also be provided
  • May be interfaced with a LIMS system

25
Structure Registration Systems
  • new compounds need to be added regularly
  • used to be done by chemical information
    specialists
  • now frequently done directly by bench chemists
  • registration system must
  • check consistency of input data
  • e.g. compare molecular formula with structure
  • check that compound is really new
  • different ways of handling tautomers, salts,
    stereoisomers etc.
  • assign registry number
  • add supplementary data (melting point etc.)
  • make data immediately available for search

26
Structure Searching System
  • Structure, substructure and similarity searching
    based on 2D structure, or other text/numeric
    fields
  • Exporting of lists and tables of structures with
    associated data (biological activities,

27
MDL ISIS
  • Currently used by most pharmas as chemical
    informatics backbone
  • Provides server software (back-end) for
    maintaining databases of chemical structures and
    other data
  • Provides client software (front-end) called
    ISIS/Base for registration, searching, etc
  • ISIS/Base and ISIS/Draw can be installed on
    chemists desktop machines

28
MDL ISIS
  • Can maintain multiple views of the data, e.g.
    for specific projects
  • Interface rather clumsy, and technology showing
    its age
  • Maintaining two separate databases is complex
  • Oracle with chemistry cartridges slowly replacing
    back end
  • More information www.mdl.com

29
MDL ISIS
ISIS/Base
ISIS/Base
ISIS/Base
ISIS/Host
ISIS Database for chemical structures
Oracle Database for other data
30
MDL ISIS/Base
31
Web / Oracle systems
Web Browser
Web Browser
Web Browser
Web applications server
Oracle database with chemistry cartridge
SQL
32
Web / Oracle Systems
  • Advantages
  • Single database for structures and data
  • No software to install on client machines (except
    maybe plug-ins like Chime)
  • Not dependent on (expensive) contract with MDL
  • Highly customizable
  • Disadvantages
  • Requires extensive web-based interface software
    to be written, for registration, searching, etc
  • Company will have to maintain system internally
  • Requires current ISIS system to be abandoned

33
Other systems available
  • Most of the chemical informatics companies do not
    provide a complete package, but have parts (e.g.
    database systems). Some exceptions are
  • IDBS ActivityBase
  • http//www.id-bs.com/products/abase/
  • Accelrys DS Accord Enterprise Informatics
  • http//www.accelrys.com/aei/index.html

34
Chemical Searching using Oracle
  • Oracle 8 and higher have the ability to add
    functionality using cartridges
  • These cartridges can add new data types and new
    functions to SQL
  • The last few years have seen the release of
    several chemistry cartridges for Oracle
  • Informix has a similar DataBlade but this has
    not been as exploted by the chemical community

35
Chemical Searching using Oracle / SQL
  • Oracle 8 and higher have the ability to add
    functionality using cartridges
  • These cartridges can add new data types and new
    functions to SQL
  • Other databases like PostgreSQL can also use
    cartridges
  • The last few years have seen the release of
    several chemistry cartridges for Oracle and other
    databases
  • Informix has a similar DataBlade but this has
    not been as exploted by the chemical community

36
Chemistry Cartridges
  • Daylight DayCart
  • http//www.daylight.com/products/daycart.html
  • Tripos Auspyx
  • http//www.tripos.com/sciTech/inSilicoDisc/chemInf
    o/auspyx.html
  • Accelrys Accord for Oracle
  • http//www.accelrys.com/accord/oracle.html
  • MDL Direct
  • http//www.mdl.com/products/framework/rel_chemistr
    y_server/index.jsp
  • IDBS ActivityBase
  • http//www.id-bs.com/products/abase/
  • gNova CHORD
  • http//www.gnova.com
  • JChem Cartridge
  • http//www.jchem.com

37
Example - DayCart
  • Store SMILES as string (VARCHAR2) in Oracle
    database
  • Cartridge provides extra functions and extensions
    to functions for searching based on chemical
    structures
  • Structure search implemented by EXACT function
  • Substructure search implemented by MATCHES
    function
  • Similarity search implemented by TANIMOTO and
    EUCLID functions

38
Sample dataset
Acetaminophen
Alprenolol
Amphetamine
Captopril
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
39
Oracle table Test for sample dataset
  • Smiles Name
    LogP
  • ------ ----
    ----
  • CC(O)Nc1ccc(O)cc1 Acetaminophen
    0.27
  • CC(C)NCC(O)COc1ccccc1CCC Alprenolol
    2.81
  • CC(N)Cc1ccccc1 Amphetamine
    1.76
  • CC(CS)C(O)N1CCCC1C(O)O Captopril
    0.84
  • CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13
    Chlorpromazine 5.20
  • OC(O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
    4.02
  • NCC1(CC(O)O)CCCCC1 Gabapentin
    -1.37
  • COC(O)c1ccccc1O Salicylate
    2.60

40
DayCart structure search using SQL
  • select from Test where
  • exact(Smiles, CC(N)Cc1ccccc1) 1
  • Smiles Name
    LogP
  • ------ ----
    ----
  • CC(N)Cc1ccccc1 Amphetamine
    1.76

41
DayCart substructure search
  • select from Test where
  • matches(Smiles, C(O)O) 1
  • Smiles Name
    LogP
  • ------ ----
    ----
  • CC(CS)C(O)N1CCCC1C(O)O Captopril
    0.84
  • OC(O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
    4.02
  • NCC1(CC(O)O)CCCCC1 Gabapentin
    -1.37
  • COC(O)c1ccccc1O Salicylate
    2.60

42
Substructure search for carboxylic acid
Acetaminophen
Alprenolol
Amphetamine
Captopril
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
43
DayCart substructure / value search
  • select from Test where
  • (matches(Smiles, C(O)O) 1)
  • AND (LogP gt 1.0))
  • Smiles Name
    LogP
  • ------ ----
    ----
  • OC(O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac
    4.02
  • COC(O)c1ccccc1O Salicylate
    2.60

44
DayCart similarity search
Aspirin
  • select from TEST where
  • tanimoto(SMILES, CC(O)Oc1ccccc1C(O)O) gt
    0.6
  • SMILES NAME
    LOGP
  • ------ ----
    ----
  • COC(O)c1ccccc1O Salicylate
    2.60
  • CC(O)Nc1ccc(O)cc1 Acetaminophen
    0.27
  • CC(N)Cc1ccccc1 Amphetamine
    1.76

45
Similarity search for carboxylic acid
?
?
Acetaminophen
Alprenolol
Amphetamine
Captopril
?
Chlorpromazine
Diclofenac
Gabapentin
Salicylate
46
More examples of DayCart
  • http//www.daylight.com/meetings/summerschool02/co
    urse/admin/daycart_hints.html

47
Follow-up
  • Try a substructure and similiarity search
    yourself at www.molinspiration.com/cgi-bin/search.
    Note any structures you find particularly
    surprising on the similarity search. If possible,
    capture a screenshot of the results (CTRLPRNT
    SCRN on PCs to copy). Email notes and/or
    screenshots to djwild_at_indiana.edu (5 of grade)
Write a Comment
User Comments (0)
About PowerShow.com