Chemical Informatics Data Services - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Chemical Informatics Data Services

Description:

Precursor for more sophisticated docking ... Dock arbitrary molecules to proteins on the fly. Dock PubChem and store the results ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 25
Provided by: Marlon91
Category:

less

Transcript and Presenter's Notes

Title: Chemical Informatics Data Services


1
Chemical Informatics Data Services
  • Building cyber-infrastructure to support research
    in small molecule chemistry.

2
On-Demand Data, Not On-Demand Supercomputing
  • Cyber-infrastructure (CI) is often discussed in
    terms of hardware, middleware, networking, and so
    forth.
  • But its the data
  • Can we make data products for the community using
    HPC?
  • See for example the Earth Systems Grid
  • Model use computational CI to make public
    community data and associated Web services (Data
    Centric CI)

3
Data-Deluged Small Molecular Chemistry
  • We are at the beginning of a new era in publicly
    available information on drug-like molecules.
  • Small means molecular weight lt 500
  • The NIH PubChem database provides data on over
    8,000,000 molecules.
  • Free, open, community driven
  • Both web browser and web service interfaces.
  • The NIH is funding High Through Put Screening
    Centers to analyze gt 300,000 drug like molecules.
  • Data is deposited in PubChem.
  • It is free, open to all researchers.
  • NIH PubMed provides access and advanced search on
    1,000s of medical an related journals.
  • Now this is cyber-infrastructure

4
Chemical Informatics and Cyber-Infrastructure
  • The NIH also sees chemical informatics as key to
    enabling scientific discoveries with PubChem,
    PubMed, and other services.
  • Indiana University is funded by NIHs ECCR
    program to research these issues.
  • Geoffrey Fox, PI
  • Our general program combine expertise in
  • Chemical informatics and computational chemistry.
  • Grids and high performance computing
  • Web Services and workflows.
  • Build cyber-infrastructure to enable our own
    research and research of others.

5
Web Services and Workflows
  • Web services provide simple developer
    interfaces/message types to complicated remote
    data and computing resources.
  • Workflows connect services together.
  • Models scientific usage scenarios

Taverna workflow connects remote services.
6
Web Services
7
Chemical Informatics on Big Red
MOAD Database
PubMed Database
OSCAR Text Analysis
Toxicity Filtering
Cluster Grouping
Docking
PubChem Database
Initial 3D Structure Calculation
NIH PubChem Database
NIH PubChem Database
Molecular Mechanics Calculations
Product databases are wrapped with Web service
interfaces and are suitable for inclusion in
Taverna workflows.
Quantum Mechanics Calculations
IUs Varuna Database
POV-Ray Parallel Rendering
8
PubMed and Supercomputing
  • PubMed annually indexes 100,000s of medical
    journal abstracts from gt 4,800 journals.
  • How can we mine this literature to find
    interesting molecules?
  • OSCAR3 is a chemistry-specific text mining tool
    developed by Peter Murray Rusts group at
    Cambridge University.
  • Is able identify chemical names in text, can
    extract SMILES (that is, string representations
    of chemical structures.
  • SMILES are starting point for many interesting
    workflows.
  • See www.chembiogrid.org/wiki for several examples.

9
OSCAR3 Mining on Big Red
  • We have used Big Red to mine the entire set of
    PubMed abstracts for chemical structures.
  • 555,007 PubMed abstracts of 2005 2006 (part)
    initially run.
  • Currently working on the entire catalog.
  • OSCAR3
  • Extracts chemical information from text and
    produces an XML document highlighting the
    chemical information
  • SMILES extraction
  • Extracting SMILES elements from OSCARs XML
    output files
  • Use this to drive docking and molecular modeling
    applications.

10
Results from Big Red Demo
Final HTML pages
11
Big Red and PubChem
  • We are using Big Red to calculate
  • 3D structures of 8,000,000 molecules in PubChem
  • Molecular Modeling, precursor to quantum
    calculations
  • Precise structures and molecular orbitals using
    GAMESS
  • Precursor to specific property calculations.
  • Preliminary protein docking of all 1,000,000
    drug-like molecules
  • Using MOAD protein data base from Heather Carlson
    (UMich) (over 10,000 curated proteins).
  • Precursor for more sophisticated docking
  • Our goal is to provide data service
    cyber-infrastructure
  • These are 3 databases wrapped as 3 web services
    and available through 3 web interfaces.
  • These enable our own research
  • See Mookie Baiks presentation at the bandwidth
    challenge
  • These may be incorporated into workflows
  • But they are being made public as well.

12
(No Transcript)
13
Status
  • OSCAR3 is available as a Web Service
  • 3D Structure Web service is available.
  • 85,000 structures now
  • 8,000,000 soon
  • We omit inorganic molecules, molecules with
    transition metals, etc. (about 1.0-1.5 of
    PubChem)
  • These are linked to local copy of PubChem.
  • Web service and user interfaces available.
  • Protein docking results
  • 1,000,000 drug-like molecules in PubChem have
    been docked.
  • OpenEyes FILTER finds 983,734
  • Heat shock protein (1YC4), Kinases (1R1P, ...)
  • Implicated in various cancers, researched by IU
    chemists
  • Varuna QM pedigree, structures, and orbitals.
  • Postgres database implemented (port of MS access)
    with preliminary interface
  • 2,000 QM structure calculations underway.
  • User and service interfaces in development
  • We roughly estimate 1,000,000 molecules1 Big
    Red Year, so we will need to move to TeraGrid.
  • We collect user interfaces to these services in a
    portal.

14
Acknowledgements and More Information
  • For more information, see
  • www.chembiogrid.org/wiki
  • Work done here described by the following people
  • Kevin Gilbert SMILES to 3D conversions and
    Molecular Modeling calculations
  • Rajarshi Guha Protein docking and 3D structure
    web services
  • Jake Kim OSCAR3 web services and PubMed mining.
  • Mookie Baik QM calculations
  • Pulan Yu Varuna web services

15
Additional Slides
16
(No Transcript)
17
CICC Project Information
  • Chemical Informatics and Cyberinfrastructure
    Collaboratory is an NIH and MS-funded research
    project to combine the CIs.
  • Project web site and more information
  • www.chembiogrid.org
  • www.chembiogrid.org/wiki
  • Team members include
  • Computer Science Geoffrey Fox (PI), Dennis
    Gannon, Beth Plale, Marlon Pierce, Yuqing
    (Melanie) Wu, Malika Mahoui, Jake Kim
  • Chemical Informatics and Chemistry Gary Wiggins,
    Mu-Hyun (Mookie) Baik, David Wild, Rajarshi Guha,
    Kevin Gilbert
  • I have stolen slides and content from these fine
    people.

18
PubDock Docking PubChem
  • PubChem contains 8M molecules
  • Some are drug like, some are not
  • Given a protein we'd like to know what molecules
    will bind to it
  • Useful in drug discovery
  • Can also be used in non-drug related scenarios
  • molecular probes
  • imaging

19
PubDock Docking PubChem
  • We could
  • Dock arbitrary molecules to proteins on the fly
  • Dock PubChem and store the results
  • Docking PubChem is useful since it allows to
    store and compare results
  • We currently select proteins of interest to us
  • Heat shock protein (1YC4)
  • Kinases (1R1P, ...)

20
Some Numbers
  • PubChem has 8M compounds
  • We only considered compounds that had between 10
    and 150 heavy atoms
  • 7,949,658 compounds
  • Many of these compounds are extremely flexible
  • Conformer generation takes a long time
  • To speed up the process we filtered for drug like
    molecules

21
What Are Drug Like Molecules?
  • Used OpenEye's filter program, using default
    settings
  • 150 lt MW lt 440
  • 10 lt Heavy atom Count lt 25
  • -5.0 lt XlogP lt 4.0
  • Maximum of 3 rings
  • ....
  • Final set of 983,734 compounds

22
Docking
  • Docking was performed using Openeye's fred
  • We considered 4 scoring functions
  • Chemgauss3
  • OEChemScore
  • Shapegauss
  • PLP
  • The receptor site is specified by hand
  • Very fast 4 hours for 10,000 molecules

23
Current Status
  • All the drug-like molecules have been docked to
    1YC4
  • Other targets are being processed
  • Database is being populated
  • Currently has 20,000 compounds
  • Database is accessible via
  • the web ( http//rguha.ath.cx/rguha/dock )
  • web services

24
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com