Title: Chemical Informatics Data Services
1Chemical Informatics Data Services
- Building cyber-infrastructure to support research
in small molecule chemistry.
2On-Demand Data, Not On-Demand Supercomputing
- Cyber-infrastructure (CI) is often discussed in
terms of hardware, middleware, networking, and so
forth. - But its the data
- Can we make data products for the community using
HPC? - See for example the Earth Systems Grid
- Model use computational CI to make public
community data and associated Web services (Data
Centric CI)
3Data-Deluged Small Molecular Chemistry
- We are at the beginning of a new era in publicly
available information on drug-like molecules. - Small means molecular weight lt 500
- The NIH PubChem database provides data on over
8,000,000 molecules. - Free, open, community driven
- Both web browser and web service interfaces.
- The NIH is funding High Through Put Screening
Centers to analyze gt 300,000 drug like molecules.
- Data is deposited in PubChem.
- It is free, open to all researchers.
- NIH PubMed provides access and advanced search on
1,000s of medical an related journals. - Now this is cyber-infrastructure
4Chemical Informatics and Cyber-Infrastructure
- The NIH also sees chemical informatics as key to
enabling scientific discoveries with PubChem,
PubMed, and other services. - Indiana University is funded by NIHs ECCR
program to research these issues. - Geoffrey Fox, PI
- Our general program combine expertise in
- Chemical informatics and computational chemistry.
- Grids and high performance computing
- Web Services and workflows.
- Build cyber-infrastructure to enable our own
research and research of others.
5Web Services and Workflows
- Web services provide simple developer
interfaces/message types to complicated remote
data and computing resources. - Workflows connect services together.
- Models scientific usage scenarios
Taverna workflow connects remote services.
6Web Services
7Chemical Informatics on Big Red
MOAD Database
PubMed Database
OSCAR Text Analysis
Toxicity Filtering
Cluster Grouping
Docking
PubChem Database
Initial 3D Structure Calculation
NIH PubChem Database
NIH PubChem Database
Molecular Mechanics Calculations
Product databases are wrapped with Web service
interfaces and are suitable for inclusion in
Taverna workflows.
Quantum Mechanics Calculations
IUs Varuna Database
POV-Ray Parallel Rendering
8PubMed and Supercomputing
- PubMed annually indexes 100,000s of medical
journal abstracts from gt 4,800 journals. - How can we mine this literature to find
interesting molecules? - OSCAR3 is a chemistry-specific text mining tool
developed by Peter Murray Rusts group at
Cambridge University. - Is able identify chemical names in text, can
extract SMILES (that is, string representations
of chemical structures. - SMILES are starting point for many interesting
workflows. - See www.chembiogrid.org/wiki for several examples.
9OSCAR3 Mining on Big Red
- We have used Big Red to mine the entire set of
PubMed abstracts for chemical structures. - 555,007 PubMed abstracts of 2005 2006 (part)
initially run. - Currently working on the entire catalog.
- OSCAR3
- Extracts chemical information from text and
produces an XML document highlighting the
chemical information - SMILES extraction
- Extracting SMILES elements from OSCARs XML
output files - Use this to drive docking and molecular modeling
applications.
10Results from Big Red Demo
Final HTML pages
11Big Red and PubChem
- We are using Big Red to calculate
- 3D structures of 8,000,000 molecules in PubChem
- Molecular Modeling, precursor to quantum
calculations - Precise structures and molecular orbitals using
GAMESS - Precursor to specific property calculations.
- Preliminary protein docking of all 1,000,000
drug-like molecules - Using MOAD protein data base from Heather Carlson
(UMich) (over 10,000 curated proteins). - Precursor for more sophisticated docking
- Our goal is to provide data service
cyber-infrastructure - These are 3 databases wrapped as 3 web services
and available through 3 web interfaces. - These enable our own research
- See Mookie Baiks presentation at the bandwidth
challenge - These may be incorporated into workflows
- But they are being made public as well.
12(No Transcript)
13Status
- OSCAR3 is available as a Web Service
- 3D Structure Web service is available.
- 85,000 structures now
- 8,000,000 soon
- We omit inorganic molecules, molecules with
transition metals, etc. (about 1.0-1.5 of
PubChem) - These are linked to local copy of PubChem.
- Web service and user interfaces available.
- Protein docking results
- 1,000,000 drug-like molecules in PubChem have
been docked. - OpenEyes FILTER finds 983,734
- Heat shock protein (1YC4), Kinases (1R1P, ...)
- Implicated in various cancers, researched by IU
chemists - Varuna QM pedigree, structures, and orbitals.
- Postgres database implemented (port of MS access)
with preliminary interface - 2,000 QM structure calculations underway.
- User and service interfaces in development
- We roughly estimate 1,000,000 molecules1 Big
Red Year, so we will need to move to TeraGrid. - We collect user interfaces to these services in a
portal.
14Acknowledgements and More Information
- For more information, see
- www.chembiogrid.org/wiki
- Work done here described by the following people
- Kevin Gilbert SMILES to 3D conversions and
Molecular Modeling calculations - Rajarshi Guha Protein docking and 3D structure
web services - Jake Kim OSCAR3 web services and PubMed mining.
- Mookie Baik QM calculations
- Pulan Yu Varuna web services
15Additional Slides
16(No Transcript)
17CICC Project Information
- Chemical Informatics and Cyberinfrastructure
Collaboratory is an NIH and MS-funded research
project to combine the CIs. - Project web site and more information
- www.chembiogrid.org
- www.chembiogrid.org/wiki
- Team members include
- Computer Science Geoffrey Fox (PI), Dennis
Gannon, Beth Plale, Marlon Pierce, Yuqing
(Melanie) Wu, Malika Mahoui, Jake Kim - Chemical Informatics and Chemistry Gary Wiggins,
Mu-Hyun (Mookie) Baik, David Wild, Rajarshi Guha,
Kevin Gilbert - I have stolen slides and content from these fine
people.
18PubDock Docking PubChem
- PubChem contains 8M molecules
- Some are drug like, some are not
- Given a protein we'd like to know what molecules
will bind to it - Useful in drug discovery
- Can also be used in non-drug related scenarios
- molecular probes
- imaging
19PubDock Docking PubChem
- We could
- Dock arbitrary molecules to proteins on the fly
- Dock PubChem and store the results
- Docking PubChem is useful since it allows to
store and compare results - We currently select proteins of interest to us
- Heat shock protein (1YC4)
- Kinases (1R1P, ...)
20Some Numbers
- PubChem has 8M compounds
- We only considered compounds that had between 10
and 150 heavy atoms - 7,949,658 compounds
- Many of these compounds are extremely flexible
- Conformer generation takes a long time
- To speed up the process we filtered for drug like
molecules
21What Are Drug Like Molecules?
- Used OpenEye's filter program, using default
settings - 150 lt MW lt 440
- 10 lt Heavy atom Count lt 25
- -5.0 lt XlogP lt 4.0
- Maximum of 3 rings
- ....
- Final set of 983,734 compounds
22Docking
- Docking was performed using Openeye's fred
- We considered 4 scoring functions
- Chemgauss3
- OEChemScore
- Shapegauss
- PLP
- The receptor site is specified by hand
- Very fast 4 hours for 10,000 molecules
23Current Status
- All the drug-like molecules have been docked to
1YC4 - Other targets are being processed
- Database is being populated
- Currently has 20,000 compounds
- Database is accessible via
- the web ( http//rguha.ath.cx/rguha/dock )
- web services
24(No Transcript)