Title: Chemical Informatics and Cyberinfrastructure Building Blocks
 1Chemical Informatics and Cyber-infrastructure 
Building Blocks
- Chemical Informatics Resources 
 - Deluge of experimental data 
 - gt 100,000 compounds screened by 10 publicly 
funded high throughput screening centers using 
various assay techniques (molecular to cellular)  - Molecular Libraries Screening Center Network 
 - Chemical databases maintained by various groups 
 - NIH PubChem, NIH DTP 
 - Chemical informatics and computational chemistry 
 - Data clustering, data mining, descriptor 
calculations, toxicity prediction, docking, 
molecular modeling, and quantum chemistry  - Visualization tools 
 - Web resources journal articles, etc. 
 - A Chemical Informatics Grid will need to 
integrate these into a common, loosely coupled, 
open, distributed computing environment. 
  2Our Solution Stack
Portals and Other User Interfaces
- Domain specific Web Services 
 - VOTables, CDK services 
 - Grid services, Cyber-infrastructure for 
computationally intensive applications.  - Clustering, quantum chemistry 
 - Workflow and service management 
 - We work with Taverna 
 - Many solutions Kepler, BPEL engines, etc. 
 - Portlets and other user interfaces 
 - Rich desktop apps 
 - Ubiquitous clients
 
Workflow and Service Management
Web and Grid Services
Each level is subject for research and 
development, as is their integration.  
 3Wrapping Science Applications as Services
- Science Grid services typically must wrap legacy 
applications written in C or Fortran.  - You must handle such problems as 
 - Specifying several input and output files 
 - These may need to be staged in 
 - Launching executables and monitoring their 
progress.  - Specifying environment variables 
 - Often these have also shell scripts to do some 
miscellaneous tasks.  - How do you convert this to WSDL? 
 - Or (equivalently) how do you automatically 
generate the XML job description for WS-GRAM? 
  4Flow Chart of SMILES to Cluster Partitioned of 
BCI Web Service
SMILES to DKM
SMILE String
Makebits
Fingerprint (.scn)
DivKmeans
Cluster Hierarchy (.dkm)
Generating the best levels
Clustering Fingerprints
Generating Fingerprints
Dictionary (Default)
New SMILE String
Extracting individual cluster partitions
Extracted Cluster Hierarchy (.clu)
Optclus
RNNclus
One Column Process
Merge Process
best
level 
 5BCI Clustering Service Methods 
 6Submitting Applications with Condor
- We are working to use Condor-G as a simple bridge 
to the NSFs TeraGrid for job submission.  - Condor has a Web Service interface (called 
BirdBath) that we are using to construct Java 
portlets.  - We are investigating how to construct Condor 
classads using GPIR.  - Required for Condor matchmaking 
 - But no facility for this built in to the 
TeraGrid.  
  7Condor-G and Globus
Condor Only
(Portal) Client
(Portal) Client
Condor
Condor Master
Condor -G
Condor
TeraGrid Globus
TeraGrid Globus
Condor
Condor
LSF
PBS 
 8VOTables Handling Tabular Data
- Developed by the Virtual Observatory community 
for encoding astronomy data.  - The VOTable format is an XML representation of 
the tabular data (data coming from BCI, NIH DTP 
databases, and so on).  - VOTables-compatible tools have been built 
 - We just inherit them. 
 - SAVOT and JAVOT JAVA Parser APIs for VOTable 
allow us to easily build VOTable-based 
applications  - Web Services 
 - Spread sheet 
 - Plotting applications. 
 - VOPlot and TopCat are two 
 
  9mrtd1.txt  smiles representation of chemical 
compounds along with its properties  
 10Votable.xml  xml representation of mrtd1.txt file 
 11VOPlot Application from generated votable.xml 
file  Graph plotted on Mass (Xaxis) and PSA 
(Y-axis) 
 12More Services WWMM Services 
 13CDK-Based Services 
 14ToxTree Service
- The Threshold of Toxicological Concern (TTC) 
establishes a level of exposure for all chemicals 
below which there would be no appreciable risk to 
human health.  - ToxTree implements the Cramer Decision Tree 
approach to estimate TTC.  - We have converted this into a service. 
 - Uses SMILES as input. 
 - Note the GUI must be separated from the library 
to be a service 
http//ecb.jrc.it/QSAR/home.php?CONTENU/QSAR/qsar
_tools/qsar_tools_toxtree.php 
 15OSCAR3 Service
- Oscar3 is a tool for shallow, chemistry-specific 
natural language parsing of chemical documents 
(i.e. journal articles).  - It identifies (or attempts to identify) 
 - Chemical names singular nouns, plurals, verbs 
etc., also formulae and acronyms.  - Chemical data Spectra, melting/boiling point, 
yield etc. in experimental sections.  - Other entities Things like N(5)-C(3) and so on. 
 - Results are exported as an XML file. 
 - There is a larger effort, SciBorg, in this area 
 - http//www.cl.cam.ac.uk/aac10/escience/sciborg.ht
ml  - It also has potentially very interesting Workflows
 
http//wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Osca
r3