Title: Chemoinformatics
1Chemoinformatics
- David Wild, djwild_at_indiana.edu
- Bioinformatics Retreat, Feb 2nd, 2007
2Current state of chemoinformatics research
- What works and what doesnt
- Fingerprints, clustering and diversity
- QSAR - predictive and descriptive methods,
virtual screening - 3D similarity, pharmacophores docking
- Visualization, organization and navigation of
chemical datesets - Current buzz areas in chemoinformatics
- How can we use our internal strengths to do
something new, important and impressive?
3What works and what doesnt
- 2D structure and similarity searching well
established - Lots of papers comparing fingerprints for
similarity - Some recent evidence Scitegic ECFPs better for
recall of actives - Clustering well established but definite room for
improvement - Traditional methods Wards, K-means, Jarvis
Patrick - Recently single pass similarity cutoff methods
used for very fast organization - gt0.85 for
similar activity, gt0.55 for QSAR - Data mining methods - ROCK, Chameleon, Cure, etc
unexplored - Diversity hot -gt cold -gt smart
- QSAR - poor relation of academic work to industry
usefulness - Lots of papers this method works best on this
dataset - Random forests appear practically to work rather
well - Interpretability vs predictive ability
- Predictive methods for LogP, pKa, solubility, etc
work reasonably - Virtual screening virtually useless unless tied
in with HTS screening process. However, is useful
for exploring around leads.
4What works and what doesnt
- Mostly, 3D methods havent worked out yet
- Similarity QSAR - Almost every paper 2D better
for recall and precision but 3D methods give
interesting ideas. Useful for lead hopping - Pharmacophore searching not widely used
- Docking - very useful for visual inspection, poor
correlation of scoring functions with binding - Visualization, organization and navigation of
datasets - Still not clear how to work with datasets gt few
hundred compounds - Dot plots, spreadsheet-based methods work
minimally - Need for UI design and research
5The current buzz in chemoinformatics
- Decorporatization and commoditization of data and
software - MLSCN, PubChem, open source, small companies
- Crisis for the software companies, nice for
academia - Pharma companies in the brown stuff without a
paddle - Integration with other ics
- Data mining chemical/genomic information
- Linking compounds -gt proteins -gt pathways, etc
(e.g. KEGG) - Fuzzy boundaries, integration with science and
informatics - Microsoft 2020 vision for science
- Integration of text and structure searching
- Semantic web, services and mashups will probably
have a BIG impact exporting best of breed what
happens to the rest?
6Suggested collaboration areas
- Chem/bio/complex systems mashups using web
services in each of the areas nice, confined
projects for students once you have the
infrastructure - Chem and complex can work together on integrating
text and structure-based searching, indexing and
crawling (e.g. networks of web services and
databases), and intelligent agents - Data mining of chemogenomic information
- Integration of advanced chemoinformatics methods
with systems biology and pathway mapping tools - Performing research to establish best practices
for areas of chemoinformatics - Tackling algorithmic problems for which there is
currently no good solution - docking and scoring
7Cyberinfrastructure
- Geoffrey Fox
- Computer Science, Informatics and Physics
8Cyberinfrastructure
- Supports distributed science data, people,
computers - Exploits Internet technology (Web2.0) adding (via
Grid technology) management, security,
supercomputers etc. - It has two aspects parallel low latency
(microseconds) between nodes and distributed
highish latency (milliseconds) between nodes - Parallel needed to get high performance on
individual 3D simulations, data analysis etc.
must decompose problem - Distributed aspect integrates already distinct
components - Cyberinfrastructure is in general a distributed
collection of parallel systems - Cyberinfrastructure is made of services (usually
Web services) that are just programs or data
sources packaged for distributed access
9TeraGrid Integrating NSF Cyberinfrastructure
TeraGrid is a facility that integrates
computational, information, and analysis
resources at the San Diego Supercomputer Center,
the Texas Advanced Computing Center, the
University of Chicago / Argonne National
Laboratory, the National Center for
Supercomputing Applications, Purdue University,
Indiana University, Oak Ridge National
Laboratory, the Pittsburgh Supercomputing Center,
and the National Center for Atmospheric
Research. Today 100 Teraflop tomorrow a
petaflop Indiana 20 teraflop today.
10Cyberinfrastructure at IU
- Interpreted broadly (Web presences), there are
many activities at IU - Interpreted narrowly as the programmable web or
using Grid technologies there are large
projects in atmospheric, earthquake, ice-sheet
sciences, network systems, particle physics,
Crystallography and Cheminformatics - IU has an international reputation in both
parallel and distributed Cyberinfrastructure
including education, research and resources - IU has 31 Supercomputer in world and is part of
two major National activities TeraGrid and Open
Science Grid - There are several well known Bioinformatics Grids
such as BIRN (mainly images) and caBIG (cancer
databases) from NIH and MyGrid from UK (EBI) - Could be opportunities to link Biology and
Informatics/CS in Cyberinfrastructure projects
11Cyberinfrastructure motivated by Web 2.0
- Capture the power of interactive Web/Grid sites
enabling people to create, collaborate and build
on each others work
12Web services, workflows, portals and ontologies
- Web Services allow us to quickly develop and
deploy new tools, interfaces that cross
disciplines and are broadly accessible - Can use simple HTTP and ignore Web Service
complications - Workflows (called mashups in Web 2.0) allow us to
string together collections of web services to do
computation that is tailored to the science (as a
one-off or for re-use). - Develop core capabilities as services and use in
many different ways as in 770 Google map mashups - APIs/Languages/Data structures/Ontologies (WSDL
AJAX JSON at low level) allow us to describe
workflows and services in discoverable, standard
ways, such that reasoning tools can piece them
together to match queries - Portals enable composable reusable user
interfaces - Distributed posting of services and easily
available composition tools enable everybody to
contribute - Interesting implications for broader
participation
13Model and Data Sharing
- Cyberinfrastructure requires agreed sharing
standards (data structures, APIs, protocols,
ontologies, languages) as intrinsically
internationally distributed - There are agreed data structures for taking
Sequence?Protein?Folding?Interaction
Transparently, e.g. BLAST - Nothing at the level where genomics and
proteomics is important cells and tissues. - Partial answers CellML, FieldML, SBML which do
not link to relevant standards outside Biology - Need to connect models at these levels. Need
Standard ontologies/data structures for cell
behaviors to allow connections and validation - Need to connect Models like SBW (Systems Biology
Workbench)/BioSpice -gtCell-level models
(Compucell) -gtTissue level models (Physiome) - Model builders at these scales not
CS-sophisticated. Models NOT interoperable and
dont use useful general ideas - Glazier organizing activity in this area with H.
Sauro (U. Washington), W. Li (UCSD-SDSC), Hunter
(U. Auckland) and NIH - Link to Open Grid Forum standard setting and
community activities
14http//www.chembiogrid.org
- Database enabled quantum chemistry computations
- Services to link PubChem, Supercomputers, results
of high throughput Screening centers - Education IU has unique Cheminformatics degrees
- Portals
15Chemical Informatics web service infrastructure
- Database Services
- Local NIH DTP Human Tumor Cell Line set
- Local PubChem mirror
- Derived properties database
- Pub3D, PubDock
- Synonym service
- VARUNA quantum chemistry database
- Statistics (based on R)
- Regression, Neural Nets, Random Forest
- LDA
- K-means clustering
- Plotting
- T-test and distribution sampling
- Computation Services
- OpenEye FRED, OMEGA, FILTER,
- Cambridge OSCAR3
- BCI fingerprint generation, Wards, Divisive
K-means clustering - Tox Tree
- Similarity fingerprint calculations (CDK)
- Descriptor calculation (CDK)
- 2D structure diagrams (CDK)
- 2D-gt3D File format conversions
16Workflows - Taverna (taverna.sourceforge.net)
17(No Transcript)
18PubDock - Chimera-based interface
19Kemo - A ChatBot for PubChem
- Uses ALICE chatbot www.alicebot.org
- AIML used to define knowledge base, e.g. reaction
to common phrases like FIND ME, WHAT IS THE LOGP
OF, etc - Can iteratively improve knowledge base
- Accesses PubChem through web service interface
20Workflow in Xbaya - a meteorology tool!
http//www.extreme.indiana.edu/xgws/xbaya/
21Indexing the worlds chemical informationAND
computational functionality
- Crawl and index web pages, journal articles, etc.
for - Structures (InChIs, SMILES)
- Images (converted using Clide or ChemReader)
- Names (converted using OSCAR3 or similar package)
- Other information (IR spectra, reactions, etc)
- Technology still immature, but improving quickly
- Problem with access to journal articles we will
assume open access in the future! - Expose computational functionality as web
services, contextualize in an OWL-S ontology
(semantics), and publish in a UDDI - Now we know what information we have, and what we
can do with it - Develop bots and intelligent agents to
automatically do useful things