Chemoinformatics - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Chemoinformatics

Description:

Current buzz areas in chemoinformatics ... Supports distributed science data, people, computers ... with access to journal articles: we will assume open ... – PowerPoint PPT presentation

Number of Views:434

Avg rating:3.0/5.0

Slides: 22

Provided by: david552

Category:

more less

Transcript and Presenter's Notes

Title: Chemoinformatics

1
Chemoinformatics

David Wild, djwild_at_indiana.edu
Bioinformatics Retreat, Feb 2nd, 2007

2
Current state of chemoinformatics research

What works and what doesnt
Fingerprints, clustering and diversity
QSAR - predictive and descriptive methods,
virtual screening
3D similarity, pharmacophores docking
Visualization, organization and navigation of
chemical datesets
Current buzz areas in chemoinformatics
How can we use our internal strengths to do
something new, important and impressive?

3
What works and what doesnt

2D structure and similarity searching well
established
Lots of papers comparing fingerprints for
similarity
Some recent evidence Scitegic ECFPs better for
recall of actives
Clustering well established but definite room for
improvement
Traditional methods Wards, K-means, Jarvis
Patrick
Recently single pass similarity cutoff methods
used for very fast organization - gt0.85 for
similar activity, gt0.55 for QSAR
Data mining methods - ROCK, Chameleon, Cure, etc
unexplored
Diversity hot -gt cold -gt smart
QSAR - poor relation of academic work to industry
usefulness
Lots of papers this method works best on this
dataset
Random forests appear practically to work rather
well
Interpretability vs predictive ability
Predictive methods for LogP, pKa, solubility, etc
work reasonably
Virtual screening virtually useless unless tied
in with HTS screening process. However, is useful
for exploring around leads.

4
What works and what doesnt

Mostly, 3D methods havent worked out yet
Similarity QSAR - Almost every paper 2D better
for recall and precision but 3D methods give
interesting ideas. Useful for lead hopping
Pharmacophore searching not widely used
Docking - very useful for visual inspection, poor
correlation of scoring functions with binding
Visualization, organization and navigation of
datasets
Still not clear how to work with datasets gt few
hundred compounds
Dot plots, spreadsheet-based methods work
minimally
Need for UI design and research

5
The current buzz in chemoinformatics

Decorporatization and commoditization of data and
software
MLSCN, PubChem, open source, small companies
Crisis for the software companies, nice for
academia
Pharma companies in the brown stuff without a
paddle
Integration with other ics
Data mining chemical/genomic information
Linking compounds -gt proteins -gt pathways, etc
(e.g. KEGG)
Fuzzy boundaries, integration with science and
informatics
Microsoft 2020 vision for science
Integration of text and structure searching
Semantic web, services and mashups will probably
have a BIG impact exporting best of breed what
happens to the rest?

6
Suggested collaboration areas

Chem/bio/complex systems mashups using web
services in each of the areas nice, confined
projects for students once you have the
infrastructure
Chem and complex can work together on integrating
text and structure-based searching, indexing and
crawling (e.g. networks of web services and
databases), and intelligent agents
Data mining of chemogenomic information
Integration of advanced chemoinformatics methods
with systems biology and pathway mapping tools
Performing research to establish best practices
for areas of chemoinformatics
Tackling algorithmic problems for which there is
currently no good solution - docking and scoring

7
Cyberinfrastructure

Geoffrey Fox
Computer Science, Informatics and Physics

8
Cyberinfrastructure

Supports distributed science data, people,
computers
Exploits Internet technology (Web2.0) adding (via
Grid technology) management, security,
supercomputers etc.
It has two aspects parallel low latency
(microseconds) between nodes and distributed
highish latency (milliseconds) between nodes
Parallel needed to get high performance on
individual 3D simulations, data analysis etc.
must decompose problem
Distributed aspect integrates already distinct
components
Cyberinfrastructure is in general a distributed
collection of parallel systems
Cyberinfrastructure is made of services (usually
Web services) that are just programs or data
sources packaged for distributed access

9
TeraGrid Integrating NSF Cyberinfrastructure
TeraGrid is a facility that integrates
computational, information, and analysis
resources at the San Diego Supercomputer Center,
the Texas Advanced Computing Center, the
University of Chicago / Argonne National
Laboratory, the National Center for
Supercomputing Applications, Purdue University,
Indiana University, Oak Ridge National
Laboratory, the Pittsburgh Supercomputing Center,
and the National Center for Atmospheric
Research. Today 100 Teraflop tomorrow a
petaflop Indiana 20 teraflop today.
10
Cyberinfrastructure at IU

Interpreted broadly (Web presences), there are
many activities at IU
Interpreted narrowly as the programmable web or
using Grid technologies there are large
projects in atmospheric, earthquake, ice-sheet
sciences, network systems, particle physics,
Crystallography and Cheminformatics
IU has an international reputation in both
parallel and distributed Cyberinfrastructure
including education, research and resources
IU has 31 Supercomputer in world and is part of
two major National activities TeraGrid and Open
Science Grid
There are several well known Bioinformatics Grids
such as BIRN (mainly images) and caBIG (cancer
databases) from NIH and MyGrid from UK (EBI)
Could be opportunities to link Biology and
Informatics/CS in Cyberinfrastructure projects

11
Cyberinfrastructure motivated by Web 2.0

Capture the power of interactive Web/Grid sites
enabling people to create, collaborate and build
on each others work

12
Web services, workflows, portals and ontologies

Web Services allow us to quickly develop and
deploy new tools, interfaces that cross
disciplines and are broadly accessible
Can use simple HTTP and ignore Web Service
complications
Workflows (called mashups in Web 2.0) allow us to
string together collections of web services to do
computation that is tailored to the science (as a
one-off or for re-use).
Develop core capabilities as services and use in
many different ways as in 770 Google map mashups
APIs/Languages/Data structures/Ontologies (WSDL
AJAX JSON at low level) allow us to describe
workflows and services in discoverable, standard
ways, such that reasoning tools can piece them
together to match queries
Portals enable composable reusable user
interfaces
Distributed posting of services and easily
available composition tools enable everybody to
contribute
Interesting implications for broader
participation

13
Model and Data Sharing

Cyberinfrastructure requires agreed sharing
standards (data structures, APIs, protocols,
ontologies, languages) as intrinsically
internationally distributed
There are agreed data structures for taking
Sequence?Protein?Folding?Interaction
Transparently, e.g. BLAST
Nothing at the level where genomics and
proteomics is important cells and tissues.
Partial answers CellML, FieldML, SBML which do
not link to relevant standards outside Biology
Need to connect models at these levels. Need
Standard ontologies/data structures for cell
behaviors to allow connections and validation
Need to connect Models like SBW (Systems Biology
Workbench)/BioSpice -gtCell-level models
(Compucell) -gtTissue level models (Physiome)
Model builders at these scales not
CS-sophisticated. Models NOT interoperable and
dont use useful general ideas
Glazier organizing activity in this area with H.
Sauro (U. Washington), W. Li (UCSD-SDSC), Hunter
(U. Auckland) and NIH
Link to Open Grid Forum standard setting and
community activities

14
http//www.chembiogrid.org

Database enabled quantum chemistry computations
Services to link PubChem, Supercomputers, results
of high throughput Screening centers
Education IU has unique Cheminformatics degrees
Portals

15
Chemical Informatics web service infrastructure

Database Services
Local NIH DTP Human Tumor Cell Line set
Local PubChem mirror
Derived properties database
Pub3D, PubDock
Synonym service
VARUNA quantum chemistry database
Statistics (based on R)
Regression, Neural Nets, Random Forest
LDA
K-means clustering
Plotting
T-test and distribution sampling

Computation Services
OpenEye FRED, OMEGA, FILTER,
Cambridge OSCAR3
BCI fingerprint generation, Wards, Divisive
K-means clustering
Tox Tree
Similarity fingerprint calculations (CDK)
Descriptor calculation (CDK)
2D structure diagrams (CDK)
2D-gt3D File format conversions

16
Workflows - Taverna (taverna.sourceforge.net)
17
(No Transcript)
18
PubDock - Chimera-based interface
19
Kemo - A ChatBot for PubChem

Uses ALICE chatbot www.alicebot.org
AIML used to define knowledge base, e.g. reaction
to common phrases like FIND ME, WHAT IS THE LOGP
OF, etc
Can iteratively improve knowledge base
Accesses PubChem through web service interface

20
Workflow in Xbaya - a meteorology tool!
http//www.extreme.indiana.edu/xgws/xbaya/
21
Indexing the worlds chemical informationAND
computational functionality

Crawl and index web pages, journal articles, etc.
for
Structures (InChIs, SMILES)
Images (converted using Clide or ChemReader)
Names (converted using OSCAR3 or similar package)
Other information (IR spectra, reactions, etc)
Technology still immature, but improving quickly
Problem with access to journal articles we will
assume open access in the future!
Expose computational functionality as web
services, contextualize in an OWL-S ontology
(semantics), and publish in a UDDI
Now we know what information we have, and what we
can do with it
Develop bots and intelligent agents to
automatically do useful things