Chemoinformatics - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Chemoinformatics

Description:

Current buzz areas in chemoinformatics ... Supports distributed science data, people, computers ... with access to journal articles: we will assume open ... – PowerPoint PPT presentation

Number of Views:434
Avg rating:3.0/5.0
Slides: 22
Provided by: david552
Category:

less

Transcript and Presenter's Notes

Title: Chemoinformatics


1
Chemoinformatics
  • David Wild, djwild_at_indiana.edu
  • Bioinformatics Retreat, Feb 2nd, 2007

2
Current state of chemoinformatics research
  • What works and what doesnt
  • Fingerprints, clustering and diversity
  • QSAR - predictive and descriptive methods,
    virtual screening
  • 3D similarity, pharmacophores docking
  • Visualization, organization and navigation of
    chemical datesets
  • Current buzz areas in chemoinformatics
  • How can we use our internal strengths to do
    something new, important and impressive?

3
What works and what doesnt
  • 2D structure and similarity searching well
    established
  • Lots of papers comparing fingerprints for
    similarity
  • Some recent evidence Scitegic ECFPs better for
    recall of actives
  • Clustering well established but definite room for
    improvement
  • Traditional methods Wards, K-means, Jarvis
    Patrick
  • Recently single pass similarity cutoff methods
    used for very fast organization - gt0.85 for
    similar activity, gt0.55 for QSAR
  • Data mining methods - ROCK, Chameleon, Cure, etc
    unexplored
  • Diversity hot -gt cold -gt smart
  • QSAR - poor relation of academic work to industry
    usefulness
  • Lots of papers this method works best on this
    dataset
  • Random forests appear practically to work rather
    well
  • Interpretability vs predictive ability
  • Predictive methods for LogP, pKa, solubility, etc
    work reasonably
  • Virtual screening virtually useless unless tied
    in with HTS screening process. However, is useful
    for exploring around leads.

4
What works and what doesnt
  • Mostly, 3D methods havent worked out yet
  • Similarity QSAR - Almost every paper 2D better
    for recall and precision but 3D methods give
    interesting ideas. Useful for lead hopping
  • Pharmacophore searching not widely used
  • Docking - very useful for visual inspection, poor
    correlation of scoring functions with binding
  • Visualization, organization and navigation of
    datasets
  • Still not clear how to work with datasets gt few
    hundred compounds
  • Dot plots, spreadsheet-based methods work
    minimally
  • Need for UI design and research

5
The current buzz in chemoinformatics
  • Decorporatization and commoditization of data and
    software
  • MLSCN, PubChem, open source, small companies
  • Crisis for the software companies, nice for
    academia
  • Pharma companies in the brown stuff without a
    paddle
  • Integration with other ics
  • Data mining chemical/genomic information
  • Linking compounds -gt proteins -gt pathways, etc
    (e.g. KEGG)
  • Fuzzy boundaries, integration with science and
    informatics
  • Microsoft 2020 vision for science
  • Integration of text and structure searching
  • Semantic web, services and mashups will probably
    have a BIG impact exporting best of breed what
    happens to the rest?

6
Suggested collaboration areas
  • Chem/bio/complex systems mashups using web
    services in each of the areas nice, confined
    projects for students once you have the
    infrastructure
  • Chem and complex can work together on integrating
    text and structure-based searching, indexing and
    crawling (e.g. networks of web services and
    databases), and intelligent agents
  • Data mining of chemogenomic information
  • Integration of advanced chemoinformatics methods
    with systems biology and pathway mapping tools
  • Performing research to establish best practices
    for areas of chemoinformatics
  • Tackling algorithmic problems for which there is
    currently no good solution - docking and scoring

7
Cyberinfrastructure
  • Geoffrey Fox
  • Computer Science, Informatics and Physics

8
Cyberinfrastructure
  • Supports distributed science data, people,
    computers
  • Exploits Internet technology (Web2.0) adding (via
    Grid technology) management, security,
    supercomputers etc.
  • It has two aspects parallel low latency
    (microseconds) between nodes and distributed
    highish latency (milliseconds) between nodes
  • Parallel needed to get high performance on
    individual 3D simulations, data analysis etc.
    must decompose problem
  • Distributed aspect integrates already distinct
    components
  • Cyberinfrastructure is in general a distributed
    collection of parallel systems
  • Cyberinfrastructure is made of services (usually
    Web services) that are just programs or data
    sources packaged for distributed access

9
TeraGrid Integrating NSF Cyberinfrastructure
TeraGrid is a facility that integrates
computational, information, and analysis
resources at the San Diego Supercomputer Center,
the Texas Advanced Computing Center, the
University of Chicago / Argonne National
Laboratory, the National Center for
Supercomputing Applications, Purdue University,
Indiana University, Oak Ridge National
Laboratory, the Pittsburgh Supercomputing Center,
and the National Center for Atmospheric
Research. Today 100 Teraflop tomorrow a
petaflop Indiana 20 teraflop today.
10
Cyberinfrastructure at IU
  • Interpreted broadly (Web presences), there are
    many activities at IU
  • Interpreted narrowly as the programmable web or
    using Grid technologies there are large
    projects in atmospheric, earthquake, ice-sheet
    sciences, network systems, particle physics,
    Crystallography and Cheminformatics
  • IU has an international reputation in both
    parallel and distributed Cyberinfrastructure
    including education, research and resources
  • IU has 31 Supercomputer in world and is part of
    two major National activities TeraGrid and Open
    Science Grid
  • There are several well known Bioinformatics Grids
    such as BIRN (mainly images) and caBIG (cancer
    databases) from NIH and MyGrid from UK (EBI)
  • Could be opportunities to link Biology and
    Informatics/CS in Cyberinfrastructure projects

11
Cyberinfrastructure motivated by Web 2.0
  • Capture the power of interactive Web/Grid sites
    enabling people to create, collaborate and build
    on each others work

12
Web services, workflows, portals and ontologies
  • Web Services allow us to quickly develop and
    deploy new tools, interfaces that cross
    disciplines and are broadly accessible
  • Can use simple HTTP and ignore Web Service
    complications
  • Workflows (called mashups in Web 2.0) allow us to
    string together collections of web services to do
    computation that is tailored to the science (as a
    one-off or for re-use).
  • Develop core capabilities as services and use in
    many different ways as in 770 Google map mashups
  • APIs/Languages/Data structures/Ontologies (WSDL
    AJAX JSON at low level) allow us to describe
    workflows and services in discoverable, standard
    ways, such that reasoning tools can piece them
    together to match queries
  • Portals enable composable reusable user
    interfaces
  • Distributed posting of services and easily
    available composition tools enable everybody to
    contribute
  • Interesting implications for broader
    participation

13
Model and Data Sharing
  • Cyberinfrastructure requires agreed sharing
    standards (data structures, APIs, protocols,
    ontologies, languages) as intrinsically
    internationally distributed
  • There are agreed data structures for taking
    Sequence?Protein?Folding?Interaction
    Transparently, e.g. BLAST
  • Nothing at the level where genomics and
    proteomics is important cells and tissues.
  • Partial answers CellML, FieldML, SBML which do
    not link to relevant standards outside Biology
  • Need to connect models at these levels. Need
    Standard ontologies/data structures for cell
    behaviors to allow connections and validation
  • Need to connect Models like SBW (Systems Biology
    Workbench)/BioSpice -gtCell-level models
    (Compucell) -gtTissue level models (Physiome)
  • Model builders at these scales not
    CS-sophisticated. Models NOT interoperable and
    dont use useful general ideas
  • Glazier organizing activity in this area with H.
    Sauro (U. Washington), W. Li (UCSD-SDSC), Hunter
    (U. Auckland) and NIH
  • Link to Open Grid Forum standard setting and
    community activities

14
http//www.chembiogrid.org
  • Database enabled quantum chemistry computations
  • Services to link PubChem, Supercomputers, results
    of high throughput Screening centers
  • Education IU has unique Cheminformatics degrees
  • Portals

15
Chemical Informatics web service infrastructure
  • Database Services
  • Local NIH DTP Human Tumor Cell Line set
  • Local PubChem mirror
  • Derived properties database
  • Pub3D, PubDock
  • Synonym service
  • VARUNA quantum chemistry database
  • Statistics (based on R)
  • Regression, Neural Nets, Random Forest
  • LDA
  • K-means clustering
  • Plotting
  • T-test and distribution sampling
  • Computation Services
  • OpenEye FRED, OMEGA, FILTER,
  • Cambridge OSCAR3
  • BCI fingerprint generation, Wards, Divisive
    K-means clustering
  • Tox Tree
  • Similarity fingerprint calculations (CDK)
  • Descriptor calculation (CDK)
  • 2D structure diagrams (CDK)
  • 2D-gt3D File format conversions

16
Workflows - Taverna (taverna.sourceforge.net)
17
(No Transcript)
18
PubDock - Chimera-based interface
19
Kemo - A ChatBot for PubChem
  • Uses ALICE chatbot www.alicebot.org
  • AIML used to define knowledge base, e.g. reaction
    to common phrases like FIND ME, WHAT IS THE LOGP
    OF, etc
  • Can iteratively improve knowledge base
  • Accesses PubChem through web service interface

20
Workflow in Xbaya - a meteorology tool!
http//www.extreme.indiana.edu/xgws/xbaya/
21
Indexing the worlds chemical informationAND
computational functionality
  • Crawl and index web pages, journal articles, etc.
    for
  • Structures (InChIs, SMILES)
  • Images (converted using Clide or ChemReader)
  • Names (converted using OSCAR3 or similar package)
  • Other information (IR spectra, reactions, etc)
  • Technology still immature, but improving quickly
  • Problem with access to journal articles we will
    assume open access in the future!
  • Expose computational functionality as web
    services, contextualize in an OWL-S ontology
    (semantics), and publish in a UDDI
  • Now we know what information we have, and what we
    can do with it
  • Develop bots and intelligent agents to
    automatically do useful things
Write a Comment
User Comments (0)
About PowerShow.com