EChemistry and Web 2.0 - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

EChemistry and Web 2.0

Description:

Geoffrey Fox. Gary Wiggins. Rajarshi Guha. David Wild. Mookie ... Peter Murray-Rust (Cambridge), Herbert Van de Sompel (Los Alamos), Geoffrey Fox (Indiana) ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 60
Provided by: servo
Category:
Tags: echemistry | fox | peter | web

less

Transcript and Presenter's Notes

Title: EChemistry and Web 2.0


1
E-Chemistry and Web 2.0
  • Marlon Pierce
  • mpierce_at_cs.indiana.edu
  • Community Grids Lab
  • Indiana University

2
One Talk, Two Projects
  • NIH funded Chemical Informatics and
    Cyberinfrastructure Collaboratory (CICC) _at_ IU.
  • Geoffrey Fox
  • Gary Wiggins
  • Rajarshi Guha
  • David Wild
  • Mookie Baik
  • Kevin Gilbert
  • And others
  • Proposed Microsoft-Funded Project E-Chemistry
  • Carl Lagoze (Cornell),
  • Lee Giles (PSU),
  • Steve Bryant (NIH),
  • Jeremy Frey (Soton),
  • Peter Murray-Rust (Cambridge),
  • Herbert Van de Sompel (Los Alamos),
  • Geoffrey Fox (Indiana)
  • And others

3
CICC Infrastructure Vision
  • Chemical Informatics drug discovery and other
    academic chemistry, pharmacology, and
    bioinformatics research will be aided by
    powerful, modern, open, information technology.
  • NIH PubChem and PubMed provide unprecedented
    open, free data and information.
  • We need a corresponding open service architecture
    (i.e. avoid stove-piped applications)
  • CICC set up as distributed cyberinfrastructure in
    eScience model
  • Web clients (user interfaces) to distributed
    databases, results of high throughput screening
    instruments, results of computational chemical
    simulations and other analyses.
  • Composed of clients to open service APIs
    (mash-ups)
  • Aggregated into portals
  • Web services manipulate this data and are
    combined into workflows.
  • So our main agenda items create interesting
    databases and build lots of Web services and
    clients.

4
CICC Databases
  • Most of our databases aim to add value to PubChem
    or link into PubChem
  • 1D (SMILES) and 2D structures
  • 3D structures (MMFF94)
  • Searchable by CID, SMARTS, 3D similarity
  • Docked ligands (FRED, Autodock)
  • 906K drug-like compounds into 7 ligands
  • Will eventually cover 2000 targets
  • Philosophy we have big computers, so lets
    calculate everything ahead of time and put the
    results in a DB.

5
Building Up the Infrastructure
  • Our SOA philosophy use standard Web services.
  • Mostly stateless
  • Some cluster, HPC work needed but these populate
    databases
  • Services are aggregate-able into different
    workflows.
  • Taverna, Pipeline Pilot,
  • You can also build lots of Web clients.
  • See http//www.chembiogrid.org/wiki/index.php/CICC
    _Web_Resources for links and details.
  • Not so far from Web 2.0.

6
Sample Services
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Web Client Interfaces
12
More Clients
13
More Clients
14
Example PubDock
  • Database of approximately 1 million PubChem
    structures (the most drug-like) docked into
    proteins taken from the PDB
  • Available as a web service, so structures can be
    accessed in your own programs, or using workflow
    tools like Pipeline Polit
  • Several interfaces developed, including one based
    on Chimera (right) which integrates the database
    with the PDB to allow browsing of compounds in
    different targets, or different compounds in the
    same target
  • Can be used as a tool to help understand
    molecular basis of activity in cellular or image
    based assays

15
Example R Statistics applied to PubChem data
  • By exposing the R statistical package, and the
    Chemistry Development Kit (CDK) toolkit as web
    services and integrating them with PubChem, we
    can quickly and easily perform statistical
    analysis and virtual screening of PubChem assay
    data
  • Predictive models for particular screens are
    exposed as web services, and can be used either
    as simple web tools or integrated into other
    applications
  • Example uses DTP Tumor Cell Line screens - a
    predictive model using Random Forests in R makes
    predictions of probability of activity across
    multiple cell lines.

16
Example assay screening workflow finding
cell-protein relationships
A protein implicated in tumor growth with known
ligand is selected (in this case HSP90 taken from
the PDB 1Y4 complex)
The screening data from a cellular HTS assay is
similarity searched for compounds with similar 2D
structures to the ligand.
Docking results and activity patterns fed into R
services for building of activity models and
correlations
LeastSquares Regression
RandomForests
NeuralNets
Similar structures are filtered for drugability,
are converted to 3D, and are automatically passed
to the OpenEye FRED docking program for docking
into the target protein.
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures to the ligand can be browsed
using client portlets.
17
Relevance to Web 2.0
  • Some Web 2.0 Key Features
  • REST Services
  • Use of RSS/Atom feeds
  • Client interfaces are mashups
  • Gadgets, widgets for portals aggregate clients
  • So
  • We provide RSS as an alternative WS format.
  • We have experimented with RSS feeds, using Yahoo
    Pipes to manipulate multiple feeds.
  • CICC Web interfaces can be easily wrapped as
    universal gadgets in iGoogle, Netvibes.
  • Alternative to classic science gateways.

18
RSS Feeds/REST Services
  • Provide access to DB's via RSS feeds
  • Feeds include 2D/3D structures in CML
  • Viewable in Bioclipse, Jmol as well as Sage etc.
  • Two feeds currently available
  • SynSearch get structures based on full or
    partial chemical names
  • DockSearch get best N structures for a target
  • Really hampered by size of DB and Postgres
    performance.

19
Tools and mashups based on web service
infrastructure
http//www.chembiogrid.org/projects/proj_tools.htm
l
20
Mining information from journal articles
  • Until now SciFinder / CAS only chemistry-aware
    portal into journal information
  • We can access full text of journal articles
    online (with subscription)
  • ACS does not make full text available but there
    are ways round that!
  • RSC is now marking up with SMILES and GO/Goldbook
    terms!
  • www.projectprospect.org
  • Having SMILES or InChI means that we can build a
    similarity/structure searchable database of
    papers e.g. find me all the papers published
    since 2000 which contain a structure with 90
    similarity to this one
  • In the absence of full text, we can at least use
    the abstract

21
Text Mining OSCAR
  • A tool for shallow, chemistry-specific natural
    language parsing of chemical documents (e.g.
    journal articles).
  • It identifies (or attempts to identify)
  • Chemical names singular nouns, plurals, verbs
    etc., also formulae and acronyms.
  • Chemical data Spectra, melting/boiling point,
    yield etc. in experimental sections.
  • Other entities Things like N(5)-C(3) and so on.
  • Part of the larger SciBorg effort
  • See http//www.cl.cam.ac.uk/aac10/escience/scibor
    g.html)
  • http//wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Osca
    r3

22
Mash-Up What published compounds might bind to
this protein?
Create a database containing thetext of all
recent PubMed abstracts(2006-2007 500,000)
Use OSCAR to extract all of the chemical names
referred to in the abstracts and covert to SMILES
DATABASE SERVICE

DOCKING SERVICE
Convert molecules to 3D and dock into a protein
of interest
Visualize top docked molecules in a Google-like
interface
23
E-Chemistry and Digital Libraries
  • We cant wait to get started.

24
E-Chemistry and Digital Libraries
  • Key problem with our SOA-based e-Science is
    information management.
  • Where is the service that I need?
  • What does it do?
  • We may consider our data-centric services to be
    digital libraries.
  • Data is diverse
  • Documents
  • Not just computational information like
    structures.
  • Another point of view how can I link together
    publications, results, workflows, etc?
  • That is, I need to manage digital documents.

25
Digital Libraries
  • Open Archives Initiative Object Reuse and
    Exchange Project (OAI-ORE)
  • Developing standardized, interoperable, and
    machine-readable mechanisms to express
    information about compound information objects on
    the web.
  • Graph-based representations of connected digital
    objects.
  • Objects may be encoded in (for example) RDF or
    XML,
  • Retrievable via repositories with REST service
    interfaces (c.f. Atom Publishing Protocal)
  • Obtain, harvest, and register

26
(No Transcript)
27
(No Transcript)
28
Challenges for E-Chemistry
  • Can digital library principals be applied to data
    as well as documents?
  • Can you link your workflow to your conference
    paper?
  • Can we engineer a publishing framework and
    message formats around Web 2.0 principals?
  • REST, Atom Publishing Protocol, Atom Syndication
    Format, JSON, Microformats
  • Can we do this securely?
  • Access control, provenance, identify federation
    are key problems.

29
(No Transcript)
30
More Information
  • Project Web Site www.chembiogrid.org
  • Project Wiki www.chembiogrid.org/wiki
  • Contact me mpierce_at_cs.indiana.edu

31
(No Transcript)
32
Chemical Informatics and Cyberinfrastucture
Collaboratory Funded by the National Institutes
of Health www.chembiogrid.org
CICC
CICC
CICC Combines Grid Computing with Chemical
Informatics
Large Scale Computing Challenges
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data
deluge of publicly available data on potential
new drugs.
Chemical Informatics is non-traditional area of
high performance computing, but many new,
challenging problems may be investigated.
NIH PubMed DataBase
OSCAR Text Analysis
Toxicity Filtering
Cluster Grouping
Docking
.
Initial 3D Structure Calculation
OSCAR-mined molecular signatures can be
clustered, filtered for toxicity, and docked onto
larger proteins. These are classic pleasingly
parallel tasks. Top-ranking docked molecules
can be further examined for drug potential.
Chemical informatics text analysis programs can
process 100,000s of abstracts of online
journal articles to extract chemical signatures
of potential drugs.
Molecular Mechanics Calculations
Big Red (and the TeraGrid) will also enable us to
perform time consuming, multi-stepped Quantum
Chemistry calculations on all of PubMed. Results
go back to public databases that are freely
accessible by the scientific community.
  • CICC supports the NIH mission by combining state
    of the art chemical informatics techniques with
  • World class high performance computing
  • National-scale computing resources (TeraGrid)
  • Internet-standard web services
  • International activities for service
    orchestration
  • Open distributed computing infrastructure for
    scientists world wide

NIH PubChem DataBase
Quantum Mechanics Calculations
IUs Varuna DataBase
POVRay Parallel Rendering
Indiana University Department of Chemistry,
School of Informatics, and Pervasive Technology
Laboratories
33
MLSCN Post-HTS Biology Decision Support
Percent Inhibition or IC50 data is retrieved from
HTS
Grids can link data analysis ( e.g image
processing developed in existing Grids),
traditional Chem-informatics tools, as well as
annotation tools (Semantic Web, del.icio.us) and
enhance lead ID and SAR analysis A Grid of Grids
linking collections of services atPubChem ECCR
centers MLSCN centers
Workflows encoding plate control well
statistics, distribution analysis, etc
Question Was this screen successful?
Workflows encoding distribution analysis of
screening results
Question What should the active/inactive cutoffs
be?

Question What can we learn about the target
protein or cell line from this screen?
Workflows encoding statistical comparison of
results to similar screens, docking of compounds
into proteins to correlate binding, with
activity, literature search of active compounds,
etc
Compounds submitted to PubChem
CHEMINFORMATICS
PROCESS
GRIDS
34
R Web Services
35
Why?
  • Need access to math and stat functionality
  • Did not want to recode algorithms
  • Wanted latest methods
  • Needed a distributed approach to computation
  • Keep computation on a powerful machine
  • Access it from a smaller machine

36
Why R?
  • Free, open-source
  • Many cutting edge methods avilable
  • Flexible programming language
  • Interfaces with many languages
  • Python
  • Perl
  • Java
  • C

37
The R Server
  • R can be run as a remote compute server
  • Requires the rserve package
  • Allows authenticated access over TCP/IP
  • Connections can maintain state
  • Client libraries for Java C

38
R as a Web Service
  • On its own the R server is not a web service
  • We provide Java frontends to specific
    functionalities
  • The frontend classes are hosted in a Tomcat web
    container
  • Accessible via SOAP
  • Full Javadocs for all available WSs

39
Flowchart
40
Functionality
  • Two classes of functionality
  • General functions
  • Allows you to supply data and build a predictive
    model
  • Sample from various distributions
  • Obtain scatter plots and hisotgram
  • Model development functions use a Java front-end
    to encapsulate model specific information

41
Functionality
  • Two classes of functionality
  • Model deployment
  • Allows you to build a model outside of the
    infrastructure
  • Place the final model in the infrastructure
  • Becomes available as a web service
  • Each model deployed requires its own front end
    class
  • In general, these classes are identical - could
    be autogenerated

42
Available Functionality
  • Predictive models - OLS, RF, CNN, LDA
  • Clustering - k-means
  • Statistical distributions
  • XY plot and scatter plots
  • Model deployment for single model types and
    ensemble model types

43
Deployed Models
  • Since deployed models are visible as web services
    we can build a simple web front end for them
  • Examples
  • NCI anti-cancer predictions
  • Ames mutagenicity predictions

44
Applications
  • The R WS is not restricted to atomic
    functionality
  • Can write a whole R program
  • Load it on the R compute server
  • Provide a Java WS frontend
  • Examples
  • Feature selection
  • Automated model generation
  • Pharmacokinetic parameter calculation

45
Data Input/Output
  • Most modeling applications require data matrices
  • Depending on client language we can use
  • SOAP array of arrays (2D matrices)
  • SOAP array (1D vector form of a 2D matrix)
  • VOTables

46
Data Input/Output
  • Some R web services can take a URL to a VOTables
    document
  • Conversion to R or Java matrices is done by a
    local VOTables Java library
  • R also has basic support for VOTables directly
  • Ignores binary data streams

47
Interacting With R WSs
  • Traditional WSs do not maintain state
  • Predictive models are different
  • A model is built at one time
  • May be used for prediction at another time
  • Need to maintain state
  • State is maintained by serialization to R binary
    files on the compute server
  • Clients deal with model IDs

48
Interacting with R WSs
  • Protocol
  • Send data to model WS
  • Get back model ID
  • Get various information via model ID
  • Fitted values
  • Training statistics
  • New predictions

49
Cheminformatics at Indiana University School of
Informatics
  • David J. Wild
  • djwild_at_indiana.edu
  • Associate Director of Chemical Informatics
    Assistant Professor
  • Indiana University School of Informatics,
    Bloomington
  • http//djwild.info

50
Cheminformatics education at Indiana
  • M.S. in Chemical Informatics
  • 2 years, 36 semester hours
  • Includes a 6-hour capstone / research project
  • Opportunity to work in Laboratory Informatics
    (IUPUI) or closely with Bioinformatics (IUB)
  • Currently 9 students enrolled
  • Ph.D. in Informatics, Cheminformatics Specialty
  • 90 credit hours, including 30 hours dissertation
    research. Usually 4 years.
  • Research rotations expose students to research in
    related areas
  • Currently 4 students enrolled
  • Graduate Certificate
  • 4 courses, all available by Distance Education
  • I571 Chemical Information Technology
  • I572 Computational Chemistry Molecular Modeling
  • I573 Programming for Science Informatics
  • I553 Independent Study in Chemical Informatics
  • D.E. students pay in-state fees! (800 per
    class)
  • See http//cheminfo.informatics.indiana.edu for
    more information, or a general review of
    cheminformatics education in Drug Discovery Today
    11, 910 (May 2006), pp436-439

51
Distance Education for Cheminformatics
  • Uses Breeze teleconference for live sharing of
    classes all that is required is a P.C. and a
    telephone. Optional Polycom videoconferencing.
  • Lectures are recorded for easy playback through a
    web browser
  • Wiki or similar webpage for dissemination of
    course materials
  • Also participate in CIC courseshare to give class
    at University of Michigan
  • Of 75 students taking our courses since fall
    2005, 39 have been D.E. students
  • See JCIM 2006 46(2) pp 495 - 502 for more
    details

52
Current research in the Wild lab
  • Integration of cheminformatics tools and data
    sources
  • A web service infrastructure for cheminformatics
  • Compound information aggregation web service
    and interface (by the way box)
  • An enhanced chatbot for exploting chemical
    information web services
  • A semantically-aware workflow tools for
    cheminformatics
  • Data mining the NIH DTP tumor cell line database
  • PubDock a docking database for PubChem
  • Aggregating life science information from web and
    journal documents
  • Data mining semantically rich chemistry journal
    articles
  • Document similarity based on chemical structure
    similarity
  • Evaluating semantic markup of chemistry journal
    articles
  • Integrating cheminformatics into the chemistry
    lab
  • Integrating cheminformatics with the Second Life
    virtual world
  • Integrating cheminformatics tools with electronic
    lab notebooks
  • Usability of cheminformatics tools

53
Current research in the Guha lab
  • Predictive Modeling
  • Interpretation, validation, domain applicability
  • Generalization to other models such as docking,
    pharmacophore etc
  • Integration of multiple data types
  • Addressing imbalanced and noisy datasets
  • Analysis of Chemical Spaces
  • Quantify distributions in spaces
  • Investigation of density approaches
  • Applications to lead hopping, model domains
  • Methods to summarize compare data
  • Applications to HTS and smaller lead series type
    datasets
  • Network models combining chemical structures and
    biological systems
  • Software and infrastructure
  • Model exchange and annotation
  • Pharmacophore representations, matching
  • Toolkit development (CDK)

54
Cheminformatics web service infrastructure
Cheminformatics services Docking (FRED) 3D
structure generation (OMEGA) Filtering (FRED,
etc) OSCAR3 Fingerprints (BCI, CDK) Clustering
(BCI) Toxicity prediction (ToxTree) R-based
predictive models Similarity calculations
(CDK) Descriptor calculation (CDK) 2D structure
diagrams (CDK)
  • Database Services
  • PostgreSQL gNova
  • PubChem mirror (augmented)
  • Pub3D - 3D structures for PubChem
  • PubDock - Bound 3D structures
  • Compound-indexed journal article DB
  • NIH Human Tumor Cell Line
  • Local PubChem mirror
  • VARUNA quantum chemistry database
  • Statistics (based on R)
  • Regression, LDA
  • Neural Nets, Random Forest
  • K-means clustering
  • Plotting
  • T-test and distribution sampling

Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy
Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey
C. Fox and David J. Wild, Web service
infrastructure for chemoinformatics, Journal of
Chemical Information and Modeling, 2007 47(4) pp
1303-1307
55
RSC Project Prospect - what can we do with the
information?
  • www.projectprospect.org
  • 100 papers marked up with SMILES/InChI (using
    OSCAR3), plus Gene Ontology and Goldbook Ontology
    terms
  • Created similarity searchable PostgreSQL / gNova
    database with paper DOIs, SMILES, and ontology
    terms
  • Web service and simple HTML interfaces for
    searching which papers reference compounds
    similar to this one in the scope of these
    ontological terms?
  • Applying statistics to look at co-occurrence of
    compounds, structural features (MACCS keys) and
    ontological terms in papers

56
Greasemonkey / OSCAR script
http//cheminfo.informatics.indiana.edu8080/ChemG
M/index.jsp
57
By the way annotation (mock-up!)
By the way This compounds is very similar to a
prescription drug, Tamoxifen. This compound is
referenced in 20 journal articles published in
the last 5 years Similar compounds are associated
with the words toxic and death in 280 web
pages It appears to be covered under 3 patents It
has been shown to be active in 5 screens Computer
models predict it to show some activity against 8
protein targets Here are some comments on this
compound David Wild dont take any notice of
the computational models - they are rubbish
58
Cheminformatics aware simple lab notebook (mock
up!)
Plug-in allows structures to be drawn with the
pen and cleaned up
Some useful chemical reactions Iodoacetate a
Iodoacetamide I-CH4COO- ICH2CONH2 This
may also react, chem favored by alkaline pH .
Web service interfaceprovides access
to computation and searching. Page is marked up
by what is possible
FIND INFO ABOUT THIS REACTION
Free text input can be converted to
machine readable form by electrovaya
Automatic detection ofdata fields (yield,
etc) Where possible
59
Automatic workflow generation and natural
language queries
  • Develop service ontology using OWL-S or similar
    language
  • Allows service interoperability, replacement and
    input/outut compatibility
  • We can then use generic reasoning and network
    analysis tools to find paths from inputs to
    desired outputs
  • Natural language can be parsed to inputs and
    desired outputs
  • Smart Clients Agents Services
  • Possible supercharged life science Google? -
    e.g. type in what compounds might bind to the
    enclosed protein?

3D search
2dsimilarity
3D structures are compounds
2D - 3D
2D structures
3D structures
2Dstructurecrawler
2D structures
3D structures
result
3D structures complexes
dock
Pphoresearch
2D structures are compounds
3D proteinstructure
3D structures are compounds
dock bind
Write a Comment
User Comments (0)
About PowerShow.com