Title: http://www.pdbj.org/
1PRAGMA 11 Workshop, Osaka Univ., on October 16,
2006
Integrated Web Service for Database Queries and
Computations Grid Web Services for Analog Queries
to Protein 3D Structures
Haruki Nakamura Institute for Protein
Research, Lab. Protein Informatics, Osaka
University
2BioGrid at Osaka http//www.biogrid.jp
- Started from 2002
- GoalsGrid Technology Development for
- Biotechnology
- (drug discovery) and Medical Sciences.
- Leader Shinji Shimojo (CMC, Osaka Univ.)
- Government Support (MEXT) 5years
H. Nakamura K. Yamaguchi T. Takada
H. Matsuda
S. Shimojo S. Date
Integration of Data Grid and Computing Grid
T. Akiyama
to access Integrated and Standardized Databases
with High-performance Computers, from anywhere
and with low cost.
3Databases in Life Sciencesnearly 1,000 DBs
Organism
Medical
Scale
Large
OMIM
Organ
Drug Data
MEDLINE
Tissue
MDDR
Cell
Literature
Function
Sugar Chain
Molecular Interaction Network
Organelle
Gene Ontology
Gene Expression
DIP
Mass Analysis
KEGG
GEO
3D-Cordinates
InterPro, Pfam
Molecule
Swiss-Prot
Chemical Structure
DDBJ/EMBL/GenBank
H-Inv, FANTOM
PDB
PubChem, Ligand, ChEBI
Sequence
Fine
Atom
Conceptual Level
Less Conceputal (Formal)
Highly Conceptual
4BioDataGrid by Hideo Matsuda at Osaka Univ.
5BioGrid at Osaka http//www.biogrid.jp
- Started from 2002
- GoalsGrid Technology Development for
- Biotechnology
- (drug discovery) and Medical Sciences.
- Leader Shinji Shimojo (CMC, Osaka Univ.)
- Government Support (MEXT) 5years
H. Nakamura K. Yamaguchi T. Takada
H. Matsuda
S. Shimojo S. Date
Integration of Data Grid and Computing Grid
T. Akiyama
to access Integrated and Standardized Databases
with High-performance Computers, from anywhere
and with low cost.
6BioPfuga Biosimulation Platform United on Grid
Architecture
Nakamura et al. (2004) New Generation Computing,
22, 157-166.
- A platform where individual application
programs at the different levels are united to
execute a hybrid computation. In particular,
BioPfuga is a platform for biosimulation.
http//www.biogrid.jp/
7- Construction of BioPfuga
- Propose and Design Communication between
- program components by XML description
- (BMSML BioMolecular Simulation-ML) ,
- and Creation of the APIs to handle it.
- Divide programs into a set of many
- components
- Different program modules are then
- implemented as the service programs
- based on the OGSA (Open Grid Service
- Architecture) mechanism.
8http//www.biogrid.jp/
9- Construction of BioPfuga
- Propose and Design Communication between
- program components by XML description
- (BMSML BioMolecular Simulation-ML) ,
- and Creation of the APIs to handle it.
- Divide programs into a set of many
- components
- Different program modules are then
- implemented as the service programs
- based on the OGSA (Open Grid Service
- Architecture) mechanism.
10Quantum mechanics (QM) and Molecular mechanics
(MM) simulation coupled on BioPfuga
Coupled simulation on Grids
MM area
Hamiltonian of total system
Active site of an enzyme
Htotal HQM(x,x) HQM/MM(x,y)
HMM(y,y)
Calculate in QM area
Calculate in MM area
Ligand
QM area
(Yonezawa et al. at SC2004)
11PKCd-C1B
Simulation of Electronic structure (1) AMOSS Ab
initio Molecular Orbital Simulation for
Supercomputer
856 atoms, 8672 AOs
developed by NEC Quantum Chemistry Group Rapid
computation for huge molecular systems (20,000
basis for RHF, 1,000 basis for CASSCF and MP2),
with high parallel performance
Simulation of Electronic structure (2) GSO-X
generalized spin density function theory (DFT)
developed by Shusuke Yamanaka Kizashi
Yamaguchi, at Osaka Univ.
Simulation of Protein-solvent structure prestoX-ba
sic Protein Engineering Simulator eXtended basic
versin-
developed by Yoshifumi Fukunishi at JBIRC-AIST
and Haruki Nakamura at Institute for Protein
Research, Osaka Univ.
12A hybrid-QM/MM with BioPfuga on the Grid
Service_at_SC2004
Client controller
User_at_USA, Pittsburgh
_at_IPR Osaka
Viewer portal
Site-A
_at_TokyoIT
MM(MD)
Site-B
_at_CMC Osaka
Site-C
BMSML
BMSML
QM(DFT)
8CPU
prestoX (MPI)
8-boards
MPI
50CPU
BMSML
MD-Grape2
GSO-X(DFT)
(Special purpose computing board max 50GFLOPS)
13Configuration of a computing system on a Grid
environment with several different program
modules.
Portal/ Client Program
User
Communicate using SOAP
Personal user
Distributed computing System
Service QM(HF)
Service QM (DFT)
Service MM
SOAP
SOAP
14BioGrid at Osaka http//www.biogrid.jp
- Started from 2002
- GoalsGrid Technology Development for
- Biotechnology
- (drug discovery) and Medical Sciences.
- Leader Shinji Shimojo (CMC, Osaka Univ.)
- Government Support (MEXT) 5years
15Integration of computing systems and Database
query systems on a Grid
Portal/ Client Program
User
Communicate using SOAP
Personal user
site-A
site-B
site-C
Distributed Computing System
GT
GT
GT
Service A
Service C
Service B
SOAP
GRID computing services
H. Nakamura, BioGrid2005 (March 9, 2005)
16Integration of computing systems and Database
query systems on a Grid
Portal/ Client Program
User
Communicate using SOAP
Personal user
site-A
site-B
site-C
Distributed Computing and DB System
GT
GT
GT
Service A
Service C
Service B
SOAP
GRID computing services
Database query services
H. Nakamura, BioGrid2005 (March 9, 2005)
17Proposed Portal for Multiple Databases for
Protein Structures Berman, H. et al. (2006)
Structure, 14, 1211-1217.
Theoretical Model DB 3
Theoretical Model DB 2
Theoretical Model DB 1
Figure 1. A Schematic for the Proposed Portal
The portal is envisioned to be a modular web
services architecture achieved by using an
implementation of the Simple Object Access
Protocol (SOAP http//www.w3.org/TR/soap/), that
allows for seamless data exchange between the
portal and all registered contributors. SOAP is a
light weight protocol for exchange of information
in a decentralized, distributed environment. It
is an XML-based protocol, which defines a
framework for representing remote procedure calls
and responses. The XML wrappers basically map the
individual contributing metadata format into an
XML format that the data portal understands
(Drinkwater et al., 2004). These services will
be platform and language independent allowing
other services (other portals or clients) to
communicate with the data portal (see
http//www.e-science.clrc.ac.uk/web/projects/datap
ortal/)
18PDB (Protein Data Bank) Protein Tertiary
Structure Database Atom species and coordinates,
amino acid residues, any other experimental
results with raw data, experimental methods, and
the conditions.
X-ray crystallography, NMR, and Electron
microscope experiments.
19(No Transcript)
20International collaboration in wwPDB
- Curation, data processing, and
- registration are made by all the
- members, collaborating with each other.
- 2) We have a single data archive, which is
- looked after by one archive keeper (RCSB).
- 3) Data format and new descriptions will be
discussed - among the members.
- 4) Members are encouraged to develop their own
- browsers, viewers, and other APIs and services.
(Berman, Henrick Nakamura (2003) Nat. Struct.
Biol. 10, 980)
21Protein Data Bank Japan http//www.pdbj.org/ At
Institute for Protein Research, Osaka Univ.
since 2001 assisted from the Institute for
Bioinformatics Research and Development, Japan
Science and Technology Agency (BIRD-JST).
22Processed data numbers at PDBj
We process more than 30 deposited data of the
entire world, mainly from Asian and Oceania
regions
23Maintain Format Standards
- PDB (conventional and flat)
- PDB Exchange (mmCIF)
- Mechanism for extension based on new demands
- PDBML
- Derived from mmCIF
- All entries converted to XML
- Automatic translation from mmCIF data files and
dictionaries - 3-styles of translation released
(Westbrook, Ito, Nakamura, Henrick, Berman (2005)
Bioinformatics, 21, 988-992)
24PDBML canonical XML description of PDB data,
developed by the wwPDB.
(Westbrook et al. (2005) Bioinformatics, 21,
988-992) http//pdbml.pdb.org/schema/pdbx.xsd,
http//pdbml.pdb.org/schema/pdbx-ext.xsd,
ftp//ftp.pdbj.org/
? No validation errors for more than 39,000 PDB
file description.
25Examples for the atom coordinates
- ltPDBxatom_siteCategorygt
- ltPDBxatom_site id"1"gt
- ltPDBxgroup_PDBgtATOMlt/PDBxgroup_PDBgt
- ltPDBxtype_symbolgtNlt/PDBxtype_symbolgt
- ltPDBxlabel_atom_idgtNlt/PDBxlabel_atom
_idgt - ltPDBxlabel_comp_idgtTHRlt/PDBxlabel_co
mp_idgt - ltPDBxlabel_asym_idgtAlt/PDBxlabel_asym
_idgt - ltPDBxlabel_entity_idgt1lt/PDBxlabel_en
tity_idgt - ltPDBxlabel_seq_idgt1lt/PDBxlabel_seq_i
dgt - ltPDBxCartn_xgt17.047lt/PDBxCartn_xgt
- ltPDBxCartn_ygt14.099lt/PDBxCartn_ygt
- ltPDBxCartn_zgt3.625lt/PDBxCartn_zgt
- ltPDBxoccupancygt1.00lt/PDBxoccupancygt
- ltPDBxB_iso_or_equivgt13.79lt/PDBxB_iso
_or_equivgt - ltPDBxauth_seq_idgt1lt/PDBxauth_seq_idgt
- ltPDBxauth_comp_idgtTHRlt/PDBxauth_comp
_idgt - ltPDBxauth_asym_idgtAlt/PDBxauth_asym_i
dgt - ltPDBxauth_atom_idgtNlt/PDBxauth_atom_i
dgt - ltPDBxpdbx_PDB_model_numgt1lt/PDBxpdbx_
PDB_model_numgt
Full-tag description
Separated file for coordinates
ltatom_record id"1"gtATOM 1 A A 1 1 ? . THR THR N
N N 17.047 14.099 3.625 1.00 13.79lt/atom_recordgt
26- Applications of PDBML
- Browser at PDBj with the native XML DB
- Database extension and annotation
- SOAP (Simple Object Access Protocol) services at
PDBj - Molecular graphics viewer (jV) , which directly
parses PDBML, displaying the information written
in XML
(Kinoshita Nakamura (2004) Bioinformatics, 20,
1329-1330, Westbrook et al (2005)
Bioinformatics, 21, 988-992)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30"and" and wild-card are available.
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Search proteins, which have one or more
helices, which are longer than 10.
/datablockstruct_confCategory/struct_conf /pdbx_P
DB_helix_lengthgt10"/_at_datablockName
36(No Transcript)
37xPSSS new facility Both XQuery and XPath are
available.
38(No Transcript)
39(No Transcript)
40(No Transcript)
41Queries by XQuery and XPath at our new xPSSS
42PDBjViewer or jV
- Offers interactive molecular
- visualization with RasMol
- type commands (source code free).
- Can be used both as Stand-
- alone and as applet with Java.
- Any polygons defined by XML
- can be displayed and manipulated.
- Can parse PDBML files and
- display the molecules.
- http//www.pdbj.org/PDBjViewer/
43xPSSS Database System
Archive (RCSB-PDB /MSD-EBI /PDBj)
Internet
download (FTP)
Web server
FTP server
downloader
xPSSS
PDBML
XSLT processor
AddInformation
Native XML-DB
PDBMLplus
CATRES Data
Function/ Source Information
PDBMLplus
PDBMLplusF
Annotation Data
PDBMLplus
Filtering Recostructing
Loader
PDBMLplusF
Get/Input Tools
DDBJ SwisProt/UniProt PIR/GenBank/KEGG/GDB/ ProThe
rm/EzCatDB
EBI/CSA /CATRES
PDF files for the primary citations
Manual input from literatures
44Development of Secondary Databases data mining
Protein Dynamics Database, ProMode (Wako Endo)
Protein Molecular Surface Database, eF-site
(Kinoshita Nakamura)
Alignment of Structural Homologues, ASH (Toh)
Encyclopedia of Protein Structures, eProtS (Ito
Nakamura)
Sequence Navigator Structure Navigator
(Standley)
45Superfolds of Proteins
46 Identify Protein Function from
Sequence similarity
Fold similarity
47Compare Protein folds/architectures
48DNA-binding domain of Bovine Papilloma Virus-1 E2
(2BOP)
RNA-binding domain of the U1A spliceosomal
protein (1URN)
49Map of the entire protein folds
498 SCOP domains
Dali (distance matrix alignment) http//www.ebi.ac
.uk/dali/
Hou et al. (2003) Proc. Natl. Acad. Sci. USA,
100, 2386-2390.
50Structure Navigator http//www.pdbj.org/strucnavi/
- Rapidly generates structure neighbors for any PDB
ID - Both an HTTP and SOAP interface are available
- Alignments are displayed in a readable format
- 3D coordinates can be viewed or downloaded
Structural similarity is defined by the Number of
Equivalent Residues (NER1)
Standley, D.M. et al. (2005)
1Standley D, Toh H, Nakamura H. Proteins
200457381-391
51Structure Navigator Real-Time Server
Standley, D.M. et al. (2005)
52Real-Time Query Strategy
Standley, D.M. (2006)
53Structure Navigator-RT could reply the query
quickly, but ...
When we open the Web service of Structure
Navigator-RT and accept many requests, our
computer resource may not be enough.
We want to use large computing resource somewhere
else.
54Grid Web Service using Opal
Opal is a toolkit for wrapping application
programs as Web services, providing scheduling,
Grid security, and data management.
55Grid Web Service using Opal
56Opal Operation Providerby W.W. Li K.
Ichikawa(under the PRIUS project)
- Opal is deployed as one of Operation
Providers of Globus Toolkit 4 (GT4) Opal
Operation Provider (Opal-OP)
Application Service (WSRF)
Opal OPToolkit
Operation Providers for WSRF
Operation Providers for Notification
Opal OP
GetRPProvider
SubscribeProvider
SetRPProvider
Opal
NotificationConsumer Provider
QueryRPProvider
Globus Toolkit 4 (GT4)
57System Architecture
End user
Request Query
Portal Server
Apache Tomcat (jsp -Input page/servlet -Opal-OP
client program) Library of GT4 Library of
Operation Provider Library of Operation Provider
Service
_at_IPR, Osaka Univ. (Suita)
Computing Server (PC-cluster 16 nodes)
Submit Job
_at_CMC, Osaka Univ. (Toyonaka)
Head Node (1 PC node) Tomcat Opal Operation
Provider Service Pre-WS GRAM GT4 OpenPBS (launch
pbs_server, pbs_sched, pbs_mon) Run the
Application program
Job scheduling
15 Worker nodes openPBS (launch pbs_mon) Run
the Application program
Launch
Launch
58Structure Navigator-RT with Opal-OP on GRID
Query 3D fold (1crn)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65Blue Query Structure Red Second nearest
cluster to the Query
66Conclusion We implemented the Opal Operation
Provider to run Structure Navigator-RT as a Grid
Web Service, using the local scheduler, openPBS,
to run our application program on the PC cluster
at the Cyber-Media Center, Osaka University,
through the network inside Osaka University.
67 Identify Protein Function from
Sequence similarity
Fold similarity
68eF-site database for Ligand recognition sites
www.pdbj.org /eF-site/
Kinoshita, Furui, Nakamura (2002) J. Struct.
Funct. Genomics 2, 9-22. Kinoshita, Nakamura
(2004) Bioinformatics 20, 1329-1330.
69Function identification of hypothetical proteins
Kinoshita Nakamura (2003) Protein
Science 12, 1589-1595. Kinoshita
Nakamura (2005) Protein Science, 14. 711-718.
search result against eF-site/ActiveSite
eF-site/mono-nucleotide (1684 entries)
query MJ0226 free form
Folylpolyglutamate synthetase
Pyruvate kinase
70eF-seek Query service for search of similar
molecular surfaces (b-version)
ef-site.hgc.jp /eF-seek/
Kinoshita, K. Nakamura, H. (2006)
71Application of Grid to Web Service of PDBj
Query for the analog structural data
Protein folds and molecular surfaces
eF-site Database
Result
Query for molecular surface
Looking for the similar surfaces (eF-seek)
Quick response by using Grid computing
Result
New Structure
Query for fold
Looking for the similar folds (Structure
Navigator Real-Time server)
Fold Database
72Integration of computing systems and Database
query systems at PDBj
Portal/ Client Program
User
Communicate using SOAP
Personal user
site-A
site-B
site-C
Distributed Computing and DB System
XMLDB
GT
GT
Service A
Service C
Service B
GRID computing services with Opal-OP
Database query services with XQuery/XPath
Analog Query Search
Text Query Search
73Acknowledgements
Reiko Yamashita, Daron M. Standley (BIRD-JST /
Inst. Protein Res., Osaka Univ.) Kohei Ichikawa,
Susumu Date, Shinji Shimojo (Cyber Media Center,
Osaka Univ.) Wilfred W. Li , Peter Arzberger
(UCSD) Hiroyuki Toh (Medical Inst.
Bioregulation, Kyushu Univ) Kengo Kinoshita
(Inst. Med. Science, Univ. Tokyo) Other PDBj
members (BIRD-JST / Inst. Protein Res., Osaka
Univ.) PRIUS
74(No Transcript)