Title: Introduction of a Grid Approach in the Biotechnology Industry
1Introduction of a Grid Approach in the
Biotechnology Industry
- Grid Day
- University of Cyprus, NicosiaMarch 26, 2003
2BioGrid team Laboratory of software engineering
internet technologies University of Cyprus
- Head of bioGrid team
- Prof. George Papadopoulos george_at_cs.ucy.ac.cy
- Research group
- Aristos Stavrou cs98sa2_at_cs.ucy.ac.cy
- Dr. Dimitrios Vogiatzis dimitrv_at_cs.ucy.ac.cy
3BioGrid
- 2 year trial IST project
- Sep.2002-Sep.2004
- Project Leader ZooRobotics
- Site www.bio-grid.net
4Scientific Objectives
- BioGrid is a trial IST project with the following
objectives - Development and Integration of grid technologies
so that - Researchers obtain an efficient information
output - Three tools to be integrated
- PSIMAP, protein interaction discovery
visualisation - Space Explorer, gene protein visualisation
- Classification Server, text data mining
- Tools access Information resources
- Databases (protein, gene expression).
- Unstructured data (pubmed abstracts)
- Software tools (TOPS, for protein structural
comparison)
5Business Objectives
- Information Grid for large proteomics and
genomics databases - Efficient transnational enterprise collaboration
- Faster time to market biotech innovations
- Software license model for bioGrid
- Targeted customers
- Pharmaceutical Aventis, GSK, Novartis, AkzoNobel
- SMEs KeyGene, Inpharmatica, LionBioScience,
Avantium -
6Data sources and formats I
- Biological objects are complex not standard
format (XML, ASN.1, proprietary) - Literature unstructured, text data mining is
necessary
- Protein Structure
- Protein Data Bank (PDB)
- Structural Classification of Proteins (SCOP,
CATH) - Gene expression databases
- Stanford Microarray data (SMA)
- Gene Expression Ominbus
- Biomolecular interaction database (ASN.1 format)
- PubMed (Biology oriented journal abstracts)
- TOPS (protein comparison tool)
7Data sources and formats II
LITERATURE DATABASE FORMAT XML BASED
(PUBMED) ltPubmedArticlegt ltPMIDgt12649791lt/PMIDgt ltYe
argt2003lt/Yeargt ltJournalgt ltISSNgt0941-3790lt/ISSNgt
lt/Journalgt ltArticleTitlegtNutritional ecology
chances . services to shape procedures lt/Article
Titlegt ltAbstractgt ltAbstractTextgtNutrition
ecology is the science that studies the impacts
of human nutrition on nutrition ecology in
their projects. lt/AbstractTextgt lt/Abstractgt
lt/PubmedArticlegt
PROTEIN DATABASE FORMAT (PDB) Columns denote
attribute e.g. near the end of file, cols 31-54
denote atomic coordinates (3d vector)
GENE EXPRESSION FORMAT (SMD)
8Tool 1 Space Explorer
- high-dimensional data mapped into a 1, 2 or 3
dimensional subspace - interactive web-enabled, virtual reality
environment - 3D visualisation complemented by hierarchical
clustering - subsequent visualisations as dendrograms, which
are linked to the scatter plots - Space Explorer facilitates visual data mining.
9Tool 2 PSIMAP
- PSIMAP is the first complete protein structural
domain interaction map - shows, what kinds of protein domains are found to
be interacting structurally. - PSIMAP has specific shapes reflecting the types
of protein domains their interaction partners
10Tool 3 Classification server
- custom hierarchy from a sample of documents,
producing a global, consistent view - automatic classification of textual items into a
hierarchy of topics - Classifications are output in XML for flexible,
standard data interchange - APIs for easy integration with existing
applications and new services
11Current situation in Biotechnology
- Genetic sequencing databases
- More than 10,000,000 in Genbank
- Protein databasesgt1,000,000 (PIR)
- Tools gt500 on-line
- PubMed 11,000,000 abstracts on-line
- There is not a universal access to data objects
- Not possible to pass automatically
- from gene expression ?protein ?protein families
?visualisation
12BioGrid project overview
13Requirements for Grid I
scenario genes ? expression ? proteins ?
relevant literature
- Unified view of the data objects (XML) ? unified
interfaces for users and applications - Co-operatation of the (gene expression, protein
interaction connection to literature) - Passing data objects to the 3-tools
Lower level functionality
- Addition of new resources transparent
- Continuous operation when adding data sources or
in the event of component failure - Functionality on a different platforms (unix
windows) - Possibility of local caching and net traffic
regulation (to be determined in the evaluation
phase)
14Requirements for Grid II
Requirements lead to the following Grid types
- Information grid accessibility to sources of
information and tools for analysis
visualisation - Knowledge grid data mining machine learning
for filtering literature abstracts.
15Schematic design integrating 3 platforms data
sources
PSIMAP
Space Explorer
Classification Server
XML
GRID
Text Mining
16bioGrid part of health Grids
- Health Grid Projects
- MammoGrid (databases of mammograms)
- GEMSS (Grid Enabled Medical Simulation Services.
Access to advanced simulation and image
processing services (www.gemss.de) - BioMody (discovery and distribution of bio-data
to web) - Tambis single user interface to bio-data
- SRS access to unstructured data
- Grid technologies under investigation
- Globus
- Legion by Avaki
- SDSC Storage Resource Broker