BioWarehouse: A Bioinformatics Database Warehouse

About This Presentation

Title:

BioWarehouse: A Bioinformatics Database Warehouse

Description:

A Bioinformatics Database Warehouse Peter D. Karp, Thomas J. Lee, Valerie Wagner BioCyc BioPAX BioCyc ENZYME CMR Genbank BioWarehouse Eco2DBase Oracle (10g) or – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 32

Provided by: sric7

Category:

more less

Transcript and Presenter's Notes

Title: BioWarehouse: A Bioinformatics Database Warehouse

1
BioWarehouse A Bioinformatics Database Warehouse

Peter D. Karp, Thomas J. Lee, Valerie Wagner

2
Overview

Motivations for BioWarehouse
Facile programmatic access to individual DBs
Capture locally produced data
Database integration
BioWarehouse technical approach
Loaders
Schema overview
Applications of BioWarehouse
Join the BioWarehouse project

3
Motivations Computing with Individual Databases

Most bioinformatics DBs are not queryable via a
database management system
Via Internet or locally installable
Having relational database versions of individual
bioinformatics DBs facilitates complex queries
against individual DBs
What is the alternative? Perl scripts? Awkward
to program, slow to execute

4
MotivationsManage/Integrate Locally Produced
Data

Need schema to capture locally produced data
Integrate locally produced data with public
databases

5
Why is the Multidatabase Approach Alone Not
Sufficient?

Multidatabase query approaches assume databases
are in a queryable DBMS
Most sites that do operate DBMSs do not allow
remote query access because of security and
loading concerns
Users want to control data stability
Users want to control speed of their queries
Multidatabase query systems limited by Internet
bandwidth and by the speed of the slowest data
source that they query
Users need to capture, integrate and publish
locally produced data of different types
Multidatabase and Warehouse approaches
complementary

6
Key Challenges / Results for BioWarehouse

Design schema that accurately captures the
contents of source DBs
Design schema that is understandable and scalable
Address poorly-specified syntax semantics of
source DBs
Balancing the preservation of source data with
mapping into common semantics
Clearly document data mappings performed by
loaders

7
Technical Approach

Multi-platform support Oracle (10G) and MySQL
Schema support for multitude of bioinformatics
datatypes
Create loaders for public bioinformatics DBs
Parse file format of the source DB
Some loaders parse interchange formats (BioPAX)
Semantic transformations
Insert DB contents into warehouse tables

BMC Bioinformatics 7170 2006 http//bioinformatic
s.ai.sri.com/biowarehouse/
8
Technical Approach

Provide Warehouse query access mechanisms
SQL queries via ODBC, JDBC, OAA
High quality documentation for schema and loader
transformations
No graphical query interface yet

9
How to Use BioWarehouse?

Create your own local instance of BioWarehouse
Query an existing BioWarehouse instance, such as
publichouse

10
PublicHouse Server

Publicly queryable BioWarehouse server operated
by SRI
Manages a set of biological DBs constructed using
BioWarehouse
CMR
BioCyc Pathway/Genome DBs
ENZYME
NCBI Taxonomy
Will be transitioning publichouse to contain
BioCyc
E. coli gene expression, proteomics, and
ChIP-chip datasets
See http//bioinformatics.ai.sri.com/biowarehous
e/publichouse.html
Note publichouse will become a BioCyc/EcoliHub
BioWarehouse server

Host publichouse.sri.com Port 3306 Database
biospice
11
BioWarehouse Schema

Manages many bioinformatics datatypes
simultaneously
Pathways, Reactions, Chemicals
Proteins, Genes, Replicons
Sequences, Sequence Features
Gene expression data
Protein expression data
Flow cytometry data
Organisms, Taxonomic relationships
Computations (sequence matches)
Citations, Controlled vocabularies
Links to external databases
Each type of warehouse object implemented through
one or more relational tables

12
BioWarehouse Schema

Manages multiple datasets simultaneously
Dataset Single version of a database
Version comparison
Multiple software tools or experiments that
require access to different versions
Each dataset is a warehouse entity
Every warehouse object is registered in a dataset

13
BioWarehouse Schema

Different databases storing the same biological
datatypes are coerced into same warehouse tables
Design of most datatypes inspired by multiple
databases
Representational tricks to decrease schema bloat
Single space of primary keys
Single set of satellite tables such as for
synonyms, citations, comments, etc.
Schema size
Core schema 70 tables
Gene expression schema 109 tables

14
BioWarehouse Loaders
Database Loader Language Input Format Comments
BioCyc C BioCyc attribute-value Pathway/Genome Databases
BioPAX Java BioPAX format Protein interactions data
CMR C CMR column-delimited Comprehensive Microbial Resource 350 microbial genomes
Eco2Dbase Java Relational table dumps E. coli 2-D gel data
ENZYME Java ENZYME attribute-value Enzyme Commission set of reactions
Genbank Java XML derived from ASN.1 Bacterial subset of Genbank
Gene Ontology Java OBO XML Hierarchical controlled vocabulary
KEGG C KEGG format Metabolic pathway data
MAGE-ML Java MAGE-ML format Microarray gene expression data
NCBI Taxonomy C Taxonomy format Organism taxonomy
UniProt Java UniProt XML SWISS-PROT and TrEMBL
15
BioWarehouse Schema Overview

Schema manages many bioinformatics datatypes
including links to external databases
Main biological objects

Each type of warehouse object implemented through
one or more relational tables (70)

16
Pathway Data

BioCyc
KEGG
BioPAX format
Physical interaction data only
ENZYME
Populates these tables Reaction, Protein,
Chemical

17
Pathway Schema Neighborhood
Pathway
Product
PathwayReaction
Chemical
Reaction
Substrate
EnzymaticReaction
Protein
18
Pathway Data BioCyc Loader

Each BioCyc DB can be loaded into separate
BioWarehouse dataset, or one common dataset
Loads data from 13 BioCyc source files
pubs.dat not present for all BioCyc PDDBs
compounds.dat
proteins.dat
protseq.fasta not present for all BioCyc PGDBs
transunits.dat not present for all BioCyc PGDBs
genes.dat
promoters.dat not present for the MetaCyc PGDB
terminators.dat not present for the MetaCyc
PGDB
dnabindsites.dat not present for the MetaCyc
PGDB
reactions.dat
enzrxns.dat
regulation.dat
pathways.dat
http//biowarehouse.ai.sri.com/repos/biocyc-loader
/flatfile/doc/index.html

19
BioCyc Loader Chemical Compounds
20
BioCyc Loader Reactions
21
BioCyc Loader Products
22
(No Transcript)
23
Comparative Analysis with BioWarehouseCompare
MetaCyc to KEGG

KEGG pathways are larger than MetaCyc pathways
MetaCyc has a larger number of pathways
Which database has a larger collection of pathway
data?
Prior result KEGG pathways are on average 4.2
times larger than MetaCyc pathways

The outcomes of pathway database computations
depend on pathway ontology Green and Karp,
Nucleic Acids Research 200634 3687-97
24
MetaCyc contains 5.1 times as many pathways as
does KEGG
25
MetaCyc contains 1.4 times as many reactions
within its pathways as does KEGG
26
Gene Expression Data inBioWarehouse

Goals
Experimentalist loads locally produced data into
BioWarehouse
Computational biologist loads remotely downloaded
data into BioWarehouse
For processing and/or integration with other data
Source data format MAGE-ML
http//mged.org/
BioWarehouse and ArrayExpress are only MAGE-ML
compliant data models we could find

27
Our Approach

Translate MAGE-OM into a relational database
schema
One class ? One table gives too large a schema
(ArrayExpress)
Instead, one table per inheritance hierarchy
reduces table count by half
Use MAGE SDK tool for XML-gtObject use Castor for
Object-gtRelational mapping
Merge the resulting schema into BioWarehouse
schema to eliminate redundancy
Result 109 tables

28
ChIP-Chip Data

Current project to extend MAGE-ML loader and
BioWarehouse to accommodate ChIP-chip data
Meta data, gene expression data, transcription
factor(s), antibody(s)

29
Protein Interactions Data

Schema support
Load via BioPAX

30
Contribute to BioWarehouse

Open Source project
Ways to contribute
Maintain/update an existing loader
Implement a new loader
Port to new compiler or platform or DBMS
support_at_biowarehouse.org

31
Acknowledgments

Funded by
NIH/NIGMS EcoliHub project
NIH/NIGMS BioCyc project
DARPA Bio-SPICE program
SRI Colleagues
Valerie Wagner, Tom Lee, Tomer Altman
Learn more
http//bioinformatics.ai.sri.com/biowarehouse/
BMC Bioinformatics 7170 2006

Write a Comment

User Comments (0)

About PowerShow.com

BioWarehouse: A Bioinformatics Database Warehouse - PowerPoint PPT Presentation

BioWarehouse: A Bioinformatics Database Warehouse

A Bioinformatics Database Warehouse Peter D. Karp, Thomas J. Lee, Valerie Wagner BioCyc BioPAX BioCyc ENZYME CMR Genbank BioWarehouse Eco2DBase Oracle (10g) or – PowerPoint PPT presentation