Title: BioMart
1BioMart
- A Federated Query Architecture
Arek Kasprzyk European Bioinformatics
Institute 29 July 2004
2BioMart
- A distributed data integration system based on
query-optimized relational views.
3Fixed schema transformation
4BIOMART
5BioMart
- Generic
- Universal BioMart data model
- Query-based interface
- No data dependent abstractions
- Network scalability
- Query optimised schema
- Platform portability
- Automatic, simple SQL
6Distributed integration
SOAP CORBA
7System overview
8BioMart
- A toolset for creating, maintaining and accessing
relational views on existing data sources. - The views - BioMart databases (marts)
9Building BioMart databases
MartBuilder
MartEditor
10Building BioMart databases
- MartBuilder
- Transforms source - mart schema
- A set of DDL commands
- An automatic schema transformation
- Adapts to source schema changes
11MartBuilder
- Input
- central object
- database meta data
- cardinalities
- Output
- Set of DDL statements
- create table as select
- Transformations
- represented as asymmetric tree
- Secondary transformations built on top of
existing ones
12MartBuilder
DATASET hsapiens_gene_ensembl TYPE MAIN M
DIMENSION D EXIT E M TABLE NAME gene gene
alt_allele cardinality 11 n1 0n 1n SKIP
S S gene gene cardinality 11 n1 0n 1n
SKIP S S gene gene_description cardinality
11 n1 0n 1n SKIP S 11 gene
gene_stable_id cardinality 11 n1 0n 1n
SKIP S 11 gene kk__gene__main cardinality
11 n1 0n 1n SKIP S S gene transcript
cardinality 11 n1 0n 1n SKIP S S gene
analysis cardinality 11 n1 0n 1n SKIP
S n1 gene dna cardinality 11 n1 0n 1n
SKIP S S gene dnac cardinality 11 n1 0n
1n SKIP S S gene seq_region cardinality
11 n1 0n 1n SKIP S S TYPE MAIN M
DIMENSION D EXIT E E ADD EXTENSION
hsapiens_gene_ensembl__gene__MAIN YN N CHANGE
FINAL TABLE NAME hsapiens_gene_ensembl__gene__MAI
N TO CREATE TABLE TEMP0 as SELECT
gene.gene_id,gene.type,gene.analysis_id,gene.seq_r
egion_id,gene.seq_region_start,gene.seq_region_end
,gene.seq_region_strand,gene.display_xref_id,gene_
description.gene_id AS gene_id_TEMP0,gene_descript
ion.description FROM gene, gene_description WHERE
gene_description.gene_id gene.gene_id CREATE
TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT
TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.s
eq_region_id,TEMP0.seq_region_start,TEMP0.seq_regi
on_end,TEMP0.seq_region_strand,TEMP0.display_xref_
id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stab
le_id.gene_id AS gene_id_TEMP1,gene_stable_id.stab
le_id,gene_stable_id.version FROM TEMP0,
gene_stable_id WHERE gene_stable_id.gene_id
TEMP0.gene_id drop table TEMP0
13Transformation configuration
satellog_repeats M repeats disease
n1 satellog_repeats M repeats gc
11 satellog_repeats M repeats
linkage_depth S satellog_repeats M
repeats repeats S satellog_repeats M
repeats transcripts S satellog_repeats
M repeats ugcount S satellog_repeats
M repeats ugstats S satellog_repeats
M repeats rep_class
n1 satellog_repeats D ugcount
ugcount S satellog_repeats D ugcount
ugstats S satellog_repeats D ugcount
gc S satellog_repeats D ugcount
repeats n1r
14Reversed Star Schema
GENE CENTRAL
gene_id(PK) gene_stable_id gene_chrom_start gene_
chrom_end chromosome gene_display_id band descript
ion etc
15Table naming convention
- Tables
- Meta tables meta_content
- Data tables dataset__content__type
- Data tables
- Main __main
- Dimension __dm
- Columns
- Key _key
- Boolean filter _bool
- List filter _list
16Configuring BioMart databases
- MartEditor
- XML editor with build-in system logic
- Configure existing interfaces
- Automatically create new, naive configuration
- Automated detection of user-added tables
- Updates
17MartEditor
18XML-based Configuration
19BioMart a generic system
- Key abstractions
- Dataset
- Attribute
- Filter
20Use cases
- Upstream sequences
- for all kinases
- up-regulated in brain and associated with
known diseases - Name, chromosome position, description
- of all genes
- located on chromosome 1, expressed in lung,
associated with mouse homologues and
non-synonymous snp changes
21Key Abstractions
22Mart Query Language (MQL)
23Data access
- MartView (Web)
- MartExplorer (GUI)
- MartShell (Text)
- MartLib (API)
24MartView
25MartShell
26MartExplorer
27Distributed Architecture
28Query-chaining
using Dataset1 get Attribute1 where Filter1var1
as q using Dataset2 get Attribute2 where
Filter2var2 and filter3 in q
29BioMart A Distributed Architecture
30BioMart User Perspective
31Requirements
- Mart-spec database
- Mart-compatible star schema
- Table naming convention (dataset__content__type)
- XML configuration file
- RDBMS server outside firewall
32What Do You Get?
- Flexible interfaces configurable according to
your spec - Performance-assured data retrieval
- Query chaining across data sources
- Administrator tools for modifying and deploying
the system
33(No Transcript)
34Current status
- BioMart 1 (EnsMart) web
- Universal data model
- Network scalability
- Universal applicability
- Automated SQL
- Platform portability
- File-based query chaining
35Current status
- BioMart 2 (MartJ) standalone interfaces
- XML-based configuration
- MQL
- Basic support for single attribute query-chaining
36Current status
- BioMart 3 web
- Plug in architecture
- Multi-attribute chaining
- Implicit chaining
- Multiple schemas/Query Compilers
- MartBuilder
37BioMart an Open Project
- All code and data freely available
- Website
- www.ebi.ac.uk/biomart
- www.ebi.ac.uk/biomart/martview
- Public MySQL server
- martdb.ebi.ac.uk
- Ftp
- ftp.ebi.ac.uk
- Mailing lists
- mart-dev
- mart-announce
38Summary
- If you need
- Scalable and flexible search interfaces for an
existing database - Single integrated search interface to many in
house databases - Connect your databases to other databases on
the internet - BioMart
39Credits
- Damian Keefe
- Damian Smedley
- Craig Melsopp
- Darin London
- Katerina Tzouvara
- Will Spooner
- Andreas Kahari
40(No Transcript)
41Changing Research Focus
- The increase in high-throughput technologies
- Growing sophistication of the user
- Research question involving big datasets
- Multispecies
- Multiexperiments
- Multidatsets
- Data sources distributed
42Use cases
- Upstream sequences for all kinases upregulated in
brain and associated with known diseases - Name, chromosome position, description of all
genes located on chromosome 1, expressed in
lung, associated with mouse homologues, and
non-synonymous snp changes
43Solutions
- Bioinformatics support
- Processing data files
- Use third party software
- In house processing
- No bioinformatics?
- One-stop shop for biological data
44(No Transcript)
45CORBA SOAP
46 A Container Ship
47Future
48July
- Alpha release of the BioMart suite
- Specification
- Schema naming convention
- DTD for XML config
- Administration Tools
- Configure
- Data access (Perl/Java)
- Lib
- Interfaces
- Tested on MySQL 4/Oracle 9i mixture
49After July
- MartBuilder
- Automatically build marts from existing 3NF with
predefined PK/FK - Fixed schema data transformation function
- SQL collection
- Collaboration
- Laboratory for the Foundation of Computer Science
- Bell Labs
50BioMart
51Biological databases
- Update oriented
- Complex, normalised schemas
- Sophisticate queries involve multiple joins
- Difficult and slow
52Query optimised
- Data mart
- Few joins
- Duplicated data
- Denormalised
- Efficient and scalable for large and
sophisticated queries
53Distributed Model Benefits
- Each group retains full control over their data
source - Data content
- Data updates
- Data presentation (interface)
- Deployment platform
- Security
54BioMart
- Building marts
- MartBuilder - schema transformation
- MartEditor - configuration editor
- Accessing marts
- MartShell - commandline interface
- MartExplorer - GUI
- MartView - website