Title: BioMart Query Network
1BioMart Query Network
Arek Kasprzyk European Bioinformatics Institute 8
January 2005
2Biological databases
- Distributed
- Different format
- Different focus
- Different release schedule
- Scalability factor
3(No Transcript)
4BioMart
5(No Transcript)
6MartView
7BioMart_at_Ensembl
8MartShell
9MartExplorer
10Database
11Schema
12Schema
FK
FK
FK
FK
PK
FK
FK
FK
FK
13Schema
FK
FK
FK
FK
14Schema - reversed star
FK1
FK1
main1
dm
dm
PK1
FK1 FK2
FK1 FK2
FK2
FK2
PK2 FK1
2
dm
PK2 PK1
FK2
FK2
15Fixed schema transformation
16Schema transformation
- Central table
- Longest n1, 11 path
- Dimension table
- Central transformation around 1n table.
- Link tables are decomposed into a set of 1n first
17MartBuilder
- Input
- central object
- database meta data
- cardinalities
- Output
- Set of SQL statements
- create table as select
- Transformations
- represented as asymmetric tree
18MartBuilder
DATASET hsapiens_gene_ensembl TYPE MAIN M
DIMENSION D EXIT E M TABLE NAME gene gene
alt_allele cardinality 11 n1 0n 1n SKIP
S S gene gene cardinality 11 n1 0n 1n
SKIP S S gene gene_description cardinality
11 n1 0n 1n SKIP S 11 gene
gene_stable_id cardinality 11 n1 0n 1n
SKIP S 11 gene kk__gene__main cardinality
11 n1 0n 1n SKIP S S gene transcript
cardinality 11 n1 0n 1n SKIP S S gene
analysis cardinality 11 n1 0n 1n SKIP
S n1 gene dna cardinality 11 n1 0n 1n
SKIP S S gene dnac cardinality 11 n1 0n
1n SKIP S S gene seq_region cardinality
11 n1 0n 1n SKIP S S TYPE MAIN M
DIMENSION D EXIT E E ADD EXTENSION
hsapiens_gene_ensembl__gene__MAIN YN N CHANGE
FINAL TABLE NAME hsapiens_gene_ensembl__gene__MAI
N TO CREATE TABLE TEMP0 as SELECT
gene.gene_id,gene.type,gene.analysis_id,gene.seq_r
egion_id,gene.seq_region_start,gene.seq_region_end
,gene.seq_region_strand,gene.display_xref_id,gene_
description.gene_id AS gene_id_TEMP0,gene_descript
ion.description FROM gene, gene_description WHERE
gene_description.gene_id gene.gene_id CREATE
TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT
TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.s
eq_region_id,TEMP0.seq_region_start,TEMP0.seq_regi
on_end,TEMP0.seq_region_strand,TEMP0.display_xref_
id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stab
le_id.gene_id AS gene_id_TEMP1,gene_stable_id.stab
le_id,gene_stable_id.version FROM TEMP0,
gene_stable_id WHERE gene_stable_id.gene_id
TEMP0.gene_id drop table TEMP0
19Transformation configuration
satellog_repeats M repeats disease
n1 satellog_repeats M repeats gc
11 satellog_repeats M repeats
linkage_depth S satellog_repeats M
repeats repeats S satellog_repeats M
repeats transcripts S satellog_repeats
M repeats ugcount S satellog_repeats
M repeats ugstats S satellog_repeats
M repeats rep_class
n1 satellog_repeats D ugcount
ugcount S satellog_repeats D ugcount
ugstats S satellog_repeats D ugcount
gc S satellog_repeats D ugcount
repeats n1r
20Data access
21Dataset Key Abstraction
- Dataset
- Organised into a single schema
- BioMart database contains one or more dataset(s)
- Attribute
- Filter
- Exportable/Importable (Links)
- Dataset - an equivalent of relational table
- Exportable/Importable PK/FK
22Key Abstractions
23Exportables, Importables and Links
- Exportable ordered list of attributes
- Importable ordered list of filters
- WHERE filt1value1
- WHERE filt1value1 or filt1value2
- WHERE filt1gtvalue1 and filt2ltvalue2
- Links matching importable and exportable
24MartView
25Dataset Configuration
- Dataset configuration
- Attributes
- Filters
- Trees, Groups, Collections
- Links
- Semantics
- Relational mapping
- User interface
- Linking datasets
- XML-based
26Dataset Configuration
27Table naming conventionNaïve configuration
- Tables
- Meta tables meta_content
- Data tables dataset__content__type
- Data tables
- Main __main
- Dimension __dm
- Columns
- Key _key
- Boolean filter _bool
- List filter _list
28MartEditor
29MartEditor
- Naïve configuration
- Updates
- Links
- Automatic discovery of new tables
30Class diagram - configuration
31Class diagram - querying
32Information flow
- Read connections
- Register individual datasets and create linked
datasets - Get input from the user, split queries to
individual datasets. - Find the shortest path between datasets
(Dijikstra) - Compile SQL
33Summary
34BioMart
- Domain independent
- Platform independent
- MySQL 4
- Oracle 9i
- Plugin architecture
35BioMart model
- Already applied
- Ensembl
- Vega
- dbSNP
- Uniprot
- MSD
- Variety of small projects
- In development
- ArrayExpress
- Wormbase
- RGD
36Future work
- BioMart v 0.2 to be released later on in january
- Java library to be upgraded over coming months to
the new architecture - BioMart has been integrated with Taverna
- MartBuilder - to be properly implemented
37BioMart
- www.ebi.ac.uk/biomart
- Open source (LGPL)
- Public MySQL server
- ftp
- mart-dev_at_ebi.ac.uk
- mart-announce_at_ebi.ac.uk
38Acknowledgments
- BioMart
- Damian Smedley
- Darin London
- Contributors
- Arne Stabenau (Ensembl)
- Andreas Kahari (Ensembl)
- Craig Melsopp (Ensembl)
- Katerina Tzouvara (Uniprot)
- Paul Donlon (Unilever)
- Will Spooner (CSHL)
39(No Transcript)
40(No Transcript)