Title: BioMart
1BioMart
Richard Holland European Bioinformatics
Institute Helsinki, September 2006
2BioMart
- A joint project
- European Bioinformatics Institute (EBI)
- Cold Spring Harbor Laboratory (CSHL)
- Aim
- To develop a generic, query-oriented data
management system capable of integrating
distributed data sources.
3Focus
- Data mining or advance search
- Creating custom datasets
- Querying multiple datasets
- Interactive
- Users
- People who provide database-based service
- Power user biologists and bioinformaticians
4Requirements
- User
- One-stop shop for biological data
- Suitable for power biologists and
bioinformaticians - A set of interfaces that allow user to group and
refine biological data based upon many criteria - Deployer
- Out of the box installation
- Built in query optimization
- Easy data federation
- Architecture
- Domain agnostic
- Distributed
- Platform independent
5Advanced search GUIs
6Single interface
7Single access point
8Queries across different databases
9Main features
- Domain agnostic
- Platform independent (MySQL, ORACLE, Postgres)
- Scalable for big datasets
- Federated architecture
- Automated UI configuration
10How does it work?
11BioMart
12Federated architecture
13Data model
FK
FK
FK
FK
14Data model
FK
FK
FK
FK
PK
FK
FK
FK
FK
15Data model - reversed star
FK1
FK1
main1
dm
dm
PK1
FK1 FK2
FK1 FK2
FK2
FK2
PK2 FK1
2
dm
PK2 PK1
FK2
FK2
16Data mart and dataset
17Data mart, dataset and virtual schema
18BioMart abstractions
- Dataset
- A subset of data organized into 1 or more tables
- Attribute
- A single data point
- e. g. gene name
- Filter
- An operation on an attribute
- e. g. Chromosome 1
19Datasets, Attributes and Filters
20BioMart abstractions (cont)
- Link
- common currency between two datasets
- e. g. accession
- Exportable
- Potential links to export
- Importable
- Potential links to import
21Exportables, Importables and Links
22Exportables, Importables and Links
23Exportables, Importables and Links
24Creating BioMart databases
25Building BioMart databases
MartBuilder
26Schema transformationprinciples
- Central table
- Longest n1, 11 path
- Dimension table
- Central transformation around 1n table.
- Link tables are decomposed into a set of 1n first
27MartBuilder Application
- Read database meta data
- Transforms a source schema into
- suggested datasets and lets you edit the
process - Produces a set of SQL statements (DDL) to run
against the server to perform the transformation
28(No Transcript)
29Dataset Configuration
- Dataset configuration
- Attributes
- Filters
- Trees, Groups, Collections
- Exportables, Importables
- Semantics
- Relational mapping
- User interface
- Linking datasets
- XML-based
30Table naming conventionNaïve configuration
- Tables
- Meta tables meta_content
- Data tables dataset__content__type
- Data tables
- Main __main
- Dimension __dm
- Columns
- Key _key
31Naming convention examples
- Homo sapiens gene ensembl
- hsapiens_gene_ensembl__gene__main
- hsapiens_gene_ensembl__xref_hugo__dm
- Encode
- hsapiens_encode__encode__main
- Uniprot
- uniprot__protein__main
- uniprot__interpro__dm
- Uniprot sequence
- uniprot_sequence__sequence__main
32Dataset Configuration
33MartEditor
34Accessing BioMart databases
35BioMart architecture
36MartView (current)
37MartView (new 0_5)
38MartExplorer
39MartShell
40MartShell (MQL)
- Uses Mart Query Language (MQL) to generate
queries - using ltdatasetgt get ltattributesgt where ltfiltersgt
- Can join datasets together
- using Dataset1 get Attribute1 where Filter1var1
as q - using Dataset2 get Attribute2 where Filter2var2
and filter3 in q - Can script and pipe
- martshell.sh -E MQLscript.mql gt results.txt
- martshell.sh -E MQLscript.mql wc
41MartShell examples
MartShellgt using MSD.msd get pdb_id where
resolution_less lt 1.5 and has_ec_info
only 193l 194l 1arb ... MartShellgt using
MSD.msd get pdb_id where resolution_less lt 1.5
and has_ec_info only as q MartShellgt using
Ensembl.hsapiens_gene_ensembl get sequence
transcript_flanks1000 where pdb in
q ENST00000270142.2 ENSG00000142168.2 strandforw
ard chr21 assemblyNCBI34 downstream flanking
sequence of transcript only AAACTAAATTAGCTCTGATACT
TATTTATATAAACAGCTTCAGTGGAA ....
42biomaRt
43Taverna
44DAS ProServer
45BioMart deployers
- Large scale data federation (EBI)
- Optimising access to a large database (Ensembl,
WormBase) - Connecting priopriatery datasets to public data
(Pasteur, Unilever, Serono, Sanofi-Aventis,
DevGen etc )
46Hinxton example
WWW
47BioMart deployers
- Large scale data federation (Hinxton)
- Optimising access to a large database (Ensembl,
WormBase, ArrayExpress) - Connecting priopriatery datasets to public data
(Pasteur, Unilever, Serono, Sanofi-Aventis,
DevGen etc )
48WormBase
Genes Expression Phenotypes Variations Literature
Ontologies Sequence
49Ensembl
Genes Ontologies Variations Protein
annotation Disease Homologies Sequence Array
annotations
50HapMap
Population Frequencies Inter population compariso
ns Gene annotation
51ArrayExpress
52BioMart deployers
- Large scale data federation (Hinxton)
- Optimising access to a large database (Ensembl,
WormBase) - Federating third party data with public data
(Pasteur, INRA, Bayer,Unilever, Serono,
Sanofi-Aventis, DevGen, Solexa etc )
53In development
- CAPRISA
- RGD
- DICTYBASE
- PURDUE UNIVERSITY
- RZPD
54Music Mart
55BioMart model
- Already applied
- Ensembl
- Vega
- SNP
- Uniprot
- MSD
- ArrayExpress
- WormBase
- Gramene
- HapMap
- Variety of in house projects (academia and
industrial)
56User restriction
martUser
Dataset
default
advanced
57Interface configuration
Interface
Dataset
single-page web interface
wizard style web interface
58Web services
MartView
MartService
80
3306
X
3306
3306
Local Mart
Remote Mart
59Web services (cont)
- MartService requests
- Registry XML
- Dataset information name, type etc
- DatasetConfig XML
- Mart Query
- API query object is converted to a XML
representation on the client and sent to the
server. - Query object is regenerated on the server and
processed. Results are sent back to client as a
simple tab-delim HTML page.
60Summary
- A generic data management system
- A set of easily configurable user interfaces
- Distributed Data federation
- Query optimization
61BioMart
- www.biomart.org
- Open source (LGPL)
- Public MySQL server
- ftp
- mart-dev_at_ebi.ac.uk
- mart-announce_at_ebi.ac.uk
62Acknowledgments
- BioMart
- Arek Kasprzyk (EBI)
- Damian Smedley (EBI)
- Syed Haider (EBI)
- Gudmundur Thorisson (CSHL)
- Contributors
- Darin London (EBI)
- Will Spooner (CSHL)
- Damian Keefe (Ensembl)
- Arne Stabenau (Ensembl)
- Andreas Kahari (Ensembl)
- Craig Melsopp (Ensembl)
- Katerina Tzouvara (Uniprot)
- Paul Donlon (Unilever)
- Steffen Durinck (SCD-ESAT, Katholieke
Universiteit Leuven) - Benoit Ballester (Universite de la Mediterranee)
- Stephen Robinson (EBI)
- Asif Kibria (EBI)
- Paul Donlon (Unilever)