BioMart - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

BioMart

Description:

A distributed data integration system based on query-optimized relational views. ... Single integrated' search interface to many in house databases ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 55
Provided by: arek
Category:
Tags: biomart | house | my | of | satellite | view

less

Transcript and Presenter's Notes

Title: BioMart


1
BioMart
  • A Federated Query Architecture

Arek Kasprzyk European Bioinformatics
Institute 29 July 2004
2
BioMart
  • A distributed data integration system based on
    query-optimized relational views.

3
Fixed schema transformation
4
BIOMART
5
BioMart
  • Generic
  • Universal BioMart data model
  • Query-based interface
  • No data dependent abstractions
  • Network scalability
  • Query optimised schema
  • Platform portability
  • Automatic, simple SQL

6
Distributed integration
SOAP CORBA
7
System overview
8
BioMart
  • A toolset for creating, maintaining and accessing
    relational views on existing data sources.
  • The views - BioMart databases (marts)

9
Building BioMart databases
MartBuilder
MartEditor
10
Building BioMart databases
  • MartBuilder
  • Transforms source - mart schema
  • A set of DDL commands
  • An automatic schema transformation
  • Adapts to source schema changes

11
MartBuilder
  • Input
  • central object
  • database meta data
  • cardinalities
  • Output
  • Set of DDL statements
  • create table as select
  • Transformations
  • represented as asymmetric tree
  • Secondary transformations built on top of
    existing ones

12
MartBuilder
DATASET hsapiens_gene_ensembl TYPE MAIN M
DIMENSION D EXIT E M TABLE NAME gene gene
alt_allele cardinality 11 n1 0n 1n SKIP
S S gene gene cardinality 11 n1 0n 1n
SKIP S S gene gene_description cardinality
11 n1 0n 1n SKIP S 11 gene
gene_stable_id cardinality 11 n1 0n 1n
SKIP S 11 gene kk__gene__main cardinality
11 n1 0n 1n SKIP S S gene transcript
cardinality 11 n1 0n 1n SKIP S S gene
analysis cardinality 11 n1 0n 1n SKIP
S n1 gene dna cardinality 11 n1 0n 1n
SKIP S S gene dnac cardinality 11 n1 0n
1n SKIP S S gene seq_region cardinality
11 n1 0n 1n SKIP S S TYPE MAIN M
DIMENSION D EXIT E E ADD EXTENSION
hsapiens_gene_ensembl__gene__MAIN YN N CHANGE
FINAL TABLE NAME hsapiens_gene_ensembl__gene__MAI
N TO CREATE TABLE TEMP0 as SELECT
gene.gene_id,gene.type,gene.analysis_id,gene.seq_r
egion_id,gene.seq_region_start,gene.seq_region_end
,gene.seq_region_strand,gene.display_xref_id,gene_
description.gene_id AS gene_id_TEMP0,gene_descript
ion.description FROM gene, gene_description WHERE
gene_description.gene_id gene.gene_id CREATE
TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT
TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.s
eq_region_id,TEMP0.seq_region_start,TEMP0.seq_regi
on_end,TEMP0.seq_region_strand,TEMP0.display_xref_
id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stab
le_id.gene_id AS gene_id_TEMP1,gene_stable_id.stab
le_id,gene_stable_id.version FROM TEMP0,
gene_stable_id WHERE gene_stable_id.gene_id
TEMP0.gene_id drop table TEMP0
13
Transformation configuration
satellog_repeats M repeats disease
n1 satellog_repeats M repeats gc
11 satellog_repeats M repeats
linkage_depth S satellog_repeats M
repeats repeats S satellog_repeats M
repeats transcripts S satellog_repeats
M repeats ugcount S satellog_repeats
M repeats ugstats S satellog_repeats
M repeats rep_class
n1 satellog_repeats D ugcount
ugcount S satellog_repeats D ugcount
ugstats S satellog_repeats D ugcount
gc S satellog_repeats D ugcount
repeats n1r
14
Reversed Star Schema
GENE CENTRAL
gene_id(PK) gene_stable_id gene_chrom_start gene_
chrom_end chromosome gene_display_id band descript
ion etc
15
Table naming convention
  • Tables
  • Meta tables meta_content
  • Data tables dataset__content__type
  • Data tables
  • Main __main
  • Dimension __dm
  • Columns
  • Key _key
  • Boolean filter _bool
  • List filter _list

16
Configuring BioMart databases
  • MartEditor
  • XML editor with build-in system logic
  • Configure existing interfaces
  • Automatically create new, naive configuration
  • Automated detection of user-added tables
  • Updates

17
MartEditor
18
XML-based Configuration
19
BioMart a generic system
  • Key abstractions
  • Dataset
  • Attribute
  • Filter

20
Use cases
  • Upstream sequences
  • for all kinases
  • up-regulated in brain and associated with
    known diseases
  • Name, chromosome position, description
  • of all genes
  • located on chromosome 1, expressed in lung,
    associated with mouse homologues and
    non-synonymous snp changes

21
Key Abstractions
22
Mart Query Language (MQL)
23
Data access
  • MartView (Web)
  • MartExplorer (GUI)
  • MartShell (Text)
  • MartLib (API)

24
MartView
25
MartShell
26
MartExplorer
27
Distributed Architecture
28
Query-chaining
using Dataset1 get Attribute1 where Filter1var1
as q using Dataset2 get Attribute2 where
Filter2var2 and filter3 in q
29
BioMart A Distributed Architecture
30
BioMart User Perspective
31
Requirements
  • Mart-spec database
  • Mart-compatible star schema
  • Table naming convention (dataset__content__type)
  • XML configuration file
  • RDBMS server outside firewall

32
What Do You Get?
  • Flexible interfaces configurable according to
    your spec
  • Performance-assured data retrieval
  • Query chaining across data sources
  • Administrator tools for modifying and deploying
    the system

33
(No Transcript)
34
Current status
  • BioMart 1 (EnsMart) web
  • Universal data model
  • Network scalability
  • Universal applicability
  • Automated SQL
  • Platform portability
  • File-based query chaining

35
Current status
  • BioMart 2 (MartJ) standalone interfaces
  • XML-based configuration
  • MQL
  • Basic support for single attribute query-chaining

36
Current status
  • BioMart 3 web
  • Plug in architecture
  • Multi-attribute chaining
  • Implicit chaining
  • Multiple schemas/Query Compilers
  • MartBuilder

37
BioMart an Open Project
  • All code and data freely available
  • Website
  • www.ebi.ac.uk/biomart
  • www.ebi.ac.uk/biomart/martview
  • Public MySQL server
  • martdb.ebi.ac.uk
  • Ftp
  • ftp.ebi.ac.uk
  • Mailing lists
  • mart-dev
  • mart-announce

38
Summary
  • If you need
  • Scalable and flexible search interfaces for an
    existing database
  • Single integrated search interface to many in
    house databases
  • Connect your databases to other databases on
    the internet
  • BioMart

39
Credits
  • Damian Keefe
  • Damian Smedley
  • Craig Melsopp
  • Darin London
  • Katerina Tzouvara
  • Will Spooner
  • Andreas Kahari

40
(No Transcript)
41
Changing Research Focus
  • The increase in high-throughput technologies
  • Growing sophistication of the user
  • Research question involving big datasets
  • Multispecies
  • Multiexperiments
  • Multidatsets
  • Data sources distributed

42
Use cases
  • Upstream sequences for all kinases upregulated in
    brain and associated with known diseases
  • Name, chromosome position, description of all
    genes located on chromosome 1, expressed in
    lung, associated with mouse homologues, and
    non-synonymous snp changes

43
Solutions
  • Bioinformatics support
  • Processing data files
  • Use third party software
  • In house processing
  • No bioinformatics?
  • One-stop shop for biological data

44
(No Transcript)
45
CORBA SOAP
46
A Container Ship
47
Future
48
July
  • Alpha release of the BioMart suite
  • Specification
  • Schema naming convention
  • DTD for XML config
  • Administration Tools
  • Configure
  • Data access (Perl/Java)
  • Lib
  • Interfaces
  • Tested on MySQL 4/Oracle 9i mixture

49
After July
  • MartBuilder
  • Automatically build marts from existing 3NF with
    predefined PK/FK
  • Fixed schema data transformation function
  • SQL collection
  • Collaboration
  • Laboratory for the Foundation of Computer Science
  • Bell Labs

50
BioMart
51
Biological databases
  • Update oriented
  • Complex, normalised schemas
  • Sophisticate queries involve multiple joins
  • Difficult and slow

52
Query optimised
  • Data mart
  • Few joins
  • Duplicated data
  • Denormalised
  • Efficient and scalable for large and
    sophisticated queries

53
Distributed Model Benefits
  • Each group retains full control over their data
    source
  • Data content
  • Data updates
  • Data presentation (interface)
  • Deployment platform
  • Security

54
BioMart
  • Building marts
  • MartBuilder - schema transformation
  • MartEditor - configuration editor
  • Accessing marts
  • MartShell - commandline interface
  • MartExplorer - GUI
  • MartView - website
Write a Comment
User Comments (0)
About PowerShow.com