BioMart - PowerPoint PPT Presentation

About This Presentation
Title:

BioMart

Description:

One-stop shop' for biological data. Suitable for power biologists and bioinformaticians ... common currency' between two datasets. e. g. accession. Exportable ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 63
Provided by: extra4
Category:
Tags: biomart | run | up

less

Transcript and Presenter's Notes

Title: BioMart


1
BioMart
  • Databases made easy

Richard Holland European Bioinformatics
Institute Helsinki, September 2006
2
BioMart
  • A joint project
  • European Bioinformatics Institute (EBI)
  • Cold Spring Harbor Laboratory (CSHL)
  • Aim
  • To develop a generic, query-oriented data
    management system capable of integrating
    distributed data sources.

3
Focus
  • Data mining or advance search
  • Creating custom datasets
  • Querying multiple datasets
  • Interactive
  • Users
  • People who provide database-based service
  • Power user biologists and bioinformaticians

4
Requirements
  • User
  • One-stop shop for biological data
  • Suitable for power biologists and
    bioinformaticians
  • A set of interfaces that allow user to group and
    refine biological data based upon many criteria
  • Deployer
  • Out of the box installation
  • Built in query optimization
  • Easy data federation
  • Architecture
  • Domain agnostic
  • Distributed
  • Platform independent

5
Advanced search GUIs
6
Single interface
7
Single access point
8
Queries across different databases
9
Main features
  • Domain agnostic
  • Platform independent (MySQL, ORACLE, Postgres)
  • Scalable for big datasets
  • Federated architecture
  • Automated UI configuration

10
How does it work?
11
BioMart
12
Federated architecture
13
Data model
FK
FK
FK
FK
14
Data model
FK
FK
FK
FK
PK
FK
FK
FK
FK
15
Data model - reversed star
FK1
FK1
main1
dm
dm
PK1
FK1 FK2
FK1 FK2
FK2
FK2
PK2 FK1
2
dm
PK2 PK1
FK2
FK2
16
Data mart and dataset
17
Data mart, dataset and virtual schema
18
BioMart abstractions
  • Dataset
  • A subset of data organized into 1 or more tables
  • Attribute
  • A single data point
  • e. g. gene name
  • Filter
  • An operation on an attribute
  • e. g. Chromosome 1

19
Datasets, Attributes and Filters
20
BioMart abstractions (cont)
  • Link
  • common currency between two datasets
  • e. g. accession
  • Exportable
  • Potential links to export
  • Importable
  • Potential links to import

21
Exportables, Importables and Links
22
Exportables, Importables and Links
23
Exportables, Importables and Links
24
Creating BioMart databases
25
Building BioMart databases
MartBuilder
26
Schema transformationprinciples
  • Central table
  • Longest n1, 11 path
  • Dimension table
  • Central transformation around 1n table.
  • Link tables are decomposed into a set of 1n first

27
MartBuilder Application
  • Read database meta data
  • Transforms a source schema into
  • suggested datasets and lets you edit the
    process
  • Produces a set of SQL statements (DDL) to run
    against the server to perform the transformation

28
(No Transcript)
29
Dataset Configuration
  • Dataset configuration
  • Attributes
  • Filters
  • Trees, Groups, Collections
  • Exportables, Importables
  • Semantics
  • Relational mapping
  • User interface
  • Linking datasets
  • XML-based

30
Table naming conventionNaïve configuration
  • Tables
  • Meta tables meta_content
  • Data tables dataset__content__type
  • Data tables
  • Main __main
  • Dimension __dm
  • Columns
  • Key _key

31
Naming convention examples
  • Homo sapiens gene ensembl
  • hsapiens_gene_ensembl__gene__main
  • hsapiens_gene_ensembl__xref_hugo__dm
  • Encode
  • hsapiens_encode__encode__main
  • Uniprot
  • uniprot__protein__main
  • uniprot__interpro__dm
  • Uniprot sequence
  • uniprot_sequence__sequence__main

32
Dataset Configuration
33
MartEditor
34
Accessing BioMart databases
35
BioMart architecture
36
MartView (current)
37
MartView (new 0_5)
38
MartExplorer
39
MartShell
40
MartShell (MQL)
  • Uses Mart Query Language (MQL) to generate
    queries
  • using ltdatasetgt get ltattributesgt where ltfiltersgt
  • Can join datasets together
  • using Dataset1 get Attribute1 where Filter1var1
    as q
  • using Dataset2 get Attribute2 where Filter2var2
    and filter3 in q
  • Can script and pipe
  • martshell.sh -E MQLscript.mql gt results.txt
  • martshell.sh -E MQLscript.mql wc

41
MartShell examples
MartShellgt using MSD.msd get pdb_id where
resolution_less lt 1.5 and has_ec_info
only 193l 194l 1arb ... MartShellgt using
MSD.msd get pdb_id where resolution_less lt 1.5
and has_ec_info only as q MartShellgt using
Ensembl.hsapiens_gene_ensembl get sequence
transcript_flanks1000 where pdb in
q ENST00000270142.2 ENSG00000142168.2 strandforw
ard chr21 assemblyNCBI34 downstream flanking
sequence of transcript only AAACTAAATTAGCTCTGATACT
TATTTATATAAACAGCTTCAGTGGAA ....
42
biomaRt
43
Taverna
44
DAS ProServer
45
BioMart deployers
  • Large scale data federation (EBI)
  • Optimising access to a large database (Ensembl,
    WormBase)
  • Connecting priopriatery datasets to public data
    (Pasteur, Unilever, Serono, Sanofi-Aventis,
    DevGen etc )

46
Hinxton example
WWW
47
BioMart deployers
  • Large scale data federation (Hinxton)
  • Optimising access to a large database (Ensembl,
    WormBase, ArrayExpress)
  • Connecting priopriatery datasets to public data
    (Pasteur, Unilever, Serono, Sanofi-Aventis,
    DevGen etc )

48
WormBase
Genes Expression Phenotypes Variations Literature
Ontologies Sequence
49
Ensembl
Genes Ontologies Variations Protein
annotation Disease Homologies Sequence Array
annotations
50
HapMap
Population Frequencies Inter population compariso
ns Gene annotation
51
ArrayExpress
52
BioMart deployers
  • Large scale data federation (Hinxton)
  • Optimising access to a large database (Ensembl,
    WormBase)
  • Federating third party data with public data
    (Pasteur, INRA, Bayer,Unilever, Serono,
    Sanofi-Aventis, DevGen, Solexa etc )

53
In development
  • CAPRISA
  • RGD
  • DICTYBASE
  • PURDUE UNIVERSITY
  • RZPD

54
Music Mart
55
BioMart model
  • Already applied
  • Ensembl
  • Vega
  • SNP
  • Uniprot
  • MSD
  • ArrayExpress
  • WormBase
  • Gramene
  • HapMap
  • Variety of in house projects (academia and
    industrial)

56
User restriction
martUser
Dataset
default
advanced
57
Interface configuration
Interface
Dataset
single-page web interface
wizard style web interface
58
Web services
MartView
MartService
80
3306
X
3306
3306
Local Mart
Remote Mart
59
Web services (cont)
  • MartService requests
  • Registry XML
  • Dataset information name, type etc
  • DatasetConfig XML
  • Mart Query
  • API query object is converted to a XML
    representation on the client and sent to the
    server.
  • Query object is regenerated on the server and
    processed. Results are sent back to client as a
    simple tab-delim HTML page.

60
Summary
  • A generic data management system
  • A set of easily configurable user interfaces
  • Distributed Data federation
  • Query optimization

61
BioMart
  • www.biomart.org
  • Open source (LGPL)
  • Public MySQL server
  • ftp
  • mart-dev_at_ebi.ac.uk
  • mart-announce_at_ebi.ac.uk

62
Acknowledgments
  • BioMart
  • Arek Kasprzyk (EBI)
  • Damian Smedley (EBI)
  • Syed Haider (EBI)
  • Gudmundur Thorisson (CSHL)
  • Contributors
  • Darin London (EBI)
  • Will Spooner (CSHL)
  • Damian Keefe (Ensembl)
  • Arne Stabenau (Ensembl)
  • Andreas Kahari (Ensembl)
  • Craig Melsopp (Ensembl)
  • Katerina Tzouvara (Uniprot)
  • Paul Donlon (Unilever)
  • Steffen Durinck (SCD-ESAT, Katholieke
    Universiteit Leuven)
  • Benoit Ballester (Universite de la Mediterranee)
  • Stephen Robinson (EBI)
  • Asif Kibria (EBI)
  • Paul Donlon (Unilever)
Write a Comment
User Comments (0)
About PowerShow.com