Database Issues in Nutritional Genomics - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Database Issues in Nutritional Genomics

Description:

Scientist or institution control who their data is disclosed to ... Prevent unauthorised access to data. Prevent unauthorised use of data ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 39

Provided by: nes6

Category:

more less

Transcript and Presenter's Notes

Title: Database Issues in Nutritional Genomics

1
Database Issues in Nutritional Genomics

Tony Travis
Peter Gray
Rowett Research Institute
University of Aberdeen
Jan 2005

2
(No Transcript)
3
Utopian view

Share data freely
Everyone benefits
Ideas develop
Science prospers

4
Big pharma disagree!

Sell data commercially
Big pharma benefits
Ideas are exploited
Science is a business

5
Scientists are confused

Intellectual freedom?
Curiosity driven science
Poor funding
Intellectual property?
Commercially driven science
Good funding

6
Preserving intellectual property

Autonomy
Scientist or institution control who their data
is disclosed to
Control data sharing by collaborators who share
their IP
Needs federated solution

Security
Prevent unauthorised access to data
Prevent unauthorised use of data
Maintain integrity and provenance of data

7
Typical NutriGenomics Use Case

Example of pragmatic solution
DNA microarray work at RRI
Autonomy
Data held locally on PC spreadsheets
Completely under control of investigator
Collaborators
Each create spreadsheet of local results
All collaborators exchange spreadsheets

8
Spreadsheet microarray data
9
Distribution of one spreadsheet
A
B
D
C
10
Exchange of all spreadsheets
A
B
D
C
11
Manual replication of database

Advantages
Simple peer-to-peerdata transfer via email
Each collaborator has entire database locally
Local analysis tools are readily available
Complete control of IP within collaboration

Disadvantages
N(N-1) solution
Does not scale well
Each collaborator must merge data into local
database replica
No control over data integrity or provenance

12
Spreadsheet Replicated Data Model

Distributed
Data originates at each collaborators site
Replicated
Copy of the entire database at each site
Manually updated
Data and corrections are pushed from each
collaborator to all others via email of Excel
spreadsheets containing expression data which is
merged into a single spreadsheet

13
Local analysis tools maxd

Microarray Bioinformatics Group University of
Manchester (UK)
Java-based
maxdView
Visualise and analyse gene expression data.
maxdLoad2
Store and curate gene expression data to MIAME
standards
Export in MAGE/ML format for submission to
ArrayExpress.

14
Import spreadsheet data into maxd
15
Analyse expression profiles

10,000 genes
Four experiments byone collaborator
Normalised
Clustered
Comparison of gene expression profiles between
experiments

16
Upgrade spreadsheet solution

MaxdLoad2
Replace spreadsheets
Use MIAME standard
JDBC compliant interface
SQL92 (MySQL, Postgres)

17
Candidate Mediator middleware

Maxd
Designed for use with single database
P/FDM
Integration of heterogeneous data sources
Federated union/join of relations
Biomart
MartShell scripting language
Federate database instances

18
Example federated DB
19
MartShell

Command line (text mode) user Interface to
BioMart that can be used by programs
Mart Query Language (MQL)
Queries can be executed in batch mode using
stored procedures in MQL scripts

20
BioArray Software Environment

BASE is a comprehensive database server to manage
massive amounts of data generated by microarray
analysis
Lund University Oklahoma University
Data can be analysed using a web-based GUI to
server-side PHP scripts or data can be extracted
from the BASE database by applications such as
Genespring

21
Querying a Federated DB

There are two kinds of distributed query that you
can send out to the federation
Federated Join - like adding extra columns with
cross-referenced information on the same object
or related objects.
Federated Union like adding extra rows with the
same column headings the same kinds of
experiments but done at different sites.

22
Comparing expression profiles(e.g.looking for
co-regulation)
23
Conditions for making a Federated DB work

Needs Common Ontologyfor data of same type.
BEWARE measurements made in different units,or
using a very different exptl. procedure,or
qualitative measurements such as
"large".."medium"

24
Conditions for making a Federated DB work

Need Common Unique Identifiers if no property
allows you to tell that one entity instance is
the same as another then integration is UNSAFE!
(Note - it might be OK for say 95 percent of
identifiers...)

25
Conditions for making a Federated DB work

Mechanisation of Value mapping
if data values can only be compared or made
compatible with others using the judgement of an
experienced scientist, then one must use a
Warehouse (as in early PDB), otherwise
if you can mechanise it using rules or equations
then it can be done by a view,
or by a mediator accessing the Federation

26
Conditions for making a Federated DB work

Need Standard Interchange Formats
Formats such as MMCIF helped reduce human
intervention in PDB. The widely used MIAME format
may do the same for MicroArray Data.
However such data is much harder to integrate as
it may be measured under different conditions
with different technology.

27
Difficulties of Federated Approach

Reliability - Sites must be availablecontinuousl
y, and not crash too often
Support costs - must be proof against Virus
attacks, etc., and have people able to bring them
back up again promptly

28
Difficulties of Federated Approach

Compatibility - must provide a common interface -
may be able to share development of some
downloadable server software (like Java
WebStart), responding to SOAP protocol messages
and commands, config-urable through web forms
that keeps logs of errors.

29
Difficulties of Federated Approach

Performance Warehouses will provide better
performance for data mining programs and others
programs with a high hit rate.
Federated systems compete well on more focused
queries which allow the use of indexes in remote
systems.

30
Having it Both Ways

A Federated Solution can include some sites that
are adopting Warehouse technology to collect and
vet large volumes of data of a particular kind.
The NUGO data model and ontologies are bound to
change a lot in ways we cannot forsee. Thus it
makes sense to be flexible to start, allowing
site autonomy, and to delay committing to large
warehouses until we understand more about the
data model and IPR issues.

31
Discovering the Model

Birney Clamp (2004) say "the true biological
interpretation of data stored in a database will
change over time, and discovering new
relationships between aspects of the data is an
important part of the motivation for storing
it..

32
(No Transcript)
33
Conclusion (1) - Spreadsheets

Spreadsheets are easy and popular
Integrating Spreadsheets manually is time wasting
and can easily lead to errors and wrong
conclusions
Scientists need the discipline of a shared Data
Model and the automation of data transfer and
conversion, usually provided by a Mediator

34
Conclusion (2) Shared Data Model

Agreement on a shared Ontology is mainly a
problem of agreeing Standards for names, units,
and specialised types.
Agreeing a shared Data Model is more subtle. It
may need experimentation in advance of a
standard.
The Data Model, based on Entity-Relationship
Model with SubTypes, must be able to evolve - not
fixed in stone, coping with the unforseen.

35
Conclusion (2) Shared Data Model

The Data Model must be at Conceptual Level -
independent of Storage Technique - arrays,
ASN-1, XML, tables etc... Otherwise agreeing a
Shared Model becomes too hard!
The Data Model must provide External Views both
to restrict access and to provide a consistent
API to External Applications these may be
Spreadsheets or Statistical Packages or MaxD or
Genespring etc...

36
Conclusion (3) Federating Microarray Data

Usually, a federation is based on a federated
Join, through common identifiers, because
irrelevant joins can be left out, to speed up the
query.
Federated Joins suit integrating other types of
data with Microarray data, e.g. physiological,
epidemiological data
This is easily done, on the fly it allows us to
evolve the data model and experiment with it
without making changes to a centralised
warehouse. Once the data model is more stable,
parts of it can be stored in warehouse.

37
Conclusion (3) Federating Microarray Data

Queries that want to compare Gene Expression
Profiles across many Experiments need a federated
Union of data from different experimenters.
Comparing one profile against those from many
experimental sites could be done in parallel.
Trusted methods could work with an encrypted
profile to keep it confidential.

38
Conclusion (4) IPR and Federation

Scientists want to retain their autonomy and
right to recognised authorship of the data,
otherwise they may not share it!
If Database Right (EU proposal) becomes
established, scientists may wish to keep data in
their own DB in order to take advantage of it.
Thus we may need to make more use of federated
techniques to bring such data together.
Revenue-Raising Potential may become important
(iTunes for example).