Database Issues in Nutritional Genomics - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Database Issues in Nutritional Genomics

Description:

Scientist or institution control who their data is disclosed to ... Prevent unauthorised access to data. Prevent unauthorised use of data ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 39
Provided by: nes6
Category:

less

Transcript and Presenter's Notes

Title: Database Issues in Nutritional Genomics


1
Database Issues in Nutritional Genomics
  • Tony Travis
  • Peter Gray
  • Rowett Research Institute
  • University of Aberdeen
  • Jan 2005

2
(No Transcript)
3
Utopian view
  • Share data freely
  • Everyone benefits
  • Ideas develop
  • Science prospers

4
Big pharma disagree!
  • Sell data commercially
  • Big pharma benefits
  • Ideas are exploited
  • Science is a business

5
Scientists are confused
  • Intellectual freedom?
  • Curiosity driven science
  • Poor funding
  • Intellectual property?
  • Commercially driven science
  • Good funding

6
Preserving intellectual property
  • Autonomy
  • Scientist or institution control who their data
    is disclosed to
  • Control data sharing by collaborators who share
    their IP
  • Needs federated solution
  • Security
  • Prevent unauthorised access to data
  • Prevent unauthorised use of data
  • Maintain integrity and provenance of data

7
Typical NutriGenomics Use Case
  • Example of pragmatic solution
  • DNA microarray work at RRI
  • Autonomy
  • Data held locally on PC spreadsheets
  • Completely under control of investigator
  • Collaborators
  • Each create spreadsheet of local results
  • All collaborators exchange spreadsheets

8
Spreadsheet microarray data
9
Distribution of one spreadsheet
A
B
D
C
10
Exchange of all spreadsheets
A
B
D
C
11
Manual replication of database
  • Advantages
  • Simple peer-to-peerdata transfer via email
  • Each collaborator has entire database locally
  • Local analysis tools are readily available
  • Complete control of IP within collaboration
  • Disadvantages
  • N(N-1) solution
  • Does not scale well
  • Each collaborator must merge data into local
    database replica
  • No control over data integrity or provenance

12
Spreadsheet Replicated Data Model
  • Distributed
  • Data originates at each collaborators site
  • Replicated
  • Copy of the entire database at each site
  • Manually updated
  • Data and corrections are pushed from each
    collaborator to all others via email of Excel
    spreadsheets containing expression data which is
    merged into a single spreadsheet

13
Local analysis tools maxd
  • Microarray Bioinformatics Group University of
    Manchester (UK)
  • Java-based
  • maxdView
  • Visualise and analyse gene expression data.
  • maxdLoad2
  • Store and curate gene expression data to MIAME
    standards
  • Export in MAGE/ML format for submission to
    ArrayExpress.

14
Import spreadsheet data into maxd
15
Analyse expression profiles
  • 10,000 genes
  • Four experiments byone collaborator
  • Normalised
  • Clustered
  • Comparison of gene expression profiles between
    experiments

16
Upgrade spreadsheet solution
  • MaxdLoad2
  • Replace spreadsheets
  • Use MIAME standard
  • JDBC compliant interface
  • SQL92 (MySQL, Postgres)

17
Candidate Mediator middleware
  • Maxd
  • Designed for use with single database
  • P/FDM
  • Integration of heterogeneous data sources
  • Federated union/join of relations
  • Biomart
  • MartShell scripting language
  • Federate database instances

18
Example federated DB
19
MartShell
  • Command line (text mode) user Interface to
    BioMart that can be used by programs
  • Mart Query Language (MQL)
  • Queries can be executed in batch mode using
    stored procedures in MQL scripts

20
BioArray Software Environment
  • BASE is a comprehensive database server to manage
    massive amounts of data generated by microarray
    analysis
  • Lund University Oklahoma University
  • Data can be analysed using a web-based GUI to
    server-side PHP scripts or data can be extracted
    from the BASE database by applications such as
    Genespring

21
Querying a Federated DB
  • There are two kinds of distributed query that you
    can send out to the federation
  • Federated Join - like adding extra columns with
    cross-referenced information on the same object
    or related objects.
  • Federated Union like adding extra rows with the
    same column headings the same kinds of
    experiments but done at different sites.

22
Comparing expression profiles(e.g.looking for
co-regulation)
23
Conditions for making a Federated DB work
  • Needs Common Ontologyfor data of same type.
    BEWARE measurements made in different units,or
    using a very different exptl. procedure,or
    qualitative measurements such as
    "large".."medium"

24
Conditions for making a Federated DB work
  • Need Common Unique Identifiers if no property
    allows you to tell that one entity instance is
    the same as another then integration is UNSAFE!
  • (Note - it might be OK for say 95 percent of
    identifiers...)

25
Conditions for making a Federated DB work
  • Mechanisation of Value mapping
  • if data values can only be compared or made
    compatible with others using the judgement of an
    experienced scientist, then one must use a
    Warehouse (as in early PDB), otherwise
  • if you can mechanise it using rules or equations
    then it can be done by a view,
  • or by a mediator accessing the Federation

26
Conditions for making a Federated DB work
  • Need Standard Interchange Formats
  • Formats such as MMCIF helped reduce human
    intervention in PDB. The widely used MIAME format
    may do the same for MicroArray Data.
  • However such data is much harder to integrate as
    it may be measured under different conditions
    with different technology.

27
Difficulties of Federated Approach
  • Reliability - Sites must be availablecontinuousl
    y, and not crash too often
  • Support costs - must be proof against Virus
    attacks, etc., and have people able to bring them
    back up again promptly

28
Difficulties of Federated Approach
  • Compatibility - must provide a common interface -
    may be able to share development of some
    downloadable server software (like Java
    WebStart), responding to SOAP protocol messages
    and commands, config-urable through web forms
    that keeps logs of errors.

29
Difficulties of Federated Approach
  • Performance Warehouses will provide better
    performance for data mining programs and others
    programs with a high hit rate.
  • Federated systems compete well on more focused
    queries which allow the use of indexes in remote
    systems.

30
Having it Both Ways
  • A Federated Solution can include some sites that
    are adopting Warehouse technology to collect and
    vet large volumes of data of a particular kind.
  • The NUGO data model and ontologies are bound to
    change a lot in ways we cannot forsee. Thus it
    makes sense to be flexible to start, allowing
    site autonomy, and to delay committing to large
    warehouses until we understand more about the
    data model and IPR issues.

31
Discovering the Model
  • Birney Clamp (2004) say "the true biological
    interpretation of data stored in a database will
    change over time, and discovering new
    relationships between aspects of the data is an
    important part of the motivation for storing
    it..

32
(No Transcript)
33
Conclusion (1) - Spreadsheets
  • Spreadsheets are easy and popular
  • Integrating Spreadsheets manually is time wasting
    and can easily lead to errors and wrong
    conclusions
  • Scientists need the discipline of a shared Data
    Model and the automation of data transfer and
    conversion, usually provided by a Mediator

34
Conclusion (2) Shared Data Model
  • Agreement on a shared Ontology is mainly a
    problem of agreeing Standards for names, units,
    and specialised types.
  • Agreeing a shared Data Model is more subtle. It
    may need experimentation in advance of a
    standard.
  • The Data Model, based on Entity-Relationship
    Model with SubTypes, must be able to evolve - not
    fixed in stone, coping with the unforseen.

35
Conclusion (2) Shared Data Model
  • The Data Model must be at Conceptual Level -
    independent of Storage Technique - arrays,
    ASN-1, XML, tables etc... Otherwise agreeing a
    Shared Model becomes too hard!
  • The Data Model must provide External Views both
    to restrict access and to provide a consistent
    API to External Applications these may be
    Spreadsheets or Statistical Packages or MaxD or
    Genespring etc...

36
Conclusion (3) Federating Microarray Data
  • Usually, a federation is based on a federated
    Join, through common identifiers, because
    irrelevant joins can be left out, to speed up the
    query.
  • Federated Joins suit integrating other types of
    data with Microarray data, e.g. physiological,
    epidemiological data
  • This is easily done, on the fly it allows us to
    evolve the data model and experiment with it
    without making changes to a centralised
    warehouse. Once the data model is more stable,
    parts of it can be stored in warehouse.

37
Conclusion (3) Federating Microarray Data
  • Queries that want to compare Gene Expression
    Profiles across many Experiments need a federated
    Union of data from different experimenters.
  • Comparing one profile against those from many
    experimental sites could be done in parallel.
    Trusted methods could work with an encrypted
    profile to keep it confidential.

38
Conclusion (4) IPR and Federation
  • Scientists want to retain their autonomy and
    right to recognised authorship of the data,
    otherwise they may not share it!
  • If Database Right (EU proposal) becomes
    established, scientists may wish to keep data in
    their own DB in order to take advantage of it.
    Thus we may need to make more use of federated
    techniques to bring such data together.
  • Revenue-Raising Potential may become important
    (iTunes for example).
Write a Comment
User Comments (0)
About PowerShow.com