Database and R Interfacing for Annotated Microarray Data - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Database and R Interfacing for Annotated Microarray Data

Description:

public central databases like GEO and ArrayExpress ... Acknowledgements. Alexey Antonov. Jan Budczies. Matthias Oesterheld. Sabine Tornow. Werner Mewes ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 26
Provided by: michae750
Category:

less

Transcript and Presenter's Notes

Title: Database and R Interfacing for Annotated Microarray Data


1
Database and R Interfacing for Annotated
Microarray Data
  • Michael Mader
  • http//mips.gsf.de

2
Microarray data management
  • public central databases like GEO and
    ArrayExpress
  • open-source distributed management systems (SMD,
    TM4)
  • repositories with mining capabilities,
    comprehensive annotation and (www-)statistics
    (GeneX, ...)

3
ME - System Architecture
Graphics/ statistics
(Re-)annotation system
Data retrieval
Upload interface
Authentication module
R statistical engine
Interfaces
DB/ Oracle
CORBA, bioMoby,.
Interfaces
4
ME DB structure
  • Oracle or DB2 (transaction control)
  • Entity-Attribute-Value systems in grey

Biomaterial
Core Data
Arrays
Element anno
Simplified schema, 45 tables element anno system
5
Interface Requirements
  • fast (time does matter for online appl.)
  • flexible
  • transparent
  • handling EAV-systems
  • fast even with complex SQL statements

6
Queries and analysis pipelines

7
Queries and analysis pipelines
  • Simple ones by biologists and medical staff
    without statistical background (level 1 users)

8
Queries and analysis pipelines
  • Simple ones by biologists and medical staff
    without statistical background (level 1 users)
  • Advanced pipelines by lab researchers with
    statistical background (level 2 users)

9
Queries and analysis pipelines
  • Simple ones by biologists and medical staff
    without statistical background (level 1 users)
  • Advanced pipelines by lab researchers with
    statistical background (level 2 users)
  • Sophisticated queues by bioinformaticians and
    statisticians (level 3 users)

10
Analyis pipelines II
run-time contributions on small local DB server,
ROracle/Rdbi as interface
11
Analysis pipelines III
  • Level 1 tasks/users are extremely frequent in
    microarray repositories
  • Data retrieval has considerable impact
  • Is ROracle/Rdbi the best choice for this?

12
The prepare issue

13
The prepare issue
  • Oracle prepares are good if they are rare
    (cost-based optimizer)

14
The prepare issue
  • Oracle prepares are good if they are rare
    (cost-based optimizer)
  • repeated prepares are still OK if there is
    nothing to optimize (primitive SQL)

15
The prepare issue
  • Oracle prepares are good if they are rare
    (cost-based optimizer)
  • repeated prepares are still OK if there is
    nothing to optimize (primitive SQL)
  • repeated prepares are even OK if you stick to the
    rule-based optimizer

16
The prepare issue
  • Oracle prepares are good if they are rare
    (cost-based optimizer)
  • repeated prepares are still OK if there is
    nothing to optimize (primitive SQL)
  • repeated prepares are even OK if you stick to the
    rule-based optimizer
  • repeated prepares are vasted time in mining-like
    SQL queries (cost-based optimizer, complex SQL)

17
Status
  • Data retrieval has considerable impact,
    especially for the common level 1 tasks
  • repeated prepares (ROracle) dont scale
  • retrieval from EAV-systems turns out to be
    difficult and consumptive

18
OTL
  • open-source template library
  • implements large-parts of OCI
  • high-level DB interface programming
  • for Oracle, DB2, ODBC, ...
  • by Sergei Kuchinhttp//otl.sourceforge.net

19
OTL implementation
  • using SQL binds and stream directly into STL
    vector-based objects
  • separated handling of expression data (relational
    data) and annotation (EAV data, DB-links, ...)

20
OTL simplistic example

21
OTL simplistic example
  • vectorltfloatgt select(int idl)
  • otl_stream i(1000, /buffer size/
  • "SELECT val,bkg FROM expr.data WHERE
    f_key varltintgt",
  • db)
  • float val, bkg
  • vectorltfloatgt v
  • iltltidl / bind variable/
  • while(!i.eof())
  • igtgtvalgtgtbkg
  • v.push_back(val - bkg)
  • return v

22
Performance
  • much faster as RefCursor (PL/SQL) approach (on
    small local DB servers)
  • OTL-based interface is 40 faster than ROracle
    for typical microarray data (20,000100 10
    anno values)
  • memory consumption is at least 20 lower (but not
    an issue)
  • no excessive type checking needing in the R code

23
Future Developments
  • bidirectional interface for R result objects
  • generalization of the interface and public
    open-source release
  • Mining on C level (no problems with excessively
    large data objects in R)?
  • switch from HTML to XML on output level

24
Summary
  • fast and flexible user-transparent interface
    needed in bioinformatics
  • higher performance due to bindings might be
    crucial
  • restructuring data heterogeneous data in C can
    be faster and less memory consumptive
  • OTL offers nice opportunities for
    vector/matrix-oriented retrieval

25
Acknowledgements
  • Alexey Antonov
  • Jan Budczies
  • Matthias Oesterheld
  • Sabine Tornow
  • Werner Mewes
Write a Comment
User Comments (0)
About PowerShow.com