Database and R Interfacing for Annotated Microarray Data - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Database and R Interfacing for Annotated Microarray Data

Description:

public central databases like GEO and ArrayExpress ... Acknowledgements. Alexey Antonov. Jan Budczies. Matthias Oesterheld. Sabine Tornow. Werner Mewes ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 26

Provided by: michae750

Category:

more less

Transcript and Presenter's Notes

Title: Database and R Interfacing for Annotated Microarray Data

1
Database and R Interfacing for Annotated
Microarray Data

Michael Mader
http//mips.gsf.de

2
Microarray data management

public central databases like GEO and
ArrayExpress
open-source distributed management systems (SMD,
TM4)
repositories with mining capabilities,
comprehensive annotation and (www-)statistics
(GeneX, ...)

3
ME - System Architecture
Graphics/ statistics
(Re-)annotation system
Data retrieval
Upload interface
Authentication module
R statistical engine
Interfaces
DB/ Oracle
CORBA, bioMoby,.
Interfaces
4
ME DB structure

Oracle or DB2 (transaction control)
Entity-Attribute-Value systems in grey

Biomaterial
Core Data
Arrays
Element anno
Simplified schema, 45 tables element anno system
5
Interface Requirements

fast (time does matter for online appl.)
flexible
transparent
handling EAV-systems
fast even with complex SQL statements

6
Queries and analysis pipelines

7
Queries and analysis pipelines

Simple ones by biologists and medical staff
without statistical background (level 1 users)

8
Queries and analysis pipelines

Simple ones by biologists and medical staff
without statistical background (level 1 users)
Advanced pipelines by lab researchers with
statistical background (level 2 users)

9
Queries and analysis pipelines

Simple ones by biologists and medical staff
without statistical background (level 1 users)
Advanced pipelines by lab researchers with
statistical background (level 2 users)
Sophisticated queues by bioinformaticians and
statisticians (level 3 users)

10
Analyis pipelines II
run-time contributions on small local DB server,
ROracle/Rdbi as interface
11
Analysis pipelines III

Level 1 tasks/users are extremely frequent in
microarray repositories
Data retrieval has considerable impact
Is ROracle/Rdbi the best choice for this?

12
The prepare issue

13
The prepare issue

Oracle prepares are good if they are rare
(cost-based optimizer)

14
The prepare issue

Oracle prepares are good if they are rare
(cost-based optimizer)
repeated prepares are still OK if there is
nothing to optimize (primitive SQL)

15
The prepare issue

Oracle prepares are good if they are rare
(cost-based optimizer)
repeated prepares are still OK if there is
nothing to optimize (primitive SQL)
repeated prepares are even OK if you stick to the
rule-based optimizer

16
The prepare issue

Oracle prepares are good if they are rare
(cost-based optimizer)
repeated prepares are still OK if there is
nothing to optimize (primitive SQL)
repeated prepares are even OK if you stick to the
rule-based optimizer
repeated prepares are vasted time in mining-like
SQL queries (cost-based optimizer, complex SQL)

17
Status

Data retrieval has considerable impact,
especially for the common level 1 tasks
repeated prepares (ROracle) dont scale
retrieval from EAV-systems turns out to be
difficult and consumptive

18
OTL

open-source template library
implements large-parts of OCI
high-level DB interface programming
for Oracle, DB2, ODBC, ...
by Sergei Kuchinhttp//otl.sourceforge.net

19
OTL implementation

using SQL binds and stream directly into STL
vector-based objects
separated handling of expression data (relational
data) and annotation (EAV data, DB-links, ...)

20
OTL simplistic example

21
OTL simplistic example

vectorltfloatgt select(int idl)
otl_stream i(1000, /buffer size/
"SELECT val,bkg FROM expr.data WHERE
f_key varltintgt",
db)
float val, bkg
vectorltfloatgt v
iltltidl / bind variable/
while(!i.eof())
igtgtvalgtgtbkg
v.push_back(val - bkg)
return v

22
Performance

much faster as RefCursor (PL/SQL) approach (on
small local DB servers)
OTL-based interface is 40 faster than ROracle
for typical microarray data (20,000100 10
anno values)
memory consumption is at least 20 lower (but not
an issue)
no excessive type checking needing in the R code

23
Future Developments

bidirectional interface for R result objects
generalization of the interface and public
open-source release
Mining on C level (no problems with excessively
large data objects in R)?
switch from HTML to XML on output level

24
Summary

fast and flexible user-transparent interface
needed in bioinformatics
higher performance due to bindings might be
crucial
restructuring data heterogeneous data in C can
be faster and less memory consumptive
OTL offers nice opportunities for
vector/matrix-oriented retrieval

25
Acknowledgements