Title: The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble
1The Gedeon Project Data, Metadata and
DatabasesYves DENNEULINLIG laboratory, Grenoble
Laboratoire LIP6
ACI MD
2Context and goals
- Heterogeneous metadata management on grids
- Clusters of clusters
- High-level queries using metadata
- Easy and flexible deployment and configuration
- Minimal overhead
- Various interfaces
- Initial target application domains
- Biocomputing (lots of metadata, few data)
- Microscopic imaging (lots of data data, few
metadata)
3The Gedeon middleware
- Metadata management on lightweight grids
- Records of (attribute,value) pairs stored in
files - Flexible requests
- Can be combined through scripting
- Various interfaces
- Command line (tools)
- Libraries
- Virtual FS (legacy applications support)
- Deployment à la carte
- Composition of various data sources
- Performances
- Dedicated I/O library
- Semantic caching
4Outline
- General architecture
- Gedeon internal structure
- Composition of various data sources
- Practical use
- dual cache
- Conclusion
5Example of a deployment
Query Interface (API, FS, GUI, ...)
cache
Local proxy
Client
Servers close to the client
cache
cache
Interconnect middleware
Interconnect middleware
cache
cache
cache
cache
cache
Local proxy
Local proxy
Local proxy
Storage sites
Interconnect
6Gedeon components
- Gedeon Kernel
- fuple
- I/O Library
- Evaluate the queries
- lowerG
- Operators to compose bases
- Remote access
- Interface
- API lowerG
- Virtual FS
- Cache
Local proxy
cache
lowerG
7What inside the sources?
- Records of pairs attribute/value
Record
Id
457
classifA
Bacteria
classifB
Clostridia
taille
26
ref
8Example of composition of sources
site S2
site S1
site S3
J
RR
Metadata can be local or copies
client
9Union
enreg. A1
enreg. B1
enreg. A2
enreg. A1
enreg. B1
enreg. A3
enreg. A2
enreg. B2
enreg. B2
enreg. A3
enreg. B3
enreg. B3
enreg. A4
enreg. B4
...
...
enreg. A4
enreg. B4
Unify storage space Parallel evaluation
...
10Round Robin
Fault Tolerance
Source 1
RR
client
Source 2
11Round Robin
Load Balancing
Source 1
client
RR
client
Source 2
12Join operator
Id
457
Id
457
A1
v1
A1
v1
A2
v2
A2
v2
Id
457
A3
v3
A3
v3
An
vAn1
J
An
vAn1
Id
458
Id
458
Id
Id
458
A1
v4
An
vAn2
A1
v4
A2
v5
...
A2
v5
A3
v6
Enrich a source with another
A3
v6
...
An
vAn2
...
13Outline
- General architecture
- Gedeon internal structure
- Composition of various data sources
- Practical use
- dual cache
- Conclusion
14Tools 1/2
- Libraries
- CLI
- Operations
- sort
- projection
- select
- index
- ...
15Tools 2/2
- Examples
- sortgt cat mesmeta.g fsort 'taille' gt
trie_taille.g
sort(attr'taille')
.Id.idx
create_idx(attr'Id')
search_idx('Id', 'P0123')
.Id.idx
.Id.idx
16Language for the requests
- Simple (, type control with the operators)
- Regular expressions
- Of the second order
17Select expression
Id
457
classifA
Bacteria
classifB
Clostridia
taille
26
Select Idgt459
Id
460
classifA
Fermicutes
Id
459
classifB
Bacteria
taille
47
Id
460
classifA
Fermicutes
18Select using regexp
Id
457
Id
457
classifA
Bacteria
classifA
Bacteria
classifB
Clostridia
classifB
Clostridia
taille
26
taille
26
Select classifB/.a/
Id
459
Id
459
classifB
Bacteria
classifB
Bacteria
taille
47
taille
47
Id
460
classifA
Fermicutes
19Select using 2nd order logic
Id
457
classifA
Bacteria
classifB
Clostridia
taille
26
Id
459
Select /classifAB/Bacteria taillegt36
classifB
Bacteria
Id
459
taille
47
classifB
Bacteria
taille
47
Id
460
classifA
Fermicutes
20Virtual FS interface
- Just a specific file-oriented interface
- Data and metadata can be anywhere in the grid
- Definition of logical directories
- Ex cd 'classifB.a'
- and between directories
- 1 filename value of a metadata logical
view/fs_virt/classifB.agt ls457
459/fs_virt/classifB.agt cat
gt/tmp/mater/fs_virt/classifB.agt
21Outline
- General architecture
- Gedeon internal structure
- Composition of various data sources
- Practical use
- dual cache
- Conclusion
22Dual cache (1)
- 2 cooperative caches
- cache of requests (R, id,...)-gt save computing
power - cache of data (id, attr,...)-gt save bandwidth
- Semantic cache
- Can evaluate a query using the data in the cache
- Can generate a remainder to complement the data
cached
23Example
- Refinement of a request
- 'OC/Eukaryota/'-gt (R, Lidid1,id2, ...)
- 'OC/Eukaryota/ yeargt1998'Select(Lid,
'yeargt1998')
24Dual cache (2)
- Distributed semantic cache
- Typically used inside communities
- Lots of common requests
- No location constraints
- Members of the community can be geographically
scattered - Distributed data cache
- Minimize time and data transfer
- Cooperation between close, from a topological
point of view, sites
25Dual cache (3)
26Dual cache (4)
- Work in progress on the notion of distance
- Find geographical proximity
- Find common interests between communities
- Create hybrid communities based on their requests
- Could be used to change the cache parameters
- Manual and/or automatic
27Conclusion
- A data integration middleware
- Handling of metadata
- Distributed and modular
- Deployment can be done according to
architectural/organisational constraints - Definition of a dual cache infrastructure
- Reflect both organisational use
- Prototype in use
- Packaging and documentation needed
28Questions?