Title: Beyond Federation of Data Collections
1Beyond Federation of Data Collections
- Making Information Integration Service a part of
NPACI Data Management Infrastructure
Amarnath Gupta Bertram Ludäscher Maryann
Martone Ilya Zaslavsky
2Collection Federation
- In this scenario, scientific groups
- produce data items (e.g., text data, images,
simulation data ) - put them in collections
- add metadata (who created it, what is the data
about ) - make it available for sharing (on the web, in a
data cache accessible with VBN, in HPSS with
authorization information ) - The Problem
- The data may be large number of small chunks or
small number of large chunks data movement is
an issue - Heterogeneity in data types, storage
technologies, networks, authentication protocols - Access has to be collection-based, data item
wise, or data fragment wise, access may need
executing data-specific functions - Storage Resource Broker/Metadata Catalog
- The focus is on making the data available
3Information Integration
Cross-source queries
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
Cross-source relationships are modeled
Information-producing services can be invoked
??? Integrated View ???
Data, relationships, constraints are modeled
??? Integrated View Definition ???
???Mediator ???
Wrapper
Wrapper
Wrapper
Wrapper
Web
protein localization
morphometry
neurotransmission
CaBP, Expasy
4Hidden Semantics Protein Localization
- ltprotein_localizationgt
- ltneuron typepurkinje cell /gt
- ltprotein channelredgt
- ltnamegtRyRlt/gt
- .
- lt/proteingt
- ltregion h_grid_pos1 v_grid_posAgt
- ltdensitygt
- ltstructure fraction0.8gt
- ltnamegtspinelt/gt
- ltamount nameRyRgt0lt/gt
- lt/gt
- ltstructure fraction0.2gt
- ltnamegtbranchletlt/gt
- ltamount nameRyRgt30lt/gt
- lt/gt
5Hidden Semantics Morphometry
- ltneuron namepurkinje cellgt
- ltbranch level10gt
- ltshaftgt
-
- lt/shaftgt
- ltspine number1gt
- ltattachment x5.3 y-3.2 z8.7 /gt
- ltlengthgt12.348lt/gt
- ltmin_sectiongt1.93lt/gt
- ltmax_sectiongt4.47lt/gt
- ltsurface_areagt9.884lt/gt
- ltvolumegt7.930lt/gt
- ltheadgt
- ltwidthgt4.47lt/gt
- ltlengthgt1.79lt/gt
- lt/headgt
- lt/spinegt
-
6The Problem
- Multiple Worlds Integration
- compatible terms not directly joinable
- complex, indirect associations among schema
elements - unstated integrity constraints
- Whats needed?
- a theory under which non-identical terms can be
semantically joined - gt lift mediation to the level of conceptual
models (CMs) - gt domain knowledge, ICs become rules over CMs
- gt Model-Based Mediation
7Information Integration
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
??? Integrated View ???
??? Integrated View Definition ???
???Mediator ???
Wrapper
Wrapper
Wrapper
Wrapper
Web
protein localization
morphometry
neurotransmission
CaBP, Expasy
8Example Query Evaluation (I)
- Example protein_distribution
- given organism, protein, brain_region
- Use DOMAIN-KNOWLEDGE-BASE
- recursively traverse the has_a_star paths under
brain_region collect all anatomical_entities - Source PROLAB
- join with anatomical structures and collect the
value of attribute image.segments.features.featur
e.protein_amount where image.segments.features.f
eature.protein_name protein and
study_db.study.animal.name organism - Mediator
- aggregate over all parents up to brain_region
- report distribution
9Example Query Evaluation (II)
"How does the parallel fiber output
(Yale/SENSELAB) relate to the distribution of
Ryanodine Receptors (UCSD/NCMIR)?"
- _at_SENSELAB X1 select output from parallel
fiber - _at_MEDIATOR X2 hang off X1 from Domain Map
- _at_MEDIATOR X3 subregion-closure(X2)
- _at_NCMIR X4 select PROT-data(X3,
Ryanodine Receptors) - _at_MEDIATOR X5 compute aggregate(X4)
10 Integration Issues
SEMANTIC Integration
- SYNTACTIC/STRUCTURAL Integration
- Integrated Views (Src-XML gt Intgr-XML)
- Schema Integration (DTD gtDTD)
- Wrapping, Data Extraction (Text gt XML)
MIX Mediation of Information using XML
Distributed Query Processing
SRB/MCAT
storage, query capabilities protocols services
SYSTEM Integration
TCP/IP HTTP CORBA
11The Mediator Architecture
Mediation Services API
Mediator Layer
- Source model lifting
- domain knowledge reconciliation
- model transformation
- Query formulation
- user query
- integrated view definition
Deductive Engine
Model Reasoner
- Source registration
- domain knowledge
- model schema
- query computation capabilities
- Query processing
- view unfolding
- semantic optimization
- capability-based rewriting
Optimizer
Wrapper Layer
- Query interface (down API)
- SDLIP, SOAP, ...
- (subsets of) SQL, X(ML)-Query, CPL,...
- DOM
- SRB-based access
- Result delivery interface (up API)
- SDLIP, SOAP, ...
- pull (tuple/set-at-a-time, DOM) vs. push
(stream) - synchronous/asynchronous
- direct data/data reference
XML Sources
RDB Sources
File Sources
HTML Sources
Digital Libraries (Collections)
Spatial Sources
Boston Univ.
NCMIR UCSD
Yale Univ.
Montana Univ.
SDLIP
ARC IMS
12Mediation Services Source Registration-I
Source
Data Type
Query Capability
Result Delivery
Access Protocol
ARC
SQL
XML QL
DOOD
table
tree
file
SRB
HTTP
Java
Tuple-at-a-time
Stream
Set-at-a-time
SPJ
Selections
Binary for Viewer
13Mediation Services Source Registration-II
- Domain Model Registration
- Here is my concept ontology
- Keep it only as a private object
- Merge my ontology with a pre-existing non-private
ontology - Here are the equivalence relations
- Detect conflicts between my ontology and a given
public ontology - Conceptual Schema Registration
- Classes, methods
- Constraints
- Domain Model Reference
Next
14ANATOM Domain Map
ANATOM
Back
15anatom_dom(X) - (ucsd_has_a(X,_)
ucsd_has_a(_,X) ucsd_isa(X,_)
ucsd_isa(_,X)). senselab_dom(X) -
(sl_has_a(X,_) sl_has_a(_,X) sl_isa(X,_)
sl_isa(_,X)). map senselab anatom terms to
equivalent ucsd anatom terms sl2ucsd(X,X) -
senselab_dom(X), anatom_dom(X). sl2ucsd('A',axon)
. sl2ucsd('AH',axon). sl2ucsd('Dad',spiny_branchle
t). should REALLY map to a PATH not just the
end of the path sl2ucsd('Dam',main_branches).
really only SOME of the main_branches based on
the branch level sl2ucsd('Dap',main_branches). sl2
ucsd('Dbd',spiny_branchlet). sl2ucsd('Dbm',main_br
anches). sl2ucsd('Dbp',main_branches). sl2ucsd('De
d',spiny_branchlet). sl2ucsd('Dem',main_branches).
sl2ucsd('Dep',main_branches). sl2ucsd('T',axon).
keep has_a edge if at least one node is known
from ucsd has_a(X,Y) - sl2ucsd(_,X),
ucsd_has_a(X,Y). has_a(X,Y) - sl2ucsd(_,Y),
ucsd_has_a(X,Y). keep all and only ucsd
is-a's isa(X,Y) - ucsd_isa(X,Y). Back
16Neuron
MyNeuron
Neostriatum
Compartment
Spiny Neuron
ALLhas
Axon
Soma
Dendrite
Medium Spiny Neuron
Neurotransmitter
MyDendrite
exp
AND
GABA
Substance P
OR
exp
Dopamine R
Substantia Nigra Pc
Substantia Nigra Pr
Globus Pallidus Int.
Globus Pallidus Ext.
Back
17Mediation Services Client Registration
Client
Update Client
Query Client
Thin Result Viewer
Fat Result Viewer
Navigate/ Ad-hoc
Query Capability
Query on Schema
Derive Before Insert
Check Data
Merge Before Insert
Client-side Processing
Client-side Buffer
Send Full Data
Context Sensitive
Server-side Buffer
Server-Push/ Client-Pull
18Mediation Services Integrated View Definition
- For the domain data modeler
- Currently in a Logic Language (Frame-logic)
- protein_distribution(Protein, Organism,
Brain_region, Feature_name, Anatom, Value) - if
- Iprotein_label_image proteins -gtgt
Protein organism -gt Organism
anatomical_structures -gtgt from PROLAB
ASanatomical_structurename-gtAnatom,
NAEneuro_anatomic_entityname-gtAnatom from
ANATOM - located_in-gtgtBrain_region,
- AS..segments..featuresname-gtFeature_name
value-gtValue. - May be wrapped into a simpler tool
19Mediation Services Query Formulation Tools
- Combination of ad hoc and navigational
- Open Issues
- Recursive queries
- Aggregate queries
- Combining data and service requests
20Mediation Services Data Update Tools
21Some Open Issues
- Data/Knowledge Modeling
- Extensibility how to handle a source with new
data types and operations? - Temporal Data instrument readings, video
microscopy - Spatial Data Integrating with spatial database
systems - Image database systems
- Conflict Management
- Grades of certainty
- Alternate Hypothesis
- Integrating Services
- Registration and warping of my image slice to a
reference - Integrating into Larger Applications
- M-Cell simulation
- Telemicroscopy
- Visualization