Beyond Federation of Data Collections - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Beyond Federation of Data Collections

Description:

put them in collections. add metadata (who created it, what ... based, data item wise, or data fragment wise, access may need executing data-specific functions ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 22
Provided by: amar91
Category:

less

Transcript and Presenter's Notes

Title: Beyond Federation of Data Collections


1
Beyond Federation of Data Collections
  • Making Information Integration Service a part of
    NPACI Data Management Infrastructure

Amarnath Gupta Bertram Ludäscher Maryann
Martone Ilya Zaslavsky
2
Collection Federation
  • In this scenario, scientific groups
  • produce data items (e.g., text data, images,
    simulation data )
  • put them in collections
  • add metadata (who created it, what is the data
    about )
  • make it available for sharing (on the web, in a
    data cache accessible with VBN, in HPSS with
    authorization information )
  • The Problem
  • The data may be large number of small chunks or
    small number of large chunks data movement is
    an issue
  • Heterogeneity in data types, storage
    technologies, networks, authentication protocols
  • Access has to be collection-based, data item
    wise, or data fragment wise, access may need
    executing data-specific functions
  • Storage Resource Broker/Metadata Catalog
  • The focus is on making the data available

3
Information Integration
Cross-source queries
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
Cross-source relationships are modeled
Information-producing services can be invoked
??? Integrated View ???
Data, relationships, constraints are modeled
??? Integrated View Definition ???
???Mediator ???
Wrapper
Wrapper
Wrapper
Wrapper
Web
protein localization
morphometry
neurotransmission
CaBP, Expasy
4
Hidden Semantics Protein Localization
  • ltprotein_localizationgt
  • ltneuron typepurkinje cell /gt
  • ltprotein channelredgt
  • ltnamegtRyRlt/gt
  • .
  • lt/proteingt
  • ltregion h_grid_pos1 v_grid_posAgt
  • ltdensitygt
  • ltstructure fraction0.8gt
  • ltnamegtspinelt/gt
  • ltamount nameRyRgt0lt/gt
  • lt/gt
  • ltstructure fraction0.2gt
  • ltnamegtbranchletlt/gt
  • ltamount nameRyRgt30lt/gt
  • lt/gt

5
Hidden Semantics Morphometry
  • ltneuron namepurkinje cellgt
  • ltbranch level10gt
  • ltshaftgt
  • lt/shaftgt
  • ltspine number1gt
  • ltattachment x5.3 y-3.2 z8.7 /gt
  • ltlengthgt12.348lt/gt
  • ltmin_sectiongt1.93lt/gt
  • ltmax_sectiongt4.47lt/gt
  • ltsurface_areagt9.884lt/gt
  • ltvolumegt7.930lt/gt
  • ltheadgt
  • ltwidthgt4.47lt/gt
  • ltlengthgt1.79lt/gt
  • lt/headgt
  • lt/spinegt

6
The Problem
  • Multiple Worlds Integration
  • compatible terms not directly joinable
  • complex, indirect associations among schema
    elements
  • unstated integrity constraints
  • Whats needed?
  • a theory under which non-identical terms can be
    semantically joined
  • gt lift mediation to the level of conceptual
    models (CMs)
  • gt domain knowledge, ICs become rules over CMs
  • gt Model-Based Mediation

7
Information Integration
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
??? Integrated View ???
??? Integrated View Definition ???
???Mediator ???
Wrapper
Wrapper
Wrapper
Wrapper
Web
protein localization
morphometry
neurotransmission
CaBP, Expasy
8
Example Query Evaluation (I)
  • Example protein_distribution
  • given organism, protein, brain_region
  • Use DOMAIN-KNOWLEDGE-BASE
  • recursively traverse the has_a_star paths under
    brain_region collect all anatomical_entities
  • Source PROLAB
  • join with anatomical structures and collect the
    value of attribute image.segments.features.featur
    e.protein_amount where image.segments.features.f
    eature.protein_name protein and
    study_db.study.animal.name organism
  • Mediator
  • aggregate over all parents up to brain_region
  • report distribution

9
Example Query Evaluation (II)
"How does the parallel fiber output
(Yale/SENSELAB) relate to the distribution of
Ryanodine Receptors (UCSD/NCMIR)?"
  • _at_SENSELAB X1 select output from parallel
    fiber
  • _at_MEDIATOR X2 hang off X1 from Domain Map
  • _at_MEDIATOR X3 subregion-closure(X2)
  • _at_NCMIR X4 select PROT-data(X3,
    Ryanodine Receptors)
  • _at_MEDIATOR X5 compute aggregate(X4)

10
Integration Issues
SEMANTIC Integration
  • SYNTACTIC/STRUCTURAL Integration
  • Integrated Views (Src-XML gt Intgr-XML)
  • Schema Integration (DTD gtDTD)
  • Wrapping, Data Extraction (Text gt XML)

MIX Mediation of Information using XML
Distributed Query Processing
SRB/MCAT
storage, query capabilities protocols services
SYSTEM Integration
TCP/IP HTTP CORBA
11
The Mediator Architecture
Mediation Services API
Mediator Layer
  • Source model lifting
  • domain knowledge reconciliation
  • model transformation
  • Query formulation
  • user query
  • integrated view definition

Deductive Engine
Model Reasoner
  • Source registration
  • domain knowledge
  • model schema
  • query computation capabilities
  • Query processing
  • view unfolding
  • semantic optimization
  • capability-based rewriting

Optimizer
Wrapper Layer
  • Query interface (down API)
  • SDLIP, SOAP, ...
  • (subsets of) SQL, X(ML)-Query, CPL,...
  • DOM
  • SRB-based access
  • Result delivery interface (up API)
  • SDLIP, SOAP, ...
  • pull (tuple/set-at-a-time, DOM) vs. push
    (stream)
  • synchronous/asynchronous
  • direct data/data reference

XML Sources
RDB Sources
File Sources
HTML Sources
Digital Libraries (Collections)
Spatial Sources
Boston Univ.
NCMIR UCSD
Yale Univ.
Montana Univ.
SDLIP
ARC IMS
12
Mediation Services Source Registration-I
Source
Data Type
Query Capability
Result Delivery
Access Protocol
ARC
SQL
XML QL
DOOD
table
tree
file
SRB
HTTP
Java
Tuple-at-a-time
Stream
Set-at-a-time
SPJ
Selections
Binary for Viewer
13
Mediation Services Source Registration-II
  • Domain Model Registration
  • Here is my concept ontology
  • Keep it only as a private object
  • Merge my ontology with a pre-existing non-private
    ontology
  • Here are the equivalence relations
  • Detect conflicts between my ontology and a given
    public ontology
  • Conceptual Schema Registration
  • Classes, methods
  • Constraints
  • Domain Model Reference

Next
14
ANATOM Domain Map
ANATOM
Back
15
anatom_dom(X) - (ucsd_has_a(X,_)
ucsd_has_a(_,X) ucsd_isa(X,_)
ucsd_isa(_,X)). senselab_dom(X) -
(sl_has_a(X,_) sl_has_a(_,X) sl_isa(X,_)
sl_isa(_,X)). map senselab anatom terms to
equivalent ucsd anatom terms sl2ucsd(X,X) -
senselab_dom(X), anatom_dom(X). sl2ucsd('A',axon)
. sl2ucsd('AH',axon). sl2ucsd('Dad',spiny_branchle
t). should REALLY map to a PATH not just the
end of the path sl2ucsd('Dam',main_branches).
really only SOME of the main_branches based on
the branch level sl2ucsd('Dap',main_branches). sl2
ucsd('Dbd',spiny_branchlet). sl2ucsd('Dbm',main_br
anches). sl2ucsd('Dbp',main_branches). sl2ucsd('De
d',spiny_branchlet). sl2ucsd('Dem',main_branches).
sl2ucsd('Dep',main_branches). sl2ucsd('T',axon).
keep has_a edge if at least one node is known
from ucsd has_a(X,Y) - sl2ucsd(_,X),
ucsd_has_a(X,Y). has_a(X,Y) - sl2ucsd(_,Y),
ucsd_has_a(X,Y). keep all and only ucsd
is-a's isa(X,Y) - ucsd_isa(X,Y). Back
16
Neuron
MyNeuron
Neostriatum
Compartment
Spiny Neuron
ALLhas
Axon
Soma
Dendrite
Medium Spiny Neuron
Neurotransmitter
MyDendrite
exp

AND
GABA
Substance P
OR
exp
Dopamine R
Substantia Nigra Pc
Substantia Nigra Pr
Globus Pallidus Int.
Globus Pallidus Ext.
Back
17
Mediation Services Client Registration
Client
Update Client
Query Client
Thin Result Viewer
Fat Result Viewer
Navigate/ Ad-hoc
Query Capability
Query on Schema
Derive Before Insert
Check Data
Merge Before Insert
Client-side Processing
Client-side Buffer
Send Full Data
Context Sensitive
Server-side Buffer
Server-Push/ Client-Pull
18
Mediation Services Integrated View Definition
  • For the domain data modeler
  • Currently in a Logic Language (Frame-logic)
  • protein_distribution(Protein, Organism,
    Brain_region, Feature_name, Anatom, Value)
  • if
  • Iprotein_label_image proteins -gtgt
    Protein organism -gt Organism
    anatomical_structures -gtgt from PROLAB
    ASanatomical_structurename-gtAnatom,
    NAEneuro_anatomic_entityname-gtAnatom from
    ANATOM
  • located_in-gtgtBrain_region,
  • AS..segments..featuresname-gtFeature_name
    value-gtValue.
  • May be wrapped into a simpler tool

19
Mediation Services Query Formulation Tools
  • Combination of ad hoc and navigational
  • Open Issues
  • Recursive queries
  • Aggregate queries
  • Combining data and service requests

20
Mediation Services Data Update Tools
21
Some Open Issues
  • Data/Knowledge Modeling
  • Extensibility how to handle a source with new
    data types and operations?
  • Temporal Data instrument readings, video
    microscopy
  • Spatial Data Integrating with spatial database
    systems
  • Image database systems
  • Conflict Management
  • Grades of certainty
  • Alternate Hypothesis
  • Integrating Services
  • Registration and warping of my image slice to a
    reference
  • Integrating into Larger Applications
  • M-Cell simulation
  • Telemicroscopy
  • Visualization
Write a Comment
User Comments (0)
About PowerShow.com