Title: KnowledgeBased Integration of Neuroscience Data Sources
1Knowledge-Based Integration of Neuroscience Data
Sources
- Amarnath Gupta
- Bertram Ludäscher
- Maryann Martone
- University of California San Diego
2A Standard Information Mediation Framework
Client Query
Integrated XML View
Mediator
XML View
XML View
XML View
Wrapper
Wrapper
Data Source
XML Data Source
Data Source
3A Neuroscience Question
Cerebellar distribution of rat proteins with more
than 70 homology with human NCS-1? Any structure
specificity? How about other rodents?
Integrated View
Mediator
Wrapper
Wrapper
Wrapper
Wrapper
WWW
CaBP, Expasy
protein localization
morphometry
neurotransmission
4Integration Issues
- Structural Heterogeneity
- Resolved by converting to common semistructured
data model - Heterogeneity in Query Capabilities
- Resolved by writing wrappers with binding
patterns and other capability-definition
languages - Semantic Heterogeneity
- Schema conflicts
- Partially resolved by mapping rules in the
mediator - Hidden Semantics?
5Hidden SemanticsProtein Localization
- ltprotein_localizationgt
- ltneuron typepurkinje cell /gt
- ltprotein channelredgt
- ltnamegtRyRlt/gt
- .
- lt/proteingt
- ltregion h_grid_pos1 v_grid_posAgt
- ltdensitygt
- ltstructure fraction0.8gt
- ltnamegtspinelt/gt
- ltamount nameRyRgt0lt/gt
- lt/gt
- ltstructure fraction0.2gt
- ltnamegtbranchletlt/gt
- ltamount nameRyRgt30lt/gt
- lt/gt
6Hidden Semantics Morphometry
- ltneuron namepurkinje cellgt
- ltbranch level10gt
- ltshaftgt
-
- lt/shaftgt
- ltspine number1gt
- ltattachment x5.3 y-3.2 z8.7 /gt
- ltlengthgt12.348lt/gt
- ltmin_sectiongt1.93lt/gt
- ltmax_sectiongt4.47lt/gt
- ltsurface_areagt9.884lt/gt
- ltvolumegt7.930lt/gt
- ltheadgt
- ltwidthgt4.47lt/gt
- ltlengthgt1.79lt/gt
- lt/headgt
- lt/spinegt
-
7The Problem
- Multiple Worlds Integration
- compatible terms not directly joinable
- complex, indirect associations among schema
elements - unstated integrity constraints
- Why not use ontologies?
- typical ontologies associate terms along limited
number of dimensions - Whats needed
- a theory under which non-identical terms can be
semantically joined
8Our Approach
- Modify the standard Mediation Architecture
- Wrapper
- Extend to encode an object-version of the
structure schema - Mediator
- Redesign to incorporate auxiliary knowledge
sources to - Correlate object schema of sources
- Define additional objects not specified but
derivable from sources - At the Mediator
- Use a logic engine to
- Encode the mapping rules between sources
- Define integrated views using a combination of
exported objects from source and the auxiliary
knowledge sources - Perform query decomposition
- We still use Global-as-View form of mediation
9The KIND Architecture
Integrated User View
View Definition Rules
Auxiliary Knowledge Source 1
Logic Engine
Integration Logic
Auxiliary Knowledge Source 2
Schema of Registered Sources
Materialized Views
Src 2
Src 1
10The Knowledge-Base
- Situate every data object in its anatomical
context - An illustration
- New data is registered with the knowledge-base
- Insertion of new data reconciles the current
knowledge-base with the new information by - Indexing the data with the source as part of
registration - Extending the knowledge-base
- Creating new views with complex rules to encode
additional domain knowledge
11F-Logic for the Mediation Engine
- Why F-Logic?
- Provides the power of Datalog (with negation) and
object creation through Skolem IDs - Correct amount of notational sugar and rules to
provide object-oriented abstraction - Schema-level reasoning
- Expressing variable arity
- F-Logic in KIND
- Source schema wrapped into F-Logic schema
- Knowledge-sources programmed in F-Logic
- Definition of Integrated Views
12Wrapping into Logic Objects
lt!ELEMENT Studies (Study)gt lt!ELEMENT Study
(study_id, animal,
experiments, experimentersgt lt!ELEMENT experiments
(experiment)gt lt!ELEMENT experiment (description,
instrument, parameters)gt
studyDBstudies ? ? study. studystudy_id ?
string animal ? animal
experiments ? ? experiment
experimenters ? ? string.
- Non-automated Part
- Subclasses
- Rules
- Integrity Constraints
mushroom_spinespine
Smushroom_spine IF Sspinehead?_neck ?_.
ic1(S)alerttype ? invalid spine object S IF
Sspineundef ? ? head, neck.
13Computing with Auxiliary Sources
- Creating Mediated Classes
- Reasoning with Schema
animalM?R IF Ssource, S.animal M?R
. animaltaxon ? TAXON.taxon. Xtaxon?T IF X
PROLAB.animalname ?N,
words(N,W1,W2_), T
TAXON.taxongenus ?W1species ?W2.
14Integrated View Definition
- Views are defined between sources and knowledge
base - Example protein_distribution
- given organism, protein, brain_region
- KB Anatom
- recursively traverse the has_a paths under
brain_region collect all anatomical_entities - Source PROLAB
- join with anatomical structures and collect the
value of attribute image.segments.features.featur
e.protein_amount where image.segments.features.f
eature.protein_name protein and
study_db.study.animal.name organism - Mediator
- aggregate over all parents up to brain_region
- report distribution
15Query Evaluation Example
- protein distribution of Human NCS-1 homologue
- from wrapped CaBP website
- get the amino acid sequence for human NCS-1
- from wrapped Expasy website
- submit amino acid sequence, get ranked homologues
- at Mediator
- select homologues H found in rat, and homology gt
0.70 - at Mediator
- for each h in H
- from previous view
- protein_distribution(rat, h, cerebellum,
distribution) - Construct result
a second integrated view
16Implementation
- System
- Flora as F-Logic Engine
- Communicate with ODBC databases through
underlying XSB Prolog - XML wrapping and Web querying through XMAS, our
XML query language and custom-built wrappers - Data
- Human Brain Project sites
- NPACI Neuroscience Thrust sites
17Work in Progress
- Architecture
- plug-in architecture for
- domain knowledge sources
- conceptual models from data sources
- Functionality
- better handling of large data
- operations
- expressive query language
- operators for domain knowledge manipulation
- query evaluation
- query optimization using domain knowledge
- Demonstration
- at VLDB 2000