Title: Biological data integration by bidirectional schema transformation rules Alexandra Poulovassilis, Bi
1Biological data integration by bi-directional
schema transformation rulesAlexandra
Poulovassilis, Birkbeck, U. of London
2The London Knowledge Lab
Institute of Education University of London
Birkbeck College University of London
purpose designed building Science Research
Infrastructure Fund 6m Research staff and
students 50 Location Bloomsbury Open June 2004
Social scientists Experts in education,
sociology, culture and media, semiotics,
philosophy, knowledge management ...
Computer scientists Experts in information
systems, information management, web
technologies, personalisation, ubiquitous
technologies
3LKL Research Themes
- Research at the London Knowledge Lab consists
mainly of externally - funded projects by EU, EPSRC, ESRC, AHRB, BBSRC,
JISC, Wellcome - Trust currently about 25 projects.
- Four broad themes guide our work and inform our
research strategy - new forms of knowledge
- turning information into knowledge
- the changing cultures of new media
- creating empowering technologies for formal and
informal learning
4Turning Information Into Knowledge
- The need to cope with ubiquitous, complex,
incomplete and inconsistent information is
pervasive in our societies - How can people benefit from this information in
their learning, working and social lives ? - What new techniques are necessary for managing,
accessing, integrating and personalising such
information ? - How to design and build tools that help people to
understand such information and generate new
knowledge from it ?
5Turning Information Into Knowledge Information
Integration
- AutoMed (EPSRC)
- developing tools for semi-automatic integration
of heterogeneous - information sources
- can handle both structured and semi-structured
(RDF/S, XML) data - can handle virtual, materialised and hybrid
integration scenarios - application in biological data integration,
e-learning, p2p data integration - ISPIDER (BBSRC e-Science programme)
- developing an integrated platform of proteomic
data sources, enabled as - Grid and Web services
- collaboration with groups at EBI, Manchester,
UCL
6The AutoMed Project
- Partners Birkbeck and Imperial Colleges
- Data integration based on schema equivalence
- Low-level metamodel, the Hypergraph Data Model
(HDM), in terms of which higher-level modelling
languages are defined extensible therefore with
new modelling languages - Automatically provides a set of primitive
equivalence-preserving schema transformations for
higher-level modelling languages - addT(c,q) deleteT(c,q) renameT(c,n,n)
- There are also two more primitive transformations
for imprecise integration scenarios - extendT(c,Range q q) contractT(c,Range q q)
7AutoMed Features
- Schema transformations are automatically
reversible - addT/deleteT(c,q) by deleteT/addT(c,q)
- extendT(c,Range q1 q2) by contractT(c,Range q1
q2) - renameT(c,n,n) by renameT(c,n,n)
- Hence bi-directional transformation pathways
(more generally transformation networks) are
defined between schemas - The queries within transformations allow
automatic data and query translation - Schemas may be expressed in a variety of
modelling languages - Schemas may or may not have a data source
associated with them thus, virtual, materialised
or hybrid integration can be supported
8Schema Transformation/Integration Networks
GS
id
id
id
id
id
US1
US2
USi
USn
LS1
LS2
LSi
LSn
9Schema Transformation/Integration Networks
(contd)
- On the previous slide
- GS is a global schema
- LS1, , LSn are local schemas
- US1, , USn are union-compatible schemas
- the transformation pathways between each pair LSi
and USi may consist of add, delete, rename,
expand and contract primitive transformation,
operating on any modelling construct defined in
the AutoMed Model Definitions Repository - the transformation pathway between USi and GS is
similar - the transformation pathway between each pair of
union-compatible schemas consists of id
transformation steps
10AutoMed Architecture
Schema and Transformation Repository
Wrapper
Schema Transformation and Integration Tools
Global Query Processor
Model Definitions Repository
Global Query Optimiser
Model Definition Tool
Schema Evolution Tool
11Comparison with GAV LAV Data Integration
- Global-As-View (GAV) approach specify GS
constructs by view definitions over LSi
constructs - Local-As-View (LAV) approach specify LS
constructs by view definitions over GS constructs
12GAV Example
- student(id,name,left,degree) x,y,z,w
?x,y,z,w,_??ug ? ?x,_,_,_,_??phd ? - ?x,y,z,w,_??phd ?
- w phd
- monitors(sno,id)
- x,y ?x,_,_,_,y??ug ?
?x,_,_,_,_??phd ? - ?x,y??supervises
- staff(sno,sname,dept)
- x,y,z ?x,y,z,w,_??tutor ?
?x,_,_??supervisor ? - ?x,y,z??supervisor
13LAV Example
- tutor(sno,sname)
- x,y ?x,y,_??staff ? ?x,z??monitors
? - ?z,_,_,w??student ?
- w ? phd
- ug(id,name,left,degree,sno)
- x,y,z,w,v ?x,y,z,w??student ?
?v,x??monitors ? - w ? phd
- phd, supervises, supervisor are defined similarly
14Evolution problems of GAV and LAV
- GAV does not readily support evolution of local
schemas e.g. adding an age attribute to phd
invalidates some of the global view definitions - In LAV, changes to a local schema impact only the
derivation rules defined for that schema e.g.
adding an age attribute to phd affects only
the rule defining phd - But LAV has problems if one wants to evolve the
global schema since all the rules defining local
schema constructs in terms of the global schema
would need to be reviewed - These problems are exacerbated in P2P data
integration scenarios where there is no
distinction between local and global schemas
15AutoMed approach, Growing Phaseassuming
initially a schema U S1 S2
- addRel(ltltstudent,idgtgt,
- x x ?ltltug,idgtgt ?
- x ? ltltphd,idgtgt)
- addAtt(ltltstudent,namegtgt,
- ltx,ygt (ltx,ygt?ltltug,namegtgt ?
- x ? ltltphd,idgtgt) ?
- ltx,ygt ? ltltphd,namegtgt)
- addAtt(ltltstudent,leftgtgt,
- ltx,ygt
- (ltx,ygt ?ltltug,leftgtgt ?
- x ? ltltphd,idgtgt) ?
- ltx,ygt ? ltltphd,leftgtgt)
16AutoMed approach, Shrinking Phase
- contrAtt(ltlttutor,snamegtgt,
- Range ltx,ygt
- ltx,ygt ?ltltstaff,snamegtgt ?
- ltz,xgt ? ltltug,snogtgt Any)
- contrRel(ltlttutor,snogtgt,
- Range x x?ltltstaff,snogtgt ?
- ltz,xgt ? ltltug,snogtgt Any)
-
- Similarly contractions for the ug attributes and
relation
17AutoMed approach, Shrinking Phase (contd)
- contrAtt(ltltphd,titlegtgt,
- Range Void Any)
- delAtt(ltltphd,leftgtgt, ltx,ygt
- ltx,ygt?ltltstudent,leftgtgt ?
- x ? ltltphd,idgtgt)
- delAtt(ltltphd,namegtgt, ltx,ygt
- ltx,ygt ? ltltstudent,namegtgt ?
- x ? ltltphd,idgtgt)
- delRel(ltltphd,idgtgt, x
- x? ltltstudent,idgtgt ? ltx,phdgt ?
ltltstudent,degreegtgt) - Similarly deletions for supervises and supervisor
18AutoMed vs GAV/LAV/GLAV
- AutoMed schema transformation pathways capture at
least the information available from GAV and LAV
rules - add/extend transformations correspond to GAV
rules - delete/contract transformations correspond to LAV
rules - We discussed this our ICDE03 paper where we
termed our integration approach both-as-view
(BAV) - In particular, we discussed how GAV and LAV view
definitions can be derived from a BAV
specification - GLAV rules e - e are captured by BAV
transformations of the form add(T,e)
del(T,e) - Thus any reasoning or processing that is possible
using GAV, LAV or GLAV is also possible using
BAV
19Schema Evolution in BAV
New Global Schema S
- Unlike GAV/LAV/GLAV, BAV framework readily
supports the evolution of both local and global
schemas - The evolution of the global or local schema is
specified by a schema transformation pathway from
the old to the new schema - For example, the figure on the right shows
transformation pathways T from an old to a new
global or local schema
T
Global Schema S
New Local Schema S
Local Schema S
T
20Global Schema Evolution
- Each transformation step t in TS?S is
considered in turn - if t is an add, delete or rename then schema
equivalence is preserved and there is nothing
further to do (except perhaps optimise the
extended transformation pathway) the extended
pathway can be used to regenerate the necessary
GAV or LAV views - if t is a contract then there will be information
present in S that is no longer available in S
again there is nothing further to do - if t is an extend then domain knowledge is
required to determine if the new construct in S
can in fact be derived from existing constructs
if not, there is nothing further to do if yes,
the extend step is replaced by an add step
21Local Schema Evolution
- This is a bit more complicated as it may require
changes to be propagated also to the global
schema(s) - Again each transformation step t in TS?S is
considered in turn - In the case that t is an add, delete, rename or
contract step, the evolution can be carried out
automatically - If it is an extend, then domain knowledge is
required - See our CAiSE02, ICDE03 and ER04 papers for
more details - The last of these discusses a materialised data
integration scenario where the old/new
global/local schemas have an extent
22Global Query Processing
- We handle query language heterogeneity by
translation into/from a functional intermediate
query language IQL - A query Q expressed in a high-level query
language on a schema S is first translated into
IQL (this functionality is not yet supported in
the AutoMed toolkit) - View definitions are derived from the
transformation pathways between S and the
requested data source schemas - These view definitions are substituted into Q,
reformulating it into an IQL query over source
schema constructs
23Global Query Processing (contd)
- Query optimisation (currently algebraic) and
query evaluation then occur - During query evaluation, the evaluator submits to
wrappers sub-queries that they are able to
translate into the local query language.
Currently, AutoMed supports wrappers for SQL,
OQL, XPath, XQuery and flat-file data sources - The wrappers translate sub-query results back
into the IQL type system - Further query post-processing then occurs in the
IQL evaluator
24Other AutoMed research at BBK
- As well as virtual integration of data sources,
we have investigated using AutoMed for
materialised data integration e.g. a data
warehousing approach - In particular, Hao Fan has worked on incremental
view maintenance, data lineage tracing and schema
evolution over AutoMed schema transformation
pathways - Lucas Zamboulis has been looking at
semi-automatic techniques for transforming and
integrating heterogeneous XML data - In recent work we have also investigated using
correspondences to RDFS schemas to enhance these
techniques
25Other AutoMed research at BBK (contd)
- Dean Williams has been working on extracting
structure from unstructured text sources - The aim here is to integrate information
extracted from unstructured text with structured
information available from other sources - Dean is using existing technology (the GATE tool)
for the text annotation and IE part of this work - The information extracted from the text is
matched with existing structured information to
derive new instance data and perhaps also new
schema fragments - AutoMed is being used for the schema and data
integration aspects of this project
26Other AutoMed research at Imperial
- Automatic generation of equivalences between
different data models - A graphical schema transformations editor
- Data mining techniques for extracting schema
equivalences - Optimising schema transformation pathways
27ISPIDER Project
- Partners Birkbeck, EBI, Manchester, UCL
- Aims
- Vast, heterogeneous biological data
- Need for interoperability
- Need for efficient processing
- Development of Proteomics Grid Infrastructure,
use existing proteomics resources and develop new
ones, develop new proteomics clients for
querying, visualisation, workflow etc.
28Project Aims
29Project Aims
30Project Aims
31Project Aims
32Project Aims
33myGrid / DQP / AutoMed
- myGrid collection of services/components
allowing high-level integration of
data/applications for in-silico experiments in
biology - DQP
- OGSA-DAI (Open Grid Services Architecture Data
Access and Integration) - Distributed query processing over OGSA-DAI
enabled resources - Current research AutoMed DQP interoperation
- Future research AutoMed myGrid workflows
interoperation
34DQP AutoMed Interoperability
- Data sources wrapped with OGSA-DAI
- AutoMed OGSA-DAI wrappers extract data sources
metadata - Semantic integration of data sources using
AutoMed transformation pathways into an
integrated AutoMed schema - IQL queries submitted to this integrated schema
are - Reformulated to IQL queries on the data sources,
using the AutoMed transformation pathways - Submitted to DQP for evaluation
35Data source schema extraction
- AutoMed wrapper requests the schema of the data
source using an OGSA-DAI service - The service replies with the source schema
encoded in XML - The AutoMed wrapper creates the corresponding
schema in the AutoMed repository
36Using AutoMed for in the BioMap Project
- Relational/XML data sources containing protein
sequence, structure, function and pathway data
gene expression data other experimental data - Wrapping of data sources
- Translation of source and global schemas into
AutoMeds XML schema - Domain expert provides matchings between
constructs in source and global schemas - Automatic schema restructuring, with automatic
generation of schema transformation pathways - See DILS05 paper for more details
Integrated
Database
Integrated
Database
Wrapper
AutoMed
Integrated
Schema
n
n
T
o
o
r
i
i
t
a
t
y
a
n
a
s
a
m
y
f
m
p
r
a
o
w
a
r
r
o
t
w
h
f
m
h
o
s
h
t
f
a
w
n
t
s
a
t
a
i
a
a
o
n
p
p
y
r
n
T
a
r
T
AutoMed
AutoMed
AutoMed
..
Relational
XMLDSS
Relational
Schema
Schema
Schema
XML
RDB
RDB
..
Wrapper
Wrapper
Wrapper
XML
RDB
..
File
RDB
37Ongoing and future research
- Using the BAV approach for data integration in
Grid and P2P environments - The integration may be virtual, materialised or
hybrid - P2P query processing over BAV pathways
- P2P update processing over BAV pathways
- Use of ECA rules and a P2P ECA rule execution
engine - Optimisation of ECA rules on semi-structured data