Biological data integration by bidirectional schema transformation rules Alexandra Poulovassilis, Bi - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Biological data integration by bidirectional schema transformation rules Alexandra Poulovassilis, Bi

Description:

Experts in education, sociology, culture and media, semiotics, philosophy, knowledge management ... funded projects by EU, EPSRC, ESRC, AHRB, BBSRC, JISC, Wellcome ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 38
Provided by: Poulova
Category:

less

Transcript and Presenter's Notes

Title: Biological data integration by bidirectional schema transformation rules Alexandra Poulovassilis, Bi


1
Biological data integration by bi-directional
schema transformation rulesAlexandra
Poulovassilis, Birkbeck, U. of London
2
The London Knowledge Lab
Institute of Education University of London
Birkbeck College University of London
purpose designed building Science Research
Infrastructure Fund 6m Research staff and
students 50 Location Bloomsbury Open June 2004
Social scientists Experts in education,
sociology, culture and media, semiotics,
philosophy, knowledge management ...
Computer scientists Experts in information
systems, information management, web
technologies, personalisation, ubiquitous
technologies
3
LKL Research Themes
  • Research at the London Knowledge Lab consists
    mainly of externally
  • funded projects by EU, EPSRC, ESRC, AHRB, BBSRC,
    JISC, Wellcome
  • Trust currently about 25 projects.
  • Four broad themes guide our work and inform our
    research strategy
  • new forms of knowledge
  • turning information into knowledge
  • the changing cultures of new media
  • creating empowering technologies for formal and
    informal learning

4
Turning Information Into Knowledge
  • The need to cope with ubiquitous, complex,
    incomplete and inconsistent information is
    pervasive in our societies
  • How can people benefit from this information in
    their learning, working and social lives ?
  • What new techniques are necessary for managing,
    accessing, integrating and personalising such
    information ?
  • How to design and build tools that help people to
    understand such information and generate new
    knowledge from it ?

5
Turning Information Into Knowledge Information
Integration
  • AutoMed (EPSRC)
  • developing tools for semi-automatic integration
    of heterogeneous
  • information sources
  • can handle both structured and semi-structured
    (RDF/S, XML) data
  • can handle virtual, materialised and hybrid
    integration scenarios
  • application in biological data integration,
    e-learning, p2p data integration
  • ISPIDER (BBSRC e-Science programme)
  • developing an integrated platform of proteomic
    data sources, enabled as
  • Grid and Web services
  • collaboration with groups at EBI, Manchester,
    UCL

6
The AutoMed Project
  • Partners Birkbeck and Imperial Colleges
  • Data integration based on schema equivalence
  • Low-level metamodel, the Hypergraph Data Model
    (HDM), in terms of which higher-level modelling
    languages are defined extensible therefore with
    new modelling languages
  • Automatically provides a set of primitive
    equivalence-preserving schema transformations for
    higher-level modelling languages
  • addT(c,q) deleteT(c,q) renameT(c,n,n)
  • There are also two more primitive transformations
    for imprecise integration scenarios
  • extendT(c,Range q q) contractT(c,Range q q)

7
AutoMed Features
  • Schema transformations are automatically
    reversible
  • addT/deleteT(c,q) by deleteT/addT(c,q)
  • extendT(c,Range q1 q2) by contractT(c,Range q1
    q2)
  • renameT(c,n,n) by renameT(c,n,n)
  • Hence bi-directional transformation pathways
    (more generally transformation networks) are
    defined between schemas
  • The queries within transformations allow
    automatic data and query translation
  • Schemas may be expressed in a variety of
    modelling languages
  • Schemas may or may not have a data source
    associated with them thus, virtual, materialised
    or hybrid integration can be supported

8
Schema Transformation/Integration Networks
GS
id
id
id
id
id
US1
US2
USi
USn




LS1
LS2
LSi
LSn
9
Schema Transformation/Integration Networks
(contd)
  • On the previous slide
  • GS is a global schema
  • LS1, , LSn are local schemas
  • US1, , USn are union-compatible schemas
  • the transformation pathways between each pair LSi
    and USi may consist of add, delete, rename,
    expand and contract primitive transformation,
    operating on any modelling construct defined in
    the AutoMed Model Definitions Repository
  • the transformation pathway between USi and GS is
    similar
  • the transformation pathway between each pair of
    union-compatible schemas consists of id
    transformation steps

10
AutoMed Architecture
Schema and Transformation Repository
Wrapper
Schema Transformation and Integration Tools
Global Query Processor
Model Definitions Repository
Global Query Optimiser
Model Definition Tool
Schema Evolution Tool
11
Comparison with GAV LAV Data Integration
  • Global-As-View (GAV) approach specify GS
    constructs by view definitions over LSi
    constructs
  • Local-As-View (LAV) approach specify LS
    constructs by view definitions over GS constructs

12
GAV Example
  • student(id,name,left,degree) x,y,z,w
    ?x,y,z,w,_??ug ? ?x,_,_,_,_??phd ?
  • ?x,y,z,w,_??phd ?
  • w phd
  • monitors(sno,id)
  • x,y ?x,_,_,_,y??ug ?
    ?x,_,_,_,_??phd ?
  • ?x,y??supervises
  • staff(sno,sname,dept)
  • x,y,z ?x,y,z,w,_??tutor ?
    ?x,_,_??supervisor ?
  • ?x,y,z??supervisor

13
LAV Example
  • tutor(sno,sname)
  • x,y ?x,y,_??staff ? ?x,z??monitors
    ?
  • ?z,_,_,w??student ?
  • w ? phd
  • ug(id,name,left,degree,sno)
  • x,y,z,w,v ?x,y,z,w??student ?
    ?v,x??monitors ?
  • w ? phd
  • phd, supervises, supervisor are defined similarly

14
Evolution problems of GAV and LAV
  • GAV does not readily support evolution of local
    schemas e.g. adding an age attribute to phd
    invalidates some of the global view definitions
  • In LAV, changes to a local schema impact only the
    derivation rules defined for that schema e.g.
    adding an age attribute to phd affects only
    the rule defining phd
  • But LAV has problems if one wants to evolve the
    global schema since all the rules defining local
    schema constructs in terms of the global schema
    would need to be reviewed
  • These problems are exacerbated in P2P data
    integration scenarios where there is no
    distinction between local and global schemas

15
AutoMed approach, Growing Phaseassuming
initially a schema U S1 S2
  • addRel(ltltstudent,idgtgt,
  • x x ?ltltug,idgtgt ?
  • x ? ltltphd,idgtgt)
  • addAtt(ltltstudent,namegtgt,
  • ltx,ygt (ltx,ygt?ltltug,namegtgt ?
  • x ? ltltphd,idgtgt) ?
  • ltx,ygt ? ltltphd,namegtgt)
  • addAtt(ltltstudent,leftgtgt,
  • ltx,ygt
  • (ltx,ygt ?ltltug,leftgtgt ?
  • x ? ltltphd,idgtgt) ?
  • ltx,ygt ? ltltphd,leftgtgt)

16
AutoMed approach, Shrinking Phase
  • contrAtt(ltlttutor,snamegtgt,
  • Range ltx,ygt
  • ltx,ygt ?ltltstaff,snamegtgt ?
  • ltz,xgt ? ltltug,snogtgt Any)
  • contrRel(ltlttutor,snogtgt,
  • Range x x?ltltstaff,snogtgt ?
  • ltz,xgt ? ltltug,snogtgt Any)
  • Similarly contractions for the ug attributes and
    relation

17
AutoMed approach, Shrinking Phase (contd)
  • contrAtt(ltltphd,titlegtgt,
  • Range Void Any)
  • delAtt(ltltphd,leftgtgt, ltx,ygt
  • ltx,ygt?ltltstudent,leftgtgt ?
  • x ? ltltphd,idgtgt)
  • delAtt(ltltphd,namegtgt, ltx,ygt
  • ltx,ygt ? ltltstudent,namegtgt ?
  • x ? ltltphd,idgtgt)
  • delRel(ltltphd,idgtgt, x
  • x? ltltstudent,idgtgt ? ltx,phdgt ?
    ltltstudent,degreegtgt)
  • Similarly deletions for supervises and supervisor

18
AutoMed vs GAV/LAV/GLAV
  • AutoMed schema transformation pathways capture at
    least the information available from GAV and LAV
    rules
  • add/extend transformations correspond to GAV
    rules
  • delete/contract transformations correspond to LAV
    rules
  • We discussed this our ICDE03 paper where we
    termed our integration approach both-as-view
    (BAV)
  • In particular, we discussed how GAV and LAV view
    definitions can be derived from a BAV
    specification
  • GLAV rules e - e are captured by BAV
    transformations of the form add(T,e)
    del(T,e)
  • Thus any reasoning or processing that is possible
    using GAV, LAV or GLAV is also possible using
    BAV

19
Schema Evolution in BAV
New Global Schema S
  • Unlike GAV/LAV/GLAV, BAV framework readily
    supports the evolution of both local and global
    schemas
  • The evolution of the global or local schema is
    specified by a schema transformation pathway from
    the old to the new schema
  • For example, the figure on the right shows
    transformation pathways T from an old to a new
    global or local schema

T
Global Schema S
New Local Schema S
Local Schema S
T
20
Global Schema Evolution
  • Each transformation step t in TS?S is
    considered in turn
  • if t is an add, delete or rename then schema
    equivalence is preserved and there is nothing
    further to do (except perhaps optimise the
    extended transformation pathway) the extended
    pathway can be used to regenerate the necessary
    GAV or LAV views
  • if t is a contract then there will be information
    present in S that is no longer available in S
    again there is nothing further to do
  • if t is an extend then domain knowledge is
    required to determine if the new construct in S
    can in fact be derived from existing constructs
    if not, there is nothing further to do if yes,
    the extend step is replaced by an add step

21
Local Schema Evolution
  • This is a bit more complicated as it may require
    changes to be propagated also to the global
    schema(s)
  • Again each transformation step t in TS?S is
    considered in turn
  • In the case that t is an add, delete, rename or
    contract step, the evolution can be carried out
    automatically
  • If it is an extend, then domain knowledge is
    required
  • See our CAiSE02, ICDE03 and ER04 papers for
    more details
  • The last of these discusses a materialised data
    integration scenario where the old/new
    global/local schemas have an extent

22
Global Query Processing
  • We handle query language heterogeneity by
    translation into/from a functional intermediate
    query language IQL
  • A query Q expressed in a high-level query
    language on a schema S is first translated into
    IQL (this functionality is not yet supported in
    the AutoMed toolkit)
  • View definitions are derived from the
    transformation pathways between S and the
    requested data source schemas
  • These view definitions are substituted into Q,
    reformulating it into an IQL query over source
    schema constructs

23
Global Query Processing (contd)
  • Query optimisation (currently algebraic) and
    query evaluation then occur
  • During query evaluation, the evaluator submits to
    wrappers sub-queries that they are able to
    translate into the local query language.
    Currently, AutoMed supports wrappers for SQL,
    OQL, XPath, XQuery and flat-file data sources
  • The wrappers translate sub-query results back
    into the IQL type system
  • Further query post-processing then occurs in the
    IQL evaluator

24
Other AutoMed research at BBK
  • As well as virtual integration of data sources,
    we have investigated using AutoMed for
    materialised data integration e.g. a data
    warehousing approach
  • In particular, Hao Fan has worked on incremental
    view maintenance, data lineage tracing and schema
    evolution over AutoMed schema transformation
    pathways
  • Lucas Zamboulis has been looking at
    semi-automatic techniques for transforming and
    integrating heterogeneous XML data
  • In recent work we have also investigated using
    correspondences to RDFS schemas to enhance these
    techniques

25
Other AutoMed research at BBK (contd)
  • Dean Williams has been working on extracting
    structure from unstructured text sources
  • The aim here is to integrate information
    extracted from unstructured text with structured
    information available from other sources
  • Dean is using existing technology (the GATE tool)
    for the text annotation and IE part of this work
  • The information extracted from the text is
    matched with existing structured information to
    derive new instance data and perhaps also new
    schema fragments
  • AutoMed is being used for the schema and data
    integration aspects of this project

26
Other AutoMed research at Imperial
  • Automatic generation of equivalences between
    different data models
  • A graphical schema transformations editor
  • Data mining techniques for extracting schema
    equivalences
  • Optimising schema transformation pathways

27
ISPIDER Project
  • Partners Birkbeck, EBI, Manchester, UCL
  • Aims
  • Vast, heterogeneous biological data
  • Need for interoperability
  • Need for efficient processing
  • Development of Proteomics Grid Infrastructure,
    use existing proteomics resources and develop new
    ones, develop new proteomics clients for
    querying, visualisation, workflow etc.

28
Project Aims
29
Project Aims
30
Project Aims
31
Project Aims
32
Project Aims
33
myGrid / DQP / AutoMed
  • myGrid collection of services/components
    allowing high-level integration of
    data/applications for in-silico experiments in
    biology
  • DQP
  • OGSA-DAI (Open Grid Services Architecture Data
    Access and Integration)
  • Distributed query processing over OGSA-DAI
    enabled resources
  • Current research AutoMed DQP interoperation
  • Future research AutoMed myGrid workflows
    interoperation

34
DQP AutoMed Interoperability
  • Data sources wrapped with OGSA-DAI
  • AutoMed OGSA-DAI wrappers extract data sources
    metadata
  • Semantic integration of data sources using
    AutoMed transformation pathways into an
    integrated AutoMed schema
  • IQL queries submitted to this integrated schema
    are
  • Reformulated to IQL queries on the data sources,
    using the AutoMed transformation pathways
  • Submitted to DQP for evaluation

35
Data source schema extraction
  • AutoMed wrapper requests the schema of the data
    source using an OGSA-DAI service
  • The service replies with the source schema
    encoded in XML
  • The AutoMed wrapper creates the corresponding
    schema in the AutoMed repository

36
Using AutoMed for in the BioMap Project
  • Relational/XML data sources containing protein
    sequence, structure, function and pathway data
    gene expression data other experimental data
  • Wrapping of data sources
  • Translation of source and global schemas into
    AutoMeds XML schema
  • Domain expert provides matchings between
    constructs in source and global schemas
  • Automatic schema restructuring, with automatic
    generation of schema transformation pathways
  • See DILS05 paper for more details

Integrated
Database
Integrated
Database
Wrapper
AutoMed
Integrated
Schema
n
n
T
o
o
r
i
i
t
a
t
y
a
n
a
s
a
m
y
f
m
p
r
a
o
w
a
r
r
o
t
w
h
f
m
h
o
s
h
t
f
a
w
n
t
s
a
t
a
i
a
a
o
n
p
p
y
r
n
T
a
r
T
AutoMed
AutoMed
AutoMed
..
Relational
XMLDSS
Relational
Schema
Schema
Schema
XML
RDB
RDB
..
Wrapper
Wrapper
Wrapper
XML
RDB
..
File
RDB
37
Ongoing and future research
  • Using the BAV approach for data integration in
    Grid and P2P environments
  • The integration may be virtual, materialised or
    hybrid
  • P2P query processing over BAV pathways
  • P2P update processing over BAV pathways
  • Use of ECA rules and a P2P ECA rule execution
    engine
  • Optimisation of ECA rules on semi-structured data
Write a Comment
User Comments (0)
About PowerShow.com