Title: AutoMed
1AutoMed
- A Heterogeneous Data Integration System
2Outline
- Both-As-View (BAV) approach
- GAV LAV approaches
- BAV approach
- Comparison of integration approaches
- BAV advantages
- The AutoMed system
- Architecture
- Current future work
- Testbeds
3GAV LAV Approaches
- Global-As-View (GAV) approach describe GS
constructs with view definitions over LSi
constructs - Local-As-View (LAV) approach describe LSi
constructs with view definitions over GS
constructs
4Global-As-View Approach (GAV)
- student(id,name,left,degree) x,y,z,w
?x,y,z,w,_??ug ? ?x,_,_,_,_??phd ? - ?x,y,z,w,_??phd ?
- w phd
- monitors(sno,id)
- x,y ?x,_,_,_,y??ug ?
?x,_,_,_,_??phd ? - ?x,y??supervises
- staff(sno,sname,dept)
- x,y,z ?x,y,z,w,_??tutor ?
?x,_,_??supervisor ? - ?x,y,z??supervisor
5Local-As-View Approach (LAV)
- tutor(sno,sname)
- x,y ?x,y,_??staff ? ?x,z??monitors
? - ?z,_,_,w??student ?
- w ? phd
- ug(id,name,left,degree,sno)
- x,y,z,w,v ?x,y,z,w??student ?
?v,x??monitors ? - w ? phd
6Both-As-View (BAV) (1/3)
- Schema transformation approach
- For each pair (LSi,GS) incrementally modify
LSi/GS to match GS/LSi
7Both-As-View (BAV) (2/3)
- Common Data Model Hypergraph Data Model (HDM)
- Constructs are nodes, edges constraints
- It avoids the semantic mismatches that may occur
between constructs of higher-level modelling
languages
8Both-As-View (BAV) (3/3)
- Modify using primitive schema transformations
- add/delete
- rename
- extend/contract
- Supply transformations with queries
- add(??table,attrib3??, q), where
qt,(a1a2)t,a1???table,attrib1??t,a2???
table,attrib2?? - extend(??table,attrib3??, q1,q2)
9Example (1/2)
- S1 ? Sg
- add(??monitors?? ,q1)
- add(??monitors,sno??,q2)
- add(??monitors,id??,q3)
- add(??tutor,dept??,q4)
- rename(??ug??,??student??)
- rename(??tutor,??staff??)
- delete(??student,sno??,q5)
- S2 ? Sg can be derived similarly
10Example (2/2)
- Automatically derivable reverse transformations
- add(C,q)/extend(C,q1,q2) delete(C,q)/contract(C,
q1,q2) - delete/contract add/extend
- rename(C1,C2) rename(C2,C1)
11BAV vs. LAV, GAV GLAV
- BAV approach subsumes other integration
approaches - Can be used to derive GAV LAV view definitions
(ICDE03) - Comparison with GAV, LAV GLAV in DBIS'04
12Schema Evolution
- In GAV LAV view definitions have to be
regenerated - The BAV approach readily supports the evolution
of both local and global schemas - In particular (CAiSE02 ICDE03 papers)
- if the evolved schema is semantically equivalent
to the original schema, schema evolution is
automatic - if the evolved schema is a contraction of the
original schema, schema evolution is automatic - if the evolved schema is an extension of the
original schema, then domain knowledge may be
required (but again the pathway can be evolved
rather than regenerated)
13Local Schema Evolution Example
- Define the evolution of the global or local
schema as a schema transformation pathway from
the old to the new schema
14Types Of Integration
- Virtual integration
- Materialised integration
- Hybrid integration
15Outline
- Both-As-View (BAV) approach
- GAV LAV approaches
- BAV approach
- Comparison with GAV, LAV GLAV
- BAV advantages
- The AutoMed system
- Architecture
- Current future work
- Testbeds
16The AutoMed System
- The AutoMed toolkit implements the BAV data
integration approach - AutoMed repository
- Model Definitions Repository (MDR)
- Schema Transformation Repository (STR)
- AutoMed query language IQL
- Higher-level query languages are translated to
IQL - IQL is translated to the query languages of the
datasources
17Query Engine
18Query Engine
Query
Reformulator
19Wrappers
- Current
- Relational (Oracle, PostgreSQL, SQLServer)
- XML documents (DOM SAX)
- YATTA
- RDF
- Near future
- Object-oriented (ODMG 3.0 compliant)
- Native XML Databases (Xindice, Sedna)
- RDF Schema Specific DataBase (RSSDB)
20Testbeds
- BioMap
- http//www.biochem.ucl.ac.uk/bsm/biomap
- Data warehouse containing diverse biological data
- AutoMed used for the creation and maintenance of
the data warehouse - ISPIDER
- http//www.ispider.man.ac.uk
- Develop a Grid architecture for sharing data from
various biological data sources (such as BioMap) - Extend AutoMed system with Grid services
21Development/Research Areas
- Query engine
- Query processing optimisation
- Query language translation
- Tools
- Data Warehousing data lineage
- Automatic schema matching (data mining)
- Automatic integration of XML data sources
- Unstructured/semi-structured data
- Transformation pathway optimisation
- Visualisation tool
- Grid/P2P architecture
22Project Information
- Homepage http//www.doc.ic.ac.uk/automed
- Technical details
- Papers
- Technical reports
- Software
- AutoMed releases
- Documentation
23Project Members
- Birkbeck College
- Alexandra Poulovassilis (P.I.)
- Hao Fan
- Dean Williams
- Lucas Zamboulis
- Past members
- Tanvir Amed Faqueer
- Edgar Jasper
- Dimitri Theodoratos
- Imperial College
- Peter McBrien (P.I.)
- Mike Boyd
- Sasivimol Kittivoravitkul
- Nikolaos Rizopoulos
- Nerissa Tong
- Past members
- Siegfried Hodgson
- Charalambos Lazanitis
24XML Data Transformation Integration
- Lucas Zamboulis, Alexandra Poulovassilislucas,ap
_at_dcs.bbk.ac.uk
25Overview
- Objective restructuring integration of XML
files - Motivation
- Interoperability
- Related work on relational databases
- Need for XML-specific solutions
26Outline
- Semantic Heterogeneity
- Schema Matching
- Ontologies
- Structural Heterogeneity
- XML schema type in AutoMed
- Schema transformation
- Schema integration
27Semantic Heterogeneity
- Problem definition
- Schema Matching
- Data mining
- Neural networks
- Machine learning (LSD)
- Ontologies (RDFS/OWL)
28Schema Matching (1/2)
- Types
- 1-1, 1-n, n-1, n-m
- Subset, superset, equivalence
- Use schema matching output to create the
intermediate schemas used by the schema
restructuring / schema integration algorithms
29Schema Matching (2/2)
- Necessary transformations
- add attributes day, month, year in S
- delete attribute dob from S
- The reverse transformation pathway describes a
n-1 match
30Structural Heterogeneity
- Problem Same information can be represented in
many different ways - Ancestor descendant ?? different branches
- Elements attributes not clearly distinguished
in XML model - Ordering policy
31Aims
- XML-specific solution
- Insert-remove-rename operations on elements,
attributes, edges - Efficient move (node/subtree) operation
- Element-to-attribute, attribute-to-element
transformations - Avoid loss of data due to structural
incompatibilities - Automation
32A Schema Type For XML
- DTD
- Advantage wide adoption
- Disadvantages
- Non-XML format
- Grammar
- XML Schema
- Advantage XML format
- Disadvantages
- Grammar
- Unnecessary complexity
33XML DataSource Schema (1/3)
- Basic characteristics
- Structure-only representation
- XML format ? ease of traversal manipulation
- Automatically derived from an XML file
- XMLDSS from other schema types (DTD, XML Schema)
34XML DataSource Schema (2/3)
35XML DataSource Schema (3/3)
- XMLDSS is being extended
- Structural summary ? schema type (persistence,
describe multiple documents) - Constraints
- Primary/foreign keys
- Cardinality
- Ordering
- If present, translate DTD/XML Schema to XMLDSS
36Schema Transformation (1/2)
- Target schema T given
- Source schema S is transformed to match the
structure of T
37Schema Transformation (2/2)
- Schema matching phase
- Schema transformation phase
- id phase
- Target schema materialisation
38Algorithm
- Growing phase traverse the target schema and
issue an add/extend transformation for every
construct that does not exist in the source
schema. - Shrinking phase traverse the source schema and
issue an delete/contract transformation for every
construct that does not exist in the target
schema. - Completeness of algorithm
39Transformation Types
- AutoMed primitive transformations
- add/extend
- delete/contract
- rename
- Schema level
- Insert, remove or rename schema constructs
- Move element/subtree
- Element ?? attribute
40Example 1
- Insert element C
- ext(ltCgt,Void,Any)
- ext(ltA,Cgt, Void,Any)
- ext(ltC,Bgt, Void,Any)
- del(ltA,Bgt,q)
- Remove element C
- add(ltA,Bgt,q)
- con(ltCgt, Void,Any)
- con(ltC,Bgt, Void,Any)
- con(ltA,Cgt, Void,Any)
41Example 2
- Insert/remove edge move operation
42Example 3
- Move
- add(ltroot,Bgt,q3)
- add(ltB,Agt,
- b,aa,b?ltA,Bgt)
- delete(ltA,Bgt)
- a,bb,a?ltB,Agt)
- Complete
- add(ltBgt, ltBgtq1)
- add(ltA,Bgt, ltA,Bgtq2)
- delete(ltA,Bgt, ltA,Bgt)
- delete(ltBgt, ltBgt)
- rename(ltBgt, ltBgt)
Schemas
Data
43Example 1 - revisited
- Actually, this can also be treated with an
add/delete transformation
44Example 4
- Element-to-attribute transformation
- insert(ltA,ABgt,q)
- remove(ltA,Bgt,q)
- remove(ltB,PCDATAgt,q)
- remove(ltBgt,q)
- Attribute-to-elementtransformation
- insert(ltBgt,q)
- insert(ltA,Bgt,q)
- insert(ltB,PCDATAgt,q)
- remove(ltA,ABgt,q)
45Schema Integration Type I
46Schema Integration Type II
- Type I integration performs two tasks at once
- schema integration
- schema improvement
- Type II
- Augment with missing constructs
- Remove redundant constructs
47Schema Integration Type II
- Improve GS as a second step
48Materialisation
- Strategy
- Materialise root and its attributes
- Consider all edges (ep,ec) in a depth-first way
- Materialise ec and its attributes
49Conclusions
- XML specific solution
- element??attribute transformations
- move operation
- No loss of data by synthetically creating missing
structure
50Evaluation
- BIOMAP
- Integration of biological data sources
- Relational databases, XML documents, XML databases
51Future Work
- Short-term
- Use ontologies for resolving semantic
heterogeneity - Extend XMLDSS
- Native XML Databases (Xindice, Sedna)
- XML-Enabled Databases (Oracle)
- Long-term
- Schema integration
- GS improvement (type II)
- Overlapping data identification
- Targeted rematerialisation of GS
- Schema evolution