Title: Young-Kwang Nam
1A Metadata Integration Assistant Generator for
Heterogeneous Databases
- Young-Kwang Nam
- Joseph Goguen
- Guilian Wang
2Data Integration in Synthetic Scientific
Applications
Applications
Integrated result without inconsistency, etc.
Query
global unified schema/ontology
Integration System
local schema/ontology
local schema/ontology
local schema/ontology
data source 1
data source 2
data source n
3Why Difficult Data Heterogeneity
- Platform System Heterogeneity
- OS, Hardware
- DBMSs, Concurrency control and recovery
capabilities - Syntactic Structural Heterogeneity
- Machine readable aspects of representation
- Data models, Schemas,
- Semantic Heterogeneity
- Naming conflicts synonyms, homonyms
- Scaling precision conflicts
- Sampling rates, error distribution, etc.
4More Difficult Flexible Integration
- No all-encompassing system satisfies everyone
- frequent update of sources
- frequent change of user requirements
- non-published data from ones own lab
- Simplicity and readability are more desirable
than completeness or exhaustiveness to domain
scientists - Domain knowledge is crucial for
- solving heterogeneities
- query optimization
- Desirable to support domain scientists to do data
integration on their own
5A Common Data Integration Architecture
Query
Result
An Integrated View Materialized or Virtual
Mediator
Wrapper
Wrapper
Wrapper
data source 1
data source 2
data source n
6Structural vs. Semanticwrt Mediation Level
- Structural approach (Mediated schema approach)
- integration by generating mediated schema that
characterize a set of data sources - Semantic approach (Ontology-based approach)
- difficult to integrate structural aspects of
sources from semantic perspective due to
inherent embedded semantics within local schemas
implicit assumptions - integration by sharing a common ontology among
the differentdata sources
7Global-as-view vs. Local-as-viewwrt Mapping
Direction
- Global-as-view approach
- each item in Global schema/ontology as a view
(query) over source schemas/ontologies - query(G) query(f(S1, S2, , Sn))
- straightforward query rewriting
- Local-as-view approach
- Each source as a view/query over global
schema/ontology - query(G) query(f1-1 (S1), f2-1(S2), , fn-1
(Sn)) - easy adding or removing sources
8Representative Systems
- TSIMMIS (Stanford IBM, 1995)
- MedMaker (Stanford, 1996)
- MIX (SDSCUCSD, 2000)
- IM (ATT, 1996)
- ClioGarlic (IBM, 2000)
- DIXSE (UT, 2001)
- XYLEME (2001)
- HERMES (UMD, 1994)
- SIMS (USC, 1996)
- Observer (UG, 1996)
- Infosleuth (MCC, 1997)
- COIN (MIT, 1999)
- Ontobroker (Ger., 2000)
- KIND (SDSCUCSD, 2001)
9Our Approach
- Virtual Integration retrieve data and resolve
conflicts at query time, easy maintenance - Structural Approach take users knowledge on
data semantics hidden in structural information
as input to achieve semantic mediation - Local-as-view easily adds or removes sources,
convenient to fit applications - GUI for specifying semantic mappings through
assigning same index to same meaning nodes
(paths) - Automatically generate DDXMI for query
decomposition - Semantic functions
10Current Prototype Architecture
11Distributed Database XML Metadata Interface
(DDXMI)
- Include Database or XML document name or location
information - Contain table columns or XML path information
- Function or operation name for resolving semantic
issues about table columns or XML elements and
attributes
12DDXMI DTD
lt!ELEMENT DDXMIA (DDXMI.header,
DDXMI.isequivalent, documentspec)gt lt!ELEMENT
DDXMI.header (documentation,version,date,authoriza
tion)gt lt!ELEMENT documentation (PCDATA)gt lt!ELEMEN
T version (PCDATA)gt lt!ELEMENT date
(PCDATA)gt lt!ELEMENT authorization
(PCDATA)gt lt!ELEMENT DDXMI.isequivalent
(source,destination)gt lt!ELEMENT source
(PCDATA)gt lt!ELEMENT destination
(PCDATA)gt lt!ELEMENT documentspec (document,
(elementname,operation))gt lt!ELEMENT document
(PCDATA)gt lt!ELEMENT elementname
(PCDATA)gt lt!ELEMENT operation (PCDATA)gt
13How to generate DDXMI
- Define a Master DTD (global schema) based on
application requirements for choosing elements or
tables from the distributed systems - Parse the master DTD and generate a path for each
element from root to current element - Assign the master index number to the site
element node which has the same meaning of the
master DTD node - May include a function name for some nodes
- Generate DDXMI file automatically by collecting
over same index numbers
14Generate Master Index
15Site1 Book1 DTD Tree
Index number
function name
16Book1 Path Information
Site1 Index
Master Index
- 0 book1.xml
- 1 /bib/book
- 11 /bib/book/price
- /bib/book/author
- 1211 /bib/book/author/first
- 1212 /bib/book/author/last
- /bib/book/title
- 15 /bib/book/publisher
- 16 /bib/book/editor
- 161 /bib/book/editor/affiliation
- 162 /bib/book/editor/last
- 162 /bib/book/editor/first
17Site 2 Book2 DTD Tree
18Book2 Path Information
Site2 Index
Master Index
0 book2.xml 1 /arts/book 12 /arts/book/author
1211 /arts/book/author/firstname 1212
/arts/book/author/lastname 13 /arts/book/title
15 /arts/book/publisher
19Site 3 Book3 DTD Tree
20Book3 Path Information
Site3 Index
Master Index
- 0 book3.xml
- /bookstore/book
- 11 /bookstore/book/price
- 12 /bookstore/book/author
- 1211 /bookstore/book/author/name
- 1212 /bookstore/book/author/name
- 13 /bookstore/book/title
21XML Query Languages
- XQL takes a document point of view
- XML-QL takes a database point of view
- Quilt draws from both areas
- proposed by Don Chamberlin, Jonathan Robie, and
Daniela Florescu - Kweelt (University of Washington), a XML query
engine based on Quilt, used in our prototype - XQuery proposal follows Quilt closely
22How to generate site queries
- Parse the master query, a query over the global
schema - If encounter a path, depending on its kind, get
corresponding path name from DDXMI file and
substitute it - If there is no corresponding path in the DDXMI,
then put it as a null value - ? no queries generated for that site
23How to get site element names
In Quilt Query 1.book ? bookstore/book 2.
price ? bookstore/book/price_info/price ?
price_info/price
cut!!
ltsourcegtbooklt/sourcegt ltdestinationgtbooksore/boo
klt/destinationgt ltsourcegtbook/pricelt/sourcegt
ltdestinationgtbookstore/book/price_info/priceltdesti
nationgt
2411 Mapping Example
FOR book IN document("book.xml")//book publ
isher "Addison-Wesley" RETURN
ltbookgtbook/titlelt/bookgt
Book1
Master index
bib
Book2
book
arts
book
book
price
editor
publisher
title
year
author
title
publisher
publisher
title
Book3
full_name
affiliation
full_name
bookstore
book
last_name
first_name
title
25Query Execution Result
261N Mapping Example
FOR edi IN document("book.xml")//book/editor
RETURN lteditorgtedi/full_namelt/editorgt
27Query Execution Result
28N1 Mapping Example
FOR a IN document("book.xml")//book//author RETUR
N ltauthorgt a/last_name,a/first_name lt/authorgt
ltoperationgtfstringlt/operationgt
ltoperationgtlstringlt/operationgt
29Query Generation Result
import split as UDF_split FUNCTION
fstring(str) split(" ",str)1 FUNCTION
lstring(str) split(" ",str)2 FOR a IN
document("book3.xml") //book//author RETURN
ltauthorgt fstring(a/name), lstring(a/name) lt/a
uthorgt
30Query Execution Result
31Semantic Function Involved Example
FOR book IN document("book.xml")//book RETURN
ltbookgt book/title,book/author,boo
k/price lt/bookgt
ltoperationgtdiv(100)lt/operationgt
32Query Execution Result
33Remaining Issues
- Handle attributes one DTD has an attribute but
others dont, or an attribute in one DTD as an
element in others - More efficient way for generating DDXMI file
automatically when there are many paths in the
master DTD - e.g., treetree mapping if two paths are
indicated as the same and have the same children,
then the index numbers should be generated
automatically - Migrate to XML schemas, instead of DTDs
- Support JOIN, PRODUCT generated by queries
- Move to XQuery and a query engine with
distributed query support - Integrate the individual site query results as
one return as a single data source ready for
further analysis - Provide mechanisms for removing redundancy
- Justify the semantics of the query generated
34Conclusion
- Our prototype uses distributed metadata to
generate a GUI tool to describe mappings between
master and local databases by assigning index
numbers and specifying conversion function names - Uses Quilt as its XML query language.
- A DDXMI file is generated based on the mappings,
and is used to translate queries over the virtual
master database into sub-queries to local
databases - An experiment testing feasibility is reported in
which 3 different bibliography databases are
integrated. - Implemented with Java Webserver and JavaCC
- Move to real applications, e.g. in the context of
NSF project SEEK (Science Environment for
Ecological Knowledge)