SCHEMA Matching Mapping Integration Mediation

About This Presentation

Title:

SCHEMA Matching Mapping Integration Mediation

Description:

Application Domains for Schema Matching. Match Techniques. Result of Match ... sense similarity synonym, hypernym (generalization), hyponym (specialization) ... – PowerPoint PPT presentation

Number of Views:414

Avg rating:3.0/5.0

Slides: 63

Provided by: Waq6

Category:

more less

Transcript and Presenter's Notes

Title: SCHEMA Matching Mapping Integration Mediation

1
SCHEMA Matching Mapping Integration
Mediation

Zohra Bellahsène, Khalid Saleem, Remi Colleta
bella, saleem, colleta_at_lirmm.fr
Master 2 Mention INFORMATIQUE
350 Cours - Intégration de Données XML
I2S - LIRMM

28-09-2007
2
Road Map

Schema
Matching Schemas
Match Operation
Application Domains for Schema Matching
Match Techniques
Result of Match Operation
Mapping
Integration and Mediation
Existing Match Tools and their Deficiencies
Large Scale Scenarios
Search Space Optimization
Charlies Algorithm, Similarity Flooding,
S-match, COMA
Tree Mining approach perspective
Conclusion

3
Schema

Data representation A particular way to
structure the data e.g. XML DTDs, XML Schemas,
ontologies, OO representations or ER models.
It consists of finite set of elements
(representation elements).
Element is a syntactic construct of
representation e.g.
XML elements or attributes in DTDs schemas,
concepts, attributes and relations in ontologies.
Each element is associated with a set of data
instances.

4
Schema and Ontology

Schema represents Database Community
Schemas often do not provide explicit semantics
of their data (ER, XML document schema).
Ontology represents the AI Community
Ontologies are logical systems that themselves
obey some formal semantics. Designed to be
interpreted by computers for reasoning (OWL)
Schemas and Ontologies are similar in the sense
Both provide a vocabulary of elements that
describes a domain
Both constraint the meaning of terms used in
vocabulary (Hierarchy of elements/ relations
among the elements)

5
XML semi structured .
ltxselement name"employee"gt
ltxscomplexTypegt ltxssequencegt
ltxselement name"firstname"
type"xsstring"/gt ltxselement
name"lastname" type"xsstring"/gt
lt/xssequencegt lt/xscomplexTypegt
lt/xselementgt
ltemployeegt ltfirstnamegtJohnlt/firstnamegt
ltlastnamegtSmithlt/lastnamegt lt/employeegt
XML Schema
John Smith
Data
XML Document
6
Schema vs Ontology examples
DARPA Agent Markup Language
Ontology Inference Layer
7
OWL

OWL is built on top of RDF
OWL is for processing information on the web
OWL was designed to be interpreted by computers
OWL was not designed for being read by people
OWL is written in XML

8
OWL Example
ltrdfRDF xmlnsrdf "http//www.w3.org/1999/02/22-
rdf-syntax-ns" xmlnsrdfs"http//www.w3.org/2000
/01/rdf-schema" xmlnsowl"http//www.w3.org/2002
/07/owl" xmlbase"http//www.daml.org/2001/10/ht
ml/airport-ontgt ltowlOntology rdfabout""gt
ltowlversionInfogtId airport-ont.daml,v 1.1
2002/03/14 062416 mdean Exp
lt/owlversionInfogt ltrdfscommentgtAirportlt/rdf
scommentgt lt/owlOntologygt ltrdfsClass
rdfID"Airport"gt ltrdfssubClassOfgt
ltowlRestrictiongt ltowlonProperty
rdfresource"name"/gt
ltowlallValuesFrom rdfresource"http//www.w3.org
/ 2001/XMLSchemastring"/gt
lt/owlRestrictiongt lt/rdfssubClassOfgt
ltrdfssubClassOfgt ltowlRestrictiongt
ltowlonProperty rdfresource"iataCode"/gt
ltowlallValuesFrom
rdfresource"http//www.w3.org/
2001/XMLSchemastring"/gt
lt/owlRestrictiongt lt/rdfssubClassOfgt
ltrdfssubClassOfgt
lt/rdfssubClassOfgt
lt/rdfsClassgt ltowlDatatypeProperty
rdfID"elevation"/gt ltowlDatatypeProperty
rdfID"iataCode"/gt ltowlDatatypeProperty
rdfID"icaoCode"/gt ltowlDatatypeProperty
rdfID"latitude"/gt ltowlDatatypeProperty
rdfID"location"/gt ltowlDatatypeProperty
rdfID"longitude"/gt ltowlDatatypeProperty
rdfID"name"/gt lt/rdfRDFgt
9
Taxonomy

Mathematically, a hierarchical taxonomy is a
tree structure of classifications for a given set
of objects
Much lighter than Ontology)

Entity
Undergrad Courses
Grad Courses
People
Staff
Faculty
Assistant Professor
Associate Professor
Professor
CS Dept. US
10
Road Map

Schema
Matching Schemas
Match Operation
Application Domains for Schema Matching
Match Techniques
Result of Match Operation
Mapping
Integration and Mediation
Existing Match Tools and their Deficiencies
Large Scale Scenarios
Search Space Optimization
Charlies Algorithm, Similarity Flooding,
S-match, COMA
Tree Mining approach perspective
Conclusion

11
Match Operation

Input Two schemas/ ontologies
Output A similarity correspondence between each
pair of elements (E1 of Schema A and E2 of
Schema B)
Similarity Correspondence can be
E1 E2
E1 ltgt E2
E1 p E2 ( E1 n E2 or E1 ? E2)

12
Match Operation
A ? B
B ? A
Books Schema A
Books Schema B
price book-title author-name
listed-price title a-fname
a-lname
partial match
match
13
The 3 Dimensions of Schema Matching
Y - Research Domains
Z - Application Domains
X - Basic Match Research
Z - Application Domains
Y - Research Domains
X - Basic Match Research
14
Application and Research Domains

Data Interoperability
Data Integration
Data Warehousing
Catalogue Integration
Web Services Discovery and Composition
Query over the Web
...
Data Exchange
E-commerce
Agents Communication
P2P DB Systems

Contributing Schema Set Not Evolving gtgt
Matching is one time process
Static
Contributing Schema Set Evolving gtgt Matching
also evolve
Dynamic
15
Match Techniques based on ..

Schema

Data Instance

16
Schema Match Techniques

Schema Structure Level Techniques
Matching combinations of elements appearing
together
Graph Matching (Example Acyclic Directed Graph
Tree)
Children, Descendants
Leaves
Relations

?
17
Match Techniques Element level

Language based
Tokenization e.g Tool_Kit (Tool,Kit)
Lemmatization e.g Kits Kit
Elimination e.g IsRelatedTo Related
Word sense similarity synonym, hypernym
(generalization), hyponym (specialization) etc.
Or Verb, Noun, Adjective etc
DeliverTo ? InvoiceTo ? DeliverTo ? ShipTo

18
Match Techniques Element level

String based
prefix, suffix e.g. auth author
n-gram common consecutive substrings of size n
2-gram of kit ki,it kiteki,it,te 2
edit distance number of steps required to
transform from one string to another
e.g. class ? classe 1 insertion
Soundx e.g. class classe
Description matching
Approximate matching used in natural language
processing

http//www.dcs.shef.ac.uk/sam/stringmetrics.html
19
Other techniques

Constraint based
Data Types e.g String, Integer, Currency etc.
Value domain e.g. 1..12 month or hour
Model Based
Entity Relationship, XML documents or XML schema,
OWL, OO etc.
Auxiliary Information
User Input
Global Schema / Ontology
Previous Match Decisions
Dictionaries, Thesauri WordNet

20
Matching Two Schemas - Example
21
Example
22
Matching
23
Result of Match Operation

Match Candidates in Similarity Matrix
For each element of Schema A , we have a set of
possibly similar elements in schema B

Match Confidence!
24
Match Confidence

Match quality/ confidence is calculated as a
numeric weight between 0 and 1
i.e. 0 ? no match and 1 ? perfect match
3-gram example
S1 university S2 université
S1 3-gram uni,niv,ive,ver,ers,rsi,sit,ity 8
S2 3-gram uni,niv,ive,ver,ers,rsi,sit,ité 8
S1S2 3gram 1
(1 3gramS1 3gramS2-23gramS1 n
3gramS2)
1/ (188-2(7) 1/ 3
0.333

Match confidence between S1 and S2 of 3-gram
algorithm!
25
Match Confidence
26
Combining different Match Algorithms

Hybrid Matcher
Directly combine several match algorithms to
determine match candidates.
Gives better performance
Multiple/Composite Matcher
Combines results of several independently
executed matchers, can include other hybrid
matchers.
Gives Flexibility depending upon the input
schemas

27
Match Algorithm Dimensions SE05

For Match Algorithms designing
We need the knowledge for its utilization i.e.
Dimensions
Input of the Algorithm
Data or Schema, Element level or Structure Level
Characteristics of the Matching Process
Require exact or approximate matching
Performance over quality
Output of the Algorithms
Output is an approximate result
OR part of a set of match algorithms which are
combined together for a map result
Integrated Schema
Mapping Expressions

28
Road Map

Schema
Matching Schemas
Match Operation
Application Domains for Schema Matching
Match Techniques
Result of Match Operation
Mapping
Integration and Mediation
Existing Match Tools and their Deficiencies
Large Scale Scenarios
Search Space Optimization
Charlies Algorithm, Similarity Flooding,
S-match, COMA
Tree Mining approach perspective
Conclusion

29
Mapping

Selection of the best candidate match
If the candidate match element set contains more
than one elements, further techniques are applied
to select the best match as the mapping.
Usually it is manual user selection for quality
results

Complex Map
Simple Map
30
Map Cardinality

Map Complexity - 11, 1n, nm
1n - authorName a-fname, a-lname
nm - Tel. of Person

31
Map Quality Measuring Similarity

Precision Share of real mappings among all
found
Recall Share of real mappings that is found

Precision B / (B C) Recall B / (A
B)
32
Schema Mapping / Ontology Alignment

Schema mapping is usually performed with the help
of techniques trying to guess the meaning encoded
in the schemas
www.xml.com/
Ontology alignment try to exploit knowledge
explicitly encoded in the ontologies. .
Ontology Example (travel)
http//protege.stanford.edu/plugins/owl/

In real world applications Solutions from both
domains are mutually beneficial
33
Semantic Web Layers
34
Road Map

Schema
Matching Schemas
Match Operation
Application Domains for Schema Matching
Match Techniques
Result of Match Operation
Mapping
Integration and Mediation
Existing Match Tools and their Deficiencies
Large Scale Scenarios
Search Space Optimization
Charlies Algorithm, Similarity Flooding,
S-match, COMA
Tree Mining approach perspective
Conclusion

35
Example
Data Source
Consumer
Mediator
Data Source
Data Source

Schema heterogeneity a key roadblock for
information integration
Different data sources speak their own schema
Mapping is key to any data sharing architecture

36
Schema Integration and Mediation

Schema Mediation
All input schemas are merged together into one
schema, without any concept redundancy. i.e.
similar concepts are represented by one concept.
All the input schemas are mapped to this schema
called the mediated schema

37
Types of Integration Strategies Batini86
binary
n-ary
balanced
one-shot
iterative
ladder
http//www.ifi.unizh.ch/pziegler/IntegrationProje
cts.htmlx
38
Xylème warehouse
39
Mediator overview
XML Documents
XQuery Requests
e-XML Mediator
Sub-requests XPath
Sub-requests XQuery
Sub-requests XQuery
Sub-requests XQuery
Web site Wrapper
Adapter
Adapter
XDBMS
RDBMS
RDBMS
Site Web (pages HTML)
40
Nimble
Front-End
Lens Builder
User Applications
Lens File
InfoBrowser
Software Developers Kit
NIMBLE APIs
Management Tools
Integration Layer
Nimble Integration Engine
Metadata Server
Compiler
Executor
Cache
Security Tools
Common XML View
Integration Builder
Concordance Developer
Data Administrator
41
Road Map

Schema
Matching Schemas
Match Operation
Application Domains for Schema Matching
Match Techniques
Result of Match Operation
Mapping
Integration and Mediation
Existing Match Tools and their Deficiencies
Large Scale Scenarios
Search Space Optimization
Charlies Algorithm, Similarity Flooding,
S-match, COMA
Tree Mining approach perspective
Conclusion

42
Existing Matching Tools

Ontologies Specific
NOM/ QOM
OLA
Anchor-PROMPT
S-Match GSY04
HICAL
SKAT
Machine Learning
GLUE (LSD, CGLUE) DMDH02
Automatch

Cupid MBR01
COMA (COMA) ADMR05
Similarity Flooding
SemInt
Artemis
DIKE
TransScm
AutoMed
Charlie TBBT04

43
Charlies Algorithm
x/writer/own_books m/book/author/name m/book/autho
r/birth m/book/author m/book/publisher m/book/titl
e m/book
x/writer m/book 0.15 m/book/author
0.8 m/book/publisher 0.01 m/book/title
0 m/book/author/name 0.6
x/writer/birth m/book/author/name
m/book/author m/book/publisher m/book/title m/book
depth_max backtrack_max max_dist
x/writer/name m/book/author/name
44
Similarity Flooding

Directed Graph Matching Technique
The technique starts from string-based comparison
(common prefix, suffix tests) of vertices labels
to obtain an initial matching.
The algorithm is based on the assumption that
whenever any two elements in Schemas S1 and S2
are found to be similar, the similarity of their
adjacent elements increases.
Iterative process
www-db.stanford.edu/melnik/mm/sfa/

45
S-Match approach

S-match is an Element Level Matcher
Syntactic Matching
Initially literal concept of labels are computed
at nodes using WordNet .
Tokenize each label and sense of each token is
calculated using WordNet. Label tokens senses are
combined to make up a concept represented by the
label.
E.g VineandCheese represents Vine and Cheese
Semantic Matching
Relations are computed between concepts at nodes
Concept at node is a combination of label
semantics and the node placement in the tree
S-match utilizes the idea of schemas as trees

46
COMA - Matching Process
http//dbs.uni-leipzig.de/Research/coma. html
47
Deficiencies in Current Matching Tools

These tools do not completely fulfil the
requirements for large scale schema matching
because
Not fully automated
Less emphasis on search space optimization
Every element of schema A is matched to every
element of schema B and a set of algorithms are
applied on each pair of elements.
Schema A n elements
Schema B m elements
k number of Match Algorithms
n x m x k

48
Road Map

Schema
Matching Schemas
Match Operation
Application Domains for Schema Matching
Match Techniques
Result of Match Operation
Mapping
Integration and Mediation
Existing Match Tools and their Deficiencies
Large Scale Scenarios
Search Space Optimization
Charlies Algorithm, Similarity Flooding,
S-match, COMA
Tree Mining approach perspective
Conclusion

49
Large Scale Scenarios

Creating a merged schema for data integration
from two large schemas (with thousands of nodes).
For example bio-genetic taxonomies
Creating a mediated schema from a large set of
schemas (with hundreds of schemas and thousands
of nodes)
For example creating a mediated web interface
input form (schema) from the hundreds of web
interface forms (schemas) related to travel domain

50
Web Interface Schema for Travel
1yahoo-form 2Where_do_you_want_to_go 3d
ep_arp_cd_1 4dep_arp_range_1 5When_are_yo
u_traveling 6Depart 7dep_dt_mn_1 8
dep_dt_dy_1 9dep_tm_1 10Return 11
dep_dt_mn_2 12dep_dt_dy_2 13dep_tm_2
14num_cnx 15How_many_travelers_are_there
16adult_pax_cnt 17chld_pax_cnt 18sen
ior_pax_cnt 19Airline_preferences 20cls_s
vc 21aln_cd_1
1aa 2WhereDoYouWantToGo 3origin 4dest
ination 5WhenDoYouWantToGo 6DepartureDate
7departureMonth 8departureDay 9depar
tureTime 10ReturnDate 11returnMonth 1
2returnDay 13returnTime 14NumberOfPasseng
ers 15numAdultPassengers 16numChildPasseng
ers 17WhatAreYourServicePref 18cabinClass
19maximumStops 20carrier 21countryPointOf
Sale
1nwa-form 2origin 3EnterDepartureDate
4departMonth 5departDay 6departTime 7d
estination 8EnterReturnDate 9returnMonth
10returnDay 11returnTime 12adult
1absTravel 2D_City 3A_City 4Depart
5D_Month 6D_Day 7Return 8R_Month
9R_Day 10ClassOfService 11NumAdults
51
Large Scale Integration
52
Search Space Optimization

n x m x k complexity has to be reduced for better
performance ! How?
Considering Schemas as trees
trees add structure which allows to perform the
classification of documents more effectively -
(Charlie Algorithm)
Labels of tree nodes have same literal meaning
but specific sense in specific domains e.g. title
- (Similarity Flooding)
Books domain ? title of book
Human Resource domain ? title of person M, Mme
Calculate this domain specific sense of label to
find similar labels
thus minimizing the search space for a certain
node
i.e. match nodes with similar sense labels
e.g. author?writer, auth-name

53
Tree mining approach!

Why?
Basic function of tree mining is to find
sub-trees that are frequent in the given set of
trees, which is similar to schema matching
activity that tries to find similar concepts
among a set of schema trees.
So it provides us with the technique to compare
more than two schemas in parallel

54
Tree Mining
a author b book d detail f information g
general h birth i isbn n name o own-books p
publisher r price t title w writer

Motivation
Large Scale Scenario
Peer-to-peer Information Systems over the XML Web

Tree Mining Approach
Semantic Label Concept Matcher
Element Level Matching
Structure Level Matching

aw bo fd
Search sub-trees
55
Tree Mining Approach

Node scope values (calculated by depth first
pre-order traversal) Top-down Zaki02

Schema matching and integration process for
handling large sets of XML schema trees.
Employs
Element level Name Matcher (same node label or
synonym)
Cluster similar/synonym labels
Utilize the node scope values properties to
extract node context semantics out of the
structure

56
Clustering Search space optimization
Synonym table aw bo fd
R
57
Property Descendent Node Check
Descendent Node Check Scope of Node x is X,Y
and Scope of Descendent Node xd Xd,Yd then
XdgtX and YdltY

publisher is mapped to publisher
publisher name can be mapped to writer/name or
publisher/name

Node with label name n44,4 is a descendent of
node (label publisher) n33,4 and not node name
n22,2 verified using descendent test
For n2 2gt3 and 2lt4 (False) and for n4 4gt3
and 4lt4 (True)

58
Conclusion

Element level Name and Linguistic Matching with
the support of thesaurus is an integral part of
every Match system.
With systems moving towards schema/ontology based
manipulation, and lack of global schemas or
previous matching results, Structure Level
matching is equally important for making out the
semantics.
Peer-to-peer environment requires new methods to
be exploited for performance and quality mapping
i.e. integration of Tree Mining techniques for
matching purposes and search space optimisation.
Machine Learning algorithms can be beneficial in
the P2P environment in later stages when training
examples have been created from instance data,
provided the target domain remains the same.

59
Some References

AH04 Antoniou G., Harmelen F. A Semantic Web
Primer, The MIT Press, 2004
ADMR05 Aumuller D., Do H. H. , Massmann S., and
Rahm E. Schema and ontology matching with COMA.
In Proceedings of the International Conference on
Management of Data (SIG-MOD), 2005
BR04 Bellahsène Z. and Roantree M. (2004)
Querying Distributed Data in a Super-peer based
Architecture. DEXA 2004.
BMP04 Bernstein PA., Melnik S., Petropoulos M.
and Quix C. (2004) Industrial-Strength Schema
Mapping. SIGMOD Record, Vol. 33, No. 4, December
2004
DMDH02 Doan AH., Madhavan J., Domingos P. and
Halvey A. (2002) Learning to Map Ontologies on
the Semantic Web. WWW 2002
MBR01 Madhavan J., Bernstein PA. and Rahm E.
(2001) Generic Schema Matching with Cupid. VLDB
2001.
RB01 Rahm E. and Bernstein PA (2001) A Survey
of Approaches to Automatic Schema Matching. VLDB
Journal 2001 10(4)334-3503
SE05 Shvaiko P. and Euzenat J. (2005) A Survey
of Schema-based Matching Approaches. Journal on
Data Semantics, 2005.
TBBT04 Tranier J., Baraer R., Bellahsene Z. and
Teisseire M (2004) Wheres Charlie Family Based
Heuristics for Peer-to-Peer Schema Integration.
IDEAS 2004, 227-235
Zaki02 Zaki MJ (2002) Efficiently Mining
Frequent Trees in a Forest. 8th ACM SIGKDD Intl
Conf. Knowledge Discovery and Data Mining. July
2002
http//www.w3.org/TR/damloil-reference
http//www.doc.ic.ac.uk/automed/

60
Thank you
61
Backup slides
62
URI

A URI can be classified as a locator or a name or
both. A Uniform Resource Locator (URL) is a URI
that, in addition to identifying a resource,
provides means of acting upon or obtaining a
representation of the resource by describing its
primary access mechanism or network "location".
For example, the URL http//www.wikipedia.org/ is
a URI that identifies a resource (Wikipedia's
home page) and implies that a representation of
that resource (such as the home page's current
HTML code, as encoded characters) is obtainable
via HTTP from a network host named
www.wikipedia.org. A Uniform Resource Name (URN)
is a URI that identifies a resource by name in a
particular namespace. A URN can be used to talk
about a resource without implying its location or
how to dereference it. For example, the URN
urnisbn0-395-36341-1 is a URI that, like an
International Standard Book Number (ISBN), allows
one to talk about a book, but doesn't suggest
where and how to obtain an actual copy of it.