Title: SCHEMA Matching Mapping Integration Mediation
1 SCHEMA Matching Mapping Integration
Mediation
- Zohra Bellahsène, Khalid Saleem, Remi Colleta
- bella, saleem, colleta_at_lirmm.fr
- Master 2 Mention INFORMATIQUE
- 350 Cours - Intégration de Données XML
-
- I2S - LIRMM
28-09-2007
2Road Map
- Schema
- Matching Schemas
- Match Operation
- Application Domains for Schema Matching
- Match Techniques
- Result of Match Operation
- Mapping
- Integration and Mediation
- Existing Match Tools and their Deficiencies
- Large Scale Scenarios
- Search Space Optimization
- Charlies Algorithm, Similarity Flooding,
S-match, COMA - Tree Mining approach perspective
- Conclusion
3Schema
- Data representation A particular way to
structure the data e.g. XML DTDs, XML Schemas,
ontologies, OO representations or ER models. - It consists of finite set of elements
(representation elements). - Element is a syntactic construct of
representation e.g. - XML elements or attributes in DTDs schemas,
- concepts, attributes and relations in ontologies.
- Each element is associated with a set of data
instances.
4Schema and Ontology
- Schema represents Database Community
- Schemas often do not provide explicit semantics
of their data (ER, XML document schema). - Ontology represents the AI Community
- Ontologies are logical systems that themselves
obey some formal semantics. Designed to be
interpreted by computers for reasoning (OWL) - Schemas and Ontologies are similar in the sense
- Both provide a vocabulary of elements that
describes a domain - Both constraint the meaning of terms used in
vocabulary (Hierarchy of elements/ relations
among the elements)
5XML semi structured .
ltxselement name"employee"gt
ltxscomplexTypegt ltxssequencegt
ltxselement name"firstname"
type"xsstring"/gt ltxselement
name"lastname" type"xsstring"/gt
lt/xssequencegt lt/xscomplexTypegt
lt/xselementgt
ltemployeegt ltfirstnamegtJohnlt/firstnamegt
ltlastnamegtSmithlt/lastnamegt lt/employeegt
XML Schema
John Smith
Data
XML Document
6Schema vs Ontology examples
DARPA Agent Markup Language
Ontology Inference Layer
7OWL
- OWL is built on top of RDF
- OWL is for processing information on the web
- OWL was designed to be interpreted by computers
- OWL was not designed for being read by people
- OWL is written in XML
8OWL Example
ltrdfRDF xmlnsrdf "http//www.w3.org/1999/02/22-
rdf-syntax-ns" xmlnsrdfs"http//www.w3.org/2000
/01/rdf-schema" xmlnsowl"http//www.w3.org/2002
/07/owl" xmlbase"http//www.daml.org/2001/10/ht
ml/airport-ontgt ltowlOntology rdfabout""gt
ltowlversionInfogtId airport-ont.daml,v 1.1
2002/03/14 062416 mdean Exp
lt/owlversionInfogt ltrdfscommentgtAirportlt/rdf
scommentgt lt/owlOntologygt ltrdfsClass
rdfID"Airport"gt ltrdfssubClassOfgt
ltowlRestrictiongt ltowlonProperty
rdfresource"name"/gt
ltowlallValuesFrom rdfresource"http//www.w3.org
/ 2001/XMLSchemastring"/gt
lt/owlRestrictiongt lt/rdfssubClassOfgt
ltrdfssubClassOfgt ltowlRestrictiongt
ltowlonProperty rdfresource"iataCode"/gt
ltowlallValuesFrom
rdfresource"http//www.w3.org/
2001/XMLSchemastring"/gt
lt/owlRestrictiongt lt/rdfssubClassOfgt
ltrdfssubClassOfgt
lt/rdfssubClassOfgt
lt/rdfsClassgt ltowlDatatypeProperty
rdfID"elevation"/gt ltowlDatatypeProperty
rdfID"iataCode"/gt ltowlDatatypeProperty
rdfID"icaoCode"/gt ltowlDatatypeProperty
rdfID"latitude"/gt ltowlDatatypeProperty
rdfID"location"/gt ltowlDatatypeProperty
rdfID"longitude"/gt ltowlDatatypeProperty
rdfID"name"/gt lt/rdfRDFgt
9Taxonomy
- Mathematically, a hierarchical taxonomy is a
tree structure of classifications for a given set
of objects - Much lighter than Ontology)
Entity
Undergrad Courses
Grad Courses
People
Staff
Faculty
Assistant Professor
Associate Professor
Professor
CS Dept. US
10Road Map
- Schema
- Matching Schemas
- Match Operation
- Application Domains for Schema Matching
- Match Techniques
- Result of Match Operation
- Mapping
- Integration and Mediation
- Existing Match Tools and their Deficiencies
- Large Scale Scenarios
- Search Space Optimization
- Charlies Algorithm, Similarity Flooding,
S-match, COMA - Tree Mining approach perspective
- Conclusion
11Match Operation
- Input Two schemas/ ontologies
- Output A similarity correspondence between each
pair of elements (E1 of Schema A and E2 of
Schema B) - Similarity Correspondence can be
- E1 E2
- E1 ltgt E2
- E1 p E2 ( E1 n E2 or E1 ? E2)
12Match Operation
A ? B
B ? A
Books Schema A
Books Schema B
price book-title author-name
listed-price title a-fname
a-lname
partial match
match
13The 3 Dimensions of Schema Matching
Y - Research Domains
Z - Application Domains
X - Basic Match Research
Z - Application Domains
Y - Research Domains
X - Basic Match Research
14Application and Research Domains
- Data Interoperability
- Data Integration
- Data Warehousing
- Catalogue Integration
- Web Services Discovery and Composition
- Query over the Web
- ...
- Data Exchange
- E-commerce
- Agents Communication
- P2P DB Systems
Contributing Schema Set Not Evolving gtgt
Matching is one time process
Static
Contributing Schema Set Evolving gtgt Matching
also evolve
Dynamic
15Match Techniques based on ..
16Schema Match Techniques
- Schema Structure Level Techniques
- Matching combinations of elements appearing
together - Graph Matching (Example Acyclic Directed Graph
Tree) - Children, Descendants
- Leaves
- Relations
?
17Match Techniques Element level
- Language based
- Tokenization e.g Tool_Kit (Tool,Kit)
- Lemmatization e.g Kits Kit
- Elimination e.g IsRelatedTo Related
- Word sense similarity synonym, hypernym
(generalization), hyponym (specialization) etc.
Or Verb, Noun, Adjective etc - DeliverTo ? InvoiceTo ? DeliverTo ? ShipTo
18Match Techniques Element level
- String based
- prefix, suffix e.g. auth author
- n-gram common consecutive substrings of size n
- 2-gram of kit ki,it kiteki,it,te 2
- edit distance number of steps required to
transform from one string to another - e.g. class ? classe 1 insertion
- Soundx e.g. class classe
- Description matching
- Approximate matching used in natural language
processing
http//www.dcs.shef.ac.uk/sam/stringmetrics.html
19Other techniques
- Constraint based
- Data Types e.g String, Integer, Currency etc.
- Value domain e.g. 1..12 month or hour
- Model Based
- Entity Relationship, XML documents or XML schema,
OWL, OO etc. - Auxiliary Information
- User Input
- Global Schema / Ontology
- Previous Match Decisions
- Dictionaries, Thesauri WordNet
20Matching Two Schemas - Example
21Example
22Matching
23Result of Match Operation
- Match Candidates in Similarity Matrix
- For each element of Schema A , we have a set of
possibly similar elements in schema B
Match Confidence!
24Match Confidence
- Match quality/ confidence is calculated as a
numeric weight between 0 and 1 - i.e. 0 ? no match and 1 ? perfect match
- 3-gram example
- S1 university S2 université
- S1 3-gram uni,niv,ive,ver,ers,rsi,sit,ity 8
- S2 3-gram uni,niv,ive,ver,ers,rsi,sit,ité 8
- S1S2 3gram 1
- (1 3gramS1 3gramS2-23gramS1 n
3gramS2) - 1/ (188-2(7) 1/ 3
0.333
Match confidence between S1 and S2 of 3-gram
algorithm!
25Match Confidence
26Combining different Match Algorithms
- Hybrid Matcher
- Directly combine several match algorithms to
determine match candidates. - Gives better performance
- Multiple/Composite Matcher
- Combines results of several independently
executed matchers, can include other hybrid
matchers. - Gives Flexibility depending upon the input
schemas
27Match Algorithm Dimensions SE05
- For Match Algorithms designing
- We need the knowledge for its utilization i.e.
Dimensions - Input of the Algorithm
- Data or Schema, Element level or Structure Level
- Characteristics of the Matching Process
- Require exact or approximate matching
- Performance over quality
- Output of the Algorithms
- Output is an approximate result
- OR part of a set of match algorithms which are
combined together for a map result - Integrated Schema
- Mapping Expressions
28Road Map
- Schema
- Matching Schemas
- Match Operation
- Application Domains for Schema Matching
- Match Techniques
- Result of Match Operation
- Mapping
- Integration and Mediation
- Existing Match Tools and their Deficiencies
- Large Scale Scenarios
- Search Space Optimization
- Charlies Algorithm, Similarity Flooding,
S-match, COMA - Tree Mining approach perspective
- Conclusion
29Mapping
- Selection of the best candidate match
- If the candidate match element set contains more
than one elements, further techniques are applied
to select the best match as the mapping. - Usually it is manual user selection for quality
results
Complex Map
Simple Map
30Map Cardinality
- Map Complexity - 11, 1n, nm
- 1n - authorName a-fname, a-lname
- nm - Tel. of Person
31Map Quality Measuring Similarity
- Precision Share of real mappings among all
found - Recall Share of real mappings that is found
Precision B / (B C) Recall B / (A
B)
32Schema Mapping / Ontology Alignment
- Schema mapping is usually performed with the help
of techniques trying to guess the meaning encoded
in the schemas - www.xml.com/
- Ontology alignment try to exploit knowledge
explicitly encoded in the ontologies. .
Ontology Example (travel) - http//protege.stanford.edu/plugins/owl/
In real world applications Solutions from both
domains are mutually beneficial
33Semantic Web Layers
34Road Map
- Schema
- Matching Schemas
- Match Operation
- Application Domains for Schema Matching
- Match Techniques
- Result of Match Operation
- Mapping
- Integration and Mediation
- Existing Match Tools and their Deficiencies
- Large Scale Scenarios
- Search Space Optimization
- Charlies Algorithm, Similarity Flooding,
S-match, COMA - Tree Mining approach perspective
- Conclusion
35Example
Data Source
Consumer
Mediator
Data Source
Data Source
- Schema heterogeneity a key roadblock for
information integration - Different data sources speak their own schema
- Mapping is key to any data sharing architecture
36Schema Integration and Mediation
- Schema Mediation
- All input schemas are merged together into one
schema, without any concept redundancy. i.e.
similar concepts are represented by one concept. - All the input schemas are mapped to this schema
called the mediated schema
37Types of Integration Strategies Batini86
binary
n-ary
balanced
one-shot
iterative
ladder
http//www.ifi.unizh.ch/pziegler/IntegrationProje
cts.htmlx
38Xylème warehouse
39Mediator overview
XML Documents
XQuery Requests
e-XML Mediator
Sub-requests XPath
Sub-requests XQuery
Sub-requests XQuery
Sub-requests XQuery
Web site Wrapper
Adapter
Adapter
XDBMS
RDBMS
RDBMS
Site Web (pages HTML)
40Nimble
Front-End
Lens Builder
User Applications
Lens File
InfoBrowser
Software Developers Kit
NIMBLE APIs
Management Tools
Integration Layer
Nimble Integration Engine
Metadata Server
Compiler
Executor
Cache
Security Tools
Common XML View
Integration Builder
Concordance Developer
Data Administrator
41Road Map
- Schema
- Matching Schemas
- Match Operation
- Application Domains for Schema Matching
- Match Techniques
- Result of Match Operation
- Mapping
- Integration and Mediation
- Existing Match Tools and their Deficiencies
- Large Scale Scenarios
- Search Space Optimization
- Charlies Algorithm, Similarity Flooding,
S-match, COMA - Tree Mining approach perspective
- Conclusion
42Existing Matching Tools
- Ontologies Specific
- NOM/ QOM
- OLA
- Anchor-PROMPT
- S-Match GSY04
- HICAL
- SKAT
- Machine Learning
- GLUE (LSD, CGLUE) DMDH02
- Automatch
- Cupid MBR01
- COMA (COMA) ADMR05
- Similarity Flooding
- SemInt
- Artemis
- DIKE
- TransScm
- AutoMed
- Charlie TBBT04
43Charlies Algorithm
x/writer/own_books m/book/author/name m/book/autho
r/birth m/book/author m/book/publisher m/book/titl
e m/book
x/writer m/book 0.15 m/book/author
0.8 m/book/publisher 0.01 m/book/title
0 m/book/author/name 0.6
x/writer/birth m/book/author/name
m/book/author m/book/publisher m/book/title m/book
depth_max backtrack_max max_dist
x/writer/name m/book/author/name
44Similarity Flooding
- Directed Graph Matching Technique
- The technique starts from string-based comparison
(common prefix, suffix tests) of vertices labels
to obtain an initial matching. - The algorithm is based on the assumption that
whenever any two elements in Schemas S1 and S2
are found to be similar, the similarity of their
adjacent elements increases. - Iterative process
- www-db.stanford.edu/melnik/mm/sfa/
45S-Match approach
- S-match is an Element Level Matcher
- Syntactic Matching
- Initially literal concept of labels are computed
at nodes using WordNet . - Tokenize each label and sense of each token is
calculated using WordNet. Label tokens senses are
combined to make up a concept represented by the
label. - E.g VineandCheese represents Vine and Cheese
- Semantic Matching
- Relations are computed between concepts at nodes
- Concept at node is a combination of label
semantics and the node placement in the tree - S-match utilizes the idea of schemas as trees
46COMA - Matching Process
http//dbs.uni-leipzig.de/Research/coma. html
47Deficiencies in Current Matching Tools
- These tools do not completely fulfil the
requirements for large scale schema matching
because - Not fully automated
- Less emphasis on search space optimization
- Every element of schema A is matched to every
element of schema B and a set of algorithms are
applied on each pair of elements. - Schema A n elements
- Schema B m elements
- k number of Match Algorithms
- n x m x k
48Road Map
- Schema
- Matching Schemas
- Match Operation
- Application Domains for Schema Matching
- Match Techniques
- Result of Match Operation
- Mapping
- Integration and Mediation
- Existing Match Tools and their Deficiencies
- Large Scale Scenarios
- Search Space Optimization
- Charlies Algorithm, Similarity Flooding,
S-match, COMA - Tree Mining approach perspective
- Conclusion
49Large Scale Scenarios
- Creating a merged schema for data integration
from two large schemas (with thousands of nodes). - For example bio-genetic taxonomies
-
- Creating a mediated schema from a large set of
schemas (with hundreds of schemas and thousands
of nodes) - For example creating a mediated web interface
input form (schema) from the hundreds of web
interface forms (schemas) related to travel domain
50Web Interface Schema for Travel
1yahoo-form 2Where_do_you_want_to_go 3d
ep_arp_cd_1 4dep_arp_range_1 5When_are_yo
u_traveling 6Depart 7dep_dt_mn_1 8
dep_dt_dy_1 9dep_tm_1 10Return 11
dep_dt_mn_2 12dep_dt_dy_2 13dep_tm_2
14num_cnx 15How_many_travelers_are_there
16adult_pax_cnt 17chld_pax_cnt 18sen
ior_pax_cnt 19Airline_preferences 20cls_s
vc 21aln_cd_1
1aa 2WhereDoYouWantToGo 3origin 4dest
ination 5WhenDoYouWantToGo 6DepartureDate
7departureMonth 8departureDay 9depar
tureTime 10ReturnDate 11returnMonth 1
2returnDay 13returnTime 14NumberOfPasseng
ers 15numAdultPassengers 16numChildPasseng
ers 17WhatAreYourServicePref 18cabinClass
19maximumStops 20carrier 21countryPointOf
Sale
1nwa-form 2origin 3EnterDepartureDate
4departMonth 5departDay 6departTime 7d
estination 8EnterReturnDate 9returnMonth
10returnDay 11returnTime 12adult
1absTravel 2D_City 3A_City 4Depart
5D_Month 6D_Day 7Return 8R_Month
9R_Day 10ClassOfService 11NumAdults
51Large Scale Integration
52Search Space Optimization
- n x m x k complexity has to be reduced for better
performance ! How? - Considering Schemas as trees
- trees add structure which allows to perform the
classification of documents more effectively -
(Charlie Algorithm) - Labels of tree nodes have same literal meaning
but specific sense in specific domains e.g. title
- (Similarity Flooding) - Books domain ? title of book
- Human Resource domain ? title of person M, Mme
- Calculate this domain specific sense of label to
find similar labels - thus minimizing the search space for a certain
node - i.e. match nodes with similar sense labels
- e.g. author?writer, auth-name
53Tree mining approach!
- Why?
- Basic function of tree mining is to find
sub-trees that are frequent in the given set of
trees, which is similar to schema matching
activity that tries to find similar concepts
among a set of schema trees. - So it provides us with the technique to compare
more than two schemas in parallel
54Tree Mining
a author b book d detail f information g
general h birth i isbn n name o own-books p
publisher r price t title w writer
- Motivation
- Large Scale Scenario
- Peer-to-peer Information Systems over the XML Web
- Tree Mining Approach
- Semantic Label Concept Matcher
- Element Level Matching
- Structure Level Matching
aw bo fd
Search sub-trees
55Tree Mining Approach
- Node scope values (calculated by depth first
pre-order traversal) Top-down Zaki02
- Schema matching and integration process for
handling large sets of XML schema trees. - Employs
- Element level Name Matcher (same node label or
synonym) - Cluster similar/synonym labels
- Utilize the node scope values properties to
extract node context semantics out of the
structure
56Clustering Search space optimization
Synonym table aw bo fd
R
57Property Descendent Node Check
Descendent Node Check Scope of Node x is X,Y
and Scope of Descendent Node xd Xd,Yd then
XdgtX and YdltY
- publisher is mapped to publisher
- publisher name can be mapped to writer/name or
publisher/name
- Node with label name n44,4 is a descendent of
node (label publisher) n33,4 and not node name
n22,2 verified using descendent test - For n2 2gt3 and 2lt4 (False) and for n4 4gt3
and 4lt4 (True)
58Conclusion
- Element level Name and Linguistic Matching with
the support of thesaurus is an integral part of
every Match system. - With systems moving towards schema/ontology based
manipulation, and lack of global schemas or
previous matching results, Structure Level
matching is equally important for making out the
semantics. - Peer-to-peer environment requires new methods to
be exploited for performance and quality mapping
i.e. integration of Tree Mining techniques for
matching purposes and search space optimisation. - Machine Learning algorithms can be beneficial in
the P2P environment in later stages when training
examples have been created from instance data,
provided the target domain remains the same.
59Some References
- AH04 Antoniou G., Harmelen F. A Semantic Web
Primer, The MIT Press, 2004 - ADMR05 Aumuller D., Do H. H. , Massmann S., and
Rahm E. Schema and ontology matching with COMA.
In Proceedings of the International Conference on
Management of Data (SIG-MOD), 2005 - BR04 Bellahsène Z. and Roantree M. (2004)
Querying Distributed Data in a Super-peer based
Architecture. DEXA 2004. - BMP04 Bernstein PA., Melnik S., Petropoulos M.
and Quix C. (2004) Industrial-Strength Schema
Mapping. SIGMOD Record, Vol. 33, No. 4, December
2004 - DMDH02 Doan AH., Madhavan J., Domingos P. and
Halvey A. (2002) Learning to Map Ontologies on
the Semantic Web. WWW 2002 - MBR01 Madhavan J., Bernstein PA. and Rahm E.
(2001) Generic Schema Matching with Cupid. VLDB
2001. - RB01 Rahm E. and Bernstein PA (2001) A Survey
of Approaches to Automatic Schema Matching. VLDB
Journal 2001Â 10(4)334-3503 - SE05 Shvaiko P. and Euzenat J. (2005) A Survey
of Schema-based Matching Approaches. Journal on
Data Semantics, 2005. - TBBT04 Tranier J., Baraer R., Bellahsene Z. and
Teisseire M (2004) Wheres Charlie Family Based
Heuristics for Peer-to-Peer Schema Integration.
IDEAS 2004, 227-235 - Zaki02 Zaki MJ (2002) Efficiently Mining
Frequent Trees in a Forest. 8th ACM SIGKDD Intl
Conf. Knowledge Discovery and Data Mining. July
2002 - http//www.w3.org/TR/damloil-reference
- http//www.doc.ic.ac.uk/automed/
60Thank you
61Backup slides
62URI
- A URI can be classified as a locator or a name or
both. A Uniform Resource Locator (URL) is a URI
that, in addition to identifying a resource,
provides means of acting upon or obtaining a
representation of the resource by describing its
primary access mechanism or network "location".
For example, the URL http//www.wikipedia.org/ is
a URI that identifies a resource (Wikipedia's
home page) and implies that a representation of
that resource (such as the home page's current
HTML code, as encoded characters) is obtainable
via HTTP from a network host named
www.wikipedia.org. A Uniform Resource Name (URN)
is a URI that identifies a resource by name in a
particular namespace. A URN can be used to talk
about a resource without implying its location or
how to dereference it. For example, the URN
urnisbn0-395-36341-1 is a URI that, like an
International Standard Book Number (ISBN), allows
one to talk about a book, but doesn't suggest
where and how to obtain an actual copy of it.