Title: Nessun%20titolo%20diapositiva
1International Conference on Cooperative
Information Systems (CoopIS 2001)
Deriving sub-source similarities from
heterogeneous, semi-stuctured information
sources D. Rosaci, G. Terracina, D. Ursino
Dipartimento di Informatica, Matematica,
Elettronica e Trasporti
Università Mediterranea di Reggio Calabria
September 5-7, Trento
2Scheme Match Finding a mapping between those
elements of two schemes that semantically
correspond to each other
Applications information source integration,
e-commerce, scheme evaluation and migration, data
and web warehousing, information source design
and so on
They aimed at deriving terminological and
structural relationships between single concepts
The need of semi-automatic techniques for
carrying out this task is nowadays recognized
Most of the techniques for Scheme Match proposed
in the literature have been designed only for
databases
3New approaches to Scheme Match, handling
semi-structured information sources, appear to be
compulsory
Such approaches must be somehow different from
the traditional ones since
- in semi-structured information sources
significant pieces of information are expressed
in the form of groups of concepts rather than
single ones - different instances of the same concept could
have different structures
The emphasis shifts away from the extraction of
semantic correspondencies between concepts to the
derivation of semantic correspondencies between
groups of concepts
4We propose a semi-automatic technique for
extracting similarities between sub-sources
belonging to different, heterogeneous and
semi-structured information sources
The adoption of a conceptual model, capable to
uniformly handle information sources of different
formats, appears to be extremely useful
Translation rules should be defined from
classical information source formats to the
adopted conceptual model
Our approach exploits the SDR-Network conceptual
model which meets the requirements described above
5Given an information source IS, the number of
possible sub-sources that can be derived from it
is extremely high
In order to avoid handling huge numbers of
sub-source pairs, we propose an heuristic
technique for singling out only the most
promising ones
The similarity degree associated to each pair of
sub-sources is determined by computing the
objective function associated to a maximum weight
matching
After that the most promising pairs of
sub-sources have been selected, their similarity
degree must be computed
SSi can be detected to be similar to SSj only if
it possible to single out concepts of SSi and SSj
that are pairwise similar in their turn
6The SDR-Network and its metrics have been already
exploited for defining a technique for deriving
synonymies and homonymies
In the whole, we propose a unified,
semi-automatic approach for deriving concept
synonymies and homonymies, as well as sub-source
similarities
- We are proposing the derivation of a property
which, generally, is not handled by most of the
approaches for Scheme Match proposed in the
literature - The technique proposed here is part of a more
general framework for deriving various kinds of
terminological and structural properties
This is particularly interesting since
7- Given an information source IS, the associated
SDR-Network Net(IS) is - Net(IS) lt NS(IS), AS(IS) gt
- NS(IS) represents the set of nodes each node is
characterized by a name - AS(D) denotes a set of arcs each arc can be
represented by a triplet lt S, T, LST gt - S is the source node
- T is the target node
- LST dST, rST is a label associated with the
arc
8- dST is the semantic distance coefficient
- it indicates how much the concept expressed by T
is semantically close to the concept expressed by
S - this depends from the capability of the concept
associated with T to characterize the concept
associated with S - rST is the semantic relevance coefficient it
indicates the fraction of instances of the
concept denoted by S whose complete definition
requires at least one instance of the concept
represented by T
9- The Path Semantic Distance PSDP of a path P in
Net(IS) is the sum of the semantic distance
coefficients associated with the arcs included in
the path - The Path Semantic Relevance PSRP of a path P in
Net(IS) is the product of the semantic relevance
coefficients associated with the arcs included in
the path - The CD-Shortest-Path (Conditional
D-Shortest-Path) between two nodes N and N in
Net(IS) and including an arc A (denoted by ? N,
N? A) is the path having the minimum Path
Semantic Distance among those connecting N and N
and including A - A D-Pathn is a path P in Net(IS) such that n ?
PSDP lt n1 - The i-th neighborhood of an SDR-Network node x
is - nbh(x,i) AA?AS(IS), Altz,y,lzygt, ?x,y?A is a
D_Pathi, x?y i?0
10The number of possible sub-sources that can be
identified in IS is exponential in the number of
nodes of Net(IS)
We have defined a technique for singling out the
most promising pairs of sub-sources
The proposed technique receives two information
sources IS1 and IS2 and a Dictionary SD of
Synonymies between nodes of Net(IS1) and Net(IS2)
Synonymies are represented in SD by tuples of the
form ltNi, Nj, fijgt, where Ni and Nj are the
synonym nodes and fij is a coefficient in the
real interval 0,1, indicating the similarity
degree of Ni and Nj
11- The technique works according to the following
rules - It considers those pairs of sub-sources SSi,
SSj such that SSi?Net(IS1) is a
rooted sub-net having a node Ni as root,
SSj ? Net(IS2) is a rooted sub-net having a
node Nj as root, Ni and Nj are interesting
synonyms i.e., the synonym coefficient associated
with them is greater than a certain threshold - It computes the maximum weight matching on some
suitable bipartite graphs obtained from the
target nodes of the arcs included in the
neighborhoods of Ni and Nj - Given a pair of synonym nodes Ni and Nj, it
derives a promising pair of sub-sources
SSik,SSjk, for each k such that both nbh(Ni,k)
and nbh(Nj,k) are not empty - SSik and SSjk are constructed by determining the
promising pairs of arcs Aik,Ajk such that Aik
?nbh(Ni,l), Ajk ?nbh(Nj,l), for each l belonging
to the integer interval 0,k
12- A pair of arcs Aik,Ajk is considered promising
if - An edge between the target nodes Tik of Aik and
Tjk of Ajk is present in the maximum weight
matching computed on a suitable bipartite graph
constructed from the target nodes of the arcs of
nbh(Ni,l) and nbh(Nj,l) for some l belonging to
the integer interval 0,k - The similarity degree of Tik and Tjk is greater
than a certain given threshold - The rationale underlying this approach is that of
constructing promising pairs of sub-sources such
that each pair consists in the maximum possible
number of pairs of concepts whose synonymy has
been already stated
13- Theorem
- Let IS1 and IS2 be two information sources and
let Net(IS1) and Net(IS2) be the corresponding
SDR-Networks. Let nc1 (resp., nc2) be the number
of complex nodes of Net(IS1) (resp., Net(IS2)).
Let l be the maximum neighborhood index
associated with a node of Net(IS1) or Net(IS2).
Then the number of possible pairs of sub-sources
is min(nc1,nc2)x(l1) - Actually, in real applications, the number of
promising pairs of sub-sources relative to IS1
and IS2 is, generally, far lesser than
min(nc1,nc2)x(l1)
14The SDR-Network of the European Social Funds
(ESF) information source
15The SDR-Network of European Community Projects
(ECP) information source
16The Synonymy Dictionary associated with ESF and
ECP
17- The interesting pairs of synonym nodes are
ltJudicial
PersonESF, PartnerECP, 0.59gt, ltPaymentESF,
PaymentECP, 0.65gt, ltProjectESF,
ProjectECP, 0.63gt - As an example, consider the pair of synonym nodes
ProjectESF and ProjectECP - Since the neighborhoods of ProjectESF and
ProjectECP are both not empty only for k0, k1
and k2, our technique obtains three promising
pairs of sub-sources relative to ProjectESF and
ProjectECP - In order to provide an example of the behaviour
of our technique, we show the derivation of the
promising pairs of sub-sources associated with
ProjectESF and ProjectECP for k0
18- The bipartite graph and the associated maximum
weight matching are
19- The technique selects only those arcs of
nbh(ProjectESF,0) and nbh(ProjectECP,0) which
participate to the matching and have a similarity
coefficient greater than a certain given
threshold - The promising pair of sub-sources associated with
nbh(ProjectESF,0) and nbh(ProjectECP,0) is
SS1, SS2 - SS1 lt ProjectESF, CountryESF, 0,1gt,
ProjectESF, TypeESF, 0, 0.9gt, ltProjectESF
, ESF_ContributionESF, 0, 0.75gt,
ltProjectESF, Country_ShareESF, 0,0.9gt - SS2 lt ProjectECP, CountryECP, 0,1gt,
ProjectECP, TypeECP, 0, 0.6gt, ltProjectECP
, ESF_ContributionECP, 0, 0.8gt,
ltProjectECP, Country_ShareECP, 0,1gt - The technique works analogously for k1 and k2
as well as for the other interesting synonym
pairs
20- The technique for deriving sub-source
similarities from a given pair of information
sources consists of two steps - The first one computes the similarity degree
relative to each promising pair of sub-sources
derived previously - The second one constructs a Sub-source Similarity
Dictionary SSD by selecting only those pairs of
sub-sources whose similarity degree is greater
than a certain, dinamically computed, threshold - More formally, the technique can be encoded as
follows SSD ?(?(SPS,SD)) - where
- SPS is the set of promising pairs of sub-sources
- SD is the Synonymy Dictionary
21- For each promising pair of sub-sources SSi and
SSj, the function ? calls a function ? which
computes the corresponding similarity degree
SSS ? (SPS, SD) lt SSi, SSj,
?(SD, ?(SSi), ?(SSj))gt SSi, SSj ?SPS - The function ? receives a rooted sub-net SS and
returns the nodes of SS - The function ? derives the similarity degree
associated with SSi and SSj by computing a
suitable objective function associated with the
maximum weight matching on a bipartite graph,
constructed from the nodes of SSi and SSj
? (T,P,Q) (1 ((PQ-2E)/(PQ
)) x (?(E)/E)
22- The function ? is called for constructing the
Sub-source Similarity Dictionary SSD by taking
those similarities of SSS having a coefficient
greater than a certain, dynamically computed,
threshold SSD ? (SSS) ltSSi, SSj, fijgt
ltSSi, SSj, fijgt ? SSS, fijgtthSim - Here thSim is the dinamically computed
threshold thSim min ((FMaxFMin)/2, thM) - where
- FMax is the maximum coefficient associated with
the similarities of SSS - FMin is the minimum coefficient associated with
the similarities of SSS - thM is a limit threshold value
23- Consider the SDR-Networks ESF and ECP SSD
?(?(SPS,SD)) - As for the pair of sub-sources SS1, SS2 ? SPS
derived previously, ? calls ?(SD, ?(SS1),
?(SS2)) - The bipartite graph and the associated maximum
weight matching relative to ?(SS1) and ?(SS2)
are
24- The objective function associated to the maximum
weight matching is (1 ((55-25)/10))(0.631
111)/50.93 - In the same way the similarity degrees associated
with all the other promising pairs of sub-sources
are obtained - Then SSS is provided in input to the function ?
which constructs the Sub-source Similarity
Dictionary SSD - SSD is determined by selecting those triplets of
SSS whose similarity coefficient is greater than
thSim - In this example all similarities of SSS are valid
and SSD SSS
25Sub-source similarities can be exploited in
several contexts
All applications of Scheme Match relative to
synonymies between single concepts can be
extended to similarities between sub-sources
Information Source Integration
E-commerce
In particular, sub-source similarities can be
exploited for
Semantic Query Processing
Data and Web Warehouse
Source clustering and cataloguing
26We have presented a semi-automatic technique for
deriving similarities of sub-sources belonging to
information sources having different formats
The technique is based on a conceptual model,
called SDR-Network, which allows to uniformly
represent information sources of different formats
We have pointed out that the derivation of
sub-source similarities is a special case of the
more general problem of Scheme Match
It consists of two steps the first one selects a
set of promising pairs of sub-sources, whereas
the second one computes a similarity degree to
associate with each pair of the set
Finally, we have illustrated a set of
applications which could benefit of sub-source
similarities
27- We have already designed an approach which
exploits sub-source similarities for carrying out
information source integration - In the future we plan to
- Develop techniques which exploit sub-source
similarities in the other possible application
contexts we have previously mentioned - Define techniques for deriving other
terminological and structural properties in the
context of semi-structured information sources
28 Domenico Ursino Dipartimento di Informatica,
Matematica, Elettronica, Trasporti UniversitÃ
Mediterranea di Reggio Calabria E-mail
ursino_at_ing.unirc.it Web http//www.ing.unirc.it/d
idattica/inform00/gruppo