Nessun%20titolo%20diapositiva

About This Presentation

Title:

Nessun%20titolo%20diapositiva

Description:

Given a pair of synonym nodes Ni and Nj, it derives a promising pair of sub ... analogously for k=1 and k=2 as well as for the other interesting synonym pairs ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 29

Provided by: urs48

Category:

more less

Transcript and Presenter's Notes

Title: Nessun%20titolo%20diapositiva

1
International Conference on Cooperative
Information Systems (CoopIS 2001)
Deriving sub-source similarities from
heterogeneous, semi-stuctured information
sources D. Rosaci, G. Terracina, D. Ursino
Dipartimento di Informatica, Matematica,
Elettronica e Trasporti
Università Mediterranea di Reggio Calabria
September 5-7, Trento
2
Scheme Match Finding a mapping between those
elements of two schemes that semantically
correspond to each other
Applications information source integration,
e-commerce, scheme evaluation and migration, data
and web warehousing, information source design
and so on
They aimed at deriving terminological and
structural relationships between single concepts
The need of semi-automatic techniques for
carrying out this task is nowadays recognized
Most of the techniques for Scheme Match proposed
in the literature have been designed only for
databases
3
New approaches to Scheme Match, handling
semi-structured information sources, appear to be
compulsory
Such approaches must be somehow different from
the traditional ones since

in semi-structured information sources
significant pieces of information are expressed
in the form of groups of concepts rather than
single ones
different instances of the same concept could
have different structures

The emphasis shifts away from the extraction of
semantic correspondencies between concepts to the
derivation of semantic correspondencies between
groups of concepts
4
We propose a semi-automatic technique for
extracting similarities between sub-sources
belonging to different, heterogeneous and
semi-structured information sources
The adoption of a conceptual model, capable to
uniformly handle information sources of different
formats, appears to be extremely useful
Translation rules should be defined from
classical information source formats to the
adopted conceptual model
Our approach exploits the SDR-Network conceptual
model which meets the requirements described above
5
Given an information source IS, the number of
possible sub-sources that can be derived from it
is extremely high
In order to avoid handling huge numbers of
sub-source pairs, we propose an heuristic
technique for singling out only the most
promising ones
The similarity degree associated to each pair of
sub-sources is determined by computing the
objective function associated to a maximum weight
matching
After that the most promising pairs of
sub-sources have been selected, their similarity
degree must be computed
SSi can be detected to be similar to SSj only if
it possible to single out concepts of SSi and SSj
that are pairwise similar in their turn
6
The SDR-Network and its metrics have been already
exploited for defining a technique for deriving
synonymies and homonymies
In the whole, we propose a unified,
semi-automatic approach for deriving concept
synonymies and homonymies, as well as sub-source
similarities

We are proposing the derivation of a property
which, generally, is not handled by most of the
approaches for Scheme Match proposed in the
literature
The technique proposed here is part of a more
general framework for deriving various kinds of
terminological and structural properties

This is particularly interesting since
7

Given an information source IS, the associated
SDR-Network Net(IS) is
Net(IS) lt NS(IS), AS(IS) gt
NS(IS) represents the set of nodes each node is
characterized by a name
AS(D) denotes a set of arcs each arc can be
represented by a triplet lt S, T, LST gt
S is the source node
T is the target node
LST dST, rST is a label associated with the
arc

dST is the semantic distance coefficient
it indicates how much the concept expressed by T
is semantically close to the concept expressed by
S
this depends from the capability of the concept
associated with T to characterize the concept
associated with S
rST is the semantic relevance coefficient it
indicates the fraction of instances of the
concept denoted by S whose complete definition
requires at least one instance of the concept
represented by T

The Path Semantic Distance PSDP of a path P in
Net(IS) is the sum of the semantic distance
coefficients associated with the arcs included in
the path
The Path Semantic Relevance PSRP of a path P in
Net(IS) is the product of the semantic relevance
coefficients associated with the arcs included in
the path
The CD-Shortest-Path (Conditional
D-Shortest-Path) between two nodes N and N in
Net(IS) and including an arc A (denoted by ? N,
N? A) is the path having the minimum Path
Semantic Distance among those connecting N and N
and including A
A D-Pathn is a path P in Net(IS) such that n ?
PSDP lt n1
The i-th neighborhood of an SDR-Network node x
is
nbh(x,i) AA?AS(IS), Altz,y,lzygt, ?x,y?A is a
D_Pathi, x?y i?0

10
The number of possible sub-sources that can be
identified in IS is exponential in the number of
nodes of Net(IS)
We have defined a technique for singling out the
most promising pairs of sub-sources
The proposed technique receives two information
sources IS1 and IS2 and a Dictionary SD of
Synonymies between nodes of Net(IS1) and Net(IS2)
Synonymies are represented in SD by tuples of the
form ltNi, Nj, fijgt, where Ni and Nj are the
synonym nodes and fij is a coefficient in the
real interval 0,1, indicating the similarity
degree of Ni and Nj
11

The technique works according to the following
rules
It considers those pairs of sub-sources SSi,
SSj such that SSi?Net(IS1) is a
rooted sub-net having a node Ni as root,
SSj ? Net(IS2) is a rooted sub-net having a
node Nj as root, Ni and Nj are interesting
synonyms i.e., the synonym coefficient associated
with them is greater than a certain threshold
It computes the maximum weight matching on some
suitable bipartite graphs obtained from the
target nodes of the arcs included in the
neighborhoods of Ni and Nj
Given a pair of synonym nodes Ni and Nj, it
derives a promising pair of sub-sources
SSik,SSjk, for each k such that both nbh(Ni,k)
and nbh(Nj,k) are not empty
SSik and SSjk are constructed by determining the
promising pairs of arcs Aik,Ajk such that Aik
?nbh(Ni,l), Ajk ?nbh(Nj,l), for each l belonging
to the integer interval 0,k

A pair of arcs Aik,Ajk is considered promising
if
An edge between the target nodes Tik of Aik and
Tjk of Ajk is present in the maximum weight
matching computed on a suitable bipartite graph
constructed from the target nodes of the arcs of
nbh(Ni,l) and nbh(Nj,l) for some l belonging to
the integer interval 0,k
The similarity degree of Tik and Tjk is greater
than a certain given threshold
The rationale underlying this approach is that of
constructing promising pairs of sub-sources such
that each pair consists in the maximum possible
number of pairs of concepts whose synonymy has
been already stated

Theorem
Let IS1 and IS2 be two information sources and
let Net(IS1) and Net(IS2) be the corresponding
SDR-Networks. Let nc1 (resp., nc2) be the number
of complex nodes of Net(IS1) (resp., Net(IS2)).
Let l be the maximum neighborhood index
associated with a node of Net(IS1) or Net(IS2).
Then the number of possible pairs of sub-sources
is min(nc1,nc2)x(l1)
Actually, in real applications, the number of
promising pairs of sub-sources relative to IS1
and IS2 is, generally, far lesser than
min(nc1,nc2)x(l1)

14
The SDR-Network of the European Social Funds
(ESF) information source
15
The SDR-Network of European Community Projects
(ECP) information source
16
The Synonymy Dictionary associated with ESF and
ECP
17

The interesting pairs of synonym nodes are
ltJudicial
PersonESF, PartnerECP, 0.59gt, ltPaymentESF,
PaymentECP, 0.65gt, ltProjectESF,
ProjectECP, 0.63gt
As an example, consider the pair of synonym nodes
ProjectESF and ProjectECP
Since the neighborhoods of ProjectESF and
ProjectECP are both not empty only for k0, k1
and k2, our technique obtains three promising
pairs of sub-sources relative to ProjectESF and
ProjectECP
In order to provide an example of the behaviour
of our technique, we show the derivation of the
promising pairs of sub-sources associated with
ProjectESF and ProjectECP for k0

The bipartite graph and the associated maximum
weight matching are

The technique selects only those arcs of
nbh(ProjectESF,0) and nbh(ProjectECP,0) which
participate to the matching and have a similarity
coefficient greater than a certain given
threshold
The promising pair of sub-sources associated with
nbh(ProjectESF,0) and nbh(ProjectECP,0) is
SS1, SS2
SS1 lt ProjectESF, CountryESF, 0,1gt,
ProjectESF, TypeESF, 0, 0.9gt, ltProjectESF
, ESF_ContributionESF, 0, 0.75gt,
ltProjectESF, Country_ShareESF, 0,0.9gt
SS2 lt ProjectECP, CountryECP, 0,1gt,
ProjectECP, TypeECP, 0, 0.6gt, ltProjectECP
, ESF_ContributionECP, 0, 0.8gt,
ltProjectECP, Country_ShareECP, 0,1gt
The technique works analogously for k1 and k2
as well as for the other interesting synonym
pairs

The technique for deriving sub-source
similarities from a given pair of information
sources consists of two steps
The first one computes the similarity degree
relative to each promising pair of sub-sources
derived previously
The second one constructs a Sub-source Similarity
Dictionary SSD by selecting only those pairs of
sub-sources whose similarity degree is greater
than a certain, dinamically computed, threshold
More formally, the technique can be encoded as
follows SSD ?(?(SPS,SD))
where
SPS is the set of promising pairs of sub-sources
SD is the Synonymy Dictionary

For each promising pair of sub-sources SSi and
SSj, the function ? calls a function ? which
computes the corresponding similarity degree
SSS ? (SPS, SD) lt SSi, SSj,
?(SD, ?(SSi), ?(SSj))gt SSi, SSj ?SPS
The function ? receives a rooted sub-net SS and
returns the nodes of SS
The function ? derives the similarity degree
associated with SSi and SSj by computing a
suitable objective function associated with the
maximum weight matching on a bipartite graph,
constructed from the nodes of SSi and SSj
? (T,P,Q) (1 ((PQ-2E)/(PQ
)) x (?(E)/E)

The function ? is called for constructing the
Sub-source Similarity Dictionary SSD by taking
those similarities of SSS having a coefficient
greater than a certain, dynamically computed,
threshold SSD ? (SSS) ltSSi, SSj, fijgt
ltSSi, SSj, fijgt ? SSS, fijgtthSim
Here thSim is the dinamically computed
threshold thSim min ((FMaxFMin)/2, thM)
where
FMax is the maximum coefficient associated with
the similarities of SSS
FMin is the minimum coefficient associated with
the similarities of SSS
thM is a limit threshold value

Consider the SDR-Networks ESF and ECP SSD
?(?(SPS,SD))
As for the pair of sub-sources SS1, SS2 ? SPS
derived previously, ? calls ?(SD, ?(SS1),
?(SS2))
The bipartite graph and the associated maximum
weight matching relative to ?(SS1) and ?(SS2)
are

The objective function associated to the maximum
weight matching is (1 ((55-25)/10))(0.631
111)/50.93
In the same way the similarity degrees associated
with all the other promising pairs of sub-sources
are obtained
Then SSS is provided in input to the function ?
which constructs the Sub-source Similarity
Dictionary SSD
SSD is determined by selecting those triplets of
SSS whose similarity coefficient is greater than
thSim
In this example all similarities of SSS are valid
and SSD SSS

25
Sub-source similarities can be exploited in
several contexts
All applications of Scheme Match relative to
synonymies between single concepts can be
extended to similarities between sub-sources
Information Source Integration
E-commerce
In particular, sub-source similarities can be
exploited for
Semantic Query Processing
Data and Web Warehouse
Source clustering and cataloguing
26
We have presented a semi-automatic technique for
deriving similarities of sub-sources belonging to
information sources having different formats
The technique is based on a conceptual model,
called SDR-Network, which allows to uniformly
represent information sources of different formats
We have pointed out that the derivation of
sub-source similarities is a special case of the
more general problem of Scheme Match
It consists of two steps the first one selects a
set of promising pairs of sub-sources, whereas
the second one computes a similarity degree to
associate with each pair of the set
Finally, we have illustrated a set of
applications which could benefit of sub-source
similarities
27

We have already designed an approach which
exploits sub-source similarities for carrying out
information source integration
In the future we plan to
Develop techniques which exploit sub-source
similarities in the other possible application
contexts we have previously mentioned
Define techniques for deriving other
terminological and structural properties in the
context of semi-structured information sources

28
Domenico Ursino Dipartimento di Informatica,
Matematica, Elettronica, Trasporti Università
Mediterranea di Reggio Calabria E-mail
ursino_at_ing.unirc.it Web http//www.ing.unirc.it/d
idattica/inform00/gruppo

Write a Comment

User Comments (0)