Nessun%20titolo%20diapositiva - PowerPoint PPT Presentation

About This Presentation
Title:

Nessun%20titolo%20diapositiva

Description:

Given a pair of synonym nodes Ni and Nj, it derives a promising pair of sub ... analogously for k=1 and k=2 as well as for the other interesting synonym pairs ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 29
Provided by: urs48
Category:

less

Transcript and Presenter's Notes

Title: Nessun%20titolo%20diapositiva


1
International Conference on Cooperative
Information Systems (CoopIS 2001)
Deriving sub-source similarities from
heterogeneous, semi-stuctured information
sources D. Rosaci, G. Terracina, D. Ursino
Dipartimento di Informatica, Matematica,
Elettronica e Trasporti
Università Mediterranea di Reggio Calabria
September 5-7, Trento
2
Scheme Match Finding a mapping between those
elements of two schemes that semantically
correspond to each other
Applications information source integration,
e-commerce, scheme evaluation and migration, data
and web warehousing, information source design
and so on
They aimed at deriving terminological and
structural relationships between single concepts
The need of semi-automatic techniques for
carrying out this task is nowadays recognized
Most of the techniques for Scheme Match proposed
in the literature have been designed only for
databases
3
New approaches to Scheme Match, handling
semi-structured information sources, appear to be
compulsory
Such approaches must be somehow different from
the traditional ones since
  • in semi-structured information sources
    significant pieces of information are expressed
    in the form of groups of concepts rather than
    single ones
  • different instances of the same concept could
    have different structures

The emphasis shifts away from the extraction of
semantic correspondencies between concepts to the
derivation of semantic correspondencies between
groups of concepts
4
We propose a semi-automatic technique for
extracting similarities between sub-sources
belonging to different, heterogeneous and
semi-structured information sources
The adoption of a conceptual model, capable to
uniformly handle information sources of different
formats, appears to be extremely useful
Translation rules should be defined from
classical information source formats to the
adopted conceptual model
Our approach exploits the SDR-Network conceptual
model which meets the requirements described above
5
Given an information source IS, the number of
possible sub-sources that can be derived from it
is extremely high
In order to avoid handling huge numbers of
sub-source pairs, we propose an heuristic
technique for singling out only the most
promising ones
The similarity degree associated to each pair of
sub-sources is determined by computing the
objective function associated to a maximum weight
matching
After that the most promising pairs of
sub-sources have been selected, their similarity
degree must be computed
SSi can be detected to be similar to SSj only if
it possible to single out concepts of SSi and SSj
that are pairwise similar in their turn
6
The SDR-Network and its metrics have been already
exploited for defining a technique for deriving
synonymies and homonymies
In the whole, we propose a unified,
semi-automatic approach for deriving concept
synonymies and homonymies, as well as sub-source
similarities
  • We are proposing the derivation of a property
    which, generally, is not handled by most of the
    approaches for Scheme Match proposed in the
    literature
  • The technique proposed here is part of a more
    general framework for deriving various kinds of
    terminological and structural properties

This is particularly interesting since
7
  • Given an information source IS, the associated
    SDR-Network Net(IS) is
  • Net(IS) lt NS(IS), AS(IS) gt
  • NS(IS) represents the set of nodes each node is
    characterized by a name
  • AS(D) denotes a set of arcs each arc can be
    represented by a triplet lt S, T, LST gt
  • S is the source node
  • T is the target node
  • LST dST, rST is a label associated with the
    arc

8
  • dST is the semantic distance coefficient
  • it indicates how much the concept expressed by T
    is semantically close to the concept expressed by
    S
  • this depends from the capability of the concept
    associated with T to characterize the concept
    associated with S
  • rST is the semantic relevance coefficient it
    indicates the fraction of instances of the
    concept denoted by S whose complete definition
    requires at least one instance of the concept
    represented by T

9
  • The Path Semantic Distance PSDP of a path P in
    Net(IS) is the sum of the semantic distance
    coefficients associated with the arcs included in
    the path
  • The Path Semantic Relevance PSRP of a path P in
    Net(IS) is the product of the semantic relevance
    coefficients associated with the arcs included in
    the path
  • The CD-Shortest-Path (Conditional
    D-Shortest-Path) between two nodes N and N in
    Net(IS) and including an arc A (denoted by ? N,
    N? A) is the path having the minimum Path
    Semantic Distance among those connecting N and N
    and including A
  • A D-Pathn is a path P in Net(IS) such that n ?
    PSDP lt n1
  • The i-th neighborhood of an SDR-Network node x
    is
  • nbh(x,i) AA?AS(IS), Altz,y,lzygt, ?x,y?A is a
    D_Pathi, x?y i?0

10
The number of possible sub-sources that can be
identified in IS is exponential in the number of
nodes of Net(IS)
We have defined a technique for singling out the
most promising pairs of sub-sources
The proposed technique receives two information
sources IS1 and IS2 and a Dictionary SD of
Synonymies between nodes of Net(IS1) and Net(IS2)
Synonymies are represented in SD by tuples of the
form ltNi, Nj, fijgt, where Ni and Nj are the
synonym nodes and fij is a coefficient in the
real interval 0,1, indicating the similarity
degree of Ni and Nj
11
  • The technique works according to the following
    rules
  • It considers those pairs of sub-sources SSi,
    SSj such that SSi?Net(IS1) is a
    rooted sub-net having a node Ni as root,
    SSj ? Net(IS2) is a rooted sub-net having a
    node Nj as root, Ni and Nj are interesting
    synonyms i.e., the synonym coefficient associated
    with them is greater than a certain threshold
  • It computes the maximum weight matching on some
    suitable bipartite graphs obtained from the
    target nodes of the arcs included in the
    neighborhoods of Ni and Nj
  • Given a pair of synonym nodes Ni and Nj, it
    derives a promising pair of sub-sources
    SSik,SSjk, for each k such that both nbh(Ni,k)
    and nbh(Nj,k) are not empty
  • SSik and SSjk are constructed by determining the
    promising pairs of arcs Aik,Ajk such that Aik
    ?nbh(Ni,l), Ajk ?nbh(Nj,l), for each l belonging
    to the integer interval 0,k

12
  • A pair of arcs Aik,Ajk is considered promising
    if
  • An edge between the target nodes Tik of Aik and
    Tjk of Ajk is present in the maximum weight
    matching computed on a suitable bipartite graph
    constructed from the target nodes of the arcs of
    nbh(Ni,l) and nbh(Nj,l) for some l belonging to
    the integer interval 0,k
  • The similarity degree of Tik and Tjk is greater
    than a certain given threshold
  • The rationale underlying this approach is that of
    constructing promising pairs of sub-sources such
    that each pair consists in the maximum possible
    number of pairs of concepts whose synonymy has
    been already stated

13
  • Theorem
  • Let IS1 and IS2 be two information sources and
    let Net(IS1) and Net(IS2) be the corresponding
    SDR-Networks. Let nc1 (resp., nc2) be the number
    of complex nodes of Net(IS1) (resp., Net(IS2)).
    Let l be the maximum neighborhood index
    associated with a node of Net(IS1) or Net(IS2).
    Then the number of possible pairs of sub-sources
    is min(nc1,nc2)x(l1)
  • Actually, in real applications, the number of
    promising pairs of sub-sources relative to IS1
    and IS2 is, generally, far lesser than
    min(nc1,nc2)x(l1)

14
The SDR-Network of the European Social Funds
(ESF) information source
15
The SDR-Network of European Community Projects
(ECP) information source
16
The Synonymy Dictionary associated with ESF and
ECP
17
  • The interesting pairs of synonym nodes are
    ltJudicial
    PersonESF, PartnerECP, 0.59gt, ltPaymentESF,
    PaymentECP, 0.65gt, ltProjectESF,
    ProjectECP, 0.63gt
  • As an example, consider the pair of synonym nodes
    ProjectESF and ProjectECP
  • Since the neighborhoods of ProjectESF and
    ProjectECP are both not empty only for k0, k1
    and k2, our technique obtains three promising
    pairs of sub-sources relative to ProjectESF and
    ProjectECP
  • In order to provide an example of the behaviour
    of our technique, we show the derivation of the
    promising pairs of sub-sources associated with
    ProjectESF and ProjectECP for k0

18
  • The bipartite graph and the associated maximum
    weight matching are

19
  • The technique selects only those arcs of
    nbh(ProjectESF,0) and nbh(ProjectECP,0) which
    participate to the matching and have a similarity
    coefficient greater than a certain given
    threshold
  • The promising pair of sub-sources associated with
    nbh(ProjectESF,0) and nbh(ProjectECP,0) is
    SS1, SS2
  • SS1 lt ProjectESF, CountryESF, 0,1gt,
    ProjectESF, TypeESF, 0, 0.9gt, ltProjectESF
    , ESF_ContributionESF, 0, 0.75gt,
    ltProjectESF, Country_ShareESF, 0,0.9gt
  • SS2 lt ProjectECP, CountryECP, 0,1gt,
    ProjectECP, TypeECP, 0, 0.6gt, ltProjectECP
    , ESF_ContributionECP, 0, 0.8gt,
    ltProjectECP, Country_ShareECP, 0,1gt
  • The technique works analogously for k1 and k2
    as well as for the other interesting synonym
    pairs

20
  • The technique for deriving sub-source
    similarities from a given pair of information
    sources consists of two steps
  • The first one computes the similarity degree
    relative to each promising pair of sub-sources
    derived previously
  • The second one constructs a Sub-source Similarity
    Dictionary SSD by selecting only those pairs of
    sub-sources whose similarity degree is greater
    than a certain, dinamically computed, threshold
  • More formally, the technique can be encoded as
    follows SSD ?(?(SPS,SD))
  • where
  • SPS is the set of promising pairs of sub-sources
  • SD is the Synonymy Dictionary

21
  • For each promising pair of sub-sources SSi and
    SSj, the function ? calls a function ? which
    computes the corresponding similarity degree
    SSS ? (SPS, SD) lt SSi, SSj,
    ?(SD, ?(SSi), ?(SSj))gt SSi, SSj ?SPS
  • The function ? receives a rooted sub-net SS and
    returns the nodes of SS
  • The function ? derives the similarity degree
    associated with SSi and SSj by computing a
    suitable objective function associated with the
    maximum weight matching on a bipartite graph,
    constructed from the nodes of SSi and SSj
    ? (T,P,Q) (1 ((PQ-2E)/(PQ
    )) x (?(E)/E)

22
  • The function ? is called for constructing the
    Sub-source Similarity Dictionary SSD by taking
    those similarities of SSS having a coefficient
    greater than a certain, dynamically computed,
    threshold SSD ? (SSS) ltSSi, SSj, fijgt
    ltSSi, SSj, fijgt ? SSS, fijgtthSim
  • Here thSim is the dinamically computed
    threshold thSim min ((FMaxFMin)/2, thM)
  • where
  • FMax is the maximum coefficient associated with
    the similarities of SSS
  • FMin is the minimum coefficient associated with
    the similarities of SSS
  • thM is a limit threshold value

23
  • Consider the SDR-Networks ESF and ECP SSD
    ?(?(SPS,SD))
  • As for the pair of sub-sources SS1, SS2 ? SPS
    derived previously, ? calls ?(SD, ?(SS1),
    ?(SS2))
  • The bipartite graph and the associated maximum
    weight matching relative to ?(SS1) and ?(SS2)
    are

24
  • The objective function associated to the maximum
    weight matching is (1 ((55-25)/10))(0.631
    111)/50.93
  • In the same way the similarity degrees associated
    with all the other promising pairs of sub-sources
    are obtained
  • Then SSS is provided in input to the function ?
    which constructs the Sub-source Similarity
    Dictionary SSD
  • SSD is determined by selecting those triplets of
    SSS whose similarity coefficient is greater than
    thSim
  • In this example all similarities of SSS are valid
    and SSD SSS

25
Sub-source similarities can be exploited in
several contexts
All applications of Scheme Match relative to
synonymies between single concepts can be
extended to similarities between sub-sources
Information Source Integration
E-commerce
In particular, sub-source similarities can be
exploited for
Semantic Query Processing
Data and Web Warehouse
Source clustering and cataloguing
26
We have presented a semi-automatic technique for
deriving similarities of sub-sources belonging to
information sources having different formats
The technique is based on a conceptual model,
called SDR-Network, which allows to uniformly
represent information sources of different formats
We have pointed out that the derivation of
sub-source similarities is a special case of the
more general problem of Scheme Match
It consists of two steps the first one selects a
set of promising pairs of sub-sources, whereas
the second one computes a similarity degree to
associate with each pair of the set
Finally, we have illustrated a set of
applications which could benefit of sub-source
similarities
27
  • We have already designed an approach which
    exploits sub-source similarities for carrying out
    information source integration
  • In the future we plan to
  • Develop techniques which exploit sub-source
    similarities in the other possible application
    contexts we have previously mentioned
  • Define techniques for deriving other
    terminological and structural properties in the
    context of semi-structured information sources

28
Domenico Ursino Dipartimento di Informatica,
Matematica, Elettronica, Trasporti Università
Mediterranea di Reggio Calabria E-mail
ursino_at_ing.unirc.it Web http//www.ing.unirc.it/d
idattica/inform00/gruppo
Write a Comment
User Comments (0)
About PowerShow.com