DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS - PowerPoint PPT Presentation

About This Presentation
Title:

DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS

Description:

Fare clic per modificare lo stile del titolo dello schema ... The dissimilarity measures presented here are among those investigated in the ASSO Project. ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 51
Provided by: gioviale
Category:

less

Transcript and Presenter's Notes

Title: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS


1
DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC
OBJECTS
  • Prof. Donato Malerba
  • Department of Informatics,
  • University of Bari, Italy
  • malerba_at_di.uniba.it
  • ASSO School
  • Athens, Greece
  • October 6-8, 2003

2
COMPUTING DISSIMILARITIES WHY?
  • Several data analysis techniques are based on
    quantifying a dissimilarity (or similarity)
    measure between multivariate data.
  • Clustering
  • Discriminant analysis
  • Visualization-based approaches
  • Symbolic objects are a kind of multivariate data.
  • Ex. colourred, black?weight ?
    60,70,80?height ? 1.50,1.60
  • The dissimilarity measures presented here are
    among those investigated in the ASSO Project.

3
A case study
  • Abalone features survey
  • Abalones are members of a large class
    (Gastropoda) of molluscs having one-piece shells.
  • 4177 cases of marine crustaceans described by
    the following attributes

4
The construction of SO
  • DB2SO facility of the ASSO system to generate
    (Boolean or Probabilistic) symbolic objects from
    relational databases.
  • Input
  • a set of groups or classes C1, C2, , CK
  • a set of n individuals ?k each of which is
    described by p variables Y1, , Yp and is
    assigned to one or more groups
  • Output
  • a set of K symbolic objects ei described by p
    variables Y1, , Yp
  • Example Nine symbolic objects, one for each
    interval of
  • Number of rings

5
TABLE OF BOOLEAN SYMBOLIC OBJECTS
6
COMPUTATION OF DISSIMILARITIES BETWEEN SYMBOLIC
OBJECTS
  • Dissimilarity matrix

7
The MID property
the degree of dissimilarity between crustaceans
computed on the independent attributes should be
proportional to the dissimilarity in the
dependent attribute (i.e., the difference in the
number of rings). This property is called
monotonic increasing dissimilarity (MID).
8
The MID property
The degree of dissimilarity between crustaceans
computed on the independent attributes should be
proportional to the dissimilarity in the
dependent attribute (i.e., the difference in the
number of rings). This property is called
monotonic increasing dissimilarity (MID).
9
BOOLEAN SYMBOLIC OBJECTS (BSOS)
  • A BSO is a conjunction of boolean elementary
    events
  • Y1A1 ? Y2A2 ? ... ? YpAp
  • where each variable Yi takes values in Yi and Ai
    is a subset of Yi
  • Let a and b be two BSOs
  • a Y1A1 ? Y2A2 ? ... ? YpAp
  • b Y1B1 ? Y2B2 ? ... ? YpBp
  • where each variable Yj takes values in Yj and Aj
    and Bj are subsets of Yj. We are interested to
    compute the dissimilarity d(a,b).

10
CONSTRAINED BSOS
  • Two types of dependencies between variables
  • Hierarchical dependence (mother-daughter) A
    variable Yi may be inapplicable if another
    variable Yj takes its values in a subset Sj ? Yj.
    This dependence is expressed as a rule
  • if Yj Sj then Yi NA
  • Logical dependence This case occurs, if a
    subset
  • Sj ? Yj of a variable Yj is related to a subset
    Si ? Yi of a variable Yi by a rule such as
  • if Yj Sj then Yi Si

11
DISSIMILARITY AND SIMILARITY MEASURES
  • Dissimilarity Measure
  • d E?E?R such that da d(a,a) ? d(a,b) d(b,a)
    lt? ?a,b?E
  • Similarity Measure
  • s E?E ? R such that sa s(a,a) ? s(a,b)
    s(b,a) ? 0 ? a,b?E
  • Generally
  • ? a ? E da d and sa s and specifically,
    d 0 while s 1
  • Dissimilarity measures can be transformed into
    similarity measures (and viceversa)
  • d?(s) ( s?-1(d) )
  • where
  • ?(s) strictly decreasing function, and ?(1) 0,
    ?(0) ?

12
DISSIMILARITY AND SIMILARITY MEASURES PROPERTIES
Some properties that a dissimilarity measure d on
E may satisfy are 
1. d(a, b) 0 ? ? c ? E d(a, c) d(b, c)
(eveness) 2. d(a, b) 0 ? a
b (definiteness) 3. d(a, b) ? d(a, c) d(c,
b) (triangle inequality) 4. d(a, b) ? max(d(a,
c), d(c, b)) (ultrametric inequality ) 5. d(a,
b) d(c, d) ? max(d(a, c) d(b, d), d(a, d)
d(b, c)) (Buneman's inequality) 6. Let (E,
) be a group, then d(a, b) d(ac,
bc) (translation invariance )
  • A dissimilarity function that satisfies
    proprieties 2 and 3 is called metric.
  • A dissimilarity function that satisfies only
    property 3 is called pseudo metric or semi-
    distance.

13
DISSIMILARITY MEASURES BETWEEN BSOS
  • Author(s) (Year) ? Notation from the SODAS
    Package
  • Gowda Diday (1991) ? U_1
  • Ichino Yaguchi (1994) ? U_2, U_3, U_4
  • De Carvalho (1994) ? SO_1, SO_2
  • De Carvalho (1996, 1998) ? SO_3, SO_4, SO_5, C_1
  • U only for unconstrained BSOs
  • C only for constrained BSOs
  • SO for both constrained and unconstrained BSOs

14
GOWDA DIDAYS DISSIMILARITY MEASURE
  • Gowda Didays dissimilarity measures for two
    BSOs a and b
  • U_1

D(a, b)
  • If Yj is a continuous variable
  • D(Aj, Bj) D?(Aj, Bj) Ds(Aj, Bj) Dc(Aj, Bj)
  • while if Yj is a nominal variable
  • D(Aj, Bj) Ds(Aj, Bj) Dc(Aj, Bj)
  • where the components are defined so that their
    values are normalized between 0 and 1
  • D?(Aj, Bj) due to position,
  • Ds(Aj, Bj) due to span,
  • Dc(Aj, Bj) due to content

15
GOWDA DIDAYS DISSIMILARITY MEASURE
  • Properties
  • D(a, b) 0 ? a b (definiteness property),
  • No proof is reported for the triangle inequality
    property

16
ICHINO YAGUCHIS DISSIMILARITY MEASURES
  • Ichino Yaguchis dissimilarity measures are
    based on the Cartesian operators join ? and meet
    ?.
  • For continuous variables
  • Aj ? Bj
  • Aj ? Bj
  • while for nominal variables
  • Aj ? Bj Aj ? Bj
  • Aj ? Bj Aj ? Bj
  • Given a pair of subsets (Aj, Bj) of Yj the
    componentwise dissimilarity?(Aj,Bj) is
  • ?(Aj, Bj) ?Aj ? Bj?? ?Aj ? Bj?? (2?Aj ?
    Bj???Aj?? ?Bj?)
  • where 0 ? ? ? 0.5 and ?Aj?is defined depending on
    variable types.

17
ICHINO YAGUCHIS DISSIMILARITY MEASURES
  • ?(Aj,Bj) are aggregated by an aggregation
    function such as the generalised Minkowskis
    distance of order q
  • U_2
  • Drawback dependence on the chosen units of
    measurements.
  • Solution normalization of the componentwise
    dissimilarity
  • U_3
  • The weighted formulation guarantees that
    dq(a,b)?0,1.
  • U_4

The above measures are metrics
18
DE CARVALHOS DISSIMILARITY MEASURES
  • A straightforward extension of similarity
    measures for classical data matrices with nominal
    variables.
  • where ?(Vj) is either the cardinality of the set
    Vj (if Yj is a nominal variable) or the length of
    the interval Vj (if Yj is a continuous variable).

19
DE CARVALHOS DISSIMILARITY MEASURES
  • Five different similarity measures si, i 1,
    ..., 5, are defined
  • The corresponding dissimilarities are di 1 ?
    si.
  • The di are aggregated by an aggregation function
    AF such as the generalised Minkowski metric, thus
    obtaining
  • SO_1

20
DE CARVALHOS EXTENSION OF ICHINO YAGUCHIS
DISSIMILARITY MEASURE
  • A different componentwise dissimilarity measure
  • where ? is defined as in Ichino Yaguchis
    dissimilarity measure.
  • The aggregation function AF suggested by De
    Carvalho is
  • SO_2

This measure is a metric.
21
THE DESCRIPTION-POTENTIAL APPROACH
  • All dissimilarity measures considered so far are
    defined by two functions a comparison function
    (componentwise measure) and an aggregation
    function.
  • A different approach is based on the concept of
    description potential ?(a) of a symbolic object
    a.
  • where ?(Vj) is either the cardinality of the set
    Vj (if Yj is a nominal variable) or the length of
    the interval Vj (if Yj is a continuous variable).

22
THE DESCRIPTION-POTENTIAL APPROACH
  • SO_3
  • SO_4
  • SO_5
  • The triangular inequality does not hold for SO_3
    and SO_4, which are equivalent. SO_5 is a metric.

23
DESCRIPTION POTENTIAL FOR CONSTRAINED BSOS
  • Given a BSO a and a logical dependence expressed
    by the rule
  • if Yj Sj then Yi Si
  • the incoherent restriction a of a is defined as
  • a Y1A1 ? ... ? Yj-1Aj-1 ? YjAj? Sj ?
    ... ? Yi-1Ai-1 ? YiAi? (Yi\Si) ? ... ?
    YpAp
  • Then the description potential of a is
  • A similar extension exists for hierarchical
    dependencies.

24
DISSIMILARITY MEASURES FOR CONSTRAINED BSOS
  • The extended definition of description potential
    can be applied to the computation of the
    distances SO_3, SO_4 and SO_5.
  • De Carvalho proposed an extension of ?, so that
    SO_2 can also be applied to constrained BSO.
  • He also proposed an extension of ?, ?, ?, and ?
    in order to take into account of constraints.
    Therefore, SO_1 can also be applied to
    constrained BSO.
  • Finally, C_1 is defined as follows
  • where
  • If all BSOs are coherent, then the dissimilarity
    measures do not change.

25
DISSIMILARITY MEASURES FOR CONSTRAINED BSOS
  • The extended definition of description potential
    can be applied to the computation of the
    distances SO_3, SO_4 and SO_5.
  • De Carvalho proposed an extension of ?, so that
    SO_2 can also be applied to constrained BSO
  • where

26
DISSIMILARITY MEASURES FOR CONSTRAINED BSOS
where Y1A1 ?... ?Yj-1Aj-1 ?YjAj
?B j ? ?YpAp Y1B1 ?...
?Yj-1Bj-1 ?YjAj ?B j ? ?YpBp
27
DISSIMILARITY MEASURES FOR CONSTRAINED BSOS
where Y1A1 ?... ?Yj-1Aj-1 ?YjAj
?c(B j ) ? ?YpAp
28
DISSIMILARITY MEASURES FOR CONSTRAINED BSOS
where Y1B1 ?... ?Yj-1Bj-1 ?Yjc(Aj
) ?B j ? ?YpBp
29
DISSIMILARITY MEASURES FOR CONSTRAINED BSOS
  • De Carvalho proposed an extension of ?, ?, ? in
    order to take into account of constraints

30
DISSIMILARITY MEASURES FOR CONSTRAINED BSOS
  • The previous extension of ?, ?, ? in order to
    take into account of constraints, can be used in
    SO_1.
  • Finally, C_1 is defined as follows
  • where
  • If all BSOs are coherent, then the dissimilarity
    measures do not change.

31
MATCHING
  • Matching is the process of comparing two or more
    structures to discover their similarities or
    differences.
  • Similarity judgements in the matching process
    are directional They have a
  • referent, a, a prototype or the description of a
    class of objects
  • subject, b, a variant of the prototype or an
    instance of a class of objects.
  • Matching two structures is a common problem to
    many domains, like symbolic classification,
    pattern recognition, data mining and expert
    systems.

32
MATCHING BSOS
  • Generally, a BSO represents a class description
    and plays the role of the referent in the
    matching process.
  • a color black, white ? height 170,
    200
  • describes a set of individuals either black or
    white, whose height is in the interval 170,200.
    Such a set of individuals is called extension of
    the BSO. The extension is a subset of the
    universe ? of individuals.
  • Given two BSOs a and b, the matching operators
    define whether b is the description of an
    individual in the extension of a.
  • In the ASSO software two matching operators for
    BSOs have been defined.

33
CANONICAL MATCHING OPERATOR
  • The result of the canonical matching operator is
    either 0 (false) or 1 (true).
  • If E denotes the space of BSOs described by a
    set of p variables Yi taking values in the
    corresponding domains Yi, then the matching
    operator is a function
  • Match E E ? 0, 1
  • such that for any two BSOs a, b ? E
  • a Y1A1 ? Y2A2 ? ... ? YpAp
  • b Y1B1 ? Y2B2 ? ... ? YpBp
  • it happens that
  • Match(a,b) 1 if Bi?Ai for each i1, 2, ?, p,
  • Match(a,b) 0 otherwise.

34
CANONICAL MATCHING OPERATOR
  • Examples
  • District1 professionfarmer, driver ?
    age24,34
  • Indiv1 professionfarmer ? age28
  • Indiv2 professionsalesman ? age27,28
  • Match(District1,Indiv1) 1
  • Match(District1,Indiv2) 0

35
CANONICAL MATCHING OPERATOR
  • The canonical matching function satisfies two
    out of three properties of a similarity measure
  • ? a, b ? E Match(a, b) ? 0
  • ? a, b ? E Match(a, a) ? Match(a, b)
  • while it does not satisfy the commutativity or
    simmetry property
  • ? a, b ? E Match(a, b) Match(b, a)
  • because of the different role played by a and b.

36
FLEXIBLE MATCHING OPERATOR
  • The requirement Bi?Ai for each i1, 2, ?, p,
    might be too strict for real-world problems,
    because of the presence of noise in the
    description of the individuals of the universe.
  • Example
  • District1 professionfarmer, driver ?
    age24,34
  • Indiv3 professionfarmer ? age23
  • Match(District1,Indiv3) 0
  • It is necessary to rely on a flexible definition
    of matching operator, which returns a number in
    0,1 corresponding to the degree of match
    between two BSOs, that is
  • flexible-matching E E 0,1

37
FLEXIBLE MATCHING OPERATOR
  • For any two BSOs a and b,
  • i) flexible-matching(a,b)1 if Match(a,b)true,
  • ii) flexible-matching(a,b)ÃŽ0,1) otherwise.
  • The result of the flexible matching can be
    interpreted as the probability of a matching b
    provided that a change is made in b.
  • Let Ea b'? E Match(a,b')1 and P(b b') be
    the conditional probability of observing b given
    that the original observation was b'. Then
  • that is flexible-matching(a,b) equals the maximum
    conditional probability over the space of BSOs
    canonically matched by a.

38
FLEXIBLE MATCHING AN APPLICATION
  • Credit card applications (Quinlan)
  • Fifteen variables whose names and values have
    been changed to meaningless symbols to protect
    the confidentiality of the data.
  • class variable positive in case of approval of
    credit facilities, negative otherwise.
  • Training set 490 cases
  • 6 rules generated by Quinlans system C4.5

39
FLEXIBLE MATCHING AN APPLICATION
  • Such rules can be easily represented by means of
    Boolean symbolic objects.
  • Both matching operators can be considered in
    order to test the validity of the induced rules.

40
A new dissimilarity measure
  • Flexible matching is asymmetric. However it is
    possible to symmetrize it ? New dissimilarity
    measure SO_6
  • It is computed as
  • d(a,b)
  • 1-(flexible_matching(a,b)flexible_matching(b,a
    ))/2

41
(No Transcript)
42
PROBABILISTIC SYMBOLIC OBJECT (PSOS)
  • Probabilistic symbolic objects (PSOs) involve
    modal (probabilistic) variables.
  • Each cell represents the set of weighted values
    that the variable can take for a symbolic object,
    where a probabilistic weighting system is
    adopted.
  • In case of PSO, it isnt possible to use
    dissimilarity measures for BSO because they dont
    take the probabilities into consideration and so
    this determines a notable information loss.
  • Therefore, new dissimilarity measures for PSO are
    needed.

43
Defining dissimilarity measures for probabilistic
symbolic objects
  • Steps
  • Define coefficients measuring the divergence
    between two probability distributions
  • Kullback-Leibler divergence
  • Chi-square divergence
  • Hellinger
  • K-divergence
  • Variation distance
  • () from them two dissimilarity measures, namely
    the Renyis and Chernoffs coefficients, are
    obtained

44
Defining dissimilarity measures for probabilistic
symbolic objects
  • Steps
  • Symmetrize the non symmetric coefficients
  • m(P,Q) m(Q,P) m(P,Q)
  • Aggregate the contribution of all variables to
    compute the dissimilarity between two symbolic
    objects
  • PSO Dissimilarity measures

45
Mixture SO
  • Some SOs can be described by both non-modal and
    modal variables
  • They are neither BSOs nor PSOs
  • What dissimilarity measure, then?
  • In ASSO it has been proposed to combine the
    result of two dissimilarity measure, one for
    modal and the other for non-modal.
  • Combination can be either additive or
    multiplicative.
  • This possibility should be taken with great
    care!!!

46
REFERENCES
  • Esposito F., Malerba D., V. Tamma, H.-H. Bock.
    Classical resemblance measures. Chapter 8.1
  • Esposito F., Malerba D., V. Tamma. Dissimilarity
    measures for symbolic objects. Chapter 8.3
  • Esposito F., Malerba D., F.A. Lisi. Matching
    symbolic objects. Chapter 8.4
  • in H.-H. Bock, E. Diday (eds.) Analysis of
    Symbolic Data. Exploratory methods for extracting
    statistical information from complex data.
    Springer Verlag, Heidelberg, 2000.
  • D. Malerba, L. Sanarico, V. Tamma (2000). A
    comparison of dissimilarity measures for Boolean
    symbolic data. In P. Brito, J. Costa, D.
    Malerba (Eds.), Proc. of the ECML 2000 Workshop
    on Dealing with Structured Data in Machine
    Learning and Statistics, Barcelona.
  • D. Malerba, F. Esposito, V. Gioviale, V. Tamma.
    Comparing Dissimilarity Measures in Symbolic Data
    Analysis. Pre-Proceedings of EKT-NTTS, vol. 1,
    pp. 473-481.

47
REFERENCES
  • D. Malerba, F. Esposito, M. Monopoli (2002).
    Estrazione e matching di oggetti simbolici da
    database relazionali. Atti del Decimo Convegno
    Nazionale su Sistemi Evoluti per Basi di Dati
    SEBD2002, 265-272.
  • D. Malerba, F. Esposito, M. Monopoli (2002).
    Comparing dissimilarity measures for
    probabilistic symbolic objects. In A. Zanasi, C.
    A. Brebbia, N.F.F. Ebecken, P. Melli (Eds.) Data
    Mining III, Series Management Information
    Systems, Vol 6, 31-40, WIT Press, Southampton,
    UK.
  • E. Diday, F. Esposito (2003). An Introduction to
    Symbolic Data Analysis and the Sodas Software,
    Intelligent Data Analysis, 7, 6, (in press).
  • Other project reports

48
METHOD DISS
  • Dissimilarity measures between both BSOs and
    PSOs.
  • Input Asso file of SOs
  • Output for dissimilarities Report Asso file
    with dissimilarity matrix
  • Developer Dipartimento di Informatica,
    University of Bari, Italy.

DI method
Report file
49
TWO USE CASE DIAGRAMS
50
PARAMETER SETUP
  • The user can select a subset of variables Yi on
    which the dissimilarity measure or the matching
    operator has to computed .

51
PARAMETER SETUP
  • The user can select a number of parameters.

Dissimilarity measure
combine
Name of the new ASSO file
?
?
52
OUTPUT SODAS FILE
  • The output ASSO file contains both the same input
    data and an additional dissimilarity matrix. The
    dissimilarity between the i-th and the j-th BSO
    is written in the cell (entry) (i, j) of the
    matrix.
  • Only the lower part of the dissimilarity matrix
    is reported in the file, since dissimilarities
    are symmetric.
  • abalone output file

53
OUTPUT REPORT FILE
  • The report file is organized as follows
  • Output report file

54
Output
  • Visualization of the dissimilarity table

55
Output
  • Visualization of a line graph of dissimilarities

Each line represents the dissimilarity between a
given SO and the subsequent SOs in the file
The number of lines in each graph is equal to the
number of SOs minus one
56
Output
  • Visualization of a scatterplot of Sammons
    nonlinear mapping into a bidimensional space
Write a Comment
User Comments (0)
About PowerShow.com