Finding Syntactic Similarities Between XML Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Finding Syntactic Similarities Between XML Documents

Description:

book author Abiteboul /author year 2000 /year /book ???? ??????? Abiteboul /??????? ??? 2000 /??? /???? ... Abiteboul /autor rok 2000 /rok /KSIAZKA ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 20
Provided by: csUal
Category:

less

Transcript and Presenter's Notes

Title: Finding Syntactic Similarities Between XML Documents


1
Finding Syntactic Similarities Between XML
Documents
  • Davood Rafiei University of Alberta
  • Joint work with
  • Daniel Moise University of Alberta and
  • Dabo Sun University of Alberta

2
Motivations
  • Ranked retrievals e.g. query bookauthorAbiteb
    oul and year2000
  • DTD extraction
  • useful for query processing
  • Clustering
  • for efficient storage and indexing
  • for efficient retrievals (similar documents are
    expected to match the same queries more often)

3
Problem Statement
  • How to measure similarity (or distance) between
    XML documents
  • Desired properties
  • The distance must be a metric
  • Documents generated by the same DTD are expected
    to have less distance
  • Documents with more common tags are expected to
    be more similar
  • Interested in syntactic similarity only

4
Examples
  • Similar documents
  • Non-similar documents

ltbookgtltauthorgtAbiteboullt/authorgtltyeargt2000lt/yeargtlt
/bookgt
ltbookgtltauthorgtJohnlt/authorgtltyeargt1994lt/yeargtlt/book
gt
ltbookgtltauthorgtGeorgelt/authorgtlttitlegtAnimal
Farmlt/titlegtlt/bookgt
ltbookgtltauthorgtAbiteboullt/authorgtltyeargt2000lt/yeargtlt
/bookgt
lt????gtlt???????gtAbiteboullt/???????gtlt???gt2000lt/???gtlt
/????gt
ltKSIAZKAgtltautorgtAbiteboullt/autorgtltrokgt2000lt/rokgtlt/
KSIAZKAgt
ltXgtltYgtJohnlt/YgtltZgt20lt/Zgtlt/Xgt
5
Related Work
  • Structural Similarity
  • Edit distance between ordered trees (Nierman and
    Jagadish 11, Zhang et al. 21, 23, Chawate et
    al. 96)
  • Edit distance between unordered trees
    NP-Complete (Zhang et al. 22)
  • Specialized Solutions (Flesca et al. 5, Zaki
    and Aggrawal 20)

6
Related Work (Cont.)
  • More Syntactic Similarity
  • Based on common parent-child tags (Lian et al.
    10) e.g. of non-similar documentsltpapergtltjour
    nalgtltauthorgtAlt/authorgtlttitlegtTlt/titlegtltyeargt2006lt
    /yeargtlt/journalgtlt/papergtltpapergtltconferencegtltauth
    orgtAlt/authorgtlttitlegtTlt/titlegtltyeargt2006lt/yeargtlt/c
    onferencegtlt/papergt
  • Use parent-child tags, twigs, content terms,
    semantic relationships (Theobald et al. 18)

7
Structural Sketch
ltusergt ltpersongtltnamegtJohnlt/namegtlt/persongt lt/user
gt ltusergt ltpersongt ltnamegtMarylt/namegt
ltidgtu200lt/idgt lt/persongt lt/usergt
t
d
For every path in d, there is a path in t and
vice versa and t is minimal.
8
Sketch Similarity
ltusergt ltpersongtltnamegtJohnlt/namegtlt/persongt lt/user
gt ltusergt ltpersongt ltnamegtMarylt/namegt
ltidgtu200lt/idgt lt/persongt lt/usergt
t
d
  • Problems of matching trees
  • Sketch tree is not unique

9
Path Sets
user/person/name user/person/id
Root paths
user/person/name user/person/id user/person person
/name user person name person/id id
Path set
10
Similar Path Sets
  • Standard set comparisons apply
  • E.g. Cosine, Jaccard, Dice
  • Path set size nl(l1)/2
  • for n root paths, each of length l
  • Fast similarity comparison
  • Cost linear on the size of the path set

11
Evaluation
  • Effectiveness in clustering documents generated
    by the same DTD
  • Count the mis-clusterings
  • For result comparison
  • Used the same dataset and setting as some earlier
    work
  • Also used a larger dataset

12
Real Data
  • XML files of ACM Sigmod Record since March 1999
  • Four DTDs (total of 989 xml files)
  • ProceedingsPage 17 xml files
  • IndexTermsPage 920 xml files
  • OrdinaryIssuePage 51 xml files
  • SigmodRecod 1 xml file

13
Synthetic Data
  • Generated using IBM xml generator
  • DTDs
  • Set A the set used by Neirman and jagadish
  • Set B set A plus 5 more DTDs
  • Parameters
  • M max repeat for or
  • P probability of an optional attribute

14
Example Clusters
15
Mis-Clusterings
  • Cosine was used for similarity measurements
  • Also tried Jaccard and Dice coefficients but the
    results werent better.

16
Comparison
Our results
Earlier results
17
Tag Frequency
18
Conclusions
  • Presented a method for clustering documents
    generated by the same DTD
  • Compared to tree-edit distance-based methods, our
    method is
  • more effective (based on our evaluations)
  • and also much more efficient

19
Future Work
  • Detecting documents with similar structures and
    related tag names, e.g.
  • Possible solutions
  • allow users to specify relabeling rules
  • Learn relabeling rules from a training data

ltbookgtltauthorgtAbiteboullt/authorgtltyeargt2000lt/yeargtlt
/bookgt
ltKSIAZKAgtltautorgtAbiteboullt/autorgtltrokgt2000lt/rokgtlt/
KSIAZKAgt
Write a Comment
User Comments (0)
About PowerShow.com