Title: Finding Syntactic Similarities Between XML Documents
1Finding Syntactic Similarities Between XML
Documents
- Davood Rafiei University of Alberta
- Joint work with
- Daniel Moise University of Alberta and
- Dabo Sun University of Alberta
2Motivations
- Ranked retrievals e.g. query bookauthorAbiteb
oul and year2000 - DTD extraction
- useful for query processing
- Clustering
- for efficient storage and indexing
- for efficient retrievals (similar documents are
expected to match the same queries more often)
3Problem Statement
- How to measure similarity (or distance) between
XML documents - Desired properties
- The distance must be a metric
- Documents generated by the same DTD are expected
to have less distance - Documents with more common tags are expected to
be more similar - Interested in syntactic similarity only
4Examples
- Similar documents
- Non-similar documents
ltbookgtltauthorgtAbiteboullt/authorgtltyeargt2000lt/yeargtlt
/bookgt
ltbookgtltauthorgtJohnlt/authorgtltyeargt1994lt/yeargtlt/book
gt
ltbookgtltauthorgtGeorgelt/authorgtlttitlegtAnimal
Farmlt/titlegtlt/bookgt
ltbookgtltauthorgtAbiteboullt/authorgtltyeargt2000lt/yeargtlt
/bookgt
lt????gtlt???????gtAbiteboullt/???????gtlt???gt2000lt/???gtlt
/????gt
ltKSIAZKAgtltautorgtAbiteboullt/autorgtltrokgt2000lt/rokgtlt/
KSIAZKAgt
ltXgtltYgtJohnlt/YgtltZgt20lt/Zgtlt/Xgt
5Related Work
- Structural Similarity
- Edit distance between ordered trees (Nierman and
Jagadish 11, Zhang et al. 21, 23, Chawate et
al. 96) - Edit distance between unordered trees
NP-Complete (Zhang et al. 22) - Specialized Solutions (Flesca et al. 5, Zaki
and Aggrawal 20)
6Related Work (Cont.)
- More Syntactic Similarity
- Based on common parent-child tags (Lian et al.
10) e.g. of non-similar documentsltpapergtltjour
nalgtltauthorgtAlt/authorgtlttitlegtTlt/titlegtltyeargt2006lt
/yeargtlt/journalgtlt/papergtltpapergtltconferencegtltauth
orgtAlt/authorgtlttitlegtTlt/titlegtltyeargt2006lt/yeargtlt/c
onferencegtlt/papergt - Use parent-child tags, twigs, content terms,
semantic relationships (Theobald et al. 18)
7Structural Sketch
ltusergt ltpersongtltnamegtJohnlt/namegtlt/persongt lt/user
gt ltusergt ltpersongt ltnamegtMarylt/namegt
ltidgtu200lt/idgt lt/persongt lt/usergt
t
d
For every path in d, there is a path in t and
vice versa and t is minimal.
8Sketch Similarity
ltusergt ltpersongtltnamegtJohnlt/namegtlt/persongt lt/user
gt ltusergt ltpersongt ltnamegtMarylt/namegt
ltidgtu200lt/idgt lt/persongt lt/usergt
t
d
- Problems of matching trees
- Sketch tree is not unique
9Path Sets
user/person/name user/person/id
Root paths
user/person/name user/person/id user/person person
/name user person name person/id id
Path set
10Similar Path Sets
- Standard set comparisons apply
- E.g. Cosine, Jaccard, Dice
- Path set size nl(l1)/2
- for n root paths, each of length l
- Fast similarity comparison
- Cost linear on the size of the path set
11Evaluation
- Effectiveness in clustering documents generated
by the same DTD - Count the mis-clusterings
- For result comparison
- Used the same dataset and setting as some earlier
work - Also used a larger dataset
12Real Data
- XML files of ACM Sigmod Record since March 1999
- Four DTDs (total of 989 xml files)
- ProceedingsPage 17 xml files
- IndexTermsPage 920 xml files
- OrdinaryIssuePage 51 xml files
- SigmodRecod 1 xml file
13Synthetic Data
- Generated using IBM xml generator
- DTDs
- Set A the set used by Neirman and jagadish
- Set B set A plus 5 more DTDs
- Parameters
- M max repeat for or
- P probability of an optional attribute
14Example Clusters
15Mis-Clusterings
- Cosine was used for similarity measurements
- Also tried Jaccard and Dice coefficients but the
results werent better.
16Comparison
Our results
Earlier results
17Tag Frequency
18Conclusions
- Presented a method for clustering documents
generated by the same DTD - Compared to tree-edit distance-based methods, our
method is - more effective (based on our evaluations)
- and also much more efficient
19Future Work
- Detecting documents with similar structures and
related tag names, e.g. - Possible solutions
- allow users to specify relabeling rules
- Learn relabeling rules from a training data
ltbookgtltauthorgtAbiteboullt/authorgtltyeargt2000lt/yeargtlt
/bookgt
ltKSIAZKAgtltautorgtAbiteboullt/autorgtltrokgt2000lt/rokgtlt/
KSIAZKAgt