Finding Syntactic Similarities Between XML Documents - PowerPoint PPT Presentation

About This Presentation

Title:

Finding Syntactic Similarities Between XML Documents

Description:

book author Abiteboul /author year 2000 /year /book ???? ??????? Abiteboul /??????? ??? 2000 /??? /???? ... Abiteboul /autor rok 2000 /rok /KSIAZKA ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 20

Provided by: csUal

Category:

more less

Transcript and Presenter's Notes

Title: Finding Syntactic Similarities Between XML Documents

1
Finding Syntactic Similarities Between XML
Documents

Davood Rafiei University of Alberta
Joint work with
Daniel Moise University of Alberta and
Dabo Sun University of Alberta

2
Motivations

Ranked retrievals e.g. query bookauthorAbiteb
oul and year2000
DTD extraction
useful for query processing
Clustering
for efficient storage and indexing
for efficient retrievals (similar documents are
expected to match the same queries more often)

3
Problem Statement

How to measure similarity (or distance) between
XML documents
Desired properties
The distance must be a metric
Documents generated by the same DTD are expected
to have less distance
Documents with more common tags are expected to
be more similar
Interested in syntactic similarity only

4
Examples

Similar documents
Non-similar documents

ltbookgtltauthorgtAbiteboullt/authorgtltyeargt2000lt/yeargtlt
/bookgt
ltbookgtltauthorgtJohnlt/authorgtltyeargt1994lt/yeargtlt/book
gt
ltbookgtltauthorgtGeorgelt/authorgtlttitlegtAnimal
Farmlt/titlegtlt/bookgt
ltbookgtltauthorgtAbiteboullt/authorgtltyeargt2000lt/yeargtlt
/bookgt
lt????gtlt???????gtAbiteboullt/???????gtlt???gt2000lt/???gtlt
/????gt
ltKSIAZKAgtltautorgtAbiteboullt/autorgtltrokgt2000lt/rokgtlt/
KSIAZKAgt
ltXgtltYgtJohnlt/YgtltZgt20lt/Zgtlt/Xgt
5
Related Work

Structural Similarity
Edit distance between ordered trees (Nierman and
Jagadish 11, Zhang et al. 21, 23, Chawate et
al. 96)
Edit distance between unordered trees
NP-Complete (Zhang et al. 22)
Specialized Solutions (Flesca et al. 5, Zaki
and Aggrawal 20)

6
Related Work (Cont.)

More Syntactic Similarity
Based on common parent-child tags (Lian et al.
10) e.g. of non-similar documentsltpapergtltjour
nalgtltauthorgtAlt/authorgtlttitlegtTlt/titlegtltyeargt2006lt
/yeargtlt/journalgtlt/papergtltpapergtltconferencegtltauth
orgtAlt/authorgtlttitlegtTlt/titlegtltyeargt2006lt/yeargtlt/c
onferencegtlt/papergt
Use parent-child tags, twigs, content terms,
semantic relationships (Theobald et al. 18)

7
Structural Sketch
ltusergt ltpersongtltnamegtJohnlt/namegtlt/persongt lt/user
gt ltusergt ltpersongt ltnamegtMarylt/namegt
ltidgtu200lt/idgt lt/persongt lt/usergt
t
d
For every path in d, there is a path in t and
vice versa and t is minimal.
8
Sketch Similarity
ltusergt ltpersongtltnamegtJohnlt/namegtlt/persongt lt/user
gt ltusergt ltpersongt ltnamegtMarylt/namegt
ltidgtu200lt/idgt lt/persongt lt/usergt
t
d

Problems of matching trees
Sketch tree is not unique

9
Path Sets
user/person/name user/person/id
Root paths
user/person/name user/person/id user/person person
/name user person name person/id id
Path set
10
Similar Path Sets