XML Keyword Search - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

XML Keyword Search

Description:

Courses. Course. Course. Title. Room. Title. Room. CSE550. BYAC 260. CSE540. BYAC ... Yu Xu, Yannis ... Course. Course. 2004. Ph.D. CSE520. CSE530 ... – PowerPoint PPT presentation

Number of Views:167

Avg rating:3.0/5.0

Slides: 27

Provided by: ziyan

Category:

more less

Transcript and Presenter's Notes

Title: XML Keyword Search

1
XML Keyword Search

Ziyang Liu

2
Pros and Cons

Pros
Users do not have to know the structure of the
XML document they are working with.
Users do not have to know the grammar of XPath or
XQuery.
Fit for base users.
Cons
Keywords can be very ambiguous.

3
Keywords Ambiguity

Mary Author Title Year?
Find title and year of publications, of which
Mary is an author.
Find additional author of the publications, of
which Mary is an author.
Find year and author of publications with similar
titles to Marys publications.

4
Current Approach

In keyword search, there can be many nodes in the
XML tree matching the keywords.
They try to find semantically related keywords to
avoid returning irrelevant XML nodes to users.

5
Semantically Related
Courses
Course
Course
Title
Room
Title
Room
CSE550
BYAC 260
CSE540
BYAC 240

CSE550 and BYAC260 are semantically related.
CSE550 and BYAC240 are not.

6
Current Approach

Semantically Related are defined in two
different ways.
Interconnected Two nodes n and n are
interconnected iff the paths from n and n to
their LCA do not have nodes with the same label,
or the only same nodes are n and n.
Meaningfully Related Two nodes n and n are
meaningfully related iff no descendant of their
LCA has n and n as its descendant.

7
Current Approach

Different papers take different approaches, but
they all try to somehow find out the related
nodes in XML document and return them to users,
not returning them irrelevant ones.

8
One of Current Approaches

Efficient Keyword Search for Smallest LCAs in XML
Databases
Yu Xu, Yannis Papakonstantinou _at_
UCSD
Proceedings of the 2005 ACM SIGMOD
international conference on Management
of data

9
Xus Paper

Given a set of keywords, they try to find their
SLCA in the XML documents, and return them to
users.
Definition of SLCA (Smallest Lowest Common
Ancestor) if a node has all keywords as its
descendant and none of its descendants does so,
then it is an SLCA of the keywords.

10
Example
For query John Ben, nodes 0.1.1, 0.1.2, 0.2.0.0
is their SLCA.
11
Xus Paper

Nodes under an SLCA is likely to be semantically
related, but not necessarily.
Sometimes theres no semantically related nodes
in the XML document that match the keywords.
They develop an Indexed Lookup Eager Algorithm to
find all slcas in the XML document for keywords.
Algorithm runs in O(SminkdlogSmax) time.

12
Finding SLCA

Definition of SLCA
SLCA of keywords is a set of nodes that
(a) contain the keywords either in their
labels or in the labels of their descendant
nodes.
(b) they have no descendant node that also
contains all keywords

13
Brute Force Approach

Suppose there are k keywords. In the XML
document, the corresponding node sets are S1, S2,
Sk
For each v1 in S1, v2 in S2,, Vk in Sk, compute
lca of v1, v2, ..., vk. We get lca(S1, S2, ,
Sk)
Remove ancestor nodes of lca(S1,S2, , Sk) so
that no node in it is the ancestor of another.
The remaining set is slca(S1,S2,,Sk).

14
Brute Force Approach

Time complexity of the brute force approach
O(S1S2Sk)
This is extremely inefficient.

15
A Better Approach

A better approach has been developed based on the
following property
slca(v, S) descendant(lca(v,lm(v, S)),
lca(v,rm(v,S)))
From this property, we get
slca(v, S2,,Sk) slca(slca(v,S2,,Sk-1),S
k)
and furthermore,
slca(S1,S2,,Sk) removeAncestor(slca(v1,S2,
Sk)) for all v1 in S1.

16
A Better Approach

This approach is much more efficient than the
brute force one, as it processes Sk only after
all Si(iltk) have been proessed.
Its time complexity is O(SminklogSmaxS12)
.
Smin klogSmax for computing lca, and
Smin 2 for removing ancestors.
This is still NOT a satisfying performance.

17
An Even Better One

We may detect ancestor nodes along the way and do
not have to remove ancestor nodes after
calculating the lca set.
Therefore, the time complexity will be
S1klogS. This is a great improvement.

18
In case we have limited memory

If the memory is limited, we may read in a
limited number of nodes in S1 at each time. This
does not change the time complexity, but will
slow down the process as there are more disk
accesses.
This algorithm is the so-called Index Lookup
Eager Algorithm.

19
Limitations of Current Approaches
Departments
CSE
EE
PHY
CHEM
Students
Student
Student
Name
Year
Degree
Course
Course
Name
Year
Degree
Course
Course
Peter
2006
M.S
CSE550
CSE550
Ronnie
2004
Ph.D.
CSE520
CSE530
20
Limitations of Current Approaches

(1) Go wrong when keywords are not semantically
related.
Example Search Peter 2004
Desired Output Peter 2006 or Ronnie 2004
Current Approaches Either output nothing or
output the whole subtree rooted at Students.

21
Limitations of Current Approaches

(2) May rank different results without taking
semantics into consideration.
Example Peter 2004
Desired output Peter 2006 then Ronnie 2004
Current Approach May output Ronnie 2004 before
Peter 2006

22
Limitations of Current Approaches

(3) May output too much.
Example CSE Students
Desired output A few students with an expansion
link.
Current Approach May output all CSE students.
When searching for Amazon Books, nobody wants
it to output all the millions of books from
Amazon.

23
Limitations of Current Approaches

(4) Do not give semantics enough consideration
Example Peter Degree
Desired Output M.S.
Current Approach Output all information of
Peter.
Example 2 Peter M.S.
Desired Output Output all information of Peter.
Current Approach Peter M.S.
Example 3 CSE Peter
Desired Output Information of Peter.
Current Approach All information of CSE.

24
Our Contribution

We look into the keyword patterns and determine
what users really want. This is done by modifying
keywords when they are not semantically related
(we use interconnected as this is easier to
compute.
We develop better ranking mechanism and optimize
the results.

25
Our System Architecture
26
Timeline