XML Keyword Search - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

XML Keyword Search

Description:

Courses. Course. Course. Title. Room. Title. Room. CSE550. BYAC 260. CSE540. BYAC ... Yu Xu, Yannis ... Course. Course. 2004. Ph.D. CSE520. CSE530 ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 27
Provided by: ziyan
Category:

less

Transcript and Presenter's Notes

Title: XML Keyword Search


1
XML Keyword Search
  • Ziyang Liu

2
Pros and Cons
  • Pros
  • Users do not have to know the structure of the
    XML document they are working with.
  • Users do not have to know the grammar of XPath or
    XQuery.
  • Fit for base users.
  • Cons
  • Keywords can be very ambiguous.

3
Keywords Ambiguity
  • Mary Author Title Year?
  • Find title and year of publications, of which
    Mary is an author.
  • Find additional author of the publications, of
    which Mary is an author.
  • Find year and author of publications with similar
    titles to Marys publications.

4
Current Approach
  • In keyword search, there can be many nodes in the
    XML tree matching the keywords.
  • They try to find semantically related keywords to
    avoid returning irrelevant XML nodes to users.

5
Semantically Related
Courses
Course
Course
Title
Room
Title
Room
CSE550
BYAC 260
CSE540
BYAC 240
  • CSE550 and BYAC260 are semantically related.
  • CSE550 and BYAC240 are not.

6
Current Approach
  • Semantically Related are defined in two
    different ways.
  • Interconnected Two nodes n and n are
    interconnected iff the paths from n and n to
    their LCA do not have nodes with the same label,
    or the only same nodes are n and n.
  • Meaningfully Related Two nodes n and n are
    meaningfully related iff no descendant of their
    LCA has n and n as its descendant.

7
Current Approach
  • Different papers take different approaches, but
    they all try to somehow find out the related
    nodes in XML document and return them to users,
    not returning them irrelevant ones.

8
One of Current Approaches
  • Efficient Keyword Search for Smallest LCAs in XML
    Databases
  • Yu Xu, Yannis Papakonstantinou _at_
    UCSD
  • Proceedings of the 2005 ACM SIGMOD
    international conference on Management
    of data

9
Xus Paper
  • Given a set of keywords, they try to find their
    SLCA in the XML documents, and return them to
    users.
  • Definition of SLCA (Smallest Lowest Common
    Ancestor) if a node has all keywords as its
    descendant and none of its descendants does so,
    then it is an SLCA of the keywords.

10
Example
For query John Ben, nodes 0.1.1, 0.1.2, 0.2.0.0
is their SLCA.
11
Xus Paper
  • Nodes under an SLCA is likely to be semantically
    related, but not necessarily.
  • Sometimes theres no semantically related nodes
    in the XML document that match the keywords.
  • They develop an Indexed Lookup Eager Algorithm to
    find all slcas in the XML document for keywords.
    Algorithm runs in O(SminkdlogSmax) time.

12
Finding SLCA
  • Definition of SLCA
  • SLCA of keywords is a set of nodes that
  • (a) contain the keywords either in their
    labels or in the labels of their descendant
    nodes.
  • (b) they have no descendant node that also
    contains all keywords

13
Brute Force Approach
  • Suppose there are k keywords. In the XML
    document, the corresponding node sets are S1, S2,
    Sk
  • For each v1 in S1, v2 in S2,, Vk in Sk, compute
    lca of v1, v2, ..., vk. We get lca(S1, S2, ,
    Sk)
  • Remove ancestor nodes of lca(S1,S2, , Sk) so
    that no node in it is the ancestor of another.
  • The remaining set is slca(S1,S2,,Sk).

14
Brute Force Approach
  • Time complexity of the brute force approach
    O(S1S2Sk)
  • This is extremely inefficient.

15
A Better Approach
  • A better approach has been developed based on the
    following property
  • slca(v, S) descendant(lca(v,lm(v, S)),
    lca(v,rm(v,S)))
  • From this property, we get
  • slca(v, S2,,Sk) slca(slca(v,S2,,Sk-1),S
    k)
  • and furthermore,
  • slca(S1,S2,,Sk) removeAncestor(slca(v1,S2,
    Sk)) for all v1 in S1.

16
A Better Approach
  • This approach is much more efficient than the
    brute force one, as it processes Sk only after
    all Si(iltk) have been proessed.
  • Its time complexity is O(SminklogSmaxS12)
    .
  • Smin klogSmax for computing lca, and
    Smin 2 for removing ancestors.
  • This is still NOT a satisfying performance.

17
An Even Better One
  • We may detect ancestor nodes along the way and do
    not have to remove ancestor nodes after
    calculating the lca set.
  • Therefore, the time complexity will be
    S1klogS. This is a great improvement.

18
In case we have limited memory
  • If the memory is limited, we may read in a
    limited number of nodes in S1 at each time. This
    does not change the time complexity, but will
    slow down the process as there are more disk
    accesses.
  • This algorithm is the so-called Index Lookup
    Eager Algorithm.

19
Limitations of Current Approaches
Departments
CSE
EE
PHY
CHEM
Students
Student
Student
Name
Year
Degree
Course
Course
Name
Year
Degree
Course
Course
Peter
2006
M.S
CSE550
CSE550
Ronnie
2004
Ph.D.
CSE520
CSE530
20
Limitations of Current Approaches
  • (1) Go wrong when keywords are not semantically
    related.
  • Example Search Peter 2004
  • Desired Output Peter 2006 or Ronnie 2004
  • Current Approaches Either output nothing or
    output the whole subtree rooted at Students.

21
Limitations of Current Approaches
  • (2) May rank different results without taking
    semantics into consideration.
  • Example Peter 2004
  • Desired output Peter 2006 then Ronnie 2004
  • Current Approach May output Ronnie 2004 before
    Peter 2006

22
Limitations of Current Approaches
  • (3) May output too much.
  • Example CSE Students
  • Desired output A few students with an expansion
    link.
  • Current Approach May output all CSE students.
  • When searching for Amazon Books, nobody wants
    it to output all the millions of books from
    Amazon.

23
Limitations of Current Approaches
  • (4) Do not give semantics enough consideration
  • Example Peter Degree
  • Desired Output M.S.
  • Current Approach Output all information of
    Peter.
  • Example 2 Peter M.S.
  • Desired Output Output all information of Peter.
  • Current Approach Peter M.S.
  • Example 3 CSE Peter
  • Desired Output Information of Peter.
  • Current Approach All information of CSE.

24
Our Contribution
  • We look into the keyword patterns and determine
    what users really want. This is done by modifying
    keywords when they are not semantically related
    (we use interconnected as this is easier to
    compute.
  • We develop better ranking mechanism and optimize
    the results.

25
Our System Architecture
26
Timeline
  • Implementation of System by Oct 15.
  • Paper by Nov 12.
  • Thank you and hope youll enjoy our XML keyword
    search system.
Write a Comment
User Comments (0)
About PowerShow.com