Making XML Documents Searchable through the Web - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Making XML Documents Searchable through the Web

Description:

Lister Hill Natrional Center for Biomedical Informatics. National Library of Medicine ... More and more documents are beginning to be provided in XML formats. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 39
Provided by: ibib
Learn more at: http://www.ibiblio.org
Category:

less

Transcript and Presenter's Notes

Title: Making XML Documents Searchable through the Web


1
Making XML Documents Searchable through the Web
  • Dongwook Shin
  • dwshin_at_nlm.nih.gov
  • Lister Hill Natrional Center for Biomedical
    Informatics
  • National Library of Medicine

2
Importance of XML Search Engine
  • More and more documents are beginning to be
    provided in XML formats.
  • XML documents are supposed to have certain
    structures
  • Current Web Search Engines do not provide
    structural search capability

3
Searching Characteristics
  • Content Searching
  • Searching for certain words in the element
    hierarchy
  • Retrieve CHAPTER whose TITLE contains servlet
    and PARAGRAPH contains session.
  • Structural Searching
  • Searching for elements satisfying certain
    relations
  • Retrieve SECTION that has at least two FIGUREs

4
Searching Characteristics (Contd)
  • Combined Searching
  • Content Structural Searching
  • Retrieve SECTION that has TITLE containing XML
    and contains at least a FIGURE.

5
Other XML Search Engines on the Web
  • Most engines provide search in a fixed set of
    fields.
  • User cannot search in any elements in the
    document hierarchy.
  • http//www.goxml.com
  • http//www.scoobs.com
  • http//www.xmlTree.com

6
XRS (XML Retrieval System)
  • Providing a variety of structural search
    functions
  • Users can search in any elements in the document
    hierarchy
  • Content Structural Searching
  • Allowing less index overhead and quick retrieval
    time
  • BUS (Bottom Up Scheme) is used
  • Applicable to valid documents, but not to
    well-formed documents
  • Using DTD when making queries and retrieving
  • Examples are Shakespeare or Bible data

7
Architecture of XRS
8
User Interface (Initialization)
DTD can be browsed here
Query conditions appear here
Search results are shown here with similarity
value
9
Query Composition
  • Principle
  • Any element can be a target - the element to be
    retrieved
  • Search conditions can be imposed on any elements
  • EXAMPLE
  • Retrieve SPEECH whose SPEAKER contains Hamlet
    and LINE contains Denmark

Target
Search Condition
10
User Interface (Query Composition)
11
User Interface (with Search Results)
12
Browser Side
  • Show XML results

13
Browsing a List of Elements
14
XML Result
15
Query at Another Target Element
  • Retrieve SCENE whose TITLE contains Castle
    and SPEAKER contains Horatio

16
XML Result
17
Query Mediator Servlet
  • Mediate the query and results
  • Convey the user query into the backend search
    engine
  • Transmit the retrieved results to the applet or
    the rendering component
  • Send the result sets with brief information to
    the applet
  • Send the XML content with a proper XSL to the
    rendering component so that it can transform into
    the HTML format
  • Session tracking and Result Sets Reclamation
  • Keep session tracking so that a user can use
    his/her session continuously until he/she quits.
  • Detect the dead sessions periodically and reclaim
    the corresponding result sets.

18
Query Language
  • INIT
  • Get the DBs and their DTDs available in the
    server
  • It is sent to the server when the applet is
    initialized
  • SEARCH db_name search_cond
  • db_name is one of DBs available in the server
  • search_cond includes the target and search
    conditions
  • PRES num
  • Get the XML results
  • num is the n-th result in the result set

19
Result Set
  • A result set is assigned to each session
  • Query Mediator does session tracking
  • Backend Search engine keeps multiple result sets
  • Multi-thread safe code is required
  • When a session is relinquished, the result set is
    reclaimed
  • Garbage collection for the result set is required

20
The Content of a Result Set
  • DB_name
  • The name of the database where the search is
    performed and the result is obtained
  • DB_path
  • The directory path from the root where the DB
    resides
  • ptr_to_result_set
  • pointer to the dynamic arrays having the search
    results
  • num_result
  • number of elements retrieved
  • ptr_to_K_ary_table
  • pointer to the table that keeps the k_ary
    information for the DB

21
RS (Result Set) Management
Backend search engine
22
Periodical RS Reclamation
Backend search engine
Query Mediator
Reclamation done
Session ends
Reclamation requested (j ,)
RS Indices to be reclaimed returned (j, ...)
alive sessions sent
i-th RS
actual result
j-th RS
reclaimed
23
Backend Search Engine
  • Less Indexing Overhead and Quick Retrieval
  • Use BUS (Bottom Up Scheme)
  • Most of codes are written in Native C code
  • Support Multi-thread
  • Multi-thread safe C code
  • Compile the C code into a shared library
  • Save index information in files

24
BUS (Bottom Up Scheme)
  • Main Idea
  • Index only at the lowest level of the document
    structure
  • Weight information at higher level is computed at
    retrieval time
  • Benifits
  • Minimize the indexing overhead
  • Support term weight and full-blown structural
    search
  • Guarantee quick retrieval time

25
Principle of BUS
Term frequency is computed at run time.
chapter
hypertext(10) browser(4) internet(5) multimedia(5)
java(7)
chapter
section2
section1
hypertext browser
section1
section2
para1
para2
hypertext(2) browser(4)
hypertext(8) internet(5) multimedia(5) java(7)
hypertext internet multimedia
hypertext internet java
para1
para2
hypertext(3) internet(3) multimedia(5)
hypertext(5) internet(2) java(7)
Indexing is performed at leaf nodes only
Document tree with index terms
Bottom Up Scheme
26
Key Issues in BUS
  • How to figure out ancestor elements of a leaf
    element efficiently ?
  • How to accumulate the term frequency effectively ?

27
UID (Unique element IDentifier)
  • Represent each document as a k-ary complete tree
    and assign a UID to each node

a
real node
e
virtual node
b
c
e

Result of assigning UIDs
d
e
e
f
g
e
e
e
e
h
i
j
3-ary tree
parent(i) (i-2)/k1
28
K-ary table
  • Each document is assigned k, which is the maximum
    number of siblings in the document tree.
  • Each collection has a K-ary table, each element
    of which represent k in the document.
  • Each result set has a pointer to the K-ary table.

29
Level and Element Type Number
  • Level
  • Level means the level in the document tree
  • It gives a clue how many parent function is
    applied to get to a target element
  • Element type number
  • A unique number is assigned to each element type
    in DTD ( not the elements in documents )
  • It enables to filter out unnecessary elements and
    accumulate the correct frequencies

30
Level and Element Type Number (Contd)
  • User query Retrieve sections that contain
    hypertext

hypertext(9) browser(1) internet(5) multimedia(5)
java(7)
chapter
Level 1
user level
section1
title
hypertext(1) browser(1)
hypertext(8) internet(5) multimedia(5) java(7)
Level 2
Level difference informs how many times parent
function is applied
Element type number lets unnecessary
index information filtered out.
para1
para2
hypertext(3) internet(3) multimedia(5)
hypertext(5) internet(2) java(7)
Level 3
Index information
text level
31
Representing a Document Tree
lt5,1,1,1gt
lt5,3,2,3gt
lt5,4,2,3gt
lt5,2,2,2gt
hypertext(1) model(1) retrieval(1) semantics(1)
lt5,8,3,5gt
lt5,9,3,5gt
lt5,11,3,6gt
lt5,12,3,6gt
index(3) lexical(1) noun(4) stem(2)
document(4) index(3) precision(2) term(5)
e
e
e
e
e
e
e
e
e
e
e
e
e
lt5,33,4,7gt
lt5,35,4,7gt
lt5,36,4,7gt
lt5,32,4,7gt
document(4) index(3) precision(1) term(5)
browser(2) hypertext(2) java(5) link(6)
anchor(2) browser(1) html(3) internet(5)
basian(3) inquiry(2) link(3) matrix(3)
e
e
32
Query Evaluation
  • Create accumulators at user level
  • Accumulators correspond to the elements at the
    user level
  • Compute the TF (Term Frequency) and DF (Document
    Frequency) of a term
  • Summing up all the term frequencies of the
    descendent elements into the corresponding
    accumulators.
  • The number of non-zero accumulators is the DF of
    the term.
  • Calculate the term weight
  • Compute the similarity of the elements and rank

33
Accumulating Term Frequency
lt5,1gt
query find sections containing browser
. . . .
Subtree of the tree in slide 28
lt5,11gt
6
lt5,12,3,6gt
lt5,11,3,6gt
1
lt5,12gt
1
. . . .
lt5,32,4,7gt
lt5,33,4,7gt
lt5,35,4,7gt
lt5,36,4,7gt
browser (4) index(3) precision(1) term(5)
browser(2) hypertext(2) java(5) link(6)
anchor(2) browser(1) html(3) internet(5)
basian(3) inquiry(2) link(3) matrix(3)
34
Performance Data (in Ultra Sparc 2)
  • Index Overhead
  • Retrieval time
  • Almost of single term queries are evaluated
    within one second

35
Advantage of XRS
  • Provides a variety of structural search
    functions.
  • Less indexing overhead and quick retrieval time
  • Easy to port
  • Java native C code
  • C code is made as shared libraries

36
Alternative Architecture of XRS
Server Side
Search Engine
Shared Library
JNI interface
Servlet
Rendering Component
Query Mediator Servlet
XML result
with XSL
HTML format
query
Search result
Web browser
Applet
User Interface
Initiate
Client Side
37
Benefit and Problem
  • Benefit
  • Simpler and easier to port than the current
    implementation
  • Do not need an independent Java process
  • Problem
  • Current Java Servlet engines can not run the
    shared libraries
  • Apache Jserv, Jrun and Jigsaw fail to run it!

38
Current Status
  • Finish the development of the content retrieval
    part
  • Available on the Web at the end of August 1999.
  • http//dlb2.nlm.nih.gov/dwshin
  • Structural retrieval part is in development
  • will be finished soon.
Write a Comment
User Comments (0)
About PowerShow.com