Title: Making XML Documents Searchable through the Web
1Making XML Documents Searchable through the Web
- Dongwook Shin
- dwshin_at_nlm.nih.gov
- Lister Hill Natrional Center for Biomedical
Informatics - National Library of Medicine
2Importance of XML Search Engine
- More and more documents are beginning to be
provided in XML formats. - XML documents are supposed to have certain
structures - Current Web Search Engines do not provide
structural search capability
3Searching Characteristics
- Content Searching
- Searching for certain words in the element
hierarchy - Retrieve CHAPTER whose TITLE contains servlet
and PARAGRAPH contains session. - Structural Searching
- Searching for elements satisfying certain
relations - Retrieve SECTION that has at least two FIGUREs
4Searching Characteristics (Contd)
- Combined Searching
- Content Structural Searching
- Retrieve SECTION that has TITLE containing XML
and contains at least a FIGURE.
5Other XML Search Engines on the Web
- Most engines provide search in a fixed set of
fields. - User cannot search in any elements in the
document hierarchy. - http//www.goxml.com
- http//www.scoobs.com
- http//www.xmlTree.com
6XRS (XML Retrieval System)
- Providing a variety of structural search
functions - Users can search in any elements in the document
hierarchy - Content Structural Searching
- Allowing less index overhead and quick retrieval
time - BUS (Bottom Up Scheme) is used
- Applicable to valid documents, but not to
well-formed documents - Using DTD when making queries and retrieving
- Examples are Shakespeare or Bible data
7Architecture of XRS
8User Interface (Initialization)
DTD can be browsed here
Query conditions appear here
Search results are shown here with similarity
value
9Query Composition
- Principle
- Any element can be a target - the element to be
retrieved - Search conditions can be imposed on any elements
- EXAMPLE
- Retrieve SPEECH whose SPEAKER contains Hamlet
and LINE contains Denmark
Target
Search Condition
10User Interface (Query Composition)
11User Interface (with Search Results)
12Browser Side
13Browsing a List of Elements
14XML Result
15Query at Another Target Element
- Retrieve SCENE whose TITLE contains Castle
and SPEAKER contains Horatio
16XML Result
17Query Mediator Servlet
- Mediate the query and results
- Convey the user query into the backend search
engine - Transmit the retrieved results to the applet or
the rendering component - Send the result sets with brief information to
the applet - Send the XML content with a proper XSL to the
rendering component so that it can transform into
the HTML format - Session tracking and Result Sets Reclamation
- Keep session tracking so that a user can use
his/her session continuously until he/she quits. - Detect the dead sessions periodically and reclaim
the corresponding result sets.
18Query Language
- INIT
- Get the DBs and their DTDs available in the
server - It is sent to the server when the applet is
initialized - SEARCH db_name search_cond
- db_name is one of DBs available in the server
- search_cond includes the target and search
conditions - PRES num
- Get the XML results
- num is the n-th result in the result set
19Result Set
- A result set is assigned to each session
- Query Mediator does session tracking
- Backend Search engine keeps multiple result sets
- Multi-thread safe code is required
- When a session is relinquished, the result set is
reclaimed - Garbage collection for the result set is required
20The Content of a Result Set
- DB_name
- The name of the database where the search is
performed and the result is obtained - DB_path
- The directory path from the root where the DB
resides - ptr_to_result_set
- pointer to the dynamic arrays having the search
results - num_result
- number of elements retrieved
- ptr_to_K_ary_table
- pointer to the table that keeps the k_ary
information for the DB
21RS (Result Set) Management
Backend search engine
22Periodical RS Reclamation
Backend search engine
Query Mediator
Reclamation done
Session ends
Reclamation requested (j ,)
RS Indices to be reclaimed returned (j, ...)
alive sessions sent
i-th RS
actual result
j-th RS
reclaimed
23Backend Search Engine
- Less Indexing Overhead and Quick Retrieval
- Use BUS (Bottom Up Scheme)
- Most of codes are written in Native C code
- Support Multi-thread
- Multi-thread safe C code
- Compile the C code into a shared library
- Save index information in files
24BUS (Bottom Up Scheme)
- Main Idea
- Index only at the lowest level of the document
structure - Weight information at higher level is computed at
retrieval time - Benifits
- Minimize the indexing overhead
- Support term weight and full-blown structural
search - Guarantee quick retrieval time
25Principle of BUS
Term frequency is computed at run time.
chapter
hypertext(10) browser(4) internet(5) multimedia(5)
java(7)
chapter
section2
section1
hypertext browser
section1
section2
para1
para2
hypertext(2) browser(4)
hypertext(8) internet(5) multimedia(5) java(7)
hypertext internet multimedia
hypertext internet java
para1
para2
hypertext(3) internet(3) multimedia(5)
hypertext(5) internet(2) java(7)
Indexing is performed at leaf nodes only
Document tree with index terms
Bottom Up Scheme
26Key Issues in BUS
- How to figure out ancestor elements of a leaf
element efficiently ? - How to accumulate the term frequency effectively ?
27UID (Unique element IDentifier)
- Represent each document as a k-ary complete tree
and assign a UID to each node
a
real node
e
virtual node
b
c
e
Result of assigning UIDs
d
e
e
f
g
e
e
e
e
h
i
j
3-ary tree
parent(i) (i-2)/k1
28K-ary table
- Each document is assigned k, which is the maximum
number of siblings in the document tree. - Each collection has a K-ary table, each element
of which represent k in the document. - Each result set has a pointer to the K-ary table.
29Level and Element Type Number
- Level
- Level means the level in the document tree
- It gives a clue how many parent function is
applied to get to a target element - Element type number
- A unique number is assigned to each element type
in DTD ( not the elements in documents ) - It enables to filter out unnecessary elements and
accumulate the correct frequencies
30Level and Element Type Number (Contd)
- User query Retrieve sections that contain
hypertext
hypertext(9) browser(1) internet(5) multimedia(5)
java(7)
chapter
Level 1
user level
section1
title
hypertext(1) browser(1)
hypertext(8) internet(5) multimedia(5) java(7)
Level 2
Level difference informs how many times parent
function is applied
Element type number lets unnecessary
index information filtered out.
para1
para2
hypertext(3) internet(3) multimedia(5)
hypertext(5) internet(2) java(7)
Level 3
Index information
text level
31Representing a Document Tree
lt5,1,1,1gt
lt5,3,2,3gt
lt5,4,2,3gt
lt5,2,2,2gt
hypertext(1) model(1) retrieval(1) semantics(1)
lt5,8,3,5gt
lt5,9,3,5gt
lt5,11,3,6gt
lt5,12,3,6gt
index(3) lexical(1) noun(4) stem(2)
document(4) index(3) precision(2) term(5)
e
e
e
e
e
e
e
e
e
e
e
e
e
lt5,33,4,7gt
lt5,35,4,7gt
lt5,36,4,7gt
lt5,32,4,7gt
document(4) index(3) precision(1) term(5)
browser(2) hypertext(2) java(5) link(6)
anchor(2) browser(1) html(3) internet(5)
basian(3) inquiry(2) link(3) matrix(3)
e
e
32Query Evaluation
- Create accumulators at user level
- Accumulators correspond to the elements at the
user level - Compute the TF (Term Frequency) and DF (Document
Frequency) of a term - Summing up all the term frequencies of the
descendent elements into the corresponding
accumulators. - The number of non-zero accumulators is the DF of
the term. - Calculate the term weight
- Compute the similarity of the elements and rank
33Accumulating Term Frequency
lt5,1gt
query find sections containing browser
. . . .
Subtree of the tree in slide 28
lt5,11gt
6
lt5,12,3,6gt
lt5,11,3,6gt
1
lt5,12gt
1
. . . .
lt5,32,4,7gt
lt5,33,4,7gt
lt5,35,4,7gt
lt5,36,4,7gt
browser (4) index(3) precision(1) term(5)
browser(2) hypertext(2) java(5) link(6)
anchor(2) browser(1) html(3) internet(5)
basian(3) inquiry(2) link(3) matrix(3)
34Performance Data (in Ultra Sparc 2)
- Index Overhead
- Retrieval time
- Almost of single term queries are evaluated
within one second
35Advantage of XRS
- Provides a variety of structural search
functions. - Less indexing overhead and quick retrieval time
- Easy to port
- Java native C code
- C code is made as shared libraries
36Alternative Architecture of XRS
Server Side
Search Engine
Shared Library
JNI interface
Servlet
Rendering Component
Query Mediator Servlet
XML result
with XSL
HTML format
query
Search result
Web browser
Applet
User Interface
Initiate
Client Side
37Benefit and Problem
- Benefit
- Simpler and easier to port than the current
implementation - Do not need an independent Java process
- Problem
- Current Java Servlet engines can not run the
shared libraries - Apache Jserv, Jrun and Jigsaw fail to run it!
38Current Status
- Finish the development of the content retrieval
part - Available on the Web at the end of August 1999.
- http//dlb2.nlm.nih.gov/dwshin
- Structural retrieval part is in development
- will be finished soon.