Making XML Documents Searchable through the Web - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Making XML Documents Searchable through the Web

Description:

Lister Hill Natrional Center for Biomedical Informatics. National Library of Medicine ... More and more documents are beginning to be provided in XML formats. ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 39

Provided by: ibib

Learn more at: http://www.ibiblio.org

Category:

more less

Transcript and Presenter's Notes

Title: Making XML Documents Searchable through the Web

1
Making XML Documents Searchable through the Web

Dongwook Shin
dwshin_at_nlm.nih.gov
Lister Hill Natrional Center for Biomedical
Informatics
National Library of Medicine

2
Importance of XML Search Engine

More and more documents are beginning to be
provided in XML formats.
XML documents are supposed to have certain
structures
Current Web Search Engines do not provide
structural search capability

3
Searching Characteristics

Content Searching
Searching for certain words in the element
hierarchy
Retrieve CHAPTER whose TITLE contains servlet
and PARAGRAPH contains session.
Structural Searching
Searching for elements satisfying certain
relations
Retrieve SECTION that has at least two FIGUREs

4
Searching Characteristics (Contd)

Combined Searching
Content Structural Searching
Retrieve SECTION that has TITLE containing XML
and contains at least a FIGURE.

5
Other XML Search Engines on the Web

Most engines provide search in a fixed set of
fields.
User cannot search in any elements in the
document hierarchy.
http//www.goxml.com
http//www.scoobs.com
http//www.xmlTree.com

6
XRS (XML Retrieval System)

Providing a variety of structural search
functions
Users can search in any elements in the document
hierarchy
Content Structural Searching
Allowing less index overhead and quick retrieval
time
BUS (Bottom Up Scheme) is used
Applicable to valid documents, but not to
well-formed documents
Using DTD when making queries and retrieving
Examples are Shakespeare or Bible data

7
Architecture of XRS
8
User Interface (Initialization)
DTD can be browsed here
Query conditions appear here
Search results are shown here with similarity
value
9
Query Composition

Principle
Any element can be a target - the element to be
retrieved
Search conditions can be imposed on any elements
EXAMPLE
Retrieve SPEECH whose SPEAKER contains Hamlet
and LINE contains Denmark

Target
Search Condition
10
User Interface (Query Composition)
11
User Interface (with Search Results)
12
Browser Side

Show XML results

13
Browsing a List of Elements
14
XML Result
15
Query at Another Target Element

Retrieve SCENE whose TITLE contains Castle
and SPEAKER contains Horatio

16
XML Result
17
Query Mediator Servlet

Mediate the query and results
Convey the user query into the backend search
engine
Transmit the retrieved results to the applet or
the rendering component
Send the result sets with brief information to
the applet
Send the XML content with a proper XSL to the
rendering component so that it can transform into
the HTML format
Session tracking and Result Sets Reclamation
Keep session tracking so that a user can use
his/her session continuously until he/she quits.
Detect the dead sessions periodically and reclaim
the corresponding result sets.

18
Query Language

INIT
Get the DBs and their DTDs available in the
server
It is sent to the server when the applet is
initialized
SEARCH db_name search_cond
db_name is one of DBs available in the server
search_cond includes the target and search
conditions
PRES num
Get the XML results
num is the n-th result in the result set

19
Result Set

A result set is assigned to each session
Query Mediator does session tracking
Backend Search engine keeps multiple result sets
Multi-thread safe code is required
When a session is relinquished, the result set is
reclaimed
Garbage collection for the result set is required

20
The Content of a Result Set

DB_name
The name of the database where the search is
performed and the result is obtained
DB_path
The directory path from the root where the DB
resides
ptr_to_result_set
pointer to the dynamic arrays having the search
results
num_result
number of elements retrieved
ptr_to_K_ary_table
pointer to the table that keeps the k_ary
information for the DB

21
RS (Result Set) Management
Backend search engine
22
Periodical RS Reclamation
Backend search engine
Query Mediator
Reclamation done
Session ends
Reclamation requested (j ,)
RS Indices to be reclaimed returned (j, ...)
alive sessions sent
i-th RS
actual result
j-th RS
reclaimed
23
Backend Search Engine

Less Indexing Overhead and Quick Retrieval
Use BUS (Bottom Up Scheme)
Most of codes are written in Native C code
Support Multi-thread
Multi-thread safe C code
Compile the C code into a shared library
Save index information in files

24
BUS (Bottom Up Scheme)

Main Idea
Index only at the lowest level of the document
structure
Weight information at higher level is computed at
retrieval time
Benifits
Minimize the indexing overhead
Support term weight and full-blown structural
search
Guarantee quick retrieval time

25
Principle of BUS
Term frequency is computed at run time.
chapter
hypertext(10) browser(4) internet(5) multimedia(5)
java(7)
chapter
section2
section1
hypertext browser
section1
section2
para1
para2
hypertext(2) browser(4)
hypertext(8) internet(5) multimedia(5) java(7)
hypertext internet multimedia
hypertext internet java
para1
para2
hypertext(3) internet(3) multimedia(5)
hypertext(5) internet(2) java(7)
Indexing is performed at leaf nodes only
Document tree with index terms
Bottom Up Scheme
26
Key Issues in BUS

How to figure out ancestor elements of a leaf
element efficiently ?
How to accumulate the term frequency effectively ?

27
UID (Unique element IDentifier)

Represent each document as a k-ary complete tree
and assign a UID to each node

a
real node
e
virtual node
b
c
e

Result of assigning UIDs
d
e
e
f
g
e
e
e
e
h
i
j
3-ary tree
parent(i) (i-2)/k1
28
K-ary table

Each document is assigned k, which is the maximum
number of siblings in the document tree.
Each collection has a K-ary table, each element
of which represent k in the document.
Each result set has a pointer to the K-ary table.

29
Level and Element Type Number

Level
Level means the level in the document tree
It gives a clue how many parent function is
applied to get to a target element
Element type number
A unique number is assigned to each element type
in DTD ( not the elements in documents )
It enables to filter out unnecessary elements and
accumulate the correct frequencies

30
Level and Element Type Number (Contd)

User query Retrieve sections that contain
hypertext

hypertext(9) browser(1) internet(5) multimedia(5)
java(7)
chapter
Level 1
user level
section1
title
hypertext(1) browser(1)
hypertext(8) internet(5) multimedia(5) java(7)
Level 2
Level difference informs how many times parent
function is applied
Element type number lets unnecessary
index information filtered out.
para1
para2
hypertext(3) internet(3) multimedia(5)
hypertext(5) internet(2) java(7)
Level 3
Index information
text level
31
Representing a Document Tree
lt5,1,1,1gt
lt5,3,2,3gt
lt5,4,2,3gt
lt5,2,2,2gt
hypertext(1) model(1) retrieval(1) semantics(1)
lt5,8,3,5gt
lt5,9,3,5gt
lt5,11,3,6gt
lt5,12,3,6gt
index(3) lexical(1) noun(4) stem(2)
document(4) index(3) precision(2) term(5)
e
e
e
e
e
e
e
e
e
e
e
e
e
lt5,33,4,7gt
lt5,35,4,7gt
lt5,36,4,7gt
lt5,32,4,7gt
document(4) index(3) precision(1) term(5)
browser(2) hypertext(2) java(5) link(6)
anchor(2) browser(1) html(3) internet(5)
basian(3) inquiry(2) link(3) matrix(3)
e
e
32
Query Evaluation

Create accumulators at user level
Accumulators correspond to the elements at the
user level
Compute the TF (Term Frequency) and DF (Document
Frequency) of a term
Summing up all the term frequencies of the
descendent elements into the corresponding
accumulators.
The number of non-zero accumulators is the DF of
the term.
Calculate the term weight
Compute the similarity of the elements and rank

33
Accumulating Term Frequency
lt5,1gt
query find sections containing browser
. . . .
Subtree of the tree in slide 28
lt5,11gt
6
lt5,12,3,6gt
lt5,11,3,6gt
1
lt5,12gt
1
. . . .
lt5,32,4,7gt
lt5,33,4,7gt
lt5,35,4,7gt
lt5,36,4,7gt
browser (4) index(3) precision(1) term(5)
browser(2) hypertext(2) java(5) link(6)
anchor(2) browser(1) html(3) internet(5)
basian(3) inquiry(2) link(3) matrix(3)
34
Performance Data (in Ultra Sparc 2)

Index Overhead
Retrieval time
Almost of single term queries are evaluated
within one second

35
Advantage of XRS

Provides a variety of structural search
functions.
Less indexing overhead and quick retrieval time
Easy to port
Java native C code
C code is made as shared libraries

36
Alternative Architecture of XRS
Server Side
Search Engine
Shared Library
JNI interface
Servlet
Rendering Component
Query Mediator Servlet
XML result
with XSL
HTML format
query
Search result
Web browser
Applet
User Interface
Initiate
Client Side
37
Benefit and Problem