Querying Structured Text in an XML Database - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Querying Structured Text in an XML Database

Description:

Querying Structured Text in an XML Database – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 40

Provided by: hong59

Category:

more less

Transcript and Presenter's Notes

Title: Querying Structured Text in an XML Database

1
Querying Structured Text in an XML Database

By
Xuemei Luo

2
Introduction

Data retrieval (DR)
provide means to formulate queries based on exact
matches of data.
Information retrieval (IR)
based on the notion of relevance of documents
within a document collection.

3
Introduction

Traditional databases (XML)
efficiently deal with data retrieval
not good at dealing with information retrieval
XML provides a unified view to all kinds of
structured and semi-structured data as well as
loosely structured documents.
It is important to integrate information
retrieval into standard database query.

4
Introduction

Relevance ranking
it is central to information retrieval
it becomes more complex in XML

5
Introduction

An algebra called TIX for querying Text In XML
was developed to integrate information retrieval
techniques into a standard database query
evaluation engine.
New evaluation strategies were also developed to
obtain good performance.

ltsectiongta14
ltsection-titlegta15
Information Retrieval Techniques
lt/section-titlegt...
lt/sectiongt
ltsectiongta16
ltsection-titlegtExampleslt/section-titlegta17
ltpgt ... Here are some IR based
search engines ... lt/pgta18
ltpgt ... search engine NewsInEssence
uses a new information retrieval
technology ... lt/pgta19
ltpgt ... semantic information retrieval
techniques are also being incorporated
into
some search engines ... lt/pgta20
lt/sectiongt
lt/chaptergt
lt/articlegt

articles.xml
ltarticlegta1
ltarticle-titlegta2
Internet Technologies
lt/article-titlegt
ltauthor idfirstgta3
ltfnamegtJanelt/fnamegta4
ltsnamegtDoelt/snamegta5
lt/authorgt
ltchaptergta10
ltctgtSearch and Retrieval lt/ctgt a11
ltsectiongta12
ltsection-titlegta13
Search Engine Basics
lt/section-titlegt
...
lt/sectiongt

Figure 1 Example XML Database
7
Query 1 simple IR-style query Find
document components in articles.xml that are
about search engine. Relevance to internet
and information retrieval is desirable but not
necessary. Query 2 structured IR-style query
Find document components in articles.xml that
are part of an article written by an author with
last name Doe and are about search engine.
Relevance to internet and information
retrieval is desirable but not
necessary. Figure 2 Example IR-style Queries
8
Motivation

Problems of a boolean specification
OR retrieve components relevant only to the two
secondary terms but not to the primary term
(a15).
AND lose the relevant paragraph (a18).
AND and OR hard to determine a suitable query
expression applicable to all possible database
instances.
Weighting and ranking support in the boolean
query engine are required

9
Algebra - scored data tree

Definition
It is a rooted ordered tree, such that each node
carries data in the form of a set of
attribute-value pairs, including at least a tag
and a real number valued score. The score of the
tree is the score of the root node.

10
Algebra - scored pattern tree

Definition
It is a triple P (T,F,S), where T (V,E) is a
node and edge labeled tree
each node in V has a distinct integer as its
label.
each edge is labeled pc (for parent child
relationship), ad (for
ancestor descendant relationship), or ad (for
self-or-descendant
relationship).
F is a formula of boolean combination.
S is a set of scoring functions specifying how
to calculate the scores of each node.

11
Figure 3 Scored Pattern Tree for Query 2
12
Scored pattern tree

Nodes are constrained in the normal ways
the pattern imposes structural requirements on
the nodes.
the formula imposes value-based constraints.
the scoring function defines how the scores of
nodes are calculated.

13
Scored pattern tree

Primary IR-node
Defined by a scoring function and
Relevance finding is applied to the node
Secondary IR-node
A node that has primary IR-nodes in its sub-tree
or
A node defined by a scoring function based on the
scores of other IR-nodes.

14
Extension of existing operators

Scored selection
Scored projection

15
Scored selection

Input data trees
Parameter a scored pattern tree
Output scored data trees
Each scored data tree matches the scored pattern
tree
The score of each data IR-node is calculated
using the corresponding scoring function

16
Figure 5 Three Representative Result Trees of
Query 2 with Selection The figure shows three
of the results obtained by applying query 2 to
the example database in Figure 1. The score of
the IR-nodes are calculated using functions
defined in Figure 9 and are indicated in the
square bracket.
17
Scored projection

Input data trees
Parameters scored pattern tree, projection list
PL
Outputscored data trees
The nodes not matching the scored pattern tree or
not being preserved in the PL are eliminated in
the output.

18
Figure 6 Result Tree of Query 2 with Projection
PL 1, 3, 4
19
New operators

Threshold
Pick

20
Threshold

Input scored data trees
Parameters a scored pattern tree P, a threshold
condition TC.
TC is either a real number value V or an integer
K.

21
Threshold

The output scored data trees satisfy
at least one data IR-node matching the query
IR-node in the result data tree has a score
higher than V .
at least one data IR-node has a rank higher than
K, where the rank is obtained by sorting the data
IR-nodes based on the score.

22
Pick

Input scored data trees
Parameters a scored pattern tree, a
pick-criterion PC
It is a key operator to remove the redundancy

23
Pick

Pick is different from projection
Projection only needs information local to the
node being projected, e.g., the tag name.
Pick needs information that may reside elsewhere
in the data tree, e.g., the ancestor nodes.
Pick operator is usually applied after the
projection operator to eliminate the redundancy

24
Figure 8 Result of Query 2 with Projection
Followed by Pick

PC condition (PickFoo)
any data IR-node with a score at least 0.8 is
considered relevant
for any data IR-node (starting with the one
highest in the tree hierarchy), if more than 50
of its child nodes are relevant
its direct parent node is not picked or it has
no parent node, then the data IR-node
is picked (parent/child redundancy elimination).

25
Figure 9 Example User Functions
26
Example

Using example database and scored pattern tree,
to obtain the top result (a10)
Projection generate Figure 6
Pick generate Figure 8
Selection generate a collection of five trees
corresponding to five primary data IR-node.
Threshold select the highest scored result. The
subtree rooted at a10 can then be retrieved.

27
Extension of XQuery
Figure 10 XQuery Expression of IR-style Queries
28
Access methods

Score generating methods
TermJoin
PhraseFinder
Score modifying methods
Score utilizing methods

29
Score generating methods

More than one term for relevance scoring
Term matching is the most common IR predicate. A
node is scored based on how many terms it has in
itself and its descendant nodes.
Phrase matching

30
Score generating methods

TermJoin algorithm
Implement score generation based on term matching
Find all ancestors that are common among the
terms in a query.
Terms are read from an inverted index.
PhraseFinder algorithm
Use word offset information in the index to
verify phrase occurrence.
Use phrase occurrences to generate appropriate
score values.

31
Score modifying methods

EXAMPLE Consider the value join access method.
It takes in two sets of scored witness trees and
outputs a set of scored witness trees where each
witness tree is the merging of two input witness
trees that satisfied the join condition.

c is the join condition
A and B are the non-scored versions of input
sets A and B.
s is a score assigned to an output tree x.

32
Score utilizing methods

Properties of PC condition
A notion of relevance score threshold for data
IR-nodes in the input collection.
Removing the redundancy either in along the ad
relationship or along the sibling relationship.
Challenge of ad redundancy
Need to examine all nodes
Pick algorithm
use a stack-based strategy to eliminate redundancy

33
Figure 12 Algorithm Pick
34
Experiment evaluation

To evaluate the performance of the new
access methods
Use an XML database system
Run each experiment five times
Ignore the lowest and the highest readings, and
average the remaining three

35
Experiment evaluation

TermJoin and PhraseFinder
improve the performance by two times
Pick
efficiently eliminate the redundancy

36
Table 1 Performance (in seconds) of the
different techniques using queries with
different number of terms
37
Table 2 Performance (in seconds) of the
PhraseFinder and Composite of Access Methods
38
Conclusion

A new algebra TIX has been developed to integrate
information retrieval into standard database
query
Advantages of TIX
Manage the relevance score
Manage result granularity
New access methods have been developed to
manipulate scores, and they effectively improve
the performance.

39
QA

Write a Comment

User Comments (0)