Title: Querying Structured Text in an XML Database
1Querying Structured Text in an XML Database
2Introduction
- Data retrieval (DR)
- provide means to formulate queries based on exact
matches of data. - Information retrieval (IR)
- based on the notion of relevance of documents
within a document collection.
3Introduction
- Traditional databases (XML)
- efficiently deal with data retrieval
- not good at dealing with information retrieval
- XML provides a unified view to all kinds of
structured and semi-structured data as well as
loosely structured documents. - It is important to integrate information
retrieval into standard database query.
4Introduction
- Relevance ranking
- it is central to information retrieval
- it becomes more complex in XML
5Introduction
- An algebra called TIX for querying Text In XML
was developed to integrate information retrieval
techniques into a standard database query
evaluation engine. - New evaluation strategies were also developed to
obtain good performance.
6- ltsectiongta14
- ltsection-titlegta15
- Information Retrieval Techniques
- lt/section-titlegt...
- lt/sectiongt
- ltsectiongta16
- ltsection-titlegtExampleslt/section-titlegta17
- ltpgt ... Here are some IR based
- search engines ... lt/pgta18
- ltpgt ... search engine NewsInEssence
- uses a new information retrieval
- technology ... lt/pgta19
- ltpgt ... semantic information retrieval
- techniques are also being incorporated
into - some search engines ... lt/pgta20
- lt/sectiongt
- lt/chaptergt
- lt/articlegt
- articles.xml
- ltarticlegta1
- ltarticle-titlegta2
- Internet Technologies
- lt/article-titlegt
- ltauthor idfirstgta3
- ltfnamegtJanelt/fnamegta4
- ltsnamegtDoelt/snamegta5
- lt/authorgt
- ltchaptergta10
- ltctgtSearch and Retrieval lt/ctgt a11
- ltsectiongta12
- ltsection-titlegta13
- Search Engine Basics
- lt/section-titlegt
- ...
- lt/sectiongt
Figure 1 Example XML Database
7Query 1 simple IR-style query Find
document components in articles.xml that are
about search engine. Relevance to internet
and information retrieval is desirable but not
necessary. Query 2 structured IR-style query
Find document components in articles.xml that
are part of an article written by an author with
last name Doe and are about search engine.
Relevance to internet and information
retrieval is desirable but not
necessary. Figure 2 Example IR-style Queries
8Motivation
- Problems of a boolean specification
- OR retrieve components relevant only to the two
secondary terms but not to the primary term
(a15). - AND lose the relevant paragraph (a18).
- AND and OR hard to determine a suitable query
expression applicable to all possible database
instances. - Weighting and ranking support in the boolean
query engine are required
9Algebra - scored data tree
- Definition
- It is a rooted ordered tree, such that each node
carries data in the form of a set of
attribute-value pairs, including at least a tag
and a real number valued score. The score of the
tree is the score of the root node.
10Algebra - scored pattern tree
- Definition
- It is a triple P (T,F,S), where T (V,E) is a
node and edge labeled tree - each node in V has a distinct integer as its
label. - each edge is labeled pc (for parent child
relationship), ad (for - ancestor descendant relationship), or ad (for
self-or-descendant - relationship).
- F is a formula of boolean combination.
- S is a set of scoring functions specifying how
to calculate the scores of each node.
11Figure 3 Scored Pattern Tree for Query 2
12Scored pattern tree
- Nodes are constrained in the normal ways
- the pattern imposes structural requirements on
the nodes. - the formula imposes value-based constraints.
- the scoring function defines how the scores of
nodes are calculated.
13Scored pattern tree
- Primary IR-node
- Defined by a scoring function and
- Relevance finding is applied to the node
- Secondary IR-node
- A node that has primary IR-nodes in its sub-tree
or - A node defined by a scoring function based on the
scores of other IR-nodes.
14Extension of existing operators
- Scored selection
- Scored projection
15Scored selection
- Input data trees
- Parameter a scored pattern tree
- Output scored data trees
- Each scored data tree matches the scored pattern
tree - The score of each data IR-node is calculated
using the corresponding scoring function
16Figure 5 Three Representative Result Trees of
Query 2 with Selection The figure shows three
of the results obtained by applying query 2 to
the example database in Figure 1. The score of
the IR-nodes are calculated using functions
defined in Figure 9 and are indicated in the
square bracket.
17Scored projection
- Input data trees
- Parameters scored pattern tree, projection list
PL - Outputscored data trees
- The nodes not matching the scored pattern tree or
not being preserved in the PL are eliminated in
the output.
18Figure 6 Result Tree of Query 2 with Projection
PL 1, 3, 4
19New operators
20Threshold
- Input scored data trees
- Parameters a scored pattern tree P, a threshold
condition TC. - TC is either a real number value V or an integer
K.
21Threshold
- The output scored data trees satisfy
- at least one data IR-node matching the query
IR-node in the result data tree has a score
higher than V . - at least one data IR-node has a rank higher than
K, where the rank is obtained by sorting the data
IR-nodes based on the score.
22Pick
- Input scored data trees
- Parameters a scored pattern tree, a
pick-criterion PC - It is a key operator to remove the redundancy
23Pick
- Pick is different from projection
- Projection only needs information local to the
node being projected, e.g., the tag name. - Pick needs information that may reside elsewhere
in the data tree, e.g., the ancestor nodes. - Pick operator is usually applied after the
projection operator to eliminate the redundancy
24Figure 8 Result of Query 2 with Projection
Followed by Pick
- PC condition (PickFoo)
- any data IR-node with a score at least 0.8 is
considered relevant - for any data IR-node (starting with the one
highest in the tree hierarchy), if more than 50
of its child nodes are relevant - its direct parent node is not picked or it has
no parent node, then the data IR-node - is picked (parent/child redundancy elimination).
25Figure 9 Example User Functions
26Example
- Using example database and scored pattern tree,
- to obtain the top result (a10)
- Projection generate Figure 6
- Pick generate Figure 8
- Selection generate a collection of five trees
corresponding to five primary data IR-node. - Threshold select the highest scored result. The
subtree rooted at a10 can then be retrieved.
27Extension of XQuery
Figure 10 XQuery Expression of IR-style Queries
28Access methods
- Score generating methods
- TermJoin
- PhraseFinder
- Score modifying methods
- Score utilizing methods
29Score generating methods
- More than one term for relevance scoring
- Term matching is the most common IR predicate. A
node is scored based on how many terms it has in
itself and its descendant nodes. - Phrase matching
30Score generating methods
- TermJoin algorithm
- Implement score generation based on term matching
- Find all ancestors that are common among the
terms in a query. - Terms are read from an inverted index.
- PhraseFinder algorithm
- Use word offset information in the index to
verify phrase occurrence. - Use phrase occurrences to generate appropriate
score values.
31Score modifying methods
- EXAMPLE Consider the value join access method.
It takes in two sets of scored witness trees and
outputs a set of scored witness trees where each
witness tree is the merging of two input witness
trees that satisfied the join condition.
- c is the join condition
- A and B are the non-scored versions of input
sets A and B. - s is a score assigned to an output tree x.
32Score utilizing methods
- Properties of PC condition
- A notion of relevance score threshold for data
IR-nodes in the input collection. - Removing the redundancy either in along the ad
relationship or along the sibling relationship. - Challenge of ad redundancy
- Need to examine all nodes
- Pick algorithm
- use a stack-based strategy to eliminate redundancy
33Figure 12 Algorithm Pick
34Experiment evaluation
- To evaluate the performance of the new
- access methods
- Use an XML database system
- Run each experiment five times
- Ignore the lowest and the highest readings, and
average the remaining three
35Experiment evaluation
- TermJoin and PhraseFinder
- improve the performance by two times
- Pick
- efficiently eliminate the redundancy
36Table 1 Performance (in seconds) of the
different techniques using queries with
different number of terms
37Table 2 Performance (in seconds) of the
PhraseFinder and Composite of Access Methods
38Conclusion
- A new algebra TIX has been developed to integrate
information retrieval into standard database
query - Advantages of TIX
- Manage the relevance score
- Manage result granularity
- New access methods have been developed to
manipulate scores, and they effectively improve
the performance.
39QA