Title: Dynamic Element Retrieval in a Structured Environment
1Dynamic Element Retrieval in a Structured
Environment
- Crouch, Carolyn J.
- University of Minnesota Duluth, MN
- October 1, 2006
2Key Problems
- Retrieval of elements at desired level of
granularity - Assigning a rank order to each element that
reflects its perceived relevance to the query
3Retrieval Environment
- Vector Space Model
- INEX Environment
- Flexible Retrieval
4Vector Space Model
- Document Indexing
- Term Weighting
- Similarity Coefficients
5INEX- Initiative for the Evaluation of XML
Retrieval
- INEX provides an environment for experiments in
structured retrieval - Traditionally contains two types of topics CO and
CAS - Both INEX 2004 and 2005 utilize an evaluation
measure known as inex-eval - Recall(the proportion of relevant information
retrieved) and Precision(the proportion of
retrieved items that are relevant
6Flexible Retrieval System
- Systems processes XML documents
- Smart format(Saltons Magic Automatic Retriever
of Text) - Lnu-ltu term weighting
7A Method for Flexible Retrieval
- Input to Flexible Retrieval
- Construction of the Document Tree
- Ranking of Elements
- Output of Flexible Retrieval
8Input to Flexible Retrieval
- Preorder traversal
- Ranked terminal leaf nodes(paragraphs)
- Generate document tree(schema and paragraphs)
9Document Tree
10Construction of the Document Tree
- Schema determine document tree
- Calculate Lnu-ltu term weights
11Ranking of Elements
- Address ranking issues with Lnu-ltu term
weighting - Length and normalization issues
- Pivot and slope
12Simple structured document
13Lnu(weight of element vector formula)
(1 log(term frequency)) (1 log(average term
frequency)) ______________________________________
____________ (1 - slope) slope ((number
unique terms) pivot)
14Ltu(weighting of query terms formula)
(1 log(term frequency) log(N
nk) ___________________________________________ (1
- slope) slope ((number unique terms)
pivot)
15Overview of flexible retrieval
1. Parse to extract leaf nodes from the original
XML documents 2. Index leaf nodes and queries
using Smart 3. Perform Smart retrieval to get
highly correlated leaf nodes
16Overview of flexible retrieval(cont)
4. For each document containing a retrieved leaf
node a. Get its document schema b. Generate
vector representations for inner nodes
(elements) 5. For each term in the query a. Get
its inverted file entry and corresponding xpaths b
. Find nk at all levels
17Output of Flexible Retrieval
- Equivalent to all-element index
18Experiments in flexible retrieval
- Factors of interest
- Experiments and results
19Factors of interest
- Slope and pivot during Lnu-ltu term weighting
- The n(number of paragraph)
20Experiments and Results
- Attendant file size(dictionary, inverted index,
element vectors reduced by 60, 50 and 50
respectively) - 30- 40 less storage than all-element index
- Is dynamic element retrieval Cost Effective?
21Conclusion
- Similar work(Grabs and Shek)
- Exhaustivity dependent
- Progress in specifity
22Researchers
- Grabs and Shek(similar work to flexible
retrieval) - Govert et al.(term weights are multiplied by a
collection-dependent augmentation factor as they
are propagated up the doc. Tree - Mass et al.(maintain separate indices for element
at different levels of granularity. Solves issues
of distorted statistics
23Overview of flexible retrieval(cont)
6. Correlate element vectors at each level with
query 7. Return ranked list of elements
24Table I
INEX 2004 INEX 2005 article
12,107 16,440 sections 69,577
94,421 subsections 77,397
104,746 paragraphs 1,029,747
1,378,202 elements 1,188,828 1,593,809 CO
Topics 40 Topics 40 Topics (34 assessed)
(29 assessed)
25Table II. Comparison of All-Element and Flexible
Retrieval under Inex-Eval (Generalized)
Precision at Rank
2004
2005 Rank All Element
Flexible All Element
Flexible 1 0.3897
0.3971 0.4224
0.4224 5 0.3088
0.2882 0.3241
0.3413 10 0.2735
0.2669 0.2991
0.2991 20 0.2529
0.2390 0.2841
0.2939 25 0.2456
0.2379 0.2669
0.2800 50 0.2000
0.1972 0.2364
0.2366 100 0.1523
0.1501 0.1921
0.1920 500 0.0697
0.0697 0.0943
0.0949 1500 0.0353
0.0362 0.0472
0.0483
26Table II.(cont)
-
Precision at Various Points of Recall - 2004
2005 - Recall All Element Flexible
All Element Flexible - 0.01 0.3395 0.3348
0.3562 0.3693 - 0.25 0.0971 0.0951
0.1131 0.1165 - 0.50 0.0257 0.0283
0.0385 0.0404 - 0.75 0.0017 0.0017
0.0097 0.0095 - 1.00 0.0013 0.0013
0.0015 0.0015 - avg prec 0.0625 0.0620
0.0739 0.0750
27Table III. Comparison of All-Element and Flexible
Retrieval under Inex-Eval (Strict)
-
Precision at Rank - 2004
2005 - Rank All Element Flexible
All Element Flexible - 1 0.2000 0.2000
0.1481 0.1481 - 5 0.1440 0.1200
0.0667 0.0741 - 10 0.1240 0.1200
0.0852 0.0778 - 20 0.1120 0.1020
0.0815 0.0815 - 25 0.1024 0.0992
0.0800 0.0830 - 50 0.0898 0.0832
0.0689 0.0681 - 100 0.0628 0.0608
0.0511 0.0500 - 500 0.0268 0.0259
0.0219 0.0217 - 1500 0.0141 0.0143
0.0096 0.0097
28Table III.(cont)
- Precision at Various
Points of Recall - 2004
2005 - Recall All Element Flexible
All Element Flexible - 0.01 0.2134 0.2115
0.1521 0.1535 - 0.25 0.1006 0.1070
0.0540 0.0515 - 0.50 0.0411 0.0394
0.0156 0.0191 - 0.75 0.0166 0.0159
0.0103 0.0104 - 1.00 0.0042 0.0044
0.0046 0.0048 - avg prec 0.0586 0.0577
0.0318 0.0335