Title: Dynamic Index Pruning for Effective Caching
1Dynamic Index Pruning for Effective Caching
Yohannes Tsegay Andrew Turpin Justin
Zobel School of Computer Science IT RMIT
University, Melbourne, Australia
Introduction
Gradual stop
Query t1, t2, t3
- Information Retrieval system make use of an
inverted index to allow efficient query
resolution. - An inverted index is built of two main
components, a vocabulary of all the unique terms
in the collection, and for each unique term, an
inverted list of documents in which that term
occurs. - The size of inverted lists grows proportionally
to the size of the collection. - On average, larger collections require more
computation time and resources to evaluate a
query. - A significant amount of time spent evaluating
queries using inverted list is performed fetching
lists from disk. - To reduce the amount of time spent accessing
disk, search engines use inverted list caching. - Caching large inverted lists can be
counterproductive, as other cache items will be
evicted to make room for the large lists,
resulting in a large number of cache misses.
- Process blocks in decreasing order of impact
weight. -
- Add new documents to A, if its impact weight is
min_score_doc(A). - To ensure documents in A are in the correct
rank, continue processing more impact blocks. Do
not add, only update accumulators already in A.
5
4
2
1
t1
4
3
1
t2
t3
5
2
1
min_score_doc(A)
Accumulators (A)
- Stop processing further blocks if, after
processing a large number of documents, the order
of documents in A does not change. In this case A
is said to have become stable.
Impacts
Query evaluator
Query evaluator
Each block contains equi-impact documents
Inverted list cache
dog
7
5
4
2
On disk index
On disk index
Query evaluation without caching
Query evaluation with caching
Documents are stored in blocks. Blocks are sorted
in decreasing order of impact
Contributions
Test data
- 1. Scheme to cache dynamically pruned inverted
lists, instead of full inverted lists. - This has the advantage of not only increasing the
number of items maintained in cache, it also
reduces the number of accumulators used to
process a query. - 2. Gradual stop, a method for dynamic pruning
inverted lists based on impacts. - 3. Cost aware cache eviction policies which take
into consideration the cost of reading an
inverted lists into cache.
- We used two large TREC collections, WT100g (100
GB) and GOV2 (425 GB). - Each collection has a corresponding query log, 2
million queries from MSN search engine and 2
million queries from Excite query log. -
- The last one million queries from each log were
run against their corresponding collection, using
the Gradual Stop. The inverted lists used to
process those queries were cached. Term hits and
byte hits were counted.
Results
1. How did the proposed pruning method perform?
2. How did caching pruned lists perform?
Cost aware cache eviction policies
- Current cache eviction polices make use of
recency and access frequency of items in cache. - Two items of the same recency/access frequency
are about to be evicted. One costs twice as much
to read from disk as the other, which should we
keep to reduce disk access? - The cost of re-reading an inverted list should
be considered before evicting it from cache. - Cost per byte
- Greedy Dual Size with Frequency
Term hit rate percentage of inverted lists found
in cache to the total number of lists processed
for the GOV2-MSN data set
di disk access cost si size of inverted
list ti age of inverted list in cache fi
number of times list was accessed while in
cache L cost of the last item evicted,
initially 0
Acknowledgments Thanks to Microsoft for
providing the MSN query log. This research was
supported by the Australian Research Council.
CIKM, November 2007