Title: Chap 14 Ranking Algorithm
1Chap 14 Ranking Algorithm
2Outline
- Introduction
- Ranking models
- Selecting ranking techniques
- Data structures and algorithms
- The creation of an inverted file
- Searching the inverted file
- Stemmed and unstemmed query terms
- A Boolean systems with ranking
- Pruning
3Introduction
- Boolean systems
- Providing powerful on-line search capabilities
for librarians and other trained intermediaries - Providing very poor service for end-users who use
the system infrequently - The ranking approach
- Inputting a natural language query without
Boolean syntax - Producing a list of ranked records that answer
the query - More oriented toward end-users
4Introduction (cont.)
- Natural language/ranking approach
- is more effective for end-users
- The results being ranked based on co-occurrence
of query terms - modified by statistical term-weighting
- eliminating the often-wrong Boolean syntax used
by end-users - providing some results even if a query term is
incorrect
5Figure 14.1 Statistical ranking
Term Factors Information Help Human Operation Retrieval systems
Qry. Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems Human factors in information retrieval systems
Vtr. 1 1 0 1 0 1 1
Rec1. Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval Human, factors, information, retrieval
Vtr. 1 1 0 1 0 1 0
Rec2. Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems Human, factors, help, systems
Vtr. 1 0 1 1 0 0 1
Rec3. Factors, operation, systems Factors, operation, systems Factors, operation, systems Factors, operation, systems Factors, operation, systems Factors, operation, systems Factors, operation, systems
Vtr. 1 0 0 0 1 0 1
6Figure 14.1 Statistical ranking
- Simple Match
- Query (1 1 0 1 0 1 1)
- Rec1 (1 1 0 1 0 1 0)
- (1 1 0 1 0 1 0) 4
- Query (1 1 0 1 0 1 1)
- Rec2 (1 0 1 1 0 0 1)
- (1 0 0 1 0 0 1) 3
- Query (1 1 0 1 0 1 1)
- Rec3 (1 0 0 0 1 0 1)
- (1 0 0 0 0 0 1) 2
- Weighted Match
- Query (1 1 0 1 0 1 1)
- Rec1 (2 3 0 5 0 3 0)
- (2 3 0 5 0 3 0) 13
- Query (1 1 0 1 0 1 1)
- Rec2 (2 0 4 5 0 0 1)
- (2 0 0 5 0 0 1) 8
- Query (1 1 0 1 0 1 1)
- Rec3 (2 0 0 0 2 0 1)
- (2 0 0 0 0 0 1) 3
7Ranking models
- Two types of ranking models
- ranking the query against Individual documents
- Vector space model
- Probabilistic model
- ranking the query against entire sets of related
documents
8Ranking models (cont.)
- Vector space model
- Using cosine correlation to compute similarity
- Early experiments
- SMART system (overlap similarity function)
- Results
- Within document frequency weighting gt no term
weighting - Cosine correlation with frequency term weighting
gt overlap similarity function - Salton Yang (1973) (Relying on term importance
within an entire collection) - Results
- Significant performance improvement using the
within-document frequency weighting the
inverted document frequency (IDF)
9Ranking models (cont.)
- Probabilistic model
- Terms appearing in previously retrieved relevant
documents was given a higher weight - Croft and Harper (1979)
- Probabilistic indexing without any relevance
information - Assuming all query terms have equal probability
- Deriving a term-weighting formula
10Ranking models (cont.)
- Probabilistic model
- Croft (1983)
- Incorporating within-document frequency weights
- Using a tuning factor K
- Result
- Significant improvement over both the IDF
weighting alone and the combination weighting
11Other experiments involving ranking
- Direct comparison of similarity measures and
term-weighting schemes - 4 types of term frequency weightings (Sparch
Jones,1973) - Term frequency within a document
- Term frequency within a collection
- Term postings within a document (a binary
measure) - Term postings within a collection
- Indexing was taken from manually extracted
keywords - Results
- Using the term frequency (or postings) within a
collection always improved performance - Using term frequency ( or postings) within a
document improved performance only for some
collections
12Other experiments involving ranking (cont.)
- Harman(1986)
- Four term-weighting factors
- (a) The number of matches between a document a
query - (b) The distribution of a term within a document
collection - IDF noise measure
- (c) The frequency of a term within a document
- (d) The length of the document
- Results
- Using the single measures alone, the distribution
of the term within the collection 2 (c) - Combining the within-document frequency with
either the IDF or noise measure 2 (using the
IDF or noise alone)
13Other experiments involving ranking (cont.)
- Ranking based on document structure
- Not only using weights based on term importance
both within an entire collection and within a
given document (Bernstein and Williamson, 1984) - But also using the structural position of the
term - Summary versus text paragraphs
- In SIBRIS, increasing term-weights for terms in
titles of documents and decreasing term-weights
for terms added to a query from a thesaurus
14Selecting ranking techniques
- Using term-weighting based on the distribution of
a term within a collection - always improves performance
- Within-document frequency IDF weight
- often provides even more improvement
- Within-document frequency (Several methods) IDF
measure - Adding additional weight for document structure
- Eg. higher weightings for terms appearing in the
title or abstract vs. those appearing only in the
text - Relevance weighting (Chap 11)
15The creation of an inverted file
- Implications for supporting inverted file
structures - Only the record id has to be stored (smaller
index) - Using strategies that increase recall at the
expense of precision - Inverted file is usually split into two pieces
for searching - The dictionary containing the term, along with
statistics about that term such as no. of
postings and IDF, and a pointer to the location
of the postings file for term - The postings file containing the record ids and
the weights for all occurrences of the term
16The creation of an inverted file (cont.)
- 4 major options for storing weights in the
postings file - Store the raw frequency
- Slowest search
- Most flexible
- Store a normalized frequency
- Not suitable for use with the cosine similarity
function - Updating would not change the postings
17The creation of an inverted file (cont.)
- Store the completely weighted term
- Any of the combination weighting schemes are
suitable - Disadvantage updating requires changing all
postings - If no within-record weighting is used, then the
postings records do not have to store weights
18Searching the inverted file
- Figure 14.4 flowchart of search engine
query
parser
Dictionary Lookup
Dictionary entry
Get Weights
Record numbers on a per term basis
Accumulator
Record numbers. Total weights
Sort by weight
Ranked record numbers
19Searching the inverted file (cont.)
- Inefficiencies of this technique
- The I/O needs to be minimized
- A single read for all the postings of a given
term, and then separating the buffer into record
ids and weights - Time savings can be gained at the expense of some
memory space - Direct access to memory rather than through
hashing - A final major bottleneck can be the sort step of
the accumulators for large data sets - Fast sort of thousands of records is very time
consuming
20Stemmed and unstemmed query terms
- If query terms were automatically stemmed in a
ranking system, users generally got better
results (Frakes, 1984 Canadela, 1990) - In some cases, a stem is produced that leads to
improper results - the original record terms are not stored in the
inverted file only their stems are used
21Stemmed and unstemmed query terms (cont.)
- Harman Candela (1990)
- 2 separate inverted files could be created and
stored - Stem terms normal query
- Unstemmed terms dont stem
- Hybrid inverted file
- Saving no space in the dictionary part
- Saving considerable storage (2 versions of
posting) - At the expense of some additional search time
22A Boolean systems with ranking
- SIRE system
- Full Boolean capability a variation of the
basic search process - Accepts queries that are either Boolean logic
strings or natural language queries (implicit OR) - Major modification to the basic search process
- Merge postings from the query terms before
ranking is done - Performance
- Faster response time for Boolean queries
- No increase in response time for natural language
queries
23Pruning
- A major time bottleneck in the basic search
process - The sort of the accumulators for large data sets
- Changed search algorithm with pruning
- Sort all query terms (stems) by decreasing IDF
value - Do a binary search for the first term (i.e., the
highest IDF) and get the address of the postings
list for that term - Read the entire postings file for that term into
a buffer and add the term weights for each record
id into the contents of the unique accumulator
for the record id
24Pruning (cont.)
- Check the IDF of the next query term. If the IDF
gt 1/3 (max IDF of any term in the data set)
then repeat steps 2, 3, and 4 otherwise
repeat steps 2, 3, and 4, but do not add weights
to zero weight accumulators - Sort the accumulators with nonzero weights to
produce the final ranked record list - If a query has only high-frequency terms, then
pruning cannot be done.
25