Title: Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal
1Inverted index,Compressing inverted
indexAndComputing score in complete search
system Chintan Mistry Mrugank
dalal
2Indexing in Search Engine
Linguistic Preprocessing
Normalized terms
User query
Already built Inverted Index Lookup the
documents that contain the terms
Rank the returned documents according to their
relevancy
Documents
Results
3Forward index
- What is INVERTED INDEX? First look at the FORWARD
INDEX! - Documents Words
-
- Querying the forward index would require
sequential iteration through each document and to
each word to verify a matching document - Too much time, memory and resources required!
Document 1 Hat, dog, the, cow, is, now
Document 2 Cow, run, away, morning, in, tree
Document 3 What, family, at, some, is, take
4What is inverted index?
Posting List
One posting
Opposed to forward index, store the list of
documents per each word Directly access the set
of documents containing the word
5How to build inverted index? (1/3)
- Build index in advance
- 1. Collect the documents
- 2. Turning each document into a list of tokens
- 3. Do linguistic preprocessing, producing list
of normalized tokens, which are the indexing
terms - 4. Index the documents (i.e. postings) for each
- word (i.e. dictionary)
6How to build inverted index? (2/3)
- Given two documents
- Document1 Document2
This is first document. Microsofts products are
office, visio, and sql server
This is second document. Googles services are
gmail, google labs and google code.
7How to build inverted index? (3/3)
- Sort based indexing
- 1. Sort the terms alphabetically
- 2. Instances of the same term are grouped by word
and then documentID - 3. The terms and documentIDs are then separated
out - Reduces storage requirement
- Dictionary commonly kept in memory while postings
list kept on disk
8Blocked sort based indexing
- Use termID instead of term
- Main memory is insufficient to collect
termID-docID pair, we need external sorting
algorithm that uses disk - Segment the collection into parts of equal size
- Sorts and group the termID-docID pairs of each
part in memory - Store the intermediate result onto disk
- Merges all intermediate results into the final
index - Running Time O (T log T)
9Single-pass in-memory indexing
- SPIMI uses term instead of termID
- Writes each blocks dictionary to disk, and then
starts a new dictionary for the next block - Assume we have stream of term-docID pairs,
- Tokens are processed one by one, when a term
occurs for the first time, it is added to the
dictionary, and a new posting list is created.
10Difference between BSBI and SPIMI
SPIMI BSBI
Add postings directly to postings list It is faster then BSBI because there is no Sorting necessary It saves memory because No termID needs to be stored Time complexity O( T ) Collect term-docID pairs , sort them and then create postings list Slower then SPIMI Require to store termID , so need more space Time complexity O( T logT)
11Distributed Indexing (1/4)
- We can not perform index construction on single
computer, web search engine uses distributed
indexing algorithms for index construction - Partitioned the work across several machine
- Use MapReduce architecture
- A general architecture for distributed computing
- Divide the work into chunks that can easily
assign and reassign. - Map and Reduce phase
12Distributed Indexing (2/4)
13Distributed Indexing (3/4)
- MAP PHASE
- Mapping the splits of the input data to key-value
pairs - Each parser writes its output to local segment
file - These machines are called parsers
- REDUCE PHASE
- Partition the keys into j term partitions and
having the parsers write key-value pair for each
term partition into a separate file. - The parser write the corresponding segment files,
one for each term partition.
14Distributed Indexing (4/4)
- REDUCE PHASE (cont.)
- Collecting all values (docIDs) for a given key
(termID) into one list is the task of inverter - The master assigns each term partition to a
different inverter - Finally, the list of values is sorted for each
key and written to the final sorted postings
list.
15Dynamic indexing
- Motivation what we have seen so far was static
collection of documents, what if the document is
added, updated or deleted? - Maintain 2 indexes Main and Auxiliary
- Auxiliary index is kept in memory, searches are
run across both indexes, and results are merged - When auxiliary index becomes too large, merge it
into the main index - Deleted document can be filtered out while
returning the results
16Querying distributed indexes (1/2)
- Partition by terms
- Partition the dictionary of index terms into
subsets, along with a postings list of those term - Query is routed to the nodes, allows greater
concurrency - Sending a long lists of postings between set of
nodes for merging cost is very high and it
outweighs the greater concurrency - Partition by documents
- Each node contains the index for a subset of all
documents - Query is distributed to all nodes, then results
are merged
17Querying distributed indexes (2/2)
- Partition by documents (cont.)
- Problem idf must be calculated for an entire
collection even though the index at single node
contains only subset of documents - The query is broadcasted to each of the nodes,
with top k results from each node being merged to
find top k documents of the query.
18Index compression (1/8)
- Compression techniques for dictionary and posting
list - Advantages
- Less disk space
- Use of caching frequently used terms can be
cached in memory for faster processing, and
compression techniques allows more terms to be
stored in memory - Faster data transfer from disk to memory total
time of transferring a compressed data from disk
and decompress it is less than transferring
uncompressed data
19Index compression (2/8)
- Dictionary compression
- Its small compared to posting lists, so why to
compress? - Because when large part (think of a millions of
terms in it!) of dictionary is on disk, then many
more disk seeks are necessary - Goal is to fit this dictionary into memory for
high response time
20Index compression (3/8)
- 1. Dictionary as an array
- Can be stored in an array of fixed width entries
- For ex. We have 4,00,000 terms in dictionary
- 4,00,000 (2044) 11.2 MB
21Index compression (4/8)
- Any problem in storing dictionary as an array?
- 1. Average length of term in English language is
about eight chars, so we are wasting 12 chars - 2. No way of storing terms of more than 20 chars
like hydrochlorofluorocarbons - SOLUTION?
- 2. Dictionary as a string
- Store it as a one long string of characters
- Pointer marks the end of the preceding term and
the beginning of the next
22Index compression (5/8)
- 2. Dictionary as a string (cont.)
- 4,00,000 (4438) 7.6 MB (compared to 11.2
MB earlier)
23Index compression (6/8)
- 3. Blocked storage
- Group the terms in the string into blocks of size
k and keeping a term pointer only for the first
term of each block.
k4 We save, (k-1)3 9 bytes for term
pointer But, Need additional 4 bytes for term
length
- 4,00,000 (1/4) 5 7.1 MB (compared to 7.6
MB)
24Index compression (7/8)
- 4. Blocked storage with front coding
- Common prefixes
- According to experience conducted by author
Size reduced to 5.9 MB (compared to 7.1 MB)
25Index compression (8/8)
- Posting file compression
- By Encoding Gaps gaps between postings are
shorter - so we can store gaps rather than storing the
posting itself
26Review Scoring , term weighting
- Meta data- information about document
- Metadata generally consist of fields
- E.g. date of creation , authors , title etc.
- Zone - similar to fields
- Difference zone is arbitrary free text
- E.g. Abstract , overview
27Review Scoring , term weighting
- Term Frequency(tf) of occurrence of term in
document - Problem size of documents gt inappropriate
ranking - Document frequency(dft) of documents in
collection which contain term from query. - Inverse Document Frequency(idft)
- idft log( N / dft) N total of
doc - Significance of idf
- If low ? its a common term (e.g. stop word )
- If high ? rare word ( e.g. apothecary )
-
28Review Scoring , term weighting
- Tf-idf weighting
- tf-idft,d tft,d idft .
- High when term occurs many time in small of
docs - Low when it occurs fewer time in docs or
- it occurs in many docs
- Lowest when term is in almost all documents.
- Score of document
- Score(q,d) ? (tq)tf-idft,d
29Computing score in complete search system
30Inexact top K document retrieval
- Motivation to reduce the cost of calculating
score for all N documents - We calculate score ONLY for top K documents
whose scores are likely to be high w.r.t given
query - How
- Find set A of documents who are contenders
- where K lt A ltlt N.
- Return the K top scoring docs from A
31Index Elimination
- Idf preset threshold
- Only traverse postings for terms with high idf
- Benefit low idf postings are long so we remove
them from - counting score.
- Include all terms
- Only traverse documents with many query terms in
it. - Danger we may end up with less than K docs at
last.
32Champion lists
- Champion list fancy list top docs
- Set of r documents for each term t in dictionary
which are pre-computed - The weights for t are high
- How to create set A
- Take a union of champion list for each term in
query - Compute score only for docs which are in union
- How and when to decide r
- Highly application dependent
- Create list at the time of indexing documents
- Problem ????????
33Static quality scores and ordering
- In many search engine we have
- Measure of quality g(d) for each documents
- The net score is calculated
- Combination of g(d) and tf-idf score.
- How to achieve this
- Document posting list is in decreasing order for
g(d) - So we just traversed first few documents in list
- Global champion list
- Chose r documents with highest value of
g(d)tf-idf
34Cluster pruning (1/2)
- We cluster document in preprocessing step
- Pick vN documents call them leaders
- For each document who is not leader we compute
nearest leader - Followers docs which are not leaders
- Each leader has approximately vN followers
35Cluster pruning (2/2)
- How does it help
- Given a query q find leader L nearest to q
- i.e calculating score for only root N docs
- Set A contains leader L with root N followers
- i.e calculating score for only root N docs
36Tiered indexes
auto
Doc1
Doc 2
Tier 1
car
Doc 1
Doc 2
Doc 3
best
Doc 4
Preset threshold value set to 20
auto
Doc 1
car
Doc1
Tier 2
best
Doc 4
Preset threshold value set to 10
Addressing an issue of getting set A of
contenders less than K documents
37A complete search system
Parsing Linguistics
Result Page
User Query
Free text query parser
Documents
Indexers
Spell correction
Scoring and Ranking
Documents cache
Metadata in zone and field indexes Inexact top K retrieval Tiered inverted positional index K - gram
Indexes Indexes Indexes Indexes
Training set
Scoring parameters MLR
38