Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal - PowerPoint PPT Presentation

About This Presentation

Title:

Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal

Description:

inverted index, compressing inverted index and computing score in complete search system chintan mistry mrugank dalal – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 39

Provided by: chin144

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal

1
Inverted index,Compressing inverted
indexAndComputing score in complete search
system Chintan Mistry Mrugank
dalal
2
Indexing in Search Engine
Linguistic Preprocessing
Normalized terms
User query
Already built Inverted Index Lookup the
documents that contain the terms
Rank the returned documents according to their
relevancy
Documents
Results
3
Forward index

What is INVERTED INDEX? First look at the FORWARD
INDEX!
Documents Words
Querying the forward index would require
sequential iteration through each document and to
each word to verify a matching document
Too much time, memory and resources required!

Document 1 Hat, dog, the, cow, is, now
Document 2 Cow, run, away, morning, in, tree
Document 3 What, family, at, some, is, take
4
What is inverted index?
Posting List
One posting
Opposed to forward index, store the list of
documents per each word Directly access the set
of documents containing the word
5
How to build inverted index? (1/3)

Build index in advance
1. Collect the documents
2. Turning each document into a list of tokens
3. Do linguistic preprocessing, producing list
of normalized tokens, which are the indexing
terms
4. Index the documents (i.e. postings) for each
word (i.e. dictionary)

6
How to build inverted index? (2/3)

Given two documents
Document1 Document2

This is first document. Microsofts products are
office, visio, and sql server
This is second document. Googles services are
gmail, google labs and google code.
7
How to build inverted index? (3/3)

Sort based indexing
1. Sort the terms alphabetically
2. Instances of the same term are grouped by word
and then documentID
3. The terms and documentIDs are then separated
out
Reduces storage requirement
Dictionary commonly kept in memory while postings
list kept on disk

8
Blocked sort based indexing

Use termID instead of term
Main memory is insufficient to collect
termID-docID pair, we need external sorting
algorithm that uses disk
Segment the collection into parts of equal size
Sorts and group the termID-docID pairs of each
part in memory
Store the intermediate result onto disk
Merges all intermediate results into the final
index
Running Time O (T log T)

9
Single-pass in-memory indexing

SPIMI uses term instead of termID
Writes each blocks dictionary to disk, and then
starts a new dictionary for the next block
Assume we have stream of term-docID pairs,
Tokens are processed one by one, when a term
occurs for the first time, it is added to the
dictionary, and a new posting list is created.

10
Difference between BSBI and SPIMI
SPIMI BSBI
Add postings directly to postings list It is faster then BSBI because there is no Sorting necessary It saves memory because No termID needs to be stored Time complexity O( T ) Collect term-docID pairs , sort them and then create postings list Slower then SPIMI Require to store termID , so need more space Time complexity O( T logT)
11
Distributed Indexing (1/4)

We can not perform index construction on single
computer, web search engine uses distributed
indexing algorithms for index construction
Partitioned the work across several machine
Use MapReduce architecture
A general architecture for distributed computing
Divide the work into chunks that can easily
assign and reassign.
Map and Reduce phase

12
Distributed Indexing (2/4)
13
Distributed Indexing (3/4)

MAP PHASE
Mapping the splits of the input data to key-value
pairs
Each parser writes its output to local segment
file
These machines are called parsers
REDUCE PHASE
Partition the keys into j term partitions and
having the parsers write key-value pair for each
term partition into a separate file.
The parser write the corresponding segment files,
one for each term partition.

14
Distributed Indexing (4/4)

REDUCE PHASE (cont.)
Collecting all values (docIDs) for a given key
(termID) into one list is the task of inverter
The master assigns each term partition to a
different inverter
Finally, the list of values is sorted for each
key and written to the final sorted postings
list.

15
Dynamic indexing

Motivation what we have seen so far was static
collection of documents, what if the document is
added, updated or deleted?
Maintain 2 indexes Main and Auxiliary
Auxiliary index is kept in memory, searches are
run across both indexes, and results are merged
When auxiliary index becomes too large, merge it
into the main index
Deleted document can be filtered out while
returning the results

16
Querying distributed indexes (1/2)

Partition by terms
Partition the dictionary of index terms into
subsets, along with a postings list of those term
Query is routed to the nodes, allows greater
concurrency
Sending a long lists of postings between set of
nodes for merging cost is very high and it
outweighs the greater concurrency
Partition by documents
Each node contains the index for a subset of all
documents
Query is distributed to all nodes, then results
are merged

17
Querying distributed indexes (2/2)

Partition by documents (cont.)
Problem idf must be calculated for an entire
collection even though the index at single node
contains only subset of documents
The query is broadcasted to each of the nodes,
with top k results from each node being merged to
find top k documents of the query.

18
Index compression (1/8)

Compression techniques for dictionary and posting
list
Advantages
Less disk space
Use of caching frequently used terms can be
cached in memory for faster processing, and
compression techniques allows more terms to be
stored in memory
Faster data transfer from disk to memory total
time of transferring a compressed data from disk
and decompress it is less than transferring
uncompressed data

19
Index compression (2/8)

Dictionary compression
Its small compared to posting lists, so why to
compress?
Because when large part (think of a millions of
terms in it!) of dictionary is on disk, then many
more disk seeks are necessary
Goal is to fit this dictionary into memory for
high response time

20
Index compression (3/8)

1. Dictionary as an array
Can be stored in an array of fixed width entries
For ex. We have 4,00,000 terms in dictionary
4,00,000 (2044) 11.2 MB

21
Index compression (4/8)

Any problem in storing dictionary as an array?
1. Average length of term in English language is
about eight chars, so we are wasting 12 chars
2. No way of storing terms of more than 20 chars
like hydrochlorofluorocarbons
SOLUTION?
2. Dictionary as a string
Store it as a one long string of characters
Pointer marks the end of the preceding term and
the beginning of the next

22
Index compression (5/8)

2. Dictionary as a string (cont.)

4,00,000 (4438) 7.6 MB (compared to 11.2
MB earlier)

23
Index compression (6/8)

3. Blocked storage
Group the terms in the string into blocks of size
k and keeping a term pointer only for the first
term of each block.

k4 We save, (k-1)3 9 bytes for term
pointer But, Need additional 4 bytes for term
length

4,00,000 (1/4) 5 7.1 MB (compared to 7.6
MB)

24
Index compression (7/8)

4. Blocked storage with front coding
Common prefixes

According to experience conducted by author
Size reduced to 5.9 MB (compared to 7.1 MB)

25
Index compression (8/8)

Posting file compression
By Encoding Gaps gaps between postings are
shorter
so we can store gaps rather than storing the
posting itself

26
Review Scoring , term weighting

Meta data- information about document
Metadata generally consist of fields
E.g. date of creation , authors , title etc.
Zone - similar to fields
Difference zone is arbitrary free text
E.g. Abstract , overview

27
Review Scoring , term weighting

Term Frequency(tf) of occurrence of term in
document
Problem size of documents gt inappropriate
ranking
Document frequency(dft) of documents in
collection which contain term from query.
Inverse Document Frequency(idft)
idft log( N / dft) N total of
doc
Significance of idf
If low ? its a common term (e.g. stop word )
If high ? rare word ( e.g. apothecary )

28
Review Scoring , term weighting

Tf-idf weighting
tf-idft,d tft,d idft .
High when term occurs many time in small of
docs
Low when it occurs fewer time in docs or
it occurs in many docs
Lowest when term is in almost all documents.
Score of document
Score(q,d) ? (tq)tf-idft,d

29
Computing score in complete search system
30
Inexact top K document retrieval

Motivation to reduce the cost of calculating
score for all N documents
We calculate score ONLY for top K documents
whose scores are likely to be high w.r.t given
query
How
Find set A of documents who are contenders
where K lt A ltlt N.
Return the K top scoring docs from A

31
Index Elimination

Idf preset threshold
Only traverse postings for terms with high idf
Benefit low idf postings are long so we remove
them from
counting score.
Include all terms
Only traverse documents with many query terms in
it.
Danger we may end up with less than K docs at
last.

32
Champion lists

Champion list fancy list top docs
Set of r documents for each term t in dictionary
which are pre-computed
The weights for t are high
How to create set A
Take a union of champion list for each term in
query
Compute score only for docs which are in union
How and when to decide r
Highly application dependent
Create list at the time of indexing documents
Problem ????????

33
Static quality scores and ordering

In many search engine we have
Measure of quality g(d) for each documents
The net score is calculated
Combination of g(d) and tf-idf score.
How to achieve this
Document posting list is in decreasing order for
g(d)
So we just traversed first few documents in list
Global champion list
Chose r documents with highest value of
g(d)tf-idf

34
Cluster pruning (1/2)

We cluster document in preprocessing step
Pick vN documents call them leaders
For each document who is not leader we compute
nearest leader
Followers docs which are not leaders
Each leader has approximately vN followers

35
Cluster pruning (2/2)

How does it help
Given a query q find leader L nearest to q
i.e calculating score for only root N docs
Set A contains leader L with root N followers
i.e calculating score for only root N docs

36
Tiered indexes
auto
Doc1
Doc 2
Tier 1
car
Doc 1
Doc 2
Doc 3
best
Doc 4
Preset threshold value set to 20
auto
Doc 1
car
Doc1
Tier 2
best
Doc 4
Preset threshold value set to 10
Addressing an issue of getting set A of
contenders less than K documents
37
A complete search system
Parsing Linguistics
Result Page
User Query
Free text query parser
Documents
Indexers
Spell correction
Scoring and Ranking
Documents cache
Metadata in zone and field indexes Inexact top K retrieval Tiered inverted positional index K - gram
Indexes Indexes Indexes Indexes
Training set
Scoring parameters MLR
38