Title: Signature Files
1Signature Files
- CSE3201/CSE4500
- Information Retrieval Systems
2Signature File for Text Retrieval
- A signature is created as an abstraction of a
document. - All the signatures that represent the documents
in the collection are kept in a file called
signature file.
3Word Signature(WS)
- A word signature
- is a fixed-length bit-string represents a word.
- is described by
- The length (N)
- A number of bits set to 1(k)
N24
1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0
k7
4Word Signature Generation
- Use a hash function to find the location of the
bit(s) that will be set on. - Using triplets of characters to generate word
signature. - divide the word into overlapping triplets.
- For each triplet of characters
- convert the characters to a numeric value (can be
ASCII representation of the character). - Use the the number as the input to the hash
function. - The hash function will produce a number which
represent the bit position of the triplet in the
word signature.
5Word Signature Generation
- Example
- A signature 111000111001 is generated for the
word signature. - The position is read from left to right
6Document Signature (DS)
- Document Signature can be created using two
methods - concatenation of word signatures
- superimposed coding.
7Document Signature Concatenation of WS
- The length of document signatures (DS) can vary.
- A fixed number of bits may precede the document
signature (DS) to indicate the length of DS. - It is possible to fix the length of the Document
Signature (DS). - The length can be set to equal the longest
document in the collection. - Extra 0 bits are padded to the shorter
documents.
8Document Signature Superimposed Coding
- Each document is divided into blocks containing a
constant number of distinct words. - To create a block signature, perform OR operation
on all the words in the block.
9Document Signature Superimposed Coding
- To create the document signature, all the block
signatures are superimposed.
10Query Signature
- Query will be converted to a block signature as
in the document. - Query
- free 001 000 110 010
- text 000 010 101 001
- Block 001 010 111 011
11Query on Signature File
Match? Perform AND operation between the query
and block signature, if ( result query) 0,
they are matched
Query 001 010 111 011
No
Yes
No
No
Yes
No
Yes
12Signature File Structure
- Sequential
- During searching, each signature will be compared
to query signature. - Time consuming
- Bit-Sliced Signature
- The signature file undergo a matrix transposed
13Matrix Transposed
14Bit-Sliced
d1
d2
d3
d4
N bits
d1
d2
N records
d3
d4
sequential
Bit sliced
15Bit Sliced Signature File
- Retrieval
- If ith bit in the query signature is set to 1,
retrieve the ith signature block/record. - If there is n number of bits are set to 1, only n
number of records needs to be retrieved.
16Bit Slice Signature File
Query 001 010 111 011
Retrieved records
Match, because all bits in this column is set to
1 (the 2nd block).
17Bit Sliced Signature File
- Advantages
- Smaller number of records are retrieved - faster
retrieval. - Disadvantages
- An update operation become a very costly exercise.
18False Drop
- False drop occurs when a documents signature
matches a querys signature but the querys word
does not match any word in the document. - It is possible because 2 distinct blocks may have
the same signatures due to - the hashing algorithm
- superimposed coding
- The rate of false drop depends on
- The size of the signature (N bits)
- The size of bits set to 1(k bits)
- The number of words per-block
19Inverted or Signature?
- Inverted files
- Slower retrieval
- More accurate
- Easier to maintain
- In fact, inverted files are still the most
popular storage structure for information
retrieval.