Signature Files - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Signature Files

Description:

A 'signature' is created as an abstraction of a document. ... ign. gna. nat. atu. tur. ure. re- 12. 7. 3. 2. 3. 1. 9. 12. 8. 1 1 1 0 0 0 1 1 1 0 0 1. 6 ... – PowerPoint PPT presentation

Number of Views:446
Avg rating:3.0/5.0
Slides: 20
Provided by: Indr1
Category:
Tags: files | ign | signature

less

Transcript and Presenter's Notes

Title: Signature Files


1
Signature Files
  • CSE3201/CSE4500
  • Information Retrieval Systems

2
Signature File for Text Retrieval
  • A signature is created as an abstraction of a
    document.
  • All the signatures that represent the documents
    in the collection are kept in a file called
    signature file.

3
Word Signature(WS)
  • A word signature
  • is a fixed-length bit-string represents a word.
  • is described by
  • The length (N)
  • A number of bits set to 1(k)

N24
1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0
k7
4
Word Signature Generation
  • Use a hash function to find the location of the
    bit(s) that will be set on.
  • Using triplets of characters to generate word
    signature.
  • divide the word into overlapping triplets.
  • For each triplet of characters
  • convert the characters to a numeric value (can be
    ASCII representation of the character).
  • Use the the number as the input to the hash
    function.
  • The hash function will produce a number which
    represent the bit position of the triplet in the
    word signature.

5
Word Signature Generation
  • Example
  • A signature 111000111001 is generated for the
    word signature.
  • The position is read from left to right

6
Document Signature (DS)
  • Document Signature can be created using two
    methods
  • concatenation of word signatures
  • superimposed coding.

7
Document Signature Concatenation of WS
  • The length of document signatures (DS) can vary.
  • A fixed number of bits may precede the document
    signature (DS) to indicate the length of DS.
  • It is possible to fix the length of the Document
    Signature (DS).
  • The length can be set to equal the longest
    document in the collection.
  • Extra 0 bits are padded to the shorter
    documents.

8
Document Signature Superimposed Coding
  • Each document is divided into blocks containing a
    constant number of distinct words.
  • To create a block signature, perform OR operation
    on all the words in the block.

9
Document Signature Superimposed Coding
  • To create the document signature, all the block
    signatures are superimposed.

10
Query Signature
  • Query will be converted to a block signature as
    in the document.
  • Query
  • free 001 000 110 010
  • text 000 010 101 001
  • Block 001 010 111 011

11
Query on Signature File
Match? Perform AND operation between the query
and block signature, if ( result query) 0,
they are matched
Query 001 010 111 011
No
Yes
No
No
Yes
No
Yes
12
Signature File Structure
  • Sequential
  • During searching, each signature will be compared
    to query signature.
  • Time consuming
  • Bit-Sliced Signature
  • The signature file undergo a matrix transposed

13
Matrix Transposed
14
Bit-Sliced
d1
d2
d3
d4
N bits
d1
d2
N records
d3
d4
sequential
Bit sliced
15
Bit Sliced Signature File
  • Retrieval
  • If ith bit in the query signature is set to 1,
    retrieve the ith signature block/record.
  • If there is n number of bits are set to 1, only n
    number of records needs to be retrieved.

16
Bit Slice Signature File
Query 001 010 111 011
Retrieved records
Match, because all bits in this column is set to
1 (the 2nd block).
17
Bit Sliced Signature File
  • Advantages
  • Smaller number of records are retrieved - faster
    retrieval.
  • Disadvantages
  • An update operation become a very costly exercise.

18
False Drop
  • False drop occurs when a documents signature
    matches a querys signature but the querys word
    does not match any word in the document.
  • It is possible because 2 distinct blocks may have
    the same signatures due to
  • the hashing algorithm
  • superimposed coding
  • The rate of false drop depends on
  • The size of the signature (N bits)
  • The size of bits set to 1(k bits)
  • The number of words per-block

19
Inverted or Signature?
  • Inverted files
  • Slower retrieval
  • More accurate
  • Easier to maintain
  • In fact, inverted files are still the most
    popular storage structure for information
    retrieval.
Write a Comment
User Comments (0)
About PowerShow.com