Title: Chapter 8 Indexing and Searching
1Chapter 8Indexing and Searching
- Hsin-Hsi Chen
- Department of Computer Science and Information
Engineering - National Taiwan University
2Introduction
- searching
- Online text searching
- Scan the text sequentially
- Indexed searching
- Build data structures over the text to speed up
the search - Semi-static collections updated at reasonably
regular interval - indexing techniques
- inverted files
- suffix (PAT) arrays
- signature files
3Assumptions
- n the size of text databases
- m the length of the search patterns (mltn)
- M the amount of memory available
- n the size of texts that are modified (nltn)
- Experiments
- 32bit Sun UltraSparc-1 of 167 MHz with 64 MB of
RAM - TREC-2 (WSJ, DOE, FR, ZIFF, AP)
4File Structures for IR
- lexicographical indices
- indices that are sorted
- inverted files
- Patricia (PAT) trees (Suffix trees and arrays)
- cluster file structures (see Chapter 7 in
document clustering) - indices based on hashing
- signature files
5Inverted Files
6Inverted Files
- Each document is assigned a list of keywords or
attributes. - Each keyword (attribute) is associated with
operational relevance weights. - An inverted file is the sorted list of keywords
(attributes), with each keyword having links to
the documents containing that keyword. - Penalty
- the size of inverted files ranges from 10 to
100of more of the size of the text itself - need to update the index as the data set changes
71 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
Vocabulary Occurrences
- addressing granularity
- inverted list
- word positions
- character positions
- inverted file
- document
letters 60 made 50 many 28 text 11,
19 words 30, 40
Heaps law the vocabulary grows as O(n?), ?
0.40.6 Vocabulary for 1GB of
TREC-2 collection 5MB
(before stemming and
normalization) Occurrences the extra space O(n)
30 40 of the text size
8Block Addressing
- Full inverted indices
- Point to exact occurrences
- Blocking addressing
- Point to the blocks where the word appears
- Pointers are smaller
- 5 overhead over the text size
block fixed size blocks, files, documents, Web
pages,
block retrieval units?
Block1 Block2
Block3 Block
4 This is a text. A text has many words. Words
are made from letters.
Vocabulary Occurrences
Text
letters 4 made 4 many 2 text 1, 2
words 3
Inverted index
9Sorted array implementation of an inverted file
the documents in which keyword occurs
10Full inversion (all words, exact positions,
4-byte pointers)
2 or 1 byte(s) per pointer independent of the
text size
document size (10KB), 1, 2, 3 bytes per pointer,
depending on text size
All words are indexed
Stop words are not indexed
11Searching
- Vocabulary search
- Identify the words and patterns in the query
- Search them in the vocabulary
- Retrieval of occurrences
- Retrieve the lists of occurrences of all the
words - Manipulation of occurrences
- Solve phrases, proximity, or Boolean operations
- Find the exact word positions when block
addressing is used
Three general steps
12Structures used in Inverted Files
- Sorted Arrays
- store the list of keywords in a sorted array
- using a standard binary search
- advantage easy to implement
- disadvantage updating the index is expensive
- B-Trees
- Tries
- Hashing Structures
- Combinations of these structures
13Trie
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
letters 60
made 50
l
d
a
m
Vocabulary trie
many 28
n
t
text 11, 19
w
words 33, 40
14B-trees
F M
Rut Uni
Al Br E
Gr H Ja L
Russian 9
Ruthenian 1
Afgan 2
15Sorted Arrays
1. The input text is parsed into a list of words
along with their location in the text. (time
and storage consuming operation) 2. This list is
inverted from a list of terms in location order
to a list of terms in alphabetical order. 3.
Add term weights, or reorganize or compress the
files.
16Inversion of Word List
report appears in two records
17Dictionary and postings file
Idea the file to be searched should be as short
as possible split a single file into two
pieces
(vocabulary)
(occurrences)
e.g. data set 38,304 records, 250,000 unique
terms
88 postings/record
(document , frequency)
18Producing an Inverted File for Large Data Sets
without Sorting
Idea avoid the use of an explicit sort
by using a right-threaded binary tree
current number of term postings the storage
location of postings list
traverse the binary tree and the linked postings
list
19Indexing Statistics
Final index only 8 of input text size for 50MB
database 14 of the input
text size for the larger database Working
storage not much larger than the size of final
index for new indexing method
the storage needed to build the index
p.1718
p.20
933
2GB
the same
20A Fast Inversion Algorithm
- Principle 1the large primary memories are
availableIf databases can be split into memory
loads that can be rapidly processed and then
combined, the overall cost will be minimized. - Principle 2the inherent order of the input
dataIt is very expensive to use polynomial or
even nlogn sorting algorithms for large files
21FAST-INV algorithm
concept postings/ pointers
See p. 22.
22Sample document vector
document number
concept number
(one concept number for each unique word)
Similar to the document- word list shown in p. 16.
The concept numbers are sorted within
document numbers, and document numbers
are sorted within collection
23HCNhighest concept number in dictionary
(total number of concepts in dictionary) Lnumbe
r of concepts/document (documents/concept)
pairs in the collection Mavailable primary
memory size, in bytes MgtgtHCN, M lt L L/jltM, so
that each part fill fit into primary memory HCN/j
concepts, approximately, are associated with
each part Let LLlength of current load
(8 bytes for each concept-weight)
Sspread of concept numbers in current load (4
bytes
for each count of posting)
number of concept-weight pairs
8LL4S lt M
24Preparation
1. Allocate an array, con_entries_cnt, of size
HCN. 2. For each ltdoc, congt entry in the
document vector file increment
con_entries_cntcon
0 (1,2), (1,4).. 2 (2,3) ..
3 (3,1), (3,2), (3,5) ... 6 (4,2), (4,3) .
8 ...
25Preparation (continued)
5. For each ltcon,countgt pair obtained from
con_entries_cnt if there is no room for
documents with this concept to fit in the
current load, then created an entry in the load
table and initialize the next load entry
otherwise update information for the current
load table entry.
26 the range of concepts for each primary load
????Load?? LLlength of current load S end
concept-start concept 1
space for concept/ weight pairLL space for each
concept to store count of postingsS lt M
??SS???? ??????
??(Doc,Con) ?Con??Load ?,???? ?????? ?Load
?????Load File???CONPTR ???Offset??? ???????? ??
copy rather than sort
27PAT Tress and PAT Arrays(Suffix Trees and Suffix
Arrays)
28PAT Trees and PAT Arrays
- Problems of tradition IR models
- Documents and words are assumed.
- Keywords must be extracted from the text
(indexing). - Queries are restricted to keywords.
- New indices for text
- A text is regarded as a long string.
- Each position corresponds to a semi-infinite
string (sistring). - suffix a string that goes from a text position
to the end of the text - Each suffix is uniquely identified by its
position - no structures and no keywords
29Text
This is a text. A text has many words. Words
are made from letters.
text. A text has many words. Words are made
from letters.
text has many words. Words are made from letters.
many words. Words are made from letters.
Suffixes
Words are made from letters.
different
made from letters.
Index points are selected from the text,
which point to the beginning of the text
positions which are retrievable.
letters.
30PATRICIA
- trie
- branch decision node search decision-markers
- element node real data
- if branch decisions are made on each bit, a
complete binary tree is formed where the depth is
equal to the number of bits of the longest
strings - many element nodes and branch nodes are null
31PATRICIA (Continued)
- compressed digital search trie
- the null element nodes and branch nodes are
removed - an additional field to denote the comparing bit
for branching decision is included in each
decision node - a matching between the searched results and their
search keys is required because only some of bits
are compared during the search process
32PATRICIA (Continued)
- Practical Algorithm to Retrieve Information Coded
in Alphanumeric - augmented branch node an additional field for
storing elements is included in branch node - each element is stored in an upper node or in
itself - an addition root node note the number of leaf
nodes is always greater than that of internal
nodes by one
33PAT-tree
- PATRICIA semi-infinite strings
- a text T with n basic units u1 u2 un
- u1 u2 un , u2 u3 un , u3 u4 un ,
- an end to the left but none to the right
- store the starting positions of semi-infinite
strings in a text using PATRICIA
34semi-infinite strings
- ExampleText Once upon a time, in a far away
land sistring 1 Once upon a time sistring
2 nce upon a time sistring 8 on a time, in a
sistring 11 a time, in a far sistring 22 a
far away land - Compare sistrings 22 lt 11 lt 2 lt 8 lt 1
35PAT Tree
- PAT TreeA Patricia tree constructed over all the
possible sistrings of a text - Patricia tree
- a digital tree where the individual bits of the
keys are used to decide on the branching - each internal node indicates which bit of the
query is used for branching - absolute bit position
- a count of the number of bits to skip
- each external node is a sistring, i.e., the
integer displacement
361
Example
2
2
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
3
4
2
1
1
2
2
3
4
2
3
5
1
external node sistring (integer
displacement) total displacement of the bit
to be inspected
1
1
1
1
0
0
1
1
1
2
2
0
1
3
2
internal node skip counter pointer
371
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
2
2
2
4
3
3
6
7
3
4
5
1
1
2
2
1
2
4
3
3
2
2
6
7
3
5
5
4
1
4
2
3
4
8
6
3
5
1
Search 00101
?3?6?4?bits????
381 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
Suffix Trie
60
l
50
d
m
a
28
space overhead 120240 over the text size
19
n
t
e
x
t
w
11
40
o
r
d
s
33
60
l
Suffix Tree
50
d
m
3
1
28
19
n
t
5
11
w
40
6
33
39PAT Trees Represented as Arrays
- indirect binary search vs. sequential searchKeep
the external nodes in the bucket in the same
relative order as they would be in the tree
PAT array
1
7
4
8
5
1
6
3
2
2
2
2
4
3
3
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
6
7
3
5
5
1
4
8
401 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
60
l
50
(1) Suffix Tree
d
m
3
1
28
19
n
t
120240 overhead
5
11
w
40
6
33
40 overhead
(2) Suffix Array
(3) Supra-Index
Suffix Array
41difference between suffix array and inverted list
- suffix array the occurrences of each word are
sorted lexicographically by the text following
the word - inverted list the occurrences of each word are
sorted by text position
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Vocabulary Supra-Index
Suffix Array
Inverted list
42Indexing Points
- The above example assumes every position in the
text is indexed.n external nodes, one for each
position in the text - word and phrase searchessistrings that are at
the beginning of words are necessary - trade-off between size of the index and search
requirements
43Prefix searching
- ideaevery subtree of the PAT tree has all the
sistrings with a given prefix. - Search proportional to the query lengthexhaust
the prefix or up to external node.
Search for the prefix 10100 and its answer
44Searching PAT Trees as Arrays
- Prefix searching and range searchingdoing an
indirect binary search over the array with the
results of the comparisons being less than,
equal, and greater than. - exampleSearch for the prefix 100 and its answer.
PAT array
7
4
8
5
1
6
3
2
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
45Proximity Searching
- Find all places where s1 is at most a fixed
(given by a user) number of characters away from
s2. in 4 ation gt insulation, international,
information - Algorithm1. Search for s1 and s2.2. Select the
smaller answer set from these two sets and
sort by position.3. Traverse the unsorted answer
set, searching every position in the sorted
set and checking if the distance between
positions satisfying the proximity condition.
sorttraverse time(m1m2)logm1 (assume m1ltm2)
46Range Searching
- Search for all the strings within a certain
lexicographical range. - the range of abc ..acc abracadabra,
acacia ? abacus, acrimonious X - Algorithm
- Search each end of the defining intervals.
- Collect all the sub-trees between (and including)
them.
47Searching Suffix Array
- P1 ? S lt P2
- Binary search both limiting patterns in the
suffix array. - Find all the elements lying between both positions
48Longest Repetition Searching
- the match between two different positions of a
text where this match is the longest in the
entire text, e.g., 0 1 1 0 0 1 0 0 0 1 0 1 1 1
the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111 sistring
5 0100010111 sistring 6 100010111 sistring
7 00010111 sistring 8 0010111
1
2
2
4
3
3
2
6
7
3
5
5
1
4
8
49Most Significant or Most Frequent Matching
- the most frequently occurring strings within the
text database, e.g., the most frequent trigram - find the most frequent trigramfind the largest
subtree at a distance 3 characters from root
1
the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
2
2
4
3
3
i.e., 1, 2, 3 are the same for sistrings
100100010111 and 100010111
2
6
7
3
5
5
1
4
8
50Building PAT Trees as Patricia Trees
- bucketing of external nodes
- collect more than one external node
- a bucket replaces any subtree with size less than
a certain constraint (b)save significant number
of internal nodes - the external nodes inside a bucket do not have
any structure associated with themincrease the
number of comparisons for each search
51Building PAT Trees as Patricia Trees(Continued)
- mapping the tree onto the disk using super-nodes
- Allocate as much as possible of the tree in a
disk page - Every disk page has a single entry point,
contains as much of the trees as possible,and
terminates either in external nodes or in
pointers to other disk pages - The pointers in internal nodes address either a
disk page or another node inside the same page - disk pages contain on the order of 1,000
internal/external nodes - on the average, each disk page contains about 10
steps of a root-to-leaf path
52Suffix array construction (in MM)
- The suffix array and the text must be in main
memory - The suffix array is the set of pointers
lexicographically sorted - The pointers are collected in ascending text
order - The pointers are sorted by the text they point to
(accessing the text at random positions)
53Suffix array construction (in MM)
- Algorithm
- All the suffixes are bucket-sorted according to
the first letter only - At iteration i, the suffixes begin already sorted
by their 2i-1 first letters and end up sorted by
their first 2i letters. - Sort the text positions Ta and Tb in the
suffix array - Determine the relative order between text
positions Ta2i-1 and Tb 2i-1 in the current
stage of search
54Construction of Suffix Arraysfor Large Text
- Split the text blocks that can be sorted in MM.
- Build the suffix array for the first block
- Build the suffix array for the second block
- Merge both suffix arrays
- Build the suffix array for the third block
- Merge the suffix array with the previous one
- Build the suffix array for the fourth block
- Merge the new suffix array with previous one
55Merge Step
- How to merge a large suffix array with the small
suffix array? - Determine how many elements of the large array
are to be placed between each pair of elements in
the small array - Read the large array sequentially into main
memory - Each suffix of that text is searched in the small
suffix array - Increment appropriate counter
- Use the information to merge the arrays without
accessing the text
56small text
small text
(a)
(b)
long text
small suffix array
small suffix array
local suffix array is built
counters
Counters are computed
small text
long suffix array
(c)
small suffix array
counters
final suffix array
The suffix arrays are merged
57Signature Files
58Signature Files
- basic idea inexact filter
- discard many of nonqualifying items
- qualifying items definitely pass the test
- false hits or false drops may also pass
accidentally - procedure
- Documents are stored sequentially in text file.
- Their signatures (hash-coded bit patterns) are
stored in the signature file. - Scan the signature file, discard nonqualifying
documents, and check the rest, when a query
arrives.
59Merits of Signature Files
- faster than full text scanning
- 1 or 2 orders of magnitude faster
- modest space overhead
- 10-15 vs. 50-300 (inversion)
- insertions can be handled more easily than
inversion - append only
- no reorganization or rewriting
60Basic Concepts
- Use superimposed coding to create signature.
- Each document is divided into logical blocks.
- A block contains D distinct non-common words.
- Each word yields word signature.
- A word signature is a F-bit pattern, with m
1-bit. - Each word is divided into successive, overlapping
triplets. e.g. free --gt ?fr, fre, ree, ee ? - Each such triplet is hashed to a bit position.
- The word signatures are ORed to form block
signature. - Block signatures are concatenated to form the
document signature.
B in text book
l
61Basic Concepts (Continued)
B
l
- Example (D2, F12, m4)word signaturefree 00
1 000 110 010text 000 010 101 001block
signature 001 010 111 011 - Search
- Use hash function to determine the m 1-bit
positions. - Examine each block signature for 1s bit
positions that the signature of the search word
has a 1.
62A Signature File
Block 1
Block 2
Block 3
Block 4
This is a text. A text has many words. Words
are made from letters.
Text
000101 110101 100100 101101
Text Signature
h(text) 000101 h(many) 110000 h(words) 100100
h(made) 001100 h(letters) 100001
63Basic Concepts (Continued)
- false alarm (false hit, or false drop) Fdthe
probability that a block signature seems to
qualify, given that the block does not actually
qualify. Fd Probsignature qualifies block
does not - Ensure the probability of a false alarm is low
enough while keeping the signature file as short
as possible - For a given value of F, the value of m that
minimizes the false drop probability is such that
each row of the matrix contains 1s with
probability 0.5. Fd 2-m Fln2mD
NF binary matrix
F signature size in bits m number of bits per
word D number of distinct noncommon words
per document Fd false drop probability
mln2F/D
64space overhead of index (1/80)(F/D) F is
measured in bits and D in words 10 overhead
false drop probability close to
2 10(1/80)(F/D) ? (F/D)8 m8ln25.545 Fd
2-5.5452 20 overhead false drop probability
close to 0.046 20(1/80)(F/D) ?
(F/D)16 m16ln211.09 Fd2-11.090.046
65Sequential Signature File (SSF)
documents
the size of document signature the size of
block signatureF assume documents span exactly
one logical block
66Classification of Signature-Based Methods
- CompressionIf the signature matrix is
deliberately sparse, it can be compressed. - Vertical partitioningStoring the signature
matrix columnwise improves the response time on
the expense of insertion time. - Horizontal partitioningGrouping similar
signatures together and/or providing an index on
the signature matrix may result in
better-than-linear search.
67Classification of Signature-Based Methods
- Sequential storage of the signature matrix
- without compression sequential signature files
(SSF) - with compression bit-block compression
(BC) variable bit-block compression (VBC) - Vertical partitioning
- without compression bit-sliced signature files
(BSSF, BSSF) frame sliced (FSSF) generalized
frame-sliced (GFSSF)
68Classification of Signature-Based
Methods(Continued)
- with compression compressed bit slices
(CBS) doubly compressed bit slices
(DCBS) no-false-drop method (NFD) - Horizontal partitioning
- data independent partitioning Gustafsons
method partitioned signature files - data dependent partitioning 2-level signature
files 5-trees
69Criteria
- the storage overhead
- the response time on single word queries
- the performance on insertion, as well as whether
the insertion maintains the append-only property
70Compression
- idea
- Create sparse document signatures on purpose.
- Compress them before storing them sequentially.
- Method
- Use B-bit vector, where B is large.
- Hash each word into one (n) bit position(s).
- Use run-length encoding.
71Compression using run-length encoding
data 0000 0000 0000 0010 0000 base 0000 0001
0000 0000 0000 management 0000 1000 0000 0000
0000 system 0000 0000 0000 0000 1000 block
signature 0000 1001 0000 0010 1000
L2
L3
L4
L5
L1
L1 L2 L3 L4 L5 where x is the encoded
vale of x.
search Decode the encoded lengths of all the
preceding intervals example search data
(1) data gt 0000 0000 0000 0010 0000 (2)
decode L10000, decode L200, decode
L3000000 disadvantage search becomes low
72Bit-block Compression (BC)
Data Structure (1) The sparse vector is divided
into groups of consecutive bits
(bit-blocks). (2) Each bit block is encoded
individually. Algorithm Part I. It is one bit
long, and it indicates whether there are any
1s in the bit-block (1) or the bit
-block is (0). In the latter case,
the bit-block signature stops here.
0000 1001 0000 0010 1000 0
1 0 1 1 Part II. It
indicates the number s of 1s in the bit-block.
It consists of s-1 1 and a
terminating zero. 10
0 0 Part III. It contains the
offsets of the 1s from the beginning of the
bit-block. 0011
10 00 ??b4,???0,
1, 2, 3,???00, 01, 10, 11 block signature 01011
10 00 00 11 10 00
73Bit-block Compression (BC) (Continued)
Search data (1) data gt 0000 0000 0000 0010
0000 (2) the 4th bit-block (3) signature 01011
10 0 0 00 11 10 00 (4) OK, there is at least
one setting in the 4th bit-block. (5) Check
furthermore. 0 tells us there is only one
setting in the 4th bit-clock. Is it the
3rd bit? (6) Yes, 10 confirms the
result. Discussion (1) Bit-block compression
requires less space than Sequential
Signature File for the same false drop
probability. (2) The response time of Bit-block
compression is lightly less than Sequential
Signature File.
74Vertical Partitioning
- ideaavoid bringing useless portions of the
document signature in main memory - methods
- store the signature file in a bit-sliced form or
in a frame-sliced form - store the signature matrix column-wise to improve
the response time on the expense of insertion time
75Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
documents
(document signature)
transpose
documents
represent
76documents
bit-files
search (1) retrieve m bit vectors. (instead of F
bit vectors) e.g., the word
signature of free is 001 000 110 010
the document contains free 3rd, 7th,
8th, 11th bit are set i.e.,
only 3rd, 7th, 8th, 11th files are examined.
(2) and these vectors. The 1s in the
result N-bit vector denote the qualifying
logical blocks (documents). (3) retrieve text
file through pointer file. insertion require F
disk accesses for a new logical block (document),
one for each bit-file, but no
rewriting
77Frame-Sliced Signature File (FSSF)
- Ideas
- random disk accesses are more expensive than
sequential ones. - force each word to hash into bit positions that
are closer to each other in the document
signature - these bit files are stored together and can be
retrieved with a few random accesses - Procedures
- The document signature (F bits long) is divided
into k frames of s consecutive bits each. - For each word in the document, one of the k
frames will be chosen by a hash function. - Using another hash function, the word sets m bits
in that frame.
78documents
frames
Each frame will be kept in consecutive disk
blocks.
79FSSF (Continued)
- Example (D2, F12, s6, k2, m3) Word Signatu
re free 000000 110010 text 010110
000000 doc. signature 010110 110010 - Search
- Only one frame has to be retrieved for a single
word query. I.E., only one random disk access is
required.e.g., search documents that contain the
word freebecause the word signature of free
is placed in 2nd frame,only 2nd frame has to be
examined. - At most n frames have to be scanned for an n word
query. - InsertionOnly k frames have to be accessed
instead of F bit-slices.
80Vertical Partitioning and Compression
- idea
- create a very sparse signature matrix
- store it in a bit-sliced form
- compress each bit slice by storing the position
of the 1s in the slice.
81Compressed Bit Slices (CBS)
- Rooms for improvements for bit-sliced method
- Searching
- Each search word requires the retrieval m bit
files. - The search time could be improved if m was forced
to be 1. - Insertion
- Require too many disk accesses (equal to F, which
is typically 600-1000).
82Compressed Bit Slices (CBS)(Continued)
- Let m1. To maintain the same false drop
probability, F (S) has to be increased.
documents
one bit-setting for each word Only one row has to
be read
Size of a signature
83(document collection)
representation for a word ????, ??0??? ????? 1???
, ???1? ????? ?
Do not distinguish synonyms.
Hash a word to obtain bucket address
Obtain the pointers to the relevant documents
from buckets
h(base)30
84Doubly Compressed Bit Slices
Idea compress the sparse directory ?S?? ???? ????
? ??,?? ??buckets ???? ????? ??,?? ??hash function
Distinguish synonyms partially.
Follow the pointers of posting buckets to
retrieve the qualifying documents.
h1(base)30
h2(base)011
85No False Drops Method
Fixed length Save space
Using pointer to the word in the text file
Distinguish synonyms completely.??h2????????
86Horizontal Partitioning
1. Goal group the signatures into sets,
partitioning the signature matrix
horizontally. 2. Grouping criterion
documents
87Partitioned Signature Files
- Using a portion of a document signature as a
signature key to partition the signature file. - All signatures with the same key will be grouped
into a so-called module. - When a query signature arrives,
- examine its signature key and look for the
corresponding modules - scan all the signatures within those modules that
have been selected
88Comparisons
- signature files
- Use hasing techniques to produce an index
- advantage
- storage overhead is small (10-20)
- disadvantages
- the search time on the index is linear
- some answers may not match the query, thus
filtering must be done
89Comparisons (Continued)
- inverted files
- storage overhead (30 100)
- search time for word searches is logarithmic
- PAT arrays
- potential use in other kind of searches
- phrases
- regular expression searching
- approximate string searching
- longest repetitions
- most frequent searching