Chapter 8 Indexing and Searching - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

Chapter 8 Indexing and Searching

Description:

Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0
Slides: 90
Provided by: HH26
Category:

less

Transcript and Presenter's Notes

Title: Chapter 8 Indexing and Searching


1
Chapter 8Indexing and Searching
  • Hsin-Hsi Chen
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University

2
Introduction
  • searching
  • Online text searching
  • Scan the text sequentially
  • Indexed searching
  • Build data structures over the text to speed up
    the search
  • Semi-static collections updated at reasonably
    regular interval
  • indexing techniques
  • inverted files
  • suffix (PAT) arrays
  • signature files

3
Assumptions
  • n the size of text databases
  • m the length of the search patterns (mltn)
  • M the amount of memory available
  • n the size of texts that are modified (nltn)
  • Experiments
  • 32bit Sun UltraSparc-1 of 167 MHz with 64 MB of
    RAM
  • TREC-2 (WSJ, DOE, FR, ZIFF, AP)

4
File Structures for IR
  • lexicographical indices
  • indices that are sorted
  • inverted files
  • Patricia (PAT) trees (Suffix trees and arrays)
  • cluster file structures (see Chapter 7 in
    document clustering)
  • indices based on hashing
  • signature files

5
Inverted Files
6
Inverted Files
  • Each document is assigned a list of keywords or
    attributes.
  • Each keyword (attribute) is associated with
    operational relevance weights.
  • An inverted file is the sorted list of keywords
    (attributes), with each keyword having links to
    the documents containing that keyword.
  • Penalty
  • the size of inverted files ranges from 10 to
    100of more of the size of the text itself
  • need to update the index as the data set changes

7
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
Vocabulary Occurrences
  • addressing granularity
  • inverted list
  • word positions
  • character positions
  • inverted file
  • document

letters 60 made 50 many 28 text 11,
19 words 30, 40
Heaps law the vocabulary grows as O(n?), ?
0.40.6 Vocabulary for 1GB of
TREC-2 collection 5MB
(before stemming and
normalization) Occurrences the extra space O(n)
30 40 of the text size
8
Block Addressing
  • Full inverted indices
  • Point to exact occurrences
  • Blocking addressing
  • Point to the blocks where the word appears
  • Pointers are smaller
  • 5 overhead over the text size

block fixed size blocks, files, documents, Web
pages,
block retrieval units?
Block1 Block2
Block3 Block
4 This is a text. A text has many words. Words
are made from letters.
Vocabulary Occurrences
Text
letters 4 made 4 many 2 text 1, 2
words 3
Inverted index
9
Sorted array implementation of an inverted file
the documents in which keyword occurs
10
Full inversion (all words, exact positions,
4-byte pointers)
2 or 1 byte(s) per pointer independent of the
text size
document size (10KB), 1, 2, 3 bytes per pointer,
depending on text size
All words are indexed
Stop words are not indexed
11
Searching
  • Vocabulary search
  • Identify the words and patterns in the query
  • Search them in the vocabulary
  • Retrieval of occurrences
  • Retrieve the lists of occurrences of all the
    words
  • Manipulation of occurrences
  • Solve phrases, proximity, or Boolean operations
  • Find the exact word positions when block
    addressing is used

Three general steps
12
Structures used in Inverted Files
  • Sorted Arrays
  • store the list of keywords in a sorted array
  • using a standard binary search
  • advantage easy to implement
  • disadvantage updating the index is expensive
  • B-Trees
  • Tries
  • Hashing Structures
  • Combinations of these structures

13
Trie
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
letters 60
made 50
l
d
a
m
Vocabulary trie
many 28
n
t
text 11, 19
w
words 33, 40
14
B-trees
F M
Rut Uni
Al Br E
Gr H Ja L


Russian 9
Ruthenian 1

Afgan 2
15
Sorted Arrays
1. The input text is parsed into a list of words
along with their location in the text. (time
and storage consuming operation) 2. This list is
inverted from a list of terms in location order
to a list of terms in alphabetical order. 3.
Add term weights, or reorganize or compress the
files.
16
Inversion of Word List
report appears in two records
17
Dictionary and postings file
Idea the file to be searched should be as short
as possible split a single file into two
pieces
(vocabulary)
(occurrences)
e.g. data set 38,304 records, 250,000 unique
terms
88 postings/record
(document , frequency)
18
Producing an Inverted File for Large Data Sets
without Sorting
Idea avoid the use of an explicit sort
by using a right-threaded binary tree
current number of term postings the storage
location of postings list
traverse the binary tree and the linked postings
list
19
Indexing Statistics
Final index only 8 of input text size for 50MB
database 14 of the input
text size for the larger database Working
storage not much larger than the size of final
index for new indexing method
the storage needed to build the index
p.1718
p.20
933
2GB
the same
20
A Fast Inversion Algorithm
  • Principle 1the large primary memories are
    availableIf databases can be split into memory
    loads that can be rapidly processed and then
    combined, the overall cost will be minimized.
  • Principle 2the inherent order of the input
    dataIt is very expensive to use polynomial or
    even nlogn sorting algorithms for large files

21
FAST-INV algorithm
concept postings/ pointers
See p. 22.
22
Sample document vector
document number
concept number
(one concept number for each unique word)
Similar to the document- word list shown in p. 16.
The concept numbers are sorted within
document numbers, and document numbers
are sorted within collection
23
HCNhighest concept number in dictionary
(total number of concepts in dictionary) Lnumbe
r of concepts/document (documents/concept)
pairs in the collection Mavailable primary
memory size, in bytes MgtgtHCN, M lt L L/jltM, so
that each part fill fit into primary memory HCN/j
concepts, approximately, are associated with
each part Let LLlength of current load
(8 bytes for each concept-weight)
Sspread of concept numbers in current load (4
bytes
for each count of posting)
number of concept-weight pairs
8LL4S lt M
24
Preparation
1. Allocate an array, con_entries_cnt, of size
HCN. 2. For each ltdoc, congt entry in the
document vector file increment
con_entries_cntcon
0 (1,2), (1,4).. 2 (2,3) ..
3 (3,1), (3,2), (3,5) ... 6 (4,2), (4,3) .
8 ...
25
Preparation (continued)
5. For each ltcon,countgt pair obtained from
con_entries_cnt if there is no room for
documents with this concept to fit in the
current load, then created an entry in the load
table and initialize the next load entry
otherwise update information for the current
load table entry.
26
the range of concepts for each primary load
????Load?? LLlength of current load S end
concept-start concept 1
space for concept/ weight pairLL space for each
concept to store count of postingsS lt M
??SS???? ??????
??(Doc,Con) ?Con??Load ?,???? ?????? ?Load
?????Load File???CONPTR ???Offset??? ???????? ??
copy rather than sort
27
PAT Tress and PAT Arrays(Suffix Trees and Suffix
Arrays)
28
PAT Trees and PAT Arrays
  • Problems of tradition IR models
  • Documents and words are assumed.
  • Keywords must be extracted from the text
    (indexing).
  • Queries are restricted to keywords.
  • New indices for text
  • A text is regarded as a long string.
  • Each position corresponds to a semi-infinite
    string (sistring).
  • suffix a string that goes from a text position
    to the end of the text
  • Each suffix is uniquely identified by its
    position
  • no structures and no keywords

29
Text
This is a text. A text has many words. Words
are made from letters.
text. A text has many words. Words are made
from letters.
text has many words. Words are made from letters.
many words. Words are made from letters.
Suffixes
Words are made from letters.
different
made from letters.
Index points are selected from the text,
which point to the beginning of the text
positions which are retrievable.
letters.
30
PATRICIA
  • trie
  • branch decision node search decision-markers
  • element node real data
  • if branch decisions are made on each bit, a
    complete binary tree is formed where the depth is
    equal to the number of bits of the longest
    strings
  • many element nodes and branch nodes are null

31
PATRICIA (Continued)
  • compressed digital search trie
  • the null element nodes and branch nodes are
    removed
  • an additional field to denote the comparing bit
    for branching decision is included in each
    decision node
  • a matching between the searched results and their
    search keys is required because only some of bits
    are compared during the search process

32
PATRICIA (Continued)
  • Practical Algorithm to Retrieve Information Coded
    in Alphanumeric
  • augmented branch node an additional field for
    storing elements is included in branch node
  • each element is stored in an upper node or in
    itself
  • an addition root node note the number of leaf
    nodes is always greater than that of internal
    nodes by one

33
PAT-tree
  • PATRICIA semi-infinite strings
  • a text T with n basic units u1 u2 un
  • u1 u2 un , u2 u3 un , u3 u4 un ,
  • an end to the left but none to the right
  • store the starting positions of semi-infinite
    strings in a text using PATRICIA

34
semi-infinite strings
  • ExampleText Once upon a time, in a far away
    land sistring 1 Once upon a time sistring
    2 nce upon a time sistring 8 on a time, in a
    sistring 11 a time, in a far sistring 22 a
    far away land
  • Compare sistrings 22 lt 11 lt 2 lt 8 lt 1

35
PAT Tree
  • PAT TreeA Patricia tree constructed over all the
    possible sistrings of a text
  • Patricia tree
  • a digital tree where the individual bits of the
    keys are used to decide on the branching
  • each internal node indicates which bit of the
    query is used for branching
  • absolute bit position
  • a count of the number of bits to skip
  • each external node is a sistring, i.e., the
    integer displacement

36
1
Example
2
2
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
3
4
2
1
1
2
2
3
4
2
3
5
1
external node sistring (integer
displacement) total displacement of the bit
to be inspected
1
1
1
1
0
0
1
1
1
2
2
0
1
3
2
internal node skip counter pointer
37
1
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
2
2
2
4
3
3
6
7
3
4
5
1
1
2
2
1
2
4
3
3
2
2
6
7
3
5
5
4
1
4
2
3
4
8
6
3
5
1
Search 00101
?3?6?4?bits????
38
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
Suffix Trie
60
l
50
d
m
a
28
space overhead 120240 over the text size
19
n
t
e
x
t
w
11
40
o
r
d
s
33
60
l
Suffix Tree
50
d
m
3
1
28
19
n
t
5
11
w
40
6
33
39
PAT Trees Represented as Arrays
  • indirect binary search vs. sequential searchKeep
    the external nodes in the bucket in the same
    relative order as they would be in the tree

PAT array
1
7
4
8
5
1
6
3
2
2
2
2
4
3
3
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
6
7
3
5
5
1
4
8
40
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
60
l
50
(1) Suffix Tree
d
m
3
1
28
19
n
t
120240 overhead
5
11
w
40
6
33
40 overhead
(2) Suffix Array
(3) Supra-Index
Suffix Array
41
difference between suffix array and inverted list
  • suffix array the occurrences of each word are
    sorted lexicographically by the text following
    the word
  • inverted list the occurrences of each word are
    sorted by text position

1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Vocabulary Supra-Index
Suffix Array
Inverted list
42
Indexing Points
  • The above example assumes every position in the
    text is indexed.n external nodes, one for each
    position in the text
  • word and phrase searchessistrings that are at
    the beginning of words are necessary
  • trade-off between size of the index and search
    requirements

43
Prefix searching
  • ideaevery subtree of the PAT tree has all the
    sistrings with a given prefix.
  • Search proportional to the query lengthexhaust
    the prefix or up to external node.

Search for the prefix 10100 and its answer
44
Searching PAT Trees as Arrays
  • Prefix searching and range searchingdoing an
    indirect binary search over the array with the
    results of the comparisons being less than,
    equal, and greater than.
  • exampleSearch for the prefix 100 and its answer.

PAT array
7
4
8
5
1
6
3
2
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
45
Proximity Searching
  • Find all places where s1 is at most a fixed
    (given by a user) number of characters away from
    s2. in 4 ation gt insulation, international,
    information
  • Algorithm1. Search for s1 and s2.2. Select the
    smaller answer set from these two sets and
    sort by position.3. Traverse the unsorted answer
    set, searching every position in the sorted
    set and checking if the distance between
    positions satisfying the proximity condition.

sorttraverse time(m1m2)logm1 (assume m1ltm2)
46
Range Searching
  • Search for all the strings within a certain
    lexicographical range.
  • the range of abc ..acc abracadabra,
    acacia ? abacus, acrimonious X
  • Algorithm
  • Search each end of the defining intervals.
  • Collect all the sub-trees between (and including)
    them.

47
Searching Suffix Array
  • P1 ? S lt P2
  • Binary search both limiting patterns in the
    suffix array.
  • Find all the elements lying between both positions

48
Longest Repetition Searching
  • the match between two different positions of a
    text where this match is the longest in the
    entire text, e.g., 0 1 1 0 0 1 0 0 0 1 0 1 1 1

the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111 sistring
5 0100010111 sistring 6 100010111 sistring
7 00010111 sistring 8 0010111
1
2
2
4
3
3
2
6
7
3
5
5
1
4
8
49
Most Significant or Most Frequent Matching
  • the most frequently occurring strings within the
    text database, e.g., the most frequent trigram
  • find the most frequent trigramfind the largest
    subtree at a distance 3 characters from root

1
the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
2
2
4
3
3
i.e., 1, 2, 3 are the same for sistrings
100100010111 and 100010111
2
6
7
3
5
5
1
4
8
50
Building PAT Trees as Patricia Trees
  • bucketing of external nodes
  • collect more than one external node
  • a bucket replaces any subtree with size less than
    a certain constraint (b)save significant number
    of internal nodes
  • the external nodes inside a bucket do not have
    any structure associated with themincrease the
    number of comparisons for each search

51
Building PAT Trees as Patricia Trees(Continued)
  • mapping the tree onto the disk using super-nodes
  • Allocate as much as possible of the tree in a
    disk page
  • Every disk page has a single entry point,
    contains as much of the trees as possible,and
    terminates either in external nodes or in
    pointers to other disk pages
  • The pointers in internal nodes address either a
    disk page or another node inside the same page
  • disk pages contain on the order of 1,000
    internal/external nodes
  • on the average, each disk page contains about 10
    steps of a root-to-leaf path

52
Suffix array construction (in MM)
  • The suffix array and the text must be in main
    memory
  • The suffix array is the set of pointers
    lexicographically sorted
  • The pointers are collected in ascending text
    order
  • The pointers are sorted by the text they point to
    (accessing the text at random positions)

53
Suffix array construction (in MM)
  • Algorithm
  • All the suffixes are bucket-sorted according to
    the first letter only
  • At iteration i, the suffixes begin already sorted
    by their 2i-1 first letters and end up sorted by
    their first 2i letters.
  • Sort the text positions Ta and Tb in the
    suffix array
  • Determine the relative order between text
    positions Ta2i-1 and Tb 2i-1 in the current
    stage of search

54
Construction of Suffix Arraysfor Large Text
  • Split the text blocks that can be sorted in MM.
  • Build the suffix array for the first block
  • Build the suffix array for the second block
  • Merge both suffix arrays
  • Build the suffix array for the third block
  • Merge the suffix array with the previous one
  • Build the suffix array for the fourth block
  • Merge the new suffix array with previous one

55
Merge Step
  • How to merge a large suffix array with the small
    suffix array?
  • Determine how many elements of the large array
    are to be placed between each pair of elements in
    the small array
  • Read the large array sequentially into main
    memory
  • Each suffix of that text is searched in the small
    suffix array
  • Increment appropriate counter
  • Use the information to merge the arrays without
    accessing the text

56
small text
small text
(a)
(b)
long text
small suffix array
small suffix array
local suffix array is built
counters
Counters are computed
small text
long suffix array
(c)
small suffix array
counters
final suffix array
The suffix arrays are merged
57
Signature Files
58
Signature Files
  • basic idea inexact filter
  • discard many of nonqualifying items
  • qualifying items definitely pass the test
  • false hits or false drops may also pass
    accidentally
  • procedure
  • Documents are stored sequentially in text file.
  • Their signatures (hash-coded bit patterns) are
    stored in the signature file.
  • Scan the signature file, discard nonqualifying
    documents, and check the rest, when a query
    arrives.

59
Merits of Signature Files
  • faster than full text scanning
  • 1 or 2 orders of magnitude faster
  • modest space overhead
  • 10-15 vs. 50-300 (inversion)
  • insertions can be handled more easily than
    inversion
  • append only
  • no reorganization or rewriting

60
Basic Concepts
  • Use superimposed coding to create signature.
  • Each document is divided into logical blocks.
  • A block contains D distinct non-common words.
  • Each word yields word signature.
  • A word signature is a F-bit pattern, with m
    1-bit.
  • Each word is divided into successive, overlapping
    triplets. e.g. free --gt ?fr, fre, ree, ee ?
  • Each such triplet is hashed to a bit position.
  • The word signatures are ORed to form block
    signature.
  • Block signatures are concatenated to form the
    document signature.

B in text book
l
61
Basic Concepts (Continued)
B
l
  • Example (D2, F12, m4)word signaturefree 00
    1 000 110 010text 000 010 101 001block
    signature 001 010 111 011
  • Search
  • Use hash function to determine the m 1-bit
    positions.
  • Examine each block signature for 1s bit
    positions that the signature of the search word
    has a 1.

62
A Signature File
Block 1
Block 2
Block 3
Block 4
This is a text. A text has many words. Words
are made from letters.
Text
000101 110101 100100 101101
Text Signature
h(text) 000101 h(many) 110000 h(words) 100100
h(made) 001100 h(letters) 100001
63
Basic Concepts (Continued)
  • false alarm (false hit, or false drop) Fdthe
    probability that a block signature seems to
    qualify, given that the block does not actually
    qualify. Fd Probsignature qualifies block
    does not
  • Ensure the probability of a false alarm is low
    enough while keeping the signature file as short
    as possible
  • For a given value of F, the value of m that
    minimizes the false drop probability is such that
    each row of the matrix contains 1s with
    probability 0.5. Fd 2-m Fln2mD

NF binary matrix
F signature size in bits m number of bits per
word D number of distinct noncommon words
per document Fd false drop probability
mln2F/D
64
space overhead of index (1/80)(F/D) F is
measured in bits and D in words 10 overhead
false drop probability close to
2 10(1/80)(F/D) ? (F/D)8 m8ln25.545 Fd
2-5.5452 20 overhead false drop probability
close to 0.046 20(1/80)(F/D) ?
(F/D)16 m16ln211.09 Fd2-11.090.046
65
Sequential Signature File (SSF)
documents
the size of document signature the size of
block signatureF assume documents span exactly
one logical block
66
Classification of Signature-Based Methods
  • CompressionIf the signature matrix is
    deliberately sparse, it can be compressed.
  • Vertical partitioningStoring the signature
    matrix columnwise improves the response time on
    the expense of insertion time.
  • Horizontal partitioningGrouping similar
    signatures together and/or providing an index on
    the signature matrix may result in
    better-than-linear search.

67
Classification of Signature-Based Methods
  • Sequential storage of the signature matrix
  • without compression sequential signature files
    (SSF)
  • with compression bit-block compression
    (BC) variable bit-block compression (VBC)
  • Vertical partitioning
  • without compression bit-sliced signature files
    (BSSF, BSSF) frame sliced (FSSF) generalized
    frame-sliced (GFSSF)

68
Classification of Signature-Based
Methods(Continued)
  • with compression compressed bit slices
    (CBS) doubly compressed bit slices
    (DCBS) no-false-drop method (NFD)
  • Horizontal partitioning
  • data independent partitioning Gustafsons
    method partitioned signature files
  • data dependent partitioning 2-level signature
    files 5-trees

69
Criteria
  • the storage overhead
  • the response time on single word queries
  • the performance on insertion, as well as whether
    the insertion maintains the append-only property

70
Compression
  • idea
  • Create sparse document signatures on purpose.
  • Compress them before storing them sequentially.
  • Method
  • Use B-bit vector, where B is large.
  • Hash each word into one (n) bit position(s).
  • Use run-length encoding.

71
Compression using run-length encoding
data 0000 0000 0000 0010 0000 base 0000 0001
0000 0000 0000 management 0000 1000 0000 0000
0000 system 0000 0000 0000 0000 1000 block
signature 0000 1001 0000 0010 1000
L2
L3
L4
L5
L1
L1 L2 L3 L4 L5 where x is the encoded
vale of x.
search Decode the encoded lengths of all the
preceding intervals example search data
(1) data gt 0000 0000 0000 0010 0000 (2)
decode L10000, decode L200, decode
L3000000 disadvantage search becomes low
72
Bit-block Compression (BC)
Data Structure (1) The sparse vector is divided
into groups of consecutive bits
(bit-blocks). (2) Each bit block is encoded
individually. Algorithm Part I. It is one bit
long, and it indicates whether there are any
1s in the bit-block (1) or the bit
-block is (0). In the latter case,
the bit-block signature stops here.
0000 1001 0000 0010 1000 0
1 0 1 1 Part II. It
indicates the number s of 1s in the bit-block.
It consists of s-1 1 and a
terminating zero. 10
0 0 Part III. It contains the
offsets of the 1s from the beginning of the
bit-block. 0011
10 00 ??b4,???0,
1, 2, 3,???00, 01, 10, 11 block signature 01011
10 00 00 11 10 00
73
Bit-block Compression (BC) (Continued)
Search data (1) data gt 0000 0000 0000 0010
0000 (2) the 4th bit-block (3) signature 01011
10 0 0 00 11 10 00 (4) OK, there is at least
one setting in the 4th bit-block. (5) Check
furthermore. 0 tells us there is only one
setting in the 4th bit-clock. Is it the
3rd bit? (6) Yes, 10 confirms the
result. Discussion (1) Bit-block compression
requires less space than Sequential
Signature File for the same false drop
probability. (2) The response time of Bit-block
compression is lightly less than Sequential
Signature File.
74
Vertical Partitioning
  • ideaavoid bringing useless portions of the
    document signature in main memory
  • methods
  • store the signature file in a bit-sliced form or
    in a frame-sliced form
  • store the signature matrix column-wise to improve
    the response time on the expense of insertion time

75
Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
documents
(document signature)
transpose
documents
represent
76
documents
bit-files
search (1) retrieve m bit vectors. (instead of F
bit vectors) e.g., the word
signature of free is 001 000 110 010
the document contains free 3rd, 7th,
8th, 11th bit are set i.e.,
only 3rd, 7th, 8th, 11th files are examined.
(2) and these vectors. The 1s in the
result N-bit vector denote the qualifying
logical blocks (documents). (3) retrieve text
file through pointer file. insertion require F
disk accesses for a new logical block (document),
one for each bit-file, but no
rewriting
77
Frame-Sliced Signature File (FSSF)
  • Ideas
  • random disk accesses are more expensive than
    sequential ones.
  • force each word to hash into bit positions that
    are closer to each other in the document
    signature
  • these bit files are stored together and can be
    retrieved with a few random accesses
  • Procedures
  • The document signature (F bits long) is divided
    into k frames of s consecutive bits each.
  • For each word in the document, one of the k
    frames will be chosen by a hash function.
  • Using another hash function, the word sets m bits
    in that frame.

78
documents
frames
Each frame will be kept in consecutive disk
blocks.
79
FSSF (Continued)
  • Example (D2, F12, s6, k2, m3) Word Signatu
    re free 000000 110010 text 010110
    000000 doc. signature 010110 110010
  • Search
  • Only one frame has to be retrieved for a single
    word query. I.E., only one random disk access is
    required.e.g., search documents that contain the
    word freebecause the word signature of free
    is placed in 2nd frame,only 2nd frame has to be
    examined.
  • At most n frames have to be scanned for an n word
    query.
  • InsertionOnly k frames have to be accessed
    instead of F bit-slices.

80
Vertical Partitioning and Compression
  • idea
  • create a very sparse signature matrix
  • store it in a bit-sliced form
  • compress each bit slice by storing the position
    of the 1s in the slice.

81
Compressed Bit Slices (CBS)
  • Rooms for improvements for bit-sliced method
  • Searching
  • Each search word requires the retrieval m bit
    files.
  • The search time could be improved if m was forced
    to be 1.
  • Insertion
  • Require too many disk accesses (equal to F, which
    is typically 600-1000).

82
Compressed Bit Slices (CBS)(Continued)
  • Let m1. To maintain the same false drop
    probability, F (S) has to be increased.

documents
one bit-setting for each word Only one row has to
be read
Size of a signature
83
(document collection)
representation for a word ????, ??0??? ????? 1???
, ???1? ????? ?
Do not distinguish synonyms.
Hash a word to obtain bucket address
Obtain the pointers to the relevant documents
from buckets
h(base)30
84
Doubly Compressed Bit Slices
Idea compress the sparse directory ?S?? ???? ????
? ??,?? ??buckets ???? ????? ??,?? ??hash function
Distinguish synonyms partially.
Follow the pointers of posting buckets to
retrieve the qualifying documents.
h1(base)30
h2(base)011
85
No False Drops Method
Fixed length Save space
Using pointer to the word in the text file
Distinguish synonyms completely.??h2????????
86
Horizontal Partitioning
1. Goal group the signatures into sets,
partitioning the signature matrix
horizontally. 2. Grouping criterion
documents
87
Partitioned Signature Files
  • Using a portion of a document signature as a
    signature key to partition the signature file.
  • All signatures with the same key will be grouped
    into a so-called module.
  • When a query signature arrives,
  • examine its signature key and look for the
    corresponding modules
  • scan all the signatures within those modules that
    have been selected

88
Comparisons
  • signature files
  • Use hasing techniques to produce an index
  • advantage
  • storage overhead is small (10-20)
  • disadvantages
  • the search time on the index is linear
  • some answers may not match the query, thus
    filtering must be done

89
Comparisons (Continued)
  • inverted files
  • storage overhead (30 100)
  • search time for word searches is logarithmic
  • PAT arrays
  • potential use in other kind of searches
  • phrases
  • regular expression searching
  • approximate string searching
  • longest repetitions
  • most frequent searching
Write a Comment
User Comments (0)
About PowerShow.com