Title: text positions O(m) worst-case cost for each position
1Indexing and Searching
2The Retrieval Process
3Outline
- Conventional text retrieval systems (8.1-8.3,
Salton) - File Structures for Indexing and Searching (Chap.
8) - Inverted files
- Suffix trees and suffix arrays
- Signature files
- Sequential searching
- Pattern matching
4Conventional Text Retrieval Systems
- Database management, e.g. employee DB
- Structured records
- Precise meaning for attribute values
- Exact match
- Text retrieval, e.g. bibliographic systems
- Structured attributes and unstructured content
- Index terms
- Imprecise representation of the text
- Approximate or partial matching
5Conceptual Information Retrieval
Queries
Documents
Similaritycomputation
Retrieval of similar terms
6Expanded Text Retrieval System
Formalstatements
Indexeddocuments
Similaritycomputation
Documents
Queries
Negotiationand analysis(Query formulation)
Text indexing(Content Analysis)
Retrieval of similar terms
Taipei city government Taipei travel guide Wiki
page on Taipei Taipei 101 Taipei times
Taipei
7Representation
- Documents
- Indexed terms (or term vectors)
- Unweighted or weighted
- Queries
- Unweighted or weighted terms
- Boolean operators or, and, not
- E.g. Taiwan AND NOT Taipei
- Efficiency
8Data Structure
- Requirement
- Fast access to documents
- Very large number of index terms
- For each term a separate index is constructed
that stores the document identifiers for all
documents identified by that term - Inverted index (or inverted file)
9Inverted Index
- The complete file is represented as an array of
indexed documents.
10Inverted-file Process
- The document-term array is inverted (actually
transposed).
11Inverted-file Process
- The rows are manipulated according to query
specification. (list-merging) - Ex Query (term 2 and term 3)
- 1 1 0 0 0 1 1 1-----------------------------
--------------- 0 1 0 0 - Ex Query ((T1 or T2) and not T3)
12Extensions of Inverted Index
- Distance Constraints
- Term Weights
- Synonym Specification
- Term Truncation
13Distance Constraints
- Nearness parameters
- Within sentence terms cooccur in a common
sentence - Adjacency terms occur adjacently in the text
14- Implementation
- To include term-location information in the
inverted index - information D345, D348, D350, retrieval
D123, D128, D345, - Cost size of the indexes
- To include sentence numbers for all term
occurrences in the inverted index - information D345, 25 D345, 37 D348, 10
D350, 8retrieval D123, 5 D128, 25 D345,
37 D345, 40
15- To include paragraph numbers, sentence numbers
within paragraphs, word numbers within sentences
in the inverted index - information D345, 2, 3, 5retrieval D345,
2, 3, 6 - Ex (information adjacent retrieval)(information
within five words retrieval)
16Term Weights
- Term-importance weights
- Di Ti1, 0.2 Ti2, 0.5 Ti3, 0.6
- Issues
- How to generate term weights? (more on this
later) - How to apply term weights?
- Vector queries the sum of the weights of all
document terms that match the given query - Boolean queries (more complex)
17Term Weights (for Boolean Queries)
- Transforming each query into sum-of-products form
(or disjunctive normal form) - The weight of each conjunct is the minimum term
weight of any document term in the conjunct - The document weight is the maximum of all the
conjunct weights
18An Example
- Example Q(T1 and T2) or T3Document Conjunct Qu
eryVectors Weights Weight (T1 and T2) (T3)
(T1 and T2) or T3D1(T1,0.2T2,0.5T3,0.6) 0.
2 0.6 0.6D2(T1,0.7T2,0.2T3,0.1) 0.2 0.1 0
.2D1 is preferred.
19- Synonym Specification
- (T1 and T2) or T3
- ((T1 or S1) and T2) or (T3 or S3)
- Term Truncation (or stemming)
- Removing suffixes and/or prefixes
- ExPSYCH psychiatrist, psychiatry,
psychiatric,psychology, psychological,
20File Structures for Indexing and Searching
21Introduction
- How to retrieval information?
- A simple alternative is to search the whole text
sequentially (online search) - Another option is to build data structures over
the text (called indices) to speed up the search
22Introduction
- Indexing techniques
- Inverted files
- Suffix arrays
- Signature files
23Notation
- n the size of the text
- m the length of the pattern (m ltlt n)
- v the size of the vocabulary
- M the amount of main memory available
24Inverted Files
- Definition an inverted file is a word-oriented
mechanism for indexing a text collection in order
to speed up the searching task. - Structure of inverted file
- Vocabulary is the set of all distinct words in
the text - Occurrences lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
25Example
1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
Vocabulary
Occurrences
beautiful flowers garden house
70 45, 58 18, 29 6
26 Space Requirements
- The space required for the vocabulary is rather
small. According to Heaps law the vocabulary
grows as O(n?), where ? is a constant between 0.4
and 0.6 in practice (sublinear) - On the other hand, the occurrences demand much
more space. Since each word appearing in the text
is referenced once in that structure, the extra
space is O(n) - To reduce space requirements, a technique called
block addressing is used
27Block Addressing
- The text is divided in blocks
- The occurrences point to the blocks where the
word appears - Advantages
- the number of pointers is smaller than positions
- all the occurrences of a word inside a single
block are collapsed to one reference - Disadvantages
- online search over the qualifying blocks if exact
positions are required
28Example
Block 1 Block 2 Block 3
Block 4
That house has a garden. The garden has many
flowers. The flowers are beautiful
Vocabulary
Occurrences
beautiful flowers garden house
4 3 2 1
29Inverted Files for Different Addressing
Granularity
All words indexed
Stopwords not indexed
30Searching
- The search algorithm on an inverted index follows
three steps - Vocabulary search the words present in the query
are searched in the vocabulary - Retrieval of occurrences the lists of the
occurrences of all words found are retrieved - Manipulation of occurrences the occurrences are
processed to solve the query
31Searching
- Searching task on an inverted file always starts
in the vocabulary (It is better to store the
vocabulary in a separate file) - The structures most used to store the vocabulary
are hashing, tries or B-trees - Hashing, tries O(m)
- An alternative is simply storing the words in
lexicographical order (cheaper in space and very
competitive with O(log v) cost)
32Construction
- All the vocabulary is kept in a suitable data
structure storing for each word a list of its
occurrences - Each word of the text is read and searched in the
vocabulary - If it is not found, it is added to the vocabulary
with a empty list of occurrences and the new
position is added to the end of its list of
occurrences
33Example
1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
beautiful 70
b
f
flower 45, 58
g
garden 18, 29
h
house 6
34Construction
- Once the text is exhausted, the vocabulary is
written to disk with the list of occurrences. Two
files are created - in the first file, the list of occurrences are
stored contiguously (posting file) - in the second file, the vocabulary is stored in
lexicographical order and, for each word, a
pointer to its list in the first file is also
included. This allows the vocabulary to be kept
in memory at search time - The overall process is O(n) worst-case time
- Not practical for large texts
35Construction
- An option is to use the previous algorithm until
the main memory is exhausted. When no more memory
is available, the partial index Ii obtained up to
now is written to disk and erased the main memory
before continuing with the rest of the text - Once the text is exhausted, a number of partial
indices Ii exist on disk - The partial indices are merged to obtain the
final index
36Example
I 1...8
final index
7
level 3
I 1...4
I 5...8
3
6
level 2
I 1...2
I 3...4
I 5...6
I 7...8
level 1
1
2
4
5
I 1
I 2
I 3
I 4
I 5
I 6
I 7
I 8
initial dumps
37Construction
- The total time to generate partial indices is
O(n) - The number of partial indices is O(n/M)
- To merge the O(n/M) partial indices, log2(n/M)
merging levels are necessary - The total cost of this algorithm is O(n log(n/M))
38Summary on Inverted File
- Inverted file is probably the most adequate
indexing technique for database text - The indices are appropriate when the text
collection is large and semi-static - Otherwise, if the text collection is volatile
online searching is the only option - Some techniques combine online and indexed
searching
39Suffix Trees and Suffix Arrays
- Each position in the text is considered as a text
suffix - Index points are selected form the text, which
point to the beginning of the text positions
which will be retrievable - The problem with suffix trees is its space
overhead
40Example
- Text
- Suffixes
- house has a garden. The garden has many flowers.
The flowers are beautiful - garden. The garden has many flowers. The flowers
are beautiful - garden has many flowers. The flowers are
beautiful - flowers. The flowers are beautiful
- flowers are beautiful
- beautiful
1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
41Example
1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
70
58
b
e
s
r
l
o
w
f
45
.
g
29
e
n
a
r
d
h
18
.
6
42Example
1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
70
45
b
.
f
8
58
1
g
18
.
h
7
29
6
43Suffix Arrays
- An array containing all the pointers to the text
suffixes listed in lexicographical order - The space requirements are almost the same as
those for inverted indices - The main drawbacks of suffix array are its costly
construction process - Allow binary searches done by comparing the
contents of each pointer - Supra-indices (for large suffix array)
- The space requirements of suffix array with
vocabulary supra-index are exactly the same as
for inverted indices
44Example
- Text
- Suffix Array
- Supra Index (l4, b2)
1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
45Example
- Text
- Vocabulary Supra-Index
- Suffix Array
- Inverted List
1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
46Construction of Suffix Arrays for Large Texts
Small text
1
2
Small suffix array
Long text
2
3
Long suffix array
Counters
3
3
Final suffix array
47Signature Files
- Characteristics
- Word-oriented index structures based on hashing
- Low overhead (1020 over the text size) at the
cost of forcing a sequential search over the
index - Suitable for not very large texts
- Inverted files outperform signature files for
most applications
48Construction and Search
- Word-oriented index structures base on hashing
- Maps words to bit masks of B bits
- Divides the text in blocks of b words each
- The mask is obtained by bitwise ORing the
signatures of all the words in the text block. - Search
- Hash the query to a bit mask W
- If W Bi W, the text block may contain the
word - For all candidate blocks, an online traversal
must be performed to verify if the word is
actually there
49Example
- Four blocks
- This is a text. A text has many words. Words are
made from letters. - 000101 110101 100100
101101 - Hash(text) 000101
- Hash(many) 110000
- Hash(words) 100100
- Hash(made) 001100
- Hash(letters) 100001
50False Drop
- Assumes that l bits are randomly set in the mask
- Let al/B
- For b words, the probability that a given bit of
the mask is set is 1-(1-1/B)bl ?1-e-ba - Hence, the probability that the l random bits are
also set is Fd (1-e-ba)aB ? False alarm - Fd is minimized for aln(2)/b
- Fd 2-l l B ln2/b
51Comparisons
- Signature files
- Use hashing techniques to produce an index
- advantage
- storage overhead is small (10-20)
- disadvantages
- the search time on the index is linear
- some answers may not match the query, thus
filtering must be done
52Comparisons (Continued)
- Inverted files
- storage overhead (30 100)
- search time for word searches is logarithmic
- Suffix arrays
- potential use in other kind of searches
- phrases
- regular expression searching
- approximate string searching
- longest repetitions
- most frequent searching
53Sequential Searching
- Brute Force (BF)
- Knuth-Morris-Pratt (KMP)
- Boyer-Moore Family (BM)
- Shift-Or
- Suffix Automaton
54Exact String Matching
- Definition Given a short pattern P of length m
and a long text T of length n, find all the text
positions where the pattern occurs - The simplest algorithm Brute-Force (BF)
- Trying all possible pattern positions in the text
- Worst-case cost O(mn), average-case cost O(n)
- O(n) text positions
- O(m) worst-case cost for each position
55Knuth-Morris-Pratt
- The KMP method scans the characters left-to-right
- When a mismatch occurs, an optimum shift is
carried out for pattern P - No new match can be obtained except when some
head of the already matching part of P is
identical to a tail of the matching part of T - How to detect coincidences between heads of P and
tails of T - Any matching tail of T is also a matching tail of
P - Detecting repeating portions in P
56Knuth-Morris-Pratt
- Next table at position j the longest proper
prefix of P1..j-1 which is also a suffix and the
characters following prefix and suffix are
different - j-nextj-1 positions can be safely skipped
- Next 0 0 0 0 1 0 1 0 0 0 0 4
- P a b r a c a d a b r a
- a b r a c a b r a c a d a b r a
- a b r a c a d
- a b r a c a d a b r a
57- At each text comparison, the window or the
pointer advance by at least one position, the
algorithm performs at most 2n comparisons (and at
least n) - The Aho-Corasick algorithm is an extension of KMP
in matching a set of patterns - Patterns are arranged in a trie-like data
structure - Ex hello, elbow, eleven
58Boyer-Moore Family
- The BM method scans characters from right to left
- The heuristic which gives the longest shift is
selected - Matching shift (or good-suffix shift, ?2 shift)
- When some tail of P already matches some
substring of S - Occurrence shift (or bad-character shift, ?1
shift) - When a mismatched character is known not to occur
in the pattern - Extended ?1 shift places in coincidence any
matching positions between heads and tails of P
59Examples
- a b r a c a b r a c a d a b r a
- a b r a c a d a b r a
- a b r a c a d a b r a (?23)
- a b r a c a d a b r a (?15)
- b a b c b a d c a b c a a b c a
- a b c a b c a c a b
- a b c a b c a c a b (?25)
- a b c a b c a c a b (?17)
- a b c a b c a c a b
(extended ?18)
60- Some variations
- Simplified BM algorithm
- BM-Horspool (BMH) algorithm
- BM-Sunday (BMS) algorithm
- Commentz-Walter algorithm an extension of BM to
multi-pattern search
61Shift-Or
- Based on bit-parallelism to simulate the
operation of a non-deterministic automaton - It first build a table B which stores a bit mask
bmb1 for each character - Bc has the i-th bit set to zero iff pi c
- The state of search is kept in Ddmd1 (initially
set to all 1s) - Where di is zero whenever the state numbered i is
active - A match is reported whenever dm is zero
- For each new character Tj, D (Dltlt1) BTj
62Example
a
b
r
a
c
a
b
a
- Ba 0 1 1 0
1 0 1 0 - Bb 1 0 1 1
1 1 0 1 - Bc 1 1 1 1
0 1 1 1 - Br 1 1 0 1
1 1 1 1 - B 1 1 1 1
1 1 1 1
1
2
m
63Example
- Ex Input Tabcabracaba
- (11111111 ltlt 1) 01010110 11111110 (A)
- (11111110 ltlt 1) 10111101 11111101 (AB)
- (11111101 ltlt 1) 11101111 11111111 ()
- (11111111 ltlt 1) 01010110 11111110 (A)
- (11111110 ltlt 1) 10111101 11111101 (AB)
- (11111101 ltlt 1) 11111011 11111011 (ABR)
- (11111011 ltlt 1) 01010110 11110111 (ABRA)
- (11110111 ltlt 1) 11101111 11101111 (ABRAC)
- (11101111 ltlt 1) 01010110 11011111 (ABRACA)
- (11011111 ltlt 1) 10111101 10111111 (ABRACAB)
- (10111111 ltlt 1) 01010110 01111111 (ABRACABA)
? Matched!
64Suffix Automaton
- Suffix automaton on a pattern P an automaton
that recognizes all suffixes of P - Backward DAWG matching (BDM) algorithm converts
this automaton to deterministic - DAWG directed acyclic word graphs
I
a
b
r
a
c
a
b
a
65- To search a pattern P
- Suffix automaton of Pr is built
- Search backwards inside the text window for a
substring of P using suffix automaton - Each time a terminal state is reached before
hitting the beginning of the window, the position
inside the window is remembered - Finding a prefix of the pattern -gt suffix of the
window - The last prefix recognized backwards is the
longest prefix of P - The window is aligned with the longest prefix
recognized
66Example
- P abracadabra
- Pr arbadacarba
- T a b r a c a b r a c a d a b r a
- x x
- x x
x
67Practical Comparison
- The clear winners are BNDM and BMS (Sunday)
- Classical BM and BDM are also very close
- For English texts, Agrep is much faster
- Because the code is carefully optimized
- For longer pattern, BDM is better than BNDM
- For extended patterns, BNDM is normally the
fastest, otherwise Shift-Or is the best option
68(No Transcript)
69Pattern Matching
- Searching allowing errors (Approximate String
Matching) - Dynamic Programming
- Automaton
- Regular Expressions and Extended patterns
- Pattern Matching Using Indices
- Inverted files
- Suffix Trees and Suffix Arrays
70Approximate String Matching
- Definition Given a short pattern P of length m,
a long text T of length n, and a maximum allowed
number of errors k, find all the text positions
where the pattern occurs with at most k errors - This corresponds to the Levenshtein distance
(edit distance) - With minimum modifications it is adapted to
searching whole words matching the pattern with k
errors
71Dynamic Programming
72Automaton
73Regular Expressions
74Pattern Matching Using Indices
- Inverted Files
- The types of queries such as suffix or substring
queries, searching allowing errors and regular
expressions, are solved by a sequential search - The restriction not able to efficiently find
approximate matches or regular expressions that
span many word.
75Pattern Matching Using Indices
- Suffix Trees
- Suffix trees are able to perform complex searches
- Word, prefix, suffix, substring, and range
queries - Regular expressions
- Unrestricted approximate string matching
- Useful in specific areas
- Find the longest substring
- Find the most common substring of a fixed size
76Pattern Matching Using Indices
- Suffix Arrays
- Some patterns can be searched directly in the
suffix array without simulating the suffix tree - Word, prefix, suffix, subword search and range
search
77Compression
- Compressed text--Huffman coding
- Taking words as symbols
- Use an alphabet of bytes instead of bits
- Compressed indices
- Inverted Files
- Suffix Trees and Suffix Arrays
- Signature Files