text positions O(m) worst-case cost for each position presentation

About This Presentation

Transcript and Presenter's Notes

Title: text positions O(m) worst-case cost for each position

1
Indexing and Searching

J. H. Wang
Feb. 20, 2008

2
The Retrieval Process
3
Outline

Conventional text retrieval systems (8.1-8.3,
Salton)
File Structures for Indexing and Searching (Chap.
8)
Inverted files
Suffix trees and suffix arrays
Signature files
Sequential searching
Pattern matching

4
Conventional Text Retrieval Systems

Database management, e.g. employee DB
Structured records
Precise meaning for attribute values
Exact match
Text retrieval, e.g. bibliographic systems
Structured attributes and unstructured content
Index terms
Imprecise representation of the text
Approximate or partial matching

5
Conceptual Information Retrieval
Queries
Documents
Similaritycomputation
Retrieval of similar terms
6
Expanded Text Retrieval System
Formalstatements
Indexeddocuments
Similaritycomputation
Documents
Queries
Negotiationand analysis(Query formulation)
Text indexing(Content Analysis)
Retrieval of similar terms
Taipei city government Taipei travel guide Wiki
page on Taipei Taipei 101 Taipei times
Taipei
7
Representation

Documents
Indexed terms (or term vectors)
Unweighted or weighted
Queries
Unweighted or weighted terms
Boolean operators or, and, not
E.g. Taiwan AND NOT Taipei
Efficiency

8
Data Structure

Requirement
Fast access to documents
Very large number of index terms
For each term a separate index is constructed
that stores the document identifiers for all
documents identified by that term
Inverted index (or inverted file)

9
Inverted Index

The complete file is represented as an array of
indexed documents.

10
Inverted-file Process

The document-term array is inverted (actually
transposed).

11
Inverted-file Process

The rows are manipulated according to query
specification. (list-merging)
Ex Query (term 2 and term 3)
1 1 0 0 0 1 1 1-----------------------------
--------------- 0 1 0 0
Ex Query ((T1 or T2) and not T3)

12
Extensions of Inverted Index

Distance Constraints
Term Weights
Synonym Specification
Term Truncation

13
Distance Constraints

Nearness parameters
Within sentence terms cooccur in a common
sentence
Adjacency terms occur adjacently in the text

Implementation
To include term-location information in the
inverted index
information D345, D348, D350, retrieval
D123, D128, D345,
Cost size of the indexes
To include sentence numbers for all term
occurrences in the inverted index
information D345, 25 D345, 37 D348, 10
D350, 8retrieval D123, 5 D128, 25 D345,
37 D345, 40

To include paragraph numbers, sentence numbers
within paragraphs, word numbers within sentences
in the inverted index
information D345, 2, 3, 5retrieval D345,
2, 3, 6
Ex (information adjacent retrieval)(information
within five words retrieval)

16
Term Weights

Term-importance weights
Di Ti1, 0.2 Ti2, 0.5 Ti3, 0.6
Issues
How to generate term weights? (more on this
later)
How to apply term weights?
Vector queries the sum of the weights of all
document terms that match the given query
Boolean queries (more complex)

17
Term Weights (for Boolean Queries)

Transforming each query into sum-of-products form
(or disjunctive normal form)
The weight of each conjunct is the minimum term
weight of any document term in the conjunct
The document weight is the maximum of all the
conjunct weights

18
An Example

Example Q(T1 and T2) or T3Document Conjunct Qu
eryVectors Weights Weight (T1 and T2) (T3)
(T1 and T2) or T3D1(T1,0.2T2,0.5T3,0.6) 0.
2 0.6 0.6D2(T1,0.7T2,0.2T3,0.1) 0.2 0.1 0
.2D1 is preferred.

Synonym Specification
(T1 and T2) or T3
((T1 or S1) and T2) or (T3 or S3)
Term Truncation (or stemming)
Removing suffixes and/or prefixes
ExPSYCH psychiatrist, psychiatry,
psychiatric,psychology, psychological,

20
File Structures for Indexing and Searching
21
Introduction

How to retrieval information?
A simple alternative is to search the whole text
sequentially (online search)
Another option is to build data structures over
the text (called indices) to speed up the search

22
Introduction

Indexing techniques
Inverted files
Suffix arrays
Signature files

23
Notation

n the size of the text
m the length of the pattern (m ltlt n)
v the size of the vocabulary
M the amount of main memory available

24
Inverted Files

Definition an inverted file is a word-oriented
mechanism for indexing a text collection in order
to speed up the searching task.
Structure of inverted file
Vocabulary is the set of all distinct words in
the text
Occurrences lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)

25
Example

Text
Inverted file

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
Vocabulary
Occurrences
beautiful flowers garden house
70 45, 58 18, 29 6
26
Space Requirements

The space required for the vocabulary is rather
small. According to Heaps law the vocabulary
grows as O(n?), where ? is a constant between 0.4
and 0.6 in practice (sublinear)
On the other hand, the occurrences demand much
more space. Since each word appearing in the text
is referenced once in that structure, the extra
space is O(n)
To reduce space requirements, a technique called
block addressing is used

27
Block Addressing

The text is divided in blocks
The occurrences point to the blocks where the
word appears
Advantages
the number of pointers is smaller than positions
all the occurrences of a word inside a single
block are collapsed to one reference
Disadvantages
online search over the qualifying blocks if exact
positions are required

28
Example

Text
Inverted file

Block 1 Block 2 Block 3
Block 4
That house has a garden. The garden has many
flowers. The flowers are beautiful
Vocabulary
Occurrences
beautiful flowers garden house
4 3 2 1
29
Inverted Files for Different Addressing
Granularity
All words indexed
Stopwords not indexed
30
Searching

The search algorithm on an inverted index follows
three steps
Vocabulary search the words present in the query
are searched in the vocabulary
Retrieval of occurrences the lists of the
occurrences of all words found are retrieved
Manipulation of occurrences the occurrences are
processed to solve the query

31
Searching

Searching task on an inverted file always starts
in the vocabulary (It is better to store the
vocabulary in a separate file)
The structures most used to store the vocabulary
are hashing, tries or B-trees
Hashing, tries O(m)
An alternative is simply storing the words in
lexicographical order (cheaper in space and very
competitive with O(log v) cost)

32
Construction

All the vocabulary is kept in a suitable data
structure storing for each word a list of its
occurrences
Each word of the text is read and searched in the
vocabulary
If it is not found, it is added to the vocabulary
with a empty list of occurrences and the new
position is added to the end of its list of
occurrences

33
Example

Text
Vocabulary trie

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
beautiful 70
b
f
flower 45, 58
g
garden 18, 29
h
house 6
34
Construction

Once the text is exhausted, the vocabulary is
written to disk with the list of occurrences. Two
files are created
in the first file, the list of occurrences are
stored contiguously (posting file)
in the second file, the vocabulary is stored in
lexicographical order and, for each word, a
pointer to its list in the first file is also
included. This allows the vocabulary to be kept
in memory at search time
The overall process is O(n) worst-case time
Not practical for large texts

35
Construction

An option is to use the previous algorithm until
the main memory is exhausted. When no more memory
is available, the partial index Ii obtained up to
now is written to disk and erased the main memory
before continuing with the rest of the text
Once the text is exhausted, a number of partial
indices Ii exist on disk
The partial indices are merged to obtain the
final index

36
Example
I 1...8
final index
7
level 3
I 1...4
I 5...8
3
6
level 2
I 1...2
I 3...4
I 5...6
I 7...8
level 1
1
2
4
5
I 1
I 2
I 3
I 4
I 5
I 6
I 7
I 8
initial dumps
37
Construction

The total time to generate partial indices is
O(n)
The number of partial indices is O(n/M)
To merge the O(n/M) partial indices, log2(n/M)
merging levels are necessary
The total cost of this algorithm is O(n log(n/M))

38
Summary on Inverted File

Inverted file is probably the most adequate
indexing technique for database text
The indices are appropriate when the text
collection is large and semi-static
Otherwise, if the text collection is volatile
online searching is the only option
Some techniques combine online and indexed
searching

39
Suffix Trees and Suffix Arrays

Each position in the text is considered as a text
suffix
Index points are selected form the text, which
point to the beginning of the text positions
which will be retrievable
The problem with suffix trees is its space
overhead

40
Example

Text
Suffixes
house has a garden. The garden has many flowers.
The flowers are beautiful
garden. The garden has many flowers. The flowers
are beautiful
garden has many flowers. The flowers are
beautiful
flowers. The flowers are beautiful
flowers are beautiful
beautiful

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
41
Example

Text
Suffix Trie

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
70

58
b
e
s
r
l
o
w
f
45
.
g
29

e
n
a
r
d
h
18
.
6
42
Example

Text
Suffix Tree

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
70
45
b
.
f
8
58

1
g
18
.
h
7
29

6
43
Suffix Arrays

An array containing all the pointers to the text
suffixes listed in lexicographical order
The space requirements are almost the same as
those for inverted indices
The main drawbacks of suffix array are its costly
construction process
Allow binary searches done by comparing the
contents of each pointer
Supra-indices (for large suffix array)
The space requirements of suffix array with
vocabulary supra-index are exactly the same as
for inverted indices

44
Example

Text
Suffix Array
Supra Index (l4, b2)

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
45
Example

Text
Vocabulary Supra-Index
Suffix Array
Inverted List

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
46
Construction of Suffix Arrays for Large Texts
Small text
1
2
Small suffix array
Long text
2
3
Long suffix array
Counters
3
3
Final suffix array
47
Signature Files

Characteristics
Word-oriented index structures based on hashing
Low overhead (1020 over the text size) at the
cost of forcing a sequential search over the
index
Suitable for not very large texts
Inverted files outperform signature files for
most applications

48
Construction and Search

Word-oriented index structures base on hashing
Maps words to bit masks of B bits
Divides the text in blocks of b words each
The mask is obtained by bitwise ORing the
signatures of all the words in the text block.
Search
Hash the query to a bit mask W
If W Bi W, the text block may contain the
word
For all candidate blocks, an online traversal
must be performed to verify if the word is
actually there

49
Example

Four blocks
This is a text. A text has many words. Words are
made from letters.
000101 110101 100100
101101
Hash(text) 000101
Hash(many) 110000
Hash(words) 100100
Hash(made) 001100
Hash(letters) 100001

50
False Drop

Assumes that l bits are randomly set in the mask
Let al/B
For b words, the probability that a given bit of
the mask is set is 1-(1-1/B)bl ?1-e-ba
Hence, the probability that the l random bits are
also set is Fd (1-e-ba)aB ? False alarm
Fd is minimized for aln(2)/b
Fd 2-l l B ln2/b

51
Comparisons

Signature files
Use hashing techniques to produce an index
advantage
storage overhead is small (10-20)
disadvantages
the search time on the index is linear
some answers may not match the query, thus
filtering must be done

52
Comparisons (Continued)

Inverted files
storage overhead (30 100)
search time for word searches is logarithmic
Suffix arrays
potential use in other kind of searches
phrases
regular expression searching
approximate string searching
longest repetitions
most frequent searching

53
Sequential Searching

Brute Force (BF)
Knuth-Morris-Pratt (KMP)
Boyer-Moore Family (BM)
Shift-Or
Suffix Automaton

54
Exact String Matching

Definition Given a short pattern P of length m
and a long text T of length n, find all the text
positions where the pattern occurs
The simplest algorithm Brute-Force (BF)
Trying all possible pattern positions in the text
Worst-case cost O(mn), average-case cost O(n)
O(n) text positions
O(m) worst-case cost for each position

55
Knuth-Morris-Pratt

The KMP method scans the characters left-to-right
When a mismatch occurs, an optimum shift is
carried out for pattern P
No new match can be obtained except when some
head of the already matching part of P is
identical to a tail of the matching part of T
How to detect coincidences between heads of P and
tails of T
Any matching tail of T is also a matching tail of
P
Detecting repeating portions in P

56
Knuth-Morris-Pratt

Next table at position j the longest proper
prefix of P1..j-1 which is also a suffix and the
characters following prefix and suffix are
different
j-nextj-1 positions can be safely skipped
Next 0 0 0 0 1 0 1 0 0 0 0 4
P a b r a c a d a b r a
a b r a c a b r a c a d a b r a
a b r a c a d
a b r a c a d a b r a

At each text comparison, the window or the
pointer advance by at least one position, the
algorithm performs at most 2n comparisons (and at
least n)
The Aho-Corasick algorithm is an extension of KMP
in matching a set of patterns
Patterns are arranged in a trie-like data
structure
Ex hello, elbow, eleven

58
Boyer-Moore Family

The BM method scans characters from right to left
The heuristic which gives the longest shift is
selected
Matching shift (or good-suffix shift, ?2 shift)
When some tail of P already matches some
substring of S
Occurrence shift (or bad-character shift, ?1
shift)
When a mismatched character is known not to occur
in the pattern
Extended ?1 shift places in coincidence any
matching positions between heads and tails of P

59
Examples

a b r a c a b r a c a d a b r a
a b r a c a d a b r a
a b r a c a d a b r a (?23)
a b r a c a d a b r a (?15)
b a b c b a d c a b c a a b c a
a b c a b c a c a b
a b c a b c a c a b (?25)
a b c a b c a c a b (?17)
a b c a b c a c a b
(extended ?18)

Some variations
Simplified BM algorithm
BM-Horspool (BMH) algorithm
BM-Sunday (BMS) algorithm
Commentz-Walter algorithm an extension of BM to
multi-pattern search

61
Shift-Or

Based on bit-parallelism to simulate the
operation of a non-deterministic automaton
It first build a table B which stores a bit mask
bmb1 for each character
Bc has the i-th bit set to zero iff pi c
The state of search is kept in Ddmd1 (initially
set to all 1s)
Where di is zero whenever the state numbered i is
active
A match is reported whenever dm is zero
For each new character Tj, D (Dltlt1) BTj

62
Example
a
b
r
a
c
a
b
a

Ba 0 1 1 0
1 0 1 0
Bb 1 0 1 1
1 1 0 1
Bc 1 1 1 1
0 1 1 1
Br 1 1 0 1
1 1 1 1
B 1 1 1 1
1 1 1 1

1
2
m
63
Example

Ex Input Tabcabracaba
(11111111 ltlt 1) 01010110 11111110 (A)
(11111110 ltlt 1) 10111101 11111101 (AB)
(11111101 ltlt 1) 11101111 11111111 ()
(11111111 ltlt 1) 01010110 11111110 (A)
(11111110 ltlt 1) 10111101 11111101 (AB)
(11111101 ltlt 1) 11111011 11111011 (ABR)
(11111011 ltlt 1) 01010110 11110111 (ABRA)
(11110111 ltlt 1) 11101111 11101111 (ABRAC)
(11101111 ltlt 1) 01010110 11011111 (ABRACA)
(11011111 ltlt 1) 10111101 10111111 (ABRACAB)
(10111111 ltlt 1) 01010110 01111111 (ABRACABA)

? Matched!
64
Suffix Automaton

Suffix automaton on a pattern P an automaton
that recognizes all suffixes of P
Backward DAWG matching (BDM) algorithm converts
this automaton to deterministic
DAWG directed acyclic word graphs

I
a
b
r
a
c
a
b
a
65

To search a pattern P
Suffix automaton of Pr is built
Search backwards inside the text window for a
substring of P using suffix automaton
Each time a terminal state is reached before
hitting the beginning of the window, the position
inside the window is remembered
Finding a prefix of the pattern -gt suffix of the
window
The last prefix recognized backwards is the
longest prefix of P
The window is aligned with the longest prefix
recognized

66
Example

P abracadabra
Pr arbadacarba
T a b r a c a b r a c a d a b r a
x x
x x
x

67
Practical Comparison

The clear winners are BNDM and BMS (Sunday)
Classical BM and BDM are also very close
For English texts, Agrep is much faster
Because the code is carefully optimized
For longer pattern, BDM is better than BNDM
For extended patterns, BNDM is normally the
fastest, otherwise Shift-Or is the best option

68
(No Transcript)
69
Pattern Matching

Searching allowing errors (Approximate String
Matching)
Dynamic Programming
Automaton
Regular Expressions and Extended patterns
Pattern Matching Using Indices
Inverted files
Suffix Trees and Suffix Arrays

70
Approximate String Matching

Definition Given a short pattern P of length m,
a long text T of length n, and a maximum allowed
number of errors k, find all the text positions
where the pattern occurs with at most k errors
This corresponds to the Levenshtein distance
(edit distance)
With minimum modifications it is adapted to
searching whole words matching the pattern with k
errors

71
Dynamic Programming
72
Automaton
73
Regular Expressions
74
Pattern Matching Using Indices

Inverted Files
The types of queries such as suffix or substring
queries, searching allowing errors and regular
expressions, are solved by a sequential search
The restriction not able to efficiently find
approximate matches or regular expressions that
span many word.

75
Pattern Matching Using Indices

Suffix Trees
Suffix trees are able to perform complex searches
Word, prefix, suffix, substring, and range
queries
Regular expressions
Unrestricted approximate string matching
Useful in specific areas
Find the longest substring
Find the most common substring of a fixed size

76
Pattern Matching Using Indices

Suffix Arrays
Some patterns can be searched directly in the
suffix array without simulating the suffix tree
Word, prefix, suffix, subword search and range
search

77
Compression

Compressed text--Huffman coding
Taking words as symbols
Use an alphabet of bytes instead of bits
Compressed indices
Inverted Files
Suffix Trees and Suffix Arrays
Signature Files

Write a Comment

User Comments (0)

About PowerShow.com

text positions O(m) worst-case cost for each position PowerPoint PPT Presentation