text positions O(m) worst-case cost for each position PowerPoint PPT Presentation

presentation player overlay
1 / 76
About This Presentation
Transcript and Presenter's Notes

Title: text positions O(m) worst-case cost for each position


1
Indexing and Searching
  • J. H. Wang
  • Feb. 20, 2008

2
The Retrieval Process
3
Outline
  • Conventional text retrieval systems (8.1-8.3,
    Salton)
  • File Structures for Indexing and Searching (Chap.
    8)
  • Inverted files
  • Suffix trees and suffix arrays
  • Signature files
  • Sequential searching
  • Pattern matching

4
Conventional Text Retrieval Systems
  • Database management, e.g. employee DB
  • Structured records
  • Precise meaning for attribute values
  • Exact match
  • Text retrieval, e.g. bibliographic systems
  • Structured attributes and unstructured content
  • Index terms
  • Imprecise representation of the text
  • Approximate or partial matching

5
Conceptual Information Retrieval
Queries
Documents
Similaritycomputation
Retrieval of similar terms
6
Expanded Text Retrieval System
Formalstatements
Indexeddocuments
Similaritycomputation
Documents
Queries
Negotiationand analysis(Query formulation)
Text indexing(Content Analysis)
Retrieval of similar terms
Taipei city government Taipei travel guide Wiki
page on Taipei Taipei 101 Taipei times
Taipei
7
Representation
  • Documents
  • Indexed terms (or term vectors)
  • Unweighted or weighted
  • Queries
  • Unweighted or weighted terms
  • Boolean operators or, and, not
  • E.g. Taiwan AND NOT Taipei
  • Efficiency

8
Data Structure
  • Requirement
  • Fast access to documents
  • Very large number of index terms
  • For each term a separate index is constructed
    that stores the document identifiers for all
    documents identified by that term
  • Inverted index (or inverted file)

9
Inverted Index
  • The complete file is represented as an array of
    indexed documents.

10
Inverted-file Process
  • The document-term array is inverted (actually
    transposed).

11
Inverted-file Process
  • The rows are manipulated according to query
    specification. (list-merging)
  • Ex Query (term 2 and term 3)
  • 1 1 0 0 0 1 1 1-----------------------------
    --------------- 0 1 0 0
  • Ex Query ((T1 or T2) and not T3)

12
Extensions of Inverted Index
  • Distance Constraints
  • Term Weights
  • Synonym Specification
  • Term Truncation

13
Distance Constraints
  • Nearness parameters
  • Within sentence terms cooccur in a common
    sentence
  • Adjacency terms occur adjacently in the text

14
  • Implementation
  • To include term-location information in the
    inverted index
  • information D345, D348, D350, retrieval
    D123, D128, D345,
  • Cost size of the indexes
  • To include sentence numbers for all term
    occurrences in the inverted index
  • information D345, 25 D345, 37 D348, 10
    D350, 8retrieval D123, 5 D128, 25 D345,
    37 D345, 40

15
  • To include paragraph numbers, sentence numbers
    within paragraphs, word numbers within sentences
    in the inverted index
  • information D345, 2, 3, 5retrieval D345,
    2, 3, 6
  • Ex (information adjacent retrieval)(information
    within five words retrieval)

16
Term Weights
  • Term-importance weights
  • Di Ti1, 0.2 Ti2, 0.5 Ti3, 0.6
  • Issues
  • How to generate term weights? (more on this
    later)
  • How to apply term weights?
  • Vector queries the sum of the weights of all
    document terms that match the given query
  • Boolean queries (more complex)

17
Term Weights (for Boolean Queries)
  • Transforming each query into sum-of-products form
    (or disjunctive normal form)
  • The weight of each conjunct is the minimum term
    weight of any document term in the conjunct
  • The document weight is the maximum of all the
    conjunct weights

18
An Example
  • Example Q(T1 and T2) or T3Document Conjunct Qu
    eryVectors Weights Weight (T1 and T2) (T3)
    (T1 and T2) or T3D1(T1,0.2T2,0.5T3,0.6) 0.
    2 0.6 0.6D2(T1,0.7T2,0.2T3,0.1) 0.2 0.1 0
    .2D1 is preferred.

19
  • Synonym Specification
  • (T1 and T2) or T3
  • ((T1 or S1) and T2) or (T3 or S3)
  • Term Truncation (or stemming)
  • Removing suffixes and/or prefixes
  • ExPSYCH psychiatrist, psychiatry,
    psychiatric,psychology, psychological,

20
File Structures for Indexing and Searching
21
Introduction
  • How to retrieval information?
  • A simple alternative is to search the whole text
    sequentially (online search)
  • Another option is to build data structures over
    the text (called indices) to speed up the search

22
Introduction
  • Indexing techniques
  • Inverted files
  • Suffix arrays
  • Signature files

23
Notation
  • n the size of the text
  • m the length of the pattern (m ltlt n)
  • v the size of the vocabulary
  • M the amount of main memory available

24
Inverted Files
  • Definition an inverted file is a word-oriented
    mechanism for indexing a text collection in order
    to speed up the searching task.
  • Structure of inverted file
  • Vocabulary is the set of all distinct words in
    the text
  • Occurrences lists containing all information
    necessary for each word of the vocabulary (text
    position, frequency, documents where the word
    appears, etc.)

25
Example
  • Text
  • Inverted file

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
Vocabulary
Occurrences
beautiful flowers garden house
70 45, 58 18, 29 6
26
Space Requirements
  • The space required for the vocabulary is rather
    small. According to Heaps law the vocabulary
    grows as O(n?), where ? is a constant between 0.4
    and 0.6 in practice (sublinear)
  • On the other hand, the occurrences demand much
    more space. Since each word appearing in the text
    is referenced once in that structure, the extra
    space is O(n)
  • To reduce space requirements, a technique called
    block addressing is used

27
Block Addressing
  • The text is divided in blocks
  • The occurrences point to the blocks where the
    word appears
  • Advantages
  • the number of pointers is smaller than positions
  • all the occurrences of a word inside a single
    block are collapsed to one reference
  • Disadvantages
  • online search over the qualifying blocks if exact
    positions are required

28
Example
  • Text
  • Inverted file

Block 1 Block 2 Block 3
Block 4
That house has a garden. The garden has many
flowers. The flowers are beautiful
Vocabulary
Occurrences
beautiful flowers garden house
4 3 2 1
29
Inverted Files for Different Addressing
Granularity
All words indexed
Stopwords not indexed
30
Searching
  • The search algorithm on an inverted index follows
    three steps
  • Vocabulary search the words present in the query
    are searched in the vocabulary
  • Retrieval of occurrences the lists of the
    occurrences of all words found are retrieved
  • Manipulation of occurrences the occurrences are
    processed to solve the query

31
Searching
  • Searching task on an inverted file always starts
    in the vocabulary (It is better to store the
    vocabulary in a separate file)
  • The structures most used to store the vocabulary
    are hashing, tries or B-trees
  • Hashing, tries O(m)
  • An alternative is simply storing the words in
    lexicographical order (cheaper in space and very
    competitive with O(log v) cost)

32
Construction
  • All the vocabulary is kept in a suitable data
    structure storing for each word a list of its
    occurrences
  • Each word of the text is read and searched in the
    vocabulary
  • If it is not found, it is added to the vocabulary
    with a empty list of occurrences and the new
    position is added to the end of its list of
    occurrences

33
Example
  • Text
  • Vocabulary trie

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
beautiful 70
b
f
flower 45, 58
g
garden 18, 29
h
house 6
34
Construction
  • Once the text is exhausted, the vocabulary is
    written to disk with the list of occurrences. Two
    files are created
  • in the first file, the list of occurrences are
    stored contiguously (posting file)
  • in the second file, the vocabulary is stored in
    lexicographical order and, for each word, a
    pointer to its list in the first file is also
    included. This allows the vocabulary to be kept
    in memory at search time
  • The overall process is O(n) worst-case time
  • Not practical for large texts

35
Construction
  • An option is to use the previous algorithm until
    the main memory is exhausted. When no more memory
    is available, the partial index Ii obtained up to
    now is written to disk and erased the main memory
    before continuing with the rest of the text
  • Once the text is exhausted, a number of partial
    indices Ii exist on disk
  • The partial indices are merged to obtain the
    final index

36
Example
I 1...8
final index
7
level 3
I 1...4
I 5...8
3
6
level 2
I 1...2
I 3...4
I 5...6
I 7...8
level 1
1
2
4
5
I 1
I 2
I 3
I 4
I 5
I 6
I 7
I 8
initial dumps
37
Construction
  • The total time to generate partial indices is
    O(n)
  • The number of partial indices is O(n/M)
  • To merge the O(n/M) partial indices, log2(n/M)
    merging levels are necessary
  • The total cost of this algorithm is O(n log(n/M))

38
Summary on Inverted File
  • Inverted file is probably the most adequate
    indexing technique for database text
  • The indices are appropriate when the text
    collection is large and semi-static
  • Otherwise, if the text collection is volatile
    online searching is the only option
  • Some techniques combine online and indexed
    searching

39
Suffix Trees and Suffix Arrays
  • Each position in the text is considered as a text
    suffix
  • Index points are selected form the text, which
    point to the beginning of the text positions
    which will be retrievable
  • The problem with suffix trees is its space
    overhead

40
Example
  • Text
  • Suffixes
  • house has a garden. The garden has many flowers.
    The flowers are beautiful
  • garden. The garden has many flowers. The flowers
    are beautiful
  • garden has many flowers. The flowers are
    beautiful
  • flowers. The flowers are beautiful
  • flowers are beautiful
  • beautiful

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
41
Example
  • Text
  • Suffix Trie

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
70

58
b
e
s
r
l
o
w
f
45
.
g
29

e
n
a
r
d
h
18
.
6
42
Example
  • Text
  • Suffix Tree

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
70
45
b
.
f
8
58

1
g
18
.
h
7
29

6
43
Suffix Arrays
  • An array containing all the pointers to the text
    suffixes listed in lexicographical order
  • The space requirements are almost the same as
    those for inverted indices
  • The main drawbacks of suffix array are its costly
    construction process
  • Allow binary searches done by comparing the
    contents of each pointer
  • Supra-indices (for large suffix array)
  • The space requirements of suffix array with
    vocabulary supra-index are exactly the same as
    for inverted indices

44
Example
  • Text
  • Suffix Array
  • Supra Index (l4, b2)

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
45
Example
  • Text
  • Vocabulary Supra-Index
  • Suffix Array
  • Inverted List

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
46
Construction of Suffix Arrays for Large Texts
Small text
1
2
Small suffix array
Long text
2
3
Long suffix array
Counters
3
3
Final suffix array
47
Signature Files
  • Characteristics
  • Word-oriented index structures based on hashing
  • Low overhead (1020 over the text size) at the
    cost of forcing a sequential search over the
    index
  • Suitable for not very large texts
  • Inverted files outperform signature files for
    most applications

48
Construction and Search
  • Word-oriented index structures base on hashing
  • Maps words to bit masks of B bits
  • Divides the text in blocks of b words each
  • The mask is obtained by bitwise ORing the
    signatures of all the words in the text block.
  • Search
  • Hash the query to a bit mask W
  • If W Bi W, the text block may contain the
    word
  • For all candidate blocks, an online traversal
    must be performed to verify if the word is
    actually there

49
Example
  • Four blocks
  • This is a text. A text has many words. Words are
    made from letters.
  • 000101 110101 100100
    101101
  • Hash(text) 000101
  • Hash(many) 110000
  • Hash(words) 100100
  • Hash(made) 001100
  • Hash(letters) 100001

50
False Drop
  • Assumes that l bits are randomly set in the mask
  • Let al/B
  • For b words, the probability that a given bit of
    the mask is set is 1-(1-1/B)bl ?1-e-ba
  • Hence, the probability that the l random bits are
    also set is Fd (1-e-ba)aB ? False alarm
  • Fd is minimized for aln(2)/b
  • Fd 2-l l B ln2/b

51
Comparisons
  • Signature files
  • Use hashing techniques to produce an index
  • advantage
  • storage overhead is small (10-20)
  • disadvantages
  • the search time on the index is linear
  • some answers may not match the query, thus
    filtering must be done

52
Comparisons (Continued)
  • Inverted files
  • storage overhead (30 100)
  • search time for word searches is logarithmic
  • Suffix arrays
  • potential use in other kind of searches
  • phrases
  • regular expression searching
  • approximate string searching
  • longest repetitions
  • most frequent searching

53
Sequential Searching
  • Brute Force (BF)
  • Knuth-Morris-Pratt (KMP)
  • Boyer-Moore Family (BM)
  • Shift-Or
  • Suffix Automaton

54
Exact String Matching
  • Definition Given a short pattern P of length m
    and a long text T of length n, find all the text
    positions where the pattern occurs
  • The simplest algorithm Brute-Force (BF)
  • Trying all possible pattern positions in the text
  • Worst-case cost O(mn), average-case cost O(n)
  • O(n) text positions
  • O(m) worst-case cost for each position

55
Knuth-Morris-Pratt
  • The KMP method scans the characters left-to-right
  • When a mismatch occurs, an optimum shift is
    carried out for pattern P
  • No new match can be obtained except when some
    head of the already matching part of P is
    identical to a tail of the matching part of T
  • How to detect coincidences between heads of P and
    tails of T
  • Any matching tail of T is also a matching tail of
    P
  • Detecting repeating portions in P

56
Knuth-Morris-Pratt
  • Next table at position j the longest proper
    prefix of P1..j-1 which is also a suffix and the
    characters following prefix and suffix are
    different
  • j-nextj-1 positions can be safely skipped
  • Next 0 0 0 0 1 0 1 0 0 0 0 4
  • P a b r a c a d a b r a
  • a b r a c a b r a c a d a b r a
  • a b r a c a d
  • a b r a c a d a b r a

57
  • At each text comparison, the window or the
    pointer advance by at least one position, the
    algorithm performs at most 2n comparisons (and at
    least n)
  • The Aho-Corasick algorithm is an extension of KMP
    in matching a set of patterns
  • Patterns are arranged in a trie-like data
    structure
  • Ex hello, elbow, eleven

58
Boyer-Moore Family
  • The BM method scans characters from right to left
  • The heuristic which gives the longest shift is
    selected
  • Matching shift (or good-suffix shift, ?2 shift)
  • When some tail of P already matches some
    substring of S
  • Occurrence shift (or bad-character shift, ?1
    shift)
  • When a mismatched character is known not to occur
    in the pattern
  • Extended ?1 shift places in coincidence any
    matching positions between heads and tails of P

59
Examples
  • a b r a c a b r a c a d a b r a
  • a b r a c a d a b r a
  • a b r a c a d a b r a (?23)
  • a b r a c a d a b r a (?15)
  • b a b c b a d c a b c a a b c a
  • a b c a b c a c a b
  • a b c a b c a c a b (?25)
  • a b c a b c a c a b (?17)
  • a b c a b c a c a b
    (extended ?18)

60
  • Some variations
  • Simplified BM algorithm
  • BM-Horspool (BMH) algorithm
  • BM-Sunday (BMS) algorithm
  • Commentz-Walter algorithm an extension of BM to
    multi-pattern search

61
Shift-Or
  • Based on bit-parallelism to simulate the
    operation of a non-deterministic automaton
  • It first build a table B which stores a bit mask
    bmb1 for each character
  • Bc has the i-th bit set to zero iff pi c
  • The state of search is kept in Ddmd1 (initially
    set to all 1s)
  • Where di is zero whenever the state numbered i is
    active
  • A match is reported whenever dm is zero
  • For each new character Tj, D (Dltlt1) BTj

62
Example
a
b
r
a
c
a
b
a
  • Ba 0 1 1 0
    1 0 1 0
  • Bb 1 0 1 1
    1 1 0 1
  • Bc 1 1 1 1
    0 1 1 1
  • Br 1 1 0 1
    1 1 1 1
  • B 1 1 1 1
    1 1 1 1

1
2
m
63
Example
  • Ex Input Tabcabracaba
  • (11111111 ltlt 1) 01010110 11111110 (A)
  • (11111110 ltlt 1) 10111101 11111101 (AB)
  • (11111101 ltlt 1) 11101111 11111111 ()
  • (11111111 ltlt 1) 01010110 11111110 (A)
  • (11111110 ltlt 1) 10111101 11111101 (AB)
  • (11111101 ltlt 1) 11111011 11111011 (ABR)
  • (11111011 ltlt 1) 01010110 11110111 (ABRA)
  • (11110111 ltlt 1) 11101111 11101111 (ABRAC)
  • (11101111 ltlt 1) 01010110 11011111 (ABRACA)
  • (11011111 ltlt 1) 10111101 10111111 (ABRACAB)
  • (10111111 ltlt 1) 01010110 01111111 (ABRACABA)

? Matched!
64
Suffix Automaton
  • Suffix automaton on a pattern P an automaton
    that recognizes all suffixes of P
  • Backward DAWG matching (BDM) algorithm converts
    this automaton to deterministic
  • DAWG directed acyclic word graphs

I
a
b
r
a
c
a
b
a
65
  • To search a pattern P
  • Suffix automaton of Pr is built
  • Search backwards inside the text window for a
    substring of P using suffix automaton
  • Each time a terminal state is reached before
    hitting the beginning of the window, the position
    inside the window is remembered
  • Finding a prefix of the pattern -gt suffix of the
    window
  • The last prefix recognized backwards is the
    longest prefix of P
  • The window is aligned with the longest prefix
    recognized

66
Example
  • P abracadabra
  • Pr arbadacarba
  • T a b r a c a b r a c a d a b r a
  • x x
  • x x
    x

67
Practical Comparison
  • The clear winners are BNDM and BMS (Sunday)
  • Classical BM and BDM are also very close
  • For English texts, Agrep is much faster
  • Because the code is carefully optimized
  • For longer pattern, BDM is better than BNDM
  • For extended patterns, BNDM is normally the
    fastest, otherwise Shift-Or is the best option

68
(No Transcript)
69
Pattern Matching
  • Searching allowing errors (Approximate String
    Matching)
  • Dynamic Programming
  • Automaton
  • Regular Expressions and Extended patterns
  • Pattern Matching Using Indices
  • Inverted files
  • Suffix Trees and Suffix Arrays

70
Approximate String Matching
  • Definition Given a short pattern P of length m,
    a long text T of length n, and a maximum allowed
    number of errors k, find all the text positions
    where the pattern occurs with at most k errors
  • This corresponds to the Levenshtein distance
    (edit distance)
  • With minimum modifications it is adapted to
    searching whole words matching the pattern with k
    errors

71
Dynamic Programming
72
Automaton
73
Regular Expressions
74
Pattern Matching Using Indices
  • Inverted Files
  • The types of queries such as suffix or substring
    queries, searching allowing errors and regular
    expressions, are solved by a sequential search
  • The restriction not able to efficiently find
    approximate matches or regular expressions that
    span many word.

75
Pattern Matching Using Indices
  • Suffix Trees
  • Suffix trees are able to perform complex searches
  • Word, prefix, suffix, substring, and range
    queries
  • Regular expressions
  • Unrestricted approximate string matching
  • Useful in specific areas
  • Find the longest substring
  • Find the most common substring of a fixed size

76
Pattern Matching Using Indices
  • Suffix Arrays
  • Some patterns can be searched directly in the
    suffix array without simulating the suffix tree
  • Word, prefix, suffix, subword search and range
    search

77
Compression
  • Compressed text--Huffman coding
  • Taking words as symbols
  • Use an alphabet of bytes instead of bits
  • Compressed indices
  • Inverted Files
  • Suffix Trees and Suffix Arrays
  • Signature Files
Write a Comment
User Comments (0)
About PowerShow.com