Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching - PowerPoint PPT Presentation

About This Presentation
Title:

Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching

Description:

Electronic Commerce & Internet Application Laboratory. Special Topics in Computer ... Supra-index: sampling, for better disk access. 15 ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 29
Provided by: alexande95
Category:

less

Transcript and Presenter's Notes

Title: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching


1
Special Topics in Computer ScienceThe Art of
Information RetrievalChapter 8 Indexing and
Searching
  • Alexander Gelbukh
  • www.Gelbukh.com

2
Previous Chapter Conclusions
  • Text transformation meaning instead of strings
  • Lexical analysis
  • Stopwords
  • Stemming
  • POS, WSD, syntax, semantics
  • Ontologies to collate similar stems
  • Text compression
  • Searchable (compress the query, then search)
  • Random access
  • Word-based statistical methods (Huffman)
  • Index compression

3
Previous Chapter Research topics
  • All computational linguistics
  • Improved POS tagging
  • Improved WSD
  • Uses of thesaurus
  • for user navigation
  • for collating similar terms
  • Better compression methods
  • Searchable compression
  • Random access

4
(No Transcript)
5
Types of searching
  • Sequential
  • Small texts
  • Volatile, or space limited
  • Indexed
  • Semi-static
  • Space overhead
  • First, we discuss indexed searching, then
    sequential

6
Inverted files
  • Vocabulary sqrt (n). Heaps law. 1GB ? 5M
  • Occurrences n 40 (stopwords)
  • positions (word, char), files, sections...

7
Compression Block addressing
  • Block addressing 5 overhead
  • 256, 64K, ..., blocks (1, 2, ..., bytes)
  • Equal size (faster search) or logical sections
    (retrieval units)

8
Searching in inverted files
  • Vocabulary search
  • Separate file
  • Many searching techniques
  • Lexicographic log V (voc. size) ½ log n
    (Heaps)
  • Hashing is not good for prefix search
  • Retrieval of occurrences
  • Manipulation with occurrences sqrt (n) (Heaps,
    Zipf)
  • Boolean operations. Context search
  • Merging
  • One list is shorter (Zipf law)
  • Only inverted files allow sublinear both space
    time
  • Suffix trees and signature files dont

9
Building inverted file 1
  • Infinite memory?
  • Use trie to store vocabulary
  • append positions
  • O(n)

10
Building inverted file 2
  • Finite memory?
  • Fill the memory
  • Write partial index n/M pieces
  • Merge partial indices (hierarchically) n log
    (n/M)
  • Insertion index, merge. n n'log(n'/M)
  • Deleting eliminate every occurrence. n
  • Very fast creating/maintenance

11
Suffix trees
  • Text as one long string. No words.
  • Genetic databases
  • Complex queries
  • Compacted trie structure
  • Problem space
  • For text retrieval, inverted files are better

12
(No Transcript)
13
(No Transcript)
14
Suffix array
  • All suffixes (by position) in lexicographic order
  • Allows binary search
  • Much less space 40 n
  • Supra-index sampling, for better disk access

15
Searching. Construction
  • Searching
  • Patterns, prefixes, phrases. Not only words
  • Suffix tree O(m), but space (m query size)
  • Suffix array O(log n) (n database size)
  • Construction of arrays sorting
  • Large text n2 log (M)/M, more than for inverted
    files
  • Skip details
  • Addition n n' log (M)/M
  • Deletion n

16
Signature files
  • Usually worse than inverted files
  • Words are mapped to bit patterns
  • Blocks are mapped to ORs of their word patterns
  • If a block contains a word, all its bits are set
  • Sequential search for blocks
  • False drops!
  • Design of the hash function
  • Have to traverse the block
  • Good to search ANDs or proximity queries
  • bit patterns are ORed

17
(No Transcript)
18
Boolean operations
  • Merging file (occurrences) lists
  • AND to find repetitions
  • According to query syntax tree
  • Complexity linear in intermediate results
  • Can be slow if they are huge
  • There are optimization techniques
  • E.g. merge small list with a big one by
    searching
  • This is a usual case (Zipf)

19
Sequential search
  • Necessary part of many algorithms (e.g., block
    addr)
  • Brute force O(nm) worst-case, O(n) on average
  • Knuth-Morris-Pratt linear worst, but the same
    avrg
  • Boyer-Moore n log(m) / m. Not all chars are
    examined!
  • If some part of the pattern was compared,no need
    to compare inside it you analyze the pattern
    once
  • Shift-Or uses logical operation on all 32 bits
    in parallel
  • BDM automation. Complexity same as Boyer-Moore
  • Combination of BDM with bit parallelism

20
Approximate string matching
  • Match with k errors
  • Levenshtein distance
  • Dynamic programming O(mn), O(kn)
  • Automation non-deterministic
  • Convert to deterministic O(n), but huge
    structure
  • Bit-parallel O(n), the fastest known
  • Filtering sublinear!
  • k errors cannot alter k segments
  • multipattern exact search detect suspicious
    places
  • uses approximate algorithm only when needed

21
Regular expressions
  • Regular expressions
  • Automation O (m 2m) O (n) bad for long
    patterns
  • Bit-parallel (simulates non-deterministic)
  • Using indices to search for words with errors
  • Inverted files search in vocabulary, then each
    word
  • Suffix trees and Suffix arrays the same
    algorithms!

22
Structural queries
  • Ad-hoc index for structure
  • Indexing tags as words
  • Inverted files are goodsince they store
    occurrences in order

23
Search over compression
  • Improves both space AND time (less disk
    operations)
  • Compress query and search
  • Huffman compression, words as symbols, bytes
  • (frequencies most frequent shorter)
  • Search each word in the vocabulary ? its code
  • More sophisticated algorithms
  • Compressed inverted files less disk ? less time
  • Text and index compression can be combined

24
...compression
  • Suffix trees can be compressed almost to size
    ofsuffix arrays
  • Suffix arrays cant be compressed (almost
    random),but can be constructed over compressed
    text
  • instead of Huffman, use a code that respects
    alphabetic order
  • almost the same compression
  • Signature files are sparse, so can be compressed
  • ratios up to 70

25
(No Transcript)
26
Research topics
  • Perhaps, new details in integration of
    compression and search
  • Linguistic indexing allowing linguistic
    variations
  • Search in plural or only singular
  • Search with or without synonyms

27
Conclusions
  • Inverted files seem to be the best option
  • Other structures are good for specific cases
  • Genetic databases
  • Sequential searching is an integral part of
    manyindexing-based search techniques
  • Many methods to improve sequential searching
  • Compression can be integrated with search

28
Thank you! Till compensation lecture?
Write a Comment
User Comments (0)
About PowerShow.com