Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching - PowerPoint PPT Presentation

About This Presentation

Title:

Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching

Description:

Electronic Commerce & Internet Application Laboratory. Special Topics in Computer ... Supra-index: sampling, for better disk access. 15 ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 29

Provided by: alexande95

Category:

more less

Transcript and Presenter's Notes

Title: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching

1
Special Topics in Computer ScienceThe Art of
Information RetrievalChapter 8 Indexing and
Searching

Alexander Gelbukh
www.Gelbukh.com

2
Previous Chapter Conclusions

Text transformation meaning instead of strings
Lexical analysis
Stopwords
Stemming
POS, WSD, syntax, semantics
Ontologies to collate similar stems
Text compression
Searchable (compress the query, then search)
Random access
Word-based statistical methods (Huffman)
Index compression

3
Previous Chapter Research topics

All computational linguistics
Improved POS tagging
Improved WSD
Uses of thesaurus
for user navigation
for collating similar terms
Better compression methods
Searchable compression
Random access

4
(No Transcript)
5
Types of searching

Sequential
Small texts
Volatile, or space limited
Indexed
Semi-static
Space overhead
First, we discuss indexed searching, then
sequential

6
Inverted files

Vocabulary sqrt (n). Heaps law. 1GB ? 5M
Occurrences n 40 (stopwords)
positions (word, char), files, sections...

7
Compression Block addressing

Block addressing 5 overhead
256, 64K, ..., blocks (1, 2, ..., bytes)
Equal size (faster search) or logical sections
(retrieval units)

8
Searching in inverted files

Vocabulary search
Separate file
Many searching techniques
Lexicographic log V (voc. size) ½ log n
(Heaps)
Hashing is not good for prefix search
Retrieval of occurrences
Manipulation with occurrences sqrt (n) (Heaps,
Zipf)
Boolean operations. Context search
Merging
One list is shorter (Zipf law)
Only inverted files allow sublinear both space
time
Suffix trees and signature files dont

9
Building inverted file 1

Infinite memory?
Use trie to store vocabulary
append positions
O(n)

10
Building inverted file 2

Finite memory?
Fill the memory
Write partial index n/M pieces
Merge partial indices (hierarchically) n log
(n/M)
Insertion index, merge. n n'log(n'/M)
Deleting eliminate every occurrence. n
Very fast creating/maintenance

11
Suffix trees

Text as one long string. No words.
Genetic databases
Complex queries
Compacted trie structure
Problem space
For text retrieval, inverted files are better

12
(No Transcript)
13
(No Transcript)
14
Suffix array

All suffixes (by position) in lexicographic order
Allows binary search
Much less space 40 n
Supra-index sampling, for better disk access

15
Searching. Construction

Searching
Patterns, prefixes, phrases. Not only words
Suffix tree O(m), but space (m query size)
Suffix array O(log n) (n database size)
Construction of arrays sorting
Large text n2 log (M)/M, more than for inverted
files
Skip details
Addition n n' log (M)/M
Deletion n

16
Signature files

Usually worse than inverted files
Words are mapped to bit patterns
Blocks are mapped to ORs of their word patterns
If a block contains a word, all its bits are set
Sequential search for blocks
False drops!
Design of the hash function
Have to traverse the block
Good to search ANDs or proximity queries
bit patterns are ORed

17
(No Transcript)
18
Boolean operations

Merging file (occurrences) lists
AND to find repetitions
According to query syntax tree
Complexity linear in intermediate results
Can be slow if they are huge
There are optimization techniques
E.g. merge small list with a big one by
searching
This is a usual case (Zipf)

19
Sequential search

Necessary part of many algorithms (e.g., block
addr)
Brute force O(nm) worst-case, O(n) on average
Knuth-Morris-Pratt linear worst, but the same
avrg
Boyer-Moore n log(m) / m. Not all chars are
examined!
If some part of the pattern was compared,no need
to compare inside it you analyze the pattern
once
Shift-Or uses logical operation on all 32 bits
in parallel
BDM automation. Complexity same as Boyer-Moore
Combination of BDM with bit parallelism

20
Approximate string matching

Match with k errors
Levenshtein distance
Dynamic programming O(mn), O(kn)
Automation non-deterministic
Convert to deterministic O(n), but huge
structure
Bit-parallel O(n), the fastest known
Filtering sublinear!
k errors cannot alter k segments
multipattern exact search detect suspicious
places
uses approximate algorithm only when needed

21
Regular expressions

Regular expressions
Automation O (m 2m) O (n) bad for long
patterns
Bit-parallel (simulates non-deterministic)
Using indices to search for words with errors
Inverted files search in vocabulary, then each
word
Suffix trees and Suffix arrays the same
algorithms!

22
Structural queries

Ad-hoc index for structure
Indexing tags as words
Inverted files are goodsince they store
occurrences in order

23
Search over compression

Improves both space AND time (less disk
operations)
Compress query and search
Huffman compression, words as symbols, bytes
(frequencies most frequent shorter)
Search each word in the vocabulary ? its code
More sophisticated algorithms
Compressed inverted files less disk ? less time
Text and index compression can be combined

24
...compression

Suffix trees can be compressed almost to size
ofsuffix arrays
Suffix arrays cant be compressed (almost
random),but can be constructed over compressed
text
instead of Huffman, use a code that respects
alphabetic order
almost the same compression
Signature files are sparse, so can be compressed
ratios up to 70

25
(No Transcript)
26
Research topics

Perhaps, new details in integration of
compression and search
Linguistic indexing allowing linguistic
variations
Search in plural or only singular
Search with or without synonyms

27
Conclusions

Inverted files seem to be the best option
Other structures are good for specific cases
Genetic databases
Sequential searching is an integral part of
manyindexing-based search techniques
Many methods to improve sequential searching
Compression can be integrated with search

28
Thank you! Till compensation lecture?

Write a Comment

User Comments (0)