Indexing and Complexity - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Indexing and Complexity

Description:

Indexing and Complexity 24 25 29 Agenda Inverted indexes Computational complexity Some Interesting Questions How long will it take to find a document? – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 18
Provided by: Preferr654
Category:

less

Transcript and Presenter's Notes

Title: Indexing and Complexity


1
Indexing and Complexity
2
Agenda
  • Inverted indexes
  • Computational complexity

3
Some Interesting Questions
  • How long will it take to find a document?
  • Is there any work we can do in advance?
  • If so, how long will that take?
  • How big a computer will I need?
  • How much disk space? How much RAM?
  • What if more documents arrive?
  • How much of the advance work must be repeated?
  • Will searching become slower?
  • How much more disk space will be needed?

4
A Cautionary Tale
  • Searching is easy - just ask Microsoft!
  • Find can search my 1 GB disk in 30 seconds
  • Well, actually it only looks at the file names...
  • How long do you think find would take for
  • The 100 GB disk we just got?
  • For the World Wide Web?
  • Computers are getting faster, but
  • How does AltaVista give answers in 5 seconds?

5
The Inverted File Trick
  • Organize the bag of words matrix by terms
  • You know the terms that you are looking for
  • Look up terms like you search phone books
  • For each letter, jump directly to the right spot
  • For terms of reasonable length, this is very fast
  • For each term, store the document identifiers
  • For every document that contains that term
  • At query time, use the document identifiers
  • Consult a postings file

6
An Example
Postings
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Inverted File
aid
0
0
0
1
0
0
0
1
AI
4, 8
A
all
0
1
0
1
0
1
0
0
AL
2, 4, 6
back
1
0
1
0
0
0
1
0
BA
1, 3, 7
B
brown
1
0
1
0
1
0
1
0
BR
1, 3, 5, 7
come
0
1
0
1
0
1
0
1
C
2, 4, 6, 8
dog
0
0
1
0
1
0
0
0
D
3, 5
fox
0
0
1
0
1
0
1
0
F
3, 5, 7
good
0
1
0
1
0
1
0
1
G
2, 4, 6, 8
jump
0
0
1
0
0
0
0
0
J
3
lazy
1
0
1
0
1
0
1
0
L
1, 3, 5, 7
men
0
1
0
1
0
0
0
1
M
2, 4, 8
now
0
1
0
0
0
1
0
1
N
2, 6, 8
over
1
0
1
0
1
0
1
1
O
1, 3, 5, 7, 8
party
0
0
0
0
0
1
0
1
P
6, 8
quick
1
0
1
0
0
0
0
0
Q
1, 3
their
1
0
0
0
1
0
1
0
TH
1, 5, 7
T
time
0
1
0
1
0
1
0
0
TI
2, 4, 6
7
The Finished Product
Term
Postings
Inverted File
aid
AI
4, 8
A
all
AL
2, 4, 6
back
BA
1, 3, 7
B
brown
BR
1, 3, 5, 7
come
C
2, 4, 6, 8
dog
D
3, 5
fox
F
3, 5, 7
good
G
2, 4, 6, 8
jump
J
3
lazy
L
1, 3, 5, 7
men
M
2, 4, 8
now
N
2, 6, 8
over
O
1, 3, 5, 7, 8
party
P
6, 8
quick
Q
1, 3
their
TH
1, 5, 7
T
time
TI
2, 4, 6
8
What Goes in a Postings File?
  • Boolean retrieval
  • Just the document number
  • Ranked Retrieval
  • Document number and term weight (TFIDF, ...)
  • Proximity operators
  • Word offsets for each occurrence of the term
  • Example Doc 3 (t17, t36), Doc 13 (t3, t45)

9
How Big Is the Postings File?
  • Very compact for Boolean retrieval
  • About 10 of the size of the documents
  • If an aggressive stopword list is used!
  • Not much larger for ranked retrieval
  • Perhaps 20
  • Enormous for proximity operators
  • Sometimes larger than the documents!
  • But access is fast - you know where to look

10
Building an Inverted Index
  • Simplest solution is a single sorted array
  • Fast lookup using binary search
  • But sorting large files on disk is very slow
  • And adding one document means starting over
  • Tree structures allow easy insertion
  • But the worst case lookup time is linear
  • Balanced trees provide the best of both
  • Fast lookup and easy insertion
  • But they require 45 more disk space

11
Starting a B Tree Inverted File
Now is the time for all good
aaaaa
now
now
time
good
all
12
Adding a New Term
Now is the time for all good men
aaaaa
now
aaaaa
men
now
time
good
all
men
13
How Big is the Inverted Index?
  • Typically smaller than the postings file
  • Depends on number of terms, not documents
  • Eventually almost all terms will be indexed
  • But the postings file will continue to grow
  • Postings dominate asymptotic space complexity
  • Linear in the number of documents
  • Assuming that the documents remain about the same
    size

14
Some Facts About Disks
  • It takes a long time to get the first byte
  • A Pentium can do 1,000,000 operations in 10 ms
  • But you can get 1,000 bytes just about as fast
  • 40 MB/sec transfer rates are typical
  • So it pays to put related stuff in each block
  • M-ary trees B are better than binary B trees
  • Time complexity is measured in disk blocks read
  • Since computing time is negligible by comparison

15
Time Complexity
  • Indexing
  • Walk the inverted file, splitting if needed
  • Insert into the postings file in sorted order
  • Hours or days for large collections
  • Query processing
  • Walk the inverted file
  • Read the postings file
  • Seconds, even for enormous collections

16
Summary
  • Slow indexing yields fast query processing
  • We use extra disk space to save query time
  • Index space is in addition to document space
  • Time and space complexity must be balanced
  • Disk block reads are the critical resource
  • Fast disks are more useful than fast computers

17
A Question
  • If insertions are more common than queries (for
    example, filtering news stories as they arrive
    and then never looking at them again), what kind
    of an index should you build?

18
Indexing High Volume Streams
  • Build an index based on dates
  • Index based on anticipated search strategies
  • Balanced trees allow easy insertions
  • Easier than sorted arrays
  • Unbalanced trees might be even faster
  • Indexing time saved could justify query time cost
  • Dont do any indexing at all
  • If the queries are stable, just keep them in RAM
Write a Comment
User Comments (0)
About PowerShow.com