Title: File organization
1File organization
2File Organizations (Indexes)
- Choices for accessing data during query
evaluation - Scan the entire collection
- Typical in early (batch) retrieval systems
- Computational and I/O costs are O(characters in
collection) - Practical for only small text collections
- Large memory systems make scanning feasible
- Use indexes for direct access
- Evaluation time O(query term occurrences in
collection) - Practical for large collections
- Many opportunities for optimization
- Hybrids Use small index, then scan a subset of
the collection
3Indexes
- What should the index contain?
- Database systems index primary and secondarykeys
- This is the hybrid approach
- Index provides fast access to a subset of
database records - Scan subset to find solution set
- IR Problem
- Cannot predict keys that people will use in
queries - Every word in a document is a potential search
term - IR Solution Index by all keys (words) ?full
text indexes
4Indexes
- Index is accessed by the atoms of a query
language, The atoms are called features or
keys or terms - Most common feature types
- Words in text, punctuation
- Manually assigned terms (controlled and
uncontrolled vocabulary) - Document structure (sentence and paragraph
boundaries) - Inter- or intra-document links (e.g.,
citations) - Composed features
- Feature sequences (phrases, names, dates,
monetary amounts) - Feature sets (e.g., synonym classes)
- Indexing and retrieval models drive choices
- Must be able to construct all components of
those models - Indexing choices (there is no right answer)
- Tokenization,Case folding (e.g., New vs new,
Apple vs apple), Stopwords (e.g., the, a, its),
Morphology (e.g., computer, computers, computing,
computed) - Index granularity has a large impact on speed and
effectiveness
5Index Contents
- The contents depend upon the retrieval model
- Feature presence/absence
- Boolean
- Statistical (tf, df, ctf, doclen, maxtf)
- Often about 10 the size of the raw data,
compressed - Positional
- Feature location within document
- Granularities include word, sentence, paragraph,
etc - Coarse granularities are less precise, but take
less space - Word-level granularity about 20-30 the size of
the raw data,compressed
6Indexes Implementation
- Common implementations of indexes
- Bitmaps
- Signature files
- Inverted files
- Common index components
- Dictionary (lexicon)
- Postings
- document ids
- word positions
No positional data indexed
7Indexes Bitmaps
- Bag-of-words index only
- For each term, allocate vector with one bit per
document - If feature present in document n, set nth bit to
1, otherwise 0 - Boolean operations very fast
- Space efficient for common terms (why?)
- Space inefficient for rare terms (why?)
- Good compression with run-length encoding (why?)
- Not widely used
8Indexes Signature Files
- Bag-of-words only
- For each term, allocate fixed size s-bit vector
(signature) - Define hash function
- Multiple functions word ? 1..s selects which
bits to set - Each term has an s-bit signature
- may not be unique!
- OR the term signatures to form document signature
- Long documents are a problem (why?)
- Usually segment them into smaller pieces / blocks
9Signature File Example
10Signature File Example
11Indexes Signature Files
- At query time
- Lookup signature for query (how?)
- If all corresponding 1-bits are on in document
signature, document probably contains that term - Vary s to control P (false positive)
- Note space tradeoff
- Optimal s changes as collection grows (why?)
- Widely studied but Not widely used
12Indexes Inverted Lists
- Inverted lists are currently the most common
indexing technique - Source file collection, organized by document
- Inverted file collection organized by term
- one record per term, listing locations where term
occurs - During evaluation, traverse lists for each query
term - OR the union of component lists
- AND an intersection of component lists
- Proximity an intersection of component lists
- SUM the union of component lists each entry has
a score
13Inverted Files
14Inverted Files
15Word-Level Inverted File
16How big is the index?
- For an n word collection
- Lexicon
- Heaps Law V O(nß), 0.4 lt ßlt 0.6
- TREC-2 1 GB text, 5 MB lexicon
- Postings
- at most, one per occurrence of the word in the
text O(n)
17Inverted Search Algorithm
- Find query elements (terms) in the lexicon
- Retrieve postings for each lexicon entry
- Manipulate postings according to the retrieval
model
18Word-Level Inverted File
lexicon
posting
Query 1.porridge pot (BOOL) 2.porridge
pot (BOOL) 3. porridge pot (VSM)
Answer
19Lexicon Data Structures
- According to Heaps Law, the size of lexicon
maybe very large - Hash table
- O(1) lookup, with constant h() and collision
handling - Supports exact-match lookup
- May be complex to expand
- B-Tree
- On-disk storage with fast retrieval and good
caching behavior - Supports exact-match and range-based lookup
- O(log n) lookups to find a list
- Usually easy to expand
20In-memory Inversion Algorithm
- Create an empty lexicon
- For each document d in the collection,
- Read document, parse into terms
- For each indexing term t,
- fd,t frequency of t in d
- If t is not in lexicon, insert it
- Append ltd, fd,tgt to postings list for t
- Output each postings list into inverted file
- For each term, start new file entry
- Append each ltd,fd,tgt to the entry
- Compress entry
- Write entry out to file.
21Complexity of In-memory Inv.
- Time O(n) for n-byte text
- Space
- Lexicon space for unique words offsets
- Postings, 10 bytes per entry
- document number 4 bytes
- frequency count 2 bytes (allows 65536 max occ)
- "next" pointer 4 bytes
- Is this affordable?
- For 5GB collection, at 10 bytes/entry, for 400M
entries, need 4GB of main memory
22Idea 1 Partition the text
- Invert a chunk of the text at a time
- Then, merge each sub-indexes into one complete
index
23Idea 2 Sort-based Inversion
- Invert in two passes
- Output records ltt, d, ftgt to a temp. file
- Sort the records using external merge sort
- read a chunk of the temp file
- sort it using Quicksort
- write it back into the same place
- then merge-sort the chunks in place
- Read sorted file, and write inverted file
24Access Optimizations
- Skip lists
- A table of contents to the inverted list
- Embedded pointers that jump ahead n documents
(why is this useful?) - Separating presence information from location
information - Many operators only need presence information
- Location information takes substantial space
(I/O) - If split,
- reduced I/O for presence operators
- increased I/O for location operators (or larger
index) - Common in CD-ROM implementations
25Inverted file compression
- Inverted lists are usually compressed
- Inverted files with word locations are about the
size of the raw data - Distribution of numbers is skewed
- Most numbers are small (e.g., word locations,
term frequency) - Distribution can be made more skewed easily
- Delta encoding 5, 8, 10, 17 ? 5, 3, 2, 7
- Simple compression techniques are often the best
choice - Simple algorithms nearly as effective as complex
algorithms - Simple algorithms much faster than complex
algorithms - Goal Time saved by reduced I/O gt Time required
to uncompress
Compress ? Huffman encoding
26Inverted file compression
- The longst lists, which take up the most space,
have the most frequent (probable) words. - Compressing the longest lists would save the most
space. - The longest lists should compress easily because
they contain the least information (why?) - Algorithms
- Delta encoding
- Variable-length encoding
- Unary codes
- Gamma codes
- Delta codes
- Variable-Byte Code
- Golomb code
-
27Inverted List Indexes Compression
- Delta Encoding ("Storing Gaps")
- Reduces range of numbers.
- keep d in ascending order
- 3, 5, 20, 21, 23, 76, 77, 78
- becomes 3, 2, 15, 1, 2, 53, 1, 1
- Produces a more skewed distribution.
- Increases probability of smaller numbers.
- Stemming also increases the probability of
smaller numbers. (Why?)
28Variable-Byte Code
- Binary, but use minimum number of bytes
- 7 bits to store value, 1 bit to flag if there is
another byte - 0 lt x lt 128 1 byte
- 128 lt x lt 16384 2 bytes
- 16384 lt x lt 2097152 3 bytes
- Integral byte sizes for easy coding
- Very effective for medium-sized numbers
- A little wasteful for very small numbers
- Best trade-off between smallest index size and
fastest query time
29Index/IR toolkits
- The Lemur Toolkit for Language Modeling and
Information Retrieval - Java-based indexing and search technology,provide
web search application software
http//www.lemurproject.org/
http//lucene.apache.org/nutch/
30Summary
- Common implementations of indexes
- Bitmap, signature file, inverted file
- Common index components
- Lexicon postings
- Inverted Search Algorithm
- Inversion algorithm
- Inverted file access optimizations
- compression
31The End
32Vector Space Model
- ??d???q???????????m???,???????TFIDF,?????????????
,? (?????tf,idf??)
33Vocabulary Growth (Heaps Law)
- How does the size of the overall vocabulary
(number of unique words) grow with the size of
the corpus? - Vocabulary has no upper bound due to proper
names, typos, etc. - New words occur less frequently as vocabulary
grows - If V is the size of the vocabulary and the n is
the length of the corpus in words - V Knß (0lt ß lt1)
- Typical constants
- K 10-100
- ß 0.4-0.6 (approx. square-root of n)
34Query Answer
- 1.porridge pot (BOOL)
- d2
- 2.porridge pot (BOOL)
- null
- 3. porridge pot (VSM)
- d2 gt d1gtd5
- Next page?
353. porridge pot (VSM) N6,f(d1)??d1???
back