File organization - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

File organization

Description:

Inverted files with word locations are about the size of the raw data ... 7 bits to store value, 1 bit to flag if there is another byte. 0 x 128: 1 byte ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 36
Provided by: pb8
Category:

less

Transcript and Presenter's Notes

Title: File organization


1
File organization
2
File Organizations (Indexes)
  • Choices for accessing data during query
    evaluation
  • Scan the entire collection
  • Typical in early (batch) retrieval systems
  • Computational and I/O costs are O(characters in
    collection)
  • Practical for only small text collections
  • Large memory systems make scanning feasible
  • Use indexes for direct access
  • Evaluation time O(query term occurrences in
    collection)
  • Practical for large collections
  • Many opportunities for optimization
  • Hybrids Use small index, then scan a subset of
    the collection

3
Indexes
  • What should the index contain?
  • Database systems index primary and secondarykeys
  • This is the hybrid approach
  • Index provides fast access to a subset of
    database records
  • Scan subset to find solution set
  • IR Problem
  • Cannot predict keys that people will use in
    queries
  • Every word in a document is a potential search
    term
  • IR Solution Index by all keys (words) ?full
    text indexes

4
Indexes
  • Index is accessed by the atoms of a query
    language, The atoms are called features or
    keys or terms
  • Most common feature types
  • Words in text, punctuation
  • Manually assigned terms (controlled and
    uncontrolled vocabulary)
  • Document structure (sentence and paragraph
    boundaries)
  • Inter- or intra-document links (e.g.,
    citations)
  • Composed features
  • Feature sequences (phrases, names, dates,
    monetary amounts)
  • Feature sets (e.g., synonym classes)
  • Indexing and retrieval models drive choices
  • Must be able to construct all components of
    those models
  • Indexing choices (there is no right answer)
  • Tokenization,Case folding (e.g., New vs new,
    Apple vs apple), Stopwords (e.g., the, a, its),
    Morphology (e.g., computer, computers, computing,
    computed)
  • Index granularity has a large impact on speed and
    effectiveness

5
Index Contents
  • The contents depend upon the retrieval model
  • Feature presence/absence
  • Boolean
  • Statistical (tf, df, ctf, doclen, maxtf)
  • Often about 10 the size of the raw data,
    compressed
  • Positional
  • Feature location within document
  • Granularities include word, sentence, paragraph,
    etc
  • Coarse granularities are less precise, but take
    less space
  • Word-level granularity about 20-30 the size of
    the raw data,compressed

6
Indexes Implementation
  • Common implementations of indexes
  • Bitmaps
  • Signature files
  • Inverted files
  • Common index components
  • Dictionary (lexicon)
  • Postings
  • document ids
  • word positions

No positional data indexed
7
Indexes Bitmaps
  • Bag-of-words index only
  • For each term, allocate vector with one bit per
    document
  • If feature present in document n, set nth bit to
    1, otherwise 0
  • Boolean operations very fast
  • Space efficient for common terms (why?)
  • Space inefficient for rare terms (why?)
  • Good compression with run-length encoding (why?)
  • Not widely used

8
Indexes Signature Files
  • Bag-of-words only
  • For each term, allocate fixed size s-bit vector
    (signature)
  • Define hash function
  • Multiple functions word ? 1..s selects which
    bits to set
  • Each term has an s-bit signature
  • may not be unique!
  • OR the term signatures to form document signature
  • Long documents are a problem (why?)
  • Usually segment them into smaller pieces / blocks

9
Signature File Example
10
Signature File Example
11
Indexes Signature Files
  • At query time
  • Lookup signature for query (how?)
  • If all corresponding 1-bits are on in document
    signature, document probably contains that term
  • Vary s to control P (false positive)
  • Note space tradeoff
  • Optimal s changes as collection grows (why?)
  • Widely studied but Not widely used

12
Indexes Inverted Lists
  • Inverted lists are currently the most common
    indexing technique
  • Source file collection, organized by document
  • Inverted file collection organized by term
  • one record per term, listing locations where term
    occurs
  • During evaluation, traverse lists for each query
    term
  • OR the union of component lists
  • AND an intersection of component lists
  • Proximity an intersection of component lists
  • SUM the union of component lists each entry has
    a score

13
Inverted Files
14
Inverted Files
15
Word-Level Inverted File
16
How big is the index?
  • For an n word collection
  • Lexicon
  • Heaps Law V O(nß), 0.4 lt ßlt 0.6
  • TREC-2 1 GB text, 5 MB lexicon
  • Postings
  • at most, one per occurrence of the word in the
    text O(n)

17
Inverted Search Algorithm
  • Find query elements (terms) in the lexicon
  • Retrieve postings for each lexicon entry
  • Manipulate postings according to the retrieval
    model

18
Word-Level Inverted File
lexicon
posting
Query 1.porridge pot (BOOL) 2.porridge
pot (BOOL) 3. porridge pot (VSM)
Answer
19
Lexicon Data Structures
  • According to Heaps Law, the size of lexicon
    maybe very large
  • Hash table
  • O(1) lookup, with constant h() and collision
    handling
  • Supports exact-match lookup
  • May be complex to expand
  • B-Tree
  • On-disk storage with fast retrieval and good
    caching behavior
  • Supports exact-match and range-based lookup
  • O(log n) lookups to find a list
  • Usually easy to expand

20
In-memory Inversion Algorithm
  • Create an empty lexicon
  • For each document d in the collection,
  • Read document, parse into terms
  • For each indexing term t,
  • fd,t frequency of t in d
  • If t is not in lexicon, insert it
  • Append ltd, fd,tgt to postings list for t
  • Output each postings list into inverted file
  • For each term, start new file entry
  • Append each ltd,fd,tgt to the entry
  • Compress entry
  • Write entry out to file.

21
Complexity of In-memory Inv.
  • Time O(n) for n-byte text
  • Space
  • Lexicon space for unique words offsets
  • Postings, 10 bytes per entry
  • document number 4 bytes
  • frequency count 2 bytes (allows 65536 max occ)
  • "next" pointer 4 bytes
  • Is this affordable?
  • For 5GB collection, at 10 bytes/entry, for 400M
    entries, need 4GB of main memory

22
Idea 1 Partition the text
  • Invert a chunk of the text at a time
  • Then, merge each sub-indexes into one complete
    index

23
Idea 2 Sort-based Inversion
  • Invert in two passes
  • Output records ltt, d, ftgt to a temp. file
  • Sort the records using external merge sort
  • read a chunk of the temp file
  • sort it using Quicksort
  • write it back into the same place
  • then merge-sort the chunks in place
  • Read sorted file, and write inverted file

24
Access Optimizations
  • Skip lists
  • A table of contents to the inverted list
  • Embedded pointers that jump ahead n documents
    (why is this useful?)
  • Separating presence information from location
    information
  • Many operators only need presence information
  • Location information takes substantial space
    (I/O)
  • If split,
  • reduced I/O for presence operators
  • increased I/O for location operators (or larger
    index)
  • Common in CD-ROM implementations

25
Inverted file compression
  • Inverted lists are usually compressed
  • Inverted files with word locations are about the
    size of the raw data
  • Distribution of numbers is skewed
  • Most numbers are small (e.g., word locations,
    term frequency)
  • Distribution can be made more skewed easily
  • Delta encoding 5, 8, 10, 17 ? 5, 3, 2, 7
  • Simple compression techniques are often the best
    choice
  • Simple algorithms nearly as effective as complex
    algorithms
  • Simple algorithms much faster than complex
    algorithms
  • Goal Time saved by reduced I/O gt Time required
    to uncompress

Compress ? Huffman encoding
26
Inverted file compression
  • The longst lists, which take up the most space,
    have the most frequent (probable) words.
  • Compressing the longest lists would save the most
    space.
  • The longest lists should compress easily because
    they contain the least information (why?)
  • Algorithms
  • Delta encoding
  • Variable-length encoding
  • Unary codes
  • Gamma codes
  • Delta codes
  • Variable-Byte Code
  • Golomb code

27
Inverted List Indexes Compression
  • Delta Encoding ("Storing Gaps")
  • Reduces range of numbers.
  • keep d in ascending order
  • 3, 5, 20, 21, 23, 76, 77, 78
  • becomes 3, 2, 15, 1, 2, 53, 1, 1
  • Produces a more skewed distribution.
  • Increases probability of smaller numbers.
  • Stemming also increases the probability of
    smaller numbers. (Why?)

28
Variable-Byte Code
  • Binary, but use minimum number of bytes
  • 7 bits to store value, 1 bit to flag if there is
    another byte
  • 0 lt x lt 128 1 byte
  • 128 lt x lt 16384 2 bytes
  • 16384 lt x lt 2097152 3 bytes
  • Integral byte sizes for easy coding
  • Very effective for medium-sized numbers
  • A little wasteful for very small numbers
  • Best trade-off between smallest index size and
    fastest query time

29
Index/IR toolkits
  • The Lemur Toolkit for Language Modeling and
    Information Retrieval
  • Java-based indexing and search technology,provide
    web search application software

http//www.lemurproject.org/
http//lucene.apache.org/nutch/
30
Summary
  • Common implementations of indexes
  • Bitmap, signature file, inverted file
  • Common index components
  • Lexicon postings
  • Inverted Search Algorithm
  • Inversion algorithm
  • Inverted file access optimizations
  • compression

31
The End
32
Vector Space Model
  • ??d???q???????????m???,???????TFIDF,?????????????
    ,? (?????tf,idf??)

33
Vocabulary Growth (Heaps Law)
  • How does the size of the overall vocabulary
    (number of unique words) grow with the size of
    the corpus?
  • Vocabulary has no upper bound due to proper
    names, typos, etc.
  • New words occur less frequently as vocabulary
    grows
  • If V is the size of the vocabulary and the n is
    the length of the corpus in words
  • V Knß (0lt ß lt1)
  • Typical constants
  • K 10-100
  • ß 0.4-0.6 (approx. square-root of n)

34
Query Answer
  • 1.porridge pot (BOOL)
  • d2
  • 2.porridge pot (BOOL)
  • null
  • 3. porridge pot (VSM)
  • d2 gt d1gtd5
  • Next page?

35
3. porridge pot (VSM) N6,f(d1)??d1???

back
Write a Comment
User Comments (0)
About PowerShow.com