Processing of large document collections - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Processing of large document collections

Description:

1 word in 4 is indexed, and each stored word is prefixed by a 1-byte length field ... one to indicate how many prefix characters are the same as the previous word ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 67
Provided by: helenaah
Category:

less

Transcript and Presenter's Notes

Title: Processing of large document collections


1
Processing of large document collections
  • Part 5

2
In this part
  • Indexing
  • querying
  • index construction

3
Indexing
  • An index is a mechanism for locating a given term
    in a text
  • Index in a book
  • it is possible to find information without
    browsing the pages
  • in large document collections (gigabytes)
    page-by-page search would even be impossible

4
Indexing
  • It is supposed that
  • a document collection consists of a set of
    separate documents
  • each document is described by a set of
    representative terms
  • index must be capable of identifying all
    documents that contain combinations of specified
    terms
  • document is the unit of text that is returned in
    response to queries

5
Indexing
  • What is a document?
  • E.g. emails
  • sender, recipient, subject, message body
  • one email, one field, a set of emails?

6
Indexing
  • Granularity of the index the resolution to
    which term locations are recorded within each
    document
  • e.g. 1 email 1 document, but the index could be
    capable of ascertaining a more exact location
    within the document of each term
  • which documents contain terms tax and
    avoidance in the same sentence?

7
Indexing
  • If the granularity of the index is taken to be
    one word, then the index will record the exact
    location of every word in the collection
  • the original text can be recovered from the index
  • the index takes more space than the original text

8
Indexing
  • Choice of representative terms
  • each word that appears in the documents is
    included verbatim as a term in the index
  • the number of terms is huge
  • usually some transformations
  • case folding
  • stemming, baseword reduction
  • removal of stopwords

9
Inverted file indexing
  • An inverted file contains, for each term in the
    lexicon, an inverted list that stores a list of
    pointers to all occurrences of that term in the
    main text
  • each pointer is the number of a document in which
    that term appears
  • a lexicon a list of all terms that appear in the
    document collection
  • supports mapping from terms to their
    corresponding inverted lists

10
Inverted file indexing
  • A query involving a single term is answered by
    scanning its inverted list and retrieving every
    document that it cites
  • for conjunctive Boolean queries of the form term
    AND term AND AND term, the intersection of the
    terms inverted lists is formed
  • for disjunction (OR) union of lists
  • for negation (NOT) complement

11
Inverted file indexing
  • The inverted lists are usually stored in order of
    increasing document number
  • various merging operations can be performed in a
    time that is linear in the size of the lists

12
Inverted file indexing granularity
  • A coarse-grained index might identify only a
    block of text, where each block stores several
    documents
  • a moderate-grain index will store locations in
    terms of document numbers
  • a fine-grained index will return a sentence or
    word number

13
Inverted file indexing granularity
  • Coarse indexes
  • require less storage, but during retrieval, more
    of the plain text must be scanned to find terms
  • multiterm queries are more likely to give rise to
    false matches, where each of the desired terms
    appears somewhere in the block, but not all
    within the same document

14
Inverted file indexing granularity
  • Word-level indexing
  • enables queries involving adjacency and proximity
    to be answered quickly because the desired
    relationship can be checked before the text is
    retrieved
  • adding precise locational information expands the
    index
  • more pointers in the index
  • each pointer requires more bits of storage

15
Inverted file indexing granularity
  • Unless a significant fraction of the queries are
    expected to be proximity-based, the usual
    granularity is to individual documents
  • phrase-based queries can be handled by the
    slightly slower method of a postretrieval scan

16
Inverted file compression
  • Uncompressed inverted files can consume
    considerable space
  • 50-100 of the space of the text itself
  • the size of an inverted file can be reduced
    considerably by compressing it
  • key for compression
  • each inverted list can without any loss of
    generalization be stored as an ascending sequence
    of integers

17
Inverted file compression
  • Suppose that some term appears in 8 documents of
    a collection the term is described in the
    inverted file by a list
  • lt8 3, 5, 20, 21, 23, 76, 77, 78gt
    the address of which is contained in the
    lexicon
  • more generally, the list for a term t store the
    number of documents ft in which the term appears
    and then a list of ft document numbers

18
Inverted file compression
  • the list of document numbers within each inverted
    list is in ascending order, and all processing is
    sequential from the beginning of the list
  • -gt the list can be stored as an initial position
    followed by a list of d-gaps
  • the list for the term above
  • lt8 3, 2, 15, 1, 2, 53, 1, 1gt

19
Inverted file compression
  • The two forms are equivalent, but it is not
    obvious that any saving has been achieved
  • the largest d-gap in the second presentation is
    still potentially the same as the largest
    document number in the first
  • if there are N documents in the collection and a
    flat binary encoding is used to represent the gap
    sizes, both methods require ?log N? bits per
    stored pointer

20
Inverted file compression
  • Considering each inverted list as a list of
    d-gaps, the sum of which is bounded by N, allows
    improved representation
  • -gt it is possible to code inverted lists using on
    average substantially fewer than ?log N? bits
    per pointer

21
Inverted file compression
  • many specific models have been proposed
  • global methods
  • every inverted list is compressed using the same
    common model
  • local methods
  • adjusted according to some parameter, usually
    frequency
  • tend to outperform global ones, but are more
    complex to implement

22
Querying
  • How to use an index to locate information in the
    text it describes?

23
Boolean queries
  • A Boolean query comprises a list of terms that
    are combined using the connectives AND, OR, and
    NOT
  • the answers to the query are those documents that
    satisfy the condition

24
Boolean queries
  • e.g. text AND compression AND retrieval
  • all three words must occur somewhere in every
    answer (no particular order)
  • the compression and retrieval of large amounts
    of text is an interesting problem
  • this text describes the fractional distillation
    scavenging technique for retrieving argon from
    compressed air

25
Boolean queries
  • A problem with all retrieval systems
  • non-relevant answers are returned
  • must be filtered out manually
  • broad query -gt high recall
  • narrow query -gt high precision

26
Boolean queries
  • Small variations in a query can generate very
    different results
  • data AND compression AND retrieval
  • text AND compression AND retrieval
  • the user should be able to pose complex queries
    like
  • (text OR data OR image) AND
    (compression OR compaction OR
    decompression) AND (archiving OR retrieval OR
    storage)

27
Ranked queries
  • Non-professional users might prefer simply giving
    a list of words that are of interest and letting
    the retrieval system supply the documents that
    seem most relevant, rather than seeking exact
    Boolean answers
  • text, data, image, compression, compaction,
    archiving, storage, retrieval...

28
Ranked queries
  • It would be useless to convert a list of words to
    a Boolean query
  • connect with AND -gt too few documents
  • connect with OR -gt too many documents
  • solution a ranked query
  • a heuristic that is applied to measure the
    similarity of each document to the query
  • r most closely matching documents are returned

29
Ranking strategies
  • Simple techniques
  • count the number of query terms that appear
    somewhere in the document
  • a document that contains 5 query terms is ranked
    higher than a document that contains 3 query
    terms
  • more advanced techniques
  • cosine measure
  • takes into account the lenghts of the documents
    etc.

30
Accessing the lexicon
  • The lexicon for an inverted file index stores
  • the terms that can be used to search the
    collection
  • information needed to allow queries to be
    processed
  • address in the inverted file (of the
    corresponding list of document numbers)
  • the number of documents containing the term

31
Access structures
  • A simple structure
  • an array of records, each comprising a string
    along with two integer fields
  • if the lexicon is sorted, a word can be located
    by a binary search of the strings
  • consumes a lot of space
  • e.g. a collection of million words (5GB), stored
    as 20-byte strings, with 4-byte inverted file
    address and 4-byte freq. value -gt 28MB

32
Access structures
  • The space for the strings is reduced if they are
    all concatenated into one long contiguous strings
  • an array of 4-byte character pointers is used for
    access
  • each term its exact number of characters 4 for
    the pointer
  • it is not necessary to store string lengths next
    pointer indicates the end of the string
  • in the collection of million terms, memory
    reduction is 8 MB -gt 20 MB

33
Access structures
  • The memory required can be further reduced by
    eliminating many of the string pointers
  • 1 word in 4 is indexed, and each stored word is
    prefixed by a 1-byte length field
  • the length field allows the start of the next
    string to be identified and the block of strings
    traversed

34
Access structures
  • In each group, 12 bytes of pointers is saved
  • at the cost of including 4 bytes of length
    information
  • for a million word lexicon saving of 2MB -gt 18
    MB

35
Access structures
  • Blocking makes the search process more complex
    to look up a term
  • the array of string pointers is binary-searched
    to locate the correct block of words
  • the block is scanned in a linear fashion to find
    the term
  • the terms ordinal term number is inferred from
    the combination of the block number and the
    position within block
  • freq.value and inverted file addresses are
    accessed using the ordinal term number

36
Access structures
  • Consecutive words in a sorted list are likely to
    share a common prefix
  • front coding
  • 2 integers are stored with each word
  • one to indicate how many prefix characters are
    the same as the previous word
  • the other to record how many suffix characters
    remain when the prefix is removed
  • the integers are followed by the suffix characters

37
Access structures
  • Front coding yields a net saving of about 40
    percent of the space required for string storage
    in a typical lexicon for the English language
  • problem with the complete front coding
  • binary search is no longer possible
  • solution partial 3-in-4 front coding

38
Access structures
  • Partial 3-in-4 front coding
  • every 4th word (the one indexed by the block
    pointer) is stored without front coding, so that
    binary search can proceed
  • on a large lexicon, expected to save about 4
    bytes on each of three words, at the cost of 2
    extra bytes of prefix-length information
  • a net gain of 10 bytes per 4-word block
  • for million-word lexicon -gt 15,5 MB

39
Disk-based lexicon storage
  • The amount of primary memory required by the
    lexicon can be reduced by putting the lexicon on
    disk
  • just enough information is retained in primary
    memory to identify the disk block corresponding
    to each term

40
Disk-based lexicon storage
  • To locate the information corresponding to a
    given term, the in-memory index is searched to
    determine a block number
  • the block is read into a buffer
  • search is continued within the block
  • B-tree etc. can be used

41
Disk-based lexicon storage
  • This approach is simple and requires minimal
    amount of primary memory
  • a disk-based lexicon is many times slower to
    access than a memory-based one
  • one disk access per lookup is required
  • extra time is tolerable when just a few terms are
    being looked up (like in normal query processing,
    less than 50 terms)
  • not suitable for index construction process

42
Boolean query processing
  • Processing a query
  • the lexicon is searched for each term in the
    query
  • each inverted list is retrieved and decoded
  • lists are merged, taking the intersection, union,
    or complement, as appropriate
  • finally, the documents are retrieved and displayed

43
Conjunctive queries
  • text AND compression AND retrieval
  • a conjunctive query of r terms is processed
  • each term is stemmed and located in the lexicon
  • if the lexicon is on disk, one disk access per
    term is required
  • the terms are sorted by increasing frequency

44
Conjunctive queries
  • The inverted list for the least frequent term is
    read into memory
  • the list a set of candidates (documents that
    have not yet been eliminated and might be answers
    to the query)
  • all remaining inverted lists are processed
    against this set of candidates, in increasing
    order of term frequency

45
Conjunctive queries
  • In a conjunctive query, a candidate cannot be an
    answer unless it appears in all inverted lists
  • -gt the size of the set of candidates is
    non-increasing
  • to process a term, each document in the set of
    candidates is checked and removed if it does not
    appear in the terms inverted list
  • the remaining candidates are the answers

46
Term processing order
  • reasons to select the least frequent term to
    initialize the set of candidates (and also
    later)
  • to minimize the amount of temporary memory space
    required during query processing
  • the number of candidates may be quickly reduced,
    even to zero, after which no processing is
    required

47
Processing ranked queries
  • How to assign a similarity measure to each
    document that indicates how closely it matches a
    query?

48
Coordinate matching
  • Count the number of query terms that appear in
    each document
  • the more terms that appear, the more likely it is
    that the document is relevant
  • a hybrid query between a conjunctive AND query
    and a disjunctive OR query
  • a document that contains any of the terms is a
    potential answer, but preference is given to
    documents that contain all or most of them

49
Inner product similarity
  • Coordinate matching can be formalized as an inner
    product of a query vector with a set of document
    vectors
  • the similarity measure of query Q with document
    Dd is expressed as
  • M(Q, Dd) Q Dd
  • the inner product of two n-vectors X and Y

50
Drawbacks
  • Takes no account of term frequency
  • documents with many occurrences of a term should
    be favored
  • takes no account of term scarcity
  • rare terms should have more weight?
  • long documents with many terms are automatically
    favored
  • they are likely to contain more of any given list
    of query terms

51
Solutions
  • Term frequency
  • binary present - not-present judgment can be
    replaced with an integer indicating how many
    times the term appears in the document
  • fd,t within-document frequency
  • more generally a term t
  • in document d can be assigned a document-term
    weight wd,t
  • and a query-term weight wq,t

52
Solutions
  • The similarity measure is the inner product of
    these two vectors
  • it is normal to assign wq,t 0 it t does not
    appear in Q, so the measure can be stated as

53
Inverse document frequency
  • If only the term frequency is taken into account,
    and a query contains common words, a document
    with enough appearances of a common term is
    always ranked first, irrespective of other words
  • -gt terms can be weighted according to their
    inverse document frequency

54
Weighting
  • Many possibilities exist to combine term
    frequency and inversed document frequency
  • principles
  • a term that appears in many documents should not
    be regarded as being more important than a term
    that appears in a few
  • a document with many occurrences of a term should
    not be regarded as being less important than a
    document that has just a few

55
Weighting
  • For instance,
  • TF-IDF for a wd,t
  • IDF for a wq,t

56
Similarity of vectors
  • Long documents should not be favored over short
    documents
  • similarity of the direction indicated by the two
    vectors is measured
  • similarity is defined as the cosine of the angle
    between the document and query vector
  • cos ? 1, when ? 0
  • cos ? 0, when the vectors orthogonal

57
Similarity of vectors
  • The cosine of an angle can be calculated
  • X is the length of vector X
  • normalization factor

58
Similarity of vectors
  • Cosine rule for ranking
  • where
  • and

59
Index construction
  • Each document of the collection contains some
    index terms, and each index term appears in some
    of the documents
  • this relationship can be expressed with a
    frequency matrix
  • each column corresponds to one word
  • each row corresponds to one document
  • the number stored at any row and column is the
    frequency, in that document, of the word
    indicated by that column

60
Index construction
  • Each document of the collection is summarized in
    one row of the frequency matrix
  • to create an index, the matrix must be
    transposed, forming a new version in which the
    rows are the term numbers
  • from this form, an inverted file index is easy to
    construct

61
Index construction
  • Trivial algorithm
  • build in memory a transposed frequency matrix,
    reading the text in document order, one column of
    the matrix at a time
  • write the matrix to disk row by row, in term order

62
Index construction
  • in reality, inversion is much more difficult
  • the problem is the size of the frequency matrix
  • for instance, a collection that has 535346
    distinct terms and 741856 documents, the size of
    the matrix can be 1.4 Tbytes
  • we could use a machine with a large virtual
    memory -gt would take 2 months

63
Index construction
  • More economical methods for constructing and
    inverting a frequency matrix exist
  • an index for the large collection mentioned above
    could be created in less than 2 hours (1998) on a
    personal computer, consuming just 30 MB of main
    memory and less than 20 MB of temporary disk
    space over the space required by the final
    inverted file

64
Final words
  • We have discussed
  • character sets
  • preprocessing of text
  • feature selection
  • text categorization
  • text summarization
  • text compression
  • indexing, querying

65
Final words
  • What else there is...
  • structured documents (XML,)
  • metadata (semantic Web, ontologies,)
  • linguistic resources (WordNet, thesauri,)
  • document management systems (archiving)
  • document analysis (scanning of documents)
  • digital libraries
  • text mining, question-answering,..
  • ...

66
Administrative...
  • Exam on Tuesday 4.12.
  • large essays (2-3 pages/each)
  • data comprehension (e.g. recall/precision)
  • use full sentences!
  • Exercise points
  • 28 and more original points -gt 30 pts
  • otherwise original points 2
  • Remember Kurssikysely!
Write a Comment
User Comments (0)
About PowerShow.com