Title: Processing of large document collections
1Processing of large document collections
2In this part
- Indexing
- querying
- index construction
3Indexing
- An index is a mechanism for locating a given term
in a text - Index in a book
- it is possible to find information without
browsing the pages - in large document collections (gigabytes)
page-by-page search would even be impossible
4Indexing
- It is supposed that
- a document collection consists of a set of
separate documents - each document is described by a set of
representative terms - index must be capable of identifying all
documents that contain combinations of specified
terms - document is the unit of text that is returned in
response to queries
5Indexing
- What is a document?
- E.g. emails
- sender, recipient, subject, message body
- one email, one field, a set of emails?
6Indexing
- Granularity of the index the resolution to
which term locations are recorded within each
document - e.g. 1 email 1 document, but the index could be
capable of ascertaining a more exact location
within the document of each term - which documents contain terms tax and
avoidance in the same sentence?
7Indexing
- If the granularity of the index is taken to be
one word, then the index will record the exact
location of every word in the collection - the original text can be recovered from the index
- the index takes more space than the original text
8Indexing
- Choice of representative terms
- each word that appears in the documents is
included verbatim as a term in the index - the number of terms is huge
- usually some transformations
- case folding
- stemming, baseword reduction
- removal of stopwords
9Inverted file indexing
- An inverted file contains, for each term in the
lexicon, an inverted list that stores a list of
pointers to all occurrences of that term in the
main text - each pointer is the number of a document in which
that term appears - a lexicon a list of all terms that appear in the
document collection - supports mapping from terms to their
corresponding inverted lists
10Inverted file indexing
- A query involving a single term is answered by
scanning its inverted list and retrieving every
document that it cites - for conjunctive Boolean queries of the form term
AND term AND AND term, the intersection of the
terms inverted lists is formed - for disjunction (OR) union of lists
- for negation (NOT) complement
11Inverted file indexing
- The inverted lists are usually stored in order of
increasing document number - various merging operations can be performed in a
time that is linear in the size of the lists
12Inverted file indexing granularity
- A coarse-grained index might identify only a
block of text, where each block stores several
documents - a moderate-grain index will store locations in
terms of document numbers - a fine-grained index will return a sentence or
word number
13Inverted file indexing granularity
- Coarse indexes
- require less storage, but during retrieval, more
of the plain text must be scanned to find terms - multiterm queries are more likely to give rise to
false matches, where each of the desired terms
appears somewhere in the block, but not all
within the same document
14Inverted file indexing granularity
- Word-level indexing
- enables queries involving adjacency and proximity
to be answered quickly because the desired
relationship can be checked before the text is
retrieved - adding precise locational information expands the
index - more pointers in the index
- each pointer requires more bits of storage
15Inverted file indexing granularity
- Unless a significant fraction of the queries are
expected to be proximity-based, the usual
granularity is to individual documents - phrase-based queries can be handled by the
slightly slower method of a postretrieval scan
16Inverted file compression
- Uncompressed inverted files can consume
considerable space - 50-100 of the space of the text itself
- the size of an inverted file can be reduced
considerably by compressing it - key for compression
- each inverted list can without any loss of
generalization be stored as an ascending sequence
of integers
17Inverted file compression
- Suppose that some term appears in 8 documents of
a collection the term is described in the
inverted file by a list - lt8 3, 5, 20, 21, 23, 76, 77, 78gt
the address of which is contained in the
lexicon - more generally, the list for a term t store the
number of documents ft in which the term appears
and then a list of ft document numbers
18Inverted file compression
- the list of document numbers within each inverted
list is in ascending order, and all processing is
sequential from the beginning of the list - -gt the list can be stored as an initial position
followed by a list of d-gaps - the list for the term above
- lt8 3, 2, 15, 1, 2, 53, 1, 1gt
19Inverted file compression
- The two forms are equivalent, but it is not
obvious that any saving has been achieved - the largest d-gap in the second presentation is
still potentially the same as the largest
document number in the first - if there are N documents in the collection and a
flat binary encoding is used to represent the gap
sizes, both methods require ?log N? bits per
stored pointer
20Inverted file compression
- Considering each inverted list as a list of
d-gaps, the sum of which is bounded by N, allows
improved representation - -gt it is possible to code inverted lists using on
average substantially fewer than ?log N? bits
per pointer
21Inverted file compression
- many specific models have been proposed
- global methods
- every inverted list is compressed using the same
common model - local methods
- adjusted according to some parameter, usually
frequency - tend to outperform global ones, but are more
complex to implement
22Querying
- How to use an index to locate information in the
text it describes?
23Boolean queries
- A Boolean query comprises a list of terms that
are combined using the connectives AND, OR, and
NOT - the answers to the query are those documents that
satisfy the condition
24Boolean queries
- e.g. text AND compression AND retrieval
- all three words must occur somewhere in every
answer (no particular order) - the compression and retrieval of large amounts
of text is an interesting problem - this text describes the fractional distillation
scavenging technique for retrieving argon from
compressed air
25Boolean queries
- A problem with all retrieval systems
- non-relevant answers are returned
- must be filtered out manually
- broad query -gt high recall
- narrow query -gt high precision
26Boolean queries
- Small variations in a query can generate very
different results - data AND compression AND retrieval
- text AND compression AND retrieval
- the user should be able to pose complex queries
like - (text OR data OR image) AND
(compression OR compaction OR
decompression) AND (archiving OR retrieval OR
storage)
27Ranked queries
- Non-professional users might prefer simply giving
a list of words that are of interest and letting
the retrieval system supply the documents that
seem most relevant, rather than seeking exact
Boolean answers - text, data, image, compression, compaction,
archiving, storage, retrieval...
28Ranked queries
- It would be useless to convert a list of words to
a Boolean query - connect with AND -gt too few documents
- connect with OR -gt too many documents
- solution a ranked query
- a heuristic that is applied to measure the
similarity of each document to the query - r most closely matching documents are returned
29Ranking strategies
- Simple techniques
- count the number of query terms that appear
somewhere in the document - a document that contains 5 query terms is ranked
higher than a document that contains 3 query
terms - more advanced techniques
- cosine measure
- takes into account the lenghts of the documents
etc.
30Accessing the lexicon
- The lexicon for an inverted file index stores
- the terms that can be used to search the
collection - information needed to allow queries to be
processed - address in the inverted file (of the
corresponding list of document numbers) - the number of documents containing the term
31Access structures
- A simple structure
- an array of records, each comprising a string
along with two integer fields - if the lexicon is sorted, a word can be located
by a binary search of the strings - consumes a lot of space
- e.g. a collection of million words (5GB), stored
as 20-byte strings, with 4-byte inverted file
address and 4-byte freq. value -gt 28MB
32Access structures
- The space for the strings is reduced if they are
all concatenated into one long contiguous strings
- an array of 4-byte character pointers is used for
access - each term its exact number of characters 4 for
the pointer - it is not necessary to store string lengths next
pointer indicates the end of the string - in the collection of million terms, memory
reduction is 8 MB -gt 20 MB
33Access structures
- The memory required can be further reduced by
eliminating many of the string pointers - 1 word in 4 is indexed, and each stored word is
prefixed by a 1-byte length field - the length field allows the start of the next
string to be identified and the block of strings
traversed
34Access structures
- In each group, 12 bytes of pointers is saved
- at the cost of including 4 bytes of length
information - for a million word lexicon saving of 2MB -gt 18
MB
35Access structures
- Blocking makes the search process more complex
to look up a term - the array of string pointers is binary-searched
to locate the correct block of words - the block is scanned in a linear fashion to find
the term - the terms ordinal term number is inferred from
the combination of the block number and the
position within block - freq.value and inverted file addresses are
accessed using the ordinal term number
36Access structures
- Consecutive words in a sorted list are likely to
share a common prefix - front coding
- 2 integers are stored with each word
- one to indicate how many prefix characters are
the same as the previous word - the other to record how many suffix characters
remain when the prefix is removed - the integers are followed by the suffix characters
37Access structures
- Front coding yields a net saving of about 40
percent of the space required for string storage
in a typical lexicon for the English language - problem with the complete front coding
- binary search is no longer possible
- solution partial 3-in-4 front coding
38Access structures
- Partial 3-in-4 front coding
- every 4th word (the one indexed by the block
pointer) is stored without front coding, so that
binary search can proceed - on a large lexicon, expected to save about 4
bytes on each of three words, at the cost of 2
extra bytes of prefix-length information - a net gain of 10 bytes per 4-word block
- for million-word lexicon -gt 15,5 MB
39Disk-based lexicon storage
- The amount of primary memory required by the
lexicon can be reduced by putting the lexicon on
disk - just enough information is retained in primary
memory to identify the disk block corresponding
to each term
40Disk-based lexicon storage
- To locate the information corresponding to a
given term, the in-memory index is searched to
determine a block number - the block is read into a buffer
- search is continued within the block
- B-tree etc. can be used
41Disk-based lexicon storage
- This approach is simple and requires minimal
amount of primary memory - a disk-based lexicon is many times slower to
access than a memory-based one - one disk access per lookup is required
- extra time is tolerable when just a few terms are
being looked up (like in normal query processing,
less than 50 terms) - not suitable for index construction process
42Boolean query processing
- Processing a query
- the lexicon is searched for each term in the
query - each inverted list is retrieved and decoded
- lists are merged, taking the intersection, union,
or complement, as appropriate - finally, the documents are retrieved and displayed
43Conjunctive queries
- text AND compression AND retrieval
- a conjunctive query of r terms is processed
- each term is stemmed and located in the lexicon
- if the lexicon is on disk, one disk access per
term is required - the terms are sorted by increasing frequency
44Conjunctive queries
- The inverted list for the least frequent term is
read into memory - the list a set of candidates (documents that
have not yet been eliminated and might be answers
to the query) - all remaining inverted lists are processed
against this set of candidates, in increasing
order of term frequency
45Conjunctive queries
- In a conjunctive query, a candidate cannot be an
answer unless it appears in all inverted lists - -gt the size of the set of candidates is
non-increasing - to process a term, each document in the set of
candidates is checked and removed if it does not
appear in the terms inverted list - the remaining candidates are the answers
46Term processing order
- reasons to select the least frequent term to
initialize the set of candidates (and also
later) - to minimize the amount of temporary memory space
required during query processing - the number of candidates may be quickly reduced,
even to zero, after which no processing is
required
47Processing ranked queries
- How to assign a similarity measure to each
document that indicates how closely it matches a
query?
48Coordinate matching
- Count the number of query terms that appear in
each document - the more terms that appear, the more likely it is
that the document is relevant - a hybrid query between a conjunctive AND query
and a disjunctive OR query - a document that contains any of the terms is a
potential answer, but preference is given to
documents that contain all or most of them
49Inner product similarity
- Coordinate matching can be formalized as an inner
product of a query vector with a set of document
vectors - the similarity measure of query Q with document
Dd is expressed as
- M(Q, Dd) Q Dd
- the inner product of two n-vectors X and Y
50Drawbacks
- Takes no account of term frequency
- documents with many occurrences of a term should
be favored - takes no account of term scarcity
- rare terms should have more weight?
- long documents with many terms are automatically
favored - they are likely to contain more of any given list
of query terms
51Solutions
- Term frequency
- binary present - not-present judgment can be
replaced with an integer indicating how many
times the term appears in the document - fd,t within-document frequency
- more generally a term t
- in document d can be assigned a document-term
weight wd,t - and a query-term weight wq,t
52Solutions
- The similarity measure is the inner product of
these two vectors - it is normal to assign wq,t 0 it t does not
appear in Q, so the measure can be stated as
53Inverse document frequency
- If only the term frequency is taken into account,
and a query contains common words, a document
with enough appearances of a common term is
always ranked first, irrespective of other words - -gt terms can be weighted according to their
inverse document frequency
54Weighting
- Many possibilities exist to combine term
frequency and inversed document frequency - principles
- a term that appears in many documents should not
be regarded as being more important than a term
that appears in a few - a document with many occurrences of a term should
not be regarded as being less important than a
document that has just a few
55Weighting
- For instance,
- TF-IDF for a wd,t
- IDF for a wq,t
56Similarity of vectors
- Long documents should not be favored over short
documents - similarity of the direction indicated by the two
vectors is measured - similarity is defined as the cosine of the angle
between the document and query vector - cos ? 1, when ? 0
- cos ? 0, when the vectors orthogonal
57Similarity of vectors
- The cosine of an angle can be calculated
- X is the length of vector X
- normalization factor
58Similarity of vectors
- Cosine rule for ranking
- where
- and
59Index construction
- Each document of the collection contains some
index terms, and each index term appears in some
of the documents - this relationship can be expressed with a
frequency matrix - each column corresponds to one word
- each row corresponds to one document
- the number stored at any row and column is the
frequency, in that document, of the word
indicated by that column
60Index construction
- Each document of the collection is summarized in
one row of the frequency matrix - to create an index, the matrix must be
transposed, forming a new version in which the
rows are the term numbers - from this form, an inverted file index is easy to
construct
61Index construction
- Trivial algorithm
- build in memory a transposed frequency matrix,
reading the text in document order, one column of
the matrix at a time - write the matrix to disk row by row, in term order
62Index construction
- in reality, inversion is much more difficult
- the problem is the size of the frequency matrix
- for instance, a collection that has 535346
distinct terms and 741856 documents, the size of
the matrix can be 1.4 Tbytes - we could use a machine with a large virtual
memory -gt would take 2 months
63Index construction
- More economical methods for constructing and
inverting a frequency matrix exist - an index for the large collection mentioned above
could be created in less than 2 hours (1998) on a
personal computer, consuming just 30 MB of main
memory and less than 20 MB of temporary disk
space over the space required by the final
inverted file
64Final words
- We have discussed
- character sets
- preprocessing of text
- feature selection
- text categorization
- text summarization
- text compression
- indexing, querying
65Final words
- What else there is...
- structured documents (XML,)
- metadata (semantic Web, ontologies,)
- linguistic resources (WordNet, thesauri,)
- document management systems (archiving)
- document analysis (scanning of documents)
- digital libraries
- text mining, question-answering,..
- ...
66Administrative...
- Exam on Tuesday 4.12.
- large essays (2-3 pages/each)
- data comprehension (e.g. recall/precision)
- use full sentences!
- Exercise points
- 28 and more original points -gt 30 pts
- otherwise original points 2
- Remember Kurssikysely!