Processing of large document collections - PowerPoint PPT Presentation

1 / 66

About This Presentation

Title:

Processing of large document collections

Description:

1 word in 4 is indexed, and each stored word is prefixed by a 1-byte length field ... one to indicate how many prefix characters are the same as the previous word ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 67

Provided by: helenaah

Category:

more less

Transcript and Presenter's Notes

Title: Processing of large document collections

1
Processing of large document collections

Part 5

2
In this part

Indexing
querying
index construction

3
Indexing

An index is a mechanism for locating a given term
in a text
Index in a book
it is possible to find information without
browsing the pages
in large document collections (gigabytes)
page-by-page search would even be impossible

4
Indexing

It is supposed that
a document collection consists of a set of
separate documents
each document is described by a set of
representative terms
index must be capable of identifying all
documents that contain combinations of specified
terms
document is the unit of text that is returned in
response to queries

5
Indexing

What is a document?
E.g. emails
sender, recipient, subject, message body
one email, one field, a set of emails?

6
Indexing

Granularity of the index the resolution to
which term locations are recorded within each
document
e.g. 1 email 1 document, but the index could be
capable of ascertaining a more exact location
within the document of each term
which documents contain terms tax and
avoidance in the same sentence?

7
Indexing

If the granularity of the index is taken to be
one word, then the index will record the exact
location of every word in the collection
the original text can be recovered from the index
the index takes more space than the original text

8
Indexing

Choice of representative terms
each word that appears in the documents is
included verbatim as a term in the index
the number of terms is huge
usually some transformations
case folding
stemming, baseword reduction
removal of stopwords

9
Inverted file indexing

An inverted file contains, for each term in the
lexicon, an inverted list that stores a list of
pointers to all occurrences of that term in the
main text
each pointer is the number of a document in which
that term appears
a lexicon a list of all terms that appear in the
document collection
supports mapping from terms to their
corresponding inverted lists

10
Inverted file indexing

A query involving a single term is answered by
scanning its inverted list and retrieving every
document that it cites
for conjunctive Boolean queries of the form term
AND term AND AND term, the intersection of the
terms inverted lists is formed
for disjunction (OR) union of lists
for negation (NOT) complement

11
Inverted file indexing

The inverted lists are usually stored in order of
increasing document number
various merging operations can be performed in a
time that is linear in the size of the lists

12
Inverted file indexing granularity

A coarse-grained index might identify only a
block of text, where each block stores several
documents
a moderate-grain index will store locations in
terms of document numbers
a fine-grained index will return a sentence or
word number

13
Inverted file indexing granularity

Coarse indexes
require less storage, but during retrieval, more
of the plain text must be scanned to find terms
multiterm queries are more likely to give rise to
false matches, where each of the desired terms
appears somewhere in the block, but not all
within the same document

14
Inverted file indexing granularity

Word-level indexing
enables queries involving adjacency and proximity
to be answered quickly because the desired
relationship can be checked before the text is
retrieved
adding precise locational information expands the
index
more pointers in the index
each pointer requires more bits of storage

15
Inverted file indexing granularity

Unless a significant fraction of the queries are
expected to be proximity-based, the usual
granularity is to individual documents
phrase-based queries can be handled by the
slightly slower method of a postretrieval scan

16
Inverted file compression

Uncompressed inverted files can consume
considerable space
50-100 of the space of the text itself
the size of an inverted file can be reduced
considerably by compressing it
key for compression
each inverted list can without any loss of
generalization be stored as an ascending sequence
of integers

17
Inverted file compression

Suppose that some term appears in 8 documents of
a collection the term is described in the
inverted file by a list
lt8 3, 5, 20, 21, 23, 76, 77, 78gt
the address of which is contained in the
lexicon
more generally, the list for a term t store the
number of documents ft in which the term appears
and then a list of ft document numbers

18
Inverted file compression

the list of document numbers within each inverted
list is in ascending order, and all processing is
sequential from the beginning of the list
-gt the list can be stored as an initial position
followed by a list of d-gaps
the list for the term above
lt8 3, 2, 15, 1, 2, 53, 1, 1gt

19
Inverted file compression

The two forms are equivalent, but it is not
obvious that any saving has been achieved
the largest d-gap in the second presentation is
still potentially the same as the largest
document number in the first
if there are N documents in the collection and a
flat binary encoding is used to represent the gap
sizes, both methods require ?log N? bits per
stored pointer

20
Inverted file compression

Considering each inverted list as a list of
d-gaps, the sum of which is bounded by N, allows
improved representation
-gt it is possible to code inverted lists using on
average substantially fewer than ?log N? bits
per pointer

21
Inverted file compression

many specific models have been proposed
global methods
every inverted list is compressed using the same
common model
local methods
adjusted according to some parameter, usually
frequency
tend to outperform global ones, but are more
complex to implement

22
Querying

How to use an index to locate information in the
text it describes?

23
Boolean queries

A Boolean query comprises a list of terms that
are combined using the connectives AND, OR, and
NOT
the answers to the query are those documents that
satisfy the condition

24
Boolean queries

e.g. text AND compression AND retrieval
all three words must occur somewhere in every
answer (no particular order)
the compression and retrieval of large amounts
of text is an interesting problem
this text describes the fractional distillation
scavenging technique for retrieving argon from
compressed air

25
Boolean queries

A problem with all retrieval systems
non-relevant answers are returned
must be filtered out manually
broad query -gt high recall
narrow query -gt high precision

26
Boolean queries

Small variations in a query can generate very
different results
data AND compression AND retrieval
text AND compression AND retrieval
the user should be able to pose complex queries
like
(text OR data OR image) AND
(compression OR compaction OR
decompression) AND (archiving OR retrieval OR
storage)

27
Ranked queries

Non-professional users might prefer simply giving
a list of words that are of interest and letting
the retrieval system supply the documents that
seem most relevant, rather than seeking exact
Boolean answers
text, data, image, compression, compaction,
archiving, storage, retrieval...

28
Ranked queries

It would be useless to convert a list of words to
a Boolean query
connect with AND -gt too few documents
connect with OR -gt too many documents
solution a ranked query
a heuristic that is applied to measure the
similarity of each document to the query
r most closely matching documents are returned

29
Ranking strategies

Simple techniques
count the number of query terms that appear
somewhere in the document
a document that contains 5 query terms is ranked
higher than a document that contains 3 query
terms
more advanced techniques
cosine measure
takes into account the lenghts of the documents
etc.

30
Accessing the lexicon

The lexicon for an inverted file index stores
the terms that can be used to search the
collection
information needed to allow queries to be
processed
address in the inverted file (of the
corresponding list of document numbers)
the number of documents containing the term

31
Access structures

A simple structure
an array of records, each comprising a string
along with two integer fields
if the lexicon is sorted, a word can be located
by a binary search of the strings
consumes a lot of space
e.g. a collection of million words (5GB), stored
as 20-byte strings, with 4-byte inverted file
address and 4-byte freq. value -gt 28MB

32
Access structures

The space for the strings is reduced if they are
all concatenated into one long contiguous strings
an array of 4-byte character pointers is used for
access
each term its exact number of characters 4 for
the pointer
it is not necessary to store string lengths next
pointer indicates the end of the string
in the collection of million terms, memory
reduction is 8 MB -gt 20 MB

33
Access structures

The memory required can be further reduced by
eliminating many of the string pointers
1 word in 4 is indexed, and each stored word is
prefixed by a 1-byte length field
the length field allows the start of the next
string to be identified and the block of strings
traversed

34
Access structures

In each group, 12 bytes of pointers is saved
at the cost of including 4 bytes of length
information
for a million word lexicon saving of 2MB -gt 18
MB

35
Access structures

Blocking makes the search process more complex
to look up a term
the array of string pointers is binary-searched
to locate the correct block of words
the block is scanned in a linear fashion to find
the term
the terms ordinal term number is inferred from
the combination of the block number and the
position within block
freq.value and inverted file addresses are
accessed using the ordinal term number

36
Access structures

Consecutive words in a sorted list are likely to
share a common prefix
front coding
2 integers are stored with each word
one to indicate how many prefix characters are
the same as the previous word
the other to record how many suffix characters
remain when the prefix is removed
the integers are followed by the suffix characters

37
Access structures

Front coding yields a net saving of about 40
percent of the space required for string storage
in a typical lexicon for the English language
problem with the complete front coding
binary search is no longer possible
solution partial 3-in-4 front coding

38
Access structures

Partial 3-in-4 front coding
every 4th word (the one indexed by the block
pointer) is stored without front coding, so that
binary search can proceed
on a large lexicon, expected to save about 4
bytes on each of three words, at the cost of 2
extra bytes of prefix-length information
a net gain of 10 bytes per 4-word block
for million-word lexicon -gt 15,5 MB

39
Disk-based lexicon storage

The amount of primary memory required by the
lexicon can be reduced by putting the lexicon on
disk
just enough information is retained in primary
memory to identify the disk block corresponding
to each term

40
Disk-based lexicon storage

To locate the information corresponding to a
given term, the in-memory index is searched to
determine a block number
the block is read into a buffer
search is continued within the block
B-tree etc. can be used

41
Disk-based lexicon storage

This approach is simple and requires minimal
amount of primary memory
a disk-based lexicon is many times slower to
access than a memory-based one
one disk access per lookup is required
extra time is tolerable when just a few terms are
being looked up (like in normal query processing,
less than 50 terms)
not suitable for index construction process

42
Boolean query processing

Processing a query
the lexicon is searched for each term in the
query
each inverted list is retrieved and decoded
lists are merged, taking the intersection, union,
or complement, as appropriate
finally, the documents are retrieved and displayed

43
Conjunctive queries

text AND compression AND retrieval
a conjunctive query of r terms is processed
each term is stemmed and located in the lexicon
if the lexicon is on disk, one disk access per
term is required
the terms are sorted by increasing frequency

44
Conjunctive queries

The inverted list for the least frequent term is
read into memory
the list a set of candidates (documents that
have not yet been eliminated and might be answers
to the query)
all remaining inverted lists are processed
against this set of candidates, in increasing
order of term frequency

45
Conjunctive queries

In a conjunctive query, a candidate cannot be an
answer unless it appears in all inverted lists
-gt the size of the set of candidates is
non-increasing
to process a term, each document in the set of
candidates is checked and removed if it does not
appear in the terms inverted list
the remaining candidates are the answers

46
Term processing order

reasons to select the least frequent term to
initialize the set of candidates (and also
later)
to minimize the amount of temporary memory space
required during query processing
the number of candidates may be quickly reduced,
even to zero, after which no processing is
required

47
Processing ranked queries

How to assign a similarity measure to each
document that indicates how closely it matches a
query?

48
Coordinate matching

Count the number of query terms that appear in
each document
the more terms that appear, the more likely it is
that the document is relevant
a hybrid query between a conjunctive AND query
and a disjunctive OR query
a document that contains any of the terms is a
potential answer, but preference is given to
documents that contain all or most of them

49
Inner product similarity

Coordinate matching can be formalized as an inner
product of a query vector with a set of document
vectors
the similarity measure of query Q with document
Dd is expressed as
M(Q, Dd) Q Dd
the inner product of two n-vectors X and Y

50
Drawbacks

Takes no account of term frequency
documents with many occurrences of a term should
be favored
takes no account of term scarcity
rare terms should have more weight?
long documents with many terms are automatically
favored
they are likely to contain more of any given list
of query terms

51
Solutions

Term frequency
binary present - not-present judgment can be
replaced with an integer indicating how many
times the term appears in the document
fd,t within-document frequency
more generally a term t
in document d can be assigned a document-term
weight wd,t
and a query-term weight wq,t

52
Solutions

The similarity measure is the inner product of
these two vectors
it is normal to assign wq,t 0 it t does not
appear in Q, so the measure can be stated as

53
Inverse document frequency

If only the term frequency is taken into account,
and a query contains common words, a document
with enough appearances of a common term is
always ranked first, irrespective of other words
-gt terms can be weighted according to their
inverse document frequency

54
Weighting

Many possibilities exist to combine term
frequency and inversed document frequency
principles
a term that appears in many documents should not
be regarded as being more important than a term
that appears in a few
a document with many occurrences of a term should
not be regarded as being less important than a
document that has just a few

55
Weighting

For instance,
TF-IDF for a wd,t
IDF for a wq,t

56
Similarity of vectors

Long documents should not be favored over short
documents
similarity of the direction indicated by the two
vectors is measured
similarity is defined as the cosine of the angle
between the document and query vector
cos ? 1, when ? 0
cos ? 0, when the vectors orthogonal

57
Similarity of vectors

The cosine of an angle can be calculated
X is the length of vector X
normalization factor

58
Similarity of vectors

Cosine rule for ranking
where
and

59
Index construction

Each document of the collection contains some
index terms, and each index term appears in some
of the documents
this relationship can be expressed with a
frequency matrix
each column corresponds to one word
each row corresponds to one document
the number stored at any row and column is the
frequency, in that document, of the word
indicated by that column

60
Index construction

Each document of the collection is summarized in
one row of the frequency matrix
to create an index, the matrix must be
transposed, forming a new version in which the
rows are the term numbers
from this form, an inverted file index is easy to
construct

61
Index construction

Trivial algorithm
build in memory a transposed frequency matrix,
reading the text in document order, one column of
the matrix at a time
write the matrix to disk row by row, in term order

62
Index construction

in reality, inversion is much more difficult
the problem is the size of the frequency matrix
for instance, a collection that has 535346
distinct terms and 741856 documents, the size of
the matrix can be 1.4 Tbytes
we could use a machine with a large virtual
memory -gt would take 2 months

63
Index construction

More economical methods for constructing and
inverting a frequency matrix exist
an index for the large collection mentioned above
could be created in less than 2 hours (1998) on a
personal computer, consuming just 30 MB of main
memory and less than 20 MB of temporary disk
space over the space required by the final
inverted file

64
Final words