Title: Chapter 7: Document Preprocessing (textbook)
1Chapter 7 Document Preprocessing (textbook)
- Document preprocessing is a procedure which can
be divided mainly into five text operations (or
transformations) - (1) Lexical analysis of the text with the
objective of treating digits, hyphens,
punctuation marks, and the case of letters. - (2) Elimination of stop-words with the
objective of filtering out words with very low
discrimination values for retrieval purposes.
2Document Preprocessing
- (3) Stemming of the remaining words with the
objective of removing affixes (i.e., prefixes and
suffixes) and allowing the retrieval of documents
containing syntactic variations of query terms
(e.g., connect, connecting, connected, etc). - (4) Selection of index terms to determine
which words/stems (or groups of words) will be
used as an indexing elements. Usually, the
decision on whether a particular word will be
used as an index term is related to the syntactic
nature of the word. In fact , noun words
frequently carry more semantics than adjectives,
adverbs, and verbs.
3Document Preprocessing
- (5) Construction of term categorization
structures such as a thesaurus, or extraction of
structure directly represented in the text, for
allowing the expansion of the original query with
related terms (a usually useful procedure).
4Lexical Analysis of the text
- Task convert strings of characters into
sequence of words. - Main task is to deal with spaces, e.g, multiple
spaces are treated as one space. - Digitsignoring numbers is a common way. Special
cases, 1999, 2000 standing for specific years
are important. Mixed digits are important, e.g.,
510B.C. 16 digits numbers might be credit card
. - Hyphens state-of-the art and state of the art
should be treated as the same. - Punctuation marks remove them. Exception 510B.C
- Lower or upper case of letters treated as the
same. - Many exceptions semi-automatic.
5Elimination of Stopwords
- words appear too often are not useful for IR.
- Stopwords words appear more than 80 of the
documents in the collection are stopwords and are
filtered out as potential index words.
6Stemming
- A Stem the portion of a word which is left after
the removal of its affixes (i.e., prefixes or
suffixes). - Example connect is the stem for connected,
connecting connection, connections - Porter algorithm using suffix list for suffix
stripping. S??, sses ?ss, etc.
7- Index terms selection
- Identification of noun groups
- Treat nouns appear closely as a single component,
e.g., computer science
8Thesaurus
- Thesaurus a precompiled list of important words
in a given domain of knowledge and for each word
in this list, there is a set of related words. - Vocabulary control in an information retrieval
system - Thesaurus construction
- Manual construction
- Automatic construction
9Vocabulary control
- Standard vocabulary for both indexing and
searching (for the constructors of the system and
the users of the system)
10Objectives of vocabulary control
- To promote the consistent representation of
subject matter by indexers and searchers ,thereby
avoiding the dispersion of related materials. - To facilitate the conduct of a comprehensive
search on some topic by linking together terms
whose meanings are related paradigmatically.
11Thesaurus
- Not like common dictionary
- Words with their explanations
- May contain words in a language
- Or only contains words in a specific domain.
- With a lot of other information especially the
relationship between words - Classification of words in the language
- Words relationship like synonyms, antonyms
12On-Line Thesaurus
- http//www.thesaurus.com
- http//www.dictionary.com/
- http//www.cogsci.princeton.edu/wn/
13Dictionary vs. Thesaurus
Check Information use http//www.thesaurus.com
Dictionary Thesaurus
- information ( n f r-m sh n)n.
- Knowledge derived from study, experience, or
instruction. - Knowledge of specific events or situations that
has been gathered or received by communication
intelligence or news. See Synonyms at knowledge. - ......
Nouns information, enlightenment, acquaintance
Verbs tell inform, inform of acquaint,
acquaint with impart, Adjectives informed
reported published
14Use of Thesaurus
- To control the term used in indexing ,for a
specific domain only use the terms in the
thesaurus as indexing terms - Assist the users to form proper queries by the
help information contained in the thesaurus
15Construction of Thesaurus
- Stemming can be used for reduce the size of
thesaurus - Can be constructed either manually or
automatically
16WordNet manually constructed
- WordNet is an online lexical reference system
whose design is inspired by current
psycholinguistic theories of human lexical
memory. English nouns, verbs, adjectives and
adverbs are organized into synonym sets, each
representing one underlying lexical concept.
Different relations link the synonym sets.
17Relations in WordNet
18Automatic Thesaurus Construction
- A variety of methods can be used in construction
the thesaurus - Term similarity can be used for constructing the
thesaurus
19Complete Term Relation Method
- Term Document Relationship can be calculated
using a variety of methods - Like tf-idf
- Term similarity can be calculated base on the
term document relationship - for example
20Complete Term Relation Method
Set threshold to 10
21Complete Term Relation Method
- Group
- T1,T3,T4,T6
- T1,T5
- T2,T4,T6
- T2,T6,T8
- T7
22Indexing
- Arrangement of data (data structure) to permit
fast searching - Which list is easier to search?
- sow fox pig eel yak hen ant cat dog hog
- ant cat dog eel fox hen hog pig sow yak
23Creating inverted files
Word Extraction
Word IDs
Original Documents
W1d1,d2,d3 W2d2,d4,d7,d9 Wn
di,dn Inverted Files
Document IDs
24Creating Inverted file
- Map the file names to file IDs
- Consider the following Original Documents
25Creating Inverted file
Red stop word
26Creating Inverted file
After stemming, make lowercase (option), delete
numbers (option)
27Creating Inverted file (unsorted)
28Creating Inverted file (sorted)
29Searching on Inverted File
- Binary Search
- Using in the small scale
- Create thesaurus and combining techniques such
as - Hashing
- Btree
- Pointer to the address in the indexed file
30Huffman codes
- Binary character code each character is
represented by a unique binary string. - A data file can be coded in two ways
a b c d e f
frequency() 45 13 12 16 9 5
fixed-length code 000 001 010 011 100 101
variable-length code 0 101 100 111 1101 1100
The first way needs 100?3300 bits. The second
way needs 45 ?113 ?312 ?316 ?39 ?45 ?4232
bits.
31Variable-length code
- Need some care to read the code.
- 001011101 (codeword a0, b00, c01, d11.)
- Where to cut? 00 can be explained as either aa
or b. - Prefix of 0011 0, 00, 001, and 0011.
- Prefix codes no codeword is a prefix of some
other codeword. (prefix free) - Prefix codes are simple to encode and decode.
32Using codeword in Table to encode and decode
- Encode abc 0.101.100 0101100
- (just concatenate the codewords.)
- Decode 001011101 0.0.101.1101 aabe
a b c d e f
frequency() 45 13 12 16 9 5
fixed-length code 000 001 010 011 100 101
variable-length code 0 101 100 111 1101 1100
33- Encode abc 0.101.100 0101100
- (just concatenate the codewords.)
- Decode 001011101 0.0.101.1101 aabe
- (use the (right)binary tree below)
Tree for the fixed length codeword
Tree for variable-length codeword
34Binary tree
- Every nonleaf node has two children.
- The fixed-length code in our example is not
optimal. - The total number of bits required to encode a
file is - f ( c ) the frequency (number of occurrences)
of c in the file - dT(c) denote the depth of cs leaf in the tree
35Constructing an optimal code
- Formal definition of the problem
- Input a set of characters Cc1, c2, , cn,
each c?C has frequency fc. - Output a binary tree representing codewords so
that the total number of bits required for the
file is minimized. - Huffman proposed a greedy algorithm to solve the
problem.
36a45
d16
e9
f5
b13
c12
(a)
(b)
37(c)
(d)
38(f)
(e)
39HUFFMAN(C) 1 nC 2 QC 3 for i1 to n-1
do 4 zALLOCATE_NODE() 5 xleftzEXTRACT_MI
N(Q) 6 yrightzEXTRACT_MIN(Q) 7 fzfx
fy 8 INSERT(Q,z) 9 return EXTRACT_MIN(Q)
40The Huffman Algorithm
- This algorithm builds the tree T corresponding to
the optimal code in a bottom-up manner. - C is a set of n characters, and each character c
in C is a character with a defined frequency
fc. - Q is a priority queue, keyed on f, used to
identify the two least-frequent characters to
merge together. - The result of the merger is a new object
(internal node) whose frequency is the sum of
the two objects.
41Time complexity
- Lines 4-8 are executed n-1 times.
- Each heap operation in Lines 4-8 takes O(lg n)
time. - Total time required is O(n lg n).
- Note The details of heap operation will not be
tested. Time complexity O(n lg n) should be
remembered.
42Another example
e4
a6
c6
b9
d11
43d11
44(No Transcript)
45Correctness of Huffmans Greedy Algorithm
(Fun Part, not required)
- Again, we use our general strategy.
- Let x and y are the two characters in C having
the lowest frequencies. (the first two characters
selected in the greedy algorithm.) - We will show the two properties
- There exists an optimal solution Topt (binary
tree representing codewords) such that x and y
are siblings in Topt. - Let z be a new character with frequency
fzfxfy and CC-x, y?z. Let
T be an optimal tree for C. Then we can get
Topt from T by replacing z with
z
x
y
46Proof of Property 1
Topt
Tnew
- Look at the lowest siblings in Topt, say, b and
c. - Exchange x with b and y with c.
- B(Topt)-B(Tnew)?0 since fx and fy are the
smallest. - 1 is proved.
47 - Let z be a new character with frequency
fzfxfy and CC-x, y?z. Let T be an
optimal tree for C. Then we can get Topt from T
by -
replacing z with - Proof Let T be the tree obtained from T by
- replacing z with the three nodes.
- B(T)B(T)fxfy. (1)
- (the length of the codes for x and y are 1 bit
more than that of z.) - Now prove T Topt by contradiction.
- If T?Topt, then B(T)gtB(Topt). (2)
- From 1, x and y are siblings in Topt .
- Thus, we can delete x and y from Topt and get
another tree T for C. - B(T)B(Topt) fx-fyltB(T)-fx-fyB(T).
- using (2)
using (1) - Thus, T(T)ltB(T). Contradiction! --- T is
optimum. -
z
y
x