Chapter 7: Document Preprocessing (textbook) - PowerPoint PPT Presentation

About This Presentation

Title:

Chapter 7: Document Preprocessing (textbook)

Description:

(1) Lexical analysis of the text with the objective of treating digits, hyphens, ... sow fox pig eel yak hen ant cat dog hog. ant cat dog eel fox hen hog pig sow yak ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 48

Provided by: cs038

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 7: Document Preprocessing (textbook)

1
Chapter 7 Document Preprocessing (textbook)

Document preprocessing is a procedure which can
be divided mainly into five text operations (or
transformations)
(1) Lexical analysis of the text with the
objective of treating digits, hyphens,
punctuation marks, and the case of letters.
(2) Elimination of stop-words with the
objective of filtering out words with very low
discrimination values for retrieval purposes.

2
Document Preprocessing

(3) Stemming of the remaining words with the
objective of removing affixes (i.e., prefixes and
suffixes) and allowing the retrieval of documents
containing syntactic variations of query terms
(e.g., connect, connecting, connected, etc).
(4) Selection of index terms to determine
which words/stems (or groups of words) will be
used as an indexing elements. Usually, the
decision on whether a particular word will be
used as an index term is related to the syntactic
nature of the word. In fact , noun words
frequently carry more semantics than adjectives,
adverbs, and verbs.

3
Document Preprocessing

(5) Construction of term categorization
structures such as a thesaurus, or extraction of
structure directly represented in the text, for
allowing the expansion of the original query with
related terms (a usually useful procedure).

4
Lexical Analysis of the text

Task convert strings of characters into
sequence of words.
Main task is to deal with spaces, e.g, multiple
spaces are treated as one space.
Digitsignoring numbers is a common way. Special
cases, 1999, 2000 standing for specific years
are important. Mixed digits are important, e.g.,
510B.C. 16 digits numbers might be credit card
.
Hyphens state-of-the art and state of the art
should be treated as the same.
Punctuation marks remove them. Exception 510B.C
Lower or upper case of letters treated as the
same.
Many exceptions semi-automatic.

5
Elimination of Stopwords

words appear too often are not useful for IR.
Stopwords words appear more than 80 of the
documents in the collection are stopwords and are
filtered out as potential index words.

6
Stemming

A Stem the portion of a word which is left after
the removal of its affixes (i.e., prefixes or
suffixes).
Example connect is the stem for connected,
connecting connection, connections
Porter algorithm using suffix list for suffix
stripping. S??, sses ?ss, etc.

Index terms selection
Identification of noun groups
Treat nouns appear closely as a single component,
e.g., computer science

8
Thesaurus

Thesaurus a precompiled list of important words
in a given domain of knowledge and for each word
in this list, there is a set of related words.
Vocabulary control in an information retrieval
system
Thesaurus construction
Manual construction
Automatic construction

9
Vocabulary control

Standard vocabulary for both indexing and
searching (for the constructors of the system and
the users of the system)

10
Objectives of vocabulary control

To promote the consistent representation of
subject matter by indexers and searchers ,thereby
avoiding the dispersion of related materials.
To facilitate the conduct of a comprehensive
search on some topic by linking together terms
whose meanings are related paradigmatically.

11
Thesaurus

Not like common dictionary
Words with their explanations
May contain words in a language
Or only contains words in a specific domain.
With a lot of other information especially the
relationship between words
Classification of words in the language
Words relationship like synonyms, antonyms

12
On-Line Thesaurus

http//www.thesaurus.com
http//www.dictionary.com/
http//www.cogsci.princeton.edu/wn/

13
Dictionary vs. Thesaurus
Check Information use http//www.thesaurus.com
Dictionary Thesaurus

information ( n f r-m sh n)n.
Knowledge derived from study, experience, or
instruction.
Knowledge of specific events or situations that
has been gathered or received by communication
intelligence or news. See Synonyms at knowledge.
......

Nouns information, enlightenment, acquaintance
Verbs tell inform, inform of acquaint,
acquaint with impart, Adjectives informed
reported published
14
Use of Thesaurus

To control the term used in indexing ,for a
specific domain only use the terms in the
thesaurus as indexing terms
Assist the users to form proper queries by the
help information contained in the thesaurus

15
Construction of Thesaurus

Stemming can be used for reduce the size of
thesaurus
Can be constructed either manually or
automatically

16
WordNet manually constructed

WordNet is an online lexical reference system
whose design is inspired by current
psycholinguistic theories of human lexical
memory. English nouns, verbs, adjectives and
adverbs are organized into synonym sets, each
representing one underlying lexical concept.
Different relations link the synonym sets.

17
Relations in WordNet
18
Automatic Thesaurus Construction

A variety of methods can be used in construction
the thesaurus
Term similarity can be used for constructing the
thesaurus

19
Complete Term Relation Method

Term Document Relationship can be calculated
using a variety of methods
Like tf-idf
Term similarity can be calculated base on the
term document relationship
for example

20
Complete Term Relation Method
Set threshold to 10
21
Complete Term Relation Method

Group
T1,T3,T4,T6
T1,T5
T2,T4,T6
T2,T6,T8
T7

22
Indexing

Arrangement of data (data structure) to permit
fast searching
Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak

23
Creating inverted files
Word Extraction
Word IDs
Original Documents
W1d1,d2,d3 W2d2,d4,d7,d9 Wn
di,dn Inverted Files
Document IDs
24
Creating Inverted file

Map the file names to file IDs
Consider the following Original Documents

25
Creating Inverted file
Red stop word
26
Creating Inverted file
After stemming, make lowercase (option), delete
numbers (option)
27
Creating Inverted file (unsorted)
28
Creating Inverted file (sorted)
29
Searching on Inverted File

Binary Search
Using in the small scale
Create thesaurus and combining techniques such
as
Hashing
Btree
Pointer to the address in the indexed file

30
Huffman codes

Binary character code each character is
represented by a unique binary string.
A data file can be coded in two ways

a b c d e f
frequency() 45 13 12 16 9 5
fixed-length code 000 001 010 011 100 101
variable-length code 0 101 100 111 1101 1100
The first way needs 100?3300 bits. The second
way needs 45 ?113 ?312 ?316 ?39 ?45 ?4232
bits.
31
Variable-length code

Need some care to read the code.
001011101 (codeword a0, b00, c01, d11.)
Where to cut? 00 can be explained as either aa
or b.
Prefix of 0011 0, 00, 001, and 0011.
Prefix codes no codeword is a prefix of some
other codeword. (prefix free)
Prefix codes are simple to encode and decode.

32
Using codeword in Table to encode and decode

Encode abc 0.101.100 0101100
(just concatenate the codewords.)
Decode 001011101 0.0.101.1101 aabe

a b c d e f
frequency() 45 13 12 16 9 5
fixed-length code 000 001 010 011 100 101
variable-length code 0 101 100 111 1101 1100
33

Encode abc 0.101.100 0101100
(just concatenate the codewords.)
Decode 001011101 0.0.101.1101 aabe
(use the (right)binary tree below)

Tree for the fixed length codeword
Tree for variable-length codeword
34
Binary tree

Every nonleaf node has two children.
The fixed-length code in our example is not
optimal.
The total number of bits required to encode a
file is
f ( c ) the frequency (number of occurrences)
of c in the file
dT(c) denote the depth of cs leaf in the tree

35
Constructing an optimal code

Formal definition of the problem
Input a set of characters Cc1, c2, , cn,
each c?C has frequency fc.
Output a binary tree representing codewords so
that the total number of bits required for the
file is minimized.
Huffman proposed a greedy algorithm to solve the
problem.

36
a45
d16
e9
f5
b13
c12
(a)
(b)
37
(c)
(d)
38
(f)
(e)
39
HUFFMAN(C) 1 nC 2 QC 3 for i1 to n-1
do 4 zALLOCATE_NODE() 5 xleftzEXTRACT_MI
N(Q) 6 yrightzEXTRACT_MIN(Q) 7 fzfx
fy 8 INSERT(Q,z) 9 return EXTRACT_MIN(Q)
40
The Huffman Algorithm

This algorithm builds the tree T corresponding to
the optimal code in a bottom-up manner.
C is a set of n characters, and each character c
in C is a character with a defined frequency
fc.
Q is a priority queue, keyed on f, used to
identify the two least-frequent characters to
merge together.
The result of the merger is a new object
(internal node) whose frequency is the sum of
the two objects.

41
Time complexity

Lines 4-8 are executed n-1 times.
Each heap operation in Lines 4-8 takes O(lg n)
time.
Total time required is O(n lg n).
Note The details of heap operation will not be
tested. Time complexity O(n lg n) should be
remembered.

42
Another example
e4
a6
c6
b9
d11
43
d11
44
(No Transcript)
45
Correctness of Huffmans Greedy Algorithm
(Fun Part, not required)

Again, we use our general strategy.
Let x and y are the two characters in C having
the lowest frequencies. (the first two characters
selected in the greedy algorithm.)
We will show the two properties
There exists an optimal solution Topt (binary
tree representing codewords) such that x and y
are siblings in Topt.
Let z be a new character with frequency
fzfxfy and CC-x, y?z. Let
T be an optimal tree for C. Then we can get
Topt from T by replacing z with

z
x
y
46
Proof of Property 1
Topt
Tnew

Look at the lowest siblings in Topt, say, b and
c.
Exchange x with b and y with c.
B(Topt)-B(Tnew)?0 since fx and fy are the
smallest.
1 is proved.

Let z be a new character with frequency
fzfxfy and CC-x, y?z. Let T be an
optimal tree for C. Then we can get Topt from T
by
replacing z with
Proof Let T be the tree obtained from T by
replacing z with the three nodes.
B(T)B(T)fxfy. (1)
(the length of the codes for x and y are 1 bit
more than that of z.)
Now prove T Topt by contradiction.
If T?Topt, then B(T)gtB(Topt). (2)
From 1, x and y are siblings in Topt .
Thus, we can delete x and y from Topt and get
another tree T for C.
B(T)B(Topt) fx-fyltB(T)-fx-fyB(T).
using (2)
using (1)
Thus, T(T)ltB(T). Contradiction! --- T is
optimum.