Chapter 7: Document Preprocessing (textbook) - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 7: Document Preprocessing (textbook)

Description:

(1) Lexical analysis of the text with the objective of treating digits, hyphens, ... sow fox pig eel yak hen ant cat dog hog. ant cat dog eel fox hen hog pig sow yak ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 48
Provided by: cs038
Category:

less

Transcript and Presenter's Notes

Title: Chapter 7: Document Preprocessing (textbook)


1
Chapter 7 Document Preprocessing (textbook)
  • Document preprocessing is a procedure which can
    be divided mainly into five text operations (or
    transformations)
  • (1) Lexical analysis of the text with the
    objective of treating digits, hyphens,
    punctuation marks, and the case of letters.
  • (2) Elimination of stop-words with the
    objective of filtering out words with very low
    discrimination values for retrieval purposes.

2
Document Preprocessing
  • (3) Stemming of the remaining words with the
    objective of removing affixes (i.e., prefixes and
    suffixes) and allowing the retrieval of documents
    containing syntactic variations of query terms
    (e.g., connect, connecting, connected, etc).
  • (4) Selection of index terms to determine
    which words/stems (or groups of words) will be
    used as an indexing elements. Usually, the
    decision on whether a particular word will be
    used as an index term is related to the syntactic
    nature of the word. In fact , noun words
    frequently carry more semantics than adjectives,
    adverbs, and verbs.

3
Document Preprocessing
  • (5) Construction of term categorization
    structures such as a thesaurus, or extraction of
    structure directly represented in the text, for
    allowing the expansion of the original query with
    related terms (a usually useful procedure).

4
Lexical Analysis of the text
  • Task convert strings of characters into
    sequence of words.
  • Main task is to deal with spaces, e.g, multiple
    spaces are treated as one space.
  • Digitsignoring numbers is a common way. Special
    cases, 1999, 2000 standing for specific years
    are important. Mixed digits are important, e.g.,
    510B.C. 16 digits numbers might be credit card
    .
  • Hyphens state-of-the art and state of the art
    should be treated as the same.
  • Punctuation marks remove them. Exception 510B.C
  • Lower or upper case of letters treated as the
    same.
  • Many exceptions semi-automatic.

5
Elimination of Stopwords
  • words appear too often are not useful for IR.
  • Stopwords words appear more than 80 of the
    documents in the collection are stopwords and are
    filtered out as potential index words.

6
Stemming
  • A Stem the portion of a word which is left after
    the removal of its affixes (i.e., prefixes or
    suffixes).
  • Example connect is the stem for connected,
    connecting connection, connections
  • Porter algorithm using suffix list for suffix
    stripping. S??, sses ?ss, etc.

7
  • Index terms selection
  • Identification of noun groups
  • Treat nouns appear closely as a single component,
    e.g., computer science

8
Thesaurus
  • Thesaurus a precompiled list of important words
    in a given domain of knowledge and for each word
    in this list, there is a set of related words.
  • Vocabulary control in an information retrieval
    system
  • Thesaurus construction
  • Manual construction
  • Automatic construction

9
Vocabulary control
  • Standard vocabulary for both indexing and
    searching (for the constructors of the system and
    the users of the system)

10
Objectives of vocabulary control
  • To promote the consistent representation of
    subject matter by indexers and searchers ,thereby
    avoiding the dispersion of related materials.
  • To facilitate the conduct of a comprehensive
    search on some topic by linking together terms
    whose meanings are related paradigmatically.

11
Thesaurus
  • Not like common dictionary
  • Words with their explanations
  • May contain words in a language
  • Or only contains words in a specific domain.
  • With a lot of other information especially the
    relationship between words
  • Classification of words in the language
  • Words relationship like synonyms, antonyms

12
On-Line Thesaurus
  • http//www.thesaurus.com
  • http//www.dictionary.com/
  • http//www.cogsci.princeton.edu/wn/

13
Dictionary vs. Thesaurus
Check Information use http//www.thesaurus.com
Dictionary Thesaurus
  • information ( n f r-m sh n)n.
  • Knowledge derived from study, experience, or
    instruction.
  • Knowledge of specific events or situations that
    has been gathered or received by communication
    intelligence or news. See Synonyms at knowledge.
  • ......

Nouns information, enlightenment, acquaintance
Verbs tell inform, inform of acquaint,
acquaint with impart, Adjectives informed
reported published
14
Use of Thesaurus
  • To control the term used in indexing ,for a
    specific domain only use the terms in the
    thesaurus as indexing terms
  • Assist the users to form proper queries by the
    help information contained in the thesaurus

15
Construction of Thesaurus
  • Stemming can be used for reduce the size of
    thesaurus
  • Can be constructed either manually or
    automatically

16
WordNet manually constructed
  • WordNet is an online lexical reference system
    whose design is inspired by current
    psycholinguistic theories of human lexical
    memory. English nouns, verbs, adjectives and
    adverbs are organized into synonym sets, each
    representing one underlying lexical concept.
    Different relations link the synonym sets.

17
Relations in WordNet
18
Automatic Thesaurus Construction
  • A variety of methods can be used in construction
    the thesaurus
  • Term similarity can be used for constructing the
    thesaurus

19
Complete Term Relation Method
  • Term Document Relationship can be calculated
    using a variety of methods
  • Like tf-idf
  • Term similarity can be calculated base on the
    term document relationship
  • for example

20
Complete Term Relation Method
Set threshold to 10
21
Complete Term Relation Method
  • Group
  • T1,T3,T4,T6
  • T1,T5
  • T2,T4,T6
  • T2,T6,T8
  • T7

22
Indexing
  • Arrangement of data (data structure) to permit
    fast searching
  • Which list is easier to search?
  • sow fox pig eel yak hen ant cat dog hog
  • ant cat dog eel fox hen hog pig sow yak

23
Creating inverted files
Word Extraction
Word IDs
Original Documents
W1d1,d2,d3 W2d2,d4,d7,d9 Wn
di,dn Inverted Files
Document IDs
24
Creating Inverted file
  • Map the file names to file IDs
  • Consider the following Original Documents

25
Creating Inverted file
Red stop word
26
Creating Inverted file
After stemming, make lowercase (option), delete
numbers (option)
27
Creating Inverted file (unsorted)
28
Creating Inverted file (sorted)
29
Searching on Inverted File
  • Binary Search
  • Using in the small scale
  • Create thesaurus and combining techniques such
    as
  • Hashing
  • Btree
  • Pointer to the address in the indexed file

30
Huffman codes
  • Binary character code each character is
    represented by a unique binary string.
  • A data file can be coded in two ways

a b c d e f
frequency() 45 13 12 16 9 5
fixed-length code 000 001 010 011 100 101
variable-length code 0 101 100 111 1101 1100
The first way needs 100?3300 bits. The second
way needs 45 ?113 ?312 ?316 ?39 ?45 ?4232
bits.
31
Variable-length code
  • Need some care to read the code.
  • 001011101 (codeword a0, b00, c01, d11.)
  • Where to cut? 00 can be explained as either aa
    or b.
  • Prefix of 0011 0, 00, 001, and 0011.
  • Prefix codes no codeword is a prefix of some
    other codeword. (prefix free)
  • Prefix codes are simple to encode and decode.

32
Using codeword in Table to encode and decode
  • Encode abc 0.101.100 0101100
  • (just concatenate the codewords.)
  • Decode 001011101 0.0.101.1101 aabe

a b c d e f
frequency() 45 13 12 16 9 5
fixed-length code 000 001 010 011 100 101
variable-length code 0 101 100 111 1101 1100
33
  • Encode abc 0.101.100 0101100
  • (just concatenate the codewords.)
  • Decode 001011101 0.0.101.1101 aabe
  • (use the (right)binary tree below)

Tree for the fixed length codeword
Tree for variable-length codeword
34
Binary tree
  • Every nonleaf node has two children.
  • The fixed-length code in our example is not
    optimal.
  • The total number of bits required to encode a
    file is
  • f ( c ) the frequency (number of occurrences)
    of c in the file
  • dT(c) denote the depth of cs leaf in the tree

35
Constructing an optimal code
  • Formal definition of the problem
  • Input a set of characters Cc1, c2, , cn,
    each c?C has frequency fc.
  • Output a binary tree representing codewords so
    that the total number of bits required for the
    file is minimized.
  • Huffman proposed a greedy algorithm to solve the
    problem.

36
a45
d16
e9
f5
b13
c12
(a)
(b)
37
(c)
(d)
38
(f)
(e)
39
HUFFMAN(C) 1 nC 2 QC 3 for i1 to n-1
do 4 zALLOCATE_NODE() 5 xleftzEXTRACT_MI
N(Q) 6 yrightzEXTRACT_MIN(Q) 7 fzfx
fy 8 INSERT(Q,z) 9 return EXTRACT_MIN(Q)
40
The Huffman Algorithm
  • This algorithm builds the tree T corresponding to
    the optimal code in a bottom-up manner.
  • C is a set of n characters, and each character c
    in C is a character with a defined frequency
    fc.
  • Q is a priority queue, keyed on f, used to
    identify the two least-frequent characters to
    merge together.
  • The result of the merger is a new object
    (internal node) whose frequency is the sum of
    the two objects.

41
Time complexity
  • Lines 4-8 are executed n-1 times.
  • Each heap operation in Lines 4-8 takes O(lg n)
    time.
  • Total time required is O(n lg n).
  • Note The details of heap operation will not be
    tested. Time complexity O(n lg n) should be
    remembered.

42
Another example
e4
a6
c6
b9
d11
43
d11
44
(No Transcript)
45
Correctness of Huffmans Greedy Algorithm
(Fun Part, not required)
  • Again, we use our general strategy.
  • Let x and y are the two characters in C having
    the lowest frequencies. (the first two characters
    selected in the greedy algorithm.)
  • We will show the two properties
  • There exists an optimal solution Topt (binary
    tree representing codewords) such that x and y
    are siblings in Topt.
  • Let z be a new character with frequency
    fzfxfy and CC-x, y?z. Let
    T be an optimal tree for C. Then we can get
    Topt from T by replacing z with

z
x
y
46
Proof of Property 1
Topt
Tnew
  • Look at the lowest siblings in Topt, say, b and
    c.
  • Exchange x with b and y with c.
  • B(Topt)-B(Tnew)?0 since fx and fy are the
    smallest.
  • 1 is proved.

47
  • Let z be a new character with frequency
    fzfxfy and CC-x, y?z. Let T be an
    optimal tree for C. Then we can get Topt from T
    by

  • replacing z with
  • Proof Let T be the tree obtained from T by
  • replacing z with the three nodes.
  • B(T)B(T)fxfy. (1)
  • (the length of the codes for x and y are 1 bit
    more than that of z.)
  • Now prove T Topt by contradiction.
  • If T?Topt, then B(T)gtB(Topt). (2)
  • From 1, x and y are siblings in Topt .
  • Thus, we can delete x and y from Topt and get
    another tree T for C.
  • B(T)B(Topt) fx-fyltB(T)-fx-fyB(T).
  • using (2)
    using (1)
  • Thus, T(T)ltB(T). Contradiction! --- T is
    optimum.

z
y
x
Write a Comment
User Comments (0)
About PowerShow.com