INSYS 300 Text Analysis - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

INSYS 300 Text Analysis

Description:

So far we treated words simply as tokens when creating the inverted indexing ... large stop list may eliminate some words that might be useful for someone or for ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 36
Provided by: xlin2
Category:
Tags: insys | analysis | text

less

Transcript and Presenter's Notes

Title: INSYS 300 Text Analysis


1
INSYS 300Text Analysis
  • Dr. Xia Lin
  • Associate Professor
  • College of Information Science and Technology
  • Drexel University

2
Improving the Indexing
  • So far we treated words simply as tokens when
    creating the inverted indexing
  • To improve the indexing, we should also consider
  • meanings of words
  • Structures of language
  • Word usages

3
Text Analysis
  • Word (token) extraction
  • Stop words
  • Stemming
  • Word Frequency counts
  • Inverted Document Frequency
  • Zipfs Law

4
Stop words
  • Many of the most frequently used words in English
    are worthless in the indexing these words are
    called stop words.
  • the, of, and, to, .
  • Typically about 400 to 500 such words
  • Why do we need to remove stop words?
  • Reduce indexing file size
  • stopwords accounts 20-30 of total word counts.
  • Improve efficiency
  • stop words are not useful for searching
  • stop words always have a large number of hits

5
Stop words
  • Potential problems of removing stop words
  • small stop list does not improve indexing much
  • large stop list may eliminate some words that
    might be useful for someone or for some purposes
  • stopwords might be part of phrases
  • needs to process for both indexing and queries.

6
Stemming
  • Techniques used to find out the root/stem of a
    word
  • lookup user engineering
  • user 15 engineering 12
  • users 4 engineered 23
  • used 5 engineer 12
  • using 5
  • stem use engineer

7
Advantages of stemming
  • improving effectiveness
  • matching similar words
  • reducing indexing size
  • combing words with same roots may reduce indexing
    size as much as 40-50.
  • Criteria for stemming
  • correctness
  • retrieval effectiveness
  • compression performance

8
Basic stemming methods
  • Use tables and rules
  • remove ending
  • if a word ends with a consonant other than s,
  • followed by an s, then delete s.
  • if a word ends in es, drop the s.
  • if a word ends in ing, delete the ing unless the
    remaining word consists only of one letter or of
    th.
  • If a word ends with ed, preceded by a consonant,
    delete the ed unless this leaves only a single
    letter.
  • ...

9
  • transform the remaining word
  • if a word ends with ies but not eies or
    aies then ies --gt y.

10
Example 1 Porter stem Algorithm
  • A set of condition/action rules
  • condition on the stem
  • condition on the suffix
  • condition on the rules
  • different combination of conditions will activate
    different rules.
  • Implementation
  • stem.c
  • Stem(word)
  • ..
  • ReplaceEnd(word, step1a_rule)
  • ruleReplaceEnd(word, step1b_rule)
  • if (rule106) (rule 107)
  • ReplaceEnd(word, 1b1_rule)

11
Example 2 Sound-based stemming
  • Soundex rules
  • letter Numeric equivalent
  • B, F, P, V 1
  • C, G, J, K, Q, S, X, Z 2
  • D, T, 3
  • L 4
  • M, N, 5
  • R, 6
  • A, E, I, O, U, W, Y not coded
  • Words sound similar often have same codes
  • The code is not unique
  • high compression rate

12
Example 3 N-gram stemmers
  • A n-gram is n-consecutive letters
  • Digram is 2 consecutive letters
  • Trigram is 3 consecutive letters
  • All diagrams of the word statistics are
  • st ta at ti is st ti ic cs
  • Unique at cs ic is st ta ti
  • All diagrams of statistical are
  • st ta ti is st ti ic ca al
  • Unique al at ca ic is st ta ti

13
  • The similarity of two words can be calculated by
  • Where
  • A is the number of unique diagrams in the first
    word
  • B is the number of unique diagrams in the second
    word
  • c is the number of unique diagrams share by A and
    B

14
Frequency counts
  • The idea
  • The best a computer can do is counting numbers
  • counts the number of times a word occurred in a
    document
  • counts the number of documents in a collection
    that contains a word
  • Using occurrence frequencies to indicate relative
    importance of a word in a document
  • if a word appears often in a document, the
    document likely deals with subjects related to
    the word.

15
  • Using occurrence frequencies to select most
    useful words to index a document collection
  • if a word appears in every document, it is not a
    good indexing word
  • if a word appears in only one or two documents,
    it may not be a good indexing word
  • If a word appears in titles, each occurrence
    should be count 5(or 10) times.

16
Saltons Vector Space
  • A document is represented as a vector
  • (W1, W2, , Wn)
  • Binary
  • Wi 1 if the corresponding term is in the
    document
  • Wi 0 if the term is not in the document
  • TF (Term Frequency)
  • Wi tfi where tfi is the number of times the
    term occurred in the document
  • TFIDF (Inverse Document Frequency)
  • Wi tfiidfitfi(1log(N/dfi)) where dfi is the
    number of documents contains the term i, and N
    the total number of documents in the collection.

17
Inverted Document Frequency
  • Where N is the total number of documents Dk is
    the number of documents contains the K term .

th
18
IDF-based Indexing
19
Example
  • D1 a b c a f o n l p o f t y x
  • D2 a m o e e e n n n a n p l
  • D3 r a c e e f n l i f f f f x l
  • D4 a f f f f c d e e f g h l l x
  • Calculate term frequencies for term a, b, and c
    in each document.
  • Calculate inverted document frequencies of a, b,
    and c.

20
Automatic indexing
  • 1. Parse individual words (tokens)
  • 2. Remove stop words.
  • 3. Stemming
  • 4. Use frequency data
  • decide heading threshold
  • decide tail threshold
  • decide variance of counting

21
  • 5. Create indexing structure
  • invert indexing
  • other structures

22
More about Counting
  • Zipfs Law
  • in a large, well-written English document,
  • r f c
  • where
  • r is the ranking number,
  • f is the number of times the given
    word is used in the document
  • c is a constant.

23
  • Zipfs Law is an observation of a fact in
    proximity.
  • Examples
  • Word frequencies in Alice in Wonderland
  • Zipfs Law has been verified for many many years
    on many different collections.
  • There are also many revised version of Ziphs Law.

24
More about Counting
  • English Letter Usage Statistics
  • Letter use frequencies
  • E 72881 12.4
  • T 52397 8.9
  • A 47072 8.0
  • O 45116 7.6
  • N 41316 7.0
  • I 39710 6.7
  • H 38334 6.5

25
  • Doubled letter frequencies
  • LL 2979 20.6
  • EE 2146 14.8
  • SS 2128 14.7
  • OO 2064 14.3
  • TT 1169 8.1
  • RR 1068 7.4
  • -- 701 4.8
  • PP 628 4.3
  • FF 430 2.9

26
  • Initial letter frequencies
  • T 20665 15.2
  • A 15564 11.4
  • H 11623 8.5
  • W 9597 7.0
  • I 9468 6.9
  • S 9376 6.9
  • O 8205 6.0
  • M 6293 4.6
  • B 5831 4.2

27
  • Ending letter frequencies
  • E 26439 19.4
  • D 17313 12.7
  • S 14737 10.8
  • T 13685 10.0
  • N 10525 7.7
  • R 9491 6.9
  • Y 7915 5.8
  • O 6226 4.5

28
Term Associations
  • Counting word pairs
  • If two words appear together very often, they are
    likely to be a phrase
  • Counting document pairs
  • if two documents have many common words, they are
    likely related

29
More Counting
  • Counting citation pairs
  • If documents A and B both cite document C, D,
    then A and B might be related.
  • If documents C and D often be cited together,
    they are likely related.

30
Co-Citation
  • The college has more than 20 years tradition on
    Co-citation research.
  • Co-citation is the mentioning of any two earlier
    documents in the bibliographic references of a
    later third document.

Document 1
cites
Later Document 3
?
cites
Document 2
31
Co-Citation Analysis
  • The count of mentions may grow over time as new
    writings appear. Thus, co-citation counts can
    reflect citers changing perceptions of documents
    as more or less strongly related.
  • Documents shown to be related by their
    co-citation counts can be mapped as proximate in
    intellectual space.

32
Co-Citation Mapping
  • Detects patterns in the frequency with which any
    works by any two authors are jointly cited in
    later works.
  • Only recurrent co-citation is significant The
    more times authors are cited together, the more
    strongly related they are in the eyes of citers.

33
AuthorLinks
34
Midterms
  • Concepts
  • What is information retrieval?
  • Data, information, text, and documents
  • What is a controlled vocabulary?
  • Two abstractions principles
  • Considerations of document representation
  • Queries and query formats
  • What is a document vector space?
  • What is tf and idf ?

35
  • Procedure problem solving
  • steps of creating automatic indexing
  • creating vector spaces
  • calculating similarity
  • calculating tf and idf
  • Boolean query matching
  • Vector query matching
  • Discussions
  • Advantages and disadvantages of .
  • What can we do to improve automatic indexing?
  • Why do we do
Write a Comment
User Comments (0)
About PowerShow.com