Text Mining Application Programming Chapter 3 Explore Text - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Text Mining Application Programming Chapter 3 Explore Text

Description:

A linguistic definition of a word is the smallest syntactic unit that cannot be ... Heaps's Law. Heaps's Law predicts the size of the vocabulary given the text. ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 40
Provided by: jdw6
Category:

less

Transcript and Presenter's Notes

Title: Text Mining Application Programming Chapter 3 Explore Text


1
Text Mining Application ProgrammingChapter 3
Explore Text
  • Manu Konchady, 2006

2
(No Transcript)
3
Outlines
  • Words
  • Zipfs Law
  • Sentences
  • Indexing Document Text

4
Extracting words from text
  • A linguistic definition of a word is the smallest
    syntactic unit that cannot be broken into smaller
    segments.
  • Words in a sequence governed by the grammar of
    the language form sentences.

5
The eight standard parts of speech
  • Nouns (??)
  • Verbs (??)
  • Adjectives (????)
  • Adverbs (??)
  • Conjunctions (???)
  • Determiners (???)
  • Prepositions (???)
  • Pronouns (???)

Content words
Function words
6
Five types of phrases
  • Noun phrases
  • A good day
  • Verb phrases
  • had thought, was right and will be jumping
  • Adjective phrases
  • A nice shiny
  • Preposition phrases.
  • With very lone hair

7
Words vs. Token
  • A token is a more formal definition of a single
    unit of text.
  • A single word may not be the smallest unit of
    text and a token may consist of one or more
    words.
  • We will use token to represent the smallest unit
    of text processed in the higher layers of our
    model.

8
Complex tokens
  • Yahoo!, ATT, HancockCo.
  • Mr. Smith, lb.,or 192.168.1.1
  • New York-New Jersey, small-scale, or x-ray
  • Web URL
  • -3.1415E-01
  • 888-555-9999
  • (-lt).

9
  • Vector representations of documents used in
    clustering and text categorization are made up of
    a sequence of tokens and weights.
  • Documents can be correctly categorized only when
    the vector representatives accurately the
    contents of documents.

10
Token Assembly
11
Abbreviations (??)
  • Currencies
  • Dimensions
  • Time
  • Places
  • Organizations.

12
Base Words
  • A base word is the root form of a word that can
    be found in the WordNet dictionary.
  • Jump (base word)
  • Jumps

13
Word Stems
  • A word stem is a root form of a word.
  • Prevent
  • Prevents
  • Prevented
  • Preventing
  • Prevention
  • Porters stemming algorithm
  • TextMine/token.pm,
  • http//cpan.org

14
Word and Meaning Relationships
  • A thesaurus(??) organizes words and word meaning
  • In WordNet 2.0
  • 115,775 word meanings, or synonym sets (synsets)
  • 152,217 word forms.
  • Antonyms(???)
  • Words with opposite meanings
  • Rich and Poor
  • Hot and cold

15
Organize word meanings into an acyclical hierarchy
  • Hypernym
  • The parent node
  • Hyponyms
  • The child nodes

16
  • Meronyms
  • Finger is a meronym of the word hand.
  • Hand is a meronym of the human.
  • Holonyms
  • The finger, metacarpus, palm, etc. are all
    holonyms of the word hand.

17
Project Gutenberghttp//www.gutenberg.org
  • Reuters
  • Alice in Wonderland
  • A Tale of Two Cities
  • Holy Bible

18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Heapss Law
  • Heapss Law predicts the size of the vocabulary
    given the text.
  • If the number of words in n, then the size of the
    vocabulary is v Kn ß, where ß is between 0 and
    1 and K is some constant between 10 and 100.
  • Values of ß between 0.4 and 0.6 have been
    reasonably good approximations to predict the
    size of the vocabulary.

22
(No Transcript)
23
Word Distribution
24
ZIPFs Law
  • G.K. Zipf first claimed that, by principle of
    least effort, we use a few words very often and
    rarely use most other words.

25
(No Transcript)
26
(No Transcript)
27
Sentences
  • A sentence is made up of one or more clauses, and
    each clause is made up of phrases.
  • The subject, verb, object, complement, and
    adverbial phrases are arranged in order to make
    up a clause.
  • Sentence-Separator
  • Period,(.)
  • !,?
  • Semicolon,()

28
  • TextMine/WordUtil.pm
  • The text_split function

29
Stopwords
  • Since high-frequency words are not generally
    useful in an index, they can be removed to save
    space and improve performance.
  • The words that we exclude are called stopwords.
  • High-frequency vs. low-frequency

30
Inverse Document Frequency(IDF)
  • fm LogN log dm 1
  • The value 1 is added to avoid cases where a word
    m occurs in every document, leading to a value of
    0 for fm.

31
Latent Semantic Indexing
  • Latent semantic indexing (LSI) is an indexing
    method based on the Singular Value Decomposition
    (SVD) of the word document matrix.
  • The SVD is a mathematical procedure to transform
    the word document matrix such that major
    intrinsic associative patterns in the collection
    are revealed.
  • Minor patterns that are not very important can be
    removed to identify major global relationships.

32
LSI
  • LSI builds relationships based on co-occurring
    words in multiple documents.
  • These hidden underlying relationships are called
    the latent semantic structure in the collection.

33
The advantage of LSI
  • LSI does not depend on individual words to locate
    documents, but rather uses a concept or topic to
    find relevant documents.
  • Keyword-based methods rely on an exact match
    between words in a document and a query.

34
LSI
  • A concept or a topic is a group of words that
    collectively describe similar thoughts, things,
    places, or people.
  • It need not be as narrow as a single meaning from
    a dictionary.
  • When a research submits a query, it is
    transformed to LSI space and compared with other
    documents in the same space.

35
Document relationships based on shared words.
36
SVD of word-document matrix
37
Vector and LSI spaces for three documents
38
Implementation of LSI
  • The SVDPACKC package
  • TextMine/WordUtil.pm
  • The gen_vectors function

39
Index Maintenance
Write a Comment
User Comments (0)
About PowerShow.com