Text Mining Application Programming Chapter 3 Explore Text - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Text Mining Application Programming Chapter 3 Explore Text

Description:

A linguistic definition of a word is the smallest syntactic unit that cannot be ... Heaps's Law. Heaps's Law predicts the size of the vocabulary given the text. ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 40

Provided by: jdw6

Category:

more less

Transcript and Presenter's Notes

Title: Text Mining Application Programming Chapter 3 Explore Text

1
Text Mining Application ProgrammingChapter 3
Explore Text

Manu Konchady, 2006

2
(No Transcript)
3
Outlines

Words
Zipfs Law
Sentences
Indexing Document Text

4
Extracting words from text

A linguistic definition of a word is the smallest
syntactic unit that cannot be broken into smaller
segments.
Words in a sequence governed by the grammar of
the language form sentences.

5
The eight standard parts of speech

Nouns (??)
Verbs (??)
Adjectives (????)
Adverbs (??)
Conjunctions (???)
Determiners (???)
Prepositions (???)
Pronouns (???)

Content words
Function words
6
Five types of phrases

Noun phrases
A good day
Verb phrases
had thought, was right and will be jumping
Adjective phrases
A nice shiny
Preposition phrases.
With very lone hair

7
Words vs. Token

A token is a more formal definition of a single
unit of text.
A single word may not be the smallest unit of
text and a token may consist of one or more
words.
We will use token to represent the smallest unit
of text processed in the higher layers of our
model.

8
Complex tokens

Yahoo!, ATT, HancockCo.
Mr. Smith, lb.,or 192.168.1.1
New York-New Jersey, small-scale, or x-ray
Web URL
-3.1415E-01
888-555-9999
(-lt).

Vector representations of documents used in
clustering and text categorization are made up of
a sequence of tokens and weights.
Documents can be correctly categorized only when
the vector representatives accurately the
contents of documents.

10
Token Assembly
11
Abbreviations (??)

Currencies
Dimensions
Time
Places
Organizations.

12
Base Words

A base word is the root form of a word that can
be found in the WordNet dictionary.
Jump (base word)
Jumps

13
Word Stems

A word stem is a root form of a word.
Prevent
Prevents
Prevented
Preventing
Prevention
Porters stemming algorithm
TextMine/token.pm,
http//cpan.org

14
Word and Meaning Relationships

A thesaurus(??) organizes words and word meaning
In WordNet 2.0
115,775 word meanings, or synonym sets (synsets)
152,217 word forms.
Antonyms(???)
Words with opposite meanings
Rich and Poor
Hot and cold

15
Organize word meanings into an acyclical hierarchy

Hypernym
The parent node
Hyponyms
The child nodes

Meronyms
Finger is a meronym of the word hand.
Hand is a meronym of the human.
Holonyms
The finger, metacarpus, palm, etc. are all
holonyms of the word hand.

17
Project Gutenberghttp//www.gutenberg.org

Reuters
Alice in Wonderland
A Tale of Two Cities
Holy Bible

18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Heapss Law

Heapss Law predicts the size of the vocabulary
given the text.
If the number of words in n, then the size of the
vocabulary is v Kn ß, where ß is between 0 and
1 and K is some constant between 10 and 100.
Values of ß between 0.4 and 0.6 have been
reasonably good approximations to predict the
size of the vocabulary.

22
(No Transcript)
23
Word Distribution
24
ZIPFs Law

G.K. Zipf first claimed that, by principle of
least effort, we use a few words very often and
rarely use most other words.

25
(No Transcript)
26
(No Transcript)
27
Sentences

A sentence is made up of one or more clauses, and
each clause is made up of phrases.
The subject, verb, object, complement, and
adverbial phrases are arranged in order to make
up a clause.
Sentence-Separator
Period,(.)
!,?
Semicolon,()

TextMine/WordUtil.pm
The text_split function

29
Stopwords

Since high-frequency words are not generally
useful in an index, they can be removed to save
space and improve performance.
The words that we exclude are called stopwords.
High-frequency vs. low-frequency

30
Inverse Document Frequency(IDF)

fm LogN log dm 1
The value 1 is added to avoid cases where a word
m occurs in every document, leading to a value of
0 for fm.

31
Latent Semantic Indexing

Latent semantic indexing (LSI) is an indexing
method based on the Singular Value Decomposition
(SVD) of the word document matrix.
The SVD is a mathematical procedure to transform
the word document matrix such that major
intrinsic associative patterns in the collection
are revealed.
Minor patterns that are not very important can be
removed to identify major global relationships.

32
LSI

LSI builds relationships based on co-occurring
words in multiple documents.
These hidden underlying relationships are called
the latent semantic structure in the collection.

33
The advantage of LSI

LSI does not depend on individual words to locate
documents, but rather uses a concept or topic to
find relevant documents.
Keyword-based methods rely on an exact match
between words in a document and a query.

34
LSI

A concept or a topic is a group of words that
collectively describe similar thoughts, things,
places, or people.
It need not be as narrow as a single meaning from
a dictionary.
When a research submits a query, it is
transformed to LSI space and compared with other
documents in the same space.