COMP 7118 Fall 2004 - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

COMP 7118 Fall 2004

Description:

Boolean: car and repair shop, tea or coffee, DBMS but not Oracle ... News items. Each item belong to a pre-defined topics. Build classifier to classify them ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 52

Provided by: Math108

Learn more at: http://www.msci.memphis.edu

Category:

Tags: comp | fall

more less

Transcript and Presenter's Notes

Title: COMP 7118 Fall 2004

1
COMP 7118 Fall 2004

Text Mining

2
Text Databases

Large collections of documents from various
sources news articles, research papers, books,
digital libraries, e-mail messages, and Web
pages, library database
Properties
Unstructured in general (semi-structured with
help, e.g. XML)
Semantics, not only syntax, is important
Non-numeric in nature

3
Text Database and Information Retrieval

Information retrieval
Traditional study of how to retrieve information
from text documents
Information is organized into (a large number of)
documents
Information retrieval problem locating relevant
documents based on user input, such as keywords
or example documents

4
Text Database and Information Retrieval

Typical IR systems
Online library catalogs
Online document management systems
Information retrieval vs. database systems
Some DB problems are not present in IR, e.g.,
update, transaction management, complex objects
Some IR problems are not addressed well in DBMS,
e.g., unstructured documents, approximate search
using keywords and relevance

5
Basic Measures for Text Retrieval

Precision the percentage of retrieved documents
that are in fact relevant to the query (i.e.,
correct responses)
Recall the percentage of documents that are
relevant to the query and were, in fact, retrieved

6
Basic retrieval problems

Keyword-based
Find documents that contain certain keywords
Expression of keywords
Boolean car and repair shop, tea or coffee, DBMS
but not Oracle
Regular Expression car ? ? Repair

7
Basic retrieval problems

Similarity-based
Finds similar documents based on a set of common
keywords
Answer should be based on the degree of relevance
based on the nearness of the keywords, relative
frequency of the keywords, etc.

8
Challenges in text retrieval

Semantics
Synonymy A keyword T does not appear anywhere in
the document, even though the document is closely
related to T, e.g., data mining
Polysemy The same keyword may mean different
things in different contexts, e.g., mining

9
Challenges in text retrieval

Data cleaning
Stop list Set of words that are deemed
irrelevant, even though they may appear
frequently
E.g., a, the, of, for, with, etc.
Stop lists may vary when document set varies
Word stem Several words are small syntactic
variants of each other since they share a common
word stem
E.g., drug, drugs, drugged
Tagging Sometimes it is better to view a group
of words as a single unit (like a noun phrase)
E.g. data mining,

10
Challenges in text retrieval

Data representation
How to representation data for each data mining
task
Clustering how to define similarity?
Classification what attribute to use?

11
Challenges in text retrieval

Example Classifying text documents
News items
Each item belong to a pre-defined topics
Build classifier to classify them
What should the attributes be?
Individual words?
Even minus stop words, the number of words may be
large
Training set of 100 documents, typically 1000
distinct words
Each word appear in very few documents

12
Challenges in text retrieval

Assume we use a Naïve Bayes classifier
Each word appear in very few documents
Moreover, many words does not appear in various
classes
Each test document does not contain many words
? Probability are very small, and computational
error can be introduced
The same concept can also be represented by
different words/phrases
Challenge
How good we can do without semantics
How can we incorporate semantics information

13
Data Representation

Document vector / Frequency Matrix
Each document is represented by a vector
Each dimension of the vector is associated with a
word/term
For each document, the value of each dimension is
the frequency of that word that exists in the
vector.

14
Data Representation

Example
Document 1 This is a database system textbook
Document 2 Oracle database sells for 1000 this
year
Vector dimensions (Database, system, textbook,
oracle, sells, year)
D1 (1, 1, 1, 0, 0, 0)
D2 (1, 0, 0, 1, 1, 1)

15
Data Representation

Variations
Binary values only presence/absence of word is
recorded
Normalized values normalize each dimension to
(0,1) range
Weighted frequency

16
Data Representation (tf-idf scheme)

Weighted frequency (tf-idf scheme)
Useful for static data sets
Term frequency (tfij) how often did term j
appear in document I (normalized)
Document Frequency (dfj) how many documents
contain the term j
For document i and term j
where d is the number of documents in the
database

17
Data Representation Example (tf-idf scheme)

D1 This is a database system textbook
D2 Oracle database sells for 1000 this year
D3 My oracle database textbook for my database
class
Raw frequencies

18
Data Representation Example (tf-idf scheme)

D1 This is a database system textbook
D2 Oracle database sells for 1000 this year
D3 My oracle database textbook for my database
class
Normalized frequencies

19
Data Representation Example (tf-idf scheme)

D1 This is a database system textbook
D2 Oracle database sells for 1000 this year
D3 My oracle database textbook for my database
class
Document frequencies

20
normalized term-frequency (tfij)
Document frequency (dfj)
21
Data Representation(tf-idf scheme)

Properties
Words that appear in every document has value of
0
Words that appear in very few documents has
higher weight
Good for clustering? Classification?

22
Data Representation

Term-frequency matrix/table
Combination of all the vectors for the documents
will form a matrix.
Query vector
Query are usually consist of a set of keywords
thus can be represented as a vector
Query can also be a separate document with
vectors calculated as before

23
Data Representation Similarity measures

For various tasks, need measurement of similarity
between documents
Cosine similarity
Focus on co-occurrence of words
This correspond to the angle between the two
vectors

24
Data Representation Latent Semantic Indexing

Weakness of keyword based techniques
Lack of semantics
Cannot identify similar word/concepts without
help
Observation
Words/phrases that represent similar concepts are
usually grouped together
The most important unit of information for
documents may not be word, but concept instead

25
Data Representation Latent Semantic Indexing

Latent Semantic Indexing is an attempt to produce
such information
Find a (relatively) small number of concepts as
each vectors dimension
Try to approximate the original information by
these dimensions

26
Data Representation Latent Semantic Indexing

Start with the term-frequency matrix M
M is of size (m n)
m number of terms
n number of documents
Find the singular values of M
Eigenvalues of MMT
Assume we have r eigenvalues (r rank(M))

27
Data Representation Latent Semantic Indexing

Then M U S VT
S (r r) diagonal matrix of the singular
values
U (m r) matrix of term vectors
V (n r) matrix of document vectors
Interpretation
There are r distinct concepts within this set
of documents
Each row in U is a term vector that corresponds
to a term
Values in each dimension corresponds to how much
each term corresponds to that concept
Similarly with V

28
Data Representation Latent Semantic Indexing

Notice that S is diagonal
We can reorganize the values in descending order
Many values tends to be small
Replace those values with 0 and form a new matrix
S
Then M U S VT will produce a good
approximation for the original matrix M
Notice that each document is now represented by a
shorter vector data reduction
And each dimension is now correspond to a concept
which is based on similar concurrence of words

29
Data Representation Latent Semantic Indexing --
Example

(From Berry, Dumais, OBrian 1994)

30
Types of Text Data Mining

Keyword-based association analysis
Automatic document classification
Similarity detection
Cluster documents by a common author
Cluster documents containing information from a
common source

31
Types of Text Data Mining

Link analysis unusual correlation between
entities
Sequence analysis predicting a recurring event
Anomaly detection find information that violates
usual patterns
Hypertext analysis
Patterns in anchors/links
Anchor text correlations with linked objects

32
Keyword-based association analysis

Collect sets of keywords or terms that occur
frequently together and then find the association
or correlation relationships among them
First preprocess the text data by parsing,
stemming, removing stop words, etc.
Then evoke association mining algorithms
Consider each document as a transaction
View a set of keywords in the document as a set
of items in the transaction
Term level association mining
No need for human effort in tagging documents
The number of meaningless results and the
execution time is greatly reduced

33
Keyword-based association analysis -- SNOWBALL

Goal Discover relationships in text documents
E.g. Companies headquarters
Microsoft, headquarted at Redmond, WA
Boeing, a Seattle-based company
the Santa Clara, CA company Intel
Can we automatically retrieve such information
from news article

34
SNOWBALL

Assumptions
Articles tends to follow the same context
A base set of (seed) knowledge available
General idea
Find documents with seed knowledge
Extract patterns where knowledge occurs
Find similar patterns on other documents
Extract knowledge from there

35
SNOWBALL

Example of seed knowledge
ltorganization, locationgt pairs
Seed tuples

36
SNOWBALL Extracting patterns

Find location where each pair is present
Extract the patterns
ltOrganizationgts headquarters in ltLocationgt
ltLocationgt-based ltOrganizationgt
ltOrganizationgt, ltLocationgt
Need a good tagger to tag the information first

37
SNOWBALL Extracting patterns

Representing the pattern as a 5-tuple
Left, lttag1gt, Middle lttag2gt, Right
Left, right, middle are context
What word/words/symbol are attached
Example The Seattle-based Boeing Company
Left ltThegt
Middle lt-gt, ltbasedgt
Right ltCompanygt
Each context has a weight associated with it
Function of the frequency of the word in that
context
Scaled middle context have higher weight
Normalizaed
E.g ltThe, 0.2gt

38
SNOWBALL Extracting patterns

Similar patterns are combined
A clustering algorithm is run
Similarity function inner product of the weights
Each cluster is represented by the centroid

39
SNOWBALL finding new tuples

Read new documents
Locate ltorganizationgt and ltlocationgt tags
Locate the patterns
See if it match with the centroids
If so, add them as candidates
Good candidates will become new seed tuples

40
SNOWBALL finding new tuples

However, how trustworthy are they?
2 directions
Patterns
If a pattern match too many new tuples, then it
is suspicious
Example Microsoft, Redmond, announced
Left , Middle ,, Right ,
Match with everything
Not very useful

41
SNOWBALL finding new tuples

However, how trustworthy are they?
2 directions
Tuples
If a tuple appear in many locations, then it is
more likely to be true
Measurement of confidence of tuple and pattern to
determine which can be used

42
Automatic document classification

Motivation
Automatic classification for the tremendous
number of on-line text documents (Web pages,
e-mails, etc.)
A classification problem
Text document classification differs from the
classification of relational data
Document databases are not structured according
to attribute-value pairs

43
Association-based document classification

Extract keywords and terms by information
retrieval and simple association analysis
techniques
Obtain concept hierarchies of keywords and terms
using
Available term classes, such as WordNet
Expert knowledge
Some keyword classification systems
Classify documents in the training set into class
hierarchies
Apply term association mining method to discover
sets of associated terms

44
Association-based document classification

Use the terms to maximally distinguish one class
of documents from others
Derive a set of association rules associated with
each document class
Order the classification rules based on their
occurrence frequency and discriminative power
Use the rules to classify new documents

45
Document clustering

Automatically group related documents based on
their contents
Require no training sets or predetermined
taxonomies, generate a taxonomy at runtime
One approach define standard similarity
measures, and use hierarchical clustering
One potential drawback document may have
multiple themes

46
Document clustering

Potential solutions
Fuzzy clustering allowing documents in multiple
clusters
Drawback can be inefficient
Form multiple base clusters
Document can reside in multiple clusters
Combine the base clusters in a hierarchical
fashion

47
Example Suffix Tree Clustering

Basic idea
Form base clusters based on common phrases
Each phrases are assigned a score. Only clusters
having high enough score will survive
Use suffix tree to help recognizing the clusters
Each cluster is represented by the documents that
are in it
Inter-cluster distance are measured by the number
of common documents inside a cluster
Single-link hierarchical cluster is used to join
the base clusters.

48
Example Suffix Tree Clustering

Example 3 documents
Cat ate cheese
Mouse ate cheese too
Cat ate mouse too
Recognize all the suffixes
Cat ate cheese
cheese, ate cheese, cat ate cheese
Mouse ate cheese too
too, cheese too, ate cheese too, mouse ate
cheese too
Cat ate mouse too
too, mouse too, ate mouse too, cat ate
mouse too

49
1 Cat ate cheese 2 mouse ate cheese too
3 Cat ate mouse too
50
Suffix Tree Clustering

Find based clusters
Documents that share the same prefix of the
suffix treated as a base cluster
Internal nodes of the suffix tree
Base clusters evaluated for their goodness
Score N F(suffix)
N number of documents in that cluster
F(suffix) function related to the suffix,
based on frequency of the word, length of the
phrase etc.
Stop words removed from the phrase
Take only base clusters with certain score

51
Suffix Tree Clustering