An Introduction to Text Mining - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

An Introduction to Text Mining

Description:

Run (1) the verb. ( 2) the noun, in cricket. Cricket (1) The game. ( 2) The insect. ... A. Hearst, http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99 ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 40

Provided by: itIi

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Text Mining

1
An Introduction to Text Mining

Ravindra Jaju

2
Outline of the presentation

Initiation/Introduction ...
What makes text stand apart from other kinds of
data?
Classification
Clustering
Mining on The Web

3
Data Mining

What Looking for information from usually large
amounts of data
Mainly two kinds of activities Descriptive and
Predictive
Example of a descriptive activity Clustering
Example of a predictive activity - Classification

4
What kind of data is this?

lt1, 1, 0, 0, 1, 0gt
lt0, 0, 1, 1, 0, 1gt
It could be two customers' baskets, containing
(milk, bread, butter) and (shaving cream, razor,
after-shave lotion) respectively.
Or, it could be two documents - Java programming
language and India beat Pakistan

5
And what kind of data is this?

lt550000, 155gt
lt750000, 115gt
lt120000, 165gt
Data about people, ltincome, IQgt pairs!

6
Data representation

Humans understand data in various forms
Text
Sales figures
Images
Computers understand only numbers

7
Working with data

Most of the mining algorithms work only with
numeric data
All data, hence, are represented as numbers so
that they can lend themselves to the algorithms
Whether it is sales figures, crime rates, text,
or images one has to find a suitable way to
transform data into numbers.

8
Text mining Working with numbers

Java Programming Language
India beat Pakistan
OR
lt1, 1, 0, 0, 1, 0gt
lt0, 0, 1, 1, 0, 1gt
The transformation to 1's and 0's hides all
relationship between Java and Language, and India
and Pakistan, which humans can make out (How?)

9
Text mining Working with numbers (contd.)

As we have seen, data transformation (from
text/word to some index number in this case)
means that there is some information loss
One big challenge in this field today is to find
a good data representation for input to the
mining algorithms

10
Text Representation Issues

Each word has a dictionary meaning, or meanings
Run (1) the verb. (2) the noun, in cricket
Cricket (1) The game. (2) The insect.
Each word is used in various senses
Tendulkar made 100 runs
Because of an injury, Tendulkar can not run and
will need a runner between the wickets
Capturing the meaning of sentences is an
important issue as well. Grammar, parts of
speech, time sense could be easy!
Finding out automatically who the he in He is
the President given a document is hard. And
president of? Well ...

11
Text Representation Issues (contd.)

In general, it is hard to capture these features
from a text document
One, it is difficult to extract this
automatically
Two, even if we did it, it won't scale!
One simplification is to represent documents as a
vector of words
We have already seen examples
Each document is represented as a vector, and
each component of the vector represents some
quantity related to a single word.

12
The Document Vector

Java Programming Language
lt1, 1, 0, 0, 1, 0, 0gt (document A)
India beat Pakistan
lt0, 0, 1, 1, 0, 1, 0gt (document B)
India beat Australia
lt0, 0, 1, 1, 0, 0, 1gt (document C)
What vector operation can you think of to find
two similar documents?
How about the dot product?
As we can easily verify, documents B and C will
have a higher dot product than any other
combination

13
More on document similarity

The dot product or cosine between two vectors is
a measure of similarity.
Documents about related topics should have higher
similarity

Language
Java
Indonesia
0, 0, 0
14
Document Similarity (contd.)

How about distance measures?
Cosine similarity measure will not capture the
inter-cluster distances!

15
Further refinements to the DV representation

Not all words are equally important
the, is, and, to, he, she, it (Why?)
Of course, these words could be important in
certain contexts
We have the option of scaling the components of
these words, or completely removing them from the
corpus
In general, we prefer to remove the stopwords and
scale the remaining words
Important words should be scaled upwards, and
vice versa
One widely used scaling factor TF-IDF
TF-IDF stands for Term Frequency and Inverse
Document Frequency product, for a word.

16
Text Mining Moving Further

Document/Term Clustering
Given a large set, group similar entities
Text Classification
Given a document, find what topic does it talk
about
Information Retrieval
Search engines
Information Extraction
Question Answering

17
Clustering (Descriptive Activity)

Activity Group together similar documents
Techniques used
Partitioning
Hierarchical
Agglomerative
Divisive
Grid based
Model based

18
Clustering (contd.)

Partitioning
Divide the input data into k partitions
K-means, K-medoids
Hierarchical clustering
Agglomerative
Each data point is assumed to be a cluster
representative
Keep merging similar clusters till we get a
single cluster
Divisive
The opposite of agglomerative

19
Frequent term-based text clustering

Idea
Frequent terms carry more information about the
cluster they might belong to
Highly co-related frequent terms probably belong
to the same cluster
D D1, , Dn the set of documents
Dj subsetOf T, the set of all terms
Then candidate clusters are generated from F
F1, , Fk, where each Fi is a set of all
frequent terms which occur together.

20
Classification

The problem statement
Given a set of documents, each with a label
called the class label for that document
Given, a classifier which learns from the above
data set
For a new, unseen document, the classifier should
be able to predict with a high degree of
accuracy the correct class to which the new
document belongs

21
Decision Tree Classifier

A tree
Each node represents some kind of an evaluation
for an attribute of the data
Each edge, the decision taken
The evaluation at each node is some kind of an
information gain measure
Reduction in entropy more information gained
Entropy E(x) -?pilog2(pi)
pi represents the probability that the data
corresponds to sample i
Each edge represents a choice for the value of
the attribute the node represents
Good for text mining. But doesnt scale

22
Statistical (Bayesian) Classification

For a document-class data, we calculate the
probabilities of occurrence of events
Bayes Theorem
P(cd) P(c) . P(dc) / P(d)
Given a document d, the probability that it
belongs to a class c is given by the above
formula.
In practice, the exact values of the
probabilities of each event are unknown, and are
estimated from the samples

23
Naïve Bayes Classification

Probability of the document event d
P(d) P(w1, , wn) wi are the words
The RHS is generally a headache. We have to
consider the inter-dependence of each of the wj
events
Naïve Bayes Assume all the wj events are
independent. The RHS expands to
? p(wj)
Most of the Bayesian text classifiers work with
this simplification

24
Bayesian Belief Networks

This is an intermediate approach
Not all words are independent
If java and program occur together, then boost
the probability value of class computer
programming
If java and indonesia occur together, then the
document is more likely about some-other-class
Problem?
How do we come up with co-relations like above?

25
Other classification techniques

Support Vector Machines
Find the best discriminant plane between two
classes
k Nearest Neighbour
Association Rule Mining
Neural Networks
Case-based reasoning

26
An example Text Classification from labeled
and unlabeled documents with Expectation
Maximization

Problem setting
Labeling documents is a manual process
A lot more unlabeled documents are available as
compared to labeled documents
Unlabeled documents contain information which
could help in the classification activity

27
An example (contd.)

Train a classifier with the labeled documents
Say, a Naïve Bayes classifier
This classifier estimates the model parameters
(the prior probabilities of the various events)
Now, classify the unlabeled documents.
Assuming the applied labels to be correct,
re-estimate the model parameters
Repeat the above step till convergence

28
Expectation Maximization

A useful technique for estimating hidden
parameters
In the previous example, the class labels were
missing from some documents
Consists of two steps
E-step Set z(k1) E z D ?(k)
M-step Set ?(k1) arg max? P(? D z(k1))
The above steps are repeated till convergence,
and convergence does occur

29
Another example Fast and accurate Text
Classification via Multiple Linear Discriminant
Projections
30
Contd.

Idea
Find a direction ? which maximizes the separation
between classes.
Why?
Reduce noise, or rather
Enhance the differences between classes
The vector corresponding to this direction is the
Fishers discriminant
Project the data-points onto this ?
For all data-points not separated by this vector,
choose another ?

31
Contd.

Repeat till all data are now separable
Note, we are looking at a 2-class case. This
easily extends to multiple classes
Project all the document vectors into the space
represented by the ? vectors as the basis vectors
Now, induce a decision tree on this projected
representation
The number of attributes is highly reduced
Since this representation nicely separates the
data points (documents), accuracy increases

32
Web Text Mining

The WWW is a huge, directed graph, with documents
as nodes and hyperlinks as the directed edges
Apart from the text itself, this graph structure
carries a lot of information about the
usefulness of the nodes
For example
10 random, average people on the streets say Mr.
T. Ache is a good dentist
5 reputed doctors, including dentists, recommend
Mr. P. Killer as a better dentist
Who would you choose?

33
Kleinbergs HITS

HITS Hypertext Induced Topic Selection
Nodes on the web can be categorized into two
types hubs and authorities
Authorities are nodes which one refers to for
definitive information about a topic
Hubs point to authorities
HITS computes the hub and authority scores on a
sub-universe of the web
How does one collect this sub-universe?

34
HITS (contd.)

The basic steps
Au ?Hv for all v pointing to u
Hu ?Av for all v pointed to by u
Repeat the above till convergence
Nodes with high A scores are relevant
Relevant to what?
Can we use this for efficient retrieval for a
query?

35
PageRank

Similar to HITS, but all pages have only one
score a Rank
R(u) c ? (R(v)/Nv)
v is the set of pages linking to u, and Nv is the
number of links in v. c is a scaling factor (lt 1)
The higher the rank of pages linking to a page,
the higher is its own rank!
To handle rank sinks (documents which do not link
outside a set of pages), the formula is modified
as
R(u) c? (R(v)/Nv) cE(u)
E(u) is a set of some pages, and acts as a rank
source (what kind of pages?)

36
Some more topics which we havent touched

Using external dictionaries
WordNet
Using language specific techniques
Computational linguistics
Use grammar for judging the sense of a query in
the information retrieval scenario
Other interesting techniques
Latent Semantic Indexing
Finding the latent information in documents using
Linear Algebra Techniques

37
Some more comments

Some purists do not consider most of the
current activities in the text mining field as
real text mining
For example, see Marti Hearsts write-up at
Untangling Text Data Mining

38
Some more comments (contd.)

One example that he mentions
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) is implicated
in some migraines
high levels of magnesium inhibit SCD
migraine patients have high platelet
aggregability
magnesium can suppress platelet aggregability
The above was inferred from a set of documents,
with some human help

39
References

Data Mining Concepts and Techniques, by Jiawei
Han and Micheline Kamber
Principle of Data Mining, by David J. Hand et al
Text Classification from Labeled and Unlabeled
Documents using EM, Kamal Nigam et al
Fast and accurate text classification via
multiple linear discriminant projections, S.
Chakrabarti et al
Frequent Term-Based Text Clustering, Florian Beil
et al
The PageRank Citation Ranking Bringing Order to
the Web, Lawrence Page and Sergey Brin
Untangling Text Data Mining, by Marti. A. Hearst,
http//www.sims.berkeley.edu/hearst/papers/acl99/
acl99-tdm.html
And others