Title: An Introduction to Text Mining
1An Introduction to Text Mining
2Outline of the presentation
- Initiation/Introduction ...
- What makes text stand apart from other kinds of
data? - Classification
- Clustering
- Mining on The Web
3Data Mining
- What Looking for information from usually large
amounts of data - Mainly two kinds of activities Descriptive and
Predictive - Example of a descriptive activity Clustering
- Example of a predictive activity - Classification
4What kind of data is this?
- lt1, 1, 0, 0, 1, 0gt
- lt0, 0, 1, 1, 0, 1gt
- It could be two customers' baskets, containing
(milk, bread, butter) and (shaving cream, razor,
after-shave lotion) respectively. - Or, it could be two documents - Java programming
language and India beat Pakistan
5And what kind of data is this?
- lt550000, 155gt
- lt750000, 115gt
- lt120000, 165gt
- Data about people, ltincome, IQgt pairs!
6Data representation
- Humans understand data in various forms
- Text
- Sales figures
- Images
- Computers understand only numbers
7Working with data
- Most of the mining algorithms work only with
numeric data - All data, hence, are represented as numbers so
that they can lend themselves to the algorithms - Whether it is sales figures, crime rates, text,
or images one has to find a suitable way to
transform data into numbers.
8Text mining Working with numbers
- Java Programming Language
- India beat Pakistan
- OR
- lt1, 1, 0, 0, 1, 0gt
- lt0, 0, 1, 1, 0, 1gt
- The transformation to 1's and 0's hides all
relationship between Java and Language, and India
and Pakistan, which humans can make out (How?)
9Text mining Working with numbers (contd.)
- As we have seen, data transformation (from
text/word to some index number in this case)
means that there is some information loss - One big challenge in this field today is to find
a good data representation for input to the
mining algorithms
10Text Representation Issues
- Each word has a dictionary meaning, or meanings
- Run (1) the verb. (2) the noun, in cricket
- Cricket (1) The game. (2) The insect.
- Each word is used in various senses
- Tendulkar made 100 runs
- Because of an injury, Tendulkar can not run and
will need a runner between the wickets - Capturing the meaning of sentences is an
important issue as well. Grammar, parts of
speech, time sense could be easy! - Finding out automatically who the he in He is
the President given a document is hard. And
president of? Well ...
11Text Representation Issues (contd.)
- In general, it is hard to capture these features
from a text document - One, it is difficult to extract this
automatically - Two, even if we did it, it won't scale!
- One simplification is to represent documents as a
vector of words - We have already seen examples
- Each document is represented as a vector, and
each component of the vector represents some
quantity related to a single word.
12The Document Vector
- Java Programming Language
- lt1, 1, 0, 0, 1, 0, 0gt (document A)
- India beat Pakistan
- lt0, 0, 1, 1, 0, 1, 0gt (document B)
- India beat Australia
- lt0, 0, 1, 1, 0, 0, 1gt (document C)
- What vector operation can you think of to find
two similar documents? - How about the dot product?
- As we can easily verify, documents B and C will
have a higher dot product than any other
combination
13More on document similarity
- The dot product or cosine between two vectors is
a measure of similarity. - Documents about related topics should have higher
similarity
Language
Java
Indonesia
0, 0, 0
14Document Similarity (contd.)
- How about distance measures?
- Cosine similarity measure will not capture the
inter-cluster distances!
15Further refinements to the DV representation
- Not all words are equally important
- the, is, and, to, he, she, it (Why?)
- Of course, these words could be important in
certain contexts - We have the option of scaling the components of
these words, or completely removing them from the
corpus - In general, we prefer to remove the stopwords and
scale the remaining words - Important words should be scaled upwards, and
vice versa - One widely used scaling factor TF-IDF
- TF-IDF stands for Term Frequency and Inverse
Document Frequency product, for a word.
16Text Mining Moving Further
- Document/Term Clustering
- Given a large set, group similar entities
- Text Classification
- Given a document, find what topic does it talk
about - Information Retrieval
- Search engines
- Information Extraction
- Question Answering
17Clustering (Descriptive Activity)
- Activity Group together similar documents
- Techniques used
- Partitioning
- Hierarchical
- Agglomerative
- Divisive
- Grid based
- Model based
18Clustering (contd.)
- Partitioning
- Divide the input data into k partitions
- K-means, K-medoids
- Hierarchical clustering
- Agglomerative
- Each data point is assumed to be a cluster
representative - Keep merging similar clusters till we get a
single cluster - Divisive
- The opposite of agglomerative
19Frequent term-based text clustering
- Idea
- Frequent terms carry more information about the
cluster they might belong to - Highly co-related frequent terms probably belong
to the same cluster - D D1, , Dn the set of documents
- Dj subsetOf T, the set of all terms
- Then candidate clusters are generated from F
F1, , Fk, where each Fi is a set of all
frequent terms which occur together.
20Classification
- The problem statement
- Given a set of documents, each with a label
called the class label for that document - Given, a classifier which learns from the above
data set - For a new, unseen document, the classifier should
be able to predict with a high degree of
accuracy the correct class to which the new
document belongs
21Decision Tree Classifier
- A tree
- Each node represents some kind of an evaluation
for an attribute of the data - Each edge, the decision taken
- The evaluation at each node is some kind of an
information gain measure - Reduction in entropy more information gained
- Entropy E(x) -?pilog2(pi)
- pi represents the probability that the data
corresponds to sample i - Each edge represents a choice for the value of
the attribute the node represents - Good for text mining. But doesnt scale
22Statistical (Bayesian) Classification
- For a document-class data, we calculate the
probabilities of occurrence of events - Bayes Theorem
- P(cd) P(c) . P(dc) / P(d)
- Given a document d, the probability that it
belongs to a class c is given by the above
formula. - In practice, the exact values of the
probabilities of each event are unknown, and are
estimated from the samples
23Naïve Bayes Classification
- Probability of the document event d
- P(d) P(w1, , wn) wi are the words
- The RHS is generally a headache. We have to
consider the inter-dependence of each of the wj
events - Naïve Bayes Assume all the wj events are
independent. The RHS expands to - ? p(wj)
- Most of the Bayesian text classifiers work with
this simplification
24Bayesian Belief Networks
- This is an intermediate approach
- Not all words are independent
- If java and program occur together, then boost
the probability value of class computer
programming - If java and indonesia occur together, then the
document is more likely about some-other-class - Problem?
- How do we come up with co-relations like above?
25Other classification techniques
- Support Vector Machines
- Find the best discriminant plane between two
classes - k Nearest Neighbour
- Association Rule Mining
- Neural Networks
- Case-based reasoning
26An example Text Classification from labeled
and unlabeled documents with Expectation
Maximization
- Problem setting
- Labeling documents is a manual process
- A lot more unlabeled documents are available as
compared to labeled documents - Unlabeled documents contain information which
could help in the classification activity
27An example (contd.)
- Train a classifier with the labeled documents
- Say, a Naïve Bayes classifier
- This classifier estimates the model parameters
(the prior probabilities of the various events) - Now, classify the unlabeled documents.
- Assuming the applied labels to be correct,
re-estimate the model parameters - Repeat the above step till convergence
28Expectation Maximization
- A useful technique for estimating hidden
parameters - In the previous example, the class labels were
missing from some documents - Consists of two steps
- E-step Set z(k1) E z D ?(k)
- M-step Set ?(k1) arg max? P(? D z(k1))
- The above steps are repeated till convergence,
and convergence does occur
29Another example Fast and accurate Text
Classification via Multiple Linear Discriminant
Projections
30Contd.
- Idea
- Find a direction ? which maximizes the separation
between classes. - Why?
- Reduce noise, or rather
- Enhance the differences between classes
- The vector corresponding to this direction is the
Fishers discriminant - Project the data-points onto this ?
- For all data-points not separated by this vector,
choose another ?
31Contd.
- Repeat till all data are now separable
- Note, we are looking at a 2-class case. This
easily extends to multiple classes - Project all the document vectors into the space
represented by the ? vectors as the basis vectors - Now, induce a decision tree on this projected
representation - The number of attributes is highly reduced
- Since this representation nicely separates the
data points (documents), accuracy increases
32Web Text Mining
- The WWW is a huge, directed graph, with documents
as nodes and hyperlinks as the directed edges - Apart from the text itself, this graph structure
carries a lot of information about the
usefulness of the nodes - For example
- 10 random, average people on the streets say Mr.
T. Ache is a good dentist - 5 reputed doctors, including dentists, recommend
Mr. P. Killer as a better dentist - Who would you choose?
33Kleinbergs HITS
- HITS Hypertext Induced Topic Selection
- Nodes on the web can be categorized into two
types hubs and authorities - Authorities are nodes which one refers to for
definitive information about a topic - Hubs point to authorities
- HITS computes the hub and authority scores on a
sub-universe of the web - How does one collect this sub-universe?
34HITS (contd.)
- The basic steps
- Au ?Hv for all v pointing to u
- Hu ?Av for all v pointed to by u
- Repeat the above till convergence
- Nodes with high A scores are relevant
- Relevant to what?
- Can we use this for efficient retrieval for a
query?
35PageRank
- Similar to HITS, but all pages have only one
score a Rank - R(u) c ? (R(v)/Nv)
- v is the set of pages linking to u, and Nv is the
number of links in v. c is a scaling factor (lt 1) - The higher the rank of pages linking to a page,
the higher is its own rank! - To handle rank sinks (documents which do not link
outside a set of pages), the formula is modified
as - R(u) c? (R(v)/Nv) cE(u)
- E(u) is a set of some pages, and acts as a rank
source (what kind of pages?)
36Some more topics which we havent touched
- Using external dictionaries
- WordNet
- Using language specific techniques
- Computational linguistics
- Use grammar for judging the sense of a query in
the information retrieval scenario - Other interesting techniques
- Latent Semantic Indexing
- Finding the latent information in documents using
Linear Algebra Techniques
37Some more comments
- Some purists do not consider most of the
current activities in the text mining field as
real text mining - For example, see Marti Hearsts write-up at
Untangling Text Data Mining
38Some more comments (contd.)
- One example that he mentions
- stress is associated with migraines
- stress can lead to loss of magnesium
- calcium channel blockers prevent some migraines
- magnesium is a natural calcium channel blocker
- spreading cortical depression (SCD) is implicated
in some migraines - high levels of magnesium inhibit SCD
- migraine patients have high platelet
aggregability - magnesium can suppress platelet aggregability
- The above was inferred from a set of documents,
with some human help
39References
- Data Mining Concepts and Techniques, by Jiawei
Han and Micheline Kamber - Principle of Data Mining, by David J. Hand et al
- Text Classification from Labeled and Unlabeled
Documents using EM, Kamal Nigam et al - Fast and accurate text classification via
multiple linear discriminant projections, S.
Chakrabarti et al - Frequent Term-Based Text Clustering, Florian Beil
et al - The PageRank Citation Ranking Bringing Order to
the Web, Lawrence Page and Sergey Brin - Untangling Text Data Mining, by Marti. A. Hearst,
http//www.sims.berkeley.edu/hearst/papers/acl99/
acl99-tdm.html - And others