An Introduction to Text Mining - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

An Introduction to Text Mining

Description:

Run (1) the verb. ( 2) the noun, in cricket. Cricket (1) The game. ( 2) The insect. ... A. Hearst, http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99 ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 40
Provided by: itIi
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Text Mining


1
An Introduction to Text Mining
  • Ravindra Jaju

2
Outline of the presentation
  • Initiation/Introduction ...
  • What makes text stand apart from other kinds of
    data?
  • Classification
  • Clustering
  • Mining on The Web

3
Data Mining
  • What Looking for information from usually large
    amounts of data
  • Mainly two kinds of activities Descriptive and
    Predictive
  • Example of a descriptive activity Clustering
  • Example of a predictive activity - Classification

4
What kind of data is this?
  • lt1, 1, 0, 0, 1, 0gt
  • lt0, 0, 1, 1, 0, 1gt
  • It could be two customers' baskets, containing
    (milk, bread, butter) and (shaving cream, razor,
    after-shave lotion) respectively.
  • Or, it could be two documents - Java programming
    language and India beat Pakistan

5
And what kind of data is this?
  • lt550000, 155gt
  • lt750000, 115gt
  • lt120000, 165gt
  • Data about people, ltincome, IQgt pairs!

6
Data representation
  • Humans understand data in various forms
  • Text
  • Sales figures
  • Images
  • Computers understand only numbers

7
Working with data
  • Most of the mining algorithms work only with
    numeric data
  • All data, hence, are represented as numbers so
    that they can lend themselves to the algorithms
  • Whether it is sales figures, crime rates, text,
    or images one has to find a suitable way to
    transform data into numbers.

8
Text mining Working with numbers
  • Java Programming Language
  • India beat Pakistan
  • OR
  • lt1, 1, 0, 0, 1, 0gt
  • lt0, 0, 1, 1, 0, 1gt
  • The transformation to 1's and 0's hides all
    relationship between Java and Language, and India
    and Pakistan, which humans can make out (How?)

9
Text mining Working with numbers (contd.)
  • As we have seen, data transformation (from
    text/word to some index number in this case)
    means that there is some information loss
  • One big challenge in this field today is to find
    a good data representation for input to the
    mining algorithms

10
Text Representation Issues
  • Each word has a dictionary meaning, or meanings
  • Run (1) the verb. (2) the noun, in cricket
  • Cricket (1) The game. (2) The insect.
  • Each word is used in various senses
  • Tendulkar made 100 runs
  • Because of an injury, Tendulkar can not run and
    will need a runner between the wickets
  • Capturing the meaning of sentences is an
    important issue as well. Grammar, parts of
    speech, time sense could be easy!
  • Finding out automatically who the he in He is
    the President given a document is hard. And
    president of? Well ...

11
Text Representation Issues (contd.)
  • In general, it is hard to capture these features
    from a text document
  • One, it is difficult to extract this
    automatically
  • Two, even if we did it, it won't scale!
  • One simplification is to represent documents as a
    vector of words
  • We have already seen examples
  • Each document is represented as a vector, and
    each component of the vector represents some
    quantity related to a single word.

12
The Document Vector
  • Java Programming Language
  • lt1, 1, 0, 0, 1, 0, 0gt (document A)
  • India beat Pakistan
  • lt0, 0, 1, 1, 0, 1, 0gt (document B)
  • India beat Australia
  • lt0, 0, 1, 1, 0, 0, 1gt (document C)
  • What vector operation can you think of to find
    two similar documents?
  • How about the dot product?
  • As we can easily verify, documents B and C will
    have a higher dot product than any other
    combination

13
More on document similarity
  • The dot product or cosine between two vectors is
    a measure of similarity.
  • Documents about related topics should have higher
    similarity

Language
Java
Indonesia
0, 0, 0
14
Document Similarity (contd.)
  • How about distance measures?
  • Cosine similarity measure will not capture the
    inter-cluster distances!

15
Further refinements to the DV representation
  • Not all words are equally important
  • the, is, and, to, he, she, it (Why?)
  • Of course, these words could be important in
    certain contexts
  • We have the option of scaling the components of
    these words, or completely removing them from the
    corpus
  • In general, we prefer to remove the stopwords and
    scale the remaining words
  • Important words should be scaled upwards, and
    vice versa
  • One widely used scaling factor TF-IDF
  • TF-IDF stands for Term Frequency and Inverse
    Document Frequency product, for a word.

16
Text Mining Moving Further
  • Document/Term Clustering
  • Given a large set, group similar entities
  • Text Classification
  • Given a document, find what topic does it talk
    about
  • Information Retrieval
  • Search engines
  • Information Extraction
  • Question Answering

17
Clustering (Descriptive Activity)
  • Activity Group together similar documents
  • Techniques used
  • Partitioning
  • Hierarchical
  • Agglomerative
  • Divisive
  • Grid based
  • Model based

18
Clustering (contd.)
  • Partitioning
  • Divide the input data into k partitions
  • K-means, K-medoids
  • Hierarchical clustering
  • Agglomerative
  • Each data point is assumed to be a cluster
    representative
  • Keep merging similar clusters till we get a
    single cluster
  • Divisive
  • The opposite of agglomerative

19
Frequent term-based text clustering
  • Idea
  • Frequent terms carry more information about the
    cluster they might belong to
  • Highly co-related frequent terms probably belong
    to the same cluster
  • D D1, , Dn the set of documents
  • Dj subsetOf T, the set of all terms
  • Then candidate clusters are generated from F
    F1, , Fk, where each Fi is a set of all
    frequent terms which occur together.

20
Classification
  • The problem statement
  • Given a set of documents, each with a label
    called the class label for that document
  • Given, a classifier which learns from the above
    data set
  • For a new, unseen document, the classifier should
    be able to predict with a high degree of
    accuracy the correct class to which the new
    document belongs

21
Decision Tree Classifier
  • A tree
  • Each node represents some kind of an evaluation
    for an attribute of the data
  • Each edge, the decision taken
  • The evaluation at each node is some kind of an
    information gain measure
  • Reduction in entropy more information gained
  • Entropy E(x) -?pilog2(pi)
  • pi represents the probability that the data
    corresponds to sample i
  • Each edge represents a choice for the value of
    the attribute the node represents
  • Good for text mining. But doesnt scale

22
Statistical (Bayesian) Classification
  • For a document-class data, we calculate the
    probabilities of occurrence of events
  • Bayes Theorem
  • P(cd) P(c) . P(dc) / P(d)
  • Given a document d, the probability that it
    belongs to a class c is given by the above
    formula.
  • In practice, the exact values of the
    probabilities of each event are unknown, and are
    estimated from the samples

23
Naïve Bayes Classification
  • Probability of the document event d
  • P(d) P(w1, , wn) wi are the words
  • The RHS is generally a headache. We have to
    consider the inter-dependence of each of the wj
    events
  • Naïve Bayes Assume all the wj events are
    independent. The RHS expands to
  • ? p(wj)
  • Most of the Bayesian text classifiers work with
    this simplification

24
Bayesian Belief Networks
  • This is an intermediate approach
  • Not all words are independent
  • If java and program occur together, then boost
    the probability value of class computer
    programming
  • If java and indonesia occur together, then the
    document is more likely about some-other-class
  • Problem?
  • How do we come up with co-relations like above?

25
Other classification techniques
  • Support Vector Machines
  • Find the best discriminant plane between two
    classes
  • k Nearest Neighbour
  • Association Rule Mining
  • Neural Networks
  • Case-based reasoning

26
An example Text Classification from labeled
and unlabeled documents with Expectation
Maximization
  • Problem setting
  • Labeling documents is a manual process
  • A lot more unlabeled documents are available as
    compared to labeled documents
  • Unlabeled documents contain information which
    could help in the classification activity

27
An example (contd.)
  • Train a classifier with the labeled documents
  • Say, a Naïve Bayes classifier
  • This classifier estimates the model parameters
    (the prior probabilities of the various events)
  • Now, classify the unlabeled documents.
  • Assuming the applied labels to be correct,
    re-estimate the model parameters
  • Repeat the above step till convergence

28
Expectation Maximization
  • A useful technique for estimating hidden
    parameters
  • In the previous example, the class labels were
    missing from some documents
  • Consists of two steps
  • E-step Set z(k1) E z D ?(k)
  • M-step Set ?(k1) arg max? P(? D z(k1))
  • The above steps are repeated till convergence,
    and convergence does occur

29
Another example Fast and accurate Text
Classification via Multiple Linear Discriminant
Projections
30
Contd.
  • Idea
  • Find a direction ? which maximizes the separation
    between classes.
  • Why?
  • Reduce noise, or rather
  • Enhance the differences between classes
  • The vector corresponding to this direction is the
    Fishers discriminant
  • Project the data-points onto this ?
  • For all data-points not separated by this vector,
    choose another ?

31
Contd.
  • Repeat till all data are now separable
  • Note, we are looking at a 2-class case. This
    easily extends to multiple classes
  • Project all the document vectors into the space
    represented by the ? vectors as the basis vectors
  • Now, induce a decision tree on this projected
    representation
  • The number of attributes is highly reduced
  • Since this representation nicely separates the
    data points (documents), accuracy increases

32
Web Text Mining
  • The WWW is a huge, directed graph, with documents
    as nodes and hyperlinks as the directed edges
  • Apart from the text itself, this graph structure
    carries a lot of information about the
    usefulness of the nodes
  • For example
  • 10 random, average people on the streets say Mr.
    T. Ache is a good dentist
  • 5 reputed doctors, including dentists, recommend
    Mr. P. Killer as a better dentist
  • Who would you choose?

33
Kleinbergs HITS
  • HITS Hypertext Induced Topic Selection
  • Nodes on the web can be categorized into two
    types hubs and authorities
  • Authorities are nodes which one refers to for
    definitive information about a topic
  • Hubs point to authorities
  • HITS computes the hub and authority scores on a
    sub-universe of the web
  • How does one collect this sub-universe?

34
HITS (contd.)
  • The basic steps
  • Au ?Hv for all v pointing to u
  • Hu ?Av for all v pointed to by u
  • Repeat the above till convergence
  • Nodes with high A scores are relevant
  • Relevant to what?
  • Can we use this for efficient retrieval for a
    query?

35
PageRank
  • Similar to HITS, but all pages have only one
    score a Rank
  • R(u) c ? (R(v)/Nv)
  • v is the set of pages linking to u, and Nv is the
    number of links in v. c is a scaling factor (lt 1)
  • The higher the rank of pages linking to a page,
    the higher is its own rank!
  • To handle rank sinks (documents which do not link
    outside a set of pages), the formula is modified
    as
  • R(u) c? (R(v)/Nv) cE(u)
  • E(u) is a set of some pages, and acts as a rank
    source (what kind of pages?)

36
Some more topics which we havent touched
  • Using external dictionaries
  • WordNet
  • Using language specific techniques
  • Computational linguistics
  • Use grammar for judging the sense of a query in
    the information retrieval scenario
  • Other interesting techniques
  • Latent Semantic Indexing
  • Finding the latent information in documents using
    Linear Algebra Techniques

37
Some more comments
  • Some purists do not consider most of the
    current activities in the text mining field as
    real text mining
  • For example, see Marti Hearsts write-up at
    Untangling Text Data Mining

38
Some more comments (contd.)
  • One example that he mentions
  • stress is associated with migraines
  • stress can lead to loss of magnesium
  • calcium channel blockers prevent some migraines
  • magnesium is a natural calcium channel blocker
  • spreading cortical depression (SCD) is implicated
    in some migraines
  • high levels of magnesium inhibit SCD
  • migraine patients have high platelet
    aggregability
  • magnesium can suppress platelet aggregability
  • The above was inferred from a set of documents,
    with some human help

39
References
  • Data Mining Concepts and Techniques, by Jiawei
    Han and Micheline Kamber
  • Principle of Data Mining, by David J. Hand et al
  • Text Classification from Labeled and Unlabeled
    Documents using EM, Kamal Nigam et al
  • Fast and accurate text classification via
    multiple linear discriminant projections, S.
    Chakrabarti et al
  • Frequent Term-Based Text Clustering, Florian Beil
    et al
  • The PageRank Citation Ranking Bringing Order to
    the Web, Lawrence Page and Sergey Brin
  • Untangling Text Data Mining, by Marti. A. Hearst,
    http//www.sims.berkeley.edu/hearst/papers/acl99/
    acl99-tdm.html
  • And others
Write a Comment
User Comments (0)
About PowerShow.com