Chapter 7: Text mining - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 7: Text mining

Description:

One based on the multinomial model (the number of word occurrences is considered) ... Na ve Bayesian Classifier (multinomial model) (1) (2) (3) ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 50
Provided by: xlin1
Learn more at: https://www.cs.uic.edu
Category:

less

Transcript and Presenter's Notes

Title: Chapter 7: Text mining


1
Chapter 7 Text mining
2
Text mining
  • It refers to data mining using text documents as
    data.
  • There are many special techniques for
    pre-processing text documents to make them
    suitable for mining.
  • Most of these techniques are from the field of
    Information Retrieval.

3
Information Retrieval (IR)
  • Conceptually, information retrieval (IR) is the
    study of finding needed information. I.e., IR
    helps users find information that matches their
    information needs.
  • Historically, information retrieval is about
    document retrieval, emphasizing document as the
    basic unit.
  • Technically, IR studies the acquisition,
    organization, storage, retrieval, and
    distribution of information.
  • IR has become a center of focus in the Web era.

4
Information Retrieval
Translating info. needs to queries
Matching queries To stored information
Query result evaluation Does information found
match users information needs?
5
Text Processing
  • Word (token) extraction
  • Stop words
  • Stemming
  • Frequency counts

6
Stop words
  • Many of the most frequently used words in English
    are worthless in IR and text mining these words
    are called stop words.
  • the, of, and, to, .
  • Typically about 400 to 500 such words
  • For an application, an additional domain specific
    stop words list may be constructed
  • Why do we need to remove stop words?
  • Reduce indexing (or data) file size
  • stopwords accounts 20-30 of total word counts.
  • Improve efficiency
  • stop words are not useful for searching or text
    mining
  • stop words always have a large number of hits

7
Stemming
  • Techniques used to find out the root/stem of a
    word
  • E.g.,
  • user engineering
  • users engineered
  • used engineer
  • using
  • stem use engineer
  • Usefulness
  • improving effectiveness of IR and text mining
  • matching similar words
  • reducing indexing size
  • combing words with same roots may reduce indexing
    size as much as 40-50.

8
Basic stemming methods
  • remove ending
  • if a word ends with a consonant other than s,
  • followed by an s, then delete s.
  • if a word ends in es, drop the s.
  • if a word ends in ing, delete the ing unless the
    remaining word consists only of one letter or of
    th.
  • If a word ends with ed, preceded by a consonant,
    delete the ed unless this leaves only a single
    letter.
  • ...
  • transform words
  • if a word ends with ies but not eies or
    aies then ies --gt y.

9
Frequency counts
  • Counts the number of times a word occurred in a
    document.
  • Counts the number of documents in a collection
    that contains a word.
  • Using occurrence frequencies to indicate relative
    importance of a word in a document.
  • if a word appears often in a document, the
    document likely deals with subjects related to
    the word.

10
Vector Space Representation
  • A document is represented as a vector
  • (W1, W2, , Wn)
  • Binary
  • Wi 1 if the corresponding term i (often a word)
    is in the document
  • Wi 0 if the term i is not in the document
  • TF (Term Frequency)
  • Wi tfi where tfi is the number of times the
    term occurred in the document
  • TFIDF (Inverse Document Frequency)
  • Wi tfiidfitfilog(N/dfi)) where dfi is the
    number of documents contains term i, and N the
    total number of documents in the collection.

11
Vector Space and Document Similarity
  • Each indexing term is a dimension. A indexing
    term is normally a word.
  • Each document is a vector
  • Di (ti1, ti2, ti3, ti4, ... tin)
  • Dj (tj1, tj2, tj3, tj4, ..., tjn)
  • Document similarity is defined as

12
Query formats
  • Query is a representation of the users
    information needs
  • Normally a list of words.
  • Query as a simple question in natural language
  • The system translates the question into
    executable queries
  • Query as a document
  • Find similar documents like this one
  • The system defines what the similarity is

13
An Example
  • A document Space is defined by three terms
  • hardware, software, users
  • A set of documents are defined as
  • A1(1, 0, 0), A2(0, 1, 0), A3(0, 0, 1)
  • A4(1, 1, 0), A5(1, 0, 1), A6(0, 1, 1)
  • A7(1, 1, 1) A8(1, 0, 1). A9(0, 1, 1)
  • If the Query is hardware and software
  • what documents should be retrieved?

14
An Example (cont.)
  • In Boolean query matching
  • document A4, A7 will be retrieved (AND)
  • retrievedA1, A2, A4, A5, A6, A7, A8, A9 (OR)
  • In similarity matching (cosine)
  • q(1, 1, 0)
  • S(q, A1)0.71, S(q, A2)0.71, S(q, A3)0
  • S(q, A4)1, S(q, A5)0.5, S(q, A6)0.5
  • S(q, A7)0.82, S(q, A8)0.5, S(q, A9)0.5
  • Document retrieved set (with ranking)
  • A4, A7, A1, A2, A5, A6, A8, A9

15
Relevance judgment for IR
  • A measurement of the outcome of a search or
    retrieval
  • The judgment on what should or should not be
    retrieved.
  • There is no simple answer to what is relevant and
    what is not relevant need human users.
  • difficult to define
  • subjective
  • depending on knowledge, needs, time,, etc.
  • The central concept of information retrieval

16
Precision and Recall
  • Given a query
  • Are all retrieved documents relevant?
  • Have all the relevant documents been retrieved ?
  • Measures for system performance
  • The first question is about the precision of the
    search
  • The second is about the completeness (recall) of
    the search.

17
Precision and Recall (cont)
Relevant
Not Relevant
Retrieved
a
b
Not retrieved
d
c
a
a
P --------------
R --------------
ab
ac
18
Precision and Recall (cont)
  • Precision measures how precise a search is.
  • the higher the precision,
  • the less unwanted documents.
  • Recall measures how complete a search is.
  • the higher the recall,
  • the less missing documents.

19
Relationship of R and P
  • Theoretically,
  • R and P not depend on each other.
  • Practically,
  • High Recall is achieved at the expense of
    precision.
  • High Precision is achieved at the expense of
    recall.
  • When will p 0?
  • Only when none of the retrieved documents is
    relevant.
  • When will p1?
  • Only when every retrieved documents are relevant.
  • Depending on application, you may want a higher
    precision or a higher recall.

20
P-R diagram
P
1.0
System A
System B
0.5
System C
0.1
R
0.1
1.0
0.5
21
Alternative measures
  • Combining recall and precision, F score
  • 2PR
  • F ------------------
  • R P
  • Breakeven point when p r
  • These two measures are commonly used in text
    mining classification and clustering.
  • Accuracy is not normally used in text domain
    because the set of relevant documents is almost
    always very small compared to the set of
    irrelevant documents.

22
Web Search as a huge IR system
  • A Web crawler (robot) crawls the Web to collect
    all the pages.
  • Servers establish a huge inverted indexing
    database and other indexing databases
  • At query (search) time, search engines conduct
    different types of vector query matching

23
Different search engines
  • The real differences among different search
    engines are
  • their indexing weight schemes
  • their query process methods
  • their ranking algorithms
  • None of these are published by any of the search
    engines firms.

24
Vector Space Based Document Classification
25
Vector Space Representation
  • Each doc j is a vector, one component for each
    term ( word).
  • Have a vector space
  • terms are attributes
  • n docs live in this space
  • even with stop word removal and stemming, we may
    have 10000 dimensions, or even 1,000,000

26
Classification in Vector space
  • Each training doc is a point (vector) labeled by
    its topic ( class)
  • Hypothesis docs of the same topic form a
    contiguous region of space
  • Define surfaces to delineate topics in space

27
Test doc Government
Government
Science
Arts
28
Rocchio Classification Method
  • Given training documents compute a prototype
    vector for each class.
  • Given test doc, assign to topic whose prototype
    (centroid) is nearest using cosine similarity.

29
Rocchio Classification

  • Constructing document vectors into a prototype
    vector for each class cj.
  • ? and ? are parameters that adjust the relative
    impact of relevant and irrelevant training
    examples. Normally,
  • ? 16 and ? 4.

30
Naïve Bayesian Classifier
  • Given a set of training documents D,
  • each document is considered an ordered list of
    words.
  • wdi,k denotes the word wt in position k of
    document di, where each word is from the
    vocabulary V lt w1, w2, , wv gt.
  • Let C c1, c2, , cC be the set of
    pre-defined classes.
  • There are two naïve Bayesian models,
  • One based on multi-variate Bernoulli model (a
    word occurs or does not occurs in a document).
  • One based on the multinomial model (the number of
    word occurrences is considered)

31
Naïve Bayesian Classifier (multinomial model)
(1)
(2)
N(wt, di) is the number of times the word wt
occurs in document di. P(cjdi) is in 0, 1
(3)
32
k Nearest Neighbor Classification
  • To classify document d into class c
  • Define k-neighborhood N as k nearest neighbors of
    d
  • Count number of documents n in N that belong to c
  • Estimate P(cd) as n/k
  • No training is needed (?). Classification time is
    linear in training set size.

33
Example
Government
Science
Arts
34
Example k6 (6NN)
P(science )?
Government
Science
Arts
35
Linear classifiers Binary Classification
  • Consider 2 class problems
  • Assume linear separability for now
  • in 2 dimensions, can separate by a line
  • in higher dimensions, need hyperplanes
  • Can find separating hyperplane by linear
    programming (e.g. perceptron)
  • separator can be expressed as ax by c

36
Linear programming / Perceptron
Find a,b,c, such that ax by ? c for red
points ax by ? c for green points.
37
Linear Classifiers (cont.)
  • Many common text classifiers are linear
    classifiers
  • Despite this similarity, large performance
    differences
  • For separable problems, there is an infinite
    number of separating hyperplanes. Which one do
    you choose?
  • What to do for non-separable problems?

38
Which hyperplane?
In general, lots of possible solutions for
a,b,c. Support Vector Machine (SVM) finds an
optimal solution
39
Support Vector Machine (SVM)
  • SVMs maximize the margin around the separating
    hyperplane.
  • The decision function is fully specified by a
    subset of training samples, the support vectors.
  • Quadratic programming problem.
  • SVM very good for text classification

40
Optimal hyperplane
  • Let the training examples be (xi, ,yi) i 1,
    2,, n, where xi is n-dimensional vector. yi is
    its class, -1 or 1.
  • The class represented by the subset with yi -1
    and the class represented by the subset with yi
    1 are linearly separable if there exists (w, b)
    such that
  • The margin of separation m is the separation
    between the hyperplane wTx b 0 and the
    closest data points (support vectors).
  • The goal of a SVM is to find the optimal
    hyperplane with the maximum margin of separation.
  • wTxi b ? 0 for yi 1
  • wTxi b lt 0 for yi -1

41
A Geometrical Interpretation
  • The decision boundary should be as far away from
    the data of both classes as possible
  • We maximize the margin, m

Class 2
m
Class 1
42
SVM formulation separable case
  • Thus, support vector machines (SVM) are linear
    functions of the form f(x) wTx b, where w is
    the weight vector and x is the input vector.
  • To find the linear function
  • Minimize
  • Subject to
  • Quadratic programming.

43
Non-separable caseSoft margin SVM
  • To deal with cases where there may be no
    separating hyperplane due to noisy labels of both
    positive and negative training examples, the soft
    margin SVM is proposed
  • Minimize
  • Subject to
  • ?i ? 0, i 1, , n
  • where C ? 0 is a parameter that controls the
    amount of training errors allowed.

44
IllustrationNon-separable case
  • Support Vectors
  • 1 margin s.v. ?i 0 Correct
  • 2 non-margin s.v. ?i lt 1 Correct (in margin)
  • 3 non-margin s.v. ?I gt 1 Error

1
1
1
3
3
3
2
1
45
Extension to Non-linear Decision surface
  • In general, complex real world applications may
    not be expressed with linear functions.
  • Key idea transform xi into a higher dimensional
    space to make life easier
  • Input space the space xi are in
  • Feature space the space of f(xi) after
    transformation

46
Kernel Trick
  • The mapping function ?(.) is used to project data
    into a higher dimensional feature space.
  • x (x1, .., xn) ? ?(x) (?1(x), , ?N(x))
  • With a higher dimensional space, the data are
    more likely to be linearly separable.
  • In SVM, the projection can be done implicitly,
    rather than explicitly because the optimization
    does not actually need the explicit projection.
  • It only needs a way to compute inner products
    between pairs of training examples (e.g., x, z)
  • Kernel K(x, z) lt? (x) ? ? (z)gt
  • If you know how to compute K, you do not need to
    know ?.

47
Comments of SVM
  • SVM are seen as best-performing method by many.
  • Statistical significance of most results not
    clear.
  • Kernels are an elegant and efficient way to map
    data into a better representation.
  • SVM can be expensive to train (quadratic
    programming).
  • For text classification, linear kernel is common
    and often sufficient.

48
Document clustering
  • We can still use the normal clustering
    techniques, e.g., partition and hierarchical
    methods.
  • Documents can be represented using vector space
    model.
  • For distance function, cosine similarity measure
    is commonly used.

49
Summary
  • Text mining applies and adapts data mining
    techniques to text domain.
  • A significant amount of pre-processing is needed
    before mining, using information retrieval
    techniques.
Write a Comment
User Comments (0)
About PowerShow.com