Text Mining

About This Presentation

Title:

Text Mining

Description:

Text Mining Huaizhong KOU PHD Student of Georges Gardarin PRiSM Laboratory 0. Content 1. Introduction What happens What is Text Mining Text Mining vs Data Mining ... – PowerPoint PPT presentation

Number of Views:292

Avg rating:3.0/5.0

Slides: 54

Provided by: georgesGa

Category:

more less

Transcript and Presenter's Notes

Title: Text Mining

1
Text Mining

Huaizhong KOU
PHD Student of Georges Gardarin
PRiSM Laboratory

2
0. Content

1. Introduction
What happens
What is Text Mining
Text Mining vs Data Mining
Applications
2. Feature Extraction
Task
Indexing
Dimensionality Reduction
3. Document Categorization
Task
Architecture
Categorization Classifiers
ApplicationTrend Analysis

4. Document Clustering
Task
Algorithms
Application
5. Product
6.Reference

3
1. Text MiningIntroduction
1.1 What happens 1.2 Whats Text Mining 1.3 Text
Mining vs Data Mining 1.4 Application
4
1.1 IntroductionWhat happens(1)

Information explosive
80 information stored in text
documentsjournals, web pages,emails...
Difficult to extract special information
Current technologies...

?
?
Internet
5
1.1 Introduction What happens(2)

It is necessary to
automatically analyze,
organize,summarize...

Text Mining
La valeur des actions des sociétés XML vont
augmenter
Knowledge
6
1.2 Introduction Whats Text Mining(1)

Text Mining the procedure of synthesizing the
information
by analyzing the relations, the patterns, and the
rules among
textual data - semi-structured or unstructured
text.
Techniques
data mining
machine learning
information retrieval
statistics
natural-language understanding
case-based reasoning

Ref1-4
7
1.2 Introduction Whats Text Mining(2)
Learning
Working
8
1.3 IntroductionTM vs DM
Ref1213
9
1.4 IntroductionApplication

The potential applications are countless.
Customer profile analysis
Trend analysis
Information filtering and routing
Event tracks
news stories classification
Web search
.

Ref1213
10
2. Feature Extraction
2.1 Task 2.2 Indexing 2.3 Weighting Model 2.4
Dimensionality Reduction
Ref71114182022
11
2.1 Feature ExtractionTask(1)
Task Extract a good subset of words to
represent documents
Document collection
Feature Extraction
Ref71114182022
12
2.1 Feature ExtractionTask(2)
While more and more textual information is
available online, effective retrieval is
difficult without good indexing of text content.
16
While-more-and-textual-information-is-available-on
line- effective-retrieval-difficult-without-good-i
ndexing-text-content
Feature Extraction
5
Text-information-online-retrieval-index
2 1 1 1
1
Ref71114182022
13
2.2 Feature ExtractionIndexing(1)
Identification all unique words
Removal stop words

non-informative word
ex.the,and,when,more

Removal of suffix to
generate word stem
grouping words
increasing the relevance
ex.walker,walking?walk

Word Stemming

Naive terms
Importance of term in Doc

Term Weighting
Ref71114182022
14
2.2 Feature ExtractionIndexing(2)

Document representations vector space models

d(w1,w2,wt)?Rt
wi is the weight of ith term in document d.
Ref71114182022
15
2.3 Feature ExtractionWeighting Model(1)

tf - Term Frequency weighting
wij Freqij
Freqij the number of times jth term
occurs in document Di.
? Drawback without reflection of importance
factor for document
discrimination.

D1
D2
Ref1122
16
2.3 Feature ExtractionWeighting Model(2)

tf?idf - Inverse Document Frequency weighting
wij Freqij log(N/ DocFreqj) .
N the number of documents in the training
document collection.
DocFreqj the number of documents in
which the jth term occurs.
?Advantage with reflection of importance
factor for
document discrimination.
Assumptionterms with low DocFreq are better
discriminator
than ones with high DocFreq in
document collection

A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0
0.3 0 D2 0 0 0.3 0 0 0 0
0 0 0
Ref13
Ref1122
17
2.3 Feature ExtractionWeighting Model(3)

Entropy weighting

where
is average entropy of ith term and -1 if word
occurs once time in every document 0 if word
occurs in only one document
Ref13
Ref1122
18
2.4 Feature ExtractionDimension Reduction

2.4.1 Document Frequency Thresholding
2.4.2 X2-statistic
2.4.3 Latent Semantic Indexing

Ref11202127
19
2.4.1 Dimension ReductionDocFreq Thresholding

Document Frequency Thresholding

Naive Terms
Calculates DocFreq(w)
Sets threshold ?
Removes all words DocFreq lt ?
Feature Terms
Ref11202127
20
2.4.2 Dimension Reduction X2-statistic

Assumptiona pre-defined category set for a
training collection D
Goal Estimation independence between term and
category

Naive Terms
Term categorical score
Sets threshold ?
Ad d ?cj ? w ?d Bd d ?cj ? w
?d Cd d ?cj ? w ? d Dd d ? cj ? w ?
d Nd d ?D
Removes all words X2max(w)lt ?
FEATURE TERMS
Ref11202127
21
2.4.3 Dimension ReductionLSI(1)

LSILatent Semantic Indexing.

1.SVD ModelSingular Value Decomposition of matrix
documents
mltmin(t,d)

'

S0
X
?
?
T0
D0
terms

(t,d)
(t,m)
(m,m)
(m,d)
'

S0
?
?
T0
D0
documents
(d,m)
(m,m)
(m,t)
'
T0 t ?m orthogonal eigenvector matrix. Its rows
are eigenvectors of X X
?
'
D0 d ?m orthogonal eigenvector matrix. Its rows
are eigenvectors of X X
?
S0 m ?m diagonal matrix of singular value(
square roots of eigenvalue) in decreasing
order of importance. m rank of matrix X.
mltmin(t,d).
Ref11202127
22
2.4.3 Dimension ReductionLSI(2)
!
!
2.Approximate Model
Select kltm

X ?
'

S
?
?
T
D
appr(X)

(t,d)
(t,k)
(k,k)
(k,d)
'
'
D ? S ? T
'

Every rows of matrix appr(X) approximately
represents one documents.
Given a row xi and its corresponding row di, the
following holds

'
xi di ? S ? T
and
di xi ? T ? S-1
Ref11202127
23
2.4.3 Dimension ReductionLSI(3)
3.Document Represent Model
t Naive Terms
new document d
d(w1,w2,,wt) ?Rt
d ?

appr(d)
d ? T ? S-1 ?Rk
(1,k) (1,t) (t,k) (k, k)

No good methods to determine k. It depends on
application domain.
Some experiments suggest 100 ? 300.

Ref11202127
24
3. Document Categorization
3.1 Task 3.2 Architecture 3.3 Categorization
Classifiers 3.4 Application
Ref124511182324
25
3.1 CategorizationTask

Task assignment of one or more predefined
categories to one document.

Topics Themes
26
3.2 CategorizationArchitecture
Training documents
New document d
preprocessing
Weighting
Selecting feature
Predefined categories
Classifier
Category(ies) to d
27
3.3 Categorization Classifiers
3.3.1Centroid-based Classifier 3.3.2 k-Nearest
Neighbor Classifier 3.3.3 Naive Bayes Classifier
28
3.3.1 ModelCentroid-Based Classifier(1)
29
3.3.1 ModelCentroid-Based Classifier(2)

? gt ?
cos(?)ltcos(?)
d2 is more close to d1 than d3

d3
d2
?
d1
?

Cosine-based similarity model can reflect the
relations between features.

30
3.3.2 ModelK-Nearest Neighbor Classifier
1.Inputnew document d 2.training
collectionDd1,d2,dn 3.predefined
categoriesCc1,c2,.,cl 4.//Compute
similarities for(di?D) Simil(d,di)
cos(d,di) 5.//Select k-nearest
neighbor Construct k-document subset Dk so that
Simil(d,di) lt min(Simil(d,doc) doc ?Dk) ?di
?D- Dk. 6.//Compute score for each category
for(ci?C) score(ci)0
for(doc?Dk) score(ci)((doc?ci)true?10)
7.//OutputAssign to d the category c with
the highest score score(c) ? score(ci)
, ?ci ?C- c
31
3.3.3 ModelNaive Bayes Classifier
Basic assumption all terms distribute in
documents independently. 1.Inputnew document
d 2.predefined categoriesCc1,c2,.,cl 3.//Co
mpute the probability that d is in each class c
?C for(ci?C)
//note that terms wi in document are independent
each other
4.//outputAssigns to d the category c with
the highest probability
32
3.4 Categorization ApplicationTrend
Analysis-EAnalyst System
Goal Predicts the trends in stock price based on
news stories
Find Trends
Trend cluster
Piecewise linear fitting
Align trends with docs
Retrieve Docs
Sample Textual data
Sample docs
News documents
Bayes Classifier
Trend(slope,confidence)
Learning Process
Categorization
New trend
Ref28
33
4. Document Clustering
4.1 Task 4.2 Algorithms 4.3 Application
Ref578910151629
34
4.1 Document ClusteringTask

Task It groups all documents so that the
documents in the same
group are more similar than ones in
other groups.
Cluster hypothesis relevant documents tend to be
more
closely related to
each other than to
non-relevant
document.

Ref578910151629
35
4.2 Document ClusteringAlgorithms

4.2.1 k-means
4.2.2 Hierarchic Agglomerative Clustering(HAC)
4.2.3 Association Rule Hypergraph Partitioning
(ARHP)

Ref578910151629
36
4.2.1 Document Clusteringk-means

k-means distance-based flat clustering

0. Input Dd1,d2,dn kthe cluster
number 1. Select k document vectors as the
initial centriods of k clusters 2. Repeat 3.
Select one vector d in remaining documents 4.
Compute similarities between d and k
centroids 5. Put d in the closest cluster
and recompute the centroid 6. Until the
centroids dont change 7. Outputk clusters of
documents

Advantage
linear time complexity
works relatively well in low dimension space
Drawback
distance computation in high dimension space
centroid vector may not well summarize the
cluster documents
initial k clusters affect the quality of clusters

Ref578910151629
37
4.2.2 Document ClusteringHAC

Hierarchic agglomerative clustering(HAC)distance-
based
hierarchic clustering

0. Input Dd1,d2,dn 1. Calculate
similarity matrix SIMi,j 2. Repeat 3. Merge
the most similar two clusters, K and L,
to form a new cluster KL 4. Compute
similarities between KL and each of the remaining
cluster and update SIMi,j 5. Until
there is a single(or specified number) cluster 6.
Output dendogram of clusters

Advantage
producing better quality clusters
works relatively well in low dimension space
Drawback
distance computation in high dimension space
quadratic time complexity

Ref578910151629
38
4.2.3 Document Clustering Association Rule
Hypergraph Partitioning(1)

Hypergraph

H(V,E) Va set of vertices Ea set of
hyperedges.
Ref30-35
39
4.2.3 Document Clustering Association Rule
Hypergraph Partitioning (2)

Transactional View of Documents and features
itemDocument
transactionfeature

items
Doc1 Doc2 Doc3
Docn w1 5 5 2
... 1 w2 2 4
3 5 w3 0
0 0 1 .
. . .
. . .
. . . .
. . . ...
. wt 6 0 0
3
(Transactional database of Documents and
features)
transactions
Ref30-35
40
4.2.3 Document Clustering Association Rule
Hypergraph Partitioning(3)

Clustering

Document-feature transaction database
Discovering association rules
Apriori algorithm
Constructing hypergraph
Association rule hypergraph
Partitioning hypergraph
Hypergraph partitioning algorithm
k partitions

Hyperedgesfrequent item sets
Hyperedge weightaverage of the confidences of
all rules
Assumption documents occurring in the same
frequent item set are more similar

Ref30-35
41
4.2.3 Document Clustering Association Rule
Hypergraph Partitioning(4)

Advantage
Without the calculation of the mean of clusters.
Linear time complexity.
The quality of the clusters is not affected by
the space dimensionality.
performing much better than traditional
clustering in high dimensional space in terms of
the quality of clusters and runtime.

Ref30-35
42
4.3 Document ClusteringApplication

Summarization of documents
Navigation of large document collections
Organization of Web search results

Ref1015-17
43
5. ProductIntelligent Miner for Text(IMT)(1)
Ref536
44
5. ProductIntelligent Miner for Text(IMT)(2)

1.Feature extraction tools
1.1 Information extraction
Extract linguistic items that represent document
contents
1.2 Feature extraction
Assign of different categories to vocabulary in
documents,
Measure their importance to the document content.
1.3 Name extraction
Locate names in text,
Determine what type of entity the name refers to
1.4 Term extraction
Discover terms in text. Multiword technical terms
Recognize variants of the same concept
1.5 Abbreviation recognition
Find abbreviation and math them with their full
forms.
1.6 Relation extraction

45
5. ProductIntelligent Miner for Text(IMT)(3)
Feature extraction Demo.
46
5. ProductIntelligent Miner for Text(IMT)(4)

2.Clustering tools
2.1 Applications
Provide a overview of content in a large
document collection
Identify hidden structures between groups of
objects
Improve the browsing process to find similar or
related information
Find outstanding documents within a collection
2.2 Hierarchical clustering
Clusters are organized in a clustering tree and
related clusters occurs in the same branch of
tree.
2.3 Binary relational clustering
Relationship of topics.
document ? cluster ?topic.
NBpreprocessing step for the categorization tool

47
5.ProductIntelligent Miner for Text(IMT)(5)

Clustering demo.navigation of document collection

48
5. ProductIntelligent Miner for Text(IMT)(6)

3.Summarization tools
3.1 Steps
the most relevant sentences ? the relevancy of a
sentence to a document
? a summary of the document with length set by
user
3.2 Applications
Judge the relevancy of a full text
Easily determine whether the document is
relevant to read.
Enrich search results The results of a
query to a search engine can be enriched with a
short
summary of each document.
Get a fast overview over document collections
summary ? full document

49
5.ProductIntelligent Miner for Text(IMT)(7)

4.Categorization tool
Applications
Organize intranet documents
Assign documents to folders
Dispatch requests
Forward news to subscribers

sports
News article
categorizer
cultures
I like health news
health
new router
politics
economics
vacations
50
6. Reference (1)
Bibliography 1 Marti A. Hearst, Untangling
Text Data Mining, Proceedings of ACL99 the 37th
Annual Meeting of the Association for
Computational Linguistics, University of
Maryland, June 20-26, 1999 (invited paper)
http//www.sims.berkeley.edu/hearst 2
Feldman and Dagan 1995 KDT - knowledge discovery
in texts. In Proceedings of the First Annual
Conference on Knowledge Discovery and Data Mining
(KDD), Montreal. 3 IJCAI-99 Workshop TEXT
MINING FOUNDATIONS, TECHNIQUES AND APPLICATIONS
Stockholm, Sweden August 2, 1999
http//www.cs.biu.ac.il/feldman/ijcai-workshop20
cfp.html 4 Taeho C. Jo Text Categorization
considering Categorical Weights and Substantial
Weights of Informative Keywords, 1999
(http//www.sccs.chukyo-u.ac.jp/ICCS/olp/p3-13/p3-
13.htm) 5 A White Paper from IBM
TechnologyText Mining Turning Information Into
Knowledge, February 17, 1998 editor Daniel
Tkach IBM Software Solutions ( http//allen.comm.v
irginia.edu/jtl5t/whiteweb.html) 6
http//allen.comm.virginia.edu/jtl5t/index.htm 7
G. Salton et al, Introduction to Modern
Information Retrieval, McGraw-Hill Book company,
1983 8 Michael Steinbach and George Karypis
and Vipin Kumar, A Comparison of Document
Clustering Techiques, KDD-2000 9 Douglass R.
Cutting, Divid R. Karger, Jan O. Pedersen, and
John W. Tukey, Scatter/Gather A
Cluster-based Approach to Browsing large Document
Collections, SIGIR 92,Pages 318 - 329 10
Oren Zamir, Oren Etzioni, Omid Madani, Richard M.
Karp, Fast and Intuitive Clustering of Web
Documents, KDD 97, Pages 287 290, 1997
51
6. Reference (2)
Bibliography 11 Kjersti Aas et al. Text
Categorization Survey, 1999 12 Text mining
White Paper http//textmining.krdl.org.sg/
whiteppr.html 13 Gartner Group, Text mining
White Paper , June 2000 http//www.xanalys.com/in
telligence_tools/products/text_mining_text.html
14 Yiming Y. and Jan O. Pedersen, A
Comparative Study on Feature Selection in Text
Categorization, In the 14th Int. Conf. On
Machine Learning, PP. 412- 420, 1997 15 C. J.
van Rijsbergen, (1979), Information Retrieval,
Buttersworth, London. 16 Chris Buckley and
Alan F. Lewit, Optimizations of inverted vector
searches, SIGIR 85, PP 97 110,1985. 17
Daphe Koller and Mehran Sahami, Hierarchically
classifying documents using very few words,
proceedings of the 14th International
Conference on Machine Learning, Nashville,
Tennessee, July 1997, PP170 178. 18 T.
Joachims, A Probabilistic Analysis of the
Rocchio Algorithm with TFIDF for Text
Categorization, In Int. Conf. Machine Learning,
1997. 19 K. Lang, NewsWeeder Learning to
Filter Netnews, International conference on
Machine learning, 1995, http//anther.learning.cs
.cmu.edu/ml95.ps 20 Hart, S.P. A
Probabilistic Approch to Automatic Keyword
Indexing, Journal of the American Society for
Information Science, July-August, 1975
52
6. Reference (3)
Bibliography 21 Scott D., Indexing by Latent
Semantic Analsis, Journal of the American Society
for Information Science, Vol. 41,No,6, P.P.
391-407, 1990 22 S.T. Dumais, Improving the
retrieval information from external sources,
Behavior Research Methods, Instruments and
Computers, Vol23, No.2,PP.229-236, 1991 23 T.
Yavuz and H. A. Guvenir, Application of k-Nearest
Neighbor on Feature Projections Classifier to
Text Categorization, 1998 24Eui-Hong H. and
George K., Centroid-Based Document
Classification Analysis Experimental Results,
In European Conference on Principles of Data
Mining and Knowledge Discovery(PKDD),
2000 25Vladmimir N. Vapnik, the Nature of
Statistical Learning Theory, Springer, New York,
1995 26Ji He,A-Hwee Tan and Chew-Lim Tan, A
comparative Study on Chinese Text Categorization
Methods, , PRICAI 2000 Workshop on Text And Web
Mining, Melbourne, pp.24-35,August 2000
(http//textmining.krdl.org.sg/PRICAI2000/text-cat
egorization.pdf) 27 Erik W. and Jan O.
Pedersen and Andreas S. Weigend, A Neural Network
Approach to Topic Spotting, In Pro 4th annaul
symposium on cocument analysis and information
retrieval, PP 22-34,1993 28 Lavrenko, V.,
Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D.
and Allan, J. in the Proceedings of KDD 2000
Conference, pp. 37-44. 29Peter W. , Recent
Trends in Hierarchic Document Clustering A
Critical Review, Information Procession
Management Vol.24, No. 5 pp. 577-597, 1988
53
6. Reference (4)
Bibliography 30 J. Larocca Neto, A.D. Santos,
C.A.A. Kaestner, A.A. Freitas. Document
clustering and text summarization. Proc. 4th Int.
Conf. Practical Applications of Knowledge
Discovery and Data Mining (PADD-2000), 41-55.
London The Practical Application Company. 2000.
31 Eui-Hong (Sam) Han, George Karypis, Vipin
Kumar and Bamshad Mobasher, Clustering Based On
Association Rule Hypergraphs , SIGMOD'97 Workshop
on Research Issues on Data Mining and Knowledge
Discovery, 1997. 32 Daniel Boley, Maria Gini,
Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings,
George Karypis, Vipin Kumar, Bamshad Mobasher,
and Jerome Moore, Parititioning-Based Clustering
for Web Document Categorization, Decision Support
Systems Journal, Vol 27, No. 3, pp 329-341, 1999.
34 George K. and Rajat A. and Vipin K. and
Shashi S., Multilevel Hypergraph Partitioning
Applications in VLSI Domain, In proceedings o th
e Design and Automation Conferences 97. 35
Eui-Hong (Sam) Han, George Karypis, Vipin Kumar
and Bamshad Mobasher, Clustering In A
High-Dimensional Space Using Hypergraph Models,
Technical report 97-019, http//www-users.cs.umn.
edu/han/. 36 IBM White paper Information
Mining with the IBM Intelligent Miner Family,
Daniel S. Tkach, February, 1998

Write a Comment

User Comments (0)