Title: Text Mining
1Text Mining
- Huaizhong KOU
- PHD Student of Georges Gardarin
- PRiSM Laboratory
20. Content
- 1. Introduction
- What happens
- What is Text Mining
- Text Mining vs Data Mining
- Applications
- 2. Feature Extraction
- Task
- Indexing
- Dimensionality Reduction
- 3. Document Categorization
- Task
- Architecture
- Categorization Classifiers
- ApplicationTrend Analysis
- 4. Document Clustering
- Task
- Algorithms
- Application
- 5. Product
- 6.Reference
31. Text MiningIntroduction
1.1 What happens 1.2 Whats Text Mining 1.3 Text
Mining vs Data Mining 1.4 Application
41.1 IntroductionWhat happens(1)
- Information explosive
- 80 information stored in text
- documentsjournals, web pages,emails...
- Difficult to extract special information
- Current technologies...
?
?
Internet
51.1 Introduction What happens(2)
- It is necessary to
- automatically analyze,
- organize,summarize...
Text Mining
La valeur des actions des sociétés XML vont
augmenter
Knowledge
61.2 Introduction Whats Text Mining(1)
- Text Mining the procedure of synthesizing the
information - by analyzing the relations, the patterns, and the
rules among - textual data - semi-structured or unstructured
text. - Techniques
- data mining
- machine learning
- information retrieval
- statistics
- natural-language understanding
- case-based reasoning
Ref1-4
71.2 Introduction Whats Text Mining(2)
Learning
Working
81.3 IntroductionTM vs DM
Ref1213
91.4 IntroductionApplication
- The potential applications are countless.
- Customer profile analysis
- Trend analysis
- Information filtering and routing
- Event tracks
- news stories classification
- Web search
- .
Ref1213
102. Feature Extraction
2.1 Task 2.2 Indexing 2.3 Weighting Model 2.4
Dimensionality Reduction
Ref71114182022
112.1 Feature ExtractionTask(1)
Task Extract a good subset of words to
represent documents
Document collection
Feature Extraction
Ref71114182022
122.1 Feature ExtractionTask(2)
While more and more textual information is
available online, effective retrieval is
difficult without good indexing of text content.
16
While-more-and-textual-information-is-available-on
line- effective-retrieval-difficult-without-good-i
ndexing-text-content
Feature Extraction
5
Text-information-online-retrieval-index
2 1 1 1
1
Ref71114182022
132.2 Feature ExtractionIndexing(1)
Identification all unique words
Removal stop words
- non-informative word
- ex.the,and,when,more
- Removal of suffix to
- generate word stem
- grouping words
- increasing the relevance
- ex.walker,walking?walk
Word Stemming
- Naive terms
- Importance of term in Doc
Term Weighting
Ref71114182022
142.2 Feature ExtractionIndexing(2)
- Document representations vector space models
d(w1,w2,wt)?Rt
wi is the weight of ith term in document d.
Ref71114182022
152.3 Feature ExtractionWeighting Model(1)
- tf - Term Frequency weighting
- wij Freqij
- Freqij  the number of times jth term
- occurs in document Di.
- ? Drawback without reflection of importance
- factor for document
discrimination.
D1
D2
Ref1122
162.3 Feature ExtractionWeighting Model(2)
- tf?idf - Inverse Document Frequency weighting
- wij Freqij log(N/ DocFreqj) .
- NÂ Â the number of documents in the training
- document collection.
- DocFreqj the number of documents in
- which the jth term occurs.
- ?Advantage with reflection of importance
factor for - document discrimination.
- Assumptionterms with low DocFreq are better
discriminator - than ones with high DocFreq in
document collection
A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0
0.3 0 D2 0 0 0.3 0 0 0 0
0 0 0
Ref13
Ref1122
172.3 Feature ExtractionWeighting Model(3)
where
is average entropy of ith term and -1 if word
occurs once time in every document 0 if word
occurs in only one document
Ref13
Ref1122
182.4 Feature ExtractionDimension Reduction
- 2.4.1 Document Frequency Thresholding
- 2.4.2 X2-statistic
- 2.4.3 Latent Semantic Indexing
Ref11202127
192.4.1 Dimension ReductionDocFreq Thresholding
- Document Frequency Thresholding
Naive Terms
Calculates DocFreq(w)
Sets threshold ?
Removes all words DocFreq lt ?
Feature Terms
Ref11202127
202.4.2 Dimension Reduction X2-statistic
- Assumptiona pre-defined category set for a
training collection D - Goal Estimation independence between term and
category
Naive Terms
Term categorical score
Sets threshold ?
Ad d ?cj ? w ?d Bd d ?cj ? w
?d Cd d ?cj ? w ? d Dd d ? cj ? w ?
d Nd d ?D
Removes all words X2max(w)lt ?
FEATURE TERMS
Ref11202127
212.4.3 Dimension ReductionLSI(1)
- LSILatent Semantic Indexing.
1.SVD ModelSingular Value Decomposition of matrix
documents
mltmin(t,d)
'
S0
X
?
?
T0
D0
terms
(t,d)
(t,m)
(m,m)
(m,d)
'
S0
?
?
T0
D0
documents
(d,m)
(m,m)
(m,t)
'
T0 t ?m orthogonal eigenvector matrix. Its rows
are eigenvectors of X X
?
'
D0 d ?m orthogonal eigenvector matrix. Its rows
are eigenvectors of X X
?
S0 m ?m diagonal matrix of singular value(
square roots of eigenvalue) in decreasing
order of importance. m rank of matrix X.
mltmin(t,d).
Ref11202127
222.4.3 Dimension ReductionLSI(2)
!
!
2.Approximate Model
Select kltm
X ?
'
S
?
?
T
D
appr(X)
(t,d)
(t,k)
(k,k)
(k,d)
'
'
D ? S ? T
'
- Every rows of matrix appr(X) approximately
represents one documents. - Given a row xi and its corresponding row di, the
following holds
'
xi di ? S ? T
and
di xi ? T ? S-1
Ref11202127
232.4.3 Dimension ReductionLSI(3)
3.Document Represent Model
t Naive Terms
new document d
d(w1,w2,,wt) ?Rt
d ?
appr(d)
d ? T ? S-1 ?Rk
(1,k) (1,t) (t,k) (k, k)
- No good methods to determine k. It depends on
application domain. - Some experiments suggest 100 ? 300.
Ref11202127
243. Document Categorization
3.1 Task 3.2 Architecture 3.3 Categorization
Classifiers 3.4 Application
Ref124511182324
253.1 CategorizationTask
- Task assignment of one or more predefined
- categories to one document.
Topics Themes
263.2 CategorizationArchitecture
Training documents
New document d
preprocessing
Weighting
Selecting feature
Predefined categories
Classifier
Category(ies) to d
273.3 Categorization Classifiers
3.3.1Centroid-based Classifier 3.3.2 k-Nearest
Neighbor Classifier 3.3.3 Naive Bayes Classifier
283.3.1 ModelCentroid-Based Classifier(1)
293.3.1 ModelCentroid-Based Classifier(2)
- ? gt ?
- cos(?)ltcos(?)
- d2 is more close to d1 than d3
d3
d2
?
d1
?
- Cosine-based similarity model can reflect the
relations between features.
303.3.2 ModelK-Nearest Neighbor Classifier
1.Inputnew document d 2.training
collectionDd1,d2,dn 3.predefined
categoriesCc1,c2,.,cl 4.//Compute
similarities for(di?D) Simil(d,di)
cos(d,di) 5.//Select k-nearest
neighbor Construct k-document subset Dk so that
Simil(d,di) lt min(Simil(d,doc) doc ?Dk) ?di
?D- Dk. 6.//Compute score for each category
for(ci?C) score(ci)0
for(doc?Dk) score(ci)((doc?ci)true?10)
7.//OutputAssign to d the category c with
the highest score score(c) ? score(ci)
, ?ci ?C- c
313.3.3 ModelNaive Bayes Classifier
Basic assumption all terms distribute in
documents independently. 1.Inputnew document
d 2.predefined categoriesCc1,c2,.,cl 3.//Co
mpute the probability that d is in each class c
?C for(ci?C)
//note that terms wi in document are independent
each other
4.//outputAssigns to d the category c with
the highest probability
323.4 Categorization ApplicationTrend
Analysis-EAnalyst System
Goal Predicts the trends in stock price based on
news stories
Find Trends
Trend cluster
Piecewise linear fitting
Align trends with docs
Retrieve Docs
Sample Textual data
Sample docs
News documents
Bayes Classifier
Trend(slope,confidence)
Learning Process
Categorization
New trend
Ref28
334. Document Clustering
4.1 Task 4.2 Algorithms 4.3 Application
Ref578910151629
344.1 Document ClusteringTask
- Task It groups all documents so that the
documents in the same - group are more similar than ones in
other groups. - Cluster hypothesis relevant documents tend to be
more - closely related to
each other than to - non-relevant
document.
Ref578910151629
354.2 Document ClusteringAlgorithms
- 4.2.1 k-means
- 4.2.2 Hierarchic Agglomerative Clustering(HAC)
- 4.2.3 Association Rule Hypergraph Partitioning
(ARHP)
Ref578910151629
364.2.1 Document Clusteringk-means
- k-means distance-based flat clustering
0. Input Dd1,d2,dn kthe cluster
number 1. Select k document vectors as the
initial centriods of k clusters 2. Repeat 3.
Select one vector d in remaining documents 4.
Compute similarities between d and k
centroids 5. Put d in the closest cluster
and recompute the centroid 6. Until the
centroids dont change 7. Outputk clusters of
documents
- Advantage
- linear time complexity
- works relatively well in low dimension space
- Drawback
- distance computation in high dimension space
- centroid vector may not well summarize the
cluster documents - initial k clusters affect the quality of clusters
Ref578910151629
374.2.2 Document ClusteringHAC
- Hierarchic agglomerative clustering(HAC)distance-
based -
hierarchic clustering
0. Input Dd1,d2,dn 1. Calculate
similarity matrix SIMi,j 2. Repeat 3. Merge
the most similar two clusters, K and L,
to form a new cluster KL 4. Compute
similarities between KL and each of the remaining
cluster and update SIMi,j 5. Until
there is a single(or specified number) cluster 6.
Output dendogram of clusters
- Advantage
- producing better quality clusters
- works relatively well in low dimension space
- Drawback
- distance computation in high dimension space
- quadratic time complexity
Ref578910151629
384.2.3 Document Clustering Association Rule
Hypergraph Partitioning(1)
H(V,E) Va set of vertices Ea set of
hyperedges.
Ref30-35
394.2.3 Document Clustering Association Rule
Hypergraph Partitioning (2)
- Transactional View of Documents and features
- itemDocument
- transactionfeature
items
Doc1 Doc2 Doc3
Docn w1 5 5 2
... 1 w2 2 4
3 5 w3 0
0 0 1 .
. . .
. . .
. . . .
. . . ...
. wt 6 0 0
3
(Transactional database of Documents and
features)
transactions
Ref30-35
404.2.3 Document Clustering Association Rule
Hypergraph Partitioning(3)
Document-feature transaction database
Discovering association rules
Apriori algorithm
Constructing hypergraph
Association rule hypergraph
Partitioning hypergraph
Hypergraph partitioning algorithm
k partitions
- Hyperedgesfrequent item sets
- Hyperedge weightaverage of the confidences of
all rules - Assumption documents occurring in the same
frequent item set are more similar
Ref30-35
414.2.3 Document Clustering Association Rule
Hypergraph Partitioning(4)
- Advantage
- Without the calculation of the mean of clusters.
- Linear time complexity.
- The quality of the clusters is not affected by
the space dimensionality. - performing much better than traditional
clustering in high dimensional space in terms of
the quality of clusters and runtime.
Ref30-35
424.3 Document ClusteringApplication
- Summarization of documents
- Navigation of large document collections
- Organization of Web search results
Ref1015-17
435. ProductIntelligent Miner for Text(IMT)(1)
Ref536
445. ProductIntelligent Miner for Text(IMT)(2)
- 1.Feature extraction tools
- 1.1 Information extraction
- Extract linguistic items that represent document
contents - 1.2 Feature extraction
- Assign of different categories to vocabulary in
documents, - Measure their importance to the document content.
- 1.3 Name extraction
- Locate names in text,
- Determine what type of entity the name refers to
- 1.4 Term extraction
- Discover terms in text. Multiword technical terms
- Recognize variants of the same concept
- 1.5 Abbreviation recognition
- Find abbreviation and math them with their full
forms. - 1.6 Relation extraction
455. ProductIntelligent Miner for Text(IMT)(3)
Feature extraction Demo.
465. ProductIntelligent Miner for Text(IMT)(4)
- 2.Clustering tools
- 2.1 Applications
- Provide a overview of content in a large
document collection - Identify hidden structures between groups of
objects - Improve the browsing process to find similar or
related information - Find outstanding documents within a collection
- 2.2 Hierarchical clustering
- Clusters are organized in a clustering tree and
related clusters occurs in the same branch of
tree. - 2.3 Binary relational clustering
- Relationship of topics.
- document ? cluster ?topic.
- NBpreprocessing step for the categorization tool
475.ProductIntelligent Miner for Text(IMT)(5)
- Clustering demo.navigation of document collection
485. ProductIntelligent Miner for Text(IMT)(6)
- 3.Summarization tools
- 3.1 Steps
- the most relevant sentences ? the relevancy of a
sentence to a document - ? a summary of the document with length set by
user - 3.2 Applications
- Judge the relevancy of a full text
- Easily determine whether the document is
relevant to read. - Enrich search results The results of a
query to a search engine can be enriched with a
short - summary of each document.
- Get a fast overview over document collections
- summary ? full document
495.ProductIntelligent Miner for Text(IMT)(7)
- 4.Categorization tool
- Applications
- Organize intranet documents
- Assign documents to folders
- Dispatch requests
- Forward news to subscribers
sports
News article
categorizer
cultures
I like health news
health
new router
politics
economics
vacations
506. Reference (1)
Bibliography 1 Marti A. Hearst, Untangling
Text Data Mining, Proceedings of ACL99 the 37th
Annual Meeting of the Association for
Computational Linguistics, University of
Maryland, June 20-26, 1999 (invited paper)
http//www.sims.berkeley.edu/hearst 2
Feldman and Dagan 1995 KDT - knowledge discovery
in texts. In Proceedings of the First Annual
Conference on Knowledge Discovery and Data Mining
(KDD), Montreal. 3 IJCAI-99 Workshop TEXT
MINING FOUNDATIONS, TECHNIQUES AND APPLICATIONS
Stockholm, Sweden August 2, 1999
http//www.cs.biu.ac.il/feldman/ijcai-workshop20
cfp.html 4 Taeho C. Jo Text Categorization
considering Categorical Weights and Substantial
Weights of Informative Keywords, 1999
(http//www.sccs.chukyo-u.ac.jp/ICCS/olp/p3-13/p3-
13.htm) 5 A White Paper from IBM
TechnologyText Mining Turning Information Into
Knowledge, February 17, 1998 editor Daniel
Tkach IBM Software Solutions ( http//allen.comm.v
irginia.edu/jtl5t/whiteweb.html) 6
http//allen.comm.virginia.edu/jtl5t/index.htm 7
G. Salton et al, Introduction to Modern
Information Retrieval, McGraw-Hill Book company,
1983 8 Michael Steinbach and George Karypis
and Vipin Kumar, A Comparison of Document
Clustering Techiques, KDD-2000 9 Douglass R.
Cutting, Divid R. Karger, Jan O. Pedersen, and
John W. Tukey, Scatter/Gather A
Cluster-based Approach to Browsing large Document
Collections, SIGIR 92,Pages 318 - 329 10
Oren Zamir, Oren Etzioni, Omid Madani, Richard M.
Karp, Fast and Intuitive Clustering of Web
Documents, KDD 97, Pages 287 290, 1997
516. Reference (2)
Bibliography 11 Kjersti Aas et al. Text
Categorization Survey, 1999 12 Text mining
White Paper http//textmining.krdl.org.sg/
whiteppr.html 13 Gartner Group, Text mining
White Paper , June 2000 http//www.xanalys.com/in
telligence_tools/products/text_mining_text.html
14 Yiming Y. and Jan O. Pedersen, A
Comparative Study on Feature Selection in Text
Categorization, In the 14th Int. Conf. On
Machine Learning, PP. 412- 420, 1997 15 C. J.
van Rijsbergen, (1979), Information Retrieval,
Buttersworth, London. 16 Chris Buckley and
Alan F. Lewit, Optimizations of inverted vector
searches, SIGIR 85, PP 97 110,1985. 17
Daphe Koller and Mehran Sahami, Hierarchically
classifying documents using very few words,
proceedings of the 14th International
Conference on Machine Learning, Nashville,
Tennessee, July 1997, PP170 178. 18 T.
Joachims, A Probabilistic Analysis of the
Rocchio Algorithm with TFIDF for Text
Categorization, In Int. Conf. Machine Learning,
1997. 19 K. Lang, NewsWeeder Learning to
Filter Netnews, International conference on
Machine learning, 1995, http//anther.learning.cs
.cmu.edu/ml95.ps 20 Hart, S.P. A
Probabilistic Approch to Automatic Keyword
Indexing, Journal of the American Society for
Information Science, July-August, 1975
526. Reference (3)
Bibliography 21 Scott D., Indexing by Latent
Semantic Analsis, Journal of the American Society
for Information Science, Vol. 41,No,6, P.P.
391-407, 1990 22 S.T. Dumais, Improving the
retrieval information from external sources,
Behavior Research Methods, Instruments and
Computers, Vol23, No.2,PP.229-236, 1991 23 T.
Yavuz and H. A. Guvenir, Application of k-Nearest
Neighbor on Feature Projections Classifier to
Text Categorization, 1998 24Eui-Hong H. and
George K., Centroid-Based Document
Classification Analysis Experimental Results,
In European Conference on Principles of Data
Mining and Knowledge Discovery(PKDD),
2000 25Vladmimir N. Vapnik, the Nature of
Statistical Learning Theory, Springer, New York,
1995 26Ji He,A-Hwee Tan and Chew-Lim Tan, A
comparative Study on Chinese Text Categorization
Methods, , PRICAI 2000 Workshop on Text And Web
Mining, Melbourne, pp.24-35,August 2000
(http//textmining.krdl.org.sg/PRICAI2000/text-cat
egorization.pdf) 27 Erik W. and Jan O.
Pedersen and Andreas S. Weigend, A Neural Network
Approach to Topic Spotting, In Pro 4th annaul
symposium on cocument analysis and information
retrieval, PP 22-34,1993 28 Lavrenko, V.,
Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D.
and Allan, J. in the Proceedings of KDD 2000
Conference, pp. 37-44. 29Peter W. , Recent
Trends in Hierarchic Document Clustering A
Critical Review, Information Procession
Management Vol.24, No. 5 pp. 577-597, 1988
536. Reference (4)
Bibliography 30 J. Larocca Neto, A.D. Santos,
C.A.A. Kaestner, A.A. Freitas. Document
clustering and text summarization. Proc. 4th Int.
Conf. Practical Applications of Knowledge
Discovery and Data Mining (PADD-2000), 41-55.
London The Practical Application Company. 2000.
31 Eui-Hong (Sam) Han, George Karypis, Vipin
Kumar and Bamshad Mobasher, Clustering Based On
Association Rule Hypergraphs , SIGMOD'97 Workshop
on Research Issues on Data Mining and Knowledge
Discovery, 1997. 32 Daniel Boley, Maria Gini,
Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings,
George Karypis, Vipin Kumar, Bamshad Mobasher,
and Jerome Moore, Parititioning-Based Clustering
for Web Document Categorization, Decision Support
Systems Journal, Vol 27, No. 3, pp 329-341, 1999.
34 George K. and Rajat A. and Vipin K. and
Shashi S., Multilevel Hypergraph Partitioning
Applications in VLSI Domain, In proceedings o th
e Design and Automation Conferences 97. 35
Eui-Hong (Sam) Han, George Karypis, Vipin Kumar
and Bamshad Mobasher, Clustering In A
High-Dimensional Space Using Hypergraph Models,
Technical report 97-019, http//www-users.cs.umn.
edu/han/. 36 IBM White paper Information
Mining with the IBM Intelligent Miner Family,
Daniel S. Tkach, February, 1998