Text Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Text Mining

Description:

Text Mining Huaizhong KOU PHD Student of Georges Gardarin PRiSM Laboratory 0. Content 1. Introduction What happens What is Text Mining Text Mining vs Data Mining ... – PowerPoint PPT presentation

Number of Views:292
Avg rating:3.0/5.0
Slides: 54
Provided by: georgesGa
Category:
Tags: mining | text

less

Transcript and Presenter's Notes

Title: Text Mining


1
Text Mining
  • Huaizhong KOU
  • PHD Student of Georges Gardarin
  • PRiSM Laboratory

2
0. Content
  • 1. Introduction
  • What happens
  • What is Text Mining
  • Text Mining vs Data Mining
  • Applications
  • 2. Feature Extraction
  • Task
  • Indexing
  • Dimensionality Reduction
  • 3. Document Categorization
  • Task
  • Architecture
  • Categorization Classifiers
  • ApplicationTrend Analysis
  • 4. Document Clustering
  • Task
  • Algorithms
  • Application
  • 5. Product
  • 6.Reference

3
1. Text MiningIntroduction
1.1 What happens 1.2 Whats Text Mining 1.3 Text
Mining vs Data Mining 1.4 Application
4
1.1 IntroductionWhat happens(1)
  • Information explosive
  • 80 information stored in text
  • documentsjournals, web pages,emails...
  • Difficult to extract special information
  • Current technologies...

?
?
Internet
5
1.1 Introduction What happens(2)
  • It is necessary to
  • automatically analyze,
  • organize,summarize...

Text Mining
La valeur des actions des sociétés XML vont
augmenter
Knowledge
6
1.2 Introduction Whats Text Mining(1)
  • Text Mining the procedure of synthesizing the
    information
  • by analyzing the relations, the patterns, and the
    rules among
  • textual data - semi-structured or unstructured
    text.
  • Techniques
  • data mining
  • machine learning
  • information retrieval
  • statistics
  • natural-language understanding
  • case-based reasoning

Ref1-4
7
1.2 Introduction Whats Text Mining(2)
Learning
Working
8
1.3 IntroductionTM vs DM
Ref1213
9
1.4 IntroductionApplication
  • The potential applications are countless.
  • Customer profile analysis
  • Trend analysis
  • Information filtering and routing
  • Event tracks
  • news stories classification
  • Web search
  • .

Ref1213
10
2. Feature Extraction
2.1 Task 2.2 Indexing 2.3 Weighting Model 2.4
Dimensionality Reduction
Ref71114182022
11
2.1 Feature ExtractionTask(1)
Task Extract a good subset of words to
represent documents
Document collection
Feature Extraction
Ref71114182022
12
2.1 Feature ExtractionTask(2)
While more and more textual information is
available online, effective retrieval is
difficult without good indexing of text content.
16
While-more-and-textual-information-is-available-on
line- effective-retrieval-difficult-without-good-i
ndexing-text-content
Feature Extraction
5
Text-information-online-retrieval-index
2 1 1 1
1
Ref71114182022
13
2.2 Feature ExtractionIndexing(1)
Identification all unique words
Removal stop words
  • non-informative word
  • ex.the,and,when,more
  • Removal of suffix to
  • generate word stem
  • grouping words
  • increasing the relevance
  • ex.walker,walking?walk

Word Stemming
  • Naive terms
  • Importance of term in Doc

Term Weighting
Ref71114182022
14
2.2 Feature ExtractionIndexing(2)
  • Document representations vector space models

d(w1,w2,wt)?Rt
wi is the weight of ith term in document d.
Ref71114182022
15
2.3 Feature ExtractionWeighting Model(1)
  • tf - Term Frequency weighting
  • wij Freqij
  • Freqij   the number of times jth term
  • occurs in document Di.
  • ? Drawback without reflection of importance
  • factor for document
    discrimination.
  • Ex.

D1
D2
Ref1122
16
2.3 Feature ExtractionWeighting Model(2)
  • tf?idf - Inverse Document Frequency weighting
  • wij Freqij log(N/ DocFreqj) .
  • N   the number of documents in the training
  • document collection.
  • DocFreqj the number of documents in
  • which the jth term occurs.
  • ?Advantage with reflection of importance
    factor for
  • document discrimination.
  • Assumptionterms with low DocFreq are better
    discriminator
  • than ones with high DocFreq in
    document collection
  • Ex.

A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0
0.3 0 D2 0 0 0.3 0 0 0 0
0 0 0
Ref13
Ref1122
17
2.3 Feature ExtractionWeighting Model(3)
  • Entropy weighting

where
is average entropy of ith term and -1 if word
occurs once time in every document 0 if word
occurs in only one document
Ref13
Ref1122
18
2.4 Feature ExtractionDimension Reduction
  • 2.4.1 Document Frequency Thresholding
  • 2.4.2 X2-statistic
  • 2.4.3 Latent Semantic Indexing

Ref11202127
19
2.4.1 Dimension ReductionDocFreq Thresholding
  • Document Frequency Thresholding

Naive Terms
Calculates DocFreq(w)
Sets threshold ?
Removes all words DocFreq lt ?
Feature Terms
Ref11202127
20
2.4.2 Dimension Reduction X2-statistic
  • Assumptiona pre-defined category set for a
    training collection D
  • Goal Estimation independence between term and
    category

Naive Terms
Term categorical score
Sets threshold ?
Ad d ?cj ? w ?d Bd d ?cj ? w
?d Cd d ?cj ? w ? d Dd d ? cj ? w ?
d Nd d ?D
Removes all words X2max(w)lt ?
FEATURE TERMS
Ref11202127
21
2.4.3 Dimension ReductionLSI(1)
  • LSILatent Semantic Indexing.

1.SVD ModelSingular Value Decomposition of matrix
documents
mltmin(t,d)


'


S0
X
?
?
T0
D0
terms


(t,d)
(t,m)
(m,m)
(m,d)
'

S0
?
?
T0
D0
documents
(d,m)
(m,m)
(m,t)
'
T0 t ?m orthogonal eigenvector matrix. Its rows
are eigenvectors of X X
?
'
D0 d ?m orthogonal eigenvector matrix. Its rows
are eigenvectors of X X
?
S0 m ?m diagonal matrix of singular value(
square roots of eigenvalue) in decreasing
order of importance. m rank of matrix X.
mltmin(t,d).
Ref11202127
22
2.4.3 Dimension ReductionLSI(2)
!
!
2.Approximate Model
Select kltm


X ?
'


S
?
?
T
D
appr(X)


(t,d)
(t,k)
(k,k)
(k,d)
'
'
D ? S ? T
'
  • Every rows of matrix appr(X) approximately
    represents one documents.
  • Given a row xi and its corresponding row di, the
    following holds

'
xi di ? S ? T
and
di xi ? T ? S-1
Ref11202127
23
2.4.3 Dimension ReductionLSI(3)
3.Document Represent Model
t Naive Terms
new document d
d(w1,w2,,wt) ?Rt
d ?

appr(d)
d ? T ? S-1 ?Rk
(1,k) (1,t) (t,k) (k, k)
  • No good methods to determine k. It depends on
    application domain.
  • Some experiments suggest 100 ? 300.

Ref11202127
24
3. Document Categorization
3.1 Task 3.2 Architecture 3.3 Categorization
Classifiers 3.4 Application
Ref124511182324
25
3.1 CategorizationTask
  • Task assignment of one or more predefined
  • categories to one document.

Topics Themes
26
3.2 CategorizationArchitecture
Training documents
New document d
preprocessing
Weighting
Selecting feature
Predefined categories
Classifier
Category(ies) to d
27
3.3 Categorization Classifiers
3.3.1Centroid-based Classifier 3.3.2 k-Nearest
Neighbor Classifier 3.3.3 Naive Bayes Classifier
28
3.3.1 ModelCentroid-Based Classifier(1)
29
3.3.1 ModelCentroid-Based Classifier(2)
  • ? gt ?
  • cos(?)ltcos(?)
  • d2 is more close to d1 than d3

d3
d2
?
d1
?
  • Cosine-based similarity model can reflect the
    relations between features.

30
3.3.2 ModelK-Nearest Neighbor Classifier
1.Inputnew document d 2.training
collectionDd1,d2,dn 3.predefined
categoriesCc1,c2,.,cl 4.//Compute
similarities for(di?D) Simil(d,di)
cos(d,di) 5.//Select k-nearest
neighbor Construct k-document subset Dk so that
Simil(d,di) lt min(Simil(d,doc) doc ?Dk) ?di
?D- Dk. 6.//Compute score for each category
for(ci?C) score(ci)0
for(doc?Dk) score(ci)((doc?ci)true?10)
7.//OutputAssign to d the category c with
the highest score score(c) ? score(ci)
, ?ci ?C- c
31
3.3.3 ModelNaive Bayes Classifier
Basic assumption all terms distribute in
documents independently. 1.Inputnew document
d 2.predefined categoriesCc1,c2,.,cl 3.//Co
mpute the probability that d is in each class c
?C for(ci?C)
//note that terms wi in document are independent
each other
4.//outputAssigns to d the category c with
the highest probability
32
3.4 Categorization ApplicationTrend
Analysis-EAnalyst System
Goal Predicts the trends in stock price based on
news stories
Find Trends
Trend cluster
Piecewise linear fitting
Align trends with docs
Retrieve Docs
Sample Textual data
Sample docs
News documents
Bayes Classifier
Trend(slope,confidence)
Learning Process
Categorization
New trend
Ref28
33
4. Document Clustering
4.1 Task 4.2 Algorithms 4.3 Application
Ref578910151629
34
4.1 Document ClusteringTask
  • Task It groups all documents so that the
    documents in the same
  • group are more similar than ones in
    other groups.
  • Cluster hypothesis relevant documents tend to be
    more
  • closely related to
    each other than to
  • non-relevant
    document.

Ref578910151629
35
4.2 Document ClusteringAlgorithms
  • 4.2.1 k-means
  • 4.2.2 Hierarchic Agglomerative Clustering(HAC)
  • 4.2.3 Association Rule Hypergraph Partitioning
    (ARHP)

Ref578910151629
36
4.2.1 Document Clusteringk-means
  • k-means distance-based flat clustering

0. Input Dd1,d2,dn kthe cluster
number 1. Select k document vectors as the
initial centriods of k clusters 2. Repeat 3.
Select one vector d in remaining documents 4.
Compute similarities between d and k
centroids 5. Put d in the closest cluster
and recompute the centroid 6. Until the
centroids dont change 7. Outputk clusters of
documents
  • Advantage
  • linear time complexity
  • works relatively well in low dimension space
  • Drawback
  • distance computation in high dimension space
  • centroid vector may not well summarize the
    cluster documents
  • initial k clusters affect the quality of clusters

Ref578910151629
37
4.2.2 Document ClusteringHAC
  • Hierarchic agglomerative clustering(HAC)distance-
    based

  • hierarchic clustering

0. Input Dd1,d2,dn 1. Calculate
similarity matrix SIMi,j 2. Repeat 3. Merge
the most similar two clusters, K and L,
to form a new cluster KL 4. Compute
similarities between KL and each of the remaining
cluster and update SIMi,j 5. Until
there is a single(or specified number) cluster 6.
Output dendogram of clusters
  • Advantage
  • producing better quality clusters
  • works relatively well in low dimension space
  • Drawback
  • distance computation in high dimension space
  • quadratic time complexity

Ref578910151629
38
4.2.3 Document Clustering Association Rule
Hypergraph Partitioning(1)
  • Hypergraph

H(V,E) Va set of vertices Ea set of
hyperedges.
Ref30-35
39
4.2.3 Document Clustering Association Rule
Hypergraph Partitioning (2)
  • Transactional View of Documents and features
  • itemDocument
  • transactionfeature

items
Doc1 Doc2 Doc3
Docn w1 5 5 2
... 1 w2 2 4
3 5 w3 0
0 0 1 .
. . .
. . .
. . . .
. . . ...
. wt 6 0 0
3
(Transactional database of Documents and
features)
transactions
Ref30-35
40
4.2.3 Document Clustering Association Rule
Hypergraph Partitioning(3)
  • Clustering

Document-feature transaction database
Discovering association rules
Apriori algorithm
Constructing hypergraph
Association rule hypergraph
Partitioning hypergraph
Hypergraph partitioning algorithm
k partitions
  • Hyperedgesfrequent item sets
  • Hyperedge weightaverage of the confidences of
    all rules
  • Assumption documents occurring in the same
    frequent item set are more similar

Ref30-35
41
4.2.3 Document Clustering Association Rule
Hypergraph Partitioning(4)
  • Advantage
  • Without the calculation of the mean of clusters.
  • Linear time complexity.
  • The quality of the clusters is not affected by
    the space dimensionality.
  • performing much better than traditional
    clustering in high dimensional space in terms of
    the quality of clusters and runtime.

Ref30-35
42
4.3 Document ClusteringApplication
  • Summarization of documents
  • Navigation of large document collections
  • Organization of Web search results

Ref1015-17
43
5. ProductIntelligent Miner for Text(IMT)(1)
Ref536
44
5. ProductIntelligent Miner for Text(IMT)(2)
  • 1.Feature extraction tools
  • 1.1 Information extraction
  • Extract linguistic items that represent document
    contents
  • 1.2 Feature extraction
  • Assign of different categories to vocabulary in
    documents,
  • Measure their importance to the document content.
  • 1.3 Name extraction
  • Locate names in text,
  • Determine what type of entity the name refers to
  • 1.4 Term extraction
  • Discover terms in text. Multiword technical terms
  • Recognize variants of the same concept
  • 1.5 Abbreviation recognition
  • Find abbreviation and math them with their full
    forms.
  • 1.6 Relation extraction

45
5. ProductIntelligent Miner for Text(IMT)(3)
Feature extraction Demo.
46
5. ProductIntelligent Miner for Text(IMT)(4)
  • 2.Clustering tools
  • 2.1 Applications
  • Provide a overview of content in a large
    document collection
  • Identify hidden structures between groups of
    objects
  • Improve the browsing process to find similar or
    related information
  • Find outstanding documents within a collection
  • 2.2 Hierarchical clustering
  • Clusters are organized in a clustering tree and
    related clusters occurs in the same branch of
    tree.
  • 2.3 Binary relational clustering
  • Relationship of topics.
  • document ? cluster ?topic.
  • NBpreprocessing step for the categorization tool

47
5.ProductIntelligent Miner for Text(IMT)(5)
  • Clustering demo.navigation of document collection

48
5. ProductIntelligent Miner for Text(IMT)(6)
  • 3.Summarization tools
  • 3.1 Steps
  • the most relevant sentences ? the relevancy of a
    sentence to a document
  • ? a summary of the document with length set by
    user
  • 3.2 Applications
  • Judge the relevancy of a full text
  • Easily determine whether the document is
    relevant to read.
  • Enrich search results The results of a
    query to a search engine can be enriched with a
    short
  • summary of each document.
  • Get a fast overview over document collections
  • summary ? full document

49
5.ProductIntelligent Miner for Text(IMT)(7)
  • 4.Categorization tool
  • Applications
  • Organize intranet documents
  • Assign documents to folders
  • Dispatch requests
  • Forward news to subscribers

sports
News article
categorizer
cultures
I like health news
health
new router
politics
economics
vacations
50
6. Reference (1)
Bibliography 1 Marti A. Hearst, Untangling
Text Data Mining, Proceedings of ACL99 the 37th
Annual Meeting of the Association for
Computational Linguistics, University of
Maryland, June 20-26, 1999 (invited paper)
http//www.sims.berkeley.edu/hearst 2
Feldman and Dagan 1995 KDT - knowledge discovery
in texts. In Proceedings of the First Annual
Conference on Knowledge Discovery and Data Mining
(KDD), Montreal. 3 IJCAI-99 Workshop TEXT
MINING FOUNDATIONS, TECHNIQUES AND APPLICATIONS
Stockholm, Sweden August 2, 1999
http//www.cs.biu.ac.il/feldman/ijcai-workshop20
cfp.html 4 Taeho C. Jo Text Categorization
considering Categorical Weights and Substantial
Weights of Informative Keywords, 1999
(http//www.sccs.chukyo-u.ac.jp/ICCS/olp/p3-13/p3-
13.htm) 5 A White Paper from IBM
TechnologyText Mining Turning Information Into
Knowledge, February 17, 1998 editor Daniel
Tkach IBM Software Solutions ( http//allen.comm.v
irginia.edu/jtl5t/whiteweb.html) 6
http//allen.comm.virginia.edu/jtl5t/index.htm 7
G. Salton et al, Introduction to Modern
Information Retrieval, McGraw-Hill Book company,
1983 8 Michael Steinbach and George Karypis
and Vipin Kumar, A Comparison of Document
Clustering Techiques, KDD-2000 9 Douglass R.
Cutting, Divid R. Karger, Jan O. Pedersen, and
John W. Tukey, Scatter/Gather  A
Cluster-based Approach to Browsing large Document
Collections, SIGIR 92,Pages 318 - 329 10
Oren Zamir, Oren Etzioni, Omid Madani, Richard M.
Karp, Fast and Intuitive Clustering of Web
Documents, KDD 97, Pages 287 290, 1997
51
6. Reference (2)
Bibliography 11 Kjersti Aas et al. Text
Categorization Survey, 1999 12 Text mining
White Paper http//textmining.krdl.org.sg/
whiteppr.html 13 Gartner Group, Text mining
White Paper , June 2000 http//www.xanalys.com/in
telligence_tools/products/text_mining_text.html
14 Yiming Y. and Jan O. Pedersen, A
Comparative Study on Feature Selection in Text
Categorization, In the 14th Int. Conf. On
Machine Learning, PP. 412- 420, 1997 15 C. J.
van Rijsbergen, (1979), Information Retrieval,
Buttersworth, London. 16 Chris Buckley and
Alan F. Lewit, Optimizations of inverted vector
searches, SIGIR 85, PP 97 110,1985. 17
Daphe Koller and Mehran Sahami, Hierarchically
classifying documents using very few words,
proceedings of the 14th International
Conference on Machine Learning, Nashville,
Tennessee, July 1997, PP170 178. 18 T.
Joachims, A Probabilistic Analysis of the
Rocchio Algorithm with TFIDF for Text
Categorization, In Int. Conf. Machine Learning,
1997. 19 K. Lang, NewsWeeder Learning to
Filter Netnews, International conference on
Machine learning, 1995, http//anther.learning.cs
.cmu.edu/ml95.ps 20 Hart, S.P. A
Probabilistic Approch to Automatic Keyword
Indexing, Journal of the American Society for
Information Science, July-August, 1975
52
6. Reference (3)
Bibliography 21 Scott D., Indexing by Latent
Semantic Analsis, Journal of the American Society
for Information Science, Vol. 41,No,6, P.P.
391-407, 1990 22 S.T. Dumais, Improving the
retrieval information from external sources,
Behavior Research Methods, Instruments and
Computers, Vol23, No.2,PP.229-236, 1991 23 T.
Yavuz and H. A. Guvenir, Application of k-Nearest
Neighbor on Feature Projections Classifier to
Text Categorization, 1998 24Eui-Hong H. and
George K., Centroid-Based Document
Classification Analysis Experimental Results,
In European Conference on Principles of Data
Mining and Knowledge Discovery(PKDD),
2000 25Vladmimir N. Vapnik, the Nature of
Statistical Learning Theory, Springer, New York,
1995 26Ji He,A-Hwee Tan and Chew-Lim Tan, A
comparative Study on Chinese Text Categorization
Methods, , PRICAI 2000 Workshop on Text And Web
Mining, Melbourne, pp.24-35,August 2000
(http//textmining.krdl.org.sg/PRICAI2000/text-cat
egorization.pdf) 27 Erik W. and Jan O.
Pedersen and Andreas S. Weigend, A Neural Network
Approach to Topic Spotting, In Pro 4th annaul
symposium on cocument analysis and information
retrieval, PP 22-34,1993 28 Lavrenko, V.,
Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D.
and Allan, J. in the Proceedings of KDD 2000
Conference, pp. 37-44. 29Peter W. , Recent
Trends in Hierarchic Document Clustering A
Critical Review, Information Procession
Management Vol.24, No. 5 pp. 577-597, 1988
53
6. Reference (4)
Bibliography 30 J. Larocca Neto, A.D. Santos,
C.A.A. Kaestner, A.A. Freitas. Document
clustering and text summarization. Proc. 4th Int.
Conf. Practical Applications of Knowledge
Discovery and Data Mining (PADD-2000), 41-55.
London The Practical Application Company. 2000.
31 Eui-Hong (Sam) Han, George Karypis, Vipin
Kumar and Bamshad Mobasher, Clustering Based On
Association Rule Hypergraphs , SIGMOD'97 Workshop
on Research Issues on Data Mining and Knowledge
Discovery, 1997. 32 Daniel Boley, Maria Gini,
Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings,
George Karypis, Vipin Kumar, Bamshad Mobasher,
and Jerome Moore, Parititioning-Based Clustering
for Web Document Categorization, Decision Support
Systems Journal, Vol 27, No. 3, pp 329-341, 1999.
34 George K. and Rajat A. and Vipin K. and
Shashi S., Multilevel Hypergraph Partitioning
Applications in VLSI Domain, In proceedings o th
e Design and Automation Conferences 97. 35
Eui-Hong (Sam) Han, George Karypis, Vipin Kumar
and Bamshad Mobasher, Clustering In A
High-Dimensional Space Using Hypergraph Models,
Technical report 97-019, http//www-users.cs.umn.
edu/han/. 36 IBM White paper Information
Mining with the IBM Intelligent Miner Family,
Daniel S. Tkach, February, 1998
Write a Comment
User Comments (0)
About PowerShow.com