Hierarchical Document Clustering Using Frequent Itemsets - PowerPoint PPT Presentation

About This Presentation
Title:

Hierarchical Document Clustering Using Frequent Itemsets

Description:

( apple, boy, cat, window ) doc1 = ( 5, 2, 7, 0 ) doc2 = ( 4, 0, 0, 3 ) ... In D. Fisher, editor, Proceedings of (ICML) 97, 14th International Conference on ... – PowerPoint PPT presentation

Number of Views:306
Avg rating:3.0/5.0
Slides: 38
Provided by: benjaminch
Category:

less

Transcript and Presenter's Notes

Title: Hierarchical Document Clustering Using Frequent Itemsets


1
Hierarchical Document Clustering Using Frequent
Itemsets
  • Benjamin Fung, Ke Wang, Martin Ester
  • bfung, wangk, ester_at_cs.sfu.ca
  • Simon Fraser University
  • May 1, 2003 (SDM 03)

2
Outline
  • What is hierarchical document clustering?
  • Previous works
  • Our method Frequent Itemset Hierarchical
    Clustering (FIHC) ?
  • Experimental results
  • Conclusions

3
Hierarchical Document Clustering
  • Document Clustering Automatic organization of
    documents into clusters so that documents within
    a cluster have high similarity in comparison to
    one another, but are very dissimilar to documents
    in other clusters.
  • Hierarchical Document Clustering

4
Challenges in Hierarchical Document Clustering
  • High dimensionality.
  • High volume of data.
  • Consistently high clustering quality.
  • Meaningful cluster description.

5
Previous Works
  • Hierarchical Methods
  • Agglomerative and Divisive.
  • Reasonably accurate but not scalable.
  • Partitioning Methods
  • Efficient, scalable, easy to implement.
  • Clustering quality degrades if an inappropriate
    of clusters is provided.
  • Frequent item-based Method
  • HFTC depends on a greedy heuristic.

6
Preprocessing
  • Remove stop words and Stemming.
  • Construct vector model
  • doci ( item frequency1, if2, if3, , ifm )
  • e.g.
  • ( apple, boy, cat, window )
  • doc1 ( 5, 2, 7, 0 )
  • doc2 ( 4, 0, 0, 3 )
  • doc3 ( 0, 3, 1, 5 ) ?
    document vector

doc1 apple 5 boy 2 cat 7
doc2 apple 4 window 3
doc3 boy 3 cat 1 window 5
7
Algorithm Overview of Our Method (FIHC)
(reduced dimensions feature vectors)
(high dimensional doc vectors)
Construct clusters
Build a Tree
Pruning
8
Definition Global Frequent Itemset
  • A global frequent itemset refers to a set of
    items (words) that appear together in more than a
    user-specified fraction of the document set.
  • The global support of an itemset is the
    percentage of documents containing the itemset.
  • e.g. 7 of the documents contain both words.
  • apple, window has global support 7.
  • A global frequent item refers to an item that
    belongs to some global frequent itemset, e.g.,
    apple.

9
Reduced Dimensions Vector Model
  • High dimensional vector model
  • ( apple, boy, cat, window )
  • doc1 ( 5, 2, 1, 1 )
  • doc2 ( 4, 0, 0, 3 )
  • doc3 ( 0, 3, 1, 5 )
  • doc4 ( 8, 0, 2, 0 )
  • doc5 ( 5, 0, 0, 3 )
  • Suppose we set the minimum support to 60. The
    global frequent itemsets are apple, cat,
    window, apple, window
  • Store the frequencies only for global frequent
    items.
  • ( apple, cat, window )
  • doc1 ( 5, 1, 1 )
  • doc2 ( 4, 0, 3 )

? document vector
? feature vector
10
Intuition
  • Frequent itemsets ? combination of words.
  • Ex. apple Topic Fruits
  • window Topic Renovation
  • apple, window Topic Computer

11
Construct Initial Clusters
  • Construct a cluster for each global frequent
    itemset.
  • Global frequent itemsets apple, cat,
    window, apple, window
  • All documents containing this itemset are
    included in the same cluster.

Capple
Cwindow
Capple, window
Ccat
12
Making Clusters Disjoint
  • Assign a document to the best initial cluster.
  • Intuitively, a cluster Ci is good for a document
    docj if there are many global frequent items in
    docj that appear in many documents in Ci.

13
Cluster Frequent Items
  • A global frequent item is cluster frequent in a
    cluster Ci if the item is contained in some
    minimum fraction of documents in Ci.
  • Suppose we set the minimum cluster support to 60.

Capple ( apple, cat, window ) doc1
( 5, 1, 1 ) doc2 ( 4,
0, 3 ) doc4 ( 8, 2, 0
) doc5 ( 5, 0, 3 )
Capple Capple
Item Cluster Support
apple 100
cat 50
window 75
apple and window are cluster frequent items.
14
Score Function (Example)
Capple apple 100 window 75
Cwindow cat 60 window 100
Capple, window apple 100 cat 60 window
100
Ccat cat 100
doc1 apple 5 cat 1 window 3
15
Score Function
  • Assign each docj to the initial cluster Ci that
    has the highest scorei
  • x represents a global frequent item in docj and
    the item is also cluster frequent in Ci.
  • x represents a global frequent item in docj but
    the item is not cluster frequent in Ci.
  • n(x) is the frequency of x in the feature vector
    of docj.
  • n(x) is the frequency of x in the feature
    vector of docj.

16
Score Function (Example)
Capple apple 100 window 75
Cwindow cat 60 window 100
Capple, window apple 100 cat 60 window
100
Ccat cat 100
doc1 apple 5 cat 1 window 3
17
Tree Construction
  • Put the more specific clusters at the bottom of
    the tree.
  • Put the more general clusters at the top of the
    tree.
  • Build a tree from bottom-up by choosing a parent
    for each cluster (start from the cluster with the
    largest number of items in its cluster label).

null
cluster label
CS
Sports
Sports, Ball
Sports, Tennis
CS, AI
CS, DM
Sports, Tennis, Ball
18
Choose a Parent Cluster (example)
Sports, Ball
Sports, Tennis
Sports, Tennis, Ball
( CS, DM, AI, Sports, Tennis, Ball ) doc1 ( 0,
0, 0, 5, 10, 2 )
doc2 ( 1, 0, 0, 5, 5,
3 ) doc3 ( 0, 1, 0,
15, 10, 1 )
sum ( 1, 1, 0, 25, 25,
6 )
19
Prune Cluster Tree
  • Why do we want to prune the tree?
  • Remove overly specific child clusters.
  • Documents of the same class (topic) are likely to
    be distributed over different subtrees, which
    would lead to poor clustering quality.

20
Inter-Cluster Similarity
  • Inter_Sim of Ca and Cb
  • Reuse the score function to calculate Sim(Ci ?
    Cj).

21
Child Pruning
  • Efficiently shorten a tree by replacing child
    clusters by their parent.
  • A child is pruned only if it is similar to its
    parent.
  • Prune if Inter_Sim gt 1

null
CS
Sports
CS, DM
CS, AI
Sports, Ball
Sports, Tennis
Sports, Tennis, Racket
22
Sibling Merging
  • Narrow a tree by merging similar subtrees at
    level 1.

null
CS
Sports
IT
CS, DM
CS, AI
Sports, Ball
Sports, Tennis
IT, Server
IT, Engineer
Inter_Sim(CS ? IT) 1.5
Inter_Sim(IT ? Sports) 0.75
Inter_Sim(CS ? Sports) 0.5
23
Sibling Merging
null
CS
Sports
Sports, Ball
Sports, Tennis
CS, DM
CS, AI
IT, Server
IT, Engineer
24
Experimental Results
  • Compare with state-of-the-art clustering
    algorithms
  • Bisecting k-means (Cluto 2.0 Toolkit)
  • UPGMA (Cluto 2.0 Toolkit)
  • HFTC (Source code in Java from author)
  • Clustering quality.
  • Efficiency and Scalability.

25
Data Sets
  • Each document is pre-classified into a single
    natural class.

26
Clustering Quality (F-measure)
  • Widely used evaluation method for clustering
    algorithms.
  • Recall and Precision.
  • F-measure weighted average of recalls and
    precisions.

27
For FIHC and HFTC, we use MinSup from 3 to 6
28
Efficiency
29
Complexity Analysis
  • Clustering ?f?F global_support(f), where f is a
    global frequent itemset. (two scans on
    documents)
  • Constructing tree Removed empty clusters first.
    O(n), where n is the number of documents.
  • Child pruning one scan on remaining clusters.
  • Sibling merging O(g2), where g is the number of
    remaining clusters at level 1.

30
Conclusions
  • This research exploits frequent itemsets for
  • defining a cluster.
  • organizing the cluster hierarchy.
  • Our contributions
  • Reduced dimensionality ? efficient and scalable.
  • High clustering quality.
  • Number of clusters as optional input parameter.
  • Meaningful cluster description.

31
Thank you.
  • Questions?

32
References
  1. C. Aggarwal, S. Gates, and P. Yu. On the merits
    of building categorization systems by supervised
    clustering. In Proceedings of (KDD) 99, 5th (ACM)
    International Conference on Knowledge Discovery
    and Data Mining, pages 352356, San Diego, US,
    1999. ACM Press, New York, US.
  2. R. Agrawal, C. Aggarwal, and V. V. V. Prasad.
    Depth-first generation of large itemsets for
    association rules. Technical Report RC21538, IBM
    Technical Report, October 1999.
  3. R. Agrawal, C. Aggarwal, and V. V. V. Prasad. A
    tree projection algorithm for generation of
    frequent item sets. Journal of Parallel and
    Distributed Computing, 61(3)350371, 2001.
  4. R. Agrawal, J. Gehrke, D. Gunopulos, and P.
    Raghavan. Automatic subspace clustering of high
    dimensional data for data mining applications. In
    Proceedings of ACM SIGMOD International
    Conference on Management of Data (SIGMOD98),
    pages 94105, 1998.
  5. R. Agrawal, T. Imielinski, and A. N. Swami.
    Mining association rules between sets of items in
    large databases. In Proceedings of ACM SIGMOD
    International Conference on Management of Data
    (SIGMOD93), pages 207216, Washington, D.C., May
    1993.
  6. R. Agrawal and R. Srikant. Fast algorithm for
    mining association rules. In J. B. Bocca, M.
    Jarke, and C. Zaniolo, editors, Proc. 20th Int.
    Conf. Very Large Data Bases, VLDB, pages 487499.
    Morgan Kaufmann, 12-15 1994.
  7. R. Agrawal and R. Srikant. Mining sequential
    patterns. In Proc. 1995 Int. Conf. Data
    Engineering, pages 314, Taipei, Taiwan, March
    1995.
  8. M. Ankerst, M. Breunig, H. Kriegel, and J.
    Sander. Optics Ordering points to identify the
    clustering structure. In 1999 ACM-SIGMOD Int.
    Conf. Management of Data (SIGMOD99), pages
    4960, Philadelphia, PA, June 1999.

33
References
  1. F. Beil, M. Ester, and X. Xu. Frequent term-based
    text clustering. In Proc. 8th Int. Conf. on
    Knowledge Discovery and Data Mining (KDD)2002,
    Edmonton, Alberta, Canada, 2002.
    http//www.cs.sfu.ca/ ester/publications.html.
  2. H. Borko and M. Bernick. Automatic document
    classication. Journal of the ACM, 10151162,
    1963.
  3. S. Chakrabarti. Data mining for hypertext A
    tutorial survey. SIGKDD Explorations Newsletter
    of the Special Interest Group (SIG) on Knowledge
    Discovery Data Mining, ACM, 1111, 2000.
  4. M. Charikar, C. Chekuri, T. Feder, and R.
    Motwani. Incremental clustering and dynamic
    information retrieval. In Proceedings of the 29th
    Symposium on Theory Of Computing STOC 1997, pages
    626635, 1997.
  5. Classic. ftp//ftp.cs.cornell.edu/pub/smart/.
  6. D. R. Cutting, D. R. Karger, J. O. Pedersen, and
    J. W. Tukey. Scatter/gather A cluster-based
    approach to browsing large document collections.
    In Proceedings of the Fifteenth Annual
    International ACM SIGIR Conference on Research
    and Development in Information Retrieval, pages
    318329, 1992.
  7. P. Domingos and G. Hulten. Mining high-speed data
    streams. In Knowledge Discovery and Data Mining,
    pages 7180, 2000.
  8. R. C. Dubes and A. K. Jain. Algorithms for
    Clustering Data. Prentice Hall College Div,
    Englewood Clis, NJ, March 1998.
  9. A. El-Hamdouchi and P. Willet. Comparison of
    hierarchic agglomerative clustering methods for
    document retrieval. The Computer Journal, 32(3),
    1989.
  10. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
    density-based algorithm for discovering clusters
    in large spatial databases with noise. In
    Proceedings of the 2nd int. Conf. on Knowledge
    Discovery and Data Mining (KDD 96), pages
    226231, Portland, Oregon, August 1996. AAAI
    Press.
  11. A. Griffiths, L. A. Robinson, and P. Willett.
    Hierarchical agglomerative clustering methods for
    automatic document classification. Journal of
    Documentation, 40(3)175205, September 1984.
  12. S. Guha, N. Mishra, R. Motwani, and L.
    OCallaghan. Clustering data streams. In IEEE
    Symposium on Foundations of Computer Science,
    pages 359366, 2000.

34
References
  • S. Guha, R. Rastogi, and K. Shim. Rock A robust
    clustering algorithm for categorical attributes.
    In Proceedings of the 15th International
    Conference on Data Engineering, 1999.
  • E. H. Han, B. Boley, M. Gini, R. Gross, K.
    Hastings, G. Karypis, V. Kumar, B. Mobasher, and
    J. Moore. Webace a web agent for document
    categorization and exploration. In Proceedings of
    the second international conference on Autonomous
    agents, pages 408415. ACM Press, 1998.
  • J. Han and M. Kimber. Data Mining Concepts and
    Techniques. Morgan-Kaufmann, August 2000.
  • J. Han, J. Pei, and Y. Yin. Mining frequent
    patterns without candidate generation. In
    Proceedings of the 2000 ACM SIGMOD International
    Conference on Management of Data (SIGMOD00),
    Dallas, Texas, USA, May 2000.
  • J. Hipp, U. Guntzer, and G. Nakhaeizadeh.
    Algorithms for association rule mining - a
    general survey and comparison. SIGKDD
    Explorations, 2(1)5864, July 2000.
  • G. Hulten, L. Spencer, and P. Domingos. Mining
    time-changing data streams. In Proceedings of the
    Seventh ACM SIGKDD International Conference on
    Knowledge Discovery and Data Mining, pages
    97106, San Francisco, CA, 2001. ACM Press.
  • G. Karypis. Cluto 2.0 clustering toolkit, April
    2002. http//www.users.cs.umn.edu/
    karypis/cluto/.
  • L. Kaufman and P. J. Rousseeuw. Finding Groups in
    Data An Introduction to Cluster Analysis. John
    Wiley and Sons, March 1990.
  • D. Koller and M. Sahami. Hierarchically
    classifying documents using very few words. In D.
    Fisher, editor, Proceedings of (ICML) 97, 14th
    International Conference on Machine Learning,
    pages 170178, Nashville, US, 1997. Morgan
    Kaufmann Publishers, San Francisco, US.
  • Kosala and Blockeel. Web mining research A
    survey. SIGKDD Explorations Newsletter of the
    Special Interest Group SIG on Knowledge Discovery
    Data Mining, 2, 2000.
  • G. Kowalski and M. Maybury. Information Storage
    and Retrieval Systems Theory and Implementation.
    Kluwer Academic Publishers, 2 edition, July 2000.

35
References
  • S. Guha, R. Rastogi, and K. Shim. Rock A robust
    clustering algorithm for categorical attributes.
    In Proceedings of the 15th International
    Conference on Data Engineering, 1999.
  • E. H. Han, B. Boley, M. Gini, R. Gross, K.
    Hastings, G. Karypis, V. Kumar, B. Mobasher, and
    J. Moore. Webace a web agent for document
    categorization and exploration. In Proceedings of
    the second international conference on Autonomous
    agents, pages 408415. ACM Press, 1998.
  • J. Han and M. Kimber. Data Mining Concepts and
    Techniques. Morgan-Kaufmann, August 2000.
  • J. Han, J. Pei, and Y. Yin. Mining frequent
    patterns without candidate generation. In
    Proceedings of the 2000 ACM SIGMOD International
    Conference on Management of Data (SIGMOD00),
    Dallas, Texas, USA, May 2000.
  • J. Hipp, U. Guntzer, and G. Nakhaeizadeh.
    Algorithms for association rule mining - a
    general survey and comparison. SIGKDD
    Explorations, 2(1)5864, July 2000.
  • G. Hulten, L. Spencer, and P. Domingos. Mining
    time-changing data streams. In Proceedings of the
    Seventh ACM SIGKDD International Conference on
    Knowledge Discovery and Data Mining, pages
    97106, San Francisco, CA, 2001. ACM Press.
  • G. Karypis. Cluto 2.0 clustering toolkit, April
    2002. http//www.users.cs.umn.edu/
    karypis/cluto/.
  • L. Kaufman and P. J. Rousseeuw. Finding Groups in
    Data An Introduction to Cluster Analysis. John
    Wiley and Sons, March 1990.
  • D. Koller and M. Sahami. Hierarchically
    classifying documents using very few words. In D.
    Fisher, editor, Proceedings of (ICML) 97, 14th
    International Conference on Machine Learning,
    pages 170178, Nashville, US, 1997. Morgan
    Kaufmann Publishers, San Francisco, US.
  • Kosala and Blockeel. Web mining research A
    survey. SIGKDD Explorations Newsletter of the
    Special Interest Group SIG on Knowledge Discovery
    Data Mining, 2, 2000.
  • G. Kowalski and M. Maybury. Information Storage
    and Retrieval Systems Theory and Implementation.
    Kluwer Academic Publishers, 2 edition, July 2000.

36
References
  1. J. Lam. Multi-dimensional constrained gradient
    mining. Masters thesis, Simon Fraser University,
    August 2001.
  2. B. Larsen and C. Aone. Fast and effective text
    mining using linear-time document clustering.
    KDD99, 1999.
  3. D. D. Lewis. Reuters. http//www.research.att.com/
    lewis/.
  4. B. Liu, W. Hsu, and Y. Ma. Integrating
    classification and association rule mining. In
    Knowledge Discovery and Data Mining (KDD) 98,
    pages 8086, 1998.
  5. Miller. Princeton wordnet, 1990.
  6. M. F. Porter. An algorithm for sux stripping.
    Program, 14(3)130137, July 1980.
  7. J. R. Quinlan. C4.5 Programs for Machine
    Learning. Morgan Kaufmann, 1993.
  8. K. Ross and D. Srivastava. Fast computation of
    sparse datacubes. In M. Jarke, M. Carey, K.
    Dittrich, F. Lochovsky, P. Loucopoulos, and M.
    Jeusfeld, editors, Proceedings of 23rd
    International Conference on Very Large Data Bases
    (VLDB97), pages 116125, Athens, Greece, August
    1997. Morgan Kaufmann.
  9. H. Schutze and H. Silverstein. Projections for
    efficient document clustering. In Proceedings of
    SIGIR97, pages 7481, Philadelphia, PA, July
    1997.
  10. C. E. Shannon. A mathematical theory of
    communication. Bell Systems Technical Journal,
    27379423 and 623656, July and October 1948.
  11. M. Steinbach, G. Karypis, and V. Kumar. A
    comparison of document clustering techniques. KDD
    Workshop on Text Mining00, 2000.
  12. Text REtrival Conference TIPSTER, 1999.
    http//trec.nist.gov/.
  13. H. Uchida, M. Zhu, and T. Della Senta. Unl A
    gift for a millennium. The United Nations
    University, 2000.
  14. C. J. van Rijsbergen. Information Retrieval.
    Dept. of Computer Science, University of Glasgow,
    Butterworth, London, 2 edition, 1979.
  15. P. Vossen. Eurowordnet, Summer 1999.
  16. K. Wang, C. Xu, and B. Liu. Clustering
    transactions using large items. In CIKM99, pages
    483490, 1999.

37
References
  1. K. Wang, S. Zhou, and Y He. Hierarchical
    classification of real life documents. In
    Proceedings of the 1st (SIAM) International
    Conference on Data Mining, Chicago, US, 2001.
  2. W. Wang, J. Yang, and R. R. Muntz. Sting A
    statistical information grid approach to spatial
    data mining. In M. Jarke, M. J. Carey, K. R.
    Dittrich, F. H. Lochovsky, P. Loucopoulos, and M.
    A. Jeusfeld, editors, VLDB97, Proceedings of
    23rd International Conference on Very Large Data
    Bases, pages 186195, Athens, Greece, August
    25-29 1997. Morgan Kaufmann.
  3. Yahoo! http//www.yahoo.com/.
  4. O. Zamir, O. Etzioni, O. Madani, and R. M. Karp.
    Fast and intuitive clustering of web documents.
    In KDD97, pages 287290, 1997.
Write a Comment
User Comments (0)
About PowerShow.com