IRC: An Iterative Reinforcement Categorization Algorithm for Interrelated Web Objects - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

IRC: An Iterative Reinforcement Categorization Algorithm for Interrelated Web Objects

Description:

IRC: An Iterative Reinforcement Categorization Algorithm for ... S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 34
Provided by: csU54
Category:

less

Transcript and Presenter's Notes

Title: IRC: An Iterative Reinforcement Categorization Algorithm for Interrelated Web Objects


1
IRC An Iterative Reinforcement Categorization
Algorithm for Interrelated Web Objects
  • Gui-Rong Xue, Dou Shen, Qiang Yang
  • Hua-Jun Zeng, Zheng Chen, Yong Yu and Wei-Ying Ma

2
Outline
  • Motivation
  • Related Work
  • Categorization of interrelated objects
  • Experiments
  • Conclusion

3
Document Classification
  • Definition and Applications
  • Approaches
  • SVM
  • Naïve Bayes
  • kNN
  • Document Representation
  • VSM model
  • N-gram model

4
Webpage Classification
  • Difference to Document Classification
  • Page semi-structure
  • Hyperlink and anchor text
  • Challenges
  • Noisy content ads, navigation bar, images,
    scripts
  • No coherent page-construction style, language or
    structure
  • New Approaches
  • Link-based classifications
  • Summarization based classification

5
Heterogeneous Web
Query
Query Thesaurus
Query Session Log
Query Log
Hyperlink
How to leverage the inter-relationship between
the heterogeneous data objects?
Web-page
Browsing patterns
Human Relationship
User Profile
6
Virtual-document
  • One possible approach virtual-document
  • To augment the features of Web objects by using
    their interrelated Web objects as additional
    features
  • A little improvement over the content-based
    categorization approach
  • There is still head room for harnessing such
    interrelationships between the heterogeneous
    objects.

Original Feature vector
Additional Feature Vector
Virtual Feature Vector
7
Our Approach
  • Iterative Reinforcement Categorization (IRC)
  • The category information of one object is
    reinforced by the category information of all its
    interrelated objects
  • The updated category information of the object
    will consequently reinforce the category
    information of its interrelated objects
  • Stop until it converges to a conclusive result.
  • To fully exploit the relationships over the
    heterogeneous data objects.

8
Outline
  • Motivation
  • Related Work
  • Categorization of interrelated objects
  • Experiments
  • Conclusion

9
Related Work
  • Classification on the Web pages
  • Content-based
  • Joachims 13 proposed a method of using Support
    Vector Machines (SVMs) to classify documents.
  • Dumais and Chen 7 use text representation to
    organize search results into an existing
    hierarchical structure.

10
Related Work
  • Classification on the Web pages
  • Link-analysis-based classification
  • Handle both text components of the Web pages and
    the hyperlink relationship.
  • Cohn et al. 6 and Glover et al.9 combining
    link-based and content-based.
  • Chakrabati et al. 2 Probabilistic model.
  • Getoor et al. 8 PRMS
  • Our work is different
  • We could classify the heterogeneous data objects
    across different data types.
  • The data objects of different types could not be
    taken as in the same space.

11
Related Work
  • Query log analysis
  • Beeferman and Berger 1 Iterative clustering on
    clickthrough data.
  • Ignore the content features of the query, the Web
    page.
  • Wen et al. 19 described a query clustering
    method.
  • Wang et al. 17 proposed a method of using query
    clickthrough log to iteratively reinforce query
    and Web page clusters.

12
Outline
  • Motivation
  • Related Work
  • Categorization of interrelated objects
  • Experiments
  • Conclusion

13
Categorization of interrelated objects
  • Problem Definition
  • Iterative reinforcement algorithm

14
Categorization of interrelated objects Problem
Definition
  • The Web could be modeled as a weighted directed
    bipartite graph G(V, E)
  • the nodes in V represent queries and Web pages,
  • the edges E represent the clickthrough
    information from a query to a browsed Web page.
  • In this paper, we divide V into two subsets
  • Queries Qq1, q2, , qm
  • Web pages Dd1, d2, , dn
  • A matrix M is used to represent the adjacency
    matrix, whose (i, j)-element is the weight from
    Web page i to query j.

15
Categorization of interrelated objects Problem
Definition
  • Given a bipartite graph G(V, E), the problem is
    how to classify the Web pages D and queries Q
    into a set of predefined categories Cc1, c2, ,
    ck, where k is the number of categories.

C1 C2 C3 Ck
Web pages
Queries
Web pages
Queries
Web pages
Queries
16
Categorization of interrelated objectsIterative
reinforcement algorithm
  • Basic Idea
  • We classify the Web pages according to their
    content feature,
  • To fully utilize the relationship in the
    bipartite graph, we propose a novel iterative
    reinforcement classification method.
  • The basic idea is to propagate the categories
    computed for one type of object to all related
    objects by updating their probability of
    belonging to a certain category.
  • This process is iteratively performed until the
    classification results for all object types
    converge.

17
Categorization of interrelated objectsIterative
reinforcement algorithm
  • We divide the Web pages D into training set Dt
    and testing set Ds.
  • We consider all queries as being part of the test
    set Qs

Web pages Dt
Queries Qs
Web pages Ds
18
Categorization of interrelated objectsIterative
reinforcement algorithm
Step 1
Input
Web pages Dt
Text Classifier
Web pages Ds
Categories of Web pages Ds
Categories of Web pages Dt
Queries Qs
Step 3
Iterative Propagation
Inter-relationship
Step 2
Categories of Queries Qs
19
Categorization of interrelated objectsIterative
reinforcement algorithm
  • Step 2 Classifying queries based on a bipartite
    graph
  • Infer the probability of queries belonging to
    categories according to their interrelated Web
    pages

Testing
Training
20
Categorization of interrelated objectsIterative
reinforcement algorithm
  • Step 3 Classifying Web pages based on a
    bipartite graph
  • Re-classify Web pages through the relationship
    between queries and pages

Queries Testing
Web pages Content
21
Categorization of interrelated objectsIterative
reinforcement algorithm
  • Steps 2-3Iterative reinforcement categorization
  • Then,
  • After several iterations, the PS and RS would
    reach a fixed point.



22
Outline
  • Motivation
  • Related Work
  • Categorization of interrelated objects
  • Experiments
  • Conclusion

23
ExperimentsData set
  • A set of classified Web pages extracted from the
    Open Directory Project (ODP) (http//dmoz.org/)
  • A real MSN query clickthrough log is collected.
  • We deal with the common pages which appeared in
    both the ODP data set and the query clickthrough
    log
  • 131,788 Web pages in 15 top-level categories
  • 199,564 associated queries
  • 468,696 relationships between Web pages and
    Queries.

24
ExperimentsFeature selection, Classifier and
Evaluation Criteria
  • A simple feature selection method, known as
    Information Gain (IG)20, is applied in our
    experiments
  • We take the SVM as the content based classifier
  • Evaluated using the conventional precision,
    recall and F1 measures

25
ExperimentsPerformance
  • Baseline
  • content-based classification method
  • Comparison take the interrelated queries could
    be taken as an additional feature for their
    corresponding pages
  • Query-metadata based classification
  • Only use the query metadata as features of Web
    page directly
  • Virtual document based classification
  • Web page content and query metadata

26
ExperimentsPerformance
  • IRC achieves the higher performance than the
    other three methods in comparison.
  • F1-micro-averageing measure
  • Over the content method by 26.4,
  • Over the query metadata method by 21,
  • Over the virtual document method by 16.4.

27
ExperimentsPerformance
  • The effect of the clickthrough data by increasing
    the clickthrough data size

28
ExperimentsPerformance
  • The page length has an important effect on the
    performance of classification based on content
    feature
  • The error rate of the content-based
    categorization gradually increasing along with
    the shorter length of pages
  • While our IRC still keeps in a stable quality.

29
ExperimentsPerformance
  • The performance of the IRC algorithm with the
    iteration times
  • The convergence curve of our iterative algorithm
  • The execution time of the algorithm on different
    data size

30
Outline
  • Motivation
  • Related Work
  • Categorization of interrelated objects
  • Experiments
  • Conclusion

31
Conclusion
  • The novelty of our work can be seen from several
    aspects.
  • First, we extend the traditional classification
    methods to multi-type interrelated data objects.
  • Aim to classify interrelated data objects of
    different types simultaneously using both their
    content features and their relationship with
    other types of objects.
  • Second, we present a reinforcement algorithm to
    classify interrelated Web data objects on a
    bipartite graph.
  • The category of one type is propagated to
    reinforce the categorization of other
    interrelated data objects, vice versa, as an
    iterative process.

32
Reference
  • D. Beeferman and A. Berger. Agglomerative
    clustering of a search engine query log. In
    Proceedings of the sixth ACM SIGKDD International
    Conference on Knowledge Discovery and Data
    Mining, pages 407-415, 2000.
  • S. Chakrabarti, B. Dom, and P. Indyk. Enhanced
    hypertext categorization using hyperlinks. In
    Proceedings of the ACM SIGMOD International
    Conference on Management of Data, pages 307-318,
    Seattle, Washington, June 1998.
  • C.Cortes and V. Vapnik. Support Vector Networks.
    Machine Learning, 201-25, 1995.
  • S. L. Chuang and L. F. Chien. Enriching Web
    taxonomies through subject categorization of
    query terms from search engine logs. Decision
    Support System, Volume 35, Issue 1, April 2003.
  • H. Cui, J. R. Wen, J. Y. Nie, and W. Y. Ma. Query
    Expansion by Mining User Logs, IEEE Transaction
    on Knowledge and Data Engineering, Vol. 15, No.
    4, July/August 2003.
  • D. Cohn and T. Hofmann. The missing link - a
    probabilistic model of document content and
    hypertext connectivity. In Advances in Neural
    Information Processing Systems 13, pages
    430-436.MIT Press, 2001.
  • S. Dumain and H. Chen. Hierarchical
    Classification of Web Content. In Proceedings of
    the 23rd annual international ACM SIGIR
    Conference on Research and Development in
    Information Retrieval, 2000.
  • L. Getoor, N. Friedman, D. Koller, and B. Taskar.
    "Learning Probabilistic Models of Relational
    Structure," In Proceeding of the 18th
    International Conference on Machine Learning,
    2001.
  • E. J. Glover, K. Tsioutsiouliklis, S. Lawrence,
    D. M. Pennock, and G. W. Flake. Using Web
    structure for classifying and describing Web
    pages. In Proceedings of WWW-02, International
    Conference on the World Wide Web, 2002.
  • G. Grimmett and D. Stirzaker. Probability and
    Random Processes, 2nd ed. Oxford, England Oxford
    University Press, 1992.
  • C. K. Huang, L. F. Chien, and Y. J. Oyang.
    Relevant term suggestion in interactive Web
    search based on contextual information in query
    session logs. JASIST 54(7) 638-649, 2003.
  • G. Jeh and J. Widom. SimRank A Measure of
    Structural-Context Similarity. Proceedings of the
    Eighth ACM SIGKDD International Conference on
    Knowledge Discovery and Data Mining, pages
    538-543, Edmonton, Canada, July 2002.
  • T. Joachims. Text categorization with support
    vector machines learning with many relevant
    features. In Proceedings of ECML-98, 10th
    European Conference on Machine Learning, pages
    137-142, Chemnitz, Germany, April 1998.
  • H. J. Oh, S. H. Myaeng, and M. H. Lee. A
    practical hypertext categorization method using
    links and incrementally available class
    information. In Proceedings of the 23rd annual
    international ACM SIGIR conference on Research
    and development in information retrieval, pages
    264-271. ACM Press, 2000.
  • Sequential Minimal Optimization, http//research.
    micro-soft.com/jplatt/smo.html.
  • S. Slattery and M. Craven. Discovering test set
    regularities in relational domains. In
    Proceedings of ICML-00, 17th International
    Conference on Machine Learning, pages 895-902,
    Stanford, US, 2000.
  • J. D. Wang, H. J. Zeng, Z. Chen, H. J. Lu, L.
    Tao, and W.-Y Ma. ReCoM reinforcement clustering
    of multi-type interrelated data objects. In
    Proceedings of the ACM SIGIR Conference on
    Research and Development in Information
    Retrieval, pages 274-281, Toronto, CA, July 2003.
  • J. Platt. Probabilistic outputs for support
    vector machines and comparisons to regularized
    likelihood methods. In A. Smola, P. Bartlett, B.
    Scholkopf, and D. Schuurmans, editors, Advances
    in Large Margin Classi ers. MIT Press, 1999.
  • J. R. Wen, J. Y. Nie, and H. J. Zhang. Clustering
    user queries of a search engine. In Proceedings
    of the Tenth International World Wide Web
    Conference, Hong Kong, May 2001.

33
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com