Title: IRC: An Iterative Reinforcement Categorization Algorithm for Interrelated Web Objects
1IRC An Iterative Reinforcement Categorization
Algorithm for Interrelated Web Objects
- Gui-Rong Xue, Dou Shen, Qiang Yang
- Hua-Jun Zeng, Zheng Chen, Yong Yu and Wei-Ying Ma
2Outline
- Motivation
- Related Work
- Categorization of interrelated objects
- Experiments
- Conclusion
3Document Classification
- Definition and Applications
- Approaches
- SVM
- Naïve Bayes
- kNN
- Document Representation
- VSM model
- N-gram model
4Webpage Classification
- Difference to Document Classification
- Page semi-structure
- Hyperlink and anchor text
- Challenges
- Noisy content ads, navigation bar, images,
scripts - No coherent page-construction style, language or
structure - New Approaches
- Link-based classifications
- Summarization based classification
5Heterogeneous Web
Query
Query Thesaurus
Query Session Log
Query Log
Hyperlink
How to leverage the inter-relationship between
the heterogeneous data objects?
Web-page
Browsing patterns
Human Relationship
User Profile
6Virtual-document
- One possible approach virtual-document
- To augment the features of Web objects by using
their interrelated Web objects as additional
features - A little improvement over the content-based
categorization approach - There is still head room for harnessing such
interrelationships between the heterogeneous
objects.
Original Feature vector
Additional Feature Vector
Virtual Feature Vector
7Our Approach
- Iterative Reinforcement Categorization (IRC)
- The category information of one object is
reinforced by the category information of all its
interrelated objects - The updated category information of the object
will consequently reinforce the category
information of its interrelated objects - Stop until it converges to a conclusive result.
- To fully exploit the relationships over the
heterogeneous data objects.
8Outline
- Motivation
- Related Work
- Categorization of interrelated objects
- Experiments
- Conclusion
9Related Work
- Classification on the Web pages
- Content-based
- Joachims 13 proposed a method of using Support
Vector Machines (SVMs) to classify documents. - Dumais and Chen 7 use text representation to
organize search results into an existing
hierarchical structure.
10Related Work
- Classification on the Web pages
- Link-analysis-based classification
- Handle both text components of the Web pages and
the hyperlink relationship. - Cohn et al. 6 and Glover et al.9 combining
link-based and content-based. - Chakrabati et al. 2 Probabilistic model.
- Getoor et al. 8 PRMS
- Our work is different
- We could classify the heterogeneous data objects
across different data types. - The data objects of different types could not be
taken as in the same space.
11Related Work
- Query log analysis
- Beeferman and Berger 1 Iterative clustering on
clickthrough data. - Ignore the content features of the query, the Web
page. - Wen et al. 19 described a query clustering
method. - Wang et al. 17 proposed a method of using query
clickthrough log to iteratively reinforce query
and Web page clusters.
12Outline
- Motivation
- Related Work
- Categorization of interrelated objects
- Experiments
- Conclusion
13Categorization of interrelated objects
- Problem Definition
- Iterative reinforcement algorithm
14Categorization of interrelated objects Problem
Definition
- The Web could be modeled as a weighted directed
bipartite graph G(V, E) - the nodes in V represent queries and Web pages,
- the edges E represent the clickthrough
information from a query to a browsed Web page. - In this paper, we divide V into two subsets
- Queries Qq1, q2, , qm
- Web pages Dd1, d2, , dn
- A matrix M is used to represent the adjacency
matrix, whose (i, j)-element is the weight from
Web page i to query j.
15Categorization of interrelated objects Problem
Definition
- Given a bipartite graph G(V, E), the problem is
how to classify the Web pages D and queries Q
into a set of predefined categories Cc1, c2, ,
ck, where k is the number of categories.
C1 C2 C3 Ck
Web pages
Queries
Web pages
Queries
Web pages
Queries
16Categorization of interrelated objectsIterative
reinforcement algorithm
- Basic Idea
- We classify the Web pages according to their
content feature, - To fully utilize the relationship in the
bipartite graph, we propose a novel iterative
reinforcement classification method. - The basic idea is to propagate the categories
computed for one type of object to all related
objects by updating their probability of
belonging to a certain category. - This process is iteratively performed until the
classification results for all object types
converge.
17Categorization of interrelated objectsIterative
reinforcement algorithm
- We divide the Web pages D into training set Dt
and testing set Ds. - We consider all queries as being part of the test
set Qs
Web pages Dt
Queries Qs
Web pages Ds
18Categorization of interrelated objectsIterative
reinforcement algorithm
Step 1
Input
Web pages Dt
Text Classifier
Web pages Ds
Categories of Web pages Ds
Categories of Web pages Dt
Queries Qs
Step 3
Iterative Propagation
Inter-relationship
Step 2
Categories of Queries Qs
19Categorization of interrelated objectsIterative
reinforcement algorithm
- Step 2 Classifying queries based on a bipartite
graph - Infer the probability of queries belonging to
categories according to their interrelated Web
pages
Testing
Training
20Categorization of interrelated objectsIterative
reinforcement algorithm
- Step 3 Classifying Web pages based on a
bipartite graph - Re-classify Web pages through the relationship
between queries and pages
Queries Testing
Web pages Content
21Categorization of interrelated objectsIterative
reinforcement algorithm
- Steps 2-3Iterative reinforcement categorization
- Then,
- After several iterations, the PS and RS would
reach a fixed point.
22Outline
- Motivation
- Related Work
- Categorization of interrelated objects
- Experiments
- Conclusion
23ExperimentsData set
- A set of classified Web pages extracted from the
Open Directory Project (ODP) (http//dmoz.org/) - A real MSN query clickthrough log is collected.
- We deal with the common pages which appeared in
both the ODP data set and the query clickthrough
log - 131,788 Web pages in 15 top-level categories
- 199,564 associated queries
- 468,696 relationships between Web pages and
Queries.
24ExperimentsFeature selection, Classifier and
Evaluation Criteria
- A simple feature selection method, known as
Information Gain (IG)20, is applied in our
experiments - We take the SVM as the content based classifier
- Evaluated using the conventional precision,
recall and F1 measures
25ExperimentsPerformance
- Baseline
- content-based classification method
- Comparison take the interrelated queries could
be taken as an additional feature for their
corresponding pages - Query-metadata based classification
- Only use the query metadata as features of Web
page directly - Virtual document based classification
- Web page content and query metadata
26ExperimentsPerformance
- IRC achieves the higher performance than the
other three methods in comparison. - F1-micro-averageing measure
- Over the content method by 26.4,
- Over the query metadata method by 21,
- Over the virtual document method by 16.4.
27ExperimentsPerformance
- The effect of the clickthrough data by increasing
the clickthrough data size
28ExperimentsPerformance
- The page length has an important effect on the
performance of classification based on content
feature - The error rate of the content-based
categorization gradually increasing along with
the shorter length of pages - While our IRC still keeps in a stable quality.
29ExperimentsPerformance
- The performance of the IRC algorithm with the
iteration times - The convergence curve of our iterative algorithm
- The execution time of the algorithm on different
data size
30Outline
- Motivation
- Related Work
- Categorization of interrelated objects
- Experiments
- Conclusion
31Conclusion
- The novelty of our work can be seen from several
aspects. - First, we extend the traditional classification
methods to multi-type interrelated data objects. - Aim to classify interrelated data objects of
different types simultaneously using both their
content features and their relationship with
other types of objects. - Second, we present a reinforcement algorithm to
classify interrelated Web data objects on a
bipartite graph. - The category of one type is propagated to
reinforce the categorization of other
interrelated data objects, vice versa, as an
iterative process.
32Reference
- D. Beeferman and A. Berger. Agglomerative
clustering of a search engine query log. In
Proceedings of the sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pages 407-415, 2000. - S. Chakrabarti, B. Dom, and P. Indyk. Enhanced
hypertext categorization using hyperlinks. In
Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 307-318,
Seattle, Washington, June 1998. - C.Cortes and V. Vapnik. Support Vector Networks.
Machine Learning, 201-25, 1995. - S. L. Chuang and L. F. Chien. Enriching Web
taxonomies through subject categorization of
query terms from search engine logs. Decision
Support System, Volume 35, Issue 1, April 2003. - H. Cui, J. R. Wen, J. Y. Nie, and W. Y. Ma. Query
Expansion by Mining User Logs, IEEE Transaction
on Knowledge and Data Engineering, Vol. 15, No.
4, July/August 2003. - D. Cohn and T. Hofmann. The missing link - a
probabilistic model of document content and
hypertext connectivity. In Advances in Neural
Information Processing Systems 13, pages
430-436.MIT Press, 2001. - S. Dumain and H. Chen. Hierarchical
Classification of Web Content. In Proceedings of
the 23rd annual international ACM SIGIR
Conference on Research and Development in
Information Retrieval, 2000. - L. Getoor, N. Friedman, D. Koller, and B. Taskar.
"Learning Probabilistic Models of Relational
Structure," In Proceeding of the 18th
International Conference on Machine Learning,
2001. - E. J. Glover, K. Tsioutsiouliklis, S. Lawrence,
D. M. Pennock, and G. W. Flake. Using Web
structure for classifying and describing Web
pages. In Proceedings of WWW-02, International
Conference on the World Wide Web, 2002. - G. Grimmett and D. Stirzaker. Probability and
Random Processes, 2nd ed. Oxford, England Oxford
University Press, 1992. - C. K. Huang, L. F. Chien, and Y. J. Oyang.
Relevant term suggestion in interactive Web
search based on contextual information in query
session logs. JASIST 54(7) 638-649, 2003. - G. Jeh and J. Widom. SimRank A Measure of
Structural-Context Similarity. Proceedings of the
Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages
538-543, Edmonton, Canada, July 2002. - T. Joachims. Text categorization with support
vector machines learning with many relevant
features. In Proceedings of ECML-98, 10th
European Conference on Machine Learning, pages
137-142, Chemnitz, Germany, April 1998. - H. J. Oh, S. H. Myaeng, and M. H. Lee. A
practical hypertext categorization method using
links and incrementally available class
information. In Proceedings of the 23rd annual
international ACM SIGIR conference on Research
and development in information retrieval, pages
264-271. ACM Press, 2000. - Sequential Minimal Optimization, http//research.
micro-soft.com/jplatt/smo.html. - S. Slattery and M. Craven. Discovering test set
regularities in relational domains. In
Proceedings of ICML-00, 17th International
Conference on Machine Learning, pages 895-902,
Stanford, US, 2000. - J. D. Wang, H. J. Zeng, Z. Chen, H. J. Lu, L.
Tao, and W.-Y Ma. ReCoM reinforcement clustering
of multi-type interrelated data objects. In
Proceedings of the ACM SIGIR Conference on
Research and Development in Information
Retrieval, pages 274-281, Toronto, CA, July 2003. - J. Platt. Probabilistic outputs for support
vector machines and comparisons to regularized
likelihood methods. In A. Smola, P. Bartlett, B.
Scholkopf, and D. Schuurmans, editors, Advances
in Large Margin Classi ers. MIT Press, 1999. - J. R. Wen, J. Y. Nie, and H. J. Zhang. Clustering
user queries of a search engine. In Proceedings
of the Tenth International World Wide Web
Conference, Hong Kong, May 2001.
33Thanks!