IRC: An Iterative Reinforcement Categorization Algorithm for Interrelated Web Objects - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

IRC: An Iterative Reinforcement Categorization Algorithm for Interrelated Web Objects

Description:

IRC: An Iterative Reinforcement Categorization Algorithm for ... S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 34

Provided by: csU54

Category:

more less

Transcript and Presenter's Notes

Title: IRC: An Iterative Reinforcement Categorization Algorithm for Interrelated Web Objects

1
IRC An Iterative Reinforcement Categorization
Algorithm for Interrelated Web Objects

Gui-Rong Xue, Dou Shen, Qiang Yang
Hua-Jun Zeng, Zheng Chen, Yong Yu and Wei-Ying Ma

2
Outline

Motivation
Related Work
Categorization of interrelated objects
Experiments
Conclusion

3
Document Classification

Definition and Applications
Approaches
SVM
Naïve Bayes
kNN
Document Representation
VSM model
N-gram model

4
Webpage Classification

Difference to Document Classification
Page semi-structure
Hyperlink and anchor text
Challenges
Noisy content ads, navigation bar, images,
scripts
No coherent page-construction style, language or
structure
New Approaches
Link-based classifications
Summarization based classification

5
Heterogeneous Web
Query
Query Thesaurus
Query Session Log
Query Log
Hyperlink
How to leverage the inter-relationship between
the heterogeneous data objects?
Web-page
Browsing patterns
Human Relationship
User Profile
6
Virtual-document

One possible approach virtual-document
To augment the features of Web objects by using
their interrelated Web objects as additional
features
A little improvement over the content-based
categorization approach
There is still head room for harnessing such
interrelationships between the heterogeneous
objects.

Original Feature vector
Additional Feature Vector
Virtual Feature Vector
7
Our Approach

Iterative Reinforcement Categorization (IRC)
The category information of one object is
reinforced by the category information of all its
interrelated objects
The updated category information of the object
will consequently reinforce the category
information of its interrelated objects
Stop until it converges to a conclusive result.
To fully exploit the relationships over the
heterogeneous data objects.

8
Outline

Motivation
Related Work
Categorization of interrelated objects
Experiments
Conclusion

9
Related Work

Classification on the Web pages
Content-based
Joachims 13 proposed a method of using Support
Vector Machines (SVMs) to classify documents.
Dumais and Chen 7 use text representation to
organize search results into an existing
hierarchical structure.

10
Related Work

Classification on the Web pages
Link-analysis-based classification
Handle both text components of the Web pages and
the hyperlink relationship.
Cohn et al. 6 and Glover et al.9 combining
link-based and content-based.
Chakrabati et al. 2 Probabilistic model.
Getoor et al. 8 PRMS
Our work is different
We could classify the heterogeneous data objects
across different data types.
The data objects of different types could not be
taken as in the same space.

11
Related Work

Query log analysis
Beeferman and Berger 1 Iterative clustering on
clickthrough data.
Ignore the content features of the query, the Web
page.
Wen et al. 19 described a query clustering
method.
Wang et al. 17 proposed a method of using query
clickthrough log to iteratively reinforce query
and Web page clusters.

12
Outline

Motivation
Related Work
Categorization of interrelated objects
Experiments
Conclusion

13
Categorization of interrelated objects

Problem Definition
Iterative reinforcement algorithm

14
Categorization of interrelated objects Problem
Definition

The Web could be modeled as a weighted directed
bipartite graph G(V, E)
the nodes in V represent queries and Web pages,
the edges E represent the clickthrough
information from a query to a browsed Web page.
In this paper, we divide V into two subsets
Queries Qq1, q2, , qm
Web pages Dd1, d2, , dn
A matrix M is used to represent the adjacency
matrix, whose (i, j)-element is the weight from
Web page i to query j.

15
Categorization of interrelated objects Problem
Definition

Given a bipartite graph G(V, E), the problem is
how to classify the Web pages D and queries Q
into a set of predefined categories Cc1, c2, ,
ck, where k is the number of categories.

C1 C2 C3 Ck
Web pages
Queries
Web pages
Queries
Web pages
Queries
16
Categorization of interrelated objectsIterative
reinforcement algorithm

Basic Idea
We classify the Web pages according to their
content feature,
To fully utilize the relationship in the
bipartite graph, we propose a novel iterative
reinforcement classification method.
The basic idea is to propagate the categories
computed for one type of object to all related
objects by updating their probability of
belonging to a certain category.
This process is iteratively performed until the
classification results for all object types
converge.

17
Categorization of interrelated objectsIterative
reinforcement algorithm

We divide the Web pages D into training set Dt
and testing set Ds.
We consider all queries as being part of the test
set Qs

Web pages Dt
Queries Qs
Web pages Ds
18
Categorization of interrelated objectsIterative
reinforcement algorithm
Step 1
Input
Web pages Dt
Text Classifier
Web pages Ds
Categories of Web pages Ds
Categories of Web pages Dt
Queries Qs
Step 3
Iterative Propagation
Inter-relationship
Step 2
Categories of Queries Qs
19
Categorization of interrelated objectsIterative
reinforcement algorithm

Step 2 Classifying queries based on a bipartite
graph
Infer the probability of queries belonging to
categories according to their interrelated Web
pages

Testing
Training
20
Categorization of interrelated objectsIterative
reinforcement algorithm

Step 3 Classifying Web pages based on a
bipartite graph
Re-classify Web pages through the relationship
between queries and pages

Queries Testing
Web pages Content
21
Categorization of interrelated objectsIterative
reinforcement algorithm

Steps 2-3Iterative reinforcement categorization
Then,
After several iterations, the PS and RS would
reach a fixed point.

22
Outline

Motivation
Related Work
Categorization of interrelated objects
Experiments
Conclusion

23
ExperimentsData set

A set of classified Web pages extracted from the
Open Directory Project (ODP) (http//dmoz.org/)
A real MSN query clickthrough log is collected.
We deal with the common pages which appeared in
both the ODP data set and the query clickthrough
log
131,788 Web pages in 15 top-level categories
199,564 associated queries
468,696 relationships between Web pages and
Queries.

24
ExperimentsFeature selection, Classifier and
Evaluation Criteria

A simple feature selection method, known as
Information Gain (IG)20, is applied in our
experiments
We take the SVM as the content based classifier
Evaluated using the conventional precision,
recall and F1 measures

25
ExperimentsPerformance

Baseline
content-based classification method
Comparison take the interrelated queries could
be taken as an additional feature for their
corresponding pages
Query-metadata based classification
Only use the query metadata as features of Web
page directly
Virtual document based classification
Web page content and query metadata

26
ExperimentsPerformance

IRC achieves the higher performance than the
other three methods in comparison.
F1-micro-averageing measure
Over the content method by 26.4,
Over the query metadata method by 21,
Over the virtual document method by 16.4.

27
ExperimentsPerformance

The effect of the clickthrough data by increasing
the clickthrough data size

28
ExperimentsPerformance

The page length has an important effect on the
performance of classification based on content
feature
The error rate of the content-based
categorization gradually increasing along with
the shorter length of pages
While our IRC still keeps in a stable quality.

29
ExperimentsPerformance

The performance of the IRC algorithm with the
iteration times
The convergence curve of our iterative algorithm
The execution time of the algorithm on different
data size

30
Outline

Motivation
Related Work
Categorization of interrelated objects
Experiments
Conclusion

31
Conclusion

The novelty of our work can be seen from several
aspects.
First, we extend the traditional classification
methods to multi-type interrelated data objects.
Aim to classify interrelated data objects of
different types simultaneously using both their
content features and their relationship with
other types of objects.
Second, we present a reinforcement algorithm to
classify interrelated Web data objects on a
bipartite graph.
The category of one type is propagated to
reinforce the categorization of other
interrelated data objects, vice versa, as an
iterative process.

32
Reference

D. Beeferman and A. Berger. Agglomerative
clustering of a search engine query log. In
Proceedings of the sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pages 407-415, 2000.
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced
hypertext categorization using hyperlinks. In
Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 307-318,
Seattle, Washington, June 1998.
C.Cortes and V. Vapnik. Support Vector Networks.
Machine Learning, 201-25, 1995.
S. L. Chuang and L. F. Chien. Enriching Web
taxonomies through subject categorization of
query terms from search engine logs. Decision
Support System, Volume 35, Issue 1, April 2003.
H. Cui, J. R. Wen, J. Y. Nie, and W. Y. Ma. Query
Expansion by Mining User Logs, IEEE Transaction
on Knowledge and Data Engineering, Vol. 15, No.
4, July/August 2003.
D. Cohn and T. Hofmann. The missing link - a
probabilistic model of document content and
hypertext connectivity. In Advances in Neural
Information Processing Systems 13, pages
430-436.MIT Press, 2001.
S. Dumain and H. Chen. Hierarchical
Classification of Web Content. In Proceedings of
the 23rd annual international ACM SIGIR
Conference on Research and Development in
Information Retrieval, 2000.
L. Getoor, N. Friedman, D. Koller, and B. Taskar.
"Learning Probabilistic Models of Relational
Structure," In Proceeding of the 18th
International Conference on Machine Learning,
2001.
E. J. Glover, K. Tsioutsiouliklis, S. Lawrence,
D. M. Pennock, and G. W. Flake. Using Web
structure for classifying and describing Web
pages. In Proceedings of WWW-02, International
Conference on the World Wide Web, 2002.
G. Grimmett and D. Stirzaker. Probability and
Random Processes, 2nd ed. Oxford, England Oxford
University Press, 1992.
C. K. Huang, L. F. Chien, and Y. J. Oyang.
Relevant term suggestion in interactive Web
search based on contextual information in query
session logs. JASIST 54(7) 638-649, 2003.
G. Jeh and J. Widom. SimRank A Measure of
Structural-Context Similarity. Proceedings of the
Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages
538-543, Edmonton, Canada, July 2002.
T. Joachims. Text categorization with support
vector machines learning with many relevant
features. In Proceedings of ECML-98, 10th
European Conference on Machine Learning, pages
137-142, Chemnitz, Germany, April 1998.
H. J. Oh, S. H. Myaeng, and M. H. Lee. A
practical hypertext categorization method using
links and incrementally available class
information. In Proceedings of the 23rd annual
international ACM SIGIR conference on Research
and development in information retrieval, pages
264-271. ACM Press, 2000.
Sequential Minimal Optimization, http//research.
micro-soft.com/jplatt/smo.html.
S. Slattery and M. Craven. Discovering test set
regularities in relational domains. In
Proceedings of ICML-00, 17th International
Conference on Machine Learning, pages 895-902,
Stanford, US, 2000.
J. D. Wang, H. J. Zeng, Z. Chen, H. J. Lu, L.
Tao, and W.-Y Ma. ReCoM reinforcement clustering
of multi-type interrelated data objects. In
Proceedings of the ACM SIGIR Conference on
Research and Development in Information
Retrieval, pages 274-281, Toronto, CA, July 2003.
J. Platt. Probabilistic outputs for support
vector machines and comparisons to regularized
likelihood methods. In A. Smola, P. Bartlett, B.
Scholkopf, and D. Schuurmans, editors, Advances
in Large Margin Classi ers. MIT Press, 1999.
J. R. Wen, J. Y. Nie, and H. J. Zhang. Clustering
user queries of a search engine. In Proceedings
of the Tenth International World Wide Web
Conference, Hong Kong, May 2001.