Title: Topic Distillation and Web Page Categorization
1Topic Distillation and Web Page Categorization
- Prasanna K. Desikan
- (05/29/2002)
2Motivation
- The web is a huge repository of information.
- Categorizing web documents facilitates the
search and retrieval of pages. - Topic distillation is the process of finding
authoritative Web pages and comprehensive hubs
which reciprocally endorse each other and are
relevant to a given query.
3Approaches for Categorization
- Text based Categorization
- Structure or link based Categorization
- Combination of link and text information
4Web Page Categorization Algorithms
- Manual categorization by domain specific experts.
- Categorization would involve the analysis of the
contents of the web page by a number of domain
experts and classification based on the textual
content as by Yahoo. - Content-based categorization - solely on document
content or a combination of document content and
META tags. - To classify a document, all the stop words are
removed and the remaining keywords/phrases are
represented in the form of a feature vector.
5Web Page Categorization Algorithms
- Link and Content Analysis.
- Based on the fact that a web page that refers to
a document must contain enough hints about its
content to induce someone to read it . Such hints
can be used to classify the document being
referred.
6Topic Distillation in Hyperlinked Environment 1
- Aim To find quality documents related to a query
topic. - Problems encountered with HITS approach.
- Mutually reinforcing relationships between hosts.
- Automatically generated links.
- Non Relevant Nodes (documents not relevant to the
query topic) .
7Topic Distillation in Hyperlinked Environment1
- Let the Web be represented as a graph with the
node as a web page and the edge as a link. - Approaches
- If there are k edges (an edge here is a link)
from documents on a first host to a single
document on a second host we give each edge an
authority weight of 1/k.
8Topic Distillation in Hyperlinked Environment1
- Approaches (contd).
- Compute the Relevance Weight for each node.
- Eliminate non-relevant nodes from the graph by
setting a threshold on the relevance weight . - Regulate the influence of a node based on its
relevance.
9Topic Distillation in Hyperlinked Environment1
- Approaches (contd).
- Partial Content Analysis.
- Content Pruning by analyzing only a part of the
graph- i.e. the nodes which are most influential
in the outcome.
10Automatic Resource Compilation 2
- Goal Automatically compile a resource list on
any topic that is broad and well-represented on
the Web. - Approach.
- search-and-growth phase.
- a weighting phase.
- w(p,q) 1 n(t).
- w(p,q) -measure of the authority on the topic
invested by page p in page q. - n(t) - number of matches between terms in the
topic description in the anchor window of width
B. - an iteration-and-reporting phase.
11Relaxation Labeling Technique3
- First Classify the unclassified documents from
the neighborhood (using terms only classifier
-i.e using the text from the neighboring
documents). - Iterate until convergence.
- Recompute the class for each document using both
the local text and the class information of the
neighbors. - The relaxation is guaranteed to converge to a
consistent state.
12Probabilistic Relational Model4
- Web Pages and Links are modeled as entities and
relationships respectively, while each of them is
represented as a class. - Create Bayesian network using the attributes from
entity-relationship model in order to model
uncertainty and make inference.
13Probabilistic Relational Model
- By belief propagation, an approximation inference
approach, we can use our prior knowledge to infer
the unobserved case. - Given new data with some unobserved variables,
first assign most likely values to them. - Based on the estimation of those marginal
probabilities, we predict the correct
classification.
14Probabilistic Relational Model
- This approach proved to be effective when applied
to hypertext classification problem, by utilizing
both information from the content and the link
structure, it provides more accurate
classification and ability to do probabilistic
reasoning.
15Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation 6
- A uniform grained model.
- Web pages are represented by their tag trees
(also called their Document Object Models
(DOMs)). - DOM trees are interconnected by ordinary
hyperlinks. - dis-aggregate mixed hubs.
16A new fine grained model 7
lthtmlgtltbodygt lttable gt lttrgtlttdgt lttable gt
lttrgtlttdgtlta hrefhttp//art.qaz.comgtartlt/agtlt/td
gtlt/trgt lttrgtlttdgtlta hrefhttp//ski.qaz.comgtsk
ilt/agtlt/tdgtlt/trgt lt/tablegt lt/tdgtlt/trgt lttrgtlttdgt
ltulgt ltligtlta hrefhttp//www.fromages.com
gtFromages.comlt/agt French cheeselt/ligt
ltligtlta hrefhttp//www.teddingtoncheese.co.ukgtTe
ddingtonlt/agt Buy onlinelt/ligt
lt/ulgt lt/tdgtlt/trgt lt/tablegt lt/bodygtlt/htmlgt
17Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation
- Figure 6 The fine-grained model of Web linkage
which unifies hyperlinks and DOM structure
18Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation
- Benefits
- Reduces Topic Drift
- Identifies and extracts regions (DOM Subtrees)
relevant to the query out of the following - Broader hub
- Hub with additional less-relevant contents and
links
19Web Page Classification Based on Document
Structure
- Web pages that belong to a particular category
have some similarity in their structure. - Information Pages.
- Research Pages.
- Personal Home Pages.
The general structural information of any page
can be deduced from the placement of links, text
and images including equations and graphs.
20Web Page Categories Based on Structural
Similarities
- Information Pages
- a logo on the top followed by a navigation bar
linking the page to other important pages - the ratio of link text (amount of text with
links) to normal text also tends to be relatively
high - Research Pages
- contain huge amounts of text, equations and
graphs in the form of images - The number of distinctive gray levels/color
shades in the images also provides a cue
21Web Page Categories Based on Structural
Similarities
- Personal Pages.
- The name and address of the person appear
prominently at the top of the page. - A photograph of the person concerned.
- towards the bottom of the page, the person
provides links to his publications if there are
any and other useful references or links to his
favorite destinations on the web.
22Feature Extraction
- Textual Information.
- The number and placement of links in a page
provides valuable information about the broad
category the page belongs to . - The ratio of number of characters in links to the
total number of characters in the page.
23Feature Extraction
- Image Information
- Information pages have more colors than personal
homepages, which in turn have more colors than
research pages - The histogram of synthetic images generally tends
to concentrate at a few bands of color shades. In
contrast, the histogram of natural images is
spread over a larger area - Information pages usually contain many natural
images, while research pages contain a number of
synthetic images
24Feature Extraction
- Other Information
- Approaches using classification based on video
and other multimedia content presently not
implemented
25Results
26Web Page Categories Based on Structural
Similarities
- Conclusions and Future work for the approach
- This approach augmented with traditional text
based approaches could be used for effective
categorization of web pages. - Improvement in feature selection.
- Automate the training process.
- Has to be experimented on more data sets.
27References
- 1K.Bharat and M. Henzinger, Improved Algorithms
for Topic Distillation in a hyperlinked
environment, In 21st International ACM SIGIR
Conference on Research and Development in
Information Retrieval. - 2 S. Chakrabarti, B. Dom, D. Gibson, J.
Kleinberg, P. Raghavan, and S. Rajagopalan.
Automatic Resource Compilation by Analyzing
Hyperlink Structure and Associated Text.
Proceedings of the 7th World-Wide Web conference,
1998. - 3 S. Chakrabarti, B. Dom and P. Indyk. Enhanced
hypertext categorization using hyperlinks.
Proceedings of ACM SIGMOD 1998.
28References
- 4 L.Getoor, E.Segal, B.Tasker, D.Koller.
Probabilistic Models of Text and Link Structure
for Hypertext Classification. IJCAI Workshop on
"Text Learning Beyond Supervision", Seattle, WA,
August 2001. - 5 Arul Prakash Asirvatham, Kranthi Kumar Ravi,
C.V.Jawahar, 'Web Page Classification based on
Document Structure. - 6 Soumen Chakrabarti, Integrating the Document
Object Model with Hyperlinks for Enhanced Topic
Distillation and Information Extraction 10th
International World Wide Web Conference, Hong
Kong, May 2001. - 7 Soumen Chakrabarti, Mukul M. Joshi , Vivek B.
Tawde, Enhanced topic distillation using text,
markup tags, and hyperlinks. SIGIR 2001, New
Orleans, LA, Sep 2001.