Topic Distillation and Web Page Categorization - PowerPoint PPT Presentation

About This Presentation

Title:

Topic Distillation and Web Page Categorization

Description:

... experts and classification based on the textual content as by Yahoo. ... li a href='http://www.teddingtoncheese.co.uk' Teddington... /a Buy online... /li ... – PowerPoint PPT presentation

Number of Views:162

Avg rating:3.0/5.0

Slides: 29

Provided by: prasanna4

Learn more at: https://www-users.cse.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Topic Distillation and Web Page Categorization

1
Topic Distillation and Web Page Categorization

Prasanna K. Desikan
(05/29/2002)

2
Motivation

The web is a huge repository of information.
Categorizing web documents facilitates the
search and retrieval of pages.
Topic distillation is the process of finding
authoritative Web pages and comprehensive hubs
which reciprocally endorse each other and are
relevant to a given query.

3
Approaches for Categorization

Text based Categorization
Structure or link based Categorization
Combination of link and text information

4
Web Page Categorization Algorithms

Manual categorization by domain specific experts.
Categorization would involve the analysis of the
contents of the web page by a number of domain
experts and classification based on the textual
content as by Yahoo.
Content-based categorization - solely on document
content or a combination of document content and
META tags.
To classify a document, all the stop words are
removed and the remaining keywords/phrases are
represented in the form of a feature vector.

5
Web Page Categorization Algorithms

Link and Content Analysis.
Based on the fact that a web page that refers to
a document must contain enough hints about its
content to induce someone to read it . Such hints
can be used to classify the document being
referred.

6
Topic Distillation in Hyperlinked Environment 1

Aim To find quality documents related to a query
topic.
Problems encountered with HITS approach.
Mutually reinforcing relationships between hosts.
Automatically generated links.
Non Relevant Nodes (documents not relevant to the
query topic) .

7
Topic Distillation in Hyperlinked Environment1

Let the Web be represented as a graph with the
node as a web page and the edge as a link.
Approaches
If there are k edges (an edge here is a link)
from documents on a first host to a single
document on a second host we give each edge an
authority weight of 1/k.

8
Topic Distillation in Hyperlinked Environment1

Approaches (contd).
Compute the Relevance Weight for each node.
Eliminate non-relevant nodes from the graph by
setting a threshold on the relevance weight .
Regulate the influence of a node based on its
relevance.

9
Topic Distillation in Hyperlinked Environment1

Approaches (contd).
Partial Content Analysis.
Content Pruning by analyzing only a part of the
graph- i.e. the nodes which are most influential
in the outcome.

10
Automatic Resource Compilation 2

Goal Automatically compile a resource list on
any topic that is broad and well-represented on
the Web.
Approach.
search-and-growth phase.
a weighting phase.
w(p,q) 1 n(t).
w(p,q) -measure of the authority on the topic
invested by page p in page q.
n(t) - number of matches between terms in the
topic description in the anchor window of width
B.
an iteration-and-reporting phase.

11
Relaxation Labeling Technique3

First Classify the unclassified documents from
the neighborhood (using terms only classifier
-i.e using the text from the neighboring
documents).
Iterate until convergence.
Recompute the class for each document using both
the local text and the class information of the
neighbors.
The relaxation is guaranteed to converge to a
consistent state.

12
Probabilistic Relational Model4

Web Pages and Links are modeled as entities and
relationships respectively, while each of them is
represented as a class.
Create Bayesian network using the attributes from
entity-relationship model in order to model
uncertainty and make inference.

13
Probabilistic Relational Model

By belief propagation, an approximation inference
approach, we can use our prior knowledge to infer
the unobserved case.
Given new data with some unobserved variables,
first assign most likely values to them.
Based on the estimation of those marginal
probabilities, we predict the correct
classification.

14
Probabilistic Relational Model

This approach proved to be effective when applied
to hypertext classification problem, by utilizing
both information from the content and the link
structure, it provides more accurate
classification and ability to do probabilistic
reasoning.

15
Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation 6

A uniform grained model.
Web pages are represented by their tag trees
(also called their Document Object Models
(DOMs)).
DOM trees are interconnected by ordinary
hyperlinks.
dis-aggregate mixed hubs.

16
A new fine grained model 7
lthtmlgtltbodygt lttable gt lttrgtlttdgt lttable gt
lttrgtlttdgtlta hrefhttp//art.qaz.comgtartlt/agtlt/td
gtlt/trgt lttrgtlttdgtlta hrefhttp//ski.qaz.comgtsk
ilt/agtlt/tdgtlt/trgt lt/tablegt lt/tdgtlt/trgt lttrgtlttdgt
ltulgt ltligtlta hrefhttp//www.fromages.com
gtFromages.comlt/agt French cheeselt/ligt
ltligtlta hrefhttp//www.teddingtoncheese.co.ukgtTe
ddingtonlt/agt Buy onlinelt/ligt
lt/ulgt lt/tdgtlt/trgt lt/tablegt lt/bodygtlt/htmlgt
17
Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation

Figure 6 The fine-grained model of Web linkage
which unifies hyperlinks and DOM structure

18
Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation

Benefits
Reduces Topic Drift
Identifies and extracts regions (DOM Subtrees)
relevant to the query out of the following
Broader hub
Hub with additional less-relevant contents and
links

19
Web Page Classification Based on Document
Structure

Web pages that belong to a particular category
have some similarity in their structure.
Information Pages.
Research Pages.
Personal Home Pages.

The general structural information of any page
can be deduced from the placement of links, text
and images including equations and graphs.
20
Web Page Categories Based on Structural
Similarities

Information Pages
a logo on the top followed by a navigation bar
linking the page to other important pages
the ratio of link text (amount of text with
links) to normal text also tends to be relatively
high
Research Pages
contain huge amounts of text, equations and
graphs in the form of images
The number of distinctive gray levels/color
shades in the images also provides a cue

21
Web Page Categories Based on Structural
Similarities

Personal Pages.
The name and address of the person appear
prominently at the top of the page.
A photograph of the person concerned.
towards the bottom of the page, the person
provides links to his publications if there are
any and other useful references or links to his
favorite destinations on the web.

22
Feature Extraction

Textual Information.
The number and placement of links in a page
provides valuable information about the broad
category the page belongs to .
The ratio of number of characters in links to the
total number of characters in the page.

23
Feature Extraction

Image Information
Information pages have more colors than personal
homepages, which in turn have more colors than
research pages
The histogram of synthetic images generally tends
to concentrate at a few bands of color shades. In
contrast, the histogram of natural images is
spread over a larger area
Information pages usually contain many natural
images, while research pages contain a number of
synthetic images

24
Feature Extraction

Other Information
Approaches using classification based on video
and other multimedia content presently not
implemented

25
Results
26
Web Page Categories Based on Structural
Similarities

Conclusions and Future work for the approach
This approach augmented with traditional text
based approaches could be used for effective
categorization of web pages.
Improvement in feature selection.
Automate the training process.
Has to be experimented on more data sets.

27
References

1K.Bharat and M. Henzinger, Improved Algorithms
for Topic Distillation in a hyperlinked
environment, In 21st International ACM SIGIR
Conference on Research and Development in
Information Retrieval.
2 S. Chakrabarti, B. Dom, D. Gibson, J.
Kleinberg, P. Raghavan, and S. Rajagopalan.
Automatic Resource Compilation by Analyzing
Hyperlink Structure and Associated Text.
Proceedings of the 7th World-Wide Web conference,
1998.
3 S. Chakrabarti, B. Dom and P. Indyk. Enhanced
hypertext categorization using hyperlinks.
Proceedings of ACM SIGMOD 1998.

28
References

4 L.Getoor, E.Segal, B.Tasker, D.Koller.
Probabilistic Models of Text and Link Structure
for Hypertext Classification. IJCAI Workshop on
"Text Learning Beyond Supervision", Seattle, WA,
August 2001.
5 Arul Prakash Asirvatham, Kranthi Kumar Ravi,
C.V.Jawahar, 'Web Page Classification based on
Document Structure.
6 Soumen Chakrabarti, Integrating the Document
Object Model with Hyperlinks for Enhanced Topic
Distillation and Information Extraction 10th
International World Wide Web Conference, Hong
Kong, May 2001.
7 Soumen Chakrabarti, Mukul M. Joshi , Vivek B.
Tawde, Enhanced topic distillation using text,
markup tags, and hyperlinks. SIGIR 2001, New
Orleans, LA, Sep 2001.