Enhanced hypertext categorization using hyperlinks - PowerPoint PPT Presentation

About This Presentation
Title:

Enhanced hypertext categorization using hyperlinks

Description:

Enhanced hypertext categorization. using ... 'OI' tends to be noisy (many topics point to Netscape and Free Speech Online) ... Music. Unknown. Unknown ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 28
Provided by: soumencha
Category:

less

Transcript and Presenter's Notes

Title: Enhanced hypertext categorization using hyperlinks


1
Enhanced hypertext categorizationusing hyperlinks
  • Soumen Chakrabarti (IBM Almaden)Byron Dom (IBM
    Almaden)
  • Piotr Indyk (Stanford)

2
Hypertext categorization
  • Automatic topic identification
  • Also called supervised learning
  • Given
  • Hypertext document corpus
  • A small set of classified documents
  • Goal
  • Construct a classifier
  • Apply to new documents

3
Example from the web
4
Applications and benefits
  • Retrieval
  • Browsing (Yahoo!)
  • Searching (socks and NOT apparel)
  • Adopted by most search companies
  • Profile based filtering and routing
  • Email, news, push services
  • Collaborative filtering
  • Automatically categorize click trails
  • Cluster users based on frequently visited topics

5
Click-trail and bookmark organizer
Integrated browser
View of topic Hierarchy
Web Page
6
The limitation of text-only classifiers
  • Text-only classifiers are well-researched
  • Rule induction
  • Bayesian learning
  • 87 accurate on news
  • Lower accuracy on hyperlinked corpora
  • Heterogenous
  • Information in links not utilized

7
Our contributions
  • A novel approach to hypertext classification
  • Combine text and link information
  • Framework for link modeling in hypertext graphs
  • Markov random field (limited sphere of
    influence)
  • Techniques for feature extraction
  • Use of domain knowledge to limit complexity
  • Techniques to handle incomplete information
  • Iterative labeling algorithm

8
Is this a new problem?
  • Reduction to text classification
  • Include (tagged) text from neighbors
  • Classify the result
  • Does not increase accuracy
  • Big neighbor pages
  • Lack of semantic correlation

9
Big neighbor
10
More of big neighbor
11
Coherent pages linking to incoherent pages
12
Model specification
  • A hypertext graph
  • Nodes documents
  • Edges hyperlinks
  • Document sequence or set of terms and links
  • Each document has a class label
  • Some labels are known
  • Most are unknown
  • Labels are drawn from some distribution

13
Assumptions used in probability model
  • No indirect coupling between the text and the
    neighbors classes
  • The probability of a nodes class depends only on
    neighbors within limited radius
  • Independence among the neighbor class
    probabilities
  • Can assume higher order dependence
  • (neighborhood radius greater than 1)

14
Probability estimation
Posterior probability of class given text and
neighborhood
Prior class probability
Class conditional neighbor class distribution
(independence between neighbors)
Class conditional term distribution
15
Bayesian classification algorithm
  • Learning phase (parameter estimation)
  • Distribution of a text within a class
  • Interclass linkage probabilities
  • Prior probability of a class
  • Classification phase
  • Compute class probabilities
  • Choose the class with highest posterior
    probability

16
Partial neighborhood knowledge
  • Problem
  • Class of test page depends on neighbors classes
  • Must know neighbors classes to use interclass
    probabilities ? circularity!
  • Solution
  • Iterative labeling
  • Initially classify neighboring nodes using text
  • Repeatedly reclassify until consistent
  • Text, link, or joint model
  • Will this stabilize?

17
Data set 1 US patent database
  • Local text information
  • Title
  • Abstract
  • Citation links
  • Related patents cite each other
  • Complete knowledge of the neighbors classes

18
Complete knowledge of neighborhood
  • Features used
  • Local text
  • Class tags from neighbor links
  • Large gain from tags
  • Gains sensitive to tag representation
  • /Arts
  • /Arts/Painting

19
Partial knowledge of neighborhood
  • Algorithm
  • Grow radius-two neighborhood
  • Delete labels from a fraction of nodes
  • Do iterative labeling
  • Observations
  • Benefit from links
  • TextLink most robust

20
Data set 2 Yahoo!
  • Few links point to classified documents
  • 19 of docs have any classified out-link
  • 28 has any classified in-link
  • 40 has either one
  • ?Need to find new source of information and
    extend the algorithm

21
Radius-2 information co-citations
  • An IO-bridge connects to many pages of similar
    topics
  • OI tends to be noisy (many topics point to
    Netscape and Free Speech Online)
  • II and OO lead to topic divergence

Unclassifieddocument
Bridge
Classifieddocument
I-link
O-link
Classifieddocument
Document to be classified
IO
OI
II/OO
22
Link proximity
  • Are out-links that are close together more likely
    to point to related topics than out-links that
    are far apart?

23
Bridges are locally coherent
  • Link proximity ? semantic proximity
  • Exploit this source of information
  • Huge attribute space
  • Simple classification
  • Check coherence
  • Voting

24
Effect of exploiting bridges and locality
25
Conclusions
  • New model for citation among hyperlinked
    documents belonging to various topics
  • New categorization algorithm
  • Complexity controlled using domain knowledge
    about citations
  • Significant increase in accuracy

26
Future work
  • Better models for joint distribution between
    terms and links
  • Semantic page segmentation to distill pure
    bridges from ones having a mixture of topics
  • Higher complexity
  • Potentially better results
  • More clever use of neighbors text
  • Investigation of the relationship between spatial
    and semantic proximity

27
Related work
Write a Comment
User Comments (0)
About PowerShow.com