PEBL: Web Page Classification without Negative Examples - PowerPoint PPT Presentation

About This Presentation
Title:

PEBL: Web Page Classification without Negative Examples

Description:

Manually collected negative training examples could be biased. ... 188 resume pages, 533 non-resume pages. Experiments. Experiment2: University CS Department ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 20
Provided by: nitina6
Category:

less

Transcript and Presenter's Notes

Title: PEBL: Web Page Classification without Negative Examples


1
PEBL Web Page Classification without Negative
Examples
  • Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, JAN 2004
2
Outline
  • Problem statement
  • Motivation
  • Related work
  • Main contribution
  • Technical details
  • Experiments
  • Summary

3
Problem Statement
  • To classify web pages into user-interesting
    classes.
  • E.g. Home-Page Classifier Call for Papers
    Classifier
  • Negative Samples are not given specifically.
  • Positive and Unlabeled Samples.

4
Motivation
  • Collecting Negative Samples may be delicate and
    arduous
  • Negative samples must uniformly represent the
    universal set.
  • Manually collected negative training examples
    could be biased.
  • Predefined classes usually do not match users
    diverse and changing search targets.

5
Challenges
  • Collecting unbiased unlabeled data from universal
    set.
  • Random Sampling of web pages on Internet.
  • Achieving classification accuracy from positive
    unlabeled data as high as from labeled data.
  • PEBL framework (Mapping-Convergence Algorithm
    using SVM).

6
Related Work
  • Semisupervised Learning
  • Requires sample of labeled (/-) and unlabeled
    data
  • EM algorithm
  • Transductive SVM
  • Single-Class Learning or Classification
  • Rule-based (k-DNF)
  • Not tolerant to sparse, high-dimensionality.
  • Requires knowledge of proportion of positive
    instances in the universal set.
  • Probability-based
  • Requires prior probabilities for each class.
  • Assumes linear separation.
  • OSVM, Neural Networks

7
Main Contribution
  • Collection of just positive samples speeds up the
    process of building classifiers.
  • The universal set of unlabeled samples can be
    reused for training different classifiers.
  • This supports example based query on internet.
  • PEBL achieves accuracy as high as that of a
    typical framework w/o loss of efficiency in
    testing.

8
(No Transcript)
9
SVM Overview
10
Mapping-Convergence Algorithm
  • Mapping Stage
  • A weak classifier (?1) that draws an initial
    approximation of strong negative data.
  • ?1 must not generate false negatives.
  • Convergence Stage
  • Runs in iteration using a second base classifier
    (?2) that maximizes the margin to make
    progressively better approximation of negative
    data.
  • ?2 must maximize margin.

11
Mapping Stage
12
Mapping-Convergence Algorithm
13
Mapping-Convergence Algorithm
14
Experiments
  • LIBSVM for SVM implementation.
  • Gaussian Kernels for better text categorization
    accuracy.
  • Experiment1 The Internet
  • 2388 pages from DMOZ - unlabeled dataset
  • 368 personal homepages, 449 non-homepages
  • 192 college admission pages, 450 non-admission
  • 188 resume pages, 533 non-resume pages

15
Experiments
  • Experiment2 University CS Department
  • 4093 pages from WebKB - unlabeled dataset
  • 1641 student pages, 662 non-student pages
  • 504 project pages, 753 non-project pages
  • 1124 faculty pages, 729 non-faculty pages
  • Precision-Recall (P-R) breakeven point is used as
    the performance measure.
  • Compared against
  • TSVM Traditional SVM
  • OSVM One-Class SVM
  • UN treating unlabeled data as negative instances

16
Experiments
17
Experiments
18
Summary
  • Classifying web pages of interesting class
    requires laborious preprocessing.
  • PEBL framework eliminates the need for negative
    training samples.
  • M-C algorithm achieves accuracy as high as
    traditional SVM.
  • Additional multiplicative logarithmic factor in
    training time on top of SVM.

19
End of Show
Write a Comment
User Comments (0)
About PowerShow.com