Web Taxonomy Integration through CoBootstrapping

About This Presentation

Title:

Web Taxonomy Integration through CoBootstrapping

Description:

Final Fantasy Fan. Dragon Quest Home. Games Strategy. Shogun: ... Final Fantasy Fan. Dragon Quest Home. EverQuest Addict. Warcraft III Clan. Games Strategy ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 42

Provided by: Aqua2

Category:

more less

Transcript and Presenter's Notes

Title: Web Taxonomy Integration through CoBootstrapping

1
Web Taxonomy Integration throughCo-Bootstrapping

Dell Zhang
National University of Singapore
Wee Sun Lee
National University of Singapore
SIGIR04

Introduction

3
Problem Statement

Games Roleplaying
Final Fantasy Fan
Dragon Quest Home
EverQuest Addict
Warcraft III Clan
Games Strategy
Shogun Total War
Warcraft III Clan

Games Roleplaying
Final Fantasy Fan
Dragon Quest Home
Games Strategy
Shogun Total War

Games Online
EverQuest Addict
Warcraft III Clan
Games Single-Player
Warcraft III Clan

4
Possible Approach
Classify
Train

Games Roleplaying
Final Fantasy Fan
Dragon Quest Home
Games Strategy
Shogun Total War

EverQuest Addict
Warcraft III Clan

ignores original Yahoo! categories
5
Another Approach (1/2)

Use Yahoo! categories
Advantage
similar categories
Potential Problem
different structure
categories do not match exactly

6
Another Approach (2/2)

Example Crayon Shin-chan
Entertainment Comics and Animation
Animation Anime Titles Crayon
Shin-chan
Arts Animation Anime Titles C
Crayon Shin-chan

7
This Papers Approach

Weak Learner (as opposed to Naïve Bayes)
Boosting to combine Weak Hypotheses
New Idea Co-Bootstrapping to exploit source
categories

8
Assumptions

Multi-category data are reduced to binary data
Totoro Fan Cartoon My Neighbor Totoro
Toys My Neighbor Totoro
is converted into
Totoro Fan Cartoon My Neighbor Totoro
Totoro Fan Toys My Neighbor Totoro
Hierarchies are ignored
Console Sega and Console Sega Dreamcast are
not related

Weak Learner
Boosting
Co-Bootstrapping

Weak Learner

10
Weak Learner

A type of classifier similar to Naïve Bayes
accept
- reject
Term may be a word or n-gram or

Weak Hypothesis (term-based classifier)
After Training
Weak Learner
11
Weak Hypothesis Example

contain Crayon Shin-chan ?
in Comics Crayon Shin-chan
not in Education Early Childhood
not contain Crayon Shin-chan ?
not in Comics Crayon Shin-chan
in Education Early Childhood

12
Weak Learner Inputs (1/2)

Training data are in the form x1, y1,
x2, y2, , xm, ym
xi is a document
yi is a category
xi, yi means document xi is in category yi
D(x, y) is a distribution over all combinations
of xi and yi
D(xi, yj) indicates the importance of (xi, yj)
w is the term (automatically found)

13
Weak Learner Algorithm

For each possible category y, compute four
values
Note (xi,y) with greater D (xi,y) has more
influence.

14
Weak Hypothesis h(x, y)

Given unclassified document x and category y
If x contains w, then
Else if x does not contain w, then

15
Weak Learner Comments

If sign h(x,y) , then x is in y
h(x,y) is the confidence
The term w is found as follows
Repeatedly run weak learner for all possible w
Choose the run with the smallest
value as the model
Boosting Minimizes probability of h(x,y) having
wrong sign

Weak Learner
Boosting
Co-Bootstrapping

Boosting
(AdaBoost.MH)

17
Boosting Idea

Train the weak learner on different Dt(x, y)
distributions
After each run, adjust Dt(x, y) by putting more
weight on the most often misclassified training
data
Output the final hypothesis as a linear
combination of weak hypotheses

18
Boosting Algorithm

Given x1, y1, x2, y2, , xm, ym, where xi
? X and yi ? Y
Initialize D1(x,y) 1/(mk)
for t 1,,T do
Pass distribution Dt to weak learner
Get weak hypothesis ht(x, y)
Choose ?t ? R
Update
end for
Output the final hypothesis

19
Boosting Algorithm Initialization

Given x1, y1, x2, y2, , xm, ym
Initialize D(x, y) 1/(mk)
k total number of categories
uniform distribution

20
Boosting Algorithm Loop

for t 1,,T do
Run weak learner using distribution D
Get weak hypothesis ht(x, y)
For each possible pair (x,y) in training data
If ht(x,y) guesses incorrectly, increase D(x,y)
end for
return

Weak Learner
Boosting
Co-Bootstrapping

Co-Bootstrapping

22
Co-Bootstrapping Idea

We want to use Yahoo! categories to increase
classification accuracy

23
Recall Example Problem

Games Online
EverQuest Addict
Warcraft III Clan
Games Single-Player
Warcraft III Clan

Games Roleplaying
Final Fantasy Fan
Dragon Quest Home
Games Strategy
Shogun Total War

24
Co-Bootstrapping Algorithm (1/4)

1. Run AdaBoost on Yahoo! sites
Get classifier Y1

2. Run AdaBoost on Google sites
Get classifier G1

3. Run Y1 on Google sites
Get predicted Yahoo! categories for Google sites

4. Run G1 on Yahoo! sites
Get predicted Google categories for Yahoo! sites

25
Co-Bootstrapping Algorithm (2/4)

5. Run AdaBoost on Yahoo! sites
Include Google category as a feature
Get classifier Y2

6. Run AdaBoost on Google sites
Include Yahoo! category as a feature
Get classifier G2

7. Run Y2 on original Google sites
get more accurate Yahoo! categories for Google
sites

8. Run G2 on original Yahoo! sites
get more accurate Google categories for Yahoo!
sites

26
Co-Bootstrapping Algorithm (3/4)

9. Run AdaBoost on Yahoo! sites
Include Google category as a feature
Get classifier Y3

10. Run AdaBoost on Google sites
Include Yahoo! category as a feature
Get classifier G3

11. Run Y3 on original Google sites
get even more accurate Yahoo! categories for
Google sites

12. Run G3 on original Yahoo! sites
get even more accurate Google categories for
Yahoo! sites

27
Co-Bootstrapping Algorithm (4/4)

Repeat, repeat, and repeat
Hopefully, the classification will become more
accurate after each iteration

Enhanced Naïve Bayes
(Benchmark)

29
Enhanced Naïve Bayes (1/2)

Given
document x
source category S of x
Predict master category C
In NB, PrC x ? PrC ?w?x(Prw C)n(x,w)
w word
n(x,w) number of occurrences of w in x
PrC x, S ? PrC S ?w?x(Prw C)n(x,w)

30
Enhanced Naïve Bayes (2/2)

PrC
Estimate PrC S ?
C ? S number of docs in S that is classified
into C by NB classifier

Experiment

32
Datasets
33
Number of Categories/Dataset (1/2)
Top level categories only
34
Number of Categories/Dataset (2/2)

Book
Horror
Science Fiction
Non-fiction
Biography
History

Merge into Non-fiction
35
Number of Websites
36
Method (1/2)

Classify Yahoo! Book websites into Google Book
categories (G?Y)
Find G?Y for Book
Hide Google categories for in G?Y
G?Y ? Yahoo! Book
Randomly take G?Y sites from G-Y ? Google Book

37
Method (2/2)

For each dataset, do G?Y five times and G?Y five
times
macro F-score calculate F-score for each
category, then average over all categories
micro F-score calculate F-score on the entire
dataset
recall 100?
Doesnt say anything about multi-category ENB

38
Results (1/3)