Web Taxonomy Integration through CoBootstrapping - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Web Taxonomy Integration through CoBootstrapping

Description:

Final Fantasy Fan. Dragon Quest Home. Games Strategy. Shogun: ... Final Fantasy Fan. Dragon Quest Home. EverQuest Addict. Warcraft III Clan. Games Strategy ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 42
Provided by: Aqua2
Category:

less

Transcript and Presenter's Notes

Title: Web Taxonomy Integration through CoBootstrapping


1
Web Taxonomy Integration throughCo-Bootstrapping
  • Dell Zhang
  • National University of Singapore
  • Wee Sun Lee
  • National University of Singapore
  • SIGIR04

2
  • Introduction

3
Problem Statement
  • Games Roleplaying
  • Final Fantasy Fan
  • Dragon Quest Home
  • EverQuest Addict
  • Warcraft III Clan
  • Games Strategy
  • Shogun Total War
  • Warcraft III Clan
  • Games Roleplaying
  • Final Fantasy Fan
  • Dragon Quest Home
  • Games Strategy
  • Shogun Total War
  • Games Online
  • EverQuest Addict
  • Warcraft III Clan
  • Games Single-Player
  • Warcraft III Clan

4
Possible Approach
Classify
Train
  • Games Roleplaying
  • Final Fantasy Fan
  • Dragon Quest Home
  • Games Strategy
  • Shogun Total War
  • EverQuest Addict
  • Warcraft III Clan

ignores original Yahoo! categories
5
Another Approach (1/2)
  • Use Yahoo! categories
  • Advantage
  • similar categories
  • Potential Problem
  • different structure
  • categories do not match exactly

6
Another Approach (2/2)
  • Example Crayon Shin-chan
  • Entertainment Comics and Animation
    Animation Anime Titles Crayon
    Shin-chan
  • Arts Animation Anime Titles C
    Crayon Shin-chan

7
This Papers Approach
  • Weak Learner (as opposed to Naïve Bayes)
  • Boosting to combine Weak Hypotheses
  • New Idea Co-Bootstrapping to exploit source
    categories

8
Assumptions
  • Multi-category data are reduced to binary data
  • Totoro Fan Cartoon My Neighbor Totoro
  • Toys My Neighbor Totoro
  • is converted into
  • Totoro Fan Cartoon My Neighbor Totoro
  • Totoro Fan Toys My Neighbor Totoro
  • Hierarchies are ignored
  • Console Sega and Console Sega Dreamcast are
    not related

9
  • Weak Learner
  • Boosting
  • Co-Bootstrapping
  • Weak Learner

10
Weak Learner
  • A type of classifier similar to Naïve Bayes
  • accept
  • - reject
  • Term may be a word or n-gram or

Weak Hypothesis (term-based classifier)
After Training
Weak Learner
11
Weak Hypothesis Example
  • contain Crayon Shin-chan ?
  • in Comics Crayon Shin-chan
  • not in Education Early Childhood
  • not contain Crayon Shin-chan ?
  • not in Comics Crayon Shin-chan
  • in Education Early Childhood

12
Weak Learner Inputs (1/2)
  • Training data are in the form x1, y1,
    x2, y2, , xm, ym
  • xi is a document
  • yi is a category
  • xi, yi means document xi is in category yi
  • D(x, y) is a distribution over all combinations
    of xi and yi
  • D(xi, yj) indicates the importance of (xi, yj)
  • w is the term (automatically found)

13
Weak Learner Algorithm
  • For each possible category y, compute four
    values
  • Note (xi,y) with greater D (xi,y) has more
    influence.

14
Weak Hypothesis h(x, y)
  • Given unclassified document x and category y
  • If x contains w, then
  • Else if x does not contain w, then

15
Weak Learner Comments
  • If sign h(x,y) , then x is in y
  • h(x,y) is the confidence
  • The term w is found as follows
  • Repeatedly run weak learner for all possible w
  • Choose the run with the smallest
  • value as the model
  • Boosting Minimizes probability of h(x,y) having
    wrong sign

16
  • Weak Learner
  • Boosting
  • Co-Bootstrapping
  • Boosting
  • (AdaBoost.MH)

17
Boosting Idea
  • Train the weak learner on different Dt(x, y)
    distributions
  • After each run, adjust Dt(x, y) by putting more
    weight on the most often misclassified training
    data
  • Output the final hypothesis as a linear
    combination of weak hypotheses

18
Boosting Algorithm
  • Given x1, y1, x2, y2, , xm, ym, where xi
    ? X and yi ? Y
  • Initialize D1(x,y) 1/(mk)
  • for t 1,,T do
  • Pass distribution Dt to weak learner
  • Get weak hypothesis ht(x, y)
  • Choose ?t ? R
  • Update
  • end for
  • Output the final hypothesis

19
Boosting Algorithm Initialization
  • Given x1, y1, x2, y2, , xm, ym
  • Initialize D(x, y) 1/(mk)
  • k total number of categories
  • uniform distribution

20
Boosting Algorithm Loop
  • for t 1,,T do
  • Run weak learner using distribution D
  • Get weak hypothesis ht(x, y)
  • For each possible pair (x,y) in training data
  • If ht(x,y) guesses incorrectly, increase D(x,y)
  • end for
  • return

21
  • Weak Learner
  • Boosting
  • Co-Bootstrapping
  • Co-Bootstrapping

22
Co-Bootstrapping Idea
  • We want to use Yahoo! categories to increase
    classification accuracy

23
Recall Example Problem
  • Games Online
  • EverQuest Addict
  • Warcraft III Clan
  • Games Single-Player
  • Warcraft III Clan
  • Games Roleplaying
  • Final Fantasy Fan
  • Dragon Quest Home
  • Games Strategy
  • Shogun Total War

24
Co-Bootstrapping Algorithm (1/4)
  • 1. Run AdaBoost on Yahoo! sites
  • Get classifier Y1
  • 2. Run AdaBoost on Google sites
  • Get classifier G1
  • 3. Run Y1 on Google sites
  • Get predicted Yahoo! categories for Google sites
  • 4. Run G1 on Yahoo! sites
  • Get predicted Google categories for Yahoo! sites

25
Co-Bootstrapping Algorithm (2/4)
  • 5. Run AdaBoost on Yahoo! sites
  • Include Google category as a feature
  • Get classifier Y2
  • 6. Run AdaBoost on Google sites
  • Include Yahoo! category as a feature
  • Get classifier G2
  • 7. Run Y2 on original Google sites
  • get more accurate Yahoo! categories for Google
    sites
  • 8. Run G2 on original Yahoo! sites
  • get more accurate Google categories for Yahoo!
    sites

26
Co-Bootstrapping Algorithm (3/4)
  • 9. Run AdaBoost on Yahoo! sites
  • Include Google category as a feature
  • Get classifier Y3
  • 10. Run AdaBoost on Google sites
  • Include Yahoo! category as a feature
  • Get classifier G3
  • 11. Run Y3 on original Google sites
  • get even more accurate Yahoo! categories for
    Google sites
  • 12. Run G3 on original Yahoo! sites
  • get even more accurate Google categories for
    Yahoo! sites

27
Co-Bootstrapping Algorithm (4/4)
  • Repeat, repeat, and repeat
  • Hopefully, the classification will become more
    accurate after each iteration

28
  • Enhanced Naïve Bayes
  • (Benchmark)

29
Enhanced Naïve Bayes (1/2)
  • Given
  • document x
  • source category S of x
  • Predict master category C
  • In NB, PrC x ? PrC ?w?x(Prw C)n(x,w)
  • w word
  • n(x,w) number of occurrences of w in x
  • PrC x, S ? PrC S ?w?x(Prw C)n(x,w)

30
Enhanced Naïve Bayes (2/2)
  • PrC
  • Estimate PrC S ?
  • C ? S number of docs in S that is classified
    into C by NB classifier

31
  • Experiment

32
Datasets
33
Number of Categories/Dataset (1/2)
Top level categories only
34
Number of Categories/Dataset (2/2)
  • Book
  • Horror
  • Science Fiction
  • Non-fiction
  • Biography
  • History

Merge into Non-fiction
35
Number of Websites
36
Method (1/2)
  • Classify Yahoo! Book websites into Google Book
    categories (G?Y)
  • Find G?Y for Book
  • Hide Google categories for in G?Y
  • G?Y ? Yahoo! Book
  • Randomly take G?Y sites from G-Y ? Google Book

37
Method (2/2)
  • For each dataset, do G?Y five times and G?Y five
    times
  • macro F-score calculate F-score for each
    category, then average over all categories
  • micro F-score calculate F-score on the entire
    dataset
  • recall 100?
  • Doesnt say anything about multi-category ENB

38
Results (1/3)
  • Co-Boostrapping-AdaBoost AdaBoost

macro-averaged F scores
micro-averaged F scores
39
Results (2/3)
  • Co-Bootstrapping-AdaBoost iteratively improves
    AdaBoost

Book Dataset
40
Results (3/3)
  • Co-Boostrapping-AdaBoost Enhanced Naïve Bayes

macro-averaged F scores
micro-averaged F scores
41
Contribution
  • Co-Bootstrapping improves Boosting performance
  • Does not require ? as in ENB
Write a Comment
User Comments (0)
About PowerShow.com