Enhanced topic distillation using text, markup tags, and hyperlinks - PowerPoint PPT Presentation

About This Presentation
Title:

Enhanced topic distillation using text, markup tags, and hyperlinks

Description:

tr td a href='http://art.qaz.com' art /a /td /tr ... measure cosine similarity of authorities with root-set centroid in vector space ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 18
Provided by: cseIi3
Category:

less

Transcript and Presenter's Notes

Title: Enhanced topic distillation using text, markup tags, and hyperlinks


1
Enhanced topic distillation using text, markup
tags, and hyperlinks
  • Soumen ChakrabartiMukul JoshiVivek Tawde
  • www.cse.iitb.ac.in/soumen

2
Topic distillation
Keyword query
  • Given a query or some example URLs
  • Collect a relevant subgraph (community) of the
    Web
  • Bipartite reinforcement between hubs and
    authorities
  • Prototypes
  • HITS and Clever
  • Bharat and Henzinger

Searchengine
Expanded set
Root set
3
Challenges and limitations
  • Web authoring style in flux since 1996
  • Complex pages generated from templates
  • File or page boundary less meaningful
  • Clique attacksrampant multi-host nepotism
    via rings, ads, banner exchanges
  • Models are too simplistic
  • Hub and authority symmetry is illusory
  • Coarse-grain hub model leaks authority
  • Ad-hoc linear segmentation not content-aware
  • Deteriorating results of topic distillation

4
Clique attacks!
Irrelevantlinks formpseudo-community
Relevant regionsthat lead to inclusionof page
in base set
5
Benign drift and generalization
Remainingsectionsgeneralize and/or drift
This sectionspecializes onShakespeare
6
A new fine-grained model
lthtmlgtltbodygt lttable gt lttrgtlttdgt lttable gt
lttrgtlttdgtlta hrefhttp//art.qaz.comgtartlt/agtlt/td
gtlt/trgt lttrgtlttdgtlta hrefhttp//ski.qaz.comgtsk
ilt/agtlt/tdgtlt/trgt lt/tablegt lt/tdgtlt/trgt lttrgtlttdgt
ltulgt ltligtlta hrefhttp//www.fromages.com
gtFromages.comlt/agt French cheeselt/ligt
ltligtlta hrefhttp//www.teddingtoncheese.co.ukgtTe
ddingtonlt/agt Buy onlinelt/ligt
lt/ulgt lt/tdgtlt/trgt lt/tablegt lt/bodygtlt/htmlgt
7
Generative model for hub text
Global termdistribution ?0
  • Global hub text distribution ?0 relevant to given
    query
  • Authors use internal DOM nodes to specialize ?0
    into ?I
  • At a certain frontier in the DOM tree, local
    distribution directly generates text in hot and
    cold subtrees

Progressivedistortion
Modelfrontier
?I
Other pages
8
A balanced cost measure
Reference distribution ?0
Cumulative distortion cost KL(?0 ?u)
KL(?u ?v)
u
v
Dv
(for exponential distribution)
Goal Find minimumcost frontier
Data encoding cost is roughly
9
Marking hot subtrees
  • Hard to solve exactly (knapsack)
  • (1?) dynamic programming solution
  • Too slow for 10 million DOM nodes
  • Greedy expansion approach at each node v,
    compare the cost of
  • Directly encoding Dv w.r.t. model ?v at v
  • First distorting ?v to ?w for each child w of v,
    then encoding all Dw w.r.t. respective w
  • If latter is smaller expand v, else prune
  • Mark relevance subtrees as must-prune

10
Exploiting co-citation in our model
1
2
Initial values ofleaf hub scores target auth
scores
Must-prune nodes are marked
Have reasonto believethese could be good too
0.10
0.20
0.01
0.06
0.05
0.13
3
4
Aggregate hubscores are copiedback to leaves
0.12
Knownauthorities
0.10
0.20
0.13
0.10
0.20
0.12
0.12
0.12
0.13
Frontier microhubsaccumulate scores
Non-linear transform, unlike HITS
11
Complete algorithm
  • Collect root set and base set
  • Pre-segment using text and mark relevant
    micro-hubs to be pruned
  • Assign only root set authority scores to 1s
  • Iterate
  • Transfer from authority to hub leaves
  • Re-segment hub DOM trees using link text
  • Smooth and redistribute hub scores
  • Transfer from hub leaves to authority roots
  • Report top authority and hot microhubs

12
Experimental setup
  • Large data sets
  • 28 queries from Clever, gt20 topics from Dmoz
  • Collect 200010000 pages per query/topic
  • Several million DOM nodes and fine links
  • Find top authorities using various algos
  • For ad-hoc query, measure cosine similarity of
    authorities with root-set centroid in vector
    space
  • For Dmoz, use an automatic classifier

13
Avoiding topic drift via micro-hubs
Query cyclingNo danger of topic drift
Query affirmative actionTopic drift from
software sites
14
Results for the Clever benchmark
  • Take top 40 auths
  • Find average cosine similarity to root set
    centroid
  • HITS lt DOMText lt DOM similarity
  • DOM alone cannot prune well enough most top
    auths from root set
  • HITS drifts often

15
Dmoz experiments and results
  • 223 topics from http//dmoz.org
  • Sample root set URLs from a class c
  • Top authorities not in root set submitted to
    Rainbow classifier
  • ?d Pr(c d) is the expected number of relevant
    documents
  • DOMText best

DMoz
Rainbowclassifier
Train
Sample
Test
Expanded set
Music
Root set
Top authority
16
Anecdotes
  • amusement parks http//www.411fun.com/THEMEPARK
    Sleaks authority via nepotistic links to
    www.411florists.com, www.411fashion.com,
    www.411eshopping.com, etc.
  • New algorithm reduces drift
  • Mixed hubs accurately segmented, e.g. amusement
    parks, classical guitar, Shakespeare and sushi
  • Mixed hubs in top 50 for 13/28 queries

17
Conclusion and ongoing work
  • Hypertext shows complex idioms, missed by
    coarse-grained graph model
  • Enhanced fine-grained distillation
  • Identifies content-bearing hot micro-hubs
  • Disaggregates hub scores
  • Reduces topic drift via mixed hubs and
    pseudo-communities
  • Application topic-based focused crawling
  • Need probabilistic combination of evidence from
    text and links
Write a Comment
User Comments (0)
About PowerShow.com