Enhanced topic distillation using text, markup tags, and hyperlinks - PowerPoint PPT Presentation

About This Presentation
Title:

Enhanced topic distillation using text, markup tags, and hyperlinks

Description:

Enhanced topic distillation using text, markup tags, and hyperlinks – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 23
Provided by: cseIi3
Category:

less

Transcript and Presenter's Notes

Title: Enhanced topic distillation using text, markup tags, and hyperlinks


1
Enhanced topic distillation using text, markup
tags, and hyperlinks
  • Soumen ChakrabartiMukul JoshiVivek Tawde
  • www.cse.iitb.ac.in/soumen

2
Topic distillation
Keyword query
  • Given a query or some example URLs
  • Collect a relevant subgraph (community) of the
    Web
  • Bipartite reinforcement between hubs and
    authorities
  • Prototypes
  • HITS, Clever, SALSA
  • Bharat and Henzinger

Searchengine
Base set
Expanded set
Root set
3
Two issues
  • How to collect the base set
  • Radius-1 expansion is arbitrary
  • Content relevance must play a role
  • How to spread prestige along links
  • Instability of HITS (Borodin, Lempel Zheng)
  • Stability of PageRank (Zheng)
  • Stochastic variants of HITS (Lempel)
  • Need better recall collecting base graph
  • Need accurate boundaries around it

4
Challenges and limitations
  • Topic distillation results deteriorating
  • Web authoring style in flux since 1996
  • Complex pages, templates, cloaks
  • File or page boundary less meaningful
  • Clique attacksrampant multi-host nepotism
    via rings, ads, banner exchanges
  • Models too simplistic
  • Hub and authority symmetry is illusory
  • Coarse-grain hub model leaks authority
  • Ad-hoc linear segmentation not content-aware

5
Clique attacks!
Irrelevantlinks formpseudo-community
Relevant regionsthat lead to inclusionof page
in base set
6
Benign drift and generalization
Remainingsectionsgeneralize and/or drift
This sectionspecializes onShakespeare
7
A fine-grained hypertext model
lthtmlgtltbodygt lttable gt lttrgtlttdgt lttable gt
lttrgtlttdgtlta hrefhttp//art.qaz.comgtartlt/agtlt/td
gtlt/trgt lttrgtlttdgtlta hrefhttp//ski.qaz.comgtsk
ilt/agtlt/tdgtlt/trgt lt/tablegt lt/tdgtlt/trgt lttrgtlttdgt
ltulgt ltligtlta hrefhttp//www.fromages.com
gtFromages.comlt/agt French cheeselt/ligt
ltligtlta hrefhttp//www.teddingtoncheese.co.ukgtTe
ddingtonlt/agt Buy onlinelt/ligt
lt/ulgt lt/tdgtlt/trgt lt/tablegt lt/bodygtlt/htmlgt
8
Preliminary approaches
  • Apply HITS to fine-grained base graph
  • Blocked reinforcement
  • Model DOM trees as resistance or flow networks
  • Ad-hoc decay factors
  • Apply BH outlier elimination to every DOM node
  • Hot absorbs cold, includes drift-enhancing links

Cold
Warm enoughto figure asone hub
Hot
9
Generative model for hub text
Global termdistribution ?0
  • Global hub text distribution ?0 relevant to given
    query
  • Authors use internal DOM nodes to hierarchically
    specialize ?0 into ?I
  • At a certain frontier, local models are frozen
    and text generated

Progressivedistortion
Modelfrontier
?I
Other pages
10
Examples using the binary model
  • Binary model
  • Code length for document d
  • Cost for specializing a term distribution

11
Discovering the frontier
Referencedistribution?0
Cumulative distortion cost KL(?0??u)
KL(?u??v)
  • Use u to directly generate text snippets in the
    subtree rooted at u
  • Expand to children v and use different params for
    each tree
  • Greedily pick better local choice

u
v
Dv
12
Exploiting co-citation in our model
1
2
Initial values ofleaf hub scores target auth
scores
Segment treeusing hub scores
Have reasonto believethese could be good too
0.10
0.20
0.01
0.06
0.05
0.13
3
4
Aggregate hubscores are copiedback to leaves
0.12
Knownauthorities
0.10
0.20
0.13
0.10
0.20
0.12
0.12
0.12
0.13
Frontier microhubsaccumulate scores
Non-linear transform, unlike HITS
13
Complete algorithm
  • Collect root set and base set
  • Pre-segment using text and mark relevant
    micro-hubs to be pruned
  • Assign only root set authority scores to 1s
  • Iterate
  • Transfer from authority to hub leaves
  • Re-segment hub DOM trees using link text
  • Smooth and redistribute hub scores
  • Transfer from hub leaves to authority roots
  • Report top authority and hot microhubs

14
Experimental setup
  • Large data sets
  • 28 queries from Clever, gt20 topics from Dmoz
  • Collect 200010000 pages per query/topic
  • Several million DOM nodes and fine links
  • Find top authorities using various algos
  • Measurements anecdotes
  • For ad-hoc query, measure cosine similarity of
    authorities with root-set centroid in vector
    space
  • Compare HITS, DOM, DOMText

15
Avoiding topic drift via micro-hubs
Query cyclingNo danger of topic drift
Query affirmative actionTopic drift from
software sites
16
Empirical convergence
  • Convergence for all queries within 20 iterations
  • Faster convergence for drift-free graphs, slower
    for graphs that posed a danger of topic drift
  • Very important to not set all auth scores to gt 0

17
Results for the Clever benchmark
  • Take top 40 auths
  • Find average cosine similarity to root set
    centroid
  • HITS lt DOMText lt DOM similarity
  • DOM alone cannot prune well enough most top
    auths from root set
  • HITS drifts often

18
Dmoz experiments and results
  • 223 topics from http//dmoz.org
  • Sample root set URLs from a class c
  • Top authorities not in root set submitted to
    Rainbow classifier
  • ?d Pr(c d) is the expected number of relevant
    documents
  • DOMText best

DMoz
Rainbowclassifier
Train
Sample
Test
Expanded set
Music
Root set
Top authority
19
Anecdotes
  • amusement parks http//www.411fun.com/THEMEPARK
    Sleaks authority via nepotistic links to
    www.411florists.com, www.411fashion.com,
    www.411eshopping.com, etc.
  • New algorithm reduces drift
  • Mixed hubs accurately segmented, e.g. amusement
    parks, classical guitar, Shakespeare and sushi
  • Mixed hubs and clique attacks rampant

20
Application surfing like humans
Focused Crawling Train a topic classifierInitiali
ze priority queue to a few sample URLs about
a topicAssume they have relevance 1Repeat
Fetch page most relevant to topic Estimate
relevance R using classifier Guess that all
outlinks have relevance R Add outlinks to
priority queue
?
!
?
  • Problem average out-degree is too high (10)
  • Discovering irrelevance after 10X more work
  • Can we use DOM and text to bias the walk?

21
Preliminary results
Relevance R1
Featurescollected fromsource pageDOM
Relevance R2
Promising andunpromisingclicks
Feedback
Standardfocusedcrawler
Meta-learner
22
Summary
  • Hypertext shows complex idioms, missed by
    coarse-grained graph model
  • Enhanced fine-grained distillation
  • Identifies content-bearing hot micro-hubs
  • Disaggregates hub scores
  • Reduces topic drift via mixed hubs and
    pseudo-communities
  • Application online reinforcement learning
  • Need probabilistic combination of evidence from
    text and links
Write a Comment
User Comments (0)
About PowerShow.com