Title: Enhanced topic distillation using text, markup tags, and hyperlinks
1Enhanced topic distillation using text, markup
tags, and hyperlinks
- Soumen ChakrabartiMukul JoshiVivek Tawde
- www.cse.iitb.ac.in/soumen
2Topic distillation
Keyword query
- Given a query or some example URLs
- Collect a relevant subgraph (community) of the
Web - Bipartite reinforcement between hubs and
authorities - Prototypes
- HITS, Clever, SALSA
- Bharat and Henzinger
Searchengine
Base set
Expanded set
Root set
3Two issues
- How to collect the base set
- Radius-1 expansion is arbitrary
- Content relevance must play a role
- How to spread prestige along links
- Instability of HITS (Borodin, Lempel Zheng)
- Stability of PageRank (Zheng)
- Stochastic variants of HITS (Lempel)
- Need better recall collecting base graph
- Need accurate boundaries around it
4Challenges and limitations
- Topic distillation results deteriorating
- Web authoring style in flux since 1996
- Complex pages, templates, cloaks
- File or page boundary less meaningful
- Clique attacksrampant multi-host nepotism
via rings, ads, banner exchanges - Models too simplistic
- Hub and authority symmetry is illusory
- Coarse-grain hub model leaks authority
- Ad-hoc linear segmentation not content-aware
5Clique attacks!
Irrelevantlinks formpseudo-community
Relevant regionsthat lead to inclusionof page
in base set
6Benign drift and generalization
Remainingsectionsgeneralize and/or drift
This sectionspecializes onShakespeare
7A fine-grained hypertext model
lthtmlgtltbodygt lttable gt lttrgtlttdgt lttable gt
lttrgtlttdgtlta hrefhttp//art.qaz.comgtartlt/agtlt/td
gtlt/trgt lttrgtlttdgtlta hrefhttp//ski.qaz.comgtsk
ilt/agtlt/tdgtlt/trgt lt/tablegt lt/tdgtlt/trgt lttrgtlttdgt
ltulgt ltligtlta hrefhttp//www.fromages.com
gtFromages.comlt/agt French cheeselt/ligt
ltligtlta hrefhttp//www.teddingtoncheese.co.ukgtTe
ddingtonlt/agt Buy onlinelt/ligt
lt/ulgt lt/tdgtlt/trgt lt/tablegt lt/bodygtlt/htmlgt
8Preliminary approaches
- Apply HITS to fine-grained base graph
- Blocked reinforcement
- Model DOM trees as resistance or flow networks
- Ad-hoc decay factors
- Apply BH outlier elimination to every DOM node
- Hot absorbs cold, includes drift-enhancing links
Cold
Warm enoughto figure asone hub
Hot
9Generative model for hub text
Global termdistribution ?0
- Global hub text distribution ?0 relevant to given
query - Authors use internal DOM nodes to hierarchically
specialize ?0 into ?I - At a certain frontier, local models are frozen
and text generated
Progressivedistortion
Modelfrontier
?I
Other pages
10Examples using the binary model
- Binary model
- Code length for document d
- Cost for specializing a term distribution
11Discovering the frontier
Referencedistribution?0
Cumulative distortion cost KL(?0??u)
KL(?u??v)
- Use u to directly generate text snippets in the
subtree rooted at u - Expand to children v and use different params for
each tree - Greedily pick better local choice
u
v
Dv
12Exploiting co-citation in our model
1
2
Initial values ofleaf hub scores target auth
scores
Segment treeusing hub scores
Have reasonto believethese could be good too
0.10
0.20
0.01
0.06
0.05
0.13
3
4
Aggregate hubscores are copiedback to leaves
0.12
Knownauthorities
0.10
0.20
0.13
0.10
0.20
0.12
0.12
0.12
0.13
Frontier microhubsaccumulate scores
Non-linear transform, unlike HITS
13Complete algorithm
- Collect root set and base set
- Pre-segment using text and mark relevant
micro-hubs to be pruned - Assign only root set authority scores to 1s
- Iterate
- Transfer from authority to hub leaves
- Re-segment hub DOM trees using link text
- Smooth and redistribute hub scores
- Transfer from hub leaves to authority roots
- Report top authority and hot microhubs
14Experimental setup
- Large data sets
- 28 queries from Clever, gt20 topics from Dmoz
- Collect 200010000 pages per query/topic
- Several million DOM nodes and fine links
- Find top authorities using various algos
- Measurements anecdotes
- For ad-hoc query, measure cosine similarity of
authorities with root-set centroid in vector
space - Compare HITS, DOM, DOMText
15Avoiding topic drift via micro-hubs
Query cyclingNo danger of topic drift
Query affirmative actionTopic drift from
software sites
16Empirical convergence
- Convergence for all queries within 20 iterations
- Faster convergence for drift-free graphs, slower
for graphs that posed a danger of topic drift - Very important to not set all auth scores to gt 0
17Results for the Clever benchmark
- Take top 40 auths
- Find average cosine similarity to root set
centroid - HITS lt DOMText lt DOM similarity
- DOM alone cannot prune well enough most top
auths from root set - HITS drifts often
18Dmoz experiments and results
- 223 topics from http//dmoz.org
- Sample root set URLs from a class c
- Top authorities not in root set submitted to
Rainbow classifier - ?d Pr(c d) is the expected number of relevant
documents - DOMText best
DMoz
Rainbowclassifier
Train
Sample
Test
Expanded set
Music
Root set
Top authority
19Anecdotes
- amusement parks http//www.411fun.com/THEMEPARK
Sleaks authority via nepotistic links to
www.411florists.com, www.411fashion.com,
www.411eshopping.com, etc. - New algorithm reduces drift
- Mixed hubs accurately segmented, e.g. amusement
parks, classical guitar, Shakespeare and sushi - Mixed hubs and clique attacks rampant
20Application surfing like humans
Focused Crawling Train a topic classifierInitiali
ze priority queue to a few sample URLs about
a topicAssume they have relevance 1Repeat
Fetch page most relevant to topic Estimate
relevance R using classifier Guess that all
outlinks have relevance R Add outlinks to
priority queue
?
!
?
- Problem average out-degree is too high (10)
- Discovering irrelevance after 10X more work
- Can we use DOM and text to bias the walk?
21Preliminary results
Relevance R1
Featurescollected fromsource pageDOM
Relevance R2
Promising andunpromisingclicks
Feedback
Standardfocusedcrawler
Meta-learner
22Summary
- Hypertext shows complex idioms, missed by
coarse-grained graph model - Enhanced fine-grained distillation
- Identifies content-bearing hot micro-hubs
- Disaggregates hub scores
- Reduces topic drift via mixed hubs and
pseudo-communities - Application online reinforcement learning
- Need probabilistic combination of evidence from
text and links