Title: Enhanced topic distillation using text, markup tags, and hyperlinks
1Enhanced topic distillation using text, markup
tags, and hyperlinks
- Soumen ChakrabartiMukul JoshiVivek Tawde
- www.cse.iitb.ac.in/soumen
2Topic distillation
Keyword query
- Given a query or some example URLs
- Collect a relevant subgraph (community) of the
Web - Bipartite reinforcement between hubs and
authorities - Prototypes
- HITS and Clever
- Bharat and Henzinger
Searchengine
Expanded set
Root set
3Challenges and limitations
- Web authoring style in flux since 1996
- Complex pages generated from templates
- File or page boundary less meaningful
- Clique attacksrampant multi-host nepotism
via rings, ads, banner exchanges - Models are too simplistic
- Hub and authority symmetry is illusory
- Coarse-grain hub model leaks authority
- Ad-hoc linear segmentation not content-aware
- Deteriorating results of topic distillation
4Clique attacks!
Irrelevantlinks formpseudo-community
Relevant regionsthat lead to inclusionof page
in base set
5Benign drift and generalization
Remainingsectionsgeneralize and/or drift
This sectionspecializes onShakespeare
6A new fine-grained model
lthtmlgtltbodygt lttable gt lttrgtlttdgt lttable gt
lttrgtlttdgtlta hrefhttp//art.qaz.comgtartlt/agtlt/td
gtlt/trgt lttrgtlttdgtlta hrefhttp//ski.qaz.comgtsk
ilt/agtlt/tdgtlt/trgt lt/tablegt lt/tdgtlt/trgt lttrgtlttdgt
ltulgt ltligtlta hrefhttp//www.fromages.com
gtFromages.comlt/agt French cheeselt/ligt
ltligtlta hrefhttp//www.teddingtoncheese.co.ukgtTe
ddingtonlt/agt Buy onlinelt/ligt
lt/ulgt lt/tdgtlt/trgt lt/tablegt lt/bodygtlt/htmlgt
7Generative model for hub text
Global termdistribution ?0
- Global hub text distribution ?0 relevant to given
query - Authors use internal DOM nodes to specialize ?0
into ?I - At a certain frontier in the DOM tree, local
distribution directly generates text in hot and
cold subtrees
Progressivedistortion
Modelfrontier
?I
Other pages
8A balanced cost measure
Reference distribution ?0
Cumulative distortion cost KL(?0 ?u)
KL(?u ?v)
u
v
Dv
(for exponential distribution)
Goal Find minimumcost frontier
Data encoding cost is roughly
9Marking hot subtrees
- Hard to solve exactly (knapsack)
- (1?) dynamic programming solution
- Too slow for 10 million DOM nodes
- Greedy expansion approach at each node v,
compare the cost of - Directly encoding Dv w.r.t. model ?v at v
- First distorting ?v to ?w for each child w of v,
then encoding all Dw w.r.t. respective w - If latter is smaller expand v, else prune
- Mark relevance subtrees as must-prune
10Exploiting co-citation in our model
1
2
Initial values ofleaf hub scores target auth
scores
Must-prune nodes are marked
Have reasonto believethese could be good too
0.10
0.20
0.01
0.06
0.05
0.13
3
4
Aggregate hubscores are copiedback to leaves
0.12
Knownauthorities
0.10
0.20
0.13
0.10
0.20
0.12
0.12
0.12
0.13
Frontier microhubsaccumulate scores
Non-linear transform, unlike HITS
11Complete algorithm
- Collect root set and base set
- Pre-segment using text and mark relevant
micro-hubs to be pruned - Assign only root set authority scores to 1s
- Iterate
- Transfer from authority to hub leaves
- Re-segment hub DOM trees using link text
- Smooth and redistribute hub scores
- Transfer from hub leaves to authority roots
- Report top authority and hot microhubs
12Experimental setup
- Large data sets
- 28 queries from Clever, gt20 topics from Dmoz
- Collect 200010000 pages per query/topic
- Several million DOM nodes and fine links
- Find top authorities using various algos
- For ad-hoc query, measure cosine similarity of
authorities with root-set centroid in vector
space - For Dmoz, use an automatic classifier
13Avoiding topic drift via micro-hubs
Query cyclingNo danger of topic drift
Query affirmative actionTopic drift from
software sites
14Results for the Clever benchmark
- Take top 40 auths
- Find average cosine similarity to root set
centroid - HITS lt DOMText lt DOM similarity
- DOM alone cannot prune well enough most top
auths from root set - HITS drifts often
15Dmoz experiments and results
- 223 topics from http//dmoz.org
- Sample root set URLs from a class c
- Top authorities not in root set submitted to
Rainbow classifier - ?d Pr(c d) is the expected number of relevant
documents - DOMText best
DMoz
Rainbowclassifier
Train
Sample
Test
Expanded set
Music
Root set
Top authority
16Anecdotes
- amusement parks http//www.411fun.com/THEMEPARK
Sleaks authority via nepotistic links to
www.411florists.com, www.411fashion.com,
www.411eshopping.com, etc. - New algorithm reduces drift
- Mixed hubs accurately segmented, e.g. amusement
parks, classical guitar, Shakespeare and sushi - Mixed hubs in top 50 for 13/28 queries
17Conclusion and ongoing work
- Hypertext shows complex idioms, missed by
coarse-grained graph model - Enhanced fine-grained distillation
- Identifies content-bearing hot micro-hubs
- Disaggregates hub scores
- Reduces topic drift via mixed hubs and
pseudo-communities - Application topic-based focused crawling
- Need probabilistic combination of evidence from
text and links