Enhanced topic distillation using text, markup tags, and hyperlinks

About This Presentation

Title:

Enhanced topic distillation using text, markup tags, and hyperlinks

Description:

Enhanced topic distillation using text, markup tags, and hyperlinks – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 23

Provided by: cseIi3

Category:

more less

Transcript and Presenter's Notes

Title: Enhanced topic distillation using text, markup tags, and hyperlinks

1
Enhanced topic distillation using text, markup
tags, and hyperlinks

Soumen ChakrabartiMukul JoshiVivek Tawde
www.cse.iitb.ac.in/soumen

2
Topic distillation
Keyword query

Given a query or some example URLs
Collect a relevant subgraph (community) of the
Web
Bipartite reinforcement between hubs and
authorities
Prototypes
HITS, Clever, SALSA
Bharat and Henzinger

Searchengine
Base set
Expanded set
Root set
3
Two issues

How to collect the base set
Radius-1 expansion is arbitrary
Content relevance must play a role
How to spread prestige along links
Instability of HITS (Borodin, Lempel Zheng)
Stability of PageRank (Zheng)
Stochastic variants of HITS (Lempel)
Need better recall collecting base graph
Need accurate boundaries around it

4
Challenges and limitations

Topic distillation results deteriorating
Web authoring style in flux since 1996
Complex pages, templates, cloaks
File or page boundary less meaningful
Clique attacksrampant multi-host nepotism
via rings, ads, banner exchanges
Models too simplistic
Hub and authority symmetry is illusory
Coarse-grain hub model leaks authority
Ad-hoc linear segmentation not content-aware

5
Clique attacks!
Irrelevantlinks formpseudo-community
Relevant regionsthat lead to inclusionof page
in base set
6
Benign drift and generalization
Remainingsectionsgeneralize and/or drift
This sectionspecializes onShakespeare
7
A fine-grained hypertext model
lthtmlgtltbodygt lttable gt lttrgtlttdgt lttable gt
lttrgtlttdgtlta hrefhttp//art.qaz.comgtartlt/agtlt/td
gtlt/trgt lttrgtlttdgtlta hrefhttp//ski.qaz.comgtsk
ilt/agtlt/tdgtlt/trgt lt/tablegt lt/tdgtlt/trgt lttrgtlttdgt
ltulgt ltligtlta hrefhttp//www.fromages.com
gtFromages.comlt/agt French cheeselt/ligt
ltligtlta hrefhttp//www.teddingtoncheese.co.ukgtTe
ddingtonlt/agt Buy onlinelt/ligt
lt/ulgt lt/tdgtlt/trgt lt/tablegt lt/bodygtlt/htmlgt
8
Preliminary approaches

Apply HITS to fine-grained base graph
Blocked reinforcement
Model DOM trees as resistance or flow networks
Ad-hoc decay factors
Apply BH outlier elimination to every DOM node
Hot absorbs cold, includes drift-enhancing links

Cold
Warm enoughto figure asone hub
Hot
9
Generative model for hub text
Global termdistribution ?0

Global hub text distribution ?0 relevant to given
query
Authors use internal DOM nodes to hierarchically
specialize ?0 into ?I
At a certain frontier, local models are frozen
and text generated

Progressivedistortion
Modelfrontier
?I
Other pages
10
Examples using the binary model

Binary model
Code length for document d
Cost for specializing a term distribution

11
Discovering the frontier
Referencedistribution?0
Cumulative distortion cost KL(?0??u)
KL(?u??v)

Use u to directly generate text snippets in the
subtree rooted at u
Expand to children v and use different params for
each tree
Greedily pick better local choice

u
v
Dv
12
Exploiting co-citation in our model
1
2
Initial values ofleaf hub scores target auth
scores
Segment treeusing hub scores
Have reasonto believethese could be good too
0.10
0.20
0.01
0.06
0.05
0.13
3
4
Aggregate hubscores are copiedback to leaves
0.12
Knownauthorities
0.10
0.20
0.13
0.10
0.20
0.12
0.12
0.12
0.13
Frontier microhubsaccumulate scores
Non-linear transform, unlike HITS
13
Complete algorithm

Collect root set and base set
Pre-segment using text and mark relevant
micro-hubs to be pruned
Assign only root set authority scores to 1s
Iterate
Transfer from authority to hub leaves
Re-segment hub DOM trees using link text
Smooth and redistribute hub scores
Transfer from hub leaves to authority roots
Report top authority and hot microhubs

14
Experimental setup

Large data sets
28 queries from Clever, gt20 topics from Dmoz
Collect 200010000 pages per query/topic
Several million DOM nodes and fine links
Find top authorities using various algos
Measurements anecdotes
For ad-hoc query, measure cosine similarity of
authorities with root-set centroid in vector
space
Compare HITS, DOM, DOMText

15
Avoiding topic drift via micro-hubs
Query cyclingNo danger of topic drift
Query affirmative actionTopic drift from
software sites
16
Empirical convergence

Convergence for all queries within 20 iterations
Faster convergence for drift-free graphs, slower
for graphs that posed a danger of topic drift
Very important to not set all auth scores to gt 0

17
Results for the Clever benchmark

Take top 40 auths
Find average cosine similarity to root set
centroid
HITS lt DOMText lt DOM similarity
DOM alone cannot prune well enough most top
auths from root set
HITS drifts often

18
Dmoz experiments and results

223 topics from http//dmoz.org
Sample root set URLs from a class c
Top authorities not in root set submitted to
Rainbow classifier
?d Pr(c d) is the expected number of relevant
documents
DOMText best

DMoz
Rainbowclassifier
Train
Sample
Test
Expanded set
Music
Root set
Top authority
19
Anecdotes

amusement parks http//www.411fun.com/THEMEPARK
Sleaks authority via nepotistic links to
www.411florists.com, www.411fashion.com,
www.411eshopping.com, etc.
New algorithm reduces drift
Mixed hubs accurately segmented, e.g. amusement
parks, classical guitar, Shakespeare and sushi
Mixed hubs and clique attacks rampant

20
Application surfing like humans
Focused Crawling Train a topic classifierInitiali
ze priority queue to a few sample URLs about
a topicAssume they have relevance 1Repeat
Fetch page most relevant to topic Estimate
relevance R using classifier Guess that all
outlinks have relevance R Add outlinks to
priority queue
?
!
?

Problem average out-degree is too high (10)
Discovering irrelevance after 10X more work
Can we use DOM and text to bias the walk?

21
Preliminary results
Relevance R1
Featurescollected fromsource pageDOM
Relevance R2
Promising andunpromisingclicks
Feedback
Standardfocusedcrawler
Meta-learner
22
Summary

Hypertext shows complex idioms, missed by
coarse-grained graph model
Enhanced fine-grained distillation
Identifies content-bearing hot micro-hubs
Disaggregates hub scores
Reduces topic drift via mixed hubs and
pseudo-communities
Application online reinforcement learning
Need probabilistic combination of evidence from
text and links

Write a Comment

User Comments (0)

About PowerShow.com

Enhanced topic distillation using text, markup tags, and hyperlinks - PowerPoint PPT Presentation

Enhanced topic distillation using text, markup tags, and hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks – PowerPoint PPT presentation