Title: Topical Locality in the Web
1Topical Locality in the Web
- David J. Manura
- Lehigh University, Dept. Computer Science and
Engineering - 2002-09-24
- (A paper presentation B. Davison, Topical
Locality in the Web, In Proceedings of the 23rd
International ACM SIGIR Conference on Research
and Development in Information Retrieval, July
2The question
- Proposition 1 Web pages are linked to pages with
related content - Proposition 2 HTML anchors describe the pages to
which they point - To what extent do these hold?
- Problem statement
- Motivation
- Methods
- Results
- Summary
lt!-- blue.html --gt lthtmlgtltheadgt lttitlegtColor
Bluelt/titlegt ltmeta namedesc
contentWhat blue is.gt lt/headgtltbodygt Blue is
the color of the sky. lt/bodygt lt/htmlgt
lt!-- sky.html --gt lthtmlgtltheadgt
lttitlegtSkylt/titlegt ltmeta namedesc
contentSky info.gt lt/headgtltbodygt The sky is lta
hrefblue.htmlgt bluelt/agt and lta
hrefair.htmlgt contains oxygenlt/agt. lt/bodygtlt/ht
lt!-- air.html --gt lthtmlgtltheadgt lttitlegtAir
Compositionlt/titlegt ltmeta namedesc
contentWhat air is made of.gt lt/headgtltbodygt Air
consists of nitrogenand oxygen. lt/bodygtlt/htmlgt
- Web indexing
- Search ranking
- Meta-search engines
- Focused crawlers
- Intelligent browsing agents
6Methods Dataset
- 100 000 pages out of 3 000 000 pages in the
neighborhood of highly-ranked pages (1999) - Two random outgoing links per page
7Methods Textual Similarity Measures
- TFIDF cosine similarityQuery term
probabilityQuery-document overlap
Distributions of URL match length.
Similarity of title,description, and
titledescription compared to page text.
Similarity of pages that are random, siblings,
and linked (same and different domains)
Similarities between sibling pages as a function
of distance between referring URLs
Similarities of anchor text to pages that are
random, siblings, and linked (same and different
Similarities of anchor text to linked text as a
function of amount of anchor text
To order online, click here.
- Empirical evidence that topical locality mirrors
spatial locality in web pages - Anchor text amount does not greatly affect
similarities - Title, description, and anchor text represent
target page well