Title: Detecting%20Nepotistic%20Links%20by%20Language%20Model%20Disagreement
1Detecting Nepotistic Links by Language Model
Disagreement
András A. Benczúr István BÃró Károly Csalogány
Máté Uher http//www.ilab.sztaki.hu/websearch
- Problem Statement
- Hyperlinks between topically dissimilar pages
cover - Undeserved PageRank (Spam or Navigational links)
- Unrelated anchor hit (links to owners,
maintainers) - Content spam (text with no meaning for humans)
Language model disagreement Unigram language
model for text (D) in collection (C)
language model kamera video dvd Vinci
NRank PageRank over penalized hyperlinks
- NRank evaluation
- Calculate PageRank
- Form 20 buckets each containing 5 of total PR
value - Pick 50 URLs from each bucket
- ? 1000-page sample stratified on PageRank
- Manually classify it as
- Non-usable unknown, alias, empty, dead
- Usable reputable, ad, weborg, spam
- Within spam thema-.de link farm
language model lipstick fashion pickup ...
Kullback-Leibler divergence (KL) between the
language model of the target and source pages
nepotistic links penalize above threshold
Unknown 0.4 Alias
0.3 Empty
0.4 Non-existent 7.9 Ad
3.7 Weborg
0.8 Spam 16.5 Reputable
70.0
Distribution of categories .de
domain Benczúr-Csalogány-Sarlós-Uher 2005
Distribution of KL between anchor text and target
document
The Assumed Gaussian Mixture Model Mishne et al.
2005
- Disagreement in anchor text
- anchor within doc
- anchor of pointing doc
- empty (or short) anchor
- use neighboring 5 words
Average demotion of reputable and spam pages into
NRank buckets
Fraction of spam in NRank buckets
Algorithmic Issues
example links to maintainer penalized
requries docs in internal memory
KL(D1D2) along all hyperlinks
external sort anchor text of all referencing docs
to D
KL(AD) for document and anchors from
pointing docs
KL(AD) for document and own anchor Mishne
et al. 2005
ETIK
works even with streaming docs
Computer and Automation Research Institute,
Hungarian Academy of Sciences
Eötvös University Budapest
Inter-University Center for Telecommunications
and Informatics