Detecting%20Nepotistic%20Links%20by%20Language%20Model%20Disagreement - PowerPoint PPT Presentation

About This Presentation
Title:

Detecting%20Nepotistic%20Links%20by%20Language%20Model%20Disagreement

Description:

Algorithmic Issues. Detecting Nepotistic Links by Language Model Disagreement ... Language model disagreement: Unigram language model for text (D) in collection (C) ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 2
Provided by: benc56
Category:

less

Transcript and Presenter's Notes

Title: Detecting%20Nepotistic%20Links%20by%20Language%20Model%20Disagreement


1
Detecting Nepotistic Links by Language Model
Disagreement
András A. Benczúr István Bíró Károly Csalogány
Máté Uher http//www.ilab.sztaki.hu/websearch
  • Problem Statement
  • Hyperlinks between topically dissimilar pages
    cover
  • Undeserved PageRank (Spam or Navigational links)
  • Unrelated anchor hit (links to owners,
    maintainers)
  • Content spam (text with no meaning for humans)

Language model disagreement Unigram language
model for text (D) in collection (C)

language model kamera video dvd Vinci
NRank PageRank over penalized hyperlinks
  • NRank evaluation
  • Calculate PageRank
  • Form 20 buckets each containing 5 of total PR
    value
  • Pick 50 URLs from each bucket
  • ? 1000-page sample stratified on PageRank
  • Manually classify it as
  • Non-usable unknown, alias, empty, dead
  • Usable reputable, ad, weborg, spam
  • Within spam thema-.de link farm

language model lipstick fashion pickup ...
Kullback-Leibler divergence (KL) between the
language model of the target and source pages
nepotistic links penalize above threshold
Unknown 0.4 Alias
0.3 Empty
0.4 Non-existent 7.9 Ad
3.7 Weborg
0.8 Spam 16.5 Reputable
70.0
Distribution of categories .de
domain Benczúr-Csalogány-Sarlós-Uher 2005
Distribution of KL between anchor text and target
document
The Assumed Gaussian Mixture Model Mishne et al.
2005
  • Disagreement in anchor text
  • anchor within doc
  • anchor of pointing doc
  • empty (or short) anchor
  • use neighboring 5 words

Average demotion of reputable and spam pages into
NRank buckets
Fraction of spam in NRank buckets
Algorithmic Issues
example links to maintainer penalized
requries docs in internal memory
KL(D1D2) along all hyperlinks
external sort anchor text of all referencing docs
to D
KL(AD) for document and anchors from
pointing docs
KL(AD) for document and own anchor Mishne
et al. 2005
ETIK
works even with streaming docs
Computer and Automation Research Institute,
Hungarian Academy of Sciences
Eötvös University Budapest
Inter-University Center for Telecommunications
and Informatics
Write a Comment
User Comments (0)
About PowerShow.com