Blocking Blog Spam with Language Model Disagreement - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Blocking Blog Spam with Language Model Disagreement

Description:

Blocking Blog Spam with Language Model ... David Carmel (IBM Israel) AIRWeb 2005. What is Blog Spam? Bots posting comments unrelated to the original blog post ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 13
Provided by: nobu2
Category:

less

Transcript and Presenter's Notes

Title: Blocking Blog Spam with Language Model Disagreement


1
Blocking Blog Spam with Language Model
Disagreement
  • Gilad Mishne (Amsterdam)
  • David Carmel (IBM Israel)
  • AIRWeb 2005

2
What is Blog Spam?
  • Bots posting comments unrelated to the original
    blog post
  • Comments contain links to irrelevant sites
  • Links are used to fool Google

3
Current Solutions
  • Register
  • Solve a puzzle
  • Prevent HTML
  • Prevent comments in old posts
  • IP Filter
  • Limit comment rate

4
Objective
  • Filter out blog spams

5
Approach
  • Compare post contents with comment contents

6
KL-Divergence Similarity
  • Use KL-Divergence as a similarity score between
    post and comment
  • Lower score Higher similarity

7
Clustering with Gaussian Mixture
  • Use clustering based on Gaussian Mixture
  • Cluster all comments of a post into 2 groups by
    KL-Divergence value
  • Higher KL-Divergence value group is the spam group

8
Limitations
  • Cheat the system by using words similar to the
    post in comments
  • Posts and comments are too short to extract the
    language model
  • follow the links

9
Experiment Corpus
  • 50 random blog posts with 1024 comments
  • At least 3 comments per post
  • 32 of comments are valid
  • 68 of comments are spams

10
Sample Spams
11
Result
  • Baseline classify as spam with 68 probability
  • Threshold Multiplier adjust classification
    boundary

12
Conclusion
  • No training
  • No hand-coded rules
  • Still working on
  • Follow the link to the website
Write a Comment
User Comments (0)
About PowerShow.com