Detecting Phrase-Level Duplication on the World Wide Web - PowerPoint PPT Presentation

About This Presentation
Title:

Detecting Phrase-Level Duplication on the World Wide Web

Description:

Detecting Phrase-Level Duplication on the World Wide Web Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 20
Provided by: Vinay2
Category:

less

Transcript and Presenter's Notes

Title: Detecting Phrase-Level Duplication on the World Wide Web


1
Detecting Phrase-Level Duplication on the World
Wide Web
  • Fetterly, Manasse, Najork
  • Paper Presentation by
  • Vinay Goel

2
Introduction
  • Problem
  • Identify instances slice and dice generation
  • Example
  • German spammer
  • 1 million URLs originating from single IP (but
    use of many host names)
  • Pages changed completely on every download
  • Pages consisted of grammatically well-formed
    sentences stitched together at random

3
Goal
  • Find instances of sentence level synthesis of web
    pages
  • More generally, of pages with an unusually large
    number of popular phrases

4
The Data
  • Datasets
  • DS1
  • BFS crawl starting at www.yahoo.com
  • 151 million HTML pages
  • DS2
  • Large crawl conducted by MSN search
  • 96 million HTML pages chosen at random

5
Finding Phrase Replication
  • Sampling
  • Reduce each document to a feature vector
  • Employ a variant of the shingling algorithm of
    Broder et al.
  • Significantly reduces the data volume

6
Sampling method
  • Replace all HTML markup by white-space
  • k-phrases of a document all sequences of k
    consecutive words
  • Treat the document as a circle last word
    followed by first word
  • n word document has exactly n phrases

7
Sampling method
  • Exploit properties of Rabin fingerprints
  • Rabin fingerprints support efficient extension
    and prefix deletion
  • Fingerprints of distinct bit patterns are
    distinct

8
Computing feature vectors
  • Fingerprint each word in the document - gives n
    tokens
  • Compute fingerprint of each k-token phrase -
    gives n phrase fingerprints
  • Apply m different fingerprint functions
  • Retain the smallest of the n resulting values for
    each function
  • Vector of m fingerprints representative of
    document (elements referred to as shingles)

9
Duplicate Suppression
  • Replication rampant on the web
  • Clustered all pages in data set into equivalence
    classes
  • Each class contains all pages that are exact or
    near duplicates of one another

10
Popular phrases
  • Occur in more documents than would be expected by
    chance
  • Assumptions
  • Normal web pages characterized by a generative
    model
  • Sought web pages - copying model (need to
    consider number of phrases, length of typical
    documents)

11
Popular Phrases
  • Limit attention to the shingles chosen by
    sampling functions
  • Phrase is popular if selected as shingle in
    sufficiently many documents
  • To determine popular phrases, consider triplets
    (i,s,d)

12
Popular Phrases
  • First 24 most popular phrases not very
    interesting
  • Starting from the 36th phrase, discover phrases
    caused by machine generated content
  • Templatic form common text, fill in the blank
    slots and optional
  • 60th phrase - instance of idiomatic phrase

13
Zipfian Distribution
14
Histogram of popular shingles per doc
15
Covering set
  • Covering sets for shingles of each page
  • Approximate a minimum covering set using a greedy
    heuristic

16
Distribution of covering set sizes
17
German spammer
18
Looking for likely sources
19
Conclusion
  • Power law distribution
  • Popular phrases
  • Often limited by design choices
  • Legal disclaimers
  • Navigational phrases
  • fill in the blanks
  • More replicated than original content
Write a Comment
User Comments (0)
About PowerShow.com