A largescale study of the evolution of Web pages - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

A largescale study of the evolution of Web pages

Description:

Shingling technique might not be well adapted to writing system like Chinese or ... Sensitivity of our shingling techniques depend on the number of words in a document ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 25
Provided by: dblab1
Category:

less

Transcript and Presenter's Notes

Title: A largescale study of the evolution of Web pages


1
A large-scale studyof the evolution of Web pages
  • D. Fetterly, M. Manasse, M. Najork and L. Wiener
  • SPE Vol.34 No.2 pages 213-237, Feb. 2004

Apr. 18. 2006 So Jeong Han
2
Content
  • Results
  • Conclusions

3
Results(1) Document Size Analyze
  • Fig 2 . Document Size (byte) versus Top-level
    domain
  • HTTP status code of 200
  • x Document size (2x-1 2x Byte)
  • 14 with standard deviation 1

4
Results(2) Document Size Analyze
  • Fig 3. Document Size (Word) versus Top-level
    domain

5
Results(3) Status Code Analyze
Heisenberg Effect
  • Fig 4. Distribution of HTTP status codes over
    crawl generations.
  • for each crawl generation the percentage of page
    retrievals resulting in categories of status
    codes.
  • y start at 85
  • URLs lifetime is limited.
  • Object Why do downloads fail?

6
Results(4) TDL Analyze
  • Fig 5. Download success per Top-level domain.
  • Pages in .jp .de .edu is available
  • The decline in the curves bears out the limited
    lifetime of Web pages

7
Results(5) TDL Analyze
Unreachable after crawl 1
Downloaded during the final crawl
  • Fig 6. Last successfully downloaded Web page per
    TLD
  • For viewing the lifetime of URLs from different
    domains.
  • Pages in .cn expire sooner than average.

8
Results(6) Change Analyze
  • Fig 7. Distribution of change
  • Fig 8. scaled to show the low-percentage
    categories
  • cumulative percentage distribution

9
Results(7)
  • Fig 9. Type of markup change
  • Markup of 1,468,671 pages change
  • Observing link evolution of this type
  • Crawler recognize Session ID and remove Session
    ID
  • avoid recrawling the same content

10
Results(8)
Fig 9.
Shrink
  • Fig 10. Type of markup change normalized by hosts
  • Fig 9 were influenced by URL? Host?
  • Change of type is counted per host.
  • of changes attributed to URL query (48 -gt 4)
  • Tend to appear many times on the same page.
  • When session ID are embedded in links to URL,
    They tend to appear in all relative links on the
    page

11
Results(9)
  • Fig 11. Breakdown of tags that changed
  • changes that were additions or deletions of tags

12
Results(10) - TLD, Change Analyze
Complete change cluster
No change cluster
  • Fig 12. Clustered rates of change by TLD
  • Each bar is divided into six region change
    cluster
  • .de domain problem
  • Automatically generate (stuffing)
  • Use distinct host names as a front to a single
    server
  • Trick link-based ranking algorithm.
  • Draw visitors to the Adultweb site.

13
Results(11) - TLD, Change Analyze
  • Solution
  • Symbolic host names of all the URLs in our data
    set
  • Singled out each IP address with more than a 1000
    symbolic host names mapping to it.
  • Fig 13. Clustered rates of change by TLD after
    excluding automatically generated keyword-spam
    documents.
  • Eliminated about 60

14
Results(12) TLD, Change Analyze
  • Fig 14. Clustered rates of change by TLD
  • omitting the no change cluster after excluding
    automatically generated keyword-spam documents.
  • Conclusion
  • Adult content continues to skew our results.
  • Shingling technique might not be well adapted to
    writing system like Chinese or Kanji (not employ
    inter-word spacing)
  • Extent of change is quite consistent with other
    TLD

15
Results(13) - Change Analyze
Fig 8.
Fig 12.
  • Fig 15. scaled to show the low-percentage
    categories, after excluding automatically
    generated keyword-spam documents.
  • Bucket 0 cut in half
  • Right is monotonous
  • consider
  • whether the length of pages impacts their rate of
    change

16
Results(14)
  • Fig 16. Clustered rates of change by document
    size. (byte)
  • Document size is strongly related to rate of
    change.
  • Small documents are mostly to change
  • Large documents (32KB above) change much more
    frequently than smaller ones (4KB below).

17
Results(15)
  • Fig 17. Clustered rates of change by the number
    of words per document.
  • Sensitivity of our shingling techniques depend on
    the number of words in a document
  • all- or-nothing similarity metric gives a
    relatively coarse.
  • Large documents are more likely to change than
    smaller one.

18
Results(16)
  • Fig 18. Clustered rates of change by the number
    of words per document, and omitting the no change
    cluster.

19
Results(17)
  • .com .net
  • Stronger effect for larger documents (than .gov
    .edu)
  • - Commercial Web site appearance of freshness
  • - Educational Governmental Web site archival
    purpose
  • Fig 19. Clustered rates of change by top-level
    domain and number of words per document

20
Results(18)
  • Fig 20. Distribution of the standard deviations
    of the rate of change in a given document over
    its lifetime

21
Results(19)
(85,85) web page dont change much over a
3week inteval
10000 times higher than any other feature
The number of pre-images in a document unchanged
from Week n to n1
The number of pre-images in a document unchanged
from Week n-1 to n
  • Plate 1. Logarithmic histogram of intra-document
    changes over three successive weeks, showing the
    absolute number of changes.

22
Results(20)
Indicating once again that past change is a
strong predictor of future change.
  • Plate 2. Logarithmic histogram of intra-document
    changes over three successive weeks, ormalized to
    show the conditional probabilities of changes.

23
Conclusion(1)
  • Purpose measuring the rate and degree of Web
    page
  • Method
  • crawled 151 million pages once a week for 11
    weeks
  • saving salient information about each downloaded
    document
  • including a feature vector of the text without
    markup
  • plus the full text of 0.1 of all downloaded pages

24
Conclusion(2)
  • Conclusion (We found..)
  • Web pages change markup or in trivial ways
    change
  • Relation with TLD
  • frequency of change of a document (strong)
  • degree of change (weaker)
  • Document size
  • both frequency and degree of change. (.com
    .net)
  • large documents change more often and more
    extensively
  • Predict future change -gt implications for web
    crawlers
  • German anomaly fast-changing page is not
    worthy.

Fin.
Write a Comment
User Comments (0)
About PowerShow.com