Title: A largescale study of the evolution of Web pages
1A large-scale studyof the evolution of Web pages
- D. Fetterly, M. Manasse, M. Najork and L. Wiener
- SPE Vol.34 No.2 pages 213-237, Feb. 2004
Apr. 18. 2006 So Jeong Han
2Content
3Results(1) Document Size Analyze
- Fig 2 . Document Size (byte) versus Top-level
domain - HTTP status code of 200
- x Document size (2x-1 2x Byte)
- 14 with standard deviation 1
4Results(2) Document Size Analyze
- Fig 3. Document Size (Word) versus Top-level
domain
5Results(3) Status Code Analyze
Heisenberg Effect
- Fig 4. Distribution of HTTP status codes over
crawl generations. - for each crawl generation the percentage of page
retrievals resulting in categories of status
codes. - y start at 85
- URLs lifetime is limited.
- Object Why do downloads fail?
6Results(4) TDL Analyze
- Fig 5. Download success per Top-level domain.
- Pages in .jp .de .edu is available
- The decline in the curves bears out the limited
lifetime of Web pages
7Results(5) TDL Analyze
Unreachable after crawl 1
Downloaded during the final crawl
- Fig 6. Last successfully downloaded Web page per
TLD - For viewing the lifetime of URLs from different
domains. - Pages in .cn expire sooner than average.
8Results(6) Change Analyze
- Fig 7. Distribution of change
- Fig 8. scaled to show the low-percentage
categories - cumulative percentage distribution
9Results(7)
- Fig 9. Type of markup change
- Markup of 1,468,671 pages change
- Observing link evolution of this type
- Crawler recognize Session ID and remove Session
ID - avoid recrawling the same content
10Results(8)
Fig 9.
Shrink
- Fig 10. Type of markup change normalized by hosts
- Fig 9 were influenced by URL? Host?
- Change of type is counted per host.
- of changes attributed to URL query (48 -gt 4)
- Tend to appear many times on the same page.
- When session ID are embedded in links to URL,
They tend to appear in all relative links on the
page
11Results(9)
- Fig 11. Breakdown of tags that changed
- changes that were additions or deletions of tags
12Results(10) - TLD, Change Analyze
Complete change cluster
No change cluster
- Fig 12. Clustered rates of change by TLD
- Each bar is divided into six region change
cluster - .de domain problem
- Automatically generate (stuffing)
- Use distinct host names as a front to a single
server - Trick link-based ranking algorithm.
- Draw visitors to the Adultweb site.
13Results(11) - TLD, Change Analyze
- Solution
- Symbolic host names of all the URLs in our data
set - Singled out each IP address with more than a 1000
symbolic host names mapping to it.
- Fig 13. Clustered rates of change by TLD after
excluding automatically generated keyword-spam
documents. - Eliminated about 60
14Results(12) TLD, Change Analyze
- Fig 14. Clustered rates of change by TLD
- omitting the no change cluster after excluding
automatically generated keyword-spam documents.
- Conclusion
- Adult content continues to skew our results.
- Shingling technique might not be well adapted to
writing system like Chinese or Kanji (not employ
inter-word spacing) - Extent of change is quite consistent with other
TLD
15Results(13) - Change Analyze
Fig 8.
Fig 12.
- Fig 15. scaled to show the low-percentage
categories, after excluding automatically
generated keyword-spam documents. - Bucket 0 cut in half
- Right is monotonous
- consider
- whether the length of pages impacts their rate of
change
16Results(14)
- Fig 16. Clustered rates of change by document
size. (byte) - Document size is strongly related to rate of
change. - Small documents are mostly to change
- Large documents (32KB above) change much more
frequently than smaller ones (4KB below).
17Results(15)
- Fig 17. Clustered rates of change by the number
of words per document. - Sensitivity of our shingling techniques depend on
the number of words in a document - all- or-nothing similarity metric gives a
relatively coarse. - Large documents are more likely to change than
smaller one.
18Results(16)
- Fig 18. Clustered rates of change by the number
of words per document, and omitting the no change
cluster.
19Results(17)
- .com .net
- Stronger effect for larger documents (than .gov
.edu) - - Commercial Web site appearance of freshness
- - Educational Governmental Web site archival
purpose
- Fig 19. Clustered rates of change by top-level
domain and number of words per document
20Results(18)
- Fig 20. Distribution of the standard deviations
of the rate of change in a given document over
its lifetime
21Results(19)
(85,85) web page dont change much over a
3week inteval
10000 times higher than any other feature
The number of pre-images in a document unchanged
from Week n to n1
The number of pre-images in a document unchanged
from Week n-1 to n
- Plate 1. Logarithmic histogram of intra-document
changes over three successive weeks, showing the
absolute number of changes.
22Results(20)
Indicating once again that past change is a
strong predictor of future change.
- Plate 2. Logarithmic histogram of intra-document
changes over three successive weeks, ormalized to
show the conditional probabilities of changes.
23Conclusion(1)
- Purpose measuring the rate and degree of Web
page - Method
- crawled 151 million pages once a week for 11
weeks - saving salient information about each downloaded
document - including a feature vector of the text without
markup - plus the full text of 0.1 of all downloaded pages
24Conclusion(2)
- Conclusion (We found..)
- Web pages change markup or in trivial ways
change - Relation with TLD
- frequency of change of a document (strong)
- degree of change (weaker)
- Document size
- both frequency and degree of change. (.com
.net) - large documents change more often and more
extensively - Predict future change -gt implications for web
crawlers - German anomaly fast-changing page is not
worthy.
Fin.